[PYTHON] Beautiful soup spills

Origin

I decided to extract only the necessary text from the report (HTML format) spit out by a C ++ static analysis tool of a certain company, so I started with Python2.7 + Beautifulsoup4.

bs4test.py


from bs4 import BeautifulSoup

soup = BeautifulSoup(open("rep_38248_dev1.html"))
print soup.prettify("shift_jis")

What? It only reads about 2500 lines, which is 1/10 of the entire HTML (about 26,000 lines) !! Weakened. Overwhelmed. I was in trouble.

Exploration journey

The usual way to do this is to "find people in the same situation online." Immediately, when I search on google ... I don't have any information. Tohoho.

There is no way, so I poke around with the bs4 source code and look it up. The cause was a bug in the feed () method of lxml that bs4 called as a subcontractor, and when I fed a huge HTML text, it spilled on the way.

All you have to do is comment out LXMLTreeBuilder.feed () in bs4 / builder / _lxml.py. (For some reason, the XML parser LXMLTreeBuilderForXML.feed () has been fixed)

bs4/builder/_lxml.py


class LXMLTreeBuilder(HTMLTreeBuilder, LXMLTreeBuilderForXML):

    features = [LXML, HTML, FAST, PERMISSIVE]
    is_xml = False

    def default_parser(self, encoding):
        return etree.HTMLParser

#    def feed(self, markup):
#        encoding = self.soup.original_encoding
#        try:
#            self.parser = self.parser_for(encoding)
#            self.parser.feed(markup)
#            self.parser.close()
#        except (UnicodeDecodeError, LookupError, etree.ParserError), e:
#            raise ParserRejectedMarkup(str(e))

End

Googling again, it's related to the Google Groups beautifulsoup forum There was a post. LXMLTreeBuilderForXML.feed () seems to have been BugFixed at this time. So, the modification of LXMLTreeBuilder was leaked ...

Recommended Posts

Beautiful soup spills
Beautiful Soup
Beautiful Soup memo
My Beautiful Soup (Python)
Scraping with Beautiful Soup
Table scraping with Beautiful Soup
Crawl practice with Beautiful Soup
Try scraping with Python + Beautiful Soup
A memorandum when using beautiful soup
Scraping multiple pages with Beautiful Soup
[Python] A memorandum of beautiful soup4
Scraping with Python and Beautiful Soup
Scraping pages with pagination with Beautiful Soup
Scraping with Beautiful Soup in 10 minutes
Website scraping with Python's Beautiful Soup
[Python3] Understand the basics of Beautiful Soup
[Python] Scraping a table using Beautiful Soup
Remove unwanted HTML tags with Beautiful Soup
Frequently used methods of Selenium and Beautiful Soup
How to search HTML data using Beautiful Soup