[PYTHON] How to speed up instantiation of BeautifulSoup

It is the knowledge when the execution speed of Image search bot using BeautifulSoup is improved. I hope it will be helpful for those who are in trouble because the scraping speed is slow.

How to do it

environment

script

You can speed up by specifying an appropriate character code in the argument of BeautifulSoup: ** from_encoding **.

from urllib import request
import bs4

page = request.urlopen("https://news.yahoo.co.jp/")
html = page.read()
# from_Substitute the character code of the site to be scraped into encoding(In the case of Yahoo News this time utf-8)
soup = bs4.BeautifulSoup(html, "html.parser", from_encoding="utf-8")

How to check the character code

Basically, it is written after charset = in the meta tag.

<!--In the case of Yahoo News-->
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

Comparison of execution time

I verified it with the following script. Measured before and after creating an instance

verification_bs4.py


from urllib import request as req
from urllib import parse
import bs4
import time
import copy

url = "https://news.yahoo.co.jp/"
page = req.urlopen(url)
html = page.read()
page.close()

start = time.time()
soup = bs4.BeautifulSoup(html, "html.parser")
print('{:.5f}'.format(time.time() - start) + "[s] html.parser, None")

start = time.time()
soup = bs4.BeautifulSoup(html, "lxml")
print('{:.5f}'.format(time.time() - start) + "[s] lxml, None")

start = time.time()
hoge = copy.copy(soup)
print('{:.5f}'.format(time.time() - start) + "[s] copy(lxml, None)")

start = time.time()
soup = bs4.BeautifulSoup(html, "html.parser", from_encoding="utf-8")
print('{:.5f}'.format(time.time() - start) + "[s] html.parser, utf-8")

start = time.time()
soup = bs4.BeautifulSoup(html, "lxml", from_encoding="utf-8")
print('{:.5f}'.format(time.time() - start) + "[s] lxml, utf-8")

start = time.time()
hoge = copy.copy(soup)
print('{:.5f}'.format(time.time() - start) + "[s] copy(lxml, utf-8)")

start = time.time()
soup = bs4.BeautifulSoup(html, "lxml", from_encoding="utf-16")
#The return value is empty because the character code is different.
print('{:.5f}'.format(time.time() - start) + "[s] lxml, utf-16")

The output result is here.

% python verification_bs4.py
2.10937[s] html.parser, None
2.00081[s] lxml, None
0.04704[s] copy(lxml, None)
0.03124[s] html.parser, utf-8
0.03115[s] lxml, utf-8
0.04188[s] copy(lxml, utf-8)
0.01651[s] lxml, utf-16

Summary

By specifying the character code in ** from_encoding **, we were able to speed up the instantiation. Looking at the code that says BeautifulSoup is slow, I didn't assign it to from_encoding, so I think that's the cause.

For those who have time

I was wondering why it had such specifications, so I checked the source code. However, I don't usually touch Python so much, so I may be writing something that is out of consideration. The source code is here

Reason for being late

Probably due to the ** EncodingDetector ** class defined in ** bs4 / dammit.py **. The following is a partial code excerpt.

class EncodingDetector:
    """Suggests a number of possible encodings for a bytestring.

    Order of precedence:

    1. Encodings you specifically tell EncodingDetector to try first
    (the override_encodings argument to the constructor).

    2. An encoding declared within the bytestring itself, either in an
    XML declaration (if the bytestring is to be interpreted as an XML
    document), or in a <meta> tag (if the bytestring is to be
    interpreted as an HTML document.)

    3. An encoding detected through textual analysis by chardet,
    cchardet, or a similar external library.

    4. UTF-8.

    5. Windows-1252.
    """
    @property
    def encodings(self):
        """Yield a number of encodings that might work for this markup.

        :yield: A sequence of strings.
        """
        tried = set()
        for e in self.override_encodings:
            if self._usable(e, tried):
                yield e

        # Did the document originally start with a byte-order mark
        # that indicated its encoding?
        if self._usable(self.sniffed_encoding, tried):
            yield self.sniffed_encoding

        # Look within the document for an XML or HTML encoding
        # declaration.
        if self.declared_encoding is None:
            self.declared_encoding = self.find_declared_encoding(
                self.markup, self.is_html)
        if self._usable(self.declared_encoding, tried):
            yield self.declared_encoding

        # Use third-party character set detection to guess at the
        # encoding.
        if self.chardet_encoding is None:
            self.chardet_encoding = chardet_dammit(self.markup)
        if self._usable(self.chardet_encoding, tried):
            yield self.chardet_encoding

        # As a last-ditch effort, try utf-8 and windows-1252.
        for e in ('utf-8', 'windows-1252'):
            if self._usable(e, tried):
                yield e

If you translate the comment written at the beginning of the class, it will look like this (DeepL translation)

    """"We suggest some possible encodings for byte strings.

The order of priority is as follows.

    1.Encoding that instructed EncodingDetector to try first
Constructor argument override_Use encodings).

    2.The encoding declared within the bytestring itself.
XML declaration(When the byte string is interpreted as XML)
document), Or<meta>In the tag(Byte string
Interpreted as an HTML document)。

    3.Encoding detected by text analysis by Charde.
Use cchardet, or a similar external library.

    4. 4.UTF-8。

    5. Windows-1252。
    """

Inferring from the comments and processing, I think that it is slow because it is processing until the above list 1 to 5 succeeds in order. Looking at 2, the character code guess from the meta tag mentioned earlier is also done automatically, so I think that it is a consideration so that you can use it without specifying the character code by looking at the source of the website. However, when scraping, I think that I usually check the source code, so I don't think it should be so late. (We have not verified which process is the bottleneck, so please give me somebody.)

Why Copy is fast

In the execution time measurement script earlier, the instance is duplicated by the copy.copy () method, but the reason why this is fast is in \ _ \ _ copy \ _ \ _ of bs4 / __ init__.py. The following is a partial code excerpt.

__init__.py


class BeautifulSoup(Tag):

    def __copy__(self):
        """Copy a BeautifulSoup object by converting the document to a string and parsing it again."""
        copy = type(self)(
            self.encode('utf-8'), builder=self.builder, from_encoding='utf-8'
        )

        # Although we encoded the tree to UTF-8, that may not have
        # been the encoding of the original markup. Set the copy's
        # .original_encoding to reflect the original object's
        # .original_encoding.
        copy.original_encoding = self.original_encoding
        return copy

It's faster because I've decided on utf-8 here. However, on the contrary, if the character code of the scraping site is other than utf-8, it will be slower. In the following measurement script, the character code is measured by Price com of shift-jis.

verification_bs4_2.py


from urllib import request as req
from urllib import parse
import bs4
import time
import copy

url = "https://kakaku.com/"
page = req.urlopen(url)
html = page.read()
page.close()

start = time.time()
soup = bs4.BeautifulSoup(html, "html.parser")
print('{:.5f}'.format(time.time() - start) + "[s] html.parser, None")

start = time.time()
soup = bs4.BeautifulSoup(html, "lxml")
print('{:.5f}'.format(time.time() - start) + "[s] lxml, None")

start = time.time()
soup = bs4.BeautifulSoup(html, "lxml", from_encoding="shift_jis")
print('{:.5f}'.format(time.time() - start) + "[s] lxml, shift_jis")

start = time.time()
hoge = copy.copy(soup)
print('{:.5f}'.format(time.time() - start) + "[s] copy(lxml, shift_jis)")

The output result is here.

% python verification_bs4_2.py
0.11084[s] html.parser, None
0.08563[s] lxml, None
0.08643[s] lxml, shift_jis
0.13631[s] copy(lxml, shift_jis)

As mentioned above, copy is slower than utf-8. However, in the case of shift-jis, even if nothing is specified in ** from_encoding **, the execution speed has hardly changed. ~~ I don't know this anymore </ font> ~~

Finally

Thank you for reading this far! At the end, I'm sorry that it got messy. I wonder why more than 90% of websites in the world are utf-8 but slow. I created an article because I felt that it was a problem that the sites that searched with BeautifulSoup and hit the top did not mention this. If you find it useful, it would be encouraging if you could "LGTM".

reference https://stackoverrun.com/ja/q/12619706

Recommended Posts

How to speed up instantiation of BeautifulSoup
How to speed up Python calculations
How to speed up scikit-learn like conda Numpy
How to increase the processing speed of vertex position acquisition
Numba to speed up as Python
Summary of how to use pandas.DataFrame.loc
Summary of how to use pyenv-virtualenv
Project Euler 4 Attempt to speed up
Summary of how to use csvkit
[DRF] Snippet to speed up PrimaryKeyRelatedField
[Python] How to get divisors of natural numbers at high speed
Indispensable if you use Python! How to use Numpy to speed up operations!
How to set up the development environment of ev3dev [Windows version]
[Python] Summary of how to use pandas
[Memo] How to use BeautifulSoup4 (1) Display html
How to get rid of long comprehensions
How to check the version of Django
How to set up SVM using Optuna
How to install CatBoost [as of January 2020]
How to calculate Use% of df command
[Python2.7] Summary of how to use unittest
Summary of procedures up to PyPI registration
Jupyter Notebook Basics of how to use
Basics of PyTorch (1) -How to use Tensor-
Summary of how to use Python list
[Python2.7] Summary of how to use subprocess
Summary of how to write AWS Lambda
[Question] How to use plot_surface of python
Summary of how to set up major Python Lint (pep8, pylint, flake8)
How to calculate the volatility of a brand
How to use folium (visualization of location information)
A simple example of how to use ArgumentParser
How to find the area of the Voronoi diagram
[Python] How to use two types of type ()
How to keep track of work in Powershell
Summary of how to import files in Python 3
How to set up Random forest using Optuna
Not much mention of how to use Pickle
Summary of how to use MNIST in Python
How to scrape horse racing data with BeautifulSoup
How to specify attributes with Mock of python
How to implement "named_scope" of RubyOnRails with Django
How to measure line speed from the terminal
How to set up Random forest using Optuna
Organize Python tools to speed up the initial movement of data analysis competitions
How to get dictionary type elements of Python 2.7
How to set up a local development server
Speed: Add element to end of Python array
[Python] Do your best to speed up SQLAlchemy
[Java] How to switch between multiple versions of Java
How to speed up Pandas apply method with just one sentence (with verification calculation)
How to display the modification date of a file in C language up to nanoseconds
How to know the port number of the xinetd service
Trial and error to speed up Android screen captures
How to get the number of digits in Python
A memo of how to use AIST supercomputer ABCI
I tried to summarize how to use matplotlib of python
How to set up a Python environment using pyenv
Memo of how to use properly when combining pandas.DataFrame
How to visualize the decision tree model of scikit-learn
[Blender] Summary of how to install / update / uninstall add-ons