Scraping with Python and Beautiful Soup

There are already a lot of scraping material in Python in the world and Qiita, but I feel that there is a lot of information that pyquery is easy to use. Personally, I would like you to know the goodness of Beautiful Soup, so I would like to use Beautiful Soup here.

By the way, this entry is mostly a summary of the Beautiful Soup 4 documentation. See the documentation for more information.

English http://www.crummy.com/software/BeautifulSoup/bs4/doc/

Japanese http://kondou.com/BS4/

Common misunderstanding

There is an opinion that pyquery is easier to use than Beautiful Soup in that it can handle HTML using css selector like jQuery, but ** it can also be done with Beautiful Soup. ** (I don't know the old version) I'll explain how to do it below.

About version

The current version is Beautiful Soup 4. Please note that there are many commentary articles about older versions. However, the code that worked with Beautiful Soup3 should work even if you replace it with Beautiful Soup4 in many cases.

Installation

$ pip install beautifulsoup4

Easy to use

Creating a BeautifulSoup object

When dealing with plaintext HTML, it looks like below.

from bs4 import BeautifulSoup

html = """
	<html>
	...
	</html>
"""

soup = BeautifulSoup(html)

Also, since URLs cannot be handled directly, it is recommended to combine them with urllib2 etc. when handling websites.

import urllib2
from bs4 import BeautifulSoup

html = urllib2.urlopen("http://example.com")

soup = BeautifulSoup(html)

If you get a warning about the HTML parser here, follow the message to specify the parser. (For details, see [About HTML Parser](#html% E3% 83% 91% E3% 83% BC% E3% 82% B5% E3% 83% BC% E3% 81% AB% E3% 81% A4% E3% 81% 84% E3% 81% A6))

soup = BeautifulSoup(html, "html.parser")

How to get a simple tag

To get all A tags from HTML

soup.find_all("a")

The object <class'bs4.element.ResultSet'> that can be obtained with this can be treated like a list.

To get only the first one instead of all tags

soup.find("a")

Or

soup.a

soup.find ("a ") and soup.a will return None if the tag does not exist in the HTML.

Obtained tag information

To get the attributes of the obtained tag

soup.a.get("href")

To get the characters in the obtained tag

soup.a.string

Of course you can also get nested tags

soup.p.find_all("a")

Get tags with specific conditions

You can easily get tags by narrowing down the conditions by attributes. To get all a tags with class is link and href is / link, for example <a href="/link" class="link">

soup.find_all("a", class_="link", href="/link")

Or

soup.find_all("a", attrs={"class": "link", "href": "/link"})

Note that class is a reserved word in Python, so it will be class_.

Also, you do not have to specify the tag.

soup.find_all(class_="link", href="/link")
soup.find_all(attrs={"class": "link", "href": "/link"})

Getting tags using regular expressions

To get all tags that start with b, such as B tags and BODY tags

import re
soup.find_all(re.compile("^b"))

To get all tags that have href attribute including "link"

import re
soup.find_all(href=re.compile("link"))

To get all A tags that contain "hello" in the string inside the tag

import re
soup.find_all("a", text=re.compile("hello"))

Get tags using css selector

If you use select instead of find_all you can get the tags using the css selector.

soup.select("#link1")
soup.select('a[href^="http://"]')

Rewrite

Add attributes to tags

a = soup.find("a")
a["target"] = "_blank"

Use `ʻunwrap`` to remove the tag

html = '''
<div>
    <a href="/link">spam</a>
</div>
'''

soup = BeautifulSoup(html)
soup.div.a.unwrap()

soup.div
# <div>spam</div>

On the contrary, if you want to add a new tag, create a tag with soup.new_tag and add it with wrap.

html = '''
<div>
    <a href="/link">spam</a>
</div>
'''

soup = BeautifulSoup(html)
soup.div.a.wrap(soup.new_tag("p"))

In addition, there are many operation methods such as ʻinsert`` and ʻextract``, so you can flexibly add and remove contents and tags.

output

By calling prettify, you can format it neatly and output it as a character string.

soup.prettify()

# <html>
#  <head>
#   <title>
#    Hello
#   </title>
#  </head>
#  <body>
#   <div>
#    <a href="/link">
#     spam
#    </a>
#   </div>
#   <div>
#    ...
#   </div>
#  </body>
# </html>
soup.div.prettify()

# <div>
#  <a href="/link">
#   spam
#  </a>
# </div>

About HTML parser

The HTML parser usually uses the Python standard html.parser, but if lxml or html5lib is installed, that will be used in preference. To specify it explicitly, specify as follows.

soup = BeautifulSoup(html, "lxml")

If your Python version is older, html.parser may not be able to parse it correctly. In my environment, I could parse with Python 2.7.3 and not with Python 2.6.

It is safe to install lxml or html5lib whenever possible to parse it properly. However, lxml etc. depend on the external C library, so you may have to install them depending on your environment.

Digression

Uses of Beautiful Soup

In my case, I have my own site that stores multiple blog articles together in the DB, but I usually get it from RSS, but since the number of RSS is small, in that case HTML is Beautiful Soup I read it with and save the contents.

Also, when displaying the body of the saved blog, unnecessary advertisements are removed and the target is specified in the a tag so that the link opens in a new tab.

Reference: http://itkr.net

I think Beautiful Soup is excellent for such applications.

Recommended Posts

Scraping with Python and Beautiful Soup
Try scraping with Python + Beautiful Soup
Scraping with Beautiful Soup
Table scraping with Beautiful Soup
Scraping with Python
Scraping multiple pages with Beautiful Soup
Scraping with Node, Ruby and Python
Scraping with Python
Scraping with Python, Selenium and Chromedriver
Scraping pages with pagination with Beautiful Soup
Scraping with Beautiful Soup in 10 minutes
Website scraping with Python's Beautiful Soup
[Python] Scraping a table using Beautiful Soup
Practice web scraping with Python and Selenium
Easy web scraping with Python and Ruby
Scraping with Python (preparation)
Try scraping with Python.
Scraping with Python + PhantomJS
My Beautiful Soup (Python)
I tried various things with Python: scraping (Beautiful Soup + Selenium + PhantomJS) and morphological analysis.
Scraping with Selenium [Python]
Scraping with Python + PyQuery
Scraping RSS with Python
Scraping tabelog with python and outputting to CSV
Settings when using Python 3 requests and Beautiful Soup with crostini on Chromebook
Sort anime faces by scraping anime character pages with Beautiful Soup and Selenium
Write a basic headless web scraping "bot" in Python with Beautiful Soup 4
Programming with Python and Tkinter
I tried scraping with Python
Encryption and decryption with Python
Web scraping with python + JupyterLab
Scraping with selenium in Python
Python and hardware-Using RS232C with Python-
Scraping with Selenium + Python Part 1
Scraping with chromedriver in python
Scraping with Selenium in Python
Scraping with Tor in Python
Scraping weather forecast with python
Scraping with Selenium + Python Part 2
python with pyenv and venv
I tried scraping with python
Web scraping beginner with python
Crawl practice with Beautiful Soup
Works with Python and R
[Python] Delete by specifying a tag with Beautiful Soup
Automated testing method combining Beautiful Soup and Selenium (Python)
Communicate with FX-5204PS with Python and PyUSB
Robot running with Arduino and python
Install Python 2.7.9 and Python 3.4.x with pip.
Neural network with OpenCV 3 and Python 3
AM modulation and demodulation with python
Beautiful Soup
[Python] font family and font with matplotlib
Web scraping with Python ① (Scraping prior knowledge)
Scraping with Selenium in Python (Basic)
[Python] A memorandum of beautiful soup4
JSON encoding and decoding with python
Hadoop introduction and MapReduce with Python
[GUI with Python] PyQt5-Drag and drop-
[Scraping] Python scraping
Web crawling, web scraping, character acquisition and image saving with python