[Python3] Understand the basics of Beautiful Soup


I started learning Python. I want to deepen my understanding of web scraping, so I will summarize it in my own way.

How the web works

I will omit it in this article, but if you are developing a distributed system, you need to understand it to some extent.

What is Beautiful Soup?

This is the main subject. In books, etc., it is described as a library that parses HTML. Also check the Official Site. The features are the following three points.

  1. Provides methods for navigating, searching, and modifying the tree structure.
  2. It will encode automatically (unless BeautifulSoup cannot determine the encode of the document).
  3. Received document is Unicode
  4. The document to be sent is UTF-8
  5. You can select the Parser to use.
  6. html.parser: Standard library. The processing speed is neither fast nor slow.
  7. lxml: Third party library. Characterized by high processing speed.
  8. html5lib: Third party library. High performance such as supporting HTML5 grammar and interpreting the same method as a Web browser. The processing speed is inferior to others.

Install Beautiful Soup

Install the BeautifulSoup library.

--Since I'm using MacOS, I use the "pip3" command. --The latest version of BeautifulSoup is 4.9.1 (as of May 23, 2020).

Run the following command in an interactive shell.

> pip3 install BeautifulSoup4

If you can import it, the installation is successful. bs4 is a library.

>>> from bs4 import BeautifulSoup4

Try to extract information from a website using BeautifulSoup

This time, we will extract the title and URL of the news list of YAHOO! JAPAN.


To implement

--Use requests to get site information. --Use BeautifulSoup to analyze the elements. --Use re to get the item with a regular expression. --Identify the tag structure to be acquired from the developer tools of the browser. --This time, you can get it by matching the href attribute "news.yahoo.co.jp/pickup". --Import the re module, which is a standard library, to use regular expressions. --Check Official Documents later. --Extract the text attribute and href attribute from the acquired items.



import requests
from bs4 import BeautifulSoup
import re

url = "https://www.yahoo.co.jp/"

#Get site information using requests
result = requests.get(url)
#Analyze elements
bs = BeautifulSoup(result.text, "html.parser")
#The link is"news.yahoo.co.jp/pickup"Get items that match
news_list = bs.find_all(href=re.compile("news.yahoo.co.jp/pickup"))

#Extract text attribute and href attribute from the acquired items
for news in news_list:
      print("{0} , {1}".format(news.getText(), news.get('href')))

Execution result

3 prefectures released Mask shoppers, https://news.yahoo.co.jp/pickup/6360522
Rice discusses resumption of nuclear test US newspaper, https://news.yahoo.co.jp/pickup/6360527
Light and dark NEW at Subaru and Mitsubishi Corona, https://news.yahoo.co.jp/pickup/6360528
Antimalarial drug increased risk of death NEW, https://news.yahoo.co.jp/pickup/6360523
A woman in her 80s with a seismic intensity of 4 broke before dawn, https://news.yahoo.co.jp/pickup/6360529
Mask delivery in Iwate Voice of nowadays NEW, https://news.yahoo.co.jp/pickup/6360521
Equestrian club pinch I want to avoid culling, https://news.yahoo.co.jp/pickup/6360510
Rina Akiyama gives birth to a second baby boy NEW, https://news.yahoo.co.jp/pickup/6360531

"NEW" has also been extracted, but I think it's okay to replace it if it's unnecessary (not included in this implementation).

in conclusion

It was a simple content, but I would like to deepen my understanding by reading the official documents.

