[Python3] Understand the basics of Beautiful Soup

Introduction

I started learning Python. I want to deepen my understanding of web scraping, so I will summarize it in my own way.

How the web works

I will omit it in this article, but if you are developing a distributed system, you need to understand it to some extent. Personally, I recommend this book for learning. [Technologies that support the Web-HTTP, URI, HTML, and REST (WEB + DB PRESS plus)](https://www.amazon.co.jp/Web%E3%82%92%E6%94%AF%E3 % 81% 88% E3% 82% 8B% E6% 8A% 80% E8% A1% 93-HTTP% E3% 80% 81URI% E3% 80% 81HTML% E3% 80% 81% E3% 81% 9D% E3 % 81% 97% E3% 81% A6REST-WEB-PRESS-plus / dp / 4774142042 / ref = pd_lpo_14_t_2 / 357-3513078-6123409? _Encoding = UTF8 & pd_rd_i = 4774142042 & pd_rd_r = 7fe1ea20-e9d9-47a1-b1cc-f9c4 = 4b55d259-ebf0-4306-905a-7762d1b93740 & pf_rd_r = 9KK4FFTSP6VV300G2BH3 & psc = 1 & refRID = 9KK4FFTSP6VV300G2BH3)

What is Beautiful Soup?

This is the main subject. In books, etc., it is described as a library that parses HTML. Also check the Official Site. The features are the following three points.

  1. Provides methods for navigating, searching, and modifying the tree structure.
  2. It will encode automatically (unless BeautifulSoup cannot determine the encode of the document).
  3. Received document is Unicode
  4. The document to be sent is UTF-8
  5. You can select the Parser to use.
  6. html.parser: Standard library. The processing speed is neither fast nor slow.
  7. lxml: Third party library. Characterized by high processing speed.
  8. html5lib: Third party library. High performance such as supporting HTML5 grammar and interpreting the same method as a Web browser. The processing speed is inferior to others.

Install Beautiful Soup

Install the BeautifulSoup library.

--Since I'm using MacOS, I use the "pip3" command. --The latest version of BeautifulSoup is 4.9.1 (as of May 23, 2020).

Run the following command in an interactive shell.

> pip3 install BeautifulSoup4

If you can import it, the installation is successful. bs4 is a library.

>>> from bs4 import BeautifulSoup4

Try to extract information from a website using BeautifulSoup

This time, we will extract the title and URL of the news list of YAHOO! JAPAN.

image.png

To implement

--Use requests to get site information. --Use BeautifulSoup to analyze the elements. --Use re to get the item with a regular expression. --Identify the tag structure to be acquired from the developer tools of the browser. --This time, you can get it by matching the href attribute "news.yahoo.co.jp/pickup". --Import the re module, which is a standard library, to use regular expressions. --Check Official Documents later. --Extract the text attribute and href attribute from the acquired items.

code

ScrapingSample.py


import requests
from bs4 import BeautifulSoup
import re

url = "https://www.yahoo.co.jp/"

#Get site information using requests
result = requests.get(url)
#Analyze elements
bs = BeautifulSoup(result.text, "html.parser")
#The link is"news.yahoo.co.jp/pickup"Get items that match
news_list = bs.find_all(href=re.compile("news.yahoo.co.jp/pickup"))

#Extract text attribute and href attribute from the acquired items
for news in news_list:
      print("{0} , {1}".format(news.getText(), news.get('href')))

Execution result

3 prefectures released Mask shoppers, https://news.yahoo.co.jp/pickup/6360522
Rice discusses resumption of nuclear test US newspaper, https://news.yahoo.co.jp/pickup/6360527
Light and dark NEW at Subaru and Mitsubishi Corona, https://news.yahoo.co.jp/pickup/6360528
Antimalarial drug increased risk of death NEW, https://news.yahoo.co.jp/pickup/6360523
A woman in her 80s with a seismic intensity of 4 broke before dawn, https://news.yahoo.co.jp/pickup/6360529
Mask delivery in Iwate Voice of nowadays NEW, https://news.yahoo.co.jp/pickup/6360521
Equestrian club pinch I want to avoid culling, https://news.yahoo.co.jp/pickup/6360510
Rina Akiyama gives birth to a second baby boy NEW, https://news.yahoo.co.jp/pickup/6360531

"NEW" has also been extracted, but I think it's okay to replace it if it's unnecessary (not included in this implementation).

in conclusion

It was a simple content, but I would like to deepen my understanding by reading the official documents.

Recommended Posts

[Python3] Understand the basics of Beautiful Soup
[Python3] Understand the basics of file operations
[Python] A memorandum of beautiful soup4
Review of the basics of Python (FizzBuzz)
About the basics list of Python basics
Learn the basics of Python ① Beginners
Basics of Python ①
Basics of python ①
[Python] Understand the content of error messages
I didn't know the basics of Python
The basics of running NoxPlayer in Python
Basics of Python scraping basics
My Beautiful Soup (Python)
the zen of Python
# 4 [python] Basics of functions
Basics of python: Output
Towards the retirement of Python2
About the ease of Python
Let's break down the basics of TensorFlow Python code
python: Basics of using scikit-learn ①
I want to fully understand the basics of Bokeh
14 quizzes to understand the surprisingly confusing scope of Python
Understand the status of data loss --Python vs. R
About the features of Python
How much do you know the basics of Python?
Basics of Python × GIS (Part 1)
The Power of Pandas: Python
Make the display of Python module exceptions easier to understand
[Understand in the shortest time] Python basics for data analysis
What beginners learned from the basics of variables in python
Try scraping with Python + Beautiful Soup
Basics of Python x GIS (Part 3)
[Python] The stumbling block of import
First Python 3 ~ The beginning of repetition ~
Understand the contents of sklearn's pipeline
Scraping with Python and Beautiful Soup
Existence from the viewpoint of Python
pyenv-change the python version of virtualenv
Getting Started with Python Basics of Python
Change the Python version of Homebrew
[Python] Understanding the potential_field_planning of Python Robotics
Basics of Python x GIS (Part 2)
[Python] Practical Beautiful Soup ~ Scraping the triple single odds table on the official website of Kyotei ~
I don't know the value error
I didn't know the basics of Python
Python basics ⑤
Beautiful Soup
Python basics
Python basics ④
Python basics ③
Python basics
[Super basics of Python] I learned the basics of the basics, so I summarized it briefly.
Python basics
Python basics
Python basics ③
Python basics ②
Python basics ②
[python] A note that started to understand the behavior of matplotlib.pyplot
Basics of binarized image processing with Python
Python: Basics of image recognition using CNN
Change the length of Python csv strings
Learn the basics of Theano once again