A library that extracts and analyzes information from HTML and XML. There is no download function, so use it in combination with ʻurllib`.
Below, the basic usage of Beautiful Soup
# Library import
from bs4 import BeautifulSoup
html1 = """
<html><body>
<h1> Scraping </ h1>
<p> Web page analysis </ p>
<p> Extraction of arbitrary parts </ p>
</body></html>
"""
# HTML parsing
soup = BeautifulSoup(html1, 'html.parser')
# Extract any element
h1 = soup.html.body.h1
p1 = soup.html.body.p
p2 = p1.next_sibling.next_sibling
print(h1.string)
print(p1.string)
print(p2.string)
Execution result
Scraping Extract web pages Extraction of arbitrary parts
Scraping by using Beautiful Soup
and ʻurllib` together
# Library import
import urllib.request as req
from bs4 import BeautifulSoup
url = "https://api.aoikujira.com/zip/xml/1500042"
res = req.urlopen(url)
# Analyze the data acquired by urlopen ()
soup = BeautifulSoup(res, 'html.parser')
ken = soup.find("ken").string
shi = soup.find("shi").string
cho = soup.find("cho").string
print(ken, shi, cho)
I have attached the GitHub published from the book I referred to. Supplementary revision Python scraping & machine learning development technique
Recommended Posts