[PYTHON] Scraping with Beautiful Soup

Environment Mac, Python3

Advance preparation

Install Beautiful Soup and lxml

$ pip install beautifulsoup4
$ pip install lxml

I got an error on the way, but the installation was successful. There are no problems so far.

Uninflected word of soup

from bs4 import BeautifulSoup
import urllib.request

#When getting html from the web
url = '××××××××××××'
req = urllib.request.Request(url)
response = urllib.request.urlopen(req)
html = response.read()
soup = BeautifulSoup(html, "lxml")
#When opening local html directly
soup = BeautifulSoup(open("index.html"), "lxml")

What to do from now on

Get the element by specifying the tag that contains the information you want.

Frequently used specification method


-Specify class
   soup.find(class_='class_name')
   #If there is no underscore after class, an error will occur.
-Specify id
   soup.find(id="id_name")
   #The id remains the same.
-Specify the tag together
   soup.find('li', class_='class_name')
   soup.find('div', id="id_name")

find () will only get the first hit. If you want to get more than one, use find_all ().

images = soup.find_all('img')
  for img in images:
    ~Individual processing~
soup.select("p > a")
soup.select('a[href="http://example.com/"]')

Execution sample

It will be a sample after loading html into soup.

Sample 1: Get the text between the tags

sample.html


<html>
  <title>test title</title>
</html>
>>> soup.title
<title>test title</title>
>>> soup.title.string
'test title'

You can get it by adding .string to the end.

Sample 2: Extract the src of the img tag

sample.html


<html>
  <div id="hoge">
    <img class="fuga" src="http://××.com/sample.jpg "/>
  </div>
</html>

First, get the div tag with id = "hoge"

>>> div = soup.find('div' id="hoge")
<div id="hoge">
  <img class="fuga" src="http://××.com/sample.jpg "/>
</div>

Next, get the img tag of class = "fuga" from the div

>>> img = div.find('img', class_='fuga')
<img class="fuga" src="http://××.com/sample.jpg "/>
>>> img['src']
"http://××.com/sample.jpg "

You don't actually need to get a div with this pattern. However, I wanted to make a sample that narrows down, so I added a div.

reference http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-beautiful-soup

Recommended Posts

Scraping with Beautiful Soup
Table scraping with Beautiful Soup
Try scraping with Python + Beautiful Soup
Scraping multiple pages with Beautiful Soup
Scraping with Python and Beautiful Soup
Scraping pages with pagination with Beautiful Soup
Scraping with Beautiful Soup in 10 minutes
Website scraping with Python's Beautiful Soup
Beautiful Soup
Crawl practice with Beautiful Soup
Scraping with selenium
Scraping with selenium ~ 2 ~
Scraping with Python
Scraping with Python
Beautiful Soup memo
Scraping with Selenium
[Python] Scraping a table using Beautiful Soup
Remove unwanted HTML tags with Beautiful Soup
Successful scraping with Selenium
Scraping with Python (preparation)
Try scraping with Python.
Scraping with Python + PhantomJS
My Beautiful Soup (Python)
Sort anime faces by scraping anime character pages with Beautiful Soup and Selenium
Write a basic headless web scraping "bot" in Python with Beautiful Soup 4
Scraping with Selenium [Python]
Scraping with Python + PyQuery
Scraping RSS with Python
[Python] Delete by specifying a tag with Beautiful Soup
I tried scraping with Python
Automatically download images with scraping
Web scraping with python + JupyterLab
Scraping with Selenium + Python Part 1
Scraping with chromedriver in python
Festive scraping with Python, scrapy
Save images with web scraping
Scraping with Selenium in Python
Easy web scraping with Scrapy
Scraping with Tor in Python
scraping the Nikkei 225 with playwright-python
Scraping with Selenium + Python Part 2
I tried various things with Python: scraping (Beautiful Soup + Selenium + PhantomJS) and morphological analysis.
Web scraping beginner with python
I-town page scraping with selenium
Scraping Google News search results in Python (2) Use Beautiful Soup
[Raspberry Pi] Scraping of web pages that cannot be obtained with python requests + Beautiful Soup
A memorandum when using beautiful soup
Scraping with Node, Ruby and Python
Web scraping with Python ① (Scraping prior knowledge)
Scraping with Selenium in Python (Basic)
Web scraping with BeautifulSoup4 (layered page)
Scraping with Python, Selenium and Chromedriver
Scraping Alexa's web rank with pyQuery
Web scraping with Python First step
I tried web scraping with python.
Draw a beautiful circle with numpy
Let's do image scraping with Python
Get Qiita trends with Python scraping
Get the link destination URL by specifying a text sentence with Python scraping (Beautiful Soup) + XPath
"Scraping & machine learning with Python" Learning memo
Scraping 1