[PYTHON] How to search HTML data using Beautiful Soup

1. 1. Overview

I embedded various tag information in HTML for data analysis, and tried various methods to see if the automatic test could detect "whether the embedded data is displayed correctly".

Among them, I was able to search the target HTML data using Python's BeautifulSoup, so I will summarize the method. ** * This case is only a search method and does not describe analysis. ** **

2. Concept of search method

To get the target HTML data with BeautifulSoup, first find the data enclosed in <> as the starting point. Then, the HTML data is searched by describing the information contained in the tag that is the starting point one by one.

image.png

If you specify data that should be the starting point with a unique value, you only need to specify that item, which simplifies the description of the search program.

In addition, in order to operate the example sentences from Chapter 3 onward, it is assumed that the following programs are described.


import requests
import re
from bs4 import BeautifulSoup

res = requests.get('Describe the URL to be analyzed here')
c=res.content
soup = BeautifulSoup(c,'html.parser')

#An example sentence is described here

print(elems)

3. 3. Basic data acquisition method

If you want to get one data, use find, and if there are multiple data, use find_all (there is another method called select, but this time it is excluded). The following example sentence uses find_all.

・ Pattern to specify tags directly

#Structure of the tag you want to search
<script>~</script>
elems = soup.find_all("script")

It is used when searching for the part surrounded by simple tags.

-Pattern to specify multiple tags (using a list)

#Structure of the tag you want to search
<h1>~</h1>
<div>~</div>
elems = soup.find_all(["h1","div"])

Use this when you want to search multiple data enclosed in tags.

・ Pattern to specify keywords (exact match)

#Structure of the tag you want to search
<a class = "test">~</a>
elems = soup.find_all(class_="test")

If there is an item in which a value is assigned using "=" in each tag, use it. Also, when specifying a class, it must be "class_". (Because class is used as a reserved word in Python) If you have two search items, use [] as shown below.

elems = soup.find_all(id=["test1", "test2"])

・ Pattern to specify keywords (partial match)

#Structure of the tag you want to search
<a href="http://○○/△△.html">~<a>
elems = soup.find_all(href=re.compile("http://"))

It is used when partially searching for the value assigned by "=".

-Pattern to specify keywords (exact match using attrs attribute)

#Structure of the tag you want to search
<a href="http://○○/△△.html">~<a>
elems = soup.find_all(attrs={"href":"http://○○/△△.html"})

Use "attrs" when there is something that cannot be used as a keyword, such as an HTML5 data tag.

#Example)
× elems = soup.find_all("meta",name="test")
⇒TypeError: find_all() got multiple values for argument 'name'

○ elems = soup.find_all("meta",attrs={"name":"test"})

-Pattern to check the presence or absence of items (True if there is a value, False if not)

#Structure of the tag you want to search
<a href="http://○○/△△.html">~<a>
elems = soup.find_all(href=True)

Anything can be used in tags such as href, so use it when you want to find out what has a value. If there is no value, specify False as follows

elems = soup.find_all(id=False)

-Pattern to search for text enclosed in tags (exact match)

#Structure of the tag you want to search
<a href="http://○○/△△.html">User guide<a> #ここのUser guideのみ検索したい
elems = soup.find_all(text='User guide')

Use it when you want to extract only the text enclosed in tags.

-Pattern for searching text enclosed in tags (partial match)

#Structure of the tag you want to search
<a href="http://○○/△△.html">User guide<a> #ここのUser guideのみ検索したい
elems = soup.find_all(text=re.compile("Go"))

Use it when you want to extract only the text enclosed in tags. This is not an exact match, but a partial match.

-Pattern that specifies the number of tags to collect (only find_all can be used)

#Structure of the tag you want to search
<p>test1</p> #I want to get only here
<p>test2</p>
elems = soup.find_all('p', limit=1)

Use this when you want to get the specified number from multiple tags.

Four. Advanced version

To find the specified structural data, ** often use a combination of the above basic patterns **.

・ Specify tags and keywords

#Structure of the tag you want to search
<meta name="test">
elems = soup.find_all("meta",attrs={"name":"test"})

It is used when searching for a tag that has name = test in the tag called meta.

-Specify tags and text

#Structure of the tag you want to search
<a href="http://○○/△△.html">It's a test<a>
elems = soup.find_all("a",text="It's a test")

When searching for a tag that has a text sentence "It is a test" with the a tag

-Specify multiple items with attrs

#Structure of the tag you want to search
<a href="http://○○/△△.html" title="test">It's a test</a>
elems = soup.find_all(attrs={"title":"test","href":"http://○○/△△.html"})

・ Specify tags before and after the starting point

Also, if the specified structural data cannot be searched because there are multiple similar tags, ** determine the structural data that will be the starting point and search before and after that data. ** Use "next_element" and "previous_element" to search before and after. (Next_element is used to search for the later element, previous_element is used to search for the previous element)

#Structure of the tag you want to search
<ui>
 <li>
  <a href="http://○○/△△.html">test</a>
 </li>
</ui>

#I want to get it here
<ui>
 <li>
  <a href="http://□□/☆☆.html">test2</a> #Get data from here
 </li>
</ui>
elems = soup.find_all("a",href="http://□□/☆☆.html")
elems2 = elems[0].previous_element.previous_element #.previous_Use element twice,<ui><li>To include up to

-Create and specify a function

You can also create functions to retrieve data from complex tag structures.

#Structure of the tag you want to search
<a class="test">~</a> #I want to get only here
<a id="test">~</a>
def has_class_but_no_id(tag):
    return tag.has_attr('class') and not tag.has_attr('id')

elems = soup.find_all(has_class_but_no_id)

Recommended Posts

How to search HTML data using Beautiful Soup
How to get article data using Qiita API
[Python] How to scrape a local html file and output it as CSV using Beautiful Soup
How to use search sorted
How to handle data frames
I tried to search videos using Youtube Data API (beginner)
How to add new data (lines and plots) using matplotlib
How to install python using anaconda
A memorandum when using beautiful soup
[Python] How to FFT mp3 data
How to read e-Stat subregion data
Flask reuse How to write html
How to deal with imbalanced data
How to deal with imbalanced data
How to Data Augmentation with PyTorch
How to use bing search api
How to collect machine learning data
How to update a Tableau packaged workbook data source using Python
How to divide and process a data frame using the groupby function
How to collect Twitter data without programming
Beautiful Soup
[Memo] How to use BeautifulSoup4 (1) Display html
How to set optuna (how to write search space)
How to draw a graph using Matplotlib
How to set up SVM using Optuna
How to set xg boost using Optuna
How to search Google Drive with Google Colaboratory
How to plot galaxy visible light data using OpenNGC database in python
How to use "deque" for Python data
How to download youtube videos using pytube3
How to handle time series data (implementation)
How to search using python's astroquery and get fits images with skyview
How to read problem data with paiza
Remove unwanted HTML tags with Beautiful Soup
I tried to notify the update of "Hamelin" using "Beautiful Soup" and "IFTTT"
[Python] How to save images on the Web at once with Beautiful Soup
How to display Map using Google Map API (Android)
How to create sample CSV data with hypothesis
Try using django-import-export to add csv data to django
How to code a drone using image recognition
How to set up Random forest using Optuna
[Django] How to get data by specifying SQL.
[Python] How to read data from CIFAR-10 and CIFAR-100
How to use data analysis tools for beginners
[Introduction to Python] How to handle JSON format data
How to add a Python module search path
How to create data to put in CNN (Chainer)
How to set up Random forest using Optuna
How to read time series data in PyTorch
Convert json format data to txt (using yolo)
Data cleaning How to handle missing and outliers
How to upload to a shared drive using pydrive
How to uninstall a module installed using setup.py
Beautiful soup spills
[For beginners] How to display maps and search boxes using the GoogleMap Javascript API
Save the text of all Evernote notes to SQLite using Beautiful Soup and SQLAlchemy
[Rails] How to get location information using Geolocation API
100 language processing knock-92 (using Gensim): application to analogy data
How to use xgboost: Multi-class classification with iris data
How to apply markers only to specific data in matplotlib
How to read dynamically generated table definitions using SQLAlchemy