1. 1. Overview

I embedded various tag information in HTML for data analysis, and tried various methods to see if the automatic test could detect "whether the embedded data is displayed correctly".

Among them, I was able to search the target HTML data using Python's BeautifulSoup, so I will summarize the method. ** * This case is only a search method and does not describe analysis. ** **

2. Concept of search method

To get the target HTML data with BeautifulSoup, first find the data enclosed in <> as the starting point. Then, the HTML data is searched by describing the information contained in the tag that is the starting point one by one.

If you specify data that should be the starting point with a unique value, you only need to specify that item, which simplifies the description of the search program.

In addition, in order to operate the example sentences from Chapter 3 onward, it is assumed that the following programs are described.


import requests
import re
from bs4 import BeautifulSoup

res = requests.get('Describe the URL to be analyzed here')
c=res.content
soup = BeautifulSoup(c,'html.parser')

#An example sentence is described here

print(elems)

3. 3. Basic data acquisition method

If you want to get one data, use find, and if there are multiple data, use find_all (there is another method called select, but this time it is excluded). The following example sentence uses find_all.

・ Pattern to specify tags directly

#Structure of the tag you want to search
<script>～</script>

elems = soup.find_all("script")

It is used when searching for the part surrounded by simple tags.

-Pattern to specify multiple tags (using a list)

#Structure of the tag you want to search
<h1>～</h1>
<div>～</div>

elems = soup.find_all(["h1","div"])

Use this when you want to search multiple data enclosed in tags.

・ Pattern to specify keywords (exact match)

#Structure of the tag you want to search
<a class = "test">～</a>

elems = soup.find_all(class_="test")

If there is an item in which a value is assigned using "=" in each tag, use it. Also, when specifying a class, it must be "class_". (Because class is used as a reserved word in Python) If you have two search items, use [] as shown below.

elems = soup.find_all(id=["test1", "test2"])

・ Pattern to specify keywords (partial match)

#Structure of the tag you want to search
<a href="http://○○/△△.html">～<a>

elems = soup.find_all(href=re.compile("http://"))

It is used when partially searching for the value assigned by "=".

-Pattern to specify keywords (exact match using attrs attribute)

#Structure of the tag you want to search
<a href="http://○○/△△.html">～<a>

elems = soup.find_all(attrs={"href":"http://○○/△△.html"})

Use "attrs" when there is something that cannot be used as a keyword, such as an HTML5 data tag.

#Example)
×　elems = soup.find_all("meta",name="test")
⇒TypeError: find_all() got multiple values for argument 'name'

○　elems = soup.find_all("meta",attrs={"name":"test"})

-Pattern to check the presence or absence of items (True if there is a value, False if not)

#Structure of the tag you want to search
<a href="http://○○/△△.html">～<a>

elems = soup.find_all(href=True)

Anything can be used in tags such as href, so use it when you want to find out what has a value. If there is no value, specify False as follows

elems = soup.find_all(id=False)

-Pattern to search for text enclosed in tags (exact match)

#Structure of the tag you want to search
<a href="http://○○/△△.html">User guide<a> #ここのUser guideのみ検索したい

elems = soup.find_all(text='User guide')

Use it when you want to extract only the text enclosed in tags.

-Pattern for searching text enclosed in tags (partial match)

#Structure of the tag you want to search
<a href="http://○○/△△.html">User guide<a> #ここのUser guideのみ検索したい

elems = soup.find_all(text=re.compile("Go"))

Use it when you want to extract only the text enclosed in tags. This is not an exact match, but a partial match.

-Pattern that specifies the number of tags to collect (only find_all can be used)

#Structure of the tag you want to search
<p>test1</p> #I want to get only here
<p>test2</p>

elems = soup.find_all('p', limit=1)

Use this when you want to get the specified number from multiple tags.

Four. Advanced version

To find the specified structural data, ** often use a combination of the above basic patterns **.

・ Specify tags and keywords

#Structure of the tag you want to search
<meta name="test">

elems = soup.find_all("meta",attrs={"name":"test"})

It is used when searching for a tag that has name = test in the tag called meta.

-Specify tags and text

#Structure of the tag you want to search
<a href="http://○○/△△.html">It's a test<a>

elems = soup.find_all("a",text="It's a test")

When searching for a tag that has a text sentence "It is a test" with the a tag

-Specify multiple items with attrs

#Structure of the tag you want to search
<a href="http://○○/△△.html" title="test">It's a test</a>

elems = soup.find_all(attrs={"title":"test","href":"http://○○/△△.html"})

・ Specify tags before and after the starting point

Also, if the specified structural data cannot be searched because there are multiple similar tags, ** determine the structural data that will be the starting point and search before and after that data. ** Use "next_element" and "previous_element" to search before and after. (Next_element is used to search for the later element, previous_element is used to search for the previous element)

#Structure of the tag you want to search
<ui>
　<li>
　　<a href="http://○○/△△.html">test</a>
　</li>
</ui>

#I want to get it here
<ui>
　<li>
　　<a href="http://□□/☆☆.html">test2</a> #Get data from here
　</li>
</ui>

elems = soup.find_all("a",href="http://□□/☆☆.html")
elems2 = elems[0].previous_element.previous_element #.previous_Use element twice,<ui><li>To include up to

-Create and specify a function

You can also create functions to retrieve data from complex tag structures.

#Structure of the tag you want to search
<a class="test">～</a> #I want to get only here
<a id="test">～</a>

def has_class_but_no_id(tag):
    return tag.has_attr('class') and not tag.has_attr('id')

elems = soup.find_all(has_class_but_no_id)

[PYTHON] How to search HTML data using Beautiful Soup