I embedded various tag information in HTML for data analysis, and tried various methods to see if the automatic test could detect "whether the embedded data is displayed correctly".
Among them, I was able to search the target HTML data using Python's BeautifulSoup, so I will summarize the method. ** * This case is only a search method and does not describe analysis. ** **
To get the target HTML data with BeautifulSoup, first find the data enclosed in <> as the starting point. Then, the HTML data is searched by describing the information contained in the tag that is the starting point one by one.
If you specify data that should be the starting point with a unique value, you only need to specify that item, which simplifies the description of the search program.
In addition, in order to operate the example sentences from Chapter 3 onward, it is assumed that the following programs are described.
import requests
import re
from bs4 import BeautifulSoup
res = requests.get('Describe the URL to be analyzed here')
c=res.content
soup = BeautifulSoup(c,'html.parser')
#An example sentence is described here
print(elems)
If you want to get one data, use find, and if there are multiple data, use find_all (there is another method called select, but this time it is excluded). The following example sentence uses find_all.
#Structure of the tag you want to search
<script>~</script>
elems = soup.find_all("script")
It is used when searching for the part surrounded by simple tags.
#Structure of the tag you want to search
<h1>~</h1>
<div>~</div>
elems = soup.find_all(["h1","div"])
Use this when you want to search multiple data enclosed in tags.
#Structure of the tag you want to search
<a class = "test">~</a>
elems = soup.find_all(class_="test")
If there is an item in which a value is assigned using "=" in each tag, use it. Also, when specifying a class, it must be "class_". (Because class is used as a reserved word in Python) If you have two search items, use [] as shown below.
elems = soup.find_all(id=["test1", "test2"])
#Structure of the tag you want to search
<a href="http://○○/△△.html">~<a>
elems = soup.find_all(href=re.compile("http://"))
It is used when partially searching for the value assigned by "=".
#Structure of the tag you want to search
<a href="http://○○/△△.html">~<a>
elems = soup.find_all(attrs={"href":"http://○○/△△.html"})
Use "attrs" when there is something that cannot be used as a keyword, such as an HTML5 data tag.
#Example)
× elems = soup.find_all("meta",name="test")
⇒TypeError: find_all() got multiple values for argument 'name'
○ elems = soup.find_all("meta",attrs={"name":"test"})
#Structure of the tag you want to search
<a href="http://○○/△△.html">~<a>
elems = soup.find_all(href=True)
Anything can be used in tags such as href, so use it when you want to find out what has a value. If there is no value, specify False as follows
elems = soup.find_all(id=False)
#Structure of the tag you want to search
<a href="http://○○/△△.html">User guide<a> #ここのUser guideのみ検索したい
elems = soup.find_all(text='User guide')
Use it when you want to extract only the text enclosed in tags.
#Structure of the tag you want to search
<a href="http://○○/△△.html">User guide<a> #ここのUser guideのみ検索したい
elems = soup.find_all(text=re.compile("Go"))
Use it when you want to extract only the text enclosed in tags. This is not an exact match, but a partial match.
#Structure of the tag you want to search
<p>test1</p> #I want to get only here
<p>test2</p>
elems = soup.find_all('p', limit=1)
Use this when you want to get the specified number from multiple tags.
To find the specified structural data, ** often use a combination of the above basic patterns **.
#Structure of the tag you want to search
<meta name="test">
elems = soup.find_all("meta",attrs={"name":"test"})
It is used when searching for a tag that has name = test in the tag called meta.
#Structure of the tag you want to search
<a href="http://○○/△△.html">It's a test<a>
elems = soup.find_all("a",text="It's a test")
When searching for a tag that has a text sentence "It is a test" with the a tag
#Structure of the tag you want to search
<a href="http://○○/△△.html" title="test">It's a test</a>
elems = soup.find_all(attrs={"title":"test","href":"http://○○/△△.html"})
Also, if the specified structural data cannot be searched because there are multiple similar tags, ** determine the structural data that will be the starting point and search before and after that data. ** Use "next_element" and "previous_element" to search before and after. (Next_element is used to search for the later element, previous_element is used to search for the previous element)
#Structure of the tag you want to search
<ui>
<li>
<a href="http://○○/△△.html">test</a>
</li>
</ui>
#I want to get it here
<ui>
<li>
<a href="http://□□/☆☆.html">test2</a> #Get data from here
</li>
</ui>
elems = soup.find_all("a",href="http://□□/☆☆.html")
elems2 = elems[0].previous_element.previous_element #.previous_Use element twice,<ui><li>To include up to
You can also create functions to retrieve data from complex tag structures.
#Structure of the tag you want to search
<a class="test">~</a> #I want to get only here
<a id="test">~</a>
def has_class_but_no_id(tag):
return tag.has_attr('class') and not tag.has_attr('id')
elems = soup.find_all(has_class_but_no_id)
Recommended Posts