Aidemy 2020/9/30
Hello, it is Yope! I am a liberal arts student, but I was interested in the possibilities of AI, so I went to the AI-specialized school "Aidemy" to study. I would like to share the knowledge gained here with you, and I am summarizing it on Qiita. I am very happy that many people have read the previous summary article. Thank you! This is the second post of scraping. Nice to meet you.
What to learn this time ・ Scraping method (Refer to scraping 1 for preparatory crawling) ・ Scraping multiple pages
・ (Review) Scraping is to acquire a Web page and extract necessary data from it. -There are two types of scraping methods, "regular expression (re module)" or "use a third-party library", but this time the major method is __ "use a third-party library". "Scraping is done by the __ method.
-XML is a __markup language __ that directly builds the same Web pages as HTML. It is more extensible than HTML. -In HTML and XML, there is text surrounded by something like ** \
-You can easily scrape by using the BeautifulSoup (decoded web page, "parser") __ method. - Parser __ is a program that analyzes (parses) character strings, and there are several types for each feature, and one of them is specified. Examples include "html.parser" which does not require an additional library, "lxml" which can process at high speed, and "xml" which corresponds to XML.
#Import requests to crawl and Beautiful Soup to scrape
from bs4 import BeautifulSoup
import requests
#Get url
url=requests.get("https://www.google.co.jp")
#Scraping (decoding is the url of the request module.The parser is done with text"xml"Specified as)
soup=BeautifulSoup(url.text,"xml")
-Necessary data can be extracted from the parsed data performed in the previous section. There are the following two methods. -If you put the parsed data in the variable soup, __soup.find ("tag name or attribute name") __ will extract only the first element with that tag or attribute. Also, if the find part is find_all, all the specified elements will be listed and extracted. -If you want to scrape from the class attribute, add _class=" class attribute name" __ to the argument.
-If you put the parsed data in the variable soup, __soup.selected_one ("CSS selector") __ will extract only the first element that satisfies this. Also, if the selected_one part is select, all the specified elements will be listed and extracted. -The __CSS selector is a method of showing elements in CSS representation. __ You can also use this to specify an element inside an element (ex) The h1 element inside a body element is "body> h1").
-Also, as a trick, you can copy elements and CSS selectors with Chrome's developer tools. Therefore, it is possible to extract the desired data in a visually easy-to-understand manner without having to bother to output the decoded data.
Google_title = soup.find("title") #<title>Google</title>
Google_h1 = soup.select("body > h1") #[](Empty list is output because there is no h1 element of body element)
-If the above is left as it is, Google_title will be output with the title tag attached, but by using text, only the text of these can be obtained.
print(Google_title.text) #Google
・ With the method so far, you can scrape only one page at a time. If you want to scrape multiple pages, you can get the URL of the other page from the link to the other page on the top page etc. __ and scrape all the URLs by iterative processing. -The URL of other pages can be obtained by __the URL of the top page + the href attribute (link of each page) __ of the element.
top="http://scraping.aidemy.net"
r=requests.get(top)
soup=BeautifulSoup(r.text,"lxml")
url_lists=[]
#Get the URL of another page from the link
#(The method is to first get all the a tags, use get to code the href attribute for each, and connect it to the topURL to make it a URL.)
urls = soup.find_all("a")
for url in urls:
url = top + url.get("href")
url_lists.append(url)
・ If you can get the URL of another page, actually scrape it. As mentioned above, scraping should be performed for all URLs by iterative processing. -In the following, the titles of the photos (listed in the h3 tag) will be scraped from all the pages acquired in the previous section, and all will be acquired and displayed as a list.
photo_lists=[]
#Encode the page obtained in the previous section and then scrape the photo title with Beautiful Soup.
for url in url_lists:
r2=requests.get(url)
soup=BeautifulSoup(r2.text,"lxml")
photos=soup.find_all("h3")
#Add the title of the photo obtained by scraping to the list without the h3 tag.
for photo in photos:
photo_text=photo.text
photo_lists.append(photo_text)
print(photo_lists) #['Minim incididunt pariatur', 'Voluptate',...(Abbreviation)]
-When scraping a crawled page, first parse it with the BeautifulSoup method. -Any data can be extracted from the parsed data. Use __find () __ or __selected_one () __ to extract. -If you add text to the extracted data, tags etc. will be omitted and only the elements can be extracted. -When scraping multiple pages at once, you can scrape them individually because you can scrape them by extracting the link from the __top page and connecting it to the base URL.
This time is over. Thank you for reading this far.
Recommended Posts