Beginners can use Python for web scraping (1) Improved version

This article is a source improved version of the previous article Beginners Web Scraping with Python (1). By changing the usage of the bs4 method, I was able to improve the fact that extra things appear at the end of Yahoo News headline news extraction, so that's a memo.

Roadmap for learning web scraping in Python

(1) Succeed in scraping the desired stuff locally for the time being. ← Still here </ font> (2) Link the result of scraping locally to Google Spreadsheet. (3) cron is automatically executed locally. (4) Challenge free automatic execution on the cloud server. (Google Compute Engine) (5) Challenge free automatic execution without a server on the cloud. (Maybe Cloud Functions + Cloud Scheduler)

Functions of sample PGM (1)

・ Get website information using requests ・ Parse html with Beautiful Soup ~~ ・ Search for a specific character string with the re library that can search for character strings (identify headline news) ~~ </ font> ・ Extract titles and links beautifully with only the bs4 method </ font> -Display all news titles and links from the acquired result list on the console

Last sample source

requests-test.py


import requests
from bs4 import BeautifulSoup
import re

#Download website information using requests
url = 'https://news.yahoo.co.jp/'
response = requests.get(url)
#print(response.text)
print('url: ',response.url)
print('status-code:',response.status_code) #HTTP status code, usually[200 OK]
print('headers[Content-Type]:',response.headers['Content-Type']) #Since headers is a dictionary, you can specify the key to content-type output
print('encoding: ',response.encoding) #encoding

#BeautifulSoup()Website information and parser acquired in"html.parser"give
soup = BeautifulSoup(response.text, "html.parser")

#In the href attribute"news.yahoo.co.jp/pickup"Extract only those that contain
elems = soup.find_all(href = re.compile("news.yahoo.co.jp/pickup"))

#The title and link of the extracted news are displayed on the console.
for elem in elems:
    print(elem.contents[0])
    print(elem.attrs['href'])

This sample source

Find'topicsList' class with CSS selector Find_all'li'tag Find the'a'tag

requests-test.py


import requests
from bs4 import BeautifulSoup

#Download website information using requests
url = 'https://news.yahoo.co.jp/'
response = requests.get(url)

#BeautifulSoup()Website information and parser acquired in"html.parser"give
soup = BeautifulSoup(response.text, "html.parser")
print('soup: ',type(soup))

topicsindex = soup.find('div', class_='topicsList')
#topicsindex = soup.find('div', attrs={'class': 'topicsList'})
print('topicsindex: ',type(topicsindex))

####manner(1)
#After extracting with li, turn while extracting a with the for statement
topics = topicsindex.find_all('li')
#print(topics)
print('topics',type(topics))

#The title and link of the extracted news are displayed on the console.
for topic in topics:
    print(topic.find('a').contents[0])
    print(topic.find('a').attrs['href'])

####manner(2)
#After extracting up to the a tag in list comprehension notation, turn it with a for statement
headlines = [i.find('a') for i in topicsindex.find_all('li')]
print(headlines)
print(type(headlines))
#The title and link of the extracted news are displayed on the console.
for headline in headlines:
    print(headline.contents[0])
    print(headline.attrs['href'])

This time, I also learned how to extract bs4 in various ways. Develop a strategy while roughly analyzing the html of Yahoo! News, which is difficult and difficult to understand. スクリーンショット 2020-09-12 17.58.47.png

The main topics of top news are -Defined in the'topicsList'class, </ font> ・ </ font> while connecting news with the'li'tag in it -Links are attached with the href attribute of the'a'tag </ font> Since it is a composition, it is a strategy to pull them out in order.

The code around soup and find is below.

requests-test.py


soup = BeautifulSoup(response.text, "html.parser")
topicsindex = soup.find('div', class_='topicsList')
topics = topicsindex.find_all('li')
for topic in topics:    
    print(topic.find('a').contents[0])
    print(topic.find('a').attrs['href'])

The following is a print of that attribute.

requests-test.py


print('soup: ',type(soup))
print('topicsindex: ',type(topicsindex))
print('topics: ',type(topics))
print('topic: ',type(topic))

Click here for the results.

bash


soup:  <class 'bs4.BeautifulSoup'>
topicsindex:  <class 'bs4.element.Tag'>
topics:  <class 'bs4.element.ResultSet'>
topic:  <class 'bs4.element.Tag'>

The soup state is the'BeautifulSoup'object, The state where the class is found by the CSS selector is the'Tag'object, The li tag find_all is a'ResultSet' object (although not found in the documentation). You can see that the'ResultSet'is of type Python list and its individual elements are also'Tag' objects. Since each element'topic'of'topics' is a Tag object, it is possible to use methods such as find. The string part (that is, the news title) surrounded by tags in contents, It is also possible to extract the href attribute.

How to do (1) After extracting with the li tag, turn while extracting the a tag with the for statement Method (2) After extracting up to the a tag in list comprehension notation, turn it with a for statement I tried two, but both give the same result. Topics with photos like last time are no longer extracted in strange ways!

bash


Docomo account Get information from fake HP
https://news.yahoo.co.jp/pickup/6370962
Political words to decide to resign
https://news.yahoo.co.jp/pickup/6370953
Two people died in flames after a car crash in Hakone
https://news.yahoo.co.jp/pickup/6370970
In the tumulus?Notice in the HP survey map
https://news.yahoo.co.jp/pickup/6370965
Brain disease in the fetus Selected abortion
https://news.yahoo.co.jp/pickup/6370957
Mountains of scrap metal in the suburbs Why
https://news.yahoo.co.jp/pickup/6370958
Two Fujii crowns and five consecutive losses to Toyoshima Ryuo
https://news.yahoo.co.jp/pickup/6370961
17-year-old "new and youngest Go player" born
https://news.yahoo.co.jp/pickup/6370964

By the way, the person who extracted the a tag in advance by method (2) is an ordinary'List type'object. I do not understand anything by looking at the element verification of chrome, but If you extract it so far, you can see that it is a simple shape that is easy for beginners.

bash


<class 'list'>
[<a data-ual-gotocontent="true" data-ylk="rsec:tpc_maj;slk:title;pos:1;" href="https://news.yahoo.co.jp/pickup/6370962">Docomo account Get information from fake HP</a>, 
<a data-ual-gotocontent="true" data-ylk="rsec:tpc_maj;slk:title;pos:2;" href="https://news.yahoo.co.jp/pickup/6370953">Political words to decide to resign<span aria-label="NEW" class="labelIcon labelIcon-NEW"></span></a>, 
<a data-ual-gotocontent="true" data-ylk="rsec:tpc_maj;slk:title;pos:3;" href="https://news.yahoo.co.jp/pickup/6370970">Two people died in flames after a car crash in Hakone<span aria-label="NEW" class="labelIcon labelIcon-NEW"></span></a>, 
<a data-ual-gotocontent="true" data-ylk="rsec:tpc_maj;slk:title;pos:4;" href="https://news.yahoo.co.jp/pickup/6370965">In the tumulus?Notice in the HP survey map</a>, 
<a data-ual-gotocontent="true" data-ylk="rsec:tpc_maj;slk:title;pos:5;" href="https://news.yahoo.co.jp/pickup/6370957">Brain disease in the fetus Selected abortion<span aria-label="NEW" class="labelIcon labelIcon-NEW"></span></a>, 
<a data-ual-gotocontent="true" data-ylk="rsec:tpc_maj;slk:title;pos:6;" href="https://news.yahoo.co.jp/pickup/6370958">Mountains of scrap metal in the suburbs Why<span aria-label="NEW" class="labelIcon labelIcon-NEW"></span></a>, 
<a data-ual-gotocontent="true" data-ylk="rsec:tpc_maj;slk:title;pos:7;" href="https://news.yahoo.co.jp/pickup/6370961">Two Fujii crowns and five consecutive losses to Toyoshima Ryuo</a>, 
<a data-ual-gotocontent="true" data-ylk="rsec:tpc_maj;slk:title;pos:8;" href="https://news.yahoo.co.jp/pickup/6370964">17-year-old "new and youngest Go player" born</a>]

Afterword

This time, I tried various things while looking at the official document. The royal road is to take a look at the websites of our predecessors and refer to them while keeping the official documents as the basis. https://www.crummy.com/software/BeautifulSoup/bs4/doc/ (English: Beautiful Soup version 4.9.1 as of 9/13/2020) http://kondou.com/BS4/ (Japanese: Beautiful Soup version 4.2.0 as of 9/13/2020)

Recommended Posts