Scraping Google News search results in Python (2) Use Beautiful Soup

If you search by keywords or sentences that you are interested in, Google News will display 100 articles organized by relevance and release date. In order to find out how the hit food products appeared, search for keywords and sentences that are likely to be related to the hit foods, investigate past news, and check the degree of increase in interest at the time of those news releases on Google Trends. By doing so, it seems that we can explore the process leading up to the hit. It can also be used to catch topics that lead to new hits. In the previous report, I introduced how to parse Google News RSS in Python (feed parser). Scraping Google News in Python and editing in R. However, with this method, the summary text has become the same as the title text since around October 2019.

Therefore, this time, I will introduce a script that uses Beautiful Soup to acquire article information on the search result page of Google News. Unlike feedparser, which provides article information organized, it is necessary to search the search result web page for the location of the article information and specify the information to be extracted by tags, elements, and attributes.

Here, I will introduce how to search for the article information you want to retrieve with Google Ghrome, and a script to retrieve the article information from the obtained page structure information using the library requests and Beautful Soup.

1. Analysis of article information on search result page by Google Chrome

The search word used was "Tapiru," which was selected as one of the top ten new word and buzzword awards in 2019. The search results shown below are displayed. image.png

To examine the structure of this page, hover over the article title and right-click and click Validate at the bottom of the menu that appears.

image.png

The element configuration of the HTML page is displayed in the upper right. From this window, identify the location of the article information and understand the tags and attributes required to obtain the information.

image.png

If you look at the displayed HTML code, you'll be shy, but the information you need is always near this light blue zone, so it's important to search carefully and persistently. Just below the light blue zone

When you click ▶, the lower layer opens and the title text "# Tapiru's English do you know? ..." is displayed. I was able to confirm that the information of the first article was written near the light blue zone. タイトル.jpg

So, if you look for the grouping tag "div" (see the end-of-sentence reference for the div tag) on the gray part to find the top tag that contains the information in this article,

▼<div class="xrnccd"

There seems to be article information you want in this lower layer, so roughly select the information of about 100 articles using "xrnccd" of the class that identifies this tag as the selector of beautiful Soup. All article information searched by the script below can be assigned to articles.

articles = soup.select(".xrnccd")

Next, find and get the part where the title, summary, url of the original article, and release date of each article are described. The title text "# Tapiru's English ..." is just below the light blue zone.

Just below

Click ▶ to open the lower layer

<span class = ・ ・ ・

The first few lines of text in the article were displayed just below. Although it is not displayed on the search result web page, it was hidden in such a place. This is called summary.

image.png

The script that gets this text summary = entry.find ("span"). Text.

For the release date information of the article, click ▶ of <div class = "Qmr ..." just below to open the lower layer, and "datetime = 2019-12-13 ..." is directly under "<time class =". had.

image.png

The script to get this datetime is time_elm = entry.find ("time ").

Finally, the url of the article page, which is in the light blue part of the verification. It means that the linked information is placed in the title of the article.

image.png

<a class="VDXfz" jsname="hXuDdf" jslog="85008; 2:https://prtimes.jp/main/thml/rd/p/000001434.000011710.html;

It is the part of https: // ~. I used the following two scripts. ~~ url_elm = entry.find("a")~~ ~~ url_elm = entry.find("a", class_= "VDXfz")~~ url_elm = entry.find("article") link = url_elm.get("jslog")

Let's introduce the script through. Use lstrip () and rstrip () to delete unnecessary characters at the end of the acquired information. If there is no release date information, "0000-00-00" is substituted instead in exception handling. The acquired information is converted into a data frame with the library pandas and saved in a csv file.

2. Google News search result scraping script

environment

Windows10 Python 3.6.2

script

google_news


#Calling the required library
import pandas as pd    #To save the scraping result in a cvs file in data frame format
import pprint    #To display part of a data frame
from bs4 import BeautifulSoup  #Analysis and extraction of acquired web page information
import requests     #Get information on web pages
import urllib       #Get keyword url encoding

#Convert the search word "tapiru" into characters and insert it between the urls on the search result page.
s = "Tapiru"
s_quote = urllib.parse.quote(s)
url_b4 = 'https://news.google.com/search?q=' + s_quote + '&hl=ja&gl=JP&ceid=JP%3Aja'

#Get information on search result page
res = requests.get(url_b4)
soup = BeautifulSoup(res.content, "html.parser")

#Select information for all articles
articles = soup.select(".xrnccd")

#Get the information of each article repeatedly for ~ enumerate and assign it to the list
news = list()   #Create an empty list for assignment

for i, entry in enumerate(articles, 1):
    title = entry.find("h3").text
    summary = entry.find("span").text
    summary = title + "。" + summary
    #url_elm = entry.find("a")Changed to
    url_elm = entry.find("article")
    link = url_elm.get("jslog")
    link = link.lstrip("85008; 2:")		#Delete left edge
    link = link.rstrip("; track:click")	#Delete right edge
    time_elm = entry.find("time")
    try:	#Exception handling
        ymd = time_elm.get("datetime")
    except AttributeError:
	    ymd = "0000-00-00"
	ymd = ymd[0:10]
	ymd = ymd.replace("-", "/")		#Replacement
	sortkey = ymd[0:4] + ymd[5:7] + ymd[8:10] #For sorting by date
				
	tmp = {				#Stored as a dictionary
	    "title": title,
	    "summary": summary,
	    "link": link,
	    "published": ymd,
	    "sortkey": sortkey
        }

	news.append(tmp)  #Add information for each article to the list
	
	#Convert to data frame and save as csv file
	news_df = pd.DataFrame(news)
	pprint.pprint(news_df.head())  #Display the first 5 lines to check the data
	filename = s + ".csv"
	news_df.to_csv(filename, encoding='utf-8-sig', index=False)	

The Google News search script is used for the following articles.

[Find the seeds of food hits in data science! (1) --The secret of Lawson's Basque hit](https://blog.hatena.ne.jp/yamtakumol/yamtakumol.hatenablog.com/edit?entry= 26006613407003507)

[Let's find the seeds of food hits! (2) --- "Complete meal" and "Weathering with You recipe" from June to August 2019](https://blog.hatena.ne.jp/yamtakumol/ yamtakumol.hatenablog.com/edit?entry=26006613422742161)

[Let's find the seeds of food hits! (3) --September 2019 is the food from Taiwan following bubble tea, especially "cheese tea"](https://blog.hatena.ne.jp/yamtakumol/ yamtakumol.hatenablog.com/edit?entry=26006613447159392)

Let's find the seeds of food hits! --Sweet potato pie in October 2019

** Seeds of food hits expected in 2020-Cheese balls-**

reference:

What is HTML? If you read this, even beginners can definitely write HTML! What is an HTML div class? Commentary with examples in 5 minutes

Recommended Posts