[PYTHON] Get the site update date seriously

(Update) I will put what I made into a class. [Python] Get the update date of news articles from HTML

It's hard to get the update date of the site

Examining the response headers may reveal the last modified date for static sites.

get_lastmodified.py


import requests
res = requests.head('https://www.kantei.go.jp')
print(res.headers['Last-Modified'])
Mon, 17 Feb 2020 08:27:02 GMT

(Previous article) [Python] Get the last updated date of the website

This works fine for some news sites and many Japanese government sites, but most sites don't.

KeyError: 'last-modified'

Then, there seem to be two main methods.

Policy 1 View URL

The URL may contain strings such as 2019/05/01 and 2019-05-01. Extracting this is a powerful and reliable method.

Policy 2 scraping

This is where you will ultimately rely.

So, with these combined techniques, the site update date will be automatically extracted from the news site you usually read. The acquired beautiful soup object is called soup. The acquired update date is converted to datetime type. Regular expressions are used to extract and format strings.

get_lastmodified.py


import bs4
import datetime
import re

The news site I searched

CNN Bloomberg BBC Reuter Wall Street Journal Forbes Japan Newsweek Asahi Shimbun Nikkei newspaper Sankei Shimbun Yomiuri Shimbun Mainichi newspaper

CNN

https://edition.cnn.com/2020/02/17/tech/jetman-dubai-trnd/index.html

get_lastmodified.py


print(soup.select('.update-time')[0].getText())
#Updated 2128 GMT (0528 HKT) February 17, 2020 

timestamp_temp_hm = re.search(r'Updated (\d{4}) GMT', str(soup.select('.update-time')[0].getText()))
timestamp_temp_bdy = re.search(r'(January|February|March|April|May|June|July|August|September|October|November|December) (\d{1,2}), (\d{4})', str(soup.select('.update-time')[0].getText()))
print(timestamp_temp_hm.groups())
print(timestamp_temp_bdy.groups())
#('2128',)
#('February', '17', '2020')
timestamp_tmp = timestamp_temp_bdy.groups()[2]+timestamp_temp_bdy.groups()[1]+timestamp_temp_bdy.groups()[0]+timestamp_temp_hm.groups()[0]
news_timestamp = datetime.datetime.strptime(timestamp_tmp, "%Y%d%B%H%M")
print(news_timestamp)
#2020-02-17 21:28:00


#If it's just the date, you can get it from the URL
URL = "https://edition.cnn.com/2020/02/17/tech/jetman-dubai-trnd/index.html"
news_timestamp = re.search(r'\d{4}/\d{1,2}/\d{1,2}', URL)
print(news_timestamp.group())
#2020/02/17
news_timestamp = datetime.datetime.strptime(news_timestamp.group(), "%Y/%m/%d")
print(news_timestamp)
#2020-02-17 00:00:00

Comment: It has not been verified whether the character string'Updated'is always included. CNN's article has the date in the URL except for the summary page, so it looks certain to take this

Bloomberg

https://www.bloomberg.co.jp/news/articles/2020-02-17/Q5V6BO6JIJV101

get_lastmodified.py


print(soup.select('time')[0].string)
# #
# #February 18, 2020 7:05 JST
# #
timesamp_tmp = re.sub(' ','',str(soup.select('time')[0].string))
timesamp_tmp = re.sub('\n','',timesamp_tmp)
news_timestamp = datetime.datetime.strptime(timesamp_tmp, "%Y year%m month%d day%H:%MJST")
print(news_timestamp)
#2020-02-18 07:05:00

#You can get up to the date even with the URL
URL = "https://www.bloomberg.co.jp/news/articles/2020-02-17/Q5V6BO6JIJV101"
timestamp_tmp = re.search(r'\d{4}-\d{1,2}-\d{1,2}', URL)
print(news_timestamp_tmp.group())
#2020-02-17
news_timestamp = datetime.datetime.strptime(timestamp_tmp, "%Y-%m-%d")
print(news_timestamp)
#2020-02-17 00:00:00

Comment: There are line breaks and spaces in the tag, so it takes a lot of work.

BBC https://www.bbc.com/news/world-asia-china-51540981

get_lastmodified.py


print(soup.select("div.date.date--v2")[0].string)
#18 February 2020
news_timestamp = datetime.datetime.strptime(soup.select("div.date.date--v2")[0].string, "%d %B %Y")
print(news_timestamp)
#2020-02-18 00:00:00

Comment: I didn't know where to look for the detailed time.

Reuter

https://jp.reuters.com/article/apple-idJPKBN20C0GP

get_lastmodified.py


print(soup.select(".ArticleHeader_date")[0].string)
#February 18, 2020 /  6:11 AM /an hour ago updated

m1 = re.match(r'(January|February|March|April|May|June|July|August|September|October|November|December) \d{1,2}, \d{4}',str(soup.select(".ArticleHeader_date")[0].string))
print(m1.group())
#February 18, 2020

m2 = re.search(r'\d{1,2}:\d{1,2}',str(soup.select(".ArticleHeader_date")[0].string))
print(m2.group())
#6:11

news_timestamp = datetime.datetime.strptime(m1.group()+' '+m2.group(), "%B %d, %Y %H:%M")
print(news_timestamp)
#2020-02-18 00:00:00

Wall Street Journal https://www.wsj.com/articles/solar-power-is-beginning-to-eclipse-fossil-fuels-11581964338

get_lastmodified.py


print(soup.select(".timestamp.article__timestamp")[0].string)
#
#          Feb. 17, 2020 1:32 pm ET
#

news_timestamp = re.sub(' ','',str(soup.select(".timestamp.article__timestamp")[0].string))
news_timestamp = re.sub('\n','',m)
print(news_timestamp)
#Feb.17,20201:32pmET
news_timestamp = re.match(r'(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec).(\d{1,2}),(\d{4})(\d{1,2}):(\d{1,2})',str(news_timestamp))
print(news_timestamp.groups())
#('Feb', '17', '2020', '1', '32')
tmp = news_timestamp.groups()
timesamp_tmp = tmp[0]+' '+ tmp[1].zfill(2)+' '+tmp[2]+' '+tmp[3].zfill(2)+' '+tmp[4].zfill(2)
print(timesamp_tmp)
#Feb 17 2020 01 32
news_timestamp = datetime.datetime.strptime(timesamp_tmp, "%b %d %Y %H %M")
print(news_timestamp)
#2020-02-17 01:32:00

Forbes Japan https://forbesjapan.com/articles/detail/32418

get_lastmodified.py


print(soup.select("time")[0].string)
#2020/02/18 12:00
news_timestamp = datetime.datetime.strptime(soup.select("time")[0].string, "%Y/%m/%d %H:%M")
print(news_timestamp)
#2020-02-18 12:00:00

Newsweek https://www.newsweek.com/fears-rise-over-coronavirus-american-cruise-passenger-diagnosed-after-previously-showing-no-1487668

get_lastmodified.py


print(soup.select('time')[0].string)
# On 2/17/20 at 12:11 PM EST
m = re.search(r'(\d{1,2})/(\d{1,2})/(\d{1,2}) at (\d{1,2}:\d{1,2}) ', str(soup.select('time')[0].string))
print(m.groups())
#('2', '17', '20', '12:11')
tmp = m.groups()
timesamp_tmp = tmp[0].zfill(2)+' '+ tmp[1].zfill(2)+' '+'20'+tmp[2].zfill(2)+' '+tmp[3]
print(timesamp_tmp)
news_timestamp = datetime.datetime.strptime(timesamp_tmp, "%m %d %Y %H:%M")
print(news_timestamp)
#2020-02-17 12:11:00

Asahi Shimbun

https://www.asahi.com/articles/ASN2K7FQKN2KUHNB00R.html

get_lastmodified.py


print(soup.select('time')[0].string)
#February 18, 2020 12:25
news_timestamp = datetime.datetime.strptime(soup.select('time')[0].string, "%Y year%m month%d day%H o'clock%M minutes")
print(news_timestamp)
#2020-02-18 12:25:00

Comment: Static and easy to understand. At first glance, there is no fluctuation even by category, which is helpful.

Nikkei newspaper

https://r.nikkei.com/article/DGXMZO5556760013022020TL1000

get_lastmodified.py


print(soup.select('time')[1])
#February 18, 2020 11:00
news_timestamp = datetime.datetime.strptime(soup.select('time')[1].string, "%Y year%m month%d day%H:%M")
print(news_timestamp)
#2020-02-18 11:00:00

https://www.nikkei.com/article/DGXLASFL18H2S_Y0A210C2000000

get_lastmodified.py


print(soup.select('.cmnc-publish')[0].string)
#2020/2/18 7:37
news_timestamp = datetime.datetime.strptime(soup.select('.cmnc-publish')[0].string, "%Y/%m/%d %H:%M")
print(news_timestamp)
#2020-02-18 07:37:00

https://www.nikkei.com/article/DGXKZO55678940V10C20A2MM8000

get_lastmodified.py


print(soup.select('.cmnc-publish')[0].string)
#2020/2/With 15
news_timestamp = datetime.datetime.strptime(soup.select('.cmnc-publish')[0].string, "%Y/%m/%With d")
print(news_timestamp)
#2020-02-15 00:00:00

Comment: There are various ways to write. There were three at a glance, but there may be one.

Sankei Shimbun

https://www.sankei.com/world/news/200218/wor2002180013-n1.html

get_lastmodified.py


print(soup.select('#__r_publish_date__')[0].string)
#2020.2.18 13:10
news_timestamp = datetime.datetime.strptime(soup.select('#__r_publish_date__')[0].string, "%Y.%m.%d %H:%M")
print(news_timestamp)
#2020-02-18 13:10:00

Comment: If you look closely, it was listed in the URL until time.

Yomiuri Shimbun

https://www.yomiuri.co.jp/national/20200218-OYT1T50158/

get_lastmodified.py


print(soup.select('time')[0].string)
#2020/02/18 14:16
news_timestamp = datetime.datetime.strptime(soup.select('time')[0].string, "%Y/%m/%d %H:%M")
print(news_timestamp)
#2020-02-18 14:16:00

Comment: You can get the date only from the URL.

Mainichi newspaper

https://mainichi.jp/articles/20180803/ddm/007/030/030000c

get_lastmodified.py


print(soup.select('time')[0].string)
#August 3, 2018 Tokyo morning edition
news_timestamp = datetime.datetime.strptime(soup.select('time')[0].string, "%Y year%m month%d day Tokyo morning edition")
print(news_timestamp)
#2018-08-03 00:00:00

https://mainichi.jp/articles/20200218/dde/012/030/033000c

get_lastmodified.py


print(soup.select('time')[0].string)
#February 18, 2020 Tokyo evening edition
news_timestamp = datetime.datetime.strptime(soup.select('time')[0].string, "%Y year%m month%Day d Tokyo evening edition")
print(news_timestamp)
#2020-02-18 00:00:00

https://mainichi.jp/articles/20200218/k00/00m/010/047000c

get_lastmodified.py


print(soup.select('time')[0].string)
#February 18, 2020 09:57
#Last updated print(soup.select('time')[1].string)
news_timestamp = datetime.datetime.strptime(soup.select('time')[0].string, "%Y year%m month%d day%H o'clock%M minutes")
print(news_timestamp)
#2020-02-18 09:57:00

https://mainichi.jp/premier/politics/articles/20200217/pol/00m/010/005000c

get_lastmodified.py


print(soup.select('time')[0].string)
#February 18, 2020
news_timestamp = datetime.datetime.strptime(soup.select('time')[0].string, "%Y year%m month%d day")
print(news_timestamp)
#2020-02-18 00:00:00

Comment: In the Mainichi Shimbun, articles only in the electronic version can be obtained in minutes. Articles from the morning and evening editions and daily premiere can only be obtained by the date, which is the same as the URL.

table

News site From the R header From URL From the HTML contents
CNN date Year, month, day, hour and minute
Bloomberg date Year, month, day, hour and minute
BBC date
Reuter Year, month, day, hour and minute
Wall Street Journal Year, month, day, hour and minute
Forbes Japan Year, month, day, hour and minute
Newsweek Year, month, day, hour and minute
Asahi Shimbun Year, month, day, hour and minute Year, month, day, hour and minute
Nikkei newspaper Year, month, day, hour and minute
Sankei Shimbun Year, month, day, hour and minute Date and time Year, month, day, hour and minute
Yomiuri Shimbun date date時分
Mainichi newspaper date date時分*

What I thought

Not to mention the language, the date notation varies from site to site. Even within the same news site, there are fluctuations in the notation, and we have not been able to confirm all of them. I haven't found a site that I can't get by looking at the HTML, but I can see it by looking at the URL. I said it was a matching technique, but it doesn't change even if you get it just by scraping. With this method, you have to read the tags and class names for each site, and it seems to be quite difficult to deal with all sites, even news sites. Please let me know if there is a better way.

(Update) I will put what I made into a class. [Python] Get the update date of news articles from HTML

Recommended Posts

Get the site update date seriously
Get the update date of the Python memo file.
[Python] Get the update date of a news article from HTML
[Python] Get the last updated date of the website
[Python3] Get date diff
Get date in Python
Get date with python
[Python] Split the date
Get the MIME Type
Script to get the expiration date of the SSL certificate
Get the current date and time in Python, considering the time difference
[Linux] Update the package offline
Get the number of digits
[Python] Get the previous month
Maya | Get the workspace path
What to do when you get "I can't see the site !!!!"
Get the title and delivery date of Yahoo! News in Python