I want to sort the crawled sites in order of site update date, but I didn't know how to get the site update date so I looked it up.
I want to get the time stamp of a file placed on the WEB with python. Posted on 2017/10/13 14:41 Last-Modified
The HTTP Last-Modified response header contains the date and time when the origin server determines that the resource was last modified. It is used as a validation material to determine if the received or stored resources are the same. It is less accurate than the ETag header and is an alternative.
get_lastmodified.py
import requests
res = requests.head('https://www.kantei.go.jp')
print(res.headers['Last-Modified'])
import datetime
html_timestamp = datetime.datetime.strptime(res.headers['Last-Modified'], "%a, %d %b %Y %H:%M:%S GMT")
print(html_timestamp)
% python get_lastmodified.py
Mon, 17 Feb 2020 08:27:02 GMT
2020-02-17 08:27:02
It also converts the datetime to the standard format.
This method is too weak for dynamic sites, so I thought about it a little more.
Get the site update date seriously
Recommended Posts