Introducing the schedule module, which is useful for running modules on a regular basis. schedule is a useful module when you want to perform the same work at regular intervals (every few minutes, hours, days). It is a convenient module for collecting information by web scraping.
This time, after summarizing the basic usage of the schedule module, we will introduce a module that periodically executes scraping as an implementation example.
The schedule module is available with the pip command. It will be installed by executing the following command on the command prompt.
pip install schedule
sample_schedule.py
import schedule
import time
#Execution job function
def job():
print("job execution")
#Register job execution every minute
schedule.every(1).minutes.do(job)
#Register job execution every hour
schedule.every(1).hours.do(job)
#AM11:Register 00 job execution
schedule.every().day.at("11:00").do(job)
#Register job execution on Sunday
schedule.every().sunday.do(job)
#Wednesday 13:Register 15 job runs
schedule.every().wednesday.at("13:15").do(job)
#Job execution monitoring, job function executed at the specified time
while True:
schedule.run_pending()
time.sleep(1)
This is a sample module of schedule. The job function is executed at certain intervals, and it becomes a module in which "job execution" is displayed. `` `schedule.every``` is a description to register the job to be executed and the execution interval. It can be executed every few minutes, every few hours, or even at a specific date and time.
#Run job every minute
schedule.every(1).minutes.do(job)
The job registered in the above `schedule.every``` will be executed by the following
schedule.run_pending ()
. If you just call
schedule.run_pending ()
`` normally, the module will end once the job is executed, so you need to put it in an infinite loop state ** with a ** while statement. .. By setting it to the infinite rule state, it is possible to continue to perform the same processing at regular intervals.
while True:
schedule.run_pending()
time.sleep(1)
Using the schedule module introduced this time, I created a module that performs scraping on a regular basis. It is a module that accesses Yahoo News every hour and gets the title and URL of the news.
scraping_schedule.py
from urllib.request import urlopen
from urllib.error import HTTPError
from urllib.error import URLError
from bs4 import BeautifulSoup
import re
import schedule
import time
def job():
try:
html = urlopen('https://news.yahoo.co.jp/topics')
except HTTPError as e:
print(e)
except URLError as e:
print(e)
else:
#Access Yahoo topics and collect news information
bs = BeautifulSoup(html.read(), 'lxml')
newsList = bs.find('div', {'class': 'topicsListAllMain'}).find_all('a')
#Get the news title and URL from the obtained List and display it
for news in newsList:
if re.match('^(https://)', news.attrs['href']):
print(news.get_text())
print(news.attrs['href'])
#Run job every hour
schedule.every(1).hours.do(job)
while True:
schedule.run_pending()
time.sleep(1)
This time, I introduced a schedule module that can execute irregular jobs at regular intervals. I mainly use it for scraping, but I think it has a wide range of uses because it is a module that can also be used for regular work execution.
Execute job with Python Schedule library Regular execution of Python scripts using the schedule library
Recommended Posts