background

There was data on the WEB whose values changed in real time. I decided to make a program to check the value regularly, but it was troublesome to write scraping code because I had to log in to the page. As a countermeasure, I decided to use selenium to operate a web browser and scrape. I will summarize the process as a memorandum.

In addition, it was okay to use the PC at hand and automatically execute the Web browser by batch processing, but it was an obstacle to launch the Web browser on my own PC that I usually use. Let's run it on the rental server (Ubuntu16.04) without permission.

More specifically, the image is as follows. (1) Launch a web browser via python → Explained in Part1 (2) Operate the web browser with selenium and process the web data → Explained in Part2 (3) Store the processed data in mongoDB → Part3 (this post) (4) Automatically execute the py program that executes (1) to (3) with cron → Part3 (this post) (5) If there is a certain fluctuation in the value, notify by e-mail → Bonus

The program that automatically acquires specific data on the Web in Part1 and Part2 Now that it's done, set the program to run automatically in CRON.

environment

OS: Ubuntu16.04 (Sakura VPS) python : version 3.5 mongoDB : version 2.6.10 PhantomJS : version 2.1.1

Step1) Set CRON

#Checking the operation of cron
sudo service cron status

#Edit cron config file
crontab -e

Described in crontab as follows

*/5 * * * * <path_to_python>/python3 /<path_to_file/test.py >> /<path_to_log>/test.log 2>>&1

Specify the python program created in Part1, Part2 as the job as above. If you do, the browser will start up every 5 minutes and fetch specific data from a specific site.

Step2) Preparation to save the fetched data in DB

It's a bonus from here. I made this fixed point observation program on Ubuntu, which is the default setting, so let's make a note of the DB storage as well.

If the output to a txt file is fine, it is not necessary to write it,

`File output`


f = open( "test.txt", "a+" )
f.write( data )
f.close

Will be fine.

I'm actually storing it in MongoDB.

Install MongoDB </ b>

Follow the steps below to install.

1)Public key setting sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 7F0CEB10 2)　mongodb.Create list echo 'deb http://downloads-distro.mongodb.org/repo/ubuntu-upstart dist 10gen' | sudo tee /etc/apt/sources.list.d/mongodb.list 3)Actually install sudo apt-get update #sudo apt-get install mongodb-10gen 4) mongod.Create service sudo vim /lib/systemd/system/mongod.service 　▼mongod.Contents of service 　[Unit] 　Description=MongoDB Database Service 　Wants=network.target 　After=network.target 　[Service] 　ExecStart=/usr/bin/mongod --config /etc/mongod.conf 　ExecReload=/bin/kill -HUP $MAINPID 　Restart=always 　User=mongodb 　Group=mongodb 　StandardOutput=syslog 　StandardError=syslog 　[Install] 　WantedBy=multi-user.target

(Reference URL) http://qiita.com/pelican/items/bb9b5290bb73acedc282

Install pymongo </ b>

Install the pymongo package that operates MongoDB from python

pip3 install pymongo

Start MongoDB </ b>

sudo systemctl start mongod

Step3) Complete program for fixed-point observation of data on the WEB by automatic execution of the browser

A simple fixed-point observation program that combines Part1 and Part2 and this post. I will write.

As with Part2, the data I'm actually observing cannot be published, so this time I will automatically get the posts in the top feed of Qiita. Let's write a program.

Here is a summary of the contents of the program. (1) Launch the browser PhantomJS (2) Automatically log in to Qiita and automatically get the 20 post names from the top of the feed. (3) Store the post name obtained in (2) in the list and output it to MongoDB. 　 The actual program code is as follows.

import time import datetime from selenium import webdriver from bs4 import BeautifulSoup URL = "https://qiita.com/" USERID = "<YOUR_USER_ID>" PASS = "<YOUR_PASSWORD>" #Automatic startup of PhantomJS and access to Qiita browser = webdriver.PhantomJS(executable_path='<path/to/phantomjs>') browser.get(URL) time.sleep(3) #Login page browser.find_element_by_id("identity").send_keys(USERID) browser.find_element_by_id("password").send_keys(PASS) browser.find_element_by_xpath('//input[@name="commit"]').click() time.sleep(5) #Get a list of posts on the home screen html = browser.page_source.encode('utf-8') soup = BeautifulSoup(html, "lxml") posts_source = soup.select(".item-box-title > h1 > a") #Post name data organization posts = [] for i in (0,len(posts_source)): 　posts[i] = post.text.strip() #Get the time of fixed point observation output = {} output["date"] = str(datetime.date.today()) output["datetime"] = str(datetime.datetime.today().strftime("%H:%M:%S")) output["content"] = posts #Store in MongoDB mongo = MongoClient('localhost:27017') db = mongo_client["qiita"] new_posts = db_connect["new_posts"] new_posts.insert(output) #Close browser browser.close()

It is like this. By regularly executing this program with cron, 20 latest post names will be recorded from the feed after logging in to Qiita. (Because it is a test program for posting, I think it is practical ^^;)

By applying this program, you will be able to perform fixed-point observation of data on various web pages with or without GET / POST.

Recommended Posts
Fixed-point observation of specific data on the Web by automatically executing a Web browser on the server (Ubuntu16.04) (3) ~ Cron automatic execution ~

Fixed-point observation of specific data on the Web by automatically executing a Web browser on the server (Ubuntu16.04) (2) -Web scraping-

Fixed-point observation of specific data on the Web by automatically executing the Web browser on the server (Ubuntu16.04) (1) -Web browser installation-

Save the audio data acquired by the browser in wav format on the server

I tried to rescue the data of the laptop by booting it on Ubuntu

[Python3] Take a screenshot of a web page on the server and crop it further

[Introduction to AWS] A memorandum of building a web server on AWS

Output the result of morphological analysis with Mecab to a WEB browser compatible with Sakura server / UTF-8

Collect only facial images of a specific person by web scraping

[PYTHON] Fixed-point observation of specific data on the Web by automatically executing a Web browser on the server (Ubuntu16.04) (3) ~ Cron automatic execution ~

background

environment

Step1) Set CRON

Step2) Preparation to save the fetched data in DB

File output

Install MongoDB </ b>

Install pymongo </ b>

Start MongoDB </ b>

Step3) Complete program for fixed-point observation of data on the WEB by automatic execution of the browser

`File output`