Purpose

There was data on the WEB whose values changed in real time. I decided to make a program to check the value regularly, but it was troublesome to write scraping code because I had to log in to the page. As a countermeasure, I decided to use selenium to operate a web browser and scrape. I will summarize the process as a memorandum.

In addition, it was okay to use the PC at hand and automatically execute the web browser by batch processing, but it was an obstacle to launch the web browser on my own PC that I usually use. Let's run it on the rental server (Ubuntu16.04) without permission.

More specifically, the image is as follows. (1) Launch a web browser via python → Part1 (this post) (2) Operate the web browser with selenium and process the web data → Explained in Part2 (3) Store the processed data in mongoDB → Explained in Part3 (4) Automatically execute the py program that executes (1) to (3) with cron → Explained in Part3 (5) If there is a certain fluctuation in the value, notify by e-mail → Bonus

environment

OS: Ubuntu16.04 (Sakura VPS) python : python3.5

Step1) Select a web browser

Initially, I was working on Google Chrome, but in some ways I found the headless browser PhantomJS to be good, so I started using PhantomJS. The reasons for choosing PhantomJS are as follows.

Data acquired by Javascript can also be scraped
Unlike Chrome, PhantomJS doesn't have a GUI so you don't need a virtual display

If you want to use Chrome for some reason, please refer to here for the story of Chrome.

Step2) Install PhantomJS 2.1.1

I installed PhantomJS by following the steps below.

$ wget -O /tmp/phantomjs-2.1.1-linux-x86_64.tar.bz2 https://bitbucket.org/ariya/phantomjs/downloads/phantomjs-2.1.1-linux-x86_64.tar.bz2
$ cd /tmp
$ bzip2 -dc /tmp/phantomjs-2.1.1-linux-x86_64.tar.bz2 | tar xvf -
$ sudo mv /tmp/phantomjs-2.1.1-linux-x86_64/bin/phantomjs /usr/local/bin/
$ phantomjs --version
> 2.1.1

If the version check on the last line shows the intended version you installed, you're done.

(Reference URL) http://bit.ly/2bRGnYI

Step2) Install Selenium

Use Selenium for fixed-point observation of data on the Web. Selenium is one of the test tools for web applications. Instead of humans controlling the browser, Selenium controls the browser.

Selenium can be operated by program code, but in short, Selenium allows you to freely operate the browser programmatically. For example, you can perform operations such as "click the button named login" and "automatically enter your email address in the text box named userid". You can also get the HTML text of the currently open page, which will be used for scraping.

I operated selenium with python3. Follow the steps below to install selenium, including installing pip3.

sudo apt-get install python3-setuptools
sudo easy_install3 pip
pip3 install selenium

Step3) Selenium and PhantomJS test program

Use Selenium and PhantomJS to launch the Google page and get the value of the Google title tag.

`selenium test program`


import time
from selenium import webdriver

browser = webdriver.PhantomJS(executable_path='<path/to/phantomjs>')
browser.get("http://www.google.com")
time.sleep(3)

print(browser.title)

browser.close()

The above code is intuitive and easy to understand, and even if you are new to selenium, you will not be confused. If "Google" is output, the program is running normally.

The only thing that needs attention is the 4th line. When calling PhantomJS with webdriver.PhantomJS (), it is necessary to describe the path of phantomJS in the argument. Specify the path in the form webdriver.PhantomJS (executable_path ='...') as shown in the code above. If you don't know the path

which phantomjs

Since the value returned in is the path, let's specify the location.

The 5th line sends a GET request to the URL specified in browser.get (...). The acquired response is stored in the browser, and the value of the title tag of Google can be output in browser.title.

Now that we have confirmed that selenium and PhantomJS have started, we would like to actually scrape and extract specific data.

Continue at Sequel (2).

[PYTHON] Fixed-point observation of specific data on the Web by automatically executing the Web browser on the server (Ubuntu16.04) (1) -Web browser installation-