background

There was data on the WEB whose values changed in real time. I decided to make a program to check the value regularly, but it was troublesome to write scraping code because I had to log in to the page. As a countermeasure, I decided to use selenium to operate a web browser and scrape. I will summarize the process as a memorandum.

In addition, it was okay to use the PC at hand and automatically execute the Web browser by batch processing, but it was an obstacle to launch the Web browser on my own PC that I usually use. Let's run it on the rental server (Ubuntu16.04) without permission.

More specifically, the image is as follows. (1) Launch a web browser via python → Explained in Part1 (2) Operate the web browser with selenium to process the web data → Part2 (this post) (3) Store the processed data in mongoDB → Explained in Part3 (4) Automatically execute the py program that executes (1) to (3) with cron → Explained in Part3 (5) If there is a certain fluctuation in the value, notify by e-mail → Bonus

Last time wrote up to launching a web browser (PhantomJS) from python on Ubuntu. This time, I would like to get the information that I want you to actually scrape.

Step1) Automatically log in to the logged-in web page with selenium

As I wrote in the previous "Purpose", my original purpose is to periodically acquire the data after login and store it in the DB on the Web page where you are logged in. The page that actually automatically acquires data cannot be written here, so the python program that automatically logs in to Qiita is shown below as an example.

`Automatic login to Qiita with PhantomJS`


import time
from selenium import webdriver

URL = "https://qiita.com/"
USERID = "<YOUR_USER_ID>"
PASS = "<YOUR_PASSWORD>"

#Automatic startup of PhantomJS and access to Qiita
browser = webdriver.PhantomJS(executable_path='<path/to/phantomjs>')
browser.get(URL)
time.sleep(3)

#Login page
browser.find_element_by_id("identity").send_keys(USERID)
browser.find_element_by_id("password").send_keys(PASS)
browser.find_element_by_xpath('//input[@name="commit"]').click()
time.sleep(5)

print(browser.title)

browser.close()
display.stop()

After executing the above, "Home --Qiita" is output. You can see that you can access the top screen of your account after logging in.

Step2) Install BeautifulSoup4

Since it's a big deal, I will scrape the page accessed by PhantomJS. There are several ways to scrape with python, but this time we will do it with BeautifulSoup4. The installation itself is easy,

pip3 install bs4

Is OK.

Step3) Scraping the screen after login

Let's scrape the top screen after logging in to Qiita.

After logging in to Qiita, the latest post list is displayed on the feed tab on the top screen. Up to 20 titles will be displayed, but let's create a program that automatically acquires these 20 titles. (There is not much practical meaning, just an example.)

▼ Specifically, get the post name in the feed on the screen below.

スクリーンショット 2016-08-31 22.43.58.png

▼ The program code is as follows.

import time
from selenium import webdriver
from bs4 import BeautifulSoup

URL = "https://qiita.com/"
USERID = "<YOUR_USER_ID>"
PASS = "<YOUR_PASSWORD>"

#Automatic startup of PhantomJS and access to Qiita
browser = webdriver.PhantomJS(executable_path='<path/to/phantomjs>')
browser.get(URL)
time.sleep(3)

#Login page
browser.find_element_by_id("identity").send_keys(USERID)
browser.find_element_by_id("password").send_keys(PASS)
browser.find_element_by_xpath('//input[@name="commit"]').click()
time.sleep(5)

#Get a list of posts on the home screen
html = browser.page_source.encode('utf-8')
soup = BeautifulSoup(html, "lxml")
posts_source = soup.select(".item-box-title > h1 > a")

#Post name output
i = 1
for post in posts_source:
　print(str(i) + ":" + post.text.strip())
　i += 1

browser.close()
display.stop()

▼ It is OK if the execution result of the program is displayed as follows.

1:babel-Talk about publishing to npm repository with babe to light with just cli
2:Using UX302NC with Raspberry Pi
3:N in Rails and MySQL-Implement full-text search with FULL TEXT INDEX using grammed data
4:I made a module to make accessibility functions in Atom
5:Use emoji not on the cheat sheet on GitHub
....

As you can see from the program code above, when operating the browser via selenium, the source of the currently open page is

browser.page_source

I got it in. Encode with the encode method as shown in the sample if necessary.

The obtained HTML code is changed to a BeautifulSoup object so that the value of the specified tag is fetched. In the sample

posts_source = soup.select(".item-box-title > h1 > a")

So, we are fetching the child element a of the h1 tag of the child element of div.item-box-title. Please refer to the following 2 sites for details on how to scrape Beautiful Soup.

I took an example of a program that automatically acquires information from the Qiita home screen, but you can see that this code has a wide range of applications just by running the above code.

So far, we've even run the browser automatically to get specific data. All you have to do is execute the program regularly with CRON or the like. Details continue to Next time.

[PYTHON] Fixed-point observation of specific data on the Web by automatically executing a Web browser on the server (Ubuntu16.04) (2) -Web scraping-

background

Step1) Automatically log in to the logged-in web page with selenium

Automatic login to Qiita with PhantomJS

Step2) Install BeautifulSoup4

Step3) Scraping the screen after login

`Automatic login to Qiita with PhantomJS`