Install Chrome on the command line on Sakura VPS (Ubuntu) and launch Chrome with python from virtual display and selenium

Purpose

It is troublesome to write web scraping code that requires POST such as login page. I used selenium to eliminate that annoyance. It automatically runs the browser through selenium, automates operations that require POST, and performs web scraping.

environment

OS: Ubuntu 16.04 (Sakura VPS)

Step1) Install Chrome from the command line

mkdir download
cd download
wget  https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.deb
sudo dpkg -i google-chrome-stable_current_amd64.deb
rm google-chrome-stable_amd64.deb

(Reference URL) http://bit.ly/2bBK3Ku

Step2) Preparing to start Google Chrome You can start it by typing google-chrome on the command line, but starting in this state caused two problems. The two are

  1. The dependency is broken.
  2. There is no screen (of course). is. The details of the action are shown below.

In CLI, you can start it by typing google-chrome, but if you start it in this state, two problems occurred. The two are

  1. The dependency is broken.
  2. There is no screen (of course). is. The details of the action are shown below.

Problem 1) Dependency repair

It corresponded with the following command.

sudo apt-get update
sudo apt-get -f install

Problem 2) There is no screen

((Proposal 1)) GUI Desktop </ b>

You can install the GUI desktop with the following command, but I stopped it because it seems to take a long time.

GUI desktop installation


sudo apt-get -y install ubuntu-desktop

((Proposal 2)) Install virtual display </ b>

Install a virtual display and run Chrome on the virtual display.

As a procedure,

① Install xvfb of virtual display ② Install selenium and pyvirtualdisplay to operate Chrome from python ③ Write a Chrome startup program with python

is.

The specific work procedure is described in Step 3.

Step3) Launch Google Chrome

Step ①) Install xvfb </ b>

I installed the virtual display xvfb with the following command.

Install xvfb


sudo apt-get install xvfb
sudo apt-get install unzip
wget -N http://chromedriver.storage.googleapis.com/2.20/chromedriver_linux64.zip
unzip chromedriver_linux64.zip
chmod +x chromedriver
sudo mv -f chromedriver/usr/local/share/chromedriver
sudo ln -s /usr/local/share/chromedriver /usr/local/bin/chromedriver
sudo ln -s /usr/local/share/chromedriver /usr/bin/chromedriver

Procedure ②) Installation of selenium etc. </ B>

To operate Chrome via python, install the selenium package for operating Chrome and the pyvirtualdisplay for operating the virtual display xvfb.

Selenium is one of the test tools for web applications. Instead of humans controlling the browser, Selenium controls the browser. pyvirtualdisplay is a package for operating virtual display xvfb with python.

I have both installed with the code below. (Since pip3 was not installed, pip3 is installed in advance.)

sudo apt-get install python3-setuptools
sudo easy_install3 pip
pip3 install pyvirtualdisplay selenium

Step ③) Install xvfb </ b>

I ran the following code.

from pyvirtualdisplay import Display
from selenium import webdriver

display = Display(visible=0, size=(800, 600))
display.start()
browser = webdriver.Chrome()
browser.get('http://www.google.co.jp')
print(browser.title)

browser.quit()
display.stop()

I don't think there is much confusion with the above code. Lines 1 and 2 call the virtual display and selenium.

The 4th line defines the virtual display and the 5th line starts it. Start Chrome on the virtual display with webdriver.Chrome () on line 6. Get the source data of google.co.jp on the 7th line Outputs the title tag element of the page acquired in the 8th line.

Now you have an environment to start Chrome only with CLI.

How to actually scrape?

When actually scraping, I use PhantomJS instead of Chrome. Since PhantomJS is a headless browser, it doesn't require a virtual display, and it also scrapes code written in Javascript, which is useful. If you want to work with PhantomJS, please check here.

However, in the case of Chrome, you may want to use Chrome because you can test while seeing how the browser actually behaves. If you want to scrape with Chrome, please visit the here page.

browser = webdriver.PhantomJS(executable_path='')

Part of

browser= webdriver.Chrome() 

If you replace it with, it will work ^^ (Repeat, please note that Javascript code cannot be scraped.)

Recommended Posts