When you build a system for machine learning using Python, you can use data that can be used without restrictions on control mechanisms such as patents, which are called open data published on the Internet. Of course, we use the acquired data to update the model efficiently, but there are times when we want to automate it. This time, I will try using Selenium for web scraping.

Notes

Mechanical downloads are often prohibited, so please read the terms and conditions of the site you are accessing carefully.

Typical web scraping library

As far as I checked, there are some selection frames, but I have not used it this time, so I will try using Selenium. To my knowledge, Selenium is a tool that automates screen operations on the Web, but it uses automation of screen operations for scraping.

lxm
Requests / BeautifulSoup
Scrapy --Selenium (used this time)

environment

A driver called WebDriver is required between Selenium and the browser (Chrome this time).

Windows 10 64bit
Google Chrome (78.0.3904.108)
MacOSX x86_64
Anaconda(3.6)

Selenium installation

Python uses Anaconda, so install it with conda.

conda install selenium

WebDriver installation

WebDriver has downloaded the binary file from http://chromedriver.chromium.org/downloads You can either copy it to a location where Path passes, or install it with pip. This time, install with pip (conda).

conda install chromedriver-binary

Sample execution

Now that the environment is complete, let's first run the sample from the chromedriver site below. As for the content, display the Google site, search for Chrome Driver, sleep for 5 seconds and exit. http://chromedriver.chromium.org/getting-started

`test1.py`


import time
from selenium import webdriver

driver = webdriver.Chrome()
driver.get('http://www.google.com/');
time.sleep(5) # Let the user actually see something!
search_box = driver.find_element_by_name('q')
search_box.send_keys('ChromeDriver')
search_box.submit()
time.sleep(5) # Let the user actually see something!
driver.quit()

python test1.py

When executed, the browser will start and start searching for Chrome Driver. The browser says "Controlled by automated testing software"! And it's easy!

Open data acquisition sample

Since it is lonely to finish just starting the sample, I will make a sample that downloads CSV from the following site.

--Specified URL (This time, it was blurred as XXXXX instead of a specific URL) --Download the CSV linked to daily data (CSV) (save in C: \ project)

from selenium import webdriver
from selenium.webdriver.chrome.options import Options

options = Options()
options.add_argument('--headless')

download_dir = 'C:\\project'
prefs = {}
prefs['download.prompt_for_download'] = False
prefs['download.directory_upgrade'] = True
prefs['download.default_directory'] = download_dir

options.add_experimental_option('prefs', prefs)

driver = webdriver.Chrome(chrome_options=options)

driver.get('https://XXXXXXXXXXXXXXXXXXXXXX');
time.sleep(5) # Let the user actually see something!
url = driver.find_element_by_partial_link_text("Daily data(CSV)")


driver.command_executor._commands["send_command"] = (
    "POST",
    '/session/$sessionId/chromium/send_command'
)
params = {
    'cmd': 'Page.setDownloadBehavior',
    'params': {
        'behavior': 'allow',
        'downloadPath': download_dir
    }
}
driver.execute("send_command", params=params)

driver.get(url.get_attribute('href'));

time.sleep(5) # Let the user actually see something!
driver.quit()

Run browser hidden

In the following part of the code, hide the browser by passing --headless as an option when starting the browser.

from selenium import webdriver
from selenium.webdriver.chrome.options import Options

options = Options()
options.add_argument('--headless')
options.add_experimental_option('prefs', prefs)
driver = webdriver.Chrome(chrome_options=options)

Specifying the save destination of the download file

The following part of the code specifies where to save the download file.

from selenium import webdriver
from selenium.webdriver.chrome.options import Options

options = Options()
download_dir = 'C:\\project'
prefs = {}
prefs['download.prompt_for_download'] = False
prefs['download.directory_upgrade'] = True
prefs['download.default_directory'] = download_dir
options.add_experimental_option('prefs', prefs)
driver = webdriver.Chrome(chrome_options=options)

In the case of --hedless, specify the save destination of the download file additionally

In the case of --hedless, an additional code to specify the save destination of the download file is required.

driver.command_executor._commands["send_command"] = (
    "POST",
    '/session/$sessionId/chromium/send_command'
)
params = {
    'cmd': 'Page.setDownloadBehavior',
    'params': {
        'behavior': 'allow',
        'downloadPath': download_dir
    }
}
driver.execute("send_command", params=params)

This completes the download! Easy.

Summary

It seems that the code will increase if you do complicated things, but if you have used Selenium, you may choose Selenium for web scraping. The downside is that more browser drivers and more things to install. It seems convenient if the created Selenium script works on the browser, but it seems to be incomplete yet.

A site that seems to be helpful for doing complicated things

https://www.seleniumqref.com/

Web scraping using Selenium (Python)