Web scraping using Selenium (Python)

When you build a system for machine learning using Python, you can use data that can be used without restrictions on control mechanisms such as patents, which are called open data published on the Internet. Of course, we use the acquired data to update the model efficiently, but there are times when we want to automate it. This time, I will try using Selenium for web scraping.

Notes

Mechanical downloads are often prohibited, so please read the terms and conditions of the site you are accessing carefully.

Typical web scraping library

As far as I checked, there are some selection frames, but I have not used it this time, so I will try using Selenium. To my knowledge, Selenium is a tool that automates screen operations on the Web, but it uses automation of screen operations for scraping.

environment

A driver called WebDriver is required between Selenium and the browser (Chrome this time).

Selenium installation

Python uses Anaconda, so install it with conda.

conda install selenium

WebDriver installation

WebDriver has downloaded the binary file from http://chromedriver.chromium.org/downloads You can either copy it to a location where Path passes, or install it with pip. This time, install with pip (conda).

conda install chromedriver-binary

Sample execution

Now that the environment is complete, let's first run the sample from the chromedriver site below. As for the content, display the Google site, search for Chrome Driver, sleep for 5 seconds and exit. http://chromedriver.chromium.org/getting-started

test1.py


import time
from selenium import webdriver

driver = webdriver.Chrome()
driver.get('http://www.google.com/');
time.sleep(5) # Let the user actually see something!
search_box = driver.find_element_by_name('q')
search_box.send_keys('ChromeDriver')
search_box.submit()
time.sleep(5) # Let the user actually see something!
driver.quit()
python test1.py

When executed, the browser will start and start searching for Chrome Driver. The browser says "Controlled by automated testing software"! And it's easy!

1.png

Open data acquisition sample

Since it is lonely to finish just starting the sample, I will make a sample that downloads CSV from the following site.

--Specified URL (This time, it was blurred as XXXXX instead of a specific URL) --Download the CSV linked to daily data (CSV) (save in C: \ project)

from selenium import webdriver
from selenium.webdriver.chrome.options import Options

options = Options()
options.add_argument('--headless')

download_dir = 'C:\\project'
prefs = {}
prefs['download.prompt_for_download'] = False
prefs['download.directory_upgrade'] = True
prefs['download.default_directory'] = download_dir

options.add_experimental_option('prefs', prefs)

driver = webdriver.Chrome(chrome_options=options)

driver.get('https://XXXXXXXXXXXXXXXXXXXXXX');
time.sleep(5) # Let the user actually see something!
url = driver.find_element_by_partial_link_text("Daily data(CSV)")


driver.command_executor._commands["send_command"] = (
    "POST",
    '/session/$sessionId/chromium/send_command'
)
params = {
    'cmd': 'Page.setDownloadBehavior',
    'params': {
        'behavior': 'allow',
        'downloadPath': download_dir
    }
}
driver.execute("send_command", params=params)

driver.get(url.get_attribute('href'));

time.sleep(5) # Let the user actually see something!
driver.quit()

Run browser hidden

In the following part of the code, hide the browser by passing --headless as an option when starting the browser.

from selenium import webdriver
from selenium.webdriver.chrome.options import Options

options = Options()
options.add_argument('--headless')
options.add_experimental_option('prefs', prefs)
driver = webdriver.Chrome(chrome_options=options)

Specifying the save destination of the download file

The following part of the code specifies where to save the download file.

from selenium import webdriver
from selenium.webdriver.chrome.options import Options

options = Options()
download_dir = 'C:\\project'
prefs = {}
prefs['download.prompt_for_download'] = False
prefs['download.directory_upgrade'] = True
prefs['download.default_directory'] = download_dir
options.add_experimental_option('prefs', prefs)
driver = webdriver.Chrome(chrome_options=options)

In the case of --hedless, specify the save destination of the download file additionally

In the case of --hedless, an additional code to specify the save destination of the download file is required.

driver.command_executor._commands["send_command"] = (
    "POST",
    '/session/$sessionId/chromium/send_command'
)
params = {
    'cmd': 'Page.setDownloadBehavior',
    'params': {
        'behavior': 'allow',
        'downloadPath': download_dir
    }
}
driver.execute("send_command", params=params)

This completes the download! Easy.

Summary

It seems that the code will increase if you do complicated things, but if you have used Selenium, you may choose Selenium for web scraping. The downside is that more browser drivers and more things to install. It seems convenient if the created Selenium script works on the browser, but it seems to be incomplete yet.

A site that seems to be helpful for doing complicated things

Recommended Posts

Web scraping using Selenium (Python)
Python web scraping selenium
I tried web scraping using python and selenium
Scraping using Python
[Beginner] Python web scraping using Google Colaboratory
Scraping with Selenium [Python]
Practice web scraping with Python and Selenium
Pharmaceutical company researchers summarized web scraping using Python
Scraping using Python 3.5 async / await
Web scraping with python + JupyterLab
Scraping with selenium in Python
Scraping with Selenium + Python Part 1
Web scraping notes in python3
Scraping using Python 3.5 Async syntax
Scraping with Selenium in Python
Start to Selenium using python
Web scraping using AWS lambda
Scraping with Selenium + Python Part 2
[Python + Selenium] Tips for scraping
Web scraping beginner with python
[Scraping] Python scraping
web scraping
Web scraping with Python ① (Scraping prior knowledge)
Scraping with Selenium in Python (Basic)
Scraping with Python, Selenium and Chromedriver
Web scraping with Python First step
I tried web scraping with python.
Beginners use Python for web scraping (1)
Beginners use Python for web scraping (4) ―― 1
[Python] Introduction to scraping | Program to open web pages (selenium webdriver)
[Python scraping] I tried google search top10 using Beautifulsoup & selenium
Operate your browser using the Selenium Web Driver Python bindings
Python scraping notes
Scraping with selenium
[Python / Selenium] XPath
Python Scraping get_ranker_categories
Scraping with selenium ~ 2 ~
Scraping with Python
WEB scraping with Python (for personal notes)
Scraping with Python
Getting Started with Python Web Scraping Practice
web scraping (prototype)
Start using Python
Scraping with Selenium
[Personal note] Web page scraping with python3
Web scraping with Python ② (Actually scraping stock sites)
Horse Racing Site Web Scraping with Python
Reboot the router using Python, Selenium, PhantomJS
Scraping a website using JavaScript in Python
Automatically manipulate web pages using selenium webdriver
Getting Started with Python Web Scraping Practice
Python: Scraping Part 1
[Python] Scraping a table using Beautiful Soup
Easy web scraping with Python and Ruby
[For beginners] Try web scraping with Python
Python: Scraping Part 2
Selenium + WebDriver (Chrome) + Python | Building environment for scraping
AWS-Perform web scraping regularly with Lambda + Python + Cron
Procedure to use TeamGant's WEB API (using python)
Serverless scraping using selenium with [AWS Lambda] -Part 1-
Scraping dynamically loaded TV program listings [Python] [Selenium]