When you build a system for machine learning using Python, you can use data that can be used without restrictions on control mechanisms such as patents, which are called open data published on the Internet. Of course, we use the acquired data to update the model efficiently, but there are times when we want to automate it. This time, I will try using Selenium for web scraping.
Mechanical downloads are often prohibited, so please read the terms and conditions of the site you are accessing carefully.
As far as I checked, there are some selection frames, but I have not used it this time, so I will try using Selenium. To my knowledge, Selenium is a tool that automates screen operations on the Web, but it uses automation of screen operations for scraping.
A driver called WebDriver is required between Selenium and the browser (Chrome this time).
Python uses Anaconda, so install it with conda.
conda install selenium
WebDriver has downloaded the binary file from http://chromedriver.chromium.org/downloads You can either copy it to a location where Path passes, or install it with pip. This time, install with pip (conda).
conda install chromedriver-binary
Now that the environment is complete, let's first run the sample from the chromedriver site below. As for the content, display the Google site, search for Chrome Driver, sleep for 5 seconds and exit. http://chromedriver.chromium.org/getting-started
test1.py
import time
from selenium import webdriver
driver = webdriver.Chrome()
driver.get('http://www.google.com/');
time.sleep(5) # Let the user actually see something!
search_box = driver.find_element_by_name('q')
search_box.send_keys('ChromeDriver')
search_box.submit()
time.sleep(5) # Let the user actually see something!
driver.quit()
python test1.py
When executed, the browser will start and start searching for Chrome Driver. The browser says "Controlled by automated testing software"! And it's easy!
Since it is lonely to finish just starting the sample, I will make a sample that downloads CSV from the following site.
--Specified URL (This time, it was blurred as XXXXX instead of a specific URL) --Download the CSV linked to daily data (CSV) (save in C: \ project)
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
options = Options()
options.add_argument('--headless')
download_dir = 'C:\\project'
prefs = {}
prefs['download.prompt_for_download'] = False
prefs['download.directory_upgrade'] = True
prefs['download.default_directory'] = download_dir
options.add_experimental_option('prefs', prefs)
driver = webdriver.Chrome(chrome_options=options)
driver.get('https://XXXXXXXXXXXXXXXXXXXXXX');
time.sleep(5) # Let the user actually see something!
url = driver.find_element_by_partial_link_text("Daily data(CSV)")
driver.command_executor._commands["send_command"] = (
"POST",
'/session/$sessionId/chromium/send_command'
)
params = {
'cmd': 'Page.setDownloadBehavior',
'params': {
'behavior': 'allow',
'downloadPath': download_dir
}
}
driver.execute("send_command", params=params)
driver.get(url.get_attribute('href'));
time.sleep(5) # Let the user actually see something!
driver.quit()
In the following part of the code, hide the browser by passing --headless as an option when starting the browser.
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
options = Options()
options.add_argument('--headless')
options.add_experimental_option('prefs', prefs)
driver = webdriver.Chrome(chrome_options=options)
The following part of the code specifies where to save the download file.
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
options = Options()
download_dir = 'C:\\project'
prefs = {}
prefs['download.prompt_for_download'] = False
prefs['download.directory_upgrade'] = True
prefs['download.default_directory'] = download_dir
options.add_experimental_option('prefs', prefs)
driver = webdriver.Chrome(chrome_options=options)
In the case of --hedless, an additional code to specify the save destination of the download file is required.
driver.command_executor._commands["send_command"] = (
"POST",
'/session/$sessionId/chromium/send_command'
)
params = {
'cmd': 'Page.setDownloadBehavior',
'params': {
'behavior': 'allow',
'downloadPath': download_dir
}
}
driver.execute("send_command", params=params)
This completes the download! Easy.
It seems that the code will increase if you do complicated things, but if you have used Selenium, you may choose Selenium for web scraping. The downside is that more browser drivers and more things to install. It seems convenient if the created Selenium script works on the browser, but it seems to be incomplete yet.
Recommended Posts