Scraping with Python, Selenium and Chromedriver

Thing you want to do

I would like to collect bibliographic information + abstracts of all papers published in a journal with Science Direct.

First, learn the basics of scraping

(Reference: https://codezine.jp/article/detail/12230) It seems that it is basically done using the requests package and the Beautiful Soup 4 package. So, first install these guys

pip install requests, beautifulsoup4

So I tried something like this.

import request
from bs4 import BeautifulSoup

#Send a request to the URL to be scraped and get the HTML
res = requests.get('https://www.ymori.com/books/python2nen/test1.html')

#Create a BeautifulSoup object from the response HTML
soup = BeautifulSoup(res.text, 'html.parser')
print(soup)

For the time being, I was able to get the html as text, but how do I do it, for example, by turning on the check box or clicking the button? ??

Selenium and Chrome driver

After investigating, it seems that Beautiful Soup cannot create pages that dynamically change the displayed contents by making full use of Javascript. After investigating what to do, I arrived at a package called Selenium. (Reference: https://qiita.com/Fujimon_fn/items/16adbd86fad609d993e8) Apparently, you can do something like RPA. In other words, operate the Web browser in order so that it can be seen by humans. However, if this is all you need, you need a driver that matches the browser you are using. (Reference: https://kurozumi.github.io/selenium-python/installation.html#drivers)

Installation

Install Selenium and chromedriver. Before installing, check the ChromeDriver page for the driver version that matches your Chrome version (check). Then it was 84.0.4147.30). (Reference: https://qiita.com/hanzawak/items/2ab4d2a333d6be6ac760)

pip install selenium, chromedriver-binary==84.0.4147.30

Once installed, you don't need to set the path (but you need to include ʻimport chrome driver`) However, the Exe file downloaded directly from ChromeDriver, for example, in c: \ work, explicitly enter the path. You can also give it. In this case, you don't need to import the package.

Let's start it for the time being

The following is an example of explicitly passing the path instead of ʻimport chromedriver`.

OpenBrowser.py


import requests
from selenium import webdriver  #  import chromedriver_binary

load_url = "https://www.sciencedirect.com/journal/reliability-engineering-and-system-safety/vol/204/suppl/C"
driver = webdriver.Chrome(executable_path='c:/work/chromedriver.exe')  #  driver = webdriver.Chrome()
driver.get(load_url)

Then, the browser started up without permission and jumped to the page of the specified address. It's completely RPA.

Page manipulation

What I want to do is go to Jounral's page

  1. First, click "Select All" in the upper left and turn on all the displayed Paper check boxes.
  2. Click Export Citations to display a dialog for downloading bibliographic information.
  3. Click "Export citation and abstract to text" in the dialog for downloading bibliographic information. → Then it will be downloaded as a text file.
  4. When you have finished downloading the text file, click "Previous Vol / Issue" at the top of the page to go to the page of the previous volume. image.png image.png

Processing called. If you loop this, you can get the information of all the documents. So, find out how to click "Select All", "Export Citations" and "Export citation and abstract to text".

Basically, you can find the target you want to operate from the page loaded by the driver, such as ID, class name, Name attribute, and send .click (). So, first look for "Select_All". Go to the page with chrome and press F12 to bring up the Developer screen. Then, press Ctrl + F to open the search box and enter "Select All" to search. Then you will find the place where Select All is written. Actually, it was made with a button tag. Well, that's right. However, it didn't look like a button at first glance, so I was a little surprised. For the time being, right-click on this button tag and select Copy ⇒ Copy selector to get the CSS Selector. image.png

So, back to the source code, Paste the css selector from earlier. However, all you need is the "button." That's why

button = driver.find_element_by_css_selector("button.button-link.button-link-secondary.js-select-all")
button.click()

Wait process until the element becomes accessible

However, even if I suddenly add the above to OpenBrowser.py and execute it, it fails. That's because when you pass the URL to Chrome, it's not immediately accessible, but before you actually fetch the HTML from the URL, parse it, and the browser can access the element. Because there is a time lag. So I have to wait for a while. This page fetches time.sleep () from the time package and uses it. However, this method is not smart as described in this manual. That's why I use the Wait function that comes with WebDriver. So the following sources. (Reference: https://qiita.com/uguisuheiankyo/items/cec03891a86dfda12c9a) (Reference: https://www.selenium.dev/selenium/docs/api/py/webdriver_support/selenium.webdriver.support.expected_conditions.html) In Manual, there was only a way to specify the element by ID, but in the case of css selector Uses CSS_SELECTOR. (Reference: https://selenium-python.readthedocs.io/locating-elements.html)

I actually tried using it, but it didn't work in Wait. It seems that the timing will be off by all means, and an error will occur. So I decided to include time.sleep () after all.

WaitAndOperation.py


import requests
from selenium import webdriver  #  import chromedriver_binary
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

load_url = "https://www.sciencedirect.com/journal/reliability-engineering-and-system-safety/vol/204/suppl/C"
driver = webdriver.Chrome(executable_path='c:/work/chromedriver.exe')  #  driver = webdriver.Chrome()
driver.get(load_url)
    #WebDriverWait(driver, 10).until(EC.visibility_of_all_elements_located)
time.sleep(5)
    #WebDriverWait(driver, 20).until(
    #    EC.element_to_be_clickable((By.CSS_SELECTOR, "button.button-link.button-link-secondary.js-select-all"))
    #)
button = driver.find_element_by_css_selector("button.button-link.button-link-secondary.js-select-all")
button.click()

The same applies to other elements ...

So, for other "Export Citations" and "Export citation and abstract to text", get the CSS Selector and add the process to click.

On the other hand, "Previous Vol / Issue" was a link, not a button. You can specify the link with css selector in the same way, but you can also access the element with the text of the link. So I tried to access it by text.

final.py


import time
import requests
#from bs4 import BeautifulSoup
# import chromedriver_binary

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

load_url = "https://www.sciencedirect.com/journal/reliability-engineering-and-system-safety/vol/204/suppl/C"
Last_url = "https://www.sciencedirect.com/journal/reliability-engineering-and-system-safety/vol/20/issue/1"

driver = webdriver.Chrome(executable_path='c:/work/chromedriver.exe')
driver.get(load_url)

while 1:
    time.sleep(5)

    button = driver.find_element_by_css_selector("button.button-link.button-link-secondary.js-select-all")
    button.click()

    time.sleep(2)

    button2 = driver.find_element_by_css_selector("button.button-alternative.text-s.u-margin-xs-top.u-display-block.js-export-citations-button.button-alternative-primary")
    button2.click()

    time.sleep(2)

    button3 = driver.find_element_by_css_selector("button.button-link.button-link-primary.u-margin-xs-bottom.text-s.u-display-block.js-citation-type-textabs")
    button3.click()

    time.sleep(3)

    #Get the current URL
    Purl = driver.current_url
    #Break if Purl and Curl are the same
    if Purl== Last_url:
        break

    link = driver.find_element_by_link_text('Previous vol/issue')
    link.click()

Recommended Posts

Scraping with Python, Selenium and Chromedriver
Scraping with Selenium [Python]
Practice web scraping with Python and Selenium
Scraping with selenium in Python
Scraping with Selenium + Python Part 1
Scraping with Selenium in Python
Scraping with Selenium + Python Part 2
Scraping with selenium
Scraping with Python
Scraping with Node, Ruby and Python
Scraping with Python
Scraping with Selenium in Python (Basic)
Scraping with Python and Beautiful Soup
Scraping with Selenium
Easy web scraping with Python and Ruby
Successful scraping with Selenium
Scraping with Python (preparation)
Try scraping with Python.
Scraping with Python + PhantomJS
ScreenShot with Selenium (Python)
Python web scraping selenium
Scraping with Python + PyQuery
Scraping RSS with Python
python selenium chromedriver beautifulsoup
Scraping tabelog with python and outputting to CSV
Try running Google Chrome with Python and Selenium
Drag and drop local files with Selenium (Python)
Programming with Python and Tkinter
I tried scraping with Python
Encryption and decryption with Python
Web scraping with python + JupyterLab
Python and hardware-Using RS232C with Python-
Festive scraping with Python, scrapy
Python: Working with Firefox with selenium
Scraping with Tor in Python
Web scraping using Selenium (Python)
Scraping weather forecast with python
python with pyenv and venv
[Python + Selenium] Tips for scraping
I tried scraping with python
Web scraping beginner with python
Challenge Python3 and Selenium Webdriver
I-town page scraping with selenium
Works with Python and R
Install selenium on Mac and try it with python
Automatic follow on Twitter with python and selenium! (RPA)
I was addicted to scraping with Selenium (+ Python) in 2020
Automate Chrome with Python and Selenium on your Chromebook
Automatically translate DeepL into English with Python and Selenium
Communicate with FX-5204PS with Python and PyUSB
Shining life with Python and OpenCV
Try scraping with Python + Beautiful Soup
Install Python 2.7.9 and Python 3.4.x with pip.
Neural network with OpenCV 3 and Python 3
AM modulation and demodulation with python
[Python] font family and font with matplotlib
Web scraping with Python ① (Scraping prior knowledge)
Web scraping with Python First step
I tried web scraping with python.
JSON encoding and decoding with python
Hadoop introduction and MapReduce with Python