Thing you want to do

I would like to collect bibliographic information + abstracts of all papers published in a journal with Science Direct.

First, learn the basics of scraping

(Reference: https://codezine.jp/article/detail/12230) It seems that it is basically done using the requests package and the Beautiful Soup 4 package. So, first install these guys

pip install requests, beautifulsoup4

So I tried something like this.

import request
from bs4 import BeautifulSoup

#Send a request to the URL to be scraped and get the HTML
res = requests.get('https://www.ymori.com/books/python2nen/test1.html')

#Create a BeautifulSoup object from the response HTML
soup = BeautifulSoup(res.text, 'html.parser')
print(soup)

For the time being, I was able to get the html as text, but how do I do it, for example, by turning on the check box or clicking the button? ??

Selenium and Chrome driver

After investigating, it seems that Beautiful Soup cannot create pages that dynamically change the displayed contents by making full use of Javascript. After investigating what to do, I arrived at a package called Selenium. (Reference: https://qiita.com/Fujimon_fn/items/16adbd86fad609d993e8) Apparently, you can do something like RPA. In other words, operate the Web browser in order so that it can be seen by humans. However, if this is all you need, you need a driver that matches the browser you are using. (Reference: https://kurozumi.github.io/selenium-python/installation.html#drivers)

Installation

Install Selenium and chromedriver. Before installing, check the ChromeDriver page for the driver version that matches your Chrome version (check). Then it was 84.0.4147.30). (Reference: https://qiita.com/hanzawak/items/2ab4d2a333d6be6ac760)

pip install selenium, chromedriver-binary==84.0.4147.30

Once installed, you don't need to set the path (but you need to include ʻimport chrome driver`) However, the Exe file downloaded directly from ChromeDriver, for example, in c: \ work, explicitly enter the path. You can also give it. In this case, you don't need to import the package.

Let's start it for the time being

The following is an example of explicitly passing the path instead of ʻimport chromedriver`.

`OpenBrowser.py`


import requests
from selenium import webdriver  #  import chromedriver_binary

load_url = "https://www.sciencedirect.com/journal/reliability-engineering-and-system-safety/vol/204/suppl/C"
driver = webdriver.Chrome(executable_path='c:/work/chromedriver.exe')  #  driver = webdriver.Chrome()
driver.get(load_url)

Then, the browser started up without permission and jumped to the page of the specified address. It's completely RPA.

Page manipulation

What I want to do is go to Jounral's page

First, click "Select All" in the upper left and turn on all the displayed Paper check boxes.
Click Export Citations to display a dialog for downloading bibliographic information.
Click "Export citation and abstract to text" in the dialog for downloading bibliographic information. → Then it will be downloaded as a text file.
When you have finished downloading the text file, click "Previous Vol / Issue" at the top of the page to go to the page of the previous volume.

Processing called. If you loop this, you can get the information of all the documents. So, find out how to click "Select All", "Export Citations" and "Export citation and abstract to text".

Basically, you can find the target you want to operate from the page loaded by the driver, such as ID, class name, Name attribute, and send .click (). So, first look for "Select_All". Go to the page with chrome and press F12 to bring up the Developer screen. Then, press Ctrl + F to open the search box and enter "Select All" to search. Then you will find the place where Select All is written. Actually, it was made with a button tag. Well, that's right. However, it didn't look like a button at first glance, so I was a little surprised. For the time being, right-click on this button tag and select Copy ⇒ Copy selector to get the CSS Selector.

So, back to the source code, Paste the css selector from earlier. However, all you need is the "button." That's why

button = driver.find_element_by_css_selector("button.button-link.button-link-secondary.js-select-all")
button.click()

Wait process until the element becomes accessible

However, even if I suddenly add the above to OpenBrowser.py and execute it, it fails. That's because when you pass the URL to Chrome, it's not immediately accessible, but before you actually fetch the HTML from the URL, parse it, and the browser can access the element. Because there is a time lag. So I have to wait for a while. This page fetches time.sleep () from the time package and uses it. However, this method is not smart as described in this manual. That's why I use the Wait function that comes with WebDriver. So the following sources. (Reference: https://qiita.com/uguisuheiankyo/items/cec03891a86dfda12c9a) (Reference: https://www.selenium.dev/selenium/docs/api/py/webdriver_support/selenium.webdriver.support.expected_conditions.html) In Manual, there was only a way to specify the element by ID, but in the case of css selector Uses CSS_SELECTOR. (Reference: https://selenium-python.readthedocs.io/locating-elements.html)

I actually tried using it, but it didn't work in Wait. It seems that the timing will be off by all means, and an error will occur. So I decided to include time.sleep () after all.

`WaitAndOperation.py`


import requests
from selenium import webdriver  #  import chromedriver_binary
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

load_url = "https://www.sciencedirect.com/journal/reliability-engineering-and-system-safety/vol/204/suppl/C"
driver = webdriver.Chrome(executable_path='c:/work/chromedriver.exe')  #  driver = webdriver.Chrome()
driver.get(load_url)
    #WebDriverWait(driver, 10).until(EC.visibility_of_all_elements_located)
time.sleep(5)
    #WebDriverWait(driver, 20).until(
    #    EC.element_to_be_clickable((By.CSS_SELECTOR, "button.button-link.button-link-secondary.js-select-all"))
    #)
button = driver.find_element_by_css_selector("button.button-link.button-link-secondary.js-select-all")
button.click()

The same applies to other elements ...

So, for other "Export Citations" and "Export citation and abstract to text", get the CSS Selector and add the process to click.

On the other hand, "Previous Vol / Issue" was a link, not a button. You can specify the link with css selector in the same way, but you can also access the element with the text of the link. So I tried to access it by text.

`final.py`


import time
import requests
#from bs4 import BeautifulSoup
# import chromedriver_binary

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

load_url = "https://www.sciencedirect.com/journal/reliability-engineering-and-system-safety/vol/204/suppl/C"
Last_url = "https://www.sciencedirect.com/journal/reliability-engineering-and-system-safety/vol/20/issue/1"

driver = webdriver.Chrome(executable_path='c:/work/chromedriver.exe')
driver.get(load_url)

while 1:
    time.sleep(5)

    button = driver.find_element_by_css_selector("button.button-link.button-link-secondary.js-select-all")
    button.click()

    time.sleep(2)

    button2 = driver.find_element_by_css_selector("button.button-alternative.text-s.u-margin-xs-top.u-display-block.js-export-citations-button.button-alternative-primary")
    button2.click()

    time.sleep(2)

    button3 = driver.find_element_by_css_selector("button.button-link.button-link-primary.u-margin-xs-bottom.text-s.u-display-block.js-citation-type-textabs")
    button3.click()

    time.sleep(3)

    #Get the current URL
    Purl = driver.current_url
    #Break if Purl and Curl are the same
    if Purl== Last_url:
        break

    link = driver.find_element_by_link_text('Previous vol/issue')
    link.click()

Scraping with Python, Selenium and Chromedriver