[PYTHON] Automatic scraping of reCAPTCHA site every day (2/7: scraping)

  1. Requirement definition ~ python environment construction
  2. ** Create a site scraping mechanism **
  3. Process the downloaded file (xls) to create the final product (csv)
  4. Create a file download from S3 / file upload to S3 1.2 Implemented captcha
  5. Allow it to be launched in a Docker container
  6. Register for AWS batch

Creation of scraping mechanism

This site uses React and uses selenium because it works with javascript.

Review the file structure


├── app
│ ├── drivers selenium put drivers
│   └── source           
│       └── scraping.py processing
└── tmp
    ├── files
│ └── download Place the file downloaded by scraping
└── logs logs(selenium log etc.)

driver download

You need to decide which browser to use to run selenium. There are likely to be the following three candidates.

――Has it become the predominant headless browser? PhantomJS --Recently headless mode has been created and is becoming mainstream from now on Chrome --The first selenium is Firefox

At first I was using Chrome, but it didn't work well with the later Xvfb, so I decided to use Firefox. Download the driver from the above URL and place it under / drivers /. [^ 1]

Also, in order to run Firefox's gecko driver, Firefox must be installed on the OS. If you haven't already, please download it from the Official Site.

[^ 1]: I think I downloaded the latest version (mac version) ... In addition, it seems that if you place the driver in a predetermined position on macOS, you do not have to specify the file at startup, but this time I do not use that method.

Prepare the download destination folder

Finally coding.

  1. Put the downloaded file in a specific folder
  2. Process 1 to create deliverable data (csv). Put in another folder
  3. Send 2 to S3

In order to proceed, prepare a download destination folder.

scraping.py


date = datetime.now().strftime('%Y%m%d')
dldir_name = os.path.abspath(__file__ + '/../../../tmp/files/download/{}/'.format(date))
dldir_path = Path(dldir_name)
dldir_path.mkdir(exist_ok=True)
download_dir = str(dldir_path.resolve())

The import statement etc. will be introduced together at the end. ... I think the code is verbose, but it worked, so I'm okay with this.

Write up to the start of selenium

Next, I will describe up to the point of starting gecko driver with selenium. In the case of Firefox, the download dialog will appear when you start it normally, so you need to specify various options so that it does not appear.

scraping.py


driver_path = os.path.abspath(__file__ + '/../../drivers/geckodriver') #Specify the position of the driver.
fp = webdriver.FirefoxProfile()
fp.set_preference("browser.download.folderList",2) #It seems that it was an option for "specifying the download folder".
fp.set_preference("browser.download.dir", download_dir)
fp.set_preference("browser.download.manager.showWhenStarting",False) #What is it? do not know
fp.set_preference("browser.helperApps.neverAsk.saveToDisk", "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet")
log_path = os.path.abspath(__file__ + '/../../../tmp/logs/geckodriver.log') #If you do nothing, a log file will be generated at the location of the executable file.
driver = webdriver.Firefox(firefox_profile=fp,executable_path=driver_path,service_log_path=log_path)

driver.maximize_window() #The side menu disappears depending on the window size ...
driver.implicitly_wait(10)  #Error if there is no specified item for 10 seconds

Be careful with the helperApps.neverAsk.saveToDisk option. Only the file format specified here will not display the "Do you want to download?" Dialog. The xls file to download this time seemed to be ** application / vnd.openxmlformats-officedocument.spreadsheetml.sheet **. [^ 2]

[^ 2]: The official xls file format is different, but you need to specify the format of the file that is actually downloaded.

By the way, it was so easy with chrome

scraping.py


driver_path = os.path.abspath(__file__ + '/../../drivers/chromedriver')
options = webdriver.ChromeOptions()
options.add_experimental_option("prefs", {"download.default_directory": download_dir})
driver = webdriver.Chrome(executable_path=driver_path, options=options)

Login process

In the case of this site, I was able to log in by doing the following.

scraping.py


#Login
driver.get(LOGIN_URL)
mail_address = driver.find_element_by_id("mail_address")
mail_address.send_keys(config.mail_address)
password = driver.find_element_by_id("password")
password.send_keys(config.password)
submit_button = driver.find_element_by_css_selector("button.submit")
submit_button.click()

… However, reCAPTCHA will appear when you click it, so you need to cancel it manually. I'll solve it later using 2captcha, but I haven't done so yet, so I'll release it myself when it comes out.

In order to have the process wait until it is released, wait process is inserted. In the example below, you can wait up to 100 seconds.

scraping.py


try:
  WebDriverWait(driver, 100).until(
    EC.presence_of_element_located((By.CSS_SELECTOR, "* Elements displayed after login"))
  )
except TimeoutException:
    driver.quit()
    exit('Error:I failed to login')

Data acquisition

Now, it's finally time to get the data. I did it like this.

scraping.py


#Search using search words
search_box = driver.find_element_by_xpath(SEARCH_BOX_XPATH)
search_box.clear()
search_box.click()
search_box.send_keys(word) #Enter in the search window
time.sleep(INTERVAL) #To wait for React processing
search_box.send_keys(Keys.RETURN) #Return key → The first search result is selected

#Open menu
try:
    driver.find_element_by_xpath(MENU_XPATH).click()
except ElementNotInteractableException:
    #When the menu cannot be selected → The accordion is not open
    driver.find_element_by_xpath(MENU_OPEN_XPATH).click()            
    driver.find_element_by_xpath(MENU_XPATH).click()

#download
driver.find_element_by_xpath(DOWNLOAD_XPATH).click()

--In the case of recent reactive front-end frameworks, it is difficult to specify in csspath. Therefore, I basically decided to use xpath to specify the element. --In the case of React etc., there were many cases where it did not work unless you waited for the operation of javascript to finish. Make good use of time.sleep (). --When there are no elements, ʻElementNotInteractableException` is thrown. There seems to be no method to check "whether or not this element exists?", So let's make good use of it. ――If you are on a normal site, you can get many downloads by hitting the URL without having to click on them ... But this time you had to click on the site.

Complete

If you connect the ones introduced so far, the download part is completed once! Finally, I would like to introduce the import statement.

scraping.py


import os
import time
from pathlib import Path
from datetime import datetime

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.common.exceptions import ElementNotInteractableException
from selenium.webdriver.support.ui import Select
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException

More on that.

Recommended Posts

Automatic scraping of reCAPTCHA site every day (2/7: scraping)
Automatic scraping of reCAPTCHA site every day (6/7: containerization)
Automatic scraping of reCAPTCHA site every day (5/7: 2captcha)
Automatic scraping of reCAPTCHA site every day (4/7: S3 file processing)
Automatic scraping of reCAPTCHA site every day (1/7: python environment construction)
Automatic scraping of reCAPTCHA site every day (3/7: xls file processing)
The definitive edition of python scraping! (Target site: BicCamera)
I tried scraping the advertisement of the pirated cartoon site
Basics of Python scraping basics