I tried scraping with Python

Purpose

Not only did I solve the problem with paiza and AtCoder, but I also wanted to actually start programming. I was also inspired by this article (most of what you need to do to program web services was learned from scraping). This time, I would like to scrape the companies listed in Rikunabi Direct and register them in the DB.

Reasons for choosing Rikunabi Direct

Rikunabi Direct is a service that displays and introduces several companies per week that Rikunabi Direct judges to match the job seeker from the industry of choice registered by the job seeker. Searches from the job seeker side are not possible. For the time being, a list of listed companies is provided, but it is very time-consuming or reckless to look at a total of more than 16,000 companies from the end. Therefore, I thought about making it searchable by scraping company information and registering it in the DB.

Since it takes too much time to scrape (2.5 seconds per case), I started two PhantomJS and tried to speed up by parallel processing.

All code

GitHub

procedure

Required packages

main.py


import utils
import os
import MySQLdb
import time
import selenium
import settings
import numpy as np
from concurrent.futures import ThreadPoolExecutor
from selenium import webdriver

utils.py


import sys
import requests
import csv
import os
import MySQLdb
import settings
import time
import selenium
from selenium import webdriver

Start PhantomJS

main.py


NUMBER_OF_BROWSER = 2

browser = utils.generate_browser()
browser_2 = utils.generate_browser()
browser_list = [browser, browser_2]

browsers = []
for i in range(NUMBER_OF_BROWSERS):
    browsers.append([browser_list[i], USER_IDs[i], PASS_WORDs[i]])

utils.py


PHANTOMJS_PATH = '/usr/local/bin/phantomjs'

def generate_browser():
    browser = webdriver.PhantomJS(executable_path=PHANTOMJS_PATH)
    print('PhantomJS initializing')
    browser.implicitly_wait(3)
    return browser

Waiting implicitly_wait () to wait for PhantomJS to start.

Enter your user ID and password to log in

main.py


TIME_TO_WAIT = 5
USER_IDs = [USER_ID, USER_ID_2]
PASS_WORDs = [PASS_WORD, PASS_WORD_2]

browser = utils.generate_browser()
browser_2 = utils.generate_browser()
browser_list = [browser, browser_2]

for browser_param in browsers:
    utils.login(browser_param[1], browser_param[2], browser_param[0])
    utils.set_wait_time(TIME_TO_WAIT, browser_param[0])
    utils.check_current_url(browser_param[0])

utils.py


def generate_browser():
    browser = webdriver.PhantomJS(executable_path=PHANTOMJS_PATH)
    print('PhantomJS initializing')
    browser.implicitly_wait(3)
    return browser

def set_wait_time(time, browser):
    browser.set_page_load_timeout(time)

def login(user, pass_word, browser):
    #Access the login page
    url_login = 'https://rikunabi-direct.jp/2020/login/'
    browser.get(url_login)
    #Access judgment. Exit if inaccessible
    test = requests.get(url_login)
    status_code = test.status_code
    if status_code == 200:
        print('HTTP status code ' + str(status_code) + ':Access the login page')
    else:
        print('HTTP status code ' + str(test.status_code) + ':Could not access the login page')
        sys.exit()
    time.sleep(5)
    #Enter user and password
    #user
    element = browser.find_element_by_name('accountId')
    element.clear()
    element.send_keys(user)
    print('I entered user')
    #password
    element = browser.find_element_by_name('password')
    element.clear()
    element.send_keys(pass_word)
    print('I entered the password')
    #Send
    submit = browser.find_element_by_xpath("//img[@alt='Login']")
    submit.click()
    print('I pressed the login button')

Use browse.get (url) to go to the url page and browser.find_element_by_name () to specify the value of the name attribute to find the desired element. The elements you want to find are as follows

<input type="text" name="accountId" autocomplete="off" value="" ...

Get the element with browser.find_element_by_name ('accountID'). Clear the text box of the fetched element with clear () and enter the userID with send_keys (). Enter the password in the same way. The submit button was obtained by specifying it with xpath. xpath is from Google chrome validation, right click> Copy> Copy XPath to copy the element (at first I did not know this convenient way and specified it as an absolute path like / html / body / ... I had a lot of trouble.) For find_element_by_, I referred to here.

Go to the list of all listed companies and get the page URL of the listed company

main.py


#If csv does not exist, store all urls in an array and export as csv
if os.path.exists(URL_PATH) == False:
    utils.move_to_company_list(browsers[0][0])
    url_arr = utils.get_url(NUMBER_OF_COMPANY, browsers[0][0])
    utils.export_csv(url_arr, URL_PATH)
    utils.browser_close(browsers[0][0])
else:
    #If csv exists, read csv and url_Store in arr
    url_arr = utils.import_csv(URL_PATH)

utils.py


def move_to_company_list(browser):
    #Go to the page of all listed companies
    element = browser.find_element_by_link_text('All listed companies')
    element.click()
    #It will be opened in another tab, so move to the second tab
    browser.switch_to_window(browser.window_handles[1])

#Get URLs of all listed companies
def get_url(number_of_company, browser):
    url_arr = []
    for i in range(2, number_of_company):
        url_xpath = '/html/body/div/div/table/tbody/tr[{0}]/td/ul/li/a'.format(i)

        element = browser.find_element_by_xpath(url_xpath)
        url = element.get_attribute('href')
        url_arr.append(url)

        print(str(i))
        print(url)

    return url_arr

#Export array to CSV
def export_csv(arr, csv_path):
    with open(csv_path, 'w') as f:
        writer = csv.writer(f, lineterminator='\n')
        writer.writerow(arr)

#Close the current tab and return to the first tab
def browser_close(browser):
    browser.close()
    browser.switch_to_window(browser.window_handles[0])

def import_csv(csv_path):
    if os.path.exists(csv_path) == True:
        with open(csv_path, 'r') as f:
            data = list(csv.reader(f))#Two-dimensional array.The 0th element is an array of URLs.
        return data[0]
    else:
        print('csv does not exist')
        sys.exit()

It takes time to get the URL every time, so once you get the URL, write it to csv and save it.

Divide the array containing the URL by the number of browsers

main.py


url_arrs = list(np.array_split(url_arr, NUMBER_OF_BROWSERS))
for i in range(NUMBER_OF_BROWSERS):
    print('length of array{0} : '.format(i) + str(len(url_arrs[i])))

DB connection

main.py


connector = MySQLdb.connect(
        unix_socket = DB_UNIX_SOCKET,
        host=DB_HOST, user=DB_USER, passwd=DB_PASS_WORD, db=DB_NAME
    )
corsor = connector.cursor()

Perform scraping for each URL

main.py


    #Perform scraping processing in each browser (parallel processing))
with ThreadPoolExecutor(max_workers=2, thread_name_prefix="thread") as executor:
    for i in range(NUMBER_OF_BROWSERS):
        executor.submit(utils.scraping_process, browsers[i][0], url_arrs[i], corsor, connector)

utils.py


def open_new_page(url, browser):
    try:
        browser.execute_script('window.open()')
        browser.switch_to_window(browser.window_handles[1])
        browser.get(url)
    except selenium.common.exceptions.TimeoutException:
        browser_close(browser)
        print('connection timeout')
        print('retrying ...')
        open_new_page(url, browser)

def content_scraping(corsor, connector, browser):
    #Find a scraping target
    name_element = browser.find_element_by_class_name('companyDetail-companyName')
    position_element = browser.find_element_by_xpath('//div[@class="companyDetail-sectionBody"]/p[1]')
    job_description_element = browser.find_element_by_xpath('//div[@class="companyDetail-sectionBody"]/p[2]')
    company_name = name_element.text
    position = position_element.text
    job_description = job_description_element.text
    url = browser.current_url

    casual_flag = is_exist_casual(browser)

    #----------Below DB registration process----------#  
    #INSERT
    corsor.execute('INSERT INTO company_data_2 SET name="{0}", url="{1}", position="{2}", description="{3}", is_casual="{4}"'.format(company_name, url, position, job_description, casual_flag))
    connector.commit()

def scraping_process(browser, url_arr, corsor, connector):
        count = 0
        
        for url in url_arr:
            open_new_page(url, browser)
            print('{0} scraping start'.format(count))
            check_current_url(browser)

            try:
                content_scraping(corsor, connector, browser)
            except selenium.common.exceptions.NoSuchElementException:
                print('Companies that are currently not listed')
            except MySQLdb._exceptions.ProgrammingError:
                print('SQL programming Error')

            browser_close(browser)
            print('{0} scraping process end.'.format(count))
            count += 1

The text described in the retrieved element can be used in element.text. For oprn_new_page (), if an expected exception (TimeoutExecption in this case) occurs, wait a little while and implement recursive retry processing to call itself again and try to connect. I learned about recursive functions in the process of tackling problems with AtCoder and so on. I had no idea where to use it in the actual processing, but this time I was able to create a usage example by myself by programming it myself.

There are five types of information you want to scrape: company name, URL, job type, job description, and whether you can work in plain clothes. Get the element by the value of Xpath and name attribute respectively.

Some listed companies have stopped posting. In that case, find_element_by will try to get the element that does not exist, and at this time, NoSuchElementException will occur. In this case catch this. ProgrammingError is returned from MySQLdb when job_description etc. contains single quotes or double quotes. Someone tell me how to do something like PHP's PDO prepare statement ~~~~~~~~~~

Speeding up

Scraping_process () that actually executes the process is the flow of opening the target page> specifying the element and getting it> DB registration. The longest time in this flow is from the process of opening the first page to the acquisition of the next element. This is because it takes a long time to display the page after opening it. In order to improve the slowdown due to this part, parallel processing is performed with ThreadPoolExecutor of concurrent.futures. I think that browser2 is in a state of proceeding while browser1 is waiting for the page to be displayed. This makes it much faster than processing with a single browser.

result

We were able to successfully scrape nearly 16,000 companies. I learned a lot by creating a recursive function by myself, studying the structure of HTML to specify xpath, and trying to parallelize the last process by grouping it into a function. It was.

reference

* 1 Selenium Python Bindings 4. Find Elements [Python] Selenium usage memo Summary of how to select elements in Selenium Selenium API (reverse lookup) Check the existence of the file with python Take and verify XPath in Chrome Parallel task execution using concurrent.futures in Python I thoroughly investigated the parallel processing and parallel processing of Python I operated MySQL from Python3 on Mac

Recommended Posts

I tried scraping with Python
I tried scraping with python
I tried web scraping with python.
I tried scraping Yahoo News with Python
I tried fp-growth with python
I tried gRPC with Python
Scraping with Python
Scraping with Python
I tried scraping
I tried running prolog with python 3.8.2.
I tried SMTP communication with Python
Scraping with Python (preparation)
Try scraping with Python.
Scraping with Python + PhantomJS
I tried Python> autopep8
I tried sending an email with python.
I tried non-photorealistic rendering with Python + opencv
I tried a functional language with Python
I tried recursion with Python ② (Fibonacci sequence)
Scraping with Selenium [Python]
I tried scraping Yahoo weather (Python edition)
Scraping with Python + PyQuery
I tried Python> decorator
Scraping RSS with Python
#I tried something like Vlookup with Python # 2
I tried scraping the ranking of Qiita Advent Calendar with Python
I tried "smoothing" the image with Python + OpenCV
I tried hundreds of millions of SQLite with python
I tried web scraping using python and selenium
I tried "differentiating" the image with Python + OpenCV
I tried L-Chika with Raspberry Pi 4 (Python edition)
I tried Jacobian and partial differential with python
I tried to get CloudWatch data with Python
I tried using mecab with python2.7, ruby2.3, php7
I tried function synthesis and curry with python
I tried to output LLVM IR with Python
I tried "binarizing" the image with Python + OpenCV
I tried running faiss with python, Go, Rust
I tried to automate sushi making with python
I tried playing mahjong with Python (single mahjong edition)
I tried running Deep Floor Plan with Python 3.6.10.
I tried sending an email with SendGrid + Python
Web scraping with python + JupyterLab
Scraping with selenium in Python
Scraping with Selenium + Python Part 1
Scraping with chromedriver in python
Festive scraping with Python, scrapy
I tried Learning-to-Rank with Elasticsearch!
I made blackjack with python!
I tried clustering with PyCaret
Scraping with Selenium in Python
Scraping with Tor in Python
Scraping weather forecast with python
I tried Python C extension
[Python] I tried using OpenPose
Scraping with Selenium + Python Part 2
I made blackjack with Python.
Web scraping beginner with python
I made wordcloud with Python.
I tried to implement Minesweeper on terminal with python
I tried to get started with blender python script_Part 01