Scraping a website using JavaScript in Python

Overview

I scraped a site that makes DOM with JavaScript using Python, so make a note of the procedure.

Roughly, I did a combination of Scrapy and Selenium.

Scrapy

Scrapy is a framework for implementing crawlers.

Implement the crawler as a class that satisfies the interface determined by the framework, such as the crawler as a subclass of * Spider *, the scraped information as a subclass of * Item *, and the processing for scraped information as a subclass of * Pipeline *.

A command called * scrapy * is provided that allows you to see a list of crawlers you've created and launch crawlers.

Selenium

Selenium is a tool for programmatically controlling the browser (is it okay?). It can be used in various languages including Python. Often used in website / app automated testing contexts. You can use it to execute JavaScript and scrape HTML sources, including dynamically generated DOM.

Originally it seemed to control the browser via JavaScript, but now it sends a message directly to the browser to control the browser.

When I tried it in my environment (OSX), I couldn't use it in Safari and Chrome without installing something like an extension. You can use it as it is in Firefox. If you include PhantomJS, you can scrape without a window, so you can use it on the server.

Why use scrapy

You can scrape with Selenium alone without using Scrapy,

--Scraping multiple pages in parallel (multithreading? Process?), ――How many pages were crawled, how many times the error occurred, and the log is nicely summarized. --Avoids duplication of crawled pages, ――It provides various setting options such as crawl interval. --Combine CSS and XPath to get out of the DOM (combination is quite convenient!), --You can spit out the crawl result in JSON or XML, --You can write a crawler with a nice program design (I think it's better than designing it yourself),

The merit of using Scrapy that comes to mind is like this.

At first, studying the framework was a hassle, so I implemented the crawler only with Selenium and PyQuery, but if I wrote logs and error handling , The feeling of reinventing the wheel became stronger and I stopped.

Before implementation, I read the Scrapy documentation from the beginning to the Settings page, as well as * Architecture overview * and * Downloader Middleware *.

Use Scrapy and Selenium in combination

The original story is this stack overflow.

Scrapy's architecture looks like this (from Scrapy documentation).

Scrapy architecture overview

Customize * Downloader Middlewares * to allow Scrapy's Spider to scrape using Selenium.

Downloader Middleware implementation

Downloader Middleware is implemented as an ordinary class that implements * process_request *.

selenium_middleware.py


# -*- coding: utf-8 -*-
import os.path

from urlparse import urlparse

import arrow

from scrapy.http import HtmlResponse
from selenium.webdriver import Firefox


driver = Firefox()


class SeleniumMiddleware(object):
    
    def process_request(self, request, spider):
        
        driver.get(request.url)

        return HtmlResponse(driver.current_url,
            body = driver.page_source,
            encoding = 'utf-8',
            request = request)


def close_driver():
    driver.close()

If you register this Download Middleware, * Spider * will call * process_request * before scraping the page. For more information, click here (http://doc.scrapy.org/en/1.0/topics/downloader-middleware.html).

Since it is returning an instance of * HtmlResponse *, Download Middleware will not be called after that. Depending on the priority setting, Default Donwload Middleware (such as parsing robots.txt) may be Note that it will not be called.

Above, I'm using Firefox, but if I want to use PhantomJS, rewrite the * driver * variable.

Download Middleware registration

Register * Selenium Middleware * in the * Spider * class you want to use.

some_spider.py


# -*- coding: utf-8 -*-
import scrapy

from ..selenium_middleware import close_driver


class SomeSpider(scrapy.Spider):
    name = "some_spider"
    allowed_domains = ["somedomain"]
    start_urls = (
        'http://somedomain/',
    )
    custom_settings = {
        "DOWNLOADER_MIDDLEWARES": {
            "some_crawler.selenium_middleware.SeleniumMiddleware": 0,
        },
        "DOWNLOAD_DELAY": 0.5,
    }
    

    def parse(self, response):
        #Crawler processing

    def closed(self, reason):
        close_driver()

Recommended Posts

Scraping a website using JavaScript in Python
Scraping using Python
[Python] Scraping a table using Beautiful Soup
Draw a tree in Python 3 using graphviz
Create a GIF file using Pillow in Python
To execute a Python enumerate function in JavaScript
View drug reviews using a list in Python
Create a MIDI file in Python using pretty_midi
Scraping using Python 3.5 async / await
Take a screenshot in Python
Create a data collection bot in Python using Selenium
Scraping with selenium in Python
Create a function in Python
Create a dictionary in Python
[Python] Scraping in AWS Lambda
Web scraping notes in python3
Scraping with chromedriver in python
Login to website in Python
Scraping using Python 3.5 Async syntax
Website change monitoring using python
Scraping with Selenium in Python
Scraping with Tor in Python
Make a bookmarklet in Python
Web scraping using Selenium (Python)
Draw a heart in Python
How to execute a command using subprocess in Python
Translate using googletrans in Python
Using Python mode in Processing
I made a quick feed reader using feedparser in Python
A memo when creating a directed graph using Graphviz in Python
To return char * in a callback function using ctypes in Python
Try building a neural network in Python without using a library
Create a tool to check scraping rules (robots.txt) in Python
A memo of writing a basic function in Python using recursion
I tried to make a stopwatch using tkinter in python
Try running a function written in Python using Fn Project
Maybe in a python (original title: Maybe in Python)
Write a binary search in Python
GUI programming in Python using Appjar
[python] Manage functions in a list
Hit a command in Python (Windows)
I made a Line-bot using Python!
Create a python GUI using tkinter
Create a DI Container in Python
Drawing a silverstone curve using python
Draw a scatterplot matrix in python
ABC166 in Python A ~ C problem
Write A * (A-star) algorithm in Python
Try using LevelDB in Python (plyvel)
Create a binary file in Python
Using global variables in python functions
Solve ABC036 A ~ C in Python
Write a pie chart in Python
Write a vim plugin in Python
Write a depth-first search in Python
Let's see using input in python
Infinite product in Python (using functools)
Edit videos in Python using MoviePy
[Python] Creating a scraping tool Memo
Implementing a simple algorithm in Python 2
Create a Kubernetes Operator in Python