Festive scraping with Python, scrapy

Introduction

This article uses a scraping framework called scrapy for web scraping in python.

Official documentation

Scrapy architecture

About the Framework Overview (https://doc.scrapy.org/en/latest/topics/architecture.html)

data flow

architecture image

Components

Scrapy Engine

Control each component as shown in the Data Flow diagram at the top.

Scheduler

Receives requests from Scrapy Engine, queues them, and retrieves them from later requests from Scrapy Engine.

Downloader

The Downloader retrieves the web page, feeds it to the Scrapy Engine, and passes it to Spider.

Spiders

Spiders are custom classes that allow Scrapy users to parse the response and extract an Item (scraped item) or any additional requests that follow.

I mainly customize here myself.

Item Pipeline

The Item Pipeline handles the Item once it has been extracted by Spider. Typical tasks include cleansing, validation, and persistence (such as storing items in a database).

You can write it to a json file or save it to a DB.

Downloader middlewares

Downloader middlewares, located between Scrapy Engine and Downloader, are specific hooks that handle requests passed by Scrapy Engine to Downloader and responses passed by Downloader to Scrapy Engine.

I will not use it this time.

Spider middlewares

Spider middlewares are specific hooks that sit between the Scrapy Engine and Spider and can handle Spider inputs (responses) and outputs (Items and requests).

I will not use it this time.

Install Requirements

Library required for installation

Implementation procedure / image of festival scraping app

--Building a local development environment using Docker --Creating a project --Slaping implementation --Output the result as a json file

Let's, Enjoy Tokyo was targeted for scraping.

Creating a Dockerfile

FROM python:3.6-alpine

RUN apk add --update --no-cache \
    build-base \
    python-dev \
    zlib-dev \
    libxml2-dev \
    libxslt-dev \
    openssl-dev \
    libffi-dev

ADD pip_requirements.txt /tmp/pip_requirements.txt
RUN pip install -r /tmp/pip_requirements.txt

ADD ./app /usr/src/app
WORKDIR /usr/src/app

pip_requirements.txt looks like this

# Production
# =============================================================================
scrapy==1.4.0


# Development
# =============================================================================
flake8==3.3.0
flake8-mypy==17.3.3
mypy==0.511

Build Docker image

docker build -t festival-crawler-app .

Start Docker container

#Run directly under the working directory
docker run -itd -v $(pwd)/app:/usr/src/app \
--name festival-crawler-app \
festival-crawler-app

Enter the started container

docker exec -it festival-crawler-app /bin/sh

Creating a project

# scrapy startproject <project_name> [project_dir]
scrapy startproject scraping .

The created directory looks like the following.

Before creating a project

├── Dockerfile
├── app
└── pip_requirements.txt

After creating the project

.
├── Dockerfile
├── app
│   ├── scraping
│   │   ├── __init__.py
│   │   ├── items.py
│   │   ├── middlewares.py
│   │   ├── pipelines.py
│   │   ├── settings.py
│   │   └── spiders
│   │       └── __init__.py
│   └── scrapy.cfg
└── pip_requirements.txt

settings.py settings

You can set how long Downloader waits for consecutive pages to download from the same website.

__ Be sure to set this value so that it does not overload the scraped website. __

Check the robots.txt of the target site, and if Crawl-delay is specified, it seems good to specify that value.

#DOWNLOAD_DELAY = 10  # sec

Implementation of app / scraping / items.py

# -*- coding: utf-8 -*-

import scrapy
from scrapy.loader.processors import (
    Identity,
    MapCompose,
    TakeFirst,
)
from w3lib.html import (
    remove_tags,
    strip_html5_whitespace,
)


class FestivalItem(scrapy.Item):
    name = scrapy.Field(
        input_processor=MapCompose(remove_tags),
        output_processor=TakeFirst(),
    )
    term_text = scrapy.Field(
        input_processor=MapCompose(remove_tags, strip_html5_whitespace),
        output_processor=TakeFirst(),
    )
    stations = scrapy.Field(
        input_processor=MapCompose(remove_tags),
        output_processor=Identity(),
    )

The Item class defines the data to be handled by scraping. This time, we will define the following data.

--Festival name --Holding period --Nearest station

input_processor and ʻoutput_processor` are very convenient because they can accept and process data at the time of output, respectively.

Implementation of the app / scraping / spider / festival.py file

# -*- coding: utf-8 -*-

from typing import Generator

import scrapy
from scrapy.http import Request
from scrapy.http.response.html import HtmlResponse

from scraping.itemloaders import FestivalItemLoader


class FestivalSpider(scrapy.Spider):
    name: str = 'festival:august'

    def start_requests(self) -> Generator[Request, None, None]:
        url: str = (
            'http://www.enjoytokyo.jp'
            '/amuse/event/list/cate-94/august/'
        )
        yield Request(url=url, callback=self.parse)

    def parse(
        self,
        response: HtmlResponse,
    ) -> Generator[Request, None, None]:
        for li in response.css('#result_list01 > li'):
            loader = FestivalItemLoader(
                response=response,
                selector=li,
            )
            loader.set_load_items()
            yield loader.load_item()

The implementation here is a description of the actual scraping. The class variable name: str ='festival: august' is the name of the command to execute from the command line. By writing the above, you can perform scraping at scrapy crawl festival: august.

Implementation of app / scraping / itemloaders.py

# -*- coding: utf-8 -*-

import scrapy
from scrapy.loader import ItemLoader

from scraping.items import FestivalItem

class FestivalItemLoader(ItemLoader):
    default_item_class = FestivalItem

    def set_load_items(self) -> None:
        s: str = '.rl_header .summary'
        self.add_css('name', s)
        s = '.rl_main .dtstart::text'
        self.add_css('term_text', s)
        s = '.rl_main .rl_shop .rl_shop_access'
        self.add_css('stations', s)

The above may not be the correct usage, but I thought it would be confusing to list the css selectors in festival.py, so I defined it in ʻItemLoader. The set_load_items method is a method added this time that does not exist in the ʻItemLoader class.

Scraping command

The output can be specified with the -o option. If you want to save it in DB etc., you need to implement it separately.

scrapy crawl festival:august -o result.json

Now you can get the festival information for August

[
	{
		"name": "ECO EDO Nihonbashi Art Aquarium 2017 ~ Edo / Goldfish Ryo ~ & Night Aquarium",
		"stations": [
			"Shin-Nihombashi Station"
		],
		"term_text": "2017/07/07(Money)"
	},
	{
		"name": "Milky Way Illuminations ~ Seien Campaign ~",
		"stations": [
			"Akabanebashi Station (5-minute walk)"
		],
		"term_text": "2017/06/01(wood)"
	},
	{
		"name": "Fireworks Aquarium by NAKED",
		"stations": [
			"Shinagawa Station"
		],
		"term_text": "2017/07/08(soil)"
	}
	...
]

Normally, it is necessary to properly create paging and parsing of date and time data, but for the time being, I was able to do festival scraping softly.

Please refer to the source of festival scraping below.

github

that's all.

Recommended Posts

Festive scraping with Python, scrapy
Scraping with Python
Scraping with Python
Scraping with Python (preparation)
Try scraping with Python.
Scraping with Python + PhantomJS
Scraping with scrapy shell
Scraping with Selenium [Python]
Scraping with Python + PyQuery
Scraping RSS with Python
I tried scraping with Python
Web scraping with python + JupyterLab
Scraping with selenium in Python
Scraping with Selenium + Python Part 1
Scraping with chromedriver in python
Scraping with Selenium in Python
Easy web scraping with Scrapy
Scraping with Tor in Python
Scraping weather forecast with python
Scraping with Selenium + Python Part 2
I tried scraping with python
Web scraping beginner with python
[Scraping] Python scraping
Try scraping with Python + Beautiful Soup
Scraping with Node, Ruby and Python
Web scraping with Python ① (Scraping prior knowledge)
Scraping with Selenium in Python (Basic)
Scraping with Python, Selenium and Chromedriver
Web scraping with Python First step
I tried web scraping with python.
Scraping with Python and Beautiful Soup
Let's do image scraping with Python
Get Qiita trends with Python scraping
"Scraping & machine learning with Python" Learning memo
Get weather information with Python & scraping
Get property information by scraping with python
Python scraping notes
Scraping with selenium
FizzBuzz with Python3
Python Scraping get_ranker_categories
Scraping with selenium ~ 2 ~
WEB scraping with Python (for personal notes)
Statistics with python
Python with Go
Automate simple tasks with Python Part1 Scraping
Getting Started with Python Web Scraping Practice
I tried scraping Yahoo News with Python
Twilio with Python
Integrate with Python
Python Scraping eBay
Play with 2016-Python
AES256 with python
Tested with Python
Scraping with Selenium
Restart with Scrapy
[Personal note] Web page scraping with python3
python starts with ()
Web scraping with Python ② (Actually scraping stock sites)
Horse Racing Site Web Scraping with Python
Python Scraping get_title
with syntax (Python)