[PYTHON] Write Spider tests in Scrapy

When I tried to write a unit test for Scrapy, it was a bit special and I didn't have much information, so I summarized it. Due to the nature of crawlers that HTML can be changed at any time, I think it is better to mainly use it to shorten the crawl time at the time of implementation rather than validation. (* Mainly articles about Spider unit tests) (* Tests such as Pipeline are out of range because they can be written normally with unittest etc.)

TL;DR;

Use Spiders Contracts

    def parse(self, response):
    """ This function parses a sample response. Some contracts are mingled
    with this docstring.

    @url http://www.amazon.com/s?field-keywords=selfish+gene
    @returns items 1 16
    @returns requests 0 0
    @scrapes Title Author Year Price
    """

Basic usage of Spiders Contracts

I think it's quick to see the sample code below. (Python3.6.2, Scrapy 1.4.0)

myblog.py


    def parse_list(self, response):
        """List screen parsing process

        @url http://www.rhoboro.com/index2.html
        @returns item 0 0
        @returns requests 0 10
        """
        for detail in response.xpath('//div[@class="post-preview"]/a/@href').extract():
            yield Request(url=response.urljoin(detail), callback=self.parse_detail)

Make Custom Contracts

Creating a subclass

Contracts can be extended by creating your own subclasses. Register the created Cntracts in setting.py.

contracts.py


# -*- coding: utf-8 -*-

from scrapy.contracts import Contract
from scrapy.exceptions import ContractFail


class ItemValidateContract(Contract):
    """Check if Item is as expected

Because the acquisition result may change at any time
I think it's best to test only where you expect invariant values.
Should I check more than missing elements with Pipeline?
    """
    name = 'item_validate' #This name will be the name in the docstring

    def post_process(self, output):
        item = output[0]
        if 'title' not in item:
            raise ContractFail('title is invalid.')


class CookiesContract(Contract):
    """On request(scrapy)Contract to add cookies

    @cookies key1 value1 key2 value2
    """
    name = 'cookies'

    def adjust_request_args(self, kwargs):
        # self.Convert args to dictionary format and put in cookies
        kwargs['cookies'] = {t[0]: t[1]
                             for t in zip(self.args[::2], self.args[1::2])}
        return kwargs

User code

The code on the side that uses this looks like this.

settings.py


...
SPIDER_CONTRACTS = {
    'item_crawl.contracts.CookiesContract': 10,
    'item_crawl.contracts.ItemValidateContract': 20,
}
...

myblog.py


    def parse_detail(self, response):
        """Detail screen parsing process

        @url http://www.rhoboro.com/2017/08/05/start-onomichi.html
        @returns item 1
        @scrapes title body tags
        @item_validate
        @cookies index 2
        """
        item = BlogItem()
        item['title'] = response.xpath('//div[@class="post-heading"]//h1/text()').extract_first()
        item['body'] = response.xpath('//article').xpath('string()').extract_first()
        item['tags'] = response.xpath('//div[@class="tags"]//a/text()').extract()
        item['index'] = response.request.cookies['index']
        yield item

Run the test

Run with scrapy check spidername. Obviously, it's faster than trying scrapy crawl spidername because it only crawls the specified page.

(venv) [alpaca]~/github/scrapy/crawler/crawler % scrapy check my_blog                                                                                                [master:crawler]
.....
----------------------------------------------------------------------
Ran 5 contracts in 8.919s

OK
(venv) [alpaca]~/github/scrapy/crawler/crawler % scrapy check my_blog                                                                                                [master:crawler]
...FF
======================================================================
FAIL: [my_blog] parse_detail (@scrapes post-hook)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/rhoboro/github/scrapy/venv/lib/python3.6/site-packages/scrapy/contracts/__init__.py", line 134, in wrapper
    self.post_process(output)
  File "/Users/rhoboro/github/scrapy/venv/lib/python3.6/site-packages/scrapy/contracts/default.py", line 89, in post_process
    raise ContractFail("'%s' field is missing" % arg)
scrapy.exceptions.ContractFail: 'title' field is missing

======================================================================
FAIL: [my_blog] parse_detail (@item_validate post-hook)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/rhoboro/github/scrapy/venv/lib/python3.6/site-packages/scrapy/contracts/__init__.py", line 134, in wrapper
    self.post_process(output)
  File "/Users/rhoboro/github/scrapy/crawler/crawler/contracts.py", line 18, in post_process
    raise ContractFail('title is invalid.')
scrapy.exceptions.ContractFail: title is invalid.

----------------------------------------------------------------------
Ran 5 contracts in 8.552s

FAILED (failures=2)

By the way, here in case of an error. (This is when I forgot to mention settings.py.) To be honest, there is too little information and it's hard.

(venv) [alpaca]~/github/scrapy/crawler/crawler % scrapy check my_blog                                                                                                [master:crawler]
Unhandled error in Deferred:


----------------------------------------------------------------------
Ran 0 contracts in 0.000s

OK

Recommended Posts

Write Spider tests in Scrapy
Write tests in GO language + gin
Write C unit tests in Python
Write Pulumi in Go
Write decorator in class
Write Python in MySQL
Write tests in Python to profile and check coverage
Write Pandoc filters in Python
Write standard input in code
Write beta distribution in Python
Write python in Rstudio (reticulate)
Write Spigot in VS Code
Write data in HDF format
Write a binary search in Python
Write a table-driven test in C
Write JSON Schema in Python DSL
How to write soberly in pandas
Write an HTTP / 2 server in Python
Install scrapy in python anaconda environment
Write AWS Lambda function in Python
Until you start crawling in Scrapy
Write A * (A-star) algorithm in Python
[Maya] Write custom nodes in Open Maya 2.0
Write foreign key constraints in Django
Write selenium test code in python
Write a pie chart in Python
Code tests around time in Python
How to separate pipeline processing code into files by spider in Scrapy