When I tried to write a unit test for Scrapy, it was a bit special and I didn't have much information, so I summarized it. Due to the nature of crawlers that HTML can be changed at any time, I think it is better to mainly use it to shorten the crawl time at the time of implementation rather than validation. (* Mainly articles about Spider unit tests) (* Tests such as Pipeline are out of range because they can be written normally with unittest etc.)

TL;DR;

Use Spiders Contracts

Official documentation
Write in docstring
Can be run with scrapy check spidername
You can create and extend subclasses yourself
Sample code in the documentation

    def parse(self, response):
    """ This function parses a sample response. Some contracts are mingled
    with this docstring.

    @url http://www.amazon.com/s?field-keywords=selfish+gene
    @returns items 1 16
    @returns requests 0 0
    @scrapes Title Author Year Price
    """

The full code for this article is in this repository
- rhoboro/scrapy-unittest
The crawler destination is my blog made by Pelican.

Basic usage of Spiders Contracts

I think it's quick to see the sample code below. (Python3.6.2, Scrapy 1.4.0)

What is written in docstring is Contracts
The format @ contract name arg1 arg2 arg3 ...
Predefined contracts in Scrapy
url: parse destination URL
item: Write the expected number of yielded items in the order of min (max)
requests: Write the expected number of yielded Requests in the order of min (max)
scrapes: List the expected elements set in item

`myblog.py`


    def parse_list(self, response):
        """List screen parsing process

        @url http://www.rhoboro.com/index2.html
        @returns item 0 0
        @returns requests 0 10
        """
        for detail in response.xpath('//div[@class="post-preview"]/a/@href').extract():
            yield Request(url=response.urljoin(detail), callback=self.parse_detail)

Make Custom Contracts

Creating a subclass

Contracts can be extended by creating your own subclasses. Register the created Cntracts in setting.py.

self.args
List of parameters passed by docstring
ʻAdjust_request_args (self, kwags)` method
kwags are passed to the Request initializer with the code below
- scrapy/contracts/__init__.py
You can change the Request contents by extending this and returning.
pre_process (self, response) method
Called before the method specified in Requset's callback is called
post_process (self, output) method
Called after the method specified in Requset's callback is called
output is a list of yielded ones

`contracts.py`


# -*- coding: utf-8 -*-

from scrapy.contracts import Contract
from scrapy.exceptions import ContractFail


class ItemValidateContract(Contract):
    """Check if Item is as expected

Because the acquisition result may change at any time
I think it's best to test only where you expect invariant values.
Should I check more than missing elements with Pipeline?
    """
    name = 'item_validate' #This name will be the name in the docstring

    def post_process(self, output):
        item = output[0]
        if 'title' not in item:
            raise ContractFail('title is invalid.')


class CookiesContract(Contract):
    """On request(scrapy)Contract to add cookies

    @cookies key1 value1 key2 value2
    """
    name = 'cookies'

    def adjust_request_args(self, kwargs):
        # self.Convert args to dictionary format and put in cookies
        kwargs['cookies'] = {t[0]: t[1]
                             for t in zip(self.args[::2], self.args[1::2])}
        return kwargs

User code

The code on the side that uses this looks like this.

You need to register in settings.py.

`settings.py`


...
SPIDER_CONTRACTS = {
    'item_crawl.contracts.CookiesContract': 10,
    'item_crawl.contracts.ItemValidateContract': 20,
}
...

Test code

`myblog.py`


    def parse_detail(self, response):
        """Detail screen parsing process

        @url http://www.rhoboro.com/2017/08/05/start-onomichi.html
        @returns item 1
        @scrapes title body tags
        @item_validate
        @cookies index 2
        """
        item = BlogItem()
        item['title'] = response.xpath('//div[@class="post-heading"]//h1/text()').extract_first()
        item['body'] = response.xpath('//article').xpath('string()').extract_first()
        item['tags'] = response.xpath('//div[@class="tags"]//a/text()').extract()
        item['index'] = response.request.cookies['index']
        yield item

Run the test

Run with scrapy check spidername. Obviously, it's faster than trying scrapy crawl spidername because it only crawls the specified page.

On success

(venv) [alpaca]~/github/scrapy/crawler/crawler % scrapy check my_blog                                                                                                [master:crawler]
.....
----------------------------------------------------------------------
Ran 5 contracts in 8.919s

OK

On failure (comment out item ['title'] = ... in parse_detail ())

(venv) [alpaca]~/github/scrapy/crawler/crawler % scrapy check my_blog                                                                                                [master:crawler]
...FF
======================================================================
FAIL: [my_blog] parse_detail (@scrapes post-hook)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/rhoboro/github/scrapy/venv/lib/python3.6/site-packages/scrapy/contracts/__init__.py", line 134, in wrapper
    self.post_process(output)
  File "/Users/rhoboro/github/scrapy/venv/lib/python3.6/site-packages/scrapy/contracts/default.py", line 89, in post_process
    raise ContractFail("'%s' field is missing" % arg)
scrapy.exceptions.ContractFail: 'title' field is missing

======================================================================
FAIL: [my_blog] parse_detail (@item_validate post-hook)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/rhoboro/github/scrapy/venv/lib/python3.6/site-packages/scrapy/contracts/__init__.py", line 134, in wrapper
    self.post_process(output)
  File "/Users/rhoboro/github/scrapy/crawler/crawler/contracts.py", line 18, in post_process
    raise ContractFail('title is invalid.')
scrapy.exceptions.ContractFail: title is invalid.

----------------------------------------------------------------------
Ran 5 contracts in 8.552s

FAILED (failures=2)

By the way, here in case of an error. (This is when I forgot to mention settings.py.) To be honest, there is too little information and it's hard.

(venv) [alpaca]~/github/scrapy/crawler/crawler % scrapy check my_blog                                                                                                [master:crawler]
Unhandled error in Deferred:


----------------------------------------------------------------------
Ran 0 contracts in 0.000s

OK

[PYTHON] Write Spider tests in Scrapy

Basic usage of Spiders Contracts

myblog.py

Make Custom Contracts

Creating a subclass

contracts.py

User code

settings.py

myblog.py

Run the test

`myblog.py`

`contracts.py`

`settings.py`

`myblog.py`