[PYTHON] Restart with Scrapy

Scrapy, a Python framework that crawls websites, has the ability to restart, that is, interrupt during execution and then resume from the previous continuation. This is useful when you visit a large number of pages and do time-consuming scraping.

Below is the official documentation. Jobs: pausing and resuming crawls

Outline of function

I prepared the following spider to try the function. http://quotes.toscrape.com 6 Just download the page and log the contents.

`toscrape-restart.py`


import scrapy
import json
import time


class QuotesSpider(scrapy.Spider):
    name = "toscrape-restart"

    custom_settings = {
        #Do not request in parallel
        "CONCURRENT_REQUESTS": 1,
        #Set intervals on requests to make them easier to interrupt
        "DOWNLOAD_DELAY": 10,
        # http://quotes.toscrape.robots on com.Do not get txt because it does not exist
        "ROBOTSTXT_OBEY": False,
    }

    def start_requests(self):
        #Maintain state between batches (see below)
        self.logger.info(self.state.get("state_key1"))
        self.state["state_key1"] = {"key": "value"}
        self.state["state_key2"] = 0

        urls = [
            "http://quotes.toscrape.com/page/1/",
            "http://quotes.toscrape.com/page/2/",
            "http://quotes.toscrape.com/page/3/",
            "http://quotes.toscrape.com/page/4/",
            "http://quotes.toscrape.com/page/5/",
            "http://quotes.toscrape.com/page/6/",
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        self.logger.info(
            "first quote author: " + response.css("small.author::text").get()
        )

The above spider can be started with the following command.

scrapy crawl toscrape-restart

This will be normal execution. To make it restartable, set JOBDIR as follows.

scrapy crawl toscrape-restart -s JOBDIR=crawls/restart-1

Executing this way creates a crawls / restart-1 directory that stores information for restarting and allows you to restart. (If you don't have a directory, Scrapy will create it, so you don't need to prepare it in advance.) Start with the above command and interrupt it with Ctrl-C during execution. For example, if you stop it immediately after getting the first page, the output will be as follows.

$ scrapy crawl toscrape-restart -s JOBDIR=crawls/restart-1

(Omitted)

2020-03-24 14:43:04 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/1/> (referer: None)
2020-03-24 14:43:04 [toscrape-restart] INFO: first quote author: Albert Einstein
^C2020-03-24 14:43:06 [scrapy.crawler] INFO: Received SIGINT, shutting down gracefully. Send again to force 
2020-03-24 14:43:06 [scrapy.core.engine] INFO: Closing spider (shutdown)
2020-03-24 14:43:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/2/> (referer: None)
2020-03-24 14:43:18 [toscrape-restart] INFO: first quote author: Marilyn Monroe
2020-03-24 14:43:19 [scrapy.statscollectors] INFO: Dumping Scrapy stats:

(Omitted)

It was interrupted when I got the second page. After interrupting in this way, you can resume by running the same command as the first.

$ scrapy crawl toscrape-restart -s JOBDIR=crawls/restart-1

(Omitted)

2020-03-24 14:46:07 [scrapy.dupefilters] DEBUG: Filtered duplicate request: <GET http://quotes.toscrape.com/page/1/> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)
2020-03-24 14:46:10 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/3/> (referer: None)
2020-03-24 14:46:10 [toscrape-restart] INFO: first quote author: Pablo Neruda
2020-03-24 14:46:21 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/4/> (referer: None)
2020-03-24 14:46:21 [toscrape-restart] INFO: first quote author: Dr. Seuss
2020-03-24 14:46:35 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/5/> (referer: None)
2020-03-24 14:46:35 [toscrape-restart] INFO: first quote author: George R.R. Martin
2020-03-24 14:46:47 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/6/> (referer: None)
2020-03-24 14:46:47 [toscrape-restart] INFO: first quote author: Jane Austen
2020-03-24 14:46:47 [scrapy.core.engine] INFO: Closing spider (finished)
2020-03-24 14:46:47 [scrapy.statscollectors] INFO: Dumping Scrapy stats:

(Omitted)

The 1st and 2nd pages were displayed as Filtered duplicate request and were not fetched. After that, the third and subsequent pages that were not acquired before the interruption were acquired normally.

Keep state between batches

Scrapy restart has the ability to pass information between boots using state. You can store the information in the spider state and refer to it the next time you start it. Specifically, it can be stored in the following usage in the first toscrape-restart.py.

self.state["state_key1"] = {"key": "value"}
self.state["state_key2"] = 0

Since state is adict type, you can do operations on the dictionary. In the above example, the key state_key1 stores the value{"key": "value"}, and the key state_key2 stores the value 0. When I run it, it looks like this:

$ scrapy crawl toscrape-restart -s JOBDIR=crawls/restart-1

(Omitted)

2020-03-24 15:19:54 [toscrape-restart] INFO: None
2020-03-24 15:19:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/1/> (referer: None)
2020-03-24 15:19:55 [toscrape-restart] INFO: first quote author: Albert Einstein
^C2020-03-24 15:19:56 [scrapy.crawler] INFO: Received SIGINT, shutting down gracefully. Send again to force 
2020-03-24 15:19:56 [scrapy.core.engine] INFO: Closing spider (shutdown)
2020-03-24 15:20:06 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/2/> (referer: None)
2020-03-24 15:20:07 [toscrape-restart] INFO: first quote author: Marilyn Monroe
2020-03-24 15:20:07 [scrapy.statscollectors] INFO: Dumping Scrapy stats:

(Omitted)

The None INFO log is output on the first line. This is output by self.logger.info (self.state.get ("state_key1")), and nothing is output because nothing is stored in state at the first startup. In the subsequent process, the information was stored in state and then interrupted. Then try again.

$ scrapy crawl toscrape-restart -s JOBDIR=crawls/restart-1

(Omitted)

2020-03-24 15:29:31 [toscrape-restart] INFO: {'key': 'value'}
2020-03-24 15:29:31 [scrapy.dupefilters] DEBUG: Filtered duplicate request: <GET http://quotes.toscrape.com/page/1/> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)
2020-03-24 15:29:31 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/3/> (referer: None)
2020-03-24 15:29:32 [toscrape-restart] INFO: first quote author: Pablo Neruda
2020-03-24 15:29:42 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/4/> (referer: None)
2020-03-24 15:29:42 [toscrape-restart] INFO: first quote author: Dr. Seuss
2020-03-24 15:29:56 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/5/> (referer: None)
2020-03-24 15:29:56 [toscrape-restart] INFO: first quote author: George R.R. Martin
2020-03-24 15:30:10 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/6/> (referer: None)
2020-03-24 15:30:10 [toscrape-restart] INFO: first quote author: Jane Austen
2020-03-24 15:30:10 [scrapy.core.engine] INFO: Closing spider (finished)
2020-03-24 15:30:10 [scrapy.statscollectors] INFO: Dumping Scrapy stats:

(Omitted)

After rebooting, the INFO log of {'key':'value'} is output on the first line. You can see that the information stored before the interruption can be referenced at the time of re-execution.

Other things I noticed

The outline is as described above, but I will make a note of other things that I have verified and noticed by myself.

Role of JOBDIR

If you use a restartable startup method, a directory with the name passed as an argument at startup will be created. Inside this directory are a directory called requests.queue and files called requests.seen and spider.state. Of these, I haven't checked what requests.queue is used for ...

spider.state is the file where the state from the previous section is stored. It is a pickle file, and you can check its contents with the following command.

python -m pickle spider.state

If you use the pattern in the example in the previous section, the output will be as follows, and you can see that the information is certainly stored.

{'state_key1': {'key': 'value'}, 'state_key2': 0}

On the other hand, requests.seen contains a hashed string. I haven't looked it up very accurately either, but it seems that it's just as the name seen, and the URL of the site that got the information during execution is recorded. It seems that the site recorded here is skipped when re-executed.

Re-execute after completion

If you complete scraping to the end without interrupting during execution, re-execution will end without doing anything because all URLs have been acquired. It doesn't seem to take care, such as overwriting JOBDIR. You need to delete the JOBDIR or change the arguments.