[PYTHON] Change Spider's Config at runtime.

Overview

Premise

In Scrapy's Spider, you can set crawling in the configuration file (settings.py) and custom settings (custom_setting). However, if you try to use only this setting, you will have to create a Spider for each setting (crawling depth, priority ...), which requires a very high management cost.

Contents of this article

After investigating the Scrapy command, it was resolved, so I will describe it.

Thing you want to do

Switch the following settings without changing the Spider source code.

Additional arguments

By adding "-s DEPTH_LIMIT = 2", the crawling depth can be set when the command is executed.

cmdline example


crawl sample_crawler -s DEPTH_LIMIT=2

Code example

Spider Spider code

spider.py


import scrapy
from scrapy.spiders import Rule
from scrapy.linkextractors import LinkExtractor


class SampleCrawler(scrapy.spiders.CrawlSpider):
    name = 'sample_crawler'
    allowed_domains = ['www.crawler-test.com']
    start_urls = ['https://www.crawler-test.com/']

    custom_settings = {
        'DEPTH_LIMIT': 1
    }
    rules = [
        Rule(
            LinkExtractor(
                allow=('/mobile/',),
                deny=('/redirects/',)
            ),
            callback='parse_test1'
        ),
        Rule(
            LinkExtractor(
                allow=('',),
                deny=('/test1/',)
            ),
        ),
    ]

    def __init__(self, *args, **kw):
        print('custom_settings: {}'.format(self.custom_settings))

        super(SampleCrawler, self).__init__(*args, **kw)

    def parse_test1(self, response):
        print('response url: {} depth: {}'.format(response.url, response.meta.get('depth')))

Execution log 1

Overridden settings: displays the settings.

run.log


2020-10-30 23:04:21 [scrapy.utils.log] INFO: Scrapy 2.2.1 started (bot: ScrapySample)
2020-10-30 23:04:21 [scrapy.utils.log] INFO: Versions: lxml 4.5.2.0, libxml2 2.9.5, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.6.2 (v3.6.2:5fd33b5, Jul  8 2017, 04:57:36) [MSC v.1900 64 bit (AMD64)], pyOpenSSL 19.1.0 (OpenSSL 1.1.1g  21 Apr 2020), cryptography 3.0, Platform Windows-10-10.0.18362-SP0
2020-10-30 23:04:21 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2020-10-30 23:04:21 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'ScrapySample',
 'DEPTH_LIMIT': 1,
 'NEWSPIDER_MODULE': 'ScrapySample.spiders',
 'ROBOTSTXT_OBEY': True,
 'SPIDER_MODULES': ['ScrapySample.spiders']}
2020-10-30 23:04:21 [scrapy.extensions.telnet] INFO: Telnet Password: ed92b4e8e23a16a0
2020-10-30 23:04:21 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats']
custom_settings: {'DEPTH_LIMIT': 1}

Execution log 2

Change the argument of the execution command and re-execute. Arguments added: "-s DEPTH_LIMIT = 2"

run.log


2020-10-30 23:08:46 [scrapy.utils.log] INFO: Scrapy 2.2.1 started (bot: ScrapySample)
2020-10-30 23:08:46 [scrapy.utils.log] INFO: Versions: lxml 4.5.2.0, libxml2 2.9.5, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.6.2 (v3.6.2:5fd33b5, Jul  8 2017, 04:57:36) [MSC v.1900 64 bit (AMD64)], pyOpenSSL 19.1.0 (OpenSSL 1.1.1g  21 Apr 2020), cryptography 3.0, Platform Windows-10-10.0.18362-SP0
2020-10-30 23:08:46 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2020-10-30 23:08:46 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'ScrapySample',
 'DEPTH_LIMIT': '2',
 'NEWSPIDER_MODULE': 'ScrapySample.spiders',
 'ROBOTSTXT_OBEY': True,
 'SPIDER_MODULES': ['ScrapySample.spiders']}
2020-10-30 23:08:46 [scrapy.extensions.telnet] INFO: Telnet Password: 22caac9b06a21a64
2020-10-30 23:08:47 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats']
custom_settings: {'DEPTH_LIMIT': 1}

Recommended Posts

Change Spider's Config at runtime.