In Scrapy's Spider, you can set crawling in the configuration file (settings.py) and custom settings (custom_setting). However, if you try to use only this setting, you will have to create a Spider for each setting (crawling depth, priority ...), which requires a very high management cost.
After investigating the Scrapy command, it was resolved, so I will describe it.
Switch the following settings without changing the Spider source code.
By adding "-s DEPTH_LIMIT = 2", the crawling depth can be set when the command is executed.
cmdline example
crawl sample_crawler -s DEPTH_LIMIT=2
Spider Spider code
spider.py
import scrapy
from scrapy.spiders import Rule
from scrapy.linkextractors import LinkExtractor
class SampleCrawler(scrapy.spiders.CrawlSpider):
name = 'sample_crawler'
allowed_domains = ['www.crawler-test.com']
start_urls = ['https://www.crawler-test.com/']
custom_settings = {
'DEPTH_LIMIT': 1
}
rules = [
Rule(
LinkExtractor(
allow=('/mobile/',),
deny=('/redirects/',)
),
callback='parse_test1'
),
Rule(
LinkExtractor(
allow=('',),
deny=('/test1/',)
),
),
]
def __init__(self, *args, **kw):
print('custom_settings: {}'.format(self.custom_settings))
super(SampleCrawler, self).__init__(*args, **kw)
def parse_test1(self, response):
print('response url: {} depth: {}'.format(response.url, response.meta.get('depth')))
Overridden settings: displays the settings.
run.log
2020-10-30 23:04:21 [scrapy.utils.log] INFO: Scrapy 2.2.1 started (bot: ScrapySample)
2020-10-30 23:04:21 [scrapy.utils.log] INFO: Versions: lxml 4.5.2.0, libxml2 2.9.5, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.6.2 (v3.6.2:5fd33b5, Jul 8 2017, 04:57:36) [MSC v.1900 64 bit (AMD64)], pyOpenSSL 19.1.0 (OpenSSL 1.1.1g 21 Apr 2020), cryptography 3.0, Platform Windows-10-10.0.18362-SP0
2020-10-30 23:04:21 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2020-10-30 23:04:21 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'ScrapySample',
'DEPTH_LIMIT': 1,
'NEWSPIDER_MODULE': 'ScrapySample.spiders',
'ROBOTSTXT_OBEY': True,
'SPIDER_MODULES': ['ScrapySample.spiders']}
2020-10-30 23:04:21 [scrapy.extensions.telnet] INFO: Telnet Password: ed92b4e8e23a16a0
2020-10-30 23:04:21 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats']
custom_settings: {'DEPTH_LIMIT': 1}
Change the argument of the execution command and re-execute. Arguments added: "-s DEPTH_LIMIT = 2"
run.log
2020-10-30 23:08:46 [scrapy.utils.log] INFO: Scrapy 2.2.1 started (bot: ScrapySample)
2020-10-30 23:08:46 [scrapy.utils.log] INFO: Versions: lxml 4.5.2.0, libxml2 2.9.5, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.6.2 (v3.6.2:5fd33b5, Jul 8 2017, 04:57:36) [MSC v.1900 64 bit (AMD64)], pyOpenSSL 19.1.0 (OpenSSL 1.1.1g 21 Apr 2020), cryptography 3.0, Platform Windows-10-10.0.18362-SP0
2020-10-30 23:08:46 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2020-10-30 23:08:46 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'ScrapySample',
'DEPTH_LIMIT': '2',
'NEWSPIDER_MODULE': 'ScrapySample.spiders',
'ROBOTSTXT_OBEY': True,
'SPIDER_MODULES': ['ScrapySample.spiders']}
2020-10-30 23:08:46 [scrapy.extensions.telnet] INFO: Telnet Password: 22caac9b06a21a64
2020-10-30 23:08:47 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats']
custom_settings: {'DEPTH_LIMIT': 1}
Recommended Posts