[PYTHON] [Scrapy] Correct / process the extracted URL

Assumption example

Using CrawlSpider from Scrapy, Item list page-> Individual item overview page-> Individual item detail page, and crawl the site where you can follow the link, It is assumed that the information on the detail page is scraped and saved.

The correspondence between the page and the URL looks like the one below.

page URL
Item List example.com/list
Item overview example.com/item/(ID)/
Item details example.com/item/(ID)/details

For a site with this structure, if you can add / details to the end of the link to the summary page extracted from the list page and use it to request the details page directly, you can go to the other party's site. The number of requests has been halved, and the time it takes to execute this program has also been reduced, so two birds with one stone! So the following is an implementation example.

Implementation

In the argument * process_value * of LinkExtractor, describe the process to process the URL with a lambda expression.

example.py


class ExampleSpider(CrawlSpider):
    name = 'example'
    allowed_domains = ['example.com']
    start_urls = ['http://example.com/list'] #Item list page

    rules = [
            Rule(LinkExtractor(
                #/item/Extract URLs that include
                allow=r'.*/item/.*',
                #To the extracted URL'details/'Add
                process_value= lambda x:x + 'details/' 
                ),callback='parse_details'),
    ]
#

    def parse_details(self, response):
        #(abridgement)

that's all!

Recommended Posts

[Scrapy] Correct / process the extracted URL
Kill the process with sudo kill -9