[PYTHON] Introduction to Scrapy (3)

Introduction to Scrapy (3)

Introduction

Introduction to Scrapy (1) Introduction to Scrapy (2)

In the previous articles, I tried using Scrapy to call the Web API. This time, let's create a Spider that downloads the file.

Creating a Spider

This time, we will create a Spider to download the data (zip file) related to MLB. The actual data uses the data published in Sean Lahman Database. Let's save the downloaded zip file in any directory.

The processing flow of Sprider is as follows.

get_csv_spider.py


# -*- coding:utf-8 -*-

from scrapy import Spider
from scrapy.http import Request


class GetCSVSpider(Spider):
    name = 'get_csv_spider'
    allowed_domains = ['seanlahman.com']

    custom_settings = {
        'DOWNLOAD_DELAY': 1.5,
    }

    #Any directory to save the CSV file
    DIR_NAME = '/tmp/csv/'

    #Endpoint (list the URL to start crawling)
    start_urls = ['http://seanlahman.com/baseball-archive/statistics/']

    #Describe the URL extraction process
    def parse(self, response):
        for href in response.css('.entry-content a[href*=csv]::attr(href)'):
            full_url = response.urljoin(href.extract())

            #Create a Request based on the extracted URL and download it
            yield Request(full_url, callback=self.parse_item)

    #Extract and save the contents based on the downloaded page
    def parse_item(self, response):

        file_name = '{0}{1}'.format(self.DIR_NAME, response.url.split('/')[-1])

        #Save file
        f = open(file_name, 'w')
        f.write(response.body)
        f.close()

Run

Crawl using the commands that come with Scrapy.

scrapy runspider get_csv_spider

When you execute the command, the following log will be displayed on the console. The log shows useful information such as the URL being retrieved, status, bytes, and summaries.

2016-12-06 10:02:22 [scrapy] INFO: Scrapy 1.2.0 started (bot: scrapybot)
2016-12-06 10:02:22 [scrapy] INFO: Overridden settings: {'TELNETCONSOLE_ENABLED': False, 'SPIDER_MODULES': ['crawler.main.spiders'], 'COOKIES_ENABLED': False, 'DOWNLOAD_DELAY': 1}
2016-12-06 10:02:22 [scrapy] INFO: Enabled extensions:
['scrapy.extensions.logstats.LogStats',
 'scrapy.extensions.corestats.CoreStats']
2016-12-06 10:02:22 [scrapy] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2016-12-06 10:02:22 [scrapy] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2016-12-06 10:02:22 [scrapy] INFO: Enabled item pipelines:
[]
2016-12-06 10:02:22 [scrapy] INFO: Spider opened
2016-12-06 10:02:22 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-12-06 10:02:23 [scrapy] DEBUG: Crawled (200) <GET http://seanlahman.com/baseball-archive/statistics/> (referer: None)
2016-12-06 10:02:28 [scrapy] DEBUG: Crawled (200) <GET http://seanlahman.com/files/database/lahman30_csv.zip> (referer: http://seanlahman.com/baseball-archive/statistics/)
2016-12-06 10:02:35 [scrapy] DEBUG: Crawled (200) <GET http://seanlahman.com/files/database/lahman51-csv.zip> (referer: http://seanlahman.com/baseball-archive/statistics/)
2016-12-06 10:02:38 [scrapy] DEBUG: Crawled (200) <GET http://seanlahman.com/files/database/lahman_50-csv.zip> (referer: http://seanlahman.com/baseball-archive/statistics/)
2016-12-06 10:02:39 [scrapy] DEBUG: Crawled (200) <GET http://seanlahman.com/files/database/lahman53_csv.zip> (referer: http://seanlahman.com/baseball-archive/statistics/)
2016-12-06 10:02:41 [scrapy] DEBUG: Crawled (200) <GET http://seanlahman.com/files/database/lahman56-csv.zip> (referer: http://seanlahman.com/baseball-archive/statistics/)
2016-12-06 10:02:41 [scrapy] DEBUG: Crawled (200) <GET http://seanlahman.com/files/database/lahman52_csv.zip> (referer: http://seanlahman.com/baseball-archive/statistics/)
2016-12-06 10:02:42 [scrapy] DEBUG: Crawled (200) <GET http://seanlahman.com/files/database/lahman54_csv.zip> (referer: http://seanlahman.com/baseball-archive/statistics/)
2016-12-06 10:02:47 [scrapy] DEBUG: Crawled (200) <GET http://seanlahman.com/files/database/lahman591-csv.zip> (referer: http://seanlahman.com/baseball-archive/statistics/)
2016-12-06 10:02:49 [scrapy] DEBUG: Crawled (200) <GET http://seanlahman.com/files/database/lahman55_csv.zip> (referer: http://seanlahman.com/baseball-archive/statistics/)
2016-12-06 10:02:49 [scrapy] DEBUG: Crawled (200) <GET http://seanlahman.com/files/database/lahman57-csv.zip> (referer: http://seanlahman.com/baseball-archive/statistics/)
2016-12-06 10:02:52 [scrapy] DEBUG: Crawled (200) <GET http://seanlahman.com/files/database/lahman58-csv.zip> (referer: http://seanlahman.com/baseball-archive/statistics/)
2016-12-06 10:02:55 [scrapy] DEBUG: Crawled (200) <GET http://seanlahman.com/files/database/lahman-csv_2014-02-14.zip> (referer: http://seanlahman.com/baseball-archive/statistics/)
2016-12-06 10:02:55 [scrapy] DEBUG: Crawled (200) <GET http://seanlahman.com/files/database/lahman-csv_2015-01-24.zip> (referer: http://seanlahman.com/baseball-archive/statistics/)
2016-12-06 10:03:00 [scrapy] DEBUG: Crawled (200) <GET http://seanlahman.com/files/database/lahman2012-csv.zip> (referer: http://seanlahman.com/baseball-archive/statistics/)
2016-12-06 10:03:00 [scrapy] INFO: Closing spider (finished)
2016-12-06 10:03:00 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 4518,
 'downloader/request_count': 15,
 'downloader/request_method_count/GET': 15,
 'downloader/response_bytes': 104279737,
 'downloader/response_count': 15,
 'downloader/response_status_count/200': 15,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2016, 12, 6, 1, 3, 0, 285944),
 'log_count/DEBUG': 15,
 'log_count/INFO': 7,
 'request_depth_max': 1,
 'response_received_count': 15,
 'scheduler/dequeued': 15,
 'scheduler/dequeued/memory': 15,
 'scheduler/enqueued': 15,
 'scheduler/enqueued/memory': 15,
 'start_time': datetime.datetime(2016, 12, 6, 1, 2, 22, 878024)}
2016-12-06 10:03:00 [scrapy] INFO: Spider closed (finished)

Now that the download is complete, let's check if it is actually downloaded. It seems that it has been downloaded safely.

tree /tmp/csv
/tmp/csv
├── lahman-csv_2014-02-14.zip
├── lahman-csv_2015-01-24.zip
├── lahman2012-csv.zip
├── lahman30_csv.zip
├── lahman51-csv.zip
├── lahman52_csv.zip
├── lahman53_csv.zip
├── lahman54_csv.zip
├── lahman55_csv.zip
├── lahman56-csv.zip
├── lahman57-csv.zip
├── lahman58-csv.zip
├── lahman591-csv.zip
└── lahman_50-csv.zip

0 directories, 14 files

At the end

Scrapy also makes it easy to describe the file download process. Since Scrapy is a framework for crawling, developers can focus on the parts that are called by the framework. From the next time onward, I will cover the data pipeline processing that I did not explain this time. looking forward to!

Reference URL

Recommended Posts

Introduction to Scrapy (1)
Introduction to Scrapy (3)
Introduction to Scrapy (2)
Introduction to Scrapy (4)
Introduction to MQTT (Introduction)
Introduction to Supervisor
Introduction to Tkinter 1: Introduction
Introduction to PyQt
[Linux] Introduction to Linux
Introduction to discord.py (2)
Introduction to discord.py
Introduction to Lightning pytorch
Introduction to Web Scraping
Introduction to Nonparametric Bayes
Introduction to EV3 / MicroPython
Introduction to Python language
Introduction to TensorFlow-Image Recognition
Introduction to OpenCV (python)-(2)
Introduction to PyQt4 Part 1
Introduction to Dependency Injection
Introduction to Private Chainer
Introduction to machine learning
AOJ Introduction to Programming Topic # 1, Topic # 2, Topic # 3, Topic # 4
Introduction to electronic paper modules
Introduction to dictionary lookup algorithm
Introduction to Monte Carlo Method
[Learning memorandum] Introduction to vim
Introduction to PyTorch (1) Automatic differentiation
opencv-python Introduction to image processing
Introduction to Python Django (2) Win
Introduction to Cython Writing [Notes]
An introduction to private TensorFlow
Kubernetes Scheduler Introduction to Homebrew
An introduction to machine learning
[Introduction to cx_Oracle] Overview of cx_Oracle
A super introduction to Linux
Introduction
AOJ Introduction to Programming Topic # 7, Topic # 8
[Introduction to pytorch-lightning] First Lit ♬
Introduction to Anomaly Detection 1 Basics
Introduction to RDB with sqlalchemy Ⅰ
[Introduction to Systre] Fibonacci Retracement ♬
Introduction to Nonlinear Optimization (I)
Introduction to serial communication [Python]
AOJ Introduction to Programming Topic # 5, Topic # 6
Introduction to Deep Learning ~ Learning Rules ~
[Introduction to Python] <list> [edit: 2020/02/22]
Introduction to Python (Python version APG4b)
An introduction to Python Programming
[Introduction to cx_Oracle] (8th) cx_Oracle 8.0 release
Introduction to discord.py (3) Using voice
An introduction to Bayesian optimization
Deep Reinforcement Learning 1 Introduction to Reinforcement Learning
Super introduction to machine learning
Introduction to Ansible Part ③'Inventory'
Series: Introduction to cx_Oracle Contents
[Introduction] How to use open3d
Introduction to Python For, While
Introduction to Deep Learning ~ Backpropagation ~
Introduction to Ansible Part ④'Variable'
Introduction to vi command (memorandum)