[PYTHON] Introduction to Scrapy (1)

Introduction to Scrapy (1)

Introduction

There are various approaches to web scraping in Python. In this article, we'll take Scrapy, a framework for scraping, as a subject and learn about Scrapy while actually creating a simple sample.

What is Scrapy

Scarpy is a fast, high-level scraping framework. It has various functions related to website crawling and scraping. The main functions are divided into components, and the user creates a program by creating classes related to each component.

コンポーネント図 From http://doc.scrapy.org/en/1.0/topics/architecture.html

The main components are:

Scrapy Engine

Scheduler

Downloader

Spiders

Item Pipeline

As you can see, Scrapy has various functions. This time, first create Spider, which is the basic concept of Scrapy, I will write a program to get the URL posted on the Advent Calendar on Qiita.

Installation

First, install with pip.

pip install scrapy

Creating a Spider

Next, create Spider, which is one of the components. Spider has a URL endpoint to start the crawl process, Describe the process for extracting the URL.

# -*- coding: utf-8 -*-

import scrapy


class QiitaSpider(scrapy.Spider):
    name = 'qiita_spider'

    #Endpoint (list the URL to start crawling)
    start_urls = ['http://qiita.com/advent-calendar/2015/categories/programming_languages']

    custom_settings = {
        "DOWNLOAD_DELAY": 1,
    }

    #Describe the URL extraction process
    def parse(self, response):
        for href in response.css('.adventCalendarList .adventCalendarList_calendarTitle > a::attr(href)'):
            full_url = response.urljoin(href.extract())

            #Create a Request based on the extracted URL and download it
            yield scrapy.Request(full_url, callback=self.parse_item)

    #Create an Item to extract and save the contents based on the downloaded page
    def parse_item(self, response):

        urls = []
        for href in response.css('.adventCalendarItem_entry > a::attr(href)'):
            full_url = response.urljoin(href.extract())
            urls.append(full_url)

        yield {
            'title': response.css('h1::text').extract(),
            'urls': urls,
        }

Run

Scrapy comes with a lot of commands. This time to run Spider Run Spider using the runspider command. You can use the -o option to save the result created by parse_item to a file in JSON format.

scrapy runspider qiita_spider.py -o advent_calendar.json

result

The execution result is as follows. I was able to get a list of the titles and posted URLs of each Advent calendar!

{
    "urls": [
      "http://loboskobayashi.github.io/pythonsf/2015/12-01/recomending_PythonSf_one-liners_for_general_python_codes/",
      "http://qiita.com/csakatoku/items/444db2d0e421265ec106",
      "http://qiita.com/csakatoku/items/86904adaa0922e80069c",
      "http://qiita.com/Hironsan/items/fb6ee6b8e0291a7e5a1d",
      "http://qiita.com/csakatoku/items/77a36888d0851f6d4ff0",
      "http://qiita.com/csakatoku/items/a5b6c3604c0d411154fa",
      "http://qiita.com/sharow/items/21e421a2a40275ee9bb8",
      "http://qiita.com/Lspeciosum/items/d248b64d1bdcb8e39d32",
      "http://qiita.com/CS_Toku/items/353fd4b0fd9ed17dc152",
      "http://hideharaaws.hatenablog.com/entry/django-upgrade-16to18",
      "http://qiita.com/FGtatsuro/items/0efebb9b58374d16c5f0",
      "http://shinyorke.hatenablog.com/entry/2015/12/12/143121",
      "http://qiita.com/wrist/items/5759f894303e4364ebfd",
      "http://qiita.com/kimihiro_n/items/86e0a9e619720e57ecd8",
      "http://qiita.com/satoshi03/items/ae616dc080d085604b06",
      "http://qiita.com/CS_Toku/items/32028e65a8bfa97266d6",
      "http://fx-kirin.com/python/nfp-linear-reggression-model/"
    ],
    "title": [
      "Python \u305d\u306e2 Advent Calendar 2015"
    ]
  },
  {
    "urls": [
      "http://qiita.com/nasa9084/items/40f223b5b44f13ef2925",
      "http://studio3104.hatenablog.com/entry/2015/12/02/120957",
      "http://qiita.com/Tsutomu-KKE@github/items/29414e2d4f30b2bc94ae",
      "http://qiita.com/icoxfog417/items/913bb815d8d419148c33",
      "http://qiita.com/sakamotomsh/items/ab6c15b971587905ef43",
      "http://cocu.hatenablog.com/entry/2015/12/06/022100",
      "http://qiita.com/masashi127/items/ca092c13e1300f4f6ade",
      "http://qiita.com/kaneshin/items/269bc5f156d86f8a91c4",
      "http://qiita.com/wh11e7rue/items/15603e8970c36ab9733d",
      "http://qiita.com/ohkawa/items/368e6bcadc56a118adaf",
      "http://qiita.com/teitei_tk/items/5c5c9e653b3a13108d12",
      "http://sinhrks.hatenablog.com/entry/2015/12/13/215858"
    ],
    "title": [
      "Python Advent Calendar 2015"
    ]
  },

At the end

This time, I created Spider, which is a basic component of Scrapy, and performed scraping. If you use Scrapy, the framework will take care of the routine processing related to crawling. Therefore, developers can describe and develop only the parts that are really necessary for services and applications, such as URL extraction processing and data storage processing. From the next time onwards, we will cover cache and save processing. looking forward to!

Introduction to Scrapy (2)

Recommended Posts

Introduction to Scrapy (1)
Introduction to Scrapy (3)
Introduction to Scrapy (2)
Introduction to Scrapy (4)
Introduction to MQTT (Introduction)
Introduction to Supervisor
Introduction to Tkinter 1: Introduction
Introduction to PyQt
[Linux] Introduction to Linux
Introduction to discord.py (2)
Introduction to discord.py
Introduction to Lightning pytorch
Introduction to Web Scraping
Introduction to Nonparametric Bayes
Introduction to EV3 / MicroPython
Introduction to Python language
Introduction to TensorFlow-Image Recognition
Introduction to OpenCV (python)-(2)
Introduction to PyQt4 Part 1
Introduction to Dependency Injection
Introduction to Private Chainer
Introduction to machine learning
AOJ Introduction to Programming Topic # 1, Topic # 2, Topic # 3, Topic # 4
Introduction to electronic paper modules
A quick introduction to pytest-mock
Introduction to dictionary lookup algorithm
Introduction to Monte Carlo Method
[Learning memorandum] Introduction to vim
Introduction to PyTorch (1) Automatic differentiation
opencv-python Introduction to image processing
Introduction to Python Django (2) Win
Introduction to Cython Writing [Notes]
An introduction to private TensorFlow
Kubernetes Scheduler Introduction to Homebrew
An introduction to machine learning
[Introduction to cx_Oracle] Overview of cx_Oracle
A super introduction to Linux
Introduction
AOJ Introduction to Programming Topic # 7, Topic # 8
[Introduction to pytorch-lightning] First Lit ♬
Introduction to Anomaly Detection 1 Basics
Introduction to RDB with sqlalchemy Ⅰ
[Introduction to Systre] Fibonacci Retracement ♬
Introduction to Nonlinear Optimization (I)
Introduction to serial communication [Python]
AOJ Introduction to Programming Topic # 5, Topic # 6
Introduction to Deep Learning ~ Learning Rules ~
[Introduction to Python] <list> [edit: 2020/02/22]
Introduction to Python (Python version APG4b)
An introduction to Python Programming
[Introduction to cx_Oracle] (8th) cx_Oracle 8.0 release
Introduction to discord.py (3) Using voice
An introduction to Bayesian optimization
Deep Reinforcement Learning 1 Introduction to Reinforcement Learning
Super introduction to machine learning
Introduction to Ansible Part ③'Inventory'
Series: Introduction to cx_Oracle Contents
[Introduction] How to use open3d
Introduction to Python For, While
Introduction to Deep Learning ~ Backpropagation ~
Introduction to Ansible Part ④'Variable'