There are various approaches to web scraping in Python. In this article, we'll take Scrapy, a framework for scraping, as a subject and learn about Scrapy while actually creating a simple sample.

What is Scrapy

Scarpy is a fast, high-level scraping framework. It has various functions related to website crawling and scraping. The main functions are divided into components, and the user creates a program by creating classes related to each component.

The main components are:

Scrapy Engine




Item Pipeline

As you can see, Scrapy has various functions. This time, first create Spider, which is the basic concept of Scrapy, I will write a program to get the URL posted on the Advent Calendar on Qiita.


First, install with pip.

pip install scrapy

Creating a Spider

Next, create Spider, which is one of the components. Spider has a URL endpoint to start the crawl process, Describe the process for extracting the URL.

# -*- coding: utf-8 -*-

import scrapy

class QiitaSpider(scrapy.Spider):
    name = 'qiita_spider'

    #Endpoint (list the URL to start crawling)
    start_urls = ['http://qiita.com/advent-calendar/2015/categories/programming_languages']

    custom_settings = {
        "DOWNLOAD_DELAY": 1,

    #Describe the URL extraction process
    def parse(self, response):
        for href in response.css('.adventCalendarList .adventCalendarList_calendarTitle > a::attr(href)'):
            full_url = response.urljoin(href.extract())

            #Create a Request based on the extracted URL and download it
            yield scrapy.Request(full_url, callback=self.parse_item)

    #Create an Item to extract and save the contents based on the downloaded page
    def parse_item(self, response):

        urls = []
        for href in response.css('.adventCalendarItem_entry > a::attr(href)'):
            full_url = response.urljoin(href.extract())

        yield {
            'title': response.css('h1::text').extract(),
            'urls': urls,


Scrapy comes with a lot of commands. This time to run Spider Run Spider using the runspider command. You can use the -o option to save the result created by parse_item to a file in JSON format.

scrapy runspider qiita_spider.py -o advent_calendar.json


The execution result is as follows. I was able to get a list of the titles and posted URLs of each Advent calendar!

    "urls": [
    "title": [
      "Python \u305d\u306e2 Advent Calendar 2015"
    "urls": [
    "title": [
      "Python Advent Calendar 2015"

At the end

This time, I created Spider, which is a basic component of Scrapy, and performed scraping. If you use Scrapy, the framework will take care of the routine processing related to crawling. Therefore, developers can describe and develop only the parts that are really necessary for services and applications, such as URL extraction processing and data storage processing. From the next time onwards, we will cover cache and save processing. looking forward to!

