[PYTHON] Download images from "Irasutoya" using Scrapy


・ I'm using Scrapy because I wanted to use Scrapy for the time being. ・ At this level, it is definitely better to use Beautiful soup than Scrapy.

The image to be downloaded is this page. Download all the images of playing cards at the link destination.

1. Install scrapy

$ pip install scrapy
$ scrapy version #Check version
Scrapy 1.8.0

2. Create a project


$ scrapy startproject download_images

The directory is complete.

$ cd download_images
download_images $ tree
├── download_images
│   ├── __init__.py
│   ├── __pycache__
│   ├── items.py
│   ├── middlewares.py
│   ├── pipelines.py
│   ├── settings.py
│   └── spiders
│       ├── __init__.py
│       └── __pycache__
└── scrapy.cfg

2-2. Setting the request transmission interval

Uncomment DOWNLOAD_DELAY in settings.py and set the request transmission interval (unit: seconds). If the request interval is short, it will look like a Dos attack, so be sure to set it. (Some sites will be blocked.)



2-3. Enable the cache.

Just uncomment the code that starts with HTTPCACHE_. Eliminates the hassle of repeatedly accessing the same page during trials and errors.


HTTPCACHE_DIR = 'httpcache'
HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

3. Download the image

3-1. Create Spider

Create a template

$ scrapy genspider download_images_spider www.irasutoya.com

The command execution directory is

download_images #← here
├── download_images
│   ├── ...
│   ├── ..
│   ├── .
│   └── spiders
│       ├── __init__.py
│       └── __pycache__
└── scrapy.cfg

Then the spiders directory looks like this

├── download_images
│   ├── ...
│   ├── ..
│   ├── .
│   └── spiders
│       ├── __init__.py
│       ├── __pycache__
│       │   └── __init__.cpython-37.pyc
│       └── download_images_spider.py
└── scrapy.cfg

3-3. Edit the created template file


# -*- coding: utf-8 -*-
import os, scrapy, urllib
from download_images.items import DownloadImagesItem

class DownloadImagesSpiderSpider(scrapy.Spider):
    name = 'download_images_spider'
    allowed_domains = ['www.irasutoya.com']
    start_urls = [
        'https://www.irasutoya.com/2010/05/numbercardspade.html', #Spades (numbers)
        'https://www.irasutoya.com/2017/05/facecardspade.html', #Spade (picture card)

        'https://www.irasutoya.com/2010/05/numbercardheart.html', #Heart (number)
        'https://www.irasutoya.com/2017/05/facecardheart.html', #Heart (picture card)

        'https://www.irasutoya.com/2010/05/numbercarddiamond.html', #Diamond (number)
        'https://www.irasutoya.com/2017/05/facecarddiamond.html', #Diamond (picture card)

        'https://www.irasutoya.com/2010/05/numbercardclub.html', #Club (number)
        'https://www.irasutoya.com/2017/05/facecardclub.html', #Club (picture card)

        'https://www.irasutoya.com/2017/05/cardjoker.html', #Joker

        'https://www.irasutoya.com/2017/05/cardback.html', #Back side
    dest_dir = '/Users/~~~/images' #Download destination directory

    def parse(self, response):
#Depending on the web page, you need to rewrite the CSS selector to the appropriate one.
        for image in response.css('div.separator img'):
            #URL of the file to download
            image_url = image.css('::attr(src)').extract_first().strip()

            #File name of the image to download
            file_name = image_url[image_url.rfind('/') + 1:]

            #If the image download destination does not exist, create it
            if not os.path.exists(self.dest_dir):
            urllib.request.urlretrieve(image_url, os.path.join(self.dest_dir, file_name))

            time.sleep(1) #Download interval is 1 second

You have downloaded all the images of playing cards! image.png

