・ I'm using Scrapy because I wanted to use Scrapy for the time being. ・ At this level, it is definitely better to use Beautiful soup than Scrapy.
・ The image to be downloaded is this page. Download all the images of playing cards at the link destination.
$ pip install scrapy
...
..
.
$ scrapy version #Check version
Scrapy 1.8.0
2-1.
$ scrapy startproject download_images
The directory is complete.
$ cd download_images
download_images $ tree
.
├── download_images
│   ├── __init__.py
│   ├── __pycache__
│   ├── items.py
│   ├── middlewares.py
│   ├── pipelines.py
│   ├── settings.py
│   └── spiders
│       ├── __init__.py
│       └── __pycache__
└── scrapy.cfg
Uncomment DOWNLOAD_DELAY in settings.py and set the request transmission interval (unit: seconds).
If the request interval is short, it will look like a Dos attack, so be sure to set it.
(Some sites will be blocked.)
settings.py
...
..
.
DOWNLOAD_DELAY = 3
.
..
...
Just uncomment the code that starts with HTTPCACHE_.
Eliminates the hassle of repeatedly accessing the same page during trials and errors.
settings.py
.
..
...
HTTPCACHE_ENABLED = True
HTTPCACHE_EXPIRATION_SECS = 0
HTTPCACHE_DIR = 'httpcache'
HTTPCACHE_IGNORE_HTTP_CODES = []
HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
Create a template
$ scrapy genspider download_images_spider www.irasutoya.com
The command execution directory is
download_images #← here
├── download_images
│   ├── ...
│   ├── ..
│   ├── .
│   └── spiders
│       ├── __init__.py
│       └── __pycache__
└── scrapy.cfg
Then the spiders directory looks like this
download_images
├── download_images
│   ├── ...
│   ├── ..
│   ├── .
│   └── spiders
│       ├── __init__.py
│       ├── __pycache__
│       │   └── __init__.cpython-37.pyc
│       └── download_images_spider.py
└── scrapy.cfg
download_images_spider.py
# -*- coding: utf-8 -*-
import os, scrapy, urllib
from download_images.items import DownloadImagesItem
class DownloadImagesSpiderSpider(scrapy.Spider):
    name = 'download_images_spider'
    allowed_domains = ['www.irasutoya.com']
    start_urls = [
        'https://www.irasutoya.com/2010/05/numbercardspade.html', #Spades (numbers)
        'https://www.irasutoya.com/2017/05/facecardspade.html', #Spade (picture card)
        'https://www.irasutoya.com/2010/05/numbercardheart.html', #Heart (number)
        'https://www.irasutoya.com/2017/05/facecardheart.html', #Heart (picture card)
        'https://www.irasutoya.com/2010/05/numbercarddiamond.html', #Diamond (number)
        'https://www.irasutoya.com/2017/05/facecarddiamond.html', #Diamond (picture card)
        'https://www.irasutoya.com/2010/05/numbercardclub.html', #Club (number)
        'https://www.irasutoya.com/2017/05/facecardclub.html', #Club (picture card)
        'https://www.irasutoya.com/2017/05/cardjoker.html', #Joker
        'https://www.irasutoya.com/2017/05/cardback.html', #Back side
    ]
    dest_dir = '/Users/~~~/images' #Download destination directory
    def parse(self, response):
#Depending on the web page, you need to rewrite the CSS selector to the appropriate one.
        for image in response.css('div.separator img'):
            #URL of the file to download
            image_url = image.css('::attr(src)').extract_first().strip()
            #File name of the image to download
            file_name = image_url[image_url.rfind('/') + 1:]
            #If the image download destination does not exist, create it
            if not os.path.exists(self.dest_dir):
                os.mkdir(self.dest_dir)
            
            #download
            urllib.request.urlretrieve(image_url, os.path.join(self.dest_dir, file_name))
            time.sleep(1) #Download interval is 1 second
You have downloaded all the images of playing cards!

Recommended Posts