[PYTHON] Automatically download images with scraping

thank you for your hard work. @ibusan. This time, as the title suggests, we implemented a program that automatically downloads images by scraping. I will leave it in the article as a memorandum so that I can look back when I forget it.

Overview

There are various targets for scraping, but this time, the Fan Kit of the game "Princess Connect! Re: Dive" that I am addicted to. I would like to automatically collect /). It's a lot to do by hand, isn't it?

Preparation

First, prepare the environment for scraping. The environment built this time is as follows.

  1. Anaconda
  2. ChromDriver

Anaconda is a platform that provides Python packages for data science. You can install it from the link above. ChromDriver is a driver required to operate Chrom programmatically. If Anaconda is installed

pip install chromedriver-binary=='Driver version'

You can install it with. The following site will be helpful for installing Chrom Driver. ChromDriver installation procedure

The library to be used is as follows. All can be installed with pip.

  1. selenium
  2. BeutifulSoup
  3. requests
  4. os
  5. time

policy

This time, we will proceed with the implementation according to the following procedure.

  1. Get the URL of all pages (common 1 page, 2 pages) of Fankit page
  2. Get all the URLs of the fan kits on each page Go to the URL obtained in 3.2 and download the fan kit

coding

Now that the policy has been decided, we can start coding.

from selenium import webdriver
import time
import os
from bs4 import BeautifulSoup
import requests

First, import the library. Import the 5 libraries listed in Preparation.

#Launch Google Chrome
browser = webdriver.Chrome("/Users/ibuki_sakata/opt/anaconda3/lib/python3.7/site-packages/chromedriver_binary/chromedriver")
browser.implicitly_wait(3)

Then use ChromDriver and selenium to start Chrom. The second line is the description for starting. The path in parentheses is the Chrom Driver path. The description on the third line is for pausing the program so that the next process does not proceed until the browser is started.

#Go to URL
url_pricone = "https://priconne-redive.jp/fankit02/"
browser.get(url_pricone)
time.sleep(3)

The URL of the top page of the fan kit is specified in the first line, and the transition is made to the URL specified in the second line. The browser get method is similar to the http communication get method.

#Get the URL of all fan kit web pages
current_url = browser.current_url
html = requests.get(current_url)
bs = BeautifulSoup(html.text, "html.parser")
fankitPage = bs.find("ul", class_="page-nav").find_all("li")
page = []

for li_tag in fankitPage:
    a_tag = li_tag.find("a")
    if(a_tag.get('class')):
        page.append(current_url)
    else:
        page.append(a_tag.get("href"))

Here, get the URL of all pages such as the first page and the second page. Use BeautifulSoup to get the URL. There are many sites that explain how to use BeautifulSoup in detail, so I will not explain it here.

#Download fan kit
for p in page:
    html = requests.get(p)
    browser.get(p)
    time.sleep(1)
    bs = BeautifulSoup(html.text, "html.parser")
    ul_fankit_list = bs.find("ul", class_="fankit-list")
    li_fankit_list = ul_fankit_list.find_all("li")
    fankit_url = []
    for li_tab in li_fankit_list:
        a_tab = li_tab.find("a")
        fankit_url.append(a_tab.get("href"))

    for url in fankit_url:
        browser.get(url)
        time.sleep(1)
        html_fankit = requests.get(url)
        bs_fankit = BeautifulSoup(html_fankit.text, "html.parser")
        h3_tag = bs_fankit.find("h3")
        title = h3_tag.text
        os.makedirs(title, exist_ok=True)
        ul_dl_btns = bs_fankit.find_all("ul", class_="dl-btns")
        for i,ul_tag in enumerate(ul_dl_btns, start=0):
            li_tag = ul_tag.find("li")
            a_tag = li_tag.find("a")
            img_url = a_tag.get("href")
            browser.get(img_url)
            time.sleep(1)
            print(img_url)
            img = requests.get(img_url)
            with open(title + "/{}.jpg ".format(i), "wb") as f:
                f.write(img.content)
        browser.back()
    

Download the fan kit here. The basics are the same as before. The flow is to get the html source with requests, analyze it with BeautifulSoup, and get the desired tag. The image is downloaded by opening the file in binary mode and writing the image data acquired by requests.

Execution result

スクリーンショット 2020-06-07 14.53.32.png

In this way, the images are downloaded for each fan kit.

Recommended Posts

Automatically download images with scraping
Save images with web scraping
I can't download images with Google_images_download
Upload and download images with falcon
Scraping with selenium
Scraping with selenium ~ 2 ~
Scraping with Python
Scraping with Python
Scraping with Selenium
Scraping 100 Fortnite images
Automatically search and download YouTube videos with Python
Bulk download images from specific URLs with python
Successful scraping with Selenium
Scraping with Python (preparation)
Try scraping with Python.
Download images using requests
Scraping with Python + PhantomJS
Scraping with scrapy shell
Scraping with Selenium [Python]
Scraping with Python + PyQuery
Scraping with Beautiful Soup
Scraping RSS with Python
Center images with python-pptx
Automatically paste images into PowerPoint materials with python + α
Bulk download images from specific site URLs with python
Image download with Flickr API
Bordering images with python Part 1
Web scraping with python + JupyterLab
Scraping with selenium in Python
Scraping with Selenium + Python Part 1
Scraping with chromedriver in python
Festive scraping with Python, scrapy
Scraping immediately from google images!
Scraping with Selenium in Python
Easy web scraping with Scrapy
Scraping with Tor in Python
Scraping weather forecast with python
scraping the Nikkei 225 with playwright-python
Scraping with Selenium + Python Part 2
Combine two images with Django
I tried scraping with python
Web scraping beginner with python
Download csv file with python
I-town page scraping with selenium
Table scraping with Beautiful Soup
Batch download images from a specific URL with python Modified version
I tried to automatically collect images of Kanna Hashimoto with Python! !!
I tried to log in to twitter automatically with selenium (RPA, scraping)
Number recognition in images with Python
Try scraping with Python + Beautiful Soup
Get Splunk download link by scraping
Download images from "Irasutoya" using Scrapy
Scraping multiple pages with Beautiful Soup
Scraping with Node, Ruby and Python
Web scraping with Python ① (Scraping prior knowledge)
Implemented file download with Python + Bottle
Scraping with Selenium in Python (Basic)
Web scraping with BeautifulSoup4 (layered page)
Scraping with Python, Selenium and Chromedriver
Scraping Alexa's web rank with pyQuery
Web scraping with Python First step