How to collect images in Python

Introduction

I wanted to collect images for deep learning, so I ended up with this article. It didn't work because the contents of the web page changed, so I rewrote it.

code

image_download.py


import requests
import urllib.request
import time
import json

def scraping(url, max_page_num):
    #Pagination implementation
    page_list = get_page_list(url, max_page_num)
    #Get image URL list
    all_img_src_list = []
    for page in page_list:
        try:
            img_src_list = get_img_src_list(page)
            all_img_src_list.extend(img_src_list)
        except:pass
    return all_img_src_list


def get_img_src_list(url):
    #Access the search results page
    response = requests.get(url)
    webtext = response.text

    #In the original article, I used Beatiful soup, but I couldn't get the image, so I changed it.
    start_word='<script>__NEXT_DATA__ = '
    start_num = webtext.find(start_word)
    webtext_start = webtext[start_num + len(start_word):]
    end_word = ';__NEXT_LOADED_PAGES__='
    
    end_num = webtext_start.find(end_word)
    webtext_all = webtext_start[:end_num]
    web_dic = json.loads(webtext_all)
    img_src_list = [img['imageSrc'] for img in web_dic["props"]["initialProps"]["pageProps"]["algos"]]

    return img_src_list


def get_page_list(url, max_page_num):
    img_num_per_page = 20 #If you change this, the number of downloads will change.
    page_list = [f'{url}{i*img_num_per_page+1}' for i in range(max_page_num)]
    return page_list

def download_img(src, dist_path):
    time.sleep(1)
    try:
        with urllib.request.urlopen(src) as data:
            img = data.read()
            with open(dist_path, 'wb') as f:
                f.write(img)
    except:
        pass


def main():
    search_words = ["Kanna Hashimoto"] #Pass the word you want to search in a list.
    for num, search_word in enumerate(search_words):
        url = f"https://search.yahoo.co.jp/image/search?p={search_word}&ei=UTF-8&b="
        max_page_num = 20
        all_img_src_list = scraping(url, max_page_num)
        
        #Image download
        for i, src in enumerate(all_img_src_list):
            download_img(src, f'./img/image_{num}_{i}.jpg') #Please change the save destination appropriately


if __name__ == '__main__':
    main()

If you create an img folder and execute the above with python, the image will be saved in the img folder. This is the image. image.png

Be careful because scraping puts a load on the other server!

reference

I tried to automatically collect images of Kanna Hashimoto with Python! !!

Recommended Posts

How to collect images in Python
How to develop in Python
[Python] How to do PCA in Python
How to use SQLite in Python
How to use Mysql in python
How to wrap C in Python
How to use ChemSpider in Python
How to use PubChem in Python
How to handle Japanese in Python
How to access environment variables in Python
How to dynamically define variables in Python
How to do R chartr () in Python
[Itertools.permutations] How to put permutations in Python
How to work with BigQuery in Python
How to get a stacktrace in python
How to display multiplication table in python
How to extract polygon area in Python
How to check opencv version in python
How to switch python versions in cloud9
How to adjust image contrast in Python
How to use __slots__ in Python class
How to collect face images relatively easily
How to dynamically zero pad in Python
How to use regular expressions in Python
How to display Hello world in python
How to use is and == in Python
How to write Ruby to_s in Python
How to view images in Django's Admin
How to draw OpenCV images in Pygame
How to install python
How to use the C library in Python
How to receive command line arguments in Python
How to clear tuples in a list (Python)
How to embed a variable in a python string
How to implement Discord Slash Command in Python
Summary of how to import files in Python 3
How to simplify restricted polynomial fit in python
How to use Python Image Library in python3 series
How to implement shared memory in Python (mmap.mmap)
How to create a JSON file in Python
Summary of how to use MNIST in Python
How to specify TLS version in python requests
How to notify a Discord channel in Python
How to get the files in the [Python] folder
How to use tkinter with python in pyenv
How to run Leap Motion in non-Apple Python
[Python] How to draw a histogram in Matplotlib
How to output "Ketsumaimo" as standard output in Python
How to handle datetime type in python sqlite3
How to make Python Interpreter changes in Pycharm
How to plot autocorrelation and partial autocorrelation in python
How to remove duplicate elements in Python3 list
[2020.8 latest] How to install Python
How to install Python [Windows]
Base64 encoding images in Python 3
To flush stdout in Python
python3: How to use bottle (2)
[Python] How to use list 1
Login to website in Python
How to update Python Tkinter to 8.6
How to use Python argparse