Introducing a package that collects troublesome images by deep learning using images.

You can collect images from search engines, post images from SNS, and automatically download images from web pages.

It seems that the function for google cannot be used yet due to the specification change of google's image search engine The google crawler was fixed 4 days before this article was posted (2020-10-10), so I think it will be improved soon.

Download from search engine

Download from bing, baidu

from icrawler.builtin import BaiduImageCrawler, BingImageCrawler, GoogleImageCrawler


bing_crawler = BingImageCrawler(downloader_threads=4,storage={'root_dir': 'C:\\Users\\Desktop\\0\\your_dir'})
bing_crawler.crawl(keyword='cat', filters=None, offset=0, max_num=10)

Specify the save destination of the image in storage If you specify a character instead of path after rooot_dir, a file with the specified character is automatically created in the working directory and collected there.

Specify the search word with key

baidu_crawler = BaiduImageCrawler(storage={'root_dir': 'your_image_dir'})
baidu_crawler.crawl(keyword='cat', offset=0, max_num=100,min_size=(200,200), max_size=None)

When max_num is specified as 1000, it will be downloaded up to about 800 The behavior when the same directory is specified is skipped when the file name and extension are covered.

Download from website

It downloads while searching for images on the website from one end Need to sort after download

from icrawler.builtin import GreedyImageCrawler

greedy_crawler = GreedyImageCrawler(storage={'root_dir': 'di'})
greedy_crawler.crawl(domains='https://URL with the image you want to download.html', max_num=10,min_size=None, max_size=None)

Download from SNS (Flickr)

Download images based on Flickr search results Requires simple user registration such as email address and name (can't confirm with google mail?)

Can be used by requesting an API key after signing in

API Request

Enter the non-profit, purpose of use, etc. and run the code after issuing the API key

from datetime import date
from icrawler.builtin import FlickrImageCrawler

flickr_crawler = FlickrImageCrawler('Issued key here',
                                    storage={'root_dir': 'image_dir'})
flickr_crawler.crawl(max_num=100, tags='cat,dog',
                     min_upload_date=date(2019, 5, 1))

I have not confirmed the operation, but in the case of google

Until a while ago, there was a problem that only 100 downloads were possible, but the download itself was possible. Now I can't even check the download

from icrawler.builtin import BaiduImageCrawler, BingImageCrawler, GoogleImageCrawler

google_crawler = GoogleImageCrawler(
    feeder_threads=1,
    parser_threads=1,
    downloader_threads=4,
    storage={'root_dir': 'er'})

google_crawler.crawl(keyword='cat', offset=0, max_num=10,
                     min_size=(200,200), max_size=None, file_idx_offset=0)

that's all

Reference site Welcome to icrawler

[PYTHON] Introduction of automatic image collection package "icrawler" (0.6.3) that can be used during machine learning

Download from search engine

Download from website

Download from SNS (Flickr)

I have not confirmed the operation, but in the case of google

that's all