[PYTHON] Introduction of automatic image collection package "icrawler" (0.6.3) that can be used during machine learning

Introducing a package that collects troublesome images by deep learning using images.

You can collect images from search engines, post images from SNS, and automatically download images from web pages.

It seems that the function for google cannot be used yet due to the specification change of google's image search engine The google crawler was fixed 4 days before this article was posted (2020-10-10), so I think it will be improved soon.

Download from search engine

Download from bing, baidu

from icrawler.builtin import BaiduImageCrawler, BingImageCrawler, GoogleImageCrawler


bing_crawler = BingImageCrawler(downloader_threads=4,storage={'root_dir': 'C:\\Users\\Desktop\\0\\your_dir'})
bing_crawler.crawl(keyword='cat', filters=None, offset=0, max_num=10)

Specify the save destination of the image in storage If you specify a character instead of path after rooot_dir, a file with the specified character is automatically created in the working directory and collected there.

Specify the search word with key

baidu_crawler = BaiduImageCrawler(storage={'root_dir': 'your_image_dir'})
baidu_crawler.crawl(keyword='cat', offset=0, max_num=100,min_size=(200,200), max_size=None)

When max_num is specified as 1000, it will be downloaded up to about 800 The behavior when the same directory is specified is skipped when the file name and extension are covered.

Download from website

It downloads while searching for images on the website from one end Need to sort after download

from icrawler.builtin import GreedyImageCrawler

greedy_crawler = GreedyImageCrawler(storage={'root_dir': 'di'})
greedy_crawler.crawl(domains='https://URL with the image you want to download.html', max_num=10,min_size=None, max_size=None)

Download from SNS (Flickr)

Download images based on Flickr search results Requires simple user registration such as email address and name (can't confirm with google mail?)

Can be used by requesting an API key after signing in

API Request

Enter the non-profit, purpose of use, etc. and run the code after issuing the API key

image.png

from datetime import date
from icrawler.builtin import FlickrImageCrawler

flickr_crawler = FlickrImageCrawler('Issued key here',
                                    storage={'root_dir': 'image_dir'})
flickr_crawler.crawl(max_num=100, tags='cat,dog',
                     min_upload_date=date(2019, 5, 1))

I have not confirmed the operation, but in the case of google

Until a while ago, there was a problem that only 100 downloads were possible, but the download itself was possible. Now I can't even check the download

from icrawler.builtin import BaiduImageCrawler, BingImageCrawler, GoogleImageCrawler

google_crawler = GoogleImageCrawler(
    feeder_threads=1,
    parser_threads=1,
    downloader_threads=4,
    storage={'root_dir': 'er'})

google_crawler.crawl(keyword='cat', offset=0, max_num=10,
                     min_size=(200,200), max_size=None, file_idx_offset=0)

that's all

Reference site Welcome to icrawler

Recommended Posts

Introduction of automatic image collection package "icrawler" (0.6.3) that can be used during machine learning
ANTs image registration that can be used in 5 minutes
Overview and useful features of scikit-learn that can also be used for deep learning
[Python] Introduction to web scraping | Summary of methods that can be used with webdriver
Easy padding of data that can be used in natural language processing
[Machine learning] List of frequently used packages
[Python3] Code that can be used when you want to change the extension of an image at once
A personal memo of Pandas related operations that can be used in practice
Easy program installer and automatic program updater that can be used in any language
Free version of DataRobot! ?? Introduction to "PyCaret", a library that automates machine learning
Summary of scikit-learn data sources that can be used when writing analysis articles
Can Machine Learning Predict Parallelograms? (1) Can it be extrapolated?
File types that can be used with Go
Functions that can be used in for statements
Python & Machine Learning Study Memo ②: Introduction of Library
Full disclosure of methods used in machine learning
List of links that machine learning beginners are learning
Summary of evaluation functions used in machine learning
[Translation] scikit-learn 0.18 Tutorial Introduction of machine learning by scikit-learn
Summary of statistical data analysis methods using Python that can be used in business
Image collection Python script for creating datasets for machine learning
Basic algorithms that can be used in competition pros
About data preprocessing of systems that use machine learning
Python knowledge notes that can be used with AtCoder
[Django] About users that can be used on template
Deep learning course that can be crushed on site
Can be used with AtCoder! A collection of techniques for drawing short code in Python!
[Atcoder] [C ++] I made a test automation tool that can be used during the contest