Introducing a package that collects troublesome images by deep learning using images.
You can collect images from search engines, post images from SNS, and automatically download images from web pages.
It seems that the function for google cannot be used yet due to the specification change of google's image search engine The google crawler was fixed 4 days before this article was posted (2020-10-10), so I think it will be improved soon.
Download from bing, baidu
from icrawler.builtin import BaiduImageCrawler, BingImageCrawler, GoogleImageCrawler
bing_crawler = BingImageCrawler(downloader_threads=4,storage={'root_dir': 'C:\\Users\\Desktop\\0\\your_dir'})
bing_crawler.crawl(keyword='cat', filters=None, offset=0, max_num=10)
Specify the save destination of the image in storage If you specify a character instead of path after rooot_dir, a file with the specified character is automatically created in the working directory and collected there.
Specify the search word with key
baidu_crawler = BaiduImageCrawler(storage={'root_dir': 'your_image_dir'})
baidu_crawler.crawl(keyword='cat', offset=0, max_num=100,min_size=(200,200), max_size=None)
When max_num is specified as 1000, it will be downloaded up to about 800 The behavior when the same directory is specified is skipped when the file name and extension are covered.
It downloads while searching for images on the website from one end Need to sort after download
from icrawler.builtin import GreedyImageCrawler
greedy_crawler = GreedyImageCrawler(storage={'root_dir': 'di'})
greedy_crawler.crawl(domains='https://URL with the image you want to download.html', max_num=10,min_size=None, max_size=None)
Download images based on Flickr search results Requires simple user registration such as email address and name (can't confirm with google mail?)
Can be used by requesting an API key after signing in
Enter the non-profit, purpose of use, etc. and run the code after issuing the API key
from datetime import date
from icrawler.builtin import FlickrImageCrawler
flickr_crawler = FlickrImageCrawler('Issued key here',
storage={'root_dir': 'image_dir'})
flickr_crawler.crawl(max_num=100, tags='cat,dog',
min_upload_date=date(2019, 5, 1))
Until a while ago, there was a problem that only 100 downloads were possible, but the download itself was possible. Now I can't even check the download
from icrawler.builtin import BaiduImageCrawler, BingImageCrawler, GoogleImageCrawler
google_crawler = GoogleImageCrawler(
feeder_threads=1,
parser_threads=1,
downloader_threads=4,
storage={'root_dir': 'er'})
google_crawler.crawl(keyword='cat', offset=0, max_num=10,
min_size=(200,200), max_size=None, file_idx_offset=0)
Reference site Welcome to icrawler
Recommended Posts