[Python] Collect images easily with icrawler!

I used icrawler to collect images for machine learning, so this is an introduction.

What is icrawler

A framework for collecting images by web crawling with python. You can collect images just by writing very short code.

Installation

pip

$ pip install icrawler

anaconda

$ conda install -c hellock icrawler

How to use

from icrawler.builtin import BingImageCrawler

crawler = BingImageCrawler(storage={"root_dir": './images'})
crawler.crawl(keyword='Cat', max_num=100)

--Specify the directory where you want to save the image in root_dir. --Specify the keywords of the images you want to collect in keyword. --Specify the number of images to collect in max_num. --You can change the BingImageCrawler part to another ImageCrawler, and you can also use Google and Flickr. --Available → https://icrawler.readthedocs.io/en/latest/builtin.html

What to do if you get a json.decoder.JSONDecodeError when using Google

  1. Find google.py. --Example (using anaconda): C: \ Users \ hoge \ anaconda3 \ envs \ env1 \ Lib \ site-packages \ icrawler \ builtin \ google.py --If you installed with pip, you can find the location of the package, so follow it from there.
    • https://qiita.com/t-fuku/items/83c721ed7107ffe5d8ff
  2. Change the parse method of google.py to: --The parse method is around line 144.
def parse(self, response):
        soup = BeautifulSoup(
            response.content.decode('utf-8', 'ignore'), 'lxml')
        #image_divs = soup.find_all('script')
        image_divs = soup.find_all(name='script')
        for div in image_divs:
            #txt = div.text
            txt = str(div)
            #if not txt.startswith('AF_initDataCallback'):
            if 'AF_initDataCallback' not in txt:
                continue
            if 'ds:0' in txt or 'ds:1' not in txt:
                continue
            #txt = re.sub(r"^AF_initDataCallback\({.*key: 'ds:(\d)'.+data:function\(\){return (.+)}}\);?$",
            #             "\\2", txt, 0, re.DOTALL)
            #meta = json.loads(txt)
            #data = meta[31][0][12][2]
            #uris = [img[1][3][0] for img in data if img[0] == 1]
            
            uris = re.findall(r'http.*?\.(?:jpg|png|bmp)', txt)
            return [{'file_url': uri} for uri in uris]

reference

https://github.com/hellock/icrawler https://github.com/hellock/icrawler/issues/65

Recommended Posts

[Python] Collect images easily with icrawler!
[Python] Collect images with Icrawler for machine learning [1000 images]
Easily beep with python
Collect images using icrawler
Bordering images with python Part 1
Number recognition in images with Python
How to collect images in Python
Easily implement subcommands with python click
Easily handle lists with python + sqlite3
Post multiple Twitter images with python
Animate multiple still images with Python
Easily handle databases with Python (SQLite3)
Load gif images with Python + OpenCV
Easily post to twitter with Python 3
Recursively collect wikipedia links with python
Working with DICOM images in Python
I tried to automatically collect images of Kanna Hashimoto with Python! !!
Amplify images for machine learning with python
Capturing images with Pupil, python and OpenCV
[python, openCV] base64 Face recognition with images
How to collect face images relatively easily
Add Gaussian noise to images with python2.7
Easily download mp3 / mp4 with python and youtube-dl!
Importing and exporting GeoTiff images with Python
Read text in images with python OCR
Upload images to Google Drive with Python
FizzBuzz with Python3
Scraping with Python
Statistics with python
Scraping with Python
Python with Go
Integrate with Python
AES256 with python
Tested with Python
python starts with ()
with syntax (Python)
Bingo with python
Zundokokiyoshi with python
Excel with Python
Microcomputer with Python
Cast with python
Generating solid color images with python | Kaggle icon
Convert PDFs to images in bulk with Python
You can easily create a GUI with Python
Getting started with AWS IoT easily in Python
Create a Python console application easily with Click
Serial communication with Python
Zip, unzip with python
Primality test with Python
Python with eclipse + PyDev.
Socket communication with Python
Data analysis with python 2
Scraping with Python (preparation)
Automatically paste images into PowerPoint materials with python + α
Try scraping with Python.
Easily daemonized with Supervisor
Sequential search with Python
Get media timeline images and videos with Python + Tweepy
"Object-oriented" learning with python
Run Python with VBA
Solve AtCoder 167 with python