[PYTHON] Made icrawler easier to use for machine learning data collection

Introduction

The Python library icrawler is useful for collecting image data for machine learning, and in the official example, it can be installed and implemented extremely easily as shown below.

Installation

pip install icrawler 
   or 
conda install -c hellock icrawler

Execution example (Search for "cat" on Google and get 100 images)

from icrawler.builtin import GoogleImageCrawler

google_crawler = GoogleImageCrawler(storage={'root_dir': 'your_image_dir'})
google_crawler.crawl(keyword='cat', max_num=100)

In this way, you can get an image in just three lines.

Easy to use and improved

The following two points have been added to make it easier to use. ①. I want to specify multiple search words (externally) ②. I want you to avoid duplicate images

Regarding (1), the search word is described line by line in an external text file and implemented by reading this. Regarding (2), by describing the URL of the image in the file name, it is automatically skipped when saving the same image.

filesystem.py


class FileSystem(BaseStorage):
    """Use filesystem as storage backend.

    The id is filename and data is stored as text files or binary files.
    """

    def __init__(self, root_dir):
        self.root_dir = root_dir

    def write(self, id, data):
        filepath = osp.join(self.root_dir, id)
        folder = osp.dirname(filepath)
        if not osp.isdir(folder):
            try:
                os.makedirs(folder)
            except OSError:
                pass
        mode = 'w' if isinstance(data, six.string_types) else 'wb'
#        with open(filepath, mode) as fout:
#            fout.write(data)
        try:
            with open(filepath, mode) as fout:
                fout.write(data)
        except  FileNotFoundError: 
                pass

Implementation

So, I will publish the implementation below first.

img_collection.py


import base64
from icrawler import ImageDownloader
from six.moves.urllib.parse import urlparse
from icrawler.builtin import BaiduImageCrawler
from icrawler.builtin import BingImageCrawler
from icrawler.builtin import GoogleImageCrawler
import argparse, os

parser = argparse.ArgumentParser(description='img_collection')
parser.add_argument('--output_dir', default="",type=str, help='')
parser.add_argument('--N',        default=10, type=int, help='')
parser.add_argument('--engine', choices=['baidu',"bing","google"],default="bing",type=str, help='')
args = parser.parse_args()

class Base64NameDownloader(ImageDownloader):
    def get_filename(self, task, default_ext):
        url_path = urlparse(task['file_url'])[2]
        if '.' in url_path:
            extension = url_path.split('.')[-1]
            if extension.lower() not in [
                    'jpg', 'jpeg', 'png', 'bmp', 'tiff', 'gif', 'ppm', 'pgm'
            ]:
                extension = default_ext
        else:
            extension = default_ext
        # works for python 3
        filename = base64.b64encode(url_path.encode()).decode()
        return '{}.{}'.format(filename, extension)
    
def get_crawler(args, dir_name):
    if args.engine == "baidu":
        crawler = BaiduImageCrawler(downloader_cls=Base64NameDownloader,storage={'root_dir': dir_name })
    elif args.engine == "bing":
        crawler = BingImageCrawler(downloader_cls=Base64NameDownloader,storage={'root_dir': dir_name })
    elif args.engine == "google": # dont work
        crawler = GoogleImageCrawler(storage={'root_dir': dir_name })    
    return crawler

if __name__=="__main__":
    # read ini file.
    with open('./setting.txt', mode='r', encoding = "utf_8") as f:
        read_data = list(f)    

    print("SELECTED ENGINE : "+args.engine)        

    for i in range(len(read_data)):
        print("SEARCH WORD : "+read_data[i].replace('\n', ''))
        print("NUM IMAGES  : "+str(args.N))
        dir_name = os.path.join(args.output_dir, read_data[i].replace('\n', '').replace(' ', '_'))
        
        #init crawler
        crawler = get_crawler(args, dir_name)
        crawler.crawl(keyword=read_data[i], max_num=args.N)

Create setting.txt in the same hierarchy as img_collection.py. A search word is described in this, and in the example below, three types of search words are specified, and it is not necessary to execute each of the three words.

setting.txt


Cat cat
Cat adult
Cat child

How to use

Enter as many search words as you like in setting.txt and execute the following --N: Upper limit of acquired images (Max1000. Actually, 1000 images cannot be acquired due to communication or presence / absence of pages) --output_dir: Output destination directory path --engine: Search engine. Select from bing and baidu. Google currently doesn't work.

python img_collection.py  --N 1000 --output_dir D:\hogehoge\WebCrawler\out --engine bing

result

Even if you specify 1000 sheets to acquire, it seems that about 600 sheets actually remain due to communication errors. For the time being, you can get a lot of images of cats. The directory is divided for each search word, but if you put the images together in one directory, the duplicate images will be basically merged because the file names conflict.

neko.png

Finally

Web crawlers have delicate rights, so be sure to handle them appropriately according to the purpose of use.

Recommended Posts

Made icrawler easier to use for machine learning data collection
Data set for machine learning
How to use machine learning for work? 03_Python coding procedure
How to collect machine learning data
How to use machine learning for work? 01_ Understand the purpose of machine learning
scikit-learn How to use summary (machine learning)
An introduction to OpenCV for machine learning
How to use "deque" for Python data
An introduction to Python for machine learning
[Updated Ver1.3.1] I made a data preprocessing library DataLiner for machine learning.
How to use machine learning for work? 02_Overview of AI development project
An introduction to machine learning for bot developers
How to use data analysis tools for beginners
[Python] Collect images with Icrawler for machine learning [1000 images]
[For beginners] Introduction to vectorization in machine learning
I tried to process and transform the image and expand the data for machine learning
Introduction to machine learning
Create a dataset of images to use for learning
Image collection Python script for creating datasets for machine learning
About data preprocessing of systems that use machine learning
Preparing to start "Python machine learning programming" (for macOS)
[Python] I made a classifier for irises [Machine learning]
xgboost: A valid machine learning model for table data
Everything for beginners to be able to do machine learning
[Python] Save PDF from Google Colaboratory to Google Drive! -Let's collect data for machine learning-
Japanese preprocessing for machine learning
An introduction to machine learning
Introduction to Machine Learning with scikit-learn-From data acquisition to parameter optimization
Performance verification of data preprocessing for machine learning (numerical data) (Part 2)
For those who want to start machine learning with TensorFlow2
I made a tool that makes it convenient to set parameters for machine learning models.
Performance verification of data preprocessing for machine learning (numerical data) (Part 1)
Super introduction to machine learning
Use with Cabocha to automatically generate "IOB2 tag corpus" learning data
Python learning memo for machine learning by Chainer Chapter 8 Introduction to Numpy
Newton's method for machine learning (from one variable to multiple variables)
Before the introduction to machine learning. ~ Technology required for machine learning other than machine learning ~
Python learning memo for machine learning by Chainer Chapter 10 Introduction to Cupy
Python learning memo for machine learning by Chainer Chapter 9 Introduction to scikit-learn
Introduction to machine learning Note writing
<For beginners> python library <For machine learning>
Machine learning in Delemas (data acquisition)
Introduction to Machine Learning Library SHOGUN
Machine learning meeting information for HRTech
Preprocessing in machine learning 2 Data acquisition
[Recommended tagging for machine learning # 4] Machine learning script ...?
Preprocessing in machine learning 4 Data conversion
Basic machine learning procedure: ② Prepare data
I made a Docker container to use JUMAN ++, KNP, python (for pyKNP).
Specific implementation method to add horse past performance data to machine learning features
Search for technical blogs by machine learning focusing on "easiness to understand"
One-click data prediction for the field realized by fully automatic machine learning
Take the free "Introduction to Python for Machine Learning" online until 4/27 application
I started machine learning with Python (I also started posting to Qiita) Data preparation
Summary of mathematical scope and learning resources required for machine learning and data science
Introduction to Machine Learning: How Models Work
Emacs perspective.el workspace is easier to use
Make SikuliX's click function easier to use
Amplify images for machine learning with python
Record the steps to understand machine learning
First Steps for Machine Learning (AI) Beginners