Image collection Python script for creating datasets for machine learning


In order to output highly accurate results using machine learning for a certain problem, it is important how to devise a data set creation and a learning device design. Of these, how many data are prepared when creating a data set is particularly important for successful learning. The first thing to do to collect the data necessary for learning is to find a database that the ancestors of the world have organized and built the data that is already overflowing on the Web or the data that they created. However, there is not always such a convenient database, so you may need to collect it yourself. So, this time, as a simple example for studying, I created a script in Python that collects images of people and then crops and saves only the face.

Execution environment


The script created this time is supposed to search and collect related images on the Web after giving a query, perform face recognition using OpenCV, and then crop and save it. The search engine used is Bing's image search.


The source code of this script is located at the link below. From the following, each function in this script will be described.

Import The modules imported in this script creation are as follows.

import sys
import os
import commands as cmd
import cv2
import time
import copy
from argparse import ArgumentParser

getHtml A function that throws a query and gets the html of the searched page. Since cmd.getstatus output returns (status, output) as a tuple, this time only the html pulled out by wget -O is returned.

def getHtml(query):
    return cmd.getstatusoutput("wget -O -" + query)[1]

extractImageURL A function that receives html and image formats and extracts only the links of the specified format from html.

def extractImageURL(html, suffix):
    url = []
    snum, lnum = 0, 0
    text = html.split('\n')
    for sen in text:
        if sen.find('<div class="item">') >= 0:
            element = sen.split('<div class="item">')
            for num in range(len(element)):
                for suf in suffix:
                    snum = element[num].find("href") + 6
                    lnum = element[num].find(suf) + len(suf)
                    if lnum > 0:
    return url

saveImg extractImage A function that temporarily saves the desired image locally from the link extracted by URL. Create a directory called Original in the created directory (opbase) and save the images in it.

def saveImg(opbase, url):
    dir = opbase + '/Original/'
    if not (os.path.exists(dir)):
    for u in url:
            os.system('wget -P ' + dir + ' '  + u)

cropFace A function that extracts only the face from the saved image, crops it, and saves it. Face recognition uses a trained model (haar cascade) of the Haar classifier in OpenCV. In this script, it is possible to use four types of methods, assuming that only the face from the front is extracted. Regarding the accuracy of the model, I referred to the following link ( For the image after Crop, create a Crop directory in opbase and save it in it.

def cropFace(opbase, path, imsize, method):
    dir = opbase + '/Crop/'
    if not (os.path.exists(dir)):
    for p in path:
        img = cv2.imread(opbase + '/Original/' + p)
        gImg = cv2.cvtColor(img, cv2.COLOR_RGB2GRAY)
        if method == 1:
            face_cascade = cv2.CascadeClassifier('haarcascade/haarcascade_frontalface_default.xml')
        elif method == 2:
            face_cascade = cv2.CascadeClassifier('haarcascade/haarcascade_frontalface_alt.xml')
        elif method == 3:
            face_cascade = cv2.CascadeClassifier('haarcascade/haarcascade_frontalface_alt2.xml')
            face_cascade = cv2.CascadeClassifier('haarcascade/haarcascade_frontalface_alt_tree.xml')
        faces = face_cascade.detectMultiScale(gImg, 1.3, 5)
        for num in range(len(faces)):
            cropImg = copy.deepcopy(img[faces[num][1]:faces[num][1]+faces[num][3], faces[num][0]:faces[num][0]+faces[num][2]])
            resizeImg = cv2.resize(cropImg, (imsize, imsize))
            filename = dir + p[:-4] + '_' + str(num + 1) + '.tif'
            cv2.imwrite(filename, resizeImg)

main The main function of this script. Execute with the following options specified by parser.

In addition, the output directory of this script is created with the specified query name.

if __name__ == "__main__":
    ap = ArgumentParser(description='')
    ap.add_argument('--query', '-q', nargs='*', default='hoge', help='Specify Query of Image Collection ')
    ap.add_argument('--suffix', '-s', nargs='*', default='jpg', help='Specify Image Suffix')
    ap.add_argument('--imsize', '-i', type=int, default=100, help='Specify Image Size of Crop Face Image')
    ap.add_argument('--method', '-m', type=int, default=1, help='Specify Method Flag (1 : Haarcascades Frontalface Default, 2 : Haarcascades Frontalface Alt1, 3 : Haarcascades Frontalface Alt2, Without : Haarcascades Frontalface Alt Tree)')
    args = ap.parse_args()

    t = time.ctime().split(' ')
    if t.count('') == 1:
    # Path Separator
    psep = '/'
    for q in args.query:
        opbase = q
        # Delite File Sepaeator   
        if (opbase[len(opbase) - 1] == psep):
            opbase = opbase[:len(opbase) - 1]
        # Add Current Directory (Exept for Absolute Path)
        if not (opbase[0] == psep):
            if (opbase.find('./') == -1):
                opbase = './' + opbase
        # Create Opbase
        opbase = opbase + '_' + t[1] + t[2] + t[0] + '_' + t[4] + '_' + t[3].split(':')[0] + t[3].split(':')[1] + t[3].split(':')[2]
        if not (os.path.exists(opbase)):
            print 'Output Directory not exist! Create...'
        print 'Output Directory:', opbase

        html = getHtml(q)
        url = extractImageURL(html, args.suffix)
        saveImg(opbase, url)
        cropFace(opbase, os.listdir(opbase + '/Original'), args.imsize, args.method)


In order to experiment with how much noise is mixed, this time I will show the result obtained by throwing the queries "Gacky" and "Becky" as a personal hobby.


It must be said that it was Gacky in the world, and although it contained some monster noise, it was generally Gacky. On the other hand, Becky has also brought in people who are thought to be Becky other than Becky. It is unavoidable in terms of specifications that more general queries contain more noise, but it can be said that there is room for improvement. In addition, the most important point to improve is the number of images collected, and since the number of images per query is overwhelmingly small, it can be said that there are still issues overall.


In the world, it is common to use Google Custom Search API, Bing Search API, etc. for image collection, and the accuracy and the number of collections are overwhelmingly high. This time, the challenge was how far I could try without using those APIs, but I would like to try the method using APIs as well. In addition, it is probable that there was a problem with the extraction method because the method of analyzing html was also explored. You may use Beautiful Soup, which you often see as a convenient module. In addition, Gacky is making a great leap forward with love dance (although he was originally an angel, so he was flying), while Becky has restarted from the cliff, which is thought to be reflected in this result. Please do your best with Ainori's MC.

Recommended Posts

Image collection Python script for creating datasets for machine learning
<For beginners> python library <For machine learning>
[Recommended tagging for machine learning # 4] Machine learning script ...?
Amplify images for machine learning with python
Why Python is chosen for machine learning
[Shakyo] Encounter with Python for machine learning
[Python] Web application design for machine learning
An introduction to Python for machine learning
Creating a development environment for machine learning
Creating learning data for face image dataset sorting (# 1)
Upgrade the Azure Machine Learning SDK for Python
[Python] Collect images with Icrawler for machine learning [1000 images]
Build an interactive environment for machine learning in Python
[Recommended tagging for machine learning # 2] Extension of scraping script
[Recommended tagging for machine learning # 2.5] Modification of scraping script
Python learning memo for machine learning by Chainer from Chapter 2
Python learning memo for machine learning by Chainer Chapters 1 and 2
Preparing to start "Python machine learning programming" (for macOS)
[Python] I made a classifier for irises [Machine learning]
Memo for building a machine learning environment using Python
Data set for machine learning
Learning flow for Python beginners
Python learning plan for AI learning
Machine learning with Python! Preparation
Python Machine Learning Programming> Keywords
Image Processing Collection in Python
Importance of machine learning datasets
Checkio's recommendation for learning Python
Beginning with Python machine learning
Build an environment for machine learning using Python on MacOSX
Made icrawler easier to use for machine learning data collection
How to use machine learning for work? 03_Python coding procedure
Python learning memo for machine learning by Chainer Chapter 7 Regression analysis
University of Tsukuba Machine Learning Course: Study sklearn while creating the Python script part of the task (17)
University of Tsukuba Machine Learning Course: Study sklearn while creating the Python script part of the task (5)
University of Tsukuba Machine Learning Course: Study sklearn while creating the Python script part of the task (16)
University of Tsukuba Machine Learning Course: Study sklearn while creating the Python script part of the task (10)
University of Tsukuba Machine Learning Course: Study sklearn while creating the Python script part of the task (2)
University of Tsukuba Machine Learning Course: Study sklearn while creating the Python script part of the task (13)
University of Tsukuba Machine Learning Course: Study sklearn while creating the Python script part of the task (9)
University of Tsukuba Machine Learning Course: Study sklearn while creating the Python script part of the task (4)
University of Tsukuba Machine Learning Course: Study sklearn while creating the Python script part of the task (12)
University of Tsukuba Machine Learning Course: Study sklearn while creating the Python script part of the task (1)
University of Tsukuba Machine Learning Course: Study sklearn while creating the Python script part of the task (11)
University of Tsukuba Machine Learning Course: Study sklearn while creating the Python script part of the task (3)
University of Tsukuba Machine Learning Course: Study sklearn while creating the Python script part of the task (14)
University of Tsukuba Machine Learning Course: Study sklearn while creating the Python script part of the task (6)
University of Tsukuba Machine Learning Course: Study sklearn while creating the Python script part of the task (15)
[Python machine learning] Recommendation of using Spyder for beginners (as of August 2020)
Python script for ldapsearch base64 decode
How about Anaconda for building a machine learning environment in Python?
Python learning memo for machine learning by Chainer Chapter 8 Introduction to Numpy
Building a Windows 7 environment for getting started with machine learning with Python
Web teaching materials for learning Python
Python learning memo for machine learning by Chainer Chapter 10 Introduction to Cupy
Machine learning with python (1) Overall classification
Machine learning summary by Python beginners
Personal notes for python image processing
Recommended container image for Python applications
AWS Layer Creation Script for python
Python: Preprocessing in Machine Learning: Overview