background

In order to output highly accurate results using machine learning for a certain problem, it is important how to devise a data set creation and a learning device design. Of these, how many data are prepared when creating a data set is particularly important for successful learning. The first thing to do to collect the data necessary for learning is to find a database that the ancestors of the world have organized and built the data that is already overflowing on the Web or the data that they created. However, there is not always such a convenient database, so you may need to collect it yourself. So, this time, as a simple example for studying, I created a script in Python that collects images of people and then crops and saves only the face.

Execution environment

Mac OS X 10.10.5 (Yosemite)
Python 2.7.13_0
Open CV 3.2.0_0 (+contrib+java+python27+qt4+vtk)

Purpose

The script created this time is supposed to search and collect related images on the Web after giving a query, perform face recognition using OpenCV, and then crop and save it. The search engine used is Bing's image search.

script

The source code of this script is located at the link below. https://github.com/tokkuman/FaceImageCollection From the following, each function in this script will be described.

Import The modules imported in this script creation are as follows.

import sys
import os
import commands as cmd
import cv2
import time
import copy
from argparse import ArgumentParser

getHtml A function that throws a query and gets the html of the searched page. Since cmd.getstatus output returns (status, output) as a tuple, this time only the html pulled out by wget -O is returned.

def getHtml(query):
    return cmd.getstatusoutput("wget -O - https://www.bing.com/images/search?q=" + query)[1]

extractImageURL A function that receives html and image formats and extracts only the links of the specified format from html.

def extractImageURL(html, suffix):
    url = []
    snum, lnum = 0, 0
    text = html.split('\n')
    for sen in text:
        if sen.find('<div class="item">') >= 0:
            element = sen.split('<div class="item">')
            for num in range(len(element)):
                for suf in suffix:
                    snum = element[num].find("href") + 6
                    lnum = element[num].find(suf) + len(suf)
                    if lnum > 0:
                        url.append(element[num][snum:lnum])
                        break
    return url

saveImg extractImage A function that temporarily saves the desired image locally from the link extracted by URL. Create a directory called Original in the created directory (opbase) and save the images in it.

def saveImg(opbase, url):
    dir = opbase + '/Original/'
    if not (os.path.exists(dir)):
        os.mkdir(dir)
    for u in url:
        try:
            os.system('wget -P ' + dir + ' '  + u)
        except:
            continue

cropFace A function that extracts only the face from the saved image, crops it, and saves it. Face recognition uses a trained model (haar cascade) of the Haar classifier in OpenCV. In this script, it is possible to use four types of methods, assuming that only the face from the front is extracted. Regarding the accuracy of the model, I referred to the following link (http://stackoverflow.com/questions/4440283/how-to-choose-the-cascade-file-for-face-detection). For the image after Crop, create a Crop directory in opbase and save it in it.

def cropFace(opbase, path, imsize, method):
    dir = opbase + '/Crop/'
    if not (os.path.exists(dir)):
        os.mkdir(dir)
    for p in path:
        img = cv2.imread(opbase + '/Original/' + p)
        gImg = cv2.cvtColor(img, cv2.COLOR_RGB2GRAY)
        if method == 1:
            face_cascade = cv2.CascadeClassifier('haarcascade/haarcascade_frontalface_default.xml')
        elif method == 2:
            face_cascade = cv2.CascadeClassifier('haarcascade/haarcascade_frontalface_alt.xml')
        elif method == 3:
            face_cascade = cv2.CascadeClassifier('haarcascade/haarcascade_frontalface_alt2.xml')
        else:
            face_cascade = cv2.CascadeClassifier('haarcascade/haarcascade_frontalface_alt_tree.xml')
        faces = face_cascade.detectMultiScale(gImg, 1.3, 5)
        for num in range(len(faces)):
            cropImg = copy.deepcopy(img[faces[num][1]:faces[num][1]+faces[num][3], faces[num][0]:faces[num][0]+faces[num][2]])
            resizeImg = cv2.resize(cropImg, (imsize, imsize))
            filename = dir + p[:-4] + '_' + str(num + 1) + '.tif'
            cv2.imwrite(filename, resizeImg)

main The main function of this script. Execute with the following options specified by parser.

query: Specify the query to search. Multiple queries can be specified.
suffix: Specify the format of the image to be saved. You can also specify multiple items.
imsize: Specify the size of the cropped face image.
method: Specify the model type in the Haar classifier.

In addition, the output directory of this script is created with the specified query name.

if __name__ == "__main__":
    ap = ArgumentParser(description='ImageCollenction.py')
    ap.add_argument('--query', '-q', nargs='*', default='hoge', help='Specify Query of Image Collection ')
    ap.add_argument('--suffix', '-s', nargs='*', default='jpg', help='Specify Image Suffix')
    ap.add_argument('--imsize', '-i', type=int, default=100, help='Specify Image Size of Crop Face Image')
    ap.add_argument('--method', '-m', type=int, default=1, help='Specify Method Flag (1 : Haarcascades Frontalface Default, 2 : Haarcascades Frontalface Alt1, 3 : Haarcascades Frontalface Alt2, Without : Haarcascades Frontalface Alt Tree)')
    args = ap.parse_args()

    t = time.ctime().split(' ')
    if t.count('') == 1:
        t.pop(t.index(''))
    # Path Separator
    psep = '/'
    for q in args.query:
        opbase = q
        # Delite File Sepaeator   
        if (opbase[len(opbase) - 1] == psep):
            opbase = opbase[:len(opbase) - 1]
        # Add Current Directory (Exept for Absolute Path)
        if not (opbase[0] == psep):
            if (opbase.find('./') == -1):
                opbase = './' + opbase
        # Create Opbase
        opbase = opbase + '_' + t[1] + t[2] + t[0] + '_' + t[4] + '_' + t[3].split(':')[0] + t[3].split(':')[1] + t[3].split(':')[2]
        if not (os.path.exists(opbase)):
            os.mkdir(opbase)
            print 'Output Directory not exist! Create...'
        print 'Output Directory:', opbase

        html = getHtml(q)
        url = extractImageURL(html, args.suffix)
        saveImg(opbase, url)
        cropFace(opbase, os.listdir(opbase + '/Original'), args.imsize, args.method)

result

In order to experiment with how much noise is mixed, this time I will show the result obtained by throwing the queries "Gacky" and "Becky" as a personal hobby.

result.001.jpeg

It must be said that it was Gacky in the world, and although it contained some monster noise, it was generally Gacky. On the other hand, Becky has also brought in people who are thought to be Becky other than Becky. It is unavoidable in terms of specifications that more general queries contain more noise, but it can be said that there is room for improvement. In addition, the most important point to improve is the number of images collected, and since the number of images per query is overwhelmingly small, it can be said that there are still issues overall.

Consideration

In the world, it is common to use Google Custom Search API, Bing Search API, etc. for image collection, and the accuracy and the number of collections are overwhelmingly high. This time, the challenge was how far I could try without using those APIs, but I would like to try the method using APIs as well. In addition, it is probable that there was a problem with the extraction method because the method of analyzing html was also explored. You may use Beautiful Soup, which you often see as a convenient module. In addition, Gacky is making a great leap forward with love dance (although he was originally an angel, so he was flying), while Becky has restarted from the cliff, which is thought to be reflected in this result. Please do your best with Ainori's MC.