A Python script that automatically collects typical images using bing image search

Purpose

This script automatically collects typical images for a query. For example, in recent years, deep learning has become popular in the field of images (originally in the field of audio ...), and it has been seen at various academic societies and established as a shared task. However, the training data requires a huge amount, and the time from collection to annotation requires a considerable cost.

Therefore, we collect tagged image data necessary for machine learning such as Deep Learning! I created this script assuming such a purpose.

Collection of typical images

This time, we will try to automate image collection using bing's image search. The code below does something like crawling and scraping, but this time I implemented it without using useful modules (BeautifulSoup, urllib, etc.) for studying.

Although it is labeled as a typical image collection, it is actually a process that only fetches the top N search results.

collect_img.py


#!/usr/bin/env python
# -*- coding: utf-8 -*-

import sys
import os
import re
import commands as cmd


#Query Search HTML acquisition
def get_HTML(query):

    html = cmd.getstatusoutput("wget -O - https://www.bing.com/images/search?q=" + query)

    return html

#Extract jpg image URL
def extract_URL(html):

    url = []
    sentences = html[1].split('\n')
    ptn = re.compile('<a href="(.+\.jpg)" class="thumb"')

    for sent in sentences:
        if sent.find('<div class="item">') >= 0:
            element = sent.split('<div class="item">')

            for j in range(len(element)):
                mtch = re.match(ptn,element[j])
                if  mtch >= 0:
                    url.append(mtch.group(1))

    return url

#Save image locally
def get_IMG(dir,url):

    for u in url:
        try:
            os.system("wget -P " + dir + " " + u)
        except:
            continue


if __name__ == "__main__":

    argvs = sys.argv # argvs[1]:Image search query, argvs[2]:Destination directory(Only when you want to save)
    query = argvs[1] # some images  e.g. leopard

    html = get_HTML(query)

    url = extract_URL(html)

    for u in url:
        print u

    #Enable when you want to save the image locally
    #get_IMG(argvs[2],url)

Run

Execution method

Execute as follows from the command line. However, the argument dir is not specified when get_IMG is not used (the image is not saved).

collect_img.py


$ python collect_img.py query dir

--query: Search word for the image you want (e.g. leopard) --dir: Image save destination directory (./img/*)

Execution result

This time, we will introduce some of the results collected by the query "leopard". First, the URL list of the acquired images is as follows. (However, only a part)

http://images.china.cn/attachement/jpg/site1007/20120720/00016c8b5de01172f9e82e.jpg http://farm2.static.flickr.com/1254/1174179702_fe9c9a5d2c_b.jpg http://www.katzen-und-kater.de/Grosskatzen/Leopard/Leopard5.jpg ...

Here is a part of the acquired image. leopard leopard leopard

From the above, it was found that it was acquired properly. However, it does not mean that noise is removed by calculating the similarity of images, but it simply fetches the top N cases. (This is also an issue because it is not implemented to collect a large amount endlessly)

Summary

This time, I wrote a script to collect typical images from bing image search for the purpose of automatic collection of annotated image data of machine learning. For annotations, I think the query can be used as it is. In addition, the following two issues can be considered in the future.

--Collecting any number (or infinitely many) of images --Delete images that cause noise based on criteria such as similarity between images.

Since this script depends on the characteristics of the image search engine that the top image search is often a typical image, it is better to think seriously about the second problem above. think. Let's implement it again next time.

Recommended Posts

A Python script that automatically collects typical images using bing image search
Publishing and using a program that automatically collects facial images of specified people
A Python script that saves a clipboard (GTK) image to a file.
Try a similar search for Image Search using the Python SDK [Search]
A program that plays rock-paper-scissors using Python
[Python] Download original images from Google Image Search
Automatically save images of your favorite characters from Google Image Search with Python
Nogizaka46 A program that automatically saves blog images
"Python Kit" that calls a Python script from Swift
A Vim plugin that automatically formats Python styles
Run a python script from excel (using xlwings)
What's in that variable (when running a Python script)
[Python] Mask the image into a circle using Pillow
Search Twitter using Python
I wrote a Python script that exports all my posts using the Qiita API v2
A program that automatically resizes the iOS app icon to the required image size in Python
A Python script that goes from Google search to saving the Search results page at once
Published a library that hides character data in Python images
Collect large numbers of images using Bing's image search API
Let's create a script that registers with Ideone.com in Python.
Creating a Python script that supports the e-Stat API (ver.2)
Image collection by calling Bing Image Search API v5 from Python
A set of script files that do wordcloud in Python3
A python script that converts Oracle Database data to csv
A Python script that compares the contents of two directories
I wrote a script that splits the image in two
Search algorithm using word2vec [python]
Do a search by image from the camera roll using Pythonista3
Create a simple scheduled batch using Docker's Python Image and parse-crontab
[Ev3dev] Create a program that captures the LCD (screen) using python
[Python algorithm] A program that outputs Sudoku answers from a depth-first search
A script that returns 0, 1 attached to the first Python prime number
A python script that deletes ._DS_Store and ._ * files created on Mac
A program that automatically determines whether an animation or a photo is entered when a person's image is input [python]