[PYTHON] Collect machine learning training image data on your own (Google Custom Search API Pikachu)

Introduction

The first thing you need to do to create a machine learning model for recognizing objects in an image is to collect a large number of training images. General items such as dogs and cars can be downloaded from services such as ImageNet, but images of characters such as Pikachu and Anpanman, for example. there is no. Then I came up with a way to collect images using Google search. This time, I will introduce how to collect image data for machine learning using the Google Custom Search API.

Click here for Tumblr API

Creating a custom search engine

First, create a custom search engine with CSE.

First click Add under Edit Search Engine カスタム検索_-検索エンジンの編集_「Google_Custom_Search_API_を使って機械学習の画像データを1日1000枚ずつ収集する」を編集_-_Qiita.png

Next, fill in the appropriate values in the form. As a caveat here, enter some appropriate value such as "\ * .com" for "Search site". No, I want to search all sites! If you think "\ *", you will not be able to proceed for the rest of your life. (I was quite addicted to it here) I will change it so that everything will be searched later. Make the following settings as appropriate and press the create button.

Custom Search_-_ Create Custom Search Engine.png

Click the create button to complete the creation. Then select the control panel.

カスタム検索_-_作成完了しました.png

Do three things in this control panel First, turn on image search. カスタム検索_-_基本.png

Next, delete the "\ * .com" you added earlier from the site you are searching for. カスタム検索_-_基本.png

Finally, change "Search only added sites" to "Search the entire web with an emphasis on added sites". カスタム検索_-_基本.png

Thank you for your hard work. You have now created a custom search engine. *** Make a note of the ID that appears when you press "Search Engine ID". *** ***

カスタム検索_-_基本.png

Enable Custom API Search and get API key

Then enable the Custom Search API. This is very easy https://console.developers.google.com Go to (Create a project if you don't have one), select Library on the left menu, and select CustomeSearch API. Press "Enable" at the transition destination to enable the API.

APIs___services_-_MyFirstApp.png

Now get the API key from the credentials on the left. Select the API key from the Create Credentials tab. 認証情報_-MyFirstApp_と「Google_Custom_Search_API_を使って機械学習の画像データを1日1000枚ずつ収集する」を編集_-_Qiita.png

*** Make a note of this as a key will be created when you select it. *** ***

認証情報_-_MyFirstApp.png

It was a long time, but now I'm ready! !!

Collect images

Collect images using the custom search engine created above and the CustomeSearch API & API key.

The script is very simple as below. Save the images in the form of number .png in a directory called images under the executed directory. For the search engine ID and API key, enter the ones you wrote down above. (Please install the imported library with pip as appropriate.)

correct_image.py


import requests
import shutil

API_PATH    = "https://www.googleapis.com/customsearch/v1"
PARAMS = {
  "cx" : "999999999999999999:abcdefghi", #Search engine ID
  "key": "xxxxxxxxxxxxxxxxxxxxxxxxxxxx", #API key
  "q"  : "Pikachu", #Search word
  "searchType": "image", #Search type
  "start" : 1, #Starting index
  "num" : 10   #Number of acquisitions in one search(10 by default)
}
LOOP = 100
image_idx = 0

for x in range(LOOP):
  PARAMS.update({'start': PARAMS["num"] * x + 1})
  items_json = requests.get(API_PATH, PARAMS).json()["items"]
  for item_json in items_json:
    path = "images/" + str(image_idx) + ".png "
    r = requests.get(item_json['link'], stream=True)
    if r.status_code == 200:
      with open(path, 'wb') as f:
        r.raw.decode_content = True
        shutil.copyfileobj(r.raw, f)
      image_idx+=1

When I actually executed this, I got the following image. 0.png 1.png 2.png

Finally

I noticed later that when I tried to get a lot of images with this method

Traceback (most recent call last):
  File "get_image.py", line 31, in <module>
    items_json = requests.get(API_PATH, PARAMS).json()["items"]
KeyError: 'items'

It turned out that more than 100 images could not be acquired. Apparently, the Google Custom Search API does not allow acquisition of pages 11 and beyond. (There was a link that was mentioned, but I lost it.)

Recommended Posts

Collect machine learning training image data on your own (Google Custom Search API Pikachu)
Collect machine learning training image data on your own (Tumblr API Yoshioka Riho ed.)
Image collection using Google Custom Search API
Collect images for machine learning (Bing Search API)
How to collect machine learning data
Put your own image data in Deep Learning and play with it
Machine learning with Pytorch on Google Colab
[Machine learning] Create a machine learning model by performing transfer learning with your own data set
Machine learning Training data division and learning / prediction / verification
[Python] Save PDF from Google Colaboratory to Google Drive! -Let's collect data for machine learning-