[PYTHON] Creating training data

Introduction

Last time Create training data using the collected image data. Since it takes a long time to calculate if the image data is passed to the tensor flow as it is, it is converted to the numpy array format to shorten the calculation time.

Source code

import

from PIL import Image
import os, glob
import numpy as np
from sklearn import model_selection

Preparation for conversion process

classes = ["monkey", "boar", "crow"]
num_classes = len(classes)
image_size = 50

X = []
Y = []

This time, we will classify monkey, boar, and crow, so we will store the keywords. The image size is unified to 50x50. X and Y are labels that indicate the image data and whether the image is monkey (0), boar (1), or crow (2), respectively.

for index, classlabel in enumerate(classes):
    photos_dir = "./" + classlabel
    files = glob.glob(photos_dir + "/*.jpg ")
    for i, file in enumerate(files):
        if i >= 141: break # monkey,boar,crow Adjust to the minimum number of data for each
        image = Image.open(file)
        image = image.convert("RGB")
        image = image.resize((image_size, image_size))
        data = np.asarray(image)
        X.append(data)
        Y.append(index)
X = np.array(X)
Y = np.array(Y)

glob () is a method that can get a list of files by matching wildcard patterns, and the following data is stored in files.

['./monkey\\49757184328.jpg', 
 './monkey\\49767449258.jpg', 
 ...

For each image, open the image, convert it to RGB 256 gradation format, and resize it to 50x50. Then convert it to a numpy array format (which seems to be faster than a Python list).

The X and Y created in this way contain the following data.

X


(423, 50, 50, 3)Array of
[[[[ 89  92  60]
   [ 85  84  52]
   [ 91  84  51]
   ...
   [177 178  24]
   [142 145  15]
   [231 219  35]]
   ...

Y


423 array
[0 0 ... 1 1 ... 2 2 ...]

Digression

Two methods are used to change to a numpy array, such as data = np.asarray (image) and X = np.array (X). The behavior is the same when converting from a list to a numpy array, but the behavior is different when converting from a numpy array to a numpy array. Reference: https://punhundon-lifeshift.com/array_asarray

Saving training data

Use the train_test_split method to split X and Y into training data and model validation data and save them with the file name" animal.npy ".

X_train, X_test, y_train, y_test = model_selection.train_test_split(X, Y)
xy = (X_train, X_test, y_train, y_test)
np.save("./animal.npy", xy)

X_train and y_train are in an array of 317, X_test and y_test are an array of 106. That is, about 75% of the data of X and Y is divided into train, and about 25% of data is divided into test.

Recommended Posts

Creating training data
Training data by CNN
Tool for creating training data for object detection in OpenCV
Data Scientist Training Course Chapter 2 Day 2
Data Scientist Training Course Chapter 3 Day 3
Data Scientist Training Course Chapter 4 Day 1
Data Scientist Training Course Chapter 3 Day 1 + 2
Pandas Cleansing Labeled Training Data Split
[Python] Chapter 04-06 Various data structures (creating dictionaries)
Creating software that visualizes data structures-Heap edition-
Data handling
Creating a data analysis application using Streamlit
I did Python data analysis training remotely
Creating Google Spreadsheet using Python / Google Data API
Training data and test data (What are X_train and y_train?) ①
Training data and test data (What are X_train and y_train?) ②
Creating learning data for face image dataset sorting (# 1)
Machine learning Training data division and learning / prediction / verification