[PYTHON] Classification of guitar images by machine learning Part 1

Summary

Achieved over 99% accuracy in the classification of guitar images by CNN (ResNet).
It was confirmed that the learning speed, accuracy, and generalization performance were improved by transfer learning based on the general image classification model of ImageNet.

Introduction

This article is a record of a summer vacation independent study by Alafor Engineer. I challenged to classify guitar images using CNN. There aren't many technically new stories, but it seems unlikely that there was a case with a guitar as the subject, so I will publish the results somehow.

environment

It is a personal computer at home.

Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz
Memory: 32GB
Geforce GTX 1080 （Founders Edition）
Ubuntu 16.04
Python 3.5.3
Keras（backend: Tensorflow）

Data set preparation

The dataset was scraped from a web search and manually modified in various labels. There are 13 types of labels below.

Stratocaster
Telecaster
Jazzmaster
Jaguar
Mustang
LesPaul
SG
FlyingV
Explorer
Firebird
SemiHollow
Hollow
Acoustic

The label is biased towards the Fender / Gibson solid model because it happened to be easy to procure labeled images, and it doesn't mean much.

This time, 250 images for each class will be prepared, 50 images randomly selected from them will be used as verification samples, and the remaining 200 images will be used as learning samples.

Model building

Keras has 50 layers of ResNet as a preset (?), So I decided to use it for the time being. As it is, it is a 1000 class classification model, so the fully connected layer is excised (include_top = False) and the fully connected layer for the desired classification is grafted. The grafted part is minimal, fully connected 1 layer + Softmax.

resnet = ResNet50(include_top=False, input_shape=(224, 224, 3), weights="imagenet")
h = Flatten()(resnet.output)
model_output = Dense(len(classes), activation="softmax")(h)
model = Model(resnet.input, model_output)

Here, if you set weights =" imagenet ", the learning result using ImageNet will be set as the initial value of the weight. By starting learning from this state, it becomes fine tuning, that is, transfer learning. By the way, this time we will not freeze the trained layers, but update the weights of all layers during training.

With weights = None, the weights are initialized with random numbers. In other words, you will learn from scratch without transitions.

This time, I experimented with and without metastasis.

Learning

Since the number of sample images is relatively small, it is necessary to inflate the data for learning. This time, we are implementing Data Augumentation using Keras ImageDataGenerator.

train_gen = ImageDataGenerator(
    rotation_range=45.,
    width_shift_range=0.2,
    height_shift_range=0.2,
    zoom_range=0.2,
    horizontal_flip=True,
    vertical_flip=True)

train_flow = train_gen.flow_from_directory(
    directory="./image/train",
    batch_size=32,
    target_size=(224, 224),
    classes=[_["key"] for _ in classes]
)

Inflating training data by affine transformation is very easy. ~~ Moreover, loading the image file + Augmentation is a nice specification that runs in parallel and asynchronous with learning. Personally, I thought this alone was worth using Keras. ~~ </ font>

** Additions / corrections: ** In Keras, data preprocessing runs in parallel and asynchronous with learning [Keras learning engine function (by using OrderedEnqueuer / GeneratorEnqueuer in fit_generator)](https://github.com/fchollet/keras/ blob / 135efd66d00166ca6df32f218d086f98cc300f1e / keras / engine / training.py # L1834-L2096), which was not a function provided by ImageDataGenerator. It was a misleading expression due to my misunderstanding, so I will correct it.

For optimization, we adopted Momentum + SGD following ResNet's original paper.

optimizer = SGD(decay=1e-6, momentum=0.9, nesterov=True)

I also tried Adam etc., but as it is said in the street, Momentum + SGD was excellent for ResNet.

This time, we will check the accuracy of verification for each step, with 1000 mini-batch learning of 32 samples as one step. Training is stopped when the accuracy of the verification converges. (Early Stopping)

es_cb = EarlyStopping(patience=20)

Learning results

Let's see the result.

First, in the case of transfer learning. The transition of accuracy is like this.

Blue is the learning curve and orange is the verification curve.

The verification accuracy is fluttering, but the learning accuracy is 99.9% and the verification accuracy is 100% in 36 steps. After that, the accuracy will increase or decrease, but for the time being, we will use this snapshot of the 36th step as a deliverable. By the way, it took about 5 hours to complete 36 steps.

On the other hand, when there is no metastasis.

Compared to transfer learning, learning progresses slowly and accuracy is not good. The best score is 99% learning accuracy and 84% verification accuracy. It can be seen that there is a large gap in accuracy between learning and verification, and the generalization performance of the model is low.

The effect of transfer learning is enormous.

Let me infer this and that

I took a picture of my guitar and tried to infer it. I used the model with metastasis.

Jazzmaster

LesPaul

** Acoustic guitar **

It seems to be working properly.

What about a guitar that is not in the training data?

Duo Sonic

It is a convincing result because it is the same student model as Mustang.

** Mysterious guitar with built-in speaker (made by Crews, model unknown) **

It doesn't look like Jaguar. Then it's hard to say what it looks like, but I feel that Les Paul and Telecaster are still closer. It seems that the way of understanding the characteristics is a little different from that of humans.

Finally, let's play a little prank.

If you scribble a little on the Jazzmaster with the paint tool, ...

For some reason Flying V. Hmm. .. I'm wondering where I looked and thought so.

Impressions

I tried image recognition with deep learning for the first time, but the accuracy was higher than expected. There are many cases where product tags and classifications are incorrect on musical instrument EC sites and auction sites, so I think you should check this.

It was also confirmed that transfer learning from a general image classification model is effective for image classification of a specific domain.

On the other hand, we were able to recognize the problem of deep learning that it is difficult to correct misclassification because we do not know the basis and judgment criteria of output. This model showed instability that while ideal input (noise-free image) enables highly accurate classification, even a small amount of noise can make a big change in judgment. By intentionally adding noise to the input during training, it seems that a more robust model can be generated, so I would like to try it if I have time. (⇒ I tried it.) There seems to be a technique called Grad-CAM that estimates the point of interest of the model, so I would like to try it together and see the changes.

I used ResNet-50 as a model this time, but I have a feeling (somehow) that such a classification task can be done with a lighter model, so I used a shallow Network in Network and reduced the model by distillation. I also want to challenge.