For the operating environment, etc., please refer to the previous post Classification of guitar images by machine learning, part 1.
** 2017/10/23 postscript: ** The content of this post seems to be similar to the method called Cutout [^ 1] / Random Erasing [^ 2].
** 2018/02/25 postscript: ** The induction of misidentification by adding noise is called hostile perturbation and seems to be actively studied. The topic of this article is simply "I'm in trouble because of low generalization performance", but in the study of hostile perturbation, I mainly talk about more skillful deception and how to protect from it (robustness). It is discussed from a security point of view. Regarding hostile perturbation, Very good summary has been published, so I will link it.
In the challenge of Posted last time, a phenomenon was seen in which the classification prediction result changed completely with a slight graffiti on the input image. This means that if any part of the identification target is hidden in something, it will cause an unpredictable misclassification.
Human cognition is more robust, and in many cases it is possible to identify an object with comprehensive judgment even if part of the object is not visible. Isn't it possible to give more robustness to the discriminating ability of machine learning models?
Most of the learning data used in the previous post challenge was "beautiful" photos such as product images on EC sites. There are almost no images where the recognition target is partially hidden by something else. Under these ideal conditions, the class can be identified well by looking only at the local features (for example, only the layout of the pickup and controls), so the model of "capturing the features in a complex manner by looking at the whole" is I don't feel like growing up. (Overfits to ideal conditions and generalization performance cannot be obtained)
Then, let's make a composition that "you can't make a good distinction unless you look at the whole and capture the features in a complex way."
It's simple to do, and it randomly hides some of the training data. In this way, it will be difficult to classify only some local features, and inevitably, more global and complex features will be selectively learned.
This time, I added multiple rectangles to the training data with the following code.
def add_dust_to_batch_images(x):
batch_size, height, width, channels = x.shape
for i in range(0, batch_size):
num_of_dust = np.random.randint(32)
dusts = zip(
(np.clip(np.random.randn(num_of_dust) / 4. + 0.5, 0., 1.) * height).astype(int), # pos x
(np.clip(np.random.randn(num_of_dust) / 4. + 0.5, 0., 1.) * width).astype(int), # pos y
np.random.randint(1, 8, num_of_dust), # width
np.random.randint(1, 8, num_of_dust), # height
np.random.randint(0, 256, num_of_dust)) # brightness
for pos_y, pos_x, half_w, half_h, b in dusts:
top = np.clip(pos_y - half_h, 0, height - 1)
bottom = np.clip(pos_y + half_h, 0, height - 1)
left = np.clip(pos_x - half_w, 0, width - 1)
right = np.clip(pos_x + half_w, 0, width - 1)
x[i, top:bottom, left:right, :] = b
return x
# ...
noised_train_flow = ((add_dust_to_batch_images(x), y) for x, y in train_flow)
The number, position, size, and brightness of the rectangles are random. I think that the subject's guitar is often reflected near the center, so I try to distribute as many rectangles as possible near the center.
This is the actually processed image.
You can see that the body outline and part of the assembly are hidden by the added rectangle.
From the viewpoint of adding noise, we also considered inserting Dropout immediately after input, but this time the aim is to "hide local features" as mentioned above, so we decided that Dropout, which adds noise evenly to the whole, is unsuitable. Did.
Let's train the model with noise added to the input. Similar to Last time, it is transfer learning using ResNet-50 that has already been ImageNet trained.
The transition of accuracy is like this.
Surprisingly, there is almost no effect on learning speed due to noise addition.
The best score is the 54th step, with a learning accuracy of 99.95% and a verification accuracy of 100%. Let's try the inference again using the snapshot at this point.
Jazzmaster, Les Paul, and acoustic guitar without graffiti are the same good results as last time, so I will omit them.
Attention is the picture of the Jazzmaster with graffiti, which was judged to be "Flying V" for some reason last time. How about this time?
It has been improved successfully.
On the other hand, here is the change in the score, Duo Sonic.
Last time it was judged as "Mustang", but this time it is "Stratocaster". As a result of capturing more global features, the shape of pickguards and bridges may have been taken into account.
I feel that I have achieved my aim somehow. (suitable)
I think that what I'm trying this time is a semi-common sense technique in the academic field, but when I apply it to familiar subjects, it deepens my understanding and is interesting.
Recommended Posts