[PYTHON] Read the dissertation Deep Self-Learning From Noisy Labels

Article summary

Here, we introduce the paper Deep Self-Learning From Noisy Labels [1]. This paper has been adopted at ICCV 2019. In addition, the implementation by Pytorch is open to the public on GitHub. However, the accuracy is not so high, and a discussion of the cause is given at the end of this article.

Content of the dissertation

Overview

In this paper, we are working on improving the accuracy of learning using noise-containing datasets in the task of image classification by deep learning. "Including noise" here indicates a situation in which the class label of the dataset is incorrectly added. Deep learning generally requires a large dataset to train a model, but preparing a huge dataset with well-labeled labels can be quite tedious. In addition, when creating a strict data set, it is common for humans to manually annotate, but it is possible that mistakes will occur at that time. On the other hand, if you use the data crawled from the Web as it is, you can easily arrange a huge data set with labels, but in this case, there are generally more label mistakes. Under these circumstances, a deep learning model that can perform correct learning using a dataset that contains noise to some extent is considered to be useful in the real world, and this paper is a paper aimed at achieving this.

Novelty of this paper

There are many existing studies of deep learning using noisy datasets. Among them are the introduction of a transition matrix to show the error rate of the label and the setting of a noise-resistant loss function. However, these techniques implicitly make the assumption that label errors are random. On the other hand, in reality, label noise often depends on the input. For example, images that are confusing to which class they belong to are more likely to be incorrectly labeled. In this paper, we propose learning that matches the noise of labels in the real world.

Another feature of this paper is that it is not necessary to use additional information such as noise-free datasets. Many existing studies use manually denoised datasets. However, as already mentioned, manual labeling is a burden even for a limited number. In this method, the model is trained in the so-called self-learning format using only the data set containing noise.

Outline of method

In the learning of the model of the proposed method, there are two phases, the learning phase and the label correction phase. These phases alternate with each epoch.

Learning phase

In the learning phase, model F for image classification is trained. Think of this phase as much like a typical deep learning training. Only one point is that the loss function is slightly different and is expressed as $ L_ {TOTAL} $ in the formula below. $ L_{TOTAL} = (1 - \alpha) L_{CCE}(F(\theta, x), y) + \alpha L_{CCE}(F(\theta, x), \hat{y}) $ Where $ L_ {CCE} $ is the cross-entropy loss function. Also, $ y $ is the label attached to the data, and $ \ hat {y} $ is the corrected label. This $ \ hat {y} $ is obtained in the label correction phase described later. $ \ Alpha $ is a hyperparameter to balance these two loss functions. In the first epoch, the label correction phase has not been performed yet and the correction label cannot be obtained, so set $ \ alpha = 0 $.

Label correction phase

Next, the label correction phase will be explained. The goal of this phase is to get a well-corrected label $ \ hat {y} $ for use in the learning phase. For that purpose, take the following steps.

  1. Randomly take $ m $ samples from the training data of each class.
  2. Determine a representative prototype $ p $ from $ m $ samples of each class.
  3. For all data, determine the correction label from the similarity with the prototype of each class.

Before explaining in order, let's define the "similarity" for each data used here. First, we define the deep learning model $ F $ as $ F = f \ circ G $. Here, $ G $ is the feature sampling layer and $ f $ is the fully connected layer for classification. Using this, the similarity is set to the cosine similarity of $ G (x_1) $ and $ G (x_2) $ for the inputs $ x_1 $ and $ x_2 . That is, $ \frac{G(x_1)^\mathsf{T}G(x_2)}{||G(x_1)|| ~ ||G(x_2)||} $$ It is expressed as. In the following, all descriptions of "similarity" should be regarded as this "feature vector cosine similarity".

sampling

First, $ m $ of data is randomly sampled from each class. This is to reduce the amount of calculation of $ O (n ^ 2) $ for the number of data $ n $ in the subsequent calculation.

Prototype decision

Next, for each class, $ p $ prototypes are determined from the $ m $ data selected earlier. The condition required for this prototype is that it well represents the features of each class. More specifically, a prototype is a prototype that successfully meets the following two conditions. ・ There are many features that are similar to ourselves. ・ It is not as similar as other prototypes. In other words, ideally, if there are $ p $ clusters in the features in the class, we want the $ p $ prototypes to be the representative points in each cluster. The specific method is described below.

First, check the similarity of $ m $ features. Define the matrix $ S $ so that $ S_ {ij} $ is the similarity between the $ i $ th sample and the $ j $ th sample.

Second, we define the density $ \ rho $ for each sample. This shows how dense the dots are around the sample. $ \rho_i = \sum_{j = 1}^m sign(S_{ij} - S_c). $ Here, $ sign $ is a function that returns $ 1 $ for a positive value, $ -1 $ for a negative value, and $ 0 $ for $ 0 $, and $ S_c $ is appropriate. This is the standard value, and here it is the value that corresponds to the top 40% of $ S $. As a result, the larger the number of samples that are similar to you, the larger the value of $ \ rho $.

Third, we define the similarity $ \ eta $ for prototyping each sample.

\eta_i = \max_{j, \rho_j > \rho_i} S_{ij} ~~(\rho_i < \rho_{max}), \\
\eta_i = \min_{j} S_{ij} ~~(\rho_i = \rho_{max})

That is, the one with the highest density $ \ rho $ takes the least similarity, and the other ones take the highest similarity from the samples with the highest density $ \ rho $. As a result, it can be said that the smaller $ \ eta $ is, the more suitable the prototype is.

And finally, we make a prototype decision using $ \ rho $ and $ \ eta $. Keep in mind that $ \ rho $ should be smaller and $ \ eta $ should be smaller. Here, we will get the top $ p $ samples with the largest $ \ rho $ from those that satisfy $ \ eta <0.95 $. Let this be the prototype for that class.

Generate correction label

Now that we have a prototype for each class, we will ask for correction labels for all the data. Simply sort them into classes that are very similar to the prototypes. For each data, calculate the average degree of similarity with the prototype of each class, and let the label of the class with the largest value be the correction label $ \ hat {y} $.

Implementation of the dissertation

From here, I will describe the implementation of the dissertation. Please check GitHub for the source code. However, as I have already written, the accuracy described in the paper has not been confirmed. The consideration of this cause is described at the end.

Implementation overview

As in the paper, experiments were conducted using the Clothing1M [2] dataset and the FoodLog-101N [3] dataset. Also, the FoodLog-101N dataset seems to be supposed to use the FoodLog-101 [4] dataset for test data, so follow it. The contents written in the paper are used as they are for models and hyperparameters.

The details of the main execution environment are described below. -Python: 3.5.2 ・ CUDA: 10.2 ・ Pytorch: 0.4.1 ・ Torchvision: 0.2.1 ・ Numpy: 1.17.2

result

The Accuracy written in the paper is as shown in the table below. CCE is the accuracy when optimizing simply using cross entropy.

CCE Proposed method
Clothing1M 69.54 74.45
Food-101N 84.51 85.11

On the other hand, the accuracy in the reproduction implementation is as shown in the table below. The reason why the accuracy of the proposed method is ambiguous is that the error due to random numbers is quite large. As will be described later, we have made various confirmations about this cause, and we have not been able to quantitatively evaluate the degree of this blurring. However, there is no phenomenon that the proposed method significantly exceeds CCE.

CCE Proposed method
Clothing1M 68.10 Around 64
Food-101N 85.05 Around 80

Possible causes of reduced accuracy

Possibly incorrect implementation

This is just a section that states the guarantee that this implementation is correct, so skip it if you are not interested.

If there is a mistake in this implementation, the accuracy when using cross entropy is not so different from the paper, so I think that the problem is the part related to label correction. This is confirmed by dividing it into two parts, one that generates the correction label and the other that uses the correction label.

The first is the confirmation of the part that generates the correction label. First, we individually confirmed that the functions related to it behaved as expected. Next, it is confirmed that the corrected label and the original noisy label match to some extent, and that the label correction is performed according to the noisy label to some extent. In addition, we quantitatively confirmed the accuracy of label correction behavior. For Clothing1M, a clean label is attached to a part of the data set, so you can check the accuracy of label correction. Here, in order to focus on the accuracy of label correction, labels are assigned to the test data using the label correction module generated from the trained model and training data, and the correct answer rate of the label of the entire training data and the correct answer of the corrected test label are assigned. We compared the rates. The results at this time are shown in the table below (in the paper, the label accuracy rate of the training data is 61.74, but in the original paper of Clothing 1M, it is 61.54, so I trust that). .. Looking at this, we can see that the accuracy of the noisy label of the original training data is exceeded.

Training data label Test data correction label
61.54 69.92

As you can see from this, the behavior of label correction as a whole is considered to be correct. From the above, you can confirm that the label correction module is correct.

The second is confirmation of the part that uses the correction label. Regarding this, when the part where the corrected label of the source code was handed over was rewritten so that the uncorrected label was handed over, the accuracy was about the same as the original CCE. Therefore, it is considered that there is no mistake in the part that optimizes using the correction label. For the above two reasons, I think that the viewpoint of implementation bugs has been cleared.

Difference in execution environment

The hyperparameters used this time are the same as those in the paper. For example, in Clothing1M, the learning rate is attenuated every 5 epochs with a total of 15 epochs. However, as you can see in Figure 1, the first 5 epochs seem to be unnecessary because the learning rate is too high. I think one of the major causes of this happening is the difference in the library and version used at runtime.

images.png

When to start the label correction phase

In the label correction phase, it is considered that the accuracy may be deteriorated until the neural network can extract appropriate features to some extent. Even in the original paper, there is a description that implies that the label correction phase is not inserted at first, but the specific setting is not stated. In this implementation, the result when the label correction phase is started immediately after the end of the first epoch is shown. As far as I read the paper, it seems that the label correction starts after the end of the first epoch, but it seems that it is written in such a way that it cannot be said that it is not completely the case.

In actual experiments, it was confirmed in some cases that the accuracy dropped to around 30% in the early stages due to the label correction in the first epoch. So, when I changed some epochs to start the label correction phase, the final accuracy did not change so much even though the phenomenon that the accuracy dropped suddenly disappeared. The results may change if you try to insert label corrections in the middle with a longer number of epochs, but this remains questionable.

random number

Although we have not been able to quantitatively evaluate the variance of the results, I think that the influence of random numbers cannot be ignored due to the intensity of the learning waves and the atmosphere of several trials. Regarding the original paper, at least for Food-101N, I feel that the improvement in accuracy by this method does not go beyond the range of influence by random numbers. By the way, regarding the Food-101N experiment, the original paper has not been analyzed in such detail.

Bonus consideration

In Food-101N, the accuracy was not improved so much in the original paper, and I mentioned the possibility that it could be explained by random numbers, but I think the cause is the method itself. In Clothing 1M, which is a classification of 14 classes, the correction label has a sufficiently high probability that it is the same label as the original label, but in Food-101N, which is a classification of 101 classes, the probability that the correction label will be the same label as the original label is Clothing 1M. It was confirmed that it was lower overall than that of. I don't know the accuracy of the correction label because Food-101N doesn't have a clean label, but I think it's almost certain that the value is lower than Clothing1M. And, considering the difficulty of classifying 101 classes using a small number of prototypes, it seems to be a natural result. Therefore, when applying this method in the case of multi-class classification such as Food-101N, soft labels are used so as not to decisively allocate classes, and the weight of learning using correction labels is reduced. I think it will be necessary to devise.

References

[1] Jiangfan Han, Ping Luo, and Xiaogang Wang. Deep Self-Learning From Noisy Labels. In International Conference on Computer Vision, 2019. [2] Tong Xiao, Tian Xia, Yi Yang, Chang Huang, and Xiaogang Wang. Learning from Massive Noisy Labeled Data for Image Classification. In Computer Vision and Pattern Recognition, 2015. [3] Kuang-Huei Lee, Xiaodong He, Lei Zhang, and Linjun Yang. CleanNet: Transfer Learning for Scalable Image Classifier Training with Label Noise. In Computer Vision and Pattern Recognition, 2018. [4] Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. Food-101 – Mining Discriminative Components with Random Forests. In European Conference on Computer Vision, 2014.

Recommended Posts

Read the dissertation Deep Self-Learning From Noisy Labels
Read the setting value from Google Spread Sheet gssetting
Deep Learning from the mathematical basics Part 2 (during attendance)
I read the Chainer reference (updated from time to time)