Introduction

In distance learning using deep learning (Metric Learning), similarities (and dissimilarities) between images are used using Contrastive Loss and Triplet Loss. ) Is widely used, but it is also known that it is difficult to combine and select learning data, and learning itself is often difficult. For that reason, various improvements and ingenuity have been proposed so far. However, recently, Metric Learning, which allows learning based on the Softmax function, is attracting attention as if it were a general classification task without having difficulty in selecting such learning data. ArcFace etc. is a typical method and is explained in detail in here. Why does Metric Learning based on the Softmax function work, and is there room for further improvement? I would like to introduce a little about these.

Center Loss Center Loss [^ Center] is probably the originator of incorporating the elements of Metric Learning into the classification. Center Loss was explained in detail at here, so I will only briefly introduce the main points. Center Loss learns classification and class center position in feature space at the same time by penalizing the distance between each feature vector and the corresponding class center. As a result, intra-class variation is reduced, and it was announced at ECCV 2016 as a more effective method of class separation.

In this paper, the FC layer just before the final layer of CNN is improved to 2D, and the value is plotted on the 2D plane as it is to visualize the features. The figure below visualizes the intermediate features of MNIST when it is classified by the normal Softmax function.

The loss function is the normal Softmax Cross Entropy Loss plus the Center Loss term for the distance between each feature vector and the center vector of its class. スクリーンショット 2020-02-03 18.09.05.png

Since it is not realistic in terms of calculation amount to calculate the class center $ C_ {y_i} $ accurately each time, the following calculation is performed for each mini-batch, and the class center $ C_ {y_i} $ is updated sequentially. I will.

The following is a concrete example of a class-centric update. If the data belonging to class A in the mini-batch is $ A1 $, $ A2 $, and $ A3 $, the center position $ C_A $ of class A will be updated as shown in the figure below.

Finally, here is a visualization of the results of Center Loss. It can be seen that by increasing the value of the hyperparameter $ \ lambda $, the intraclass variation also decreases.

Sphereface He found that the feature space when learned using the Softmax function had a unique angular distribution, and insisted that the angle-based margin was more suitable than the Euclidean distance margin such as Center Loss. Angular Softmax (A-Softmax) Loss was proposed [^ Sphere] and adopted at CVPR2017.

I would like to look at them in order. First, the usual Softmax Cross Entropy Loss are: スクリーンショット 2020-02-03 22.30.08.png Next, this is the inner product formula\boldsymbol{a}\cdot\boldsymbol{b}=\|\|\boldsymbol{a}\|\|\,\|\|\boldsymbol{b}\|\|\cos\thetaWith\boldsymbol{W^T_j}When\boldsymbol{x_i}Rewrite the inner product part of. スクリーンショット 2020-02-03 22.49.50.png In addition, normalize the weight parameters\|\|\boldsymbol{W_j}\|\| = 1And biasb_jAlso0To Then we have the following, which we define as Modified Softmax Loss. スクリーンショット 2020-02-03 23.03.35.png Next, let's see what the decision boundaries in this $ L_ {modified} $ are. A decision boundary is a boundary value where two classes have exactly the same probability.

Assuming that the angle between the class 1 weight $ W_1 $ and the feature $ x_i $ is $ \ cos \ theta_1 $ and the angle between the class 2 weight $ W_2 $ and the feature $ x_i $ is $ \ cos \ theta_2 $ These two probabilities are exactly the same when the following conditions are met.

\frac{e^{\|\boldsymbol{x_i}\|\cos\theta_1}}{\sum_{j} e^{\|\boldsymbol{x_i}\|\cos(\theta_{j,i})}} = \frac{e^{\|\boldsymbol{x_i}\|\cos\theta_2}}{\sum_{j} e^{\|\boldsymbol{x_i}\|\cos(\theta_{j,i})}} \\
\|\boldsymbol{x_i}\|\cos\theta_1 = \|\boldsymbol{x_i}\|\cos\theta_2 \\
\theta_1 = \theta_2

Therefore, the probabilities of class 1 and class 2 are exactly the same if there is $ x_i $ in the straight line that bisects the angles of the weights $ W_1 $ and $ W_2 $. スクリーンショット 2020-02-03 23.42.12.png From this result, it can be seen that the feature $ x_i $ learned by Softmax Loss is essentially distributed along the angle of the last weight $ Wj $ just before applying softmax. スクリーンショット 2020-02-03 23.54.56.png This suggests that combining Euclidean distance-based margins like Center Loss is not a good affinity for the Softmax function, and ** providing margins along angles is the most effective. It shows that it is a target **.

Angular Softmax (A-Softmax) Loss Angular Softmax (A-Softmax) Loss is defined as follows: スクリーンショット 2020-02-04 0.29.05.png Since $ L_ {modified} $ is penalized with a coefficient of $ m $ only for the angle $ \ theta_ {y_i, i} $ formed by the weight $ W_ {y_i} $ of its own class, the original decision boundary is set. It is expected to have the effect of moving the weight of your class closer to $ W_ {y_i} $ than the position. スクリーンショット 2020-02-04 0.39.42.png

CosFace and ArcFace

There are several proposals for Sphereface and beyond to take margins along the angle as claimed by Sphereface. CosFace [^ Cosface] and ArcFace [^ Arcface] are examples. スクリーンショット 2020-02-04 0.59.41.png

I will omit the details, but briefly, since CosFace also normalizes $ x_i $, the angle between the two vectors, $ \ cos \ theta $, is expressed as the similarity. It is now possible. We also introduce the hyperparameter $ s $ to prevent Softmax from failing because the value of $ \ cos \ theta $ is too small. Since then, ArcFace has changed the location of the margins. Since the margin is specified directly for the angle, the separation boundary in Softmax is linear and constant over the entire interval.

Uniform Loss By adding a new term called Uniform Loss to Cosine Based Softmax Loss such as Sphereface, a method is proposed that imposes equal distribution constraints that evenly distribute the class center and makes the best use of the feature space [^ Uniformface]. ]it was done. Adopted at CVPR 2019. Uniform Loss works by thinking of the center of a class as an electric charge with interclass repulsion on the hypersphere, which is a feature space, and minimizing its potential energy. Many of the existing methods so far aim to increase the inter-class distance and reduce the intra-class variation, but Uniform Loss also expects to maximize the minimum inter-class distance. I can do it.

In the case of ** Center Loss **, it simply minimizes the distance between the elements in the class and the center of the class, and does not describe the relationships between the classes in the first place. スクリーンショット 2020-02-04 13.03.15.png

In the case of ** Sphereface **, ** Cosface **, ** Arcface **, the discriminant boundary with a margin is learned on the hypersphere, and as a result, the intraclass variation is small and the interclass distance is small. Learning progresses as it grows larger. However, since nothing explicitly constrains the overall distribution of the feature space, it may be a local arrangement within the feature space and may be unbalanced. スクリーンショット 2020-02-04 13.08.08.png

Uniform Loss is motivated by the fact that when charges are placed on the surface of a sphere, the potential energy is minimized when they are evenly distributed, so the center of the class is regarded as the charge and its potential. The function that represents energy is expressed as Loss. スクリーンショット 2020-02-04 13.15.19.png

We define a repulsive force $ F $ between two classes that is inversely proportional to the square of the distance. スクリーンショット 2020-02-04 13.17.52.png

The distance function $ d $ follows Coulomb's law and uses the Euclidean distance. Also, add 1 to prevent the repulsive force $ F $ from becoming too large. スクリーンショット 2020-02-04 13.21.18.png

Formulate Uniform Loss as the mean of all class-centric pairwise energies. スクリーンショット 2020-02-04 13.25.32.png

The Loss (Cosine Based Softmax Loss) function part of Sphereface is as follows. スクリーンショット 2020-02-04 13.30.10.png

The sum of these two Loss is the final Loss function. スクリーンショット 2020-02-04 13.31.25.png

Finally, there is a class-centric update method, which is exactly the same as that of Center Loss.

By the way, as an aside, if N points are placed on a 3-sphere and arranged so that the minimum value of the spherical distance between the points is the maximum, what is the length of the spherical distance? This problem is called the Tammes problem, and it seems to be one of the unsolved mathematical problems for which a complete answer has not yet been found.

P2SGrad As a joint research between SenseTime Research and The Chinese University of Hong Kong, two papers, AdaCos [^ Adacos] and P2SGrad [^ P2sgrad], were published simultaneously by the same author on May 7, 2019. Both papers have been adopted at CVPR2019 (AdaCos has been adopted by Oral). In addition to the proposed method, I found it particularly interesting to mention the essential differences between Cosine Based Softmax Loss such as Sphereface, Cosface, and Arcface, and the process of gradient optimization. AdaCos has already been explained here, so I would like to take a closer look at P2SGrad.

Now, with regard to P2SGrad, I think it is worth mentioning that the model can be optimized without formulating a specific loss function. Regardless of Metric Learning, it is a standard practice in general Deep Learning to define a loss function and learn while updating weights and biases so that the loss function becomes smaller. However, in P2SGrad, the optimum gradient direction and magnitude for updating the weight vector in the feature space are directly obtained, and backpropagation is performed to optimize the parameters. Moreover, the fact that you don't have to define a loss function means that you don't have to adjust the hyperparameters such as $ m $ and $ s $ that were in the loss function of ArcFace and CosFace.

Backpropagation of Cosine Based Softmax Loss

First, let's check the gradient for Cosine Based Softmax Loss with no angular margin. The next $ f_ {i, j} $ is the logits for the class $ j $ of the feature $ \ vec {x_i} $.

f_{i,j}  = s \cdot \frac{\langle \vec{x_i}, \vec{W_j} \rangle}{\|\vec{x_i}\|_2\|\vec{W_j}\|_2} = s \cdot \langle \boldsymbol{\hat{x_i}}, \boldsymbol{\hat{W_j}} \rangle = s \cdot \cos\theta_{i,j} \tag{1}

$ \ Boldsymbol {\ hat {x_i}} $ and $ \ boldsymbol {\ hat {W_j}} $ represent normalized vectors of $ \ vec {x_i} $ and $ \ vec {W_j} $, respectively.

When the number of classes is $ C $, the probability $ P_ {i, j} $ that the $ i $ feature is the class $ j $ can be written by the Softmax function as follows.

P_{i,j} = Softmax(f_{i,j}) = \frac{e^{f_{i,j}}}{\sum^{C}_{k=1}e^{f_{i,k}}} \tag{2}

If the correct class is $ y_i $, then the Cross Entropy Loss looks like this:

L_{CE}(\vec{x_i}) =-\log{P_{i, y_i}} = -\log{\frac{e^{f_{i,y_i}}}{\sum^{C}_{k=1}e^{f_{i,k}}}} \tag{3}

For this $ L_ {CE} (\ vec {x_i}) $, find the gradients of $ \ vec {x_i} $ and $ \ vec {W_j} $.

\frac{\partial L_{CE}(\vec{x_i})}{\partial \vec{x_i}} = \sum_{j=1}^{C}(P_{i,j} - \mathbb{1}(y_i=j)) \nabla f(\cos\theta_{i,j}) \cdot \frac{\partial \cos\theta_{i,j}}{\partial \vec{x_i}} \tag{4}

\frac{\partial L_{CE}(\vec{x_i})}{\partial \vec{W_j}} = (P_{i,j} - \mathbb{1}(y_i=j)) \nabla f(\cos\theta_{i,j}) \cdot \frac{\partial \cos\theta_{i,j}}{\partial \vec{W_j}} \tag{5}

The function $ \ mathbb {1} (y_i = j) $ is $ 1 $ when $ j = y_i $, and $ 0 $ otherwise. In addition, $ \ frac {\ partial \ cos \ theta_ {i, j}} {\ partial \ vec {x_i}} $ and $ \ frac {\ partial \ cos \ theta_ {i, j}} {\ partial \ vec { When W_j}} $ is calculated, the vector values are as follows.

\frac{\partial \cos\theta_{i,j}}{\partial \vec{x_i}} = \frac{1}{\|\vec{x_i}\|_2}(\boldsymbol{\hat{W_j}} - \cos\theta_{i,j} \cdot \boldsymbol{\hat{x_i}}) \tag{6}

\frac{\partial \cos\theta_{i,j}}{\partial \vec{W_j}} = \frac{1}{\|\vec{W_j}\|_2}(\boldsymbol{\hat{x_i}} - \cos\theta_{i,j} \cdot \boldsymbol{\hat{W_j}}) \tag{7}

Next, calculate the $ \ nabla f (\ cos \ theta_ {i, j}) $ part. Since $ f $ is a function that represents the logits part, $ f (\ cos \ theta_ {i, j}) $ with no angular margin is $ s \ cdot \ cos \ theta_ {i, j} $. Therefore, $ \ nabla f (\ cos \ theta_ {i, j}) = s $. In the case of Cosface, $ f (\ cos \ theta_ {i, y_i}) = s \ cdot (\ cos \ theta_ {i, y_i} --m) $, so $ \ nabla f (\ cos \ theta_ {i, y_i} ) = S $, and in the case of Arcface, $ f (\ cos \ theta_ {i, y_i}) = s \ cdot \ cos (\ theta_ {i, y_i} + m) $, so $ \ nabla f (\ cos) \ theta_ {i, y_i}) = s \ cdot \ frac {\ sin (\ theta_ {i, y_i} + m)} {\ sin \ theta_ {i, y_i}} $.

In any case, $ \ nabla f (\ cos \ theta_ {i, j}) $ is a scalar value determined by the parameters $ s $, $ m $ and $ \ cos \ theta_ {i, y_i} $. Also, since we usually use the value of $ s> 1 $, all of the above is $ \ nabla f (\ cos \ theta_ {i, j})> 1 $.

Now, let's sort out $ \ frac {\ partial L_ {CE} (\ vec {x_i})} {\ partial \ vec {W_j}} $ by $ j $.

For $ j \ neq y_i $

\frac{\partial L_{CE}(\vec{x_i})}{\partial \vec{W_j}} = P_{i,j} \cdot \nabla f(\cos\theta_{i,j}) \cdot \frac{\partial \cos\theta_{i,j}}{\partial \vec{W_j}} = T \cdot \frac{\partial \cos\theta_{i,j}}{\partial \vec{W_j}} \tag{8}

If $ j = y_i $

\frac{\partial L_{CE}(\vec{x_i})}{\partial \vec{W_j}} = (P_{i,j} - 1) \cdot \nabla f(\cos\theta_{i,j}) \cdot \frac{\partial \cos\theta_{i,j}}{\partial \vec{W_j}} = U \cdot \frac{\partial \cos\theta_{i,j}}{\partial \vec{W_j}} \tag{9}

Since $ P_ {i, j} $ is a Softmax value, the range of possible values is $ P_ {i, j} \ in [0, 1] $. In other words, $ T $ and $ U $ in the above equation are scalar values in the range $ T> 0 $ and $ U <0 $, respectively.

The weight vector $ \ vec {W_j} $ is optimized by Backpropagation as follows.

\vec{W_j} \leftarrow \vec{W_j} - \eta\cdot \frac{\partial L_{CE}}{\partial \vec{W_j}} \tag{10}

Summarizing based on these facts, we can see the following.

--The correct class weight parameter $ \ vec {W_ {y_i}} $ is $ \ frac {\ partial \ cos \ theta_ {i, j}} {vertical to $ \ vec {W_ {y_i}} $ by Backpropagation. It will be updated in the direction of \ partial \ vec {W_j}} $. And the length of the gradient vector is $ (P_ {i, j}-\ mathbb {1} (y_i = j)) \ nabla f (\ cos \ theta_ {i, j}) $.

--The weight parameter $ \ vec {W_j} $ of the class that is not correct is $-\ frac {\ partial \ cos \ theta_ {i, j}} {\ partial , which is perpendicular to $ \ vec {W_j} $ by Backpropagation. It will be updated in the direction of vec {W_j}} $. And the length of the gradient vector is $ (P_ {i, j}-\ mathbb {1} (y_i = j)) \ nabla f (\ cos \ theta_ {i, j}) $.

-$ \ frac {\ partial \ cos \ theta_ {i, j}} {\ partial \ vec {W_j}} $ is perpendicular to $ \ vec {W_j} $, so this is the fastest, most optimal and very It is a reasonable update.

--There are various Cosine Based Softmax Loss such as CosFace and ArcFace, but the gradient directions to be updated are all the same. The essential difference between them is the length of the gradient, which greatly influences the optimization of the model.

Gradient length

As we have seen, the length of the gradient is $ (P_ {i, j}-\ mathbb {1} (y_i = j)) \ cdot \ nabla f (\ cos \ theta_ {i, j}) $ , ** $ P_ {i, j} $ has a large effect on the length of the gradient **. In addition, $ P_ {i, j} $ is positively correlated with logits $ f_ {i, j} $. And logits $ f_ {i, j} $ is strongly influenced by hyperparameters $ s $ and $ m $.

Let's look at an example of ArcFace. $ f_ {i, y_i} $ is $ s \ cdot \ cos (\ theta_ {i, y_i} + m) $. You can see that the value of $ P_ {i, y_i} $ changes significantly even if the hyperparameters $ s $ and $ m $ are different and the same $ \ theta_ {i, y_i} $ is used.

What should be reconsidered here is the original purpose of ArcFace. ArcFace does not want to classify, but wants to know if the angle $ \ theta $ between the two extracted vectors $ \ vec {x_1} $ and $ \ vec {x_2} $ is close. In other words, the size of $ \ theta_ {i, y_i} $ should affect the length of the gradient (that is, $ P_ {i, j} $) which is important in Backpropagation. , Hyperparameters $ s $ and $ m $ are strongly involved. And one more thing. The value of $ P_ {i, y_i} $ also changes depending on the number of classes $ C $. As the number of classes $ C $ increases, the probability of being assigned to the incorrect class $ P_ {i, j} $ is expected to decrease more or less. This makes sense for closed classification problems. However, this is not suitable for face recognition, which is an Open-Set problem.

There is also $ \ nabla f (\ cos \ theta_ {i, j}) $ as a value that affects the length of the gradient. This depends on the type of Cosine Based Softmax Loss.

For Cosine Based Softmax Loss with no angular margin

\nabla f(\cos\theta_{i,j}) = s \tag{11}

For Cosface

\nabla f(\cos\theta_{i,j}) = s \tag{12}

For Arcface

\nabla f(\cos\theta_{i,j}) = s \cdot \frac{\sin(\theta_{i,j} + m)}{\sin\theta_{i,j}} \tag{13}

It will be.

In other words, in Cosine Based Softmax Loss and Cosface, which have no angular margin, the effect of $ \ nabla f (\ cos \ theta_ {i, j}) $ on the length of the gradient is always constant $ s $. It shows that. However, in Arcface, the length of the gradient and $ \ theta_ {i, j} $ are negatively correlated, which is completely unexpected. Gradients in $ \ theta_ {i, y_i} $ generally tend to reduce the gradient of $ P_ {i, y_i} $, but Arcface tends to increase the length. Therefore, Arcface does not explain the geometric implications of $ \ nabla f (\ cos \ theta_ {i, j}) $ on the length of the gradient.

P2S Grad's proposal to determine the length of the gradient

From here, I will finally explain the new method ** P2SGrad ** that determines the length of the gradient only by $ \ theta_ {i, j} $, without depending on hyperparameters, number of classes, or logits. I will. P2SGrad reviews the equations $ (4) $ and $ (5) $ to find the gradient directly without defining a specific loss function. First of all, the gradient direction is the most optimal direction as explained above, so there is no need to change it here. Next, consider the length part of the gradient, $ P_ {i, j} $ and $ \ nabla f (\ cos \ theta_ {i, j}) $. Regarding $ \ nabla f (\ cos \ theta_ {i, j}) $, we are reviewing it for the purpose of excluding hyperparameters in the first place, so we will set it to $ 1 $ here.

Next, consider using $ P_ {i, j} $ but $ \ cos \ theta_ {i, j} $ instead of $ P_ {i, j} $. There are three reasons below.

-$ P_ {i, j} $ and $ \ cos \ theta_ {i, j} $ have the same theoretical range of values $ [0, 1] $. However, it is $ \ theta_ {i, j} \ in [0, \ pi / 2] $. -Unlike $ P_ {i, j} $, $ \ cos \ theta_ {i, j} $ is not affected by any of the hyperparameters or the number of classes $ C $. --At the time of inference, face recognition is performed based on the similarity by $ \ cos \ theta $. $ P_ {i, j} $ is only a probability that applies only to closed classification tasks.

With these in mind, we will formulate a new P2SGrad. Expressions $ (4) $ and $ (5) $ are replaced by expressions $ (14) $ and expression $ (15) $, respectively.

\tilde{G}_{P2SGrad}(\vec{x_i}) = \sum_{j=1}^{C}(\cos\theta_{i,j} - \mathbb{1}(y_i=j)) \cdot \frac{\partial \cos\theta_{i,j}}{\partial \vec{x_i}} \tag{14}

\tilde{G}_{P2SGrad}(\vec{W_j}) = (\cos\theta_{i,j} - \mathbb{1}(y_i=j)) \cdot \frac{\partial \cos\theta_{i,j}}{\partial \vec{W_j}} \tag{15}

This P2SGrad is concise but very rational. If $ y_i = j $, the length of the gradient and $ \ theta_ {i, j} $ have a positive correlation, and if $ j \ neq y_i $, there is a negative correlation.

Finally, I will explain how to learn. In general learning, the cross entropy loss of formula $ (3) $ is obtained by foward processing, and then the formula $ (4) $ and formula $ (5) $ are backpropagated from the final layer by backpropagation, and the weight of each layer and Find the gradient of the bias parameter in order. In learning using P2SGrad, the formula $ (14) $ and the formula $ (15) $ are calculated directly from the feature $ \ vec {x_i} $ obtained by foward processing. After that, it is back-propagated in the same procedure as normal learning, and the weight of each layer and the gradient of the bias parameter are obtained in order.

Experiment

For detailed experimental results, see the original paper, and for reference, I will post the experimental results with the LWF dataset. The CNN part uses ResNet-50 and is a diagram showing the relationship between the learning iteration and the average value of $ \ theta_ {i, y_i} $. スクリーンショット 2020-02-05 16.13.14.png It can be seen that learning with P2SGrad decreases $ \ theta_ {i, y_i} $ faster than other Cosine Based Softmax Loss.