Introduction

It's been a long time since Deep Learning has firmly established its position in the field of opportunity learning. This time, I would like to introduce a brief summary and some demos of Deep Metric Learning, which is being applied in various fields due to its high performance and versatility. Since handwritten character recognition and handwritten character recognition alone are not interesting, we also perform anomaly detection.

Deep Metric Learning

Metric Learning is a method called "distance learning", which is a method of learning the conversion (mapping) of input data from the feature space to the feature space that reflects the similarity of the data.

In a nutshell

--Data belonging to the same class are close --Data belonging to different classes is far away

You will learn how to convert to a feature space.

In class classification, even in cases where the distance is too close and it is difficult to classify, the identification accuracy can be improved by learning the feature space so that the same class is close and the different classes are far away. I can do it.

Metric Learning itself is an old method, but Deep Metric Learning is a method of designing this transformation non-linearly by a Deep Neural Network. Deep Metric Learning is highly versatile because it is a method of learning the "distance" between data, and due to its high performance, it has a wide range of application fields as follows.

Information retrieval --Image classification --Face recognition (biometric recognition) --Clustering
Visualization --Anomaly detection

This is a method that frequently appears in the competition "Kaggle" where data scientists from all over the world compete for their performance, and this time we will perform image classification and anomaly detection.

Demo 1. Handwriting recognition

We will do the familiar MNIST handwriting recognition with Deep Metric Learning. Data: 0-9 handwritten text image of 28 x 28 pixels Learning data: 50,000 sheets Test data: 10,000 sheets

Several methods have been proposed as Deep Metric Learning methods, This time, we will use L2-constrained Softmax Loss because of the speed of learning, high performance, and the simplicity of the model itself.

L2-constrained Softmax Loss L2-constrained Softmax Loss is a method of constraining the L2 norm of the output of the final layer of the Deep Neural Network to be a constant $ \ alpha $, which is a hypersphere with a radius of $ \ alpha $ for the input data. It is synonymous with embedding above.

The formula for L2-constrained Softmax Loss is:

\text{minimize } -\frac{1}{M} \sum_{i=1}^{M} \log \frac{e^{W_{y_{i}}^{T} f\left(\mathbf{x}_{i}\right)+b_{y_{i}}}}{\sum_{j=1}^{C} e^{W_{j}^{T} f\left(\mathbf{x}_{i}\right)+b_{j}}}\\
\text { subject to } \quad\left\|f\left(\mathbf{x}_{i}\right)\right\|_{2}=\alpha, \forall i=1,2, \dots M\\

By putting a constraint to embed data in the hypersphere, it is possible to learn so that the cosine similarity between data of the same class is large and the cosine similarity with different classes is small.

In normal Softmax Loss, for example, in the case of a face photo, the L2 norm is large for an easy-to-understand image that looks straight ahead, and the L2 norm is large for an image that is difficult to take features such as facing sideways or lying down. It has the property of becoming smaller. As a result, learning is drawn to easy-to-understand images that look straight ahead, and images with a small L2 norm, which is difficult in short, tend to be ignored. l2ノルム.jpeg

L2-constrained Softmax Loss overcomes this problem by making the L2 norm constant regardless of the data, making the effect of all data on Loss uniform.

The implementation of L2-constrained Softmax Loss itself is very easy, and can be achieved by adding the L2 norm normalization layer and the constant $ \ alpha $ scaling layer to the output of the final layer of the Deep Neural Network and calculating the Softmax Loss. I will.

model

Implementation will be done using Keras on Google Colaboratory. As shown below, stack 3 Convolution layers and insert L2-constrained (Lambda layer) between FC layer (fully connected layer) and Softmax. This L2-constrained is the only difference from a normal Convolutional Neural Network (CNN).

** Model overview **

** Model output in Keras ** キャプチャ.PNG

** Various learning parameters **

--Number of epochs: 15 --Batch size: 128 --Hypersphere radius $ \ alpha $: 16

Visualization with t-SNE

In order to confirm the effect of Deep Metric Learning, let's visualize the input feature space and the converted feature space by t-SNE by dropping them down to two dimensions. t-SNE is an algorithm that performs dimensional compression in such a way that the "closeness" of data in high dimensional space is maintained even in low dimensional space. Since a teacher label is not used for compression, data is purely in high dimensional space. You can see how separated they are.

** 1. Visualization of input space (784 dimensions → 2 dimensions) ** 入力空間.png Each point corresponds to one image, and the same color represents data belonging to the same class of numbers. Even in the input space, although it is generally separated for each class, there are many overlaps and variations.

** 2. Visualization of the final layer on normal CNN (64D → 2D) **

As the CNN itself boasts high performance, you can see that even a normal CNN is fairly well separated (clusters are formed for each class) in the final layer. Although it is well separated, if you look closely, you can often see a small jump value.

** 3. Visualization of the final layer with L2-constrained Softmax Loss (64D → 2D) ** l2_圧縮.png

In L2-constrained Softmax Loss, you can see that the clusters are more clearly separated than in normal CNN. Thanks to the normalization of the L2 norm, all the data contributes to the learning, and there are almost no jump values. You can see why Deep Metric Learning is called "distance learning".

Identification result

The identification results for 10,000 test images are as follows. As you can see from the high degree of separation in the visualization of the final layer, we can see that the accuracy is improved simply by putting the constraints of the L2 normalization layer and the scale layer in the same CNN model. Both accuracy and loss show a gentle transition, and both are finally better than CNN (no L2 loss).

Method	CNN (no L2 loss)	L2-constrained Softmax Loss
Identification rate	99.01	99.31
Learning (accuracy)
Learning (loss)

By the way, the images (69 images) that failed to be identified by L2-constrained Softmax Loss are as follows. ・ Pred: Predicted value of L2-constrained Softmax Loss ・ True: Correct label

失敗画像.png

I understand that feeling! There are quite a few things that I want to say. It seems difficult for humans to get 100 points. .. It seems that the accuracy will increase if the learning data of weak places is increased.

Demo in Flask (identification)

I made a demo that can be processed in real time with Flask using the model I learned because it was a big deal. You can see that they have identified it firmly. In the first place, the identification rate is 99.3%, so it seems to be okay if it is not a very strange number.

Demo 2. Anomaly detection

Although the identification is solid, in the current state, things that are clearly not numbers are forcibly assigned to one of them as follows.

When something that is not a number is input, you want to say that it is not a number, rather than outputting something close to it. Use anomaly detection to repel non-numeric numbers while preserving your ability to discriminate.

As I wrote at the beginning that Deep Metric Learning can also be applied to anomaly detection, it is higher than anomaly detection in the input space by performing anomaly detection in the feature space learned in a form that reflects the similarity of data. You will be able to get accuracy. Since Deep Metric Learning itself is distance learning rather than anomaly detection, another method is required for anomaly detection. This time, we will use Local Outlier Factor (LOF) for anomaly detection.

Local Outlier Factor(LOF) LOF is an anomaly detection method that focuses on the density of data in space. It is like an advanced version of k-nearest neighbor (kNN), and while kNN cannot consider the variation of data from cluster to cluster, LOF is ** local density (local density) from itself to k neighboring data. By paying attention to) **, it becomes possible to detect anomalies considering the distribution of data.

Local density = 1 / average distance to k nearby points

The formulas and details are omitted, but in the case shown in the figure below, for example, data A is far from the neighboring clusters, so we want to judge it as abnormal, and data B is in the same distribution as the neighboring clusters. Because it is in, you want to judge it as normal. However, since B is farther in terms of the distance from the vicinity of k, kNN cannot handle it. On the other hand, the LOF determines the anomaly threshold based on the density of the surrounding data, so it can handle such cases.

model

Anomaly detection is performed by applying LOF to the output of the L2-constrained layer of the model learned by MNIST identification (no need to retrain Deep Neural Net). If the LOF determines that it is abnormal, it is output as an error, and if it is determined to be normal, the identification result of Sofmax is output as before.

LOF uses scikit-learn and learns with the following parameters. ・ N-neighbors: 20 ・ Contamination: 0.001 ・ Novelty: true

data

The following two types of data sets are used as targets for anomaly detection.

Dataset	Fashion-MNIST	Cifar-10
Overview	Fashion image data set for shirts, bags, shoes, etc. (10 classes)	Natural image data set for airplanes, cars, dogs, etc. (10 classes)
Image example

There is no numerical data in either case, Fashion-MNIST is a 28 x 28 pixel grayscale image, so it is used as it is, and Cifar-10 is a 32 x 32 pixel color image, so it is grayscaled and resized. I will. Since Deep Nerual Net is learned only with handwritten MNIST, both are unknown images ** when viewed from ** Deep Neural Net. I would like to test whether these two datasets are correctly repelled as anomalies in the LOF of the feature space of the final layer.

Identification result

As a benchmark, we use anomaly detection when LOF is applied in the input feature space.

** LOF in input feature space: Benchmark **

Data	Normal judgment	Abnormal judgment
MNIST	0.99	0.01
Fashion-MNIST	0.70	0.30
Cifar-10	0.16	0.84

Although MNIST can recognize 99% as normal, the range of normal is too wide, 70% of Fashion-MNIST is normal, and 16% of Cifar-10 is normal.

** At the final layer of Deep Metric Learning (L2-constrained Softmax Loss) LOF: This time method **

Data	Normal judgment	Abnormal judgment
MNIST	0.99	0.01
Fashion-MNIST	0.12	0.88
Cifar-10	0.05	0.95

While maintaining 99% of the normal judgment of MNIST, 88% of Fashion-MNIST and 95% of Cifar-10 can be judged as abnormal.

If you want to repel a little more abnormalities, you can increase the contamination (ratio of outliers in the training data), and the result when contamination is set to 0.01 is as follows.

Data	Normal judgment	Abnormal judgment
MNIST	0.96	0.04
Fashion-MNIST	0.02	0.98
Cifar-10	0.00	1.00

4% of the MNIST data is determined to be abnormal, With Fashion-MNIST, 98%, and with Cifar-10, all images can be judged as abnormal. As you can see in the image that failed to identify, MNIST contains data that is difficult to identify in the first place, so considering the accuracy of anomaly detection, this seems to be better for practical use.

Demonstration in Flask (identification + anomaly detection)

I also made a demo that can be processed in real time with Flask. When a character other than a number comes, it is rejected as abnormal (Not Digit), and it can be confirmed that the number is still identified.

Summary

It was found in a simple demo that distance learning with Deep Metric Learning makes it easier to apply to anomaly detection as well as improving identification accuracy. It is easy to understand as a concept, and especially L2-constrained Softmax Loss has the advantage that it is very easy to implement because it only puts constraints on the L2 norm.

In the future, I would like to introduce various methods while demonstrating as much as possible in this way.

The demo released at the link below uses Deep Autoencoder, and since the accuracy of this anomaly detection (Deep Metric Learning + Local Outlier Factor) has not come out, I would like to replace it here at the right time. think.

Qiita article ・ [A trap that anyone can easily create RPA](https://qiita.com/jw-automation/items/78d823a6cb278f7015b5) ・ [Trap that RPA can be easily created by anyone who can build VBA](https://qiita.com/jw-automation/items/38c28016bf5162e76c59) ・ [I made a UiPath coding check tool [RPA]](https://qiita.com/jw-automation/items/003dd0ef116cf968c3a8) ・ [The reason why RPA is so good that you can deepen your understanding of RPA](https://qiita.com/jw-automation/items/836391dfde3fb1ac83d6) ・ [Recommended books for RPA](https://qiita.com/jw-automation/items/3c141c50c7a163943fd9) ・ [People who are suitable for RPA development, those who are not](https://qiita.com/jw-automation/items/828933731611b9ec601f) ・ [I tried to automate sushi making](https://qiita.com/jw-automation/items/a906f7a79e72add7b577)

demo -UiPathCodingChecker: Analyze code from UiPath xaml files ・ AI Demos: Handwritten character recognition by Deep Learning ・ Abnormality detection ・ Image denoise ・ Automation of sushi making (YouTube): Automation of typing game sushi making with RPA x OCR

[PYTHON] [AI] Deep Metric Learning

Introduction

Demo 1. Handwriting recognition

model

Visualization with t-SNE

Identification result

Demo in Flask (identification)

Demo 2. Anomaly detection

model

data

Identification result

Demonstration in Flask (identification + anomaly detection)

Summary