[PYTHON] [AI] Deep Metric Learning

Introduction

It's been a long time since Deep Learning has firmly established its position in the field of opportunity learning. This time, I would like to introduce a brief summary and some demos of Deep Metric Learning, which is being applied in various fields due to its high performance and versatility. Since handwritten character recognition and handwritten character recognition alone are not interesting, we also perform anomaly detection.

Deep Metric Learning

Metric Learning is a method called "distance learning", which is a method of learning the conversion (mapping) of input data from the feature space to the feature space that reflects the similarity of the data.

In a nutshell

--Data belonging to the same class are close --Data belonging to different classes is far away

You will learn how to convert to a feature space.

In class classification, even in cases where the distance is too close and it is difficult to classify, the identification accuracy can be improved by learning the feature space so that the same class is close and the different classes are far away. I can do it.

Metric Learning itself is an old method, but Deep Metric Learning is a method of designing this transformation non-linearly by a Deep Neural Network. Deep Metric Learning is highly versatile because it is a method of learning the "distance" between data, and due to its high performance, it has a wide range of application fields as follows.

This is a method that frequently appears in the competition "Kaggle" where data scientists from all over the world compete for their performance, and this time we will perform image classification and anomaly detection.

Demo 1. Handwriting recognition

We will do the familiar MNIST handwriting recognition with Deep Metric Learning. 20150117221123.png Data: 0-9 handwritten text image of 28 x 28 pixels Learning data: 50,000 sheets Test data: 10,000 sheets

Several methods have been proposed as Deep Metric Learning methods, This time, we will use L2-constrained Softmax Loss because of the speed of learning, high performance, and the simplicity of the model itself.

L2-constrained Softmax Loss L2-constrained Softmax Loss is a method of constraining the L2 norm of the output of the final layer of the Deep Neural Network to be a constant $ \ alpha $, which is a hypersphere with a radius of $ \ alpha $ for the input data. It is synonymous with embedding above.

The formula for L2-constrained Softmax Loss is:

\text{minimize } -\frac{1}{M} \sum_{i=1}^{M} \log \frac{e^{W_{y_{i}}^{T} f\left(\mathbf{x}_{i}\right)+b_{y_{i}}}}{\sum_{j=1}^{C} e^{W_{j}^{T} f\left(\mathbf{x}_{i}\right)+b_{j}}}\\
\text { subject to } \quad\left\|f\left(\mathbf{x}_{i}\right)\right\|_{2}=\alpha, \forall i=1,2, \dots M\\

By putting a constraint to embed data in the hypersphere, it is possible to learn so that the cosine similarity between data of the same class is large and the cosine similarity with different classes is small.

In normal Softmax Loss, for example, in the case of a face photo, the L2 norm is large for an easy-to-understand image that looks straight ahead, and the L2 norm is large for an image that is difficult to take features such as facing sideways or lying down. It has the property of becoming smaller. As a result, learning is drawn to easy-to-understand images that look straight ahead, and images with a small L2 norm, which is difficult in short, tend to be ignored. l2ノルム.jpeg

L2-constrained Softmax Loss overcomes this problem by making the L2 norm constant regardless of the data, making the effect of all data on Loss uniform.

The implementation of L2-constrained Softmax Loss itself is very easy, and can be achieved by adding the L2 norm normalization layer and the constant $ \ alpha $ scaling layer to the output of the final layer of the Deep Neural Network and calculating the Softmax Loss. I will.

https___qiita-image-store.s3.amazonaws.com_0_263391_4b3647ac-e98b-eb00-f2d4-cf4c75508c0e.png

model

Implementation will be done using Keras on Google Colaboratory. As shown below, stack 3 Convolution layers and insert L2-constrained (Lambda layer) between FC layer (fully connected layer) and Softmax. This L2-constrained is the only difference from a normal Convolutional Neural Network (CNN).

** Model overview **

** Model output in Keras ** キャプチャ.PNG

** Various learning parameters **

--Number of epochs: 15 --Batch size: 128 --Hypersphere radius $ \ alpha $: 16

Visualization with t-SNE

In order to confirm the effect of Deep Metric Learning, let's visualize the input feature space and the converted feature space by t-SNE by dropping them down to two dimensions. t-SNE is an algorithm that performs dimensional compression in such a way that the "closeness" of data in high dimensional space is maintained even in low dimensional space. Since a teacher label is not used for compression, data is purely in high dimensional space. You can see how separated they are.

** 1. Visualization of input space (784 dimensions → 2 dimensions) ** 入力空間.png Each point corresponds to one image, and the same color represents data belonging to the same class of numbers. Even in the input space, although it is generally separated for each class, there are many overlaps and variations.

** 2. Visualization of the final layer on normal CNN (64D → 2D) ** CNN.png

As the CNN itself boasts high performance, you can see that even a normal CNN is fairly well separated (clusters are formed for each class) in the final layer. Although it is well separated, if you look closely, you can often see a small jump value.

** 3. Visualization of the final layer with L2-constrained Softmax Loss (64D → 2D) ** l2_圧縮.png

In L2-constrained Softmax Loss, you can see that the clusters are more clearly separated than in normal CNN. Thanks to the normalization of the L2 norm, all the data contributes to the learning, and there are almost no jump values. You can see why Deep Metric Learning is called "distance learning".

Identification result

The identification results for 10,000 test images are as follows. As you can see from the high degree of separation in the visualization of the final layer, we can see that the accuracy is improved simply by putting the constraints of the L2 normalization layer and the scale layer in the same CNN model. Both accuracy and loss show a gentle transition, and both are finally better than CNN (no L2 loss).

Method CNN (no L2 loss) L2-constrained Softmax Loss
Identification rate 99.01 99.31
Learning (accuracy)
Learning (loss)

By the way, the images (69 images) that failed to be identified by L2-constrained Softmax Loss are as follows. ・ Pred: Predicted value of L2-constrained Softmax Loss ・ True: Correct label

失敗画像.png

I understand that feeling! There are quite a few things that I want to say. It seems difficult for humans to get 100 points. .. It seems that the accuracy will increase if the learning data of weak places is increased.

Demo in Flask (identification)

I made a demo that can be processed in real time with Flask using the model I learned because it was a big deal. You can see that they have identified it firmly. In the first place, the identification rate is 99.3%, so it seems to be okay if it is not a very strange number.

Demo 2. Anomaly detection

Although the identification is solid, in the current state, things that are clearly not numbers are forcibly assigned to one of them as follows.

When something that is not a number is input, you want to say that it is not a number, rather than outputting something close to it. Use anomaly detection to repel non-numeric numbers while preserving your ability to discriminate.

As I wrote at the beginning that Deep Metric Learning can also be applied to anomaly detection, it is higher than anomaly detection in the input space by performing anomaly detection in the feature space learned in a form that reflects the similarity of data. You will be able to get accuracy. Since Deep Metric Learning itself is distance learning rather than anomaly detection, another method is required for anomaly detection. This time, we will use Local Outlier Factor (LOF) for anomaly detection.

Local Outlier Factor(LOF) LOF is an anomaly detection method that focuses on the density of data in space. It is like an advanced version of k-nearest neighbor (kNN), and while kNN cannot consider the variation of data from cluster to cluster, LOF is ** local density (local density) from itself to k neighboring data. By paying attention to) **, it becomes possible to detect anomalies considering the distribution of data.

Local density = 1 / average distance to k nearby points

The formulas and details are omitted, but in the case shown in the figure below, for example, data A is far from the neighboring clusters, so we want to judge it as abnormal, and data B is in the same distribution as the neighboring clusters. Because it is in, you want to judge it as normal. However, since B is farther in terms of the distance from the vicinity of k, kNN cannot handle it. On the other hand, the LOF determines the anomaly threshold based on the density of the surrounding data, so it can handle such cases.

model

Anomaly detection is performed by applying LOF to the output of the L2-constrained layer of the model learned by MNIST identification (no need to retrain Deep Neural Net). If the LOF determines that it is abnormal, it is output as an error, and if it is determined to be normal, the identification result of Sofmax is output as before.

LOF uses scikit-learn and learns with the following parameters. ・ N-neighbors: 20 ・ Contamination: 0.001 ・ Novelty: true

data

The following two types of data sets are used as targets for anomaly detection.

Dataset Fashion-MNIST Cifar-10
Overview Fashion image data set for shirts, bags, shoes, etc. (10 classes) Natural image data set for airplanes, cars, dogs, etc. (10 classes)
Image example

There is no numerical data in either case, Fashion-MNIST is a 28 x 28 pixel grayscale image, so it is used as it is, and Cifar-10 is a 32 x 32 pixel color image, so it is grayscaled and resized. I will. Since Deep Nerual Net is learned only with handwritten MNIST, both are unknown images ** when viewed from ** Deep Neural Net. I would like to test whether these two datasets are correctly repelled as anomalies in the LOF of the feature space of the final layer.

Identification result

As a benchmark, we use anomaly detection when LOF is applied in the input feature space.

** LOF in input feature space: Benchmark **

Data Normal judgment Abnormal judgment
MNIST0.990.01
Fashion-MNIST0.700.30
Cifar-100.160.84

Although MNIST can recognize 99% as normal, the range of normal is too wide, 70% of Fashion-MNIST is normal, and 16% of Cifar-10 is normal.

** At the final layer of Deep Metric Learning (L2-constrained Softmax Loss) LOF: This time method **

Data Normal judgment Abnormal judgment
MNIST0.990.01
Fashion-MNIST0.120.88
Cifar-100.050.95

While maintaining 99% of the normal judgment of MNIST, 88% of Fashion-MNIST and 95% of Cifar-10 can be judged as abnormal.

If you want to repel a little more abnormalities, you can increase the contamination (ratio of outliers in the training data), and the result when contamination is set to 0.01 is as follows.

Data Normal judgment Abnormal judgment
MNIST0.960.04
Fashion-MNIST0.020.98
Cifar-100.001.00

4% of the MNIST data is determined to be abnormal, With Fashion-MNIST, 98%, and with Cifar-10, all images can be judged as abnormal. As you can see in the image that failed to identify, MNIST contains data that is difficult to identify in the first place, so considering the accuracy of anomaly detection, this seems to be better for practical use.

Demonstration in Flask (identification + anomaly detection)

I also made a demo that can be processed in real time with Flask. When a character other than a number comes, it is rejected as abnormal (Not Digit), and it can be confirmed that the number is still identified.

Summary

It was found in a simple demo that distance learning with Deep Metric Learning makes it easier to apply to anomaly detection as well as improving identification accuracy. It is easy to understand as a concept, and especially L2-constrained Softmax Loss has the advantage that it is very easy to implement because it only puts constraints on the L2 norm.

In the future, I would like to introduce various methods while demonstrating as much as possible in this way.

Qiita article ・ [A trap that anyone can easily create RPA](https://qiita.com/jw-automation/items/78d823a6cb278f7015b5) ・ [Trap that RPA can be easily created by anyone who can build VBA](https://qiita.com/jw-automation/items/38c28016bf5162e76c59) ・ [I made a UiPath coding check tool [RPA]](https://qiita.com/jw-automation/items/003dd0ef116cf968c3a8) ・ [The reason why RPA is so good that you can deepen your understanding of RPA](https://qiita.com/jw-automation/items/836391dfde3fb1ac83d6) ・ [Recommended books for RPA](https://qiita.com/jw-automation/items/3c141c50c7a163943fd9) ・ [People who are suitable for RPA development, those who are not](https://qiita.com/jw-automation/items/828933731611b9ec601f) ・ [I tried to automate sushi making](https://qiita.com/jw-automation/items/a906f7a79e72add7b577)

demo -UiPathCodingChecker: Analyze code from UiPath xaml files ・ AI Demos: Handwritten character recognition by Deep Learning ・ Abnormality detection ・ Image denoise ・ Automation of sushi making (YouTube): Automation of typing game sushi making with RPA x OCR

Recommended Posts

[AI] Deep Metric Learning
Deep Learning
[AI] Deep Learning for Image Denoising
Deep Learning Memorandum
Python Deep Learning
Deep learning × Python
First Deep Learning ~ Struggle ~
Python: Deep Learning Practices
Deep learning / activation functions
Deep Learning from scratch
Deep learning 1 Practice of deep learning
Deep learning / cross entropy
First Deep Learning ~ Preparation ~
First Deep Learning ~ Solution ~
I tried deep learning
Python: Deep Learning Tuning
Deep learning large-scale technology
Deep learning / softmax function
Why Deep Metric Learning based on Softmax functions works
Deep Learning from scratch 1-3 chapters
Try deep learning with TensorFlow
Deep Learning Gaiden ~ GPU Programming ~
Python learning plan for AI learning
Deep learning image recognition 1 theory
Deep running 2 Tuning of deep learning
Deep learning / LSTM scratch code
Rabbit Challenge Deep Learning 1Day
<Course> Deep Learning: Day1 NN
Deep Kernel Learning with Pyro
Deep learning for compound formation?
Introducing Udacity Deep Learning Nanodegree
Subjects> Deep Learning: Day3 RNN
Introduction to Deep Learning ~ Learning Rules ~
Rabbit Challenge Deep Learning 2Day
Deep Reinforcement Learning 1 Introduction to Reinforcement Learning
Deep reinforcement learning 2 Implementation of reinforcement learning
Generate Pokemon with Deep Learning
Introduction to Deep Learning ~ Backpropagation ~
Create AI to identify Zuckerberg's face by deep learning ③ (Data learning)
Deep Learning with Shogi AI on Mac and Google Colab
Deep Learning Model Lightening Library Distiller
Deep Learning / Deep Learning from Zero 2 Chapter 4 Memo
Try Deep Learning with FPGA-Select Cucumbers
Cat breed identification with deep learning
Deep Learning with Shogi AI on Mac and Google Colab Chapter 11
Deep Learning / Deep Learning from Zero Chapter 3 Memo
Make ASCII art with deep learning
Deep Learning with Shogi AI on Mac and Google Colab Chapters 1-6
Deep Learning / Deep Learning from Zero 2 Chapter 5 Memo
Implement Deep Learning / VAE (Variational Autoencoder)
Deep Learning with Shogi AI on Mac and Google Colab Chapter 8
Introduction to Deep Learning ~ Function Approximation ~
Deep Learning with Shogi AI on Mac and Google Colab Chapter 12 3
Deep learning from scratch (cost calculation)
Deep Learning with Shogi AI on Mac and Google Colab Chapter 7
Deep Learning with Shogi AI on Mac and Google Colab Chapter 10 6-9
About Deep Learning (DNN) Project Management
Deep Learning with Shogi AI on Mac and Google Colab Chapter 10
Deep Learning with Shogi AI on Mac and Google Colab Chapter 7 5-7
Deep Learning with Shogi AI on Mac and Google Colab Chapter 9
Deep learning to start without GPU