By the way, as a continuation of the previous theory, this time I will actually implement the Mahalanobis distance in python. Click here for the previous theory Abnormal value detection by unsupervised learning: Mahalanobis distance (theory)

Introduction

Python's scikit-learn library implements a function that calculates the Mahalanobis distance. Robust covariance estimation and Mahalanobis distances relevance

However, even if I write an article that says "I will execute this code!", It will be just an English translation, so This time, I will implement it myself and see how the calculation process is on the program.

Finally, after connecting all the codes, the goal is to be able to calculate the distance by yourself.

environment

Mac OSX Mavericks
Python3 series (3.4)

Also, assume that the files handled in this experiment have the following format. Suppose Column is the machine number and Row is the machine score. (It seems that Ichamon will fly in the opposite direction, but of course the opposite is not a problem.)

test.csv

	1	2	3	4	5	6	7	8	9	10
a	0	6	7	2	3	3	1	0	0	1
b	1	1	11	6	0	2	1	4	1	2
c	2	12	32	5	0	1	3	4	1	1
d	3	3	7	3	2	2	2	1	2	5
e	4	6	6	3	5	1	1	1	1	3
f	5	7	9	5	0	2	2	1	1	2

With this format, it seems that the score is extremely high only for the third machine number.

With Mahalanobis distance, you can set a certain threshold value yourself and judge a record with a value that exceeds the threshold value as an abnormal value. Note that you have to set the threshold yourself each time. </ b> This time, we will visualize abnormal data </ b> by showing it as a graph without setting a threshold value on this side.

Implementation

import The function to import is as follows.

numpy

scipy

pandas

matplotlib

pylab

These installation procedures are omitted this time. For installation, basically it is better to use pip or easy_install.

# -*- coding: utf-8 -*- import numpy as np import scipy as sc from scipy import linalg from scipy import spatial import scipy.spatial.distance import pandas as pd import matplotlib.pyplot as plt import matplotlib.font_manager import pylab

Variable definition

This time, there are 6 rows for ROW and 10 rows for COLUMN, so define as follows. Also, since the test.csv mentioned earlier is read, read this as well.

ROW = 10 COLUMN = 6 csv = pd.read_csv('test.csv')

At the same time, the variables to be handled in the future are also defined here first.

# row:line,column:Column,ave:average,vcm:分散共分散lineColumn row = [] column = [] ave = [0.0 for i in range(ROW)] vcm = np.zeros((COLUMN, ROW, ROW)) diff = np.zeros((1, ROW)) mahal = np.zeros(COLUMN) tmp = np.zeros(ROW)

Delete missing values

In general data, there are not many cases where all the data is filled like this test.csv. In such a case, I delete the missing value as follows.

#Delete missing data # axis =1 deletes columns with missing values trans_data = csv.dropna(axis=1) print(trans_data)

If you print this,

Unnamed: 0 1 2 3 4 5 6 7 8 9 10 0 a 0 6 7 2 3 3 1 0 0 1 1 b 1 1 11 6 0 2 1 4 1 2 2 c 2 12 32 5 0 1 3 4 1 1 3 d 3 3 7 3 2 2 2 1 2 5 4 e 4 6 6 3 5 1 1 1 1 3 5 f 5 7 9 5 0 2 2 1 1 2

It will be output like this.

In addition, the data formatted in this way is converted to list format and stored in the row and column declared earlier.

#trans to row_Concatenate elements of data in list format for i in range(ROW): row.append(list(trans_data.ix[i])) print(row) #Concatenate columns for i in range(1, COLUMN+1): column.append(list(trans_data.ix[:, i])) print(column)

#row calculation result [['a', 0, 6, 7, 2, 3, 3, 1, 0, 0, 1], ['b', 1, 1, 11, 6, 0, 2, 1, 4, 1, 2], ['c', 2, 12, 32, 5, 0, 1, 3, 4, 1, 1], ['d', 3, 3, 7, 3, 2, 2, 2, 1, 2, 5], ['e', 4, 6, 6, 3, 5, 1, 1, 1, 1, 3], ['f', 5, 7, 9, 5, 0, 2, 2, 1, 1, 2]] #column calculation result [[0, 1, 2, 3, 4, 5], [6, 1, 12, 3, 6, 7], [7, 11, 32, 7, 6, 9], [2, 6, 5, 3, 3, 5], [3, 0, 0, 2, 5, 0], [3, 2, 1, 2, 1, 2], [1, 1, 3, 2, 1, 2], [0, 4, 4, 1, 1, 1], [0, 1, 1, 2, 1, 1], [1, 2, 1, 5, 3, 2]]

It turned out that it is stored line by line as a multidimensional array.

Calculation of mean vector

I explained that the mean vector can be expressed by the following formula.

{ μ = \frac{1}{m} \sum_{i=1}^m x_i }

If you actually write this in python, it will be as follows.

#Calculation of mean value for i in range(ROW): #A technique called slicing ave[i] = np.average(row[i][1:len(row[i])]) print(ave)

Unnamed: 0 1 2 3 4 5 6 7 8 9 10 0 a 0 6 7 2 3 3 1 0 0 1 1 b 1 1 11 6 0 2 1 4 1 2 2 c 2 12 32 5 0 1 3 4 1 1 3 d 3 3 7 3 2 2 2 1 2 5 4 e 4 6 6 3 5 1 1 1 1 3 5 f 5 7 9 5 0 2 2 1 1 2

[2.2999999999999998, 2.8999999999999999, 6.0999999999999996, 3.0, 3.1000000000000001, 3.3999999999999999]

Numerical values are displayed for row, such as the average of the values in row a and the average of the values in row b.

Calculation of variance-covariance matrix

As I mentioned in the previous article, the covariance matrix is also

\sum_{} = \frac{1}{m}\sum_{i=1}^m (x_i - μ)(x_i - μ)^{\mathrm{T}}

It can be calculated in this way.

#Since it uses Numpy methods, array()Converted the list with. column = np.array([column]) ave = np.array(ave) #Find the covariance matrix # np.swapaxes()You can convert the axis with. for i in range(COLUMN): diff = (column[0][i] - ave) diff = np.array([diff]) vcm[i] = (diff * np.swapaxes(diff, 0, 1)) / COLUMN print(vcm)

The diff part is the difference, and mathematically

{(x_i - μ)}

It points to this part. Calculate this difference for each column and find the matrix for each column.

Unnamed: 0 1 2 3 4 5 6 7 8 9 10 0 a 0 6 7 2 3 3 1 0 0 1 1 b 1 1 11 6 0 2 1 4 1 2 2 c 2 12 32 5 0 1 3 4 1 1 3 d 3 3 7 3 2 2 2 1 2 5 4 e 4 6 6 3 5 1 1 1 1 3 5 f 5 7 9 5 0 2 2 1 1 2

#vcm column 1st column matrix [[[ 5.29000000e-01 4.37000000e-01 9.43000000e-01 -0.00000000e+00 -2.07000000e-01 -3.68000000e-01] [ 4.37000000e-01 3.61000000e-01 7.79000000e-01 -0.00000000e+00 -1.71000000e-01 -3.04000000e-01] [ 9.43000000e-01 7.79000000e-01 1.68100000e+00 -0.00000000e+00 -3.69000000e-01 -6.56000000e-01] [ -0.00000000e+00 -0.00000000e+00 -0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00] [ -2.07000000e-01 -1.71000000e-01 -3.69000000e-01 0.00000000e+00 8.10000000e-02 1.44000000e-01] [ -3.68000000e-01 -3.04000000e-01 -6.56000000e-01 0.00000000e+00 1.44000000e-01 2.56000000e-01]] #vcm second row [[ 1.36900000e+00 -7.03000000e-01 2.18300000e+00 0.00000000e+00 1.07300000e+00 1.33200000e+00] [ -7.03000000e-01 3.61000000e-01 -1.12100000e+00 -0.00000000e+00 -5.51000000e-01 -6.84000000e-01] [ 2.18300000e+00 -1.12100000e+00 3.48100000e+00 0.00000000e+00 1.71100000e+00 2.12400000e+00] [ 0.00000000e+00 -0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00] [ 1.07300000e+00 -5.51000000e-01 1.71100000e+00 0.00000000e+00 8.41000000e-01 1.04400000e+00] [ 1.33200000e+00 -6.84000000e-01 2.12400000e+00 0.00000000e+00 1.04400000e+00 1.29600000e+00]] … #vcm 10th row [[ 1.69000000e-01 1.17000000e-01 6.63000000e-01 -2.60000000e-01 1.30000000e-02 1.82000000e-01] [ 1.17000000e-01 8.10000000e-02 4.59000000e-01 -1.80000000e-01 9.00000000e-03 1.26000000e-01] [ 6.63000000e-01 4.59000000e-01 2.60100000e+00 -1.02000000e+00 5.10000000e-02 7.14000000e-01] [ -2.60000000e-01 -1.80000000e-01 -1.02000000e+00 4.00000000e-01 -2.00000000e-02 -2.80000000e-01] [ 1.30000000e-02 9.00000000e-03 5.10000000e-02 -2.00000000e-02 1.00000000e-03 1.40000000e-02] [ 1.82000000e-01 1.26000000e-01 7.14000000e-01 -2.80000000e-01 1.40000000e-02 1.96000000e-01]]]

Find the Mahalanobis distance

The Mahalanobis distance can be calculated using the following formula.

θ < \sqrt{(x_i - μ) ^{\mathrm{T}}\sum{}^{-1} (x_i - μ) }

#Find mahalanobis distance for i in range(COLUMN): #Generate a general inverse matrix and multiply it by turning value for convenience of calculation vcm[i] = sc.linalg.pinv(vcm[i]) vcm[i] = vcm[i].transpose() vcm[i] = np.identity(ROW) #Generation of difference vector diff = (column[0][i] - ave) for j in range(ROW): tmp[j] = np.dot(diff, vcm[i][j]) mahal[i] = np.dot(tmp, diff)

#Mahalanobis distance [ 5.39258751 8.5720476 28.53559181 3.67151195 7.89176786 5.88897275 4.72016949 5.00799361 6.7882251 5.8719673 ]

Now you can calculate the distance for the records in each column!

Actually plot

Let's plot the Mahalanobis distance obtained in this way on a graph.

plot = pylab.arange(0.0, ROW, 1.0) mahal = np.sqrt(mahal) print("Mahalanobis distance") print(mahal) plt.bar(range(COLUMN),mahal) plt.title("") plt.xlabel("x") plt.ylabel("y") plt.savefig("plot1.png ")

Unnamed: 0 1 2 3 4 5 6 7 8 9 10 0 a 0 6 7 2 3 3 1 0 0 1 1 b 1 1 11 6 0 2 1 4 1 2 2 c 2 12 32 5 0 1 3 4 1 1 3 d 3 3 7 3 2 2 2 1 2 5 4 e 4 6 6 3 5 1 1 1 1 3 5 f 5 7 9 5 0 2 2 1 1 2

#Mahalanobis distance [ 5.39258751 8.5720476 28.53559181 3.67151195 7.89176786 5.88897275 4.72016949 5.00799361 6.7882251 5.8719673 ]

When I plotted it, I found that the records in the third column were far apart compared to the other elements. Now you can see that the data is out of order visually.

in conclusion

What did you think. This time, based on Theory, I actually calculated the Mahalanobis distance using python. The program used this time has been released to gist on GitHub.

Since it was implemented in full scratch, there may be a flaw in the code, so if you need precise calculations, [Robust covariance estimation and Mahalanobis distances relevance](http://scikit-learn.org/stable/auto_examples/ covariance / plot_mahalanobis_distances.html) We strongly recommend that you use the library with reference to this document.

We would appreciate it if you could contact us if you have any deficiencies or mistakes.

Recommended Posts
Abnormal value detection by unsupervised learning: Mahalanobis distance (implementation)

Abnormal value detection by unsupervised learning: Mahalanobis distance (theory)

Deep learning learned by implementation ~ Anomaly detection (unsupervised learning) ~

[Anomaly detection] Detect image distortion by deep distance learning

Deep learning learned by implementation 1 (regression)

Deep learning learned by implementation 2 (image classification)

Deep Understanding Object Detection by Deep Learning by Keras

Deep learning learned by implementation (segmentation) ~ Implementation of SegNet ~

	1	2	3	4	5	6	7	8	9	10
a	0	6	7	2	3	3	1	0	0	1
b	1	1	11	6	0	2	1	4	1	2
c	2	12	32	5	0	1	3	4	1	1
d	3	3	7	3	2	2	2	1	2	5
e	4	6	6	3	5	1	1	1	1	3
f	5	7	9	5	0	2	2	1	1	2

	1	2	3	4	5	6	7	8	9	10
a	0	6	7	2	3	3	1	0	0	1
b	1	1	11	6	0	2	1	4	1	2
c	2	12	32	5	0	1	3	4	1	1
d	3	3	7	3	2	2	2	1	2	5
e	4	6	6	3	5	1	1	1	1	3
f	5	7	9	5	0	2	2	1	1	2

[PYTHON] Abnormal value detection by unsupervised learning: Mahalanobis distance (implementation)

Introduction

environment

Implementation

Variable definition

Delete missing values

Calculation of mean vector

Calculation of variance-covariance matrix

Find the Mahalanobis distance

Actually plot

in conclusion

	1	2	3	4	5	6	7	8	9	10
a	0	6	7	2	3	3	1	0	0	1
b	1	1	11	6	0	2	1	4	1	2
c	2	12	32	5	0	1	3	4	1	1
d	3	3	7	3	2	2	2	1	2	5
e	4	6	6	3	5	1	1	1	1	3
f	5	7	9	5	0	2	2	1	1	2