[PYTHON] Anomaly detection introduction and method summary

I learned about anomaly detection, so I will summarize it.

References

I have greatly referred to the following documents: [1] Lukas Ruff, et al.：A Unifying Review of Deep and Shallow Anomaly Detection [2] [Ide, Sugiyama: Introductory Machine Learning Abnormality Detection-Practical Guide by R](https://www.amazon.co.jp/%E5%85%A5%E9%96%80-%E6%A9 % 9F% E6% A2% B0% E5% AD% A6% E7% BF% 92% E3% 81% AB% E3% 82% 88% E3% 82% 8B% E7% 95% B0% E5% B8% B8 % E6% A4% 9C% E7% 9F% A5% E2% 80% 95R% E3% 81% AB% E3% 82% 88% E3% 82% 8B% E5% AE% 9F% E8% B7% B5% E3 % 82% AC% E3% 82% A4% E3% 83% 89-% E4% BA% 95% E6% 89% 8B-% E5% 89% 9B / dp / 4339024910 / ref = sr_1_2? __ mk_ja_JP =% E3% 82% AB% E3% 82% BF% E3% 82% AB% E3% 83% 8A & dchild = 1 & keywords =% E7% 95% B0% E5% B8% B8% E6% A4% 9C% E7% 9F% A5 & qid = 1605247003 & sr = 8-2) [3] [Ide, Sugiyama: Abnormality detection and change detection (machine learning professional series)](https://www.amazon.co.jp/%E7%95%B0%E5%B8%B8%E6%A4% 9C% E7% 9F% A5% E3% 81% A8% E5% A4% 89% E5% 8C% 96% E6% A4% 9C% E7% 9F% A5-% E6% A9% 9F% E6% A2% B0 % E5% AD% A6% E7% BF% 92% E3% 83% 97% E3% 83% AD% E3% 83% 95% E3% 82% A7% E3% 83% 83% E3% 82% B7% E3 % 83% A7% E3% 83% 8A% E3% 83% AB% E3% 82% B7% E3% 83% AA% E3% 83% BC% E3% 82% BA-% E4% BA% 95% E6% 89% 8B% E5% 89% 9B-ebook / dp / B018K6C99U / ref = sr_1_1? __mk_ja_JP =% E3% 82% AB% E3% 82% BF% E3% 82% AB% E3% 83% 8A & dchild = 1 & keywords =% E7% 95% B0% E5% B8% B8% E6% A4% 9C% E7% 9F% A5 & qid = 1605247003 & sr = 8-1)

§1 Anomaly detection overview

Anomaly detection application example

--Medical x anomaly detection Identification of diseased sites from medical images

Source: Thomas, et al. Unsupervised Anomaly Detection with Generative Adversarial Networks to Guide Marker Discovery

--Finance x Anomaly detection Money laundering detection

Source: Mark, et al. Anti-Money Laundering in Bitcoin: Experimenting with Graph Convolutional Networks for Financial Forensics

Abnormalities are diverse

There are a wide variety of anomalies, which can be broadly divided into static anomalies and dynamic anomalies (please point out any strange points because this is a self-made cut). Static anomaly </ b> means that there is no relationship such as order between each data, and the value of the data is a measure of the anomaly.

――For example, if the shade of the relevant part in the medical image is significantly different from that in the healthy case, it is judged to be abnormal (presence of disease). --Such static anomalies are especially called outliers.

Dynamic anomaly </ b> means that there is a relationship such as order between data like time series data, and the behavior of the data is a measure of anomaly.

――Abnormal behavior depends on individual problems, such as a drastic increase or decrease in the observed value in the same section or a significantly smaller change. --The major difference from static anomalies is that dynamic anomalies do not always have anomalous values. ――For example, the central part of the graph below looks abnormal, but the value itself is a possible value. In this case, you will find anomalies in the behavior of "frequency changes."

import numpy as np import matplotlib.pyplot as plt x1 = np.concatenate([np.linspace(-10, -3, 50), np.zeros(50), np.linspace(3, 9, 50)]) x2 = np.concatenate([np.zeros(50), np.linspace(-3, 3, 50), np.zeros(50)]) y = np.sin(x1) + np.sin(5*x2) plt.plot(np.linspace(-10, 9, 150), y)

Expressed as a formula, each point $ x_j $ is $ p (x) = \ int p (x, t) {\ rm d} t $ in the series data $ \ {x_1, ..., x_t \} $ Even if it is normal in the sense of, the conditional probability of time $ p (x_j \ mid t) = \ int \ cdots \ int p (x_1, ... x_t \ mid t) {\ rm d} t_1 \ cdots { \ rm d} t_ {j-1} {\ rm d} t_ {j + 1} \ cdots {\ rm d} t_t $ can be abnormal.

However, even in this case, the central part may not be abnormal. For example, it is possible that the upper graph is the machine operation signal of the factory and the operation mode is switched only in the central part, so it is not abnormal.

As mentioned above, there are various aspects to anomalies, but reference [1] summarizes the types of anomalies in this way:

出典:[1]

Point Anomaly in the upper left figure looks like single-shot noise data (outliers) due to observation errors, but Group Anomaly looks like data groups generated from a different distribution than outliers. It is assumed that both require different anomaly detection approaches. When you reach Semantic Anomaly in the lower right, it is difficult to deal with with classical anomaly detection methods, and a deep learning approach is required.

As you can see, there is no unified solution for "abnormality", and "abnormality" must be defined on a case-by-case basis.

Anomaly detection algorithms are basically unsupervised learning

Anomaly detection algorithms basically use unsupervised learning rather than supervised learning. There are two reasons:

Mostly because there are few abnormal data

Abnormal data is diverse and difficult to model comprehensively.

is. Hundreds to thousands of anomaly data should not be collected, especially in sites where anomalies are critical (medical or factory). In addition, reason 2 is as explained at the beginning. It is easier and more straightforward to guess "normal / not normal" than to guess "normal / abnormal". The latter does not mention anomalous data and can only be achieved by understanding the structure of normal data. Unsupervised learning is a framework for understanding the structure of data by estimating $ p (x) $ from the sample $ x_i (i = 1, ..., n) $, so anomaly detection is basically unsupervised. You will use learning. However, it must be possible to assume that the sample for unsupervised learning does not contain anomalous data or is overwhelmingly small.

As an aside, reference [2] states the following and is my favorite:

In Tolstoy's novel Anna Karenina, a passage that says, "Happy families are all alike; every unhappy family is unhappy in its own way." Is suggestive.

Anomaly detection algorithms are roughly divided into four types

Then, I will introduce the anomaly detection algorithm concretely.

Algorithms can be broadly classified into four approaches. Classification Approach, Probabilistic Approach, Reconstruction Approach, Distance Approach, respectively. In addition to these four categories, here is a table that classifies whether or not to use deep learning:

出典：[1]

Each word such as PCA or OC-SVM is the name of the anomaly detection algorithm that has been proposed in the past. The details of each algorithm will be introduced later, so here we will only give an overview of each of the four approaches. (I myself do not fully understand the table below, so I would appreciate it if you could give me your opinion.)

approach Overview Pros Cons

Classification An approach that directly estimates the boundary that separates normal and abnormal data There is no need to assume the shape of the distribution. In addition, the method using SVM can be learned with a relatively small amount of data. Normal data may not work well if it has a complicated structure

Probabilistic An approach that estimates the probability distribution of normal data and regards events with a low probability of occurrence as abnormal Very high accuracy achieved if the data follows the assumed model Many assumptions such as distribution model are required. It is also difficult to handle higher dimensions.

Reconstruction Well to reconstruct normal data-The approach that data that trained logic cannot reconstruct well is abnormal Powerful methods such as AE and GAN, which have been actively studied in recent years, can be applied. Since the objective function is designed to improve dimensional compression and reconstruction accuracy, it is not necessarily a method specialized for anomaly detection.

Distance An approach that detects anomalies based on the distance between data. The idea that abnormal data is far from normal data. The idea is simple and the explanation for the output is high. Large amount of calculation (O(n^2)Such)

The above table was created with reference to the following materials.

Hido: Jubatus Casual Talks # 2 Anomaly Detection Introduction

PANG, et al. Deep Learning for Anomaly Detection: A Review

** §2 Algorithm introduction and experiment **

Summary of methods to introduce

Classification Approach

One-Class SVM

Probability Approach --Hotelling's T2 method

GMM

KDE

Reconstruction Approach

AE

Distance Approach

k-NN

LOF

Classification Approach One-Class SVM Suppose you are given the data $ D = \ {x_1, ..., x_N \} $. However, assume that $ D $ does not contain anomalous data, or if it does, it is an overwhelmingly small number.

In the previous chapter, we introduced that the Classification Approach estimates the boundary that separates normal and abnormal. ** One-Class SVM is an algorithm that calculates and borders a sphere that completely surrounds normal data. ** (A sphere is not always three-dimensional.)

If you make the radius very large, you can make a sphere that completely contains $ D $, but if it is too large, False Negative will be 1 and I am not happy. Therefore,

--Make the volume of the sphere as small as possible -Allow data that extends beyond the sphere in $ D $

Calculate a general-purpose sphere by allowing two points. Specifically, the optimum sphere radius $ R $ and center $ b $ can be obtained by solving the following optimization problem:

python

\min_{R,b,u} \left\{R^2 + C\sum_{n=1}^{N}u_n \right\} \quad {\rm s.t.} \quad \|x_n - b\|^2 \le R^2 + u_n

However, $ u $ is a penalty term that extends beyond the sphere, and $ C $ is a parameter of how important the penalty is. Non-linear conversion of data by kernel functions is possible with SVM, but of course it is also available with One-Class SVM. Proper use of kernel functions will give you the right sphere.

Let's do a numerical experiment. The data was prepared as follows.

python

import numpy as np import matplotlib import matplotlib.pyplot as plt import seaborn as sns sns.set() np.random.seed(123) #Mean and variance mean1 = np.array([2, 2]) mean2 = np.array([-2,-2]) cov = np.array([[1, 0], [0, 1]]) #Generation of normal data(Generated from two normal distributions) norm1 = np.random.multivariate_normal(mean1, cov, size=50) norm2 = np.random.multivariate_normal(mean2, cov, size=50) norm = np.vstack([norm1, norm2]) #Generation of anomalous data(Generated from uniform distribution) lower, upper = -10, 10 anom = (upper - lower)*np.random.rand(20,20) + lower s=8 plt.scatter(norm[:,0], norm[:,1], s=s, color='green', label='normal') plt.scatter(anom[:,0], anom[:,1], s=s, color='red', label='anomaly') plt.legend()

python

from sklearn import svm xx, yy = np.meshgrid(np.linspace(-10, 10, 500), np.linspace(-10, 10, 500)) # fit the model clf = svm.OneClassSVM(nu=0.1, kernel="rbf", gamma=0.1) clf.fit(norm) def draw_boundary(norm, anom, clf): # plot the line, the points, and the nearest vectors to the plane Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()]) Z = Z.reshape(xx.shape) plt.title("One-Class SVM") plt.contourf(xx, yy, Z, levels=np.linspace(Z.min(), Z.max(), 7), cmap=plt.cm.PuBu) bd = plt.contour(xx, yy, Z, levels=[0], linewidths=2, colors='darkred') plt.contourf(xx, yy, Z, levels=[0, Z.max()], colors='palevioletred') nm = plt.scatter(norm[:, 0], norm[:, 1], c='green', s=s) am = plt.scatter(anom[:, 0], anom[:, 1], c='red', s=s) plt.axis('tight') plt.xlim((-10, 10)) plt.ylim((-10, 10)) plt.legend([bd.collections[0], nm, am], ["learned frontier", "normal", "anomaly"], loc="upper left", prop=matplotlib.font_manager.FontProperties(size=11)) plt.show() draw_boundary(norm, anom, clf)

By the way, if you make the parameter $ C $ smaller, that is, make the penalty that sticks out of the sphere lighter, the boundary changes like this. (In sklearn, it is represented by the parameter $ \ nu $, but it seems that it is in the denominator, so it seems that $ \ nu $ is made smaller and $ = C $ is made larger.)

python

# fit the model clf = svm.OneClassSVM(nu=0.5, kernel="rbf", gamma=0.1) clf.fit(norm) draw_boundary(norm, anom, clf)

Other proposals include One-Class NN. In One-Class SVM, I was searching for a suitable nonlinear transformation by adjusting the kernel function, but the idea is to substitute & automate it with NN. The implementation seems to be annoying, so I haven't done it.

Probability Approach

Hotelling's T2 method

Hoteling's T2 method is a classical method and has been theoretically analyzed. In conclusion, ** By assuming a normal distribution, it becomes an algorithm ** that utilizes the property that the anomaly score follows the $ \ chi ^ 2 $ distribution.

For the sake of simplicity, we will use one-dimensional data here. As usual, given the data $ D $. Here each data $ x $ is normally distributed

python

p(x\mid \mu, \sigma) = \frac{1}{\sqrt{2\pi\sigma^2}}\exp\left\{-\frac{1}{2\sigma^2}(x-\mu)^2\right\}

It is assumed that it follows **. By fitting this normal distribution to the data by the maximum likelihood estimation method

python

\begin{align*} \hat{\mu} &= \frac{1}{N}\sum_{n=1}^{N}x_n \\ \hat{\sigma}^2 &= \frac{1}{N}\sum_{n=1}^{N}(x_n - \hat{\mu})^2 \end{align*}

And each parameter is estimated. In other words, we estimated that the data $ D $ follows the normal distribution $ p (x \ mid \ hat {\ mu}, \ hat {\ sigma} ^ 2) $.

Now, let's introduce the Anomaly Score. Abnormality is a statistic that literally indicates the degree of anomaly, and negative log-likelihood is often used.

python

prob = np.linspace(0.01, 1, 100) neg_log_likh = [-np.log(p) for p in prob] plt.plot(prob, neg_log_likh) plt.title('Negative Log Likelihood') plt.ylabel('Anomaly Score') plt.xlabel('Prob.')

As shown in the graph, the negative log-likelihood has a shape that takes a large value where the probability is small, and it seems to function as an anomaly. If you calculate the anomaly of the normal distribution you just estimated

python

-\log p(x\mid \hat{\mu}, \hat{\sigma}^2) = \frac{1}{2\hat{\sigma}^2}(x-\hat{\mu})^2 + \frac{1}{2}\log(2\pi\hat{\sigma}^2)

It will be. Since the anomaly is a quantity defined for the data $ x $, the anomaly $ a (x) $ is, excluding the terms unrelated to $ x $.

python

a(x) = \left( \frac{x - \hat{\mu}}{\hat{\sigma}} \right)^2

Is calculated. The numerator of anomaly indicates that "the farther the data is from the mean, the higher the anomaly". The denominator expresses "how much data is scattered?", And the feeling that scattered data is tolerated even if it is far from the average, and conversely, dense data does not allow any deviation. There is.

In fact, it is known that this anomaly $ a (x) $ follows a chi-square distribution $ \ chi ^ 2_1 $ with $ 1 $ degrees of freedom when the sample size $ N $ is large enough.

python

from scipy.stats import chi2 x = np.linspace(0, 8, 1000) for df in [1,2,3,4,5]: plt.plot(x, chi2.pdf(x ,df = df), label='deg. of freedom:'+str(df)) plt.legend() plt.ylim(0, 0.5) plt.title('Chi-Square Dist.')

After that, anomaly detection is possible by determining the threshold $ a_ {th} $ at the significance level of $ 5 $%. Summary,

Suppose the data follow a normal distribution.

Estimate the parameters $ \ mu $, $ \ sigma $ of the normal distribution by the maximum likelihood estimation method.

From the chi-square distribution, determine the point $ a_ {th} $ that separates normal and abnormal.

If the new data $ x_ {new} $ is $ a (x_ {new})> a_ {th} $, it is considered abnormal.

Is the T2 method of hoteling. This time, I explained in one dimension, but the degree of anomaly follows the chi-square distribution in multiple dimensions as well. However, degrees of freedom = number of dimensions.

Let's experiment. The data was created as follows.

python

norm = np.random.normal(0, 1, 20) #Normal data anom = np.array([-3, 3]) #Abnormal data plt.scatter(norm, [0]*len(norm), color='green', label='normal') plt.scatter(anom, [0]*len(anom), color='red', label='anomaly')

python

#Maximum likelihood estimation N = len(norm) mu_hat = sum(norm)/N s_hat = sum([(x-mu_hat)**2 for x in norm])/N #Use the quantile of the chi-square distribution as the threshold a_th_5 = chi2.ppf(q=0.95, df=1) a_th_30 = chi2.ppf(q=0.7, df=1) #Calculate the degree of anomaly of each data data = np.hstack([norm, anom]) anom_score = [] for d in data: score = ((d - mu_hat)/s_hat)**2 anom_score.append(score) #drawing norm_pred_5 = [d for d,a in zip(data, anom_score) if a<=a_th_5] anom_pred_5 = [d for d,a in zip(data, anom_score) if a>a_th_5] norm_pred_30 = [d for d,a in zip(data, anom_score) if a<=a_th_30] anom_pred_30 = [d for d,a in zip(data, anom_score) if a>a_th_30] fig, axes = plt.subplots(1,3, figsize=(15,5)) axes[0].scatter(norm, [0]*len(norm), color='green', label='normal') axes[0].scatter(anom, [0]*len(anom), color='red', label='anomaly') axes[0].set_title('Truth') axes[1].scatter(norm_pred_5, [0]*len(norm_pred_5), color='green', label='normal pred') axes[1].scatter(anom_pred_5, [0]*len(anom_pred_5), color='red', label='anomaly pred') axes[1].set_title('Threshold:5%') axes[2].scatter(norm_pred_30, [0]*len(norm_pred_30), color='green', label='normal pred') axes[2].scatter(anom_pred_30, [0]*len(anom_pred_30), color='red', label='anomaly pred') axes[2].set_title('Threshold:30%') plt.legend()

It can be seen that if the threshold of the degree of abnormality is reduced, the area considered to be abnormal will expand. The threshold should be determined according to the problem. For example, the threshold should be set small when overlooking abnormalities is critical, such as in the medical field, and may be set a little higher in the case of data cleansing.

GMM(Gaussian Mixture Model) The Hoteling T2 method assumed a normal distribution for the data, but in some cases this may be inappropriate. For example, if the data is concentrated in multiple places (multiple peaks instead of single peaks) as shown below, fitting with a normal distribution

python

lower, upper = -3, -1 data1 = (upper - lower)*np.random.rand(20) + lower lower, upper = 1, 3 data2 = (upper - lower)*np.random.rand(30) + lower data = np.hstack([data1, data2]) #Maximum likelihood estimation N = len(data) mu_hat = sum(data)/N s_hat = sum([(x-mu_hat)**2 for x in data])/N #plot from scipy.stats import norm X = np.linspace(-7, 7, 1000) Y = norm.pdf(X, mu_hat, s_hat) plt.scatter(data, [0]*len(data)) plt.plot(X, Y)

Like, it is estimated that the peak of the normal distribution comes to the place where no data is observed. In such cases, it is better to set a ** slightly more complicated distribution instead of the normal distribution. This time, as an example, we will introduce anomaly detection ** by GMM (Gaussian Mixture Model).

GMM is a distribution that expresses a multimodal type by adding several normal distributions. The number to add is often expressed in $ K $ and is called the number of components. The GMM formulas and visuals are as follows (one-dimensional).

python

p(x\mid \pi, \mu, \sigma) = \sum_{k=1}^{K}\pi_k N(x\mid \mu_k, \sigma_k)

However,

python

0 \le\pi_k\le 1,\quad \sum_{k=1}^{K}\pi_k=1,\quad N(x\mid\mu,\sigma)={\text normal distribution}

is.

python

#plot from scipy.stats import norm X = np.linspace(-7, 7, 1000) Y1 = norm.pdf(X, 1, 1) Y2 = norm.pdf(X, -3, 2) Y = 0.5*Y1 + 0.5*Y2 plt.plot(X, Y1, label='Component1') plt.plot(X, Y2, label='Component2') plt.plot(X, Y, label='GMM') plt.legend()

The mean ($\ mu ) / variance ( \ sigma $) and mixing ratio $ \ pi $ of each normal distribution are parameters learned from the data. However, the number of components $ K $ needs to be fixed. Parameters that are decided by the analyst side in this way are called hyperparameters.

Let's experiment with anomaly detection by GMM. Let's also observe how the shape of the estimated distribution changes depending on the number of components $ K $. The data uses the same as the One-Class SVM:

python

def draw_GMM(norm, anom, clf, K): # plot the line, the points, and the nearest vectors to the plane xx, yy = np.meshgrid(np.linspace(-10, 10, 1000), np.linspace(-10, 10, 1000)) Z = np.exp(clf.score_samples(np.c_[xx.ravel(), yy.ravel()])) Z = Z.reshape(xx.shape) plt.title("GMM:K="+str(K)) plt.contourf(xx, yy, Z, levels=np.linspace(Z.min(), Z.max(), 7), cmap=plt.cm.PuBu) nm = plt.scatter(norm[:, 0], norm[:, 1], c='green', s=s) am = plt.scatter(anom[:, 0], anom[:, 1], c='red', s=s) plt.axis('tight') plt.xlim((-10, 10)) plt.ylim((-10, 10)) plt.legend([nm, am], ["normal", "anomaly"], loc="upper left", prop=matplotlib.font_manager.FontProperties(size=11)) plt.show()

python

#Fitting by GMM from sklearn.mixture import GaussianMixture for K in [1,2,3,4]: clf = GaussianMixture(n_components=K, covariance_type='full') clf.fit(norm) draw_GMM(norm, anom, clf, K)

By increasing the number of components in this way, you can see that even the trends in the details of the data are captured. However, it is necessary to be careful because it does not capture the details = generalization performance is high. The number of components can be determined using information criteria such as WAIC (I tried using WAIC for the time being: GitHub)

KDE(Kernel Density Estimation) Until now, we have assumed a normal distribution or a mixed normal distribution and estimated the internal parameters through fitting to the data. Such an approach is called a parametric approach. But what if the parametric distribution doesn't work?

A non-parametric approach is effective in such cases, and ** KDE is a typical non-parametric estimation method **. KDE is a method that estimates the probability distribution by overlaying a kernel (usually Gaussian) on each data point **. The visualization is as follows.

python

from scipy.stats import norm np.random.seed(123) lower, upper = -1, 5 data = (upper - lower)*np.random.rand(3) + lower for i, d in enumerate(data): X = np.linspace(lower,upper,500) Y = norm.pdf(X, loc=d, scale=1) knl = plt.plot(X,Y,color='green', label='kernel'+str(i+1)) # KDE n = 3 h = 0.24 Y_pdf = [] for x in X: Y = 0 for d in data: Y += norm.pdf((x-d)/h, loc=0, scale=1) Y /= n*h Y_pdf.append(Y) kde = plt.plot(X,Y_pdf, label='KDE') plt.scatter(data, [0]*len(data), marker='x') plt.ylim([-0.05, 1]) plt.legend()

In this way, the probability distribution is estimated by superimposing the kernels. If you use a normal distribution for your kernel, you need to set the variance (called bandwidth). In subsequent experiments, let's also observe the change in the estimated distribution when this bandwidth is changed.

The details of KDE's theory and its strengths and weaknesses are [Suzuki. Data Analysis 10th "Nonparametric Density Estimate Method"](http://ibis.tu-tokyo.ac.jp/suzuki/lecture/2015/dataanalysis/L9. pdf) was easy to understand:

** Estimated distribution **

python

p(x) = \frac{1}{Nh}\sum_{n=1}^{N}K\left(\frac{x-X_n}{h}\right)

However, $ h> 0 $ is the bandwidth, $ K () $ is the kernel function, and $ X_n $ is each data point. ** Advantages / Disadvantages ** Source: Suzuki. Data analysis 10th "Nonparametric density estimation method"

Let's experiment with anomaly detection by KDE. Data is as usual Use the. The estimated distribution when the bandwidth is changed is as follows.

python

def draw_KDE(norm, anom, clf, K): # plot the line, the points, and the nearest vectors to the plane xx, yy = np.meshgrid(np.linspace(-10, 10, 1000), np.linspace(-10, 10, 1000)) Z = np.exp(clf.score_samples(np.c_[xx.ravel(), yy.ravel()])) Z = Z.reshape(xx.shape) plt.title("KDE:BandWidth="+str(bw)) plt.contourf(xx, yy, Z, levels=np.linspace(Z.min(), Z.max(), 7), cmap=plt.cm.PuBu) nm = plt.scatter(norm[:, 0], norm[:, 1], c='green', s=s) am = plt.scatter(anom[:, 0], anom[:, 1], c='red', s=s) plt.axis('tight') plt.xlim((-10, 10)) plt.ylim((-10, 10)) plt.legend([nm, am], ["normal", "anomaly"], loc="upper left", prop=matplotlib.font_manager.FontProperties(size=11)) plt.show()

python

from sklearn.neighbors import KernelDensity for bw in [0.2,0.5,1.0,2.0]: clf = KernelDensity(bandwidth=bw, kernel='gaussian') clf.fit(norm) draw_KDE(norm, anom, clf, bw)

As you can see, the estimated distribution becomes smoother as the bandwidth increases. Also, since there are other kernel functions besides the normal distribution, it is necessary to set both "bandwidth and kernel function" appropriately.

Reconstruction Approach Since the Reconstruction Approach is often used for image data, we will use image data here as well. Let the normal data be MNIST and the abnormal data be Fashion-MNIST.

python

#MNIST load import torch from torch.utils.data import DataLoader from torchvision import datasets, transforms #Reading training data train_mnist = datasets.MNIST( root="./data", train=True, transform=transforms.Compose([ transforms.ToTensor(), transforms.Normalize((0.5,), (0.5,)) #section[-1,1]Mean, standard deviation within= 0.5, 0.Normalized to 5 ]), download=True ) train_mnist_loader = torch.utils.data.DataLoader( train_mnist, batch_size=64, shuffle=True ) #Read test data test_mnist = datasets.MNIST( root="./data", train=False, transform=transforms.Compose([ transforms.ToTensor(), transforms.Normalize((0.5,), (0.5,)) #section[-1,1]Mean, standard deviation within= 0.5, 0.Normalized to 5 ]), download=True ) test_mnist_loader = torch.utils.data.DataLoader( test_mnist, batch_size=64, shuffle=True )

python

# Fashion-MNIST load #Reading training data train_fashion = datasets.FashionMNIST( root="./data", train=True, transform=transforms.Compose([ transforms.ToTensor(), transforms.Normalize((0.5,), (0.5,)) #section[-1,1]Mean, standard deviation within= 0.5, 0.Normalized to 5 ]), download=True ) train_fashion_loader = torch.utils.data.DataLoader( train_fashion, batch_size=64, shuffle=True ) #Read test data test_fashion = datasets.FashionMNIST( root="./data", train=False, transform=transforms.Compose([ transforms.ToTensor(), transforms.Normalize((0.5,), (0.5,)) #section[-1,1]Mean, standard deviation within= 0.5, 0.Normalized to 5 ]), download=True ) test_fashion_loader = torch.utils.data.DataLoader( test_fashion, batch_size=64, shuffle=True )

python

import matplotlib.pyplot as plt #A function that visualizes images side by side at once def imshow(image_set, nrows=2, ncols=10): plot_num = nrows * ncols fig, ax = plt.subplots(ncols=ncols, nrows=nrows, figsize=(15, 15*nrows/ncols)) ax = ax.ravel() for i in range(plot_num): ax[i].imshow(image_set[i].reshape(28, 28), cmap='gray') ax[i].set_xticks([]) ax[i].set_yticks([])

python

#MNIST sample visualization imshow(train_mnist.data)

python

# Fashion-MNIST sample visualization imshow(train_fashion.data)

AE(Auto Encoder) AE (Auto Encoder) is a type of NN (Neural Network), and is an NN that learns so that the input and output are equal. Since a simple identity function is boring, it usually sandwiches an intermediate layer with dimensions smaller than the input dimension.

What this is doing is "summarizing" the input data in the middle layer. Mathematically, it means that the input data is mapped to a low-dimensional space while retaining enough information to reconstruct it.

** Anomaly detection using AE is based on the idea that "AEs trained to reconstruct normal data may not be able to reconstruct abnormal data well." ** **

Now, let's experiment with anomaly detection by AE. After training AE with MNIST, we will try to detect the reconstruction accuracy of MNIST and Fashion-MNIST as the degree of anomaly.

python

import torch.nn as nn import torch.nn.functional as F #Definition of AE class AE(nn.Module): def __init__(self): super(AE, self).__init__() self.fc1 = nn.Linear(28*28, 256) self.fc2 = nn.Linear(256, 28*28) def forward(self, x): x = torch.tanh(self.fc1(x)) x = torch.tanh(self.fc2(x)) return x ae = AE() #Definition of loss function and optimization method import torch.optim as optim criterion = nn.MSELoss() optimizer = optim.Adam(ae.parameters(), lr=0.0001)

python

#Learning loop N_train = train_mnist.data.shape[0] N_test = test_mnist.data.shape[0] train_loss = [] test_loss = [] for epoch in range(10): # Train step running_train_loss = 0.0 for data in train_mnist_loader: inputs, _ = data inputs = inputs.view(-1, 28*28) # zero the parameter gradients optimizer.zero_grad() # forward + backward + optimize outputs = ae(inputs) loss = criterion(outputs, inputs) loss.backward() optimizer.step() running_train_loss += loss.item() running_train_loss /= N_train train_loss.append(running_train_loss) # Test step with torch.no_grad(): running_test_loss = 0 for data in test_mnist_loader: inputs, _ = data inputs = inputs.view(-1, 28*28) outputs = ae(inputs) running_test_loss += criterion(outputs, inputs).item() running_test_loss /= N_test test_loss.append(running_test_loss) if (epoch+1)%1==0: # print every 1 epoch print('epoch:{}, train_loss:{}, test_loss:{}'.format(epoch + 1, running_train_loss, running_test_loss)) plt.plot(train_loss, label='Train loss') plt.plot(test_loss, label='Test loss') plt.ylabel('Loss') plt.xlabel('epoch') plt.legend()

python

#Let's visualize the state of reconstruction with torch.no_grad(): for data in test_mnist_loader: inputs, _ = data inputs = inputs.view(-1, 28*28) outputs = ae(inputs) n = 5 # num to viz fig, axes = plt.subplots(ncols=n, nrows=2, figsize=(15, 15*2/n)) for i in range(n): axes[0, i].imshow(inputs[i].reshape(28, 28), cmap='gray') axes[1, i].imshow(outputs[i].reshape(28, 28), cmap='gray') axes[0, i].set_title('Input') axes[1, i].set_title('Reconstruct') axes[0, i].set_xticks([]) axes[0, i].set_yticks([]) axes[1, i].set_xticks([]) axes[1, i].set_yticks([])

It seems that it has been reconstructed neatly.

Now, let's draw the distribution of the reconstruction accuracy (square error this time) of MNIST and Fashion-MNIST. If AE anomaly detection is successful, Fashion-MNIST should be less accurate.

python

#Calculation of reconstruction accuracy mnist_loss = [] with torch.no_grad(): for data in test_mnist_loader: inputs, _ = data for data_i in inputs: data_i = data_i.view(-1, 28*28) outputs = ae(data_i) loss = criterion(outputs, data_i) mnist_loss.append(loss.item()) fashion_loss = [] with torch.no_grad(): for data in test_fashion_loader: inputs, _ = data for data_i in inputs: data_i = data_i.view(-1, 28*28) outputs = ae(data_i) loss = criterion(outputs, data_i) fashion_loss.append(loss.item())

python

plt.hist([mnist_loss, fashion_loss], bins=25, label=['MNIST', 'Fashion-MNIST']) plt.xlabel('Loss') plt.ylabel('data count') plt.legend() plt.show()

Certainly, the reconstruction of Fashion-MNIST does not seem to work well, so it seems that anomaly detection based on the reconstruction accuracy will be established.

Finally, I will pick up and draw the ones with low Loss (= the ones that are difficult to judge as abnormal) and the ones with high Loss (= the ones that can be easily judged as abnormal) among Fashion-MNIST.

python

lower_loss, upper_loss = 0.1, 0.8 data_low = [] #Stores data with low Loss data_up = [] #Stores data with high Loss with torch.no_grad(): for data in test_fashion_loader: inputs, _ = data for data_i in inputs: data_i = data_i.view(-1, 28*28) outputs = ae(data_i) loss = criterion(outputs, data_i).item() if loss<lower_loss: data_low.append([data_i, outputs, loss]) if loss>upper_loss: data_up.append([data_i, outputs, loss])

python

#Index for drawing randomly np.random.seed(0) n = 3 #Number of drawing data lower_indices = np.random.choice(np.arange(len(data_low)), n) upper_indices = np.random.choice(np.arange(len(data_up)), n) indices = np.hstack([lower_indices, upper_indices]) fig, axes = plt.subplots(ncols=n*2, nrows=2, figsize=(15, 15*2/(n*2))) #Drawing data with small Loss for i in range(n): inputs = data_low[indices[i]][0] outputs = data_low[indices[i]][1] loss = data_low[indices[i]][2] axes[0, i].imshow(inputs.reshape(28, 28), cmap='gray') axes[1, i].imshow(outputs.reshape(28, 28), cmap='gray') axes[0, i].set_title('Input') axes[1, i].set_title('Small Loss={:.2f}'.format(loss)) axes[0, i].set_xticks([]) axes[0, i].set_yticks([]) axes[1, i].set_xticks([]) axes[1, i].set_yticks([]) #Drawing data with large loss for i in range(n, 2*n): inputs = data_up[indices[i]][0] outputs = data_up[indices[i]][1] loss = data_up[indices[i]][2] axes[0, i].imshow(inputs.reshape(28, 28), cmap='gray') axes[1, i].imshow(outputs.reshape(28, 28), cmap='gray') axes[0, i].set_title('Input') axes[1, i].set_title('Large Loss={:.2f}'.format(loss)) axes[0, i].set_xticks([]) axes[0, i].set_yticks([]) axes[1, i].set_xticks([]) axes[1, i].set_yticks([]) plt.tight_layout()

Images with low Loss tend to have a large black area, while images with a high Loss tend to have a large white area. Since MNIST is basically an image with a large black area, if the white area is large, it will be "foreign" for AE and it seems that the reconstruction is not working well.

Distance Approach k-NN(k-Nearest Neighbor) k-NN is a method to measure the degree of anomaly based on the distance between data. In k-NN, consider a circle containing k neighboring data points. The figure below is for $ k = 5 $.

As shown in the figure, k-NN defines the radius of the circle as the degree of anomaly based on the idea that the circle drawn by the anomalous data should be larger than the normal data. However, the threshold of the radius that separates normal / abnormal from the number of neighboring points $ k $ must be tuned appropriately.

Let's do a numerical experiment. The threshold is the 90% quantile in the radius value distribution of normal data.

python

#Data creation np.random.seed(123) #Mean and variance mean = np.array([2, 2]) cov = np.array([[3, 0], [0, 3]]) #Generation of normal data norm = np.random.multivariate_normal(mean1, cov, size=50) #Generation of anomalous data(Generated from uniform distribution) lower, upper = -10, 10 anom = (upper - lower)*np.random.rand(20,20) + lower s=8 plt.scatter(norm[:,0], norm[:,1], s=s, color='green', label='normal') plt.scatter(anom[:,0], anom[:,1], s=s, color='red', label='anomaly') plt.legend()

I didn't know how to complete the anomaly detection by k-NN with the library, so I made it myself. .. ..

python

#A function that calculates the radius of the k-nearest neighbor circle of all data def kNN(data, k=5): """ input: data:list, k:int output:Radius of the k-nearest neighbor circle of all data:list The distance adopts the Euclidean distance. """ output = [] for d in data: distances = [] for p in data: distances.append(np.linalg.norm(d - p)) output.append(sorted(distances)[k]) return output #1 if greater than the threshold,A function that returns 0 if it is small def kNN_scores(point, data, thres, K): distances = [] for p in data: distances.append(np.linalg.norm(point - p)) dist = sorted(distances)[K] if dist > thres: return 1 else: return 0

python

def draw_KNN(norm, anom, thres, K, only_norm=False): # plot the line, the points, and the nearest vectors to the plane xx, yy = np.meshgrid(np.linspace(-10, 10, 100), np.linspace(-10, 10, 100)) Z = [kNN_scores(point, norm, thres, K) for point in np.c_[xx.ravel(), yy.ravel()]] Z = np.array(Z).reshape(xx.shape) plt.title("KNN:K="+str(K)) plt.contourf(xx, yy, Z) if not only_norm: nm = plt.scatter(norm[:, 0], norm[:, 1], c='green', s=s) am = plt.scatter(anom[:, 0], anom[:, 1], c='red', s=s) plt.axis('tight') plt.xlim((-10, 10)) plt.ylim((-10, 10)) plt.legend([nm, am], ["normal", "anomaly"], loc="upper left", prop=matplotlib.font_manager.FontProperties(size=11)) if only_norm: nm = plt.scatter(norm[:, 0], norm[:, 1], c='green', s=s) plt.axis('tight') plt.xlim((-10, 10)) plt.ylim((-10, 10)) plt.legend([nm], ["normal"], loc="upper left", prop=matplotlib.font_manager.FontProperties(size=11)) plt.show()

python

for k in [1,5,30]: anomaly_scores = kNN(norm, k=k) thres = pd.Series(anomaly_scores).quantile(0.90) draw_KNN(norm, anom, thres, k)

You can see that the normal / abnormal boundary becomes smoother as k increases.

Now, let's experiment with a slightly messy example. Suppose ** normal data is generated from two normal distributions with different densities ** as shown below.

python

np.random.seed(123) #Mean and variance mean1 = np.array([3, 3]) mean2 = np.array([-5,-5]) cov1 = np.array([[0.5, 0], [0, 0.5]]) cov2 = np.array([[3, 0], [0, 3]]) #Generation of normal data(Generated from two normal distributions) norm1 = np.random.multivariate_normal(mean1, cov1, size=50) norm2 = np.random.multivariate_normal(mean2, cov2, size=10) norm = np.vstack([norm1, norm2]) s=8 plt.scatter(norm[:,0], norm[:,1], s=s, color='green', label='normal') plt.legend()

python

for k in [1,5,30]: anomaly_scores = kNN(norm, k=k) thres = pd.Series(anomaly_scores).quantile(0.90) draw_KNN(norm, [], thres, k, only_norm=True)

Focusing especially on the case of k = 1, we can see that kNN is not working well when normal data originates from different clusters with different densities. ** You can imagine that it can be dealt with by adding the process of "important the deviation from the dense cluster and tolerate the deviation from the sparse cluster" **.

In fact, that is the method called LOF introduced in the next section.

LOF (Local Outlier Factor)

The case where k-NN introduced in the previous section does not work well is shown in the figure.

It will be. Of course, I would like to increase the degree of anomaly on the left side, but LOF introduces the concept of ** data density ** and realizes it. I will omit the mathematical details, but I am doing the following.

The degree of anomaly is defined by ** the ratio of the degree of squishyness of oneself to the degree of squishyness of nearby points **.

Now let's experiment with LOF.

python

def draw_LOF(norm, anom, K, only_norm=False): # plot the line, the points, and the nearest vectors to the plane xx, yy = np.meshgrid(np.linspace(-10, 10, 100), np.linspace(-10, 10, 100)) Z = - model.predict(np.c_[xx.ravel(), yy.ravel()]) Z = np.array(Z).reshape(xx.shape) plt.title("LOF:K="+str(K)) plt.contourf(xx, yy, Z) if not only_norm: nm = plt.scatter(norm[:, 0], norm[:, 1], c='green', s=s) am = plt.scatter(anom[:, 0], anom[:, 1], c='red', s=s) plt.axis('tight') plt.xlim((-10, 10)) plt.ylim((-10, 10)) plt.legend([nm, am], ["normal", "anomaly"], loc="upper left", prop=matplotlib.font_manager.FontProperties(size=11)) if only_norm: nm = plt.scatter(norm[:, 0], norm[:, 1], c='green', s=s) plt.axis('tight') plt.xlim((-10, 10)) plt.ylim((-10, 10)) plt.legend([nm], ["normal"], loc="upper left", prop=matplotlib.font_manager.FontProperties(size=11)) plt.show()

python

from sklearn.neighbors import LocalOutlierFactor for k in [1, 5, 30]: model = LocalOutlierFactor(n_neighbors=k, novelty=True) model.fit(norm) draw_LOF(norm, [], k, only_norm=True)

In this way, it seems that anomaly detection can be performed after considering the density of the cluster.

Evolving anomaly detection problem

As I study anomaly detection in the future, I would like to study the following topics:

--Abnormality detection of time series data --Abnormality detection of graph data --Can be used for money laundering detection

Recommended Posts
Anomaly detection introduction and method summary

Anomaly detection introduction 2 Outlier detection

Introduction to Anomaly Detection 1 Basics

Anomaly detection introduction 3 Change point detection

Method overriding and super

Summary of test method

Anomaly detection assuming multivariate normal distribution by Hotelling's T ^ 2 method

Introduction to Monte Carlo Method

Kaggle Kernel Method Summary [Image]

Adam Paper Summary and Code

Perturbation development and simulation Perturbation method

Cumulative sum and potato method

Unity IAP implementation method summary

Face detection summary in Python

Introduction and tips of mlflow.Tracking

[Anomaly detection] Try using the latest method of deep distance learning

[Recommendation] Summary of advantages and disadvantages of content-based and collaborative filtering / implementation method

approach	Overview	Pros	Cons
Classification	An approach that directly estimates the boundary that separates normal and abnormal data	There is no need to assume the shape of the distribution. In addition, the method using SVM can be learned with a relatively small amount of data.	Normal data may not work well if it has a complicated structure
Probabilistic	An approach that estimates the probability distribution of normal data and regards events with a low probability of occurrence as abnormal	Very high accuracy achieved if the data follows the assumed model	Many assumptions such as distribution model are required. It is also difficult to handle higher dimensions.
Reconstruction	Well to reconstruct normal data-The approach that data that trained logic cannot reconstruct well is abnormal	Powerful methods such as AE and GAN, which have been actively studied in recent years, can be applied.	Since the objective function is designed to improve dimensional compression and reconstruction accuracy, it is not necessarily a method specialized for anomaly detection.
Distance	An approach that detects anomalies based on the distance between data. The idea that abnormal data is far from normal data.	The idea is simple and the explanation for the output is high.	Large amount of calculation (O(n^2)Such)

[PYTHON] Anomaly detection introduction and method summary

References

** §1 Anomaly detection overview **

Anomaly detection application example

Abnormalities are diverse

Anomaly detection algorithms are basically unsupervised learning

Anomaly detection algorithms are roughly divided into four types

** §2 Algorithm introduction and experiment **

Summary of methods to introduce

python

python

python

python

Hotelling's T2 method

python

python

python

python

python

python

python

python

python

python

python

python

python

python

python

python

python

python

python

python

python

python

python

python

python

python

python

python

python

python

python

python

python

python

python

python

LOF (Local Outlier Factor)

python

python

Evolving anomaly detection problem

§1 Anomaly detection overview

§2 Algorithm introduction and experiment

`python`

`python`

`python`

`python`

`python`

`python`

`python`

`python`

`python`

`python`

`python`

`python`

`python`

`python`

`python`

`python`

`python`

`python`

`python`

`python`

`python`

`python`

`python`

`python`

`python`

`python`

`python`

`python`

`python`

`python`

`python`

`python`

`python`

`python`

`python`

`python`

`python`

`python`

`python`

`python`

`python`

`python`