[PYTHON] I tried to implement anomaly detection by sparse structure learning


Continuing from the previous anomaly detection method (MSPC), this time we have organized the anomaly detection method by sparse structure learning.

What is anomaly detection by sparse structure learning?

Hotelling's $ T ^ 2 $ method and $ MSPC $, which are well-known anomaly detection methods, are methods used when monitoring data whose average value does not change. On the other hand, at the manufacturing site, not only such data, but also equipment and operation data with a data structure whose values change while maintaining a certain relationship between variables. At that time, one of the anomaly detection methods focusing on the relationships between variables (correlation, etc.) is the anomaly detection method by sparse structure learning.

Outline of method

  1. To find the correlation between variables, assume a multivariate normal distribution behind the data and find the precision matrix (the reciprocal of the covariance matrix).
  2. Since it is difficult to extract only meaningful correlations from actual data (effect of noise), we will use Sparse Inverse Covariance Matrix Optimization this time.
  3. Abnormality is defined as the difference (KL distance) in the probability distribution between normal and evaluation (online monitoring).

For details of the method, see ([Abnormality detection and change point detection](https://www.amazon.co.jp/%E7%95%B0%E5%B8%B8%E6%A4%9C%E7%9F%A5] % E3% 81% A8% E5% A4% 89% E5% 8C% 96% E6% A4% 9C% E7% 9F% A5-% E6% A9% 9F% E6% A2% B0% E5% AD% A6% E7% BF% 92% E3% 83% 97% E3% 83% AD% E3% 83% 95% E3% 82% A7% E3% 83% 83% E3% 82% B7% E3% 83% A7% E3% 83% 8A% E3% 83% AB% E3% 82% B7% E3% 83% AA% E3% 83% BC% E3% 82% BA-% E4% BA% 95% E6% 89% 8B-% E5% If you read 89% 9B / dp / 4061529080)), you will deepen your understanding. I also read this book, studied it, and put it into practice in my actual work.

Implementation of anomaly detection method by sparse structure learning

This time, I used a dataset called Water Treatment Plant from the UCI Machine Learning Repository for the sample. We used 100 normal data and the remaining 279 as evaluation data to see how much they deviated from the normal state.

The python code is below.

#Import required libraries
import pandas as pd
import numpy as np

from matplotlib import pyplot as plt
%matplotlib inline
from scipy import stats 

from sklearn.preprocessing import StandardScaler
from sklearn.neighbors.kde import KernelDensity
from sklearn.covariance import GraphicalLassoCV

df = pd.read_csv('watertreatment_mod.csv', encoding='shift_jis', header=0, index_col=0)

window_size = 10
cr = 0.99

スクリーンショット 2020-11-19 21.55.24.png

The input data looks like the above.

Next, we will divide the training data and evaluation data and standardize them.

#Split point
split_point = 100
#Divided into training data (train) and evaluation data (test)
train_df = df.iloc[:(split_point-1),]
test_df = df.iloc[split_point:,]

#Standardize data
sc = StandardScaler()
train_df_std = sc.transform(test_df)
test_df_std = sc.transform(test_df)

Next, we estimate the covariance.

#Covariance estimation
n_samples, n_features = train_df_std.shape
emp_cov = np.dot(train_df_std.T, train_df_std) / n_samples

model = GraphicalLassoCV()
cov_ = model.covariance_
prec_ = model.precision_

Next, define a function to find the KL distance.

def Calc_KL(cov_, prec_, xtest):
    """Calculate the KL distance

    cov_ : np.array
Covariance matrix of training data
    prec_ : np.array
Accuracy matrix of training data
    df : pd.DataFrame
data set

    d_ab : pd.DataFrame
KL distance
    n_samples, n_features = xtest.shape
    model_test = GraphicalLassoCV()
    except FloatingPointError:
        print("floating error")
        return d_ab

    cov__test = model_test.covariance_
    prec__test = model_test.precision_  
    #Calculate the magnitude of correlation collapse for each variable
    for i in range(n_features):
        temp_prec_a = np.r_[prec_[i:n_features,:], prec_[0:i,:]] 
        temp_prec_a = np.c_[temp_prec_a[:,i:n_features], temp_prec_a[:,0:i]]
        temp_prec_b = np.r_[prec__test[i:n_features,:], prec__test[0:i,:]] 
        temp_prec_b = np.c_[temp_prec_b[:,i:n_features], temp_prec_b[:,0:i]]
        temp_cov_a = np.r_[cov_[i:n_features,:], cov_[0:i,:]] 
        temp_cov_a = np.c_[temp_cov_a[:,i:n_features], temp_cov_a[:,0:i]]
        temp_cov_b = np.r_[cov__test[i:n_features,:], cov__test[0:i,:]] 
        temp_cov_b = np.c_[temp_cov_b[:,i:n_features], temp_cov_b[:,0:i]]
        La = temp_prec_a[:-1, :-1]
        la = temp_prec_a[:-1, -1]
        lama = temp_prec_a[-1, -1]
        Wa = temp_cov_a[:-1, :-1]
        wa = temp_cov_a[:-1, -1]
        sigmaa = temp_cov_a[-1, -1]
        Lb = temp_prec_b[:-1, :-1]
        lb = temp_prec_b[:-1, -1]
        lamb = temp_prec_b[-1, -1]
        Wb = temp_cov_b[:-1, :-1]
        wb = temp_cov_b[:-1, -1]
        sigmab = temp_cov_b[-1, -1]
        d_abp[i] = np.dot(wa, la)+0.5*(np.dot(np.dot(lb, Wb), lb)-np.dot(np.dot(la, Wa), la))+0.5*(np.log(lama/lamb)+sigmaa-sigmab)
        d_abm[i] = np.dot(wb, lb)+0.5*(np.dot(np.dot(la, Wa), la)-np.dot(np.dot(lb, Wb), lb))+0.5*(np.log(lamb/lama)+sigmab-sigmaa)
        d_ab[i] = max(-d_abp[i], -d_abm[i])
    return d_ab

Next, prepare a function that calculates the management limit by kernel density estimation.

def cl_limit(x, cr=0.99):
    """Calculate control limits

    x : np.array
KL distance
    cr : float
Boundaries of control limits

    cl : float
Boundary point of management limit
    X = x.reshape(np.shape(x)[0],1)
    bw= (np.max(X)-np.min(X))/100
    kde = KernelDensity(kernel='gaussian', bandwidth=bw).fit(X)
    X_plot = np.linspace(np.min(X), np.max(X), 1000)[:, np.newaxis]
    log_dens = kde.score_samples(X_plot)
    prob = np.exp(log_dens) / np.sum(np.exp(log_dens))
    calprob = np.zeros(np.shape(prob)[0])
    calprob[0] = prob[0]
    for i in range(1,np.shape(prob)[0]):
    cl = X_plot[np.min(np.where(calprob>cr))]
    return cl

Cross-validate the training data to calculate the management limit.

K = 5
cv_data_size = np.int(np.shape(train_df_std)[0]/5)
n_train_samples = np.shape(train_df_std)[0]
counter = 0

for i in range(K):
    cv_train_data=np.r_[train_df_std[0:i*cv_data_size,], train_df_std[(i+1)*cv_data_size:,]]
    if i < K-1:
    model_cv = GraphicalLassoCV()
    cov__cv = model.covariance_
    prec__cv = model.precision_
    for n in range(window_size, np.shape(cv_test_data)[0]):
        count = i*cv_data_size + n
        tempX = cv_test_data[n-window_size:n,:]
        d_ab_temp = Calc_KL(cov__cv, prec__cv, tempX)      
        if 0 == counter:
            d_ab = d_ab_temp.reshape(1,n_features)
            TimeIndices2 = TimeIndices[count]
            d_ab = np.r_[d_ab,d_ab_temp.reshape(1,n_features)]
            #Here error
            TimeIndices2 = np.vstack((TimeIndices2,TimeIndices[count]))
        counter = counter + 1

split_point = np.shape(d_ab)[0]    
d_ab_cv = d_ab[np.sum(d_ab,axis=1)!=0,:]

cl = np.zeros([n_features])
for i in range(n_features):
    cl[i] = cl_limit(d_ab_cv[:,i],cr)

Finally, the covariance of the evaluation data is estimated and visualized.

#Estimate covariance for evaluation data
n_test_samples = np.shape(test_df_std)[0]

for n in range(window_size, n_test_samples):    
    tempX = test_df_std[n-window_size:n,:]
    d_ab_temp = Calc_KL(cov_, prec_, tempX)

    d_ab = np.r_[d_ab,d_ab_temp.reshape(1,n_features)]
    TimeIndices2 = np.vstack((TimeIndices2,TimeIndices[n+n_train_samples]))

x2 = [0, np.shape(d_ab)[0]]
x3 = [split_point, split_point]
x = range(0, np.shape(TimeIndices2)[0],20)
NewTimeIndices = np.array(TimeIndices2[x])
for i in range(38):
    plt.figure(figsize=(200, 3))
    plt.subplot(1, 38, i+1)
    plt.title('%s Contribution' % (i))
    plt.plot(d_ab[:, i], marker="o")
    y2 = [cl[i],cl[i]]
    plt.plot(x2,y2,ls="-", color = "r")
    y3 = [0, np.nanmax(d_ab[:,i])]
    plt.plot(x3,y3,ls="--", color = "b")



Thank you for reading to the end. This time, we implemented anomaly detection by sparse structure learning.

If you have a request for correction, we would appreciate it if you could contact us.

Recommended Posts

I tried to implement anomaly detection by sparse structure learning
I tried to implement anomaly detection using a hidden Markov model
I tried to implement PCANet
I tried to implement StarGAN (1)
I tried to implement Deep VQE
I tried to implement adversarial validation
[Django] I tried to implement access control by class inheritance.
I tried to implement ListNet of rank learning with Chainer
I tried to implement hierarchical clustering
I tried to implement Realness GAN
I tried to implement Perceptron Part 1 [Deep Learning from scratch]
I tried to implement sentence classification by Self Attention with PyTorch
I tried to implement PLSA in Python
I tried to implement Autoencoder with TensorFlow
I tried to implement permutation in Python
I tried to implement PLSA in Python 2
I tried to implement ADALINE in Python
I tried to implement PPO in Python
I tried to implement CVAE with PyTorch
I tried to implement deep learning that is not deep with only NumPy
[Deep Learning from scratch] I tried to implement sigmoid layer and Relu layer.
I tried to implement Bayesian linear regression by Gibbs sampling in python
[Reinforcement learning] Finally surpassed humans! ?? I tried to explain / implement Agent57 (Keras-RL)
I tried to classify mnist numbers by unsupervised learning [PCA, t-SNE, k-means]
I tried to classify Oba Hana and Emiri Otani by deep learning
I tried to program bubble sort by language
I tried to implement reading Dataset with PyTorch
I tried to get an image by scraping
Deep learning learned by implementation ~ Anomaly detection (unsupervised learning) ~
I tried to implement TOPIC MODEL in Python
I tried to implement selection sort in python
I tried to classify dragon ball by adaline
I tried to implement the traveling salesman problem
[Python] Deep Learning: I tried to implement deep learning (DBN, SDA) without using a library.
I tried to predict the presence or absence of snow by machine learning.
I tried to predict the change in snowfall for 2 years by machine learning
I tried to implement various methods for machine learning (prediction model) using scikit-learn.
I tried to implement Cifar10 with SONY Deep Learning library NNabla [Nippon Hurray]
I tried to classify Oba Hana and Emiri Otani by deep learning (Part 2)
I tried to implement sentence classification & Attention visualization by Japanese BERT in PyTorch
I tried deep learning
I tried to debug.
I tried to paste
I tried to move machine learning (ObjectDetection) with TouchDesigner
I tried to implement multivariate statistical process management (MSPC)
I tried to implement and learn DCGAN with PyTorch
I tried to implement Minesweeper on terminal with python
[Anomaly detection] Detect image distortion by deep distance learning
I tried to implement a pseudo pachislot in Python
I tried to implement a recommendation system (content-based filtering)
I tried to implement Dragon Quest poker in Python
I tried to implement an artificial perceptron with python
I tried to implement time series prediction with GBDT
I tried to implement GA (genetic algorithm) in Python
I tried to implement Grad-CAM with keras and tensorflow
[Deep Learning from scratch] I tried to explain Dropout
I tried to implement SSD with PyTorch now (Dataset)
I tried to implement automatic proof of sequence calculation
I tried to compress the image using machine learning
[Keras] I tried to solve a donut-type region classification problem by machine learning [Study]
[Introduction] I tried to implement it by myself while explaining the binary search tree.