[PYTHON] I tried to implement anomaly detection by sparse structure learning

Articles sent by data scientists from the manufacturing industry
This time, we have implemented and organized anomaly detection methods that can be used in the manufacturing industry.

Introduction

Continuing from the previous anomaly detection method (MSPC), this time we have organized the anomaly detection method by sparse structure learning.

What is anomaly detection by sparse structure learning?

Hotelling's $ T ^ 2 $ method and $ MSPC $, which are well-known anomaly detection methods, are methods used when monitoring data whose average value does not change. On the other hand, at the manufacturing site, not only such data, but also equipment and operation data with a data structure whose values change while maintaining a certain relationship between variables. At that time, one of the anomaly detection methods focusing on the relationships between variables (correlation, etc.) is the anomaly detection method by sparse structure learning.

Outline of method

To find the correlation between variables, assume a multivariate normal distribution behind the data and find the precision matrix (the reciprocal of the covariance matrix).
Since it is difficult to extract only meaningful correlations from actual data (effect of noise), we will use Sparse Inverse Covariance Matrix Optimization this time.
Abnormality is defined as the difference (KL distance) in the probability distribution between normal and evaluation (online monitoring).

For details of the method, see ([Abnormality detection and change point detection](https://www.amazon.co.jp/%E7%95%B0%E5%B8%B8%E6%A4%9C%E7%9F%A5] % E3% 81% A8% E5% A4% 89% E5% 8C% 96% E6% A4% 9C% E7% 9F% A5-% E6% A9% 9F% E6% A2% B0% E5% AD% A6% E7% BF% 92% E3% 83% 97% E3% 83% AD% E3% 83% 95% E3% 82% A7% E3% 83% 83% E3% 82% B7% E3% 83% A7% E3% 83% 8A% E3% 83% AB% E3% 82% B7% E3% 83% AA% E3% 83% BC% E3% 82% BA-% E4% BA% 95% E6% 89% 8B-% E5% If you read 89% 9B / dp / 4061529080)), you will deepen your understanding. I also read this book, studied it, and put it into practice in my actual work.

Implementation of anomaly detection method by sparse structure learning

This time, I used a dataset called Water Treatment Plant from the UCI Machine Learning Repository for the sample. We used 100 normal data and the remaining 279 as evaluation data to see how much they deviated from the normal state.

The python code is below.

#Import required libraries
import pandas as pd
import numpy as np

from matplotlib import pyplot as plt
%matplotlib inline
from scipy import stats 

from sklearn.preprocessing import StandardScaler
from sklearn.neighbors.kde import KernelDensity
from sklearn.covariance import GraphicalLassoCV

df = pd.read_csv('watertreatment_mod.csv', encoding='shift_jis', header=0, index_col=0)
df.head()

window_size = 10
cr = 0.99

スクリーンショット 2020-11-19 21.55.24.png

The input data looks like the above.

Next, we will divide the training data and evaluation data and standardize them.

#Split point
split_point = 100
#Divided into training data (train) and evaluation data (test)
train_df = df.iloc[:(split_point-1),]
test_df = df.iloc[split_point:,]

#Standardize data
sc = StandardScaler()
sc.fit(train_df)
train_df_std = sc.transform(test_df)
test_df_std = sc.transform(test_df)

Next, we estimate the covariance.

#Covariance estimation
n_samples, n_features = train_df_std.shape
emp_cov = np.dot(train_df_std.T, train_df_std) / n_samples

model = GraphicalLassoCV()
model.fit(train_df_std)
cov_ = model.covariance_
prec_ = model.precision_

Next, define a function to find the KL distance.

def Calc_KL(cov_, prec_, xtest):
    """Calculate the KL distance

    Parameters
    ----------
    cov_ : np.array
Covariance matrix of training data
    prec_ : np.array
Accuracy matrix of training data
    df : pd.DataFrame
data set

    Returns
    -------
    d_ab : pd.DataFrame
KL distance
    """
    n_samples, n_features = xtest.shape
    
    d_abp=np.zeros(n_features)
    d_abm=np.zeros(n_features)
    d_ab=np.zeros(n_features)
    
    model_test = GraphicalLassoCV()
    try:
        model_test.fit(xtest)
    except FloatingPointError:
        print("floating error")
        return d_ab

    cov__test = model_test.covariance_
    prec__test = model_test.precision_  
    
    #Calculate the magnitude of correlation collapse for each variable
    for i in range(n_features):
        temp_prec_a = np.r_[prec_[i:n_features,:], prec_[0:i,:]] 
        temp_prec_a = np.c_[temp_prec_a[:,i:n_features], temp_prec_a[:,0:i]]
        temp_prec_b = np.r_[prec__test[i:n_features,:], prec__test[0:i,:]] 
        temp_prec_b = np.c_[temp_prec_b[:,i:n_features], temp_prec_b[:,0:i]]
        temp_cov_a = np.r_[cov_[i:n_features,:], cov_[0:i,:]] 
        temp_cov_a = np.c_[temp_cov_a[:,i:n_features], temp_cov_a[:,0:i]]
        temp_cov_b = np.r_[cov__test[i:n_features,:], cov__test[0:i,:]] 
        temp_cov_b = np.c_[temp_cov_b[:,i:n_features], temp_cov_b[:,0:i]]
        La = temp_prec_a[:-1, :-1]
        la = temp_prec_a[:-1, -1]
        lama = temp_prec_a[-1, -1]
        Wa = temp_cov_a[:-1, :-1]
        wa = temp_cov_a[:-1, -1]
        sigmaa = temp_cov_a[-1, -1]
        Lb = temp_prec_b[:-1, :-1]
        lb = temp_prec_b[:-1, -1]
        lamb = temp_prec_b[-1, -1]
        Wb = temp_cov_b[:-1, :-1]
        wb = temp_cov_b[:-1, -1]
        sigmab = temp_cov_b[-1, -1]
        
        d_abp[i] = np.dot(wa, la)+0.5*(np.dot(np.dot(lb, Wb), lb)-np.dot(np.dot(la, Wa), la))+0.5*(np.log(lama/lamb)+sigmaa-sigmab)
        d_abm[i] = np.dot(wb, lb)+0.5*(np.dot(np.dot(la, Wa), la)-np.dot(np.dot(lb, Wb), lb))+0.5*(np.log(lamb/lama)+sigmab-sigmaa)
        d_ab[i] = max(-d_abp[i], -d_abm[i])
        
    return d_ab

Next, prepare a function that calculates the management limit by kernel density estimation.

def cl_limit(x, cr=0.99):
    """Calculate control limits

    Parameters
    ----------
    x : np.array
KL distance
    cr : float
Boundaries of control limits

    Returns
    -------
    cl : float
Boundary point of management limit
    """
    X = x.reshape(np.shape(x)[0],1)
    bw= (np.max(X)-np.min(X))/100
    kde = KernelDensity(kernel='gaussian', bandwidth=bw).fit(X)
    X_plot = np.linspace(np.min(X), np.max(X), 1000)[:, np.newaxis]
    log_dens = kde.score_samples(X_plot)
    prob = np.exp(log_dens) / np.sum(np.exp(log_dens))
    calprob = np.zeros(np.shape(prob)[0])
    calprob[0] = prob[0]
    for i in range(1,np.shape(prob)[0]):
        calprob[i]=calprob[i-1]+prob[i]
    
    cl = X_plot[np.min(np.where(calprob>cr))]
    
    return cl

Cross-validate the training data to calculate the management limit.

K = 5
cv_data_size = np.int(np.shape(train_df_std)[0]/5)
n_train_samples = np.shape(train_df_std)[0]
counter = 0

for i in range(K):
    
    cv_train_data=np.r_[train_df_std[0:i*cv_data_size,], train_df_std[(i+1)*cv_data_size:,]]
    if i < K-1:
        cv_test_data=train_df_std[i*cv_data_size:(i+1)*cv_data_size,]
    else:
        cv_test_data=train_df_std[i*cv_data_size:,]
    
    model_cv = GraphicalLassoCV()
    model_cv.fit(cv_train_data)
    cov__cv = model.covariance_
    prec__cv = model.precision_
    
    for n in range(window_size, np.shape(cv_test_data)[0]):
        count = i*cv_data_size + n
        tempX = cv_test_data[n-window_size:n,:]
        d_ab_temp = Calc_KL(cov__cv, prec__cv, tempX)      
        
        if 0 == counter:
            d_ab = d_ab_temp.reshape(1,n_features)
            TimeIndices2 = TimeIndices[count]
        else:
            d_ab = np.r_[d_ab,d_ab_temp.reshape(1,n_features)]
            #Here error
            TimeIndices2 = np.vstack((TimeIndices2,TimeIndices[count]))
        
        counter = counter + 1

split_point = np.shape(d_ab)[0]    
d_ab_cv = d_ab[np.sum(d_ab,axis=1)!=0,:]

cl = np.zeros([n_features])
for i in range(n_features):
    cl[i] = cl_limit(d_ab_cv[:,i],cr)

Finally, the covariance of the evaluation data is estimated and visualized.

#Estimate covariance for evaluation data
n_test_samples = np.shape(test_df_std)[0]

for n in range(window_size, n_test_samples):    
    tempX = test_df_std[n-window_size:n,:]
    
    d_ab_temp = Calc_KL(cov_, prec_, tempX)

    d_ab = np.r_[d_ab,d_ab_temp.reshape(1,n_features)]
    TimeIndices2 = np.vstack((TimeIndices2,TimeIndices[n+n_train_samples]))

x2 = [0, np.shape(d_ab)[0]]
x3 = [split_point, split_point]
x = range(0, np.shape(TimeIndices2)[0],20)
NewTimeIndices = np.array(TimeIndices2[x])
for i in range(38):
    plt.figure(figsize=(200, 3))
    plt.subplot(1, 38, i+1)
    plt.title('%s Contribution' % (i))
    plt.plot(d_ab[:, i], marker="o")
    plt.xlabel("Time")
    plt.ylabel("Contribution") 
    plt.xticks(x,NewTimeIndices,rotation='vertical') 
    y2 = [cl[i],cl[i]]
    plt.plot(x2,y2,ls="-", color = "r")
    y3 = [0, np.nanmax(d_ab[:,i])]
    plt.plot(x3,y3,ls="--", color = "b")

plt.show()

finally

Thank you for reading to the end. This time, we implemented anomaly detection by sparse structure learning.

If you have a request for correction, we would appreciate it if you could contact us.