- Articles sent by data scientists from the manufacturing industry
- This time, we have implemented and organized anomaly detection methods that can be used in the manufacturing industry.

Continuing from the previous anomaly detection method (MSPC), this time we have organized the anomaly detection method by sparse structure learning.

Hotelling's $ T ^ 2 $ method and $ MSPC $, which are well-known anomaly detection methods, are methods used when monitoring data whose average value does not change. On the other hand, at the manufacturing site, not only such data, but also equipment and operation data with a data structure whose values change while maintaining a certain relationship between variables. At that time, one of the anomaly detection methods focusing on the relationships between variables (correlation, etc.) is the anomaly detection method by sparse structure learning.

Outline of method

- To find the correlation between variables, assume a multivariate normal distribution behind the data and find the precision matrix (the reciprocal of the covariance matrix).
- Since it is difficult to extract only meaningful correlations from actual data (effect of noise), we will use Sparse Inverse Covariance Matrix Optimization this time.
- Abnormality is defined as the difference (KL distance) in the probability distribution between normal and evaluation (online monitoring).

For details of the method, see ([Abnormality detection and change point detection](https://www.amazon.co.jp/%E7%95%B0%E5%B8%B8%E6%A4%9C%E7%9F%A5] % E3% 81% A8% E5% A4% 89% E5% 8C% 96% E6% A4% 9C% E7% 9F% A5-% E6% A9% 9F% E6% A2% B0% E5% AD% A6% E7% BF% 92% E3% 83% 97% E3% 83% AD% E3% 83% 95% E3% 82% A7% E3% 83% 83% E3% 82% B7% E3% 83% A7% E3% 83% 8A% E3% 83% AB% E3% 82% B7% E3% 83% AA% E3% 83% BC% E3% 82% BA-% E4% BA% 95% E6% 89% 8B-% E5% If you read 89% 9B / dp / 4061529080)), you will deepen your understanding. I also read this book, studied it, and put it into practice in my actual work.

This time, I used a dataset called Water Treatment Plant from the UCI Machine Learning Repository for the sample. We used 100 normal data and the remaining 279 as evaluation data to see how much they deviated from the normal state.

The python code is below.

```
#Import required libraries
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
%matplotlib inline
from scipy import stats
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors.kde import KernelDensity
from sklearn.covariance import GraphicalLassoCV
df = pd.read_csv('watertreatment_mod.csv', encoding='shift_jis', header=0, index_col=0)
df.head()
window_size = 10
cr = 0.99
```

The input data looks like the above.

Next, we will divide the training data and evaluation data and standardize them.

```
#Split point
split_point = 100
#Divided into training data (train) and evaluation data (test)
train_df = df.iloc[:(split_point-1),]
test_df = df.iloc[split_point:,]
#Standardize data
sc = StandardScaler()
sc.fit(train_df)
train_df_std = sc.transform(test_df)
test_df_std = sc.transform(test_df)
```

Next, we estimate the covariance.

```
#Covariance estimation
n_samples, n_features = train_df_std.shape
emp_cov = np.dot(train_df_std.T, train_df_std) / n_samples
model = GraphicalLassoCV()
model.fit(train_df_std)
cov_ = model.covariance_
prec_ = model.precision_
```

Next, define a function to find the KL distance.

```
def Calc_KL(cov_, prec_, xtest):
"""Calculate the KL distance
Parameters
----------
cov_ : np.array
Covariance matrix of training data
prec_ : np.array
Accuracy matrix of training data
df : pd.DataFrame
data set
Returns
-------
d_ab : pd.DataFrame
KL distance
"""
n_samples, n_features = xtest.shape
d_abp=np.zeros(n_features)
d_abm=np.zeros(n_features)
d_ab=np.zeros(n_features)
model_test = GraphicalLassoCV()
try:
model_test.fit(xtest)
except FloatingPointError:
print("floating error")
return d_ab
cov__test = model_test.covariance_
prec__test = model_test.precision_
#Calculate the magnitude of correlation collapse for each variable
for i in range(n_features):
temp_prec_a = np.r_[prec_[i:n_features,:], prec_[0:i,:]]
temp_prec_a = np.c_[temp_prec_a[:,i:n_features], temp_prec_a[:,0:i]]
temp_prec_b = np.r_[prec__test[i:n_features,:], prec__test[0:i,:]]
temp_prec_b = np.c_[temp_prec_b[:,i:n_features], temp_prec_b[:,0:i]]
temp_cov_a = np.r_[cov_[i:n_features,:], cov_[0:i,:]]
temp_cov_a = np.c_[temp_cov_a[:,i:n_features], temp_cov_a[:,0:i]]
temp_cov_b = np.r_[cov__test[i:n_features,:], cov__test[0:i,:]]
temp_cov_b = np.c_[temp_cov_b[:,i:n_features], temp_cov_b[:,0:i]]
La = temp_prec_a[:-1, :-1]
la = temp_prec_a[:-1, -1]
lama = temp_prec_a[-1, -1]
Wa = temp_cov_a[:-1, :-1]
wa = temp_cov_a[:-1, -1]
sigmaa = temp_cov_a[-1, -1]
Lb = temp_prec_b[:-1, :-1]
lb = temp_prec_b[:-1, -1]
lamb = temp_prec_b[-1, -1]
Wb = temp_cov_b[:-1, :-1]
wb = temp_cov_b[:-1, -1]
sigmab = temp_cov_b[-1, -1]
d_abp[i] = np.dot(wa, la)+0.5*(np.dot(np.dot(lb, Wb), lb)-np.dot(np.dot(la, Wa), la))+0.5*(np.log(lama/lamb)+sigmaa-sigmab)
d_abm[i] = np.dot(wb, lb)+0.5*(np.dot(np.dot(la, Wa), la)-np.dot(np.dot(lb, Wb), lb))+0.5*(np.log(lamb/lama)+sigmab-sigmaa)
d_ab[i] = max(-d_abp[i], -d_abm[i])
return d_ab
```

Next, prepare a function that calculates the management limit by kernel density estimation.

```
def cl_limit(x, cr=0.99):
"""Calculate control limits
Parameters
----------
x : np.array
KL distance
cr : float
Boundaries of control limits
Returns
-------
cl : float
Boundary point of management limit
"""
X = x.reshape(np.shape(x)[0],1)
bw= (np.max(X)-np.min(X))/100
kde = KernelDensity(kernel='gaussian', bandwidth=bw).fit(X)
X_plot = np.linspace(np.min(X), np.max(X), 1000)[:, np.newaxis]
log_dens = kde.score_samples(X_plot)
prob = np.exp(log_dens) / np.sum(np.exp(log_dens))
calprob = np.zeros(np.shape(prob)[0])
calprob[0] = prob[0]
for i in range(1,np.shape(prob)[0]):
calprob[i]=calprob[i-1]+prob[i]
cl = X_plot[np.min(np.where(calprob>cr))]
return cl
```

Cross-validate the training data to calculate the management limit.

```
K = 5
cv_data_size = np.int(np.shape(train_df_std)[0]/5)
n_train_samples = np.shape(train_df_std)[0]
counter = 0
for i in range(K):
cv_train_data=np.r_[train_df_std[0:i*cv_data_size,], train_df_std[(i+1)*cv_data_size:,]]
if i < K-1:
cv_test_data=train_df_std[i*cv_data_size:(i+1)*cv_data_size,]
else:
cv_test_data=train_df_std[i*cv_data_size:,]
model_cv = GraphicalLassoCV()
model_cv.fit(cv_train_data)
cov__cv = model.covariance_
prec__cv = model.precision_
for n in range(window_size, np.shape(cv_test_data)[0]):
count = i*cv_data_size + n
tempX = cv_test_data[n-window_size:n,:]
d_ab_temp = Calc_KL(cov__cv, prec__cv, tempX)
if 0 == counter:
d_ab = d_ab_temp.reshape(1,n_features)
TimeIndices2 = TimeIndices[count]
else:
d_ab = np.r_[d_ab,d_ab_temp.reshape(1,n_features)]
#Here error
TimeIndices2 = np.vstack((TimeIndices2,TimeIndices[count]))
counter = counter + 1
split_point = np.shape(d_ab)[0]
d_ab_cv = d_ab[np.sum(d_ab,axis=1)!=0,:]
cl = np.zeros([n_features])
for i in range(n_features):
cl[i] = cl_limit(d_ab_cv[:,i],cr)
```

Finally, the covariance of the evaluation data is estimated and visualized.

```
#Estimate covariance for evaluation data
n_test_samples = np.shape(test_df_std)[0]
for n in range(window_size, n_test_samples):
tempX = test_df_std[n-window_size:n,:]
d_ab_temp = Calc_KL(cov_, prec_, tempX)
d_ab = np.r_[d_ab,d_ab_temp.reshape(1,n_features)]
TimeIndices2 = np.vstack((TimeIndices2,TimeIndices[n+n_train_samples]))
x2 = [0, np.shape(d_ab)[0]]
x3 = [split_point, split_point]
x = range(0, np.shape(TimeIndices2)[0],20)
NewTimeIndices = np.array(TimeIndices2[x])
for i in range(38):
plt.figure(figsize=(200, 3))
plt.subplot(1, 38, i+1)
plt.title('%s Contribution' % (i))
plt.plot(d_ab[:, i], marker="o")
plt.xlabel("Time")
plt.ylabel("Contribution")
plt.xticks(x,NewTimeIndices,rotation='vertical')
y2 = [cl[i],cl[i]]
plt.plot(x2,y2,ls="-", color = "r")
y3 = [0, np.nanmax(d_ab[:,i])]
plt.plot(x3,y3,ls="--", color = "b")
plt.show()
```

Thank you for reading to the end. This time, we implemented anomaly detection by sparse structure learning.

If you have a request for correction, we would appreciate it if you could contact us.

Recommended Posts