[PYTHON] I tried to implement anomaly detection using a hidden Markov model

Articles sent by data scientists from the manufacturing industry
This time, I implemented anomaly detection using a hidden Markov model.

Introduction

This time, I implemented an anomaly detection method using a hidden Markov model, which is a highly versatile time series analysis model.

What is anomaly detection using a hidden Markov model?

I will omit the details of the theoretical part this time, but I will introduce the series of "states" behind the data, predict the probability of occurrence of the data to build the model, and perform anomaly detection. The features are briefly described below.

A versatile time series model
Assuming unobservable hidden states, each state has a different output probability distribution
Complex time-series data abnormalities can be detected by calculating the transition probability of states and the probability of occurrence obtained for each state.

The data set used this time is as follows.

Uses ECG data called qtdb/sel102 ECG dataset.
URL：http://www.cs.ucr.edu/~eamonn/discords/

The python code is below.

#Import required libraries
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

#Load the required function
from hmmlearn import hmm

df = np.loadtxt("qtdbsel102.txt", delimiter="\t")
#Use the data in the third column
#train data,Create test data
train_df = df[0:3000, 2]
test_df = df[3000:6000, 2]

#Visualization of training data
fig = plt.figure()
ax = fig.add_subplot(1, 1, 1)
ax.grid(False)

ax.plot(train_df)

ax.set_title('train_df')
ax.set_ylabel('value')
ax.set_xlabel('time')

#Visualization of evaluation data
fig = plt.figure()
ax = fig.add_subplot(1, 1, 1)
ax.grid(False)

ax.plot(test_df)

ax.set_title('test_df')
ax.set_ylabel('value')
ax.set_xlabel('time')

Now that we know the characteristics of the training data and the evaluation data, we will estimate the distribution.

num_states = 15

X = train_df.reshape(-1, 1)
lengths = [len(train_df)]

np.random.seed(seed=7)
model = hmm.GaussianHMM(n_components=num_states, covariance_type='full')
model.fit(X, lengths)

This time, the number of state types (num_states) is 15. When using the hmmlearn library, use hmmlearn.GaussianHMM.fit ().

Next, the degree of anomaly is calculated using the parameters calculated in the distribution estimation.

# model.scores()In the function, the series x'Log-likelihood p(log(x'))Is calculated.
logprob = np.array([model.score(train_df[0:i+1].reshape(-1, 1)) for i in range(len(train_df))])
train_abnormality = -np.append(logprob[0], np.diff(logprob))

#Threshold setting
ratio = 0.005 #Percentage of judgments as abnormal
threshold = np.sort(train_abnormality)[int((1-ratio)*len(train_abnormality))]
print(threshold)

Anomaly detection is performed using the model constructed last.

#Evaluation data anomaly detection
logprob = np.array([model.score(test_df[0:i+1].reshape(-1, 1)) for i in range(len(test_df))])
test_abnormality = -np.append(logprob[0], np.diff(logprob))

#Visualize the degree of abnormality in evaluation data
#The control limit is indicated by a broken line
fig = plt.figure()
ax = fig.add_subplot(1,1,1)
ax.grid(False)

ax.axhline(threshold, ls="--", color="red")
ax.plot(test_abnormality, color="gray")

ax.set_title('test_df_abnormality')
ax.set_xlabel('time')
ax.set_ylabel('test_df_abnormality')

Finally, I would like to visualize the evaluation data and the degree of abnormality together.

#Visualize the degree of abnormality in evaluation data
#The control limit is indicated by a broken line
fig = plt.figure()
ax1 = fig.add_subplot(1,1,1)
ax1.grid(False)

#Associate ax1 and ax2
ax2 = ax1.twinx()

ax2.axhline(threshold, ls="--", color="red")
ax1.plot(test_df)
ax2.plot(test_abnormality, color="gray")

ax1.set_title('test_df_and_abnormality')
ax1.set_ylabel('value')
ax1.set_xlabel('time')
ax2.set_ylabel('abnormality')

It was relatively easy to implement. However, since the calculation cost is a little high, it seems necessary to consider the calculation time when using it in practice. In addition, the more states there are, the more complicated the structure becomes, but since there is no clear basis for decision, it seems necessary to determine an appropriate number of states according to the data at the site.

at the end

Thank you for reading to the end. This time, I implemented anomaly detection using a hidden Markov model.

If you have a request for correction, we would appreciate it if you could contact us.