It seems that time series data mining is a revolution by Matrix Profile. However, it seems that there are still few Japanese documents, so I hope to add them. This article provides an overview of MatrixProfile and an example implementation in Python.
It is a similar combination of partial time series.
Consider a pair of time series data (A, B) (it can be a pair of the same time series). Each time series is decomposed into partial time series of length m. For all the partial time series in A, the index of the partial time series in B closest to it and its distance (similarity) are called MatrixProfile. If A and B are in the same time series, the pair with the partial time series will be optimal, so the area around the partial time series will be ignored.
Example of seismic data and Matrix Profile [Yeh et. Al, ICDM2016]
It takes a huge amount of time to calculate properly, but it can be solved in a realistic time by devising an algorithm using FFT or the like. It's a revolution, isn't it?
For more information https://www.cs.ucr.edu/~eamonn/MatrixProfile.html It is said that it is a technology from the laboratory of Dr. Keogh of UCR, who is famous for SAX and Shapelet. The title of the paper seems to be serialized in the form of Matrix Profile 〇〇 :, and as of November 2019, it seems that it has been published to Matrix Profile XIX :.
There is already a super-easy Python library for MatrixProfile called matrixprofile-ts.
pip install matrixprofile-ts
It is also available at here. It seems that there are also R and C ++ ones. This time, we will try to detect anomalies in ECG data using this matrix profile-ts.
This time, using a series of ECG data including abnormal waveform parts, the problem is to calculate the degree of abnormality at each point without a teacher and identify the abnormal part. Below, we will move on to implementation.
This time, we will use ECG data called qtdbsel102. https://www.cs.ucr.edu/~eamonn/discords/qtdbsel102.txt This is also the data used in Professor Keogh's dissertation. Since the data length is very long, we will use the first 5000 points. By the way, MatrixProfile is normalized in the calculation process, so no preprocessing is required.
from matrixprofile import *
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
#Data reading
ECG = pd.read_csv('qtdbsel102.txt', header=None, delimiter='\t')
X = ECG.values[:5000,2]
#ECG visualization
plt.style.use('seaborn')
plt.figure(figsize=(16, 8))
plt.xlim(0, 5000)
plt.plot(X)
plt.axvspan(4200, 4400, facecolor='r', alpha=0.1)
Abnormal waveforms appear around 4200 to 4400 points. It's obvious at a glance.
It is easy to implement thanks to the excellent library. Hyperparameters are only the length of the partial time series. This time set it to 200. As for the calculation method, I will use ** SCRIMP ++ **, which is the fastest and latest method, although the accuracy is slightly lower.
#calculation of matrixProfile
window_size = 200
MP = matrixProfile.scrimp_plus_plus(X,window_size)
In data with periodicity such as ECG, a normal partial time series should have waveforms that are very similar to itself in other parts. Conversely, if there is no waveform similar to itself, that is, the part where the distance value of Matrix Profile becomes large is considered to be abnormal. This is the so-called neighbor method.
By the way, the distance column and the reference index column are stored in the MP calculated earlier. Let's check the distance column and visualize it.
#Visualization of Matrix Profile
plt.figure(figsize=(16, 8))
plt.subplot(2,1,1)
plt.xlim(0, 5000)
plt.title('ECG')
plt.plot(X)
x = np.arange(4200,4400)
plt.axvspan(4200, 4400, facecolor='r', alpha=0.1)
plt.subplot(2,1,2)
plt.xlim(0, 5000)
plt.title('MatrixProfile')
plt.plot(MP[0])
plt.axvspan(4200, 4400, facecolor='r', alpha=0.1)
You can confirm that the abnormality has been detected successfully. In addition, let's check the waveform of the part where the value of MatrixProfile is the maximum, that is, the waveform of the most abnormal part.
#MatrixProfile Maximum waveform and its nearest neighbor waveform
max_point_index = np.argmax(MP[0])
nearest_neighbor_index = MP[1][max_point_index]
plt.tight_layout()
fig, ax = plt.subplots(1, 3, figsize=(22, 6))
ax[0].plot(X[max_point_index:max_point_index+window_size], linewidth=5, color = "r")
ax[0].title.set_text("mp_peak(anormaly_wave)")
ax[1].plot(X[nearest_neighbor_index:nearest_neighbor_index+window_size], linewidth=5, color = "b")
ax[1].title.set_text("nearest_neighbor")
from sklearn.preprocessing import StandardScaler
S = StandardScaler()
ax[2].title.set_text("comparison")
ax[2].plot(S.fit_transform(X[max_point_index:max_point_index+window_size].reshape(-1,1)).flatten(), linewidth=5, color = "r")
ax[2].plot(S.fit_transform(X[nearest_neighbor_index:nearest_neighbor_index+window_size].reshape(-1,1)).flatten(), linewidth=5, color = "b")
The leftmost waveform is the largest MatrixProfile waveform, the middle one is the nearest neighbor waveform, and the rightmost waveform is z-normalized and compared. By the way, I'm not an ECG expert, so I can't understand the meaning of the waveform, but the abnormal waveform has a large peak, a small dent, and is out of phase in the first place. You can see the difference. It's interpretable and nice.
"Anomaly detection can be done with other methods. Deep learning is fine." Many people may think so.
However, the appeal of Matrix Profile is its simplicity. Although the calculation process of MatrixProfile is surprisingly novel, the final output is just matching of partial waveforms. In other words, even if an abnormality is detected, the validity can be easily checked by a person. In terms of reliability and tractorability, there is nothing like it.
Furthermore, this abnormal waveform detection is just one of the functions of Matrix Profile. It seems that there are many applications, including the discovery of Motif and Shapelet. It's a revolution, isn't it?
Introduced MatrixProfile and tried a library called matrixprofile-ts: grinning:
Recommended Posts