[PYTHON] [Latest method] Visualization of time series data and extraction of frequent patterns using Pan-Matrix Profile

Overview

I have the following time series data. 合成時系列データ.png For the time series data as shown above, ** Pan-Matrix Profile ** is as follows.

無題.jpg The heatmap at the bottom is ** Pan-Matrix Profile **, where the horizontal axis is time and the vertical axis is the length of the partial time series. By referring to the vertices (?) Indicated by ★, you can see at a glance the appearance position (horizontal axis) and its length (vertical axis) of the frequent patterns inherent in the time series.

What is Pan-Matrix Profile?

Last time introduced ** Matrix Profile ** as an innovative technology for time series data analysis. ** Pan-Matrix Profile (PMP) ** is simply a matrix of the partial time series length $ m $ of ** MatrixProfile ** examined in a certain range.

Let's take a look at the example image above.

無題1.jpg

Describes what each element of the ** Pan-Matrix Profile ** matrix means. As an example, consider the element (504,80). As mentioned earlier, in this matrix, the horizontal axis represents time and the vertical axis represents the length of the partial time series. First, for the partial time series from the 504th point to the 504 + 80th point of the time series (the left side of the blue partial time series in the upper figure), the part of the same length that is most similar to the partial time series Calculate the distance to the time series (upper figure, blue, right). Here, the distance scale uses ** Z-normalized Euclidean distance **. Now that the distance value is 0.95, the element (504,80) will contain 0.95.

By the way, if you try to make this matrix by force search, it takes a lot of calculation time such as $ O (n ^ 4) $ (however, $ n $ is the time series length), but in reference [1], various speed-up techniques Reduces to $ O (n ^ 2r) $ (where $ r $ is the number of candidates for partial time series length).

Library for Pan-Matrix Profile "matrixprofile"

There are many libraries for matrixprofile and I get lost, but this time I will use a library for Python called matrixprofile. That's right.

pip install matrixprofile

Implementation example (1): Application to artificial data

First, let's implement some time-series data that comes standard with the library as synthetic artificial data.

from matplotlib import pyplot as plt
import numpy as np
import matrixprofile as mp

#Data reading
dataset = mp.datasets.load('motifs-discords-small')
X = dataset['data']

#Data visualization
plt.figure(figsize=(18.0, 6.0))
plt.plot(np.arange(len(X)), X, color="k")
plt.xlim(0, len(X))
plt.title('Synthetic Time Series')

合成時系列データ.png

The code to create and visualize the Pan-Matrix Profile looks like this:

#Pan-Matrix Profile and other analyzes
profile, figures = mp.analyze(X)

#PMP display
figures[0]

ダウンロード.png You can easily create a PMP with mp.analyze (X). In addition to the PMP matrix, the profile contains the matrix of the nearest neighbor partial time series index (PMPI) for each partial time series required when creating a PMP, analysis information such as Motif and Discord, and so on.

And, figures contains an image that visualizes the analysis, and the display is executed at the time of executingmp.analyze (X), but only the PMP part is displayed withfigures [0]. You can.

Implementation example (2): Application to mitochondrial DNA sequence

Next, let's apply it to actual data. It is applied to the mitochondrial DNA sequence introduced in the original paper [1]. (Strictly speaking, this is not a ** time ** series ...)

The data can be obtained in mat format from here.

#Data reading part 2
import scipy.io as sio

dataset = sio.loadmat('termite_DNA_circular_shift')
X = dataset['t2'].reshape((-1,))

#Data visualization
plt.figure(figsize=(18.0, 6.0))
plt.plot(np.arange(len(X)), X, color="k")
plt.xlim(0, len(X))
plt.title('Mitochondrial DNA Sequence')

ダウンロード (1).png Let's create a Pan-Matrix Profile. This time it will take a few minutes.

#Pan-Matrix Profile and other analysis part 2
profile, figures = mp.analyze(X)

#PMP display part 2
figures[0]

ダウンロード (2).png I feel that PMP is uselessly large. Actually, mp.analyze has a threshold parameter threshold, and you can adjust the search range of the partial time series length upper limit value by adjusting it. I wonder where the parameter-free wording went, but it seems that it is not as sensitive as the partial time series length $ m $.

Implementation example ③: Application to electrocardiogram (EOG)

This is also the EOG data used in the original paper [1].

#Data reading 3
dataset = sio.loadmat('eog_multiple_scale_example')
X = dataset['testdata'].reshape((-1,))

#Data visualization
plt.figure(figsize=(18.0, 6.0))
plt.plot(np.arange(len(X)), X, color="k")
plt.xlim(0, len(X))
plt.title('EOG')

ダウンロード (3).png If you visualize it with the default settings, it will be as follows.

#Pan-Matrix Profile and other analysis # 3
profile, figures = mp.analyze(X)

#PMP display part 3
figures[0]

ダウンロード (4).png By the way, this time it was a PMP with a very small vertical width. Enlarge and take out Motif appropriately. 無2題.jpg

Well, the longest Motif looks like this. The Motif of this data presented in the original paper should have been longer. What the hell is the problem?

First, if you take a look at the code in the visualization part of the matrixprofile library, everything with a distance of 1 or more in PMP is filled with 1. In other words, in the PMP image, all the yellow parts are 1. Is it okay to do something like this?

Intuitively, it seems that the distance between partial time series tends to increase as the partial time series length increases. In other words, you cannot retrieve a long Motif with this method.

By the way, according to the original paper, the method of extracting Top-k Motif (I'm not sure if I understand it) seems to extract the one with a small PMP value. So is the Motif extract code in the matrixprofile library. In this case, I think that the shorter Motif is, the more overestimated it is.

Summary

So, this time, I introduced the latest method ** Pan-Matrix Profile ** for time series data analysis. It seems that the end of the question remains, but I would like to work on solving this question soon.

References

Recommended Posts

[Latest method] Visualization of time series data and extraction of frequent patterns using Pan-Matrix Profile
Smoothing of time series and waveform data 3 methods (smoothing)
Visualization method of data by explanatory variable and objective variable
Data visualization method using matplotlib (1)
Data visualization method using matplotlib (2)
Graph time series data in Python using pandas and matplotlib
Comparison of time series data predictions between SARIMA and Prophet models
Data visualization method using matplotlib (+ pandas) (3)
About time series data and overfitting
Differentiation of time series data (discrete)
Time series analysis 3 Preprocessing of time series data
Data visualization method using matplotlib (+ pandas) (4)
[For beginners] Script within 10 lines (5. Resample of time series data using pandas)
Visualization of latitude / longitude coordinate data (assuming meteorological data) using cartopy and matplotlib
Try using PHATE, a dimensionality reduction and visualization method for biological data
[Kaggle] I tried feature engineering of multidimensional time series data using tsfresh.
Reading OpenFOAM time series data and sets data
Kaggle Kernel Method Summary [Table Time Series Data]
Acquisition of time series data (daily) of stock prices
View details of time series data with Remotte
Implementation of clustering k-shape method for time series data [Unsupervised learning with python Chapter 13]
Analysis of financial data by pandas and its visualization (2)
Analysis of financial data by pandas and its visualization (1)
Anomaly detection of time series data by LSTM (Keras)
Overview and tips of seaborn with statistical data visualization
Story of image analysis of PDF file and data extraction
Data batch extraction method by regular expression from Series
"Measurement Time Series Analysis of Economic and Finance Data" Solving Chapter End Problems with Python