I have the following time series data. For the time series data as shown above, ** Pan-Matrix Profile ** is as follows.
The heatmap at the bottom is ** Pan-Matrix Profile **, where the horizontal axis is time and the vertical axis is the length of the partial time series. By referring to the vertices (?) Indicated by ★, you can see at a glance the appearance position (horizontal axis) and its length (vertical axis) of the frequent patterns inherent in the time series.
Last time introduced ** Matrix Profile ** as an innovative technology for time series data analysis. ** Pan-Matrix Profile (PMP) ** is simply a matrix of the partial time series length $ m $ of ** MatrixProfile ** examined in a certain range.
Let's take a look at the example image above.
Describes what each element of the ** Pan-Matrix Profile ** matrix means. As an example, consider the element (504,80). As mentioned earlier, in this matrix, the horizontal axis represents time and the vertical axis represents the length of the partial time series. First, for the partial time series from the 504th point to the 504 + 80th point of the time series (the left side of the blue partial time series in the upper figure), the part of the same length that is most similar to the partial time series Calculate the distance to the time series (upper figure, blue, right). Here, the distance scale uses ** Z-normalized Euclidean distance **. Now that the distance value is 0.95, the element (504,80) will contain 0.95.
By the way, if you try to make this matrix by force search, it takes a lot of calculation time such as $ O (n ^ 4) $ (however, $ n $ is the time series length), but in reference [1], various speed-up techniques Reduces to $ O (n ^ 2r) $ (where $ r $ is the number of candidates for partial time series length).
There are many libraries for matrixprofile and I get lost, but this time I will use a library for Python called matrixprofile. That's right.
pip install matrixprofile
First, let's implement some time-series data that comes standard with the library as synthetic artificial data.
from matplotlib import pyplot as plt
import numpy as np
import matrixprofile as mp
#Data reading
dataset = mp.datasets.load('motifs-discords-small')
X = dataset['data']
#Data visualization
plt.figure(figsize=(18.0, 6.0))
plt.plot(np.arange(len(X)), X, color="k")
plt.xlim(0, len(X))
plt.title('Synthetic Time Series')
The code to create and visualize the Pan-Matrix Profile looks like this:
#Pan-Matrix Profile and other analyzes
profile, figures = mp.analyze(X)
#PMP display
figures[0]
You can easily create a PMP with mp.analyze (X)
. In addition to the PMP matrix, the profile
contains the matrix of the nearest neighbor partial time series index (PMPI) for each partial time series required when creating a PMP, analysis information such as Motif and Discord, and so on.
And, figures
contains an image that visualizes the analysis, and the display is executed at the time of executingmp.analyze (X)
, but only the PMP part is displayed withfigures [0]
. You can.
Next, let's apply it to actual data. It is applied to the mitochondrial DNA sequence introduced in the original paper [1]. (Strictly speaking, this is not a ** time ** series ...)
The data can be obtained in mat format from here.
#Data reading part 2
import scipy.io as sio
dataset = sio.loadmat('termite_DNA_circular_shift')
X = dataset['t2'].reshape((-1,))
#Data visualization
plt.figure(figsize=(18.0, 6.0))
plt.plot(np.arange(len(X)), X, color="k")
plt.xlim(0, len(X))
plt.title('Mitochondrial DNA Sequence')
Let's create a Pan-Matrix Profile. This time it will take a few minutes.
#Pan-Matrix Profile and other analysis part 2
profile, figures = mp.analyze(X)
#PMP display part 2
figures[0]
I feel that PMP is uselessly large. Actually, mp.analyze
has a threshold parameter threshold
, and you can adjust the search range of the partial time series length upper limit value by adjusting it. I wonder where the parameter-free wording went, but it seems that it is not as sensitive as the partial time series length $ m $.
This is also the EOG data used in the original paper [1].
#Data reading 3
dataset = sio.loadmat('eog_multiple_scale_example')
X = dataset['testdata'].reshape((-1,))
#Data visualization
plt.figure(figsize=(18.0, 6.0))
plt.plot(np.arange(len(X)), X, color="k")
plt.xlim(0, len(X))
plt.title('EOG')
If you visualize it with the default settings, it will be as follows.
#Pan-Matrix Profile and other analysis # 3
profile, figures = mp.analyze(X)
#PMP display part 3
figures[0]
By the way, this time it was a PMP with a very small vertical width. Enlarge and take out Motif appropriately.
Well, the longest Motif looks like this. The Motif of this data presented in the original paper should have been longer. What the hell is the problem?
First, if you take a look at the code in the visualization part of the matrixprofile
library, everything with a distance of 1 or more in PMP is filled with 1. In other words, in the PMP image, all the yellow parts are 1. Is it okay to do something like this?
Intuitively, it seems that the distance between partial time series tends to increase as the partial time series length increases. In other words, you cannot retrieve a long Motif with this method.
By the way, according to the original paper, the method of extracting Top-k Motif (I'm not sure if I understand it) seems to extract the one with a small PMP value. So is the Motif extract code in the matrixprofile
library. In this case, I think that the shorter Motif is, the more overestimated it is.
So, this time, I introduced the latest method ** Pan-Matrix Profile ** for time series data analysis. It seems that the end of the question remains, but I would like to work on solving this question soon.
Recommended Posts