[PYTHON] How to compare time series data-Derivative DTW, DTW-

wrap up

DTW

** How to find DTW **

  1. Create a (cost) matrix that calculates the distance between each point in two time series (absolute value this time, depending on the dimension of the data)

    \begin{align}
    time\ series\ T&:t_0,t_1,...,t_N,\\
    time\ series\ S&:s_0,s_1,...,s_M,\\
    \boldsymbol{W}&:=(\delta(i,j)) ,\\
    where\ \delta(t_i,s_j)&:=|t_i-s_j|,i \in \{0,1,...,N\},j\in\{0,1,...,M\} 
    \end{align}
    
  2. Consider a path that passes from (0,0) to (N, M) on this matrix and satisfies the monotonic increase.

    \begin{align}
    path\ \tau :\tau(0)=(0,0),\tau(1),...,\tau(p)=(p_i,p_j),...,\tau(K),\\
    where\ p_i \in \{0,1,...,N\},p_j\in\{0,1,...,M\}.
    \end{align}
    

In particular, consider something that satisfies the monotonic increase in this $ path \ \ tau $.

```math
\tau(p) \leq \tau(q) \Longleftrightarrow^{def} p_i\leq q_i\ or\ p_j\leq q_j.\\
if\ p\leq q,then\ \tau(p) \leq \tau(q).
```
  1. Let the DTW cost be the smallest value in the sum of the elements of the matrix $ \ boldsymbol {W} $ that this $ path $ passes through.

    \begin{align}
    cost(\tau)&:=\sum_{p}\delta(p_i,p_j)\\
    Dist(T,S)&:=min_{\tau}cost(\tau)
    \end{align}
    

The figure below is from Derivative Dynamic Time Warping by Eamonn image.png

It's different from the official formula, but it looks like this. Test code: Reference 3

#Preparation
pip install fastdtw # <=pip really just needs this one
pip install numpy
pip install scipy

##code##
import numpy as np
from scipy.spatial.distance import euclidean
from fastdtw import fastdtw
A = np.sin(np.array(range(T)) / 10)
B = np.sin((np.array(range(T)) / 10 + t * np.pi))
C = np.zeros((T))
distance, path = fastdtw(A, B, dist=euclidean)
print(distance)
plt.plot(A)
plt.plot(B)
for i, j in path:
   plt.plot([i, j], [A[i], B[j]],color='gray', linestyle='dotted')
plt.legend(["sin(θ)", "sin(θ+150*pi)"], fontsize=10, loc=2)
plt.show()

##result##
sin(θ)-sin(θ+150*pi): 0.6639470476737607
sin(θ)-constant: 0.5150026855435942
DTW(sin(θ), sin(θ+150*pi)): 16.720461687388624
DTW(sin(θ), constant): 97.26964355198544

Figure_1.png Figure_2.png Figure_3.png

** Benefits of DTW **

Similarity can be obtained even if the lengths and cycles of time series are different.

** Disadvantages of DTW **

Intuitive alignment is performed for time series data that has locally accelerated and decelerated parts on the time axis: Reference: Derivative Dynamic Time Warping by Eamonn, etc. Clipboard01.jpg For example, when one of them becomes s high, DTW ignores the data shape and aligns, so the data point of s high is associated with the data point before s high.

DDTW

In response to DTW, the degree of similarity is measured by adding information that captures the shape. Actually, after processing the time series data $ time \ series \ T: t_0, t_1, ..., t_N $ as follows, DTW:

T^*:t^*_0,t^*_1,...,t^*_N,\\
where\ t^*_i :=  \frac{t_i-t_{i-1}+\frac{t_{i+1}-t_{i-1}}{2}}{2}=  \frac{t_i-t_{i-1}+\frac{(t_{i+1}-t_i)+(t_i-t_{i-1})}{2}}{2}

This process is the average of the left derivative $ t_i-t_ {i-1} $ and the right derivative $ t_ {i + 1} -t_i $ and the average of the left derivative with respect to the point of interest $ t_i $. .. DDTW is the DTW for this processing.

** Benefits of DDTW **

Similarity can be calculated considering shapes such as uptrends and downtrends.

** Disadvantages of DDTW **

Is the average slope good?

code
from concurrent.futures import ProcessPoolExecutor
from concurrent.futures import ThreadPoolExecutor
import time
import numpy as np
import pylab as plt
import seaborn as sns
from fastdtw import fastdtw
from scipy.spatial.distance import euclidean


def get_test_curves(view=False):
    T = 150
    t = .4

    A = np.sin(np.array(range(T)) / 10)
    B = np.sin((np.array(range(T)) / 10 + t * np.pi))
    C = np.zeros((T))
    if view:
        plt.plot(A)
        plt.plot(B)
        plt.plot(C)
        plt.legend(['sin(θ)', 'sin(θ+150*pi)', 'constant'], fontsize=10, loc=2)
        plt.show()

    return {'name': 'sin(θ)', 'data': A}, {'name': 'sin(θ+150*pi)', 'data': B}, {'name': 'constant', 'data': C}


def mse(a, b):
    return ((a-b)**2).mean()


def get_DWT_results(T, S, skip=1, view=False):
    T_data, S_data = T['data'], S['data']
    T_name, S_name = T['name'], S['name']
    distance, path = fastdtw(T_data, S_data, dist=euclidean)
    print("DTW({}, {}):".format(T_name, S_name), distance)
    if view:
        plt.plot(T_data)
        plt.plot(S_data)
        k = -1
        for i, j in path:
            k += 1
            if k % skip == 0:
                plt.plot([i, j], [T_data[i], S_data[j]],
                         color='gray', linestyle='dotted')
        plt.legend([T_name, S_name], fontsize=10, loc=2)
        plt.title('DWT plot result')
        plt.show()

    return distance, path


def get_derivative(T):
    diff = np.diff(T)
    next_diff = np.append(np.delete(diff, 0), 0)
    avg = (next_diff + diff) / 2
    avg += diff
    avg /= 2
    return np.delete(avg, -1)


def get_DDWT_results(T, S, skip=1, view=False):
    T_data, S_data = T['data'], S['data']
    dT_data = get_derivative(T_data)
    dS_data = get_derivative(S_data)
    T_name, S_name = T['name'], S['name']
    distance, path = fastdtw(dT_data, dS_data, dist=euclidean)
    print("DDTW({}, {}):".format(T_name, S_name), distance)
    if view:
        plt.plot(T_data)
        plt.plot(S_data)
        k = -1
        for i, j in path:
            k += 1
            if k % skip == 0:
                plt.plot([i+1, j+1], [T_data[i+1], S_data[j+1]],
                         color='gray', linestyle='dotted')
        plt.legend([T_name, S_name], fontsize=10, loc=2)
        plt.title('DDWT plot result')
        plt.show()

    return distance, path


def get_test_curves_DDTWvsDWT(view=False):
    T = 150
    t = .4
    A = np.zeros((T))
    B = np.zeros((T))
    # A = np.sin(np.array(range(T)) / 10)
    # B = np.sin(np.array(range(T)) / 10+2)+50
    s_i = 50
    e_i = 60
    for i in range(s_i, e_i, 1):
        A[i] = np.sin(np.pi*(i-s_i)/(e_i-s_i))
    #     B[i] = -2.2
    if view:
        plt.plot(A)
        plt.plot(B)
        plt.legend(['sin(θ)', 'sin(θ+150*pi)'], fontsize=10, loc=2)
        plt.show()

    return {'name': 'down', 'data': A}, {'name': 'up', 'data': B}


def main():
    print("=== main ===")
    # A, B, C = get_test_curves()
    A, B = get_test_curves_DDTWvsDWT()

    # A["data"] = np.array([2, 0, 1, 1, 2, 4, 2, 1, 2, 0])
    # B["data"] = np.array([1, 1, 2, 4, 2, 1, 2, 0])

    print("{}-{}:".format(A['name'], B['name']), mse(A['data'], B['data']))
    # print("{}-{}:".format(A['name'], C['name']), mse(A['data'], C['data']))
    #Calculate DTW
    get_DWT_results(A, B, skip=5)
    get_DDWT_results(A, B, skip=5)


if __name__ == "__main__":
    main()


** Applied article **

Clustering of stock price fluctuations of 2,940 Japanese companies in DTW (Dynamic Time Warping) Dynamic Time Warping by Kimiaki Shirahama Compare the difference in results using DTW and DDTW. The data used is "Thomson Reuters Data Stream". We conclude by qualitative evaluation (visually) that DTW + DDTW is the best similarity measure for classifying and analyzing stock price data.

Reference 3 is quite helpful with both code and experiments. Applies to temperature data.

Also, there is something called k-shape. I haven't read this, but I use distance as a cross-correlation and cluster using k-means.

** Impression **

I actually clustered the time series data using DTW / DDTW, but it didn't work. The cause is that I don't understand much. If you have any suggestions, please do not hesitate to contact us.

Recommended Posts

How to compare time series data-Derivative DTW, DTW-
How to handle time series data (implementation)
How to read time series data in PyTorch
How to generate exponential pulse time series data in python
[Python] How to use Pandas Series
How to extract features of time series data with PySpark Basics
<Pandas> How to handle time series data in a pivot table
How to set the server time to Japanese time
matplotlib Write text to time series graph
How to use MkDocs for the first time
How to avoid writing% matplotlib inline every time
How to achieve time wait processing with wxpython
How to use Python Image Library in python3 series
Time Series Decomposition
How to measure execution time with Python Part 1
[wxpython] How to use wx.lib.plot basic & time axis
[Python] How to compare datetime with timezone added
"Then, how does it compare to other methods?"
How to measure execution time with Python Part 2
How to measure processing time in Python or Java
How to measure mp3 file playback time with python
I tried to implement time series prediction with GBDT
Compare how to write processing for lists by language
Python: Time Series Analysis
How to use Python-shell
How to use tf.data
How to use virtualenv
How to use Seaboan
How to use shogun
How to use Pandas 2
How to read PyPI
How to install pip
How to use Virtualenv
How to use numpy.vectorize
How to update easy_install
How to install archlinux
How to use pytest_report_header
How to restart gunicorn
How to install python
How to virtual host
How to debug selenium
How to use partial
How to use Bio.Phylo
How to read JSON
How to use SymPy
How to use x-means
Python time series question
How to use WikiExtractor.py
How to update Spyder
How to use IPython
RNN_LSTM1 Time series analysis
How to install BayesOpt
Time series analysis 1 Basics
How to use virtualenv
How to use Matplotlib
How to calculate the sum or average of time series csv data in an instant
How to use iptables
How to use numpy
How to use TokyoTechFes2015
How to use venv
How to use dictionary {}