This article summary

- Try clustering time series data using the K-Shape method </ font>

data set

--This time, we will use the UCR Time Series Classification Archive from the University of California, Riverside.

--ECG is known to facilitate the detection of heart disease

--The contents of ** ECGFiveDays_TRAIN.tsv ** to be used are as follows.

`DataFrame`


df = pd.read_table("ECGFiveDays_TRAIN.tsv",header=None)
df.head(10)

Click here for data summary

`Summary`


import numpy as np
import matplotlib.pyplot as plt
from tslearn.utils import to_time_series_dataset

data_train = np.loadtxt("ECGFiveDays_TRAIN.tsv")
X_train = to_time_series_dataset(data_train[:,1:])
print("Total number of time series data: ",len(data_train))
print("Number of classes: ", len(np.unique(data_train[:,0])))
print("Time series length: ",len(data_train[0,1:]))

----------------------
#Total number of time series data:  23
#Number of classes:  2
#Time series length:  136

A total of 23 time-series data with 136 snaps are included. There are no missing values.

`Vizualize`


%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
for i in range(0,3):
    if data_train[i,0]==2.0:
        plt.figure(figsize=(18, 8))
        print("Plot",i,"Class",data_train[i,0])
        plt.plot(data_train[i],c='r')
        plt.show()

These labels are labeled by ECG experts, so both Class 1.0 and Class 2.0 are almost indistinguishable to the untrained eye. Also, as a side note, according to the Readme.md downloaded with ECGFiveDays_TRAIN.tsv, this electrocardiogram is ECG data of a 67-year-old man measured in 1990.

Preprocessing

Normalize each time series data so that it has an average of 0 and a standard deviation of 1 in order to bring it into K-Shape. Here, let's normalize the electrocardiogram of Plot0, class1.0, graph it, and compare it.

`Normalization`


from tslearn.preprocessing import TimeSeriesScalerMeanVariance
from tslearn.utils import to_time_series_dataset
X_train = to_time_series_dataset(data_train[:,1:])
X_train = TimeSeriesScalerMeanVariance(mu=0.,std=1.).fit_transform(X_train)
plt.figure(figsize=(18, 8))
plt.plot(X_train[1],c='magenta',alpha=0.8,label="normalized")
plt.plot(data_train[1],c='blue',alpha=0.8,label="original")
plt.legend(fontsize=17)

Compared with the normalized class 2.0 red graph, it looks like the figure below.

About K-Shape method

I will briefly summarize with 3 points.

--A shape-based clustering method for time series data --Basically the same as k-means, but the method of measuring the distance is different. --The following is a K-Shape paper. Written in 2015, the algorithm itself is relatively new ( https://dl.acm.org/doi/10.1145/2723372.2737793)

Now, training

Run with 2 clusters, 100 maximum iterations, and 100 trainings.

`Train`


from tslearn.clustering import KShape
ks = KShape(n_clusters=2,max_iter=100,n_init=100,verbose=0)
ks.fit(X_train)
y_pred = ks.fit_predict(X_train)
print(y_pred)

---
#[0 1 0 0 1 0 0 1 0 1 1 1 1 0 0 0 1 0 1 1 1 0 1]

For a total of 23 time series, we were able to put a prediction label on which of the two classes it would be classified into. Please refer to the following URL for hyperparameters related to K-Shape. https://tslearn.readthedocs.io/en/stable/gen_modules/clustering/tslearn.clustering.KShape.html

Evaluation

The ** adjusted land index ** is used as an evaluation scale for time series clustering. In a nutshell, this metric measures "how well the cluster allocations match between predicted and true clustering."

Here, Scikit-learn's adjusted_rand_score is used. https://scikit-learn.org/stable/modules/generated/sklearn.metrics.adjusted_rand_score.html

`Evaluation(For training data)`


from sklearn.metrics import adjusted_rand_score
ars = adjusted_rand_score(data_train[:,0],y_pred)
print("Adjusted land index: ",ars)
---
#Adjusted land index:  0.668041237113402

Next, find the ars for the test data.

`Evaluation(For test data)`


import numpy as np
import matplotlib.pyplot as plt
from tslearn.utils import to_time_series_dataset

data_test = np.loadtxt("ECGFiveDays_TEST.tsv")
X_test = to_time_series_dataset(data_test[:,1:])
pred_test = ks.predict(X_test)
ars = adjusted_rand_score(data_test[:,0],pred_test)
print("Adjusted land index: ",ars)
---
#Adjusted land index:  0.06420517898316379

I got a very low score.

Score close to 0=Score when randomly assigned

Because of the logic, it can never be said that it is a good model. The reason why the score for the test data is low is thought to be that the number of data is overwhelmingly small. So increase the number of data and try the same.

Training / evaluation using ECG5000 data

The following uses ** ECG5000_TRAIN.tsv ** and ** ECG500_TEST.tsv ** packaged in the UCR Archive.

`ECG5000 dataset`


import numpy as np
import matplotlib.pyplot as plt
from tslearn.utils import to_time_series_dataset
from sklearn.model_selection import train_test_split

data_test_ = np.loadtxt("ECG5000_TEST.tsv")
data_train_ = np.loadtxt("ECG5000_TRAIN.tsv")
data_joined = np.concatenate((data_train_,data_test_),axis=0)
data_train_,data_test_ =train_test_split(data_joined,test_size=0.2,random_state=2000) 

X_train_ = to_time_series_dataset(data_train_[:,1:])
X_test_ = to_time_series_dataset(data_test_[:,1:])

print("Total number of time series data: ",len(data_train_))
print("Number of classes: ", len(np.unique(data_train_[:,0])))
print("Time series length: ",len(data_train_[0,1:]))

---
#Total number of time series data:  4000
#Number of classes:  5
#Time series length:  140

Click here for the number of data belonging to the cluster

`Cluster details`


print("Class 1.Number of data belonging to 0",len(data_train_[data_train_[:,0]==1.0]))
print("Class 1.Number of data belonging to 0",len(data_train_[data_train_[:,0]==2.0]))
print("Class 1.Number of data belonging to 0",len(data_train_[data_train_[:,0]==3.0]))
print("Class 1.Number of data belonging to 0",len(data_train_[data_train_[:,0]==4.0]))
print("Class 1.Number of data belonging to 0",len(data_train_[data_train_[:,0]==5.0]))
---
#Class 1.Number of data belonging to 0 2342
#Class 1.Number of data belonging to 0 1415
#Class 1.Number of data belonging to 0 74
#Class 1.Number of data belonging to 0 153
#Class 1.Number of data belonging to 0 16

It looks like this in a graph

`Visualization of each cluster`


%matplotlib inline
import matplotlib.pyplot as plt

for j in np.unique(data_train_[:,0]):
    plt.figure(figsize=(28,2))
    dataPlot = data_train_[data_train_[:,0]==j]
    cnt = len(dataPlot) 
    dataPlot = dataPlot[:,1:].mean(axis=0)
    print(" Class ",j," Count ",cnt)
    plt.plot(dataPlot,c='b')
    plt.show()

We will train and evaluate as before

`Training/Evaluation(For test data)`


from tslearn.preprocessing import TimeSeriesScalerMeanVariance
from tslearn.utils import to_time_series_dataset
from tslearn.clustering import KShape
from sklearn.metrics import adjusted_rand_score
X_train_ = to_time_series_dataset(data_train_[:,1:])
X_train_ = TimeSeriesScalerMeanVariance(mu=0.,std=1.).fit_transform(X_train_)
X_test_ = to_time_series_dataset(data_test_[:,1:])
X_test_ = TimeSeriesScalerMeanVariance(mu=0.,std=1.).fit_transform(X_test_)

ks = KShape(n_clusters=5,max_iter=100,n_init=100,verbose=1,random_state=2000)
ks.fit(X_train_)
preds_test = ks.predict(X_test_)
ars_ = adjusted_rand_score(data_test_[:,0],preds_test)
print("Adjusted land index: ",ars_)
---
#Adjusted land index:  0.4984050982000773

The evaluation land index when applied to the test data was about 0.498 </ font>. Compared to the previous data application, it is a difference in cloud mud. By increasing the size of the training set from 23 to 4000, a much more accurate model was created.

Finally, let's look at the test data

`Visualization of classification results`


for yi in range(5):
    plt.figure(figsize=(20,35))
    plt.subplot(5, 1, 1 + yi)
    for xx in X_test_[preds_test == yi]:
        plt.plot(xx.ravel(), alpha=0.1,c='blue')
    cnt = len(X_test_[preds_test == yi]) 
    
    plt.plot(ks.cluster_centers_[yi].ravel(), "red")
    print(" Class ",yi," Count ",cnt)
    plt.title("Cluster %d" % (yi + 1) )

plt.tight_layout()
plt.show()

Here is a visualization of the classification results of test data (red line is Centroid)

Summary

--This time, I tried time series clustering using shape-based K-Shape and electrocardiogram data. --By increasing the number of data, the score of the adjusted land index has exploded. --In addition to K-Shape, you can also cluster with k-means and DBSCAN, so I would like to compare them in the future.

Supplement

K-Shape distance measurement

SBD(\vec{x},\vec{y})=1-\underset{w}{max}\frac{CC_w(\vec{x},\vec{y})}{\sqrt{R_0(\vec{x},\vec{x})}\sqrt{R_0(\vec{y},\vec{y})}}

--Introducing similarity and cross-correlation $ CC_w $ that fits well with time series data -$ SBD $ means the distance of $ \ vec {x}, \ vec {y} $

What is tslearn?

tslearn is a Python package for time series analysis using machine learning. In addition to K-Shape, various time series analysis algorithms are included.

References

https://www.slideshare.net/kenyanonaka/k-shapes-zemiyomi
https://dl.acm.org/doi/10.1145/2723372.2737793

[PYTHON] I tried clustering ECG data using the K-Shape method

This article summary

data set

DataFrame

Summary

Vizualize

Preprocessing

Normalization

About K-Shape method

Now, training

Train

Evaluation

Evaluation(For training data)

Evaluation(For test data)

Training / evaluation using ECG5000 data

ECG5000 dataset

Cluster details

Visualization of each cluster

Training/Evaluation(For test data)

Visualization of classification results

Summary

Supplement

K-Shape distance measurement

What is tslearn?

References

`DataFrame`

`Summary`

`Vizualize`

`Normalization`

`Train`

`Evaluation(For training data)`

`Evaluation(For test data)`

`ECG5000 dataset`

`Cluster details`

`Visualization of each cluster`

`Training/Evaluation(For test data)`

`Visualization of classification results`