[PYTHON] I tried clustering ECG data using the K-Shape method

This article summary

- Try clustering time series data using the K-Shape method </ font>

data set

--This time, we will use the UCR Time Series Classification Archive from the University of California, Riverside.

--ECG is known to facilitate the detection of heart disease

--The contents of ** ECGFiveDays_TRAIN.tsv ** to be used are as follows.

DataFrame


df = pd.read_table("ECGFiveDays_TRAIN.tsv",header=None)
df.head(10)

image.png

Click here for data summary

Summary


import numpy as np
import matplotlib.pyplot as plt
from tslearn.utils import to_time_series_dataset

data_train = np.loadtxt("ECGFiveDays_TRAIN.tsv")
X_train = to_time_series_dataset(data_train[:,1:])
print("Total number of time series data: ",len(data_train))
print("Number of classes: ", len(np.unique(data_train[:,0])))
print("Time series length: ",len(data_train[0,1:]))

----------------------
#Total number of time series data:  23
#Number of classes:  2
#Time series length:  136

A total of 23 time-series data with 136 snaps are included. There are no missing values.

Vizualize


%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
for i in range(0,3):
    if data_train[i,0]==2.0:
        plt.figure(figsize=(18, 8))
        print("Plot",i,"Class",data_train[i,0])
        plt.plot(data_train[i],c='r')
        plt.show()

image.png

image.png

These labels are labeled by ECG experts, so both Class 1.0 and Class 2.0 are almost indistinguishable to the untrained eye. Also, as a side note, according to the Readme.md downloaded with ECGFiveDays_TRAIN.tsv, this electrocardiogram is ECG data of a 67-year-old man measured in 1990.

Preprocessing

Normalize each time series data so that it has an average of 0 and a standard deviation of 1 in order to bring it into K-Shape. Here, let's normalize the electrocardiogram of Plot0, class1.0, graph it, and compare it.

Normalization


from tslearn.preprocessing import TimeSeriesScalerMeanVariance
from tslearn.utils import to_time_series_dataset
X_train = to_time_series_dataset(data_train[:,1:])
X_train = TimeSeriesScalerMeanVariance(mu=0.,std=1.).fit_transform(X_train)
plt.figure(figsize=(18, 8))
plt.plot(X_train[1],c='magenta',alpha=0.8,label="normalized")
plt.plot(data_train[1],c='blue',alpha=0.8,label="original")
plt.legend(fontsize=17)

image.png

  • Compared with the normalized class 2.0 red graph, it looks like the figure below. image.png

About K-Shape method

I will briefly summarize with 3 points.

--A shape-based clustering method for time series data --Basically the same as k-means, but the method of measuring the distance is different. --The following is a K-Shape paper. Written in 2015, the algorithm itself is relatively new ( https://dl.acm.org/doi/10.1145/2723372.2737793)

Now, training

Run with 2 clusters, 100 maximum iterations, and 100 trainings.

Train


from tslearn.clustering import KShape
ks = KShape(n_clusters=2,max_iter=100,n_init=100,verbose=0)
ks.fit(X_train)
y_pred = ks.fit_predict(X_train)
print(y_pred)

---
#[0 1 0 0 1 0 0 1 0 1 1 1 1 0 0 0 1 0 1 1 1 0 1]

For a total of 23 time series, we were able to put a prediction label on which of the two classes it would be classified into. Please refer to the following URL for hyperparameters related to K-Shape. https://tslearn.readthedocs.io/en/stable/gen_modules/clustering/tslearn.clustering.KShape.html

Evaluation

The ** adjusted land index ** is used as an evaluation scale for time series clustering. In a nutshell, this metric measures "how well the cluster allocations match between predicted and true clustering."

Here, Scikit-learn's adjusted_rand_score is used. https://scikit-learn.org/stable/modules/generated/sklearn.metrics.adjusted_rand_score.html

Evaluation(For training data)


from sklearn.metrics import adjusted_rand_score
ars = adjusted_rand_score(data_train[:,0],y_pred)
print("Adjusted land index: ",ars)
---
#Adjusted land index:  0.668041237113402

Next, find the ars for the test data.

Evaluation(For test data)


import numpy as np
import matplotlib.pyplot as plt
from tslearn.utils import to_time_series_dataset

data_test = np.loadtxt("ECGFiveDays_TEST.tsv")
X_test = to_time_series_dataset(data_test[:,1:])
pred_test = ks.predict(X_test)
ars = adjusted_rand_score(data_test[:,0],pred_test)
print("Adjusted land index: ",ars)
---
#Adjusted land index:  0.06420517898316379

I got a very low score.

Score close to 0=Score when randomly assigned

Because of the logic, it can never be said that it is a good model. The reason why the score for the test data is low is thought to be that the number of data is overwhelmingly small. So increase the number of data and try the same.

Training / evaluation using ECG5000 data

The following uses ** ECG5000_TRAIN.tsv ** and ** ECG500_TEST.tsv ** packaged in the UCR Archive.

ECG5000 dataset


import numpy as np
import matplotlib.pyplot as plt
from tslearn.utils import to_time_series_dataset
from sklearn.model_selection import train_test_split

data_test_ = np.loadtxt("ECG5000_TEST.tsv")
data_train_ = np.loadtxt("ECG5000_TRAIN.tsv")
data_joined = np.concatenate((data_train_,data_test_),axis=0)
data_train_,data_test_ =train_test_split(data_joined,test_size=0.2,random_state=2000) 

X_train_ = to_time_series_dataset(data_train_[:,1:])
X_test_ = to_time_series_dataset(data_test_[:,1:])

print("Total number of time series data: ",len(data_train_))
print("Number of classes: ", len(np.unique(data_train_[:,0])))
print("Time series length: ",len(data_train_[0,1:]))

---
#Total number of time series data:  4000
#Number of classes:  5
#Time series length:  140

Click here for the number of data belonging to the cluster

Cluster details


print("Class 1.Number of data belonging to 0",len(data_train_[data_train_[:,0]==1.0]))
print("Class 1.Number of data belonging to 0",len(data_train_[data_train_[:,0]==2.0]))
print("Class 1.Number of data belonging to 0",len(data_train_[data_train_[:,0]==3.0]))
print("Class 1.Number of data belonging to 0",len(data_train_[data_train_[:,0]==4.0]))
print("Class 1.Number of data belonging to 0",len(data_train_[data_train_[:,0]==5.0]))
---
#Class 1.Number of data belonging to 0 2342
#Class 1.Number of data belonging to 0 1415
#Class 1.Number of data belonging to 0 74
#Class 1.Number of data belonging to 0 153
#Class 1.Number of data belonging to 0 16

It looks like this in a graph

Visualization of each cluster


%matplotlib inline
import matplotlib.pyplot as plt

for j in np.unique(data_train_[:,0]):
    plt.figure(figsize=(28,2))
    dataPlot = data_train_[data_train_[:,0]==j]
    cnt = len(dataPlot) 
    dataPlot = dataPlot[:,1:].mean(axis=0)
    print(" Class ",j," Count ",cnt)
    plt.plot(dataPlot,c='b')
    plt.show()

image.png

We will train and evaluate as before

Training/Evaluation(For test data)


from tslearn.preprocessing import TimeSeriesScalerMeanVariance
from tslearn.utils import to_time_series_dataset
from tslearn.clustering import KShape
from sklearn.metrics import adjusted_rand_score
X_train_ = to_time_series_dataset(data_train_[:,1:])
X_train_ = TimeSeriesScalerMeanVariance(mu=0.,std=1.).fit_transform(X_train_)
X_test_ = to_time_series_dataset(data_test_[:,1:])
X_test_ = TimeSeriesScalerMeanVariance(mu=0.,std=1.).fit_transform(X_test_)

ks = KShape(n_clusters=5,max_iter=100,n_init=100,verbose=1,random_state=2000)
ks.fit(X_train_)
preds_test = ks.predict(X_test_)
ars_ = adjusted_rand_score(data_test_[:,0],preds_test)
print("Adjusted land index: ",ars_)
---
#Adjusted land index:  0.4984050982000773

The evaluation land index when applied to the test data was about 0.498 </ font>. Compared to the previous data application, it is a difference in cloud mud. By increasing the size of the training set from 23 to 4000, a much more accurate model was created.

Finally, let's look at the test data

Visualization of classification results


for yi in range(5):
    plt.figure(figsize=(20,35))
    plt.subplot(5, 1, 1 + yi)
    for xx in X_test_[preds_test == yi]:
        plt.plot(xx.ravel(), alpha=0.1,c='blue')
    cnt = len(X_test_[preds_test == yi]) 
    
    plt.plot(ks.cluster_centers_[yi].ravel(), "red")
    print(" Class ",yi," Count ",cnt)
    plt.title("Cluster %d" % (yi + 1) )

plt.tight_layout()
plt.show()

Here is a visualization of the classification results of test data (red line is Centroid) image.png

image.png

image.png

image.png

image.png

Summary

--This time, I tried time series clustering using shape-based K-Shape and electrocardiogram data. --By increasing the number of data, the score of the adjusted land index has exploded. --In addition to K-Shape, you can also cluster with k-means and DBSCAN, so I would like to compare them in the future.

Supplement

K-Shape distance measurement

SBD(\vec{x},\vec{y})=1-\underset{w}{max}\frac{CC_w(\vec{x},\vec{y})}{\sqrt{R_0(\vec{x},\vec{x})}\sqrt{R_0(\vec{y},\vec{y})}}

--Introducing similarity and cross-correlation $ CC_w $ that fits well with time series data -$ SBD $ means the distance of $ \ vec {x}, \ vec {y} $

What is tslearn?

tslearn is a Python package for time series analysis using machine learning. In addition to K-Shape, various time series analysis algorithms are included.

References

  • https://www.slideshare.net/kenyanonaka/k-shapes-zemiyomi
  • https://dl.acm.org/doi/10.1145/2723372.2737793

Recommended Posts