- Try clustering time series data using the K-Shape method </ font>
--This time, we will use the UCR Time Series Classification Archive from the University of California, Riverside.
--ECG is known to facilitate the detection of heart disease
--The contents of ** ECGFiveDays_TRAIN.tsv ** to be used are as follows.
DataFrame
df = pd.read_table("ECGFiveDays_TRAIN.tsv",header=None)
df.head(10)
Click here for data summary
Summary
import numpy as np
import matplotlib.pyplot as plt
from tslearn.utils import to_time_series_dataset
data_train = np.loadtxt("ECGFiveDays_TRAIN.tsv")
X_train = to_time_series_dataset(data_train[:,1:])
print("Total number of time series data: ",len(data_train))
print("Number of classes: ", len(np.unique(data_train[:,0])))
print("Time series length: ",len(data_train[0,1:]))
----------------------
#Total number of time series data: 23
#Number of classes: 2
#Time series length: 136
A total of 23 time-series data with 136 snaps are included. There are no missing values.
Vizualize
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
for i in range(0,3):
if data_train[i,0]==2.0:
plt.figure(figsize=(18, 8))
print("Plot",i,"Class",data_train[i,0])
plt.plot(data_train[i],c='r')
plt.show()
These labels are labeled by ECG experts, so both Class 1.0 and Class 2.0 are almost indistinguishable to the untrained eye. Also, as a side note, according to the Readme.md downloaded with ECGFiveDays_TRAIN.tsv, this electrocardiogram is ECG data of a 67-year-old man measured in 1990.
Normalize each time series data so that it has an average of 0 and a standard deviation of 1 in order to bring it into K-Shape. Here, let's normalize the electrocardiogram of Plot0, class1.0, graph it, and compare it.
Normalization
from tslearn.preprocessing import TimeSeriesScalerMeanVariance
from tslearn.utils import to_time_series_dataset
X_train = to_time_series_dataset(data_train[:,1:])
X_train = TimeSeriesScalerMeanVariance(mu=0.,std=1.).fit_transform(X_train)
plt.figure(figsize=(18, 8))
plt.plot(X_train[1],c='magenta',alpha=0.8,label="normalized")
plt.plot(data_train[1],c='blue',alpha=0.8,label="original")
plt.legend(fontsize=17)
I will briefly summarize with 3 points.
--A shape-based clustering method for time series data --Basically the same as k-means, but the method of measuring the distance is different. --The following is a K-Shape paper. Written in 2015, the algorithm itself is relatively new ( https://dl.acm.org/doi/10.1145/2723372.2737793)
Run with 2 clusters, 100 maximum iterations, and 100 trainings.
Train
from tslearn.clustering import KShape
ks = KShape(n_clusters=2,max_iter=100,n_init=100,verbose=0)
ks.fit(X_train)
y_pred = ks.fit_predict(X_train)
print(y_pred)
---
#[0 1 0 0 1 0 0 1 0 1 1 1 1 0 0 0 1 0 1 1 1 0 1]
For a total of 23 time series, we were able to put a prediction label on which of the two classes it would be classified into. Please refer to the following URL for hyperparameters related to K-Shape. https://tslearn.readthedocs.io/en/stable/gen_modules/clustering/tslearn.clustering.KShape.html
The ** adjusted land index ** is used as an evaluation scale for time series clustering. In a nutshell, this metric measures "how well the cluster allocations match between predicted and true clustering."
Here, Scikit-learn's adjusted_rand_score is used. https://scikit-learn.org/stable/modules/generated/sklearn.metrics.adjusted_rand_score.html
Evaluation(For training data)
from sklearn.metrics import adjusted_rand_score
ars = adjusted_rand_score(data_train[:,0],y_pred)
print("Adjusted land index: ",ars)
---
#Adjusted land index: 0.668041237113402
Next, find the ars for the test data.
Evaluation(For test data)
import numpy as np
import matplotlib.pyplot as plt
from tslearn.utils import to_time_series_dataset
data_test = np.loadtxt("ECGFiveDays_TEST.tsv")
X_test = to_time_series_dataset(data_test[:,1:])
pred_test = ks.predict(X_test)
ars = adjusted_rand_score(data_test[:,0],pred_test)
print("Adjusted land index: ",ars)
---
#Adjusted land index: 0.06420517898316379
I got a very low score.
Score close to 0=Score when randomly assigned
Because of the logic, it can never be said that it is a good model. The reason why the score for the test data is low is thought to be that the number of data is overwhelmingly small. So increase the number of data and try the same.
The following uses ** ECG5000_TRAIN.tsv ** and ** ECG500_TEST.tsv ** packaged in the UCR Archive.
ECG5000 dataset
import numpy as np
import matplotlib.pyplot as plt
from tslearn.utils import to_time_series_dataset
from sklearn.model_selection import train_test_split
data_test_ = np.loadtxt("ECG5000_TEST.tsv")
data_train_ = np.loadtxt("ECG5000_TRAIN.tsv")
data_joined = np.concatenate((data_train_,data_test_),axis=0)
data_train_,data_test_ =train_test_split(data_joined,test_size=0.2,random_state=2000)
X_train_ = to_time_series_dataset(data_train_[:,1:])
X_test_ = to_time_series_dataset(data_test_[:,1:])
print("Total number of time series data: ",len(data_train_))
print("Number of classes: ", len(np.unique(data_train_[:,0])))
print("Time series length: ",len(data_train_[0,1:]))
---
#Total number of time series data: 4000
#Number of classes: 5
#Time series length: 140
Click here for the number of data belonging to the cluster
Cluster details
print("Class 1.Number of data belonging to 0",len(data_train_[data_train_[:,0]==1.0]))
print("Class 1.Number of data belonging to 0",len(data_train_[data_train_[:,0]==2.0]))
print("Class 1.Number of data belonging to 0",len(data_train_[data_train_[:,0]==3.0]))
print("Class 1.Number of data belonging to 0",len(data_train_[data_train_[:,0]==4.0]))
print("Class 1.Number of data belonging to 0",len(data_train_[data_train_[:,0]==5.0]))
---
#Class 1.Number of data belonging to 0 2342
#Class 1.Number of data belonging to 0 1415
#Class 1.Number of data belonging to 0 74
#Class 1.Number of data belonging to 0 153
#Class 1.Number of data belonging to 0 16
It looks like this in a graph
Visualization of each cluster
%matplotlib inline
import matplotlib.pyplot as plt
for j in np.unique(data_train_[:,0]):
plt.figure(figsize=(28,2))
dataPlot = data_train_[data_train_[:,0]==j]
cnt = len(dataPlot)
dataPlot = dataPlot[:,1:].mean(axis=0)
print(" Class ",j," Count ",cnt)
plt.plot(dataPlot,c='b')
plt.show()
We will train and evaluate as before
Training/Evaluation(For test data)
from tslearn.preprocessing import TimeSeriesScalerMeanVariance
from tslearn.utils import to_time_series_dataset
from tslearn.clustering import KShape
from sklearn.metrics import adjusted_rand_score
X_train_ = to_time_series_dataset(data_train_[:,1:])
X_train_ = TimeSeriesScalerMeanVariance(mu=0.,std=1.).fit_transform(X_train_)
X_test_ = to_time_series_dataset(data_test_[:,1:])
X_test_ = TimeSeriesScalerMeanVariance(mu=0.,std=1.).fit_transform(X_test_)
ks = KShape(n_clusters=5,max_iter=100,n_init=100,verbose=1,random_state=2000)
ks.fit(X_train_)
preds_test = ks.predict(X_test_)
ars_ = adjusted_rand_score(data_test_[:,0],preds_test)
print("Adjusted land index: ",ars_)
---
#Adjusted land index: 0.4984050982000773
The evaluation land index when applied to the test data was about 0.498 </ font>. Compared to the previous data application, it is a difference in cloud mud. By increasing the size of the training set from 23 to 4000, a much more accurate model was created.
Finally, let's look at the test data
Visualization of classification results
for yi in range(5):
plt.figure(figsize=(20,35))
plt.subplot(5, 1, 1 + yi)
for xx in X_test_[preds_test == yi]:
plt.plot(xx.ravel(), alpha=0.1,c='blue')
cnt = len(X_test_[preds_test == yi])
plt.plot(ks.cluster_centers_[yi].ravel(), "red")
print(" Class ",yi," Count ",cnt)
plt.title("Cluster %d" % (yi + 1) )
plt.tight_layout()
plt.show()
Here is a visualization of the classification results of test data (red line is Centroid)
--This time, I tried time series clustering using shape-based K-Shape and electrocardiogram data. --By increasing the number of data, the score of the adjusted land index has exploded. --In addition to K-Shape, you can also cluster with k-means and DBSCAN, so I would like to compare them in the future.
SBD(\vec{x},\vec{y})=1-\underset{w}{max}\frac{CC_w(\vec{x},\vec{y})}{\sqrt{R_0(\vec{x},\vec{x})}\sqrt{R_0(\vec{y},\vec{y})}}
--Introducing similarity and cross-correlation $ CC_w $ that fits well with time series data -$ SBD $ means the distance of $ \ vec {x}, \ vec {y} $
tslearn is a Python package for time series analysis using machine learning. In addition to K-Shape, various time series analysis algorithms are included.
Recommended Posts