**-Implemented classification of time series data by k-shape --Use electrocardiogram data for data **
The k-shape method is often used as a classification method for time series data. In this article, we will cluster the electrocardiogram data using the k-shape method. Data and code ["Unsupervised learning with python"](url https://www.amazon.co.jp/Python%E3%81%A7%E3%81%AF%E3%81%98%E3%82 % 81% E3% 82% 8B% E6% 95% 99% E5% B8% AB% E3% 81% AA% E3% 81% 97% E5% AD% A6% E7% BF% 92-% E2% 80% 95% E6% A9% 9F% E6% A2% B0% E5% AD% A6% E7% BF% 92% E3% 81% AE% E5% 8F% AF% E8% 83% BD% E6% 80% A7% E3% 82% 92% E5% BA% 83% E3% 81% 92% E3% 82% 8B% E3% 83% A9% E3% 83% 99% E3% 83% AB% E3% 81% AA% E3% 81% 97% E3% 83% 87% E3% 83% BC% E3% 82% BF% E3% 81% AE% E5% 88% A9% E7% 94% A8-Ankur-Patel / dp / 4873119103) I am allowed to.
Use the UCR Time Series Classification Archive from the University of California, Riverside. https://www.cs.ucr.edu/~eamonn/time_series_data/
ECG5000 is used. The password is [attempt to classify].
This is a reference book. Some of them are not covered in this article.
Installation is required when running with colab.
python
!pip install kshape
!pip install tslearn
python
'''Main'''
import numpy as np
import pandas as pd
import os, time, re
import pickle, gzip, datetime
from os import listdir, walk
from os.path import isfile, join
'''Data Viz'''
import matplotlib.pyplot as plt
import seaborn as sns
color = sns.color_palette()
import matplotlib as mpl
from mpl_toolkits.axes_grid1 import Grid
%matplotlib inline
'''Data Prep and Model Evaluation'''
from sklearn import preprocessing as pp
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import log_loss, accuracy_score
from sklearn.metrics import precision_recall_curve, average_precision_score
from sklearn.metrics import roc_curve, auc, roc_auc_score, mean_squared_error
from keras.utils import to_categorical
from sklearn.metrics import adjusted_rand_score
import random
'''Algos'''
from kshape.core import kshape, zscore
import tslearn
from tslearn.utils import to_time_series_dataset
from tslearn.clustering import KShape, TimeSeriesScalerMeanVariance
from tslearn.clustering import TimeSeriesKMeans
import hdbscan
sns.set("talk")
Next is reading the data. There are 4000 time series data, and the data is classified into 5 clusters.
python
#Data reading
current_path = os.getcwd()
file = os.path.sep.join(["",'data', 'datasets', 'ucr_time_series_data', '']) #Rewrite according to personal folder
data_train = np.loadtxt(current_path+file+
"ECG5000/ECG5000_TRAIN",
delimiter=",")
data_test = np.loadtxt(current_path+file+
"ECG5000/ECG5000_TEST",
delimiter=",")
data_joined = np.concatenate((data_train,data_test),axis=0)
data_train, data_test = train_test_split(data_joined,
test_size=0.20, random_state=2019)
X_train = to_time_series_dataset(data_train[:, 1:])
y_train = data_train[:, 0].astype(np.int)
X_test = to_time_series_dataset(data_test[:, 1:])
y_test = data_test[:, 0].astype(np.int)
#View data structure
print("Number of time series:", len(data_train))
print("Number of unique classes:", len(np.unique(data_train[:,0])))
print("Time series length:", len(data_train[0,1:]))
# Calculate number of readings per class
print("Number of time series in class 1.0:",
len(data_train[data_train[:,0]==1.0]))
print("Number of time series in class 2.0:",
len(data_train[data_train[:,0]==2.0]))
print("Number of time series in class 3.0:",
len(data_train[data_train[:,0]==3.0]))
print("Number of time series in class 4.0:",
len(data_train[data_train[:,0]==4.0]))
print("Number of time series in class 5.0:",
len(data_train[data_train[:,0]==5.0]))
"""
Number of time series: 4000
Number of unique classes: 5
Time series length: 140
Number of time series in class 1.0: 2327
Number of time series in class 2.0: 1423
Number of time series in class 3.0: 75
Number of time series in class 4.0: 156
Number of time series in class 5.0: 19
"""
Visualize class 1-5 data. Even if you look at an amateur, you can't tell the difference.
python
fig, ax = plt.subplots(5,5,figsize=[30,10],sharey=True)
ax_f = ax.flatten()
#Class 1-5 plot
df_train = pd.DataFrame(data_train)
cnt = 0
for class_i in range(1,6):
df_train_plot = df_train[df_train[0] == class_i]
for i in range(0,5):
ax_f[cnt].set_title("class: {}".format(class_i))
ax_f[cnt].plot(df_train_plot.iloc[i][1:])
cnt += 1
Implementation of k-shape and its evaluation. For evaluation, a method called "adjustment land method" is used, and the closer it is to 1, the higher the accuracy of clustering.
python
#k-shape
ks = KShape(n_clusters=5,max_iter=100,n_init=100,verbose=0)
ks.fit(X_train)
#Evaluation by the adjusted land method
#See how well it matches the actual label
#The closer it is to 1, the more predictive clustering is.
preds=ks.predict(X_train)
ars = adjusted_rand_score(data_train[:,0],preds)
print("train Adjusted Rand Index:",ars)
preds_test=ks.predict(X_test)
ars = adjusted_rand_score(data_test[:,0],preds)
print("test Adjusted Rand Index:",ars)
UCR Time Series Classification Archive
In addition, the class distribution for each cluster is displayed as shown below. Since the distribution is biased, clustering is done fairly well.
However, it should be noted that the division is not such that the number of 3,4,5 is the largest.
python
#Visualization of distribution inside the cluster
preds_test = preds_test.reshape(1000,1)
preds_test = np.hstack((preds_test,data_test[:,0].reshape(1000,1)))
preds_test = pd.DataFrame(data=preds_test)
preds_test = preds_test.rename(columns={0: 'prediction', 1: 'actual'})
counter = 0
for i in np.sort(preds_test.prediction.unique()):
print("Predicted Cluster ", i)
print(preds_test.actual[preds_test.prediction==i].value_counts())
print()
cnt = preds_test.actual[preds_test.prediction==i] \
.value_counts().iloc[1:].sum()
counter = counter + cnt
print("Count of Non-Primary Points: ", counter)
"""
Predicted Cluster 0.0
2.0 29
4.0 2
1.0 2
3.0 2
5.0 1
Name: actual, dtype: int64
Predicted Cluster 1.0
2.0 270
4.0 14
3.0 8
1.0 2
5.0 1
Name: actual, dtype: int64
Predicted Cluster 2.0
1.0 553
4.0 16
2.0 9
3.0 7
Name: actual, dtype: int64
Predicted Cluster 3.0
2.0 35
1.0 5
4.0 5
5.0 3
3.0 3
Name: actual, dtype: int64
Predicted Cluster 4.0
1.0 30
4.0 1
3.0 1
2.0 1
Name: actual, dtype: int64
Count of Non-Primary Points: 83
"""
In this article, we implemented k-shape, which is a clustering method for time series data. It's quite similar to doing k-means. I think it will be useful for signal processing and anomaly detection.
If you find it helpful, it would be encouraging if you could use LGTM.