Implementation of clustering k-shape method for time series data [Unsupervised learning with python Chapter 13]

What to do in this article

**-Implemented classification of time series data by k-shape --Use electrocardiogram data for data **

Introduction

The k-shape method is often used as a classification method for time series data. In this article, we will cluster the electrocardiogram data using the k-shape method. Data and code ["Unsupervised learning with python"](url https://www.amazon.co.jp/Python%E3%81%A7%E3%81%AF%E3%81%98%E3%82 % 81% E3% 82% 8B% E6% 95% 99% E5% B8% AB% E3% 81% AA% E3% 81% 97% E5% AD% A6% E7% BF% 92-% E2% 80% 95% E6% A9% 9F% E6% A2% B0% E5% AD% A6% E7% BF% 92% E3% 81% AE% E5% 8F% AF% E8% 83% BD% E6% 80% A7% E3% 82% 92% E5% BA% 83% E3% 81% 92% E3% 82% 8B% E3% 83% A9% E3% 83% 99% E3% 83% AB% E3% 81% AA% E3% 81% 97% E3% 83% 87% E3% 83% BC% E3% 82% BF% E3% 81% AE% E5% 88% A9% E7% 94% A8-Ankur-Patel / dp / 4873119103) I am allowed to.

Data to handle

Use the UCR Time Series Classification Archive from the University of California, Riverside. https://www.cs.ucr.edu/~eamonn/time_series_data/

ECG5000 is used. The password is [attempt to classify].

Library import

This is a reference book. Some of them are not covered in this article.

Installation is required when running with colab.

python


!pip install kshape
!pip install tslearn

python



'''Main'''
import numpy as np
import pandas as pd
import os, time, re
import pickle, gzip, datetime
from os import listdir, walk
from os.path import isfile, join

'''Data Viz'''
import matplotlib.pyplot as plt
import seaborn as sns
color = sns.color_palette()
import matplotlib as mpl
from mpl_toolkits.axes_grid1 import Grid

%matplotlib inline

'''Data Prep and Model Evaluation'''
from sklearn import preprocessing as pp
from sklearn.model_selection import train_test_split 
from sklearn.model_selection import StratifiedKFold 
from sklearn.metrics import log_loss, accuracy_score
from sklearn.metrics import precision_recall_curve, average_precision_score
from sklearn.metrics import roc_curve, auc, roc_auc_score, mean_squared_error
from keras.utils import to_categorical
from sklearn.metrics import adjusted_rand_score
import random

'''Algos'''
from kshape.core import kshape, zscore
import tslearn
from tslearn.utils import to_time_series_dataset
from tslearn.clustering import KShape, TimeSeriesScalerMeanVariance
from tslearn.clustering import TimeSeriesKMeans
import hdbscan

sns.set("talk")

Data reading

Next is reading the data. There are 4000 time series data, and the data is classified into 5 clusters.

python


#Data reading
current_path = os.getcwd()
file = os.path.sep.join(["",'data', 'datasets', 'ucr_time_series_data', '']) #Rewrite according to personal folder
data_train = np.loadtxt(current_path+file+
                        "ECG5000/ECG5000_TRAIN", 
                        delimiter=",")

data_test = np.loadtxt(current_path+file+
                       "ECG5000/ECG5000_TEST", 
                       delimiter=",")

data_joined = np.concatenate((data_train,data_test),axis=0)
data_train, data_test = train_test_split(data_joined, 
                                    test_size=0.20, random_state=2019)

X_train = to_time_series_dataset(data_train[:, 1:])
y_train = data_train[:, 0].astype(np.int)
X_test = to_time_series_dataset(data_test[:, 1:])
y_test = data_test[:, 0].astype(np.int)

#View data structure
print("Number of time series:", len(data_train))
print("Number of unique classes:", len(np.unique(data_train[:,0])))
print("Time series length:", len(data_train[0,1:]))

# Calculate number of readings per class
print("Number of time series in class 1.0:", 
      len(data_train[data_train[:,0]==1.0]))
print("Number of time series in class 2.0:", 
      len(data_train[data_train[:,0]==2.0]))
print("Number of time series in class 3.0:", 
      len(data_train[data_train[:,0]==3.0]))
print("Number of time series in class 4.0:", 
      len(data_train[data_train[:,0]==4.0]))
print("Number of time series in class 5.0:", 
      len(data_train[data_train[:,0]==5.0]))

"""
Number of time series: 4000
Number of unique classes: 5
Time series length: 140
Number of time series in class 1.0: 2327
Number of time series in class 2.0: 1423
Number of time series in class 3.0: 75
Number of time series in class 4.0: 156
Number of time series in class 5.0: 19
"""

Data visualization

Visualize class 1-5 data. Even if you look at an amateur, you can't tell the difference.

python



fig, ax = plt.subplots(5,5,figsize=[30,10],sharey=True)

ax_f = ax.flatten()

#Class 1-5 plot
df_train = pd.DataFrame(data_train)

cnt = 0
for class_i in range(1,6):
  df_train_plot = df_train[df_train[0] == class_i]
  for i in range(0,5):
      ax_f[cnt].set_title("class: {}".format(class_i))
      ax_f[cnt].plot(df_train_plot.iloc[i][1:])
      cnt += 1

image.png

Classification by k-shape

Implementation of k-shape and its evaluation. For evaluation, a method called "adjustment land method" is used, and the closer it is to 1, the higher the accuracy of clustering.

python


#k-shape
ks = KShape(n_clusters=5,max_iter=100,n_init=100,verbose=0)
ks.fit(X_train)

#Evaluation by the adjusted land method
#See how well it matches the actual label
#The closer it is to 1, the more predictive clustering is.

preds=ks.predict(X_train)
ars = adjusted_rand_score(data_train[:,0],preds)
print("train Adjusted Rand Index:",ars)

preds_test=ks.predict(X_test)
ars = adjusted_rand_score(data_test[:,0],preds)
print("test Adjusted Rand Index:",ars)
UCR Time Series Classification Archive

In addition, the class distribution for each cluster is displayed as shown below. Since the distribution is biased, clustering is done fairly well.

However, it should be noted that the division is not such that the number of 3,4,5 is the largest.

python



#Visualization of distribution inside the cluster
preds_test = preds_test.reshape(1000,1)
preds_test = np.hstack((preds_test,data_test[:,0].reshape(1000,1)))
preds_test = pd.DataFrame(data=preds_test)
preds_test = preds_test.rename(columns={0: 'prediction', 1: 'actual'})

counter = 0
for i in np.sort(preds_test.prediction.unique()):
    print("Predicted Cluster ", i)
    print(preds_test.actual[preds_test.prediction==i].value_counts())
    print()
    cnt = preds_test.actual[preds_test.prediction==i] \
                        .value_counts().iloc[1:].sum()
    counter = counter + cnt
print("Count of Non-Primary Points: ", counter)

"""
Predicted Cluster  0.0
2.0    29
4.0     2
1.0     2
3.0     2
5.0     1
Name: actual, dtype: int64

Predicted Cluster  1.0
2.0    270
4.0     14
3.0      8
1.0      2
5.0      1
Name: actual, dtype: int64

Predicted Cluster  2.0
1.0    553
4.0     16
2.0      9
3.0      7
Name: actual, dtype: int64

Predicted Cluster  3.0
2.0    35
1.0     5
4.0     5
5.0     3
3.0     3
Name: actual, dtype: int64

Predicted Cluster  4.0
1.0    30
4.0     1
3.0     1
2.0     1
Name: actual, dtype: int64

Count of Non-Primary Points:  83
"""

At the end

In this article, we implemented k-shape, which is a clustering method for time series data. It's quite similar to doing k-means. I think it will be useful for signal processing and anomaly detection.

If you find it helpful, it would be encouraging if you could use LGTM.

Recommended Posts

Implementation of clustering k-shape method for time series data [Unsupervised learning with python Chapter 13]
Supervised learning of mnist in the fully connected layer, clustering and evaluating the final stage
Implementation of clustering k-shape method for time series data [Unsupervised learning with python Chapter 13]
Classify mnist numbers by unsupervised learning with keras [Autoencoder]
Relationship data learning with numpy and NetworkX (spectral clustering)
Unsupervised learning 2 non-hierarchical clustering
Plot CSV of time series data with unixtime value in Python (matplotlib)
Deep Learning from scratch The theory and implementation of deep learning learned with Python Chapter 3
Align the number of samples between classes of data for machine learning with Python
Predicting the goal time of a full marathon with machine learning-③: Visualizing data with Python-
Python for Data Analysis Chapter 4
Python for Data Analysis Chapter 2
Python: Unsupervised Learning: Non-hierarchical clustering
Python for Data Analysis Chapter 3
Recommender system using matrix factorization [Unsupervised learning with python Chapter 10]
A story about clustering time series data of foreign exchange
How to extract features of time series data with PySpark Basics
Python: Time Series Analysis: Preprocessing Time Series Data
About learning method with original data of CenterNet (Objects as Points)
Multi Layer Perceptron for Deep Learning (Deep Learning with Python; MPS Yokohama Deep Learning Series)
Differentiation of time series data (discrete)
Implementation of Dijkstra's algorithm with python
Time series analysis 3 Preprocessing of time series data
Rewrite the field creation node of SPSS Modeler with Python. Feature extraction from time series sensor data
[For beginners] Script within 10 lines (5. Resample of time series data using pandas)
Source code of sound source separation (machine learning practice series) learned with Python
Python learning memo for machine learning by Chainer Chapter 13 Basics of neural networks
Python learning memo for machine learning by Chainer until the end of Chapter 2
Python vs Ruby "Deep Learning from scratch" Chapter 4 Implementation of loss function
Forecasting time series data with Simplex Projection
Implementation and experiment of convex clustering method
Predict time series data with neural network
[Python] Accelerates loading of time series CSV
[Shakyo] Encounter with Python for machine learning
Time series data anomaly detection for beginners
Recommendation of Altair! Data visualization with Python
Data analysis starting with python (data preprocessing-machine learning)
How to handle time series data (implementation)
Format and display time series data with different scales and units with Python or Matplotlib
Python practice data analysis Summary of learning that I hit about 10 with 100 knocks
Python vs Ruby "Deep Learning from scratch" Chapter 3 Implementation of 3-layer neural network
Create an animated time series map of coronavirus infection status with python + plotly
A memorandum of method often used when analyzing data with pandas (for beginners)
[Introduction to Python] How to get the index of data with a for statement
Get time series data from k-db.com in Python
The story of low learning costs for Python
Kaggle Kernel Method Summary [Table Time Series Data]
Acquisition of time series data (daily) of stock prices
Use logger with Python for the time being
[Python] Collect images with Icrawler for machine learning [1000 images]
Smoothing of time series and waveform data 3 methods (smoothing)
Implementation of TRIE tree with Python and LOUDS
[Implementation for learning] Implement Stratified Sampling in Python (1)
I started machine learning with Python Data preprocessing
Introduction to Deep Learning for the first time (Chainer) Japanese character recognition Chapter 4 [Improvement of recognition accuracy by expanding data]
Implementation of Deep Learning model for image recognition
[Python for Hikari] Chapter 09-01 Classes (Basics of Objects)
The first step of machine learning ~ For those who want to implement with python ~
Reading, summarizing, visualizing, and exporting time series data to an Excel file with Python
[Translation] scikit-learn 0.18 Tutorial Statistical learning tutorial for scientific data processing Unsupervised learning: Finding the representation of data
Build API server for checking the operation of front implementation with python3 and Flask
[Latest method] Visualization of time series data and extraction of frequent patterns using Pan-Matrix Profile
"Getting stock price time series data from k-db.com with Python" Program environment creation memo
Clustering of clustering method
[Python] Implementation of clustering using a mixed Gaussian model