[PYTHON] tslearn trial memorandum

Overview

There is a powerful package called tslearn for clustering time series data, and a memorandum when checking the operation It seems that it can be used at work, so I eat a little bit --Implementation period: October 2020 --Environment: Ubuntu18.04LTS

Creating a Conda virtual environment for operation check

Create a new virtual environment for operation check according to the procedure of Miniconda Install memorandum Then install the following packages required

conda install -c conda-forge tslearn
conda install -c conda-forge h5py

The procedure for tslearn also as written, and the following is also required. scikit-learn, numpy, scipy

Preparation

The following three methods are implemented in tslearn's Clustering.

This time, I will use a method called K-Shape. An overview of K-Shape can be found in this blog, from which you can also refer to the original paper.

This time, we will try clustering of arrhythmia waveforms based on tslearn's Official sample code. For the data, we used ECG Heartbeat Categorization Dataset from Kaggle. Since it is not a DNN, download only mitbih_train.csv. 187 points of 125Hz waveform data and the last column are labeled 0-4. '0' is the normal waveform, and other than that, it seems to be a different waveform for each symptom. There are 87554 cases in total, but shuffle and use 100 points. Also, reserve the label to confirm that clustering was successful later.

code

Import the required libraries.

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

from tslearn.clustering import KShape
from tslearn.datasets import CachedDatasets
from tslearn.preprocessing import TimeSeriesScalerMeanVariance

Format the data and finally scale it.

train_df=pd.read_csv('/home/hoge/mitbih_train.csv',header=None)
trainArr = []
trainArr = train_df.values
np.random.shuffle(trainArr)

trainArr_X = []
trainArr_y = []
trainArr_X = trainArr[:100, :187]
trainArr_y = trainArr[:100, -1:]
trainArr_X = trainArr_X.reshape([100, 187, 1])
print(trainArr_X.shape)         # (100, 187, 1)
print(trainArr_y.shape)         # (100, 1)
sz = trainArr_X.shape[1]

# For this method to operate properly, prior scaling is required
trainArr_X = TimeSeriesScalerMeanVariance().fit_transform(trainArr_X)

Perform clustering and view the results. There were 5 classes.

# kShape clustering
ks = KShape(n_clusters=5, verbose=True, random_state=seed)
y_pred = ks.fit_predict(trainArr_X)

plt.figure(figsize=(10, 10), tight_layout = True)
for yi in range(5):
    plt.subplot(5, 1, 1 + yi)
    
    for xx in trainArr_X[y_pred == yi]:
        plt.plot(xx.ravel(), "b-", alpha=.2)
    plt.plot(ks.cluster_centers_[yi].ravel(), "r-")
    plt.xlim(0, sz)
    plt.ylim(-8, 8)
    plt.title("Cluster %d" % (yi + 1))

plt.show()

result

Get the figure below. The red line seems to be the waveform closest to the center of each cluster.

Screenshot from 2020-10-04 16-18-33.png

(I really wanted to color-code 100 lines by trainArr_y for each disease name, but I couldn't write it immediately due to lack of Python power, so I will support it at a later date) The number of labels of 100 original data used here (irrelevant to the Cluster number in the above figure) is as follows. '0': 86 pieces, '1': 2 pieces, '2': 4 pieces, '3': 2 pieces, '4': 6 pieces

Since the number of labels and the number of each class are clearly different, it seems that it can not be used without adjusting anything. Investigate what kind of parameters are available. The raw data was padded with zero out of 187 points in the latter half. I rushed in as it was, but it seems that this was not a problem.

Kaggle also has a code that classifies this dataset with CNN, and the accuracy is also outstandingly good.

Recommended Posts

tslearn trial memorandum
Matplotlib memorandum
linux memorandum
jinja2 memorandum
Python memorandum
Django memorandum
Python Memorandum 2
plotly memorandum
Slackbot memorandum (1)
Trial gdb
multiprocessing memorandum
Memorandum MetaTrader 5
[Linux/LPIC] Memorandum
pip memorandum
Python memorandum
pydoc memorandum
python memorandum
Pandas memorandum
python memorandum
DjangoGirls memorandum
Command memorandum
Python memorandum
pandas memorandum
python memorandum
Python memorandum