[PYTHON] Library tsfresh that automatically extracts features from time series data

The fifth day of Ateam Lifestyle Advent Calendar 2019 Ateam Lifestyle Inc. CTO Office Engineer Kobayashi is in charge. The company is working on a machine learning project.

Introduction

Recently, I am dealing with time series data at work, and a convenient library that automatically extracts features from time series data called tsfresh. This is an introduction because I used. The order of the time-series data is meaningful, but the meaning was not so good, so the background is that I investigated the feature extraction method. In the example of forecasting time series data,

--Forecast future sales from past product sales --Predict the next action from the user's action log --Detects anomalies from the data captured by the sensor

And so on. It can be expected that it will be more accurate to handle the data with meaning in the order than to handle the data arranged in chronological order individually. Since tsfresh extracts features from time series data, it seems to be able to contribute to improving accuracy.

There is How to use notebook on Github of tsfresh, so refer to it and [Google Colaboratory](https:: //colab.research.google.com/notebooks/welcome.ipynb?hl=ja#scrollTo=xitplqMNk_Hc). Google Colaboratory is an environment where you can use Jupyter Notebook for free. Its use is limited to machine learning research and education, but it is recommended because it is a great service that you can use GPU and TPU for free and have a Python library for machine learning from the beginning.

Preparation

First, install tsfresh. The following is the description method on Jupyter Notebook.

!pip install tsfresh

Import required packages

%matplotlib inline
import matplotlib.pylab as plt
from tsfresh.examples.har_dataset import download_har_dataset, load_har_dataset, load_har_classes
from tsfresh import extract_features, extract_relevant_features, select_features
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
import xgboost as xgb
import pandas as pd
import numpy as np

Download data

# fetch dataset from uci
download_har_dataset()
df = load_har_dataset()
print(df.head())
df.shape
        0         1         2    ...       125       126       127
0  0.000181  0.010139  0.009276  ... -0.001147 -0.000222  0.001576
1  0.001094  0.004550  0.002879  ... -0.004646 -0.002941 -0.001599
2  0.003531  0.002285 -0.000420  ...  0.001246  0.003117  0.002178
3 -0.001772 -0.001311  0.000388  ... -0.000867 -0.001172 -0.000028
4  0.000087 -0.000272  0.001022  ... -0.000698 -0.001223 -0.003328

[5 rows x 128 columns]
(7352, 128)

This data is the numerical value of 128 time series accelerometers in each row. It seems to be classified into 6 categories (walking, climbing stairs, going down stairs, sitting, standing, sleeping). Data source: https://archive.ics.uci.edu/ml/datasets/Human+Activity+Recognition+Using+Smartphones

Let's display one line as a graph.

plt.title('accelerometer reading')
plt.plot(df.ix[0,:])
plt.show()

スクリーンショット 2019-12-02 6.29.13.png

Feature extraction

Sample and format the data.

N = 500
master_df = pd.DataFrame({0: df[:N].values.flatten(),
                          1: np.arange(N).repeat(df.shape[1])})
master_df.head()
	0	1
0	0.000181	0
1	0.010139	0
2	0.009276	0
3	0.005066	0
4	0.010810	0

The data that was lined up in 128 columns horizontally is sorted vertically so that tsfresh can handle it. Column 0 is the data and column 1 is the index (row number of the original data). We will use this data to extract features. The column of the index is specified by column_id.

X = extract_features(master_df, column_id=1)
Feature Extraction: 100%|██████████| 5/5 [01:02<00:00, 12.48s/it]

It took about 1 minute for 500 lines. It takes a lot of time to actually try it with data of 100,000 rows or more, and memory was also required, so it seems that some ingenuity is required when applying it to large-scale data.

X.shape
(500, 754)

As many as 754 features have been extracted. I have not confirmed the contents in detail, but it seems that various features such as Fourier transform are calculated from basic things such as mean and median.

Accuracy verification

Let's try learning and prediction using the extracted features. First, prepare the teacher data.

y = load_har_classes()[:N]
y.hist(bins=12)

スクリーンショット 2019-12-02 8.55.10.png

It seems that there is no such extreme variation. Divide into training data and test data.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2)

Let's learn and predict with xgboost.

cl = xgb.XGBClassifier()
cl.fit(X_train, y_train)
print(classification_report(y_test, cl.predict(X_test)))
              precision    recall  f1-score   support

           1       1.00      1.00      1.00        24
           2       1.00      1.00      1.00        11
           3       1.00      1.00      1.00        12
           4       0.62      0.53      0.57        15
           5       0.70      0.74      0.72        19
           6       0.60      0.63      0.62        19

    accuracy                           0.81       100
   macro avg       0.82      0.82      0.82       100
weighted avg       0.81      0.81      0.81       100

It seems that the accuracy is about 80%. Let's also look at the importance of features.

importances = pd.Series(index=X_train.columns, data=cl.feature_importances_)
importances.sort_values(ascending=False).head(10)
variable
0__spkt_welch_density__coeff_8                                   0.054569
0__time_reversal_asymmetry_statistic__lag_3                      0.041737
0__agg_linear_trend__f_agg_"max"__chunk_len_5__attr_"stderr"     0.036145
0__standard_deviation                                            0.035886
0__change_quantiles__f_agg_"var"__isabs_False__qh_0.4__ql_0.0    0.028676
0__spkt_welch_density__coeff_2                                   0.027741
0__augmented_dickey_fuller__attr_"pvalue"                        0.019172
0__autocorrelation__lag_2                                        0.018580
0__linear_trend__attr_"stderr"                                   0.018235
0__cid_ce__normalize_True                                        0.018181
dtype: float32

It seems that you can see what kind of features are created from the variable name. Let's train on the original dataset for comparison.

X_1 = df.ix[:N-1,:]
X_1.shape
(500, 128)
X_train, X_test, y_train, y_test = train_test_split(X_1, y, test_size=.2)
cl = xgb.XGBClassifier()
cl.fit(X_train, y_train)
print(classification_report(y_test, cl.predict(X_test)))
              precision    recall  f1-score   support

           1       0.79      0.83      0.81        36
           2       0.71      0.67      0.69        15
           3       0.58      0.58      0.58        12
           4       0.25      0.43      0.32         7
           5       0.67      0.53      0.59        19
           6       0.44      0.36      0.40        11

    accuracy                           0.64       100
   macro avg       0.57      0.57      0.56       100
weighted avg       0.65      0.64      0.64       100

The accuracy of the original data is about 64%, so the data extracted with tsfresh is more accurate.

Summary

We were able to improve the accuracy by extracting features from time series data using tsfresh. Before the extraction, the order of the time series data could not be meaningful, so I think it is the result of being able to make sense. I haven't seen in detail what kind of data was extracted, so I would like to investigate what kind of data it is.

The 7th day of Ateam Lifestyle Advent Calendar 2019 will be sent by @maonem. I'm looking forward to it!

The Ateam Group, which values "challenge," is looking for colleagues with a strong spirit of challenge to work with. If you are interested, please visit the Ateam Group recruitment site. https://www.a-tm.co.jp/recruit/

Recommended Posts

Library tsfresh that automatically extracts features from time series data
Features that can be extracted from time series data
Extract periods that match a particular pattern from pandas time series qualitative data
[Python] Plot time series data
A Python program that aggregates time usage from icalendar data
How to extract features of time series data with PySpark Basics
[numpy] Create a moving window matrix from multidimensional time series data
Python: Time Series Analysis: Preprocessing Time Series Data
About time series data and overfitting
Differentiation of time series data (discrete)
Time series analysis 3 Preprocessing of time series data
[Kaggle] I tried feature engineering of multidimensional time series data using tsfresh.
Forecasting time series data with Simplex Projection
Predict time series data with neural network
Time series data anomaly detection for beginners
How to handle time series data (implementation)
Reading OpenFOAM time series data and sets data