[PYTHON] [Kaggle] I tried feature engineering of multidimensional time series data using tsfresh.

0. Introduction

Suddenly, time series data is difficult to handle, isn't it? Moreover, I think that the more variables you have, the more likely you are to break your heart. However, "If you extract the features from the time series data, you can do anything else!" I think there are many people.

This time, we will introduce ** tsfresh **, a library that seems to be useful for feature engineering of multidimensional time series data.

I referred to the following article. -Library tsfresh that automatically extracts features from time series data -Easy statistical processing of time series data with tsfresh

1. Install tsfresh

I installed it via pip. You couldn't install from pip without a new pip, so please upgrade pip.

pip install --upgrade pip

Upgrade pip with

pip install tsfresh

Install tsfresh with.

pip install pandas==0.21 

Please change like.

Here is the version I'm using.

2. Prepare pseudo time series data

Finding multidimensional time series data was a hassle, so this time we'll use a pseudo-transformed dataset that can be downloaded from tsfresh. (If you already have your own data, please skip it.)

First of all, I can grasp the procedure with this pseudo data, but the result that comes out is not interesting at all, so if you have your own data, I recommend you to use it.

UEA & UCR Time Series Classification Repository There seems to be a lot of time series data that seems to be interesting ...

First, load the data.


import pandas as pd
import numpy as np

from tsfresh.examples.har_dataset import download_har_dataset, load_har_dataset

df = load_har_dataset()

If you check with, you can see that this data has 7352 sample points and 128 variables (128 dimensions).

Next, cut out only 100 sample points and 50 variables.


df = df.iloc[0:100, 0:50]

This time, "There are five subjects, and 10 variables of time-series data are acquired from sensors attached to the body to classify whether the subjects are children or adults." Imagine a situation like this.

The purpose of this time is to see through the flow of feature engineering. The correspondence of values is messed up. Of course, it cannot be classified by this data.


# id:5 and 10 variables

#Each 10 variables are assigned to each individual (subject).
df_s1 = df.iloc[:,0:10].copy()
df_s2 = df.iloc[:,10:20].copy()
df_s3 = df.iloc[:,20:30].copy()
df_s4 = df.iloc[:,30:40].copy()
df_s5 = df.iloc[:,40:50].copy()

#Create a column whose value is each individual id.
df_s1['id'] = 'sub1'
df_s2['id'] = 'sub2'
df_s3['id'] = 'sub3'
df_s4['id'] = 'sub4'
df_s5['id'] = 'sub5'

#Rewrite the variable name of each column.
columns = ['x1', 'x2', 'x3', 'x4', 'x5', 'x6', 'x7', 'x8', 'x9', 'x10', 'id']
df_s1.columns = columns
df_s2.columns = columns
df_s3.columns = columns
df_s4.columns = columns
df_s5.columns = columns



There are 5 data frames like this.

3. Transform time series data into a corresponding data frame

According to the Official Documents, ** extract_features () **, which is the main function in this article, has an argument. The format for passing is specified. The data type is the pandas dataframe object type, but there are three formats.

--Flat DataFrame --Stacked DataFrame (vertical stacking of data frames) --Dictionary of flat DataFrames (Dictionary of flat dataframes)

The above are the three types. This time, I will format it to this first format.

id	time	x	y
A	t1	x(A, t1)	y(A, t1)
A	t2	x(A, t2)	y(A, t2)
A	t3	x(A, t3)	y(A, t3)
B	t1	x(B, t1)	y(B, t1)
B	t2	x(B, t2)	y(B, t2)
B	t3	x(B, t3)	y(B, t3)

Continuing from earlier,


df = pd.concat([df_s1, df_s2, df_s3, df_s4, df_s5], axis=0)

As, when connected,


It will be. The number of unique ids is 5, so they are concatenated without any problems.

You can now pass it to the extract_features () function.

4. Extract features with extract_features ()

For the previous data frame


from tsfresh import extract_features

df_features = extract_features(df, column_id='id')

When you apply image.png

The feature amount is calculated like this. There are 754 features for one variable. I think that there are quite a lot, but I think that it is useful if you think that the difficulty of handling time series data is solved.

What each feature means is The documentation's Overview on extracted features (https://tsfresh.readthedocs.io/en/latest/text/list_of_features.html) describes it. It seems that the features (statistics) that require parameters are calculated with multiple parameters.

5. Filter features with select_features ()

In the documentation, after extracting the features as described above, [Filter features] (https://tsfresh.readthedocs.io/en/latest/text/feature_filtering.html) is recommended. The function for that feature filtering is ** select_features () **. This function uses a statistical hypothesis test to select features so that only features that are likely to have a statistically significant difference based on the features.

before that,


from tsfresh.utilities.dataframe_functions import impute

df_features = impute(df_features)

By doing, the unmanageable values such as NaN and infinity are complemented by the feature quantity obtained earlier.

In addition, select_features () narrows down the features based on the dependent variable y, so prepare the data in a pseudo manner. This y must be pandas.Series or numpy.array, as you can see in the Source Code (https://tsfresh.readthedocs.io/en/latest/_modules/tsfresh/feature_selection/selection.html) .. This time,


from tsfresh import select_features

X = df_features
y = [0, 0, 0, 1, 1]
y = np.array(y)

Prepare an X data frame with features extracted and a numpy array of y allocated appropriately.



X_selected = select_features(X, y)

When you do



Nothing comes out brilliantly. That's right, because the data is appropriate ... With proper data, you should be able to select features well. (Those who could (or couldn't) select features based on their own data would be delighted to cry if you could give us a comment.)

Bonus. I tried to compress the features by principal component analysis (PCA).

Because it ends with this This time, instead of feature selection using statistical hypothesis testing, I tried dimensional compression with PCA. It's like creating an instance called pca, fitting it, and transforming it using that instance.


from sklearn.decomposition import PCA
pca = PCA(n_components=4)
X_PCA = pca.transform(df_features)

When this is transformed into a data frame and displayed,


X_PCA = pd.DataFrame( X_PCA )


It will be.

further, Python: Principal component analysis (PCA) with scikit-learn If you try to find the contribution rate / cumulative contribution rate with reference to


print('Contribution rate of each dimension: {0}'.format(pca.explained_variance_ratio_))
print('Cumulative contribution rate: {0}'.format(sum(pca.explained_variance_ratio_)))


Contribution rate of each dimension: [0.30121012 0.28833114 0.22187195 0.1885868 ]
Cumulative contribution rate: 0.9999999999999999

It will be. perhaps, It may be possible to compress the features with PCA instead of selecting the features.

Although it is a power play, if you use this series of flows, it seems that you can escape from the troublesomeness unique to time series.

From there [Kaggle] Baseline model construction, Pipeline processing You can do various things such as making a baseline model like this.

Next, I will try it with actual multidimensional time series data.

Recommended Posts

[Kaggle] I tried feature engineering of multidimensional time series data using tsfresh.
I tried using the API of the salmon data project
Differentiation of time series data (discrete)
Time series analysis 3 Preprocessing of time series data
[Kaggle] I tried undersampling using imbalanced-learn
I tried to make a regular expression of "time" using Python
I tried time series analysis! (AR model)
[Kaggle] I tried ensemble learning using LightGBM
[For beginners] Script within 10 lines (5. Resample of time series data using pandas)
I tried logistic regression analysis for the first time using Titanic data
I tried to perform a cluster analysis of customers using purchasing data
I tried using scrapy for the first time
Acquisition of time series data (daily) of stock prices
Smoothing of time series and waveform data 3 methods (smoothing)
View details of time series data with Remotte
I tried using the image filter of OpenCV
I tried DBM with Pylearn 2 using artificial data
[Latest method] Visualization of time series data and extraction of frequent patterns using Pan-Matrix Profile
Anomaly detection of time series data by LSTM (Keras)
I tried clustering ECG data using the K-Shape method
I tried to implement time series prediction with GBDT
I tried reading data from a file using Node.js.
I tried using Python (3) instead of a scientific calculator
I tried using parameterized
I tried using argparse
I tried using mimesis
I tried using anytree
I tried using aiomysql
I tried using Summpy
I tried using coturn
I tried using Pipenv
I tried using matplotlib
I tried using "Anvil".
I tried using Hubot
I tried using ESPCN
I tried using openpyxl
I tried using Ipython
I tried using PyCaret
I tried using cron
I tried using face_recognition
I tried using Jupyter
I tried using PyCaret
I tried using doctest
I tried using folium
I tried using jinja2
I tried using folium
I tried using time-window
I tried the same data analysis with kaggle notebook (python) and Power BI at the same time ②
I tried to compare the accuracy of machine learning models using kaggle as a theme.
I tried the same data analysis with kaggle notebook (python) and Power BI at the same time ①
I tried to search videos using Youtube Data API (beginner)
I tried refactoring the CNN model of TensorFlow using TF-Slim
I tried face recognition of the laughter problem using Keras.
Library tsfresh that automatically extracts features from time series data
Graph time series data in Python using pandas and matplotlib
A story about clustering time series data of foreign exchange
I tried handwriting recognition of runes with CNN using Keras
I tried to analyze scRNA-seq data using Topological Data Analysis (TDA)
I tried to get data from AS / 400 quickly using pypyodbc
I tried using PDF data of online medical care based on the spread of the new coronavirus infection
[I tried using Pythonista 3] Introduction