0. Introduction

Suddenly, time series data is difficult to handle, isn't it? Moreover, I think that the more variables you have, the more likely you are to break your heart. However, "If you extract the features from the time series data, you can do anything else!" I think there are many people.

This time, we will introduce ** tsfresh **, a library that seems to be useful for feature engineering of multidimensional time series data.

I referred to the following article. -Library tsfresh that automatically extracts features from time series data -Easy statistical processing of time series data with tsfresh

1. Install tsfresh

I installed it via pip. You couldn't install from pip without a new pip, so please upgrade pip.

pip install --upgrade pip

Upgrade pip with

pip install tsfresh

Install tsfresh with.

At this time, if you get a warning that pandas does not support it, please change the version of pandas.

pip install pandas==0.21

Please change like.

Here is the version I'm using.

pip: 20.0.2
pandas: 0.21.0
tsfresh: 0.14.1

2. Prepare pseudo time series data

Finding multidimensional time series data was a hassle, so this time we'll use a pseudo-transformed dataset that can be downloaded from tsfresh. (If you already have your own data, please skip it.)

First of all, I can grasp the procedure with this pseudo data, but the result that comes out is not interesting at all, so if you have your own data, I recommend you to use it.

UEA & UCR Time Series Classification Repository There seems to be a lot of time series data that seems to be interesting ...

First, load the data.

`In[1]`



import pandas as pd
import numpy as np

from tsfresh.examples.har_dataset import download_har_dataset, load_har_dataset

download_har_dataset()
df = load_har_dataset()
print(df.shape)
df.head()

If you check with, you can see that this data has 7352 sample points and 128 variables (128 dimensions).

Next, cut out only 100 sample points and 50 variables.

`In[2]`


df = df.iloc[0:100, 0:50]
print(df.shape)
df.head()

This time, "There are five subjects, and 10 variables of time-series data are acquired from sensors attached to the body to classify whether the subjects are children or adults." Imagine a situation like this.

The purpose of this time is to see through the flow of feature engineering. The correspondence of values is messed up. Of course, it cannot be classified by this data.

`In[3]`


# id:5 and 10 variables

#Each 10 variables are assigned to each individual (subject).
df_s1 = df.iloc[:,0:10].copy()
df_s2 = df.iloc[:,10:20].copy()
df_s3 = df.iloc[:,20:30].copy()
df_s4 = df.iloc[:,30:40].copy()
df_s5 = df.iloc[:,40:50].copy()

#Create a column whose value is each individual id.
df_s1['id'] = 'sub1'
df_s2['id'] = 'sub2'
df_s3['id'] = 'sub3'
df_s4['id'] = 'sub4'
df_s5['id'] = 'sub5'

#Rewrite the variable name of each column.
columns = ['x1', 'x2', 'x3', 'x4', 'x5', 'x6', 'x7', 'x8', 'x9', 'x10', 'id']
df_s1.columns = columns
df_s2.columns = columns
df_s3.columns = columns
df_s4.columns = columns
df_s5.columns = columns

df_s1.head()

There are 5 data frames like this.

Although it is a common mistake, if you do not use .copy () when dividing and extracting a data frame, the original value will be changed (passed by reference). I hope this article will be helpful. Differences in behavior when copying with = in pandas and when using copy ()

3. Transform time series data into a corresponding data frame

According to the Official Documents, ** extract_features () **, which is the main function in this article, has an argument. The format for passing is specified. The data type is the pandas dataframe object type, but there are three formats.

--Flat DataFrame --Stacked DataFrame (vertical stacking of data frames) --Dictionary of flat DataFrames (Dictionary of flat dataframes)

The above are the three types. This time, I will format it to this first format.

id	time	x	y
A	t1	x(A, t1)	y(A, t1)
A	t2	x(A, t2)	y(A, t2)
A	t3	x(A, t3)	y(A, t3)
B	t1	x(B, t1)	y(B, t1)
B	t2	x(B, t2)	y(B, t2)
B	t3	x(B, t3)	y(B, t3)

Continuing from earlier,

`In[4]`


df = pd.concat([df_s1, df_s2, df_s3, df_s4, df_s5], axis=0)
print(df['id'].nunique())
df.head()

As, when connected,

It will be. The number of unique ids is 5, so they are concatenated without any problems.

You can now pass it to the extract_features () function.

4. Extract features with extract_features ()

For the previous data frame

`In[5]`


from tsfresh import extract_features

df_features = extract_features(df, column_id='id')
df_features.head()

When you apply

The feature amount is calculated like this. There are 754 features for one variable. I think that there are quite a lot, but I think that it is useful if you think that the difficulty of handling time series data is solved.

What each feature means is The documentation's Overview on extracted features (https://tsfresh.readthedocs.io/en/latest/text/list_of_features.html) describes it. It seems that the features (statistics) that require parameters are calculated with multiple parameters.

5. Filter features with select_features ()

In the documentation, after extracting the features as described above, [Filter features] (https://tsfresh.readthedocs.io/en/latest/text/feature_filtering.html) is recommended. The function for that feature filtering is ** select_features () **. This function uses a statistical hypothesis test to select features so that only features that are likely to have a statistically significant difference based on the features.

before that,

`In[6]`


from tsfresh.utilities.dataframe_functions import impute

df_features = impute(df_features)

By doing, the unmanageable values such as NaN and infinity are complemented by the feature quantity obtained earlier.

In addition, select_features () narrows down the features based on the dependent variable y, so prepare the data in a pseudo manner. This y must be pandas.Series or numpy.array, as you can see in the Source Code (https://tsfresh.readthedocs.io/en/latest/_modules/tsfresh/feature_selection/selection.html) .. This time,

`In[7]`


from tsfresh import select_features

X = df_features
y = [0, 0, 0, 1, 1]
y = np.array(y)

Prepare an X data frame with features extracted and a numpy array of y allocated appropriately.

And

`In[8]`


X_selected = select_features(X, y)
print(X_selected.shape)
X_selected

When you do

Well.

Nothing comes out brilliantly. That's right, because the data is appropriate ... With proper data, you should be able to select features well. (Those who could (or couldn't) select features based on their own data would be delighted to cry if you could give us a comment.)

Bonus. I tried to compress the features by principal component analysis (PCA).

Because it ends with this This time, instead of feature selection using statistical hypothesis testing, I tried dimensional compression with PCA. It's like creating an instance called pca, fitting it, and transforming it using that instance.

The number of n_components must be less than or equal to the smaller of the number of individuals (5 this time) and the number of features (7540 this time).

`In[9]`


from sklearn.decomposition import PCA
pca = PCA(n_components=4)
pca.fit(df_features)
X_PCA = pca.transform(df_features)

When this is transformed into a data frame and displayed,

`In[10]`


X_PCA = pd.DataFrame( X_PCA )
X_PCA.head()

It will be.

further, Python: Principal component analysis (PCA) with scikit-learn If you try to find the contribution rate / cumulative contribution rate with reference to

`In[11]`


print('Contribution rate of each dimension: {0}'.format(pca.explained_variance_ratio_))
print('Cumulative contribution rate: {0}'.format(sum(pca.explained_variance_ratio_)))

`out[11]`


Contribution rate of each dimension: [0.30121012 0.28833114 0.22187195 0.1885868 ]
Cumulative contribution rate: 0.9999999999999999

It will be. perhaps, It may be possible to compress the features with PCA instead of selecting the features.

Although it is a power play, if you use this series of flows, it seems that you can escape from the troublesomeness unique to time series.

From there [Kaggle] Baseline model construction, Pipeline processing You can do various things such as making a baseline model like this.

Next, I will try it with actual multidimensional time series data.

[PYTHON] [Kaggle] I tried feature engineering of multidimensional time series data using tsfresh.

0. Introduction

1. Install tsfresh

2. Prepare pseudo time series data

In[1]

In[2]

In[3]

3. Transform time series data into a corresponding data frame

In[4]

4. Extract features with extract_features ()

In[5]

5. Filter features with select_features ()

In[6]

In[7]

In[8]