[PYTHON] What you should not do in the process of time series data analysis (including reflection)

** (Addition 1) It was pointed out that the usage of terms ("in-sample", "out-of-sample") is inappropriate. [Second half of this article](http://qiita.com/TomokIshii/items/ac7bde63f2c0e0de47b3#%E8%BF%BD%E8%A8%98%E7%94%A8%E8%AA%9E-in-sample -out-of-sample-% E3% 81% AE% E5% AE% 9A% E7% BE% A9% E3% 81% AB% E3% 81% A4% E3% 81% 84% E3% 81% A6) I added this point to. ** ** ** (Addition 2) [Second half](http://qiita.com/TomokIshii/items/ac7bde63f2c0e0de47b3#%E8%BF%BD%E8%A8%982%E7%94%A8%E8%AA%9E -in-sample-% E3% 81% AE% E5% AE% 9A% E7% BE% A9% E3% 81% AB% E3% 81% A4% E3% 81% 84% E3% 81% A6) I did. ** **

After posting to Qiita, even if you notice an error in the content, you tend to spare time and leave it as it is. It's okay if it's a tiny error, but if it's a theoretical error or misunderstanding, it must be that the wrong message is being sent, and we must reflect on it. (It's quick to delete an article, but if you like it, it's rude to delete it ...)

By the way, in the previous Comparison of regression models --ARMA vs. Random Forest Regression --Qiita, Random Forest regression using Lag (delay) as a feature from time series data Introduces how to do it.

This time, it is univariate time series data, but I tried it as a model to estimate the current value using some past data. In particular, Volume_current ~ Volume_Lag1 + Volume_Lag2 + Volume_Lag3 This is the model. I wanted to make the training data and test data the same as last time (70, 30), but since NaN appears in the process of calculating the Lag value, dropna () the first part and set the data length to (67, 30). did.

First, preprocess the data.

df_nile['lag1'] = df_nile['volume'].shift(1) 
df_nile['lag2'] = df_nile['volume'].shift(2)
df_nile['lag3'] = df_nile['volume'].shift(3)

df_nile = df_nile.dropna()

X_train = df_nile[['lag1', 'lag2', 'lag3']][:67].values
X_test = df_nile[['lag1', 'lag2', 'lag3']][67:].values

y_train = df_nile['volume'][:67].values
y_test = df_nile['volume'][67:].values

Use Random Forest Regressor from Scikit-learn.

from sklearn.ensemble import RandomForestRegressor
r_forest = RandomForestRegressor(
            n_estimators=100,
            criterion='mse',
            random_state=1,
            n_jobs=-1
)
r_forest.fit(X_train, y_train)
y_train_pred = r_forest.predict(X_train)
y_test_pred = r_forest.predict(X_test)

According to the typical procedure of machine learning, the data is divided into training data / test data, the model is fitted using the training data, and then the answers are matched with the model prediction value and the test data label. What's wrong? The flow of the above code is explained using a figure.

** Fig.1 How to predict and verify incorrect time series data **

Generate 3 Lag data from the original time series data with pandas.DaraFrame.shift. After that, it is divided into training data (in-sample) and test data (out-of-sample) at a predetermined time. At first glance, it looks good, but in this procedure, the test data section (pink part) of the original data series will be incorporated into the Lag data series. In model validation, such processing can be performed, but in reality, data (out-of-sample) after a predetermined time does not exist for future information, so this processing cannot be performed.

The correct way to do this is to fill in the Lag values required for data prediction one by one and move forward.

** Fig.2 How to predict and verify correct time series data **

In the above figure, the pink information needs to be excluded as an answer in the validation. At time (xn-3), all training data can be aligned with blue in-sample data. At the next time (xn-2), the Lag-3 data will be lost, so the value here will be calculated using the prediction model. At the next time (xn-1), it is thought that the calculation process will be to fill in the missing parts of Lag-2 and Lag-3. This is the code below.

def step_forward(rf_fitted, X, in_out_sep=67, pred_len=33):
    # predict step by step
    nlags = 3
    idx = in_out_sep - nlags - 1

    lags123 = np.asarray([X[idx, 0],
                          X[idx, 1],
                          X[idx, 2]])

    x_pred_hist = []
    for i in range(nlags + pred_len):
        x_pred = rf_fitted.predict([lags123])
        if i > nlags:
            x_pred_hist.append(x_pred)
        lags123[0] = lags123[1]
        lags123[1] = lags123[2]
        lags123[2] = x_pred

    x_pred_np = np.asarray(x_pred_hist).squeeze()

    return x_pred_np

In the above code, rf_fitted is a Random Forest model for which regression fit (learning) has been completed. Using this, the future value, which corresponds to the out-of-sample, is calculated step by step.

Now, check how the predicted values are different in the wrong way / right way.

** Fig.3 Incorrect prediction **

Focusing on the out-of-sample line drawn in cyan, the high-frequency component of the prediction interval of the input value (dotted line) that should not be referred to in the above figure is also reflected in the predicted value (solid cyan line). It looks like there is. However, the degree of overlap in this section seems to be high.

** Fig.4 Prediction by the right way **

In this figure, the predicted value high-frequency component of the predicted interval appears to be quite small, and in the latter half of the interval, it seems to deviate slightly from the actual value (dotted line). However, in this figure, a large trend of “gradual decrease” in the in-sample interval can be seen in the out-of-sample predicted values. (From my personal point of view ...)

Please be aware that there are such pitfalls in time series data analysis. (It may be the basic of the basics for people who usually handle time series data.) You have to find the time series data on the machine learning competition site "Kaggle" etc. and study a little practical part. I feel that.

I also hope that people who like or "stock" my posted articles will also consider the reliability of the articles. (It's a "notice" that it may be wrong. I would be even more happy if you could comment on the mistake.)

References, web site

"Backtesting", "Walk Forward Optimization" - en.wikipedia
How to split dataset for time-series prediction? - Cross Validated, StackExchange https://stats.stackexchange.com/questions/117350/how-to-split-dataset-for-time-series-prediction

(Addition 1) Definition of terms "in-sample" and "out-of-sample"

For this article, "in-sample is used synonymously with" training data ", but I don't think it means that. In the case of regression, we received a tweet saying "the value of the explanatory variable is the same as the training data". As you pointed out, the understanding of "in-sample" was ambiguous, so read "in-sample"-> "training data" and "out-of-sample"-> "test data" in this article. I hope you can change it. I haven't fully understood the concept of the above terms at this point, so I will update the article at a later date.

(Addition 2) About the definition of the term "in-sample"

I checked the "in-sample" in the article, so please refer to it below.

At the time of writing the above article, the terms "in-sample" / "out-of-sample" were divided at a certain point on the time axis, the data up to that point was "in-sample", and the data after that point was I thought it was "out-of-sample". Although this usage is inaccurate, it seems that the term is often used in this way.

For example, StackExchange, Cross Validated questions, What is difference between “in-sample” and “out-of-sample” forecasts? -between-in-sample-and-out-of-sample-forecasts) has the following explanation as an answer.

if you use data 1990-2013 to fit the model and then you forecast for 2011-2013, it's in-sample forecast. but if you only use 1990-2010 for fitting the model and then you forecast 2011-2013, then its out-of-sample forecast.

I searched the literature "The Elements of Statistical Learning, 2nd edition" for the exact "usage" of the term. This book is a standard textbook for modeling and machine learning, and thankfully it is distributed free of charge by Stanford University. (I wasn't "tsundoku", I was just downloading ...)

"In-sample" was explained in Chapter 7, "Model Assessment and Selection", in the explanation of generalization performance and generalization error.

Training set:

\mathcal{T} = \{ (x_1, y_1), (x_2, y_2), ...(x_N, y_N) \}

When evaluating new data (Test data) with the model "\ hat {f}" obtained by using, the following generalization error occurs.

Err_{\mathcal{T}} = E_{X^0, Y^0} [L(Y^0,\ \hat{f}(X^0)) \mid \mathcal{T}]  \\

There are various factors that reduce the accuracy of generalization, and one of them is that the data points (explanatory variables) differ between the training data and the test data. If the data points match, the following "in-sample" error will occur.

Err_{in} = \frac{1}{N} \sum_{i=1}^{N} E_{Y^0} [L(Y^0_i ,\ \hat{f}(x_i)) \mid \mathcal{T} ]

It may be difficult to understand the explanation of the fragmentary citation, but compare the above two formulas. Of note is the argument of the model expression "hat {f}", where the general "X ^ 0" becomes "x_i" (x in the training data) in the expression below. Such data becomes "in-samle" data. The figure below is shown.

Consider the case where a model (dark line) is obtained from training data consisting of 6 points. (Actually, you cannot get a sine curve like a line with 6 points. Please understand that it is an explanatory diagram.)

The data that matches the x value with this training data is the "in-sample" data. This time, the training data was x = 5, 10, 15, so the one that does not match this is not "in-sample". Therefore, the data of x> 20 on the x-axis is different, and in addition, the gap points such as x = 4, 7 are not "in-sample". This is the reason why the StackExchange, Cross Validated citations are not accurate at first. However, in the time series analysis that is often performed, the data points are often at regular intervals (for example, daily, monthly), and in this case, even if they are separated by a certain point on the time axis, "in-sample" / other than that. Because it can be divided into, there is a situation that it matches the definition of the original term.

Except for the misunderstanding of how to use this term, the explanation contents in the text and the processing of time series data analysis are correct at this point, so please check.

The usage of words will change, such as "You can eat delicious food **" and "You can eat delicious food **", but this time, "Let's use technical terms correctly". The matter was a lesson.