[PYTHON] What you should not do in the process of time series data analysis (including reflection)

** (Addition 1) It was pointed out that the usage of terms ("in-sample", "out-of-sample") is inappropriate. [Second half of this article](http://qiita.com/TomokIshii/items/ac7bde63f2c0e0de47b3#%E8%BF%BD%E8%A8%98%E7%94%A8%E8%AA%9E-in-sample -out-of-sample-% E3% 81% AE% E5% AE% 9A% E7% BE% A9% E3% 81% AB% E3% 81% A4% E3% 81% 84% E3% 81% A6) I added this point to. ** ** ** (Addition 2) [Second half](http://qiita.com/TomokIshii/items/ac7bde63f2c0e0de47b3#%E8%BF%BD%E8%A8%982%E7%94%A8%E8%AA%9E -in-sample-% E3% 81% AE% E5% AE% 9A% E7% BE% A9% E3% 81% AB% E3% 81% A4% E3% 81% 84% E3% 81% A6) I did. ** **

After posting to Qiita, even if you notice an error in the content, you tend to spare time and leave it as it is. It's okay if it's a tiny error, but if it's a theoretical error or misunderstanding, it must be that the wrong message is being sent, and we must reflect on it. (It's quick to delete an article, but if you like it, it's rude to delete it ...)

By the way, in the previous Comparison of regression models --ARMA vs. Random Forest Regression --Qiita, Random Forest regression using Lag (delay) as a feature from time series data Introduces how to do it.

This time, it is univariate time series data, but I tried it as a model to estimate the current value using some past data. In particular, Volume_current ~ Volume_Lag1 + Volume_Lag2 + Volume_Lag3 This is the model. I wanted to make the training data and test data the same as last time (70, 30), but since NaN appears in the process of calculating the Lag value, dropna () the first part and set the data length to (67, 30). did.

First, preprocess the data.

df_nile['lag1'] = df_nile['volume'].shift(1) 
df_nile['lag2'] = df_nile['volume'].shift(2)
df_nile['lag3'] = df_nile['volume'].shift(3)

df_nile = df_nile.dropna()

X_train = df_nile[['lag1', 'lag2', 'lag3']][:67].values
X_test = df_nile[['lag1', 'lag2', 'lag3']][67:].values

y_train = df_nile['volume'][:67].values
y_test = df_nile['volume'][67:].values

Use Random Forest Regressor from Scikit-learn.

from sklearn.ensemble import RandomForestRegressor
r_forest = RandomForestRegressor(
            n_estimators=100,
            criterion='mse',
            random_state=1,
            n_jobs=-1
)
r_forest.fit(X_train, y_train)
y_train_pred = r_forest.predict(X_train)
y_test_pred = r_forest.predict(X_test)

According to the typical procedure of machine learning, the data is divided into training data / test data, the model is fitted using the training data, and then the answers are matched with the model prediction value and the test data label. What's wrong? The flow of the above code is explained using a figure.

** Fig.1 How to predict and verify incorrect time series data **

TSA_procedure1.png

Generate 3 Lag data from the original time series data with pandas.DaraFrame.shift. After that, it is divided into training data (in-sample) and test data (out-of-sample) at a predetermined time. At first glance, it looks good, but in this procedure, the test data section (pink part) of the original data series will be incorporated into the Lag data series. In model validation, such processing can be performed, but in reality, data (out-of-sample) after a predetermined time does not exist for future information, so this processing cannot be performed.

The correct way to do this is to fill in the Lag values required for data prediction one by one and move forward.

** Fig.2 How to predict and verify correct time series data **

TSA-procedure2.png

In the above figure, the pink information needs to be excluded as an answer in the validation. At time (xn-3), all training data can be aligned with blue in-sample data. At the next time (xn-2), the Lag-3 data will be lost, so the value here will be calculated using the prediction model. At the next time (xn-1), it is thought that the calculation process will be to fill in the missing parts of Lag-2 and Lag-3. This is the code below.

def step_forward(rf_fitted, X, in_out_sep=67, pred_len=33):
    # predict step by step
    nlags = 3
    idx = in_out_sep - nlags - 1

    lags123 = np.asarray([X[idx, 0],
                          X[idx, 1],
                          X[idx, 2]])

    x_pred_hist = []
    for i in range(nlags + pred_len):
        x_pred = rf_fitted.predict([lags123])
        if i > nlags:
            x_pred_hist.append(x_pred)
        lags123[0] = lags123[1]
        lags123[1] = lags123[2]
        lags123[2] = x_pred

    x_pred_np = np.asarray(x_pred_hist).squeeze()

    return x_pred_np

In the above code, rf_fitted is a Random Forest model for which regression fit (learning) has been completed. Using this, the future value, which corresponds to the out-of-sample, is calculated step by step.

Now, check how the predicted values are different in the wrong way / right way.

** Fig.3 Incorrect prediction ** tsa_nile1.png

Focusing on the out-of-sample line drawn in cyan, the high-frequency component of the prediction interval of the input value (dotted line) that should not be referred to in the above figure is also reflected in the predicted value (solid cyan line). It looks like there is. However, the degree of overlap in this section seems to be high.

** Fig.4 Prediction by the right way ** tsa_nile2.png

In this figure, the predicted value high-frequency component of the predicted interval appears to be quite small, and in the latter half of the interval, it seems to deviate slightly from the actual value (dotted line). However, in this figure, a large trend of “gradual decrease” in the in-sample interval can be seen in the out-of-sample predicted values. (From my personal point of view ...)

Please be aware that there are such pitfalls in time series data analysis. (It may be the basic of the basics for people who usually handle time series data.) You have to find the time series data on the machine learning competition site "Kaggle" etc. and study a little practical part. I feel that.

I also hope that people who like or "stock" my posted articles will also consider the reliability of the articles. (It's a "notice" that it may be wrong. I would be even more happy if you could comment on the mistake.)

References, web site

(Addition 1) Definition of terms "in-sample" and "out-of-sample"

For this article, "in-sample is used synonymously with" training data ", but I don't think it means that. In the case of regression, we received a tweet saying "the value of the explanatory variable is the same as the training data". As you pointed out, the understanding of "in-sample" was ambiguous, so read "in-sample"-> "training data" and "out-of-sample"-> "test data" in this article. I hope you can change it. I haven't fully understood the concept of the above terms at this point, so I will update the article at a later date.

(Addition 2) About the definition of the term "in-sample"

I checked the "in-sample" in the article, so please refer to it below.

At the time of writing the above article, the terms "in-sample" / "out-of-sample" were divided at a certain point on the time axis, the data up to that point was "in-sample", and the data after that point was I thought it was "out-of-sample". Although this usage is inaccurate, it seems that the term is often used in this way.

For example, StackExchange, Cross Validated questions, What is difference between “in-sample” and “out-of-sample” forecasts? -between-in-sample-and-out-of-sample-forecasts) has the following explanation as an answer.

if you use data 1990-2013 to fit the model and then you forecast for 2011-2013, it's in-sample forecast. but if you only use 1990-2010 for fitting the model and then you forecast 2011-2013, then its out-of-sample forecast.

I searched the literature "The Elements of Statistical Learning, 2nd edition" for the exact "usage" of the term. This book is a standard textbook for modeling and machine learning, and thankfully it is distributed free of charge by Stanford University. (I wasn't "tsundoku", I was just downloading ...)

"In-sample" was explained in Chapter 7, "Model Assessment and Selection", in the explanation of generalization performance and generalization error.

Training set:

\mathcal{T} = \{ (x_1, y_1), (x_2, y_2), ...(x_N, y_N) \}

When evaluating new data (Test data) with the model "\ hat {f}" obtained by using, the following generalization error occurs.

Err_{\mathcal{T}} = E_{X^0, Y^0} [L(Y^0,\ \hat{f}(X^0)) \mid \mathcal{T}]  \\

There are various factors that reduce the accuracy of generalization, and one of them is that the data points (explanatory variables) differ between the training data and the test data. If the data points match, the following "in-sample" error will occur.

Err_{in} = \frac{1}{N} \sum_{i=1}^{N} E_{Y^0} [L(Y^0_i ,\ \hat{f}(x_i)) \mid \mathcal{T} ]

It may be difficult to understand the explanation of the fragmentary citation, but compare the above two formulas. Of note is the argument of the model expression "hat {f}", where the general "X ^ 0" becomes "x_i" (x in the training data) in the expression below. Such data becomes "in-samle" data. The figure below is shown.

insample1.png

Consider the case where a model (dark line) is obtained from training data consisting of 6 points. (Actually, you cannot get a sine curve like a line with 6 points. Please understand that it is an explanatory diagram.)

insample2.png

The data that matches the x value with this training data is the "in-sample" data. This time, the training data was x = 5, 10, 15, so the one that does not match this is not "in-sample". Therefore, the data of x> 20 on the x-axis is different, and in addition, the gap points such as x = 4, 7 are not "in-sample". This is the reason why the StackExchange, Cross Validated citations are not accurate at first. However, in the time series analysis that is often performed, the data points are often at regular intervals (for example, daily, monthly), and in this case, even if they are separated by a certain point on the time axis, "in-sample" / other than that. Because it can be divided into, there is a situation that it matches the definition of the original term.

Except for the misunderstanding of how to use this term, the explanation contents in the text and the processing of time series data analysis are correct at this point, so please check.

The usage of words will change, such as "You can eat delicious food **" and "You can eat delicious food **", but this time, "Let's use technical terms correctly". The matter was a lesson.

Recommended Posts

What you should not do in the process of time series data analysis (including reflection)
Time series analysis 3 Preprocessing of time series data
Data analysis, what do you do after all?
Power of forecasting methods in time series data analysis Semi-optimization (SARIMA) [Memo]
Instantly illustrate the predominant period in time series data using spectrum analysis
Not being aware of the contents of the data in python
What to do if the progress bar is not displayed in tqdm of python
Make a note of what you want to do in the future with Raspberry Pi
[Understand in the shortest time] Python basics for data analysis
How to calculate the sum or average of time series csv data in an instant
You should not use the --color = always option of the grep command
Explaining the mechanism of Linux that you do not know unexpectedly
What to do if you can't use the trash in Lubuntu 18.04.
Python: Time Series Analysis: Preprocessing Time Series Data
Process the result of% time,% timeit
Differentiation of time series data (discrete)
Do not change the order of columns when concatenating pandas data frames.
Plot CSV of time series data with unixtime value in Python (matplotlib)
Implement part of the process in C ++
Time series analysis 4 Construction of SARIMA model
Conversion of time data in 25 o'clock notation
Preprocessing in machine learning 1 Data analysis process
In pandas.DataFrame, even when assigning only a specific column, if index is attached, you do not have to worry about the order of data
Example of what to do when the sample script does not work (OpenCV-Python)
I want to visualize the transfer status of the 2020 J League, what should I do?
Find the index of items that match the conditions in the pandas data frame / series
What to do if you get "Python not configured." Using PyDev in Eclipse
What to do when the graph does not appear in jupyter (ipython) notebook
Shortening the analysis time of Openpose using sound
Acquisition of time series data (daily) of stock prices
Read the output of subprocess.Popen in real time
[Data analysis] Should I buy the Harumi flag?
The story of reading HSPICE data in Python
View details of time series data with Remotte
How to read time series data in PyTorch
What should I do with DICOM in MPEG2?
A well-prepared record of data analysis in Python
What to do when is not in the sudoers file.This incident will be reported.
What to do if you get "(35,'SSL connect error')" in pycurl (one of them)
real-time-Personal-estimation (What should I do to prevent the estimation of images outside the category) * Failure.
What to do if Python does not switch from the System version in pyenv
Check the processing time and the number of calls for each process in python (cProfile)
What to do if you can't hit the arrow keys in the Python interactive console
What to do if you have corrected the mistake in the IP address of the zone file but cannot connect to the DNS server