[PYTHON] Regression model comparison-ARMA vs. Random Forest Regression

(Addition) In the latter half of this article, I posted a correction article about Random Rorest Regression. See also here. What you should not do in the process of time series data analysis (including reflection)

When performing time series data analysis in Python, the Times series analysis (tsa) module provided by StatsModels for statistical modeling may be the first option. In addition, the machine learning library Scikit-learn has Decision Tree Regression and Random Forest Regression, which perform regression based on different concepts, so I wanted to try these as well, and compared them.

(Programming environment: Python 2.7.11, Pandas 0.18.0, StatsModels 0.6.1, Scikit-learn 0.17.1.)

A group of traditional autoregressive models

When you open the literature on time series data analysis, the explanations often appear in the following order.

Abbreviation	Description
AR	Autoregressive(Auto Regression)model
MA	moving average(Moving Average)model
ARMA	Autoregressive moving average model
ARIMA	Autoregressive integrated moving average model

For details of each model, refer to Wikipedia etc. Basically, ** ARMA ** is a model containing ** AR ** and ** MA **, and ** ARIMA ** is *. Since it is a model that includes * ARMA **, if ** ARIMA ** is supported as a library, all the above four models can be supported. However, StatsModels provides ** AR **, ** ARMA **, ** ARIMA ** as APIs.

http://statsmodels.sourceforge.net/stable/tsa.html

This time, "Nile" was taken up as the data to be analyzed.

Nile River flows at Ashwan 1871-1970 This dataset contains measurements on the annual flow of the Nile as measured at Ashwan for 100 years from 1871-1970. There is an apparent changepoint near 1898.

This is the annual data of the Nile flow in Africa, and is the sample data prepared in the StatsModel library.

In the autoregressive model, "stationary state" is often a prerequisite for model application, but since the above plot does not show monotonous increase or monotonous decrease, we decided to apply the ** ARMA ** model. To do.

ARMA model order selection

Since it is time series data for about 100 years, we will allocate the first half 70 years as data for fitting (training) and the latter 30 years as data for model testing. It is necessary to select the order (p, q) of the ARMA model. First, let's plot the ACF plot (Autocorrelation function plot) and PACF plot (Partial autocorrelation plot), which are also called correlograms.

def load_data():
    df_read = sm.datasets.nile.load_pandas().data
    s_date = pd.Series(
        [pd.to_datetime(str(int(y_str))) for y_str in df_read['year']]
    )
    df = df_read.set_index(s_date)
    df = df.drop('year', axis=1)
    
    return df

df_nile = load_data()

# Plot Time Series Data
ax = df_nile['volume'].plot(figsize=(8,4), grid=True)
ax.set_xlabel('year')
ax.set_ylabel('volume x 10^8 m^3')
plt.show()

# Data Split (70: 30)
# ACF, PACF
fig = plt.figure(figsize=(12,8))
ax1 = fig.add_subplot(211)
fig = sm.graphics.tsa.plot_acf(df_train.values, lags=40, ax=ax1)
ax2 = fig.add_subplot(212)
fig = sm.graphics.tsa.plot_pacf(df_train.values, lags=40, ax=ax2)

Fig. ACF and PACF plot of "Nile" data

In ARMA (p, q), q = [0, 1, 2, 3] can be selected from the ACF plot (top), and p = [0, 1] can be selected from the PACF plot (bottom) ( Somehow) I understand. For the time being, when the information was output by the order selection utility with ARMA (p = 1, q = 3) in view, the result was as follows.

info_criteria = sm.tsa.stattools.arma_order_select_ic(
                    df_train.values, ic=['aic', 'bic']
                )
print(info_criteria.aic_min_order)
print(info_criteria.bic_min_order)

ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
  "Check mle_retvals", ConvergenceWarning)
>>> (1, 1)
>>> (1, 0)

Note that there is a warning about convergence. Based on the Warning, we obtained information that ARMA (1, 1) is good from AIC (model selection criterion) and AMMA (1, 0) is good from BIC (model selection criterion). Therefore, let's fit these two models to the data.

arma_10 = sm.tsa.ARMA(df_train, (1, 0)).fit()
arma_11 = sm.tsa.ARMA(df_train, (1, 1)).fit()

This was processed without any particular error or warning. The return values arma_10 and arma_11 are model objects (ARMAResults class) after fitting. For the time being, I also tried higher-order models ARMA (1, 2) and ARMA (1, 3).

arma_12 = sm.tsa.ARMA(df_train, (1, 2)).fit()
arma_13 = sm.tsa.ARMA(df_train, (1, 3)).fit()

(Omitted, Traceback information, etc.)

ValueError: The computed initial AR coefficients are not stationary
You should induce stationarity, choose a different model order, or you can
pass your own start_params.

The above Value Error occurred. (Although the error message mentions stationary ...) There is a problem with convergence when calculating parameters, and it is difficult to apply to ARMA (1, 2) and ARMA (1, 3). is there. In the future, this cause may be investigated deeply, but this time, the theme is ARMA vs. Random Forest Regressor, so we will consider using ARMA (1, 0) and ARMA (1, 1) that can be fitted. Proceed.

ARMA model verification

First, check the value of AIC.

print('ARMA(1,0): AIC = %.2f' % arma_10.aic)
print('ARMA(1,1): AIC = %.2f' % arma_11.aic)
>>> ARMA(1,0): AIC = 910.94
>>> ARMA(1,1): AIC = 908.92

The AIC value is smaller in the ARMA (1, 1) model.

Next, the model estimates for the training data interval (70 years) and the model estimates for the subsequent test interval (30 years) are calculated.

# in-sample predict
arma_10_inpred = arma_10.predict(start='1871-01-01', end='1940-01-01')
arma_11_inpred = arma_11.predict(start='1871-01-01', end='1940-01-01')

# out-of-sample predict
arma_10_outpred = arma_10.predict(start='1941-01-01', end='1970-01-01')
arma_11_outpred = arma_11.predict(start='1941-01-01', end='1970-01-01')

Both the AIC calculation and the estimated value calculation can be obtained by the method of the ARMA model result class (ARMAResults class). The values of the original data and the estimated values are plotted together.

# plot data and predicted values
def plot_ARMA_results(origdata, pred10in, pred10out, pred11in, pred11out):
    px = origdata.index
    py1 = origdata.values
    plt.plot(px, py1, 'b:', label='orig data')
    
    px_in = pred10in.index
    plt.plot(px_in, pred10in.values, 'g')
    plt.plot(px_in, pred11in.values, 'c')
    
    px_out = pred10out.index
    plt.plot(px_out, pred10out.values, 'g', label='ARMA(1,0)')
    plt.plot(px_out, pred11out.values, 'c', label='ARMA(1,1)')
    
    plt.legend()
    plt.grid(True)
    plt.show()

plot_ARMA_results(df_nile, arma_10_inpred, arma_10_outpred, 
    arma_11_inpred, arma_11_outpred)

Fig. Nile data - original data and ARMA estimated data

The training section (in-sample) is from 1870 to 1940, and the test section (out-of-sample) is after that. It's a little difficult to understand, but somehow ARMA (1,1) seems to be closer to the actual data. MSE (Mean Squared Error) is obtained from the residuals for model comparison later.

# Residue (mse for train) 
arma_10_mse_tr = np.array([r ** 2 for r in arma_10.resid]).mean()
arma_11_mse_tr = np.array([r ** 2 for r in arma_11.resid]).mean()

# Residue (mse for test)
arma_10_mse_te = np.array([(df_test.values[i] - arma_10_outpred[i]) **2 
                            for i in range(30)]).mean()
arma_11_mse_te = np.array([(df_test.values[i] - arma_11_outpred[i]) **2 
                            for i in range(30)]).mean()

Random Forest Regression

The basis of Random Forest regression is Decision Tree Regression. The decision tree algorithm itself needs to be understood roughly, but in this article, the explanation is omitted and the Scikit-learn function is treated as a black box. There is an example in the Scikit-learn document, but the result of trying it is shown below.

Fig. Decision Tree Regression Example

The feature is that it fits in a stepped straight line. This time, it is univariate time series data, but I tried it as a model to estimate the current value using some past data. In particular, Volume_current ~ Volume_Lag1 + Volume_Lag2 + Volume_Lag3 This is the model. I wanted to make the training data and test data the same as last time (70, 30), but since NaN appears in the process of calculating the Lag value, dropna () the first part and set the data length to (67, 30). did.

First, data preprocessing is performed.

df_nile['lag1'] = df_nile['volume'].shift(1) 
df_nile['lag2'] = df_nile['volume'].shift(2)
df_nile['lag3'] = df_nile['volume'].shift(3)

df_nile = df_nile.dropna()

X_train = df_nile[['lag1', 'lag2', 'lag3']][:67].values
X_test = df_nile[['lag1', 'lag2', 'lag3']][67:].values

y_train = df_nile['volume'][:67].values
y_test = df_nile['volume'][67:].values

Use Random Forest Regressor from Scikit-learn.

from sklearn.ensemble import RandomForestRegressor
r_forest = RandomForestRegressor(
            n_estimators=100,
            criterion='mse',
            random_state=1,
            n_jobs=-1
)
r_forest.fit(X_train, y_train)
y_train_pred = r_forest.predict(X_train)
y_test_pred = r_forest.predict(X_test)

The estimated values are as shown in the figure below.

Fig. Nile data - original data and Random Forest Regression results

Finally, calculate MSE.

# check residue (mse)
train_resid = y_train - y_train_pred
RFR_mse_train = np.array([r ** 2 for r in train_resid]).mean()
test_resid = y_test - y_test_pred
RFR_mse_test = np.array([r ** 2 for r in test_resid]).mean()

Model accuracy comparison

Compare the MSE (Mean Squared Error) of each model.

Model	MSE of in-samle (train)	MSE of out-of-sample (test)
ARMA(1,0)	24111.0	18684.4
ARMA(1,1)	22757.3	16625.8
Random F. Regressor	3470.1	15400.3

(MSE : smaller ... better)

In the comparison between ARMA models of different orders, it can be seen that ARMA (1,1) has smaller MSE and better accuracy in both the training section and the test section, which is consistent with the comparison of the AIC values obtained earlier. There is.

In the comparison between the ARMA model and the Random Forest regression model, the MSE of the Random Forest regression is small with a large difference in the training section. The Random Forest regression is also small in the test interval, but the difference is about 7%. Random Forest regression is advantageous in terms of accuracy, but it should be taken into consideration that the ARMA model to which the statistical model is applied can obtain information on confidence intervals (for example, 95% confidence intervals) in addition to the predicted median. Is done. I want to use it properly according to the purpose.

(Comparing the plots of ARMA and Random Forest regression, I was surprised again that the lines look very different. Also note that none of the models may have had sufficient parameter adjustments. Please give me.)

References (web site)

StatsModels Documentation http://statsmodels.sourceforge.net/stable/index.html
Scikit-learn Documentation http://scikit-learn.org/stable/ --Time-series data analysis that can be used immediately in the field (Daisuke Yokouchi (Author), Yoshimitsu Aoki (Author)) http://www.amazon.co.jp/ebook/dp/B00KNRL068/
Python Machine Learning - O'reilly media (By Sebastian Raschka) http://shop.oreilly.com/product/9781783555130.do