[PYTHON] I tried time series analysis! (AR model)

Overview

I actually tried using a library of time series analysis. (statsmodels.api.tsa) In this article, the author is studying the theory of time series analysis. This is a self-sufficient article that I wrote to establish my understanding.

Analysis summary

Create data with the following AR model (with stationarity) and Parameter estimation was performed with the statsmodels library "ARMA". I also tried the ADF test to see if it had stationarity.

--Created data: Create daily data with the following model AR (1) (2018/1/1 ~ 2019/12/31) y_{t}=2+0.8y_{t-1} --Learning model: AR (1) with constant term * ARMA (1,0) --Learning method: Maximum likelihood estimation

Premise

--AR model (autoregressive model Auto-Regression) A model in which the value at time point t is represented by a linear combination of the values at the most recent time points (t-1) to (t-p) in the past, as shown below. $ u_ {t} $ is the error term, which is white noise (normal distribution with mean 0). When expressed up to (t-p), it is expressed as an AR (p) model. This time, data is created using the AR (1) model, and parameter estimation is performed using the AR (1) model. y_{t}=c+a_{1}y_{t-1}+a_{2}y_{t-2}+\ldots +a_{p}y_{t-p}+u_{t}

--Stationarity If "the mean is constant at all time points and the covariance with before k time points depends only on k" The time series is said to have stationarity. In the case of the AR (1) model that is the subject of this analysis The conditions for stationarity are as follows. 「y_{t}In the model formula of_{t-1}Coefficient ofa_{1}But|a_{1}| < 1Meet. "

--DF test One of the tests called the unit root test. Assuming the target time series is AR (1) (The null hypothesis is "unit root") A test that makes the alternative hypothesis "stationary". The test statistic is a standard normal distribution.

--ADF test (extended DF test) Whereas the DF test is AR (1) only A test that extends it so that it can also be applied in AR (p). Like the DF test The alternative hypothesis is "steady", The test statistic is a standard normal distribution.

Analysis details (code)

1. Library import

import os
import pandas as pd
import random
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display
import statsmodels.api as sm

2. Data creation

Data; ts_data

#True parameter(All less than 1 to be steady)
params_list = [2 , -0.8]
params_num = len(params_list)
print("True parameter(c , a_(t-1) , a_(t-2))" , ":" , params_list)
print("Number of true parameters" , ":" , params_num)

#Time series data creation
index_date = pd.date_range('2018-01-01' , '2019-12-31' , freq='D')
N = len(index_date)
init_y_num = params_num - 1
init_y_list = [random.randint(-1000 , 1000) for _ in range(init_y_num)]
print("index_date" , ":" , index_date[:6])
print("N" , ":" , N)
print("Initial data({}Pieces)".format(init_y_num) , ":" , init_y_list)

ts_data_list = list()
for i in range(N):
    if i < init_y_num:
        ts_data_list.append(init_y_list[i])
    else:
        y = params_list[0] + sum([params_list[j+1] * ts_data_list[i-(j+1)] for j in range(params_num - 1)])
        ts_data_list.append(y)
print("ts_data_list" , ":" , ts_data_list[:5])

ts_data = pd.Series(ts_data_list , index=index_date)
print("ts_data" , ":")
print(ts_data)

data.jpg

3. Graphing (line) _ Data confirmation

#Graph creation
fig = plt.figure(figsize=(15 ,10))

data = ts_data[:10]
ax_1 = fig.add_subplot(221)
ax_1.plot(data.index , data , marker="o")

plt.title("ten days from 2018/01/01")
plt.xlabel("date")
plt.ylabel("value")
plt.xticks(rotation=45)

data = ts_data[-10:]
ax_2 = fig.add_subplot(222)
ax_2.plot(data.index , data , marker="o")

plt.title("ten days to 2019/12/31")
plt.xlabel("date")
plt.ylabel("value")
plt.xticks(rotation=45)

plt.show()

graph.jpg

4. AR model learning

Model learning is performed using the following three types of data, and the results are confirmed. ① All period (2018/1/1 ~ 2019/12/31) ② 50 days from January 1, 2018 ③ 50 days until 2019/12/31

#AR learning result_Learning data ①(Whole period)
print("① All period" , "-" * 80)
data = ts_data
arma_result = sm.tsa.ARMA(data , order=(1 , 0)).fit(trend='c' , method='mle')
print(arma_result.summary())
print()

#AR learning result_Learning data ②(2018/1/50 days from 1)
print("② 2018/1/50 days from 1" , "-" * 80)
data = ts_data[:50]
arma_result = sm.tsa.ARMA(data , order=(1 , 0)).fit(trend='c' , method='mle')
print(arma_result.summary())
print()

#AR learning result_Learning data ③(2019/12/50 days up to 31)
print("③ 2019/12/50 days up to 31" , "-" * 80)
data = ts_data[-50:]
arma_result = sm.tsa.ARMA(data , order=(1 , 0)).fit(trend='c' , method='mle')
print(arma_result.summary())
print()

arma_1.jpg arma_2.jpg arma_3_1.jpgarma_3_2.jpg Comparing the coefficients (coef) for ① and ②, Both the constant term (const) and the coefficient (ar.L1.y) of $ y_ {t-1} $ ① is closer to the true model. Regarding (3), the log output "The target time series does not have stationarity. You should input the stationary time series."

From the results of ① to ③ The shape at the beginning fits the model better than at the later points. Why?

5. ADF test

Perform the ADF test for the same data ① to ③ as in 4. Check the P value and confirm at what significance level the null hypothesis will be rejected.

#AR learning result_Learning data ①(Whole period)
print("① All period" , "-" * 80)
data = ts_data
result = sm.tsa.stattools.adfuller(data)
print('value = {:.4f}'.format(result[0]))
print('p-value = {:.4}'.format(result[1]))
print()

#AR learning result_Learning data ②(2018/1/50 days from 1)
print("② 2018/1/50 days from 1" , "-" * 80)
data = ts_data[:50]
result = sm.tsa.stattools.adfuller(data)
print('value = {:.4f}'.format(result[0]))
print('p-value = {:.4}'.format(result[1]))
print()

#AR learning result_Learning data ③(2019/12/50 days up to 31)
print("③ 2019/12/50 days up to 31" , "-" * 80)
data = ts_data[-50:]
result = sm.tsa.stattools.adfuller(data)
print('value = {:.4f}'.format(result[0]))
print('p-value = {:.4}'.format(result[1]))
print()

adf.jpg

All test values (values) are very large in absolute value, The P value (p-value) is 0.0. Even if the significance level is set to 1% In all of ① to ③, the null hypothesis was rejected.

(Although all p-values are 0.0, which may be a relatively meaningless comparison) Comparing the absolute values of the test statistic, ① > ② > ③ And (because the data is completely AR (1)) As a result of the ADF test, I felt that the evaluation was more correct at the beginning.

Summary

From the model learning result and ADF test At the beginning, it was evaluated more correctly. In particular, the model learning result of (3) was completely unexpected. The author has not yet understood the causal relationship as to why such a result was obtained. I hope I can find out as I continue to study.

Recommended Posts

I tried time series analysis! (AR model)
Time series analysis 2 Stationary, ARMA / ARIMA model
Time series analysis Part 2 AR / MA / ARMA
Time series analysis 4 Construction of SARIMA model
Python: Time Series Analysis
RNN_LSTM1 Time series analysis
Time series analysis 1 Basics
Python: Time Series Analysis: Building a SARIMA Model
Python: Time Series Analysis: Stationarity, ARMA / ARIMA Model
Time series analysis related memo
I tried to implement time series prediction with GBDT
Time series analysis part 4 VAR
Time series analysis Part 3 Forecast
Time series analysis Part 1 Autocorrelation
I implemented "Basics of Time Series Analysis and State Space Model" (Hayamoto) with pystan
Python: Time Series Analysis: Preprocessing Time Series Data
Time series analysis practice sales forecast
Time series analysis 3 Preprocessing of time series data
[Statistics] [Time series analysis] Plot the ARMA model and grasp the tendency.
I tried tensorflow for the first time
I tried factor analysis with Titanic data!
Time series analysis # 6 Spurious regression and cointegration
I tried logistic regression analysis for the first time using Titanic data
[Kaggle] I tried feature engineering of multidimensional time series data using tsfresh.
I tried using scrapy for the first time
I tried scraping
I tried PyQ
Time Series Decomposition
Before the coronavirus, I first tried SARS analysis
I tried principal component analysis with Titanic data!
I tried python programming for the first time.
I tried FX technical analysis by AI "scikit-learn"
I tried AutoKeras
I tried to implement TOPIC MODEL in Python
I tried Mind Meld for the first time
I tried papermill
Introduction to Time Series Analysis ~ Seasonal Adjustment Model ~ Implemented in R and Python
I tried morphological analysis and vectorization of words
I tried django-slack
I tried Django
I tried spleeter
I tried cgo
Draw a graph in Julia ... I tried a little analysis
I tried Amazon Comprehend sentiment analysis with AWS CLI.
Python 3.4 Create Windows7-64bit environment (for financial time series analysis)
I tried to predict the J-League match (data analysis)
[OpenCV / Python] I tried image analysis of cells with OpenCV
I tried python on heroku for the first time
I tried hosting a Pytorch sample model using TorchServe
[MNIST] I tried Fine Tuning using the ImageNet model.
AI Gaming I tried it for the first time
PyTorch Learning Note 2 (I tried using a pre-trained model)
I tried to find an alternating series with tensorflow
I tried using parameterized
I tried using argparse
I tried using mimesis
I tried using anytree
I tried competitive programming
I tried running pymc
I tried ARP spoofing
I tried using aiomysql