Overview

I actually tried using a library of time series analysis. (statsmodels.api.tsa) In this article, the author is studying the theory of time series analysis. This is a self-sufficient article that I wrote to establish my understanding.

Analysis summary

Create data with the following AR model (with stationarity) and Parameter estimation was performed with the statsmodels library "ARMA". I also tried the ADF test to see if it had stationarity.

--Created data: Create daily data with the following model AR (1) (2018/1/1 ~ 2019/12/31) y_{t}=2+0.8y_{t-1} --Learning model: AR (1) with constant term * ARMA (1,0) --Learning method: Maximum likelihood estimation

Premise

--AR model (autoregressive model Auto-Regression) A model in which the value at time point t is represented by a linear combination of the values at the most recent time points (t-1) to (t-p) in the past, as shown below. $ u_ {t} $ is the error term, which is white noise (normal distribution with mean 0). When expressed up to (t-p), it is expressed as an AR (p) model. This time, data is created using the AR (1) model, and parameter estimation is performed using the AR (1) model. y_{t}=c+a_{1}y_{t-1}+a_{2}y_{t-2}+\ldots +a_{p}y_{t-p}+u_{t}

--Stationarity If "the mean is constant at all time points and the covariance with before k time points depends only on k" The time series is said to have stationarity. In the case of the AR (1) model that is the subject of this analysis The conditions for stationarity are as follows. 「y_{t}In the model formula of_{t-1}Coefficient ofa_{1}But|a_{1}| < 1Meet. "

--DF test One of the tests called the unit root test. Assuming the target time series is AR (1) (The null hypothesis is "unit root") A test that makes the alternative hypothesis "stationary". The test statistic is a standard normal distribution.

--ADF test (extended DF test) Whereas the DF test is AR (1) only A test that extends it so that it can also be applied in AR (p). Like the DF test The alternative hypothesis is "steady", The test statistic is a standard normal distribution.

Analysis details (code)

1. Library import

import os
import pandas as pd
import random
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display
import statsmodels.api as sm

2. Data creation

Data; ts_data

#True parameter(All less than 1 to be steady)
params_list = [2 , -0.8]
params_num = len(params_list)
print("True parameter(c , a_(t-1) , a_(t-2))" , ":" , params_list)
print("Number of true parameters" , ":" , params_num)

#Time series data creation
index_date = pd.date_range('2018-01-01' , '2019-12-31' , freq='D')
N = len(index_date)
init_y_num = params_num - 1
init_y_list = [random.randint(-1000 , 1000) for _ in range(init_y_num)]
print("index_date" , ":" , index_date[:6])
print("N" , ":" , N)
print("Initial data({}Pieces)".format(init_y_num) , ":" , init_y_list)

ts_data_list = list()
for i in range(N):
    if i < init_y_num:
        ts_data_list.append(init_y_list[i])
    else:
        y = params_list[0] + sum([params_list[j+1] * ts_data_list[i-(j+1)] for j in range(params_num - 1)])
        ts_data_list.append(y)
print("ts_data_list" , ":" , ts_data_list[:5])

ts_data = pd.Series(ts_data_list , index=index_date)
print("ts_data" , ":")
print(ts_data)

3. Graphing (line) _ Data confirmation

#Graph creation
fig = plt.figure(figsize=(15 ,10))

data = ts_data[:10]
ax_1 = fig.add_subplot(221)
ax_1.plot(data.index , data , marker="o")

plt.title("ten days from 2018/01/01")
plt.xlabel("date")
plt.ylabel("value")
plt.xticks(rotation=45)

data = ts_data[-10:]
ax_2 = fig.add_subplot(222)
ax_2.plot(data.index , data , marker="o")

plt.title("ten days to 2019/12/31")
plt.xlabel("date")
plt.ylabel("value")
plt.xticks(rotation=45)

plt.show()

4. AR model learning

Model learning is performed using the following three types of data, and the results are confirmed. ① All period (2018/1/1 ~ 2019/12/31) ② 50 days from January 1, 2018 ③ 50 days until 2019/12/31

#AR learning result_Learning data ①(Whole period)
print("① All period" , "-" * 80)
data = ts_data
arma_result = sm.tsa.ARMA(data , order=(1 , 0)).fit(trend='c' , method='mle')
print(arma_result.summary())
print()

#AR learning result_Learning data ②(2018/1/50 days from 1)
print("② 2018/1/50 days from 1" , "-" * 80)
data = ts_data[:50]
arma_result = sm.tsa.ARMA(data , order=(1 , 0)).fit(trend='c' , method='mle')
print(arma_result.summary())
print()

#AR learning result_Learning data ③(2019/12/50 days up to 31)
print("③ 2019/12/50 days up to 31" , "-" * 80)
data = ts_data[-50:]
arma_result = sm.tsa.ARMA(data , order=(1 , 0)).fit(trend='c' , method='mle')
print(arma_result.summary())
print()

… Comparing the coefficients (coef) for ① and ②, Both the constant term (const) and the coefficient (ar.L1.y) of $ y_ {t-1} $ ① is closer to the true model. Regarding (3), the log output "The target time series does not have stationarity. You should input the stationary time series."

From the results of ① to ③ The shape at the beginning fits the model better than at the later points. Why?

5. ADF test

Perform the ADF test for the same data ① to ③ as in 4. Check the P value and confirm at what significance level the null hypothesis will be rejected.

Alternative hypothesis: The target time series has stationarity.

#AR learning result_Learning data ①(Whole period)
print("① All period" , "-" * 80)
data = ts_data
result = sm.tsa.stattools.adfuller(data)
print('value = {:.4f}'.format(result[0]))
print('p-value = {:.4}'.format(result[1]))
print()

#AR learning result_Learning data ②(2018/1/50 days from 1)
print("② 2018/1/50 days from 1" , "-" * 80)
data = ts_data[:50]
result = sm.tsa.stattools.adfuller(data)
print('value = {:.4f}'.format(result[0]))
print('p-value = {:.4}'.format(result[1]))
print()

#AR learning result_Learning data ③(2019/12/50 days up to 31)
print("③ 2019/12/50 days up to 31" , "-" * 80)
data = ts_data[-50:]
result = sm.tsa.stattools.adfuller(data)
print('value = {:.4f}'.format(result[0]))
print('p-value = {:.4}'.format(result[1]))
print()

All test values (values) are very large in absolute value, The P value (p-value) is 0.0. Even if the significance level is set to 1% In all of ① to ③, the null hypothesis was rejected.

(Although all p-values are 0.0, which may be a relatively meaningless comparison) Comparing the absolute values of the test statistic, ① > ② > ③ And (because the data is completely AR (1)) As a result of the ADF test, I felt that the evaluation was more correct at the beginning.

Summary

From the model learning result and ADF test At the beginning, it was evaluated more correctly. In particular, the model learning result of (3) was completely unexpected. The author has not yet understood the causal relationship as to why such a result was obtained. I hope I can find out as I continue to study.

[PYTHON] I tried time series analysis! (AR model)