[PYTHON] Time series analysis Part 1 Autocorrelation

Purpose

Time series data, RNN for the time being! , LSTM! !! It was like that, but I learned about ARIMA and SARIMA and wanted to add them to my drawers.
It would be easy to use the library and dive into it, but I bought a book because I wanted to know the theory.
A book called "Measurement Time Series Analysis of Economic and Finance Data" by Professor Tatsuyoshi Okimoto. This is a 2010 book.

data

I decided to use TOPIX historical data for the time being because I wanted to move my hands while reading. https://quotes.wsj.com/index/JP/TOKYO%20EXCHANGE%20(TOPIX)/180460/historical-prices I was able to download it from the WSJ website.

It seems that there are no missing values for the time being. The period is from December 30, 2008 to November 29, 2019. Historical data of 4 values without ex-dividend adjustment. When plotting the closing price, it looks like the figure below. It is a period from the time when the 1,000pt was broken immediately after the Lehman shock to the time after Abenomics. The daily returns are as follows. Since it is daily, it will not make a big difference, but I used a logarithmic return ($ \ Delta \ log {y_t} = \ log {(y_t)}-\ log {(y_ {t-1})} $). The biggest drop was recorded on March 15, 2011, at $ \ Delta \ log {y_t} = -0.0995 $. The cause was the Tohoku-Pacific Ocean Earthquake that occurred about 15 minutes before the closing of the previous day.

Various statistics

average \bar{y} = \frac{1}{T}\displaystyle{\sum_{t=1}^{T}y_t}

tpx_return = np.log(tpx['close'].values)[1:] - np.log(tpx['close'].values)[:-1]
tpx_return.mean()

0.00025531962222667643

** auto-covariance ** \hat{\gamma}_k = \frac{1}{T} \displaystyle{\sum\_{t=k+1}^{T}}(y\_t-\bar{y})(y\_{t-k}-\bar{y}),\quad k = 0,1,2,...

import statsmodels.api as sm
sm.tsa.stattools.acovf(tpx_return, fft=True, nlag=5)

array([ 1.57113176e-04,  1.16917913e-06,  3.48846296e-06, -4.67502133e-06, -5.31500282e-06, -2.82855375e-06])

I'm not used to the statsmodels library, so check it by hand.

# k=0
((tpx_return-tpx_return.mean())**2).sum() / len(tpx_return)

0.00015711317609153836

# k=4
((tpx_return-tpx_return.mean())[4:]*(tpx_return-tpx_return.mean())[:-4]).sum() / len(tpx_return)

-5.315002816332674e-06

** Auto-correlation ** \hat{\rho}_k = \frac{\hat{\gamma}\_k}{\hat{\gamma}\_0},\quad k=1,2,3,...

sm.tsa.stattools.acf(tpx_return)[:5]

array([ 1.        ,  0.00744164,  0.0222035 , -0.02975576, -0.03382913])

If you check using the results of $ k = 0 $ and $ k = 4 $,

-5.315002816332674e-06 / 0.00015711317609153836

-0.03382913482212345

So it seems that the library is doing the calculations as I imagined.

I will also draw a correlogram. A correlogram is a graph of the autocorrelation coefficient.

autocorr = sm.tsa.stattools.acf(tpx_return)
ci = 1.96 / np.sqrt(len(tpx_return))
plt.bar(np.arange(len(autocorr)), autocorr)
plt.hlines([ci,-ci],0,len(autocorr), linestyle='dashed')
plt.title('Correlogram')
plt.ylim(-0.05,0.05)
plt.show()

CI (confidence interval) is a confidence interval, and when data are independent of each other and follow the same distribution, $ \ hat {\ rho} _k $ asymptotically means $ 0 $ on average and variance $ \ frac {1} {T. We calculate 95% on both sides using the property of following the normal distribution of} $.

It looks like this when using the library. Fashionable.

sm.graphics.tsa.plot_acf(tpx_return)
plt.ylim(-0.05,0.05)
plt.show()

Think a little about the numbers you got. \hat{\rho}\_{k=11}=-0.0421, \quad \hat{\rho}_{k=16}=0.0415 It is slightly higher than $ CI = 0.0379 $ twice. In particular, $ \ hat {\ rho} \ _ {k = 12} =-0.0323 $ is also a negative value, suggesting that even if the market price starts to rise or fall once, the momentum will end in about 2 weeks. You may be able to think that you are doing it.

Portmanteau test

A method to test the null hypothesis that multiple autocorrelation coefficients are all 0. H_0 : \rho_1 = \rho_2 =\quad ... \quad= \rho_m = 0 Here, we will perform the test using the statistics of Ljung and Box (1978) introduced in the book. The approach is to compare the $ Q (m) $ below with the 95% point of the chi-square distribution. Q(m) = T(T+2)\displaystyle{\sum\_{k=1}^{m}}\frac{\hat{\rho}^2_k}{T-k} \sim \chi^2(m) A statistic called the P value is also defined, which indicates the probability that a random variable following the chi-square distribution will take a value larger than $ Q (m) $. That is, assuming a significance level of 5%, $ H_0 $ is rejected when the P value is less than 0.05. For the value of $ m $, $ m \ approx \ log {(T)} $ seems to be a guide, but it seems that it is common to test multiple $ m $ and make a comprehensive judgment. Is.

First of all, from the pattern that uses the library of stats models. $ m \ approx \ log {(T)} = 7.89 $ So, for the time being, consider the lag in the range up to 16.

lvalue, pvalue = sm.stats.diagnostic.acorr_ljungbox(tpx_return)

That's it. Very easy.

The P value did not fall below 0.05 regardless of the value of $ m $, and the result was that the daily change rate of TOPIX could not be said to have an autocorrelation. Well, the market price is not that simple.

Finally, try without the library to deepen your understanding.

from scipy.stats import chi2

def Q_func(data, max_m):
    T = len(data)
    auto_corr = sm.tsa.stattools.acf(data)[1:]
    lvalue = T*(T+2)*((auto_corr**2/(T-np.arange(1,len(auto_corr)+1)))[:max_m]).cumsum()
    pvalue = 1 - chi2.cdf(lvalue, np.arange(1,len(lvalue)+1))
    return lvalue, pvalue

Confirm that the same result is obtained.

l_Q_func, p_Q_func = Q_func(tpx_return,max_m=16)
l_sm, p_sm = sm.stats.diagnostic.acorr_ljungbox(tpx_return, lags=16)
((l_Q_func-l_sm)**2).mean(), ((p_Q_func-p_sm)**2).mean()

(0.0, 7.824090399073146e-34)

If you take a closer look at the case of $ m = 8 $,

T = len(tpx_return)
auto_corr = sm.tsa.stattools.acf(tpx_return)[1:]
lvalue = T*(T+2)*((auto_corr**2/(T-np.arange(1,len(auto_corr)+1)))[:8]).sum()
print(lvalue)

8.604732778577853

1-chi2.cdf(lvalue,8)

0.37672860496603844

So it turns out that it is quite difficult to reject the null hypothesis.