[PYTHON] Time series analysis Part 1 Autocorrelation

Purpose

data

I decided to use TOPIX historical data for the time being because I wanted to move my hands while reading. https://quotes.wsj.com/index/JP/TOKYO%20EXCHANGE%20(TOPIX)/180460/historical-prices I was able to download it from the WSJ website.

It seems that there are no missing values for the time being. The period is from December 30, 2008 to November 29, 2019. Historical data of 4 values without ex-dividend adjustment. When plotting the closing price, it looks like the figure below. It is a period from the time when the 1,000pt was broken immediately after the Lehman shock to the time after Abenomics. TOPIX.png The daily returns are as follows. Since it is daily, it will not make a big difference, but I used a logarithmic return ($ \ Delta \ log {y_t} = \ log {(y_t)}-\ log {(y_ {t-1})} $). save.png save.png The biggest drop was recorded on March 15, 2011, at $ \ Delta \ log {y_t} = -0.0995 $. The cause was the Tohoku-Pacific Ocean Earthquake that occurred about 15 minutes before the closing of the previous day.

Various statistics

average \bar{y} = \frac{1}{T}\displaystyle{\sum_{t=1}^{T}y_t}

tpx_return = np.log(tpx['close'].values)[1:] - np.log(tpx['close'].values)[:-1]
tpx_return.mean()
0.00025531962222667643

** auto-covariance ** \hat{\gamma}_k = \frac{1}{T} \displaystyle{\sum\_{t=k+1}^{T}}(y\_t-\bar{y})(y\_{t-k}-\bar{y}),\quad k = 0,1,2,...

import statsmodels.api as sm
sm.tsa.stattools.acovf(tpx_return, fft=True, nlag=5)
array([ 1.57113176e-04,  1.16917913e-06,  3.48846296e-06, -4.67502133e-06, -5.31500282e-06, -2.82855375e-06])

I'm not used to the statsmodels library, so check it by hand.

# k=0
((tpx_return-tpx_return.mean())**2).sum() / len(tpx_return)
0.00015711317609153836
# k=4
((tpx_return-tpx_return.mean())[4:]*(tpx_return-tpx_return.mean())[:-4]).sum() / len(tpx_return)
-5.315002816332674e-06

** Auto-correlation ** \hat{\rho}_k = \frac{\hat{\gamma}\_k}{\hat{\gamma}\_0},\quad k=1,2,3,...

sm.tsa.stattools.acf(tpx_return)[:5]
array([ 1.        ,  0.00744164,  0.0222035 , -0.02975576, -0.03382913])

If you check using the results of $ k = 0 $ and $ k = 4 $,

-5.315002816332674e-06 / 0.00015711317609153836
-0.03382913482212345

So it seems that the library is doing the calculations as I imagined.

I will also draw a correlogram. A correlogram is a graph of the autocorrelation coefficient.

autocorr = sm.tsa.stattools.acf(tpx_return)
ci = 1.96 / np.sqrt(len(tpx_return))
plt.bar(np.arange(len(autocorr)), autocorr)
plt.hlines([ci,-ci],0,len(autocorr), linestyle='dashed')
plt.title('Correlogram')
plt.ylim(-0.05,0.05)
plt.show()

save.png

CI (confidence interval) is a confidence interval, and when data are independent of each other and follow the same distribution, $ \ hat {\ rho} _k $ asymptotically means $ 0 $ on average and variance $ \ frac {1} {T. We calculate 95% on both sides using the property of following the normal distribution of} $.

It looks like this when using the library. Fashionable.

sm.graphics.tsa.plot_acf(tpx_return)
plt.ylim(-0.05,0.05)
plt.show()

save.png

Think a little about the numbers you got. \hat{\rho}\_{k=11}=-0.0421, \quad \hat{\rho}_{k=16}=0.0415 It is slightly higher than $ CI = 0.0379 $ twice. In particular, $ \ hat {\ rho} \ _ {k = 12} =-0.0323 $ is also a negative value, suggesting that even if the market price starts to rise or fall once, the momentum will end in about 2 weeks. You may be able to think that you are doing it.

Portmanteau test

A method to test the null hypothesis that multiple autocorrelation coefficients are all 0. H_0 : \rho_1 = \rho_2 =\quad ... \quad= \rho_m = 0 Here, we will perform the test using the statistics of Ljung and Box (1978) introduced in the book. The approach is to compare the $ Q (m) $ below with the 95% point of the chi-square distribution. Q(m) = T(T+2)\displaystyle{\sum\_{k=1}^{m}}\frac{\hat{\rho}^2_k}{T-k} \sim \chi^2(m) A statistic called the P value is also defined, which indicates the probability that a random variable following the chi-square distribution will take a value larger than $ Q (m) $. That is, assuming a significance level of 5%, $ H_0 $ is rejected when the P value is less than 0.05. For the value of $ m $, $ m \ approx \ log {(T)} $ seems to be a guide, but it seems that it is common to test multiple $ m $ and make a comprehensive judgment. Is.

First of all, from the pattern that uses the library of stats models. $ m \ approx \ log {(T)} = 7.89 $ So, for the time being, consider the lag in the range up to 16.

lvalue, pvalue = sm.stats.diagnostic.acorr_ljungbox(tpx_return)

That's it. Very easy. save.png

The P value did not fall below 0.05 regardless of the value of $ m $, and the result was that the daily change rate of TOPIX could not be said to have an autocorrelation. Well, the market price is not that simple.

Finally, try without the library to deepen your understanding.

from scipy.stats import chi2

def Q_func(data, max_m):
    T = len(data)
    auto_corr = sm.tsa.stattools.acf(data)[1:]
    lvalue = T*(T+2)*((auto_corr**2/(T-np.arange(1,len(auto_corr)+1)))[:max_m]).cumsum()
    pvalue = 1 - chi2.cdf(lvalue, np.arange(1,len(lvalue)+1))
    return lvalue, pvalue

Confirm that the same result is obtained.

l_Q_func, p_Q_func = Q_func(tpx_return,max_m=16)
l_sm, p_sm = sm.stats.diagnostic.acorr_ljungbox(tpx_return, lags=16)
((l_Q_func-l_sm)**2).mean(), ((p_Q_func-p_sm)**2).mean()
(0.0, 7.824090399073146e-34)

If you take a closer look at the case of $ m = 8 $,

T = len(tpx_return)
auto_corr = sm.tsa.stattools.acf(tpx_return)[1:]
lvalue = T*(T+2)*((auto_corr**2/(T-np.arange(1,len(auto_corr)+1)))[:8]).sum()
print(lvalue)
8.604732778577853
1-chi2.cdf(lvalue,8)
0.37672860496603844

save.png

So it turns out that it is quite difficult to reject the null hypothesis.

Recommended Posts

Time series analysis Part 1 Autocorrelation
Time series analysis Part 3 Forecast
Python: Time Series Analysis
Time series analysis Part 2 AR / MA / ARMA
RNN_LSTM1 Time series analysis
Time series analysis 1 Basics
Time series analysis related memo
Python: Time Series Analysis: Preprocessing Time Series Data
Time series analysis practice sales forecast
Time series analysis 3 Preprocessing of time series data
Time series analysis 2 Stationary, ARMA / ARIMA model
I tried time series analysis! (AR model)
Time Series Decomposition
Time series analysis 4 Construction of SARIMA model
Time series analysis # 6 Spurious regression and cointegration
pandas series part 1
Python: Time Series Analysis: Building a SARIMA Model
Python: Time Series Analysis: Stationarity, ARMA / ARIMA Model
Kaggle ~ Housing Analysis ③ ~ Part1
Python time series question
Display TOPIX time series
Time series plot / Matplotlib
Python 3.4 Create Windows7-64bit environment (for financial time series analysis)
Python application: Pandas Part 2: Series
Challenge to future sales forecast: ② Time series analysis using PyFlux
A study method for beginners to learn time series analysis
[Python] Plot time series data
Wrap analysis part1 (data preparation)
Challenge to future sales forecast: ⑤ Time series analysis by Prophet
Challenges for future sales forecasts: (1) What is time series analysis?
[Statistics] [Time series analysis] Plot the ARMA model and grasp the tendency.
Easy time series prediction with Prophet
Time series plot started ~ python edition ~
About time series data and overfitting
Japanese analysis processing using Janome part1
Differentiation of time series data (discrete)
Movement statistics for time series forecasting
LSTM (1) for time series forecasting (for beginners)
Multidimensional data analysis library xarray Part 2
Power of forecasting methods in time series data analysis Semi-optimization (SARIMA) [Memo]
Instantly illustrate the predominant period in time series data using spectrum analysis