[PYTHON] I tried to predict the price of ETF

Hello. Recently I am trying to forecast stock prices. I am still training, but I would like to output it. This time, I would like to make a prediction for the Vanguard S & P500 ETF (VOO). You can get the data at the URL below, but you need to log in to Investing.com. https://www.investing.com/etfs/vanguard-s-p-500-historical-data As a reminder, it is your responsibility to buy financial products as a result of this model.

Preparation


#Import required libraries
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import warnings
import datetime
import statsmodels.api as sm
import itertools
import matplotlib.dates as mdates

import warnings
warnings.simplefilter('ignore')

Next, read the data and prepare it.


#Data loading and indexing to_Convert with datetime
df = pd.read_csv("voo_10.csv")
df.index = pd.to_datetime(df["Date"])
df = df.sort_index()
del df["Date"]

This time, we will only look at ETF price changes in chronological order. By the way, the read data looks like this.


[out]

         Price    Open    High    Low Vol.    Change %
Date                        
2011-01-10  116.20  115.88  116.28  115.48  124.46K -0.09%
2011-01-11  116.58  116.66  116.85  116.12  131.69K 0.33%
2011-01-12  117.62  117.32  117.72  117.10  107.31K 0.89%
2011-01-13  117.46  117.68  117.74  117.16  96.05K  -0.14%
2011-01-14  118.30  117.30  118.34  117.26  193.39K 0.72%
... ... ... ... ... ... ...
2020-12-31  343.69  341.82  344.37  341.23  3.51M   0.54%
2021-01-04  339.03  345.02  345.09  335.37  5.35M   -1.36%
2021-01-05  341.26  338.40  342.39  338.39  3.28M   0.66%
2021-01-06  343.33  339.79  346.50  339.30  4.79M   0.61%
2021-01-07  348.46  345.74  349.20  345.54  3.62M   1.49%

This Price column has a closing price. If you graph the price transition

price = pd.Series(df["Price"], dtype=float)
plt.ylabel("price, USD")
price.plot()
plt.show()

image.png

EDA The read data is the original series data, that is, the data in the observed state. Stationarity is important in the analysis of time series data, so it is necessary to confirm the presence or absence of stationarity. Stationarity means that it does not change over time. It is called a stationary process because it is assumed to have stationarity. If it is not a stationary process, it will be a completely random transition, and time series prediction will not be possible.

First, let's see how the data changes.

#Check the transition of data
fig = sm.tsa.seasonal_decompose(price, freq=80).plot()
plt.show()

image.png

The top of the image is the original series data. The second is the trend, the trend of data. It's almost rising, but sometimes it tends to go down. The third is a periodic variation, which is called seasonality. Even if it is not related to the season, it will be Seasonal if there is periodicity. Roughly speaking, prices are rising and processing periodically. Due to the large number of data, freq is also set large.

If the data is as it is, it is not a stationary process and time series analysis cannot be performed, so the difference is taken. For the time being, the floor difference is 1.


price_diff = price - price.shift()
len(price_diff)
price_diff.head()
[out]
Date
2011-01-10     NaN
2011-01-11    0.38
2011-01-12    1.04
2011-01-13   -0.16
2011-01-14    0.84
```python

 The first line is a missing value because there is no target for taking the difference.

```python

price_diff = price_diff.dropna()

Delete the lines with missing values.

The graph of the data with the difference is


plt.plot(price_diff)
plt.show()

image.png It will be.

Since the difference is taken, the graph will repeat a certain contraction. The difference from the original series is called the difference series.

Now check the autocorrelation and partial autocorrelation of the difference series. Autocorrelation represents the correlation at two points in time. However, autocorrelation is affected between the two time points. Therefore, it is the partial autocorrelation that eliminates the influence of the period between them.

sm.graphics.tsa.plot_acf(price_diff,lags=700)
sm.graphics.tsa.plot_pacf(price_diff, lags=1095)
plt.figure(figsize=(10,8))
plt.show()

image.png The point of time of the blue ball protruding from this blue shadow is the period when the alternative hypothesis that there is a correlation in the temporary test is adopted. The autocorrelation becomes less significant as the time points move away. image.png Although it is a partial autocorrelation, there is a large negative correlation in a cycle of about 390 days.

Since there are some protruding parts, we will test whether it is a stationary process. We use the ADF test.

#Confirm stationarity by ADF test
check = sm.tsa.stattools.adfuller(price_diff)
print("p-value:{:.3f}".format(check[1]))

[out]
p-value:0.00000

This ADF test tests for the presence or absence of unit root processes. When the difference series calculated from the data of the non-stationary process has stationarity, it is called a unit root process. Because it is a hypothesis test Null hypothesis (hypothesis you want to deny) = not a unit root process Alternative hypothesis (hypothesis you want to affirm) = unit root process I choose either of them, but the standard is the p-value. By convention, if the p-value is less than 5%, the alternative hypothesis is adopted and the null hypothesis is denied. At this time, 5% is based on the significance level, and the severity of the test can be adjusted by the magnitude of the significance level.

Since the p-value is less than 5%, we conclude that it is a unit root process of the alternative hypothesis, that is, stationary.

Forecast

Let's move on to forecasting. Before moving on to forecasting, prepare training data and test data.

#Divide the data for training and testing
train = price_diff["2011-01-11":"2019-12-31"]
test = price_diff["2018-01-01":"2020-01-07"]

plt.plot(train)
plt.plot(test,color="r")
plt.legend(["Train", "Test"])

plt.show()

image.png

The divided ETF data looks like this. For verification purposes, both data have overlapping periods. Let's see the accuracy by the degree of overlap of this overlapping period.

Next, we will estimate the optimum parameters for tuning the model.

#A function that automatically calculates the optimum parameters of Sarimax
def selectparameter(DATA,s):
    p = d = q = range(0, 2)
    pdq = list(itertools.product(p, d, q))
    seasonal_pdq = [(x[0], x[1], x[2], s) for x in list(itertools.product(p, d, q))]
    parameters = []
    BICs = np.array([])
    for param in pdq:
        for param_seasonal in seasonal_pdq:
            try:
                mod = sm.tsa.statespace.SARIMAX(DATA,
                                            order=param,
                                            seasonal_order=param_seasonal)
                results = mod.fit()
                parameters.append([param, param_seasonal, results.bic])
                BICs = np.append(BICs,results.bic)
            except:
                continue
    return parameters[np.argmin(BICs)]
best_params = selectparameter(price, 12)

#Let the model train and predict
sarimax = sm.tsa.statespace.SARIMAX(train,order=best_params[0],
                                    seasonal_order=best_params[0]).fit()
pred = sarimax.predict("2018-02-01", "2019-11-29")

The SARIMA model also requires parameter tuning, but it is difficult to try them one by one, so we will automate it. Use the Bayesian Information Criterion for tuning to avoid error in learning results and overfitting. SARIMA is a model that uses ARIMA after considering periodic fluctuations.

best_params[0], best_params[1]
[(1, 1, 0), (0, 0, 0, 12)]

The optimal parameters are as above.
Since it takes too much time due to the PC specifications, the range of parameters to be given is set to 0 or 1.

There is also a function that automatically estimates the order of SARIMAX, but only the autocorrelation degree and moving average can be estimated.
Two criteria, AIC and BIC, can be used for estimation, but here we will create a model estimated based on Akaike's information criterion.
Manually enter the parts that cannot be estimated automatically.

```python
# Estimate parameters with AIC
# Manual specification of guidance, seasonal_order, etc.
def sarimax_aic(data, sp, sd, sq, s):
    param = sm.tsa.arma_order_select_ic(data, ic="aic")
    model = sm.tsa.statespace.SARIMAX(data, order=(param["aic_min_order"][0], 1, param["aic_min_order"][1]),
                                    seasonal_order=(sp, sd, sq, s)).fit()
    return model, param

# Let the model estimated by AIC perform training and prediction
aic_model, aic_param = sarimax_aic(price_diff, 0 ,0 ,0, 12)
pred_aic = aic_model.predict("2018-02-01", "2019-11-29")

The optimum parameters estimated by the two criteria are 'aic_min_order': (4, 2) is.

Visualize the predictions to verify the accuracy of the prepared model.

# Visualization
plt.figure(figsize=(8,8))
plt.plot(test, color="r")
plt.plot(pred, color="c")
plt.plot(pred_aic, color="y")
plt.legend(["Test", "Prediction", "AIC Prediction"], loc="lower right")
plt.tight_layout()
plt.show()

image.png

Regarding model prediction, we compare the period predicted from the test data and training data left as the correct answer for accuracy verification. As mentioned above, we are comparing by the overlapping period. The model tuned with BIC traces the trend of price changes, but there is a gap. I also wanted to tune the AIC under the same conditions, but it takes too long, so I use the same values ​​for the manual part as the BIC model. However, this one was not very accurate. The parameter of the part representing the cycle was set to 12 months, but the result was the same when I set it to 13 with reference to the correlogram of partial autocorrelation. By the way, the result is the same even if the period of the given data is shortened. And the error of prediction

# Make the model prediction error a correlogram
sm.graphics.tsa.plot_acf(sarimax.resid, lags=365*2)
sm.graphics.tsa.plot_pacf(sarimax.resid, lags=365*2)
plt.show()

image.png Since the autocorrelation and periodicity between the errors remain, it means that the prediction cannot be made well ...

It was a very disappointing result, but finally I will make the prediction I wanted to make this time.

Finally, since we want to know the future price transition, let the model predict the entire period of the data with the estimated parameters.

# Predict the range outside the period given to the model
sarimax = sm.tsa.statespace.SARIMAX(price_diff,order=(1, 1, 0),
                                    seasonal_order=(0, 0, 0, 12)).fit()
forecast =  sarimax.forecast(steps=10)

# For convenience, create a subplot of one graph
fig = plt.figure(figsize=(7,7))
ax = fig.add_subplot(1,1,1)

# Set the date of the predicted period on the x-axis and draw the predicted value on the y-axis
fig = plt.figure(figsize=(7,7))
ax = fig.add_subplot(1,1,1)
days = mdates.DayLocator() 
daysFmt = mdates.DateFormatter('%m-%d')
x = pd.date_range(price_diff.index[-1], periods=len(forecast), freq="B")
y = forecast
ax.plot(x,y)
ax.set_xlabel("Date")
ax.set_ylabel("USD")

plt.tight_layout()

plt.show()

image.png

forecast()By using, you can predict after the end of the period used for learning. Specify 10 destinations in step, and since it is date and time data, predict the period up to 10 business days ahead. The horizontal axis sets the date for the predicted future period. Predicted price on the vertical axis(US dollar unit)It is described in. Since the predicted value is a difference, it is a difference. As a result, it seems that the current peak is at the peak and then goes down and level off. However, the accuracy of the model itself is not high, so it cannot be used for trading.

#Summary How to tune the model(Especially the calculation time)Was the biggest challenge. seasonal_order is a parameter of the effect on seasonal seasonal fluctuations, and I wanted to adjust it as well, but I gave up because it would take too long to set it to 1 or more. I wanted to expand the range of specifications given to the parameters, but for the same reason I gave up ... I also had a problem understanding the mathematical part.

`Written by Professor Okimoto Measurement time series analysis of economic and finance data'

bought. I'm just starting to read it, but I recommend it because it clearly describes what you need for time series analysis.

Thank you for watching until the end.

Recommended Posts

I tried to predict the price of ETF
I tried to touch the API of ebay
I tried to correct the keystone of the image
I tried to vectorize the lyrics of Hinatazaka46!
I tried the common story of using Deep Learning to predict the Nikkei 225
I tried to predict the behavior of the new coronavirus with the SEIR model.
I tried to summarize the basic form of GPLVM
I tried to predict the J-League match (data analysis)
I tried to visualize the spacha information of VTuber
I tried to erase the negative part of Meros
I tried to classify the voices of voice actors
I tried to summarize the string operations of Python
I tried to predict the up and down of the closing price of Gurunavi's stock price using TensorFlow (progress)
I tried to predict the deterioration of the lithium ion battery using the Qore SDK
I tried to predict the presence or absence of snow by machine learning.
I tried to move the ball
I tried to estimate the interval.
I tried to find the entropy of the image with python
[Horse Racing] I tried to quantify the strength of racehorses
I tried to get the location information of Odakyu Bus
I tried to find the average of the sequence with TensorFlow
[Python] I tried to visualize the follow relationship of Twitter
[Machine learning] I tried to summarize the theory of Adaboost
I tried to fight the Local Minimum of Goldstein-Price Function
I tried to predict the victory or defeat of the Premier League using the Qore SDK
[First data science ⑥] I tried to visualize the market price of restaurants in Tokyo
I tried to summarize the umask command
I tried to recognize the wake word
I tried to summarize the graphical modeling.
I tried to estimate the pi stochastically
I tried to touch the COTOHA API
[Linux] I tried to summarize the command of resource confirmation system
I tried to get the index of the list using the enumerate function
I tried to automate the watering of the planter with Raspberry Pi
I tried to build the SD boot image of LicheePi Nano
I tried to expand the size of the logical volume with LVM
I tried to summarize the frequently used implementation method of pytest-mock
I tried to improve the efficiency of daily work with Python
I tried to visualize the common condition of VTuber channel viewers
I tried to predict the infection of new pneumonia using the SIR model: ☓ Wuhan edition ○ Hubei edition
I tried to predict the genre of music from the song title on the Recurrent Neural Network
I tried to predict the sales of game software with VARISTA by referring to the article of Codexa
I tried to transform the face image using sparse_image_warp of TensorFlow Addons
I tried to predict next year with AI
I tried web scraping to analyze the lyrics.
I tried cluster analysis of the weather map
I tried to get the batting results of Hachinai using image processing
I tried to visualize the age group and rate distribution of Atcoder
I tried transcribing the news of the example business integration to Amazon Transcribe
I tried to notify slack of Redmine update
I tried to optimize while drying the laundry
zoom I tried to quantify the degree of excitement of the story at the meeting
I tried to estimate the similarity of the question intent using gensim's Doc2Vec
I tried to save the data with discord
I tried to find 100 million digits of pi
I tried to solve the 2020 version of 100 language processing [Chapter 3: Regular expressions 25-29]
I tried to get the authentication code of Qiita API with Python.
I tried to debug.
I tried to automatically extract the movements of PES players with software
(Python) I tried to analyze 1 million hands ~ I tried to estimate the number of AA ~
I tried to summarize the logical way of thinking about object orientation.