Python: Time Series Analysis: Building a SARIMA Model

Construction of SARIMA model (1)

Review of ARIMA model

The ARIMA model is a model that is influenced by previous values With an AR model AR (p) that correlates with the last p values A MA model MA (q) that is affected by the previous q values in a model affected by the previous error was synthesized. ARMA (p, q) was adapted to the difference series before the d point.

And What is the SARIMA model? ARIMA model for time series data with seasonal cycle It is a model that can be expanded.

The SARIMA model has (p, d, q) parameters in addition to It also has the parameters (sp, sd, sq, s).

What are sp, sd, sq?

sp :Seasonal autocorrelation
sd :Derivation of seasonality
sq :Seasonal moving average
It's called.
By the way, ARIMA(p,d,q)Of the model

・ P:This is called autocorrelation, and whether the model is predicted using the last p values.

・ D:It is called induction, and the fact that the d-th order difference was necessary to make the time series data steady.

・ Q:The moving average is that the model is affected by the last q values.

It represented.

The basic meanings of sp, sd, and sq do not change. However For sp, sd, sq, the current data is influenced by historical data that has passed one or more seasonal periods.

For example, in the case of data with seasonal fluctuations of 12-month cycle, the parameter of s represents the cycle. It becomes s = 12.

If sq = 1, data for exactly 12 months (1 cycle ago) If sq = 2, it means that the model is affected by the data 12 months ago and the data 24 months ago.

If it's hard to understand, simply replace q with sq.

Parameter determination

SARIMA model parameters, (p, d, q) (sp, sd, sq, s) for Python There is no function that automatically makes it the most appropriate.

Therefore, the information criterion(BIC in this case(Bayesian Information Criterion) )By
You have to write a program to find out which value is most appropriate.

I won't go into the information criterion this time, For BIC, understand that the lower this value, the more appropriate the parameter value.

#Data in chronological order:DATA,Parameters s(period):Enter s to output the best parameters and their BIC.

def selectparameter(DATA,s):
    p = d = q = range(0, 1)
    pdq = list(itertools.product(p, d, q))
    seasonal_pdq = [(x[0], x[1], x[2], 12) for x in list(itertools.product(p, d, q))]
    parameters = []
    BICs = np.array([])
    for param in pdq:
        for param_seasonal in seasonal_pdq:
            try:
                mod = sm.tsa.statespace.SARIMAX(DATA,
                                            order=param,
                seasonal_order=param_seasonal)
                results = mod.fit()
                parameters.append([param, param_seasonal, results.bic])
                BICs = np.append(y,results.bic)
            except:
                continue
    return print(parameters[np.argmin(BICs)])

To elaborate on the contents of the function, each parameter is For 0,1 (this time, the upper limit of the parameter is set to 1 for simplicity) The BIC of the SARIMA model is calculated and the case where the BIC is the smallest is displayed.

However, regarding the parameter s, time series data and Let's examine it by visualizing the partial autocorrelation described below.

import warnings
import itertools
import pandas as pd
import numpy as np
import statsmodels.api as sm

#Read and organize data
sales_sparkling = pd.read_csv("./5060_tsa_data/monthly-australian-wine-sales-th-sparkling.csv")
index = pd.date_range("1980-01-31","1995-07-31",freq="M")
sales_sparkling.index=index
del sales_sparkling["Month"]

#The data is for one year
sales_sparkling = sales_sparkling[:12]

#Function definition
def selectparameter(DATA,s):
    p = d = q = range(0, 2)
    pdq = list(itertools.product(p, d, q))
    seasonal_pdq = [(x[0], x[1], x[2], s) for x in list(itertools.product(p, d, q))]
    parameters = []
    BICs = np.array([])
    for param in pdq:
        for param_seasonal in seasonal_pdq:
            try:
                mod = sm.tsa.statespace.SARIMAX(DATA,
                                            order=param,
                                            seasonal_order=param_seasonal)
                results = mod.fit()
                parameters.append([param, param_seasonal, results.bic])
                BICs = np.append(BICs,results.bic)
            except:
                continue
    return parameters[np.argmin(BICs)]

#Put the cycle in the second argument
selectparameter(sales_sparkling,12)

Autocorrelation coefficient / partial autocorrelation coefficient and its visualization

Autocorrelation was the correlation with one's own past data. Another way to analyze time series data

The value of partial autocorrelation is important.

Speaking of k-th order autocorrelation, it represents the correlation between yt and yt−k. The k-order partial autocorrelation is the data between yt and yt−k, that is, The correlation is obtained by removing the influence of yt−k + 1 from yt−1.

I will explain in detail. Suppose you find the autocorrelation coefficient for a 7-day difference. However, if the data of one day correlates with the data of the previous day, 7 days ago → 6 days ago → 5 days ago ... It may be correlated through the data from 1 day ago to today.

The correlation obtained by removing the influence during this period is called partial autocorrelation.

Let's visualize the autocorrelation coefficient and the partial autocorrelation coefficient using Python. Generally, if the monthly data is seasonal, the cycle will be 12.

#The graph of autocorrelation coefficient is
sm.graphics.tsa.plot_acf(DATA)

#The graph of partial autocorrelation is
sm.graphics.tsa.plot_pacf(DATA)

#You can output with.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
from pandas import datetime


#Read and organize data
sales_sparkling = pd.read_csv("./5060_tsa_data/monthly-australian-wine-sales-th-sparkling.csv")
index = pd.date_range("1980-01-31", "1995-07-31", freq="M")
sales_sparkling.index=index
del sales_sparkling["Month"]
#Visualization of autocorrelation / partial autocorrelation
fig=plt.figure(figsize=(12, 8))
#Outputs a graph of autocorrelation coefficient
ax1 = fig.add_subplot(211)
fig = sm.graphics.tsa.plot_acf(sales_sparkling, lags=30, ax=ax1)
#Outputs a graph of partial autocorrelation coefficient
ax2 = fig.add_subplot(212)
fig = sm.graphics.tsa.plot_pacf(sales_sparkling, lags=30, ax=ax2)
plt.show()

image.png

Construction of SARIMA model (2)

Model building

Here, I will summarize the procedure for analyzing time series data using the SARIMA model.

  1. Read data

  2. Data organization

  3. Data visualization

  4. Understanding the data cycle (determining the parameter s)

  5. Determining parameters other than s

  6. Build a model

  7. Forecasting with data and its visualization

It is done in the flow.

Finally, in this section, you will learn from model building to prediction. Once you know the optimal (p, d, q), (sp, sd, sq, s), it's time to build the model. This time, we are using data for 15 years (180 months).

#To build a model
sm.tsa.statespace.SARIMAX(DATA,order=(p, d, q),seasonal_order=(sp, sd, sq, s)).fit()
#Is used.

Let's build a SARIMA model of sparkling wine sales data.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
from pandas import datetime


#Read and organize data
sales_sparkling = pd.read_csv("./5060_tsa_data/monthly-australian-wine-sales-th-sparkling.csv")
index = pd.date_range("1980-01-31", "1995-07-31", freq="M")
sales_sparkling.index=index
del sales_sparkling["Month"]

#Model fit
SARIMA_sparkring_sales = sm.tsa.statespace.SARIMAX(sales_sparkling,order=(0, 0, 0),seasonal_order=(0, 1, 1, 12)).fit()


#Output the BIC of the constructed SARIMA model
print(SARIMA_sparkring_sales.bic)

Forecast execution and visualization of forecast data

Getting predictive data for the model is easy

Model name.predict("At the start of forecast","At the end of the forecast")Is used.

However, the "start of forecast" must be the time in the original time series data. For example, the sales data for the sparkling wine used this time must be before 1995-07-31.

import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
from pandas import datetime


#Read and organize data
sales_sparkling = pd.read_csv("./5060_tsa_data/monthly-australian-wine-sales-th-sparkling.csv")
print(sales_sparkling.head())
print(sales_sparkling.tail())
index = pd.date_range("1980-01-31", "1995-07-31", freq="M")
sales_sparkling.index=index
del sales_sparkling["Month"]

#Model fit
SARIMA_sparkring_sales = sm.tsa.statespace.SARIMAX(sales_sparkling,order=(0, 0, 0),seasonal_order=(0, 1, 1, 12)).fit()

#Substitute prediction data for pred
pred = SARIMA_sparkring_sales.predict("1994-07-31","1997-12-31")

#Visualization of pred data
plt.plot(pred)
plt.show()

Comparison of actual data and forecast data

Next, instead of outputting only forecast data Let's compare the original time series data and forecast data by outputting them at the same time.

import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
from pandas import datetime
import numpy as np


#Read and organize data
sales_sparkling = pd.read_csv("./5060_tsa_data/monthly-australian-wine-sales-th-sparkling.csv")
index = pd.date_range("1980-01-31", "1995-07-31", freq="M")
sales_sparkling.index=index
del sales_sparkling["Month"]

#Model fit
SARIMA_sparkring_sales = sm.tsa.statespace.SARIMAX(sales_sparkling,order=(0, 0, 0),seasonal_order=(0, 1, 1, 12)).fit()

#Substitute prediction data for pred
pred = SARIMA_sparkring_sales.predict("1994-7-31", "1997-12-31")

#Visualization of pred data and original time series data
plt.plot(sales_sparkling)
plt.plot(pred, color="r")
plt.show()

image.png

Recommended Posts

Python: Time Series Analysis: Building a SARIMA Model
Python: Time Series Analysis
Time series analysis 4 Construction of SARIMA model
Python: Time Series Analysis: Stationarity, ARMA / ARIMA Model
Python: Time Series Analysis: Preprocessing Time Series Data
Time series analysis 2 Stationary, ARMA / ARIMA model
I tried time series analysis! (AR model)
Time series analysis using a general Gaussian state-space model using Python [Implementation example considering exogenous and seasonality]
Python time series question
RNN_LSTM1 Time series analysis
Time series analysis 1 Basics
Introduction to Time Series Analysis ~ Seasonal Adjustment Model ~ Implemented in R and Python
Python 3.4 Create Windows7-64bit environment (for financial time series analysis)
Time series analysis related memo
Building a Python virtual environment
Time series analysis part 4 VAR
Time series analysis Part 3 Forecast
[Python] Plot time series data
Time series analysis Part 1 Autocorrelation
Building a Python virtual environment
I made a package to filter time series with python
A study method for beginners to learn time series analysis
Building a Python environment on Mac
Building a Python environment on Ubuntu
Time series analysis practice sales forecast
Building a virtual environment with Python 3
Time series plot started ~ python edition ~
Time series analysis 3 Preprocessing of time series data
[Statistics] [Time series analysis] Plot the ARMA model and grasp the tendency.
[Pyenv] Building a python environment with ubuntu 16.04
Building a Python3 environment with Amazon Linux2
Power of forecasting methods in time series data analysis Semi-optimization (SARIMA) [Memo]
Building a Python 3.6 environment with Windows + PowerShell
Time series analysis Part 2 AR / MA / ARMA
[Mac] Building a virtual environment for Python
[Python] Accelerates loading of time series CSV
Building a Python development environment for AI development
Time series analysis # 6 Spurious regression and cointegration
python: Creating a ramen timer (pyttsx3, time)
Get time series data from k-db.com in Python
Time variation analysis of black holes using python
Building a python environment with virtualenv and direnv
[Python3] A story stuck with time zone conversion
Building a Python environment with WLS2 + Anaconda + PyCharm
A clever way to time processing in Python
Survival time analysis learned in Python 2 -Kaplan-Meier estimator
Data analysis in Python: A note about line_profiler
[Python] Web development preparation (building a virtual environment)
Think about building a Python 3 environment in a Mac environment
Create a simple momentum investment model in Python
Building a Python environment on a Sakura VPS server
A well-prepared record of data analysis in Python
First time python
I implemented "Basics of Time Series Analysis and State Space Model" (Hayamoto) with pystan
[Introduction to element decomposition] Let's arrange time series analysis methods in R and python ♬
Data analysis python
Time Series Decomposition
python time measurement
First time python
Predict from various data in Python using Facebook Prophet, a time series prediction tool
[Python] Implementation of clustering using a mixed Gaussian model