[PYTHON] A beginner who has been programming for 2 months tried to analyze the real GDP of Japan in time series with the SARIMA model.

[at first]

Thank you for taking the time to read the article! !!

Let me start by introducing myself! I'm a member of society who enjoys learning Python in my spare time

is.

Far from programming, I was crazy about PC, so I started studying Python from September 1st last month.

I started programming with Progate, PyQ, and Aidemy, so it's been about two months since I started programming.

Having learned all about Aidemy's data analysis course, I wanted to output it, so I decided to write this article.

Is the article supposed to be read by whom?

Although I can't secure a lot of time for programming learning, such as while working or going to school,

I am a programming beginner who wants to learn programming. As I wrote above about myself, I am also a program

I'm a beginner. Therefore, please use it as one of the samples of how much you can do in about 2 months.

I'm happy.

environment

Python3 MacBookAir Jupyter Notebook

[Text]

Target

Create a SARIMA model (a type of time series model) that predicts Japan's GDP, and display the actual and predicted values in a graph.

Library used this time

import csv
import numpy as np
import pandas as pd
from scipy import stats
import matplotlib.pyplot as plt
import statsmodels.api as sm
from datetime import datetime
from statsmodels.tsa.statespace.sarimax import SARIMAX
import itertools

Data set used

Real GDP of e-stat (government statistics)

procedure

__1. Pre-process data __ Since the target is clarified, we will calculate back and process the raw data.

#①
df = pd.read_csv('gaku-jg2022 (1).csv',encoding="shift-jis")
df = df.drop(range(0,6)) #Erase unnecessary lines
df = df.drop([110,111,112]) 
df = df.drop(df.columns[range(2, 30)], axis=1) #Erase unnecessary columns
df = df.reset_index(drop=True) #Renumber lines
df = df.rename(columns={'Substantial original series': 'Date'}) #Retitle the column
df = df.rename(columns={'Unnamed: 1': 'RealGDP'})

#②
#Process the data in the Time column
j = 1994
k = 0
for i in range(len(df["Date"])):
    df.loc[i,"Date"] = j
    k += 1
    if k%4 == 0:
        j += 1
    
df["Date"] #Extract only Date
index = pd.date_range("1994","2020",freq = "Q")#Separate data quarterly
df.index = index
del df["Date"]

#③
#Process Real GDP
i = 0
for x in df["RealGDP"]:
    x = x.replace(',', '')
    df.iloc[i,0] = float(x)
    i += 1

In (1), the columns that do not require the acquired raw data (such as the column of private final consumption expenditure), the rows that include non-numeric values such as variable names and blanks, and the data for FY2020 are odd, so they are truncated. In (2), the existing value is changed for the work of indexing the time information (what is done from the line of the variable index). (3) Since the data type of the GDP column value is a character string and contains ",", convert it to a float type for graph display.

__2. Graph display __ If you display the data processed in 1 as a graph with the following code,

#Represent the data as a line graph
#Set the title of the graph
plt.title("quarterly-RealGDP_in_Japan")
#Graph x-axis and y-axis naming
plt.xlabel("date")
plt.ylabel("GDP")
#Data plot
plt.plot(df)
plt.show()

GDP.png The horizontal axis is time, and the vertical axis is the GDP value graph. It swings up and down in the short term and tends to rise in the long term.

I will. Exceptionally, it can be seen that the GDP value has dropped significantly around 2008 due to the Lehman shock. I minutes

When analyzed, it can only be interpreted to this extent, but what pattern does the machine read from this data, and what is it?

Will you make such a prediction? I'm looking forward to it!

__3. Determine parameters __ The SARIMA model requires seven variables, one to determine visually in the graph and the other six to output with a function.

I will.

One is a parameter called period s. How many units of patterns on the data can be seen repeatedly in the period s

Enter the time it took. Considering the period s in the graph displayed above, the vertical movement is repeated.

Since he exercises up and down four times in four years, a periodic pattern occurs once (one unit) in one year. Therefore, the cycle s is 1 year

In the meantime, since the data we are dealing with this time is quarterly data, 4 data is equivalent to 1 year, so we can see that s = 4.   Next, the remaining 6 are output by the following function.

#Determine the parameters of the SARIMA model
def selectparameter(DATA,s):
    p = d = q = range(0, 2)
    pdq = list(itertools.product(p, d, q))
    seasonal_pdq = [(x[0], x[1], x[2], 12) for x in list(itertools.product(p, d, q))]
    parameters = []
    BICs = np.array([])
    for param in pdq:
        for param_seasonal in seasonal_pdq:
            try:
                mod = sm.tsa.statespace.SARIMAX(DATA,
                                            order=param,
                seasonal_order=param_seasonal)
                results = mod.fit()
                parameters.append([param, param_seasonal, results.bic])
                BICs = np.append(BICs,results.bic)
            except:
                continue
    return print(parameters[np.argmin(BICs)])

#Process Real GDP values
i = 0
for x in df["RealGDP"]:
    x = x.replace(',', '')
    df.iloc[i,0] = x
    i += 1

selectparameter(df["RealGDP"].values.astype(float), 4)

[(0, 1, 0), (1, 1, 0, 12), 1641.6840970980422] CPU times: user 19.1 s, sys: 8.17 s, total: 27.3 s Wall time: 14.2 s

Set the parameters to (0, 1, 0), (1, 1, 0, 12) from the output result.

__4. Model Fitting and Prediction __

#Model fit
SARIMA_df = sm.tsa.statespace.SARIMAX(df.astype("float64"),order=(0, 1, 0),seasonal_order=(1, 1, 0, 12)).fit()#Please write your answer here

#Substitute prediction data for pred
pred = SARIMA_df.predict("2015-03-31", "2022-12-31")

#Visualization of pred data and original time series data
plt.plot(df)
plt.plot(pred, color="r")
plt.show()

Predict GDP from March 31, 2015 to December 31, 2022, graph the actual value in blue and the predicted value in red

To do. The graph looks like this:

GDPpred.png

Since blue and red overlap quite a bit, it can be said that the prediction is good.

However, since the impact on the economy caused by the new coronavirus is not taken into consideration, the predicted values after that are considerably different from the actual values.

You can expect it to be. I would like to wait for future actual measurement values.

[finally]

Even though I didn't have a strict understanding, I couldn't even do a blind touch two months ago, so I let the machine learn.

I'm a little impressed to be able to make predictions.

I still have Aidemy's course remaining in my future plans, so I will study in another course and output other than data analysis.

I will come back here to do it.

Thank you very much for reading to the end! !!

References

e-stat Aidemy Data Analysis Course Population Trends in Japan by Machine Learning [Big data analysis method and "SARIMA model" that predict the future](https://deepage.net/bigdata/2016/10/22/bigdata-analytics.html#sarima%E3%83%A2%E3%83] % 87% E3% 83% AB)

Recommended Posts

A beginner who has been programming for 2 months tried to analyze the real GDP of Japan in time series with the SARIMA model.
I tried to describe the traffic in real time with WebSocket
[Python & SQLite] I tried to analyze the expected value of a race with horses in the 1x win range ①
I tried to create a model with the sample of Amazon SageMaker Autopilot
I tried to make something like a chatbot with the Seq2Seq model of TensorFlow
How to write offline real time I tried to solve the problem of F02 with Python
I wrote a doctest in "I tried to simulate the probability of a bingo game with Python"
A person who wants to clear the D problem with ABC of AtCoder tried to scratch
Analyze the topic model of becoming a novelist with GensimPy3
Put the process to sleep for a certain period of time (seconds) or more in Python
I tried to predict the number of domestically infected people of the new corona with a mathematical model
If you are a beginner in programming, why not make a "game" for the time being? The story
The first time a programming beginner tried simple data analysis by programming
A super beginner who does not know the basics of Python tried to graph the realized profit and loss data of Rakuten Securities in Python
A super beginner who does not know the basics of Python tried to graph the stock price of GAFA
I tried to analyze the negativeness of Nono Morikubo. [Compare with Posipa]
A beginner of machine learning tried to predict Arima Kinen with python
I tried to display the altitude value of DTM in a graph
I tried to predict the behavior of the new coronavirus with the SEIR model.
The day when a beginner who started programming for two and a half months made a web application using Flask
I tried to predict the number of people infected with coronavirus in Japan by the method of the latest paper in China
In creating a model for discriminating tweet emotions with LSTM + Embedding, I reaffirmed the importance of preprocessing in NLP.
I wanted to know the number of lines in multiple files, so I tried to get it with a command
Time series analysis 4 Construction of SARIMA model
Feel free to write a test with nose (in the case of + gevent)
To output a value even in the middle of a cell with Jupyter Notebook
Turn multiple lists with a for statement at the same time in Python
How to get a list of files in the same directory with python
[Introduction to Python] How to get the index of data with a for statement
(Python: OpenCV) I tried to output a value indicating the distance between regions while binarizing the video in real time.