[PYTHON] Challenge to future sales forecast: ③ PyFlux parameter tuning

Introduction

Last time, in "Challenge to future sales forecast: ② Time series analysis using PyFlux", we built a model of ARIMA and ARIMAX using PyFlux. I did.

However, the accuracy was not very good. I was groping for the parameters such as the number of dimensions of AR and MA, but was that not the case? However, "statistically this is good!" Is a high hurdle for me (sweat).

So I searched for something like GridSearch in scikit-learn. Then, "[Predict the transition of TV Asahi's viewing rate with the SARIMA model](https://qiita.com/mshinoda88/items/749131478bfefc9bf365#sarima%E3%83%A2%E3%83%87%E3%" 83% AB% E5% AD% A3% E7% AF% 80% E8% 87% AA% E5% B7% B1% E5% 9B% 9E% E5% B8% B0% E5% 92% 8C% E5% 88% 86% E7% A7% BB% E5% 8B% 95% E5% B9% B3% E5% 9D% 87% E3% 83% A2% E3% 83% 87% E3% 83% AB) ", Stats Models Since parameter tuning of time series analysis using was implemented, I made it with reference to that.

Analytical environment

Google Colaboratory

Target data

Last time Similarly, the data uses daily sales and temperature (average, maximum, minimum) as explanatory variables.

date Sales amount Average temperature Highest temperature Lowest Temperature
2018-01-01 7,400,000 4.9 7.3 2.2
2018-01-02 6,800,000 4.0 8.0 0.0
2018-01-03 5,000,000 3.6 4.5 2.7
2018-01-04 7,800,000 5.6 10.0 2.6

Parameter tuning

The following is the program of Last time. The parameters are ar, ma, and integ.

import pyflux as pf

model = pf.ARIMA(data=df, ar=5, ma=5, integ=1, target='Sales amount', family=pf.Normal())
x = model.fit('MLE')

So far, I've talked about parameter tuning, but basically each parameter takes an integer, so I'm looping through the numbers.


def optimisation_arima(df, target):

  import pyflux as pf

  df_optimisations = pd.DataFrame(columns=['p','d','q','aic'])

  max_p=4
  max_d=4
  max_q=4

  for p in range(0, max_p):
    for d in range(0, max_d):
      for q in range(0, max_q):

        model = pf.ARIMA(data=df, ar=p, ma=q, integ=d, target=target, family=pf.Normal())
        x = model.fit('MLE')

        print("AR:",p, " I:",d, " MA:",q, " AIC:", x.aic)

        tmp = pd.Series([p,d,q,x.aic],index=df_optimisations.columns)
        df_optimisations = df_optimisations.append( tmp, ignore_index=True )

  return df_optimisations

Now when you call it like this

df_output = optimisation_arima(df, "Sales amount")

The result is displayed. There are several evaluation criteria for PyFlux, but we use AIC (the smaller the better model).

AR: 0  I: 0  MA: 0  AIC: 11356.163772323638
AR: 0  I: 0  MA: 1  AIC: 11262.28357561013
AR: 0  I: 0  MA: 2  AIC: 11218.453940684196
AR: 0  I: 0  MA: 3  AIC: 11171.121950637687
AR: 0  I: 1  MA: 0  AIC: 11462.586538415879

Therefore, the AR / I / MA combination with the lowest AIC can be selected as the optimum parameter.

df_optimisations[df_optimisations.aic == min(df_optimisations.aic)]

in conclusion

Since the accuracy of previous was a terrible defeat, brute force parameter tuning was performed.

However, the accuracy of the results that came out did not improve. ダウンロード.png

You have to think about the next improvement plan.

Recommended Posts

Challenge to future sales forecast: ③ PyFlux parameter tuning
Challenge to future sales forecast: ② Time series analysis using PyFlux
Challenge to future sales forecast: ⑤ Time series analysis by Prophet
Challenge to future sales forecast: ④ Time series analysis considering seasonality by Stats Models