[PYTHON] Comparison of time series data predictions between SARIMA and Prophet models

Hello. Future Search I'm Sugato from Brazil.

I don't know what number to brew today, but I would like to write about forecasting time series data.

1.First of all

There is an image that prediction of time series data is basically not so usable, but I would like to see how much it is and whether it can be used in practice.

The specifics I tried are as follows

** Get daily followers on Twitter ** ~~ That astringent ~~ I will do my best with the API

** Try to predict the number of followers on your Twitter ** (1) Predicted by SARIMA model ・ [Combining neural network model with seasonal time series ARIMA model] https://www.sciencedirect.com/science/article/pii/S004016250000113X ・ [Analysis of time series data with SARIMA (prediction of PV number)] https://www.kumilog.net/entry/sarima-pv @xkumiyu

(2) Prediction with Prophet model ・ [Prophet Official] https://facebook.github.io/prophet/docs/quick_start.html ・ [Time Series Analysis Library Prophet Official Document Translation 1 (Overview & Features)] https://qiita.com/japanesebonobo/items/96868e58d4da42d36807 @japanesebonobo

Contents of this time

Predicting the number of followers, which is decreasing day by day without tweeting, makes my heart even more deep. To conclude first, the number of followers will decrease, and there is no prospect of an increase.

2. Environment

3. Preparation

The daily follower number data looks like this. I can't stand to see it. (Https://twitter.com/Ndtn_/) http://web.sfc.wide.ad.jp/~nadechin/follower.csv

date        follower
2018/9/6	39.569
2018/9/7	39.57
2018/9/8	39.573
   .           .
   .           .
   .           .
2019/12/10	37.861

4. Processing time series data

Separate training data and test data. It doesn't matter if it's pandas or numpy, but for the time being, ・ 2018/09/06 ~ 2019/12/10 Original data ・ 2018/09/06 ~ 2019/11/30 learning data ・ 2019/12/01 ~ 2019/12/10 test data

Confirm the stationarity of the data by ADF test. ・ [Statsmodels.tsa.stattools.adfuller] http://www.statsmodels.org/dev/generated/statsmodels.tsa.stattools.adfuller.html ・ [Null hypothesis, significance level] http://www.gen-info.osaka-u.ac.jp/MEPHAS/express/express11.html

res = sm.tsa.stattools.adfuller(df.follower)


The output result is as follows

p-value = 0.9774


⇨p-value  >  0.05

Therefore, it cannot be said to have stationarity. In order to have stationarity, the difference is taken and the seasonality is removed.

predict.py


data = [Scatter(x=df.index, y=df.follower.diff())]

Then seasonal removal.

predict.py


data = [Scatter(x=df.index, y=df.follower-res.seasonal)]

This will perform the ADF test again.

p-value = 1.109e-25


⇨p-value  <  0.05

As a result, we were able to process time-series data with stationarity.

5. Forecasting time series data

In the case of SARIMA model, creating a model for each data

predict.py


# coding:utf-8
from statsmodels.tsa.statespace.sarimax import SARIMAX

model = SARIMAX(
    train,
    order=(p, d, q),
    seasonal_order=(sa, sd, sq, s),
    enforce_stationarity=False,
    enforce_invertibility=False)
result = model.fit()

Do it with. order = (p, d, q) is a parameter of the ARIMA model seasonal_order = (sp, sd, sq, s) is a seasonal parameter

See ↓ ・ [Statsmodels.tsa.statespace.sarimax.SARIMAX] https://www.statsmodels.org/dev/generated/statsmodels.tsa.statespace.sarimax.SARIMAX.html ・ [Analysis of time series data with SARIMA (prediction of PV number)] https://www.kumilog.net/entry/sarima-pv @xkumiyu

Next, create a Prophet model.

Prophet will build a model just by typing in the learning data. It realizes "I don't know what I'm doing, but I've done something that seems to be predictable." Starting today, I can become a data scientist with a 2-second copy and paste.

predict.py


# coding:utf-8
import pandad as pd
import numpy as np
from fbprophet import Prophet

data = pd.read_csv('follower.csv')
data.follower= data.follower.apply(lambda x: int(x.replace(',', '')))
#The column name is'ds','y'Must be set to
data = data.rename(columns={'date': 'ds', 'follower': 'y'})
model = Prophet()
model.fit(data)

6. Forecasting time series data

・ SARIMA model

Prediction of test data applied to SARIMA model

2019-12-01  38002.878685
2019-12-02  38001.204647
2019-12-03  37998.080676
2019-12-04  37988.324131
2019-12-05  37981.134367
2019-12-06  37974.569498
2019-12-07  37966.333432
2019-12-08  37958.270232
2019-12-09  37956.258566
2019-12-10  37952.875398

・ Prophet model

Prediction of test data applied to Prophet model

2019-12-01  37958.337506
2019-12-02  37959.963661
2019-12-03  37957.304699
2019-12-04  37943.272430
2019-12-05  37934.533210
2019-12-06  37920.537811
2019-12-07  37908.529618
2019-12-08  37905.819057
2019-12-09  37907.445213
2019-12-10  37904.786251

I'm lonely so I'll plot

[Overall view] Figure_11.png

[Prediction part] Figure_12.png

[Enlarged view of the predicted part] Figure_13.png

7. What I found

Let's look at the forecast data for the day after the last day of the training data.

date, follower

#Real data
2019-12-01, 38003.000000

# SARIMA
2019-12-01, 38002.878685

# Prophet
2019-12-01, 37958.337506

As you can see from the [Expanded view of the predicted part], the predictions for the next day of the training data are almost the same in the SARIMA data. The prediction of the next time point of the training data seems to be suitable.

Prophet was honestly subtle.

7. Let's make a one-day intensive forecast

I thought that it would work unexpectedly if I learned until 2019/12/09 and put out the predicted value of 2019/12/10, so I will try it.

Results below Figure_15.png

date, follower

#Real data
2019-12-10, 37861.000000

# SARIMA
2019-12-10  37868.158032

It feels good. After all, if it is a forecast for only one day, it seems that a relatively good accuracy of a practical level will come out.

As I say many times, Prophet was honestly subtle.

Summary

Prophet is convenient, but it lacks practicality. With the SARIMA model, I felt that the prediction of time-series data could be used in one day. I wanted to compare a little more models at once. See you next time.

Also, the number of followers will decrease.

Recommended Posts

Comparison of time series data predictions between SARIMA and Prophet models
Smoothing of time series and waveform data 3 methods (smoothing)
About time series data and overfitting
Differentiation of time series data (discrete)
Time series analysis 3 Preprocessing of time series data
Time series analysis 4 Construction of SARIMA model
Reading OpenFOAM time series data and sets data
Power of forecasting methods in time series data analysis Semi-optimization (SARIMA) [Memo]
Acquisition of time series data (daily) of stock prices
[Python] Conversion memo between time data and numerical data
View details of time series data with Remotte
Anomaly detection of time series data by LSTM (Keras)
[Latest method] Visualization of time series data and extraction of frequent patterns using Pan-Matrix Profile
"Measurement Time Series Analysis of Economic and Finance Data" Solving Chapter End Problems with Python
Graph time series data in Python using pandas and matplotlib
A story about clustering time series data of foreign exchange
[Python] Plot time series data
Comparison of Apex and Lamvery
When plotting time series data and getting a matplotlib Overflow Error
Summary of differences between Python and PHP (comparison table of main items)
Calculation of time series customer loyalty
Django's MVT-Relationship between Models and Modules-
Easy time series prediction with Prophet
Python: Time Series Analysis: Preprocessing Time Series Data
Difference between python2 series and python3 series dict.keys ()
Speed comparison between CPython and PyPy
[For beginners] Script within 10 lines (5. Resample of time series data using pandas)
Plot CSV of time series data with unixtime value in Python (matplotlib)
[Kaggle] I tried feature engineering of multidimensional time series data using tsfresh.