Introduction

I am an M2 major in CS. I usually focus on image processing, but since I had the opportunity to handle future date and time series data, I will leave it as a memorandum. I hope it will serve as a reference for those who want to process time-series data. ** Formulas etc. are omitted, so I think that it is for those who want to grasp the atmosphere **. Also, if you have any mistakes, please let us know.

What is time series data?

Time-series data is ** "a collection of results measured at regular intervals" **. In addition to information on changes in temperature, precipitation, and store sales, it is an image that has information on the measured time as a set.

Models + terms that can be used for time series data

AR model (autoregressive model)

--The future y is explained by the past y --Use your past data as an explanatory variable --Representing the data of interest by combining several past data multiplied by a coefficient --Assuming a stationary process

MA model (moving average model)

--Future y is explained by past error --Future forecast value is determined by the error between the past forecast value and the actual value. -(Example) If the sales volume of this month is higher than the original sales volume, the sales volume of next month will increase. --Expressing the relationship by having a term that is common to the data of interest and the past data --Assuming a stationary process

ARMA model (autoregressive moving average model)

--AR + MA process, according to the stronger property ――Therefore, both autocorrelation and partial autocorrelation decay according to the size of the lag. --The ARMA model estimates and predicts under the stationarity of the data series, but the actual data is often non-stationary. --Assuming a stationary process

ARIMA model (autoregressive integrated moving average model)

--The difference from the ARMA model is that it incorporates a difference process. ――Granted to ARMA how many floor differences should be taken to become steady --A process in which a sequence with d-th difference follows a steady and invertable ARMA (p, q) process

SARIMA model (seasonal autoregressive integrated moving average model)

――The difference with ARIMA is whether to consider seasonality? --In addition to ARIMA (p, d, q) in the time series direction, ARIMA (P, D, Q) in the seasonal difference direction, and the period s

Unit root process

--The data is created by adding the values. --Data with unit roots is called "unit root process" --ex) Random walk (cumulative sum of white noise) --White noise: Just "noise" according to a normal distribution with no autocorrelation

ADF test

--Since many time series models assume a stationary process, it is often the case that the unit root is confirmed first for the time series. --Null hypothesis: ** Unit root process **, Alternative hypothesis: ** Stationary process ** --If the P value is 0.05 or less, the null hypothesis is rejected and the process becomes stationary. --In general, if you take a "difference series" or "logarithmic conversion", the series tends to have stationarity.

Autocorrelation (ACF: Autocorrelation Function)

――How much does the past value affect the current data? --The number of steps of shifted data is called lag.

Partial Autocorrelation Function (PACF)

--Autocorrelation obtained by removing the influence of time from the autocorrelation coefficient --The relationship between today and two days ago indirectly includes the influence of one day ago. --By using partial autocorrelation, it is possible to examine the relationship between today and two days ago, excluding the effect of one day ago.

Correlogram

--Lag + autocorrelation

analysis

import numpy as np
import pandas as pd

Handling of dates

pd.date_range('2020-1-1', freq='D', periods=3)

'''
DatetimeIndex(['2020-01-01', '2020-01-02', '2020-01-03'], dtype='datetime64[ns]', freq='D')
'''


df = pd.Series(np.arange(3))
df.index = pd.date_range('2020-1-1', freq='D', periods=3)
df

'''
2020-01-01    0
2020-01-02    1
2020-01-03    2
Freq: D, dtype: int64
'''

idx = pd.date_range('2020-1-1',freq='D',periods=365)
df = pd.DataFrame({'Product A' : np.random.randint(100, size=365),
                   'Product B' : np.random.randint(100, size=365)},
                   index=idx)
df     

'''

Product A Product B
2020-01-01	99	23
2020-01-02	73	98
2020-01-03	86	85
2020-01-04	44	37
2020-01-05	67	63
...	...	...
2020-12-26	23	25
2020-12-27	91	35
2020-12-28	3	23
2020-12-29	92	47
2020-12-30	55	84
365 rows × 2 columns
'''

#Data acquisition for a specific date
df.loc['2020-2-3']

'''
Product A 51
Product B 46
Name: 2020-02-03 00:00:00, dtype: int64
'''

#Data acquisition by slicing
df.loc[:'2020-1-4']

'''
Product A Product B
2020-01-01	99	23
2020-01-02	73	98
2020-01-03	86	85
2020-01-04	44	37
'''

df.loc['2020-1-4':'2020-1-7']
'''
Product A Product B
2020-01-04	44	37
2020-01-05	67	63
2020-01-06	6	94
2020-01-07	47	11
'''

df.loc['2020-1']

'''
### ``Display all data for January(abridgement)
'''

#Get the moon
df.index.month

'''
Int64Index([ 1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
            ...
            12, 12, 12, 12, 12, 12, 12, 12, 12, 12],
           dtype='int64', length=365)
'''

Simple data analysis

This time we will use the'AirPassengers' dataset, which is famous for time series data.

Loading and displaying data

import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt 
data = pd.read_csv('AirPassengers.csv', index_col=0, parse_dates=[0])
plt.plot(data)

Decompose into trend, seasonal, resid using statsmodels

import statsmodels.api as sm 
res = sm.tsa.seasonal_decompose(data)
fig = res.plot()

Display of autocorrelation and partial autocorrelation

fig, axes = plt.subplots(1,2, figsize=(15,5))
sm.tsa.graphics.plot_acf(data, ax=axes[0])
sm.tsa.graphics.plot_pacf(data, ax=axes[1])

Trend removal

plt.figure(figsize=(15,5))
plt.plot(data.diff(1))

ADF test

The tuple returns a value, so the first element of it is the P value. The null hypothesis can be rejected if the P value is 0.05 or less.


#raw data
sm.tsa.adfuller(data)[1]
0.991880243437641

#Logarithmic conversion
ldata = np.log(data)
sm.tsa.adfuller(ldata)[1]
0.42236677477039125

#Logarithmic conversion+Floor difference
sm.tsa.adfuller(ldata.diff().dropna())[1]
0.0711205481508595

SARIMA model estimation

Set parameters with ʻorder and seasonal_order. Model training with fit (). Forecasts outside the learning range are forecast () Prediction of points containing training data ispredict ()` Parameter tuning should be calculated by brute force. (Can't find the best model for a function without statsmodels?)

model = sm.tsa.SARIMAX(ldata, order=(1,1,1),seasonal_order=(0,1,2,12))
res_model = model.fit()
pred = res_model.forecast(36)
plt.plot(ldata, label='Original')
plt.plot(pred, label='Pred')

Feature creation in time series data

Information that is likely to be a feature in chronological order

Month --Day of the week --Weeks --Weekend flag
public holiday --Holiday --Weather --Consecutive holiday flag --How many days of consecutive holidays etc ...

#Easy table creation
df = pd.DataFrame(np.arange(6).reshape(6, 1),columns=['values'])

#Difference
df['diff_1'] = df['values'].diff(1)
#Difference for 2 times
df['diff_2'] = df['values'].diff(2)
#Just shift the value
df['shift'] = df['values'].shift(1)
#Rate of change
df['ch'] = df['values'].pct_change(1)
#Moving average with window function
df['rolling_mean'] = df['values'].rolling(2).mean()
df['rolling_max'] = df['values'].rolling(2).max()

Other notes

--Features can be created with a library called tsfresh --You can CV with sklearn's TimeSeries Split ――Since the machine learning model is a stationary process, isn't it better to use a statistical model? --SARIMA model cannot handle nan

References

-Time series data processing by pandas -Thorough explanation of ARIMA model and SARIMA model appearing in time series analysis -Blog of data scientist working in front of Shibuya station

At the end

It's easy, but I've summarized the time series. What is worrisome is whether to use a machine learning model or a statistical model. Personally, I feel that the statistical model is better as a result (not this data, but ...).

[PYTHON] How to handle time series data (implementation)

Introduction

What is time series data?

Models + terms that can be used for time series data

AR model (autoregressive model)

MA model (moving average model)

ARMA model (autoregressive moving average model)

ARIMA model (autoregressive integrated moving average model)

SARIMA model (seasonal autoregressive integrated moving average model)

Unit root process

ADF test

Autocorrelation (ACF: Autocorrelation Function)

Partial Autocorrelation Function (PACF)

Correlogram

analysis

Handling of dates

Simple data analysis

Loading and displaying data

Decompose into trend, seasonal, resid using statsmodels

Display of autocorrelation and partial autocorrelation

Trend removal

ADF test

SARIMA model estimation

Feature creation in time series data

Other notes

References

At the end