Introduction

CyberAgent20 New graduate Advent Calendar 2019 16th day article. Moving average is a statistical method that is often used when analyzing time series data. This is especially common when measuring trends in stock price trends, but this time I would like to make it a form that can be used as a feature for machine learning. If anything, it is an article that is closer to implementation.

I usually focus on text and sound analysis, so I would appreciate it if you could point out any mistakes.

What is a movement statistic?

Although it is called a moving statistic, it is a moving average or a moving variance. Most of the methods actually used are moving averages, and even if you search for something called moving variance, there are almost no hits, but when you use it in the actual field, there are cases where it correlates with y and leads to improved accuracy. I decided to post it because there was. The explanation of the moving average itself will be omitted as much as possible because many articles will appear when searching, but if it is difficult to see the transition of data due to variance in time series data, smoothing with the moving average will show the transition of the overall data. You can make it visible. Mathematically, the moving average for the three steps of time * t * in the time series data * x * can be expressed as follows.


\frac{x_{t-2}+x_{t-1}+x_t}{3}

I made 3 steps, but if it is daily data, it will be calculated according to the time series scale like 3 days. Smoothing is performed by calculating the moving average using the above formula while shifting the time * t *.

When there is time series data as shown in the left figure below as an actual image, the trend can be grasped by taking the moving average as shown in the right figure.

The above is called a simple moving average, and there are different types of moving averages. If you look it up in wiki, it looks like this: It seems that there are many types, but as written, SMA, WMA, and EMA mentioned above are generally used. I would like to actually move each hand and find a use case next time.

--Simple Moving Average (SMA) --Weighted Moving Average (WMA) --Exponential Moving Average (EMA) --Modified moving average --Triangular moving average --Sine-weighted moving average --Cumulative moving average

I have tried simple moving average (SMA), weighted moving average (WMA), and exponential moving average (EMA) when predicting Bitcoin, and at that time there was no data that would drop sharply, so which one is currently available? I don't know if the index is good, but I remember that SMA worked best around the summer of 2017. The simple moving average is also used in the actual field.

Lag features

When using a movement statistic as a feature, it must be used as a lag feature. The lag feature is used as an index of the time by using the data of several hours and days before that time as the feature of a certain time. In the movement statistic, the statistic from a certain point in time to before a specific step is taken, but if the time you want to predict is 7 days later, the data after 6 days cannot be used for the feature quantity after 7 days, so there is a lag from the time. The movement statistic up to the present time is used as the feature quantity 7 days after birth. By doing this, the features are treated equally between the training data and the prediction data.

The lag feature is recently written in this book for your reference.

Implementation

About data

Since it was easy to understand as data, I used the daily average temperature of the Japan Meteorological Agency. https://www.jma.go.jp/jma/index.html

This time, considering the case where you want to predict the average temperature in November 2019, we will use the data from 2000 to September / October / November 2018 and September / October 2019.

Visualization

Let's visualize the actual value with a graph. The upper figure is the raw average temperature data, and the lower figure is the 30-day moving average. By doing this, you can see when the temperature variation is a little large and when it is not. When using the movement statistic, the lag feature is further shifted by one month from the figure below, so the training data will be reduced. Therefore, care must be taken when the lag must be increased.

About the environment

For the time being, I will think about the destination and specify it.

`Pipfile`


[[source]]
name = "pypi"
url = "https://pypi.org/simple"
verify_ssl = true

[dev-packages]

[packages]
pandas = "~=0.25"
matplotlib = "~=3.1"

[requires]
python_version = "3.7"

pandas In pandas, there is something called rolling that can be applied with a window function. In addition, statistic is output by mean and var, and the time series is shifted by shift in consideration of the lag feature.

impoart pandas as pd

df = pd.read_csv('')
#Since it is assumed that it has been sorted, please sort as appropriate.
df['Average temperature(℃)'].rolling(30).mean().shift(30)
#The width of the window function is determined according to the domain by looking at the trend and the correlation with y.

Since pandas has an aggregate function, you can easily add an index even if you want to add an index other than the moving average.

df['Average temperature(℃)'].rolling(30).agg(['mean', 'var']).shift(30)
#kurtosis(kurt)And skewness(skew)You can also see.

BigQuery BigQuery also has a convenient window function, so you can easily calculate the moving average considering the lag.

SELECT
	AVG(Average temperature(℃)) OVER(ROWS BETWEEN 30 PRECEDING 60 PRECEDING)
FROM
	`project.dataset.table`

The difference with pandas is obvious, but BigQuery is much faster. Calculating kurtosis is not as easy as pandas, but it is not a problem because it is rarely used as a feature. The window function is convenient and it is simple and easy to understand because you can specify each group and sort when applying the function.

At the end

It is interesting to find the features while consulting with the actual data, but I am still a beginner in statistics, so I wish I could go further.

[PYTHON] Movement statistics for time series forecasting