[PYTHON] How to use Pandas Rolling

To get a rough idea of the data that fluctuates in detail, such as stock prices and the number of patients positive for new coronavirus, take the average including the numbers before and after a certain point. Make a note of rolling, a Pandas function that does this.

Normal usage

import pandas as pd

First, to confirm the basic operation of rolling, create a Series with 10 1s and name it ones.

ones = pd.Series([1] * 10)
ones

0    1
1    1
2    1
3    1
4    1
5    1
6    1
7    1
8    1
9    1
dtype: int64

Use rolling to add four ones at a time. Of course, the result is a column of 4, because we just add four numbers of ones at a time. However, the result of the part where the first four data are not complete is NaN. I've listed sum-4 as a result of adding ones to compare with the original column.

In this way, specify the number of data (window width) to be aggregated in window. It is an image of four windows moving from the beginning to the end and counting. The summation result is recorded at the end of the window, so the first three values of sum-4 are NaN.

pd.DataFrame({
    'ones': ones,
    'sum-4': ones.rolling(window=4).sum(),
})

	ones	sum-4
0	1	NaN
1	1	NaN
2	1	NaN
3	1	4.0
4	1	4.0
5	1	4.0
6	1	4.0
7	1	4.0
8	1	4.0
9	1	4.0

If you specify center = True, you can record the aggregated result in the middle instead of recording it at the end of the window. Let's arrange this as sum-center. If the window is even, record it behind the center. Now the last element is NaN.

pd.DataFrame({
    'ones': ones,
    'sum-4': ones.rolling(window=4).sum(),
    'sum-center': ones.rolling(window=4, center=True).sum(),
})

	ones	sum-4	sum-center
0	1	NaN	NaN
1	1	NaN	NaN
2	1	NaN	4.0
3	1	4.0	4.0
4	1	4.0	4.0
5	1	4.0	4.0
6	1	4.0	4.0
7	1	4.0	4.0
8	1	4.0	4.0
9	1	4.0	NaN

By default, the part where the width of the window at the edge of the data is insufficient is not aggregated. However, if you specify min_periods, it will be aggregated if there is at least min_periods. In this example, min_period = 2 is specified, so the result will appear from the second line.

pd.DataFrame({
    'ones': ones,
    'sum-4': ones.rolling(4).sum(),
    'min-periods': ones.rolling(window=4, min_periods=2).sum(),
})

	ones	sum-4	min-periods
0	1	NaN	NaN
1	1	NaN	2.0
2	1	NaN	3.0
3	1	4.0	4.0
4	1	4.0	4.0
5	1	4.0	4.0
6	1	4.0	4.0
7	1	4.0	4.0
8	1	4.0	4.0
9	1	4.0	4.0

Up to this point, all the elements of the window are used for aggregation, but it seems that the weight of the aggregated value can be changed by using win_type. You need to install scipy.

pd.DataFrame({
    'ones': ones,
    'sum-4': ones.rolling(4).sum(),
    'win_type': ones.rolling(window=4, win_type='nuttall').sum(),
})

	ones	sum-4	win_type
0	1	NaN	NaN
1	1	NaN	NaN
2	1	NaN	NaN
3	1	4.0	1.059185
4	1	4.0	1.059185
5	1	4.0	1.059185
6	1	4.0	1.059185
7	1	4.0	1.059185
8	1	4.0	1.059185
9	1	4.0	1.059185

Use date for index

So far, we have specified the number of elements in window. Here, you can specify the period such as time or number of days in window only when the index of Series is date or time. I will make sample data immediately.

date_ones = pd.Series(
    index=[
        pd.Timestamp('2020-01-01'),
        pd.Timestamp('2020-01-02'),
        pd.Timestamp('2020-01-07'),
        pd.Timestamp('2020-01-08'),
        pd.Timestamp('2020-01-09'),
        pd.Timestamp('2020-01-16'),
    ],
    data=[1,1,1,1,1,1]
)
date_ones

2020-01-01    1
2020-01-02    1
2020-01-07    1
2020-01-08    1
2020-01-09    1
2020-01-16    1
dtype: int64

If you specify window = '7d' here, the elements up to 7 days ago will be used for aggregation.

pd.DataFrame({
    'ones': date_ones,
    'rolling-7d': date_ones.rolling('7D').sum(),
})

	ones	rolling-7d
2020-01-01	1	1.0
2020-01-02	1	2.0
2020-01-07	1	3.0
2020-01-08	1	3.0
2020-01-09	1	3.0
2020-01-16	1	1.0

In this example

2020-01-01: 1 because there is nothing before
2020-01-02: There is one within 7 days ago, so 2
2020-01-07: There are 2020-01-01, 2020-01-02, 2020-01-07 within 7 days, so 3
2020-01-08: There are 2020-01-02, 2020-01-07, 2020-01-08 within 7 days, so 3
2020-01-16: 1 because there is nothing within 7 days ago

It will be.

This '7D'-like specification method is called offset. It is explained in Offset aliases.

If you use offset for window, the data will be aggregated at least, so NaN will not be recorded arbitrarily. Conversely, if min_periods is specified, it will be NaN if there is no more data than specified in the specified offset period.

pd.DataFrame({
    'ones': date_ones,
    'rolling-7d': date_ones.rolling('7D').sum(),
    'rolling-min-periods': date_ones.rolling('7D', min_periods=3).sum(),
})

	ones	rolling-7d	rolling-min-periods
2020-01-01	1	1.0	NaN
2020-01-02	1	2.0	NaN
2020-01-07	1	3.0	3.0
2020-01-08	1	3.0	3.0
2020-01-09	1	3.0	3.0
2020-01-16	1	1.0	NaN

Example of moving average

Now, let's draw a moving average graph of the number of people infected with the new coronavirus in Tokyo using the knowledge so far.

The data is borrowed from Details of Announcement of New Coronavirus Positive Patients in Tokyo.

covid_src = pd.read_csv('https://stopcovid19.metro.tokyo.lg.jp/data/130001_tokyo_covid19_patients.csv', parse_dates=['Published_date'])
covid_src

	No	National Local Public Organization Code	Prefecture name	City name	Published_Date	day of the week	Onset_date	Patient_Residence	Patient_age	Patient_Gender	Patient_attribute	Patient_Status	Patient_Symptoms	Patient_travel history flag	Remarks	Discharged flag
0	1	130001	Tokyo	NaN	2020-01-24	gold	NaN	Wuhan City, Hubei Province	40s	male	NaN	NaN	NaN	NaN	NaN	1.0
1	2	130001	Tokyo	NaN	2020-01-25	Sat	NaN	Wuhan City, Hubei Province	30s	female	NaN	NaN	NaN	NaN	NaN	1.0
2	3	130001	Tokyo	NaN	2020-01-30	tree	NaN	Changsha City, Hunan Province	30s	female	NaN	NaN	NaN	NaN	NaN	1.0
3	4	130001	Tokyo	NaN	2020-02-13	tree	NaN	Tokyo	70s	male	NaN	NaN	NaN	NaN	NaN	1.0
4	5	130001	Tokyo	NaN	2020-02-14	gold	NaN	Tokyo	50s	female	NaN	NaN	NaN	NaN	NaN	1.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
28131	28031	130001	Tokyo	NaN	2020-10-14	water	NaN	NaN	Under 10 years old	male	NaN	NaN	NaN	NaN	NaN	NaN
28132	28032	130001	Tokyo	NaN	2020-10-14	water	NaN	NaN	20s	male	NaN	NaN	NaN	NaN	NaN	NaN
28133	28033	130001	Tokyo	NaN	2020-10-14	water	NaN	NaN	70s	male	NaN	NaN	NaN	NaN	NaN	NaN
28134	28034	130001	Tokyo	NaN	2020-10-14	water	NaN	NaN	20s	female	NaN	NaN	NaN	NaN	NaN	NaN
28135	28035	130001	Tokyo	NaN	2020-10-14	water	NaN	NaN	40s	female	NaN	NaN	NaN	NaN	NaN	NaN

28136 rows × 16 columns

Since we only need the number of people, we will use group by and size to get the number of cases by date.

covid_daily = covid_src.groupby("Published_date").size()
covid_daily.index.name = 'date'
covid_daily

date
2020-01-24      1
2020-01-25      1
2020-01-30      1
2020-02-13      1
2020-02-14      2
             ... 
2020-10-10    249
2020-10-11    146
2020-10-12     78
2020-10-13    166
2020-10-14    177
Length: 239, dtype: int64

If you simply draw the number of cases for each date on the graph, it will be zigzag like this.

covid_daily.plot()

Set up a weekly window to capture trends in positive patients. If you use mean () instead of sum (), you will get a moving average.

covid_daily.rolling(window='7D').mean().plot()

	ones	sum-4	sum-center
0	1	NaN	NaN
1	1	NaN	NaN
2	1	NaN	4.0
3	1	4.0	4.0
4	1	4.0	4.0
5	1	4.0	4.0
6	1	4.0	4.0
7	1	4.0	4.0
8	1	4.0	4.0
9	1	4.0	NaN

	ones	sum-4	min-periods
0	1	NaN	NaN
1	1	NaN	2.0
2	1	NaN	3.0
3	1	4.0	4.0
4	1	4.0	4.0
5	1	4.0	4.0
6	1	4.0	4.0
7	1	4.0	4.0
8	1	4.0	4.0
9	1	4.0	4.0

	ones	sum-4	sum-center
0	1	NaN	NaN
1	1	NaN	NaN
2	1	NaN	4.0
3	1	4.0	4.0
4	1	4.0	4.0
5	1	4.0	4.0
6	1	4.0	4.0
7	1	4.0	4.0
8	1	4.0	4.0
9	1	4.0	NaN

	ones	sum-4	min-periods
0	1	NaN	NaN
1	1	NaN	2.0
2	1	NaN	3.0
3	1	4.0	4.0
4	1	4.0	4.0
5	1	4.0	4.0
6	1	4.0	4.0
7	1	4.0	4.0
8	1	4.0	4.0
9	1	4.0	4.0

[PYTHON] How to use Pandas Rolling

Normal usage

Use date for index

Example of moving average

reference

	ones	sum-4	sum-center
0	1	NaN	NaN
1	1	NaN	NaN
2	1	NaN	4.0
3	1	4.0	4.0
4	1	4.0	4.0
5	1	4.0	4.0
6	1	4.0	4.0
7	1	4.0	4.0
8	1	4.0	4.0
9	1	4.0	NaN

	ones	sum-4	min-periods
0	1	NaN	NaN
1	1	NaN	2.0
2	1	NaN	3.0
3	1	4.0	4.0
4	1	4.0	4.0
5	1	4.0	4.0
6	1	4.0	4.0
7	1	4.0	4.0
8	1	4.0	4.0
9	1	4.0	4.0