[PYTHON] How to use Pandas Rolling

To get a rough idea of the data that fluctuates in detail, such as stock prices and the number of patients positive for new coronavirus, take the average including the numbers before and after a certain point. Make a note of rolling, a Pandas function that does this.

Normal usage

import pandas as pd

First, to confirm the basic operation of rolling, create a Series with 10 1s and name it ones.

ones = pd.Series([1] * 10)
ones
0    1
1    1
2    1
3    1
4    1
5    1
6    1
7    1
8    1
9    1
dtype: int64

Use rolling to add four ones at a time. Of course, the result is a column of 4, because we just add four numbers of ones at a time. However, the result of the part where the first four data are not complete is NaN. I've listed sum-4 as a result of adding ones to compare with the original column.

In this way, specify the number of data (window width) to be aggregated in window. It is an image of four windows moving from the beginning to the end and counting. The summation result is recorded at the end of the window, so the first three values of sum-4 are NaN.

pd.DataFrame({
    'ones': ones,
    'sum-4': ones.rolling(window=4).sum(),
})
ones sum-4
0 1 NaN
1 1 NaN
2 1 NaN
3 1 4.0
4 1 4.0
5 1 4.0
6 1 4.0
7 1 4.0
8 1 4.0
9 1 4.0

If you specify center = True, you can record the aggregated result in the middle instead of recording it at the end of the window. Let's arrange this as sum-center. If the window is even, record it behind the center. Now the last element is NaN.

pd.DataFrame({
    'ones': ones,
    'sum-4': ones.rolling(window=4).sum(),
    'sum-center': ones.rolling(window=4, center=True).sum(),
})
ones sum-4 sum-center
0 1 NaN NaN
1 1 NaN NaN
2 1 NaN 4.0
3 1 4.0 4.0
4 1 4.0 4.0
5 1 4.0 4.0
6 1 4.0 4.0
7 1 4.0 4.0
8 1 4.0 4.0
9 1 4.0 NaN

By default, the part where the width of the window at the edge of the data is insufficient is not aggregated. However, if you specify min_periods, it will be aggregated if there is at least min_periods. In this example, min_period = 2 is specified, so the result will appear from the second line.

pd.DataFrame({
    'ones': ones,
    'sum-4': ones.rolling(4).sum(),
    'min-periods': ones.rolling(window=4, min_periods=2).sum(),
})
ones sum-4 min-periods
0 1 NaN NaN
1 1 NaN 2.0
2 1 NaN 3.0
3 1 4.0 4.0
4 1 4.0 4.0
5 1 4.0 4.0
6 1 4.0 4.0
7 1 4.0 4.0
8 1 4.0 4.0
9 1 4.0 4.0

Up to this point, all the elements of the window are used for aggregation, but it seems that the weight of the aggregated value can be changed by using win_type. You need to install scipy.

pd.DataFrame({
    'ones': ones,
    'sum-4': ones.rolling(4).sum(),
    'win_type': ones.rolling(window=4, win_type='nuttall').sum(),
})
ones sum-4 win_type
0 1 NaN NaN
1 1 NaN NaN
2 1 NaN NaN
3 1 4.0 1.059185
4 1 4.0 1.059185
5 1 4.0 1.059185
6 1 4.0 1.059185
7 1 4.0 1.059185
8 1 4.0 1.059185
9 1 4.0 1.059185

Use date for index

So far, we have specified the number of elements in window. Here, you can specify the period such as time or number of days in window only when the index of Series is date or time. I will make sample data immediately.

date_ones = pd.Series(
    index=[
        pd.Timestamp('2020-01-01'),
        pd.Timestamp('2020-01-02'),
        pd.Timestamp('2020-01-07'),
        pd.Timestamp('2020-01-08'),
        pd.Timestamp('2020-01-09'),
        pd.Timestamp('2020-01-16'),
    ],
    data=[1,1,1,1,1,1]
)
date_ones
2020-01-01    1
2020-01-02    1
2020-01-07    1
2020-01-08    1
2020-01-09    1
2020-01-16    1
dtype: int64

If you specify window = '7d' here, the elements up to 7 days ago will be used for aggregation.

pd.DataFrame({
    'ones': date_ones,
    'rolling-7d': date_ones.rolling('7D').sum(),
})
ones rolling-7d
2020-01-01 1 1.0
2020-01-02 1 2.0
2020-01-07 1 3.0
2020-01-08 1 3.0
2020-01-09 1 3.0
2020-01-16 1 1.0

In this example

It will be.

This '7D'-like specification method is called offset. It is explained in Offset aliases.

If you use offset for window, the data will be aggregated at least, so NaN will not be recorded arbitrarily. Conversely, if min_periods is specified, it will be NaN if there is no more data than specified in the specified offset period.

pd.DataFrame({
    'ones': date_ones,
    'rolling-7d': date_ones.rolling('7D').sum(),
    'rolling-min-periods': date_ones.rolling('7D', min_periods=3).sum(),
})
ones rolling-7d rolling-min-periods
2020-01-01 1 1.0 NaN
2020-01-02 1 2.0 NaN
2020-01-07 1 3.0 3.0
2020-01-08 1 3.0 3.0
2020-01-09 1 3.0 3.0
2020-01-16 1 1.0 NaN

Example of moving average

Now, let's draw a moving average graph of the number of people infected with the new coronavirus in Tokyo using the knowledge so far.

The data is borrowed from Details of Announcement of New Coronavirus Positive Patients in Tokyo.

covid_src = pd.read_csv('https://stopcovid19.metro.tokyo.lg.jp/data/130001_tokyo_covid19_patients.csv', parse_dates=['Published_date'])
covid_src
No National Local Public Organization Code Prefecture name City name Published_Date day of the week Onset_date Patient_Residence Patient_age Patient_Gender Patient_attribute Patient_Status Patient_Symptoms Patient_travel history flag Remarks Discharged flag
0 1 130001 Tokyo NaN 2020-01-24 gold NaN Wuhan City, Hubei Province 40s male NaN NaN NaN NaN NaN 1.0
1 2 130001 Tokyo NaN 2020-01-25 Sat NaN Wuhan City, Hubei Province 30s female NaN NaN NaN NaN NaN 1.0
2 3 130001 Tokyo NaN 2020-01-30 tree NaN Changsha City, Hunan Province 30s female NaN NaN NaN NaN NaN 1.0
3 4 130001 Tokyo NaN 2020-02-13 tree NaN Tokyo 70s male NaN NaN NaN NaN NaN 1.0
4 5 130001 Tokyo NaN 2020-02-14 gold NaN Tokyo 50s female NaN NaN NaN NaN NaN 1.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
28131 28031 130001 Tokyo NaN 2020-10-14 water NaN NaN Under 10 years old male NaN NaN NaN NaN NaN NaN
28132 28032 130001 Tokyo NaN 2020-10-14 water NaN NaN 20s male NaN NaN NaN NaN NaN NaN
28133 28033 130001 Tokyo NaN 2020-10-14 water NaN NaN 70s male NaN NaN NaN NaN NaN NaN
28134 28034 130001 Tokyo NaN 2020-10-14 water NaN NaN 20s female NaN NaN NaN NaN NaN NaN
28135 28035 130001 Tokyo NaN 2020-10-14 water NaN NaN 40s female NaN NaN NaN NaN NaN NaN

28136 rows × 16 columns

Since we only need the number of people, we will use group by and size to get the number of cases by date.

covid_daily = covid_src.groupby("Published_date").size()
covid_daily.index.name = 'date'
covid_daily
date
2020-01-24      1
2020-01-25      1
2020-01-30      1
2020-02-13      1
2020-02-14      2
             ... 
2020-10-10    249
2020-10-11    146
2020-10-12     78
2020-10-13    166
2020-10-14    177
Length: 239, dtype: int64

If you simply draw the number of cases for each date on the graph, it will be zigzag like this.

covid_daily.plot()

output_23_1.png

Set up a weekly window to capture trends in positive patients. If you use mean () instead of sum (), you will get a moving average.

covid_daily.rolling(window='7D').mean().plot()

output_25_1.png

reference

Recommended Posts

How to use Pandas Rolling
How to use Pandas 2
[Python] How to use Pandas Series
How to use xml.etree.ElementTree
How to use Python-shell
[Python] Summary of how to use pandas
How to use tf.data
How to use virtualenv
How to use Seaboan
How to use image-match
How to use shogun
[Pandas] What is set_option [How to use]
How to use Virtualenv
How to use numpy.vectorize
How to use partial
How to use Bio.Phylo
How to use SymPy
How to use WikiExtractor.py
How to use IPython
How to use virtualenv
How to use Matplotlib
How to use iptables
How to use numpy
How to use TokyoTechFes2015
How to use venv
How to use dictionary {}
How to use Pyenv
How to use list []
How to use python-kabusapi
How to use OptParse
How to use return
How to use dotenv
How to use pyenv-virtualenv
How to use pandas Timestamp and date_range
How to use Go.mod
How to use imutils
How to use import
How to use Qt Designer
[gensim] How to use Doc2Vec
python3: How to use bottle (2)
Understand how to use django-filter
How to use the generator
[Python] How to use list 1
How to use FastAPI ③ OpenAPI
How to use Python argparse
How to use IPython Notebook
[Note] How to use virtualenv
How to use redis-py Dictionaries
Python: How to use pydub
[Python] How to use checkio
[Go] How to use "... (3 periods)"
How to use Django's GeoIp2
[Python] How to use input ()
How to use the decorator
[Introduction] How to use open3d
How to use Python lambda
How to use Jupyter Notebook
[Python] How to use virtualenv
python3: How to use bottle (3)
python3: How to use bottle
How to use Google Colaboratory