To get a rough idea of the data that fluctuates in detail, such as stock prices and the number of patients positive for new coronavirus, take the average including the numbers before and after a certain point. Make a note of rolling, a Pandas function that does this.
import pandas as pd
First, to confirm the basic operation of rolling, create a Series with 10 1s and name it ones.
ones = pd.Series([1] * 10)
ones
0 1
1 1
2 1
3 1
4 1
5 1
6 1
7 1
8 1
9 1
dtype: int64
Use rolling to add four ones at a time. Of course, the result is a column of 4, because we just add four numbers of ones at a time. However, the result of the part where the first four data are not complete is NaN. I've listed sum-4 as a result of adding ones to compare with the original column.
In this way, specify the number of data (window width) to be aggregated in window. It is an image of four windows moving from the beginning to the end and counting. The summation result is recorded at the end of the window, so the first three values of sum-4 are NaN.
pd.DataFrame({
'ones': ones,
'sum-4': ones.rolling(window=4).sum(),
})
ones | sum-4 | |
---|---|---|
0 | 1 | NaN |
1 | 1 | NaN |
2 | 1 | NaN |
3 | 1 | 4.0 |
4 | 1 | 4.0 |
5 | 1 | 4.0 |
6 | 1 | 4.0 |
7 | 1 | 4.0 |
8 | 1 | 4.0 |
9 | 1 | 4.0 |
If you specify center = True, you can record the aggregated result in the middle instead of recording it at the end of the window. Let's arrange this as sum-center. If the window is even, record it behind the center. Now the last element is NaN.
pd.DataFrame({
'ones': ones,
'sum-4': ones.rolling(window=4).sum(),
'sum-center': ones.rolling(window=4, center=True).sum(),
})
ones | sum-4 | sum-center | |
---|---|---|---|
0 | 1 | NaN | NaN |
1 | 1 | NaN | NaN |
2 | 1 | NaN | 4.0 |
3 | 1 | 4.0 | 4.0 |
4 | 1 | 4.0 | 4.0 |
5 | 1 | 4.0 | 4.0 |
6 | 1 | 4.0 | 4.0 |
7 | 1 | 4.0 | 4.0 |
8 | 1 | 4.0 | 4.0 |
9 | 1 | 4.0 | NaN |
By default, the part where the width of the window at the edge of the data is insufficient is not aggregated. However, if you specify min_periods, it will be aggregated if there is at least min_periods. In this example, min_period = 2 is specified, so the result will appear from the second line.
pd.DataFrame({
'ones': ones,
'sum-4': ones.rolling(4).sum(),
'min-periods': ones.rolling(window=4, min_periods=2).sum(),
})
ones | sum-4 | min-periods | |
---|---|---|---|
0 | 1 | NaN | NaN |
1 | 1 | NaN | 2.0 |
2 | 1 | NaN | 3.0 |
3 | 1 | 4.0 | 4.0 |
4 | 1 | 4.0 | 4.0 |
5 | 1 | 4.0 | 4.0 |
6 | 1 | 4.0 | 4.0 |
7 | 1 | 4.0 | 4.0 |
8 | 1 | 4.0 | 4.0 |
9 | 1 | 4.0 | 4.0 |
Up to this point, all the elements of the window are used for aggregation, but it seems that the weight of the aggregated value can be changed by using win_type
. You need to install scipy.
pd.DataFrame({
'ones': ones,
'sum-4': ones.rolling(4).sum(),
'win_type': ones.rolling(window=4, win_type='nuttall').sum(),
})
ones | sum-4 | win_type | |
---|---|---|---|
0 | 1 | NaN | NaN |
1 | 1 | NaN | NaN |
2 | 1 | NaN | NaN |
3 | 1 | 4.0 | 1.059185 |
4 | 1 | 4.0 | 1.059185 |
5 | 1 | 4.0 | 1.059185 |
6 | 1 | 4.0 | 1.059185 |
7 | 1 | 4.0 | 1.059185 |
8 | 1 | 4.0 | 1.059185 |
9 | 1 | 4.0 | 1.059185 |
So far, we have specified the number of elements in window. Here, you can specify the period such as time or number of days in window only when the index of Series is date or time. I will make sample data immediately.
date_ones = pd.Series(
index=[
pd.Timestamp('2020-01-01'),
pd.Timestamp('2020-01-02'),
pd.Timestamp('2020-01-07'),
pd.Timestamp('2020-01-08'),
pd.Timestamp('2020-01-09'),
pd.Timestamp('2020-01-16'),
],
data=[1,1,1,1,1,1]
)
date_ones
2020-01-01 1
2020-01-02 1
2020-01-07 1
2020-01-08 1
2020-01-09 1
2020-01-16 1
dtype: int64
If you specify window = '7d' here, the elements up to 7 days ago will be used for aggregation.
pd.DataFrame({
'ones': date_ones,
'rolling-7d': date_ones.rolling('7D').sum(),
})
ones | rolling-7d | |
---|---|---|
2020-01-01 | 1 | 1.0 |
2020-01-02 | 1 | 2.0 |
2020-01-07 | 1 | 3.0 |
2020-01-08 | 1 | 3.0 |
2020-01-09 | 1 | 3.0 |
2020-01-16 | 1 | 1.0 |
In this example
It will be.
This '7D'-like specification method is called offset. It is explained in Offset aliases.
If you use offset for window, the data will be aggregated at least, so NaN will not be recorded arbitrarily. Conversely, if min_periods is specified, it will be NaN if there is no more data than specified in the specified offset period.
pd.DataFrame({
'ones': date_ones,
'rolling-7d': date_ones.rolling('7D').sum(),
'rolling-min-periods': date_ones.rolling('7D', min_periods=3).sum(),
})
ones | rolling-7d | rolling-min-periods | |
---|---|---|---|
2020-01-01 | 1 | 1.0 | NaN |
2020-01-02 | 1 | 2.0 | NaN |
2020-01-07 | 1 | 3.0 | 3.0 |
2020-01-08 | 1 | 3.0 | 3.0 |
2020-01-09 | 1 | 3.0 | 3.0 |
2020-01-16 | 1 | 1.0 | NaN |
Now, let's draw a moving average graph of the number of people infected with the new coronavirus in Tokyo using the knowledge so far.
The data is borrowed from Details of Announcement of New Coronavirus Positive Patients in Tokyo.
covid_src = pd.read_csv('https://stopcovid19.metro.tokyo.lg.jp/data/130001_tokyo_covid19_patients.csv', parse_dates=['Published_date'])
covid_src
No | National Local Public Organization Code th> | Prefecture name th> | City name th> | Published_Date th> | day of the week th> | Onset_date th> | Patient_Residence th> | Patient_age th> | Patient_Gender th> | Patient_attribute th> | Patient_Status th> | Patient_Symptoms th> | Patient_travel history flag th> | Remarks th> | Discharged flag th> | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 130001 | Tokyo td> | NaN | 2020-01-24 | gold td> | NaN | Wuhan City, Hubei Province td> | 40s td> | male td> | NaN | NaN | NaN | NaN | NaN | 1.0 |
1 | 2 | 130001 | Tokyo td> | NaN | 2020-01-25 | Sat td> | NaN | Wuhan City, Hubei Province td> | 30s td> | female td> | NaN | NaN | NaN | NaN | NaN | 1.0 |
2 | 3 | 130001 | Tokyo td> | NaN | 2020-01-30 | tree td> | NaN | Changsha City, Hunan Province td> | 30s td> | female td> | NaN | NaN | NaN | NaN | NaN | 1.0 |
3 | 4 | 130001 | Tokyo td> | NaN | 2020-02-13 | tree td> | NaN | Tokyo td> | 70s td> | male td> | NaN | NaN | NaN | NaN | NaN | 1.0 |
4 | 5 | 130001 | Tokyo td> | NaN | 2020-02-14 | gold td> | NaN | Tokyo td> | 50s td> | female td> | NaN | NaN | NaN | NaN | NaN | 1.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
28131 | 28031 | 130001 | Tokyo td> | NaN | 2020-10-14 | water td> | NaN | NaN | Under 10 years old td> | male td> | NaN | NaN | NaN | NaN | NaN | NaN |
28132 | 28032 | 130001 | Tokyo td> | NaN | 2020-10-14 | water td> | NaN | NaN | 20s td> | male td> | NaN | NaN | NaN | NaN | NaN | NaN |
28133 | 28033 | 130001 | Tokyo td> | NaN | 2020-10-14 | water td> | NaN | NaN | 70s td> | male td> | NaN | NaN | NaN | NaN | NaN | NaN |
28134 | 28034 | 130001 | Tokyo td> | NaN | 2020-10-14 | water td> | NaN | NaN | 20s td> | female td> | NaN | NaN | NaN | NaN | NaN | NaN |
28135 | 28035 | 130001 | Tokyo td> | NaN | 2020-10-14 | water td> | NaN | NaN | 40s td> | female td> | NaN | NaN | NaN | NaN | NaN | NaN |
28136 rows × 16 columns
Since we only need the number of people, we will use group by and size to get the number of cases by date.
covid_daily = covid_src.groupby("Published_date").size()
covid_daily.index.name = 'date'
covid_daily
date
2020-01-24 1
2020-01-25 1
2020-01-30 1
2020-02-13 1
2020-02-14 2
...
2020-10-10 249
2020-10-11 146
2020-10-12 78
2020-10-13 166
2020-10-14 177
Length: 239, dtype: int64
If you simply draw the number of cases for each date on the graph, it will be zigzag like this.
covid_daily.plot()
Set up a weekly window to capture trends in positive patients. If you use mean () instead of sum (), you will get a moving average.
covid_daily.rolling(window='7D').mean().plot()
Recommended Posts