[PYTHON] Let's write an aggregation process for a certain period using pandas × groupby × Grouper

I wrote this article from the motivation of "I want to write a process to create a feature quantity that aggregates certain columns every N days with pandas, but what should I do?"

The sample code is an excerpt from the Google Colony linked below. GoogleColab --pandas x groupby x Grouper sample

In addition, we use this data (kaggle Datasets --Bitcoin Historical Data).

First of all, it is from preprocessing of data, Timestamp column is converted to datetime type for the previous processing.

load_dataset


import pandas as pd

btc = pd.read_csv("/content/bitstampUSD_1-min_data_2012-01-01_to_2019-08-12.csv")
btc[btc.columns.values] = btc[btc.columns.values].ffill()

# https://stackoverflow.com/questions/19231871/convert-unix-time-to-readable-date-in-pandas-dataframe
btc["Timestamp"] = pd.to_datetime(btc["Timestamp"], unit='s')

btc.head()

"""
Timestamp	Open	High	Low	Close	Volume_(BTC)	Volume_(Currency)	Weighted_Price
0	2011-12-31 07:52:00	4.39	4.39	4.39	4.39	0.455581	2.0	4.39
1	2011-12-31 07:53:00	4.39	4.39	4.39	4.39	0.455581	2.0	4.39
2	2011-12-31 07:54:00	4.39	4.39	4.39	4.39	0.455581	2.0	4.39
3	2011-12-31 07:55:00	4.39	4.39	4.39	4.39	0.455581	2.0	4.39
4	2011-12-31 07:56:00	4.39	4.39	4.39	4.39	0.455581	2.0	4.39
"""

The problem is ** the process of creating features by aggregating certain columns every N days **, but as the title suggests, ** pandas x groupby x Grouper ** is used. "Group by" is often used when performing aggregation processing, but this time it is a miso to use "Grouper" because we want to group ** every N days ** and every fixed cycle / interval. Since it is time to summarize, if you specify the Timestamp column for the Grouper key and specify freq ='D', the process of summarizing every day is completed.

groupby_grouper


# freq='D'By setting it to, it can be summarized on a daily basis.
btc.groupby(pd.Grouper(key='Timestamp', freq='D')).mean()
"""
Open	High	Low	Close	Volume_(BTC)	Volume_(Currency)	Weighted_Price
Timestamp							
2011-12-31	4.472552	4.472624	4.472552	4.472624	7.590527	34.069690	4.472579
2012-01-01	4.680778	4.680778	4.680778	4.680778	5.861697	27.856539	4.680778
2012-01-02	5.000000	5.000000	5.000000	5.000000	11.566478	57.832389	5.000000
2012-01-03	5.145917	5.145917	5.145917	5.145917	13.506125	68.725391	5.145917
2012-01-04	5.176708	5.228729	5.176708	5.228729	17.221039	91.624089	5.195443
...	...	...	...	...	...	...	...
2019-08-08	11802.176188	11809.075389	11794.541660	11802.534583	7.059914	83271.678234	11801.428490
2019-08-09	11820.325465	11826.872431	11813.188132	11820.873674	5.099464	60251.317095	11819.498275
2019-08-10	11582.574299	11587.521007	11576.103201	11582.249007	5.219845	60211.963986	11581.356693
2019-08-11	11378.198736	11382.765819	11373.714118	11379.078882	3.062079	34822.593814	11378.195194
2019-08-12	11527.440000	11551.570000	11520.000000	11520.000000	23.805939	274731.256920	11540.450291
2782 rows × 7 columns
"""

that's all.

reference

Recommended Posts

Let's write an aggregation process for a certain period using pandas × groupby × Grouper
Let's make a module for Python using SWIG
Make an IP address allocation / allocation list for a certain area
Let's create a function for parametrized test using frame object