I want to create features with pandas group by

When you want to add statistics for each column attribute to a feature, you may not need to create something like dict with collections or groupby and merge it. It's easy if you just put out the statistics, but I had a hard time using pandas.DataFrame.groupby when I wanted to add it to the record as a feature, so I will leave it as a memo.

What I want to say is that groupby.transform is convenient.

Sample data

import pandas as pd
df = pd.DataFrame({
     "site":["A","A","A","B","B","C"],
     "dat":[15,30,30,30,10,50]
})

	site	dat
0	A	15
1	A	30
2	A	30
3	B	30
4	B	10
5	C	50

Basic statistics such as mean, maximum, and minimum for each attribute

Features can be generated directly by changing the argument of transform to np.max or np.min. The same applies to median, var, etc. The code to calculate the average value for each site is shown below.

import numpy as np
df["site_mean"] = df.groupby("site").transform(np.mean)

	site	dat	site_mean
0	A	15	25
1	A	30	25
2	A	30	25
3	B	30	20
4	B	10	20
5	C	50	50

Count Encoding The method of making the number of appearances of (category) features of a certain column into new features is called count encoding. When combined with groupby, it can be characterized by something like rarity within an attribute. You can do it with collections.Counter, but this also ends with transform.

The code to convert to the number of occurrences of the site and dat pair is shown. (30 appearances on site A are 2 times)

df["count_site_dat"] = df.groupby(["site","dat"]).transform(np.size)

	site	dat	site_mean	count_size_dat
0	A	15	25	1
1	A	30	25	2
2	A	30	25	2
3	B	30	20	1
4	B	10	20	1
5	C	50	50	1

Ranking

Among the data having a certain feature, calculate the largest data of the certain feature.

df["site_rank"] = df.groupby("site")["dat"].rank(method="dense")

	site	dat	site_mean	count_size_dat	site_rank
0	A	15	25	1	1
1	A	30	25	2	2
2	A	30	25	2	2
3	B	30	20	1	2
4	B	10	20	1	1
5	C	50	50	1	1

Changing the argument of rank mainly changes the expression method of the same value (same rank). For details, refer to the method of rank for ranking pandas.DataFrame, Series.

[PYTHON] Feature generation with pandas group by

I want to create features with pandas group by

Sample data

Basic statistics such as mean, maximum, and minimum for each attribute

Ranking