[PYTHON] Feature generation with pandas group by

I want to create features with pandas group by

When you want to add statistics for each column attribute to a feature, you may not need to create something like dict with collections or groupby and merge it. It's easy if you just put out the statistics, but I had a hard time using pandas.DataFrame.groupby when I wanted to add it to the record as a feature, so I will leave it as a memo.

What I want to say is that groupby.transform is convenient.

Sample data

import pandas as pd
df = pd.DataFrame({
     "site":["A","A","A","B","B","C"],
     "dat":[15,30,30,30,10,50]
})
site dat
0 A 15
1 A 30
2 A 30
3 B 30
4 B 10
5 C 50

Basic statistics such as mean, maximum, and minimum for each attribute

Features can be generated directly by changing the argument of transform to np.max or np.min. The same applies to median, var, etc. The code to calculate the average value for each site is shown below.

import numpy as np
df["site_mean"] = df.groupby("site").transform(np.mean)
site dat site_mean
0 A 15 25
1 A 30 25
2 A 30 25
3 B 30 20
4 B 10 20
5 C 50 50

Count Encoding The method of making the number of appearances of (category) features of a certain column into new features is called count encoding. When combined with groupby, it can be characterized by something like rarity within an attribute. You can do it with collections.Counter, but this also ends with transform.

The code to convert to the number of occurrences of the site and dat pair is shown. (30 appearances on site A are 2 times)

df["count_site_dat"] = df.groupby(["site","dat"]).transform(np.size)
site dat site_mean count_size_dat
0 A 15 25 1
1 A 30 25 2
2 A 30 25 2
3 B 30 20 1
4 B 10 20 1
5 C 50 50 1

Ranking

Among the data having a certain feature, calculate the largest data of the certain feature.

df["site_rank"] = df.groupby("site")["dat"].rank(method="dense")
site dat site_mean count_size_dat site_rank
0 A 15 25 1 1
1 A 30 25 2 2
2 A 30 25 2 2
3 B 30 20 1 2
4 B 10 20 1 1
5 C 50 50 1 1

Changing the argument of rank mainly changes the expression method of the same value (same rank). For details, refer to the method of rank for ranking pandas.DataFrame, Series.

Recommended Posts

Feature generation with pandas group by
Standardize by group with pandas
Manipulating strings with pandas group by
Create an age group with pandas
Pandas: groupby () to complete value by group
Speed comparison when shifting by group by pandas
Sort by pandas
Draw a graph by processing with Pandas groupby
Quickly visualize with Pandas
Bootstrap sampling with Pandas
Processing datasets with pandas (2)
Merge datasets with pandas
Extract N samples for each group with Pandas DataFrame
Learn Pandas with Cheminformatics
Date feature generation memo
Data visualization with pandas
07. Sentence generation by template
Data manipulation with Pandas!
Feature selection by sklearn.feature_selection
Shuffle data with pandas
JPEG image generation by specifying quality with Python + OpenCV
Mass generation of QR code with character display by Python
Automatic quiz generation with COTOHA
pandas Matplotlib Summary by usage
Read csv with python pandas
Load nested json with pandas
Artificial data generation with numpy
Feature selection by genetic algorithm
Sentence generation with GRU (keras)
Memorandum (pseudo Vlookup by pandas)
[Python] Change dtype with pandas
Feature selection by Null importances
Image caption generation with Chainer
Prevent omissions with pandas print
Data processing tips with Pandas