pandas category type aggregation trap

If dtype is category, it may be aggregated even for non-existent values.

import pandas as pd  # version 1.1.2

#Define DataFrame
df = pd.DataFrame({
    'col1': ['a', 'a', 'b', 'b', 'c', 'c'],
    'col2': [1, 2, 1, 2, 1, 2]
})
#Make col1 a category type
df['col1'] = df['col1'].astype('category')
#Copy the first 3 lines
df_sub = df.head(3).copy()
#Groupby with col1 and aggregate about col2
df_grp = df_sub.groupby('col1')
df_agg = df_grp.agg({'col2': 'mean'}).reset_index()
df_agg.columns = ['col1', 'mean_col2']

df_sub is as follows.

	col1	col2
0	a	1
1	a	2
2	b	1

df_agg is as follows.

	col1	mean_col2
0	a	1.5
1	b	1.0
2	c	NaN

problem

There is a line where col1 is c even though it should have been aggregated for df_sub. If you check df_grp.groups, it will be {'a': [0, 1],'b': [2],'c': []}.

Countermeasures

The definition of df_grp is as follows.

df_grp = df_sub.groupby('col1', observed=True)

We have changed the countermeasures as pointed out by @nkay. Thank you very much.

groupby documentation

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html

[PYTHON] pandas category type aggregation trap

pandas category type aggregation trap

problem

Countermeasures

groupby documentation