Here are some useful GroupBy tips I learned when I entered Kaggle's Data Science Bowl 2019.
Use the following fictitious data
import pandas as pd
df = pd.DataFrame({
'name' : ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Fred', 'George'],
'state' : ['NY', 'CA', 'NY', 'CA', 'FL', 'FL', 'NY'],
'score' : [4, 3, 5, 10, 1, 0, 7]
})
name | state | score | |
---|---|---|---|
0 | Alice | NY | 4 |
1 | Bob | CA | 3 |
2 | Charlie | NY | 5 |
3 | David | CA | 10 |
4 | Eve | FL | 1 |
5 | Fred | FL | 0 |
6 | George | NY | 7 |
Use first ()
or head (1)
, nth (0)
head (n)
gets n rows from the beginning of each group, so specify n = 1
if you want the first row
nth (n)
gets the nth line of each group, so if you want the first line, specify n = 0
Note that each method behaves differently when there are missing values.
first ()
gets the first non-NaN value for each column, so if it contains missing values, it may return spliced data from separate rows.
head ()
gets n rows from the beginning, including missing values
The behavior of nth ()
differs depending on the setting value of dropna
(None
, ʻany, ʻall
).
Also, first ()
and nth ()
have the same output format, but head ()
is different.
df.groupby('state').first()
# name score
#state
#CA Bob 3
#FL Eve 1
#NY Alice 4
df.groupby('state').head(1)
# name state score
#0 Alice NY 4
#1 Bob CA 3
#4 Eve FL 1
df.groupby('state').nth(0)
# name score
#state
#CA Bob 3
#FL Eve 1
#NY Alice 4
If there are missing values
#Data creation omitted
print(with_nan_df)
# name state score
#0 Alice NY NaN
#1 Bob CA 3.0
#2 Charlie NY 5.0
#3 David CA 10.0
#4 Eve FL NaN
#5 Fred FL 0.0
#6 George NY 7.0
with_nan_df.groupby('state').first()
# name score
#state
#CA Bob 3.0
#FL Eve 0.0
#NY Alice 5.0
#⇒ Eve and Alice data are spliced with other data!
with_nan_df.groupby('state').head(1)
# name state score
#0 Alice NY NaN
#1 Bob CA 3.0
#4 Eve FL NaN
#⇒ NaN remains as it is!
See the following page for details on the difference in behavior when there are missing values. [Pandas] The story of the first and last functions that find the first and last lines of groupby, the difference from head and nth
Use last ()
or tail (1)
, nth (-1)
tail (n)
gets n rows from the end of each group, so specify n = 1
if you want the last row
nth (n)
gets the nth line of each group, so if you want the last line, specify n = -1
Note that each method behaves differently when there are missing values, just like when the first row was extracted earlier.
last ()
behaves the same as first ()
and tail ()
behaves the same as head ()
df.groupby('state').last()
# name score
#state
#CA David 10
#FL Fred 0
#NY George 7
df.groupby('state').tail(1)
# name state score
#3 David CA 10
#5 Fred FL 0
#6 George NY 7
df.groupby('state').nth(-1)
# name score
#state
#CA David 10
#FL Fred 0
#NY George 7
Use size ()
Similar results can be obtained with value_counts ()
when group by
is done with only one column, but it is convenient when getting the number of data of each pair for multiple columns.
df.groupby('state').size()
#state
#CA 2
#FL 2
#NY 3
#dtype: int64
df['state'].value_counts()
#NY 3
#CA 2
#FL 2
#Name: state, dtype: int64
Get the number of data in each pair for multiple columns
#Data creation omitted
print(team_df)
# name state score team
#0 Alice NY 4 A
#1 Bob CA 3 A
#2 Charlie NY 5 A
#3 David CA 10 A
#4 Eve FL 1 B
#5 Fred FL 0 B
#6 George NY 7 B
team_df.groupby(['state', 'team']).size()
#state team
#CA A 2
#FL B 2
#NY A 2
# B 1
#dtype: int64
You can use shift ()
in the result of groupby
#Sort data by state to make the results easier to see
df.sort_values('state', inplace=True)
print(df)
# name state score
#1 Bob CA 3
#3 David CA 10
#4 Eve FL 1
#5 Fred FL 0
#0 Alice NY 4
#2 Charlie NY 5
#6 George NY 7
df.groupby('state')['score'].shift()
#1 NaN
#3 3.0
#4 NaN
#5 1.0
#0 NaN
#2 4.0
#6 5.0
#Name: score, dtype: float64
Apply cumsum ()
to the result of groupby
using ʻapply`
print(df)
# name state score
#1 Bob CA 3
#3 David CA 10
#4 Eve FL 1
#5 Fred FL 0
#0 Alice NY 4
#2 Charlie NY 5
#6 George NY 7
df.groupby('state').apply(lambda tdf: tdf['score'].cumsum())
#state
#CA 1 3
# 3 13
#FL 4 1
# 5 1
#NY 0 4
# 2 9
# 6 16
#Name: score, dtype: int64
For example, the following program gets how many types of teams there are for each state
print(team_df)
# name state score team
#0 Alice NY 4 A
#1 Bob CA 3 A
#2 Charlie NY 5 A
#3 David CA 10 A
#4 Eve FL 1 B
#5 Fred FL 0 B
#6 George NY 7 B
team_df.groupby('state')['team'].agg(lambda x: len(x.unique()))
#state
#CA 1
#FL 1
#NY 2
#Name: team, dtype: int64
There are mistakes and better ways! I would be grateful if you could tell me
Recommended Posts