[PYTHON] Aggregate event data into one-user, one-line format using pandas

Introduction

This time, the word "event data" is used to mean data such as ** (time, event name (, user) (, quantity)) **. When trying to create a user prediction model, you may want to format such event data so that it is "one user, one line".

Make a note of how pandas handles this area.

Method

First, consider the following event data as an example.

import pandas as pd

data = [
    ['user1', '2015-7-1 10:00:00', 'login'],
    ['user1', '2015-7-1 10:00:10', 'item'],
    ['user1', '2015-7-1 10:00:30', 'item'],
    ['user2', '2015-7-1 10:01:00', 'top'],
    ['user1', '2015-7-1 10:01:30', 'cart'],
    ['user2', '2015-7-1 10:01:50', 'login'],
    ['user2', '2015-7-1 10:02:30', 'logout'],

    ['user1', '2015-7-2 13:00:00', 'login'],
    ['user1', '2015-7-2 13:01:00', 'logout'],
    ['user3', '2015-7-2 13:01:00', 'top'],
    ['user2', '2015-7-2 13:01:50', 'login'],
    ['user2', '2015-7-2 13:02:30', 'item'],
    ['user2', '2015-7-2 13:03:30', 'cart'],
    ['user2', '2015-7-2 13:03:30', 'history'],
]
df = pd.DataFrame(data)
df.columns = ['user_id', 'time', 'event']
df['time'] = pd.to_datetime(df['time'])
df['dummy'] = 1
df
user_id time event dummy
0 user1 2015-07-01 10:00:00 login 1
1 user1 2015-07-01 10:00:10 item 1
2 user1 2015-07-01 10:00:30 item 1
3 user2 2015-07-01 10:01:00 top 1
4 user1 2015-07-01 10:01:30 cart 1
5 user2 2015-07-01 10:01:50 login 1
6 user2 2015-07-01 10:02:30 logout 1
7 user1 2015-07-02 13:00:00 login 1
8 user1 2015-07-02 13:01:00 logout 1
9 user3 2015-07-02 13:01:00 top 1
10 user2 2015-07-02 13:01:50 login 1
11 user2 2015-07-02 13:02:30 item 1
12 user2 2015-07-02 13:03:30 cart 1
13 user2 2015-07-02 13:03:30 history 1

Suppose you want to format this data as "1 user, 1 row" and "the total number of occurrences for each daily event" as a column.

pivot_table and Grouper

You can use the DataFrame class pivot_table () to do something similar to an Excel PivotTable. I'd like to aggregate by day, but if I use Grouper (), I can group the datetime type columns nicely. How convenient ...

res2 = df.pivot_table(index=['user_id', pd.Grouper(freq='d', key='time')], columns='event', values='dummy', aggfunc=len)
res2
event cart history item login logout top
user_id time
user1 2015-07-01 1 NaN 2 1 NaN NaN
2015-07-02 NaN NaN NaN 1 1 NaN
user2 2015-07-01 NaN NaN NaN 1 1 1
2015-07-02 1 1 1 1 NaN NaN
user3 2015-07-02 NaN NaN NaN NaN NaN 1

Since Grouper's freq ='d', it will be aggregated on a daily basis. If you set freq ='M', it will be monthly.

Bring the date to the column

It is good to use unstack ().

res3 = res2.unstack()
res3
event cart history item login logout top
time 2015-07-01 2015-07-02 2015-07-01 2015-07-02 2015-07-01 2015-07-02 2015-07-01 2015-07-02 2015-07-01 2015-07-02 2015-07-01 2015-07-02
user_id
user1 1 NaN NaN NaN 2 NaN 1 1 NaN 1 NaN NaN
user2 NaN 1 NaN 1 NaN 1 1 1 1 NaN 1 NaN
user3 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 1

I want to bring the date up

Use swaplevel (Index1, which you want to replace). Also, I want to operate on the column (axis number 1), so add ʻaxis = 1`.

res4 = res3.swaplevel(0, 1, axis=1)
res4
time 2015-07-01 2015-07-02 2015-07-01 2015-07-02 2015-07-01 2015-07-02 2015-07-01 2015-07-02 2015-07-01 2015-07-02 2015-07-01 2015-07-02
event cart cart history history item item login login logout logout top top
user_id
user1 1 NaN NaN NaN 2 NaN 1 1 NaN 1 NaN NaN
user2 NaN 1 NaN 1 NaN 1 1 1 1 NaN 1 NaN
user3 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 1

I want the columns to be arranged in chronological order

Use sortlevel ().

res4.sortlevel(0, axis=1)
time 2015-07-01 2015-07-02
event cart history item login logout top cart history item login logout top
user_id
user1 1 NaN 2 1 NaN NaN NaN NaN NaN 1 1 NaN
user2 NaN NaN NaN 1 1 1 1 1 1 1 NaN NaN
user3 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 1

Now you have the desired shape.

at the end

There will be a column of "number of event types x number of dates", so if there are too many, it cannot be used. I think that it can be used when the number of event types is not large or when it is aggregated on a monthly basis.

Anyway, pandas is convenient.

Recommended Posts

Aggregate event data into one-user, one-line format using pandas
Data analysis using python pandas
Data visualization method using matplotlib (+ pandas) (5)
Parsing CSV format data using SQL
Data visualization method using matplotlib (+ pandas) (3)
Data visualization method using matplotlib (+ pandas) (4)
Try to aggregate doujin music data with pandas
[Pandas] Basics of processing date data using dt
100 language processing knock-20 (using pandas): reading JSON data
Convert json format data to txt (using yolo)
Make holiday data into a data frame with pandas