at first

PANDAS is the best data processing library for data scientists, but you need to be careful to avoid performance pitfalls when dealing with millions of rows of data. This time I would like to introduce some techniques that I have learned in my work.

DataFrame Since PANDAS is a column-oriented data structure, we are good at processing column by column. It is recommended to create a DataFrame in the" 1 record 1 line "format, and to set all measurable values (humidity, price, coordinates, etc.) for each record for each column.

However, in a huge amount of data processing, if for loop processing for each row is performed, the performance will be significantly reduced. In this article, I would like to introduce a smart way to add data in consideration of performance.

First, make a sample table.

data = {'Team': ['Tiger', 'Tiger', 'Rabbit', 'Rabbit', 'Cat',
   'Cat', 'Cat', 'Cat', 'Tiger', 'Human', 'Human', 'Tiger'],
   'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
   'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
   'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
df= pd.DataFrame(data)
print(team_dashboard)
'''
      Team  Rank  Year  Points
0    Tiger     1  2014     876
1    Tiger     2  2015     789
2   Rabbit     2  2014     863
3   Rabbit     3  2015     673
4      Cat     3  2014     741
5      Cat     4  2015     812
6      Cat     1  2016     756
7      Cat     1  2017     788
8    Tiger     2  2016     694
9    Human     4  2014     701
10   Human     1  2015     804
11   Tiger     2  2017     690
'''

print(df.columns) #Get column label
'''
Index(['Team', 'Rank', 'Year', 'Points'], dtype='object')
'''

print(df.index) #Get row index
'''
RangeIndex(start=0, stop=12, step=1)
'''

The DataFrame has a special list called index. In the above example, the elements that represent each column are labeled as ['Team','Rank','Year','Points'] , and 0, 1, 2, 3, ... on the left side. Has a row index that represents each row.

groupby() groupby is used when you want to collect data with the same value and perform common operations for each group. Although it has become a standard in data processing, groupby is quite difficult at first because it has various uses. Let's take a look at an example. If you put the column name in the groupby () method, it returns the `` `GroupBy``` object.


#1 level
df_g = df.groupby(by=['Team'])
print(df_g.groups)
'''
{'Cat': Int64Index([4, 5, 6, 7], dtype='int64'),
 'Human': Int64Index([9, 10], dtype='int64'),
 'Rabbit': Int64Index([2, 3], dtype='int64'),
 'Tiger': Int64Index([0, 1, 8, 11], dtype='int64')}
'''

#2 (plural) labels
df_g = df.groupby(by=['Team', 'Year'])
print(df_g.groups)
'''
{('Cat', 2014): Int64Index([4], dtype='int64'),
 ('Cat', 2015): Int64Index([5], dtype='int64'),
 ('Cat', 2016): Int64Index([6], dtype='int64'),
 ('Cat', 2017): Int64Index([7], dtype='int64'),
 ('Human', 2014): Int64Index([9], dtype='int64'),
 ('Human', 2015): Int64Index([10], dtype='int64'),
 ('Rabbit', 2014): Int64Index([2], dtype='int64'),
 ('Rabbit', 2015): Int64Index([3], dtype='int64'),
 ('Tiger', 2014): Int64Index([0], dtype='int64'),
 ('Tiger', 2015): Int64Index([1], dtype='int64'),
 ('Tiger', 2016): Int64Index([8], dtype='int64'),
 ('Tiger', 2017): Int64Index([11], dtype='int64')}
'''

In this way, you can see which column is in which group in the form of {column label: [row label, row label, ...]}. A row index list of data with the same label is stored for each label.

By the way, to get the data in the group, pass the group key to get_group ().

df_oneGroup = df_g.get_group('Rabbit')
print(df_oneGroup)
'''
     Team  Rank  Year  Points
2  Rabbit     2  2014     863
3  Rabbit     3  2015     673
'''

df_oneGroup = df_g.get_group(('Cat', 2014))
print(df_oneGroup)
'''
  Team  Rank  Year  Points
4  Cat     3  2014     741
'''

Well, in fact, get_group () isn't used much for anything other than debugging, but the GroupBy object allows you to do various operations for each group. For example, call the mean () method to average the annual Rank and Points for each team. Many other methods are provided, such as sum () and mode (). By the way, ʻas_index = False` resets the default group label to [0, 1, 2, ..., n].

df_mean = team_dashboard.groupby(by=['Team', 'Year']).mean()
print(df_mean)
'''
              Rank  Points
Team   Year              
Cat    2014     3     741
       2015     4     812
       2016     1     756
       2017     1     788
Human  2014     4     701
       2015     1     804
Rabbit 2014     2     863
       2015     3     673
Tiger  2014     1     876
       2015     2     789
       2016     2     694
       2017     2     690
'''

df_mean = team_dashboard.groupby(by=['Team', 'Year'], as_index=False).mean()
print(df_mean)
'''
      Team  Year  Rank  Points
0      Cat  2014     3     741
1      Cat  2015     4     812
2      Cat  2016     1     756
3      Cat  2017     1     788
4    Human  2014     4     701
5    Human  2015     1     804
6   Rabbit  2014     2     863
7   Rabbit  2015     3     673
8    Tiger  2014     1     876
9    Tiger  2015     2     789
10   Tiger  2016     2     694
11   Tiger  2017     2     690
'''

Return multiple lines for each group using ʻagg () `

Like the previous GroupBy.mean (), you can get the numerical value for each group, but if you want to get the numerical value for each group separately, use ʻagg () (Aggregration). Functions used for Aggregation can be called with strings, numpy methods, self-made functions, and lambda expressions. To use ʻagg (), define it in dict () and pass it as follows.

'''
  location   col1  col2  col3  col4
0        a   True     2     1     4
1        a  False     6     2     6
2        b   True     7     6     3
3        b   True     3     3     4
4        b  False     8     4     6
5        c   True     9    57     8
6        d  False     1    74     9
'''
func_dict = {'col1': lambda x: x.any(),         #Confirmation of missing state
             'col2': np.mean,                   #average
             'col3': np.sum,                    #total
             'col4': lambda S: S.mode()[0]}     #Mode
df_agg = df.groupby('location').agg(func_dict).reset_index()
print(df_agg)
'''
  location   col1  col2  col3  col4
0        a   True     4     3     4
1        b   True     6    13     3
2        c   True     9    57     8
3        d  False     1    74     9
'''

Convert categorical variables to numbers at arbitrary boundary values using `cut ()`

Introducing cut (), which categorizes data by the specified boundary value. For example, you can easily divide the entire data into five categories: midnight, morning, noon, afternoon, and night.

prods = pd.DataFrame({'hour':range(0, 24)})
b = [0, 6, 11, 12, 17, 24]
l = ['Midnight', 'a.m.','noon', 'afternoon', 'Night']
prods['period'] = pd.cut(prods['hour'], bins=b, labels=l, include_lowest=True)
print(prods)

'''
    hour   period
0 0 midnight
1 1 midnight
2 2 midnight
3 3 midnight
4 4 midnight
5 5 midnight
6 6 midnight
7 7 am
8 8 am
9 9 am
10 10 am
11 11 am
12 12 noon
13 13 pm
14 14 pm
15 15 pm
16 16 afternoon
17 17 pm
18 18 night
19 19 night
20 20 night
21 21 night
22 22 night
23 23 night
'''

Group and process time series data using `resample ()`

This time, let's say "calculate the cumulative number of cases per hour". I think I'll try using pd.cumsum () here, so create a num_ride_1h column in advance and give it" 1 ". Then, after grouping by resample () in the timestamp column every hour, each group can be completed by calling the cumsum () method.

df_raw= make_classification(n_samples, n_features+1)
df_raw['timestamp'] = random_datetimes_or_dates(start, end, n=n_samples)
df_raw['num_ride_1h'] = 1
print(df_raw)
'''
       var_0     var_1     var_2  class           timestamp  num_ride_1h
0   1.062513 -0.056001  0.761312      0 2020-09-21 00:01:57            1
1  -2.272391  1.307474 -1.276716      0 2020-09-21 00:14:49            1
2  -1.470793  1.245910 -0.708071      2 2020-09-21 00:17:46            1
3  -1.827838  1.365970 -0.933938      0 2020-09-21 00:25:13            1
4  -1.115794 -0.045542 -0.830372      0 2020-09-21 00:31:45            1
..       ...       ...       ...    ...                 ...          ...
95  0.247010  0.903812  0.448323      0 2020-09-21 23:29:25            1
96 -0.665399  1.861112  0.063642      1 2020-09-21 23:32:51            1
97  0.875282  0.028704  0.649306      2 2020-09-21 23:36:21            1
98  2.159065 -1.155290  1.238786      0 2020-09-21 23:37:23            1
99  1.739777 -1.775147  0.748372      2 2020-09-21 23:56:04            1
'''

df_raw['num_ride_1h'] = df_raw.resample('1H', on='timestamp')['num_ride_1h'].cumsum()

'''
       var_0     var_1     var_2  class           timestamp  num_ride_1h
0  -1.331170 -0.274703  0.809738      1 2020-10-11 00:10:54            1
1  -1.373495 -1.067991  1.738302      1 2020-10-11 00:14:24            2
2  -1.471448  0.216404  0.296618      0 2020-10-11 00:43:29            3
3  -2.282394 -1.528916  2.605747      1 2020-10-11 00:48:52            4
4   0.162427  0.524188 -0.663437      2 2020-10-11 00:51:23            5
..       ...       ...       ...    ...                 ...          ...
95  1.197076  0.274294 -0.759543      1 2020-10-11 22:23:50            3
96 -0.459688  0.646523 -0.573518      0 2020-10-11 23:00:20            1
97  0.212496  0.773962 -0.969428      2 2020-10-11 23:11:43            2
98  1.578519  0.496655 -1.156869      1 2020-10-11 23:14:31            3
99  1.318311 -0.324909 -0.114971      0 2020-10-11 23:46:46            4

'''

You can also use pd.Grouper (). Both can produce the same result.

df_raw['num_ride_1h'] = df_raw.groupby(pd.Grouper(key='timestamp', freq='1h'))['num_ride_1h'].cumsum()

Finally

There are several other examples I would like to introduce, but this time I would like to end here. If I have the next opportunity, I will introduce data processing centered on time series data.

** [Kuando Co., Ltd.] ** Kuando Co., Ltd., where I work, is working on local industry upgrades. Please take a look. http://quando.jp/

[PYTHON] Example of efficient data processing with PANDAS

at first

Return multiple lines for each group using ʻagg () `

Convert categorical variables to numbers at arbitrary boundary values using cut ()

Group and process time series data using resample ()

Finally

Convert categorical variables to numbers at arbitrary boundary values using `cut ()`

Group and process time series data using `resample ()`