[PYTHON] Example of efficient data processing with PANDAS

at first

PANDAS is the best data processing library for data scientists, but you need to be careful to avoid performance pitfalls when dealing with millions of rows of data. This time I would like to introduce some techniques that I have learned in my work.

DataFrame Since PANDAS is a column-oriented data structure, we are good at processing column by column. It is recommended to create a DataFrame in the" 1 record 1 line "format, and to set all measurable values (humidity, price, coordinates, etc.) for each record for each column.

However, in a huge amount of data processing, if for loop processing for each row is performed, the performance will be significantly reduced. In this article, I would like to introduce a smart way to add data in consideration of performance.

First, make a sample table.

data = {'Team': ['Tiger', 'Tiger', 'Rabbit', 'Rabbit', 'Cat',
   'Cat', 'Cat', 'Cat', 'Tiger', 'Human', 'Human', 'Tiger'],
   'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
   'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
   'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
df= pd.DataFrame(data)
print(team_dashboard)
'''
      Team  Rank  Year  Points
0    Tiger     1  2014     876
1    Tiger     2  2015     789
2   Rabbit     2  2014     863
3   Rabbit     3  2015     673
4      Cat     3  2014     741
5      Cat     4  2015     812
6      Cat     1  2016     756
7      Cat     1  2017     788
8    Tiger     2  2016     694
9    Human     4  2014     701
10   Human     1  2015     804
11   Tiger     2  2017     690
'''

print(df.columns) #Get column label
'''
Index(['Team', 'Rank', 'Year', 'Points'], dtype='object')
'''

print(df.index) #Get row index
'''
RangeIndex(start=0, stop=12, step=1)
'''

The DataFrame has a special list called index. In the above example, the elements that represent each column are labeled as ['Team','Rank','Year','Points'] , and 0, 1, 2, 3, ... on the left side. Has a row index that represents each row.

groupby() groupby is used when you want to collect data with the same value and perform common operations for each group. Although it has become a standard in data processing, groupby is quite difficult at first because it has various uses. Let's take a look at an example. If you put the column name in the groupby () method, it returns the `` `GroupBy``` object.


#1 level
df_g = df.groupby(by=['Team'])
print(df_g.groups)
'''
{'Cat': Int64Index([4, 5, 6, 7], dtype='int64'),
 'Human': Int64Index([9, 10], dtype='int64'),
 'Rabbit': Int64Index([2, 3], dtype='int64'),
 'Tiger': Int64Index([0, 1, 8, 11], dtype='int64')}
'''

#2 (plural) labels
df_g = df.groupby(by=['Team', 'Year'])
print(df_g.groups)
'''
{('Cat', 2014): Int64Index([4], dtype='int64'),
 ('Cat', 2015): Int64Index([5], dtype='int64'),
 ('Cat', 2016): Int64Index([6], dtype='int64'),
 ('Cat', 2017): Int64Index([7], dtype='int64'),
 ('Human', 2014): Int64Index([9], dtype='int64'),
 ('Human', 2015): Int64Index([10], dtype='int64'),
 ('Rabbit', 2014): Int64Index([2], dtype='int64'),
 ('Rabbit', 2015): Int64Index([3], dtype='int64'),
 ('Tiger', 2014): Int64Index([0], dtype='int64'),
 ('Tiger', 2015): Int64Index([1], dtype='int64'),
 ('Tiger', 2016): Int64Index([8], dtype='int64'),
 ('Tiger', 2017): Int64Index([11], dtype='int64')}
'''

In this way, you can see which column is in which group in the form of {column label: [row label, row label, ...]}. A row index list of data with the same label is stored for each label.

By the way, to get the data in the group, pass the group key to get_group ().

df_oneGroup = df_g.get_group('Rabbit')
print(df_oneGroup)
'''
     Team  Rank  Year  Points
2  Rabbit     2  2014     863
3  Rabbit     3  2015     673
'''

df_oneGroup = df_g.get_group(('Cat', 2014))
print(df_oneGroup)
'''
  Team  Rank  Year  Points
4  Cat     3  2014     741
'''

Well, in fact, get_group () isn't used much for anything other than debugging, but the GroupBy object allows you to do various operations for each group. For example, call the mean () method to average the annual Rank and Points for each team. Many other methods are provided, such as sum () and mode (). By the way, ʻas_index = False` resets the default group label to [0, 1, 2, ..., n].

df_mean = team_dashboard.groupby(by=['Team', 'Year']).mean()
print(df_mean)
'''
              Rank  Points
Team   Year              
Cat    2014     3     741
       2015     4     812
       2016     1     756
       2017     1     788
Human  2014     4     701
       2015     1     804
Rabbit 2014     2     863
       2015     3     673
Tiger  2014     1     876
       2015     2     789
       2016     2     694
       2017     2     690
'''

df_mean = team_dashboard.groupby(by=['Team', 'Year'], as_index=False).mean()
print(df_mean)
'''
      Team  Year  Rank  Points
0      Cat  2014     3     741
1      Cat  2015     4     812
2      Cat  2016     1     756
3      Cat  2017     1     788
4    Human  2014     4     701
5    Human  2015     1     804
6   Rabbit  2014     2     863
7   Rabbit  2015     3     673
8    Tiger  2014     1     876
9    Tiger  2015     2     789
10   Tiger  2016     2     694
11   Tiger  2017     2     690
'''

Return multiple lines for each group using ʻagg () `

Like the previous GroupBy.mean (), you can get the numerical value for each group, but if you want to get the numerical value for each group separately, use ʻagg () (Aggregration). Functions used for Aggregation can be called with strings, numpy methods, self-made functions, and lambda expressions. To use ʻagg (), define it in dict () and pass it as follows.

'''
  location   col1  col2  col3  col4
0        a   True     2     1     4
1        a  False     6     2     6
2        b   True     7     6     3
3        b   True     3     3     4
4        b  False     8     4     6
5        c   True     9    57     8
6        d  False     1    74     9
'''
func_dict = {'col1': lambda x: x.any(),         #Confirmation of missing state
             'col2': np.mean,                   #average
             'col3': np.sum,                    #total
             'col4': lambda S: S.mode()[0]}     #Mode
df_agg = df.groupby('location').agg(func_dict).reset_index()
print(df_agg)
'''
  location   col1  col2  col3  col4
0        a   True     4     3     4
1        b   True     6    13     3
2        c   True     9    57     8
3        d  False     1    74     9
'''

Convert categorical variables to numbers at arbitrary boundary values using cut ()

Introducing cut (), which categorizes data by the specified boundary value. For example, you can easily divide the entire data into five categories: midnight, morning, noon, afternoon, and night.

prods = pd.DataFrame({'hour':range(0, 24)})
b = [0, 6, 11, 12, 17, 24]
l = ['Midnight', 'a.m.','noon', 'afternoon', 'Night']
prods['period'] = pd.cut(prods['hour'], bins=b, labels=l, include_lowest=True)
print(prods)

'''
    hour   period
0 0 midnight
1 1 midnight
2 2 midnight
3 3 midnight
4 4 midnight
5 5 midnight
6 6 midnight
7 7 am
8 8 am
9 9 am
10 10 am
11 11 am
12 12 noon
13 13 pm
14 14 pm
15 15 pm
16 16 afternoon
17 17 pm
18 18 night
19 19 night
20 20 night
21 21 night
22 22 night
23 23 night
'''

Group and process time series data using resample ()

This time, let's say "calculate the cumulative number of cases per hour". I think I'll try using pd.cumsum () here, so create a num_ride_1h column in advance and give it" 1 ". Then, after grouping by resample () in the timestamp column every hour, each group can be completed by calling the cumsum () method.

df_raw= make_classification(n_samples, n_features+1)
df_raw['timestamp'] = random_datetimes_or_dates(start, end, n=n_samples)
df_raw['num_ride_1h'] = 1
print(df_raw)
'''
       var_0     var_1     var_2  class           timestamp  num_ride_1h
0   1.062513 -0.056001  0.761312      0 2020-09-21 00:01:57            1
1  -2.272391  1.307474 -1.276716      0 2020-09-21 00:14:49            1
2  -1.470793  1.245910 -0.708071      2 2020-09-21 00:17:46            1
3  -1.827838  1.365970 -0.933938      0 2020-09-21 00:25:13            1
4  -1.115794 -0.045542 -0.830372      0 2020-09-21 00:31:45            1
..       ...       ...       ...    ...                 ...          ...
95  0.247010  0.903812  0.448323      0 2020-09-21 23:29:25            1
96 -0.665399  1.861112  0.063642      1 2020-09-21 23:32:51            1
97  0.875282  0.028704  0.649306      2 2020-09-21 23:36:21            1
98  2.159065 -1.155290  1.238786      0 2020-09-21 23:37:23            1
99  1.739777 -1.775147  0.748372      2 2020-09-21 23:56:04            1
'''

df_raw['num_ride_1h'] = df_raw.resample('1H', on='timestamp')['num_ride_1h'].cumsum()

'''
       var_0     var_1     var_2  class           timestamp  num_ride_1h
0  -1.331170 -0.274703  0.809738      1 2020-10-11 00:10:54            1
1  -1.373495 -1.067991  1.738302      1 2020-10-11 00:14:24            2
2  -1.471448  0.216404  0.296618      0 2020-10-11 00:43:29            3
3  -2.282394 -1.528916  2.605747      1 2020-10-11 00:48:52            4
4   0.162427  0.524188 -0.663437      2 2020-10-11 00:51:23            5
..       ...       ...       ...    ...                 ...          ...
95  1.197076  0.274294 -0.759543      1 2020-10-11 22:23:50            3
96 -0.459688  0.646523 -0.573518      0 2020-10-11 23:00:20            1
97  0.212496  0.773962 -0.969428      2 2020-10-11 23:11:43            2
98  1.578519  0.496655 -1.156869      1 2020-10-11 23:14:31            3
99  1.318311 -0.324909 -0.114971      0 2020-10-11 23:46:46            4

'''

You can also use pd.Grouper (). Both can produce the same result.

df_raw['num_ride_1h'] = df_raw.groupby(pd.Grouper(key='timestamp', freq='1h'))['num_ride_1h'].cumsum()

Finally

There are several other examples I would like to introduce, but this time I would like to end here. If I have the next opportunity, I will introduce data processing centered on time series data.

** [Kuando Co., Ltd.] ** Kuando Co., Ltd., where I work, is working on local industry upgrades. Please take a look. http://quando.jp/

Recommended Posts

Example of efficient data processing with PANDAS
Data processing tips with Pandas
Processing datasets with pandas (1)
Data visualization with pandas
Data manipulation with Pandas!
Shuffle data with pandas
[Pandas] Basics of processing date data using dt
Parallel processing with Parallel of scikit-learn
Process csv data with python (count processing using pandas)
Notes on handling large amounts of data with python + pandas
A collection of methods used when aggregating data with pandas
100 Language Processing Knock-91: Preparation of Analogy Data
Import of japandas with pandas 1.0 and above
Try converting to tidy data with pandas
Drawing with Matrix-Reinventor of Python Image Processing-
Pandas: A very simple example of DataFrame.rolling ()
Working with 3D data structures in pandas
Best practices for messing with data with pandas
I have 0 years of programming experience and challenge data processing with python
Try to aggregate doujin music data with pandas
Convert data with shape (number of data, 1) to (number of data,) with numpy.
Automatic operation of Chrome with Python + Selenium + pandas
Read Python csv data with Pandas ⇒ Graph with Matplotlib
Draw a graph by processing with Pandas groupby
Read pandas data
Use decorators to prevent re-execution of data processing
Implement "Data Visualization Design # 3" with pandas and matplotlib
Example of getting mastodon access token with authorization_code
Interactively visualize data with TreasureData, Pandas and Jupyter.
100 language processing knock-20 (using pandas): reading JSON data
100 Language Processing Knock-32 (using pandas): Prototype of verb
CSV output of pulse data with Raspberry Pi (CSV output)
View details of time series data with Remotte
Asynchronous processing with Arduino (Asynchronous processing of processing requests from Linux)
Aggregate VIP values of Smash Bros. with Pandas
Make holiday data into a data frame with pandas
Automatic acquisition of stock price data with docker-compose
Practice of creating a data analysis platform with BigQuery and Cloud DataFlow (data processing)
Basics of pandas for beginners ② Understanding data overview
A memorandum of method often used when analyzing data with pandas (for beginners)
Simple implementation example of one kind of data augmentation
Example of reading and writing CSV with Python
I tried to compare the processing speed with dplyr of R and pandas of Python
Implementation example of simple LISP processing system (Python version)
Analysis of financial data by pandas and its visualization (2)
Full-width and half-width processing of CSV data in Python
Get Amazon RDS (PostgreSQL) data using SQL with pandas
Analysis of financial data by pandas and its visualization (1)
How to convert horizontally held data to vertically held data with pandas
Be careful when reading data with pandas (specify dtype)
[Chapter 5] Introduction to Python with 100 knocks of language processing
[Chapter 6] Introduction to scikit-learn with 100 knocks of language processing
Data analysis environment construction with Python (IPython notebook + Pandas)
Overview and tips of seaborn with statistical data visualization
Challenge principal component analysis of text data with Python
How to extract non-missing value nan data with pandas
How to output CSV of multi-line header with pandas
Performance verification of data preprocessing in natural language processing
Summary of Pandas methods used when extracting data [Python]
[Chapter 3] Introduction to Python with 100 knocks of language processing
[Chapter 2] Introduction to Python with 100 knocks of language processing