[PYTHON] Ready-to-use pandas acceleration techniques

When processing a large amount of data using pandas It often takes tens of minutes to hours to process a few GB, or days if you're not good at it. If the processing is slow, the work to proceed will not proceed, so Note how you can speed up with a simple source code modification

DataFrame sum and mean are done only with numerical data

index name weight height
0 Tanaka 160.1 60.1
1 Suzuki 172.4 75.0
2 Saitou 155.8 42.2
...
999998 Morita 167.9 94.07
999999 Satou 177.7 80.3

For example, if you have a data frame like the one above When calculating the average value of weight and height, it seems that the process is described as follows.

Bad pattern


import pandas as pd

sr_means = df.mean()
mean_height = sr_means['height'] 
mean_weight = sr_means['weight']

However, due to the column name containing the string, the above code takes a very long time to calculate

By changing the description as shown below, the processing will be orders of magnitude faster.

Good pattern


import pandas as pd

sr_means = df[['height', 'weight']].mean()
mean_height = sr_means['height'] 
mean_weight = sr_means['weight']

Postscript: After writing, I was curious and investigated, and the option to do with only numerical values was suitable

Postscript: Good pattern


import pandas as pd

sr_means = df.mean(numeric_only = True)
mean_height = sr_means['height'] 
mean_weight = sr_means['weight']

Actually measure the time

Actual measurement


import pandas as pd
import numpy as np

N = 100000
df_test = pd.DataFrame(
    {
        'name':['abc'] * N,
        'weight': np.random.normal(60, 5, N),
        'height': np.random.normal(160, 5, N)
    }
)

print("df_test.mean()")
%time df_test.mean()

print("df_test[['height', 'weight']].mean()")
%time df_test[['height', 'weight']].mean()

The above results are below. Even considering that the number of columns to calculate is reduced by one, the latter is about four orders of magnitude faster.

result


df_test.mean()
Wall time: 3.06 s

df_test[['height', 'weight']].mean()
Wall time: 4 ms

Let's use a higher-order function (map)

For example, the round function is used to round the column weight to an integer. If you are not familiar with python and how to write nowadays, you tend to write using the for statement. Let's use the higher-order function map. (What is a higher-order function is omitted in this article)

Bad pattern


#Apply the round function to the element
for idx in range(len(df_test['weight'].index)):
    df_test['weight'][idx] = round(df_test['weight'][idx])

Rewrite below using map

Good pattern


#Apply the round function to the element
df_test['weight'] = df_test['weight'].map(round)

I will actually measure the time this time as well. Since the for statement is too slow, reduce the number of data

Actual measurement


def func(sr):
    for idx in range(len(sr.index)):
        sr[idx] = round(sr[idx])
    return(sr)


N = 1000
df_test = pd.DataFrame(
    {
        'name':['abc'] * N,
        'weight': np.random.normal(60, 5, N),
        'height': np.random.normal(160, 5, N)
    }
)

print("For for")
%time df_test['weight'] = func(df_test['weight'])
print("For map")
%time df_test['weight'] = df_test['weight'].map(round)

The result is below. What a map can process at the speed of light becomes ridiculously slow with a for statement Because this is only 100 pieces of data It's scary when you think about handling 1 to 100 million data

result


For for
Wall time: 22.1 s

For map
Wall time: 0 ns

Just by improving the above two processes It used to take a day to process data, but now it can be processed in a few minutes. I want to improve it steadily.

Recommended Posts

Ready-to-use pandas acceleration techniques
Pandas