[PYTHON] How to speed up Pandas apply method with just one sentence (with verification calculation)

Conclusion

Just add the swifter method before the Pandas apply method

Concrete example

import pandas as pd
import numpy as np
import swifter

#Create a suitable DataFrame
df = pd.DataFrame({'col': np.random.normal(size=10000000)})

#Add a swifter method before the apply method.
%time df['col2'] = df['col'].swifter.apply(lambda x: x**2)
# Wall time: 50 ms

#For comparison (normal pandas apply method)
%time df['col2'] = df['col'].apply(lambda x: x**2)
# Wall time: 3.48 s

Installation method

For pip


$ pip install -U pandas # upgrade pandas
$ pip install swifter

For conda


$ conda update pandas # upgrade pandas
$ conda install -c conda-forge swifter

What swifter is doing

Pandas apply is slow

The complexity of Pandas' apply method is O (N). It doesn't matter if the DataFrame has about 10,000 rows, Processing large DataFrames can be quite painful. Fortunately, there are several ways to speed up Pandas.

Pandas acceleration method

  1. [Vectorization](#What is vectorization)
  2. Use Cython and Numba [^ 1]
  3. Parallel processing by Dask [^ 2]

For example, when vectorization is not possible, parallel processing by Dask is performed. However, parallel processing on a DataFrame that does not have many rows may slow down the process. For people like me who find it cumbersome to choose the best acceleration method on a case-by-case basis, ** swifter ** is the best choice.

swifter According to the official documentation [^ 3], swifter does the following:

  1. Vectorization If possible, vectorize.
  2. If vectorization is not possible, either Dask parallel processing or Pandas apply, whichever is faster, is automatically selected.

It is very convenient to automatically select the best method. As I'll show you later, swifter is faster than Pandas apply in many cases, so Isn't it not bad to always use swfiter?

Verification

Below, I would like to verify how fast swifter is compared to Dask, Pandas, etc. Since swifter behaves differently depending on whether it can be vectorized and not, we will verify each case. The specifications of the PC used are Intel Core i5-8350U @ 1.70GHz and the memory is 16GB.

If vectorizable

Since swifter vectorizes when it can be vectorized, the calculation time of swifter is the same as when it is simply vectorized. Should be about equal. Let's check this.

If vectorizable


import pandas as pd
import numpy as np
import dask.dataframe as dd
import swifter
import multiprocessing
import gc

pandas_time_list = []
dask_time_list = []
vector_time_list = []
swifter_time_list = []

#Vectorizable functions
def multiple_func(df):
    return df['col1']*df['col2']

def apply_func_to_df(df):
    return df.apply(multiple_func, axis=1)

for num in np.logspace(2, 7, num=7-2+1, base=10, dtype='int'):
    df = pd.DataFrame()
    df['col1'] = np.random.normal(size=num)
    df['col2'] = np.random.normal(size=num)
    ddf = dd.from_pandas(df, npartitions=multiprocessing.cpu_count())

    pandas_time = %timeit -n2 -r1 -o -q df.apply(multiple_func, axis=1)
    dask_time = %timeit -n2 -r1 -o -q ddf.map_partitions(apply_func_to_df).compute(scheduler='processes')
    vector_time = %timeit -n2 -r1 -o -q df['col1']*df['col2']
    swifter_time = %timeit -n2 -r1 -o -q df.swifter.apply(multiple_func, axis=1)
    
    pandas_time_list.append(pandas_time.average)
    dask_time_list.append(dask_time.average)
    vector_time_list.append(vector_time.average)
    swifter_time_list.append(swifter_time.average)

    del df, ddf
    gc.collect()

vect.png

The horizontal axis of the figure is the number of rows of the DataFrame, and the vertical axis is the elapsed time. Note that it is a log-log graph.

The elapsed time of swifter </ font> is close to the elapsed time of vectorization </ font>, so you can see that it is vectorized. ..

For DataFrames with less than 100,000 rows, a single core of Pandas </ font> is faster than parallel processing of Dask </ font>. Since the elapsed time of Dask </ font> of 100,000 lines or less is constant, it can be inferred that this is due to the overhead such as memory sharing due to parallel processing. (Function calculation time <Data copy time for memory sharing)

When vectorization is not possible

Next, let's look at the case where vectorization is not possible. If it cannot be vectorized, swifter should choose between parallel processing and single-core processing, whichever is better.

When vectorization is not possible


pandas_time_list_non_vectorize = []
dask_time_list_non_vectorize = []
swifter_time_list_non_vectorize = []

#Functions that cannot be vectorized
def compare_func(df):
    if df['col1'] > df['col2']:
        return 1
    else:
        return -1

def apply_func_to_df(df):
    return df.apply(compare_func, axis=1)

for num in np.logspace(2, 7, num=7-2+1, base=10, dtype='int'):
    df = pd.DataFrame()
    df['col1'] = np.random.normal(size=num)
    df['col2'] = np.random.normal(size=num)
    ddf = dd.from_pandas(df, npartitions=multiprocessing.cpu_count())

    pandas_time = %timeit -n2 -r1 -o -q df.apply(compare_func, axis=1)
    dask_time = %timeit -n2 -r1 -o -q ddf.map_partitions(apply_func_to_df).compute(scheduler='processes')
    swifter_time = %timeit -n2 -r1 -o -q df.swifter.apply(compare_func, axis=1)
    
    pandas_time_list_non_vectorize.append(pandas_time.average)
    dask_time_list_non_vectorize.append(dask_time.average)
    swifter_time_list_non_vectorize.append(swifter_time.average)

    del df, ddf
    gc.collect()

non_vect.png

swifter </ font> is processed by single core when parallel processing is not obtained. If parallel processing is superior to single core, you can see that parallel processing is selected.

Summary

swifter is an excellent module that automatically selects the optimum acceleration method according to the situation. To avoid wasting valuable time, use swifter when using Pandas' apply methods.

bonus

What is vectorization?

A vectorization function is a function that automatically applies to all elements without writing an explicit for loop. I think it's easier to understand if you look at an example.

If not vectorized


array_sample = np.random.normal(size=1000000)

def non_vectorize(array_sample):
    result = []
    for i in array_sample:
        result.append(i*i)
    return np.array(result)

%time non_vectorize_result = non_vectorize(array_sample)
# Wall time: 350 ms

When vectorized


def vectorize(array_sample):
    return array_sample*array_sample

%time vectorize_result = vectorize(array_sample)
# Wall time: 4.09 ms

Vectorization makes it about 80 times faster. Check that the two results match.

Check if they match


np.allclose(non_vectorize_result, vectorize_result)
# True

Recommended Posts