Just add the swifter method before the Pandas apply method
import pandas as pd
import numpy as np
import swifter
#Create a suitable DataFrame
df = pd.DataFrame({'col': np.random.normal(size=10000000)})
#Add a swifter method before the apply method.
%time df['col2'] = df['col'].swifter.apply(lambda x: x**2)
# Wall time: 50 ms
#For comparison (normal pandas apply method)
%time df['col2'] = df['col'].apply(lambda x: x**2)
# Wall time: 3.48 s
For pip
$ pip install -U pandas # upgrade pandas
$ pip install swifter
For conda
$ conda update pandas # upgrade pandas
$ conda install -c conda-forge swifter
The complexity of Pandas' apply method is O (N). It doesn't matter if the DataFrame has about 10,000 rows, Processing large DataFrames can be quite painful. Fortunately, there are several ways to speed up Pandas.
For example, when vectorization is not possible, parallel processing by Dask is performed. However, parallel processing on a DataFrame that does not have many rows may slow down the process. For people like me who find it cumbersome to choose the best acceleration method on a case-by-case basis, ** swifter ** is the best choice.
swifter According to the official documentation [^ 3], swifter does the following:
It is very convenient to automatically select the best method. As I'll show you later, swifter is faster than Pandas apply in many cases, so Isn't it not bad to always use swfiter?
Below, I would like to verify how fast swifter is compared to Dask, Pandas, etc. Since swifter behaves differently depending on whether it can be vectorized and not, we will verify each case. The specifications of the PC used are Intel Core i5-8350U @ 1.70GHz and the memory is 16GB.
Since swifter vectorizes when it can be vectorized, the calculation time of swifter is the same as when it is simply vectorized. Should be about equal. Let's check this.
If vectorizable
import pandas as pd
import numpy as np
import dask.dataframe as dd
import swifter
import multiprocessing
import gc
pandas_time_list = []
dask_time_list = []
vector_time_list = []
swifter_time_list = []
#Vectorizable functions
def multiple_func(df):
return df['col1']*df['col2']
def apply_func_to_df(df):
return df.apply(multiple_func, axis=1)
for num in np.logspace(2, 7, num=7-2+1, base=10, dtype='int'):
df = pd.DataFrame()
df['col1'] = np.random.normal(size=num)
df['col2'] = np.random.normal(size=num)
ddf = dd.from_pandas(df, npartitions=multiprocessing.cpu_count())
pandas_time = %timeit -n2 -r1 -o -q df.apply(multiple_func, axis=1)
dask_time = %timeit -n2 -r1 -o -q ddf.map_partitions(apply_func_to_df).compute(scheduler='processes')
vector_time = %timeit -n2 -r1 -o -q df['col1']*df['col2']
swifter_time = %timeit -n2 -r1 -o -q df.swifter.apply(multiple_func, axis=1)
pandas_time_list.append(pandas_time.average)
dask_time_list.append(dask_time.average)
vector_time_list.append(vector_time.average)
swifter_time_list.append(swifter_time.average)
del df, ddf
gc.collect()
The horizontal axis of the figure is the number of rows of the DataFrame, and the vertical axis is the elapsed time. Note that it is a log-log graph.
The elapsed time of swifter </ font> is close to the elapsed time of vectorization </ font>, so you can see that it is vectorized. ..
For DataFrames with less than 100,000 rows, a single core of Pandas </ font> is faster than parallel processing of Dask </ font>. Since the elapsed time of Dask </ font> of 100,000 lines or less is constant, it can be inferred that this is due to the overhead such as memory sharing due to parallel processing. (Function calculation time <Data copy time for memory sharing)
Next, let's look at the case where vectorization is not possible. If it cannot be vectorized, swifter should choose between parallel processing and single-core processing, whichever is better.
When vectorization is not possible
pandas_time_list_non_vectorize = []
dask_time_list_non_vectorize = []
swifter_time_list_non_vectorize = []
#Functions that cannot be vectorized
def compare_func(df):
if df['col1'] > df['col2']:
return 1
else:
return -1
def apply_func_to_df(df):
return df.apply(compare_func, axis=1)
for num in np.logspace(2, 7, num=7-2+1, base=10, dtype='int'):
df = pd.DataFrame()
df['col1'] = np.random.normal(size=num)
df['col2'] = np.random.normal(size=num)
ddf = dd.from_pandas(df, npartitions=multiprocessing.cpu_count())
pandas_time = %timeit -n2 -r1 -o -q df.apply(compare_func, axis=1)
dask_time = %timeit -n2 -r1 -o -q ddf.map_partitions(apply_func_to_df).compute(scheduler='processes')
swifter_time = %timeit -n2 -r1 -o -q df.swifter.apply(compare_func, axis=1)
pandas_time_list_non_vectorize.append(pandas_time.average)
dask_time_list_non_vectorize.append(dask_time.average)
swifter_time_list_non_vectorize.append(swifter_time.average)
del df, ddf
gc.collect()
swifter </ font> is processed by single core when parallel processing is not obtained. If parallel processing is superior to single core, you can see that parallel processing is selected.
swifter is an excellent module that automatically selects the optimum acceleration method according to the situation. To avoid wasting valuable time, use swifter when using Pandas' apply methods.
A vectorization function is a function that automatically applies to all elements without writing an explicit for loop. I think it's easier to understand if you look at an example.
If not vectorized
array_sample = np.random.normal(size=1000000)
def non_vectorize(array_sample):
result = []
for i in array_sample:
result.append(i*i)
return np.array(result)
%time non_vectorize_result = non_vectorize(array_sample)
# Wall time: 350 ms
When vectorized
def vectorize(array_sample):
return array_sample*array_sample
%time vectorize_result = vectorize(array_sample)
# Wall time: 4.09 ms
Vectorization makes it about 80 times faster. Check that the two results match.
Check if they match
np.allclose(non_vectorize_result, vectorize_result)
# True
Recommended Posts