[PYTHON] pandas idxmax is slow

background

When processing big data at work I noticed that the processing was slow and I was looking for the criminal, but I noticed that __idxmax () __ in the pandas library is slow. </ font>

Of course, compared to max, there is a process to return the index of the max value, so it is natural that it will be slower. I tried to verify how slow it was actually.

Prerequisite knowledge

The processing of __max () __ and __idxmax () __ of pandas is as follows.

import pandas as pd
import numpy as np

#Generate a 10x5 data frame of random integers between 0 and 100
data = np.random.randint(0,100,[10,5])
df = pd.DataFrame(data,
                  index=['A','B','C','D','E','F','G','H','I','J'],
                  columns=['a','b','c','d','e'])

print(df)
print(df.max())
print(df.idxmax())

__max () __ returns the maximum value for each column, __idxmax () __ returns the maximum index for each column

It is a function called.

Now, let's enlarge the data frame and measure the processing time.

Processing time measurement of max () and idxmax ()

import pandas as pd
import numpy as np
import time

arr = np.random.randint(0,100,[10**5,10**4],dtype='int8')
df = pd.DataFrame(arr, dtype='int8')
df.info()
#<class 'pandas.core.frame.DataFrame'>
#RangeIndex: 100000 entries, 0 to 99999
#Columns: 10000 entries, 0 to 9999
#dtypes: int8(10000)
#memory usage: 953.7 MB

ts = time.time()
df.max()
te =time.time()
print('max()_time:',te-ts)
#max()_time: 10.67

ts = time.time()
df.idxmax()
te =time.time()
print('idxmax()_time:',te-ts)
#idxmax()_time: 19.08

The above experiment For a data frame of about 1GB This is the result of measuring the processing time of __max () __ and __idxmax () __.

__idxmax () __ 19.08 ÷ 10.67 = __1.78 times __ </ font> I found that the process was slow.

By the way, the machine specs are MacBookPro 2018 model, processor: 2.3 GHz Intel Core i5, memory: 8 GB 2133 MHz LPDDR3 is (There was a 6-fold time difference on the company's Windows PC.)

Let's take a look at the contents of the idxmax () function to see if it can be done faster.

Source code for idxmax ()
import inspect
print(inspect.getsource(pd.DataFrame.idxmax))

After executing this, the source code returned is as follows.

def idxmax(self, axis=0, skipna=True):
    """
    Return index of first occurrence of maximum over requested axis.
    NA/null values are excluded.

    Parameters
    ----------
    axis : {0 or 'index', 1 or 'columns'}, default 0
        0 or 'index' for row-wise, 1 or 'columns' for column-wise
    skipna : boolean, default True
        Exclude NA/null values. If an entire row/column is NA, the result
        will be NA.

    Returns
    -------
    idxmax : Series

    Raises
    ------
    ValueError
        * If the row/column is empty

    See Also
    --------
    Series.idxmax

    Notes
    -----
    This method is the DataFrame version of ``ndarray.argmax``.
    """
    axis = self._get_axis_number(axis)
    indices = nanops.nanargmax(self.values, axis=axis, skipna=skipna)
    index = self._get_axis(axis)
    result = [index[i] if i >= 0 else np.nan for i in indices]
    return Series(result, index=self._get_agg_axis(axis))

It is written as a dataframe version of the __argmax () __ function of ndarray. Of course it was as expected, so I would like to compare the processing time of __max () __ and __argmax () __.

Processing time measurement of max () and idxmax ()

ts = time.time()
_max = np.max(arr,axis=0)
te =time.time()
print('max()_time:',te-ts)
#max()_time: 0.85

ts = time.time()
_argmax = np.argmax(arr,axis=0)
te =time.time()
print('argmax()_time:',te-ts)
#argmax()_time: 13.70

The result is __argmax () __ 13.70 ÷ 0.85 = __ 16.11 times __ </ font> I found that the process was slow.

Consideration

-Both are faster than the data frame. ・ Max is overwhelmingly faster than idxmax when using ndarray.

I understand.

As a cause, Since idxmax is a process that returns only the index of the first data when there is the same maximum value, I feel that the processing time is longer by that amount.

If you change the range and size of random numbers in the data frame, this magnification will change considerably, so Quantitative talk is difficult, but One thing to say is that if you don't use dataframe-specific processing (such as groupby) It is better not to make it into a data frame unnecessarily.