[PYTHON] That's why I quit pandas [Three ways to groupby.mean () with just NumPy]

That's why I quit pandas [Three ways to groupby.mean () with just NumPy]

What is this playful article

People who say that pandas is already a mess on low-spec PCs (Tohoho) will be handling data in a mess with NumPy. But NumPy doesn't have a pd.DataFrame.groupby-like function, so some people may be messed up with crying pandas. But if you're stupid, and you can't do what pandas can do with NumPy, do your best to flip the NumPy docs and look at the various functions (but for loops are NG). Yes, you can.

For those who don't know what to think, a wise man in the past said, "How to aggregate by group with Numpy only (for sentence, no Pandas) / numpy-grouping-no-pandas /) ”(God). However, in the present age when science and technology have developed, I felt that there might be a better way, so I thought about various things while studying Numpy functions, and wrote a memo written in Japanese that does not connect the record of the battle. This article is about dedicating to.

things to do

This time, I want to find the mean and standard deviation for each group, that is, to calculate pd.DataFrame.groupby.mean ()pd.DataFrame.groupby.std () only with NumPy.

Data to be used

Familiar iris data. The lines are shuffled.

import numpy as np
import pandas as pd

df = (pd.read_csv('https://raw.githubusercontent.com/pandas-dev/pandas/'
                  'master/pandas/tests/data/iris.csv')
      .sample(frac=1, random_state=1))

df.head()
SepalLength SepalWidth PetalLength PetalWidth Name
14 5.8 4 1.2 0.2 Iris-setosa
98 5.1 2.5 3 1.1 Iris-versicolor
75 6.6 3 4.4 1.4 Iris-versicolor
16 5.4 3.9 1.3 0.4 Iris-setosa
131 7.9 3.8 6.4 2 Iris-virginica

A note before converting to a Numpy object. In order to handle data efficiently with NumPy, it is essential to convert data other than numeric type (character string, object type, etc.) to numeric type. If you want to manipulate strings, it's best to use a standard list. The "Name" column this time only represents the category, and the character string has no meaning in the data, so so-called label encoding is performed to convert it to a number. This is possible with np.unique (array, return_inverse = True).

unique_names, df['Name'] = np.unique(df['Name'].to_numpy(), return_inverse=True)
# (array(['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'], dtype=object),
#  array([0, 1, 1, 0, 2, 1, 2, 0, 0, 2, 1, 0, 2, 1, 1, 0, 1, 1, 0, 0, 1, 1,
#         ...,
#         1, 0, 1, 1, 1, 1, 2, 0, 0, 2, 1, 2, 1, 2, 2, 1, 2, 0], dtype=int64))

#This is also possible
df['Name'], unique_names = pd.factorize(df['Name'], sort=True)
Digression

At this time, it is known that pd.factorize () is faster than np.unique () if there is a certain number of data.

import perfplot

test_df = pd.DataFrame(['a'])
perfplot.show(
    setup=lambda n: pd.concat([test_df]*n,
                              ignore_index=True, axis=0).iloc[:, 0],
    n_range=[2 ** k for k in range(20)],
    kernels=[
        lambda series: np.unique(series.to_numpy(), return_inverse=True),
        lambda series: pd.factorize(series, sort=True),
    ],
    labels=["np.unique", "pd.factorize"],
    equality_check=None,
    xlabel="Number of rows of data", logx=True, logy=True, time_unit="s", automatic_order=False
)

pd_to_np_1.png

Apparently you should use pd.factorize () when converting Series (not only ndarray) with more than 1000 lines. </ div> </ details>

Convert to a NumPy array (here, a record array is used for convenience of explanation). The moment you break up with pandas.

arr = df.to_records(index=False)

How to group by

Part 1 (onehot)

First, I will introduce the method of the above article.

One-hot vectorize iris types. Multiply this by the vector of petal lengths you want to aggregate and take the average for each column. If the one-hot vector is averaged as it is, it will be affected by 0, so replace it with NaN in advance.

Part 1


onehot = np.eye(arr['Name'].max()+1)[arr['Name']]
onehot[onehot == 0] = np.nan
onehot_value = onehot * arr['PetalWidth'][:, None]

result_mean = np.nanmean(onehot_value, 0)
# [0.244 1.326 2.026]
result_std = np.nanstd(onehot_value, 0)
# [0.10613199 0.19576517 0.27188968]

The problem with this method is that np.nanmean () and np.nanstd () are known to be slow. The average formula is

mean = \frac{1}{n} \sum_{i=1}^{N} x_i

Therefore, if you replace it with an expression using np.nansum (), it will be as follows.

python


result_mean = np.nansum(onehot_value, 0)/(~np.isnan(onehot_value)).sum(0)
result_std = np.sqrt(np.nansum(onehot_value**2, 0)
                     / (~np.isnan(onehot_value)).sum(0) - result_mean**2)

Furthermore, np.nan in ʻonehot_value replaces 0, but since the total does not change no matter how many 0s are added, do not replace it with np.nan. You can use np.sum ()` as it is.

The fact that you don't have to assign np.nan to ʻonehot means that you don't have to make it a float data type, so you can get ʻonehot faster by the following method.

%timeit np.eye(arr['Name'].max()+1)[arr['Name']]
# 21.2 µs ± 919 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# array([[1., 0., 0.],
#        [0., 1., 0.],
#        [0., 1., 0.],
#        ...,
#        [1., 0., 0.]])


%timeit np.arange(arr['Name'].max()+1) == arr['Name'][:, None]
# 20.1 µs ± 205 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# array([[ True, False, False],
#        [False,  True, False],
#        [False,  True, False],
#        ...,
#        [ True, False, False]])

Therefore,

Part 1 break


onehot = np.arange(arr['Name'].max()+1) == arr['Name'][:, None]
onehot_value = onehot * arr['PetalWidth'][:, None]
sums = onehot_value.sum(0)
counts = onehot.sum(0)

result_mean = sums/counts
result_std = np.sqrt((onehot_value**2).sum(0)/counts - result_mean**2)

Part 2 (reduceat)

The point of groupby is how to calculate for each group. In Part 1, we were able to get the same result as groupby.mean () by moving the values to different columns for each group and averaging them.

Part 1 way of thinking


g |  A  B  B  C  A  B  A  C  C
                 v
A |  1  0  0  0  1  0  1  0  0 |    mean_A
B |  0  1  1  0  0  1  0  0  0 | -> mean_B
C |  0  0  0  1  0  0  0  1  1 |    mean_C

However, in the original array, if the elements are arranged in groups, the following idea can be considered.

Part 2 way of thinking


g |  A  A  A  B  B  B  C  C  C
    ----A--- ----B--- ----C---
                 v
     mean_A   mean_B   mean_C

Therefore, in order to sort by group, sort is performed based on the Name column.

sorter_index = np.argsort(arr['Name'])
name_s = arr['Name'][sorter_index]
# array([0, 0, 0, ..., 1, 1, 1, ..., 2, 2, 2], dtype=int64)
pwidth_s = arr['PetalWidth'][sorter_index]

Then we get the boundaries of the group. This can be done by looking at each adjacent value in the sorted Name column. For example

ex_array = np.array([0, 0, 0, 1, 1, 1, 2, 2, 2, 3, 3, 3])
#                    0------  1------  2------  3------← I want the position of this boundary

ex_array[1:] != ex_array[1:]
"""Way of thinking
 0 [0, 0, 1, 1, 1, 2, 2, 2, 3, 3, 3]    # ex_array[1:]
   [0, 0, 0, 1, 1, 1, 2, 2, 2, 3, 3] 3  # ex_array[:-1]
   [F, F, T, F, F, T, F, F, T, F, F]    # ex_array[1:] != ex_array[1:]
(If the top two match, F,If they do not match, T)

[0, 0, 0, 1, 1, 1, 2, 2, 2, 3, 3, 3]    # ex_array
[T, F, F, T, F, F, T, F, F, T, F, F, T] #The array I really wanted
 0------  1------  2------  3------
# ex_array[1:] != ex_array[1:]If you add one True before and after,
"T-before the next T" becomes a group.
"""

#Complementary version before and after
np.concatenate(([True], ex_array[1:] != ex_array[:-1], [True]))
# array([ True, False, False, False, False, False, False, False, False, False, False, False, True])

If you get the True position of this array with np.ndarray.nonzero (), it will be the boundary position of each group.

cut_index = np.concatenate(([True], ex_array[1:] != ex_array[:-1], [True])).nonzero()[0]
# array([ 0,  3,  6,  9, 12], dtype=int64)
"""Way of thinking
 0  1  2  3  4  5  6  7  8  9 10 11 12   # ex_array index
[0, 0, 0, 1, 1, 1, 2, 2, 2, 3, 3, 3]     # ex_array
[0,       3,       6,       9,      12]  # cut_index:Boundary (starting point) of each group
"""

Again, the average is "total ÷ number", so find each one. The number of elements in a group is the difference between adjacent values of cut_index, so you can usenp.diff ()andnp.ediff1d ()to find it in one shot.

counts = np.ediff1d(cut_index)
# array([3, 3, 3, 3], dtype=int64)

Then, the sum of the elements in the group is calculated. By the way, there is a ** Numpy universal function ** that can add two numbers and arrays called np.add (x1, x2).

np.add(1, 2)
# 3
np.add([0, 1], [0, 2])
# array([0, 3])

Basically, np.add () is just +, so it's unlikely that you'll use this function itself. What is important is the method of the universal function. For example, there is np.ufunc.reduce (array), which allows you to extract an element from ʻarray` and apply a function. What does that mean, for example

ex_array = np.arange(10)

result = ex_array[0]
for value in ex_array[1:]:
    result = np.add(result, value)
print(result)
# 45

np.add.reduce(ex_array)
# 45

Furthermore, using np.ufunc.reduceat (array, indices) gives an array of np.ufunc.reduce (a [indices [i]: indices [i + 1]]).

ex_array = np.arange(10)
indices = [0, 1, 6, len(ex_array)]

results = []
for i in range(len(indices) - 1):
    result = np.add.reduce(ex_array[indices[i]:indices[i+1]])
    results.append(result)
print(results)
# [0, 15, 30]

indices = [0, 1, 6]
np.add.reduceat(ex_array, indices)
# array([0, 15, 30], dtype=int32)

Here, ʻindices` is exactly the" group boundary "sought earlier. In other words, in summary

Part 2


sorter_index = np.argsort(arr['Name'])
name_s = arr['Name'][sorter_index]
pwidth_s = arr['PetalWidth'][sorter_index]
cut_index, = np.concatenate(([True], name_s[1:] != name_s[:-1], [True])).nonzero()

sums = np.add.reduceat(pwidth_s, cut_index[:-1])
counts = np.ediff1d(cut_index)

result_mean = sums/counts
# [0.244 1.326 2.026]
result_std = np.array([np.std(pwidth_s[s:e], ddof=0)
                       for s, e in zip(cut_index[:-1], cut_index[1:])])
# [0.10613199 0.19576517 0.27188968]

The sad thing about this technique is that you can't use np.ufunc.reduceat () well when finding the standard deviation, so you're relying on a for loop (instead, directly np.std (). Can be used).

Part 3 (bincount)

The easiest way to make part 1 or 2 ridiculous.

Since the Name column is an array obtained by label encoding, {0, 1, 2} are lined up. You can count this all at once by using np.bincount (). What does that mean, for example?

ex_array = np.array([3, 4, 3, 3, 2, 2, 4, 4, 1, 3])

results = []
for value in range(ex_array.max()+1):
    result = sum(ex_array == value)
    results.append(result)
print(results)
# [0, 1, 2, 4, 3]
# ex_In array, 0 is 0, 1 is 1, 2 is 2, 3 is 4, and 4 is 3.

np.bincount(ex_array)
# array([0, 1, 2, 4, 3], dtype=int64)

np.bincount () has a weights option that allows weighting. So, for example,

ex_array = np.array([1, 2, 2, 3, 3, 3, 3, 4, 4, 4])
weights = np.array([2, 1, 1, 1, 1, 1, 1, 2, 3, 4])

results = []
for value in range(ex_array.max()+1):
    idx = np.nonzero(ex_array == value)[0]
    result = sum(weights[idx])
    results.append(result)
print(results)
# [0, 2, 2, 4, 9]

np.bincount(ex_array, weights)
# array([0., 2., 2., 4., 9.])

By using this weighting, the sum of the elements can be calculated. Therefore,

Part 3


counts = np.bincount(arr['Name'])
sums = np.bincount(arr['Name'], arr['PetalWidth'])

result_mean = sums/counts
# [0.244 1.326 2.026]

deviation_array = result_mean[arr['Name']] - arr['PetalWidth']
result_std = np.sqrt(np.bincount(arr['Name'], deviation_array**2) / counts)
# [0.10613199 0.19576517 0.27188968]

Summary (People who have been caught in the title should read only here)

numpy_groupby.py


import numpy as np
import pandas as pd
import perfplot


def type1(value, group):
    onehot = np.eye(group.max()+1)[group]
    onehot[onehot == 0] = np.nan
    onehot_value = onehot * value[:, None]

    result_mean = np.nanmean(onehot_value, 0)
    result_std = np.nanstd(onehot_value, 0)
    return result_mean, result_std


def type1rev(value, group):
    onehot = np.arange(group.max()+1) == group[:, None]
    onehot_value = onehot * value[:, None]
    sums = onehot_value.sum(0)
    counts = onehot.sum(0)

    result_mean = sums/counts
    result_std = np.sqrt((onehot_value**2).sum(0)/counts - result_mean**2)
    return result_mean, result_std


def type2(value, group):
    sorter_index = np.argsort(group)
    group_s = group[sorter_index]
    value_s = value[sorter_index]
    cut_index, = np.concatenate(([True], group_s[1:] != group_s[:-1], [True])).nonzero()

    sums = np.add.reduceat(value_s, cut_index[:-1])
    counts = np.ediff1d(cut_index)

    result_mean = sums/counts
    result_std = np.array([np.std(value_s[s:e], ddof=0)
                           for s, e in zip(cut_index[:-1], cut_index[1:])])
    return result_mean, result_std


def type3(value, group):
    counts = np.bincount(group)
    sums = np.bincount(group, value)

    result_mean = sums/counts
    result_std = np.sqrt(
        np.bincount(group, (result_mean[group] - value)**2) / counts)
    return result_mean, result_std


df = (pd.read_csv('https://raw.githubusercontent.com/pandas-dev/pandas/'
                  'master/pandas/tests/data/iris.csv')
      .sample(frac=1, random_state=1))
unique_names, df['Name'] = np.unique(df['Name'].to_numpy(), return_inverse=True)
arr = df.to_records(index=False)[:5]

#Comparison
perfplot.show(
    setup=lambda n: np.concatenate([arr]*n),
    n_range=[2 ** k for k in range(21)],
    kernels=[
        lambda arr: type1(arr['PetalWidth'], arr['Name']),
        lambda arr: type1rev(arr['PetalWidth'], arr['Name']),
        lambda arr: type2(arr['PetalWidth'], arr['Name']),
        lambda arr: type3(arr['PetalWidth'], arr['Name']),
    ],
    labels=["type1", "type1rev", "type2", "type3"],
    equality_check=None,
    xlabel="Number of data rows (× 5)", logx=True, logy=True,
    time_unit="s", automatic_order=False
)

pd_to_np_2.png

Part 3 wins (depending on the number of groups).

If you make a mistake, please comment.


Bonus: Functions that appeared this time

Let's remember! (Click to jump to the official document)

np.nansum () --Treat non-numbers (NaN) as zero, Returns the sum of the array elements on the specified axis. np.nanmean () --Ignore NaN and move to the specified axis Calculate the arithmetic mean along. np.nanstd () --Ignore NaN and move to the specified axis Calculate the standard deviation along.

np.add () --Add arguments element by element. np.ufunc.reduce ()-Along one axis And apply ufunc to reduce the dimension of the array by one. np.ufunc.reduceat ()-on a single axis Performs a (local) reduce using the specified slice.

np.argsort ()-Returns the index to sort the array. np.nonzero ()-Returns the index of non-zero elements.

np.bincount () --Each value in an array consisting only of natural numbers Counts the number of occurrences of. np.unique ()-Find the unique element of the array.

np.ediff1d ()-Differences between consecutive elements of an array.

np.concatenate () --Combine a sequence of arrays along an existing axis To do.

Recommended Posts