People who say that pandas is already a mess on low-spec PCs (Tohoho) will be handling data in a mess with NumPy. But NumPy doesn't have a pd.DataFrame.groupby
-like function, so some people may be messed up with crying pandas. But if you're stupid, and you can't do what pandas can do with NumPy, do your best to flip the NumPy docs and look at the various functions (but for loops are NG). Yes, you can.
For those who don't know what to think, a wise man in the past said, "How to aggregate by group with Numpy only (for sentence, no Pandas) / numpy-grouping-no-pandas /) ”(God). However, in the present age when science and technology have developed, I felt that there might be a better way, so I thought about various things while studying Numpy functions, and wrote a memo written in Japanese that does not connect the record of the battle. This article is about dedicating to.
This time, I want to find the mean and standard deviation for each group, that is, to calculate pd.DataFrame.groupby.mean ()
・ pd.DataFrame.groupby.std ()
only with NumPy.
Familiar iris data. The lines are shuffled.
import numpy as np
import pandas as pd
df = (pd.read_csv('https://raw.githubusercontent.com/pandas-dev/pandas/'
'master/pandas/tests/data/iris.csv')
.sample(frac=1, random_state=1))
df.head()
SepalLength | SepalWidth | PetalLength | PetalWidth | Name | |
---|---|---|---|---|---|
14 | 5.8 | 4 | 1.2 | 0.2 | Iris-setosa |
98 | 5.1 | 2.5 | 3 | 1.1 | Iris-versicolor |
75 | 6.6 | 3 | 4.4 | 1.4 | Iris-versicolor |
16 | 5.4 | 3.9 | 1.3 | 0.4 | Iris-setosa |
131 | 7.9 | 3.8 | 6.4 | 2 | Iris-virginica |
A note before converting to a Numpy object. In order to handle data efficiently with NumPy, it is essential to convert data other than numeric type (character string, object type, etc.) to numeric type. If you want to manipulate strings, it's best to use a standard list.
The "Name" column this time only represents the category, and the character string has no meaning in the data, so so-called label encoding is performed to convert it to a number. This is possible with np.unique (array, return_inverse = True)
.
unique_names, df['Name'] = np.unique(df['Name'].to_numpy(), return_inverse=True)
# (array(['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'], dtype=object),
# array([0, 1, 1, 0, 2, 1, 2, 0, 0, 2, 1, 0, 2, 1, 1, 0, 1, 1, 0, 0, 1, 1,
# ...,
# 1, 0, 1, 1, 1, 1, 2, 0, 0, 2, 1, 2, 1, 2, 2, 1, 2, 0], dtype=int64))
#This is also possible
df['Name'], unique_names = pd.factorize(df['Name'], sort=True)
At this time, it is known that pd.factorize ()
is faster than np.unique ()
if there is a certain number of data.
import perfplot
test_df = pd.DataFrame(['a'])
perfplot.show(
setup=lambda n: pd.concat([test_df]*n,
ignore_index=True, axis=0).iloc[:, 0],
n_range=[2 ** k for k in range(20)],
kernels=[
lambda series: np.unique(series.to_numpy(), return_inverse=True),
lambda series: pd.factorize(series, sort=True),
],
labels=["np.unique", "pd.factorize"],
equality_check=None,
xlabel="Number of rows of data", logx=True, logy=True, time_unit="s", automatic_order=False
)
Apparently you should use pd.factorize ()
when converting Series
(not only ndarray
) with more than 1000 lines. </ div> </ details>
Convert to a NumPy array (here, a record array is used for convenience of explanation). The moment you break up with pandas.
arr = df.to_records(index=False)
First, I will introduce the method of the above article.
One-hot vectorize iris types. Multiply this by the vector of petal lengths you want to aggregate and take the average for each column. If the one-hot vector is averaged as it is, it will be affected by 0, so replace it with NaN in advance.
Part 1
onehot = np.eye(arr['Name'].max()+1)[arr['Name']]
onehot[onehot == 0] = np.nan
onehot_value = onehot * arr['PetalWidth'][:, None]
result_mean = np.nanmean(onehot_value, 0)
# [0.244 1.326 2.026]
result_std = np.nanstd(onehot_value, 0)
# [0.10613199 0.19576517 0.27188968]
The problem with this method is that np.nanmean ()
and np.nanstd ()
are known to be slow. The average formula is
mean = \frac{1}{n} \sum_{i=1}^{N} x_i
Therefore, if you replace it with an expression using np.nansum ()
, it will be as follows.
python
result_mean = np.nansum(onehot_value, 0)/(~np.isnan(onehot_value)).sum(0)
result_std = np.sqrt(np.nansum(onehot_value**2, 0)
/ (~np.isnan(onehot_value)).sum(0) - result_mean**2)
Furthermore, np.nan
in ʻonehot_value replaces
0, but since the total does not change no matter how many
0s are added, do not replace it with
np.nan. You can use
np.sum ()` as it is.
The fact that you don't have to assign np.nan
to ʻonehot means that you don't have to make it a
float data type, so you can get ʻonehot
faster by the following method.
%timeit np.eye(arr['Name'].max()+1)[arr['Name']]
# 21.2 µs ± 919 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# array([[1., 0., 0.],
# [0., 1., 0.],
# [0., 1., 0.],
# ...,
# [1., 0., 0.]])
%timeit np.arange(arr['Name'].max()+1) == arr['Name'][:, None]
# 20.1 µs ± 205 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# array([[ True, False, False],
# [False, True, False],
# [False, True, False],
# ...,
# [ True, False, False]])
Therefore,
Part 1 break
onehot = np.arange(arr['Name'].max()+1) == arr['Name'][:, None]
onehot_value = onehot * arr['PetalWidth'][:, None]
sums = onehot_value.sum(0)
counts = onehot.sum(0)
result_mean = sums/counts
result_std = np.sqrt((onehot_value**2).sum(0)/counts - result_mean**2)
The point of groupby is how to calculate for each group.
In Part 1, we were able to get the same result as groupby.mean ()
by moving the values to different columns for each group and averaging them.
Part 1 way of thinking
g | A B B C A B A C C
v
A | 1 0 0 0 1 0 1 0 0 | mean_A
B | 0 1 1 0 0 1 0 0 0 | -> mean_B
C | 0 0 0 1 0 0 0 1 1 | mean_C
However, in the original array, if the elements are arranged in groups, the following idea can be considered.
Part 2 way of thinking
g | A A A B B B C C C
----A--- ----B--- ----C---
v
mean_A mean_B mean_C
Therefore, in order to sort by group, sort is performed based on the Name column.
sorter_index = np.argsort(arr['Name'])
name_s = arr['Name'][sorter_index]
# array([0, 0, 0, ..., 1, 1, 1, ..., 2, 2, 2], dtype=int64)
pwidth_s = arr['PetalWidth'][sorter_index]
Then we get the boundaries of the group. This can be done by looking at each adjacent value in the sorted Name column. For example
ex_array = np.array([0, 0, 0, 1, 1, 1, 2, 2, 2, 3, 3, 3])
# 0------ 1------ 2------ 3------← I want the position of this boundary
ex_array[1:] != ex_array[1:]
"""Way of thinking
0 [0, 0, 1, 1, 1, 2, 2, 2, 3, 3, 3] # ex_array[1:]
[0, 0, 0, 1, 1, 1, 2, 2, 2, 3, 3] 3 # ex_array[:-1]
[F, F, T, F, F, T, F, F, T, F, F] # ex_array[1:] != ex_array[1:]
(If the top two match, F,If they do not match, T)
[0, 0, 0, 1, 1, 1, 2, 2, 2, 3, 3, 3] # ex_array
[T, F, F, T, F, F, T, F, F, T, F, F, T] #The array I really wanted
0------ 1------ 2------ 3------
# ex_array[1:] != ex_array[1:]If you add one True before and after,
"T-before the next T" becomes a group.
"""
#Complementary version before and after
np.concatenate(([True], ex_array[1:] != ex_array[:-1], [True]))
# array([ True, False, False, False, False, False, False, False, False, False, False, False, True])
If you get the True position of this array with np.ndarray.nonzero ()
, it will be the boundary position of each group.
cut_index = np.concatenate(([True], ex_array[1:] != ex_array[:-1], [True])).nonzero()[0]
# array([ 0, 3, 6, 9, 12], dtype=int64)
"""Way of thinking
0 1 2 3 4 5 6 7 8 9 10 11 12 # ex_array index
[0, 0, 0, 1, 1, 1, 2, 2, 2, 3, 3, 3] # ex_array
[0, 3, 6, 9, 12] # cut_index:Boundary (starting point) of each group
"""
Again, the average is "total ÷ number", so find each one. The number of elements in a group is the difference between adjacent values of cut_index
, so you can usenp.diff ()
andnp.ediff1d ()
to find it in one shot.
counts = np.ediff1d(cut_index)
# array([3, 3, 3, 3], dtype=int64)
Then, the sum of the elements in the group is calculated. By the way, there is a ** Numpy universal function ** that can add two numbers and arrays called np.add (x1, x2)
.
np.add(1, 2)
# 3
np.add([0, 1], [0, 2])
# array([0, 3])
Basically, np.add ()
is just +
, so it's unlikely that you'll use this function itself. What is important is the method of the universal function. For example, there is np.ufunc.reduce (array)
, which allows you to extract an element from ʻarray` and apply a function. What does that mean, for example
ex_array = np.arange(10)
result = ex_array[0]
for value in ex_array[1:]:
result = np.add(result, value)
print(result)
# 45
np.add.reduce(ex_array)
# 45
Furthermore, using np.ufunc.reduceat (array, indices)
gives an array of np.ufunc.reduce (a [indices [i]: indices [i + 1]])
.
ex_array = np.arange(10)
indices = [0, 1, 6, len(ex_array)]
results = []
for i in range(len(indices) - 1):
result = np.add.reduce(ex_array[indices[i]:indices[i+1]])
results.append(result)
print(results)
# [0, 15, 30]
indices = [0, 1, 6]
np.add.reduceat(ex_array, indices)
# array([0, 15, 30], dtype=int32)
Here, ʻindices` is exactly the" group boundary "sought earlier. In other words, in summary
Part 2
sorter_index = np.argsort(arr['Name'])
name_s = arr['Name'][sorter_index]
pwidth_s = arr['PetalWidth'][sorter_index]
cut_index, = np.concatenate(([True], name_s[1:] != name_s[:-1], [True])).nonzero()
sums = np.add.reduceat(pwidth_s, cut_index[:-1])
counts = np.ediff1d(cut_index)
result_mean = sums/counts
# [0.244 1.326 2.026]
result_std = np.array([np.std(pwidth_s[s:e], ddof=0)
for s, e in zip(cut_index[:-1], cut_index[1:])])
# [0.10613199 0.19576517 0.27188968]
The sad thing about this technique is that you can't use np.ufunc.reduceat ()
well when finding the standard deviation, so you're relying on a for loop (instead, directly np.std ()
. Can be used).
The easiest way to make part 1 or 2 ridiculous.
Since the Name column is an array obtained by label encoding, {0, 1, 2}
are lined up. You can count this all at once by using np.bincount ()
. What does that mean, for example?
ex_array = np.array([3, 4, 3, 3, 2, 2, 4, 4, 1, 3])
results = []
for value in range(ex_array.max()+1):
result = sum(ex_array == value)
results.append(result)
print(results)
# [0, 1, 2, 4, 3]
# ex_In array, 0 is 0, 1 is 1, 2 is 2, 3 is 4, and 4 is 3.
np.bincount(ex_array)
# array([0, 1, 2, 4, 3], dtype=int64)
np.bincount ()
has a weights
option that allows weighting. So, for example,
ex_array = np.array([1, 2, 2, 3, 3, 3, 3, 4, 4, 4])
weights = np.array([2, 1, 1, 1, 1, 1, 1, 2, 3, 4])
results = []
for value in range(ex_array.max()+1):
idx = np.nonzero(ex_array == value)[0]
result = sum(weights[idx])
results.append(result)
print(results)
# [0, 2, 2, 4, 9]
np.bincount(ex_array, weights)
# array([0., 2., 2., 4., 9.])
By using this weighting, the sum of the elements can be calculated. Therefore,
Part 3
counts = np.bincount(arr['Name'])
sums = np.bincount(arr['Name'], arr['PetalWidth'])
result_mean = sums/counts
# [0.244 1.326 2.026]
deviation_array = result_mean[arr['Name']] - arr['PetalWidth']
result_std = np.sqrt(np.bincount(arr['Name'], deviation_array**2) / counts)
# [0.10613199 0.19576517 0.27188968]
numpy_groupby.py
import numpy as np
import pandas as pd
import perfplot
def type1(value, group):
onehot = np.eye(group.max()+1)[group]
onehot[onehot == 0] = np.nan
onehot_value = onehot * value[:, None]
result_mean = np.nanmean(onehot_value, 0)
result_std = np.nanstd(onehot_value, 0)
return result_mean, result_std
def type1rev(value, group):
onehot = np.arange(group.max()+1) == group[:, None]
onehot_value = onehot * value[:, None]
sums = onehot_value.sum(0)
counts = onehot.sum(0)
result_mean = sums/counts
result_std = np.sqrt((onehot_value**2).sum(0)/counts - result_mean**2)
return result_mean, result_std
def type2(value, group):
sorter_index = np.argsort(group)
group_s = group[sorter_index]
value_s = value[sorter_index]
cut_index, = np.concatenate(([True], group_s[1:] != group_s[:-1], [True])).nonzero()
sums = np.add.reduceat(value_s, cut_index[:-1])
counts = np.ediff1d(cut_index)
result_mean = sums/counts
result_std = np.array([np.std(value_s[s:e], ddof=0)
for s, e in zip(cut_index[:-1], cut_index[1:])])
return result_mean, result_std
def type3(value, group):
counts = np.bincount(group)
sums = np.bincount(group, value)
result_mean = sums/counts
result_std = np.sqrt(
np.bincount(group, (result_mean[group] - value)**2) / counts)
return result_mean, result_std
df = (pd.read_csv('https://raw.githubusercontent.com/pandas-dev/pandas/'
'master/pandas/tests/data/iris.csv')
.sample(frac=1, random_state=1))
unique_names, df['Name'] = np.unique(df['Name'].to_numpy(), return_inverse=True)
arr = df.to_records(index=False)[:5]
#Comparison
perfplot.show(
setup=lambda n: np.concatenate([arr]*n),
n_range=[2 ** k for k in range(21)],
kernels=[
lambda arr: type1(arr['PetalWidth'], arr['Name']),
lambda arr: type1rev(arr['PetalWidth'], arr['Name']),
lambda arr: type2(arr['PetalWidth'], arr['Name']),
lambda arr: type3(arr['PetalWidth'], arr['Name']),
],
labels=["type1", "type1rev", "type2", "type3"],
equality_check=None,
xlabel="Number of data rows (× 5)", logx=True, logy=True,
time_unit="s", automatic_order=False
)
Part 3 wins (depending on the number of groups).
If you make a mistake, please comment.
Let's remember! (Click to jump to the official document)
np.nansum ()
--Treat non-numbers (NaN) as zero, Returns the sum of the array elements on the specified axis.
np.nanmean ()
--Ignore NaN and move to the specified axis Calculate the arithmetic mean along.
np.nanstd ()
--Ignore NaN and move to the specified axis Calculate the standard deviation along.
np.add ()
--Add arguments element by element.
np.ufunc.reduce ()
-Along one axis And apply ufunc to reduce the dimension of the array by one.
np.ufunc.reduceat ()
-on a single axis Performs a (local) reduce using the specified slice.
np.argsort ()
-Returns the index to sort the array.
np.nonzero ()
-Returns the index of non-zero elements.
np.bincount ()
--Each value in an array consisting only of natural numbers Counts the number of occurrences of.
np.unique ()
-Find the unique element of the array.
np.ediff1d ()
-Differences between consecutive elements of an array.
np.concatenate ()
--Combine a sequence of arrays along an existing axis To do.
Recommended Posts