When playing with table data in pandas, you may want to shift the row direction for each category (for example, you want to shift the time series data by one period for each user). If you want to transform data by group with pandas, you can honestly do it with groupby (). Transform
. However, the process of grouping takes a lot of time. So, let's examine how long it will take in several ways.
--OS name Microsoft Windows 10 Home --Processor Intel (R) Core (TM) i5-7200U CPU @ 2.50GHz, 2712 Mhz, 2 cores, 4 logical processors --Installed physical memory (RAM) 8.00 GB
I prepared 10 million lines for the time being. There are 7 variables, 5 are appropriate numbers, and 2 are categorical variables (X, Y), which are 10 categories and 4 categories, respectively.
import numpy as np
import pandas as pd
x = np.arange(10_000_000)
y = np.tile(np.arange(10), int(len(x)/10))
z = np.tile(np.arange(4), int(len(x)/4))
df = pd.DataFrame({"a": x, "b": x, "c": x, "d": x, "e": x, "Y":y, "Z": z})
This time I tried to group by two categorical variables.
In a straightforward way, this is the basis for comparison.
%%timeit -n 1 -r 10
s = df.groupby(["Y", "Z"])["a"].transform(lambda x: x.shift(1))
# 3.25 s ± 107 ms per loop (mean ± std. dev. of 10 runs, 1 loop each)
This is a method of combining variables to be grouped in advance. It's faster, but it takes longer to join, so it's suitable for frequent grouping.
dg = df.copy()
dg["YZ"] = dg["Y"].astype("str") + dg["Z"].astype("str")
# 13.7 s ± 964 ms per loop (mean ± std. dev. of 10 runs, 1 loop each)
%%timeit -n 1 -r 10
s = dg.groupby(["YZ"])["a"].transform(lambda x: x.shift(1))
# 2.62 s ± 25.1 ms per loop (mean ± std. dev. of 10 runs, 1 loop each)
This is a method to create and execute a numpy shift function instead of the pandas shift
method. The speed doesn't seem to change that much.
Reference: python --Shift elements in a numpy array --Stack Overflow
def shift2(arr, num):
result = np.empty(arr.shape[0])
if num > 0:
result[:num] = np.nan
result[num:] = arr[:-num]
elif num < 0:
result[-num:] = np.nan
result[:num] = arr[-num:]
else:
result = arr
return result
%%timeit -n 1 -r 10
s = df.groupby(["Y", "Z"])["a"].transform(lambda x: shift2(x, 1))
# 3.2 s ± 15.1 ms per loop (mean ± std. dev. of 10 runs, 1 loop each)
After grouping, iterates to process each group. Even with this method, the speed does not change so much, and more flexible processing can be applied, so it is quite a favorite method.
%%timeit -n 1 -r 10
l = [group["a"].shift(1) for _, group in df.groupby(["Y", "Z"])]
dh = pd.concat(l, axis=0).sort_index()
# 3.12 s ± 14.4 ms per loop (mean ± std. dev. of 10 runs, 1 loop each)
Actually, transform
is unnecessary. This is the fastest.
%%timeit -n 1 -r 10
s = df.groupby(["Y", "Z"])["a"].shift(1)
# 983 ms ± 10.9 ms per loop (mean ± std. dev. of 10 runs, 1 loop each)
Method | Description | time(per loop) |
---|---|---|
Method 1 | Standard method | 3.25 s ± 0.107 s |
Method 2 | Pre-join | 2.62 s ± 0.0251 s |
Method 3 | numpy shift | 3.2 s ± 0.0151 s |
Method 4 | Iteration | 3.12 s ± 0.0144 s |
Method 5 | No transform | 0.983 s ± 0.0109 s |
Don't use transform
(commandment)
Recommended Posts