Introduction

When playing with table data in pandas, you may want to shift the row direction for each category (for example, you want to shift the time series data by one period for each user). If you want to transform data by group with pandas, you can honestly do it with groupby (). Transform. However, the process of grouping takes a lot of time. So, let's examine how long it will take in several ways.

environment

--OS name Microsoft Windows 10 Home --Processor Intel (R) Core (TM) i5-7200U CPU @ 2.50GHz, 2712 Mhz, 2 cores, 4 logical processors --Installed physical memory (RAM) 8.00 GB

conda 4.8.3
Python 3.7.7
numpy==1.18.1
pandas==1.0.1

procedure

Data preparation

I prepared 10 million lines for the time being. There are 7 variables, 5 are appropriate numbers, and 2 are categorical variables (X, Y), which are 10 categories and 4 categories, respectively.

import numpy as np
import pandas as pd


x = np.arange(10_000_000)
y = np.tile(np.arange(10), int(len(x)/10))
z = np.tile(np.arange(4), int(len(x)/4))

df = pd.DataFrame({"a": x, "b": x, "c": x, "d": x, "e": x, "Y":y, "Z": z})

Experiment

This time I tried to group by two categorical variables.

Method 1

In a straightforward way, this is the basis for comparison.

%%timeit -n 1 -r 10

s = df.groupby(["Y", "Z"])["a"].transform(lambda x: x.shift(1))
# 3.25 s ± 107 ms per loop (mean ± std. dev. of 10 runs, 1 loop each)

Method 2

This is a method of combining variables to be grouped in advance. It's faster, but it takes longer to join, so it's suitable for frequent grouping.

dg = df.copy()
dg["YZ"] = dg["Y"].astype("str") + dg["Z"].astype("str")
# 13.7 s ± 964 ms per loop (mean ± std. dev. of 10 runs, 1 loop each)

%%timeit -n 1 -r 10

s = dg.groupby(["YZ"])["a"].transform(lambda x: x.shift(1))
# 2.62 s ± 25.1 ms per loop (mean ± std. dev. of 10 runs, 1 loop each)

Method 3

This is a method to create and execute a numpy shift function instead of the pandas shift method. The speed doesn't seem to change that much.

Reference: python --Shift elements in a numpy array --Stack Overflow

def shift2(arr, num):
    result = np.empty(arr.shape[0])
    if num > 0:
        result[:num] = np.nan
        result[num:] = arr[:-num]
    elif num < 0:
        result[-num:] = np.nan
        result[:num] = arr[-num:]
    else:
        result = arr
    return result

%%timeit -n 1 -r 10

s = df.groupby(["Y", "Z"])["a"].transform(lambda x: shift2(x, 1))
# 3.2 s ± 15.1 ms per loop (mean ± std. dev. of 10 runs, 1 loop each)

Method 4

After grouping, iterates to process each group. Even with this method, the speed does not change so much, and more flexible processing can be applied, so it is quite a favorite method.

%%timeit -n 1 -r 10

l = [group["a"].shift(1) for _, group in df.groupby(["Y", "Z"])]    
dh = pd.concat(l, axis=0).sort_index()
# 3.12 s ± 14.4 ms per loop (mean ± std. dev. of 10 runs, 1 loop each)

Method 5

Actually, transform is unnecessary. This is the fastest.

%%timeit -n 1 -r 10

s = df.groupby(["Y", "Z"])["a"].shift(1)
# 983 ms ± 10.9 ms per loop (mean ± std. dev. of 10 runs, 1 loop each)

result

Method	Description	time(per loop)
Method 1	Standard method	3.25 s ± 0.107 s
Method 2	Pre-join	2.62 s ± 0.0251 s
Method 3	numpy shift	3.2 s ± 0.0151 s
Method 4	Iteration	3.12 s ± 0.0144 s
Method 5	No transform	0.983 s ± 0.0109 s

in conclusion

Don't use transform (commandment)

[PYTHON] Speed comparison when shifting by group by pandas