[PYTHON] Speed comparison when searching with multiple keys in pandas (MultiIndex vs Others)

Introduction

MultiIndex is useful for extracting rows using multiple keys, but I was searching for multiple keys without knowing its existence, so let's compare the execution speed with and without MultiIndex. I did.

Execution environment

Code execution and speed measurement were done on Jupyter Notebook. Also, the code below

import pandas as pd

Is running.

Data preparation

sample.csv


city,year,number,dividion
Tokyo,2019,1,A
Tokyo,2019,2,B
Tokyo,2020,1,C
Tokyo,2020,2,D
Tokyo,2018,1,E
Tokyo,2018,2,F
Kyoto,2019,1,G
Kyoto,2019,2,H
Kyoto,2020,1,I
Kyoto,2020,2,J
Kyoto,2018,1,K
Kyoto,2018,2,L

Try to import sample.csv into DataFrame.

df1 = pd.read_csv("sample.csv")

Then it will be read like this.

city year number dividion
0 Tokyo 2019 1 A
1 Tokyo 2019 2 B
2 Tokyo 2020 1 C
3 Tokyo 2020 2 D
4 Tokyo 2018 1 E
5 Tokyo 2018 2 F
6 Kyoto 2019 1 G
7 Kyoto 2019 2 H
8 Kyoto 2020 1 I
9 Kyoto 2020 2 J
10 Kyoto 2018 1 K
11 Kyoto 2018 2 L
In this table, division is uniquely defined by three keys: city, year and number. # Method 1: Specify multiple conditions without devising anything ## Processing speed including csv reading
%%timeit
df1 = pd.read_csv("sample.csv")
df1[(df1["city"] == "Tokyo")&(df1["year"] == 2019)&(df1["number"] == 1)].iloc[0]

Execution result </ b> 2.65 ms ± 48.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

When csv is read in advance

%timeit df1[(df1["city"] == "Tokyo")&(df1["year"] == 2019)&(df1["number"] == 1)].iloc[0]

Execution result </ b> 1.44 ms ± 101 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Method 1 required a considerable amount of execution time when processing a large amount of data, so method 2 was used for processing.

Method 2: Add and search compound key columns

Create a unique element by adding the three elements of city, year, and number as a character string.

df2 = pd.read_csv("sample.csv")
# city, year,Extract number column as list (element is converted to str)
cities = list(map(str, df2["city"].values.tolist()))
years = list(map(str, df2["year"].values.tolist()))
numbers = list(map(str, df2["number"].values.tolist()))
#Add three strings together to generate a unique key
keys = [city+year+number for city, year, number in zip(cities, years, numbers)]
df2["key"] = keys

Then, such a DataFrame is created.

city year number dividion key
0 Tokyo 2019 1 A Tokyo20191
1 Tokyo 2019 2 B Tokyo20192
2 Tokyo 2020 1 C Tokyo20201
3 Tokyo 2020 2 D Tokyo20202
4 Tokyo 2018 1 E Tokyo20181
5 Tokyo 2018 2 F Tokyo20182
6 Kyoto 2019 1 G Kyoto20191
7 Kyoto 2019 2 H Kyoto20192
8 Kyoto 2020 1 I Kyoto20201
9 Kyoto 2020 2 J Kyoto20202
10 Kyoto 2018 1 K Kyoto20181
11 Kyoto 2018 2 L Kyoto20182
Hereinafter, the speed is measured in the same manner as in Method 1. ## Processing speed including csv reading
df2 = pd.read_csv("sample.csv")
cities = list(map(str, df2["city"].values.tolist()))
years = list(map(str, df2["year"].values.tolist()))
numbers = list(map(str, df2["number"].values.tolist()))
keys = [city+year+number for city, year, number in zip(cities, years, numbers)]
df2["key"] = keys
df2[df2["key"] == "Tokyo20191"].iloc[0]

Execution result </ b> 2.5 ms ± 136 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

When csv is read and preprocessed

%timeit df2[df2["key"] == "2019Tokyo1"].iloc[0]

Execution result </ b> 569 µs ± 5.39 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

You can see that the execution speed including csv reading does not change, but the execution speed of search only has increased to the microsecond order.

When using MultiIndex

Now that I've learned this method for the first time, let's examine which is faster than Method 2. For more information on MultiIndex, see Official Documentation.

If you specify city, year, and number as indexes, it will be read as follows.

df3 = pd.read_csv("sample.csv",index_col=["city","year","number"])
dividion
city year number
Tokyo 2019 1 A
2 B
2020 1 C
2 D
2018 1 E
2 F
Kyoto 2019 1 G
2 H
2020 1 I
2 J
2018 1 K
2 L
Perform a search and measure the speed. ## Processing speed including csv reading
df3 = pd.read_csv("sample.csv",index_col=["city","year","number"])
df3.loc[("Tokyo",2019,1)]

Execution result </ b> 2.64 ms ± 148 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

When csv is read in advance

%timeit df2[df2["key"] == "2019Tokyo1"].iloc[0]

Execution result </ b> 151 µs ± 12.1 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Although there is no big difference in the execution speed including csv reading, the result is that the execution speed when reading in advance is about 3.8 times faster than method 2.

Summary

The results so far are summarized in the table below.

Method 1 Method 2 Method 3
Including csv reading 2.65 ms 2.5 ms 2.64 ms
Search only 1.44 ms 569 µs 151 µs Method 2 I was using was a little faster in terms of execution time including csv reading, but once it was read, it was faster to use MultiIndex. By the way, Method 2 makes the code longer and the data to be stored in the DataFrame increases, so basically it seems better to use MultiIndex.

reference

About MultiIndex of pandas Measure code speed with jupyter