Introduction

MultiIndex is useful for extracting rows using multiple keys, but I was searching for multiple keys without knowing its existence, so let's compare the execution speed with and without MultiIndex. I did.

Execution environment

Python 3.7.4
pandas 0.25.1
Jupyter Notebook 6.0.1

Code execution and speed measurement were done on Jupyter Notebook. Also, the code below

import pandas as pd

Is running.

Data preparation

`sample.csv`


city,year,number,dividion
Tokyo,2019,1,A
Tokyo,2019,2,B
Tokyo,2020,1,C
Tokyo,2020,2,D
Tokyo,2018,1,E
Tokyo,2018,2,F
Kyoto,2019,1,G
Kyoto,2019,2,H
Kyoto,2020,1,I
Kyoto,2020,2,J
Kyoto,2018,1,K
Kyoto,2018,2,L

Try to import sample.csv into DataFrame.

df1 = pd.read_csv("sample.csv")

Then it will be read like this.

	city	year	number	dividion
0	Tokyo	2019	1	A
1	Tokyo	2019	2	B
2	Tokyo	2020	1	C
3	Tokyo	2020	2	D
4	Tokyo	2018	1	E
5	Tokyo	2018	2	F
6	Kyoto	2019	1	G
7	Kyoto	2019	2	H
8	Kyoto	2020	1	I
9	Kyoto	2020	2	J
10	Kyoto	2018	1	K
11	Kyoto	2018	2	L

In this table, division is uniquely defined by three keys: city, year and number. # Method 1: Specify multiple conditions without devising anything ## Processing speed including csv reading

%%timeit
df1 = pd.read_csv("sample.csv")
df1[(df1["city"] == "Tokyo")&(df1["year"] == 2019)&(df1["number"] == 1)].iloc[0]

Execution result 2.65 ms ± 48.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

When csv is read in advance

%timeit df1[(df1["city"] == "Tokyo")&(df1["year"] == 2019)&(df1["number"] == 1)].iloc[0]

Execution result 1.44 ms ± 101 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Method 1 required a considerable amount of execution time when processing a large amount of data, so method 2 was used for processing.

Method 2: Add and search compound key columns

Create a unique element by adding the three elements of city, year, and number as a character string.

df2 = pd.read_csv("sample.csv") # city, year,Extract number column as list (element is converted to str) cities = list(map(str, df2["city"].values.tolist())) years = list(map(str, df2["year"].values.tolist())) numbers = list(map(str, df2["number"].values.tolist())) #Add three strings together to generate a unique key keys = [city+year+number for city, year, number in zip(cities, years, numbers)] df2["key"] = keys

Then, such a DataFrame is created.

city year number dividion key

0 Tokyo 2019 1 A Tokyo20191

1 Tokyo 2019 2 B Tokyo20192

2 Tokyo 2020 1 C Tokyo20201

3 Tokyo 2020 2 D Tokyo20202

4 Tokyo 2018 1 E Tokyo20181

5 Tokyo 2018 2 F Tokyo20182

6 Kyoto 2019 1 G Kyoto20191

7 Kyoto 2019 2 H Kyoto20192

8 Kyoto 2020 1 I Kyoto20201

9 Kyoto 2020 2 J Kyoto20202

10 Kyoto 2018 1 K Kyoto20181

11 Kyoto 2018 2 L Kyoto20182

Hereinafter, the speed is measured in the same manner as in Method 1. ## Processing speed including csv reading
df2 = pd.read_csv("sample.csv") cities = list(map(str, df2["city"].values.tolist())) years = list(map(str, df2["year"].values.tolist())) numbers = list(map(str, df2["number"].values.tolist())) keys = [city+year+number for city, year, number in zip(cities, years, numbers)] df2["key"] = keys df2[df2["key"] == "Tokyo20191"].iloc[0]

Execution result 2.5 ms ± 136 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

When csv is read and preprocessed

%timeit df2[df2["key"] == "2019Tokyo1"].iloc[0]

Execution result 569 µs ± 5.39 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

You can see that the execution speed including csv reading does not change, but the execution speed of search only has increased to the microsecond order.

When using MultiIndex

Now that I've learned this method for the first time, let's examine which is faster than Method 2. For more information on MultiIndex, see Official Documentation.

If you specify city, year, and number as indexes, it will be read as follows.

df3 = pd.read_csv("sample.csv",index_col=["city","year","number"])

dividion

city year number

Tokyo 2019 1 A

2 B

2020 1 C

2 D

2018 1 E

2 F

Kyoto 2019 1 G

2 H

2020 1 I

2 J

2018 1 K

2 L

Perform a search and measure the speed. ## Processing speed including csv reading
df3 = pd.read_csv("sample.csv",index_col=["city","year","number"]) df3.loc[("Tokyo",2019,1)]

Execution result 2.64 ms ± 148 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

When csv is read in advance

%timeit df2[df2["key"] == "2019Tokyo1"].iloc[0]

Execution result 151 µs ± 12.1 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Although there is no big difference in the execution speed including csv reading, the result is that the execution speed when reading in advance is about 3.8 times faster than method 2.

Summary

The results so far are summarized in the table below.

Method 1 Method 2 Method 3

Including csv reading 2.65 ms 2.5 ms 2.64 ms
Search only 1.44 ms 569 µs 151 µs Method 2 I was using was a little faster in terms of execution time including csv reading, but once it was read, it was faster to use MultiIndex. By the way, Method 2 makes the code longer and the data to be stored in the DataFrame increases, so basically it seems better to use MultiIndex.
reference

About MultiIndex of pandas Measure code speed with jupyter

Recommended Posts
Speed comparison when searching with multiple keys in pandas (MultiIndex vs Others)

When specifying multiple keys in python sort

Speed comparison when shifting by group by pandas

How to access with cache when reading_json in pandas

Debug settings in virtual environment when using Pipenv with VS Code

	city	year	number	dividion	key
0	Tokyo	2019	1	A	Tokyo20191
1	Tokyo	2019	2	B	Tokyo20192
2	Tokyo	2020	1	C	Tokyo20201
3	Tokyo	2020	2	D	Tokyo20202
4	Tokyo	2018	1	E	Tokyo20181
5	Tokyo	2018	2	F	Tokyo20182
6	Kyoto	2019	1	G	Kyoto20191
7	Kyoto	2019	2	H	Kyoto20192
8	Kyoto	2020	1	I	Kyoto20201
9	Kyoto	2020	2	J	Kyoto20202
10	Kyoto	2018	1	K	Kyoto20181
11	Kyoto	2018	2	L	Kyoto20182

[PYTHON] Speed comparison when searching with multiple keys in pandas (MultiIndex vs Others)

Introduction

Execution environment

Data preparation

sample.csv

When csv is read in advance

Method 2: Add and search compound key columns

When csv is read and preprocessed

When using MultiIndex

When csv is read in advance

Summary

reference

`sample.csv`