MultiIndex is useful for extracting rows using multiple keys, but I was searching for multiple keys without knowing its existence, so let's compare the execution speed with and without MultiIndex. I did.
Code execution and speed measurement were done on Jupyter Notebook. Also, the code below
import pandas as pd
Is running.
sample.csv
city,year,number,dividion
Tokyo,2019,1,A
Tokyo,2019,2,B
Tokyo,2020,1,C
Tokyo,2020,2,D
Tokyo,2018,1,E
Tokyo,2018,2,F
Kyoto,2019,1,G
Kyoto,2019,2,H
Kyoto,2020,1,I
Kyoto,2020,2,J
Kyoto,2018,1,K
Kyoto,2018,2,L
Try to import sample.csv into DataFrame.
df1 = pd.read_csv("sample.csv")
Then it will be read like this.
city | year | number | dividion | |
---|---|---|---|---|
0 | Tokyo | 2019 | 1 | A |
1 | Tokyo | 2019 | 2 | B |
2 | Tokyo | 2020 | 1 | C |
3 | Tokyo | 2020 | 2 | D |
4 | Tokyo | 2018 | 1 | E |
5 | Tokyo | 2018 | 2 | F |
6 | Kyoto | 2019 | 1 | G |
7 | Kyoto | 2019 | 2 | H |
8 | Kyoto | 2020 | 1 | I |
9 | Kyoto | 2020 | 2 | J |
10 | Kyoto | 2018 | 1 | K |
11 | Kyoto | 2018 | 2 | L |
%%timeit
df1 = pd.read_csv("sample.csv")
df1[(df1["city"] == "Tokyo")&(df1["year"] == 2019)&(df1["number"] == 1)].iloc[0]
Execution result </ b> 2.65 ms ± 48.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit df1[(df1["city"] == "Tokyo")&(df1["year"] == 2019)&(df1["number"] == 1)].iloc[0]
Execution result </ b>
1.44 ms ± 101 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Method 1 required a considerable amount of execution time when processing a large amount of data, so method 2 was used for processing.
Create a unique element by adding the three elements of city, year, and number as a character string.
df2 = pd.read_csv("sample.csv")
# city, year,Extract number column as list (element is converted to str)
cities = list(map(str, df2["city"].values.tolist()))
years = list(map(str, df2["year"].values.tolist()))
numbers = list(map(str, df2["number"].values.tolist()))
#Add three strings together to generate a unique key
keys = [city+year+number for city, year, number in zip(cities, years, numbers)]
df2["key"] = keys
Then, such a DataFrame is created.
city | year | number | dividion | key | |
---|---|---|---|---|---|
0 | Tokyo | 2019 | 1 | A | Tokyo20191 |
1 | Tokyo | 2019 | 2 | B | Tokyo20192 |
2 | Tokyo | 2020 | 1 | C | Tokyo20201 |
3 | Tokyo | 2020 | 2 | D | Tokyo20202 |
4 | Tokyo | 2018 | 1 | E | Tokyo20181 |
5 | Tokyo | 2018 | 2 | F | Tokyo20182 |
6 | Kyoto | 2019 | 1 | G | Kyoto20191 |
7 | Kyoto | 2019 | 2 | H | Kyoto20192 |
8 | Kyoto | 2020 | 1 | I | Kyoto20201 |
9 | Kyoto | 2020 | 2 | J | Kyoto20202 |
10 | Kyoto | 2018 | 1 | K | Kyoto20181 |
11 | Kyoto | 2018 | 2 | L | Kyoto20182 |
df2 = pd.read_csv("sample.csv")
cities = list(map(str, df2["city"].values.tolist()))
years = list(map(str, df2["year"].values.tolist()))
numbers = list(map(str, df2["number"].values.tolist()))
keys = [city+year+number for city, year, number in zip(cities, years, numbers)]
df2["key"] = keys
df2[df2["key"] == "Tokyo20191"].iloc[0]
Execution result </ b> 2.5 ms ± 136 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit df2[df2["key"] == "2019Tokyo1"].iloc[0]
Execution result </ b> 569 µs ± 5.39 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
You can see that the execution speed including csv reading does not change, but the execution speed of search only has increased to the microsecond order.
Now that I've learned this method for the first time, let's examine which is faster than Method 2. For more information on MultiIndex, see Official Documentation.
If you specify city, year, and number as indexes, it will be read as follows.
df3 = pd.read_csv("sample.csv",index_col=["city","year","number"])
dividion | |||
---|---|---|---|
city | year | number | |
Tokyo | 2019 | 1 | A |
2 | B | ||
2020 | 1 | C | |
2 | D | ||
2018 | 1 | E | |
2 | F | ||
Kyoto | 2019 | 1 | G |
2 | H | ||
2020 | 1 | I | |
2 | J | ||
2018 | 1 | K | |
2 | L |
df3 = pd.read_csv("sample.csv",index_col=["city","year","number"])
df3.loc[("Tokyo",2019,1)]
Execution result </ b> 2.64 ms ± 148 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit df2[df2["key"] == "2019Tokyo1"].iloc[0]
Execution result </ b> 151 µs ± 12.1 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Although there is no big difference in the execution speed including csv reading, the result is that the execution speed when reading in advance is about 3.8 times faster than method 2.
The results so far are summarized in the table below.
th> | Method 1 th> | Method 2 th> | Method 3 th> tr> thead> |
---|---|---|---|
Including csv reading th> | 2.65 ms td> | 2.5 ms td> | 2.64 ms td> tr> |
Search only th> | 1.44 ms td> | 569 µs td> | 151 µs td> tr> tbody> table>
Method 2 I was using was a little faster in terms of execution time including csv reading, but once it was read, it was faster to use MultiIndex.
By the way, Method 2 makes the code longer and the data to be stored in the DataFrame increases, so basically it seems better to use MultiIndex.
referenceAbout MultiIndex of pandas Measure code speed with jupyter Recommended Posts |