[PYTHON] About MultiIndex of pandas

This article is the 19th day article of Python Part 2 Advent Calendar 2015.

I tried to register for Advent Calendar, but recently I haven't written such Python & even if I write it, I can't write it because I've done too many special things in business, so I'm a little reminded now, but it's a small story of pandas. I will write about pandas, which is familiar in data preprocessing and aggregation, and Multiindex, which I know but is a little difficult to understand.

Environment at the time of writing

The code that appears below is

import pandas as pd

It is written on the premise that it is done. Also, the pasted table is a copy of the to_html version, so it is slightly different from the display on the jupyter notebook.

What is MultiIndex

MultiIndex / Advanced Indexing A function to index multiple columns with DataFrame of pandas. This is useful when data can be uniquely identified by using multiple values as keys.

Data preparation and review of ordinary indexes

Ordinary indexes are referred to in this article as Single Index for convenience in contrast to Multi Index. (There is no such type)

Data preparation

Prepare the following csv for explanation

sample.csv


name,year,product,expected,actual
Taro,2013,A,116,119
Taro,2013,B,131,149
Taro,2014,A,151,124
Taro,2014,B,125,107
Taro,2015,A,135,109
Taro,2015,B,142,148
Hanako,2013,A,170,130
Hanako,2013,B,113,190
Hanako,2014,A,102,142
Hanako,2014,B,183,125
Hanako,2015,A,169,120
Hanako,2015,B,134,199

Read this with pandas.

df1 = pd.read_csv('sample.csv')
name year product expected actual
0 Taro 2013 A 116 119
1 Taro 2013 B 131 149
2 Taro 2014 A 151 124
3 Taro 2014 B 125 107
4 Taro 2015 A 135 109
5 Taro 2015 B 142 148
6 Hanako 2013 A 170 130
7 Hanako 2013 B 113 190
8 Hanako 2014 A 102 142
9 Hanako 2014 B 183 125
10 Hanako 2015 A 169 120
11 Hanako 2015 B 134 199

Which of the row labels in this table, 0,1,2 ..., is data.index and the column labels"name", "year", ...is data.columns? Also has a data type of pandas.Index.

Because pandas.Index has a member called name

df1.index.name = "id"
df1.columns.name = "header"

Then

header name year product expected actual
id
0 Taro 2013 A 116 119
1 Taro 2013 B 131 149
2 Taro 2014 A 151 124
3 Taro 2014 B 125 107
4 Taro 2015 A 135 109
5 Taro 2015 B 142 148
6 Hanako 2013 A 170 130
7 Hanako 2013 B 113 190
8 Hanako 2014 A 102 142
9 Hanako 2014 B 183 125
10 Hanako 2015 A 169 120
11 Hanako 2015 B 134 199
It will be such a table.

Data selection

df1["name"]

output


id
0       Taro
1       Taro
2       Taro
3       Taro
4       Taro
5       Taro
6     Hanako
7     Hanako
8     Hanako
9     Hanako
10    Hanako
11    Hanako
Name: name, dtype: object
df1.loc[3]

output


header
name        Taro
year        2014
product        B
expected     125
actual       107
Name: 3, dtype: object
df1.iloc[3]

This time index is a row index, so the output is the same.

df1.ix[3:5,"product":"actual"]

output

product expected actual
3 B 125 107
4 A 135 109
5 B 142 148
df1[(df1["name"] == "Hanako")&(df1["product"] == "A")]

output

header name year product expected actual
id
6 Hanako 2013 A 170 130
8 Hanako 2014 A 102 142
10 Hanako 2015 A 169 120

See other articles and books for details.

MultiIndex settings

The main subject is from here. As you can see, the data read earlier

In such a case, use MultiIndex to organize the data neatly.

df2 = pd.read_csv("sample.csv",index_col=["name","year","product"])
expected actual
name year product
Taro 2013 A 116 119
B 131 149
2014 A 151 124
B 125 107
2015 A 135 109
B 142 148
Hanako 2013 A 170 130
B 113 190
2014 A 102 142
B 183 125
2015 A 169 120
B 134 199

After reading the data, or when applying MultiIndex to the DataFrame created by various processing 2016/3/18 postscript </ font> If you use the set_index function, the specified column will be indexed. (I didn't know this because it wasn't listed on the MultiIndex page of the official document ...)

df2 = df1.set_index(["name","year","product"], drop=True)

If drop = True, the indexed column will be deleted. When returning

df1 = df2.reset_index()

To End of addition </ font>

pd.MultiIndex.from_arrays()
pd.MultiIndex.from_product()
pd.MultiIndex.from_tuples()

If you use around, it will be manageable. By the way, pd.MultiIndex.from_product creates an index for nested loops when passing multiple collections (same as itertools.product). For example, to create the same index as df2

pd.MultiIndex.from_product((["Taro","Hanako"],[2013,2014,2015],["A","B"]),\
    names=["name","year","product"])

You can do it.

Data selection

When selecting a row, pass the tuple of the index.

df2.loc[("Taro",2013,"A")]

output


expected    116
actual      119
Name: (Taro, 2013, A), dtype: int64

If you pass a tuple halfway, the table will be returned, but since PerformanceWarning will be issued for the specification up to the level in the middle, it is better not to use it except when you want to look inside temporarily. There is no problem if you specify the first level (name in this case) or all levels.

df2.loc[("Taro",2013)]

Output (PerformanceWarning)

expected actual
product
A 116 119
B 131 149

Use xs to specify up to the middle level or only in the middle

df2.xs(["Taro","A"],level=[0,2])

output

expected actual
year
2013 116 119
2014 151 124
2015 135 109

Warning is no longer displayed, but even if you write the same process as above, this one is a little slower ...

Usage of MultiIndex

Since pandas.Index is not pandas.Series, there are quite a few things that can't be used like a normal row, and it's sober and stressful. Then I would like to say that it is okay to keep Single Index and Series, but since there is a lot of effort, think about where to use it.

Search for uniquely determined data using multiple columns as keys

As I wrote earlier, this is probably the main use.

df2.loc[("Taro",2013,"A")]

If you try to do the same with Single Index,

df1[(df1["name"] == "Taro")&(df1["year"] == 2013)&(df1["product"] == "A")].iloc[0]

It feels like it's dirty (maybe there is a better way to write it). Of course, there is also a difference in speed. When I tried using about 50,000 data, I found that there was only a 2-3 times difference in one trial. I tried to check with %% timeit, but `` df2` seems to be cached and I could not compare.

Also, if you are not sure if there is data corresponding to the specified key, MultiIndex

("Taro",2013,"A") in df2.index

But with Single Index

df1[(df1["name"] == "Taro")&(df1["year"] == 2013)&(df1["product"] == "A")].shape[0]>0

I have to do it (maybe there is a better way to write it)

For display

Since the output of MultiIndex with nested key indexes is highly visible to humans, it is possible to use MultiIndex only for output. DataFrame can output in many formats such as html, excel, latex, etc., and even if it is Excel, it will combine cells properly according to MultiIndex, so it is convenient. By the way, with df.to_csv, even if it is MultiIndex, it will output all lines properly.

Other operation methods, etc.

  • Use df.index.get_level_values to retrieve one level of index
  • Use df.reset_index () to return MultiIndex to SingleIndex. A column for each level is added to create a DataFrame whose serial number is index.
  • In addition, there are slices etc., but I do not know how to use it, so I will take another opportunity

Recommended Posts