This article is the 19th day article of Python Part 2 Advent Calendar 2015.
I tried to register for Advent Calendar, but recently I haven't written such Python & even if I write it, I can't write it because I've done too many special things in business, so I'm a little reminded now, but it's a small story of pandas. I will write about pandas, which is familiar in data preprocessing and aggregation, and Multiindex, which I know but is a little difficult to understand.
The code that appears below is
import pandas as pd
It is written on the premise that it is done. Also, the pasted table is a copy of the to_html
version, so it is slightly different from the display on the jupyter notebook.
MultiIndex / Advanced Indexing A function to index multiple columns with DataFrame of pandas. This is useful when data can be uniquely identified by using multiple values as keys.
Ordinary indexes are referred to in this article as Single Index for convenience in contrast to Multi Index. (There is no such type)
Prepare the following csv for explanation
sample.csv
name,year,product,expected,actual
Taro,2013,A,116,119
Taro,2013,B,131,149
Taro,2014,A,151,124
Taro,2014,B,125,107
Taro,2015,A,135,109
Taro,2015,B,142,148
Hanako,2013,A,170,130
Hanako,2013,B,113,190
Hanako,2014,A,102,142
Hanako,2014,B,183,125
Hanako,2015,A,169,120
Hanako,2015,B,134,199
Read this with pandas.
df1 = pd.read_csv('sample.csv')
name | year | product | expected | actual | |
---|---|---|---|---|---|
0 | Taro | 2013 | A | 116 | 119 |
1 | Taro | 2013 | B | 131 | 149 |
2 | Taro | 2014 | A | 151 | 124 |
3 | Taro | 2014 | B | 125 | 107 |
4 | Taro | 2015 | A | 135 | 109 |
5 | Taro | 2015 | B | 142 | 148 |
6 | Hanako | 2013 | A | 170 | 130 |
7 | Hanako | 2013 | B | 113 | 190 |
8 | Hanako | 2014 | A | 102 | 142 |
9 | Hanako | 2014 | B | 183 | 125 |
10 | Hanako | 2015 | A | 169 | 120 |
11 | Hanako | 2015 | B | 134 | 199 |
Which of the row labels in this table, 0,1,2 ...
, is data.index
and the column labels"name", "year", ...
is data.columns
? Also has a data type of pandas.Index
.
Because pandas.Index
has a member called name
df1.index.name = "id"
df1.columns.name = "header"
Then
header | name | year | product | expected | actual |
---|---|---|---|---|---|
id | |||||
0 | Taro | 2013 | A | 116 | 119 |
1 | Taro | 2013 | B | 131 | 149 |
2 | Taro | 2014 | A | 151 | 124 |
3 | Taro | 2014 | B | 125 | 107 |
4 | Taro | 2015 | A | 135 | 109 |
5 | Taro | 2015 | B | 142 | 148 |
6 | Hanako | 2013 | A | 170 | 130 |
7 | Hanako | 2013 | B | 113 | 190 |
8 | Hanako | 2014 | A | 102 | 142 |
9 | Hanako | 2014 | B | 183 | 125 |
10 | Hanako | 2015 | A | 169 | 120 |
11 | Hanako | 2015 | B | 134 | 199 |
df1["name"]
output
id
0 Taro
1 Taro
2 Taro
3 Taro
4 Taro
5 Taro
6 Hanako
7 Hanako
8 Hanako
9 Hanako
10 Hanako
11 Hanako
Name: name, dtype: object
df1.loc[3]
output
header
name Taro
year 2014
product B
expected 125
actual 107
Name: 3, dtype: object
df1.iloc[3]
This time index is a row index, so the output is the same.
df1.ix[3:5,"product":"actual"]
output
product | expected | actual | |
---|---|---|---|
3 | B | 125 | 107 |
4 | A | 135 | 109 |
5 | B | 142 | 148 |
df1[(df1["name"] == "Hanako")&(df1["product"] == "A")]
output
header | name | year | product | expected | actual |
---|---|---|---|---|---|
id | |||||
6 | Hanako | 2013 | A | 170 | 130 |
8 | Hanako | 2014 | A | 102 | 142 |
10 | Hanako | 2015 | A | 169 | 120 |
See other articles and books for details.
The main subject is from here. As you can see, the data read earlier
In such a case, use MultiIndex to organize the data neatly.
df2 = pd.read_csv("sample.csv",index_col=["name","year","product"])
expected | actual | |||
---|---|---|---|---|
name | year | product | ||
Taro | 2013 | A | 116 | 119 |
B | 131 | 149 | ||
2014 | A | 151 | 124 | |
B | 125 | 107 | ||
2015 | A | 135 | 109 | |
B | 142 | 148 | ||
Hanako | 2013 | A | 170 | 130 |
B | 113 | 190 | ||
2014 | A | 102 | 142 | |
B | 183 | 125 | ||
2015 | A | 169 | 120 | |
B | 134 | 199 |
After reading the data, or when applying MultiIndex to the DataFrame created by various processing
2016/3/18 postscript </ font>
If you use the set_index
function, the specified column will be indexed. (I didn't know this because it wasn't listed on the MultiIndex page of the official document ...)
df2 = df1.set_index(["name","year","product"], drop=True)
If drop = True
, the indexed column will be deleted.
When returning
df1 = df2.reset_index()
To End of addition </ font>
pd.MultiIndex.from_arrays()
pd.MultiIndex.from_product()
pd.MultiIndex.from_tuples()
If you use around, it will be manageable. By the way, pd.MultiIndex.from_product
creates an index for nested loops when passing multiple collections (same as itertools.product). For example, to create the same index as df2
pd.MultiIndex.from_product((["Taro","Hanako"],[2013,2014,2015],["A","B"]),\
names=["name","year","product"])
You can do it.
When selecting a row, pass the tuple of the index.
df2.loc[("Taro",2013,"A")]
output
expected 116
actual 119
Name: (Taro, 2013, A), dtype: int64
If you pass a tuple halfway, the table will be returned, but since PerformanceWarning will be issued for the specification up to the level in the middle, it is better not to use it except when you want to look inside temporarily. There is no problem if you specify the first level (name in this case) or all levels.
df2.loc[("Taro",2013)]
Output (PerformanceWarning)
expected | actual | |
---|---|---|
product | ||
A | 116 | 119 |
B | 131 | 149 |
Use xs
to specify up to the middle level or only in the middle
df2.xs(["Taro","A"],level=[0,2])
output
expected | actual | |
---|---|---|
year | ||
2013 | 116 | 119 |
2014 | 151 | 124 |
2015 | 135 | 109 |
Warning is no longer displayed, but even if you write the same process as above, this one is a little slower ...
Since pandas.Index is not pandas.Series, there are quite a few things that can't be used like a normal row, and it's sober and stressful. Then I would like to say that it is okay to keep Single Index and Series, but since there is a lot of effort, think about where to use it.
As I wrote earlier, this is probably the main use.
df2.loc[("Taro",2013,"A")]
If you try to do the same with Single Index,
df1[(df1["name"] == "Taro")&(df1["year"] == 2013)&(df1["product"] == "A")].iloc[0]
It feels like it's dirty (maybe there is a better way to write it).
Of course, there is also a difference in speed. When I tried using about 50,000 data, I found that there was only a 2-3 times difference in one trial. I tried to check with %% timeit
, but `` df2` seems to be cached and I could not compare.
Also, if you are not sure if there is data corresponding to the specified key, MultiIndex
("Taro",2013,"A") in df2.index
But with Single Index
df1[(df1["name"] == "Taro")&(df1["year"] == 2013)&(df1["product"] == "A")].shape[0]>0
I have to do it (maybe there is a better way to write it)
Since the output of MultiIndex with nested key indexes is highly visible to humans, it is possible to use MultiIndex only for output.
DataFrame can output in many formats such as html, excel, latex, etc., and even if it is Excel, it will combine cells properly according to MultiIndex, so it is convenient.
By the way, with df.to_csv
, even if it is MultiIndex, it will output all lines properly.
Recommended Posts