Click here for details ↓ https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html According to the document, it is a hierarchical index and can be used when manipulating high-dimensional data (roughly).
Panel that handles 3D data structures in pandas has been abolished, so check how to handle it here. Since it is a personal memorandum, it is not exhaustive at all. Therefore, if you have any questions, you have to investigate it as appropriate.
python 3.7.7 pandas 1.1.1 Execution environment: jupyter notebook
As expected, research data cannot be used, so sample data will be created appropriately.
#pandas import
import pandas as pd
#Data creation
array =[['cat','cat','cat','cat','cat','dog','dog','dog','bird','bird'],
        ['black','white','white','brown','brown','black','white','brown','white','yellow'],
        ['male','male','female','male','female','female','female','male','male','female'],
        [3,5,5,5,4,7,8,10,1,2], 
        [20,25,26,30,22,77,55,90,10,15], 
        [3.6,4.5,4.0,5.2,3.0,15.6,10.5,20.1,0.52,0.89]]
#Convert to DataFrame.Transpose rows and columns with T
df = pd.DataFrame(array).T
#column settings
df.columns = ['animal','color','male or female','age','height','weight']
Like this. Since the content is created implicitly, there seems to be a very obese cat.
| animal | color | male or female | age | height | weight | |
|---|---|---|---|---|---|---|
| 0 | cat | black | male | 3 | 20 | 3.6 | 
| 1 | cat | white | male | 5 | 25 | 4.5 | 
| 2 | cat | white | female | 5 | 26 | 4 | 
| 3 | cat | brown | male | 5 | 30 | 5.2 | 
| 4 | cat | brown | female | 4 | 22 | 3 | 
| 5 | dog | black | female | 7 | 77 | 15.6 | 
| 6 | dog | white | female | 8 | 55 | 10.5 | 
| 7 | dog | brown | male | 10 | 90 | 20.1 | 
| 8 | bird | white | male | 1 | 10 | 0.52 | 
| 9 | bird | yellow | female | 2 | 15 | 0.89 | 
Digression: You can use .to_markdown () to create a markdown notation table. It seems that it was implemented from 1.0.0 or later of pandas? However, in the end, .to_html () is indebted.
index How to set and reset the index
The Column name specified by set_index is the index of the level starting from 0 from animal. level→0 : animal, 1 : color, 2 : male or female
df2 = df.set_index(['animal','color','male or female'])
| age | height | weight | |||
|---|---|---|---|---|---|
| animal | color | male or female | |||
| cat | black | male | 3 | 20 | 3.6 | 
| white | male | 5 | 25 | 4.5 | |
| female | 5 | 26 | 4 | ||
| brown | male | 5 | 30 | 5.2 | |
| female | 4 | 22 | 3 | ||
| dog | black | female | 7 | 77 | 15.6 | 
| white | female | 8 | 55 | 10.5 | |
| brown | male | 10 | 90 | 20.1 | |
| bird | white | male | 1 | 10 | 0.52 | 
| yellow | female | 2 | 15 | 0.89 | 
By the way, the index levels can be set in any order you like.
df3 = df.set_index(['animal','male or female','color'])
| age | height | weight | |||
|---|---|---|---|---|---|
| animal | male or female | color | |||
| cat | male | black | 3 | 20 | 3.6 | 
| white | 5 | 25 | 4.5 | ||
| female | white | 5 | 26 | 4 | |
| male | brown | 5 | 30 | 5.2 | |
| female | brown | 4 | 22 | 3 | |
| dog | female | black | 7 | 77 | 15.6 | 
| white | 8 | 55 | 10.5 | ||
| male | brown | 10 | 90 | 20.1 | |
| bird | male | white | 1 | 10 | 0.52 | 
| female | yellow | 2 | 15 | 0.89 | 
Comparing with the above table, it can be confirmed that the values of the first and second levels are interchanged.
As a side note, .set_index () has a parameter drop, which is True by default. This specifies whether to delete the data of the column used as Index. Therefore, when drop = False, the data of the column specified in Index remains as it is as shown below.
df4 = df.set_index(['animal','color','male or female'], drop=False)
| animal | color | male or female | age | height | weight | |||
|---|---|---|---|---|---|---|---|---|
| animal | color | male or female | ||||||
| cat | black | male | cat | black | male | 3 | 20 | 3.6 | 
| white | male | cat | white | male | 5 | 25 | 4.5 | |
| female | cat | white | female | 5 | 26 | 4 | ||
| brown | male | cat | brown | male | 5 | 30 | 5.2 | |
| female | cat | brown | female | 4 | 22 | 3 | ||
| dog | black | female | dog | black | female | 7 | 77 | 15.6 | 
| white | female | dog | white | female | 8 | 55 | 10.5 | |
| brown | male | dog | brown | male | 10 | 90 | 20.1 | |
| bird | white | male | bird | white | male | 1 | 10 | 0.52 | 
| yellow | female | bird | yellow | female | 2 | 15 | 0.89 | 
Like this. Basically, it seems that there is no problem with the default, so you do not have to worry about it.
To reset the Index, use .reset_index ().
df1 = df2.reset_index()
If you compare df and df1, you can see that the table is the same.
Confirmation that the column name specified in Index is likely to be included in names.
df2.index
>
MultiIndex([( 'cat',  'black',   'male'),
            ( 'cat',  'white',   'male'),
            ( 'cat',  'white', 'female'),
            ( 'cat',  'brown',   'male'),
            ( 'cat',  'brown', 'female'),
            ( 'dog',  'black', 'female'),
            ( 'dog',  'white', 'female'),
            ( 'dog',  'brown',   'male'),
            ('bird',  'white',   'male'),
            ('bird', 'yellow', 'female')],
           names=['animal', 'color', 'male or female'])
I wonder if it can be used when checking the column name in Index.
df2.index.names
>
FrozenList(['animal', 'color', 'male or female'])
From here, we will use df2 to see various things.
Use .index.get_level_values (). This will return a specific level of label as a vector.
#Specified by a specific level column name
df2.index.get_level_values('animal')
> Index(['cat', 'cat', 'cat', 'cat', 'cat', 'dog', 'dog', 'dog', 'bird', 'bird'], dtype='object', name='animal')
Same result as above ↓
#Can also be specified by level number
df2.index.get_level_values(0)
> Index(['cat', 'cat', 'cat', 'cat', 'cat', 'dog', 'dog', 'dog', 'bird', 'bird'], dtype='object', name='animal')
Avoid duplication with set, convert to list, and sort it to create a new list. If you want to sort in descending order, set the argument reverse to True. (I don't think it makes much sense here.)
#Get label without duplication
a = sorted(list(set(df2.index.get_level_values('animal'))))
a
> ['bird', 'cat', 'dog']
By using .xs (), you can specify the index column name and its value (label) for selection or extraction.
Let's use the list created earlier to check the operation of xs (). The contents are ['bird','cat','dog'] ← Using the values in this list as keys, find the average height for each male and female animal.
M = {}
F = {}
for s in a:
    m = df2['height'].xs([s, 'male'], level=['animal', 'male or female']).mean()
    M[s] = m
    f = df2['height'].xs([s, 'female'], level=['animal', 'male or female']).mean()
    F[s] = f
#result
M
> {'bird': 10.0, 'cat': 25.0, 'dog': 90.0}
F
> {'bird': 15.0, 'cat': 24.0, 'dog': 66.0}
In addition to .mean (), .min (), .max (), .std (), .count (), etc. can be used according to the purpose.
For the time being, commentary
First, specify the column name of the value you want to find (this time I want to know the average of'height') df2['height']
Next, use .xs () to make detailed specifications. df2['height'].xs([s, 'male'], level=['animal', 'male or female'])
First, about [s,'male']. This time, in order to calculate the average height of each animal for each male (male) and female (female), specify a label corresponding to the information of "which animal and which gender" in the first argument of .xs ().
Next, about level = ['animal','male or female']. This seems to indicate what level the [s,'male'] you specified earlier is. So, it would be nice if you could correspond to the feeling of s in'animal'and'male' in'male or female'. Since the level can be specified by a number, the result is the same even if level = [0,2].
Finally, I want to find the average, so add .mean ().
The contents of the dictionary look like this ↓
#In the program above.mean()Remove and run
m = df2['height'].xs([s, 'male'], level=['animal', 'male or female'])
#result
M
> 
{'bird': color
         white    10
 Name: height, dtype: object,
 'cat': color
         black    20
         white    25
         brown    30
 Name: height, dtype: object,
 'dog': color
         brown    90
 Name: height, dtype: object}
For each animal, the value of'height'corresponding to the male (male) can be obtained.
In order, I felt that I should check from here first. I think the usage of .loc is the same as for a normal DataFrame that is not Multiindex.
Data review
df2
>
                             age height weight
animal color  male or female                  
cat    black  male             3     20    3.6
       white  male             5     25    4.5
              female           5     26      4
       brown  male             5     30    5.2
              female           4     22      3
dog    black  female           7     77   15.6
       white  female           8     55   10.5
       brown  male            10     90   20.1
bird   white  male             1     10   0.52
       yellow female           2     15   0.89
For Index, specify'cat'of'animal'. Specify all column names of Column.
df2.loc['cat']
#Or
# df2.loc['cat', :]
>
                     age height weight
color male or female                  
black male             3     20    3.6
white male             5     25    4.5
      female           5     26      4
brown male             5     30    5.2
      female           4     22      3
You can specify only the column names of Column that you want to narrow down.
df2.loc['cat', ['height', 'weight']]
>
                     height weight
color male or female              
black male               20    3.6
white male               25    4.5
      female             26      4
brown male               30    5.2
      female             22      3
The first argument is the label of a specific level of Index, and the second argument is the column name of the Column you want to narrow down.
df2.loc['cat', 'height']
>
color  male or female
black  male              20
white  male              25
       female            26
brown  male              30
       female            22
Name: height, dtype: object
Specify multiple labels within the same level of Index.
df2.loc[['cat', 'bird'], 'height']
>
animal  color   male or female
cat     black   male              20
        white   male              25
                female            26
        brown   male              30
                female            22
bird    white   male              10
        yellow  female            15
Name: height, dtype: object
This time, how to cross the index level. Narrow down in order of Index level (from 0 in terms of number). If nothing is specified in (), the result is the same as df2.loc [:,'height'].
# 'cat'What is.
df2.loc[('cat'), 'height']
>
color  male or female
black  male              20
white  male              25
       female            26
brown  male              30
       female            22
Name: height, dtype: object
# 'cat'Of'white'What is.
df2.loc[('cat','white'), 'height']
>
male or female
male      25
female    26
Name: height, dtype: object
# 'cat'Of'white'Is'male'What is.
df2.loc[('cat','white','male'), 'height']
> 25
If you narrow down to this point, you can rewrite the value.
df3 = df2.copy()
#Substitute 30
df3.loc[('cat','white','male'), 'height'] = 30
df3.loc[('cat','white','male'), 'height']
> 30
It seems that slices can also be used, but if you do it without thinking, the following error will occur
df2.loc[('cat','brown','male'):('dog','black','female')]
> UnsortedIndexError: 'Key length (3) was greater than MultiIndex lexsort depth (0)'
Solution Sort using .sort_index ().
df5 = df2.sort_index(level=0)
df5
>
                             age height weight
animal color  male or female                  
bird   white  male             1     10   0.52
       yellow female           2     15   0.89
cat    black  male             3     20    3.6
       brown  female           4     22      3
              male             5     30    5.2
       white  female           5     26      4
              male             5     25    4.5
dog    black  female           7     77   15.6
       brown  male            10     90   20.1
       white  female           8     55   10.5
Specify level = number as an argument. Since level = 0 was specified this time, it is sorted based on the animal label.
df5.loc[('cat','brown','male'):('dog','black','female')]
>
                            age height weight
animal color male or female                  
cat    brown male             5     30    5.2
       white female           5     26      4
             male             5     25    4.5
dog    black female           7     77   15.6
('cat','brown'): ('dog','white') can be narrowed down.
pd.indexSlice
idx = pd.IndexSlice
df5.loc[idx[:, :, ['male']], 'height']
>
 animal  color  male or female
bird    white  male              10
cat     black  male              20
        brown  male              30
        white  male              25
dog     brown  male              90
Name: height, dtype: object
Looking only at the output result, the male'height'for each animal can be extracted as in the case of using .xs (). So if you want to get the same result as using .xs (), do this. a = ['bird', 'cat', 'dog']
# pd.Use Index Slice
M_idx = {}
for s in a:
    M_idx[s] = df5.loc[idx[[s], :, ['male']], 'height'].mean()
M_idx
> {'bird': 10.0, 'cat': 25.0, 'dog': 90.0}
Comparison with when using .xs ()
# .xs()use
M = {}
for s in a:
    M[s] = df2['height'].xs([s, 'male'], level=['animal', 'male or female']).mean()
M
> {'bird': 10.0, 'cat': 25.0, 'dog': 90.0}
Which one to use seems to be personal preference
I feel that Multiindex can be used in various ways once I get used to it, but I'm still not sure how to use it. Later, there may be typographical errors or typographical errors that I overlooked, so I will correct them.
MultiIndex / advanced indexing
Recommended Posts