[PYTHON] Pandas User Guide "Multi-Index / Advanced Index" (Official document Japanese translation)

This article is part of the official Pandas documentation after machine translation of the User Guide --MultiIndex / Advanced Indexing (https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html). It is a modification of an unnatural sentence. At the time of writing this article, the latest release version of Pandas is 0.25.3, but in consideration of the future, the text of this article is based on the document of the development version 1.0.0.

If you have any mistranslations, alternative translations, questions, etc., please use the comments section or edit request.

Multi-index / advanced index

This chapter describes Indexing with Multi-Index (# Hierarchical Index (Multi-Index)) and Other Advanced Indexing Features (#Other Advanced Indexing Features).

For documentation on basic indexes, see Indexing and selecting data (https://qiita.com/nkay/items/d322ed9d9a14bdbf14cb).

: warning: ** Warning ** Whether the assignment operation returns a copy or a reference depends on the context. This is called a chained assignment` and should be avoided. See [Returning a view or a copy](https://qiita.com/nkay/items/d322ed9d9a14bdbf14cb#Returning a view or a copy).

See also the cookbook (https://pandas.pydata.org/pandas-docs/stable/user_guide/cookbook.html#cookbook-selection) for more advanced operations.

Hierarchical index (multi-index)

Hierarchical and multi-level indexes are very useful for advanced data analysis and manipulation, especially when dealing with high-dimensional data. In essence, you can store and manipulate any number of dimensions in a low-dimensional data structure such as Series (1d) or DataFrame (2d).

In this section, we'll show you what a "hierarchical" index means and how it integrates with all the Pandas indexing features described above and in the previous chapters. Later, Grouping (https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html#groupby) and Pivoting and Reshaping Data (https://pandas.pydata.org/) When we talk about pandas-docs / stable / user_guide / reshaping.html # reshaping), we'll introduce you to an important application to explain how it can help you structure your data for analysis.

See also the cookbook (https://pandas.pydata.org/pandas-docs/stable/user_guide/cookbook.html#cookbook-selection) for more advanced ways.

_ Changed in version 0.24.0 _: MultiIndex.labels has been changed to MultiIndex.codes and MultiIndex.set_labels has been changed to MultiIndex.set_codes.

Creating a MultiIndex (hierarchical index) object

In Pandas objects, the ʻIndex object is commonly used to store axis labels, but the MultiIndexobject is a hierarchical version of it. You can think ofMultiIndexas an array of tuples, where each tuple is unique.MultiIndex uses a list of arrays (using MultiIndex.from_arrays () ), an array of tuples (using MultiIndex.from_tuples () ), and (MultiIndex.from_product () . It can be created from a direct product of iterables, DataFrame (using MultiIndex.from_frame () ). The ʻIndex constructor tries to return a MultiIndex when a list of tuples is passed. Below are various ways to initialize MultiIndex.

In [1]: arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],
   ...:           ['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]
   ...:

In [2]: tuples = list(zip(*arrays))

In [3]: tuples
Out[3]:
[('bar', 'one'),
 ('bar', 'two'),
 ('baz', 'one'),
 ('baz', 'two'),
 ('foo', 'one'),
 ('foo', 'two'),
 ('qux', 'one'),
 ('qux', 'two')]

In [4]: index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])

In [5]: index
Out[5]:
MultiIndex([('bar', 'one'),
            ('bar', 'two'),
            ('baz', 'one'),
            ('baz', 'two'),
            ('foo', 'one'),
            ('foo', 'two'),
            ('qux', 'one'),
            ('qux', 'two')],
           names=['first', 'second'])

In [6]: s = pd.Series(np.random.randn(8), index=index)

In [7]: s
Out[7]:
first  second
bar    one       0.469112
       two      -0.282863
baz    one      -1.509059
       two      -1.135632
foo    one       1.212112
       two      -0.173215
qux    one       0.119209
       two      -1.044236
dtype: float64

If you want all combinations (direct products) of two iterable elements, it is convenient to use the MultiIndex.from_product () method.

In [8]: iterables = [['bar', 'baz', 'foo', 'qux'], ['one', 'two']]

In [9]: pd.MultiIndex.from_product(iterables, names=['first', 'second'])
Out[9]:
MultiIndex([('bar', 'one'),
            ('bar', 'two'),
            ('baz', 'one'),
            ('baz', 'two'),
            ('foo', 'one'),
            ('foo', 'two'),
            ('qux', 'one'),
            ('qux', 'two')],
           names=['first', 'second'])

You can also use the MultiIndex.from_frame () method to create a MultiIndex directly from the DataFrame. This is a method that complements MultiIndex.to_frame ().

_ From version 0.24.0 _

In [10]: df = pd.DataFrame([['bar', 'one'], ['bar', 'two'],
   ....:                    ['foo', 'one'], ['foo', 'two']],
   ....:                   columns=['first', 'second'])
   ....:

In [11]: pd.MultiIndex.from_frame(df)
Out[11]:
MultiIndex([('bar', 'one'),
            ('bar', 'two'),
            ('foo', 'one'),
            ('foo', 'two')],
           names=['first', 'second'])

You can also automatically create a MultiIndex by passing the list of arrays directly to the Series or DataFrame, as shown below.

In [12]: arrays = [np.array(['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux']),
   ....:           np.array(['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two'])]
   ....:

In [13]: s = pd.Series(np.random.randn(8), index=arrays)

In [14]: s
Out[14]:
bar  one   -0.861849
     two   -2.104569
baz  one   -0.494929
     two    1.071804
foo  one    0.721555
     two   -0.706771
qux  one   -1.039575
     two    0.271860
dtype: float64

In [15]: df = pd.DataFrame(np.random.randn(8, 4), index=arrays)

In [16]: df
Out[16]:
                0         1         2         3
bar one -0.424972  0.567020  0.276232 -1.087401
    two -0.673690  0.113648 -1.478427  0.524988
baz one  0.404705  0.577046 -1.715002 -1.039268
    two -0.370647 -1.157892 -1.344312  0.844885
foo one  1.075770 -0.109050  1.643563 -1.469388
    two  0.357021 -0.674600 -1.776904 -0.968914
qux one -1.294524  0.413738  0.276662 -0.472035
    two -0.013960 -0.362543 -0.006154 -0.923061

All MultiIndex constructors take a names argument that stores the level's own string names (labels). If no name is specified, None will be assigned.

In [17]: df.index.names
Out[17]: FrozenList([None, None])

This index can be set on an axis in any direction of the Pandas object, and the number of ** levels ** in the index is up to you.

In [18]: df = pd.DataFrame(np.random.randn(3, 8), index=['A', 'B', 'C'], columns=index)

In [19]: df
Out[19]:
first        bar                 baz                 foo                 qux
second       one       two       one       two       one       two       one       two
A       0.895717  0.805244 -1.206412  2.565646  1.431256  1.340309 -1.170299 -0.226169
B       0.410835  0.813850  0.132003 -0.827317 -0.076467 -1.187678  1.130127 -1.436737
C      -1.413681  1.607920  1.024180  0.569605  0.875906 -2.211372  0.974466 -2.006747

In [20]: pd.DataFrame(np.random.randn(6, 6), index=index[:6], columns=index[:6])
Out[20]:
first              bar                 baz                 foo
second             one       two       one       two       one       two
first second
bar   one    -0.410001 -0.078638  0.545952 -1.219217 -1.226825  0.769804
      two    -1.281247 -0.727707 -0.121306 -0.097883  0.695775  0.341734
baz   one     0.959726 -1.110336 -0.619976  0.149748 -0.732339  0.687738
      two     0.176444  0.403310 -0.154951  0.301624 -2.179861 -1.369849
foo   one    -0.954208  1.462696 -1.743161 -0.826591 -0.345352  1.314232
      two     0.690579  0.995761  2.396780  0.014871  3.357427 -0.317441

We have "sparsed" higher levels of indexes to make the console output easier to see. You can control how the index is displayed using the multi_sparse option ofpandas.set_options ().

In [21]: with pd.option_context('display.multi_sparse', False):
   ....:     df
   ....:

Keep in mind that it's okay to use tuples as a single indivisible label.

In [22]: pd.Series(np.random.randn(8), index=tuples)
Out[22]:
(bar, one)   -1.236269
(bar, two)    0.896171
(baz, one)   -0.487602
(baz, two)   -0.082240
(foo, one)   -2.182937
(foo, two)    0.380396
(qux, one)    0.084844
(qux, two)    0.432390
dtype: float64

The reason why MultiIndex is important is that you can use MultiIndex to perform grouping, selection, and shape change operations, as explained in the following and subsequent chapters. As you will see in a later section, you can work with hierarchically indexed data without having to explicitly create the MultiIndex yourself. However, if you load data from a file, you can generate your own MultiIndex when preparing the dataset.

Rebuilding level labels

The get_level_values () method is for each specific level. Returns the position label vector.

In [23]: index.get_level_values(0)
Out[23]: Index(['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'], dtype='object', name='first')

In [24]: index.get_level_values('second')
Out[24]: Index(['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two'], dtype='object', name='second')

Basic indexing of axes using MultiIndex

One of the key features of a hierarchical index is the ability to select data by a "partial" label that identifies a subgroup in the data. ** partial ** selection "drops" the level of the resulting hierarchical index in exactly the same way as selecting a column in a regular DataFrame.

In [25]: df['bar']
Out[25]:
second       one       two
A       0.895717  0.805244
B       0.410835  0.813850
C      -1.413681  1.607920

In [26]: df['bar', 'one']
Out[26]:
A    0.895717
B    0.410835
C   -1.413681
Name: (bar, one), dtype: float64

In [27]: df['bar']['one']
Out[27]:
A    0.895717
B    0.410835
C   -1.413681
Name: one, dtype: float64

In [28]: s['qux']
Out[28]:
one   -1.039575
two    0.271860
dtype: float64

See Sections with Hierarchical Indexes for information on how to select at a deeper level.

Level definition

MultiIndex was defined even if it wasn't actually used Holds indexes for all levels. You may notice this when slicing the index. For example

In [29]: df.columns.levels  # original MultiIndex
Out[29]: FrozenList([['bar', 'baz', 'foo', 'qux'], ['one', 'two']])

In [30]: df[['foo','qux']].columns.levels  # sliced
Out[30]: FrozenList([['bar', 'baz', 'foo', 'qux'], ['one', 'two']])

This is done to avoid level recalculations to improve slice performance. If you want to see only the levels used, [get_level_values ()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.MultiIndex.get_level_values.html#pandas. You can use the MultiIndex.get_level_values) method.

In [31]: df[['foo', 'qux']].columns.to_numpy()
Out[31]:
array([('foo', 'one'), ('foo', 'two'), ('qux', 'one'), ('qux', 'two')],
      dtype=object)

# for a specific level
In [32]: df[['foo', 'qux']].columns.get_level_values(0)
Out[32]: Index(['foo', 'foo', 'qux', 'qux'], dtype='object', name='first')

To rebuild MultiIndex with usage levels only, [remove_unused_levels () ](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.MultiIndex.remove_unused_levels.html# You can use the pandas.MultiIndex.remove_unused_levels) method.

In [33]: new_mi = df[['foo', 'qux']].columns.remove_unused_levels()

In [34]: new_mi.levels
Out[34]: FrozenList([['foo', 'qux'], ['one', 'two']])

Data alignment and use of reindex

Operations between objects with different indexes that have MultiIndex on the axis work as expected. Data alignment works like a tuple index.

In [35]: s + s[:-2]
Out[35]:
bar  one   -1.723698
     two   -4.209138
baz  one   -0.989859
     two    2.143608
foo  one    1.443110
     two   -1.413542
qux  one         NaN
     two         NaN
dtype: float64

In [36]: s + s[::2]
Out[36]:
bar  one   -1.723698
     two         NaN
baz  one   -0.989859
     two         NaN
foo  one    1.443110
     two         NaN
qux  one   -2.079150
     two         NaN
dtype: float64

[Reindex ()] of Series / DataFrames (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.reindex.html#pandas.DataFrame.reindex) The method can also receive another MultiIndex, or a list or array of tuples.

In [37]: s.reindex(index[:3])
Out[37]:
first  second
bar    one      -0.861849
       two      -2.104569
baz    one      -0.494929
dtype: float64

In [38]: s.reindex([('foo', 'two'), ('bar', 'one'), ('qux', 'one'), ('baz', 'one')])
Out[38]:
foo  two   -0.706771
bar  one   -0.861849
qux  one   -1.039575
baz  one   -0.494929
dtype: float64

Advanced indexing with hierarchical indexes

Getting MultiIndex to do advanced indexing with .loc is a bit difficult syntactically, but we've made every effort to make it happen. In general, MultiIndex keys take the form of tuples. For example, the following works as expected.

In [39]: df = df.T

In [40]: df
Out[40]:
                     A         B         C
first second
bar   one     0.895717  0.410835 -1.413681
      two     0.805244  0.813850  1.607920
baz   one    -1.206412  0.132003  1.024180
      two     2.565646 -0.827317  0.569605
foo   one     1.431256 -0.076467  0.875906
      two     1.340309 -1.187678 -2.211372
qux   one    -1.170299  1.130127  0.974466
      two    -0.226169 -1.436737 -2.006747

In [41]: df.loc[('bar', 'two')]
Out[41]:
A    0.805244
B    0.813850
C    1.607920
Name: (bar, two), dtype: float64

For this example, df.loc ['bar','two'] will also work, but be aware that this abbreviation can be misleading in general.

If you use .loc to index from a particular column, you must use tuples as follows:

In [42]: df.loc[('bar', 'two'), 'A']
Out[42]: 0.8052440253863785

If you are passing only the first element of the tuple, you do not have to specify all levels of MultiIndex. For example, you can use the "partial" index to get all the elements that have bar in the first level as follows:

df.loc['bar']

This is a shortcut for the more verbose notation df.loc [('bar',),] (also equivalent to df.loc ['bar',] in this example).

The "Partial" slice also works very well.

In [43]: df.loc['baz':'foo']
Out[43]:
                     A         B         C
first second
baz   one    -1.206412  0.132003  1.024180
      two     2.565646 -0.827317  0.569605
foo   one     1.431256 -0.076467  0.875906
      two     1.340309 -1.187678 -2.211372

You can slice by the "range" of values by passing a slice of tuple.

In [44]: df.loc[('baz', 'two'):('qux', 'one')]
Out[44]:
                     A         B         C
first second
baz   two     2.565646 -0.827317  0.569605
foo   one     1.431256 -0.076467  0.875906
      two     1.340309 -1.187678 -2.211372
qux   one    -1.170299  1.130127  0.974466

In [45]: df.loc[('baz', 'two'):'foo']
Out[45]:
                     A         B         C
first second
baz   two     2.565646 -0.827317  0.569605
foo   one     1.431256 -0.076467  0.875906
      two     1.340309 -1.187678 -2.211372

As with reindexing, you can also pass a list of labels or tuples.

In [46]: df.loc[[('bar', 'two'), ('qux', 'one')]]
Out[46]:
                     A         B         C
first second
bar   two     0.805244  0.813850  1.607920
qux   one    -1.170299  1.130127  0.974466

: ballot_box_with_check: ** Note ** Note that tuples and lists are not treated the same in pandas when it comes to indexing. Tuples are interpreted as one multi-level key, but lists point to multiple keys. In other words, tuples move horizontally (crossing levels) and lists move vertically (scanning levels).

Importantly, a list of tuples pulls multiple complete MultiIndex keys, but tuples in a list refer to some value in a level.

In [47]: s = pd.Series([1, 2, 3, 4, 5, 6],
   ....:               index=pd.MultiIndex.from_product([["A", "B"], ["c", "d", "e"]]))
   ....:

In [48]: s.loc[[("A", "c"), ("B", "d")]]  # list of tuples
Out[48]:
A  c    1
B  d    5
dtype: int64

In [49]: s.loc[(["A", "B"], ["c", "d"])]  # tuple of lists
Out[49]:
A  c    1
   d    2
B  c    4
   d    5
dtype: int64

Use of slicer

You can slice the MultiIndex by providing multiple indexers.

Use any of the slice label list, label, and boolean array selectors you see in Select by Label (https://qiita.com/nkay/items/d322ed9d9a14bdbf14cb#Select by Label) in the same way. I can.

You can use slice (None) to select all elements at * that * level. It is not necessary to specify all * deeper * levels. They are implied by slice (None).

Like the others, this is a label indexing, so it includes ** both sides ** of the slicer.

: warning: ** Warning ** In .loc, specify all axes (** index ** and ** columns **). There are some ambiguous cases where the indexer passed can be misinterpreted as an index on * both * axes instead of something like MultiIndex on a row. Write as follows.

df.loc[(slice('A1', 'A3'), ...), :]             # noqa: E999

Do not write as follows.

df.loc[(slice('A1', 'A3'), ...)]                # noqa: E999
In [50]: def mklbl(prefix, n):
   ....:     return ["%s%s" % (prefix, i) for i in range(n)]
   ....:

In [51]: miindex = pd.MultiIndex.from_product([mklbl('A', 4),
   ....:                                       mklbl('B', 2),
   ....:                                       mklbl('C', 4),
   ....:                                       mklbl('D', 2)])
   ....:

In [52]: micolumns = pd.MultiIndex.from_tuples([('a', 'foo'), ('a', 'bar'),
   ....:                                        ('b', 'foo'), ('b', 'bah')],
   ....:                                       names=['lvl0', 'lvl1'])
   ....:

In [53]: dfmi = pd.DataFrame(np.arange(len(miindex) * len(micolumns))
   ....:                       .reshape((len(miindex), len(micolumns))),
   ....:                     index=miindex,
   ....:                     columns=micolumns).sort_index().sort_index(axis=1)
   ....:

In [54]: dfmi
Out[54]:
lvl0           a         b
lvl1         bar  foo  bah  foo
A0 B0 C0 D0    1    0    3    2
         D1    5    4    7    6
      C1 D0    9    8   11   10
         D1   13   12   15   14
      C2 D0   17   16   19   18
...          ...  ...  ...  ...
A3 B1 C1 D1  237  236  239  238
      C2 D0  241  240  243  242
         D1  245  244  247  246
      C3 D0  249  248  251  250
         D1  253  252  255  254

[64 rows x 4 columns]

Basic multi-index slicing with slice list labels.

In [55]: dfmi.loc[(slice('A1', 'A3'), slice(None), ['C1', 'C3']), :]
Out[55]:
lvl0           a         b
lvl1         bar  foo  bah  foo
A1 B0 C1 D0   73   72   75   74
         D1   77   76   79   78
      C3 D0   89   88   91   90
         D1   93   92   95   94
   B1 C1 D0  105  104  107  106
...          ...  ...  ...  ...
A3 B0 C3 D1  221  220  223  222
   B1 C1 D0  233  232  235  234
         D1  237  236  239  238
      C3 D0  249  248  251  250
         D1  253  252  255  254

[24 rows x 4 columns]

Using Pandas.IndexSlice inslice (None) You can use a more natural syntax than using :.

In [56]: idx = pd.IndexSlice

In [57]: dfmi.loc[idx[:, :, ['C1', 'C3']], idx[:, 'foo']]
Out[57]:
lvl0           a    b
lvl1         foo  foo
A0 B0 C1 D0    8   10
         D1   12   14
      C3 D0   24   26
         D1   28   30
   B1 C1 D0   40   42
...          ...  ...
A3 B0 C3 D1  220  222
   B1 C1 D0  232  234
         D1  236  238
      C3 D0  248  250
         D1  252  254

[32 rows x 2 columns]

You can use this method to make very complex selections on multiple axes at the same time.

In [58]: dfmi.loc['A1', (slice(None), 'foo')]
Out[58]:
lvl0        a    b
lvl1      foo  foo
B0 C0 D0   64   66
      D1   68   70
   C1 D0   72   74
      D1   76   78
   C2 D0   80   82
...       ...  ...
B1 C1 D1  108  110
   C2 D0  112  114
      D1  116  118
   C3 D0  120  122
      D1  124  126

[16 rows x 2 columns]

In [59]: dfmi.loc[idx[:, :, ['C1', 'C3']], idx[:, 'foo']]
Out[59]:
lvl0           a    b
lvl1         foo  foo
A0 B0 C1 D0    8   10
         D1   12   14
      C3 D0   24   26
         D1   28   30
   B1 C1 D0   40   42
...          ...  ...
A3 B0 C3 D1  220  222
   B1 C1 D0  232  234
         D1  236  238
      C3 D0  248  250
         D1  252  254

[32 rows x 2 columns]

You can use the Boolean indexer to provide selections related to * values *.

In [60]: mask = dfmi[('a', 'foo')] > 200

In [61]: dfmi.loc[idx[mask, :, ['C1', 'C3']], idx[:, 'foo']]
Out[61]:
lvl0           a    b
lvl1         foo  foo
A3 B0 C1 D1  204  206
      C3 D0  216  218
         D1  220  222
   B1 C1 D0  232  234
         D1  236  238
      C3 D0  248  250
         D1  252  254

You can also specify the ʻaxisargument in.loc` to interpret the slicer passed on a single axis.

In [62]: dfmi.loc(axis=0)[:, :, ['C1', 'C3']]
Out[62]:
lvl0           a         b
lvl1         bar  foo  bah  foo
A0 B0 C1 D0    9    8   11   10
         D1   13   12   15   14
      C3 D0   25   24   27   26
         D1   29   28   31   30
   B1 C1 D0   41   40   43   42
...          ...  ...  ...  ...
A3 B0 C3 D1  221  220  223  222
   B1 C1 D0  233  232  235  234
         D1  237  236  239  238
      C3 D0  249  248  251  250
         D1  253  252  255  254

[32 rows x 4 columns]

In addition, you can * assign * values * using the following methods:

In [63]: df2 = dfmi.copy()

In [64]: df2.loc(axis=0)[:, :, ['C1', 'C3']] = -10

In [65]: df2
Out[65]:
lvl0           a         b
lvl1         bar  foo  bah  foo
A0 B0 C0 D0    1    0    3    2
         D1    5    4    7    6
      C1 D0  -10  -10  -10  -10
         D1  -10  -10  -10  -10
      C2 D0   17   16   19   18
...          ...  ...  ...  ...
A3 B1 C1 D1  -10  -10  -10  -10
      C2 D0  241  240  243  242
         D1  245  244  247  246
      C3 D0  -10  -10  -10  -10
         D1  -10  -10  -10  -10

[64 rows x 4 columns]

You can also use the right-hand side of alignable objects.

In [66]: df2 = dfmi.copy()

In [67]: df2.loc[idx[:, :, ['C1', 'C3']], :] = df2 * 1000

In [68]: df2
Out[68]:
lvl0              a               b
lvl1            bar     foo     bah     foo
A0 B0 C0 D0       1       0       3       2
         D1       5       4       7       6
      C1 D0    9000    8000   11000   10000
         D1   13000   12000   15000   14000
      C2 D0      17      16      19      18
...             ...     ...     ...     ...
A3 B1 C1 D1  237000  236000  239000  238000
      C2 D0     241     240     243     242
         D1     245     244     247     246
      C3 D0  249000  248000  251000  250000
         D1  253000  252000  255000  254000

[64 rows x 4 columns]

Cross-section

The xs () method of DataFrame also takes a level argument to make it easier to select data at a particular level of MultiIndex.

In [69]: df
Out[69]:
                     A         B         C
first second
bar   one     0.895717  0.410835 -1.413681
      two     0.805244  0.813850  1.607920
baz   one    -1.206412  0.132003  1.024180
      two     2.565646 -0.827317  0.569605
foo   one     1.431256 -0.076467  0.875906
      two     1.340309 -1.187678 -2.211372
qux   one    -1.170299  1.130127  0.974466
      two    -0.226169 -1.436737 -2.006747

In [70]: df.xs('one', level='second')
Out[70]:
              A         B         C
first
bar    0.895717  0.410835 -1.413681
baz   -1.206412  0.132003  1.024180
foo    1.431256 -0.076467  0.875906
qux   -1.170299  1.130127  0.974466
#Use of slices
In [71]: df.loc[(slice(None), 'one'), :]
Out[71]:
                     A         B         C
first second
bar   one     0.895717  0.410835 -1.413681
baz   one    -1.206412  0.132003  1.024180
foo   one     1.431256 -0.076467  0.875906
qux   one    -1.170299  1.130127  0.974466

You can also select columns with xs by specifying the Axis argument.

In [72]: df = df.T

In [73]: df.xs('one', level='second', axis=1)
Out[73]:
first       bar       baz       foo       qux
A      0.895717 -1.206412  1.431256 -1.170299
B      0.410835  0.132003 -0.076467  1.130127
C     -1.413681  1.024180  0.875906  0.974466
#Use of slices
In [74]: df.loc[:, (slice(None), 'one')]
Out[74]:
first        bar       baz       foo       qux
second       one       one       one       one
A       0.895717 -1.206412  1.431256 -1.170299
B       0.410835  0.132003 -0.076467  1.130127
C      -1.413681  1.024180  0.875906  0.974466

With xs, you can also select using multiple keys.

In [75]: df.xs(('one', 'bar'), level=('second', 'first'), axis=1)
Out[75]:
first        bar
second       one
A       0.895717
B       0.410835
C      -1.413681
#Use of slices
In [76]: df.loc[:, ('bar', 'one')]
Out[76]:
A    0.895717
B    0.410835
C   -1.413681
Name: (bar, one), dtype: float64

You can keep the selected level by passing drop_level = False to xs.

In [77]: df.xs('one', level='second', axis=1, drop_level=False)
Out[77]:
first        bar       baz       foo       qux
second       one       one       one       one
A       0.895717 -1.206412  1.431256 -1.170299
B       0.410835  0.132003 -0.076467  1.130127
C      -1.413681  1.024180  0.875906  0.974466

Compare the above result with the case of drop_level = True (default value).

In [78]: df.xs('one', level='second', axis=1, drop_level=True)
Out[78]:
first       bar       baz       foo       qux
A      0.895717 -1.206412  1.431256 -1.170299
B      0.410835  0.132003 -0.076467  1.130127
C     -1.413681  1.024180  0.875906  0.974466

Advanced reindexing and alignment

The pandas object reindex () and [ʻalign (ʻalign) ) ](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.align.html#pandas.DataFrame.align) If you use the level` argument in the method, the whole level Useful for broadcasting values to. For example:

In [79]: midx = pd.MultiIndex(levels=[['zero', 'one'], ['x', 'y']],
   ....:                      codes=[[1, 1, 0, 0], [1, 0, 1, 0]])
   ....:

In [80]: df = pd.DataFrame(np.random.randn(4, 2), index=midx)

In [81]: df
Out[81]:
               0         1
one  y  1.519970 -0.493662
     x  0.600178  0.274230
zero y  0.132885 -0.023688
     x  2.410179  1.450520

In [82]: df2 = df.mean(level=0)

In [83]: df2
Out[83]:
             0         1
one   1.060074 -0.109716
zero  1.271532  0.713416

In [84]: df2.reindex(df.index, level=0)
Out[84]:
               0         1
one  y  1.060074 -0.109716
     x  1.060074 -0.109716
zero y  1.271532  0.713416
     x  1.271532  0.713416

#alignment
In [85]: df_aligned, df2_aligned = df.align(df2, level=0)

In [86]: df_aligned
Out[86]:
               0         1
one  y  1.519970 -0.493662
     x  0.600178  0.274230
zero y  0.132885 -0.023688
     x  2.410179  1.450520

In [87]: df2_aligned
Out[87]:
               0         1
one  y  1.060074 -0.109716
     x  1.060074 -0.109716
zero y  1.271532  0.713416
     x  1.271532  0.713416

Level order exchange by swaplevel

The swaplevel () method is in a two-level order. Can be replaced.

In [88]: df[:5]
Out[88]:
               0         1
one  y  1.519970 -0.493662
     x  0.600178  0.274230
zero y  0.132885 -0.023688
     x  2.410179  1.450520

In [89]: df[:5].swaplevel(0, 1, axis=0)
Out[89]:
               0         1
y one   1.519970 -0.493662
x one   0.600178  0.274230
y zero  0.132885 -0.023688
x zero  2.410179  1.450520

Sorting levels by reorder_levels

The reorder_levels () method is the swaplevel method. Generalize to allow you to replace the level of a hierarchical index in one step.

In [90]: df[:5].reorder_levels([1, 0], axis=0)
Out[90]:
               0         1
y one   1.519970 -0.493662
x one   0.600178  0.274230
y zero  0.132885 -0.023688
x zero  2.410179  1.450520

Rename ʻIndex or MultiIndex`

Rename () usually used to rename columns in DataFrame The # pandas.DataFrame.rename) method can also rename the MultiIndex label. The columns argument of rename can be a dictionary containing only the columns to rename.

In [91]: df.rename(columns={0: "col0", 1: "col1"})
Out[91]:
            col0      col1
one  y  1.519970 -0.493662
     x  0.600178  0.274230
zero y  0.132885 -0.023688
     x  2.410179  1.450520

This method can also be used to rename a particular label in the DataFrame's main index.

In [92]: df.rename(index={"one": "two", "y": "z"})
Out[92]:
               0         1
two  z  1.519970 -0.493662
     x  0.600178  0.274230
zero z  0.132885 -0.023688
     x  2.410179  1.450520

rename_axis () The method is ʻIndex or Rename MultiIndex. In particular, you can specify a level name for MultiIndex, which is useful later when you use reset_index () `to move the value from MultiIndex to a regular column.

In [93]: df.rename_axis(index=['abc', 'def'])
Out[93]:
                 0         1
abc  def
one  y    1.519970 -0.493662
     x    0.600178  0.274230
zero y    0.132885 -0.023688
     x    2.410179  1.450520

Note that the columns in the DataFrame are indexes, so using rename_axis with the columns argument will rename the index.

In [94]: df.rename_axis(columns="Cols").columns
Out[94]: RangeIndex(start=0, stop=2, step=1, name='Cols')

Rename and rename_axis support the specification of dictionaries, Series, and mapping functions to map labels / names to new values.

If you want to work directly with the ʻIndex object instead of via the DataFrame, then [ʻIndex.set_names ()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Index) You can rename it using .set_names.html # pandas.Index.set_names).

In [95]: mi = pd.MultiIndex.from_product([[1, 2], ['a', 'b']], names=['x', 'y'])

In [96]: mi.names
Out[96]: FrozenList(['x', 'y'])

In [97]: mi2 = mi.rename("new name", level=0)

In [98]: mi2
Out[98]:
MultiIndex([(1, 'a'),
            (1, 'b'),
            (2, 'a'),
            (2, 'b')],
           names=['new name', 'y'])

: warning: ** Warning ** Prior to pandas 1.0.0, it was also possible to set the MultiIndex name by updating the level name.

>>> mi.levels[0].name = 'name via level'
>>> mi.names[0]  # only works for older panads
'name via level'

As of pandas 1.0, this implicitly fails to update the name of MultiIndex. Use ʻIndex.set_names () ` instead Please give me.

Sorting of MultiIndex

Objects with MultiIndex applied must be sorted in order to be effectively indexed and sliced. Like any other index, sort_index () You can use it.

In [99]: import random

In [100]: random.shuffle(tuples)

In [101]: s = pd.Series(np.random.randn(8), index=pd.MultiIndex.from_tuples(tuples))

In [102]: s
Out[102]:
qux  one    0.206053
foo  two   -0.251905
bar  two   -2.213588
     one    1.063327
qux  two    1.266143
baz  one    0.299368
foo  one   -0.863838
baz  two    0.408204
dtype: float64

In [103]: s.sort_index()
Out[103]:
bar  one    1.063327
     two   -2.213588
baz  one    0.299368
     two    0.408204
foo  one   -0.863838
     two   -0.251905
qux  one    0.206053
     two    1.266143
dtype: float64

In [104]: s.sort_index(level=0)
Out[104]:
bar  one    1.063327
     two   -2.213588
baz  one    0.299368
     two    0.408204
foo  one   -0.863838
     two   -0.251905
qux  one    0.206053
     two    1.266143
dtype: float64

In [105]: s.sort_index(level=1)
Out[105]:
bar  one    1.063327
baz  one    0.299368
foo  one   -0.863838
qux  one    0.206053
bar  two   -2.213588
baz  two    0.408204
foo  two   -0.251905
qux  two    1.266143
dtype: float64

You can also pass the level name to sort_index if the level of MultiIndex is named.

In [106]: s.index.set_names(['L1', 'L2'], inplace=True)

In [107]: s.sort_index(level='L1')
Out[107]:
L1   L2
bar  one    1.063327
     two   -2.213588
baz  one    0.299368
     two    0.408204
foo  one   -0.863838
     two   -0.251905
qux  one    0.206053
     two    1.266143
dtype: float64

In [108]: s.sort_index(level='L2')
Out[108]:
L1   L2
bar  one    1.063327
baz  one    0.299368
foo  one   -0.863838
qux  one    0.206053
bar  two   -2.213588
baz  two    0.408204
foo  two   -0.251905
qux  two    1.266143
dtype: float64

For higher dimensional objects, if you have MultiIndex, you can sort by level on axes other than the index.

In [109]: df.T.sort_index(level=1, axis=1)
Out[109]:
        one      zero       one      zero
          x         x         y         y
0  0.600178  2.410179  1.519970  0.132885
1  0.274230  1.450520 -0.493662 -0.023688

Indexing works even if the data isn't sorted, but it's pretty inefficient (and you'll see a PerformanceWarning). It also returns a copy of the data instead of the view.

In [110]: dfm = pd.DataFrame({'jim': [0, 0, 1, 1],
   .....:                     'joe': ['x', 'x', 'z', 'y'],
   .....:                     'jolie': np.random.rand(4)})
   .....:

In [111]: dfm = dfm.set_index(['jim', 'joe'])

In [112]: dfm
Out[112]:
            jolie
jim joe
0   x    0.490671
    x    0.120248
1   z    0.537020
    y    0.110968
In [4]: dfm.loc[(1, 'z')]
PerformanceWarning: indexing past lexsort depth may impact performance.

Out[4]:
           jolie
jim joe
1   z    0.64094

In addition, indexing when not completely sorted can result in errors similar to the following:

In [5]: dfm.loc[(0, 'y'):(1, 'z')]
UnsortedIndexError: 'Key length (2) was greater than MultiIndex lexsort depth (1)'

The [ʻis_lexsorted ()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.MultiIndex.is_lexsorted.html#pandas.MultiIndex.is_lexsorted) method of MultiIndexis an index. Indicates whether is sorted and thelexsort_depth` property returns the sort depth.

In [113]: dfm.index.is_lexsorted()
Out[113]: False

In [114]: dfm.index.lexsort_depth
Out[114]: 1
In [115]: dfm = dfm.sort_index()

In [116]: dfm
Out[116]:
            jolie
jim joe
0   x    0.490671
    x    0.120248
1   y    0.110968
    z    0.537020

In [117]: dfm.index.is_lexsorted()
Out[117]: True

In [118]: dfm.index.lexsort_depth
Out[118]: 2

The choices now work as expected.

In [119]: dfm.loc[(0, 'y'):(1, 'z')]
Out[119]:
            jolie
jim joe
1   y    0.110968
    z    0.537020

take method

Like NumPy ndarrays, pandas ʻIndex · Series·DataFrame also provides a take () method to get elements along a particular axis at a particular index. The index specified must be an ndarray at the list or integer index position. take` can also accept a negative integer as a relative position from the end of the object.

In [120]: index = pd.Index(np.random.randint(0, 1000, 10))

In [121]: index
Out[121]: Int64Index([214, 502, 712, 567, 786, 175, 993, 133, 758, 329], dtype='int64')

In [122]: positions = [0, 9, 3]

In [123]: index[positions]
Out[123]: Int64Index([214, 329, 567], dtype='int64')

In [124]: index.take(positions)
Out[124]: Int64Index([214, 329, 567], dtype='int64')

In [125]: ser = pd.Series(np.random.randn(10))

In [126]: ser.iloc[positions]
Out[126]:
0   -0.179666
9    1.824375
3    0.392149
dtype: float64

In [127]: ser.take(positions)
Out[127]:
0   -0.179666
9    1.824375
3    0.392149
dtype: float64

For DataFrame, the specified index must be a one-dimensional list or ndarray that specifies the row or column position.

In [128]: frm = pd.DataFrame(np.random.randn(5, 3))

In [129]: frm.take([1, 4, 3])
Out[129]:
          0         1         2
1 -1.237881  0.106854 -1.276829
4  0.629675 -1.425966  1.857704
3  0.979542 -1.633678  0.615855

In [130]: frm.take([0, 2], axis=1)
Out[130]:
          0         2
0  0.595974  0.601544
1 -1.237881 -1.276829
2 -0.767101  1.499591
3  0.979542  0.615855
4  0.629675  1.857704

Note that the take method of the pandas object is not intended to work with Boolean indexes and can return unexpected results.

In [131]: arr = np.random.randn(10)

In [132]: arr.take([False, False, True, True])
Out[132]: array([-1.1935, -1.1935,  0.6775,  0.6775])

In [133]: arr[[0, 1]]
Out[133]: array([-1.1935,  0.6775])

In [134]: ser = pd.Series(np.random.randn(10))

In [135]: ser.take([False, False, True, True])
Out[135]:
0    0.233141
0    0.233141
1   -0.223540
1   -0.223540
dtype: float64

In [136]: ser.iloc[[0, 1]]
Out[136]:
0    0.233141
1   -0.223540
dtype: float64

Finally, a little performance note, the take method handles a narrower range of inputs, which can provide much faster performance than a fancy index. ]

In [137]: arr = np.random.randn(10000, 5)

In [138]: indexer = np.arange(10000)

In [139]: random.shuffle(indexer)

In [140]: %timeit arr[indexer]
   .....: %timeit arr.take(indexer, axis=0)
   .....:
219 us +- 1.23 us per loop (mean +- std. dev. of 7 runs, 1000 loops each)
72.3 us +- 727 ns per loop (mean +- std. dev. of 7 runs, 10000 loops each)
In [141]: ser = pd.Series(arr[:, 0])

In [142]: %timeit ser.iloc[indexer]
   .....: %timeit ser.take(indexer)
   .....:
179 us +- 1.54 us per loop (mean +- std. dev. of 7 runs, 10000 loops each)
162 us +- 1.6 us per loop (mean +- std. dev. of 7 runs, 10000 loops each)

Index type

So far we've covered the MultiIndex quite extensively. Documentation for DatetimeIndex and PeriodIndex is here, documentation for TimedeltaIndex is [here](https: /) See /dev.pandas.io/docs/user_guide/timedeltas.html#timedeltas-index).

The following subsections highlight some other index types.

CategoricalIndex

CategoricalIndex is an index that helps support duplicate indexes. .. This is a container that surrounds Categorical with many duplicate elements. Allows efficient indexing and storage of indexes that include.

In [143]: from pandas.api.types import CategoricalDtype

In [144]: df = pd.DataFrame({'A': np.arange(6),
   .....:                    'B': list('aabbca')})
   .....:

In [145]: df['B'] = df['B'].astype(CategoricalDtype(list('cab')))

In [146]: df
Out[146]:
   A  B
0  0  a
1  1  a
2  2  b
3  3  b
4  4  c
5  5  a

In [147]: df.dtypes
Out[147]:
A       int64
B    category
dtype: object

In [148]: df['B'].cat.categories
Out[148]: Index(['c', 'a', 'b'], dtype='object')

Setting the index creates a CategoricalIndex.

In [149]: df2 = df.set_index('B')

In [150]: df2.index
Out[150]: CategoricalIndex(['a', 'a', 'b', 'b', 'c', 'a'], categories=['c', 'a', 'b'], ordered=False, name='B', dtype='category')

Indexes using __getitem__ /. iloc / .loc work the same as ʻIndex. The indexer must ** belong to a category **. Otherwise, you will get a KeyError`.

In [151]: df2.loc['a']
Out[151]:
   A
B
a  0
a  1
a  5

The CategoricalIndex is ** retained ** after the index.

In [152]: df2.loc['a'].index
Out[152]: CategoricalIndex(['a', 'a', 'a'], categories=['c', 'a', 'b'], ordered=False, name='B', dtype='category')

When you sort the indexes, they are sorted by category (because you created the index with CategoricalDtype (list ('cab')), they are sorted by cab).

In [153]: df2.sort_index()
Out[153]:
   A
B
c  4
a  0
a  1
a  5
b  2
b  3

Groupby operations on indexes retain the properties of the index as well.

In [154]: df2.groupby(level=0).sum()
Out[154]:
   A
B
c  4
a  6
b  5

In [155]: df2.groupby(level=0).sum().index
Out[155]: CategoricalIndex(['c', 'a', 'b'], categories=['c', 'a', 'b'], ordered=False, name='B', dtype='category')

The reindex operation returns an index of the result based on the indexer type passed. Passing a list will return the usual ʻIndex. Passing a Categorical returns aCategoricalIndexindexed according to the **Categorical`dtype category passed. This allows you to arbitrarily index values that ** don't exist in the category, just as you would reindex pandas.

In [156]: df3 = pd.DataFrame({'A': np.arange(3),
   .....:                     'B': pd.Series(list('abc')).astype('category')})
   .....:

In [157]: df3 = df3.set_index('B')

In [158]: df3
Out[158]:
   A
B
a  0
b  1
c  2
In [159]: df3.reindex(['a', 'e'])
Out[159]:
     A
B
a  0.0
e  NaN

In [160]: df3.reindex(['a', 'e']).index
Out[160]: Index(['a', 'e'], dtype='object', name='B')

In [161]: df3.reindex(pd.Categorical(['a', 'e'], categories=list('abe')))
Out[161]:
     A
B
a  0.0
e  NaN

In [162]: df3.reindex(pd.Categorical(['a', 'e'], categories=list('abe'))).index
Out[162]: CategoricalIndex(['a', 'e'], categories=['a', 'b', 'e'], ordered=False, name='B', dtype='category')

: warning: ** Warning ** Formatting and comparing operations with CategoricalIndex must be in the same category. Otherwise, you will get a TypeError.

In [163]: df4 = pd.DataFrame({'A': np.arange(2),
   .....:                     'B': list('ba')})
   .....:

In [164]: df4['B'] = df4['B'].astype(CategoricalDtype(list('ab')))

In [165]: df4 = df4.set_index('B')

In [166]: df4.index
Out[166]: CategoricalIndex(['b', 'a'], categories=['a', 'b'], ordered=False, name='B', dtype='category')

In [167]: df5 = pd.DataFrame({'A': np.arange(2),
   .....:                     'B': list('bc')})
   .....:

In [168]: df5['B'] = df5['B'].astype(CategoricalDtype(list('bc')))

In [169]: df5 = df5.set_index('B')

In [170]: df5.index
Out[170]: CategoricalIndex(['b', 'c'], categories=['b', 'c'], ordered=False, name='B', dtype='category')
In [1]: pd.concat([df4, df5])
TypeError: categories must match existing categories when appending

Int64Index and RangeIndex

ʻInt64Index` is the foundation of the pandas index foundation. This is an immutable array that implements an ordered slicable set.

RangeIndex provides the default index for all NDFrame objects A subclass of ʻInt64Index. RangeIndex is a version of ʻInt64Index optimized to represent a monotonous order set. These are similar to Python's range type.

Float64Index

By default, Float64Index is a floating point or integer in indexing. Created automatically when you pass a mixed value of floating point numbers. This allows for a pure label-based slicing paradigm that makes the scalar index and slicing [], ʻix, and loc` work exactly the same.

In [171]: indexf = pd.Index([1.5, 2, 3, 4.5, 5])

In [172]: indexf
Out[172]: Float64Index([1.5, 2.0, 3.0, 4.5, 5.0], dtype='float64')

In [173]: sf = pd.Series(range(5), index=indexf)

In [174]: sf
Out[174]:
1.5    0
2.0    1
3.0    2
4.5    3
5.0    4
dtype: int64

Scalar selections for [] and .loc are always label-based. An integer specification matches an equal float index (for example, 3 is equivalent to 3.0).

In [175]: sf[3]
Out[175]: 2

In [176]: sf[3.0]
Out[176]: 2

In [177]: sf.loc[3]
Out[177]: 2

In [178]: sf.loc[3.0]
Out[178]: 2

The only position index is via ʻiloc`.

In [179]: sf.iloc[3]
Out[179]: 3

The scalar index not found throws a KeyError. Slices are primarily based on index values when using [], ʻix, and loc, and ** always based on position when using ʻiloc. The exception is when the slice is a boolean value. In this case, it is always based on position.

In [180]: sf[2:4]
Out[180]:
2.0    1
3.0    2
dtype: int64

In [181]: sf.loc[2:4]
Out[181]:
2.0    1
3.0    2
dtype: int64

In [182]: sf.iloc[2:4]
Out[182]:
3.0    2
4.5    3
dtype: int64

The float index allows you to use slices with floating point numbers.

In [183]: sf[2.1:4.6]
Out[183]:
3.0    2
4.5    3
dtype: int64

In [184]: sf.loc[2.1:4.6]
Out[184]:
3.0    2
4.5    3
dtype: int64

If it is not a float index, slices using floats will raise a TypeError.

In [1]: pd.Series(range(5))[3.5]
TypeError: the label [3.5] is not a proper indexer for this index type (Int64Index)

In [1]: pd.Series(range(5))[3.5:4.5]
TypeError: the slice start [3.5] is not a proper indexer for this index type (Int64Index)

The following are common use cases for using this type of index: Imagine an irregular timedelta-like indexing scheme where the data is recorded as a float. This could be, for example, a millisecond offset.

In [185]: dfir = pd.concat([pd.DataFrame(np.random.randn(5, 2),
   .....:                                index=np.arange(5) * 250.0,
   .....:                                columns=list('AB')),
   .....:                   pd.DataFrame(np.random.randn(6, 2),
   .....:                                index=np.arange(4, 10) * 250.1,
   .....:                                columns=list('AB'))])
   .....:

In [186]: dfir
Out[186]:
               A         B
0.0    -0.435772 -1.188928
250.0  -0.808286 -0.284634
500.0  -1.815703  1.347213
750.0  -0.243487  0.514704
1000.0  1.162969 -0.287725
1000.4 -0.179734  0.993962
1250.5 -0.212673  0.909872
1500.6 -0.733333 -0.349893
1750.7  0.456434 -0.306735
2000.8  0.553396  0.166221
2250.9 -0.101684 -0.734907

Selection operations always work on a value basis for all selection operators.

In [187]: dfir[0:1000.4]
Out[187]:
               A         B
0.0    -0.435772 -1.188928
250.0  -0.808286 -0.284634
500.0  -1.815703  1.347213
750.0  -0.243487  0.514704
1000.0  1.162969 -0.287725
1000.4 -0.179734  0.993962

In [188]: dfir.loc[0:1001, 'A']
Out[188]:
0.0      -0.435772
250.0    -0.808286
500.0    -1.815703
750.0    -0.243487
1000.0    1.162969
1000.4   -0.179734
Name: A, dtype: float64

In [189]: dfir.loc[1000.4]
Out[189]:
A   -0.179734
B    0.993962
Name: 1000.4, dtype: float64

You can get the first second (1000 milliseconds) of the data as follows:

In [190]: dfir[0:1000]
Out[190]:
               A         B
0.0    -0.435772 -1.188928
250.0  -0.808286 -0.284634
500.0  -1.815703  1.347213
750.0  -0.243487  0.514704
1000.0  1.162969 -0.287725

If you need integer position-based selection, use ʻiloc`.

In [191]: dfir.iloc[0:5]
Out[191]:
               A         B
0.0    -0.435772 -1.188928
250.0  -0.808286 -0.284634
500.0  -1.815703  1.347213
750.0  -0.243487  0.514704
1000.0  1.162969 -0.287725

IntervalIndex

[ʻIntervalIndex](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.IntervalIndex.html#pandas.IntervalIndex) and its own dtype, ʻIntervalDtype, and ʻInterval` Scalar types allow first-class support for interval notation in pandas. I will.

ʻIntervalIndex allows some unique indexing, [cut () ](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.cut.html#pandas As the return type for the .cut) and [qcut ()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.qcut.html#pandas.qcut) categories Is also used.

Index by ʻInterval Index`

ʻIntervalIndex can be used as an index in SeriesandDataFrame`.

In [192]: df = pd.DataFrame({'A': [1, 2, 3, 4]},
   .....:                   index=pd.IntervalIndex.from_breaks([0, 1, 2, 3, 4]))
   .....:

In [193]: df
Out[193]:
        A
(0, 1]  1
(1, 2]  2
(2, 3]  3
(3, 4]  4

Label-based indexes via .loc that follow the ends of the interval work as expected and select that particular interval.

In [194]: df.loc[2]
Out[194]:
A    2
Name: (1, 2], dtype: int64

In [195]: df.loc[[2, 3]]
Out[195]:
        A
(1, 2]  2
(2, 3]  3

If you select the * included * label within an interval, it will be selected for each interval.

In [196]: df.loc[2.5]
Out[196]:
A    3
Name: (2, 3], dtype: int64

In [197]: df.loc[[2.5, 3.5]]
Out[197]:
        A
(2, 3]  3
(3, 4]  4

When selected using intervals, only exact matches are returned (pandas 0.25.0 and later).

In [198]: df.loc[pd.Interval(1, 2)]
Out[198]:
A    2
Name: (1, 2], dtype: int64

If you try to select an interval that is not exactly included in the ʻIntervalIndex, you will get aKeyError`.

In [7]: df.loc[pd.Interval(0.5, 2.5)]
---------------------------------------------------------------------------
KeyError: Interval(0.5, 2.5, closed='right')

To select all ʻIntervals that overlap a particular ʻInterval, [ʻoverlaps ()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.IntervalIndex. Create a Boolean indexer using the overlaps.html # pandas.IntervalIndex.overlaps) method.

In [199]: idxr = df.index.overlaps(pd.Interval(0.5, 2.5))

In [200]: idxr
Out[200]: array([ True,  True,  True, False])

In [201]: df[idxr]
Out[201]:
        A
(0, 1]  1
(1, 2]  2
(2, 3]  3

Binning data using cut and qcut

cut () and [qcut ()](https: / /pandas.pydata.org/pandas-docs/stable/reference/api/pandas.qcut.html#pandas.qcut) both return Categorical objects, and the bins they create arein the.categories attribute. It is saved as IntervalIndex.

In [202]: c = pd.cut(range(4), bins=2)

In [203]: c
Out[203]:
[(-0.003, 1.5], (-0.003, 1.5], (1.5, 3.0], (1.5, 3.0]]
Categories (2, interval[float64]): [(-0.003, 1.5] < (1.5, 3.0]]

In [204]: c.categories
Out[204]:
IntervalIndex([(-0.003, 1.5], (1.5, 3.0]],
              closed='right',
              dtype='interval[float64]')

cut () passes ʻIntervalIndex as the binsargument can also do. This allows for a convenient pandas idiom. First, set some data andbins to a fixed number [cut () ](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.cut. Call html#pandas.cut) to create a bin. Then the value of that .categories was subsequently called [cut () ](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.cut.html# You can binn new data in the same bin by passing it to the bins` argument of pandas.cut).

In [205]: pd.cut([0, 3, 5, 1], bins=c.categories)
Out[205]:
[(-0.003, 1.5], (1.5, 3.0], NaN, (-0.003, 1.5]]
Categories (2, interval[float64]): [(-0.003, 1.5] < (1.5, 3.0]]

Values outside all bins are assigned the NaN value.

Creating a range of intervals

If you need intervals at normal frequency, use the [ʻinterval_range () ](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.interval_range.html#pandas.interval_range) function. You can use it to create ʻIntervalIndex with different combinations of start, ʻend, and periods. The default period for ʻinterval_range is 1 for numeric intervals and calendar days for datetime-like intervals.

In [206]: pd.interval_range(start=0, end=5)
Out[206]:
IntervalIndex([(0, 1], (1, 2], (2, 3], (3, 4], (4, 5]],
              closed='right',
              dtype='interval[int64]')

In [207]: pd.interval_range(start=pd.Timestamp('2017-01-01'), periods=4)
Out[207]:
IntervalIndex([(2017-01-01, 2017-01-02], (2017-01-02, 2017-01-03], (2017-01-03, 2017-01-04], (2017-01-04, 2017-01-05]],
              closed='right',
              dtype='interval[datetime64[ns]]')

In [208]: pd.interval_range(end=pd.Timedelta('3 days'), periods=3)
Out[208]:
IntervalIndex([(0 days 00:00:00, 1 days 00:00:00], (1 days 00:00:00, 2 days 00:00:00], (2 days 00:00:00, 3 days 00:00:00]],
              closed='right',
              dtype='interval[timedelta64[ns]]')

The freq argument can be used to specify a non-default period, with various period aliases for datetime-like intervals. # timeseries-offset-aliases) is available.

In [209]: pd.interval_range(start=0, periods=5, freq=1.5)
Out[209]:
IntervalIndex([(0.0, 1.5], (1.5, 3.0], (3.0, 4.5], (4.5, 6.0], (6.0, 7.5]],
              closed='right',
              dtype='interval[float64]')

In [210]: pd.interval_range(start=pd.Timestamp('2017-01-01'), periods=4, freq='W')
Out[210]:
IntervalIndex([(2017-01-01, 2017-01-08], (2017-01-08, 2017-01-15], (2017-01-15, 2017-01-22], (2017-01-22, 2017-01-29]],
              closed='right',
              dtype='interval[datetime64[ns]]')

In [211]: pd.interval_range(start=pd.Timedelta('0 days'), periods=3, freq='9H')
Out[211]:
IntervalIndex([(0 days 00:00:00, 0 days 09:00:00], (0 days 09:00:00, 0 days 18:00:00], (0 days 18:00:00, 1 days 03:00:00]],
              closed='right',
              dtype='interval[timedelta64[ns]]')

In addition, you can use the closed argument to specify who closes the interval. By default, the interval is closed on the right.

In [212]: pd.interval_range(start=0, end=4, closed='both')
Out[212]:
IntervalIndex([[0, 1], [1, 2], [2, 3], [3, 4]],
              closed='both',
              dtype='interval[int64]')

In [213]: pd.interval_range(start=0, end=4, closed='neither')
Out[213]:
IntervalIndex([(0, 1), (1, 2), (2, 3), (3, 4)],
              closed='neither',
              dtype='interval[int64]')

_ From version 0.23.0 _

If you specify start ・ ʻend periods, the resulting ʻIntervalIndex will create an interval from start to ʻendwith as many elements asperiods`.

In [214]: pd.interval_range(start=0, end=6, periods=4)
Out[214]:
IntervalIndex([(0.0, 1.5], (1.5, 3.0], (3.0, 4.5], (4.5, 6.0]],
              closed='right',
              dtype='interval[float64]')

In [215]: pd.interval_range(pd.Timestamp('2018-01-01'),
   .....:                   pd.Timestamp('2018-02-28'), periods=3)
   .....:
Out[215]:
IntervalIndex([(2018-01-01, 2018-01-20 08:00:00], (2018-01-20 08:00:00, 2018-02-08 16:00:00], (2018-02-08 16:00:00, 2018-02-28]],
              closed='right',
              dtype='interval[datetime64[ns]]')

Other indexing FAQs

Integer index

Label-based indexing with integer axis labels is a tricky topic. It is frequently discussed among various members of the mailing list and the scientific Python community. In pandas, our general view is that labels are more important than integer positions. Therefore, for integer axis indexes, standard tools such as .loc allow label-based indexing * only *. The following code raises an exception.

In [216]: s = pd.Series(range(5))

In [217]: s[-1]
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-217-76c3dce40054> in <module>
----> 1 s[-1]

~/work/1/s/pandas/core/series.py in __getitem__(self, key)
   1076         key = com.apply_if_callable(key, self)
   1077         try:
-> 1078             result = self.index.get_value(self, key)
   1079
   1080             if not is_scalar(result):

~/work/1/s/pandas/core/indexes/base.py in get_value(self, series, key)
   4623         k = self._convert_scalar_indexer(k, kind="getitem")
   4624         try:
-> 4625             return self._engine.get_value(s, k, tz=getattr(series.dtype, "tz", None))
   4626         except KeyError as e1:
   4627             if len(self) > 0 and (self.holds_integer() or self.is_boolean()):

~/work/1/s/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_value()

~/work/1/s/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_value()

~/work/1/s/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

~/work/1/s/pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()

~/work/1/s/pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()

KeyError: -1

In [218]: df = pd.DataFrame(np.random.randn(5, 4))

In [219]: df
Out[219]:
          0         1         2         3
0 -0.130121 -0.476046  0.759104  0.213379
1 -0.082641  0.448008  0.656420 -1.051443
2  0.594956 -0.151360 -0.069303  1.221431
3 -0.182832  0.791235  0.042745  2.069775
4  1.446552  0.019814 -1.389212 -0.702312

In [220]: df.loc[-2:]
Out[220]:
          0         1         2         3
0 -0.130121 -0.476046  0.759104  0.213379
1 -0.082641  0.448008  0.656420 -1.051443
2  0.594956 -0.151360 -0.069303  1.221431
3 -0.182832  0.791235  0.042745  2.069775
4  1.446552  0.019814 -1.389212 -0.702312

This deliberate decision was made to prevent ambiguity and subtle bugs (many users find bugs when they modify the API to stop "fallback" in position-based indexing. I reported).

Exact match required for non-monotonic indexes

If the index of the Series or DataFrame is monotonously increasing or decreasing, it is possible that the label-based slice boundaries are outside the index range, much like normal Python list slice indexing. The monotony of the index is ʻis_monotonic_increasing () ](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Index.is_monotonic_increasing.html#pandas.Index.is_monotonic_increasing) and [ ʻIs_monotonic_decreasing () You can test with the attribute.

In [221]: df = pd.DataFrame(index=[2, 3, 3, 4, 5], columns=['data'], data=list(range(5)))

In [222]: df.index.is_monotonic_increasing
Out[222]: True

#Lines 0, 1 do not exist, but returns lines 2, 3 (both), 4
In [223]: df.loc[0:4, :]
Out[223]:
   data
2     0
3     1
3     2
4     3

#An empty DataFrame is returned because the slice is out of index
In [224]: df.loc[13:15, :]
Out[224]:
Empty DataFrame
Columns: [data]
Index: []

On the other hand, if the index is not monotonous, both slice boundaries must be * unique * values of the index.

In [225]: df = pd.DataFrame(index=[2, 3, 1, 4, 3, 5],
   .....:                   columns=['data'], data=list(range(6)))
   .....:

In [226]: df.index.is_monotonic_increasing
Out[226]: False

#There is no problem because both 2 and 4 are in the index
In [227]: df.loc[2:4, :]
Out[227]:
   data
2     0
3     1
1     2
4     3
#0 does not exist in the index
In [9]: df.loc[0:4, :]
KeyError: 0

#3 is not a unique label
In [11]: df.loc[2:3, :]
KeyError: 'Cannot get right slice bound for non-unique label: 3'

ʻIndex.is_monotonic_increasing and ʻIndex.is_monotonic_decreasing just lightly check if the index is monotonic. To see the exact monotony, either [ʻis_unique () `](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Index.is_unique.html# Combine with the pandas.Index.is_unique) attribute.

In [228]: weakly_monotonic = pd.Index(['a', 'b', 'c', 'c'])

In [229]: weakly_monotonic
Out[229]: Index(['a', 'b', 'c', 'c'], dtype='object')

In [230]: weakly_monotonic.is_monotonic_increasing
Out[230]: True

In [231]: weakly_monotonic.is_monotonic_increasing & weakly_monotonic.is_unique
Out[231]: False

End point is included

Unlike standard Python sequence slices that do not contain endpoints, pandas label-based slices do. The main reason for this is that it is often not easy to determine the "subsequent label" or the next element after a particular label in the index. For example, consider the following Seires.

In [232]: s = pd.Series(np.random.randn(6), index=list('abcdef'))

In [233]: s
Out[233]:
a    0.301379
b    1.240445
c   -0.846068
d   -0.043312
e   -1.658747
f   -0.819549
dtype: float64

Suppose you want to slice from c to e using an integer. This is done as follows:

In [234]: s[2:5]
Out[234]:
c   -0.846068
d   -0.043312
e   -1.658747
dtype: float64

However, if you specify only from c and ʻe`, determining the next element in the index can be a bit complicated. For example, the following does not work:

s.loc['c':'e' + 1]

A very common use case is to specify a specific time series that starts and ends on two specific dates. To make this possible, we designed the label-based slice to include both endpoints.

In [235]: s.loc['c':'e']
Out[235]:
c   -0.846068
d   -0.043312
e   -1.658747
dtype: float64

This is arguably "more practical than pure", but be careful if you expect label-based slices to behave exactly like standard Python integer slices.

Indexing that implicitly changes the dtype of Series

Different indexing operations can change the dtype of Series.

In [236]: series1 = pd.Series([1, 2, 3])

In [237]: series1.dtype
Out[237]: dtype('int64')

In [238]: res = series1.reindex([0, 4])

In [239]: res.dtype
Out[239]: dtype('float64')

In [240]: res
Out[240]:
0    1.0
4    NaN
dtype: float64
In [241]: series2 = pd.Series([True])

In [242]: series2.dtype
Out[242]: dtype('bool')

In [243]: res = series2.reindex_like(series1)

In [244]: res.dtype
Out[244]: dtype('O')

In [245]: res
Out[245]:
0    True
1     NaN
2     NaN
dtype: object

This is because the (re) indexing operation above implicitly inserts NaNs and changes the dtype accordingly. This can cause problems when using numpy ufuncs such as numpy.logical_and.

See this Past Issue (https://github.com/pydata/pandas/issues/2388) for more information.

Recommended Posts

Pandas User Guide "Multi-Index / Advanced Index" (Official document Japanese translation)
Pandas User Guide "Manipulating Missing Data" (Official Document Japanese Translation)
Pandas User Guide "Table Formatting and PivotTables" (Official Document Japanese Translation)
Pandas User Guide "merge, join and concatenate" (Japanese translation of official documentation)
[Translation] scikit-learn 0.18 User Guide 4.5. Random projection
[Translation] scikit-learn 0.18 User Guide 1.11. Ensemble method
[Translation] scikit-learn 0.18 User Guide 1.15. Isotonic regression
[Translation] scikit-learn 0.18 User Guide 4.2 Feature extraction
[Translation] scikit-learn 0.18 User Guide 1.16. Probability calibration
[Translation] scikit-learn 0.18 User Guide 1.13 Feature selection
[Translation] scikit-learn 0.18 User Guide 3.4. Model persistence
[Translation] scikit-learn 0.18 User Guide 2.8. Density estimation
[Translation] scikit-learn 0.18 User Guide 4.3. Data preprocessing
Apache Spark Document Japanese Translation --Submitting Applications
[Translation] scikit-learn 0.18 User Guide 4.4. Unsupervised dimensionality reduction
[Translation] scikit-learn 0.18 User Guide Table of Contents
[Translation] scikit-learn 0.18 User Guide 1.4. Support Vector Machine
How to use FastAPI ② Advanced --User Guide
Apache Spark Document Japanese Translation --Quick Start
[Google App Engine] User Objects (Japanese translation)
Japanese translation of self-study "A Beginner's Guide to Getting User Input in Python"