[PYTHON] pandas 1.2.0 What's new

pandas 1.2.0 What's new

With the release of pandas 1.2.0, note the new features you care about

I've skipped the content that I'm not interested in and minor changes, so Click here for all changes

Settings that do not allow duplicate row names and column names

Premise

The pandas Index can have duplicate labels. That is, the row and column names of the data frames and series do not have to be unique.

For example, the following data frame df_dlabel has two rows with row names'a' and two columns with column names 'B'.

df_dlabel = pd.DataFrame(np.arange(9).reshape(3, 3),
                         index=list('aab'), columns=list('ABB'))
df_dlabel
A B B
a 0 1 2
a 3 4 5
b 6 7 8

For the data-kneading people, it sometimes seems unnatural that this is allowed. For example, SQL primary keys basically do not allow duplication.

Example that is inconvenient if the line name is not unique

In the pandas data frame, if you specify the column name with slice or .loc etc., the series is usually returned. However, it is the data frame that returns the duplicate column name. It is not good that the return result is ambiguous in this way.

df_dlabel['A']

df_dlabel['B']
# df_dlabel['A'] -> pd.Series
a    0
a    3
b    6
Name: A, dtype: int32

# df_dlabel['B'] -> pd.DataFrame
   B  B
a  1  2
a  4  5
b  7  8

Similarly, when the row name and column name are specified at the same time with .loc etc., or when .at is used, the scalar value is usually returned, but if the name is duplicated, the series And data frames are returned.

df_dlabel.at['b', 'A']

df_dlabel.at['b', 'B']

df_dlabel.at['a', 'B']
# df_dlabel.at['b', 'A'] -> int
6

# df_dlabel.at['b', 'B'] -> pd.Series
B    7
B    8
Name: b, dtype: int32

# df_dlabel.at['a', 'B'] -> pd.DataFrame
   B  B
a  1  2
a  4  5

There are functions and methods in pandas that assume that row and column names are unique, and if you face them, you will get an error (or if you are unlucky, you will get a bug).

The following is an example where an error occurs when a duplicate column name is specified in the argument.

df_dlabel.merge(df_dlabel, on='A')

df_dlabel.merge(df_dlabel, on='B')
# df_dlabel.merge(df_dlabel, on='A')to go well
   A  B_x  B_x  B_y  B_y
0  0    1    2    1    2
1  3    4    5    4    5
2  6    7    8    7    8

# df_dlabel.merge(df_dlabel, on='B')No good
ValueError: The column label 'B' is not unique.

The following is an example where an error occurs even if you do not specify it in the argument if the data frame has a duplicate column name.

df_dlabel.reindex(list('AC'))
ValueError: cannot reindex from a duplicate axis

The following is an example where if there is a duplicate column name, the result is automatically judged and a warning is issued instead of an error.

df_dlabel.to_dict()
UserWarning: DataFrame columns are not unique, some columns will be omitted.
{'A': {'a': 3, 'b': 6}, 'B': {'a': 5, 'b': 8}}

New behavior

The allows_duplicate_labels flag has been added to the data frame flags (which can be confirmed by the .flags attribute) (although the .flags itself has also been added this time).

df_dlabel = (pd.DataFrame(np.arange(9).reshape(3, 3),
                          index=list('abc'), columns=list('ABC'))
             .set_flags(allows_duplicate_labels=True))
df_dlabel.flags.allows_duplicate_labels


df_ulabel = (pd.DataFrame(np.arange(9).reshape(3, 3),
                          index=list('abc'), columns=list('ABC'))
             .set_flags(allows_duplicate_labels=False))
df_ulabel.flags.allows_duplicate_labels
# df_dlabel
True

# df_ulabel
False

Data frames with allows_duplicate_labels set to True behave as before, but data frames with False are guaranteed to have unique row and column names and do not allow duplication. Duplicate label prohibition If you try to perform an operation that duplicates the row name and column name for a data frame, DuplicateLabelError will occur (although it is supposed to be, another error will occur or the operation will be implicit. It seems that it may fail).

df_dlabel.reindex(list('aab'))

df_ulabel.reindex(list('aab'))
# df_can dlabel
   A  B  C
a  0  1  2
a  0  1  2
b  3  4  5

# df_ulabel can't
DuplicateLabelError: Index has duplicates.

The data frame flags can be changed with the .set_flags () method as shown above. Note that for data frames with duplicate row and column names, allows_duplicate_labels cannot be set to False, and it will also be DuplicateLabelError.

(pd.DataFrame(np.arange(9).reshape(3, 3),
             index=list('aab'), columns=list('ABB'))
 .set_flags(allows_duplicate_labels=False))
DuplicateLabelError: Index has duplicates.

Missing value allowed floating point data type

In pandas, NaN ( np.nan) was once treated as something like a missing value, but "a non-numeric value, a non-real number, an exceptional number" is a number instead of NaN. There is a claim that "no data, missing value" NA that can be used consistently for data types other than" is required, and from version 1.0.0 NA (pd.NA) that represents a missing value ) Was introduced.

In accordance with this, the missing value allowed floating point data type (Float64Dtype etc.) was created following the missing value allowed integer data type (Int64Dtype etc.) that existed a little earlier this time.

If 'Float64' is specified in the dtype argument, it becomes a missing value allowed floating point data type ('Float32' etc. also exists). Like 'Int64', the first letter is an uppercase letter to distinguish it from normal floating point data types.

Missing value allowed Floating point data types use NA instead of NaN.

s_float = pd.Series([0, 1, np.nan, np.nan], dtype='float64')

s_nfloat = pd.Series([0, 1, np.nan, pd.NA], dtype='Float64')
# float64
0    0.0
1    1.0
2    NaN
3    NaN
dtype: float64

# Float64
0     0.0
1     1.0
2    <NA>
3    <NA>
dtype: Float64

The result of the comparison operation is different between NaN and NA.

s_float < 0

s_nfloat < 0
# float64
0    False
1    False
2    False
3    False
dtype: bool

# Float64
0    False
1    False
2     <NA>
3     <NA>
dtype: boolean

Maintaining index names when concatenating data frames, etc.

The Index object has a name (name attribute). It is generally perceived as an "index column name" (especially if you set a particular column to the index with .set_index ()).

The following example creates an index with the name 'idx'.

df_n = pd.DataFrame(np.arange(6).reshape(3, 2), columns=list('AB'),
                    index=pd.Index(list('abc'), name='idx'))
df_n
idx A B
a 0 1
b 2 3
c 4 5

When joining data frames, even if the index names are the same, it seems that the names disappeared when the indexes that seemed to be inconsistent (?) Were joined.

Previous behavior


ct = pd.concat([df_n.iloc[:2, :1], df_n.iloc[1:, 1:]], axis=1)
ct
A B
a 0 nan
b 2 3
c nan 5

This has been fixed so that the name is preserved as much as possible. (Although it is obviously a bug fix, it is introduced as a new feature for some reason)

Added option to ignore missing values ​​in applymap ()

pd.DataFrame.applymap () is a method that applies a specific function to all elements of a data frame for each element.

df_str = pd.DataFrame([['where', 'why'], ['when', 'who']], dtype='string')
df_str.applymap(len)
0 1
0 5 3
1 4 3

You may want to propagate the missing value as it is without applying the function to the missing value NA.

df_str = pd.DataFrame([['where', 'why'], ['when', pd.NA]], dtype='string')
df_str.applymap(len)
TypeError: object of type 'NAType' has no len()

Until now, either set a conditional branch such as "If it is NA, return NA" in the function itself, or apply pd.Series.map () with an option to propagate missing values ​​for each series. I needed it.

df_str.apply(lambda s: s.map(len, na_action='ignore'))
0 1
0 5 3
1 4 <NA>

A na_action argument has also been added toapplymap (), and missing values ​​are now propagated by passing'ignore'.

df_str.applymap(len, na_action='ignore')
0 1
0 5 3
1 4 <NA>

Multiplication / division of object data type Index

As with the series, pandas indexes can be added or subtracted without problems even if the data type is 'object', as long as each element is a numerical value.

pd.Series([0, 1, 2], dtype='object') + 2

pd.Index([0, 1, 2], dtype='object') + 2
# pd.Series([0, 1, 2], dtype='object') + 2
0    2
1    3
2    4
dtype: object

# pd.Index([0, 1, 2], dtype='object') + 2
Int64Index([2, 3, 4], dtype='int64')

However, an error occurred in multiplication / division. This was a different behavior than the series.

Previous behavior


pd.Series([0, 1, 2], dtype='object') * 2

pd.Index([0, 1, 2], dtype='object') * 2
# pd.Series([0, 1, 2], dtype='object') * 2
0    0
1    2
2    4
dtype: object

# pd.Index([0, 1, 2], dtype='object') * 2
TypeError: cannot perform __mul__ with this index type: Index

This has been fixed.

pd.Index([0, 1, 2], dtype='object') * 2
Int64Index([0, 2, 4], dtype='int64')

.explode () method supports set type

pd.DataFrame.explode () and pd.Series.explode () are methods that break lists and tuples when they are elements.

s_pack = pd.Series(['abc', list('abc'), tuple('abc')])
s_pack
0          abc
1    [a, b, c]
2    (a, b, c)
dtype: object
s_pack.explode()
0    abc
1      a
1      b
1      c
2      a
2      b
2      c
dtype: object

Until now, it did not support the set type (it remained the same even if .explode () was applied), but the set is now broken.

pd.Series(['abc', list('abc'), set('abc')]).explode()
0    abc
1      a
1      b
1      c
2      c
2      b
2      a
dtype: object

The .lookup () method is deprecated

What is pd.DataFrame.lookup ()?

np.random.seed(0)
df_map = pd.DataFrame(np.random.randn(10, 3), columns=list('ABC'))

np.random.seed(0)
df_table = (pd.DataFrame({'col': list('ABC')}).iloc[np.random.randint(0, 3, 7)]
            .assign(idx=np.random.randint(0, 10, 7)).reset_index(drop=True))

df_map
df_table

↓ There is such a correspondence table

A B C
0 1.76405 0.400157 0.978738
1 2.24089 1.86756 -0.977278
2 0.950088 -0.151357 -0.103219
3 0.410599 0.144044 1.45427
4 0.761038 0.121675 0.443863
5 0.333674 1.49408 -0.205158
6 0.313068 -0.854096 -2.55299
7 0.653619 0.864436 -0.742165
8 2.26975 -1.45437 0.0457585
9 -0.187184 1.53278 1.46936

↓ When there is such data

col idx
0 A 7
1 B 6
2 A 8
3 B 8
4 B 1
5 C 6
6 A 7

↓ This is a method that can do this.

df_map.lookup(df_table['idx'], df_table['col'])
array([ 0.6536186 , -0.85409574,  2.26975462, -1.45436567,  1.86755799,
       -2.55298982,  0.6536186 ])

It seems that it is expected to be abolished because there is no need for such a method. In the above example, if you do not use .lookup (), for example, there are the following methods.

df_map.unstack()[df_table.set_index(['col', 'idx']).index]
df_map.unstack()[zip(df_table['col'], df_table['idx'])]
df_map.unstack()[df_table.to_records(index=False).tolist()]
df_map.unstack()[df_table.itertuples(index=False)]
# -> pd.Series

df_map.to_numpy()[df_table['idx'], df_map.columns.get_indexer(df_table['col'])]
# -> np.ndarray

Recommended Posts

pandas 1.2.0 What's new
What's new in Python 3.5
What's new in Python 3.6
What's new in Python 3.10 (Summary)
Pandas
What's new in Python 3.4.0 (2) --enum
What's new in Python 3.9 (Summary)
What's new in python3.9 Merge dictionaries
Dynamically create new dataframes with pandas
Pandas memo
What's new in Django 1.8 Conditional Expressions #djangoja
Pandas basics
What's blackbird?
Pandas memorandum
Pandas basics
pandas memorandum
pandas memo
pandas SettingWithCopyWarning