With the release of pandas 1.2.0, note the new features you care about
I've skipped the content that I'm not interested in and minor changes, so Click here for all changes
The pandas Index
can have duplicate labels. That is, the row and column names of the data frames and series do not have to be unique.
For example, the following data frame df_dlabel
has two rows with row names'a'
and two columns with column names 'B'
.
df_dlabel = pd.DataFrame(np.arange(9).reshape(3, 3),
index=list('aab'), columns=list('ABB'))
df_dlabel
A | B | B | |
---|---|---|---|
a | 0 | 1 | 2 |
a | 3 | 4 | 5 |
b | 6 | 7 | 8 |
For the data-kneading people, it sometimes seems unnatural that this is allowed. For example, SQL primary keys basically do not allow duplication.
In the pandas data frame, if you specify the column name with slice or .loc
etc., the series is usually returned. However, it is the data frame that returns the duplicate column name. It is not good that the return result is ambiguous in this way.
df_dlabel['A']
df_dlabel['B']
# df_dlabel['A'] -> pd.Series a 0 a 3 b 6 Name: A, dtype: int32 # df_dlabel['B'] -> pd.DataFrame B B a 1 2 a 4 5 b 7 8
Similarly, when the row name and column name are specified at the same time with .loc
etc., or when .at
is used, the scalar value is usually returned, but if the name is duplicated, the series And data frames are returned.
df_dlabel.at['b', 'A']
df_dlabel.at['b', 'B']
df_dlabel.at['a', 'B']
# df_dlabel.at['b', 'A'] -> int 6 # df_dlabel.at['b', 'B'] -> pd.Series B 7 B 8 Name: b, dtype: int32 # df_dlabel.at['a', 'B'] -> pd.DataFrame B B a 1 2 a 4 5
There are functions and methods in pandas that assume that row and column names are unique, and if you face them, you will get an error (or if you are unlucky, you will get a bug).
The following is an example where an error occurs when a duplicate column name is specified in the argument.
df_dlabel.merge(df_dlabel, on='A')
df_dlabel.merge(df_dlabel, on='B')
# df_dlabel.merge(df_dlabel, on='A')to go well A B_x B_x B_y B_y 0 0 1 2 1 2 1 3 4 5 4 5 2 6 7 8 7 8 # df_dlabel.merge(df_dlabel, on='B')No good ValueError: The column label 'B' is not unique.
The following is an example where an error occurs even if you do not specify it in the argument if the data frame has a duplicate column name.
df_dlabel.reindex(list('AC'))
ValueError: cannot reindex from a duplicate axis
The following is an example where if there is a duplicate column name, the result is automatically judged and a warning is issued instead of an error.
df_dlabel.to_dict()
UserWarning: DataFrame columns are not unique, some columns will be omitted. {'A': {'a': 3, 'b': 6}, 'B': {'a': 5, 'b': 8}}
The allows_duplicate_labels
flag has been added to the data frame flags (which can be confirmed by the .flags
attribute) (although the .flags
itself has also been added this time).
df_dlabel = (pd.DataFrame(np.arange(9).reshape(3, 3),
index=list('abc'), columns=list('ABC'))
.set_flags(allows_duplicate_labels=True))
df_dlabel.flags.allows_duplicate_labels
df_ulabel = (pd.DataFrame(np.arange(9).reshape(3, 3),
index=list('abc'), columns=list('ABC'))
.set_flags(allows_duplicate_labels=False))
df_ulabel.flags.allows_duplicate_labels
# df_dlabel True # df_ulabel False
Data frames with allows_duplicate_labels
set to True
behave as before, but data frames with False
are guaranteed to have unique row and column names and do not allow duplication. Duplicate label prohibition If you try to perform an operation that duplicates the row name and column name for a data frame, DuplicateLabelError
will occur (although it is supposed to be, another error will occur or the operation will be implicit. It seems that it may fail).
df_dlabel.reindex(list('aab'))
df_ulabel.reindex(list('aab'))
# df_can dlabel A B C a 0 1 2 a 0 1 2 b 3 4 5 # df_ulabel can't DuplicateLabelError: Index has duplicates.
The data frame flags can be changed with the .set_flags ()
method as shown above. Note that for data frames with duplicate row and column names, allows_duplicate_labels
cannot be set to False
, and it will also be DuplicateLabelError
.
(pd.DataFrame(np.arange(9).reshape(3, 3),
index=list('aab'), columns=list('ABB'))
.set_flags(allows_duplicate_labels=False))
DuplicateLabelError: Index has duplicates.
In pandas, NaN
( np.nan
) was once treated as something like a missing value, but "a non-numeric value, a non-real number, an exceptional number" is a number instead of NaN
. There is a claim that "no data, missing value" NA
that can be used consistently for data types other than" is required, and from version 1.0.0 NA
(pd.NA
) that represents a missing value ) Was introduced.
In accordance with this, the missing value allowed floating point data type (Float64Dtype
etc.) was created following the missing value allowed integer data type (Int64Dtype
etc.) that existed a little earlier this time.
If 'Float64'
is specified in the dtype
argument, it becomes a missing value allowed floating point data type ('Float32'
etc. also exists). Like 'Int64'
, the first letter is an uppercase letter to distinguish it from normal floating point data types.
Missing value allowed Floating point data types use NA
instead of NaN
.
s_float = pd.Series([0, 1, np.nan, np.nan], dtype='float64')
s_nfloat = pd.Series([0, 1, np.nan, pd.NA], dtype='Float64')
# float64 0 0.0 1 1.0 2 NaN 3 NaN dtype: float64 # Float64 0 0.0 1 1.0 2 <NA> 3 <NA> dtype: Float64
The result of the comparison operation is different between NaN
and NA
.
s_float < 0
s_nfloat < 0
# float64 0 False 1 False 2 False 3 False dtype: bool # Float64 0 False 1 False 2 <NA> 3 <NA> dtype: boolean
The Index
object has a name (name
attribute). It is generally perceived as an "index column name" (especially if you set a particular column to the index with .set_index ()
).
The following example creates an index with the name 'idx'
.
df_n = pd.DataFrame(np.arange(6).reshape(3, 2), columns=list('AB'),
index=pd.Index(list('abc'), name='idx'))
df_n
idx | A | B |
---|---|---|
a | 0 | 1 |
b | 2 | 3 |
c | 4 | 5 |
When joining data frames, even if the index names are the same, it seems that the names disappeared when the indexes that seemed to be inconsistent (?) Were joined.
Previous behavior
ct = pd.concat([df_n.iloc[:2, :1], df_n.iloc[1:, 1:]], axis=1)
ct
A | B | |
---|---|---|
a | 0 | nan |
b | 2 | 3 |
c | nan | 5 |
This has been fixed so that the name is preserved as much as possible. (Although it is obviously a bug fix, it is introduced as a new feature for some reason)
applymap ()
pd.DataFrame.applymap () is a method that applies a specific function to all elements of a data frame for each element.
df_str = pd.DataFrame([['where', 'why'], ['when', 'who']], dtype='string')
df_str.applymap(len)
0 | 1 | |
---|---|---|
0 | 5 | 3 |
1 | 4 | 3 |
You may want to propagate the missing value as it is without applying the function to the missing value NA
.
df_str = pd.DataFrame([['where', 'why'], ['when', pd.NA]], dtype='string')
df_str.applymap(len)
TypeError: object of type 'NAType' has no len()
Until now, either set a conditional branch such as "If it is NA
, return NA
" in the function itself, or apply pd.Series.map ()
with an option to propagate missing values for each series. I needed it.
df_str.apply(lambda s: s.map(len, na_action='ignore'))
0 | 1 | |
---|---|---|
0 | 5 | 3 |
1 | 4 | <NA> |
A na_action
argument has also been added toapplymap ()
, and missing values are now propagated by passing'ignore'
.
df_str.applymap(len, na_action='ignore')
0 | 1 | |
---|---|---|
0 | 5 | 3 |
1 | 4 | <NA> |
Index
As with the series, pandas indexes can be added or subtracted without problems even if the data type is 'object'
, as long as each element is a numerical value.
pd.Series([0, 1, 2], dtype='object') + 2
pd.Index([0, 1, 2], dtype='object') + 2
# pd.Series([0, 1, 2], dtype='object') + 2 0 2 1 3 2 4 dtype: object # pd.Index([0, 1, 2], dtype='object') + 2 Int64Index([2, 3, 4], dtype='int64')
However, an error occurred in multiplication / division. This was a different behavior than the series.
Previous behavior
pd.Series([0, 1, 2], dtype='object') * 2
pd.Index([0, 1, 2], dtype='object') * 2
# pd.Series([0, 1, 2], dtype='object') * 2 0 0 1 2 2 4 dtype: object # pd.Index([0, 1, 2], dtype='object') * 2 TypeError: cannot perform __mul__ with this index type: Index
This has been fixed.
pd.Index([0, 1, 2], dtype='object') * 2
Int64Index([0, 2, 4], dtype='int64')
.explode ()
method supports set typepd.DataFrame.explode ()
and pd.Series.explode ()
are methods that break lists and tuples when they are elements.
s_pack = pd.Series(['abc', list('abc'), tuple('abc')])
s_pack
0 abc 1 [a, b, c] 2 (a, b, c) dtype: object
s_pack.explode()
0 abc 1 a 1 b 1 c 2 a 2 b 2 c dtype: object
Until now, it did not support the set type (it remained the same even if .explode ()
was applied), but the set is now broken.
pd.Series(['abc', list('abc'), set('abc')]).explode()
0 abc 1 a 1 b 1 c 2 c 2 b 2 a dtype: object
.lookup ()
method is deprecatedWhat is pd.DataFrame.lookup ()
?
np.random.seed(0)
df_map = pd.DataFrame(np.random.randn(10, 3), columns=list('ABC'))
np.random.seed(0)
df_table = (pd.DataFrame({'col': list('ABC')}).iloc[np.random.randint(0, 3, 7)]
.assign(idx=np.random.randint(0, 10, 7)).reset_index(drop=True))
df_map
df_table
↓ There is such a correspondence table
A | B | C | |
---|---|---|---|
0 | 1.76405 | 0.400157 | 0.978738 |
1 | 2.24089 | 1.86756 | -0.977278 |
2 | 0.950088 | -0.151357 | -0.103219 |
3 | 0.410599 | 0.144044 | 1.45427 |
4 | 0.761038 | 0.121675 | 0.443863 |
5 | 0.333674 | 1.49408 | -0.205158 |
6 | 0.313068 | -0.854096 | -2.55299 |
7 | 0.653619 | 0.864436 | -0.742165 |
8 | 2.26975 | -1.45437 | 0.0457585 |
9 | -0.187184 | 1.53278 | 1.46936 |
↓ When there is such data
col | idx | |
---|---|---|
0 | A | 7 |
1 | B | 6 |
2 | A | 8 |
3 | B | 8 |
4 | B | 1 |
5 | C | 6 |
6 | A | 7 |
↓ This is a method that can do this.
df_map.lookup(df_table['idx'], df_table['col'])
array([ 0.6536186 , -0.85409574, 2.26975462, -1.45436567, 1.86755799, -2.55298982, 0.6536186 ])
It seems that it is expected to be abolished because there is no need for such a method. In the above example, if you do not use .lookup ()
, for example, there are the following methods.
df_map.unstack()[df_table.set_index(['col', 'idx']).index]
df_map.unstack()[zip(df_table['col'], df_table['idx'])]
df_map.unstack()[df_table.to_records(index=False).tolist()]
df_map.unstack()[df_table.itertuples(index=False)]
# -> pd.Series
df_map.to_numpy()[df_table['idx'], df_map.columns.get_indexer(df_table['col'])]
# -> np.ndarray
Recommended Posts