[PYTHON] Features of pd.NA in pandas 1.0.0 (rc0)

update1 2020-01-25: Added that bug-like behavior was bug

As of 2020-01-13, pandas 1.0.0rc0 has been released, but one of the major features is the introduction of pd.NA as a missing value. I will summarize this property and how to use it.

Disclaimer: It has been confirmed to work with pandas 1.0.0 rc0, and there is a good possibility that it will change in the future.

Finally, [Verification Environment](#Verification Environment).

wrap up

--pd.NA appears as the meaning of missing value. --pd.NA can be used with IntegerArray, BooleanArray, StringArray --With the introduction of pd.NA, missing value can be expressed in int class as well (no careless conversion to float). --pd.NA is a singleton object and is consistent with all data types. --All comparison operator return values for pd.NA are pd.NA (same behavior as Julia's missing object, R's NA)) --Operations with logical operators follow the so-called three-valued logic --In pd.read_csv (), NA is recognized by specifying ʻInt64, string, boolean. (booleandoesn't work in rc0 and is dealing with issues). ~~ can be specified, butboolean` will result in an error. It's unclear if this behavior is a bug or a spec. (Probably specifications) ~~

data type

A new class called NAType is introduced in pandas. The purpose is to indicate the value as a missing value.

>>> import pandas as pd
>>> pd.NA
<NA>

>>> type(pd.NA)
<class 'pandas._libs.missing.NAType'>

In pd.Series and pd.DataFrame, if you do not specify a type, it is treated as an object type, and if you specify it, it is treated as that type. ʻInt64Dtype is Nullable interger (An ExtensionDtype for int64 integer data. Array of integer (optional missing) values) introduced from pandas 0.24. Note that you must specify dtype in uppercase as ʻInt64 instead of ʻint64. Technically, the introduction of Pandas Extension Arrays` made it possible to use ExtensionDType.

>>> pd.Series([pd.NA]).dtype
dtype('O') # O means Object

#You can specify dtype either as a string alias or as type itself. The following is specified as a character string.
>>> pd.Series([pd.NA], dtype="Int64").dtype
Int64Dtype()

>>> pd.Series([pd.NA], dtype="boolean").dtype
BooleanDtype

>>> pd.Series([pd.NA], dtype="string").dtype
StringDtype

Click here for the implementation of NAType.

https://github.com/pandas-dev/pandas/blob/493363ef60dd9045888336b5c801b2a3d00e976d/pandas/_libs/missing.pyx#L335-L485

Interestingly, the hash value is defined by 2 ** 61 --1 == 2305843009213693951. There is no problem because the key of the dictionary does not collide. It's not related to pd.NA, but in fact, the hash of the integer value of python goes around with 2 ** 61 --1.

>>> hash(pd.NA) == 2 ** 61 -1
True

>>> {pd.NA: "a", 2305843009213693951: "b"}
{<NA>: 'a', 2305843009213693951: 'b'}

>>> (hash(2**61 - 2), hash(2**61 - 1), hash(2**61))
(2305843009213693950, 0, 1)

Types in pd.Series, pd.DataFrame

Type determination must be specified in uppercase ʻInt64 instead of ʻint64.

>>> pd.Series([1, 2]) + pd.Series([pd.NA, pd.NA])
0    <NA>
1    <NA>
dtype: object

>>> pd.Series([1, 2]) + pd.Series([pd.NA, pd.NA], dtype="Int64")
0    <NA>
1    <NA>
dtype: Int64

Specifying ʻint64` will result in an error.

>>> pd.Series([pd.NA], dtype="int64").dtype
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.7/site-packages/pandas/core/series.py", line 304, in __init__
    data = sanitize_array(data, index, dtype, copy, raise_cast_failure=True)
  File "/usr/local/lib/python3.7/site-packages/pandas/core/construction.py", line 438, in sanitize_array
    subarr = _try_cast(data, dtype, copy, raise_cast_failure)
  File "/usr/local/lib/python3.7/site-packages/pandas/core/construction.py", line 535, in _try_cast
    subarr = maybe_cast_to_integer_array(arr, dtype)
  File "/usr/local/lib/python3.7/site-packages/pandas/core/dtypes/cast.py", line 1502, in maybe_cast_to_integer_array
    casted = np.array(arr, dtype=dtype, copy=copy)
TypeError: int() argument must be a string, a bytes-like object or a number, not 'NAType'

The result of the operation with boolean is the same behavior as Julia's missing and R's NA.

>>> pd.Series([True, False, pd.NA]) & True
0     True
1    False
2       NA
dtype: bool

>>> pd.Series([True, False, pd.NA]) | True
0     True
1     True
2     True
dtype: bool

>>> pd.NA & True
NA

>>> pd.NA & False
False

>>> pd.NA | True
True

>>> pd.NA | False
NA
>>> pd.Series([1, 2, pd.NA], dtype="Int64")
0       1
1       2
2    <NA>
dtype: Int64
>>> pd.Series([True, False, pd.NA], dtype="boolean")
0     True
1    False
2     <NA>
dtype: boolean

The result of the sum operation is NA propagated (propagate), butpd.Series.sum ()with no arguments is treated as 0 and not propagated. It is necessary to specify sum (skipna = False) to handle it as propagate. However, when specifying the type of 'Int64', np.nan was output instead of NA. ~~ I searched for issues to see if this was the expected behavior or a bug, but it was unclear. So I recklessly created an issue ticket. ~~ issue ticket Imported. It seems to be reflected in rc1.

>>> sum([1, pd.NA])
<NA>
# pd.Series object
>>> pd.Series([1, pd.NA])
0       1
1    <NA>
dtype: object

>>> pd.Series([1, pd.NA]).sum()
1
>>> pd.Series([1, pd.NA]).sum(skipna=False)
<NA>
# pd.Series Int64
>>> pd.Series([1, pd.NA], dtype='Int64')
0       1
1    <NA>
dtype: Int64

>>> pd.Series([1, pd.NA], dtype='Int64').sum()
1
>>> pd.Series([1, pd.NA], dtype='Int64').sum(skipna=False)
nan

pow function

The treatment of exponentiation is consistent with R's NA_integer_. The behavior of julia is a mystery.

>>> pd.NA ** 0
1
>>> 1 ** pd.NA
1
>>> -1 ** pd.NA
-1
> R.version.string
[1] "R version 3.6.1 (2019-07-05)"

> NA_integer_ ^ 0L
[1] 1
> 1L ^ NA_integer_
[1] 1
> -1L ^ NA_integer_
[1] -1
julia> VERSION
v"1.3.1"

julia> missing ^ 0
missing

julia> 1 ^ missing
missing

julia> -1 ^ missing
missing

Specified by read_csv

Experiment with the following csv file. (test.csv)

X_int,X_bool,X_string
1,True,"a"
2,False,"b"
NA,NA,"NA"

If dtype is not specified, the behavior is the same as pandas 0.25.3.

>>> df1 = pd.read_csv("test.csv")
>>> df1
   X_int X_bool X_string
0    1.0   True        a
1    2.0  False        b
2    NaN    NaN      NaN
>>> df1.dtypes
X_int       float64
X_bool       object
X_string     object
dtype: object

ʻInt64andstring` can be specified for dtype.

#dtype can be the following type class instead of character literals.
# df2 = pd.read_csv("test.csv", dtype={'X_int': pd.Int64Dtype(), 'X_string': pd.StringDtype()})
>>> df2 = pd.read_csv("test.csv", dtype={'X_int': 'Int64', 'X_string': 'string'})
>>> df2
   X_int X_bool X_string
0      1   True        a
1      2  False        b
2   <NA>    NaN     <NA>
>>> df2.dtypes
X_int        Int64
X_bool      object
X_string    string
dtype: object

On the other hand, even if 'boolean'`` pd.BooleanDtype () is specified, reading as boolean NA fails. Of course, specifying 'bool' is also an error. issue When I reported it, it was successfully imported. It seems to work fine with rc1.

>>> df3 = pd.read_csv("test.csv", dtype={'X_bool': 'boolean'})
Traceback (most recent call last):
  File "pandas/_libs/parsers.pyx", line 1191, in pandas._libs.parsers.TextReader._convert_with_dtype
  File "/usr/local/lib/python3.7/site-packages/pandas/core/arrays/base.py", line 232, in _from_sequence_of_strings
    raise AbstractMethodError(cls)
pandas.errors.AbstractMethodError: This method must be defined in the concrete class type

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 676, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/usr/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 454, in _read
    data = parser.read(nrows)
  File "/usr/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 1133, in read
    ret = self._engine.read(nrows)
  File "/usr/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 2037, in read
    data = self._reader.read(nrows)
  File "pandas/_libs/parsers.pyx", line 859, in pandas._libs.parsers.TextReader.read
  File "pandas/_libs/parsers.pyx", line 874, in pandas._libs.parsers.TextReader._read_low_memory
  File "pandas/_libs/parsers.pyx", line 951, in pandas._libs.parsers.TextReader._read_rows
  File "pandas/_libs/parsers.pyx", line 1083, in pandas._libs.parsers.TextReader._convert_column_data
  File "pandas/_libs/parsers.pyx", line 1114, in pandas._libs.parsers.TextReader._convert_tokens
  File "pandas/_libs/parsers.pyx", line 1194, in pandas._libs.parsers.TextReader._convert_with_dtype
NotImplementedError: Extension Array: <class 'pandas.core.arrays.boolean.BooleanArray'> must implement _from_sequence_of_strings in order to be used in parser methods
>>> df3 = pd.read_csv("test.csv", dtype={'X_bool': pd.BooleanDtype()})
Traceback (most recent call last):
  File "pandas/_libs/parsers.pyx", line 1191, in pandas._libs.parsers.TextReader._convert_with_dtype
  File "/usr/local/lib/python3.7/site-packages/pandas/core/arrays/base.py", line 232, in _from_sequence_of_strings
    raise AbstractMethodError(cls)
pandas.errors.AbstractMethodError: This method must be defined in the concrete class type

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 676, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/usr/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 454, in _read
    data = parser.read(nrows)
  File "/usr/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 1133, in read
    ret = self._engine.read(nrows)
  File "/usr/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 2037, in read
    data = self._reader.read(nrows)
  File "pandas/_libs/parsers.pyx", line 859, in pandas._libs.parsers.TextReader.read
  File "pandas/_libs/parsers.pyx", line 874, in pandas._libs.parsers.TextReader._read_low_memory
  File "pandas/_libs/parsers.pyx", line 951, in pandas._libs.parsers.TextReader._read_rows
  File "pandas/_libs/parsers.pyx", line 1083, in pandas._libs.parsers.TextReader._convert_column_data
  File "pandas/_libs/parsers.pyx", line 1114, in pandas._libs.parsers.TextReader._convert_tokens
  File "pandas/_libs/parsers.pyx", line 1194, in pandas._libs.parsers.TextReader._convert_with_dtype
NotImplementedError: Extension Array: <class 'pandas.core.arrays.boolean.BooleanArray'> must implement _from_sequence_of_strings in order to be used in parser methods
>>> df3 = pd.read_csv("test.csv", dtype={'X_bool': 'bool'})
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 676, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/usr/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 454, in _read
    data = parser.read(nrows)
  File "/usr/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 1133, in read
    ret = self._engine.read(nrows)
  File "/usr/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 2037, in read
    data = self._reader.read(nrows)
  File "pandas/_libs/parsers.pyx", line 859, in pandas._libs.parsers.TextReader.read
  File "pandas/_libs/parsers.pyx", line 874, in pandas._libs.parsers.TextReader._read_low_memory
  File "pandas/_libs/parsers.pyx", line 951, in pandas._libs.parsers.TextReader._read_rows
  File "pandas/_libs/parsers.pyx", line 1083, in pandas._libs.parsers.TextReader._convert_column_data
  File "pandas/_libs/parsers.pyx", line 1114, in pandas._libs.parsers.TextReader._convert_tokens
  File "pandas/_libs/parsers.pyx", line 1231, in pandas._libs.parsers.TextReader._convert_with_dtype
ValueError: Bool column has NA values in column 1

(Impression) All the culprit was that the missing value did not exist in numpy. So there are various contradictions in the introduction at the pandas layer. Various things are written here. https://dev.pandas.io/docs/user_guide/gotchas.html#why-not-make-numpy-like-r

The reason I learned in the first place

I noticed by retweet from someone on twitter

https://mobile.twitter.com/jorisvdbossche/status/1208476049690046465 スクリーンショット 2020-01-13 8.49.57.png

wrap up

--pd.NA appears as the meaning of missing value. --pd.NA can be used with IntegerArray, BooleanArray, StringArray --With the introduction of pd.NA, missing value can be expressed in int class as well (no careless conversion to float). --pd.NA is a singleton object and is consistent with all data types. --All the return values of comparison operators for pd.NA are pd.NA (same behavior as Julia's missing object, R's NA)) --Operations with logical operators follow the so-called three-valued logic --In pd.read_csv (), ʻInt64 and stringcan be specified, butboolean` becomes an error. It's unclear if this behavior is a bug or a spec. (Probably specifications)

Finally

If you love this kind of maniac story, please come visit us at justInCase. https://www.wantedly.com/companies/justincase


Reference URL

Verification environment

I confirmed it on docker.

FROM python:3.7.6
WORKDIR /home
RUN pip install pandas==1.0.0rc0
CMD ["/bin/bash"]
$ docker build -t pdna .
$ docker run -it --rm -v $(pwd):/home/ pdna

Inside Docker

root@286578c2496b:/home# cat /etc/issue
Debian GNU/Linux 10 \n \l
root@286578c2496b:/home# uname -a
Linux 286578c2496b 4.9.184-linuxkit #1 SMP Tue Jul 2 22:58:16 UTC 2019 x86_64 GNU/Linux
root@286578c2496b:/home# python -c "import pandas as pd; pd.show_versions()"

INSTALLED VERSIONS
------------------
commit           : None
python           : 3.7.6.final.0
python-bits      : 64
OS               : Linux
OS-release       : 4.9.184-linuxkit
machine          : x86_64
processor        : 
byteorder        : little
LC_ALL           : None
LANG             : C.UTF-8
LOCALE           : en_US.UTF-8

pandas           : 1.0.0rc0
numpy            : 1.18.1
pytz             : 2019.3
dateutil         : 2.8.1
pip              : 19.3.1
setuptools       : 44.0.0
Cython           : None
pytest           : None
hypothesis       : None
sphinx           : None
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : None
html5lib         : None
pymysql          : None
psycopg2         : None
jinja2           : None
IPython          : None
pandas_datareader: None
bs4              : None
bottleneck       : None
fastparquet      : None
gcsfs            : None
lxml.etree       : None
matplotlib       : None
numexpr          : None
odfpy            : None
openpyxl         : None
pandas_gbq       : None
pyarrow          : None
pytables         : None
pytest           : None
s3fs             : None
scipy            : None
sqlalchemy       : None
tables           : None
tabulate         : None
xarray           : None
xlrd             : None
xlwt             : None
xlsxwriter       : None
numba            : None

Recommended Posts

Features of pd.NA in pandas 1.0.0 (rc0)
Summary of methods often used in pandas
Summary of what was used in 100 Pandas knocks (# 1 ~ # 32)
A proposal for versioning of features in Kedro
Header shifts in read_csv () and read_table () of Pandas
Talking about the features that pandas and I were in charge of in the project
Learn Pandas in 10 minutes
About MultiIndex of pandas
UnicodeDecodeError in pandas read_csv
Basic operation of Pandas
Installation of Python 3.3 rc1
Features of Go language
Main features of ChainMap
"Type Error: Unrecognized value type: <class'str'>" in to_datetime of pandas
Comparison of data frame handling in Python (pandas), R, Pig
How to get an overview of your data in Pandas
Handling of quotes in [bash]
Partial in case of trouble
Features of programming languages [Memo]
Formatted display of pandas DataFrame
List of nodes in diagrams
Equivalence of objects in Python
Basic usage of Pandas Summary
Behavior of pandas rolling () method
Implementation of quicksort in Python
About the features of Python
Index of certain pandas usage
Swap columns in pandas dataframes
The Power of Pandas: Python
[Finally a major update] I checked various updates of Pandas 1.0.0rc