[PYTHON] A little scrutiny of pandas 1.0 and dask

background

Pandas 1.0.0 was released on January 29, 2020! Crackling As of 02/14/2020, it is 1.0.1.

Personally, I think the following changes are important points. --pandas original NA --Experimental for String type

Well.

When analyzing, I often use the following libraries and pandas together.

In particular, I would like to sort out dask's pandas 1.0 support status and other detailed behavior. The version of dask is 2.10.1 as of 02/14/2020.

Regarding intake, I think that there is no problem if dask supports it. (There is also a time when the processing wait time of dask is free.)

スクリーンショット 2020-02-14 午前11.20.52.png

What you are interested in

-Can dask use pandas.NA properly? (Related to ver 1.0) --Can dask use dtype: string properly? (Related to ver 1.0) --I / O, especially fastparquet, can't input / output properly with pandas.NA or dtype: string? (Related to ver 1.0) ――No, can dask be used properly with dtype: categorical? (Other)

Tom Augspurger seems to support pandas 1.0 like a demon, and I have high expectations so far. スクリーンショット 2020-02-14 午後2.26.08.png

result

For those who want to know only the result.

--Dask can perform four arithmetic operations and character string operations even if it contains pandas.NA. --dask cannot be set_index with custom types such as ʻInt64, string --Both pandas / dask cannot index filter onboolean columns containing pandas.NA --Dask becomes ʻobject type even if ʻapply (meta ='string'), but it can be revived by ʻas type ('string'). --When using pandas.Categorical in dask, it seems that filtering and aggregation are not possible. --It seems that you need to use ʻas type when adding a new Categorical column to dask DataFrame --dask cannot to_parquet types ʻInt64 and string (when engine = fastparquet).

Survey environment construction

For the time being, prepare a clean verification environment. The OS is macOS Catalina 10.15.2.

Regarding the python version, pandas only sets minimum, and dask seems to be python3.8 compatible, so If it is 3.7.4, there will be no problem.

For dependencies, include minimum versions for dependencies. However, fastparquet and pyArrow cannot coexist on mac problem, so I will not include pyArrow just in case. I don't use it.

Verification work is done on jupyterlab.

pyenv virtualenv 3.7.4 pandas100
pyenv shell pandas100
pip install -r requirements.txt

requirements.txt


pandas==1.0.1
dask[complete]==2.10.1
fastparquet==0.3.3

jupyterlab==1.2.6

numpy==1.18.1
pytz==2019.3
python-dateutil==2.8.1
numexpr==2.7.1
beautifulsoup4==4.8.2
gcsfs==0.6.0
lxml==4.5.0
matplotlib==3.1.3
numba==0.48.0
openpyxl==3.0.3
pymysql==0.9.3
tables==3.6.1
s3fs==0.4.0
scipy==1.4.1
sqlalchemy==1.3.13
xarray==0.15.0
xlrd==1.2.0
xlsxwriter==1.2.7
xlwt==1.3.0

dask vs pandas1.0

First, check pd.NA.

Check the behavior of pd.NA in each

s=... type(s.loc[3])
pandas.Series([1,2,3,None], dtype='int') TypeError
pandas.Series([1,2,3,pandas.NA], dtype='int') TypeError
pandas.Series([1,2,3,None], dtype='Int64') pandas._libs.missing.NAType
pandas.Series([1,2,3,None], dtype='float') numpy.float64
pandas.Series([1,2,3,pandas.NA], dtype='float') TypeError
pandas.Series([1,2,3,None], dtype='Int64').astype('float') numpy.float64
pandas.Series(['a', 'b', 'c' ,None], dtype='string') pandas._libs.missing.NAType
pandas.Series(['a', 'b', 'c' ,None], dtype='object').astype('string') pandas._libs.missing.NAType
pandas.Series([True, False, True ,None], dtype='boolean') pandas._libs.missing.NAType
pandas.Series([1, 0, 1 ,None], dtype='float').astype('boolean') pandas._libs.missing.NAType
pandas.Series(pandas.to_datetime(['2020-01-01', '2020-01-02', '2020-01-03', None])) pandas._libs.tslibs.nattype.NaTType
pandas.Series(pandas.to_timedelta(['00:00:01', '00:00:02', '00:00:03', None])) pandas._libs.tslibs.nattype.NaTType
pandas.Series([object(), object(), object(), None], dtype='object') NoneType
pandas.Series([object(), object(), object(), pandas.NA], dtype='object') pandas.Series([object(), object(), object(), pandas.NA], dtype='object')

Summary,

--dtype int does not become pandas.NA (TypeError as it is) --dtype Int64, string, boolean becomes pandas.NA. --dtype float becomes numpy.NaN --dtype datetime64, timedelta64 becomes NAT --dtype object does not automatically convert None to pandas.NA

Investigate what happens if you do this with dask.dataframe.from_pandas.

>>> import pandas
>>> import dask.dataframe

>>> df = pandas.DataFrame({'i': [1,2,3,4], 
...                        'i64': pandas.Series([1,2,3,None], dtype='Int64'), 
...                        's': pandas.Series(['a', 'b', 'c' ,None], dtype='string'), 
...                        'f': pandas.Series([1,2,3,None], dtype='Int64').astype('float')})

>>> ddf = dask.dataframe.from_pandas(df, npartitions=1)
>>> df
	i	i64	s	f
0	1	1	a	1.0
1	2	2	b	2.0
2	3	3	c	3.0
3	4	<NA>	<NA>	NaN

>>> ddf
Dask DataFrame Structure:
	i	i64	s	f
npartitions=1				
0	int64	Int64	string	float64
3	...	...	...	...

Indeed, ʻInt64 is also ʻInt64 on dask. The same is true for string.

>>> #(Integer) operation on Int64
>>> df.i64 * 2
0       2
1       4
2       6
3    <NA>
Name: i64, dtype: Int64

>>> (ddf.i64 * 2).compute()
0       2
1       4
2       6
3    <NA>
Name: i64, dtype: Int64

Int64-> Int64 processing works fine.

>>> #(Floating point) operation for Int64
>>> df.i64 - df.f
0    0.0
1    0.0
2    0.0
3    NaN
dtype: float64

>>> (ddf.i64 - ddf.f).compute()
0    0.0
1    0.0
2    0.0
3    NaN
dtype: float64

The processing of Int64-> float64 also works properly.

>>> # pandas.Set in Int64 columns containing NA_index
>>> df.set_index('i64')
	i	s	f	i64_result	i64-f
i64					
1	1	a	1.0	2	0.0
2	2	b	2.0	4	0.0
3	3	c	3.0	6	0.0
NaN	4	<NA>	NaN	<NA>	NaN

>>> ddf.set_index('i64').compute()
TypeError: data type not understood

>>> # pandas.What would happen without NA
>>> ddf['i64_nonnull'] = ddf.i64.fillna(1)
... ddf.set_index('i64_nonnull').compute()
TypeError: data type not understood

Eh! dask can't set_index in the ʻInt64` column! Of course you can do pandas.

>>> # pandas.Set in a string column containing NA_index
>>> df.set_index('s')
	i	i64	f
s			
a	1	1	1.0
b	2	2	2.0
c	3	3	3.0
NaN	4	<NA>	NaN

>>> ddf.set_index('s').compute()
TypeError: Cannot perform reduction 'max' with string dtype

>>> # pandas.What would happen without NA
>>> ddf['s_nonnull'] = ddf.s.fillna('a')
... ddf.set_index('s_nonnull')
TypeError: Cannot perform reduction 'max' with string dtype

I can't do string either. This can't be used yet (in my usage).

# .Try the str function
>>> df.s.str.startswith('a')
0     True
1    False
2    False
3     <NA>
Name: s, dtype: boolean

>>> ddf.s.str.startswith('a').compute()
0     True
1    False
2    False
3     <NA>
Name: s, dtype: boolean

Hmmm, this works.

>>> # pandas.Filter by boolean column containing NA
>>> df[df.s.str.startswith('a')]
ValueError: cannot mask with array containing NA / NaN values

>>> # pandas.Is NA bad?
>>> df['s_nonnull'] = df.s.fillna('a')
... df[df.s_nonnull.str.startswith('a')]
	i	i64	s	f	i64_nonnull	s_nonnull
0	1	1	a	1.0	1	a
3	4	<NA>	<NA>	NaN	1	a

>>> ddf[ddf.s.str.startswith('a')].compute()
ValueError: cannot mask with array containing NA / NaN values

>>> ddf['s_nonnull'] = ddf.s.fillna('a')
... ddf[ddf.s_nonnull.str.startswith('a')].compute()
	i	i64	s	f	i64_nonnull	s_nonnull
0	1	1	a	1.0	1	a
3	4	<NA>	<NA>	NaN	1	a
>>> ddf[ddf.s.str.startswith('a')].compute()

e! !! !! Can't I filter if I include pandas.NA? This is no good!

>>> #apply to meta='Int64'Try to specify
>>> ddf['i10'] = ddf.i.apply(lambda v: v * 10, meta='Int64')
>>> ddf
Dask DataFrame Structure:
	i	i64	s	f	i64_nonnull	s_nonnull	i10
npartitions=1							
0	int64	Int64	string	float64	Int64	string	int64
3	...	...	...	...	...	...	...

>>> #apply to meta='string'Try to specify
>>> ddf['s_double'] = ddf.s.apply(lambda v: v+v, meta='string')
Dask DataFrame Structure:
	i	i64	s	f	i64_nonnull	s_nonnull	i10	s_double
npartitions=1								
0	int64	Int64	string	float64	Int64	string	int64	object
3	...	...	...	...	...	...	...	...

>>> # astype('string')Try
>>> ddf['s_double'] = ddf['s_double'].astype('string')
>>> ddf
Dask DataFrame Structure:
	i	i64	s	f	i64_nonnull	s_nonnull	i10	s_double
npartitions=1								
0	int64	Int64	string	float64	Int64	string	int64	string
3	...	...	...	...	...	...	...	...

If you specify it with meta =, is it not reflected? .. .. It can be revived with astype, but it's a hassle. .. ..

result

--Calculation is OK --In dask, it cannot be used as an Index (because the type that supports pandas.NA cannot be used in the first place) --Cannot filter with both pandas / dask! --It is ignored even if .apply (meta ='string') etc. You have to astype.

dask vs pandas.Categorical

In order to investigate Categorical in pandas, this time we will use the method using Categorical Dtype. The basic usage of Categorical Dtype is

  1. Instantiate by specifying categories and ordered
  2. Instantiate pandas.Series as a dtype = CategoricalDtype instance

I think it is. Sample code below

>>> #First, create a Categorical Dtype
>>> int_category = pandas.CategoricalDtype(categories=[1,2,3,4,5], 
...                                        ordered=True)
>>> int_category
CategoricalDtype(categories=[1, 2, 3, 4, 5], ordered=True)

>>> int_category.categories
Int64Index([1, 2, 3, 4, 5], dtype='int64')

>>> #Like this pandas.Make a Series
>>> int_series = pandas.Series([1,2,3], dtype=int_category)
>>> int_series
0    1
1    2
2    3
dtype: category
Categories (5, int64): [1 < 2 < 3 < 4 < 5]

>>> #At the time of generation, it will convert values that are not in the category to NaN
>>> int_series = pandas.Series([1,2,3,6], dtype=int_category)
>>> int_series
0      1
1      2
2      3
3    NaN
dtype: category
Categories (5, int64): [1 < 2 < 3 < 4 < 5]

>>> #Get angry after generation
>>> int_series.loc[3] = 10
ValueError: Cannot setitem on a Categorical with a new category, set the categories first

Next, try using Categorical on dask.

>>> import pandas
>>> import dask.dataframe
>>> # pandas.Generate DataFrame
>>> df = pandas.DataFrame({'a': pandas.Series([1, 2, 3, 1, 2, 3], dtype=int_category), 
...                        'b': pandas.Series([1, 2, 3, 1, 2, 3], dtype='int64')})
>>> df
	a	b
0	1	1
1	2	2
2	3	3
3	1	1
4	2	2
5	3	3

>>> # dask.dataframe.Convert to DataFrame
>>> ddf = dask.dataframe.from_pandas(df, npartitions=1)
>>> ddf
Dask DataFrame Structure:
	a	b
npartitions=1		
0	category[known]	int64
5	...	...

For the time being, I was able to make it dask as it is categorical.

#Adding new category values is legal in pandas
>>> df.loc[2, 'a'] = 30
ValueError: Cannot setitem on a Categorical with a new category, set the categories first

#In dask, it can not be assigned in the first place regardless of Categorical
>>> ddf.loc['a', 3] = 10
TypeError: '_LocIndexer' object does not support item assignment

#In pandas, the calculation of category values is also legal
>>> df.a * 2
TypeError: unsupported operand type(s) for *: 'Categorical' and 'int'

#Even in dask, the calculation of category values is legal
>>> ddf.a * 2
TypeError: unsupported operand type(s) for *: 'Categorical' and 'int'

#Try to specify as meta with apply of dask
>>> ddf['c'] = ddf.a.apply(lambda v: v, meta=int_category)
Dont know how to create metadata from category

#dask apply, meta='category'Will you do your best if you do?
>>> ddf['c'] = ddf.a.apply(lambda v: v, meta='category')
>>> ddf.dtypes
a    category
b       int64
c      object
dtype: object

>>> #Check if it is consistent with the contents
>>> ddf.compute().dtypes
a    category
b       int64
c    category
dtype: object

>>> #try astype
>>> ddf['c'] = ddf.c.astype(int_category)
>>> ddf
Dask DataFrame Structure:
	a	b	c
npartitions=1			
0	category[known]	int64	category[known]
5	...	...	...

I see. The constraint part of the category is maintained, but if you do .apply (meta =), dask's dtype management will be buggy. It's possible to revive it with astype, but it's a hassle. .. .. Isn't it possible to use only filters?

#Try to aggregate
>>> ddf.groupby('a').b.mean().compute()
a
1    1.0
2    2.0
3    3.0
4    NaN
5    NaN
Name: b, dtype: float64

#Isn't the type broken by being treated as an Index?
Dask DataFrame Structure:
a	b
npartitions=1		
category[known]	float64
...	...
Dask Name: reset_index, 34 tasks

Hmmm, do you feel that it corresponds to aggregation?

result

--When using pandas.Categorical with dask, filters and aggregates seem to be fine --If you want to add a new Categorical column to Dask's DataFrame, use ʻas type`

to_parquet vs pandas1.0

>>> #First of all, pandas.Generate DataFrame
>>> df = pandas.DataFrame(
    {
        'i64': pandas.Series([1, 2, 3,None], dtype='Int64'),
        'i64_nonnull': pandas.Series([1, 2, 3, 4], dtype='Int64'),
        's': pandas.Series(['a', 'b', 'c',None], dtype='string'),
        's_nonnull': pandas.Series(['a', 'b', 'c', 'd'], dtype='string'),
    }
)
>>> df
	i64	i64_nonnull	s	s_nonnull
0	1	1	a	a
1	2	2	b	b
2	3	3	c	c
3	<NA>	4	<NA>	d

>>> # dask.dataframe.Convert to DataFrame
>>> ddf = dask.dataframe.from_pandas(df, npartitions=1)
>>> ddf
Dask DataFrame Structure:
	i64	i64_nonnull	s	s_nonnull
npartitions=1				
0	Int64	Int64	string	string
3	...	...	...	...

For the time being, try to_parquet.

>>> ddf.to_parquet('test1', engine='fastparquet')
ValueError: Dont know how to convert data type: Int64

seriously. .. .. I was expecting it. .. .. Even if Int64 is not good, string may be possible. .. ..

>>> ddf.to_parquet('test2', engine='fastparquet')
ValueError: Dont know how to convert data type: string

It was bad.

result

--Int64 and string cannot be to_parquet.

in conclusion

How was that? Perhaps no one has read this comment to the end. Should I have separated the posts?

pandas 1.0 I hope it helps people who are thinking about it.

See you soon.

Recommended Posts

A little scrutiny of pandas 1.0 and dask
A little niche feature introduction of faiss
A rough understanding of python-fire and a memo
Connect a lot of Python or and and
[Pandas_flavor] Add a method of Pandas DataFrame
A brief description of pandas (Cheat Sheet)
Pandas: A very simple example of DataFrame.rolling ()
A memorandum of studying and implementing deep learning
Calculation of technical indicators by TA-Lib and pandas
Header shifts in read_csv () and read_table () of Pandas
[Python] A rough understanding of iterators, iterators, and generators
A discussion of the strengths and weaknesses of Python
Basic operation of Python Pandas Series and Dataframe (1)
Analysis of financial data by pandas and its visualization (2)
Analysis of financial data by pandas and its visualization (1)
Add a list of numpy library functions little by little --a
A story of trying out pyenv, virtualenv and virtualenvwrapper
[PyTorch] A little understanding of CrossEntropyLoss with mathematical formulas
Make a note of the list of basic Pandas usage
A quick comparison of Python and node.js test libraries
Create a batch of images and inflate with ImageDataGenerator
[python] [Gracenote Web API] A little customization of pygn
(For myself) Flask_2 (list and for, extends, and a little more)
Basic operation of pandas
About MultiIndex of pandas
pandas index and reindex
Basic operation of Pandas
pandas resample and rolling
Pandas averaging and listing
Make a Linux version of OpenSiv3D with find_package a little easier
Add a list of numpy library functions little by little --- b
As a result of mounting and tuning with POH! Lite
Add a list of numpy library functions little by little --c
Make a given number of seconds into hours, minutes and seconds
Detect objects of a specific color and size with Python
A rough summary of the differences between Windows and Linux
Python: Create a dictionary from a list of keys and values
I tried a little bit of the behavior of the zip function
Pandas basics for beginners ④ Handling of date and time items
A collection of methods used when aggregating data with pandas
I wrote AWS Lambda, and I was a little addicted to the default value of Python arguments