pandas 1.0.0 has been released.
https://pandas.pydata.org/pandas-docs/stable/whatsnew/v1.0.0.html
As a new feature
--Pandas.NA, which handles different types of missing values in a unified manner, has been introduced on a trial basis. --Added an option to speed up rolling.apply () with Numba --Added to_markdown () method that can output data frame in Markdown notation
However, personally, the above release note 'Other enhancements' -enhancements) I noticed the casually written description at the very end:
DataFrame.to_pickle() and read_pickle() now accept URL (GH30163)
This means that you can directly save and read the data that was solidified by pickle on the cloud storage! ??
I tried it for a reason:
Create a suitable DataFrame like the one below.
df = pd.DataFrame({'hoge': [1, 2, 3], 'fuga': [4, 5, 6], 'piyo': [7, 8, 9]})
By doing this,
hoge | fuga | piyo | |
---|---|---|---|
0 | 1 | 4 | 7 |
1 | 2 | 5 | 8 |
2 | 3 | 6 | 9 |
I was able to express that. [^ 1]
[^ 1]: By the way, this table was output using df.to_markdown ()
which was also added from pandas 1.0.0. Convenient.
Save this in the AWS S3 bucket 's3: // tatamiya-test /
created in advance and read it. [^ 2] [^ 3]
[^ 2]: I will omit the setting method of IAM and credential key. [^ 3]: I haven't confirmed it, but you should be able to use GCS.
As long as it is saved / read in .csv format, it was possible with the conventional version.
pd.__version__
# >> '0.25.3'
df.to_csv('s3://tatamiya-test/test.csv', index=None)
pd.read_csv('s3://tatamiya-test/test.csv')
# >> hoge fuga piyo
# >> 0 1 4 7
# >> 1 2 5 8
# >> 2 3 6 9
However, in the case of pickle, the PATH was not recognized by the same operation.
pd.__version__
# >> '0.25.3'
df.to_pickle('s3://tatamiya-test/test.pkl')
# >> FileNotFoundError: [Errno 2] No such file or directory: 's3://tatamiya-test/test.pkl'
pd.read_pickle('s3://tatamiya-test/test.pkl')
# >> FileNotFoundError: [Errno 2] No such file or directory: 's3://tatamiya-test/test.pkl'
Now, let's try with the latest version 1.0.0.
pd.__version__
# >> '1.0.0'
df.to_pickle('s3://tatamiya-test/test.pkl')
pd.read_pickle('s3://tatamiya-test/test.pkl')
# >> hoge fuga piyo
# >> 0 1 4 7
# >> 1 2 5 8
# >> 2 3 6 9
I was able to confirm that it can be saved and read properly!
When processing data using pandas on the cloud, even if the data source is on S3 or GCS, if it is a csv format file, it was possible to directly specify the URL with to_csv () and read it.
However, when saving the intermediate data after shaping and processing, it is better to convert the DataFrame or Series to a byte string as a Python object and save it.
--Small capacity --Fast reload --No need to retype
To_pickle () and read_pickle () were useful because they had such merits.
However, as mentioned above, in the past it was not possible to specify the S3 / GCS URL as the save destination at this time.
--Use client library --Save locally and then skip with the command line tool --Mount the target bucket on the VM instance in advance
I had to take some time and effort.
That's why this update was sober but personally very grateful!
(However, since backward compatibility is not guaranteed, it is not possible to introduce 1.0.0 with the code as it is ...)
Recommended Posts