The one that saves python objects as binary data https://docs.python.org/ja/3/library/pickle.html
Loading is fast Since it is binary data, parsing processing is fast because it is unnecessary. Trained models can be pickled and reused
This verification article is wonderful Python: I investigated the persistence format of pandas
Make train.csv pickle for the time being This is the only code
#pickle is a standard library so no install required
import pickle
import pandas as pd
train = pd.read_csv('../input/titanic/train.csv')
# 'wb'(write binary)Specify
with open('train.pickle', 'wb') as f:
pickle.dump(train, f)
First commit

When the green Complete appears in the upper left, click Open Version.

Scroll to the Output column

If you can see train.pickle, then New Dataset

Enter your favorite Dataset title and create

Dataset is completed

If you create a new notebook + Add Data

Filter by Your Datasets

Add the guy you just made

Win if displayed here

This is the only code
# 'rb'(read binary)Specify
with open('../input/titanicdatasetpickles/train.pickle', 'rb') as f:
train = pickle.load(f)
It is properly loaded as a DataFrame.
train.shape
# (891, 12)
!ls ../input
# titanicdatasetpickles
Let's use the dump process
dump_pickles.py
import pickle
import pandas as pd
#Switch path between Kaggle and another environment
if '/kaggle/working' in _dh:
input_path = '../input'
else:
input_path = './input'
#Rewrite only here for each competition
data_sets = {
'train': f'{input_path}/titanic/train.csv',
'test': f'{input_path}/titanic/test.csv',
'gender_submission': f'{input_path}/titanic/gender_submission.csv'
}
for name, path in data_sets.items():
df = pd.read_csv(path)
with open(f'{name}.pickle', 'wb') as f:
pickle.dump(df, f)
#this is
with open('./train.pickle', 'wb') as f:
pickle.dump(train, f)
#like this
train.to_pickle('./train.pickle')
#this is
with open('../input/titanicdatasetpickles/train.pickle', 'rb') as f:
df_ss = pickle.load(f)
#like this
train = pd.read_pickle('../input/titanicdatasetpickles/train.pickle')
ModuleNotFoundError: No module named 'pandas.core.internals.managers'; 'pandas.core.internals' is not a package
It seems to be a problem with the version of pandas
pip install -U pandas
Solved by
I was saved by this article Inconsistency between pickle and pandas
Thank you for reading to the end
Recommended Posts