Compare read / write speed and capacity of csv, pickle, joblib, parquet in python environment

What i did

Frequently used when temporarily saving 2D array data in python

pickle.dump
joblib.dump
Convert to pyarrow and save parquet
pd.write_csv We compared read / write speed and storage capacity for each of.

Conclusion

--Pickle protocol = 4 for compression ratio and speed ――If you repeat reading or writing only a part, save parquet with pyarrow

Looks good

Trial environment

CPU: Xeon E5-2630 x 2 chip VRAM: 128GB Windows8 64bit python 3.6

Data used for comparison

Tried with feature data for machine learning · Pandas.DataFrame 536 rows 178886 columns 0.77GB · Pandas.DataFrame 4803 rows 178886 columns 6.87GB

Comparison result

0.77GB DataFrame

6.87GB DataFrame

About the result

pickle Looking only at the compression rate and speed, pickle's protocol = 3 or later is excellent, and especially protocol = 4 is extremely convenient because it also supports export of 4 GB or more. However, in ubuntu environment python3.6 or earlier, there is a problem with protocol = 4 or it does not work as it is for reading and writing. Since it works normally with python3.7 or later, pickle seems to be good if the environment can be secured or the capacity is small.

joblib Compared to pickle, the compression rate and read / write speed are a little halfway, but since it is possible to read / write 4GB or more even in the python3.6 environment, this may be good for those who can not renew python due to package reasons.

pyarrow => parquet It is attractive that you can read and write specified rows and specified columns while having the same compression rate and read / write speed as joblib compress = 0. Especially since writing is fast, this seems to be good when reading and writing are randomly entered.

Environmental impact

The influence of the environment seems to be very large, and when I experimented on another machine with a different OS, the difference was 20 times, but only 4 times. It seems better to test in the environment where you actually use it.

Recommended Posts

Compare read / write speed and capacity of csv, pickle, joblib, parquet in python environment

Read and write JSON files in Python

Compare the speed of Python append and map

Speed evaluation of CSV file output in Python

Full-width and half-width processing of CSV data in Python

Python netCDF4 read speed and nesting of for statements

Read and write single precision floating point in Python

Read and write NFC tags in python using PaSoRi

Compress variables such as DataFrame with joblib instead of pickle to read and write

Read and write csv file

Environment construction of python and opencv

Create and read messagepacks in Python

Write O_SYNC file in C and Python

Read and write csv files with numpy

Read Python csv and export to txt

Project Euler # 1 "Multiples of 3 and 5" in Python

Various ways to read the last line of a csv file in Python

I compared the speed of regular expressions in Ruby, Python, and Perl (2013 version)

Read JSON with Python and output as CSV

Installation of Python3 and Flask [Environment construction summary]

Reading and writing CSV and JSON files in Python

python development environment -use of pyenv and virtualenv-

[Python3] Read and write with datetime isoformat with json

Explanation of edit distance and implementation in Python

Example of reading and writing CSV with Python

Comparison of Python and Ruby (Environment / Grammar / Literal)

[Python] Read Japanese csv with pandas without garbled characters (and extract columns written in Japanese)

I compared the speed of the reference of the python in list and the reference of the dictionary comprehension made from the in list.

Read the csv file with jupyter notebook and write the graph on top of it

I tried to compare the processing speed with dplyr of R and pandas of Python

Csv in python

"Linear regression" and "Probabilistic version of linear regression" in Python "Bayesian linear regression"

Operate mongoDB from python in ubuntu environment ① Introduction of mongoDB

File write speed comparison experiment between python 2.7.9 and pypy 2.5.0

Difference between Ruby and Python in terms of variables

Read the csv file and display it in the browser

[python] Calculation of months and years of difference in datetime

Write tests in Python to profile and check coverage

Read and write files with Slackbot ~ Bot development with Python ~

Overview of generalized linear models and implementation in Python

How to read csv containing only integers in Python

Compare "relationship between log and infinity" in Gauche (0.9.4) and Python (3.5.1)

Sample of getting module name and class name in Python

Summary of date processing in Python (datetime and dateutil)

[Python] Chapter 01-02 About Python (Execution and installation of development environment)

Parallel processing of Python joblib does not work in uWSGI environment. How to process in parallel on uWSGI?