Compare read / write speed and capacity of csv, pickle, joblib, parquet in python environment

What i did

Frequently used when temporarily saving 2D array data in python

  1. pickle.dump
  2. joblib.dump
  3. Convert to pyarrow and save parquet
  4. pd.write_csv We compared read / write speed and storage capacity for each of.

Conclusion

--Pickle protocol = 4 for compression ratio and speed ――If you repeat reading or writing only a part, save parquet with pyarrow

Looks good

Trial environment

CPU: Xeon E5-2630 x 2 chip VRAM: 128GB Windows8 64bit python 3.6

Data used for comparison

Tried with feature data for machine learning · Pandas.DataFrame 536 rows 178886 columns 0.77GB · Pandas.DataFrame 4803 rows 178886 columns 6.87GB

Comparison result

0.77GB DataFrame

image.png image.png

6.87GB DataFrame

image.png image.png

About the result

pickle Looking only at the compression rate and speed, pickle's protocol = 3 or later is excellent, and especially protocol = 4 is extremely convenient because it also supports export of 4 GB or more. However, in ubuntu environment python3.6 or earlier, there is a problem with protocol = 4 or it does not work as it is for reading and writing. Since it works normally with python3.7 or later, pickle seems to be good if the environment can be secured or the capacity is small.

joblib Compared to pickle, the compression rate and read / write speed are a little halfway, but since it is possible to read / write 4GB or more even in the python3.6 environment, this may be good for those who can not renew python due to package reasons.

pyarrow => parquet It is attractive that you can read and write specified rows and specified columns while having the same compression rate and read / write speed as joblib compress = 0. Especially since writing is fast, this seems to be good when reading and writing are randomly entered.

Environmental impact

The influence of the environment seems to be very large, and when I experimented on another machine with a different OS, the difference was 20 times, but only 4 times. It seems better to test in the environment where you actually use it.

Recommended Posts

Compare read / write speed and capacity of csv, pickle, joblib, parquet in python environment
Read and write JSON files in Python
Compare the speed of Python append and map
Speed evaluation of CSV file output in Python
Full-width and half-width processing of CSV data in Python
Python netCDF4 read speed and nesting of for statements
Read and write single precision floating point in Python
Read and write NFC tags in python using PaSoRi
Compress variables such as DataFrame with joblib instead of pickle to read and write
Read and write csv file
Environment construction of python and opencv
Create and read messagepacks in Python
Write O_SYNC file in C and Python
Read and write csv files with numpy
Read Python csv and export to txt
Project Euler # 1 "Multiples of 3 and 5" in Python
Various ways to read the last line of a csv file in Python
I compared the speed of regular expressions in Ruby, Python, and Perl (2013 version)
Read JSON with Python and output as CSV
Installation of Python3 and Flask [Environment construction summary]
Reading and writing CSV and JSON files in Python
python development environment -use of pyenv and virtualenv-
[Python3] Read and write with datetime isoformat with json
Explanation of edit distance and implementation in Python
Example of reading and writing CSV with Python
Comparison of Python and Ruby (Environment / Grammar / Literal)
[Python] Read Japanese csv with pandas without garbled characters (and extract columns written in Japanese)
I compared the speed of the reference of the python in list and the reference of the dictionary comprehension made from the in list.
Read the csv file with jupyter notebook and write the graph on top of it
I tried to compare the processing speed with dplyr of R and pandas of Python
Csv in python
"Linear regression" and "Probabilistic version of linear regression" in Python "Bayesian linear regression"
Operate mongoDB from python in ubuntu environment ① Introduction of mongoDB
File write speed comparison experiment between python 2.7.9 and pypy 2.5.0
Difference between Ruby and Python in terms of variables
Read the csv file and display it in the browser
[python] Calculation of months and years of difference in datetime
Write tests in Python to profile and check coverage
Read and write files with Slackbot ~ Bot development with Python ~
Overview of generalized linear models and implementation in Python
How to read csv containing only integers in Python
Compare "relationship between log and infinity" in Gauche (0.9.4) and Python (3.5.1)
Sample of getting module name and class name in Python
Summary of date processing in Python (datetime and dateutil)
[Python] Chapter 01-02 About Python (Execution and installation of development environment)
Parallel processing of Python joblib does not work in uWSGI environment. How to process in parallel on uWSGI?