Speed evaluation of CSV file output in Python

Background

A while ago, I posted an article "I want to perform summation processing of array elements at high speed with Google Apps Script". At that time, I didn't plan to use it elsewhere, so I created the library only for GAS, but recently I also touched a large array in Python, so I will consider a method for outputting array data to a csv file. did.

While researching, I came across Okadate's article. The article says that there are csv module and pandas module to output csv file. Since the amount of data handled is large, I was still concerned about its processing speed, so I decided to check it before normal operation.

Therefore, we evaluated the processing speed of the csv output of the csv module and pandas module. As a reference, I used a standard method using the "+" operator, a port of the GAS summation library to Python (souwapy).

Evaluation method

I used the following module to evaluate the speed of csv file output. The specifications of the computer used for the measurement are CPU Core i5-3210M, Memory 8GB, OS Windows10 (x64) (v1607). The Python version is 3.5.2.

Module name Remarks
csv Includes Python standard library
pandas Python data analysis module, version 0.19.0
souwapy Self-made, version 1.1.1
standard algorithm General method of adding array elements in order

The script used for speed evaluation is as follows.

python


#!/usr/bin/env python
# -*- coding: UTF-8 -*-
import time
import csv
import pandas as pd
import SOUWA


def measure_csv(ar):
    start = time.time()
    with open('csvmod.csv', 'w') as f:
        writer = csv.writer(f, lineterminator='\n')
        writer.writerows(ar)
    Processing_time = time.time() - start
    print("Processing time = {0}".format(Processing_time) + " [s]")
    return


def measure_pandas(ar):
    start = time.time()
    df = pd.DataFrame(ar)
    df.to_csv('pandastest.csv', header=False, index=False)
    Processing_time = time.time() - start
    print("Processing time = {0}".format(Processing_time) + " [s]")
    return


def measure_souwapy(ar):
    start = time.time()
    s = SOUWA.sou()
    result = s.getcsvdata(ar, ",", "\n")
    with open('souwa.csv', 'w') as f:
        f.write(result)
    Processing_time = time.time() - start
    print("Processing time = {0}".format(Processing_time) + " [s]")
    return


def measure_standard(ar):
    start = time.time()
    result = ''
    for dat in ar:
        result += ",".join(dat) + "\n"
    with open('standard.csv', 'w') as f:
        f.write(result)
    Processing_time = time.time() - start
    print("Processing time = {0}".format(Processing_time) + " [s]")
    return


def MakeArray(row):
    theta = [0 for i in range(row)]
    for i in range(0, row):
        theta[i] = [str(i + 1).zfill(9), 'a', 'b', 'c', 'd', 'e']
    return theta


ar = MakeArray(10)

measure = 1

if measure == 1:
    measure_csv(ar)
elif measure == 2:
    measure_pandas(ar)
elif measure == 3:
    measure_souwapy(ar)
elif measure == 4:
    measure_standard(ar)

The array as data used a 9-digit zero-padded numeric string and a 6-element one-dimensional array of the alphabets a --e. This is exactly the content of the data you want to make into a csv file during operation. Here, all the elements are zero-padded to match the data size, and each alphabet is also one character. The data in the csv file with the number of arrays set to 10 is as follows.

000000001,a,b,c,d,e
000000002,a,b,c,d,e
000000003,a,b,c,d,e
000000004,a,b,c,d,e
000000005,a,b,c,d,e
000000006,a,b,c,d,e
000000007,a,b,c,d,e
000000008,a,b,c,d,e
000000009,a,b,c,d,e
000000010,a,b,c,d,e

"," Is used for the delimiter and "\ n" is used for the line feed code. The total of these is 20 bytes per line. We have also confirmed that the csv module, pandas module, souwapy module, and standard algorithm all have the same data. The speed evaluation was targeted until the output of the csv file from the array.

Evaluation results

fig.png

The result is shown in the above figure. The horizontal axis is the number of array elements, and the vertical axis is the time required to complete the csv file output. Blue, red, orange and green are from the standard algorithm, pandas module, csv module and souwapy module respectively. As a result, it was found that the processing time to output the array data to the csv file is faster in the order of standard, pandas module, csv module, souwapy module. The average processing time ratio was 1.4 times faster for the csv module than for the pandas module, 2.3 times faster for the souwapy module than for the csv module, and 3.1 times faster for the souwapy module than for the pandas module.

If you take a closer look, in the standard algorithm, the processing time is proportional to the square of the number of elements. [It is known that in the standard method of adding arrays in order using the "+" operator, the total amount of data moving during processing increases in proportion to the square of the number of array elements](http: // qiita.com/tanaike/items/17c88c69a0aa0b8b18d7). On the other hand, in each module, the processing time is linearly proportional to the number of elements. From these, it can be inferred that the csv module and pandas module are undergoing some optimization processing when changing to csv data. I tried to find out what algorithm csv and pandas used to convert the array to a csv file, but unfortunately I couldn't reach it myself.

If the number of elements is small, you can judge that there is no big difference in processing time between each module. The effect appears as the number of elements increases. The souwapy module has a fast result because it uses a specialized algorithm for converting array data to csv data, but so far it has only this one function, so it has other advanced functions. I thought it would be nice to combine it with a module and use it only in the final csv file output.

bonus

The souwapy module is a port of the GAS library. It seemed to be effective when the number of elements increased, so I uploaded it to PyPI if it could be useful. The installation method and usage method are as follows.

So far, it only has the ability to sum the array. I would like to add it when other functions are needed in the future.

--How to install

$ pip install souwapy

python


from souwapy import SOUWA

s = SOUWA.sou()
result = s.getcsvdata(array, ",", "\n")

array is an array, and please change the delimiter and line feed code at any time. See below for details.

Recommended Posts

Speed evaluation of CSV file output in Python
Output to csv file with Python
Csv in python
Data input / output in Python (CSV, JSON)
File operations in Python
Output the number of CPU cores in Python
File processing in Python
File operations in Python
Transpose CSV file in Python Part 2: Performance measurement
Output in the form of a python array
Japanese output in Python
Read Python csv file
Various ways to read the last line of a csv file in Python
Collectively register data in Firestore using csv file in Python
Output the output result of sklearn.metrics.classification_report as a CSV file
Summary of python file operations
Download the file in Python
Equivalence of objects in Python
Implementation of quicksort in Python
Download csv file with python
Read Fortran output in python
Compare read / write speed and capacity of csv, pickle, joblib, parquet in python environment
Output the specified table of Oracle database in Python to Excel for each file
[Python] Open the csv file in the folder specified by pandas
Change the standard output destination to a file in Python
[Note] Import of a file in the parent directory in Python
Google search for the last line of the file in Python
Transpose CSV files in Python Part 1
Pixel manipulation of images in Python
File / folder path manipulation in Python
Output 2017 Premium Friday list in Python
Easy encryption of file contents (Python)
Tips on Python file input / output
[Python] Write to csv file with Python
Save the binary file in Python
Linebot creation & file sharing in Python
Division of timedelta in Python 2.7 series
MySQL-automatic escape of parameters in python
Make standard output non-blocking in Python
Implementation of life game in Python
Create a binary file in Python
Waveform display of audio in Python
Python CSV file reading and writing
The story of the "hole" in the file
Notes for Python file input / output
Export and output files in Python
Law of large numbers in python
Implementation of original sorting in Python
Speed comparison of Python XML parsing
ORC, Parquet file operations in Python
Reversible scrambling of integers in Python
Output the contents of ~ .xlsx in the folder to HTML with Python
Read the standard output of a subprocess line by line in Python
File open function in Python3 (difference between open and codecs.open and speed comparison)
Trial of writing the configuration file in Python instead of .ini etc.
Conversion of string <-> date (date, datetime) in Python
[Python] How to convert db file to csv
Use of constraints file added in pip 7.1
Change the length of Python csv strings
Check the behavior of destructor in Python
(Bad) practice of using this in Python