[Python] Accelerates loading of time series CSV

When analyzing time series data, you often read data from csv / tsv files.

If the file has a capacity of several tens of MB, you may not care so much, but if it is a file of several hundred MB, it takes several seconds to several tens of seconds just to read it, and the waiting that occurs every time you execute the code Time becomes stressful.

Here, I will introduce a method that can explode the reading process with a little ingenuity.

Execution environment

Measure the execution time with the MBP at hand.

MacBook Pro (Retina, 13-inch, Mid 2014)

In addition, the execution environment of the measurement code is the latest version of 2.7 series.

$ python --version
$ Python 2.7.10

Use the pandas module to read the csv file.

>>> import pandas
>>> pandas.__version__
0.18.1

The csv file to be read uses the data of the dollar-yen exchange rate in 1-minute increments from 2005 to 2015 that was just at hand. Contains data of 5 columns of time, open price, high price, low price, and close price x 3687221 rows.

The contents are like this

time open high low close
2005-01-03 03:00:00 102.360 102.390 102.360 102.390
$ wc -c < usdjpy-1m.csv
204760049

It's a csv file of about 204MB.

0 ・ Preparation

Use the time method of the standard time module to measure the time. I prepared the following code as a preparation for that.

The calctime method is a simple measurement method that executes an arbitrary expression and returns the difference between the start time and the end time.

calctime.py


#!/usr/bin/python
# -*- encoding: utf-8 -*-

from time import time

def calctime(func):
	start = time()
	r = func()
	return {'value': r, 'time': time()-start}

1. Read normally

First, in order to measure the reference time, use the read_csv method to measure the time.

calctime.py


import pandas

csv = calctime(lambda: pandas.read_csv('usdjpy-1m.csv'))
print('Read normally_csv{}It took a second'.format(csv['time']))
Read normally_csv 7.56608009338 It took 2 seconds

It took 7.5 seconds. In addition, since the file handled this time is time series data, I will also parse the column that stores the time to the datetime type.

calctime.py


import pandas

csv = calctime(lambda: pandas.read_csv('usdjpy-1m.csv', parse_dates=['time'], index_col='time'))
print('read while parsing the time column_csv{}It took a second'.format(csv['time']))
read while parsing the time column_40 with csv.0371601582 It took 2 seconds

It takes 40 seconds. I can't talk about waiting 40 seconds each time, so let's speed it up.

2. Serialize the object

Python comes standard with a module called pickle that serializes objects in memory into bytes.

As the name of the pickle singular system, it provides a function to convert a short-lived object in memory into a form that can be output to a file and save it in storage. Raw vegetables can also be pickled and stored for a long time.

The code looks like this.

Save the object once stored in the variable with pandas.read_csv as usdjpy-1m.csv.pkl using pickle.dump, and read the dumped byte string using pickle.load. I'm measuring time.

calctime.py


import pandas
import pickle

#loading csv
csv = pandas.read_csv('usdjpy-1m.csv', parse_dates=['time'], index_col='time')

#Dump bytes to file
pickle.dump(csv, open('usdjpy-1m.csv.pkl', 'w'))

#Reload the dumped file
pickled = calctime(lambda: pickle.load(open('usdjpy-1m.csv.pkl', 'w')))
print('When loading with pickle{}It took a second'.format(pickled['time']))
5 when loading with pickle.65008401871 took 1 second

It has been greatly reduced from 40 seconds to 5.6 seconds.

It seems that it is working because it is no longer necessary to parse the datetime type every time.

Once you dump it with pickle, you can read it from the next time and use it as a pandas object, so it's easy. I've made a lot of progress.

But it's long enough to wait 5.6 seconds each time. Let's make this pickle even faster.

3. Specify the pickle protocol

As shown in http://docs.python.jp/2/library/pickle.html#pickle.dump , Pickle can specify the protocol to use when dumping.

There are versions 0 to 2 of the protocol, with 0 being the oldest and slowest and 2 being the newest and fastest protocol. Version 2 can only be used in Python 2.3 or higher environments, and has the feature of not being backward compatible.

And the default is version 0. If you use pickle.dump without specifying anything, you end up using a slow protocol instead of backward compatibility.

Let's specify the protocol in the code above and re-execute.

calctime.py


import pandas
import pickle

#loading csv
csv = pandas.read_csv('usdjpy-1m.csv', parse_dates=['time'], index_col='time')

#Dump bytes to a file using version 2 protocol
pickle.dump(csv, open('usdjpy-1m.csv.pkl', 'w'), protocol=2)

#Reload the dumped file
pickled = calctime(lambda: pickle.load(open('usdjpy-1m.csv.pkl', 'w')))
print('With version 2{}It took a second'.format(pickled['time']))
1 with version 2.07643604279 It took 2 seconds

5.6 seconds → It has been shortened to about 1 second. You can now read at a nice speed.

But I'm still not satisfied. Waiting for 1 second each time is stressful.

Finally, let's add another effort to explode the code.

4. Save pickle in redis

If the byte string dumped by pickle is saved as a file on the storage, it will take some time to read.

Conversely, the fact that access time to storage is a bottleneck means that it should be eliminated.

Here, let's change the save destination to redis, which saves data in memory instead of storage.

redis is a key-value store type database that operates on-memory, and because it stores data in memory, it is characterized by extremely high speed.

Detailed installation method and explanation are omitted here, so please check it if you are interested.

If the installation is successful, is redis working properly?

$ redis-cli ping
PONG

You can check by typing the command.

To access redis from python, we will install the redis module with pip.

$ pip install redis

Let's change the previous code a little, save the read csv as a byte string to dump and redis, and then measure the time to get it from redis.

This time, neither dump nor load is performed by specifying a file, but I want to expand it to a variable, so the methods used are pickle.dumps and pickle.loads.

calctime.py


import pandas
import pickle
import redis

#loading csv
csv = pandas.read_csv('usdjpy-1m.csv', parse_dates=['time'], index_col='time')

#Dump bytes using version 2 protocol
dumped = pickle.dumps(csv, protocol=2)

#Save the dumped bytes in redis
r = redis.Redis(host='localhost', port=6379, db=0)
r.set('usdjpy-1m.csv', dumped)

#Read bytes from redis and restore with pickle
csv = calctime(lambda: pickle.loads(r.get('usdjpy-1m.csv')))
print('With redis{}It took a second'.format(csv['time']))
0 with redis.It took 332698106766 seconds

It has been reduced from 1 second to 0.33 seconds.

With this, you can execute the code without any stress! Compared to the initial 40 seconds of loading, it was 121 times faster.

bonus

Here is the set of code used in the experiment.

calctime.py


#!/usr/bin/python
# -*- encoding: utf-8 -*-

import pandas
import pickle
import redis
from time import time

#Time measurement method
def calctime(func):
	start = time()
	r = func()
	return {'value': r, 'time': time()-start}

#csv file path
csv_filepath = 'usdjpy-1m.csv'

#Read normally
read_csv = calctime(lambda: pandas.read_csv(csv_filepath, parse_dates=['time'], index_col='time'))
print('Read normally_csv{}It took a second'.format(read_csv['time']))

#pickle Read using Protocolol version 0
pickle.dump(read_csv['value'], open('csv0.pkl', 'w'))
pickle_load_0 = calctime(lambda: pickle.load(open('csv0.pkl', 'r')))
print('pickle with Protocolol version 0{}It took a second'.format(pickle_load_0['time']))

#pickle Read using Protocolol version 2
pickle.dump(read_csv['value'], open('csv2.pkl', 'w'), protocol=2)
pickle_load_2 = calctime(lambda: pickle.load(open('csv2.pkl', 'r')))
print('pickle with Protocolol version 2{}It took a second'.format(pickle_load_2['time']))

#Read using redis
r = redis.Redis(host='localhost', port=6379, db=0)
r.set(csv_filepath, pickle.dumps(read_csv['value'], protocol=2))
redis_get = calctime(lambda: pickle.loads(r.get(csv_filepath)))
print('With redis{}It took a second'.format(redis_get['time']))

Finally

When performing data analysis, I think that there are many situations where you have to execute the code many times and repeat trials and errors, but I hope that it will help you to rotate the iterations as efficiently as possible in such cases.

The author posts technical information on Twitter every day. I would be grateful if you could follow me.

https://twitter.com/YuhsakInoue

Recommended Posts

[Python] Accelerates loading of time series CSV
Python: Time Series Analysis
Python time series question
Plot CSV of time series data with unixtime value in Python (matplotlib)
[Python] Plot time series data
Faster loading of Python images
[Python] Loading csv files using pandas
Calculation of time series customer loyalty
Python: Time Series Analysis: Preprocessing Time Series Data
Division of timedelta in Python 2.7 series
Python Math Series ⓪ Table of Contents
Time series plot started ~ python edition ~
Differentiation of time series data (discrete)
Time series analysis 3 Preprocessing of time series data
[Python] Summary of array generation (initialization) time! !! !!
Time series analysis 4 Construction of SARIMA model
Basic grammar of Python3 series (list, tuple)
Easy introduction of python3 series and OpenCV3
Python: Time Series Analysis: Building a SARIMA Model
Get time series data from k-db.com in Python
First time python
Introduction of Python
measurement of time
Python: Time Series Analysis: Stationarity, ARMA / ARIMA Model
Time Series Decomposition
Character encoding when using csv module of python 2.7.3
Basics of python ①
First time python
Smoothing of time series and waveform data 3 methods (smoothing)
Copy of python
View details of time series data with Remotte
Csv in python
At the time of python update on ubuntu
Create an animated time series map of coronavirus infection status with python + plotly
Speed evaluation of CSV file output in Python
Example of reading and writing CSV with Python
Introduction of Python
Basic operation of Python Pandas Series and Dataframe (1)
Implementation of clustering k-shape method for time series data [Unsupervised learning with python Chapter 13]
Full-width and half-width processing of CSV data in Python
Summary of Python sort (list, dictionary type, Series, DataFrame)
Python 3.4 Create Windows7-64bit environment (for financial time series analysis)
Anomaly detection of time series data by LSTM (Keras)
Save TOPIX time series in pickle, csv, Excel format
[Time series with plotly] Dynamic visualization with plotly [python, stock price]
[Python] Operation of enumerate
List of python modules
Function execution time (Python)
Unification of Python environment
Copy of python preferences
Basics of Python scraping basics
[python] behavior of argmax
"Measurement Time Series Analysis of Economic and Finance Data" Solving Chapter End Problems with Python
Usage of Python locals ()
the zen of Python
Installation of Python 3.3 rc1
RNN_LSTM1 Time series analysis
Time series analysis 1 Basics
Output python execution time
How to calculate the sum or average of time series csv data in an instant
Time floor function (Python)