Example of how to aggregate a large amount of time series data using Python at a reasonable speed in a small memory environment

Introduction

There was a problem of reading the data of 30 seconds cycle collected in one year at multiple bases and calculating the total value of each time, and I could not process it at all if I did it with the straightforward method, so I will make a note of the contents that I devised a little.

data form

The date and time and value are entered in CSV format for each base.

Date and time value
2018-10-01 00:00:00 4
2018-10-01 00:00:30 1
2018-10-01 00:01:00 2
2018-10-01 00:01:30 6
2018-10-01 00:02:00 7
2018-10-01 00:02:30 7
2019-09-31 23:59:30 7

This data is collected from more than 100 locations. In addition, it was not real-time, but the content was to aggregate after a certain period of time.

By simple calculation, one location is 1,051,200, and 100 locations total 105,120,000 data.

...Billion(-_-;)

First thing I did

Read all the files at once, group by date and time and get the total value!

python


from glob import glob
import pandas as pd

files = glob("data/*.csv")

df = pd.DataFrame()

for file in files:
    df = pd.concat([df, pd.read_csv(file)])

df = df.groupby("Date and time").sum()

df.to_csv("Total value.csv")

... RAM usage is steadily using the swap area, and it ends with an error when it exceeds 80GB after half a day.

I tried the method of reducing the number of files and calculating the total value little by little, but it doesn't seem to work.

Last thing I did

I tried to read the file and calculate the total value each time.

python


from glob import glob
import pandas as pd

files = glob("data/*.csv")

df = pd.DataFrame()

for file in files:
    df = pd.concat([df, pd.read_csv(file)])
    df = df.groupby("Date and time").sum().reset_index()

df.to_csv("Total value.csv")

As a result, it took only a few minutes without overwhelming the RAM.

It may be a matter of course, and once I understand it, it's nothing to do, but since I've spent a little time, I thought it would be good if even one person who had the same difficulty could be reduced, so I recorded it.

By the way, my work starts from here. Data analysis, let's do our best ... (-_-;)

Recommended Posts

Example of how to aggregate a large amount of time series data using Python at a reasonable speed in a small memory environment
How to create a large amount of test data in MySQL? ??
[TensorFlow 2.x compatible version] How to train a large amount of data using TFRecord & DataSet in TensorFlow (Keras)
How to generate exponential pulse time series data in python
How to develop in a virtual environment of Python [Memo]
How to check the memory size of a dictionary in Python
<Pandas> How to handle time series data in a pivot table
How to send a visualization image of data created in Python to Typetalk
How to read time series data in PyTorch
How to set up a Python environment using pyenv
How to execute a command using subprocess in Python
How to unit test a function containing the current time using freezegun in python
Predict from various data in Python using Facebook Prophet, a time series prediction tool
How to shuffle a part of a Python list (at random.shuffle)
How to create an instance of a particular class from dict using __new__ () in python
How to calculate the sum or average of time series csv data in an instant
How to get a list of built-in exceptions in python
How to build a python2.7 series development environment with Vagrant
Graph time series data in Python using pandas and matplotlib
How to extract features of time series data with PySpark Basics
How to determine the existence of a selenium element in Python
I tried to make a regular expression of "amount" using Python
I tried to make a regular expression of "time" using Python
How to build a Python environment using Virtualenv on Ubuntu 18.04 LTS
How to update a Tableau packaged workbook data source using Python
[Python] How to save the installed package and install it in a new environment at once Mac environment
How to generate a new loggroup in CloudWatch using python within Lambda
[Python] How to put any number of standard inputs in a list
How to get a value from a parameter store in lambda (using python)
How to install python package in local environment as a general user
How to plot galaxy visible light data using OpenNGC database in python
[Mac] Build a Python 3.x environment at the fastest speed using Docker
Plot CSV of time series data with unixtime value in Python (matplotlib)
How to format a list of dictionaries (or instances) well in Python
Get a datetime instance at any time of the day in Python
[Python] [Word] [python-docx] Try to create a template of a word sentence in Python using python-docx
How to stop a program in python until a specific date and time
How to get a stacktrace in python
How to handle time series data (implementation)
Part 1 I wrote an example of the answer to the reference problem of how to write offline in real time in Python
I made a program in Python that reads CSV data of FX and creates a large amount of chart images
How to pass the execution result of a shell command in a list in Python
Data analysis in Python Summary of sources to look at first for beginners
A small story that outputs table data in CSV format at high speed
[Circuit x Python] How to find the transfer function of a circuit using Lcapy
How to build an environment for using multiple versions of Python on Mac
A complete guidebook to using pyenv, pip and python in an offline environment
A program that sends a fixed amount of mail at a specified time by Python
How to get a list of files in the same directory with python
[Introduction to Python] How to get the index of data with a for statement
How to create large files at high speed
A simple example of how to use ArgumentParser
How to create a Python virtual environment (venv)
How to clear tuples in a list (Python)
How to embed a variable in a python string
Summary of how to import files in Python 3
How to use Python Image Library in python3 series
How to implement shared memory in Python (mmap.mmap)
Summary of how to use MNIST in Python
A clever way to time processing in Python
How to notify a Discord channel in Python