[PYTHON] A story about struggling to loop 3 million ID data

■ Purpose

One month's worth of data is prepared for each of the 3 million IDs The contents of the data are one explanatory variable and one objective variable. In other words, there are three columns in the table: ID, explanatory variable x, and objective variable y. The number of records is 3 million x 30 days ≒ 90 million

At this time, for each of the 3 million IDs, a simple regression of the explanatory variables and objective variables for 30 days was performed. I want to store the correlation coefficient, slope, and p-value for each ID as an output.

■ Policy

Regression is performed in a for loop for 3 million IDs, and the results are stored as a list. Finally, the lists are combined into a data frame. See here for the speed of this method

■ Environment

--EC2 instance (ubuntu: r5d.4xlarge)

■ Challenges

It takes time (about 13 seconds per id) to simply query and extract the records corresponding to each ID from the data frame.

code1.py


for id in id_list:
    tmp_x = df[df.id == id].x
    tmp_y = df[df.id == id].y

■ Solution

--Speed up by using id as index and extracting with df.loc [](about 3.9 seconds per id)

code2.py


df.index = df.id
for id in id_list:
	tmp_x = df.loc[id].x
	tmp_y = df.loc[id].y

--In combination with the above, use dask dataframe instead of pandas dataframe (1.7 seconds per id) * What is dask?

code3.py


import dask.dataframe as dd
import multiprocessing

df.index = df.id
#Cpu in the current environment_count = 32
ddf = dd.from_pandas(df, npartitions=multiprocessing.cpu_count())
for id in id_list:
	tmp_x = ddf.loc[id].x.compute()
	tmp_y = ddf.loc[id].y.compute()

■ Conclusion

It's still late. With this, it would take two months to complete the processing of all the data. ..

■ Future plans

Currently, 30 records are stored for each ID, but by storing 30 days' worth of data in one cell as a list, one record is created for each ID. By doing so, since the inclusion processing can be used for the loop processing, there is a possibility that the processing speed can be improved. (However, how long does it take to convert from 30 records to 1 record in the first place ... I want you to say it with pivot_table)

Recommended Posts

A story about struggling to loop 3 million ID data
A story that was struggling to loop processing 3 million ID data
A story about data analysis by machine learning
A story about how to specify a relative path in python.
A story about how to deal with the CORS problem
A story about clustering time series data of foreign exchange
A story about trying to implement a private variable in Python.
A refreshing story about Python's Slice
A story about trying to automate a chot when cooking for yourself
A story about adding a REST API to a daemon made with Python
A sloppy story about Python's Slice
A story addicted to Azure Pipelines
A story about using Python's reduce
A story about everything from data collection to AI development and Web application release in Python (3. AI development)
A story about wanting to think about garbled characters on GAE / P
A story about trying to run JavaScripthon on Windows and giving up.
A story about trying to connect to MySQL using Heroku and giving up
A story about a beginner trying hard to set up CentOS 8 (procedure memo)
A story about remodeling Lubuntu into a Chromebook
A story about machine learning with Kyasuket
A story about Python pop and append
A story about a 503 error on Heroku open
[Note] A story about not being able to break through a proxy with pip
Change the data frame of pandas purchase data (id x product) to a dictionary
How to share a virtual environment [About requirements.txt]
A story about simple machine learning using TensorFlow
Randomly sample MNIST data to create a dataset
A story about operating a GCP instance from Discord
A story about Go's global variables and scope
A story stuck with handling Python binary data
I'm addicted to Kintone as a data store
A story about displaying article-linked ads on Jubatus
Data analysis in Python: A note about line_profiler
A story about implementing a login screen with django
A story about running Python on PHP on Heroku
A story about modifying Python and adding functions
Use a cool graph to analyze PES data!
A story about improving the program for partial filling of 3D binarized image data
A story about how Windows 10 users created an environment to use OpenCV3 with Python 3.5
A story about a Python beginner trying to get Google search results using the API
A story about trying to introduce Linter in the middle of a Python (Flask) project