[PYTHON] A story about struggling to loop 3 million ID data

■ Purpose

One month's worth of data is prepared for each of the 3 million IDs The contents of the data are one explanatory variable and one objective variable. In other words, there are three columns in the table: ID, explanatory variable x, and objective variable y. The number of records is 3 million x 30 days ≒ 90 million

At this time, for each of the 3 million IDs, a simple regression of the explanatory variables and objective variables for 30 days was performed. I want to store the correlation coefficient, slope, and p-value for each ID as an output.

■ Policy

Regression is performed in a for loop for 3 million IDs, and the results are stored as a list. Finally, the lists are combined into a data frame. See here for the speed of this method

■ Environment

--EC2 instance (ubuntu: r5d.4xlarge)

JupyterLab 0.35.3

■ Challenges

It takes time (about 13 seconds per id) to simply query and extract the records corresponding to each ID from the data frame.

`code1.py`


for id in id_list:
    tmp_x = df[df.id == id].x
    tmp_y = df[df.id == id].y

■ Solution

--Speed up by using id as index and extracting with df.loc [](about 3.9 seconds per id)

`code2.py`


df.index = df.id
for id in id_list:
	tmp_x = df.loc[id].x
	tmp_y = df.loc[id].y

--In combination with the above, use dask dataframe instead of pandas dataframe (1.7 seconds per id) * What is dask?

`code3.py`


import dask.dataframe as dd
import multiprocessing

df.index = df.id
#Cpu in the current environment_count = 32
ddf = dd.from_pandas(df, npartitions=multiprocessing.cpu_count())
for id in id_list:
	tmp_x = ddf.loc[id].x.compute()
	tmp_y = ddf.loc[id].y.compute()

■ Conclusion

It's still late. With this, it would take two months to complete the processing of all the data. ..

■ Future plans

Currently, 30 records are stored for each ID, but by storing 30 days' worth of data in one cell as a list, one record is created for each ID. By doing so, since the inclusion processing can be used for the loop processing, there is a possibility that the processing speed can be improved. (However, how long does it take to convert from 30 records to 1 record in the first place ... I want you to say it with pivot_table)