Previously, I had to analyze a little big data, which took a lot of time to process. At that time, I will summarize the method used to speed up the processing.
The following is using the multiprocessing module.
multi.py
from multiprocessing import Pool
from multiprocessing import Process
It is like this.
multi.py
def function(hoge):
#Thing you want to do
return x
def multi(n):
p = Pool(10) #Maximum number of processes:10
result = p.map(function, range(n))
return result
def main():
data = multi(20)
for i in data:
print i
main()
In this case, the process is "execute the function 20 times by changing the value to 0,1,2 ... 19". Since the return value of function is included in the result as a list, it is received and output as standard.
Also, in my environment, I can use up to 12 cores (6 cores and 12 threads to be exact), so I set the maximum number of processes to 10. If you use it to the maximum, it will be difficult to open the browser, so it is safe to stop it.
The CPU usage rate during parallel processing is also listed. You can see that parallel processing is properly performed with multiple cores like this.
You can also get the process id that is in charge of each process.
multi.py
import os
def fuction(hoge):
#Thing you want to do
print 'process id:' + str(os.getpid())
return x
#Omitted below
It is interesting to know that if you display it like this, it is being executed in a different process.
The process, which took about 35 hours, was completed in just over 4 hours. The processing time is less than 1/10, which is a sufficient result.
Of course, the speed of each process is not increasing, so it is necessary to allocate work evenly in order to improve efficiency, but I think that it is useful because there are many such things in the analysis system.
Recommended Posts