I think many people say that python images are slow.
Libraries such as cython have been released to dispel that image, but this time I will introduce distributed processing as one of the methods to speed up python.
Speaking of representative of distributed processing.
・ Hadoop ・ Spark
is.
This time I wanted to simply apply Spark to python ... In the article below, it was stated that the conversion of the JVM and Python data structures would occur many times and the latency would increase, so it would not be too fast.
http://codezine.jp/article/detail/8484
Looking at the structure in the above figure, I get the impression that there are many parts that pipe data with Spark Worker, and that may become a bottleneck if distributed processing is performed.
Spartan
https://github.com/spartan-array/spartan
Therefore, this time, data processing in Python can be accelerated by using a matrix data structure called NumPy, so a library created by a project called Spartan, which is an attempt to distribute Numpy matrices like RDD of Spark, is used. I decided to use it.
It didn't work well unless it was spython2 series due to the dependency library. Support for python3 series is desired.
If you are a python3 type person, you can use either virtualenv or pyenv, so please use a 2 type environment.
Then, it is an environment construction procedure. (I have only tried it on Mac)
Prepare requirement.txt
as shown below and install it with pip install -r requirement.txt
.
numpy
chainer==1.1.2
ipython==4.0.0
notebook==4.0.4
jinja2==2.8
pyzmq==14.7.0
tornado==4.1
scipy
dsltools
cython
parakeet
scikit-learn
traits
psutil
git clone https://github.com/spartan-array/spartan.git
cd spartan
python setup.py develop
Installation is complete above.
However, further settings were required to use it on a Mac.
spartan/worker.py
You need to modify the above python file.
In the default state
psutil.TOTAL_PHYMEM
psutil.NUM_CPUS
2 are not set and an error will occur, so
ret = psutil.virtual_memory()
num_cpus = psutil.cpu_percent()
psutil.TOTAL_PHYMEM = ret.total
psutil.NUM_CPUS = num_cpus
It can be set by adding the above before the line of the program below. What is set is how much virtual memory and CPU are used. psutil is a program that can adjust and manage memory and CPU usage, so if you want to know more, please see below.
https://github.com/giampaolo/psutil
Also, if the environment settings are set to use only single, please set from the following site.
http://jesperrasmussen.com/2013/03/07/limiting-cpu-cores-on-the-fly-in-os-x/
self.worker_status = core.WorkerStatus(psutil.TOTAL_PHYMEM,
psutil.NUM_CPUS,
psutil.virtual_memory().percent,
psutil.cpu_percent(),
time.time(),
[], [])
Let's actually make it work.
Write the following linear regression program with the name lreg.py
.
import spartan as sp
sp.initialize()
N_DIM = 10
N_EXAMPLES = 1000 * 1000
EPSILON = 1e-6
x = 100 * sp.ones((N_EXAMPLES, N_DIM)) + sp.rand(N_EXAMPLES, N_DIM)
y = sp.ones((N_EXAMPLES, 1))
# put weights on one server
w = sp.rand(N_DIM, 1)
for i in range(50):
yp = sp.dot(x, w)
diff = x * (yp - y)
grad = sp.sum(diff, axis=0).reshape((N_DIM, 1))
w = w - (grad / N_EXAMPLES * EPSILON)
print grad.sum().glom()
Operate with the following command.
python lreg.py --log_level=WARN
When executed, it is full and consumes CPU and memory, so it freezes. It's bad to force your PC to do something spartan.
This time, I haven't gotten to the point where it works with an essential cluster, so I plan to try it in the future.
Use it systematically! !!
Click here for this repository
https://github.com/SnowMasaya/Spartan-Study
http://codezine.jp/article/detail/8484
https://github.com/spartan-array/spartan
https://www.cs.nyu.edu/web/Research/Theses/power_russell.pdf
Recommended Posts