This article is for those who want to do distributed processing with Python.

I think many people say that python images are slow.

Libraries such as cython have been released to dispel that image, but this time I will introduce distributed processing as one of the methods to speed up python.

Speaking of representative of distributed processing.

・ Hadoop ・ Spark

is.

This time I wanted to simply apply Spark to python ... In the article below, it was stated that the conversion of the JVM and Python data structures would occur many times and the latency would increase, so it would not be too fast.

http://codezine.jp/article/detail/8484

python spark.png

Looking at the structure in the above figure, I get the impression that there are many parts that pipe data with Spark Worker, and that may become a bottleneck if distributed processing is performed.

Spartan

https://github.com/spartan-array/spartan

Therefore, this time, data processing in Python can be accelerated by using a matrix data structure called NumPy, so a library created by a project called Spartan, which is an attempt to distribute Numpy matrices like RDD of Spark, is used. I decided to use it.

It didn't work well unless it was spython2 series due to the dependency library. Support for python3 series is desired.

If you are a python3 type person, you can use either virtualenv or pyenv, so please use a 2 type environment.

Introducing the library

Then, it is an environment construction procedure. (I have only tried it on Mac)

Prepare requirement.txt as shown below and install it with pip install -r requirement.txt.

numpy
chainer==1.1.2
ipython==4.0.0
notebook==4.0.4
jinja2==2.8
pyzmq==14.7.0
tornado==4.1
scipy
dsltools
cython
parakeet
scikit-learn
traits
psutil

Installation

git clone https://github.com/spartan-array/spartan.git
cd spartan
python setup.py develop

Installation is complete above.

However, further settings were required to use it on a Mac.

spartan/worker.py

You need to modify the above python file.

In the default state

psutil.TOTAL_PHYMEM
psutil.NUM_CPUS

2 are not set and an error will occur, so

    ret = psutil.virtual_memory()
    num_cpus = psutil.cpu_percent()
    psutil.TOTAL_PHYMEM = ret.total
    psutil.NUM_CPUS = num_cpus

It can be set by adding the above before the line of the program below. What is set is how much virtual memory and CPU are used. psutil is a program that can adjust and manage memory and CPU usage, so if you want to know more, please see below.

https://github.com/giampaolo/psutil

Also, if the environment settings are set to use only single, please set from the following site.

http://jesperrasmussen.com/2013/03/07/limiting-cpu-cores-on-the-fly-in-os-x/

    self.worker_status = core.WorkerStatus(psutil.TOTAL_PHYMEM, 
                                           psutil.NUM_CPUS,
                                           psutil.virtual_memory().percent,
                                           psutil.cpu_percent(),
                                           time.time(),
                                           [], [])

Let's actually make it work.

Write the following linear regression program with the name lreg.py.

import spartan as sp
sp.initialize()

N_DIM = 10
N_EXAMPLES = 1000 * 1000
EPSILON = 1e-6

x = 100 * sp.ones((N_EXAMPLES, N_DIM)) + sp.rand(N_EXAMPLES, N_DIM)
y = sp.ones((N_EXAMPLES, 1))

# put weights on one server
w = sp.rand(N_DIM, 1)

for i in range(50):
    yp = sp.dot(x, w)
    diff = x * (yp - y)
    grad = sp.sum(diff, axis=0).reshape((N_DIM, 1))
    w = w - (grad / N_EXAMPLES * EPSILON)
    print grad.sum().glom()

Operate with the following command.

python lreg.py --log_level=WARN

When executed, it is full and consumes CPU and memory, so it freezes. It's bad to force your PC to do something spartan.

This time, I haven't gotten to the point where it works with an essential cluster, so I plan to try it in the future.

Use it systematically! !!

Click here for this repository

https://github.com/SnowMasaya/Spartan-Study

Reference material