Python distributed processing Spartan

This article is for those who want to do distributed processing with Python.

I think many people say that python images are slow.

Libraries such as cython have been released to dispel that image, but this time I will introduce distributed processing as one of the methods to speed up python.

Speaking of representative of distributed processing.

・ Hadoop ・ Spark

is.

This time I wanted to simply apply Spark to python ... In the article below, it was stated that the conversion of the JVM and Python data structures would occur many times and the latency would increase, so it would not be too fast.

http://codezine.jp/article/detail/8484

python spark.png

Looking at the structure in the above figure, I get the impression that there are many parts that pipe data with Spark Worker, and that may become a bottleneck if distributed processing is performed.

Spartan

https://github.com/spartan-array/spartan

Therefore, this time, data processing in Python can be accelerated by using a matrix data structure called NumPy, so a library created by a project called Spartan, which is an attempt to distribute Numpy matrices like RDD of Spark, is used. I decided to use it.

It didn't work well unless it was spython2 series due to the dependency library. Support for python3 series is desired.

If you are a python3 type person, you can use either virtualenv or pyenv, so please use a 2 type environment.

Introducing the library

Then, it is an environment construction procedure. (I have only tried it on Mac)

Prepare requirement.txt as shown below and install it with pip install -r requirement.txt.

numpy
chainer==1.1.2
ipython==4.0.0
notebook==4.0.4
jinja2==2.8
pyzmq==14.7.0
tornado==4.1
scipy
dsltools
cython
parakeet
scikit-learn
traits
psutil

Installation

git clone https://github.com/spartan-array/spartan.git
cd spartan
python setup.py develop

Installation is complete above.

However, further settings were required to use it on a Mac.

spartan/worker.py

You need to modify the above python file.

In the default state

psutil.TOTAL_PHYMEM
psutil.NUM_CPUS

2 are not set and an error will occur, so

    ret = psutil.virtual_memory()
    num_cpus = psutil.cpu_percent()
    psutil.TOTAL_PHYMEM = ret.total
    psutil.NUM_CPUS = num_cpus

It can be set by adding the above before the line of the program below. What is set is how much virtual memory and CPU are used. psutil is a program that can adjust and manage memory and CPU usage, so if you want to know more, please see below.

https://github.com/giampaolo/psutil

Also, if the environment settings are set to use only single, please set from the following site.

http://jesperrasmussen.com/2013/03/07/limiting-cpu-cores-on-the-fly-in-os-x/

    self.worker_status = core.WorkerStatus(psutil.TOTAL_PHYMEM, 
                                           psutil.NUM_CPUS,
                                           psutil.virtual_memory().percent,
                                           psutil.cpu_percent(),
                                           time.time(),
                                           [], [])

Let's actually make it work.

Write the following linear regression program with the name lreg.py.

import spartan as sp
sp.initialize()

N_DIM = 10
N_EXAMPLES = 1000 * 1000
EPSILON = 1e-6

x = 100 * sp.ones((N_EXAMPLES, N_DIM)) + sp.rand(N_EXAMPLES, N_DIM)
y = sp.ones((N_EXAMPLES, 1))

# put weights on one server
w = sp.rand(N_DIM, 1)

for i in range(50):
    yp = sp.dot(x, w)
    diff = x * (yp - y)
    grad = sp.sum(diff, axis=0).reshape((N_DIM, 1))
    w = w - (grad / N_EXAMPLES * EPSILON)
    print grad.sum().glom()

Operate with the following command.

python lreg.py --log_level=WARN

When executed, it is full and consumes CPU and memory, so it freezes. It's bad to force your PC to do something spartan.

This time, I haven't gotten to the point where it works with an essential cluster, so I plan to try it in the future.

Use it systematically! !!

Click here for this repository

https://github.com/SnowMasaya/Spartan-Study

Reference material

http://codezine.jp/article/detail/8484

https://github.com/spartan-array/spartan

https://www.cs.nyu.edu/web/Research/Theses/power_russell.pdf

Recommended Posts

Python distributed processing Spartan
python image processing
Python file processing
File processing in Python
Python: Natural language processing
Communication processing by Python
Multithreaded processing in python
First Python image processing
Text processing in Python
Queue processing in Python
Image processing with Python
Python string processing illustration
Various processing of Python
An introduction to Python distributed parallel processing with Ray
Image processing with Python (Part 2)
100 Language Processing with Python Knock 2015
UTF8 text processing in python
python3 Measure the processing speed.
"Apple processing" with OpenCV3 + Python3
Acoustic signal processing with Python (2)
Acoustic signal processing with Python
Asynchronous processing (threading) in python
100 Language Processing Knock Chapter 1 (Python)
100 Language Processing Knock Chapter 2 (Python)
Image processing with Python (Part 1)
Image processing with Python (Part 3)
Image processing by python (Pillow)
Post processing of python (NG)
Image Processing Collection in Python
Using Python mode in Processing
[Python] Iterative processing (for, while)
[Python] Image processing with scikit-image
Command line argument processing (Python docopt)
[Python] Random processing (create, select, sort)
Leave the troublesome processing to Python
Signal processing in Python (1): Fourier transform
[Python] Easy parallel processing with Joblib
Python parallel processing (multiprocessing and Joblib)
Python
100 Language Processing Knock with Python (Chapter 1)
100 Language Processing Knock Chapter 1 in Python
Paiza Python Primer 3: Learn Loop Processing
100 Language Processing Knock with Python (Chapter 3)
python> Exit processing> import sys / sys.exit ()
Personal notes for python image processing
Image processing with Python 100 knocks # 3 Binarization
IPython cluster stupid (distributed parallel processing)
Python parallel / parallel processing sample code summary
Python beginner tried 100 language processing knock 2015 (05 ~ 09)
python string processing map and lambda
Image processing with Python 100 knocks # 2 Grayscale
100 Language Processing Knock Chapter 1 by Python
python> Processing time measurement> time.time () --start_time
Python beginner tried 100 language processing knock 2015 (00 ~ 04)
Image processing with Python 100 knock # 10 median filter
Periodic execution processing when using tkinter [Python3]
Socket communication and multi-thread processing by Python
[Python] Speeding up processing using cache tools
Image processing by Python 100 knock # 1 channel replacement
[Python] Matrix multiplication processing time using NumPy
(Java, JavaScript, Python) Comparison of string processing