The ipython cluster starts up locally as soon as ipyparallel is installed, but that's not interesting, so I put the distribution destination on AWS. Memorandum of understanding.

Character

user --Client (local machine)

--Cluster

Controller/Hub (EC2)
Engine (EC2)

Schematic diagram

The story explained here is such a composition. Since each host is connected by SSH, it is necessary to set up each connection information in advance. Actually, it would be easier if you had the trouble of preparing an AMI that functions as a node for Engine just by starting it.

Create hosts for Engine and Controller

Create an instance on Amazon Linux as appropriate.

Here, I used ʻAmazon Linux AMI 2015.09 (HVM)` and created it with t2.micro in the Tokyo region.

Once installed, install ʻipython and ʻipyparallel.

sudo yum groupinstall -y 'Development Tools'
sudo pip install ipython ipyparallel

Here, the Engine is started via SSH, so make sure that you can ssh from the Controller to the Engine server with ec2-user without a password.

Create a new ipython profile in Controller

ipython profile create --parallel --profile=myprofile

This will generate a file like the one below.

tree .ipython/profile_myprofile/

~/.ipython/profile_myprofile/
├── ipcluster_config.py
├── ipcontroller_config.py
├── ipengine_config.py
├── ipython_config.py
├── ipython_kernel_config.py
├── log
├── pid
├── security
└── startup
    └── README

We will make settings for cluster operation.

ipcluster_config.py

Let's add the following code at the beginning.

`ipcluster_config.py`


c = get_config()

c.IPClusterEngines.engine_launcher_class = 'SSHEngineSetLauncher'

c.SSHEngineSetLauncher.engines = {
    '<ENGINE_HOST1>' : 3,
    '<ENGINE_HOST2>' : 3
}

ipcontroller_config.py

Find the commented out part inside and rewrite the relevant part.

`ipcontroller_config.py`



...Omission...

c.IPControllerApp.reuse_files = True

...Omission...

c.RegistrationFactory.ip = u'*'

...

If reuse_files is set to True, the files under .ipython / profile_ <PROFILE_NANE> / security / will be reused. What happens when it's reused is that the cluster key isn't regenerated so you don't have to update the config file every time you restart the controller.

Start Controller

Execute the following command on the controller server.

ipcluster start --profile=myprofile

If you want more detailed output, use the --debug option. Also, the log file is .ipython/profile_myprofile/log/.

If it works, after the controller starts, you will see a log that looks like you have transferred ʻipcontroller-client.json and ʻipcontroller-engine.json to each node. If the controller fails to start, ʻipcontroller- * `related files will not be generated. (Therefore, if the controller fails to start, it will output a message that the controller has stopped casually, and then it will be told that ipcontroller-client.json cannot be transferred to each Engine node. It is unfriendly.)

2015-10-04 11:15:21.568 [IPClusterStart] Engines appear to have started successfully

When a log like this appears, the startup is complete.

If you want to make it a daemon, add the --daemonize option.

Connect from Client

From now on, I will do it from the local machine.

Get ipcontroller-client.json from Controller

This file is created when the Controller starts.

Drop this file locally on the Client with scp or something.

scp ec2-user@<CONTROLLER_HOST>:./.ipython/profile_myprofile/security/ipcontroller-client.json ./

Creating a connection confirmation script

Here, I prepared the following confirmation script while referring to various things.

`cluster_check.py`


from ipyparallel import Client
rc = Client("ipcontroller-client.json", sshserver="ec2-user@<CONTROLLER_HOST>")
print rc.ids

dview = rc[:]
@dview.remote(block=True)
def gethostname():
    import socket
    return socket.getfqdn()

print gethostname()

If you run it with this, it should look like this.

$ python cluster_check.py

[0, 1, 2, 3, 4, 5]
[<HOSTNAME>, <HOSTNAME>, <HOSTNAME>, <HOSTNAME2>, <HOSTNAME2>, <HOSTNAME2>]

If you follow the above settings, you should be booting 3 hosts each.

Now ipython cluster can distribute the process across multiple nodes.

programming

Import is done in the function

Like the operation check code above, you need to do the necessary import etc. in the code that runs on multiple nodes.

Use the @ require and @ depend decorators to execute code that can be executed properly on that node. (Please refer to this area)

Can be used like a task queue

https://ipython.org/ipython-doc/2/parallel/parallel_task.html

Use LoadBalancedView if you want to process tasks in sequence on multiple nodes.

Get load_balanced_view and apply asynchronous map with map_async

In [1]: from ipyparallel import Client
In [3]: a = Client("ipcontroller-client.json", sshserver="ec2-user@<CONTROLLER_HOST>")
In [4]: views = a.load_balanced_view()
In [5]: result = views.map_async(lambda x: x*x , (x for x in xrange(10000)))
In [6]: result
Out[6]: <AsyncMapResult: <lambda>>

It's useless in this example as it transfers one value to another controller, transfers the lambda expression, allocates it to each engine separately, calculates and receives the result for every list element. It's a nice place, but it's just an example.

After execution, you can see the situation with progress and ready () of AsyncMapResult.

In [8]: result.progress
Out[8]: 8617
In [9]: result.progress
Out[9]: 9557

In [11]: result.progress
Out[11]: 10000

In [12]: result.ready()
Out[12]: True

In [13]: result
Out[13]: <AsyncMapResult: finished>

The result of mapping is stored in the result `.result`.

In [14]: len(result.result)
Out[14]: 10000
In [15]: type(result.result)
Out[15]: list

Please note that if you access .result carelessly, it will be locked until the distributed processing is completed.

Summary

So, it is now possible to create a cluster and perform distributed processing.

I thought ipython was only a "convenient repl environment", but this is amazing.

It seems to be difficult to dynamically allocate nodes or increase or decrease the number of clusters according to the load, but above all, it is a dream that the code created by localizing the distribution destination can be put on a large-scale distribution. On the contrary, is it fun to be able to run code that works in a large-scale distribution as it is in local mode? As the scale grows, cluster management will probably come up as another challenge.

reference

I used it very much as a reference.

http://qiita.com/chokkan/items/750cc12fb19314636eb7

I tried to launch ipython cluster to the minimum on AWS