[PYTHON] I tried to make deep learning scalable with Spark × Keras × Docker 2 Multi-host edition

I tried to make deep learning scalable with Spark × Keras × Docker 2

Previously, I put Dist-keras on Docker to create a scalable deep learning. http://qiita.com/cvusk/items/3e6c3bade8c0e1c0d9bf

The point of reflection at that time was that the performance did not come out, but after careful review, it seems that the parameter settings were incorrect. So, after reflecting on it, I tried various things.

Synopsis up to the last time

For a description of Dist-Keras itself, please refer to the previous Post, but the point is Keras running on a Spark cluster. I made this a Docker image to make it easier to scale out.

The Dockerfile is available on GitHub. https://github.com/shibuiwilliam/distkeras-docker

What to do this time

This time, I would like to verify Dist-Keras on Docker on single host and multi-host to improve performance. Last time, I launched multiple containers on a single host to create a Spark cluster. This time I will increase the pattern.

It is MNIST that runs. The MNIST learning program is customized from the one provided by dist-keras.

Verified configuration

Validate on single host and multihost. Both are configured as Spark Master + Worker, and adjust the number of workers and worker specifications. In the case of multi-host, there are two servers. The host is AWS EC2 CentOS 7.3 m4.xlarge.

no hosts workers resources
1 single 1 1 processor, 2GB RAM
2 single 2 2 processors, 5GB RAM
3 single 3 1 processor, 3GB RAM
4 multihost 2 2 processors, 5GB RAM
5 multihost 2 3 processors, 8GB RAM
6 multihost 4 2 processors, 5GB RAM

How to set up a single host

The image for a single host looks like this. The number of Docker containers will fluctuate depending on the verification conditions.

14.jpg

In the case of a single host, start multiple Docker containers on the same host.

# docker dist-keras for spark master and slave
docker run -it -p 18080:8080 -p 17077:7077 -p 18888:8888 -p 18081:8081 -p 14040:4040 -p 17001:7001 -p 17002:7002 \
 -p 17003:7003 -p 17004:7004 -p 17005:7005 -p 17006:7006 --name spmaster -h spmaster distkeras /bin/bash

# docker dist-keras for spark slave1
docker run -it --link spmaster:master -p 28080:8080 -p 27077:7077 -p 28888:8888 -p 28081:8081 -p 24040:4040 -p 27001:7001 \
-p 27002:7002 -p 27003:7003 -p 27004:7004 -p 27005:7005 -p 27006:7006 --name spslave1 -h spslave1 distkeras /bin/bash

# docker dist-keras for spark slave2
docker run -it --link spmaster:master -p 38080:8080 -p 37077:7077 -p 38888:8888 -p 38081:8081 -p 34040:4040 -p 37001:7001 \
-p 37002:7002 -p 37003:7003 -p 37004:7004 -p 37005:7005 -p 37006:7006 --name spslave2 -h spslave2 distkeras /bin/bash


Spark master launches Spark master and worker, and Spark slave launches worker only.

# for spark master
${SPARK_HOME}/sbin/start-master.sh

# for spark worker
${SPARK_HOME}/sbin/start-slave.sh -c 1 -m 3G spark://spmaster:${SPARK_MASTER_PORT}

Then customize the MNIST program with Spark Master. MNIST sample code is provided by Dist-Keras. The directory is / opt / dist-keras / examples, which contains the following sample data and programs.

[root@spm examples]# tree
.
|-- cifar-10-preprocessing.ipynb
|-- data
|   |-- atlas_higgs.csv
|   |-- mnist.csv
|   |-- mnist.zip
|   |-- mnist_test.csv
|   `-- mnist_train.csv
|-- example_0_data_preprocessing.ipynb
|-- example_1_analysis.ipynb
|-- kafka_producer.py
|-- kafka_spark_high_throughput_ml_pipeline.ipynb
|-- mnist.ipynb
|-- mnist.py
|-- mnist_analysis.ipynb
|-- mnist_preprocessing.ipynb
|-- spark-warehouse
`-- workflow.ipynb

Copy the original file, back it up, and apply the following changes.

cp mnist.py mnist.py.bk

Change 1 Import of Spark Session

Add the following at the beginning.

from pyspark.sql import SparkSession

Change 2 Parameter setting

Change Spark parameters for this environment. The intent of the change is as follows. --Using Spark2 --Using the local environment --Define master url in local environment --Change the number of processors according to the verification conditions --Change the number of workers according to the verification conditions

# Modify these variables according to your needs.
application_name = "Distributed Keras MNIST"
using_spark_2 = True  # False to True
local = True  # False to True
path_train = "data/mnist_train.csv"
path_test = "data/mnist_test.csv"
if local:
    # Tell master to use local resources.
#     master = "local[*]"   comment out
    master = "spark://spm:7077"  # add
    num_processes = 1 # change to number of processors per worker
    num_executors = 3  # change to number of workers
else:
    # Tell master to use YARN.
    master = "yarn-client"
    num_executors = 20
    num_processes = 1
Change 3 Worker memory

Change the worker's memory to match the validation criteria.

conf = SparkConf()
conf.set("spark.app.name", application_name)
conf.set("spark.master", master)
conf.set("spark.executor.cores", `num_processes`)
conf.set("spark.executor.instances", `num_executors`)
conf.set("spark.executor.memory", "4g") # change RAM size
conf.set("spark.locality.wait", "0")
conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer");

You can then run it on the Spark master with python mnist.py.

How to set up multi-host

The multi-host configuration looks like this.

23.jpg

Multihost requires Docker containers to be connected via an overlay network. Please refer to the following for details on how to build a Docker network with multiple hosts. http://knowledge.sakura.ad.jp/knowledge/4786/ http://christina04.hatenablog.com/entry/2016/05/16/065853

I will write only my procedure here. Prepare host1 and host2 on EC2, install etcd on host1 and start it.

yum -y install etcd

vi /etc/etcd/etcd.conf
systemctl enable etcd
systemctl start etcd

Next, add docker-network settings to both host1 and host2.

# edit docker-network file
vi /etc/sysconfig/docker-network

# for host1
DOCKER_NETWORK_OPTIONS='--cluster-store=etcd://<host1>:2379 --cluster-advertise=<host1>:2376'

# for host2
DOCKER_NETWORK_OPTIONS='--cluster-store=etcd://<host1>:2379 --cluster-advertise=<host2>:2376'

# from host2 to ensure network connection to host1 etcd is available
curl -L http://<host1>:2379/version
{"etcdserver":"3.1.3","etcdcluster":"3.1.0"}

Now that you can connect to the network between dockers, create a docker network with host1. Here, create a docker network called test1 on subnet 10.0.1.0/24.


# for host1
docker network create --subnet=10.0.1.0/24 -d overlay test1

Finally, run docker network ls and it's OK if the test1 network is added.


NETWORK ID          NAME                DRIVER              SCOPE
feb90a5a5901        bridge              bridge              local
de3c98c59ba6        docker_gwbridge     bridge              local
d7bd500d1822        host                host                local
d09ac0b6fed4        none                null                local
9d4c66170ea0        test1               overlay             global

Then add a Docker container to the test1 network. Let's deploy a Docker container to each of host1 and host2.

# for host1 as spark master
docker run -it --net=test1 --ip=10.0.1.10 -p 18080:8080 -p 17077:7077 -p 18888:8888 -p 18081:8081 -p 14040:4040 -p 17001:7001 -p 17002:7002 \
-p 17003:7003 -p 17004:7004 -p 17005:7005 -p 17006:7006 --name spm -h spm distkeras /bin/bash

# for host2 as spark slave
docker run -it --net=test1 --ip=10.0.1.20 --link=spm:master -p 28080:8080 -p 27077:7077 -p 28888:8888 -p 28081:8081 -p 24040:4040 -p 27001:7001 \
-p 27002:7002 -p 27003:7003 -p 27004:7004 -p 27005:7005 -p 27006:7006 --name sps1 -h sps1 distkeras /bin/bash

You have now deployed two Docker containers on the test1 network with multiple hosts. After that, start the Spark master and Spark worker in the same procedure as a single host, edit MNIST.py and execute python mnist.py.

Performance verification result

This is the result of verifying the performance of each configuration. This time, we are measuring the time (seconds) required to complete.

no. hosts workers resources time in second
1 single 1 1 processor, 2GB RAM 1615.63757
2 single 2 2 processors, 5GB RAM 1418.56935
3 single 3 1 processor, 3GB RAM 1475.84212
4 multihost 2 2 processors, 5GB RAM 805.382518
5 multihost 2 3 processors, 8GB RAM 734.290324
6 multihost 4 2 processors, 5GB RAM 723.878466

Performance is better with multi-host. I think that the amount of free resources simply makes a difference in performance. Verification 2 and Verification 4 have the same settings for the Docker container configuration and the resources used by the workers, but there is still a difference of 600 seconds. Comparing validation 1 and validation 2, or validation 4 and validation 5 and validation 6 does not seem to make a big difference in the number of Spark workers and the amount of resources themselves. If you want to improve the performance significantly, it is better to make it multi-host obediently.

[2017/05/26 postscript] I clustered it with Kubernetes. http://qiita.com/cvusk/items/42a5ffd4e3228963234d

Recommended Posts

I tried to make deep learning scalable with Spark × Keras × Docker 2 Multi-host edition
I tried to make deep learning scalable with Spark × Keras × Docker
"Deep Learning from scratch" Self-study memo (No. 16) I tried to build SimpleConvNet with Keras
"Deep Learning from scratch" Self-study memo (No. 17) I tried to build DeepConvNet with Keras
I tried to make Othello AI that I learned 7.2 million hands by deep learning with Chainer
I tried to implement deep learning that is not deep with only NumPy
I tried to make a traffic light-like with Raspberry Pi 4 (Python edition)
I tried to move GAN (mnist) with keras
I tried to integrate with Keras in TFv1.1
I tried deep learning
I tried to make Othello AI with tensorflow without understanding the theory of machine learning ~ Battle Edition ~
I tried to extract a line art from an image with Deep Learning
I tried to implement Cifar10 with SONY Deep Learning library NNabla [Nippon Hurray]
I tried to move machine learning (ObjectDetection) with TouchDesigner
I tried to implement Grad-CAM with keras and tensorflow
I tried to make an OCR application with PySimpleGUI
[Deep Learning from scratch] I tried to explain Dropout
I tried to make a real-time sound source separation mock with Python machine learning
I tried to make various "dummy data" with Python faker
I tried to implement ListNet of rank learning with Chainer
I captured the Touhou Project with Deep Learning ... I wanted to.
I tried to make GUI tic-tac-toe with Python and Tkinter
I tried to implement Perceptron Part 1 [Deep Learning from scratch]
I tried to implement SSD with PyTorch now (model edition)
I tried to implement Deep VQE
Make ASCII art with deep learning
I tried machine learning with liblinear
I tried deep learning using Theano
Introduction to Deep Learning ~ Dropout Edition ~
Make people smile with Deep Learning
I tried learning LightGBM with Yellowbrick
[5th] I tried to make a certain authenticator-like tool with python
I tried to solve the ant book beginner's edition with python
[2nd] I tried to make a certain authenticator-like tool with python
I tried deep reinforcement learning (Double DQN) for tic-tac-toe with ChainerRL
I tried to make a periodical process with Selenium and Python
I tried to make a 2channel post notification application with Python
I tried to make a todo application using bottle with python
[4th] I tried to make a certain authenticator-like tool with python
[1st] I tried to make a certain authenticator-like tool with python
I tried to make a strange quote for Jojo with LSTM
I tried to make an image similarity function with Python + OpenCV
I tried to make a mechanism of exclusive control with Go
I tried to make Othello AI with tensorflow without understanding the theory of machine learning ~ Introduction ~
I tried to make Othello AI with tensorflow without understanding the theory of machine learning ~ Implementation ~
Python: I tried to make a flat / flat_map just right with a generator
I tried to implement Autoencoder with TensorFlow
I tried to get started with Hy
I tried to make an open / close sensor (Twitter cooperation) with TWE-Lite-2525A
I tried learning with Kaggle's Titanic (kaggle②)
[Deep Learning from scratch] I tried to implement sigmoid layer and Relu layer.
I tried to make a calculator with Tkinter so I will write it
Mayungo's Python Learning Episode 2: I tried to put out characters with variables
I tried to make "Sakurai-san" a LINE BOT with API Gateway + Lambda
I tried to draw a system configuration diagram with Diagrams on Docker
I tried using the trained model VGG16 of the deep learning library Keras
(Machine learning) I tried to understand Bayesian linear regression carefully with implementation.
I tried to make Kana's handwriting recognition Part 2/3 Data creation and learning
I tried to implement CVAE with PyTorch
I tried to make a Web API
I tried to solve TSP with QAOA