[PYTHON] Run TensorFlow on a GPU instance on AWS

I tried it in the following environment, so keep a record of the work at the time of installation.

--Mac OS X CPU only -Amazon web service (AWS) With GPU

Please note that TensorFlow does not work on Windows. The Google build tool Bazel used by TensorFlow is only compatible with Linux and Mac. If you don't have a Mac or Linux machine at hand, I think it's easy to set up an Ubuntu environment on AWS.

Mac OS X I have installed a package that runs only on the CPU. It's the easiest. spec

Install the package with pip as per the official document.

$ sudo easy_install --upgrade six
$ sudo pip install --upgrade https://storage.googleapis.com/tensorflow/mac/tensorflow-0.6.0-py2-none-any.whl

As a test, let's learn CIFAR-10 dataset.

$ git clone https://github.com/tensorflow/tensorflow.git
$ cd tensorflow/tensorflow/models/image/cifar10/
$ python cifar10_train.py

When executed, the dataset will be downloaded and learning will begin. Since the progress of the process is output to the terminal, it took 0.540 seconds for one batch learning when looking at around 100 steps when learning was stable.

2015-12-31 15:00:08.397460: step 100, loss = 4.49 (237.0 examples/sec; 0.540 sec/batch)

Amazon web service(AWS) Build the environment using AWS EC2 G2 Instance.

** Note: TensorFlow requires special support if Cuda compute capability 3.5 or lower. ** ** Cuda compute capability is like the architecture of the GPU and is determined by the GPU. The GRID K520 installed in the AWS G2 instance is Cuda compute capability 3.0, so you cannot perform TensorFlow GPU calculations as it is. Here discusses support for 3.0.

I used an instance of Oregon (USA), which was cheaper. The prices and information below are as of December 30, 2015.

model GPU vCPU memory(GiB) SSD storage(GB) Fee-Oregon(USA)
g2.2xlarge GRID K520 x 1 8 15 1 x 60 $0.65 /1 hour
g2.8xlarge GRID K520 x 4 32 60 2 x 120 $2.6 /1 hour

Installation

See here for connecting to a Linux instance. I proceeded with reference to here. First, install the necessary software.

$ sudo apt-get update
$ sudo apt-get upgrade -y #Select “install package maintainers version”
$ sudo apt-get install -y build-essential python-pip python-dev git python-numpy swig python-dev default-jdk zip zlib1g-dev ipython

Added Nouveau blacklist to avoid conflicts with NVIDIA drivers.

$ echo -e "blacklist nouveau\nblacklist lbm-nouveau\noptions nouveau modeset=0\nalias nouveau off\nalias lbm-nouveau off\n" | sudo tee /etc/modprobe.d/blacklist-nouveau.conf
$ echo options nouveau modeset=0 | sudo tee -a /etc/modprobe.d/nouveau-kms.conf
$ sudo update-initramfs -u
$ sudo reboot

It will be rebooted, so log in again and execute the following. I don't understand why I need it here.

$ sudo apt-get install -y linux-image-extra-virtual
$ sudo reboot
# Install latest Linux headers
$ sudo apt-get install -y linux-source linux-headers-`uname -r`

Next, install CUDA and cuDNN. Please also refer to Official documentation here to proceed. In addition, the version to be installed must be the following.

First, install CUDA.

# Install CUDA 7.0
$ wget http://developer.download.nvidia.com/compute/cuda/7_0/Prod/local_installers/cuda_7.0.28_linux.run
chmod +x cuda_7.0.28_linux.run
$ ./cuda_7.0.28_linux.run -extract=`pwd`/nvidia_installers
$ cd nvidia_installers
$ sudo ./NVIDIA-Linux-x86_64-346.46.run
$ sudo modprobe nvidia
$ sudo ./cuda-linux64-rel-7.0.28-19326674.run

Then install cuDNN. cuDNN is a library specializing in accelerating the learning of deep neural networks on the GPU. This article will be helpful. To get cuDNN, you need to register for an NVIDIA developer account. Since it is not available with wget, download it from here once in the local environment. Transfer the downloaded file locally via SCP. The following is an example of transfer from Linux. xxxxx.amazonaws.com is the public DNS of the AMI.

#Work locally
#Transferred to AMI by SCP
$ scp -i /path/my-key-pair.pem cudnn-6.5-linux-x64-v2.tgz [email protected]:~

After the transfer is complete, unzip it and copy it to the cuda directory.

#Working with AMI
$ cd
$ tar -xzf cudnn-6.5-linux-x64-v2.tgz
$ sudo cp cudnn-6.5-linux-x64-v2/libcudnn* /usr/local/cuda/lib64
$ sudo cp cudnn-6.5-linux-x64-v2/cudnn.h /usr/local/cuda/include/

Pass through the path.

$ vi .bashrc #Use vi or nano to do the following two lines.Add to bashrc
export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/usr/local/cuda/lib64"
export CUDA_HOME=/usr/local/cuda
$ source .bashrc # .Reflect bashrc settings

The disk space available in the AMI is not very large. After this, you will need a large enough disk space when building with Bazel or downloading training data. There is enough disk space in the ephemeral storage (/ mnt / or less) allocated when creating an instance, so create a symbolic link. You can delete nvidia_installers and cudnn-6.5-linux-x64-v2.tgz that you no longer use.

$ df
Filesystem     1K-blocks    Used Available Use% Mounted on
udev             7687184      12   7687172   1% /dev
tmpfs            1540096     344   1539752   1% /run
/dev/xvda1       8115168 5874536   1805356  77% /
none                   4       0         4   0% /sys/fs/cgroup
none                5120       0      5120   0% /run/lock
none             7700472       0   7700472   0% /run/shm
none              102400       0    102400   0% /run/user
/dev/xvdb       66946696   53144  63486192   1% /mnt
# /mnt/tmp/Create a symbolic link to
$ sudo mkdir /mnt/tmp
$ sudo chmod 777 /mnt/tmp
$ sudo rm -rf /tmp
$ sudo ln -s /mnt/tmp /tmp

** Note: When the instance is stopped, everything under / mnt / will be deleted. Do not save the data that needs to be left in the AMI image in / tmp /. ** ** For the public AMI described later, we have prepared a shell script (create_tmp_on_ephemeral_storage.sh) that creates tmp in the ephemeral storage when creating an instance or restarting.

Install the build tool Bazel.

$ cd /mnt/tmp
$ git clone https://github.com/bazelbuild/bazel.git
$ cd bazel
$ git checkout tags/0.1.0
$ ./compile.sh
$ sudo cp output/bazel /usr/bin

Then install TensorFlow. "./configure" is executed with options like "TF_UNOFFICIAL_SETTING = 1 ./configure" as discussed in here To do. This makes it an unofficial setting that is also compatible with Cuda compute capability 3.0.

$ cd /mnt/tmp
$ git clone --recurse-submodules https://github.com/tensorflow/tensorflow
$ cd tensorflow
$ TF_UNOFFICIAL_SETTING=1 ./configure

During the configuration, I have the following questions: By default, only Cuda compute capability 3.5 and 5.2 are supported, so add 3.0 as shown below.

Please specify a list of comma-separated Cuda compute capabilities you want to build with.
You can find the compute capability of your device at: https://developer.nvidia.com/cuda-gpus.
Please note that each additional compute capability significantly increases your build time and binary size.
[Default is: "3.5,5.2"]: 3.0,3.5,5.2   #3.Add 0

Build TensorFlow.

$ bazel build -c opt --config=cuda //tensorflow/cc:tutorials_example_trainer

Then build and install the TensorFlow Python package. Please match the part of "/tmp/tensorflow_pkg/tensorflow-0.6.0-cp27-none-linux_x86_64.whl" with the file name of the actually generated version.

$ bazel build -c opt --config=cuda //tensorflow/tools/pip_package:build_pip_package
$ bazel-bin/tensorflow/tools/pip_package/build_pip_package /tmp/tensorflow_pkg
$ sudo pip install /tmp/tensorflow_pkg/tensorflow-0.6.0-cp27-none-linux_x86_64.whl

This completes the installation.

test

First try with a g2.2xlarge instance.

$ cd tensorflow/models/image/cifar10/
$ python cifar10_multi_gpu_train.py

The results are as follows.

2016-01-01 09:08:55.345446: step 100, loss = 4.49 (285.9 examples/sec; 0.448 sec/batch)

In addition, try with g2.8xlarge. Set the number of GPUs to 4 around line 63 of cifar10_multi_gpu_train.py as shown below. If you don't do this, it will apparently use four GPUs, but it won't speed up, probably because parallelization isn't working properly.

tf.app.flags.DEFINE_integer('num_gpus', 4,
                            """How many GPUs to use.""")

Execution result. It was pretty fast.

2016-01-01 09:33:24.481037: step 100, loss = 4.49 (718.2 examples/sec; 0.178 sec/batch)

Community AMI

The image created this time is published in the community AMI of AWS Oregon (USA). ubuntu14.04_tensorflow0.6.0_gpu - ami-69475f08 After creating the instance, execute create_tmp_on_ephemeral_storage.sh to create the / tmp directory on the ephemeral storage.

$ ./create_tmp_on_ephemeral_storage.sh

Recommended Posts

Run TensorFlow on a GPU instance on AWS
Try Tensorflow with a GPU instance on AWS
Building an environment to run ChainerMN on a GPU instance on AWS
Run GPU version tensorflow on AWS EC2 Spot Instances
Use jupyter on AWS GPU instance
Run TensorFlow2 on a VPS server
Periodically run a python program on AWS Lambda
June 2017 version to build Tensorflow / Keras environment on GPU instance of AWS
# 2 Build a Python environment on AWS EC2 instance (ubuntu18.04)
Building a TensorFlow environment that uses GPU on Windows 10
Run Tensorflow 2.x on Python 3.7
Run YOLO v3 on AWS v2
Run YOLO v3 on AWS
If you think tensorflow doesn't recognize your GPU on AWS
Run Tensorflow natively supported on windows
Run a Linux server on GCP
Run Python on Schedule on AWS Lambda
Run TensorFlow Docker Image on Python3
Run Matplotlib on a Docker container
Run headless-chrome on a Debian-based image
Run a Java app that resides on AWS EC2 as a daemon
# 3 Build a Python (Django) environment on AWS EC2 instance (ubuntu18.04) part2
Run Radeon GPU on Windows on QEMU / KVM
Run a local script on a remote host
I installed TensorFlow (GPU version) on Ubuntu
[Python] Run Headless Chrome on AWS Lambda
Run Python code on A2019 Community Edition
I built a TensorFlow environment on windows10
Run Jupyter notebook on a remote server
Run matplotlib on a Windows Docker container
Steps to run TensorFlow 2.1 from Jupyter on supercomputer ITO front end (with GPU)
A addictive story when using tensorflow on Android
Consider a cloud-native WebSocket application running on AWS
Build a WardPress environment on AWS with pulumi
Jupyter on AWS
Create an AWS GPU instance to train StyleNet
(For myself) AWS_Flask_3 (Install / Run Flask on AWS)
TensorFlow: Run data learned in Python on Android
Try running a Schedule to start and stop an instance on AWS Lambda (Python)
Run Tensorflow from Jupyter Notebook on Bash on Ubuntu on Windows
[Environment construction] @anaconda that runs keras / tensorflow on GPU
Deployment procedure on AWS (2) Server (EC2 instance) environment settings
Make a parrot return LINE Bot on AWS Cloud9
A memo of installing Chainer 1.5 for GPU on Windows
Set up a free server on AWS in 30 minutes
Procedure for creating a Line Bot on AWS Lambda
I'm a windows user but want to run tensorflow
How to run Django on IIS on a Windows server
A swampy story when using firebase on AWS lamda
Install Tensorflow on Mac
Install TensorFlow on Ubuntu
Run Django on PythonAnywhere
Run mysqlclient on Lambda
Enable GPU for tensorflow
Run OpenMVG on Mac
Build TensorFlow on Windows
Install Docker on AWS
Run AWS IoT Device SDK for Python on Raspberry Pi
[AWS] Install node.js on EC2 instance and execute sample program
How to run a trained transformer model locally on CloudTPU
Run a limited number of image presentation programs on PsychoPy