[PYTHON] Building a machine learning environment with Tellus GPU server (Sakura high-power computing)

Introduction

Verification environment

Item Version
OS Ubuntu 18.04
OpenSSH 7.6p1

Apply for GPU server (via Terrass)

Tellus

About GPU server

Item Spec
OS Ubuntu 18.04(64bit)
GPU NVIDIA Tesla V100 (32GB) ×1
CPU Xeon 4Core 3.7GHz 1CPU
Disk MLC SSD 480GB ×2
Memory 64GB

Application flow

  1. After registering as a member of Terras (free of charge), apply for a development environment.
  1. The period can be selected from 1 month, 3 months or more (consultation required)
  1. After a while after applying, the operation will contact you with your login ID.

Environment construction (GPU)

Basically, follow the procedure of CUDA Toolkit / GPU card driver installation procedure

Server information

Tellus account dashboard → See development environment

Item Corresponding item
Server IP Environment host name / IP
Login ID Emailed from the operation
Initial password Token information / SSHPW information

tellus_dashboard.png

Connect to server

~/.ssh/config


Host tellus
     HostName [Environment host name / IP]
     User [Login ID]
     IdentityFile ~/.ssh/id_rsa

Package update and installation

Preparation before installing GPU driver

sudo apt update
sudo apt upgrade
apt install build-essential
apt install dkms

CUDA Toolkit

wget http://developer.download.nvidia.com/compute/cuda/10.2/Prod/local_installers/cuda_10.2.89_440.33.01_linux.run
sudo sh cuda_10.2.89_440.33.01_linux.run
chmod +x cuda_10.2.89_440.33.01_linux.run
sudo ./cuda_10.2.89_440.33.01_linux.run --toolkit --samples --samplespath=/usr/local/cuda-samples --no-opengl-libs

/etc/profile.d/cuda.sh


export CUDA_HOME="/usr/local/cuda" 
export PATH="$CUDA_HOME/bin:$PATH" 
export LD_LIBRARY_PATH="/usr/local/lib:$CUDA_HOME/lib64:$LD_LIBRARY_PATH" 
export CPATH="/usr/local/include:$CUDA_HOME/include:$CPATH" 
export INCLUDE_PATH="$CUDA_HOME/include" 

shell:/etc/profile.d/cuda.csh


export CUDA_HOME="/usr/local/cuda" 
export PATH="$CUDA_HOME/bin:$PATH" 
export LD_LIBRARY_PATH="/usr/local/lib:$CUDA_HOME/lib64:$LD_LIBRARY_PATH" 
export CPATH="/usr/local/include:$CUDA_HOME/include:$CPATH" 
export INCLUDE_PATH="$CUDA_HOME/include" 

CUDA Driver

wget https://us.download.nvidia.com/tesla/440.95.01/NVIDIA-Linux-x86_64-440.95.01.run
chmod +x NVIDIA-Linux-x86_64-440.95.01.run
sudo ./NVIDIA-Linux-x86_64-440.95.01.run --no-opengl-files --no-libglx-indirect --dkms

cuDNN

client


scp -r cudnn-10.2-linux-x64-v8.0.3.33.tgz tellus:~/

server


tar xvzf cudnn-10.2-linux-x64-v8.0.3.33.tgz
sudo mv cuda/include/cudnn.h /usr/local/cuda/include/
sudo mv cuda/lib64/* /usr/local/cuda/lib64/

Installation confirmation

nvidia-smi.png

Environment construction (Python)

Anaconda

wget https://repo.anaconda.com/archive/Anaconda3-2020.07-Linux-x86_64.sh
sudo bash Anaconda3-2020.07-Linux-x86_64.sh
conda update -n base conda

.bashrc


export PYTHONPATH="/home/[Login ID]/anaconda3/envs/py38/lib/python3.8:/home/[Login ID]/anaconda3/envs/py38/lib/python3.8/site-packages:$PYTHONPATH"

PyTorch

conda install pytorch torchvision cudatoolkit=10.2 -c pytorch

MLFlow

conda install -c conda-forge mlflow

~/.ssh/config


Host tellus
     HostName [Environment host name / IP]
     User [Login ID]
     IdentityFile ~/.ssh/id_rsa
     LocalForward [Client side port number] localhost:5000

QGIS

conda install -c conda-forge qgis=3.10.8

Operation check

GPU learning

cifar10.py


import os

import torch
import torch.nn as nn
import torch.optim as optim
import torchvision.models as models
import torchvision.transforms as transforms
from torch.utils.data import DataLoader
from torchvision.datasets import CIFAR10
from tqdm import tqdm


batch = 1024
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")


def dataloader(is_train: bool, transform: transforms.Compose) -> DataLoader:
    dataset = CIFAR10(root='./data', train=is_train, download=True, transform=transform)
    return DataLoader(dataset, batch_size=batch, shuffle=is_train, num_workers=os.cpu_count())


def model() -> nn.Module:
    model = models.resnet18(pretrained=True)
    model.fc = nn.Linear(512, 10)
    return model.to(device)


def training(net: nn.Module, trainloader: DataLoader, epochs: int) -> None:
    # loss function & optimizer
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.SGD(net.parameters(), lr=0.001, momentum=0.9)

    for epoch in range(epochs):  # loop over the dataset multiple times
        running_loss = 0.0
        bar = tqdm(trainloader, desc="training model [epoch:{:02d}]".format(epoch), total=len(trainloader))
        for data in bar:
            # get the inputs; data is a list of [inputs, labels]
            inputs, labels = data[0].to(device), data[1].to(device)

            # zero the parameter gradients
            optimizer.zero_grad()

            # forward + backward + optimize
            outputs = net(inputs)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()

            running_loss += loss.item()
            bar.set_postfix(device=device, batch=batch, loss=(running_loss / len(trainloader)))

    print('Finished Training')


transform = transforms.Compose(
    [transforms.ToTensor(), transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])

trainloader = dataloader(True, transform)
net = model()
training(net, trainloader, 3)

CPU results cpu_batch_1024.png cpu_batch_1024_smi.png

GPU results gpu_batch_1024.png gpu_batch_1024_smi.png

MLFlow

record_sin.py


from math import pi, sin

import mlflow

mlflow.set_experiment('test')
amplitude = 2.0

with mlflow.start_run() as _:
    mlflow.log_param('amplitude', amplitude)
    for i in range(360):
        sin_val = amplitude * sin(i * pi / 180.)
        mlflow.log_metric('sin wave', sin_val, step=i)

~/test_code/


python record_sin.py
mlflow ui

Result image mlflow_localforward.png mlflow_test.png mlflow_sinwave.png

QGIS

ssh -X tellus
qgis

Using VS Code

conda install -c conda-forge ipykernel

in conclusion

Reference page

Tellus FAQ Bamboo shoot blog-Building a PyTorch environment from the Terraus GPU server

Recommended Posts

Building a machine learning environment with Tellus GPU server (Sakura high-power computing)
Building a Python environment on a Sakura VPS server
Build a Python machine learning environment with a container
Build a machine learning application development environment with Python
Memo for building a machine learning environment using Python
Create a machine learning environment from scratch with Winsows 10
(Now) Build a GPU Deep Learning environment with GeForce GTX 960
Building a kubernetes environment with ansible 2
How about Anaconda for building a machine learning environment in Python?
Building a Python3 environment with Amazon Linux2
A story about machine learning with Kyasuket
Building a Python 3.6 environment with Windows + PowerShell
Creating a development environment for machine learning
Build AI / machine learning environment with Python
Building a python environment with virtualenv and direnv
Building a Python environment with WLS2 + Anaconda + PyCharm
How to set up a Google Colab environment with Coursera's advanced machine learning courses
How to quickly create a machine learning environment using Jupyter Notebook with UbuntuServer 16.04 LTS
Until you create a machine learning environment with Python on Windows 7 and run it
Recommendation of building a portable Python environment with conda
Run a machine learning pipeline with Cloud Dataflow (Python)
Build a machine learning Python environment on Mac OS
Building a TensorFlow environment that uses GPU on Windows 10
conda memorandum: Building a Python environment with supercomputer ITO
Build a machine learning environment natively on Windows 10 (x64)
How to quickly create a machine learning environment using Jupyter Notebook with UbuntuServer 16.04 LTS with anaconda
Rebuilding an environment for machine learning with Miniconda (Windows version)
Build a python machine learning study environment on macOS sierra
Build a machine learning environment on mac (pyenv, deeplearning, opencv)
Create a machine learning app with ABEJA Platform + LINE Bot
Summary from building Python 3.4. * From source to building a scientific computing environment
Launching a machine learning environment using Google Compute Engine (GCE)
Create a python machine learning model relearning mechanism with mlflow
Building a pyhon environment without using Anaconda (with easy startup)
Machine learning environment settings based on Python 3 on Mac (coexistence with Python 2)
Build a PyData environment for a machine learning study session (January 2017)
Building a virtual environment with pyenv-virtualenv/Python (installation, environment settings, packages) Mac environment
Building an auto-sklearn environment that semi-automates machine learning (Mac & Docker)
How to quickly create a machine learning environment using Jupyter Notebook on macOS Sierra with anaconda
Machine learning learned with Pokemon
Machine learning with Python! Preparation
Building a Python virtual environment
Machine learning Minesweeper with PyTorch
Machine learning environment construction macbook 2021
Try machine learning with Kaggle
A story stuck with the installation of the machine learning library JAX
Building a development environment with Maven on Google App Engine [Java]
Create an arbitrary machine learning environment with GCP + Docker + Jupyter Lab
Building an environment to run ChainerMN on a GPU instance on AWS
How to create a serverless machine learning API with AWS Lambda