[LINUX] nvidia cuda setup

Make a note of how to install the NVIDIA driver (version 450), the CUDA 11 compatible CUDA Toolkit and the cuDNN SDK 8.0.4 on Ubuntu 20.04LTS. The purpose is to run TensorFlow.

https://www.tensorflow.org/install/gpu

Blacklist `nouveau`

Immediately after installation, the OSS nouveau driver is loaded.

$ lsmod | grep nouveau
nouveau              1949696  1
mxm_wmi                16384  1 nouveau
video                  49152  1 nouveau
ttm                   106496  2 drm_vram_helper,nouveau
drm_kms_helper        184320  4 ast,nouveau
i2c_algo_bit           16384  2 ast,nouveau
drm                   491520  8 drm_kms_helper,drm_vram_helper,ast,ttm,nouveau
wmi                    32768  2 mxm_wmi,nouveau

Since the NVIDIA driver is required to use CUDA, list nouveau in the blacklist and remove it from the initramfs so that the NVIDIA driver can be used. Make sure nouveau is not loaded after a reboot.

$ sudo echo "blacklist nouveau" >> /etc/modprobe.d/blacklist-nouveau.conf
$ sudo echo "options nouveau modeset=0" >> /etc/modprobe.d/blacklist-nouveau.conf
$ sudo update-initramfs -u
update-initramfs: Generating /boot/initrd.img-5.8.0-36-generic
$ sudo reboot
$ lsmod | grep nouveau
$

Check the version of the driver to install

Next, check the version of the driver distributed by ubuntu.

$ ubuntu-drivers devices
== /sys/devices/pci0000:5d/0000:5d:00.0/0000:5e:00.0 ==
modalias : pci:v000010DEd00001DB4sv000010DEsd00001214bc03sc02i00
vendor   : NVIDIA Corporation
model    : GV100GL [Tesla V100 PCIe 16GB]
driver   : nvidia-driver-450 - distro non-free
driver   : nvidia-driver-390 - distro non-free
driver   : nvidia-driver-460 - distro non-free recommended
driver   : nvidia-driver-450-server - distro non-free
driver   : nvidia-driver-440-server - distro non-free
driver   : nvidia-driver-418-server - distro non-free
driver   : xserver-xorg-video-nouveau - distro free builtin
$
$
$ sudo apt info nvidia-driver-450 | grep -i version

WARNING: apt does not have a stable CLI interface. Use with caution in scripts.

Version: 450.102.04-0ubuntu0.20.04.1
$
$
$ sudo apt info nvidia-driver-450-server | grep -i version

WARNING: apt does not have a stable CLI interface. Use with caution in scripts.

Version: 450.80.02-0ubuntu0.20.04.3
$

I checked the NVIDIA website and found that 450.80.02 is distributed, so I decided to install nvidia-driver-450-server which can install this version.

Driver installation and startup confirmation

Install the driver and check the driver startup with nvidia-smi after rebooting.

$ sudo apt install nvidia-driver-450-server
$ sudo reboot
$ 
$ lsmod | grep nvidia
nvidia_uvm           1003520  0
nvidia_drm             49152  0
nvidia_modeset       1183744  1 nvidia_drm
nvidia              19718144  2 nvidia_uvm,nvidia_modeset
drm_kms_helper        217088  5 drm_vram_helper,ast,nvidia_drm
drm                   552960  7 drm_kms_helper,drm_vram_helper,ast,drm_ttm_helper,nvidia_drm,ttm
$ 
$ nvidia-smi 
Fri Jan  8 16:11:05 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.80.02    Driver Version: 450.80.02    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-PCIE...  Off  | 00000000:5E:00.0 Off |                    0 |
| N/A   33C    P0    37W / 250W |      0MiB / 16160MiB |      4%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
$

Install `CUDA Toolkit`

Next, I install CUDA Toolkit, but I haven't distributed 11.0 on Ubuntu yet.

$ sudo apt info nvidia-cuda-toolkit | grep -i version

WARNING: apt does not have a stable CLI interface. Use with caution in scripts.

Version: 10.1.243-3
$

Go to the NVIDIA website, select Ubuntu 20.04 and follow the installation steps that appear.

The installation command must specify the version, such as cuda-11-0. When I ran it without it, cuda 11.2 was installed.

$ wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-ubuntu2004.pin
$ sudo mv cuda-ubuntu2004.pin /etc/apt/preferences.d/cuda-repository-pin-600
$ sudo apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/7fa2af80.pub
$ sudo add-apt-repository "deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/ /"
$ sudo apt-get update
$ sudo apt-get install cuda-11-0
$ sudo reboot

After rebooting, check if the CUDA Toolkit is installed properly with the nvcc -V command. It is said that it is not included, but it seems that the pass does not pass, so I will pass it.

$ nvcc -V

Command 'nvcc' not found, but can be installed with:

sudo apt install nvidia-cuda-toolkit

$ /usr/local/cuda/bin/nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Wed_Jul_22_19:09:09_PDT_2020
Cuda compilation tools, release 11.0, V11.0.221
Build cuda_11.0_bu.TC445_37.28845127_0
$
$ echo "export PATH="/usr/local/cuda/bin:$PATH" >> /etc/bash.bashrc

Install `cuDNN`

Download the cuDNN SDK 8.0.4 from the NVIDIA website. An NVIDIA developer account (free registration) is required to download.

There is no download for Ubuntu 20.04, so download cuDNN Library for Linux (x86_64). When I unzip it, there are header files and libraries in the two folders, but there is no indication of the copy destination. The txt file is the license agreement. .. ..

$ ls
include  lib64  NVIDIA_SLA_cuDNN_Support.txt
$
$ ls include/
cudnn_adv_infer.h  cudnn_cnn_infer.h  cudnn_ops_infer.h
cudnn_adv_train.h  cudnn_cnn_train.h  cudnn_ops_train.h
cudnn_backend.h    cudnn.h            cudnn_version.h
$
$ ls lib64/
libcudnn_adv_infer.so        libcudnn_cnn_train.so.8.0.4
libcudnn_adv_infer.so.8      libcudnn_ops_infer.so
libcudnn_adv_infer.so.8.0.4  libcudnn_ops_infer.so.8
libcudnn_adv_train.so        libcudnn_ops_infer.so.8.0.4
libcudnn_adv_train.so.8      libcudnn_ops_train.so
libcudnn_adv_train.so.8.0.4  libcudnn_ops_train.so.8
libcudnn_cnn_infer.so        libcudnn_ops_train.so.8.0.4
libcudnn_cnn_infer.so.8      libcudnn.so
libcudnn_cnn_infer.so.8.0.4  libcudnn.so.8
libcudnn_cnn_train.so        libcudnn.so.8.0.4
libcudnn_cnn_train.so.8      libcudnn_static.a
$

When I googled, I immediately found the Official Document and was instructed to specify the copy destination and change the file permissions. I will also pass the pass.

$ sudo cp cuda/include/cudnn*.h /usr/local/cuda/include
$ sudo cp cuda/lib64/libcudnn* /usr/local/cuda/lib64

$ sudo chmod a+r /usr/local/cuda/include/cudnn*.h /usr/local/cuda/lib64/libcudnn*
$ echo 'export LD_LIBRARY_PATH="/usr/local/cuda/lib64:$LD_LIBRARY_PATH"' >> /etc/bash.bashrc
$ echo 'export LD_LIBRARY_PATH="/usr/lib/cuda/include:$LD_LIBRARY_PATH"' >> /etc/bash.bashrc

It seems to compile mnist CUDNN to check if the installation was successful. It seems that the necessary files are in the cuDNN Code Samples and User Guide ~, so download the deb file for Ubuntu 18.04 and unzip it.

$ mkdir libcudnn8-samples
$ dpkg-deb -x libcudnn8-samples_8.0.4.30-1+cuda11.0_amd64.deb libcudnn8-samples
$
$ cd libcudnn8-samples/usr/src/cudnn_samples_v8/mnistCUDNN
$ make clean && make
$ ./mnistCUDNN
mnistCUDNN 
Executing: mnistCUDNN
cudnnGetVersion() : 8004 , CUDNN_VERSION from cudnn.h : 8004 (8.0.4)
Host compiler version : GCC 9.3.0

There are 1 CUDA capable devices on your machine :
device 0 : sms 80  Capabilities 7.0, SmClock 1380.0 Mhz, MemSize (Mb) 16160, MemClock 877.0 Mhz, Ecc=1, boardGroupID=0
Using device 0
...
...
0.0000012 0.0000006 

Result of classification: 1 3 5

Test passed!
$

The compilation was successful. This completes the CUDA settings.

[LINUX] nvidia cuda setup

Blacklist nouveau

Check the version of the driver to install

Driver installation and startup confirmation

Install CUDA Toolkit

Install cuDNN

Blacklist `nouveau`

Install `CUDA Toolkit`

Install `cuDNN`