본문 바로가기
Kubernetes/Install

GPU Operator Install on Ubuntu

by 여행을 떠나자! 2021. 9. 21.

2021.6.9

 

1. Environments

- Softwares

   ✓ Ubuntu 18.04.5 LTS (Bionic Beaver), Kubernetes 1.16.15, Docker 19.03.15

   ✓ NVIDIA Driver 460.73.01, cuda-libraries-11-2, libcudnn8_8.2.1.32 - 별도 설치

    ✓ GPU Operator 1.7.0

          NVIDIA k8s device plugin 0.9.0

          NVIDIA container toolkit 1.7.0

          NVIDIA DCGM-exporter 2.1.8-2.4.0

          Node Feature Discovery 0.6.0

          GPU Feature Discovery 0.4.1

- GPU Card

   ✓ NVIDIA Tesla V100

 

 

2. NVIDIA Driver & Cuda toolkit (library) Install

- 별도 설치 사유

   GPU Operator를 이용한 NVIDIA Driver 설치 시 kubernetes에서만 GPU 사용 가능

   GPU Operator를 이용한 NVIDIA Driver 설치 과정에서 외부 접속이 필요하나 설치 구성 환경은 접속 불가 환경 임

- 참조 페이지

   https://docs.nvidia.com/datacenter/tesla/tesla-installation-notes/index.html

 

- 사전 작업

$ lspci | grep -i nvidia
3b:00.0 3D controller: NVIDIA Corporation Device 1df6 (rev a1)
d8:00.0 3D controller: NVIDIA Corporation Device 1df6 (rev a1)
$ lsmod | egrep 'i2c_core|ipmi_msghandler'
ipmi_msghandler       102400  3 ipmi_devintf,ipmi_si,ipmi_ssif
$ ubuntu-drivers devices
== /sys/devices/pci0000:3a/0000:3a:00.0/0000:3b:00.0 ==
modalias : pci:v000010DEd00001DF6sv000010DEsd000013D6bc03sc02i00
vendor   : NVIDIA Corporation
driver   : nvidia-driver-460 - distro non-free
driver   : nvidia-driver-450-server - distro non-free
driver   : nvidia-driver-465 - distro non-free recommended
driver   : nvidia-driver-460-server - distro non-free
driver   : xserver-xorg-video-nouveau - distro free builtin
$

- https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#runfile-nouveau

# vi /etc/modprobe.d/blacklist-nouveau.conf
blacklist nouveau
options nouveau modeset=0
# update-initramfs -u
update-initramfs: Generating /boot/initrd.img-5.4.0-73-generic
# reboot
...

# lsmod | grep nouveau
# apt-get install gcc make
...
# gcc --version
gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
...
#

 

- NVIDIA Install

   NVIDIA Driver Download

       https://www.nvidia.co.kr/Download/index.aspx?lang=kr

# sh ./NVIDIA-Linux-x86_64-460.73.01.run --info
  Identification    : NVIDIA Accelerated Graphics Driver for Linux-x86_64 460.73.01
  Target directory  : NVIDIA-Linux-x86_64-460.73.01
  Uncompressed size : 623288 KB
  Compression       : xz
  Date of packaging : Thu Apr  1 22:17:32 UTC 2021
  Application run after extraction : ./nvidia-installer
#
# sh NVIDIA-Linux-x86_64-460.73.01.run
Verifying archive integrity... OK
Uncompressing NVIDIA Accelerated Graphics Driver for Linux-x86_64 460.73.01.......................

The distribution-provided pre-install script failed!  Are you sure you want to continue?
Continue installation (선택)

WARNING: Unable to find a suitable destination to install 32-bit compatibility libraries. Your system may not be set up for 32-bit compatibility. 32-bit compatibility files will not be installed; if you wish to install them, re-run the installation and set a valid directory with the --compat32-libdir option.
OK (선택)

WARNING: Unable to determine the path to install the libglvnd EGL vendor library config files. Check that you have pkg-config and the libglvnd
           development libraries installed, or specify a path with --glvnd-egl-config-path.

OK (선택)
...
# nvidia-smi
Tue Jun  8 16:51:09 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.73.01    Driver Version: 460.73.01    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100S-PCI...  Off  | 00000000:3B:00.0 Off |                    0 |
| N/A   28C    P0    37W / 250W |      0MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100S-PCI...  Off  | 00000000:D8:00.0 Off |                    0 |
| N/A   29C    P0    36W / 250W |      0MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
#

 

- Cuda toolkit install

   The NVIDIA® CUDA® Toolkit provides a development environment for creating high performance GPU-accelerated applications.

   https://developer.nvidia.com/cuda-11.2.2-download-archive?target_os=Linux&target_arch=x86_64&target_distro=Ubuntu&target_version=1804&target_type=deblocal

$ wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/cuda-ubuntu1804.pin
$ sudo mv cuda-ubuntu1804.pin /etc/apt/preferences.d/cuda-repository-pin-600
$ wget https://developer.download.nvidia.com/compute/cuda/11.2.2/local_installers/cuda-repo-ubuntu1804-11-2-local_11.2.2-460.32.03-1_amd64.deb
$ sudo dpkg -i cuda-repo-ubuntu1804-11-2-local_11.2.2-460.32.03-1_amd64.deb
$ sudo apt-key add /var/cuda-repo-ubuntu1804-11-2-local/7fa2af80.pub
$ sudo apt-get update

## Install package
#   cuda-libraries-11-2: CUDA Libraries 11.2 meta-package
#   cuda-cupti-11-2: CUDA profiling tools runtime libs.
$ sudo apt-get -y install cuda-libraries-11-2 cuda-cupti-11-2

$ vi .bash_profile
...
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-11.2/lib64
$

 

- cuDNN Install

   The NVIDIA® CUDA® Deep Neural Network library™ (cuDNN) is a GPU-accelerated library of primitives for deep neural networks

   https://docs.nvidia.com/deeplearning/cudnn/install-guide/index.html

$ wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/libcudnn8_8.2.1.32-1+cuda11.3_amd64.deb
$ sudo dpkg -i libcudnn8_8.2.1.32-1+cuda11.3_amd64.deb
$ vi .bash_profile
...
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-11.2/lib64
$

 

- log

cuda_installation.txt
0.02MB

 

 

3. GPU Operator Install

- 특이사항

   설치 환경이 외부 접속이 불가해서 helm 대신 인터넷 접속 환경에서 설치를 위한 yaml 파일을 생성하여 설치 진행

- 참조페이지

   https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/getting-started.html#install-nvidia-gpu-operator

 

- Install

yoosungjeon@ysjeon-Dev ~ % helm repo list | egrep 'NAME|nvidia'
NAME                    URL
nvidia                  https://helm.ngc.nvidia.com/nvidia
yoosungjeon@ysjeon-Dev ~ % helm install nvidia/gpu-operator -n gpu-operator --create-namespace  --generate-name --version 1.7.0  --set driver.enabled=false --dry-run  > gpu-operator-1.7.0_driver-disable.yaml
yoosungjeon@ysjeon-Dev ~ % wget \
   https://raw.githubusercontent.com/NVIDIA/gpu-operator/v1.7.0/deployments/gpu-operator/crds/nvidia.com_clusterpolicies_crd.yaml
yoosungjeon@ysjeon-Dev ~ %

## GPU Node
$ k create -f nvidia.com_clusterpolicies_crd.yaml
customresourcedefinition.apiextensions.k8s.io/clusterpolicies.nvidia.com created
$ k create ns gpu-operator
namespace/gpu-operator created
$ k apply -f gpu-operator-1.7.0_driver-disable.yaml -n gpu-operator
serviceaccount/node-feature-discovery created
serviceaccount/gpu-operator created
configmap/gpu-operator-1623115703-node-feature-discovery created
clusterrole.rbac.authorization.k8s.io/gpu-operator-1623115703-node-feature-discovery-master created
clusterrole.rbac.authorization.k8s.io/gpu-operator created
clusterrolebinding.rbac.authorization.k8s.io/gpu-operator-1623115703-node-feature-discovery-master created
clusterrolebinding.rbac.authorization.k8s.io/gpu-operator created
service/gpu-operator-1623115703-node-feature-discovery created
daemonset.apps/gpu-operator-1623115703-node-feature-discovery-worker created
deployment.apps/gpu-operator-1623115703-node-feature-discovery-master created
deployment.apps/gpu-operator created
clusterpolicy.nvidia.com/cluster-policy created
$

 

- Troubleshooting

  GPU Operator의 POD(deployments, daemonsets)에 설정된 priorityClassName 주석 처리

$ k describe replicaset.apps gpu-operator-xxxx -n gpu-operator | grep Events -A10
Events:
  Type     Reason        Age                     From                   Message
  ----     ------        ----                    ----                   -------
 Warning  FailedCreate  2m33s (x16 over 5m17s)  replicaset-controller  Error creating: pods "gpu-operator-57458b9d86-" is forbidden: pods with
 system-node-critical priorityClass is not permitted in gpu-operator namespace
$ k edit deployments.apps gpu-operator -n gpu-operator
...
#      priorityClassName: system-node-critical
      restartPolicy: Always
...
$ k get daemonset.apps -n gpu-operator-resources
NAME                                DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR                                    AGE
gpu-feature-discovery               0       0       0     0          0         nvidia.com/gpu.deploy.gpu-feature-discovery=true 4m44s
nvidia-container-toolkit-daemonset  0       0       0     0          0         nvidia.com/gpu.deploy.container-toolkit=true     4m44s
nvidia-dcgm-exporter                0       0       0     0          0         nvidia.com/gpu.deploy.dcgm-exporter=true         4m44s
nvidia-device-plugin-daemonset      0       0       0     0          0         nvidia.com/gpu.deploy.device-plugin=true         4m44s
nvidia-driver-daemonset             0       0       0     0          0         nvidia.com/gpu.deploy.driver=true                4m44s
nvidia-mig-manager                  0       0       0     0          0         nvidia.com/gpu.deploy.mig-manager=true           4m44s
nvidia-operator-validator           0       0       0     0          0         nvidia.com/gpu.deploy.operator-validator=true    4m44s
$ k edit daemonset.apps xxxxx -n gpu-operator-resources
...
$

 

-결과

$ k get pod -n gpu-operator -o wide
NAME                                                            READY STATUS  RESTARTS AGE IP           NODE
gpu-operator-1623138978-node-feature-discovery-master-65f8jjggh 1/1   Running 0        24m 10.244.1.17  acp-master03
gpu-operator-1623138978-node-feature-discovery-worker-5s2s4     1/1   Running 0        24m 10.244.5.174 acp-worker03
gpu-operator-1623138978-node-feature-discovery-worker-67zmz     1/1   Running 0        24m 10.244.4.157 acp-worker02
gpu-operator-1623138978-node-feature-discovery-worker-8d9fs     1/1   Running 0        24m 10.244.2.24  acp-master02
gpu-operator-1623138978-node-feature-discovery-worker-fnsbl     1/1   Running 0        24m 10.244.0.13  acp-master01
gpu-operator-1623138978-node-feature-discovery-worker-j2qqp     1/1   Running 0        24m 10.244.1.18  acp-master03
gpu-operator-1623138978-node-feature-discovery-worker-k2bm7     1/1   Running 0        24m 10.244.7.38  acp-worker01
gpu-operator-6945d47bdd-dfwkz                                   1/1   Running 0        23m 10.244.0.14  acp-master01
$ k get pod -n gpu-operator-resources -o wide
NAME                                     READY STATUS    RESTARTS AGE   IP           NODE
gpu-feature-discovery-gszb8              1/1   Running   0        19m   10.244.7.40  acp-worker01
gpu-feature-discovery-r6pbk              1/1   Running   0        19m   10.244.5.176 acp-worker03
gpu-feature-discovery-zjpdf              1/1   Running   0        19m   10.244.4.159 acp-worker02
nvidia-container-toolkit-daemonset-69qnc 1/1   Running   0        22m   10.244.4.158 acp-worker02
nvidia-container-toolkit-daemonset-77k2r 1/1   Running   0        22m   10.244.7.39  acp-worker01
nvidia-container-toolkit-daemonset-z2xf9 1/1   Running   0        22m   10.244.5.175 acp-worker03
nvidia-cuda-validator-dc2nv              0/1   Completed 0        4m46s 10.244.4.163 acp-worker02
nvidia-cuda-validator-rczsd              0/1   Completed 0        58s   10.244.5.180 acp-worker03
nvidia-cuda-validator-ssjm8              0/1   Completed 0        17m   10.244.7.42  acp-worker01
nvidia-dcgm-exporter-dzkvd               1/1   Running   0        16m   10.244.7.45  acp-worker01
nvidia-dcgm-exporter-ll5rf               1/1   Running   0        16m   10.244.5.179 acp-worker03
nvidia-dcgm-exporter-t2csn               1/1   Running   0        16m   10.244.4.162 acp-worker02
nvidia-device-plugin-daemonset-kgrhl     1/1   Running   0        17m   10.244.7.44  acp-worker01
nvidia-device-plugin-daemonset-mqmbq     1/1   Running   0        17m   10.244.4.161 acp-worker02
nvidia-device-plugin-daemonset-vmcgr     1/1   Running   0        17m   10.244.5.178 acp-worker03
nvidia-device-plugin-validator-g22dm     0/1   Completed 0        51s   10.244.5.182 acp-worker03
nvidia-device-plugin-validator-m4vp5     0/1   Completed 0        15m   10.244.7.46  acp-worker01
nvidia-device-plugin-validator-r42gp     0/1   Completed 0        4m39s 10.244.4.165 acp-worker02
nvidia-operator-validator-klw4f          1/1   Running   0        17m   10.244.4.160 acp-worker02
nvidia-operator-validator-qk9f7          1/1   Running   0        17m   10.244.7.41  acp-worker01
nvidia-operator-validator-qzlpw          1/1   Running   0        17m   10.244.5.177 acp-worker03
$

 

4. Test

$ cat cuda-vectoradd.yaml
apiVersion: v1
kind: Pod
metadata:
  name: cuda-vectoradd
spec:
  restartPolicy: OnFailure
  containers:
  - name: cuda-vectoradd
    image: "nvidia/samples:vectoradd-cuda11.2.1"
    resources:
      limits:
         nvidia.com/gpu: 1
$ k apply -f cuda-vectoradd.yaml
pod/cuda-vectoradd created
acp@acp-master01:~/k8s-oss/gpu-operator$ k logs cuda-vectoradd
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done
$

$ cat cuda-load-generator.yaml
apiVersion: v1
kind: Pod
metadata:
   name: dcgmproftester
spec:
   restartPolicy: OnFailure
   containers:
   - name: dcgmproftester11
     image: nvidia/samples:dcgmproftester-2.0.10-cuda11.0-ubuntu18.04
     args: ["--no-dcgm-validation", "-t 1004", "-d 120"]
     resources:
      limits:
         nvidia.com/gpu: 1
     securityContext:
      capabilities:
         add: ["SYS_ADMIN"]
acp@acp-master01:~$ k apply -f cuda-load-generator.yaml
pod/dcgmproftester created
acp@acp-master01:~/k8s-oss/gpu-operator$ k get pod -o wide | egrep 'NAME|dcgmproftester'
NAME             READY   STATUS      RESTARTS   AGE    IP             NODE           NOMINATED NODE   READINESS GATES
dcgmproftester   1/1     Running     0          24s    10.244.5.183   acp-worker03   <none>           <none>
$

## GPU Node
$ nvidia-smi
Tue Jun  8 19:02:48 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.73.01    Driver Version: 460.73.01    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100S-PCI...  Off  | 00000000:3B:00.0 Off |                    0 |
| N/A   58C    P0   207W / 250W |    493MiB / 32510MiB |     70%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100S-PCI...  Off  | 00000000:D8:00.0 Off |                    0 |
| N/A   28C    P0    23W / 250W |      0MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     49407      C   /usr/bin/dcgmproftester11         489MiB |
+-----------------------------------------------------------------------------+
$

 

'Kubernetes > Install' 카테고리의 다른 글

K8s 구성 - AWS  (0) 2023.08.30
Kubernetes 업그레이드 (1.16 ⇢1.20) 및 호환성 검토  (0) 2021.10.19
GPU Operator on CentOS  (0) 2021.09.21
Helm  (0) 2021.09.21
MetalLB  (0) 2021.09.15

댓글