본문 바로가기
Kubernetes/Install

GPU Operator on CentOS

by 여행을 떠나자! 2021. 9. 21.

2020.12.23

 

1. NVIDIA GPU Operator

https://developer.nvidia.com/blog/nvidia-gpu-operator-simplifying-gpu-management-in-kubernetes/

- Simplifying GPU Management in Kubernetes

- To provision GPU worker nodes in a Kubernetes cluster, the following NVIDIA software components are required – the driver, container runtime, device plugin and monitoring. The GPU Operator simplifies both the initial deployment and management of the components by containerizing all the components and using standard Kubernetes APIs for automating and managing these components including versioning and upgrades.

- The GPU operator should run on nodes that are equipped with GPUs. To determine which nodes have GPUs, the operator relies on Node Feature Discovery (NFD) within Kubernetes.

- The NFD worker detects various hardware features on the node – for example, PCIe device ids, kernel versions, memory and other attributes. It then advertises these features to Kubernetes using node labels. The GPU operator then uses these node labels to determine if NVIDIA software components should be provisioned on the node.

 

 

2. Environments

- Softwares

  ✓ CentOS 7.8, 3.10.0-1127.el7.x86_64, Kubernetes 1.16.15, Docker 19.03.13

  ✓ GPU Operator 1.4.0

          NVIDIA Driver 450.80.02

          CUDA 11.0

          NVIDIA k8s device plugin 0.7.1

          NVIDIA container toolkit 1.4.0

          NVIDIA DCGM-exporter 2.0.13-2.1.2

          Node Feature Discovery 0.6.0

          GPU Feature Discovery 0.2.2

- GPU Card

  ✓ NVIDIA Tesla V100

 

 

3. Running the GPU Operator

a. Prerequisite

https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/getting-started.html#prerequisites

- Nodes must not be pre-configured with NVIDIA components (driver, container runtime, device plugin).

- On node with GPU nodes

# lspci | grep -i nvidia
3b:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 PCIe 32GB] (rev a1)
d8:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 PCIe 32GB] (rev a1)
# modprobe -a i2c_core ipmi_msghandler
# lsmod | egrep "i2c_core|ipmi_msghandler"
ipmi_msghandler        56728  2 ipmi_ssif,ipmi_si
#

 

b. deploy the GPU operator

https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/getting-started.html#install-helm
https://github.com/NVIDIA/gpu-operator/tags
https://developer.nvidia.com/blog/nvidia-gpu-operator-simplifying-gpu-management-in-kubernetes

$ helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
"nvidia" has been added to your repositories
$ helm repo update
Hang tight while we grab the latest from your chart repositories...
…
Successfully got an update from the "nvidia" chart repository
Update Complete. ⎈Happy Helming!⎈
$

https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/release-notes.html#id1
   Due to a known limitation with the GPU Operator’s default values on CentOS, install the operator on CentOS 7/8: 
       This issue will be fixed in the next release (toolkit.version)

$ helm install nvidia/gpu-operator -n gpu-operator --generate-name --wait --version 1.4.0 --set toolkit.version=1.4.0-ubi8 # --dry-run > gpu-operator-1.4.0.yaml
NAME: gpu-operator-1606980427
LAST DEPLOYED: Thu Dec  3 16:27:09 2020
NAMESPACE: gpu-operator
STATUS: deployed
REVISION: 1
TEST SUITE: None
$ helm list -n gpu-operator
NAME                    NAMESPACE    REVISION UPDATED                                 STATUS   CHART              APP VERSION
gpu-operator-1608268482 gpu-operator 1        2020-12-18 14:14:44.091117517 +0900 KST deployed gpu-operator-1.4.0 1.4.0
$ # helm pull nvidia/gpu-operator --create-namespace -n gpu-operator --generate-name  --version 1.3.0 --untar
$
$ k get pod -n gpu-operator -o wide
NAME                                                              READY   STATUS    RESTARTS   AGE     IP              NODE    NOMINATED NODE   READINESS GATES
gpu-operator-1606963460-node-feature-discovery-master-6cb6p9db7   1/1     Running   0          2m51s   10.244.9.250    iap11   <none>           <none>
gpu-operator-1606963460-node-feature-discovery-worker-4792b       1/1     Running   0          2m50s   14.52.244.214   iap11   <none>           <none>
gpu-operator-1606963460-node-feature-discovery-worker-85d5v       1/1     Running   0          2m49s   14.52.244.208   iap04   <none>           <none>
gpu-operator-1606963460-node-feature-discovery-worker-h89rh       1/1     Running   0          2m49s   14.52.244.141   iap09   <none>           <none>
gpu-operator-1606963460-node-feature-discovery-worker-knr6c       1/1     Running   0          2m50s   14.52.244.140   iap08   <none>           <none>
gpu-operator-1606963460-node-feature-discovery-worker-ph5xv       1/1     Running   0          2m48s   14.52.244.213   iap10   <none>           <none>
gpu-operator-1606963460-node-feature-discovery-worker-vf42q       1/1     Running   0          2m51s   14.52.244.210   iap05   <none>           <none>
gpu-operator-1606963460-node-feature-discovery-worker-vxph8       1/1     Running   0          2m49s   14.52.244.211   iap06   <none>           <none>
gpu-operator-1606963460-node-feature-discovery-worker-xz4w5       1/1     Running   0          2m49s   14.52.244.139   iap07   <none>           <none>
gpu-operator-7659485bf7-fnljx                                     1/1     Running   0          2m51s   10.244.9.251    iap11   <none>           <none>
$

 

- Troubleshooting

   ▷ Problem

$ k describe pod gpu-operator-b9fb68b4b-br7dt -n gpu-operator | grep Event -A10
Events:
  Type     Reason     Age                 From               Message
  ----     ------     ----                ----               -------
…
  Warning  Failed     19s (x5 over 104s)  kubelet, iap10     Error: failed to start container "gpu-operator": Error response from daemon: OCI runtime create failed: container_linux.go:349: starting container process caused "process_linux.go:449: container init caused \"process_linux.go:432: running prestart hook 0 caused \\\"error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: initialization error: driver error: failed to process request\\\\n\\\"\"": unknown
$

   ▷ Solution

       gpu-operator pod가 정상 기동시 daemon.json이 자동으로 생성 되었다가, POD 삭제시 daemon.json도 삭제 됨

       nvidia-container-toolkit-daemonset-qsm9g에서도 유사 현상 발생되어 동일하게 조치 함

# cat /etc/docker/daemon.json
{
    "runtimes": {
        "nvidia": {
            "path": "nvidia-container-runtime",
            "runtimeArgs": []
        }
    }
}
# rm /etc/docker/daemon.json
# systemctl restart docker
#

- After a few minutes, the GPU operator would have deployed all the NVIDIA software components. The output also shows the validation containers run as part of the GPU operator state machine.

 

c. check NFD (Node Feature Discovery)
- NVIDIA GPU Feature Discovery for Kubernetes is a software component that allows you to automatically generate labels for the set of GPUs available on a node
https://github.com/NVIDIA/gpu-feature-discovery
- GPU server에서 추가된 labels 정보 
   ✓ GPU Operator v1.3.0, v1.4.0 
       feature.node.kubernetes.io/pci-0302_10de.present:true
       nvidia.com/gpu.present:true
       nvidia.com/gpu.product=Tesla-V100-PCIE-32GB
  ✓ GPU Operator  v 1.2.0   
       feature.node.kubernetes.io/pci-10de.present:true                
  ✓ PCI Vendor ID (10de) is NVIDIA Corporation

$ k get nodes iap10 -o jsonpath='{.metadata.labels}' | sed 's/ /\n/g' | grep pci
feature.node.kubernetes.io/pci-102b.present:true
feature.node.kubernetes.io/pci-10de.present:true
$ k get nodes iap11 -o jsonpath='{.metadata.labels}' | sed 's/ /\n/g' | grep pci
feature.node.kubernetes.io/pci-102b.present:true
feature.node.kubernetes.io/pci-10de.present:true
# V1.3.0, V1.4.0
$ k get nodes iap10 -o jsonpath='{.metadata.labels}' | sed 's/ /\n/g' | grep "gpu.present"
nvidia.com/gpu.present:true]
$

 

d. check NVIDIA GPU Driver

- The NVIDIA GPU driver container allows the provisioning of the NVIDIA driver through the use of containers. This provides a few benefits over a standard driver installation, for example:

   ✓ Ease of deployment

   ✓ Fast installation

   ✓ Reproducibility

- The driver container can be used in standard Linux distributions and container operating system distributions such as Red Hat CoreOS, Flatcar or other distributions. The driver container is also deployed as part of the NVIDIA GPU Operator.

- GPU Driver image release (https://ngc.nvidia.com/catalog/containers/nvidia:driver/tags)

   ✓ 450.80.02-centos7 : 12/05/2020

   ✓ 450.80.02-centos8 : 12/04/2020

   ✓ 450.80.02-rhel7.9 : 11/05/2020

   ✓ 450.80.02-1.0.1-rhel7: 10/29/2020

$ k edit daemonset.apps nvidia-driver-daemonset -n gpu-operator-resources
…
    spec:
      containers:
      # Before
      # - args:
      #   - init
      #   command:
      #   - nvidia-driver
      - args:
        - ln -s /lib/modules/3.10.0-1160.6.1.el7.x86_64 /lib/modules/3.10.0-1127.el7.x86_64 && nvidia-driver init
        command:
        - /bin/sh
        - -c
…
$ k get pod -n gpu-operator-resources | egrep "NAME|nvidia-driver"
NAME                                       READY   STATUS             RESTARTS   AGE
nvidia-driver-daemonset-rfpgh              1/1     Running            162        2m
nvidia-driver-daemonset-rxwxk              1/1     Running            162        2m
nvidia-driver-validation                   0/1     Completed           0          6s
$ k logs nvidia-driver-daemonset-rfpgh -n gpu-operator-resources
========== NVIDIA Software Installer ==========

Starting installation of NVIDIA driver version 450.80.02 for Linux kernel version 3.10.0-1127.el7.x86_64
…
Installation of the kernel module for the NVIDIA Accelerated Graphics Driver for Linux-x86_64 (version 450.80.02) is now complete.

Loading IPMI kernel module...
Loading NVIDIA driver kernel modules...
Starting NVIDIA persistence daemon...
Mounting NVIDIA driver rootfs...
Done, now waiting for signal
$ k logs nvidia-driver-validation -n gpu-operator-resources
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done
$

 

- Troubleshooting #1

   ▷ Problem: "Could not resolve Linux kernel version"

       GPU Operator v1.3.0, v1.4.0 동일 현상 발생

$ k logs nvidia-driver-daemonset-rxwxk -n gpu-operator-resources
========== NVIDIA Software Installer ==========

Starting installation of NVIDIA driver version 450.80.02 for Linux kernel version 3.10.0-1127.el7.x86_64

Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...
Checking NVIDIA driver packages...
Updating the package cache...
Unable to open the file '/lib/modules/3.10.0-1127.el7.x86_64/proc/version' (No such file or directory).Could not resolve Linux kernel version
Resolving Linux kernel version...
Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...
$

   ▷ Cause

$ k exec nvidia-driver-daemonset-rxwxk -n gpu-operator-resources -it -- sh
sh-4.2# uname -r
3.10.0-1127.el7.x86_64
sh-4.2# ls /lib/modules/3.10.0-1127.el7.x86_64
ls: cannot access /lib/modules/3.10.0-1127.el7.x86_64: No such file or directory
sh-4.2# cat /lib/modules/3.10.0-1160.6.1.el7.x86_64/proc/version
Linux version 3.10.0-1160.6.1.el7.x86_64 (mockbuild@kbuilder.bsys.centos.org) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-44) (GCC) ) #1 SMP Tue Nov 17 13:59:11 UTC 2020
sh-4.2#

   ▷ Solution

$ k edit daemonset.apps nvidia-driver-daemonset -n gpu-operator-resources
…
    spec:
      containers:
      # Before
      # - args:
      #   - init
      #   command:
      #   - nvidia-driver
      - args:
        - ln -s /lib/modules/3.10.0-1160.6.1.el7.x86_64 /lib/modules/3.10.0-1127.el7.x86_64 && nvidia-driver init
        command:
        - /bin/sh
        - -c
…
$

- Troubleshooting #2

   ▷ Problem: 

       GPU v1.3.0에서만 발생, GPU v1.4.0에서는 발생 되지 않음

$ k describe pod nvidia-driver-validation -n gpu-operator-resources | grep Events -A10
Events:
  Type     Reason     Age                  From               Message
  ----     ------     ----                 ----               -------
  Normal   Scheduled  <unknown>            default-scheduler  Successfully assigned gpu-operator-resources/nvidia-driver-validation to iap10
  Normal   Pulled     10m (x5 over 11m)    kubelet, iap10     Container image "nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2" already present on machine
  Normal   Created    10m (x5 over 11m)    kubelet, iap10     Created container cuda-vector-add
  Warning  Failed     10m (x5 over 11m)    kubelet, iap10     Error: failed to start container "cuda-vector-add": Error response from daemon: OCI runtime create failed: container_linux.go:349: starting container process caused "process_linux.go:449: container init caused \"process_linux.go:432: running prestart hook 0 caused \\\"error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: ldcache error: open failed: /run/nvidia/driver/sbin/ldconfig.real: no such file or directory\\\\n\\\"\"": unknown
  Warning  BackOff    101s (x46 over 11m)  kubelet, iap10     Back-off restarting failed container
$

   ▷ Solution

# ln -s /run/nvidia/driver/sbin/ldconfig /run/nvidia/driver/sbin/ldconfig.real

- Troubleshooting #3

   ▷ Problem: 

$ k logs nvidia-driver-daemonset-rfpgh -n gpu-operator-resources
… Could not unload NVIDIA driver kernel modules, driver is in use
$

   ▷ Cause: NVIDIA Driver가 OS에 이미 설치 되어 있기 떄문에 발생 됨

# lsmod | egrep "Module|nvidia"
Module                  Size  Used by
nvidia_uvm            966382  0
nvidia_drm             48606  0
nvidia_modeset       1176938  1 nvidia_drm
nvidia              19636654  23 nvidia_modeset,nvidia_uvm
drm_kms_helper        186531  3 mgag200,nouveau,nvidia_drm
drm                   456166  6 ttm,drm_kms_helper,mgag200,nouveau,nvidia_drm
#
## The third column in output is for ‘used by‘. It is basically used to display the number of instances of the module which are used.

# rpm -qa | grep nvidia | sort
libnvidia-container-tools-1.3.0-1.x86_64
libnvidia-container1-1.3.0-1.x86_64
nvidia-container-runtime-3.4.0-1.x86_64
nvidia-container-toolkit-1.3.0-2.x86_64
nvidia-docker2-2.5.0-1.noarch
# docker info | grep -i nvidia
 Runtimes: nvidia runc
 Default Runtime: nvidia
#

   ▷ Solution

# yum remove nvidia-docker2 nvidia-container-runtime nvidia-container-toolkit
# cat /etc/docker/daemon.json
{
    "runtimes": {
        "nvidia": {
            "path": "nvidia-container-runtime",
            "runtimeArgs": []
        }
    }
}
# rm /etc/docker/daemon.json
# systemctl restart docker
# docker info | grep -i nvidia
#

## How to Uninstall Manually Installed Nvidia Drivers in Linux (https://computingforgeeks.com/how-to-uninstall-manually-installed-nvidia-drivers-in-linux/)
# sh NVIDIA-Linux-x86_64-450.80.02.run --uninstall

Verifying archive integrity... OK

Uncompressing NVIDIA Accelerated Graphics Driver for Linux-x86_64 450.80.02...........

# systemctl restart kubelet 
# lsmod | egrep "Module|nvidia"
Module                  Size  Used by
#

 

e. Check nvidia-container-toolkit-daemonset

- The NVIDIA Container Toolkit allows users to build and run GPU accelerated containers. The toolkit includes a container runtime library and utilities to automatically configure containers to leverage NVIDIA GPUs.

$ k get pod -n gpu-operator-resources | egrep "NAME|nvidia-container-toolkit-daemonset"
NAME                                       READY   STATUS             RESTARTS   AGE
nvidia-container-toolkit-daemonset-6pcqq   1/1     Running            0          15m
nvidia-container-toolkit-daemonset-nfnxj   1/1     Running            0          14m
$ k logs nvidia-container-toolkit-daemonset-6pcqq -n gpu-operator-resources | egrep "^\["
[INFO] _init
[INFO] =================Starting the NVIDIA Container Toolkit=================
[INFO] toolkit::remove /usr/local/nvidia/toolkit
[INFO] toolkit::install
[INFO] toolkit::setup::config /usr/local/nvidia/toolkit
[INFO] toolkit::setup::cli_binary /usr/local/nvidia/toolkit
[INFO] toolkit::setup::toolkit_binary /usr/local/nvidia/toolkit
[INFO] toolkit::setup::runtime_binary /usr/local/nvidia/toolkit
[INFO] Refreshing the docker daemon configuration
[INFO] =================Done, Now Waiting for signal=================
$

## GPU Node
# ls -1 /usr/local/nvidia/toolkit/
-rwxr-xr-x 1 root root     166  5월 17 19:06 nvidia-container-runtime
-rwxr-xr-x 1 root root     195  5월 17 19:06 nvidia-container-toolkit
...
# cat /etc/docker/daemon.json
{
  "runtimes": {
    "nvidia": {
      "path": "/usr/local/nvidia/toolkit/nvidia-container-runtime"
    }
  },
  "default-runtime": "nvidia"
}
#
# ps -ef | grep 2947 | head -n 1
root   2947      1 11  5월17 ?   1-13:17:49 /usr/bin/containerd
[root@iap10 toolkit]# ps -ef | grep 87639 | grep -v grep
root  87639   2947  0  5월25 ?   00:13:49 containerd-shim -namespace moby -workdir /var/lib/containerd/io.containerd.runtime.v1.linux/moby/8c2ada4978993ba088244ad0372062be816c4937e8868dccfb7ef1230a013611 -address /run/containerd/containerd.sock -containerd-binary /usr/bin/containerd -runtime-root /var/run/docker/runtime-nvidia
# ps -ef | grep 132113 | grep -v grep
root 132113  87639  0 10:24 ?   00:00:00 /bin/sh /usr/local/nvidia/toolkit/nvidia-container-runtime --root /var/run/docker/runtime-nvidia/moby --log /run/containerd/io.containerd.runtime.v1.linux/moby/8c2ada4978993ba088244ad0372062be816c4937e8868dccfb7ef1230a013611/log.json --log-format json exec --process /tmp/runc-process237715488 --detach --pid-file /run/containerd/io.containerd.runtime.v1.linux/moby/8c2ada4978993ba088244ad0372062be816c4937e8868dccfb7ef1230a013611/3c86331cedd36dcf42180e63fab16c8622dab10bd056f11d7396da24edbf41a2.pid 8c2ada4978993ba088244ad0372062be816c4937e8868dccfb7ef1230a013611
# ps -ef | grep 132113 | grep -v grep
#

 

f. Check Device plugin

- The NVIDIA device plugin for Kubernetes is a DaemonSet that allows you to automatically

   ✓ Expose the number of GPUs on each nodes of your cluster

   ✓ Keep track of the health of your GPUs

   ✓ Run GPU enabled containers in your Kubernetes cluster.

$ k get pod -n gpu-operator-resources | egrep "NAME|nvidia-device-plugin"
NAME                                       READY   STATUS             RESTARTS   AGE
nvidia-device-plugin-daemonset-kq46z       1/1     Running            0          17m
nvidia-device-plugin-daemonset-qwfhf       1/1     Running            0          17m
nvidia-device-plugin-validation            0/1     Completed          0          9m40s
$ k describe nodes iap10
…
Capacity:
 cpu:                80
…
 nvidia.com/gpu:     2
…
$

 

g. Check dcgm-exporter

- DCGM-Exporter (https://github.com/NVIDIA/gpu-monitoring-tools)

- It exposes GPU metrics exporter for Prometheus leveraging NVIDIA DCGM.

$ k get pod -n gpu-operator-resources | egrep "NAME|nvidia-dcgm-exporter”
NAME                                       READY   STATUS      RESTARTS   AGE
nvidia-dcgm-exporter-426f4                 1/1     Running     0          4d23h
nvidia-dcgm-exporter-wxkq6                 1/1     Running     0          4d23h
$

- Troubleshooting 

   ▷ Problem #1

$ k describe pod nvidia-dcgm-exporter-56qjq -n gpu-operator-resources | grep Events -A15
Events:
  Type     Reason     Age                     From               Message
  ----     ------     ----                    ----               -------
  Normal   Scheduled  <unknown>               default-scheduler  Successfully assigned gpu-operator-resources/nvidia-dcgm-exporter-56qjq to iap10
  Warning  Failed     4m21s                   kubelet, iap10     Error: failed to start container "nvidia-dcgm-exporter": Error response from daemon: OCI runtime create failed: container_linux.go:349: starting container process caused "process_linux.go:449: container init caused \"process_linux.go:432: running prestart hook 0 caused \\\"error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: mount error: file creation failed: /var/lib/docker/overlay2/f7a84d034eb655d20d188dc6d74e4d3e3c620ce1c66aae08257062f8f2cf8973/merged/dev/nvidia-uvm: no such device\\\\n\\\"\"": unknown
…

   ▷ Problem #2

$ k logs nvidia-dcgm-exporter-cp4fm -n gpu-operator-resources
time="2020-12-16T06:22:43Z" level=info msg="Starting dcgm-exporter"
time="2020-12-16T06:22:43Z" level=info msg="DCGM successfully initialized!"
time="2020-12-16T06:22:43Z" level=fatal msg="Error watching fields: The third-party Profiling module returned an unrecoverable error"
$

   ▷ Solution: GPU Operator v1.3.0에서 v1.4.0로 upgrade

 

 

4. Running Sample GPU Applications

https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/getting-started.html#running-sample-gpu-applications

 

a. CUDA FP16 Matrix multiply

$  cat << EOF | kubectl create -f -
   apiVersion: v1
   kind: Pod
   metadata:
      name: dcgmproftester
   spec:
      restartPolicy: OnFailure
      containers:
      - name: dcgmproftester11
        image: nvidia/samples:dcgmproftester-2.0.10-cuda11.0-ubuntu18.04
        args: ["--no-dcgm-validation", "-t 1004", "-d 120"]
        resources:
          limits:
            nvidia.com/gpu: 1
EOF
$ k logs dcgmproftester -f
Skipping CreateDcgmGroups() since DCGM validation is disabled
CU_DEVICE_ATTRIBUTE_MAX_THREADS_PER_MULTIPROCESSOR: 2048
CU_DEVICE_ATTRIBUTE_MULTIPROCESSOR_COUNT: 80
CU_DEVICE_ATTRIBUTE_MAX_SHARED_MEMORY_PER_MULTIPROCESSOR: 98304
CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MAJOR: 7
CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MINOR: 0
CU_DEVICE_ATTRIBUTE_GLOBAL_MEMORY_BUS_WIDTH: 4096
CU_DEVICE_ATTRIBUTE_MEMORY_CLOCK_RATE: 877000
Max Memory bandwidth: 898048000000 bytes (898.05 GiB)
CudaInit completed successfully.

Skipping WatchFields() since DCGM validation is disabled
TensorEngineActive: generated ???, dcgm 0.000 (72284.4 gflops)
TensorEngineActive: generated ???, dcgm 0.000 (67266.7 gflops)
…
TensorEngineActive: generated ???, dcgm 0.000 (54648.1 gflops)
Skipping UnwatchFields() since DCGM validation is disabled
$ watch kubectl exec nvidia-driver-daemonset-kl4gp -n gpu-operator-resources -it -- nvidia-smi
Mon May 31 01:02:59 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.80.02    Driver Version: 450.80.02    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-PCIE...  On   | 00000000:3B:00.0 Off |                    0 |
| N/A   55C    P0   189W / 250W |    493MiB / 32510MiB |     65%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-PCIE...  On   | 00000000:D8:00.0 Off |                    0 |
| N/A   26C    P0    23W / 250W |      0MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     16354      C   /usr/bin/dcgmproftester11         489MiB |
+-----------------------------------------------------------------------------+
...
$ k exec dcgmproftester -it -- nvidia-smi
Mon May 31 01:05:00 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.80.02    Driver Version: 450.80.02    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-PCIE...  On   | 00000000:3B:00.0 Off |                    0 |
| N/A   61C    P0   188W / 250W |    493MiB / 32510MiB |     66%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+
$

 

b. Jupyter Notebook

$ kubectl apply -f https://nvidia.github.io/gpu-operator/notebook-example.yml
service/tf-notebook created
pod/tf-notebook created
$ k get pod tf-notebook -o wide
NAME          READY   STATUS    RESTARTS   AGE     IP              NODE    NOMINATED NODE   READINESS GATES
tf-notebook   1/1     Running   0          5m38s   10.244.10.132   iap10   <none>           <none>
$ k get svc tf-notebook
NAME          TYPE       CLUSTER-IP      EXTERNAL-IP   PORT(S)        AGE
tf-notebook   NodePort   10.111.206.73   <none>        80:30001/TCP   5m51s
$ k logs tf-notebook
[I 00:29:30.727 NotebookApp] Writing notebook server cookie secret to /root/.local/share/jupyter/runtime/notebook_cookie_secret
[I 00:29:30.918 NotebookApp] Serving notebooks from local directory: /tf
[I 00:29:30.918 NotebookApp] Jupyter Notebook 6.1.5 is running at:
[I 00:29:30.918 NotebookApp] http://tf-notebook:8888/?token=736793c504d7935ca45282bd40163bc4a1897ffec7509037
[I 00:29:30.918 NotebookApp]  or http://127.0.0.1:8888/?token=736793c504d7935ca45282bd40163bc4a1897ffec7509037
[I 00:29:30.918 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
[C 00:29:30.922 NotebookApp]

    To access the notebook, open this file in a browser:
        file:///root/.local/share/jupyter/runtime/nbserver-1-open.html
    Or copy and paste one of these URLs:
        http://tf-notebook:8888/?token=736793c504d7935ca45282bd40163bc4a1897ffec7509037
     or http://127.0.0.1:8888/?token=736793c504d7935ca45282bd40163bc4a1897ffec7509037
[I 00:30:43.032 NotebookApp] 302 GET / (10.244.0.0) 1.57ms
[I 00:30:43.084 NotebookApp] 302 GET /tree? (10.244.0.0) 1.16ms
$

- http://14.52.244.136:30001/?token=736793c504d7935ca45282bd40163bc4a1897ffec7509037

'Kubernetes > Install' 카테고리의 다른 글

Kubernetes 업그레이드 (1.16 ⇢1.20) 및 호환성 검토  (0) 2021.10.19
GPU Operator Install on Ubuntu  (0) 2021.09.21
Helm  (0) 2021.09.21
MetalLB  (0) 2021.09.15
keepalived, haproxy for K8s  (0) 2021.09.15

댓글