Distributed training 사례 #3 (In Jupyter)

2021.06.25

1. Kubeflow Jupyter(GPU 할당) 환경에서 Distributed training (Tensorflow)
- Environments
✓ Remote - 개발 환경
Kubeflow Jupyter (GPU)
✓ Remote - 학습 환경
Kubeflow 1.2 (The machine learning toolkit for Kubernetes) / Kubernetes 1.16.15
Harbor 2.2.1 (Private docker registry)
Nvidia V100 / Driver 450.80, cuda 11.2, cuDNN 8.1.0

CentOS 7.8
- 관련 기술
- Tensorflow MirroredStrategy (Data parallelism)

2. 사전작업
a. Remote (Kubernetes cluster)
- Kubeflow 계정 생성 & 리소스 쿼터 설정 (GPU, CPU, Memory, Disk)
- istio-injection disabled

$ kubectl label namespace yoosung-jeon istio-injection=disabled --overwrite

b. Remote (Kubeflow Jupyter)

- Kubeflow Jupyter notebook 생성

i. Kubeflow dashboard login (http://kf.acp.kt.co.kr)

ii. Notebook server 생성

Menu (좌측 상단 ) > Notebook Servers > ‘New Server’

Name: Study-gpu

Image: repo.acp.kt.co.kr/kubeflow/kubeflow-images-private/tensorflow-2.5.0-notebook-gpu:1.0.1

CPU: 0.5

Memory: 8Gi

Num of GPUs: 2

GPU Vendor: NVIDIA

iii. Jupyter 접속

- 참고사항

할당할 수 있는 최대 GPU수는 Kubernetes cluster의 Worker node(서버)들 중에서 Node별 미 사용 중인 GPU수의 최대값

Jupyter를 생성하는 시점부터 해당 GPU는 점유 상태로 다른 어플리케이션에서 사용 불가

$ k get pod -n yoosung-jeon -o wide | egrep 'NAME|study-gpu'
NAME          READY  STATUS   RESTARTS  AGE  IP           NODE   NOMINATED NODE  READINESS GATES
study-gpu-0   1/1    Running  0         52m  10.244.4.74  iap10  <none>          <none>
$ k exec `k get pod -n gpu-operator-resources -o wide | grep nvidia-driver-daemonset | grep iap10 | awk '{print $1}'` -n gpu-operator-resources -it -- nvidia-smi
Fri Jun 25 06:56:17 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.80.02    Driver Version: 450.80.02    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-PCIE...  On   | 00000000:3B:00.0 Off |                    0 |
| N/A   29C    P0    38W / 250W |  31895MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-PCIE...  On   | 00000000:D8:00.0 Off |                    0 |
| N/A   28C    P0    35W / 250W |  31895MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A    291829      C   /usr/bin/python3                31891MiB |
|    1   N/A  N/A    291829      C   /usr/bin/python3                31891MiB |
+-----------------------------------------------------------------------------+
$

3. Distributed training 코딩
- 소스
3.jupyter-gpu/keras-multiWorkerMirroredStrategy.ipynb
- Code snippet

strategy = tf.distribute.MirroredStrategy()

def build_and_compile_cnn_model():
  model = tf.keras.Sequential([
      tf.keras.layers.Conv2D(32, 3, activation='relu', input_shape=(28, 28, 1)),
      tf.keras.layers.MaxPooling2D(),
      tf.keras.layers.Flatten(),
      tf.keras.layers.Dense(64, activation='relu'),
      tf.keras.layers.Dense(10, activation='softmax')
  ])
  model.compile(
      loss=tf.keras.losses.sparse_categorical_crossentropy,
      optimizer=tf.keras.optimizers.SGD(learning_rate=0.001),
      metrics=['accuracy'])
  return model

with strategy.scope():
  multi_gpu_model = build_and_compile_cnn_model()

4. Distributed training 실행

5. 소스

keras-mirroredStrategy.ipynb

0.00MB

keras-mirroredStrategy.py

0.00MB

'Kubeflow > Distributed training' 카테고리의 다른 글

Distributed training 사례 #4 (From KF Jupyter, PyTorch) (0)	2021.09.27
Distributed training 사례 #2 (From KF Jupyter, Tensorflow) (0)	2021.09.26
Distributed training 사례 #1 (From MacOS) (0)	2021.09.26
Distributed training 개요 (0)	2021.09.26
Running the MNIST using distributed training (0)	2021.09.24

일주일만 하면 ...

Distributed training 사례 #3 (In Jupyter)

'Kubeflow > Distributed training' 카테고리의 다른 글

댓글

티스토리툴바

Distributed training 사례 #3 (In Jupyter)

'Kubeflow > Distributed training' 카테고리의 다른 글

관련글

댓글

티스토리툴바