본문 바로가기
Kubeflow/Distributed training

Distributed training 사례 #3 (In Jupyter)

by 여행을 떠나자! 2021. 9. 26.

2021.06.25


1. Kubeflow Jupyter(GPU 할당) 환경에서 Distributed training (Tensorflow)
- Environments
   Remote - 개발 환경
       Kubeflow Jupyter (GPU)
    Remote - 학습 환경
       Kubeflow 1.2 (The machine learning toolkit for Kubernetes) / Kubernetes 1.16.15
       Harbor 2.2.1 (Private docker registry)
       Nvidia V100 / Driver 450.80, cuda 11.2, cuDNN 8.1.0

       CentOS 7.8
- 관련 기술
    - Tensorflow MirroredStrategy (Data parallelism)

 

 

2. 사전작업
a. Remote (Kubernetes cluster)
- Kubeflow 계정 생성 & 리소스 쿼터 설정 (GPU, CPU, Memory, Disk)
- istio-injection disabled

$ kubectl label namespace yoosung-jeon istio-injection=disabled --overwrite

 

b. Remote (Kubeflow Jupyter)

- Kubeflow Jupyter notebook 생성

   i. Kubeflow dashboard login (http://kf.acp.kt.co.kr)

   ii. Notebook server 생성         

       Menu (좌측 상단 ) > Notebook Servers > ‘New Server’

          Name: Study-gpu

          Image: repo.acp.kt.co.kr/kubeflow/kubeflow-images-private/tensorflow-2.5.0-notebook-gpu:1.0.1

          CPU: 0.5

          Memory: 8Gi

          Num of GPUs: 2

          GPU Vendor: NVIDIA

   iii. Jupyter 접속

- 참고사항

   할당할 수 있는 최대 GPU수는 Kubernetes cluster의 Worker node(서버)들 중에서 Node별 미 사용 중인 GPU수의 최대값

   Jupyter를 생성하는 시점부터 해당 GPU는 점유 상태로 다른 어플리케이션에서 사용 불가

$ k get pod -n yoosung-jeon -o wide | egrep 'NAME|study-gpu'
NAME          READY  STATUS   RESTARTS  AGE  IP           NODE   NOMINATED NODE  READINESS GATES
study-gpu-0   1/1    Running  0         52m  10.244.4.74  iap10  <none>          <none>
$ k exec `k get pod -n gpu-operator-resources -o wide | grep nvidia-driver-daemonset | grep iap10 | awk '{print $1}'` -n gpu-operator-resources -it -- nvidia-smi
Fri Jun 25 06:56:17 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.80.02    Driver Version: 450.80.02    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-PCIE...  On   | 00000000:3B:00.0 Off |                    0 |
| N/A   29C    P0    38W / 250W |  31895MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-PCIE...  On   | 00000000:D8:00.0 Off |                    0 |
| N/A   28C    P0    35W / 250W |  31895MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A    291829      C   /usr/bin/python3                31891MiB |
|    1   N/A  N/A    291829      C   /usr/bin/python3                31891MiB |
+-----------------------------------------------------------------------------+
$

 

 

3. Distributed training 코딩
- 소스
   3.jupyter-gpu/keras-multiWorkerMirroredStrategy.ipynb
- Code snippet

strategy = tf.distribute.MirroredStrategy()

def build_and_compile_cnn_model():
  model = tf.keras.Sequential([
      tf.keras.layers.Conv2D(32, 3, activation='relu', input_shape=(28, 28, 1)),
      tf.keras.layers.MaxPooling2D(),
      tf.keras.layers.Flatten(),
      tf.keras.layers.Dense(64, activation='relu'),
      tf.keras.layers.Dense(10, activation='softmax')
  ])
  model.compile(
      loss=tf.keras.losses.sparse_categorical_crossentropy,
      optimizer=tf.keras.optimizers.SGD(learning_rate=0.001),
      metrics=['accuracy'])
  return model

with strategy.scope():
  multi_gpu_model = build_and_compile_cnn_model()

 

4. Distributed training 실행

 

5. 소스

keras-mirroredStrategy.ipynb
0.00MB
keras-mirroredStrategy.py
0.00MB

 

댓글