본문 바로가기
Kubeflow/Distributed training

Distributed training 사례 #2 (From KF Jupyter, Tensorflow)

by 여행을 떠나자! 2021. 9. 26.

2021.06.24

 

1. Kubeflow Jupyter 환경에서 Distributed training (Tensorflow)
Environments
   ✓ Remote - 개발 환경
       Kubeflow Jupyter
   ✓ Remote - 학습 환경
       Kubeflow 1.2 (The machine learning toolkit for Kubernetes) / Kubernetes 1.16.15
       Nexus (Private docker registry)
       Nvidia V100 / Driver 450.80, cuda 11.2, cuDNN 8.1.0

       CentOS 7.8
Flow (Tensorflow)
   ✓ Remote (Kubeflow Jupyter)
       a. Docker Image build
       b. Docker Image Push
       c. Kubeflow TFJob deployment
   ✓ Remote (Kubeflow / Kubernetes)
       d. Kubernetes POD 생성 (Chief POD, Worker #n POD)
           Docker image pull & Container 생성, GPU / CPU / Memory 리소스 할당
       e. Kubernetes POD 실행
           Tensorflow distributed training
- 관련 기술
  ✓ Kubeflow fairing
  ✓ Tensorflow MultiWokerMirroredStrategy (Data parallelism)


2. 사전작업
a. Remote (Kubernetes cluster)
- Kubeflow 계정 생성 & 리소스 쿼터 설정 (GPU, CPU, Memory, Disk)
- Private Docker registry credential 생성
   Private docker registry는 TLS Certificate를 제공해야 함 (kubeflow 1.2 기준 self-signed certificate 미지원)

$ k create secret docker-registry chelsea-agp-reg-cred -n yoosung-jeon \
  --docker-server=repo.chelsea.kt.co.kr --docker-username=agp --docker-password=*****
$ k patch serviceaccount default -n yoosung-jeon -p "{\"imagePullSecrets\": [{\"name\": \"chelsea-agp-reg-cred\"}]}"

- Kubernetes PV(Persistent volume) 생성

   Distributed training 어플리케이션들에서 공유하여 사용할 디스크      

$ vi mnist-mwms-pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: mnist-mwms-pvc
  namespace: yoosung-jeon
spec:
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 1Gi
  storageClassName: nfs-sc-iap
$ k apply -f mnist-mwms-pvc.yaml
…
$ cp -r tensorflow_datasets `k get pvc -n yoosung-jeon | grep -w mnist-mwms-pvc | awk '{printf("/nfs_01/yoosung-jeon-%s-%s", $1, $3)}'`
$

 

b. Remote (Kubeflow Jupyter)

- Kubeflow Jupyter notebook 생성

   i. Kubeflow dashboard login (http://kf.acp.kt.co.kr)

   ii. Notebook server 생성         

       Menu (좌측 상단 ) > Notebook Servers > ‘New Server’

            Name: Study

            Image: repo.acp.kt.co.kr/kubeflow/kubeflow-images-private/tensorflow-2.5.0-notebook-cpu:1.0.1

   iii. Jupyter 접속

   iv. Terminal 접속

   v. Private Docker registry credential 생성

$ mkdir .docker && vi .docker/config.json
{
        "auths": {
                "repo.chelsea.kt.co.kr": {"auth": "YWdwOm5ldzEy****"}
        }
}
$

 

 

3. Distributed training 코딩

a. Base image 생성

    Distributed training 어플리케이션에서 사용되는 python package 들을 Base image에 추가, 추가할 Package가 없는 경우 생략

yoosungjeon@ysjeon-Dev create-base-docker-image % vi Dockerfile
FROM tensorflow/tensorflow:2.5.0-gpu

COPY requirements.txt .
RUN pip install -r requirements.txt
yoosungjeon@ysjeon-Dev create-base-docker-image % vi requirements.txt
tensorflow-datasets==4.3.0
yoosungjeon@ysjeon-Dev create-base-docker-image % docker build -t repo.chelsea.kt.co.kr/agp/tensorflow-custom:2.5.0-gpu -f Dockerfile .
…
yoosungjeon@ysjeon-Dev create-base-docker-image % docker push repo.chelsea.kt.co.kr/agp/tensorflow-custom:2.5.0-gpu
yoosungjeon@ysjeon-Dev create-base-docker-image %

 

b. Fairing & Distributed training 코드
- 소스
   2.jupyter-cpu-fairing/keras-multiWorkerMirroredStrategy.ipynb
- Code snippet

def main():
    strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy()
    ...
    with strategy.scope():
      multi_worker_model = build_and_compile_cnn_model()
    ...

def fairing_run():
    base_image = 'repo.chelsea.kt.co.kr/agp/tensorflow-custom:2.5.0-gpu'
    ...
    fairing.config.set_builder(name='append', registry=docker_registry,
                               base_image=base_image, image_name=image_name)
    fairing.config.set_deployer(name='tfjob', namespace=k8s_namespace, stream_log=False, job_name=tfjob_name,
                            chief_count=num_chief, worker_count=num_workers,
                            pod_spec_mutators=[
                              volume_mounts(volume_type='pvc', volume_name=k8s_pvc_name, mount_path=mount_dir),
                              get_resource_mutator(gpu=gpus_per_worker, gpu_vendor='nvidia')]
                            )
    fairing.config.run()

if __name__ == '__main__':
    ...
    if os.getenv('FAIRING_RUNTIME', None) is None:
        fairing_run()
    else:
        main()

 

 

4. Distributed training 실행

a. Remote (Kubeflow Jupyter)

 

b. Remote (Kubernetes cluster)

$ kubectl get pod -n yoosung-jeon -o wide | egrep 'NAME|mnist-mwms-training-4edc'
NAME                               READY  STATUS   RESTARTS  AGE  IP            NODE
mnist-mwms-training-4edc-chief-0   1/1    Running  0         6s   10.244.3.3    iap11
mnist-mwms-training-4edc-worker-0  1/1    Running  0         6s   10.244.4.41   iap10
mnist-mwms-training-4edc-worker-1  1/1    Running  0         6s   10.244.4.43   iap10
$

 

 

5. 소스

keras-multiWorkerMirroredStrategy.ipynb
0.02MB
keras-multiWorkerMirroredStrategy.py
0.00MB
Dockerfile
0.00MB

댓글