본문 바로가기
Kubeflow/Distributed training

Distributed training 사례 #1 (From MacOS)

by 여행을 떠나자! 2021. 9. 26.

2021.06.24

1. Local 환경에서 Distributed training (Tensorflow)
Environments
   ✓ Local - 개발 환경
       Python 3.8.5, Jupyter / PyCharm (option)
   ✓ Remote - 학습 환경
       Kubeflow 1.2 (The machine learning toolkit for Kubernetes) / Kubernetes 1.16.15

           Master node: 3ea, Worker node: 4ea
       Harbor 2.2.1 (Private docker registry)
       Nvidia V100 / Driver 450.80, cuda 11.2, cuDNN 8.1.0

       CentOS 7.8

Flow (Tensorflow 기준)
   ✓ Local (Mac, Windows, Linux)
       a. Docker Image build
       b. Docker Image Push
       c. Kubeflow TFJob deployment
   ✓ Remote (Kubeflow / Kubernetes)
       d. Kubernetes POD 생성 (Chief POD, Worker #n POD)
           Docker image pull & Container 생성, GPU / CPU / Memory 리소스 할당
       e. Kubernetes POD 실행
           Tensorflow distributed training

- 관련 기술
   ✓ Kubeflow fairing
   ✓ Tensorflow MultiWokerMirroredStrategy (Data parallelism)

 

 

2. 사전작업

a. Remote (Kubernetes cluster)

- Kubernetes namespace 생성 & 리소스 쿼터 설정 (GPU, CPU, Memory, Disk)

$ kubectl create namespace yoosung-jeon

- Private Docker registry credential 생성

$ kubectl create secret docker-registry acp-agp-reg-cred -n yoosung-jeon \
  --docker-server=repo.acp.kt.co.kr --docker-username=agp --docker-password=*****
$ kubectl patch serviceaccount default -n yoosung-jeon -p "{\"imagePullSecrets\": [{\"name\": \"acp-agp-reg-cred\"}]}"

- Kubernetes PV(Persistent volume) 생성

   Distributed training 어플리케이션들에서 공유하여 사용할 디스크      

$ vi mnist-mwms-pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: mnist-mwms-pvc
  namespace: yoosung-jeon
spec:
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 1Gi
  storageClassName: nfs-sc-iap
$ k apply -f mnist-mwms-pvc.yaml
…
$ cp -r tensorflow_datasets `k get pvc -n yoosung-jeon | grep -w mnist-mwms-pvc | awk '{printf("/nfs_01/yoosung-jeon-%s-%s", $1, $3)}'`
$

 

b. Local

- Private Docker registry 접속

   Private docker registry가 self-signed certificate를 사용하는 경우

yoosungjeon@ysjeon-Dev ~ % vi .docker/certs.d/repo.acp.kt.co.kr/ca.crt
-----BEGIN CERTIFICATE-----
MIIC9TCCAd2gAwIBAgIRAJfPc9ZtIltNnaskJyio3u0wDQYJKoZIhvcNAQELBQAw
...
-----END CERTIFICATE-----
yoosungjeon@ysjeon-Dev ~ % docker login repo.acp.kt.co.kr
Username: agp
Password:
Login Succeeded
yoosungjeon@ysjeon-Dev ~ %

- Kubernetes Cluster 정보 생성

yoosungjeon@ysjeon-Dev ~ % vi .kube/config 
…
yoosungjeon@ysjeon-Dev ~ %

 

 

3. Distributed training 코딩
a. Fairing 코드
- 소스
   1.local-dev-fairing/fairing.py
- Code snippet

fairing.config.set_preprocessor('python', command=command, path_prefix="/app", output_map=output_map)
fairing.config.set_builder(name='docker', registry=docker_registry, base_image="",
                           image_name=project_name, dockerfile_path="Dockerfile")
fairing.config.set_deployer(name='tfjob', namespace=k8s_namespace, stream_log=False, job_name=tfjob_name,
                            chief_count=num_chief, worker_count=num_workers,
                            pod_spec_mutators=[
                              volume_mounts(volume_type='pvc', volume_name=k8s_pvc_name, mount_path=mount_dir),
                              get_resource_mutator(gpu=gpus_per_worker, gpu_vendor='nvidia')]
                            )

 

b . Distributed training 코드
- 소스
  1.local-dev-fairing/Dockerfile
  1.local-dev-fairing/mnist.py
- Code snippet
   ✓ Dockerfile

FROM tensorflow/tensorflow:2.5.0-gpu
ADD mnist-mwms.py /opt/mnist-mwms.py
RUN pip install tensorflow-datasets

   ✓ mnist.py

strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy()

with strategy.scope():
  multi_worker_model = build_and_compile_cnn_model()

multi_worker_model.fit(x=train_datasets, epochs=10)

 

 

4. Distributed training 실행

a. Local

yoosungjeon@ysjeon-Dev 1.local-dev-fairing % python fairing.py
[I 210624 14:18:15 config:134] Using preprocessor: <kubeflow.fairing.preprocessors.base.BasePreProcessor object at 0x7fbb8c4c5dc0>
[I 210624 14:18:15 config:136] Using builder: <kubeflow.fairing.builders.docker.docker.DockerBuilder object at 0x7fbb8c4c5e20>
[I 210624 14:18:15 config:138] Using deployer: <kubeflow.fairing.deployers.tfjob.tfjob.TfJob object at 0x7fbb8c4c5e50>
[I 210624 14:18:15 docker:32] Building image using docker
[W 210624 14:18:15 docker:41] Docker command: ['python', '/opt/mnist-mwms.py', '--tf-mount-dir=/mnt', '--tf-global-batch-size=200']
[I 210624 14:18:15 base:107] Creating docker context: /tmp/fairing_context_n0cgilcf
[W 210624 14:18:15 docker:56] Building docker image repo.acp.kt.co.kr/agp/mnist-mwms:C675727...
[I 210624 14:18:16 docker:103] Build output: Step 1/6 : FROM tensorflow/tensorflow:2.5.0-gpu
...
[I 210624 14:18:16 docker:103] Build output: Successfully tagged repo.acp.kt.co.kr/agp/mnist-mwms:C675727
[W 210624 14:18:16 docker:70] Publishing image repo.acp.kt.co.kr/agp/mnist-mwms:C675727...
[I 210624 14:18:16 docker:103] Push output: The push refers to repository [repo.acp.kt.co.kr/agp/mnist-mwms] None
...
[I 210624 14:18:18 docker:103] Push finished: {'Tag': 'C675727', 'Digest': 'sha256:3e0c938c70e00d79e2c265d15b59f28939e3caed754fdec46e065ad2f88a0d6e', 'Size': 4304}
[W 210624 14:18:18 job:101] The tfjob mnist-mwms-training-4edc launched.
yoosungjeon@ysjeon-Dev 1.local-dev-fairing %

 

b. Remote (Kubernetes cluster)

$ kubectl get pod -n yoosung-jeon -o wide | egrep 'NAME|mnist-mwms-training-4edc'
NAME                               READY  STATUS   RESTARTS  AGE  IP            NODE
mnist-mwms-training-4edc-chief-0   1/1    Running  0         6s   10.244.3.3    iap11
mnist-mwms-training-4edc-worker-0  1/1    Running  0         6s   10.244.4.41   iap10
mnist-mwms-training-4edc-worker-1  1/1    Running  0         6s   10.244.4.43   iap10
$
$ iap10 lspci | grep NVIDIA
3b:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 PCIe 32GB] (rev a1)
d8:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 PCIe 32GB] (rev a1)
$ iap11 lspci | grep NVIDIA
3b:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 PCIe 32GB] (rev a1)
d8:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 PCIe 32GB] (rev a1)
$

 

 

5. 소스

Dockerfile
0.00MB
fairing.py
0.00MB
mnist-mwms.py
0.00MB
serviceaccount-default.yaml
0.00MB
mnist-pvc.yaml
0.00MB

 

댓글