2021.06.24
1. Local 환경에서 Distributed training (Tensorflow)
- Environments
✓ Local - 개발 환경
Python 3.8.5, Jupyter / PyCharm (option)
✓ Remote - 학습 환경
Kubeflow 1.2 (The machine learning toolkit for Kubernetes) / Kubernetes 1.16.15
Master node: 3ea, Worker node: 4ea
Harbor 2.2.1 (Private docker registry)
Nvidia V100 / Driver 450.80, cuda 11.2, cuDNN 8.1.0
CentOS 7.8
- Flow (Tensorflow 기준)
✓ Local (Mac, Windows, Linux)
a. Docker Image build
b. Docker Image Push
c. Kubeflow TFJob deployment
✓ Remote (Kubeflow / Kubernetes)
d. Kubernetes POD 생성 (Chief POD, Worker #n POD)
Docker image pull & Container 생성, GPU / CPU / Memory 리소스 할당
e. Kubernetes POD 실행
Tensorflow distributed training
- 관련 기술
✓ Kubeflow fairing
✓ Tensorflow MultiWokerMirroredStrategy (Data parallelism)
2. 사전작업
a. Remote (Kubernetes cluster)
- Kubernetes namespace 생성 & 리소스 쿼터 설정 (GPU, CPU, Memory, Disk)
$ kubectl create namespace yoosung-jeon
- Private Docker registry credential 생성
$ kubectl create secret docker-registry acp-agp-reg-cred -n yoosung-jeon \
--docker-server=repo.acp.kt.co.kr --docker-username=agp --docker-password=*****
$ kubectl patch serviceaccount default -n yoosung-jeon -p "{\"imagePullSecrets\": [{\"name\": \"acp-agp-reg-cred\"}]}"
- Kubernetes PV(Persistent volume) 생성
Distributed training 어플리케이션들에서 공유하여 사용할 디스크
$ vi mnist-mwms-pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: mnist-mwms-pvc
namespace: yoosung-jeon
spec:
accessModes:
- ReadWriteMany
resources:
requests:
storage: 1Gi
storageClassName: nfs-sc-iap
$ k apply -f mnist-mwms-pvc.yaml
…
$ cp -r tensorflow_datasets `k get pvc -n yoosung-jeon | grep -w mnist-mwms-pvc | awk '{printf("/nfs_01/yoosung-jeon-%s-%s", $1, $3)}'`
$
b. Local
- Private Docker registry 접속
Private docker registry가 self-signed certificate를 사용하는 경우
yoosungjeon@ysjeon-Dev ~ % vi .docker/certs.d/repo.acp.kt.co.kr/ca.crt
-----BEGIN CERTIFICATE-----
MIIC9TCCAd2gAwIBAgIRAJfPc9ZtIltNnaskJyio3u0wDQYJKoZIhvcNAQELBQAw
...
-----END CERTIFICATE-----
yoosungjeon@ysjeon-Dev ~ % docker login repo.acp.kt.co.kr
Username: agp
Password:
Login Succeeded
yoosungjeon@ysjeon-Dev ~ %
- Kubernetes Cluster 정보 생성
yoosungjeon@ysjeon-Dev ~ % vi .kube/config
…
yoosungjeon@ysjeon-Dev ~ %
3. Distributed training 코딩
a. Fairing 코드
- 소스
1.local-dev-fairing/fairing.py
- Code snippet
fairing.config.set_preprocessor('python', command=command, path_prefix="/app", output_map=output_map)
fairing.config.set_builder(name='docker', registry=docker_registry, base_image="",
image_name=project_name, dockerfile_path="Dockerfile")
fairing.config.set_deployer(name='tfjob', namespace=k8s_namespace, stream_log=False, job_name=tfjob_name,
chief_count=num_chief, worker_count=num_workers,
pod_spec_mutators=[
volume_mounts(volume_type='pvc', volume_name=k8s_pvc_name, mount_path=mount_dir),
get_resource_mutator(gpu=gpus_per_worker, gpu_vendor='nvidia')]
)
b . Distributed training 코드
- 소스
1.local-dev-fairing/Dockerfile
1.local-dev-fairing/mnist.py
- Code snippet
✓ Dockerfile
FROM tensorflow/tensorflow:2.5.0-gpu
ADD mnist-mwms.py /opt/mnist-mwms.py
RUN pip install tensorflow-datasets
✓ mnist.py
strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy()
with strategy.scope():
multi_worker_model = build_and_compile_cnn_model()
multi_worker_model.fit(x=train_datasets, epochs=10)
4. Distributed training 실행
a. Local
yoosungjeon@ysjeon-Dev 1.local-dev-fairing % python fairing.py
[I 210624 14:18:15 config:134] Using preprocessor: <kubeflow.fairing.preprocessors.base.BasePreProcessor object at 0x7fbb8c4c5dc0>
[I 210624 14:18:15 config:136] Using builder: <kubeflow.fairing.builders.docker.docker.DockerBuilder object at 0x7fbb8c4c5e20>
[I 210624 14:18:15 config:138] Using deployer: <kubeflow.fairing.deployers.tfjob.tfjob.TfJob object at 0x7fbb8c4c5e50>
[I 210624 14:18:15 docker:32] Building image using docker
[W 210624 14:18:15 docker:41] Docker command: ['python', '/opt/mnist-mwms.py', '--tf-mount-dir=/mnt', '--tf-global-batch-size=200']
[I 210624 14:18:15 base:107] Creating docker context: /tmp/fairing_context_n0cgilcf
[W 210624 14:18:15 docker:56] Building docker image repo.acp.kt.co.kr/agp/mnist-mwms:C675727...
[I 210624 14:18:16 docker:103] Build output: Step 1/6 : FROM tensorflow/tensorflow:2.5.0-gpu
...
[I 210624 14:18:16 docker:103] Build output: Successfully tagged repo.acp.kt.co.kr/agp/mnist-mwms:C675727
[W 210624 14:18:16 docker:70] Publishing image repo.acp.kt.co.kr/agp/mnist-mwms:C675727...
[I 210624 14:18:16 docker:103] Push output: The push refers to repository [repo.acp.kt.co.kr/agp/mnist-mwms] None
...
[I 210624 14:18:18 docker:103] Push finished: {'Tag': 'C675727', 'Digest': 'sha256:3e0c938c70e00d79e2c265d15b59f28939e3caed754fdec46e065ad2f88a0d6e', 'Size': 4304}
[W 210624 14:18:18 job:101] The tfjob mnist-mwms-training-4edc launched.
yoosungjeon@ysjeon-Dev 1.local-dev-fairing %
b. Remote (Kubernetes cluster)
$ kubectl get pod -n yoosung-jeon -o wide | egrep 'NAME|mnist-mwms-training-4edc'
NAME READY STATUS RESTARTS AGE IP NODE
mnist-mwms-training-4edc-chief-0 1/1 Running 0 6s 10.244.3.3 iap11
mnist-mwms-training-4edc-worker-0 1/1 Running 0 6s 10.244.4.41 iap10
mnist-mwms-training-4edc-worker-1 1/1 Running 0 6s 10.244.4.43 iap10
$
$ iap10 lspci | grep NVIDIA
3b:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 PCIe 32GB] (rev a1)
d8:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 PCIe 32GB] (rev a1)
$ iap11 lspci | grep NVIDIA
3b:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 PCIe 32GB] (rev a1)
d8:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 PCIe 32GB] (rev a1)
$
5. 소스
'Kubeflow > Distributed training' 카테고리의 다른 글
Distributed training 사례 #4 (From KF Jupyter, PyTorch) (0) | 2021.09.27 |
---|---|
Distributed training 사례 #3 (In Jupyter) (0) | 2021.09.26 |
Distributed training 사례 #2 (From KF Jupyter, Tensorflow) (0) | 2021.09.26 |
Distributed training 개요 (0) | 2021.09.26 |
Running the MNIST using distributed training (0) | 2021.09.24 |
댓글