Distributed training 개요

2021.6.28

1. Distributed training ?

a. Distributed training 분류 (https://ettrends.etri.re.kr/ettrends/172/0905172001/)

- Data Parallelism

✓ 대량의 데이터를 다수의 컴퓨터에서 데이터를 분산하여 학습하는 방법

- Model Parallelism

✓ 딥러닝 모델의 크기가 증가되어 하나의 컴퓨터에서 처리하지 못 하는 경우 모델을 분할하여 학습하는 방법

▷ 레이어 분할

https://ettrends.etri.re.kr/ettrends/172/0905172001/images_1/2018/v33n4/ETRI_J003_2018_v33n4_1_f002.jpg

▷ 학습 피처 분할

https://ettrends.etri.re.kr/ettrends/172/0905172001/images_1/2018/v33n4/ETRI_J003_2018_v33n4_1_f003.jpg

b. Distributed training을 위해 필요한 것

- Tensorflow/PyTorch 분산 학습 API

- Kubeflow/Kubernetes 기반의 분산 학습 환경

2. Tensorflow distributed training 소개

a. tf.distribute.Strategy API

학습을 여러 GPU 또는 여러 장비, 여러 TPU로 나누어 처리하기 위한 텐서플로 API

텐서플로의 고수준 API인 tf.keras 및 tf.estimator와 함께 사용

b. 전략의 종류 (https://www.tensorflow.org/guide/distributed_training)

- Synchronous data parallelism

✓ MirroredStrategy

장비 하나에서 다중 GPU를 이용한 동기 분산 훈련, 각각의 GPU 장치마다 복제본 만듬. 모델의 모든 변수가 복제본 마다 미러링

✓ MultiWorkerMirroredStrategy

MirroredStrategy와 매우 유사. 다중 워커를 이용하여 동기 분산 훈련, 각 워커는 여러 개의 GPU를 사용

✓ TPUStrategy

TPU(Tensor Processing Unit)에서 수행하는 전략, TPUStrategy는 MirroredStrategy와 동일 구조

✓ CentralStorageStrategy

각각의 GPU 장치마다 복제본 만듬, 변수를 미러링하지 않고 CPU에서 관리

- Asynchronous data parallelism

✓ ParameterServerStrategy

파라미터 서버를 사용, 모델의 각 변수는 한 파라미터 서버에 할당, 비동기 분산 훈련

- TensorFlow processes (MultiWorkerMirroredStrategy, ParameterServerStrategy)

✓ Chief: The chief ('master') is responsible for orchestrating the training and performing supplementary tasks,

such as initializing the graph, checkpointing, and, saving logs for TensorBoard, and saving the model.

It also manages failures and restarts. If the chief itself fails, the training is restarted from the last available checkpoint.

✓ Worker: The workers do the actual work of training the model. In some cases, worker 0 might also act as the chief.

✓ PS: The ps are parameter servers; these servers provide a distributed data store for the model parameters.

✓ Evaluator: The evaluators can be used to compute evaluation metrics as the model is trained.

c. MultiWorkerMirroredStrategy Example

- Code snippet

def main():
    strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy()
    ...
    with strategy.scope():
      multi_worker_model = build_and_compile_cnn_model()

- 실행 스크립트

# On 1'st node (10.10.10.2)
$ export TF_CONFIG='{"cluster": {"worker": ["10.10.10.2:12346", "10.10.10.3:12346"]}, "task": {"index": 0, "type": "worker"}}'
$ export CUDA_VISIBLE_DEVICES=0,1
$ python worker.py

# On 2'nd node (10.10.10.3)
$ export TF_CONFIG='{"cluster": {"worker": ["10.10.10.2:12346", "10.10.10.3:12346"]}, "task": {"index": 1, "type": "worker"}}'
$ export CUDA_VISIBLE_DEVICES=2,3
$ python worker.py

3. PyTorch Distributed training 소개

a. Data parallel training (https://pytorch.org/tutorials/beginner/dist_overview.html)

- DataParallel

The DataParallel package enables single-machine multi-GPU parallelism with the lowest coding hurdle.

- DistributedDataParallel

DistributedDataParallel (DDP) implements data parallelism at the module level which can run across multiple machines.

b. Backends - 통신방식 (https://pytorch.org/docs/stable/distributed.html)

- torch.distributed supports three built-in backends

✓ Gloo

✓ NCCL(NVIDIA Collective Communication Library)

✓ MPI(Message Passing Interface)

- Rule of thumb

✓ Use the NCCL backend for distributed GPU training

✓ Use the Gloo backend for distributed CPU training

c. DDP example

- Code snippet

def main():
    backend = dist.Backend.NCCL
    if dist.is_available() and WORLD_SIZE > 1:
        logging.debug("Using distributed PyTorch with {} backend".format(backend))
        dist.init_process_group(backend=backend)
    ...

    model = Net().to(device)
    if dist.is_available() and dist.is_initialized():
        if use_cuda:
            torch.cuda.set_device(torch.cuda.current_device())
        ddp_model = nn.parallel.DistributedDataParallel(model)
    ...

- 실행 스크립트

# On 1'st node (10.10.10.2)
$ export MASTER_ADDR=localhost
$ export MASTER_PORT=23456
$ export WORLD_SIZE=2
$ export RANK=0
$ python pytorch-ddp.py

# On 2'nd node (10.10.10.3)
$ export MASTER_ADDR=10.10.10.2
$ export MASTER_PORT=23456
$ export WORLD_SIZE=2
$ export RANK=1
$ python pytorch-ddp.py

4. Kubeflow TFJob / PyTorchJob 기능

a. TensorFlow Training (TFJob)

- Overview

TFJob for training a machine learning model with TensorFlow, TFJob is a resource with a YAML representation

https://www.kubeflow.org/docs/components/training/tftraining/

- MultiWorkerMirroredStrategy Example

$ vi tfjob-test.yaml
apiVersion: kubeflow.org/v1
kind: TFJob
metadata:
  name: mnist-mwms-training-8c00
  ...
spec:
  tfReplicaSpecs:
    Chief:
      replicas: 1
      template:
        spec:
          containers:
          - command:
            - python
            - /app/keras-multiWorkerMirroredStrategy.py
            image: repo.chelsea.kt.co.kr/agp/mnist-mwms:69BA2319
            resources:
              limits:
                nvidia.com/gpu: 1
          ...
    Worker:
      replicas: 2
      template:
        spec:
          containers:
          - command:
            - python
            - /app/keras-multiWorkerMirroredStrategy.py
            image: repo.chelsea.kt.co.kr/agp/mnist-mwms:69BA2319
            resources:
              limits:
                nvidia.com/gpu: 1
          ...
$ k apply tfjob-test.yaml -n yoosung-jeon
$ k get tfjobs.kubeflow.org -n yoosung-jeon
NAME                  STATE     AGE
mnist-training-8c00   Running   12m
$ k get pod -n yoosung-jeon | egrep 'NAME|mnist-training-8c00'
NAME                           READY   STATUS    RESTARTS   AGE   IP             NODE
mnist-training-8c00-chief-0    1/1     Running   0          12m   10.244.4.63    iap10
mnist-training-8c00-worker-0   1/1     Running   0          12m   10.244.4.131   iap11
mnist-training-8c00-worker-1   1/1     Running   0          12m   10.244.4.132   iap11
$ k describe pod mnist-mwms-training-8c00-chief-0 -n yoosung-jeon | grep Environment -A 1
   Environment:
     TF_CONFIG:        {"cluster":{"chief":["mnist-mwms-training-8c00-chief-0.yoosung-jeon.svc:2222"],"worker":["mnist-mwms-training-8c00-worker-0.yoosung-jeon.svc:2222","mnist-mwms-training-8c00-worker-1.yoosung-jeon.svc:2222"]},"task":{"type":"chief","index":0},"environment":"cloud"}
$

b. PyTorch Training (PyTorchJob)

- Overview

PyTorchJob for training a machine learning model with PyTorch, PyTorchJob is a resource with a YAML representation

https://www.kubeflow.org/docs/components/training/pytorch/

- Distributed Data Parallel Example

$ vi ddp-exam.yaml
apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
  name: mnist-pytorch-training-1687
  ...
spec:
  pytorchReplicaSpecs:
    Master:
      replicas: 1
      template:
        spec:
          containers:
          - command:
            - python
            - /app/pytorch-mnist.py
            image: repo.chelsea.kt.co.kr/agp/mnist-pytorch:48403B0A
            resources:
              limits:
                nvidia.com/gpu: 1
        ...
    Worker:
      replicas: 2
      template:
        spec:
          containers:
          - command:
            - python
            - /app/pytorch-mnist.py
            image: repo.chelsea.kt.co.kr/agp/mnist-pytorch:48403B0A
            resources:
              limits:
                nvidia.com/gpu: 1
        ...
$ k apply -f ddp-exam.yaml
$ k get pytorchjobs.kubeflow.org -n yoosung-jeon
NAME                          STATE       AGE
mnist-pytorch-training-1687   Succeeded   6h29m
$ k get pod -n yoosung-jeon -o wide | egrep 'NAME|mnist-pytorch-training-1687'
NAME                                    READY   STATUS      RESTARTS   AGE     IP             NODE
mnist-pytorch-training-1687-master-0    0/1     Completed   0          6h35m   10.244.4.130   iap10
mnist-pytorch-training-1687-worker-0    0/1     Completed   3          6h35m   10.244.4.129   iap10
mnist-pytorch-training-1687-worker-1    0/1     Completed   1          6h35m   10.244.3.45    iap11
$ k describe pod mnist-pytorch-training-1687-master-0 -n yoosung-jeon | grep Environment -A 6
   Environment:
     FAIRING_RUNTIME:   1
     MASTER_PORT:       23456
     MASTER_ADDR:       localhost
     WORLD_SIZE:        3
     RANK:              0
     PYTHONUNBUFFERED:  0
$

5. Kubeflow fairing 기능

a. Kubeflow Fairing

- Overview

Python SDK to build, train, and deploy ML models remotely

https://www.kubeflow.org/docs/external-add-ons/fairing/fairing-overview/

https://kangwoo.kr/2020/03/14/kubeflow-fairing/

- 특/장점

✓ 쉽고 빠른 분산 학습 - kubeflow fairing

✓ GPU Utilization 향상 - 학습 시점에만 GPU 사용 가능

✓ 사용자별 리소스(GPU, CPU, Memory) 쿼터 관리 - Kubernetes

- Code snippet

output_map = {
    "Dockerfile": "Dockerfile",
    f"{project_name}.py": f"{project_name}.py"
}

fairing.config.set_preprocessor('python', command=command, path_prefix="/app", output_map=output_map)
fairing.config.set_builder(name='docker', registry=DOCKER_REGISTRY, image_name="mnist",
                           base_image="", dockerfile_path="Dockerfile")
fairing.config.set_deployer(name='tfjob', namespace=my_namespace, job_name=tfjob_name, stream_log=False,
                            chief_count=num_chief, worker_count=num_workers, ps_count=num_ps,
                            pod_spec_mutators=[mounting_pvc(pvc_name=pvc_name, pvc_mount_path=model_dir),
                                                get_resource_mutator(cpu=90, memory=600),
                                               get_resource_mutator(gpu=1, gpu_vendor='nvidia')]
                            )
fairing.config.run()

- Preprocessor API

Kubeflow Fairing이 학습 작업에 사용할 컨테이너 이미지를 만들 때, 이미지 생성에 필요한 일련의 정보들을 정의하는 역할

✓ Preprocessor type

• python: 입력 파일을 컨테이너 이미지에 복사

• notebook : 노트북을 실행 가능한 파이썬 파일로 변환, 노트북 코드에서 파이썬 코드가 아닌 부분을 제거

• full_notebook : 파이썬 코드가 아닌 부분들을 포함해서 전체 노트북을 그대로 실행, 노트북 실행에 papermill을 사용

• function : FunctionPreProcessor는 단일 함수를 전처리, function_shim.py을 사용하여 함수를 직접 호출

✓ Builder API

Kubeflow Fairing이 학습 작업에 사용할 컨테이너 이미지를 빌드 하는 방법 및 컨테이너 레지스트리의 위치를 정의하는 역할

Builder type

• docker: 로컬 도커 데몬을 사용하여 컨테이너 이미지를 빌드하고, 컨테이너 이미지 레지스트리에 푸시

• append: 기존 컨테이너 이미지를 바탕으로, 코드를 새 레이어로 추가

기본 이미지를 가져 와서 이미지를 작성하지 않고, 추가된 부분만 컨테이너 이미지 레지스트리에 푸시

이미지를 작성이 빠르고, 파이썬 라이브러리 containerregistry을 사용하기 때문에, 도커 데몬이 필요 없음

fairing.config.set_builder('append', registry=CONTAINER_REGISTRY, image_name="mnist", base_image="tensorflow/tensorflow:2.2.2-gpu-py3")
or
fairing.config.set_builder('append', registry=CONTAINER_REGISTRY, image_name="mnist", base_image="tensorflow/tensorflow:2.2.2-py3")

• cluster : 쿠버네티스 클러스터에서 학습 작업에 사용할 컨테이너 이미지를 빌드하고, 컨테이너 이미지 레지스트리에 푸시

✓ Deployer API

Kubeflow Fairing이 학습 작업에 사용할 컨테이너 이미지를 배포하고 실행할 위치를 정의하는 역할

• TfJob : Kubeflow의 TFJob 컴포넌트를 사용하여 텐서플로우 학습 작업을 시작

• PyTorchJob : Kubeflow의 PyTorchJob 컴포넌트를 사용하여 PyTorch 학습 작업을 시작

• Job : 쿠버네티스 Job 리소스를 사용하여 학습 작업을 시작

• GCPJob : GCP에게 학습 작업 전달

• Serving : 쿠버네티스의 디플로이먼트와 서비스를 사용하여, 예측(prediction) 엔드포인트를 서빙

• KFServing : KFServing을 사용하여, 예측(prediction) 엔드포인트를 서빙

- 참조 페이지

✓ Kubeflow Fairing 살펴보기: https://kangwoo.kr/2020/03/14/kubeflow-fairing/

✓ Tutorial for Kubeflow Fairing: https://docs.d2iq.com/dkp/kaptain/1.2.0-1.0.0/tutorials/fairing/

✓Kubeflow Fairing SDK API reference: https://kubeflow-fairing.readthedocs.io/en/latest/

✓kubernetes-python-client’s documentation: https://kubernetes.readthedocs.io/en/latest/

6. Kubeflow

- Kubeflow is the ML toolkit for Kubernetes

7. KT ACP (AI Core Platform) 소개

- ACP ?

✓ ML 또는 대용량 데이터 처리가 요구되는 신규 플랫폼 구축시

✓ 참조 아키텍처 모델로 활용하거나 또는 공통 개발 환경으로 활용하기 위하여

✓ Kubernetes 기반 하에 다양한 오픈소스들을 통합하여 구축한 플랫폼

-ACP layered architecture

'Kubeflow > Distributed training' 카테고리의 다른 글

Distributed training 사례 #4 (From KF Jupyter, PyTorch) (0)	2021.09.27
Distributed training 사례 #3 (In Jupyter) (0)	2021.09.26
Distributed training 사례 #2 (From KF Jupyter, Tensorflow) (0)	2021.09.26
Distributed training 사례 #1 (From MacOS) (0)	2021.09.26
Running the MNIST using distributed training (0)	2021.09.24

일주일만 하면 ...

Distributed training 개요

'Kubeflow > Distributed training' 카테고리의 다른 글

댓글

티스토리툴바

Distributed training 개요

'Kubeflow > Distributed training' 카테고리의 다른 글

관련글

댓글

티스토리툴바