Distributed training 사례 #4 (From KF Jupyter, PyTorch)

2021.06.30

1. Kubeflow Jupyter 환경에서 Distributed training (PyTorch)
- Environments
✓ Remote - 개발 환경
Kubeflow Jupyter (CPU)
✓ Remote - 학습 환경
Kubeflow 1.2 (The machine learning toolkit for Kubernetes) / Kubernetes 1.16.15
Nexus (Private docker registry)
Nvidia V100 / Driver 450.80, cuda 11.2, cuDNN 8.1.0
- Flow (PyTorch)
✓ Remote (Kubeflow Jupyter)
a. Docker Image build
b. Docker Image Push
c. Kubeflow PyTorchJob deployment
   ✓ Remote (Kubeflow / Kubernetes)
d. Kubernetes POD 생성 (Chief POD, Worker #n POD)
Docker image pull & Container 생성, GPU / CPU / Memory 리소스 할당
e. Kubernetes POD 실행
Tensorflow distributed training
- 관련 기술
   ✓ Kubeflow fairing
   ✓ PyTorch distributed (Data parallelism)

2. 사전작업

a. Remote (Kubernetes cluster)

- Kubeflow 계정 생성 & 리소스 쿼터 설정 (GPU, CPU, Memory, Disk)

- Docker registry credential 생성

Private docker registry는 TLS Certificate를 제공해야 함 (kubeflow 1.2 기준 self-signed certificate 미지원)

$ k create secret docker-registry chelsea-agp-reg-cred -n yoosung-jeon \
  --docker-server=repo.chelsea.kt.co.kr --docker-username=agp --docker-password=*****
$ k patch serviceaccount default -n yoosung-jeon -p "{\"imagePullSecrets\": [{\"name\": \"chelsea-agp-reg-cred\"}]}"

- Kubernetes PV(Persistent volume) 생성

Distributed training 어플리케이션들에서 공유하여 사용할 디스크

$ vi mnist-pytorch-pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: mnist-pytorch-pvc
  namespace: yoosung-jeon
spec:
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 1Gi
  storageClassName: nfs-sc-iap
$ k apply -f mnist-pytorch-pvc.yaml
…
$ cp -r data `k get pvc -n yoosung-jeon | grep -w mnist-pytorch-pvc | awk '{printf("/nfs_01/yoosung-jeon-%s-%s", $1, $3)}'`
$

b. Remote (Kubeflow Jupyter)

- Kubeflow Jupyter notebook 생성

i. Kubeflow dashboard login (http://kf.acp.kt.co.kr)

ii. Notebook server 생성

Menu (좌측 상단 ) > Notebook Servers > ‘New Server’

Name: Study-pytorch

Image: repo.acp.kt.co.kr/kubeflow/kubeflow-images-private/pytorch-1.9.0-cuda11.1-notebook:1.0.0

iii. Jupyter 접속

iv. Terminal 접속

v. Docker registry credential 생성

$ mkdir .docker && vi .docker/config.json
{
        "auths": {
                "repo.chelsea.kt.co.kr": {"auth": "YWdwOm5ldzEy****"}
        }
}
$

3. Distributed training 코딩

a. Base image 생성

- Distributed training 어플리케이션에서 사용되는 python package 들을 Base image에 추가, 추가할 Package가 없는 경우 생략

yoosungjeon@ysjeon-Dev create-base-docker-image % vi Dockerfile
FROM pytorch/pytorch:1.9.0-cuda11.1-cudnn8-runtime

COPY requirements.txt .
RUN pip install -r requirements.txt
yoosungjeon@ysjeon-Dev create-base-docker-image % vi requirements.txt
...
yoosungjeon@ysjeon-Dev create-base-docker-image % docker build -t repo.chelsea.kt.co.kr/agp/pytorch-custom:1.9.0-cuda11.1-cudnn8-runtime -f Dockerfile .
…
yoosungjeon@ysjeon-Dev create-base-docker-image % docker push repo.chelsea.kt.co.kr/agp/pytorch-custom:1.9.0-cuda11.1-cudnn8-runtime
yoosungjeon@ysjeon-Dev create-base-docker-image %

b. Fairing & Distributed training 코드

- 소스

4.jupyter-cpu-fairing-pytorch/pytorch-mnist.ipynb

- Code snippet

def train_and_test(...)
    # create default process group
    WORLD_SIZE = int(os.environ.get("WORLD_SIZE", 1))
    backend = dist.Backend.NCCL
    if dist.is_available() and WORLD_SIZE > 1:
        logging.debug("Using distributed PyTorch with {} backend".format(backend))
        dist.init_process_group(backend=backend)
    ...

    # construct DDP(Distribeted Data Parallel) model
    if dist.is_available() and dist.is_initialized():
            if use_cuda:
                torch.cuda.set_device(torch.cuda.current_device())
            model = nn.parallel.DistributedDataParallel(model)
...

def fairing_run():
    ...
    fairing.config.set_builder(name='append',
                               registry=DOCKER_REGISTRY,
                               base_image='pytorch/pytorch:1.9.0-cuda11.1-cudnn8-runtime',
                               image_name=project_name)
    fairing.config.set_deployer(name="pytorchjob",
                                job_name=pytorchjob_name,
                                master_count=1,
                                worker_count=worker_count,
                                pod_spec_mutators=[
                                  volume_mounts(volume_type='pvc', volume_name=k8s_pvc_name, mount_path=mount_name),
                                  get_resource_mutator(gpu=gpus_per_worker, gpu_vendor='nvidia')],
                                stream_log=False)
    fairing.config.run()

if __name__ == '__main__':
    ...
    if os.getenv('FAIRING_RUNTIME', None) is None:
        fairing_run()
    else:
        main()

4. Distributed training 실행

a. Remote (Kubeflow Jupyter)

5. 소스

'Kubeflow > Distributed training' 카테고리의 다른 글

Distributed training 사례 #3 (In Jupyter) (0)	2021.09.26
Distributed training 사례 #2 (From KF Jupyter, Tensorflow) (0)	2021.09.26
Distributed training 사례 #1 (From MacOS) (0)	2021.09.26
Distributed training 개요 (0)	2021.09.26
Running the MNIST using distributed training (0)	2021.09.24

일주일만 하면 ...

Distributed training 사례 #4 (From KF Jupyter, PyTorch)

'Kubeflow > Distributed training' 카테고리의 다른 글

댓글

티스토리툴바

Distributed training 사례 #4 (From KF Jupyter, PyTorch)

'Kubeflow > Distributed training' 카테고리의 다른 글

관련글

댓글

티스토리툴바