본문 바로가기
Kubeflow/기능 탐방 (Kubeflow 1.0)

Kubeflow 1.0 기능 #4 (PyTorch Training)

by 여행을 떠나자! 2021. 9. 25.

2020.03.09

 

1. PyTorchJob ?

- Kubeflow에서 PyTorch training할 때 사용되는 Kubernetes custom resource

- https://v1-0-branch.kubeflow.org/docs/reference/pytorchjob/v1/pytorch/

 

 

2. PyTorch training 하기

https://v1-0-branch.kubeflow.org/docs/components/training/pytorch/components/training/pytorch/

 

a. Cloud shell 기동

 

b. Verify that PyTorch support is included in your Kubeflow deployment

$ kubectl get crd | head -n 1 && kubectl get crd | grep pytorchjobs.kubeflow.org
NAME                                                 CREATED AT
pytorchjobs.kubeflow.org                             2020-02-25T02:15:17Z
$

 

c. Docker image 생성 및 등록

$ git clone https://github.com/kubeflow/pytorch-operator.git
$ cd pytorch-operator/examples/mnist
$ echo $PROJECT
my-kubeflow-269301
$ gcloud builds submit --tag gcr.io/${PROJECT}/pytorch_dist_mnist:1.0
…
$

- mnist.py (테스트시 사용) : https://github.com/kubeflow/pytorch-operator/blob/master/examples/mnist/mnist.py

 

d. Creating a PyTorch Job

$ cd v1/
$ ls
pytorch_job_mnist_gloo.yaml  pytorch_job_mnist_mpi.yaml  pytorch_job_mnist_nccl.yaml
$ vi pytorch_job_mnist_gloo.yaml
apiVersion: "kubeflow.org/v1"
kind: "PyTorchJob"
metadata:
 name: "pytorch-dist-mnist-gloo"
spec:
 pytorchReplicaSpecs:
   Master:
     replicas: 1
     restartPolicy: OnFailure
     template:
       spec:
            containers:
           - name: pytorch
             image: gcr.io/my-kubeflow-269301/pytorch_dist_mnist:1.0
             args: ["--backend", "gloo"]
             # Comment out the below resources to use the CPU.
             # resources:
             #  limits:
             #    nvidia.com/gpu: 1
   Worker:
     replicas: 1
     restartPolicy: OnFailure
     template:
       spec:
         containers:
           - name: pytorch
             image: gcr.io/my-kubeflow-269301/pytorch_dist_mnist:1.0
             args: ["--backend", "gloo"]
             # Comment out the below resources to use the CPU.
             # resources:
             #  limits:
             #    nvidia.com/gpu: 1
$
$ kubectl apply -f pytorch_job_mnist_gloo.yaml
pytorchjob.kubeflow.org/pytorch-dist-mnist-gloo created
$
$ kubectl get pods | head -n 1 && kubectl get pods --show-labels | grep pytorch-dist-mnist-gloo
NAME                                 READY   STATUS             RESTARTS   AGE
pytorch-dist-mnist-gloo-master-0     0/1     ImagePullBackOff   0          7m5s   controller-name=pytorch-operator,group-name=kubeflow.org,job-name=pytorch-di
nist-gloo,job-role=master,pytorch-job-name=pytorch-dist-mnist-gloo,pytorch-replica-index=0,pytorch-replica-type=master
pytorch-dist-mnist-gloo-worker-0     0/1     Init:0/1           0          7m5s   controller-name=pytorch-operator,group-name=kubeflow.org,job-name=pytorch-di
nist-gloo,pytorch-job-name=pytorch-dist-mnist-gloo,pytorch-replica-index=0,pytorch-replica-type=worker
$ kubectl logs -f pytorch-dist-mnist-gloo-master-0
Using distributed PyTorch with gloo backend
Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz
Processing...
Done!
Train Epoch: 1 [0/60000 (0%)]   loss=2.3000
…
Train Epoch: 1 [59520/60000 (99%)]      loss=0.0638
accuracy=0.9665
$

- DISTRIBUTED COMMUNICATION PACKAGE - TORCH.DISTRIBUTED

   https://pytorch.org/docs/stable/distributed.html

   Rule of thumb

     • Use the Gloo backend for distributed CPU training.

     • Use the NCCL backend for distributed GPU training

 

e. Monitoring a PyTorch Job

$ kubectl get  pytorchjobs pytorch-dist-mnist-gloo -o yaml
apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
…
  spec:
…
status:
 completionTime: "2020-03-05T08:59:30Z"
 conditions:
 - lastTransitionTime: "2020-03-05T08:54:54Z"
   lastUpdateTime: "2020-03-05T08:54:54Z"
   message: PyTorchJob pytorch-dist-mnist-gloo is created.
   reason: PyTorchJobCreated
   status: "True"
   type: Created
 - lastTransitionTime: "2020-03-05T08:55:39Z"
   lastUpdateTime: "2020-03-05T08:55:39Z"
   message: PyTorchJob pytorch-dist-mnist-gloo is running.
   reason: PyTorchJobRunning
   status: "False"
   type: Running
 - lastTransitionTime: "2020-03-05T08:59:30Z"
   lastUpdateTime: "2020-03-05T08:59:30Z"
   message: PyTorchJob pytorch-dist-mnist-gloo is successfully completed.
   reason: PyTorchJobSucceeded
  status: "True"
   type: Succeeded
replicaStatuses:
   Master:
     succeeded: 1
   Worker:
     succeeded: 1
startTime: "2020-03-05T08:54:54Z"
$

 

 

3. Scale up 기능 테스트

- Master will always have 1 instance and we can scale worker nodes as per our requirement.

- Worker 선언없이 Master 선언가능하며, 이런 경우 1개의 Container 동작  

- Scale up 테스트

$ vi pytorch_job_mnist_gloo.yaml
…
    Worker:
      replicas: 5
…
$ kubectl delete -f pytorch_job_mnist_gloo.yaml && kubectl apply -f pytorch_job_mnist_gloo.yaml
$ kubectl get pods | head -n 1 && kubectl get pods --show-labels | grep pytorch-dist-mnist-gloo | cut -c-120
NAME                                                           READY   STATUS      RESTARTS   AGE
pytorch-dist-mnist-gloo-master-0                               1/1     Running     0          4m39s
pytorch-dist-mnist-gloo-worker-0                               1/1     Running     0          4m39s
pytorch-dist-mnist-gloo-worker-1                               1/1     Running     0          4m39s
pytorch-dist-mnist-gloo-worker-2                               1/1     Running     0          4m38s
pytorch-dist-mnist-gloo-worker-3                               1/1     Running     0          4m37s
pytorch-dist-mnist-gloo-worker-4                               1/1     Running     0          4m36s
$ kubectl delete pytorchjobs pytorch-dist-mnist-gloo

 

- 테스트 결과 

   Worker 수가 5일 때 10:19 수행 (3/6 09:28:33~09:38:52), Worker 수가 1일 때 3:48 수행 (3/5 17:55:40~17:59.28)

   ✓ pytorch-dist-mnist-gloo-master-0 로그

   ✓ pytorch-dist-mnist-gloo-worker-3 로그

댓글