2020.03.09
1. PyTorchJob ?
- Kubeflow에서 PyTorch training할 때 사용되는 Kubernetes custom resource
- https://v1-0-branch.kubeflow.org/docs/reference/pytorchjob/v1/pytorch/
2. PyTorch training 하기
- https://v1-0-branch.kubeflow.org/docs/components/training/pytorch/components/training/pytorch/
a. Cloud shell 기동
b. Verify that PyTorch support is included in your Kubeflow deployment
$ kubectl get crd | head -n 1 && kubectl get crd | grep pytorchjobs.kubeflow.org
NAME CREATED AT
pytorchjobs.kubeflow.org 2020-02-25T02:15:17Z
$
c. Docker image 생성 및 등록
$ git clone https://github.com/kubeflow/pytorch-operator.git
$ cd pytorch-operator/examples/mnist
$ echo $PROJECT
my-kubeflow-269301
$ gcloud builds submit --tag gcr.io/${PROJECT}/pytorch_dist_mnist:1.0
…
$
- mnist.py (테스트시 사용) : https://github.com/kubeflow/pytorch-operator/blob/master/examples/mnist/mnist.py
d. Creating a PyTorch Job
$ cd v1/
$ ls
pytorch_job_mnist_gloo.yaml pytorch_job_mnist_mpi.yaml pytorch_job_mnist_nccl.yaml
$ vi pytorch_job_mnist_gloo.yaml
apiVersion: "kubeflow.org/v1"
kind: "PyTorchJob"
metadata:
name: "pytorch-dist-mnist-gloo"
spec:
pytorchReplicaSpecs:
Master:
replicas: 1
restartPolicy: OnFailure
template:
spec:
containers:
- name: pytorch
image: gcr.io/my-kubeflow-269301/pytorch_dist_mnist:1.0
args: ["--backend", "gloo"]
# Comment out the below resources to use the CPU.
# resources:
# limits:
# nvidia.com/gpu: 1
Worker:
replicas: 1
restartPolicy: OnFailure
template:
spec:
containers:
- name: pytorch
image: gcr.io/my-kubeflow-269301/pytorch_dist_mnist:1.0
args: ["--backend", "gloo"]
# Comment out the below resources to use the CPU.
# resources:
# limits:
# nvidia.com/gpu: 1
$
$ kubectl apply -f pytorch_job_mnist_gloo.yaml
pytorchjob.kubeflow.org/pytorch-dist-mnist-gloo created
$
$ kubectl get pods | head -n 1 && kubectl get pods --show-labels | grep pytorch-dist-mnist-gloo
NAME READY STATUS RESTARTS AGE
pytorch-dist-mnist-gloo-master-0 0/1 ImagePullBackOff 0 7m5s controller-name=pytorch-operator,group-name=kubeflow.org,job-name=pytorch-di
nist-gloo,job-role=master,pytorch-job-name=pytorch-dist-mnist-gloo,pytorch-replica-index=0,pytorch-replica-type=master
pytorch-dist-mnist-gloo-worker-0 0/1 Init:0/1 0 7m5s controller-name=pytorch-operator,group-name=kubeflow.org,job-name=pytorch-di
nist-gloo,pytorch-job-name=pytorch-dist-mnist-gloo,pytorch-replica-index=0,pytorch-replica-type=worker
$ kubectl logs -f pytorch-dist-mnist-gloo-master-0
Using distributed PyTorch with gloo backend
Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz
Processing...
Done!
Train Epoch: 1 [0/60000 (0%)] loss=2.3000
…
Train Epoch: 1 [59520/60000 (99%)] loss=0.0638
accuracy=0.9665
$
- DISTRIBUTED COMMUNICATION PACKAGE - TORCH.DISTRIBUTED
https://pytorch.org/docs/stable/distributed.html
Rule of thumb
• Use the Gloo backend for distributed CPU training.
• Use the NCCL backend for distributed GPU training
e. Monitoring a PyTorch Job
$ kubectl get pytorchjobs pytorch-dist-mnist-gloo -o yaml
apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
…
spec:
…
status:
completionTime: "2020-03-05T08:59:30Z"
conditions:
- lastTransitionTime: "2020-03-05T08:54:54Z"
lastUpdateTime: "2020-03-05T08:54:54Z"
message: PyTorchJob pytorch-dist-mnist-gloo is created.
reason: PyTorchJobCreated
status: "True"
type: Created
- lastTransitionTime: "2020-03-05T08:55:39Z"
lastUpdateTime: "2020-03-05T08:55:39Z"
message: PyTorchJob pytorch-dist-mnist-gloo is running.
reason: PyTorchJobRunning
status: "False"
type: Running
- lastTransitionTime: "2020-03-05T08:59:30Z"
lastUpdateTime: "2020-03-05T08:59:30Z"
message: PyTorchJob pytorch-dist-mnist-gloo is successfully completed.
reason: PyTorchJobSucceeded
status: "True"
type: Succeeded
replicaStatuses:
Master:
succeeded: 1
Worker:
succeeded: 1
startTime: "2020-03-05T08:54:54Z"
$
3. Scale up 기능 테스트
- Master will always have 1 instance and we can scale worker nodes as per our requirement.
- Worker 선언없이 Master만 선언가능하며, 이런 경우 1개의 Container만 동작 함
- Scale up 테스트
$ vi pytorch_job_mnist_gloo.yaml
…
Worker:
replicas: 5
…
$ kubectl delete -f pytorch_job_mnist_gloo.yaml && kubectl apply -f pytorch_job_mnist_gloo.yaml
$ kubectl get pods | head -n 1 && kubectl get pods --show-labels | grep pytorch-dist-mnist-gloo | cut -c-120
NAME READY STATUS RESTARTS AGE
pytorch-dist-mnist-gloo-master-0 1/1 Running 0 4m39s
pytorch-dist-mnist-gloo-worker-0 1/1 Running 0 4m39s
pytorch-dist-mnist-gloo-worker-1 1/1 Running 0 4m39s
pytorch-dist-mnist-gloo-worker-2 1/1 Running 0 4m38s
pytorch-dist-mnist-gloo-worker-3 1/1 Running 0 4m37s
pytorch-dist-mnist-gloo-worker-4 1/1 Running 0 4m36s
$ kubectl delete pytorchjobs pytorch-dist-mnist-gloo
- 테스트 결과
Worker 수가 5일 때 10:19 수행 (3/6 09:28:33~09:38:52), Worker 수가 1일 때 3:48 수행 (3/5 17:55:40~17:59.28)
✓ pytorch-dist-mnist-gloo-master-0 로그
✓ pytorch-dist-mnist-gloo-worker-3 로그
'Kubeflow > 기능 탐방 (Kubeflow 1.0)' 카테고리의 다른 글
Kubeflow 1.0 기능 #6 (Metadata) (0) | 2021.09.25 |
---|---|
Kubeflow 1.0 기능 #5 (KFServing, TFServing) (0) | 2021.09.25 |
Kubeflow 1.0 기능 #3 (Katib) (0) | 2021.09.25 |
Kubeflow 1.0 기능 #2 (TF-Job, TF-Serving, Kubeflow pipeline) (0) | 2021.09.24 |
Kubeflow 1.0 기능 #1 (Jupyter notebook) (1) | 2021.09.24 |
댓글