본문 바로가기
Kubeflow/Distributed training

Running the MNIST using distributed training

by 여행을 떠나자! 2021. 9. 24.

2021.5.28

 

1. Running the MNIST on-prem Jupyter notebook

- The MNIST on-prem notebook builds a Docker image, launches a TFJob to train a model, and creates an InferenceService (KFServing) to deploy the trained model.

- https://v1-2-branch.kubeflow.org/docs/started/workstation/minikube-linux/#running-the-mnist-on-prem-jupyter-notebook

 

a. Prerequisites

- Step 1: Set up Python environment in MacOS

yoosungjeon@ysjeon-Dev ~ % brew install anaconda
yoosungjeon@ysjeon-Dev ~ % conda init zsh
...
modified      /Users/yoosungjeon/.zshrc
==> For changes to take effect, close and re-open your current shell. <==
yoosungjeon@ysjeon-Dev ~ % . ~/.zshrc
(base) yoosungjeon@ysjeon-Dev ~ % conda -V
conda 4.10.1
(base) yoosungjeon@ysjeon-Dev ~ % conda create --name mlpipeline python=3.7
Collecting package metadata (current_repodata.json): done
Solving environment: done

## Package Plan ##
environment location: /usr/local/anaconda3/envs/mlpipeline
...
(base) yoosungjeon@ysjeon-Dev ~ % conda activate mlpipeline
(mlpipeline) yoosungjeon@ysjeon-Dev ~ % conda info

     active environment : mlpipeline
    active env location : /usr/local/anaconda3/envs/mlpipeline
       user config file : /Users/yoosungjeon/.condarc
 populated config files : /Users/yoosungjeon/.condarc
...
(mlpipeline) yoosungjeon@ysjeon-Dev ~ %

 

- Step 2: Install Jupyter Notebooks

(mlpipeline) yoosungjeon@ysjeon-Dev ~ % pip install --upgrade pip
(mlpipeline) yoosungjeon@ysjeon-Dev ~ % pip install jupyter
(mlpipeline) yoosungjeon@ysjeon-Dev ~ % pip install kubeflow-fairing
(mlpipeline) yoosungjeon@ysjeon-Dev ~ % pip list | egrep "Package|jupyter|kubeflow|kfserving|kubernetes"
Package                            Version
jupyter                            1.0.0
jupyter-client                     6.1.7
jupyter-console                    6.2.0
jupyter-core                       4.6.3
jupyterlab                         2.2.6
jupyterlab-pygments                0.1.2
jupyterlab-server                  1.2.0
kfserving                          0.4.1
kubeflow-fairing                   1.0.2
kubeflow-pytorchjob                0.1.3
kubeflow-tfjob                     0.1.3
kubernetes                         10.0.1
(mlpipeline) yoosungjeon@ysjeon-Dev ~ %

 

- Step 3: Create a namespace to run the MNIST on-prem notebook

$ k get ns yoosung-jeon --show-labels | egrep serving.kubeflow.org/inferenceservice=enabled
yoosung-jeon   Active   2d5h   ...,istio-injection=disabled,...,serving.kubeflow.org/inferenceservice=enabled
$

 

- Step 4: Download the MNIST on-prem notebook

(mlpipeline) yoosungjeon@ysjeon-Dev ~ % mkdir ~/Private/k8s-oss/kf-exam-mnist && cd ~/Private/k8s-oss/kf-exam-mnist
(mlpipeline) yoosungjeon@ysjeon-Dev kf_exam-exam % git clone https://github.com/kubeflow/fairing.git
…
(mlpipeline) yoosungjeon@ysjeon-Dev kf-exam-exam %

 

- Step 5: Configure kubernetes

    ✓ Kubernetes TFJobs 리소스를 생성하기 위하여 kubernetes 환경 설정

    ✓ Context: 권한을 제한하기 위하여 kubernetes-admin 대신 yoosung-jeon를 생성

    ✓ 관련 API: fairing.config.set_deployer(name='tfjob',...)

(mlpipeline) yoosungjeon@ysjeon-Dev ~ % vi ~/.kube/config
apiVersion: v1
clusters:
- cluster:
    certificate-authority-data: LS0tLS1CRUdJTiBDRVJUSUZJQ0F...
    server: https://14.52.244.136:7443
  name: kubernetes
contexts:
- context:
    cluster: kubernetes
    user: default
    namespace: yoosung-jeon
  name: yoosung-jeon-context
current-context: yoosung-jeon-context
kind: Config
preferences: {}
users:
- name: default
  user:
    token: eyJhbGciOiJSUzI1NiIsImtpZCI6Inl5dzc5RHpNZHJ5T3hrWHhsV1VoZm5...
(base) yoosungjeon@ysjeon-Dev ~ % k config get-contexts
CURRENT   NAME                   CLUSTER      AUTHINFO     NAMESPACE
*         yoosung-jeon-context   kubernetes   default      yoosung-jeon
(base) yoosungjeon@ysjeon-Dev ~ %

 

- Step 6: create PVs & Docker credential in Kubernetes

   ✓ Kubernetes POD 생성시 사용할  Docker private registry의 credential 정보를 생성

$ k create secret docker-registry agp-reg-cred -n yoosung-jeon \
    --docker-server=repo.acp.kt.co.kr --docker-username=agp --docker-password=*****
$ k patch serviceaccount default -n yoosung-jeon -p "{\"imagePullSecrets\": [{\"name\": \"agp-reg-cred\"}]}"

   ✓ mnist 어플리케이션에서 사용할 Persistent Volume과 학습/검증 데이터 생성

$ cat mnist-pvc-nfs.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: mnist-pvc
  namespace: yoosung-jeon
spec:
  accessModes:
  - ReadWriteMany
  resources:
    requests:
      storage: 1Gi
  storageClassName: nfs-sc-iap
$ cat mnist-data-pvc-nfs.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: mnist-data-pvc
  namespace: yoosung-jeon
spec:
  accessModes:
  - ReadWriteMany
  resources:
    requests:
      storage: 1Gi
  storageClassName: nfs-sc-iap
$ k apply -f mnist-data-pvc-nfs.yaml
$ k apply -f mnist-pvc-nfs.yaml
$ k get pvc -n yoosung-jeon | egrep 'NAME|mnist'
NAME            STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
mnist-data-pvc  Bound    pvc-b6a55474-36f9-42d4-8293-23bd4c6390af   1Gi        RWX            nfs-sc-iap     19s
mnist-pvc-nfs   Bound    pvc-9acc419f-dcb8-4c10-bc3c-f62fd4d57860   1Gi        RWX            nfs-sc-iap     25s
$ cd /nfs_01/yoosung-jeon-mnist-data-pvc-pvc-b6a55474-36f9-42d4-8293-23bd4c6390af/
$ wget https://storage.googleapis.com/cvdf-datasets/mnist/train-images-idx3-ubyte.gz &&
  wget https://storage.googleapis.com/cvdf-datasets/mnist/train-labels-idx1-ubyte.gz &&
  wget https://storage.googleapis.com/cvdf-datasets/mnist/t10k-images-idx3-ubyte.gz &&
  wget https://storage.googleapis.com/cvdf-datasets/mnist/t10k-labels-idx1-ubyte.gz
$

 

b. Launch Jupyter Notebook

(mlpipeline) yoosungjeon@ysjeon-Dev kf_exam-mnist % docker login repo.acp.kt.co.kr
Username: agp
Password:
Login Succeeded
(mlpipeline) yoosungjeon@ysjeon-Dev kf_exam-mnist % jupyter notebook --allow-root
...

 

2. Execute MNIST on-prem notebook

https://v1-2-branch.kubeflow.org/docs/started/workstation/minikube-linux/#execute-mnist-on-prem-notebook

 

a. Dockerfile 수정

- 파일: /fairing/examples/mnist/Dockerfile

- 변경 내용

   #FROM tensorflow/tensorflow:1.15.2-py3

   FROM tensorflow/tensorflow:1.15.2-gpu-py3

 

b. Kubeflow faring 파트 수정 (Jupyter Notebook)

- 파일: /fairing/examples/mnist/mnist_e2e_on_prem.ipynb

- 변경 내용

DOCKER_REGISTRY = 'repo.acp.kt.co.kr/agp'
my_namespace = 'yoosung-jeon'
num_workers = 2           # number of Worker in TFJob
pvc_name = 'mnist-pvc'
pvc_data_name = 'mnist-data-pvc'
data_dir = '/data'
train_steps = "10000"     # Default: 1000

command=["python",
         "/opt/mnist.py",
         "--tf-model-dir=" + model_dir,
         "--tf-data-dir=" + data_dir,
         "--tf-export-dir=" + export_path,
         "--tf-batch-size=" + batch_size,
         "--tf-train-steps=" + train_steps,
         "--tf-learning-rate=" + learning_rate]

from kubeflow import fairing
from kubeflow.fairing.kubernetes.utils import mounting_pvc
from kubeflow.fairing.kubernetes.utils import get_resource_mutator

fairing.config.set_preprocessor('python', command=command, path_prefix="/app", output_map=output_map)
fairing.config.set_builder(name='docker', registry=DOCKER_REGISTRY, base_image="",
                           image_name="mnist", dockerfile_path="Dockerfile")
fairing.config.set_deployer(name='tfjob', namespace=my_namespace, stream_log=False, job_name=tfjob_name,
                            chief_count=num_chief, worker_count=num_workers, ps_count=num_ps,
                            pod_spec_mutators=[mounting_pvc(pvc_name=pvc_name, pvc_mount_path=model_dir),
                                                   mounting_pvc(pvc_name=pvc_data_name, pvc_mount_path=data_dir),
                                                   get_resource_mutator(gpu=1, gpu_vendor='nvidia')]
                               )
fairing.config.run()
...

 

c. Kubeflow faring 결과

- Docker Image build 결과

- TFJob 배포 결과

$ k get tfjobs.kubeflow.org -n yoosung-jeon
NAME                  STATE     AGE
mnist-training-f856   Running   16s
$ k get pod -l job-name=mnist-training-f856 -n yoosung-jeon
NAME                           READY   STATUS    RESTARTS   AGE
mnist-training-f856-chief-0    1/1     Running   0          80s
mnist-training-f856-ps-0       1/1     Running   0          81s
mnist-training-f856-worker-0   1/1     Running   0          81s
mnist-training-f856-worker-1   1/1     Running   0          81s
$

- Resource check

$ k get pod mnist-training-f856-worker-0 -n yoosung-jeon -o yaml | grep 'resources:' -A4
    resources:
      limits:
        nvidia.com/gpu: "1"
      requests:
        nvidia.com/gpu: "1"
$ check-gpu.sh
Node   Available(GPUs)  Used(GPUs)
iap10  2                1
iap11  2                2

Node   Namespace     POD                           Used(GPUs)
iap10  yoosung-jeon  mnist-training-f856-chief-0   1
iap11  yoosung-jeon  mnist-training-f856-worker-0  1
iap11  yoosung-jeon  mnist-training-f856-worker-1  1
$ k get pod -n gpu-operator-resources -o wide | egrep 'NAME|nvidia-driver-daemonset'
NAME                           READY  STATUS   RESTARTS  AGE  IP             NODE   NOMINATED NODE  READINESS GATES
nvidia-driver-daemonset-kl4gp  1/1    Running  3         18d  14.52.244.214  iap11  <none>          <none>
nvidia-driver-daemonset-txb8j  1/1    Running  2         14d  14.52.244.213  iap10  <none>          <none>
$ k exec -n gpu-operator-resources -it nvidia-driver-daemonset-txb8j -- nvidia-smi
Fri May 28 04:47:54 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.80.02    Driver Version: 450.80.02    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-PCIE...  On   | 00000000:3B:00.0 Off |                    0 |
| N/A   24C    P0    23W / 250W |      0MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-PCIE...  On   | 00000000:D8:00.0 Off |                    0 |
| N/A   27C    P0    41W / 250W |  31437MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    1   N/A  N/A    182410      C   python                          31431MiB |
+-----------------------------------------------------------------------------+
$ k exec -n gpu-operator-resources -it nvidia-driver-daemonset-kl4gp -- nvidia-smi
Fri May 28 04:56:58 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.80.02    Driver Version: 450.80.02    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-PCIE...  On   | 00000000:3B:00.0 Off |                    0 |
| N/A   29C    P0    43W / 250W |  31437MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-PCIE...  On   | 00000000:D8:00.0 Off |                    0 |
| N/A   29C    P0    36W / 250W |  31437MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A    286795      C   python                          31431MiB |
|    1   N/A  N/A    286929      C   python                          31431MiB |
+-----------------------------------------------------------------------------+
$ k top pod -l job-name=mnist-training-f856 -n yoosung-jeon
NAME                           CPU(cores)   MEMORY(bytes)
mnist-training-f856-chief-0    71m          4518Mi
mnist-training-f856-ps-0       347m         3690Mi
mnist-training-f856-worker-0   161m         4483Mi
mnist-training-f856-worker-1   226m         4495Mi
$ k exec -n gpu-operator-resources -it nvidia-driver-daemonset-txb8j -- ps -ef | egrep 'NAME|mnist.py'
root     182410 182337 17 04:40 ?  00:01:43 python /opt/mnist.py --tf-mode
$ k exec -n gpu-operator-resources -it nvidia-driver-daemonset-kl4gp -- ps -ef | egrep 'NAME|mnist.py'
root     286795 286767 14 04:40 ?  00:01:29 python /opt/mnist.py --tf-mode
root     286929 286901 16 04:40 ?  00:01:39 python /opt/mnist.py --tf-mode
$
[root@iap11 ~]# top
top - 13:52:40 up 10 days, 18:49,  1 user,  load average: 33.58, 31.76, 27.04
Tasks: 1187 total,   4 running, 1181 sleeping,   0 stopped,   2 zombie
%Cpu(s):  7.8 us,  1.5 sy,  0.0 ni, 81.2 id,  8.8 wa,  0.0 hi,  0.7 si,  0.0 st
KiB Mem : 65383860 total,  2330036 free, 38326408 used, 24727416 buff/cache
KiB Swap:        0 total,        0 free,        0 used. 21140372 avail Mem

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
  2825 root      20   0   10.9g 348588  32372 S 485.5  0.5  45104:46 kubelet
286929 root      20   0   63.7g   4.8g   1.4g S  22.1  7.7   2:01.68 mnist.py
286795 root      20   0   63.8g   4.8g   1.4g S  20.5  7.6   1:50.81 mnist.py
...

- 생성된 모델

$ tree /nfs_01/yoosung-jeon-mnist-pvc-pvc-6d571c28-c31a-4ac9-bcda-65035b38f067/
/nfs_01/yoosung-jeon-mnist-pvc-pvc-6d571c28-c31a-4ac9-bcda-65035b38f067/
├── checkpoint
├── events.out.tfevents.1621931437.mnist-training-6a23-chief-0
├── events.out.tfevents.1622176818.mnist-training-f856-chief-0
├── export
│   └── 1622204179
│       ├── saved_model.pb
│       └── variables
│           ├── variables.data-00000-of-00001
│           └── variables.index
├── graph.pbtxt
├── model.ckpt-4009.data-00000-of-00001
├── model.ckpt-4009.index
└── model.ckpt-4009.meta

3 directories, 22 files
$

 

d. KFServing 파트 (Jupyter Notebook)

- 파일: /fairing/examples/mnist/mnist_e2e_on_prem.ipynb

- 내용

from kubeflow.fairing.deployers.kfserving.kfserving import KFServing
isvc_name = f'mnist-service-{uuid.uuid4().hex[:4]}'
isvc = KFServing('tensorflow', namespace=my_namespace, isvc_name=isvc_name,
                 default_storage_uri='pvc://' + pvc_name + '/export')
isvc.deploy(isvc.generate_isvc())

 

e. TFServing 배포 결과

- MNIST Service Endpoint: http://mnist-service-b0ab.yoosung-jeon.example.com'

   Predict command: curl -v -H Host: 'mnist-service-b0ab.yoosung-jeon.example.com' http://14.52.244.137/v1/models/mnist-service-b0ab:predict @./input.json

$ k get inferenceservices -n yoosung-jeon
NAME                URL                                                 READY DEFAULT TRAFFIC  CANARY TRAFFIC  AGE
mnist-service-b0ab  http://mnist-service-b0ab.yoosung-jeon.example.com  True  100                              10m
$ k describe inferenceservices mnist-service-b0ab -n yoosung-jeon | grep -i storage
        Storage Uri:      pvc://mnist-pvc/export
$ k get pod -n yoosung-jeon | egrep 'NAME|mnist-service-b0ab'
NAME                                                            READY STATUS  RESTARTS AGE
mnist-service-b0ab-predictor-default-nw7zf-deployment-565cq8q8d 3/3   Running 9        106m
$ k logs mnist-service-b0ab-predictor-default-nw7zf-deployment-565cq8q8d -n yoosung-jeon -c kfserving-container -f
2021-05-29 05:35:13.177114: I tensorflow_serving/model_servers/server.cc:82] Building single TensorFlow model file config:  model_name: mnist-service-b0ab model_base_path: /mnt/models
...
2021-05-29 05:35:13.279969: I external/org_tensorflow/tensorflow/cc/saved_model/reader.cc:31] Reading SavedModel from: /mnt/models/1622204179
...
2021-05-29 05:35:14.751536: I tensorflow_serving/model_servers/server.cc:344] Exporting HTTP/REST API at:localhost:8080 ...

$ k get revisions.serving.knative.dev -n yoosung-jeon
NAME                                       CONFIG NAME                          K8S SERVICE NAME                           GENERATION READY REASON
mnist-service-b0ab-predictor-default-nw7zf mnist-service-b0ab-predictor-default mnist-service-b0ab-predictor-default-nw7zf 1          True
mnist-service-b80d-predictor-default-bj5z5 mnist-service-b80d-predictor-default mnist-service-b80d-predictor-default-bj5z5 1          True
$ k get configurations.serving.knative.dev -n yoosung-jeon
NAME                                 LATESTCREATED                              LATESTREADY                                READY REASON
mnist-service-b0ab-predictor-default mnist-service-b0ab-predictor-default-nw7zf mnist-service-b0ab-predictor-default-nw7zf True
$ k get kservice -n yoosung-jeon
NAME                                 URL                                                                  LATESTCREATED                              LATESTREADY                                READY REASON
mnist-service-b0ab-predictor-default http://mnist-service-b0ab-predictor-default.yoosung-jeon.example.com mnist-service-b0ab-predictor-default-nw7zf mnist-service-b0ab-predictor-default-nw7zf True
mnist-service-b80d-predictor-default http://mnist-service-b80d-predictor-default.yoosung-jeon.example.com mnist-service-b80d-predictor-default-bj5z5 mnist-service-b80d-predictor-default-bj5z5 True

 

 

3. TFServing

- Overview

   ✓ KFServing provides a Kubernetes Custom Resource Definition for serving machine learning (ML) models on arbitrary frameworks.

   ✓ It aims to solve production model serving use cases by providing performant, high abstraction interfaces for common ML frameworks like Tensorflow, XGBoost, ScikitLearn, PyTorch, and ONNX.

- Code snippet

isvc = KFServing('tensorflow', namespace=my_namespace, isvc_name=isvc_name,
                 default_storage_uri='pvc://' + pvc_name + '/export')

- 참조 페이지

   Kubeflow KFServing 살펴보기: https://www.kangwoo.kr/2020/04/11/kubeflow-kfserving-%EA%B0%9C%EC%9A%94/

   Kubeflow KFSering: https://v1-1-branch.kubeflow.org/docs/components/serving/kfserving/

댓글