2021.5.28
1. Running the MNIST on-prem Jupyter notebook
- The MNIST on-prem notebook builds a Docker image, launches a TFJob to train a model, and creates an InferenceService (KFServing) to deploy the trained model.
a. Prerequisites
- Step 1: Set up Python environment in MacOS
yoosungjeon@ysjeon-Dev ~ % brew install anaconda
yoosungjeon@ysjeon-Dev ~ % conda init zsh
...
modified /Users/yoosungjeon/.zshrc
==> For changes to take effect, close and re-open your current shell. <==
yoosungjeon@ysjeon-Dev ~ % . ~/.zshrc
(base) yoosungjeon@ysjeon-Dev ~ % conda -V
conda 4.10.1
(base) yoosungjeon@ysjeon-Dev ~ % conda create --name mlpipeline python=3.7
Collecting package metadata (current_repodata.json): done
Solving environment: done
## Package Plan ##
environment location: /usr/local/anaconda3/envs/mlpipeline
...
(base) yoosungjeon@ysjeon-Dev ~ % conda activate mlpipeline
(mlpipeline) yoosungjeon@ysjeon-Dev ~ % conda info
active environment : mlpipeline
active env location : /usr/local/anaconda3/envs/mlpipeline
user config file : /Users/yoosungjeon/.condarc
populated config files : /Users/yoosungjeon/.condarc
...
(mlpipeline) yoosungjeon@ysjeon-Dev ~ %
- Step 2: Install Jupyter Notebooks
(mlpipeline) yoosungjeon@ysjeon-Dev ~ % pip install --upgrade pip
(mlpipeline) yoosungjeon@ysjeon-Dev ~ % pip install jupyter
(mlpipeline) yoosungjeon@ysjeon-Dev ~ % pip install kubeflow-fairing
(mlpipeline) yoosungjeon@ysjeon-Dev ~ % pip list | egrep "Package|jupyter|kubeflow|kfserving|kubernetes"
Package Version
jupyter 1.0.0
jupyter-client 6.1.7
jupyter-console 6.2.0
jupyter-core 4.6.3
jupyterlab 2.2.6
jupyterlab-pygments 0.1.2
jupyterlab-server 1.2.0
kfserving 0.4.1
kubeflow-fairing 1.0.2
kubeflow-pytorchjob 0.1.3
kubeflow-tfjob 0.1.3
kubernetes 10.0.1
(mlpipeline) yoosungjeon@ysjeon-Dev ~ %
- Step 3: Create a namespace to run the MNIST on-prem notebook
$ k get ns yoosung-jeon --show-labels | egrep serving.kubeflow.org/inferenceservice=enabled
yoosung-jeon Active 2d5h ...,istio-injection=disabled,...,serving.kubeflow.org/inferenceservice=enabled
$
- Step 4: Download the MNIST on-prem notebook
(mlpipeline) yoosungjeon@ysjeon-Dev ~ % mkdir ~/Private/k8s-oss/kf-exam-mnist && cd ~/Private/k8s-oss/kf-exam-mnist
(mlpipeline) yoosungjeon@ysjeon-Dev kf_exam-exam % git clone https://github.com/kubeflow/fairing.git
…
(mlpipeline) yoosungjeon@ysjeon-Dev kf-exam-exam %
- Step 5: Configure kubernetes
✓ Kubernetes TFJobs 리소스를 생성하기 위하여 kubernetes 환경 설정
✓ Context: 권한을 제한하기 위하여 kubernetes-admin 대신 yoosung-jeon를 생성
✓ 관련 API: fairing.config.set_deployer(name='tfjob',...)
(mlpipeline) yoosungjeon@ysjeon-Dev ~ % vi ~/.kube/config
apiVersion: v1
clusters:
- cluster:
certificate-authority-data: LS0tLS1CRUdJTiBDRVJUSUZJQ0F...
server: https://14.52.244.136:7443
name: kubernetes
contexts:
- context:
cluster: kubernetes
user: default
namespace: yoosung-jeon
name: yoosung-jeon-context
current-context: yoosung-jeon-context
kind: Config
preferences: {}
users:
- name: default
user:
token: eyJhbGciOiJSUzI1NiIsImtpZCI6Inl5dzc5RHpNZHJ5T3hrWHhsV1VoZm5...
(base) yoosungjeon@ysjeon-Dev ~ % k config get-contexts
CURRENT NAME CLUSTER AUTHINFO NAMESPACE
* yoosung-jeon-context kubernetes default yoosung-jeon
(base) yoosungjeon@ysjeon-Dev ~ %
- Step 6: create PVs & Docker credential in Kubernetes
✓ Kubernetes POD 생성시 사용할 Docker private registry의 credential 정보를 생성
$ k create secret docker-registry agp-reg-cred -n yoosung-jeon \
--docker-server=repo.acp.kt.co.kr --docker-username=agp --docker-password=*****
$ k patch serviceaccount default -n yoosung-jeon -p "{\"imagePullSecrets\": [{\"name\": \"agp-reg-cred\"}]}"
✓ mnist 어플리케이션에서 사용할 Persistent Volume과 학습/검증 데이터 생성
$ cat mnist-pvc-nfs.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: mnist-pvc
namespace: yoosung-jeon
spec:
accessModes:
- ReadWriteMany
resources:
requests:
storage: 1Gi
storageClassName: nfs-sc-iap
$ cat mnist-data-pvc-nfs.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: mnist-data-pvc
namespace: yoosung-jeon
spec:
accessModes:
- ReadWriteMany
resources:
requests:
storage: 1Gi
storageClassName: nfs-sc-iap
$ k apply -f mnist-data-pvc-nfs.yaml
$ k apply -f mnist-pvc-nfs.yaml
$ k get pvc -n yoosung-jeon | egrep 'NAME|mnist'
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
mnist-data-pvc Bound pvc-b6a55474-36f9-42d4-8293-23bd4c6390af 1Gi RWX nfs-sc-iap 19s
mnist-pvc-nfs Bound pvc-9acc419f-dcb8-4c10-bc3c-f62fd4d57860 1Gi RWX nfs-sc-iap 25s
$ cd /nfs_01/yoosung-jeon-mnist-data-pvc-pvc-b6a55474-36f9-42d4-8293-23bd4c6390af/
$ wget https://storage.googleapis.com/cvdf-datasets/mnist/train-images-idx3-ubyte.gz &&
wget https://storage.googleapis.com/cvdf-datasets/mnist/train-labels-idx1-ubyte.gz &&
wget https://storage.googleapis.com/cvdf-datasets/mnist/t10k-images-idx3-ubyte.gz &&
wget https://storage.googleapis.com/cvdf-datasets/mnist/t10k-labels-idx1-ubyte.gz
$
b. Launch Jupyter Notebook
(mlpipeline) yoosungjeon@ysjeon-Dev kf_exam-mnist % docker login repo.acp.kt.co.kr
Username: agp
Password:
Login Succeeded
(mlpipeline) yoosungjeon@ysjeon-Dev kf_exam-mnist % jupyter notebook --allow-root
...
2. Execute MNIST on-prem notebook
a. Dockerfile 수정
- 파일: /fairing/examples/mnist/Dockerfile
- 변경 내용
#FROM tensorflow/tensorflow:1.15.2-py3
FROM tensorflow/tensorflow:1.15.2-gpu-py3
b. Kubeflow faring 파트 수정 (Jupyter Notebook)
- 파일: /fairing/examples/mnist/mnist_e2e_on_prem.ipynb
- 변경 내용
DOCKER_REGISTRY = 'repo.acp.kt.co.kr/agp'
my_namespace = 'yoosung-jeon'
num_workers = 2 # number of Worker in TFJob
pvc_name = 'mnist-pvc'
pvc_data_name = 'mnist-data-pvc'
data_dir = '/data'
train_steps = "10000" # Default: 1000
command=["python",
"/opt/mnist.py",
"--tf-model-dir=" + model_dir,
"--tf-data-dir=" + data_dir,
"--tf-export-dir=" + export_path,
"--tf-batch-size=" + batch_size,
"--tf-train-steps=" + train_steps,
"--tf-learning-rate=" + learning_rate]
from kubeflow import fairing
from kubeflow.fairing.kubernetes.utils import mounting_pvc
from kubeflow.fairing.kubernetes.utils import get_resource_mutator
fairing.config.set_preprocessor('python', command=command, path_prefix="/app", output_map=output_map)
fairing.config.set_builder(name='docker', registry=DOCKER_REGISTRY, base_image="",
image_name="mnist", dockerfile_path="Dockerfile")
fairing.config.set_deployer(name='tfjob', namespace=my_namespace, stream_log=False, job_name=tfjob_name,
chief_count=num_chief, worker_count=num_workers, ps_count=num_ps,
pod_spec_mutators=[mounting_pvc(pvc_name=pvc_name, pvc_mount_path=model_dir),
mounting_pvc(pvc_name=pvc_data_name, pvc_mount_path=data_dir),
get_resource_mutator(gpu=1, gpu_vendor='nvidia')]
)
fairing.config.run()
...
c. Kubeflow faring 결과
- Docker Image build 결과
- TFJob 배포 결과
$ k get tfjobs.kubeflow.org -n yoosung-jeon
NAME STATE AGE
mnist-training-f856 Running 16s
$ k get pod -l job-name=mnist-training-f856 -n yoosung-jeon
NAME READY STATUS RESTARTS AGE
mnist-training-f856-chief-0 1/1 Running 0 80s
mnist-training-f856-ps-0 1/1 Running 0 81s
mnist-training-f856-worker-0 1/1 Running 0 81s
mnist-training-f856-worker-1 1/1 Running 0 81s
$
- Resource check
$ k get pod mnist-training-f856-worker-0 -n yoosung-jeon -o yaml | grep 'resources:' -A4
resources:
limits:
nvidia.com/gpu: "1"
requests:
nvidia.com/gpu: "1"
$ check-gpu.sh
Node Available(GPUs) Used(GPUs)
iap10 2 1
iap11 2 2
Node Namespace POD Used(GPUs)
iap10 yoosung-jeon mnist-training-f856-chief-0 1
iap11 yoosung-jeon mnist-training-f856-worker-0 1
iap11 yoosung-jeon mnist-training-f856-worker-1 1
$ k get pod -n gpu-operator-resources -o wide | egrep 'NAME|nvidia-driver-daemonset'
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
nvidia-driver-daemonset-kl4gp 1/1 Running 3 18d 14.52.244.214 iap11 <none> <none>
nvidia-driver-daemonset-txb8j 1/1 Running 2 14d 14.52.244.213 iap10 <none> <none>
$ k exec -n gpu-operator-resources -it nvidia-driver-daemonset-txb8j -- nvidia-smi
Fri May 28 04:47:54 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.80.02 Driver Version: 450.80.02 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100-PCIE... On | 00000000:3B:00.0 Off | 0 |
| N/A 24C P0 23W / 250W | 0MiB / 32510MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-PCIE... On | 00000000:D8:00.0 Off | 0 |
| N/A 27C P0 41W / 250W | 31437MiB / 32510MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 1 N/A N/A 182410 C python 31431MiB |
+-----------------------------------------------------------------------------+
$ k exec -n gpu-operator-resources -it nvidia-driver-daemonset-kl4gp -- nvidia-smi
Fri May 28 04:56:58 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.80.02 Driver Version: 450.80.02 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100-PCIE... On | 00000000:3B:00.0 Off | 0 |
| N/A 29C P0 43W / 250W | 31437MiB / 32510MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-PCIE... On | 00000000:D8:00.0 Off | 0 |
| N/A 29C P0 36W / 250W | 31437MiB / 32510MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 286795 C python 31431MiB |
| 1 N/A N/A 286929 C python 31431MiB |
+-----------------------------------------------------------------------------+
$ k top pod -l job-name=mnist-training-f856 -n yoosung-jeon
NAME CPU(cores) MEMORY(bytes)
mnist-training-f856-chief-0 71m 4518Mi
mnist-training-f856-ps-0 347m 3690Mi
mnist-training-f856-worker-0 161m 4483Mi
mnist-training-f856-worker-1 226m 4495Mi
$ k exec -n gpu-operator-resources -it nvidia-driver-daemonset-txb8j -- ps -ef | egrep 'NAME|mnist.py'
root 182410 182337 17 04:40 ? 00:01:43 python /opt/mnist.py --tf-mode
$ k exec -n gpu-operator-resources -it nvidia-driver-daemonset-kl4gp -- ps -ef | egrep 'NAME|mnist.py'
root 286795 286767 14 04:40 ? 00:01:29 python /opt/mnist.py --tf-mode
root 286929 286901 16 04:40 ? 00:01:39 python /opt/mnist.py --tf-mode
$
[root@iap11 ~]# top
top - 13:52:40 up 10 days, 18:49, 1 user, load average: 33.58, 31.76, 27.04
Tasks: 1187 total, 4 running, 1181 sleeping, 0 stopped, 2 zombie
%Cpu(s): 7.8 us, 1.5 sy, 0.0 ni, 81.2 id, 8.8 wa, 0.0 hi, 0.7 si, 0.0 st
KiB Mem : 65383860 total, 2330036 free, 38326408 used, 24727416 buff/cache
KiB Swap: 0 total, 0 free, 0 used. 21140372 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
2825 root 20 0 10.9g 348588 32372 S 485.5 0.5 45104:46 kubelet
286929 root 20 0 63.7g 4.8g 1.4g S 22.1 7.7 2:01.68 mnist.py
286795 root 20 0 63.8g 4.8g 1.4g S 20.5 7.6 1:50.81 mnist.py
...
- 생성된 모델
$ tree /nfs_01/yoosung-jeon-mnist-pvc-pvc-6d571c28-c31a-4ac9-bcda-65035b38f067/
/nfs_01/yoosung-jeon-mnist-pvc-pvc-6d571c28-c31a-4ac9-bcda-65035b38f067/
├── checkpoint
├── events.out.tfevents.1621931437.mnist-training-6a23-chief-0
├── events.out.tfevents.1622176818.mnist-training-f856-chief-0
├── export
│ └── 1622204179
│ ├── saved_model.pb
│ └── variables
│ ├── variables.data-00000-of-00001
│ └── variables.index
├── graph.pbtxt
├── model.ckpt-4009.data-00000-of-00001
├── model.ckpt-4009.index
└── model.ckpt-4009.meta
3 directories, 22 files
$
d. KFServing 파트 (Jupyter Notebook)
- 파일: /fairing/examples/mnist/mnist_e2e_on_prem.ipynb
- 내용
from kubeflow.fairing.deployers.kfserving.kfserving import KFServing
isvc_name = f'mnist-service-{uuid.uuid4().hex[:4]}'
isvc = KFServing('tensorflow', namespace=my_namespace, isvc_name=isvc_name,
default_storage_uri='pvc://' + pvc_name + '/export')
isvc.deploy(isvc.generate_isvc())
e. TFServing 배포 결과
- MNIST Service Endpoint: http://mnist-service-b0ab.yoosung-jeon.example.com'
Predict command: curl -v -H Host: 'mnist-service-b0ab.yoosung-jeon.example.com' http://14.52.244.137/v1/models/mnist-service-b0ab:predict @./input.json
$ k get inferenceservices -n yoosung-jeon
NAME URL READY DEFAULT TRAFFIC CANARY TRAFFIC AGE
mnist-service-b0ab http://mnist-service-b0ab.yoosung-jeon.example.com True 100 10m
$ k describe inferenceservices mnist-service-b0ab -n yoosung-jeon | grep -i storage
Storage Uri: pvc://mnist-pvc/export
$ k get pod -n yoosung-jeon | egrep 'NAME|mnist-service-b0ab'
NAME READY STATUS RESTARTS AGE
mnist-service-b0ab-predictor-default-nw7zf-deployment-565cq8q8d 3/3 Running 9 106m
$ k logs mnist-service-b0ab-predictor-default-nw7zf-deployment-565cq8q8d -n yoosung-jeon -c kfserving-container -f
2021-05-29 05:35:13.177114: I tensorflow_serving/model_servers/server.cc:82] Building single TensorFlow model file config: model_name: mnist-service-b0ab model_base_path: /mnt/models
...
2021-05-29 05:35:13.279969: I external/org_tensorflow/tensorflow/cc/saved_model/reader.cc:31] Reading SavedModel from: /mnt/models/1622204179
...
2021-05-29 05:35:14.751536: I tensorflow_serving/model_servers/server.cc:344] Exporting HTTP/REST API at:localhost:8080 ...
$ k get revisions.serving.knative.dev -n yoosung-jeon
NAME CONFIG NAME K8S SERVICE NAME GENERATION READY REASON
mnist-service-b0ab-predictor-default-nw7zf mnist-service-b0ab-predictor-default mnist-service-b0ab-predictor-default-nw7zf 1 True
mnist-service-b80d-predictor-default-bj5z5 mnist-service-b80d-predictor-default mnist-service-b80d-predictor-default-bj5z5 1 True
$ k get configurations.serving.knative.dev -n yoosung-jeon
NAME LATESTCREATED LATESTREADY READY REASON
mnist-service-b0ab-predictor-default mnist-service-b0ab-predictor-default-nw7zf mnist-service-b0ab-predictor-default-nw7zf True
$ k get kservice -n yoosung-jeon
NAME URL LATESTCREATED LATESTREADY READY REASON
mnist-service-b0ab-predictor-default http://mnist-service-b0ab-predictor-default.yoosung-jeon.example.com mnist-service-b0ab-predictor-default-nw7zf mnist-service-b0ab-predictor-default-nw7zf True
mnist-service-b80d-predictor-default http://mnist-service-b80d-predictor-default.yoosung-jeon.example.com mnist-service-b80d-predictor-default-bj5z5 mnist-service-b80d-predictor-default-bj5z5 True
3. TFServing
- Overview
✓ KFServing provides a Kubernetes Custom Resource Definition for serving machine learning (ML) models on arbitrary frameworks.
✓ It aims to solve production model serving use cases by providing performant, high abstraction interfaces for common ML frameworks like Tensorflow, XGBoost, ScikitLearn, PyTorch, and ONNX.
- Code snippet
isvc = KFServing('tensorflow', namespace=my_namespace, isvc_name=isvc_name,
default_storage_uri='pvc://' + pvc_name + '/export')
- 참조 페이지
Kubeflow KFServing 살펴보기: https://www.kangwoo.kr/2020/04/11/kubeflow-kfserving-%EA%B0%9C%EC%9A%94/
Kubeflow KFSering: https://v1-1-branch.kubeflow.org/docs/components/serving/kfserving/
'Kubeflow > Distributed training' 카테고리의 다른 글
Distributed training 사례 #4 (From KF Jupyter, PyTorch) (0) | 2021.09.27 |
---|---|
Distributed training 사례 #3 (In Jupyter) (0) | 2021.09.26 |
Distributed training 사례 #2 (From KF Jupyter, Tensorflow) (0) | 2021.09.26 |
Distributed training 사례 #1 (From MacOS) (0) | 2021.09.26 |
Distributed training 개요 (0) | 2021.09.26 |
댓글