2020.02.26
1. 개요
- GKE에 설치한 Kubeflow의 Pipeline 기능을 이해하기 위해 아래 사이트를 참조하여 사용 해 봄
- Using Kubeflow for Financial Time Series (https://github.com/kubeflow/examples/tree/master/financial_time_series)
- This example covers the following concepts:
a. Deploying Kubeflow to a GKE cluster
b. Exploration via JupyterHub (prospect data, preprocess data, develop ML model)
c. Training several tensorflow models at scale with TF-jobs
d. Deploy and serve with TF-serving
e. Iterate training and serving
f. Training on GPU
g. Using Kubeflow Pipelines to automate ML workflow
2. TensorFlow training, delay & serving in Kubernets
a. Cloud shell 기동
b. Training at scale with TF-jobs
- We will build a docker image on Google Cloud by running following command
$ git clone https://github.com/kubeflow/examples.git
$ cd examples/financial_time_series/tensorflow_model/
$ cat Dockerfile
FROM tensorflow/tensorflow:1.15.0-py3
…
RUN pip3 install google-cloud-storage==1.17.0 \
google-cloud-bigquery==1.6.0 \
pandas==0.23.4
COPY . /opt/workdir
WORKDIR /opt/workdir
$ ls
CPU GPU __init__.py requirements.txt run_preprocess.py run_train.py tfserving.yaml
Dockerfile helpers ml_pipeline.py run_deploy.py run_preprocess_train_deploy.py serving_requests
$ export TRAIN_PATH=gcr.io/${PROJECT}/financial-model/cpu:v1 # format: gcr.io/<project>/<image-name>/cpu:v1
- Docker image를 생성하여 Google Container Registry에 등록
$ gcloud builds submit --tag $TRAIN_PATH .
Creating temporary tarball archive of 23 file(s) totalling 39.1 KiB before compression.
…
Uploading tarball of [.] to [gs://my-kubeflow-269301_cloudbuild/source/1582615138.56-855904751f8341e69964239f29d28135.tgz]
…
Operation "operations/acf.045e2496-0e8e-4679-95eb-508ccf68839f" finished successfully.
Created [https://cloudbuild.googleapis.com/v1/projects/my-kubeflow-269301/builds/74fb9d48-5a2e-4a85-882e-f00bbcd8b9b3].
Logs are available at [https://console.cloud.google.com/gcr/builds/74fb9d48-5a2e-4a85-882e-f00bbcd8b9b3?project=954212444059].
--------------------------------------------------------------- REMOTE BUILD OUTPUT ------------------------------------------
starting build "74fb9d48-5a2e-4a85-882e-f00bbcd8b9b3"
FETCHSOURCE
…
BUILD
Already have image (with digest): gcr.io/cloud-builders/docker
Step 1/10 : FROM tensorflow/tensorflow:1.15.0-py3
1.15.0-py3: Pulling from tensorflow/tensorflow
…
Step 10/10 : WORKDIR /opt/workdir
Successfully tagged gcr.io/my-kubeflow-269301/financial-model/cpu:v1
PUSH
Pushing gcr.io/my-kubeflow-269301/financial-model/cpu:v1
The push refers to repository [gcr.io/my-kubeflow-269301/financial-model/cpu]
v1: digest: sha256:fd5178caff6cebc35393d39c5514ce5baf56764637841d29108fcba47caad327 size: 4098
…
$
- Docker image 등록 확인
- We will create a bucket to store our data and model artifacts:
$ BUCKET_NAME=financial-bucket-ysjeon7
$ gsutil mb gs://$BUCKET_NAME/
- We can launch the tf-job to our Kubeflow cluster and follow the progress via the logs of the pod.
In this case we are using a very simple definition of a TF-job, it only has a single worker as we are not doing any advanced training set-up (e.g. distributed training).
TFJob is a Kubernetes custom resource that you can use to run TensorFlow training jobs on Kubernetes. TFJob spec is designed to manage distributed TensorFlow training jobs.
$ vi CPU/tfjob1.yaml
apiVersion: kubeflow.org/v1
kind: TFJob
…
spec:
tfReplicaSpecs:
Worker:
replicas: 1
…
- name: tensorflow
image: gcr.io/my-kubeflow-269301/financial-model/cpu:v1
command:
- python
- run_preprocess_train_deploy.py
- --model=FlatModel
- --epochs=30001
- --bucket=financial-bucket-ysjeon7
- --tag=v1
…
$ kubectl apply -f CPU/tfjob1.yaml
$ POD_NAME=$(kubectl get pods -n kubeflow --selector=tf-job-name=tfjob-flat --template '{{range .items}}{{.metadata.name}}{{"\n"}}{{end}}')
$ kubectl logs -f $POD_NAME -n kubeflow
…
Precision = 0.9142857142857143
Recall = 0.2222222222222222
F1 Score = 0.35754189944134074
Accuracy = 0.6006944444444444
$ kubectl get pods -n kubeflow | grep $POD_NAME
tfjob-flat-worker-0 0/1 Completed 0 12m
$
- Trained model 확인
c. Deploy and serve with TF-serving
- We will use the standard TF-serving module that Kubeflow offers.
$ vi tfserving.yaml
apiVersion: v1
kind: Service
…
spec:
containers:
- args:
- --port=9000
- --rest_api_port=8500
- --model_name=finance-model
- --model_base_path=gs://financial-bucket-ysjeon7/tfserving/
command:
- /usr/bin/tensorflow_model_server
image: tensorflow/serving:1.11.1
…
$ kubectl apply -f tfserving.yaml
service/tfserving-service created
deployment.apps/tfserving created
$ POD=`kubectl get pods -n kubeflow --selector=app=model | awk '{print $1}' | tail -1`
$ kubectl logs -f $POD -n kubeflow
…
$ kubectl get pods -o wide | grep $POD
tfserving-5b6ddb8586-llkv9 1/1 Running 0 5h55m
$
- We can set up port-forwarding to our localhost.
$ kubectl get services -n kubeflow --selector=app=model
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
tfserving-service ClusterIP 10.63.248.92 <none> 9000/TCP,8500/TCP 9m46s
$ kubectl port-forward $POD 8500:8500 -n kubeflow 2>&1 >/dev/null &
[1] 1798
$ netstat -na | grep -w 8500
tcp 0 0 127.0.0.1:8500 0.0.0.0:* LISTEN
$ curl “http://127.0.0.1:8500/v1/models/finance-model”
…
$
- We need to do is send a request to localhost:8500 with the expected input of the saved model and it will return a prediction.
$ sudo pip3 install numpy requests
$ python3 -m serving_requests.request_random
{'model-tag': 'v1', 'prediction': 0}
$ sudo pip3 install -r requirements.txt
$ python3 -m serving_requests.request
{'prediction': 0, 'model-tag': 'v1'}
$
d. Running another TF-job and serving update
- We will train a more complex neural network with several hidden layers.
$ vi CPU/tfjob2.yaml
apiVersion: kubeflow.org/v1
kind: TFJob
…
spec:
tfReplicaSpecs:
Worker:
replicas: 1
…
- name: tensorflow
image: gcr.io/my-kubeflow-269301/financial-model/cpu:v1
command:
- python
- run_preprocess_train_deploy.py
- --model=DeepModel
- --epochs=30001
- --bucket=financial-bucket-ysjeon7
- --tag=v2
…
$ kubectl apply -f CPU/tfjob2.yaml
$ POD_NAME=$(kubectl get pods -n kubeflow --selector=tf-job-name=tfjob-deep --template '{{range .items}}{{.metadata.name}}{{"\n"}}{{end}}')
$ kubectl logs -f $POD_NAME -n kubeflow
…
Accuracy = 0.7222222222222222
$ python3 -m serving_requests.request
{'prediction': 1, 'model-tag': 'v2'}
$
The response returns the model tag 'v2' and predicts the correct output 1
e. Running TF-job on a GPU
$ cp GPU/Dockerfile ./Dockerfile
$ cat Dockerfile
FROM tensorflow/tensorflow:1.15.0-gpu-py3
…
$ export TRAIN_PATH_GPU=gcr.io/my-kubeflow-269301/financial-model/gpu:v1
$ gcloud builds submit --tag $TRAIN_PATH_GPU .
…
$
- The GKE deployment script for Kubeflow automatically adds a GPU-pool that can scale as needed so you don’t need to pay for a GPU when you don’t need it.
$ vi GPU/tfjob3.yaml
apiVersion: kubeflow.org/v1
kind: TFJob
…
spec:
tfReplicaSpecs:
Worker:
replicas: 1
…
- name: tensorflow
image: gcr.io/my-kubeflow-269301/financial-model/gpu:v1
command:
- python
- run_preprocess_train_deploy.py
- --model=DeepModel
- --epochs=30001
- --bucket=financial-bucket-ysjeon7
- --tag=v3
…
# the container now has a nodeSelector to point to the GPU-pool.
resources:
limits:
# You can consume these GPUs from your containers by requesting <vendor>.com/gpu just like you request cpu or memory.
nvidia.com/gpu: 1
…
$ kubectl apply -f GPU/tfjob3.yaml
$ POD_NAME=$(kubectl get pods -n kubeflow --selector=tf-job-name=tfjob-deep-gpu --template '{{range .items}}{{.metadata.name}}{{"\n"}}{{end}}')
$ kubectl logs -f $POD_NAME -n kubeflow
…
$
- Troubleshooting
▷ Problem: Error msg = 예약할 수 없는 Pod ( PodUnschedulable은 리소스 부족 또는 일부 구성 오류로 인해 Pod를 예약할 수 없음을 나타냅니다.)
▷ Solution:
할당된 GPUs (all regions)의 quata 값이 0이기 때문에 발생 되었으며, 메일로 quata 값 증가를 Google에 요청 필요.
참고 Ref. https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/#deploying-nvidia-gpu-device-plugin
3. Kubeflow Pipelines
- Kubeflow pipelines offers an easy way of chaining these steps (the preprocessing, training and deploy step) together.
- Kubeflow Pipelines asks us to compile our pipeline Python3 file into a domain-specific-language.
a. Pipeline 설정 생성
$ sudo pip3 install python-dateutil kfp==0.1.36
$ vi ml_pipeline.py
…
class Preprocess(dsl.ContainerOp):
def __init__(self, name, bucket, cutoff_year):
super(Preprocess, self).__init__(
name=name,
# image needs to be a compile-time string
image='gcr.io/my-kubeflow-269301/financial-model/cpu:v1',
command=['python3', 'run_preprocess.py'],
…
if __name__ == '__main__':
import kfp.compiler as compiler
compiler.Compiler().compile(preprocess_train_deploy, __file__ + '.tar.gz')
$ python3 ml_pipeline.py
$ tar tzvf ml_pipeline.py.tar.gz
-rw-r--r-- 0/0 7093 1970-01-01 09:00 pipeline.yaml
$
b. Pipeline 등록
- PC로 파일 다운로드
- Upload pipeline
Kubeflow 접속: https://my-kubeflow.endpoints.my-kubeflow-269301.cloud.goog/
“+ Upload pipelinet” 선택 후 다음 항목 입력 해서 “Create” 선택
Pipeline Name: Finance model
Pipeline Description : Finance model pipeline
Upload a file: ml_pipeline.py.tar.gz
c. Pipeline 실행
- “+ Create run”
- Pipeline 수행 과정 및 결과 확인
'Kubeflow > 기능 탐방 (Kubeflow 1.0)' 카테고리의 다른 글
Kubeflow 1.0 기능 #6 (Metadata) (0) | 2021.09.25 |
---|---|
Kubeflow 1.0 기능 #5 (KFServing, TFServing) (0) | 2021.09.25 |
Kubeflow 1.0 기능 #4 (PyTorch Training) (0) | 2021.09.25 |
Kubeflow 1.0 기능 #3 (Katib) (0) | 2021.09.25 |
Kubeflow 1.0 기능 #1 (Jupyter notebook) (1) | 2021.09.24 |
댓글