Kubeflow 1.0 기능 #2 (TF-Job, TF-Serving, Kubeflow pipeline)

2020.02.26

1. 개요

- GKE에 설치한 Kubeflow의 Pipeline 기능을 이해하기 위해 아래 사이트를 참조하여 사용 해 봄

- Using Kubeflow for Financial Time Series (https://github.com/kubeflow/examples/tree/master/financial_time_series)

- This example covers the following concepts:

a. Deploying Kubeflow to a GKE cluster

b. Exploration via JupyterHub (prospect data, preprocess data, develop ML model)

c. Training several tensorflow models at scale with TF-jobs

d. Deploy and serve with TF-serving

e. Iterate training and serving

f. Training on GPU

g. Using Kubeflow Pipelines to automate ML workflow

2. TensorFlow training, delay & serving in Kubernets

a. Cloud shell 기동

b. Training at scale with TF-jobs

- We will build a docker image on Google Cloud by running following command

$ git clone https://github.com/kubeflow/examples.git
$ cd examples/financial_time_series/tensorflow_model/
$ cat Dockerfile
FROM tensorflow/tensorflow:1.15.0-py3
…
RUN pip3 install google-cloud-storage==1.17.0 \
                 google-cloud-bigquery==1.6.0 \
                 pandas==0.23.4
COPY . /opt/workdir
WORKDIR /opt/workdir
$ ls
CPU         GPU      __init__.py     requirements.txt  run_preprocess.py               run_train.py      tfserving.yaml
Dockerfile  helpers  ml_pipeline.py  run_deploy.py     run_preprocess_train_deploy.py  serving_requests
$ export TRAIN_PATH=gcr.io/${PROJECT}/financial-model/cpu:v1         # format: gcr.io/<project>/<image-name>/cpu:v1

- Docker image를 생성하여 Google Container Registry에 등록

$ gcloud builds submit --tag $TRAIN_PATH .
Creating temporary tarball archive of 23 file(s) totalling 39.1 KiB before compression.
…
 Uploading tarball of [.] to [gs://my-kubeflow-269301_cloudbuild/source/1582615138.56-855904751f8341e69964239f29d28135.tgz]
…
Operation "operations/acf.045e2496-0e8e-4679-95eb-508ccf68839f" finished successfully.
Created [https://cloudbuild.googleapis.com/v1/projects/my-kubeflow-269301/builds/74fb9d48-5a2e-4a85-882e-f00bbcd8b9b3].
Logs are available at [https://console.cloud.google.com/gcr/builds/74fb9d48-5a2e-4a85-882e-f00bbcd8b9b3?project=954212444059].
--------------------------------------------------------------- REMOTE BUILD OUTPUT ------------------------------------------
starting build "74fb9d48-5a2e-4a85-882e-f00bbcd8b9b3"
FETCHSOURCE
…
BUILD
Already have image (with digest): gcr.io/cloud-builders/docker
Step 1/10 : FROM tensorflow/tensorflow:1.15.0-py3
1.15.0-py3: Pulling from tensorflow/tensorflow
…
Step 10/10 : WORKDIR /opt/workdir
Successfully tagged gcr.io/my-kubeflow-269301/financial-model/cpu:v1
PUSH
 Pushing gcr.io/my-kubeflow-269301/financial-model/cpu:v1
 The push refers to repository [gcr.io/my-kubeflow-269301/financial-model/cpu]
 v1: digest: sha256:fd5178caff6cebc35393d39c5514ce5baf56764637841d29108fcba47caad327 size: 4098
…
$

- 빌드 확인: https://console.cloud.google.com/gcr/builds/74fb9d48-5a2e-4a85-882e-f00bbcd8b9b3?project=954212444059

- Docker image 등록 확인

- We will create a bucket to store our data and model artifacts:

$ BUCKET_NAME=financial-bucket-ysjeon7
$ gsutil mb gs://$BUCKET_NAME/

- We can launch the tf-job to our Kubeflow cluster and follow the progress via the logs of the pod.

In this case we are using a very simple definition of a TF-job, it only has a single worker as we are not doing any advanced training set-up (e.g. distributed training).

TFJob is a Kubernetes custom resource that you can use to run TensorFlow training jobs on Kubernetes. TFJob spec is designed to manage distributed TensorFlow training jobs.

$ vi CPU/tfjob1.yaml
apiVersion: kubeflow.org/v1
kind: TFJob
…
spec:
 tfReplicaSpecs:
   Worker:
    replicas: 1
…
        - name: tensorflow
          image: gcr.io/my-kubeflow-269301/financial-model/cpu:v1
           command:
             - python
             - run_preprocess_train_deploy.py
             - --model=FlatModel
             - --epochs=30001
             - --bucket=financial-bucket-ysjeon7
             - --tag=v1
…
$ kubectl apply -f CPU/tfjob1.yaml
$ POD_NAME=$(kubectl get pods -n kubeflow --selector=tf-job-name=tfjob-flat --template '{{range .items}}{{.metadata.name}}{{"\n"}}{{end}}')
$ kubectl logs -f $POD_NAME -n kubeflow
…
Precision =  0.9142857142857143
Recall =  0.2222222222222222
F1 Score =  0.35754189944134074
Accuracy =  0.6006944444444444
$ kubectl get pods -n kubeflow | grep $POD_NAME
tfjob-flat-worker-0                                            0/1     Completed   0          12m
$

- Trained model 확인

c. Deploy and serve with TF-serving

- We will use the standard TF-serving module that Kubeflow offers.

$ vi tfserving.yaml
apiVersion: v1
 kind: Service
 …
   spec:
     containers:
     - args:
       - --port=9000
       - --rest_api_port=8500
       - --model_name=finance-model
       - --model_base_path=gs://financial-bucket-ysjeon7/tfserving/
     command:
     - /usr/bin/tensorflow_model_server
     image: tensorflow/serving:1.11.1
 …
$ kubectl apply -f tfserving.yaml
service/tfserving-service created
deployment.apps/tfserving created
$ POD=`kubectl get pods -n kubeflow --selector=app=model | awk '{print $1}' | tail -1`
$ kubectl logs -f $POD -n kubeflow
…
$ kubectl get pods -o wide | grep $POD
tfserving-5b6ddb8586-llkv9                                    1/1     Running     0          5h55m
$

- We can set up port-forwarding to our localhost.

$ kubectl get services -n kubeflow --selector=app=model
NAME                TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)             AGE
tfserving-service   ClusterIP   10.63.248.92   <none>        9000/TCP,8500/TCP   9m46s
$ kubectl port-forward $POD 8500:8500 -n kubeflow 2>&1 >/dev/null &
[1] 1798
$ netstat -na | grep -w 8500
tcp        0      0 127.0.0.1:8500          0.0.0.0:*               LISTEN
$ curl “http://127.0.0.1:8500/v1/models/finance-model”
…
$

- We need to do is send a request to localhost:8500 with the expected input of the saved model and it will return a prediction.

$ sudo pip3 install numpy requests
$ python3 -m serving_requests.request_random
{'model-tag': 'v1', 'prediction': 0}
$ sudo pip3 install -r requirements.txt
$ python3 -m serving_requests.request
{'prediction': 0, 'model-tag': 'v1'}
$

d. Running another TF-job and serving update

- We will train a more complex neural network with several hidden layers.

$ vi CPU/tfjob2.yaml
apiVersion: kubeflow.org/v1
kind: TFJob
 …
 spec:
 tfReplicaSpecs:
   Worker:
     replicas: 1
 …
         - name: tensorflow
           image: gcr.io/my-kubeflow-269301/financial-model/cpu:v1
           command:
             - python
             - run_preprocess_train_deploy.py
             - --model=DeepModel
             - --epochs=30001
             - --bucket=financial-bucket-ysjeon7
             - --tag=v2
…
$ kubectl apply -f CPU/tfjob2.yaml
$ POD_NAME=$(kubectl get pods -n kubeflow --selector=tf-job-name=tfjob-deep  --template '{{range .items}}{{.metadata.name}}{{"\n"}}{{end}}')
$ kubectl logs -f $POD_NAME -n kubeflow
…
Accuracy =  0.7222222222222222
$ python3 -m serving_requests.request
{'prediction': 1, 'model-tag': 'v2'}
$

The response returns the model tag 'v2' and predicts the correct output 1

e. Running TF-job on a GPU

$ cp GPU/Dockerfile ./Dockerfile
$ cat Dockerfile
FROM tensorflow/tensorflow:1.15.0-gpu-py3
…
$ export TRAIN_PATH_GPU=gcr.io/my-kubeflow-269301/financial-model/gpu:v1
$ gcloud builds submit --tag $TRAIN_PATH_GPU .
…
$

- The GKE deployment script for Kubeflow automatically adds a GPU-pool that can scale as needed so you don’t need to pay for a GPU when you don’t need it.

$ vi GPU/tfjob3.yaml
apiVersion: kubeflow.org/v1
kind: TFJob
…
 spec:
 tfReplicaSpecs:
   Worker:
     replicas: 1
…
         - name: tensorflow
           image: gcr.io/my-kubeflow-269301/financial-model/gpu:v1
           command:
            - python
            - run_preprocess_train_deploy.py
            - --model=DeepModel
            - --epochs=30001
            - --bucket=financial-bucket-ysjeon7
            - --tag=v3
…
       # the container now has a nodeSelector to point to the GPU-pool.
       resources:
         limits:
           #  You can consume these GPUs from your containers by requesting <vendor>.com/gpu just like you request cpu or memory.
           nvidia.com/gpu: 1
…
$ kubectl apply -f GPU/tfjob3.yaml
$ POD_NAME=$(kubectl get pods -n kubeflow --selector=tf-job-name=tfjob-deep-gpu --template '{{range .items}}{{.metadata.name}}{{"\n"}}{{end}}')
$ kubectl logs -f $POD_NAME -n kubeflow
…
$

- Troubleshooting

▷ Problem: Error msg = 예약할 수 없는 Pod ( PodUnschedulable은 리소스 부족 또는 일부 구성 오류로 인해 Pod를 예약할 수 없음을 나타냅니다.)

▷ Solution:

할당된 GPUs (all regions)의 quata 값이 0이기 때문에 발생 되었으며, 메일로 quata 값 증가를 Google에 요청 필요.

참고 Ref. https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/#deploying-nvidia-gpu-device-plugin

3. Kubeflow Pipelines

- Kubeflow pipelines offers an easy way of chaining these steps (the preprocessing, training and deploy step) together.

- Kubeflow Pipelines asks us to compile our pipeline Python3 file into a domain-specific-language.

a. Pipeline 설정 생성

$ sudo pip3 install python-dateutil kfp==0.1.36
$ vi ml_pipeline.py
…
 class Preprocess(dsl.ContainerOp):
   def __init__(self, name, bucket, cutoff_year):
     super(Preprocess, self).__init__(
      name=name,
      # image needs to be a compile-time string
      image='gcr.io/my-kubeflow-269301/financial-model/cpu:v1',
      command=['python3', 'run_preprocess.py'],
…
if __name__ == '__main__':
import kfp.compiler as compiler
compiler.Compiler().compile(preprocess_train_deploy, __file__ + '.tar.gz')
$ python3 ml_pipeline.py
$ tar tzvf ml_pipeline.py.tar.gz
-rw-r--r-- 0/0            7093 1970-01-01 09:00 pipeline.yaml
$

b. Pipeline 등록

- PC로 파일 다운로드

- Upload pipeline

Kubeflow 접속: https://my-kubeflow.endpoints.my-kubeflow-269301.cloud.goog/

“+ Upload pipelinet” 선택 후 다음 항목 입력 해서 “Create” 선택

Pipeline Name: Finance model

Pipeline Description : Finance model pipeline

Upload a file: ml_pipeline.py.tar.gz

c. Pipeline 실행

- “+ Create run”

- Pipeline 수행 과정 및 결과 확인

'Kubeflow > 기능 탐방 (Kubeflow 1.0)' 카테고리의 다른 글

Kubeflow 1.0 기능 #6 (Metadata) (0)	2021.09.25
Kubeflow 1.0 기능 #5 (KFServing, TFServing) (0)	2021.09.25
Kubeflow 1.0 기능 #4 (PyTorch Training) (0)	2021.09.25
Kubeflow 1.0 기능 #3 (Katib) (0)	2021.09.25
Kubeflow 1.0 기능 #1 (Jupyter notebook) (1)	2021.09.24

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

일주일만 하면 ...

Kubeflow 1.0 기능 #2 (TF-Job, TF-Serving, Kubeflow pipeline)

'Kubeflow > 기능 탐방 (Kubeflow 1.0)' 카테고리의 다른 글

댓글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역

Kubeflow 1.0 기능 #2 (TF-Job, TF-Serving, Kubeflow pipeline)

'Kubeflow > 기능 탐방 (Kubeflow 1.0)' 카테고리의 다른 글

관련글

댓글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역