본문 바로가기
Kubeflow/기능 탐방 (Kubeflow 1.0)

Kubeflow 1.0 기능 #2 (TF-Job, TF-Serving, Kubeflow pipeline)

by 여행을 떠나자! 2021. 9. 24.

2020.02.26

 

1. 개요

- GKE에 설치한 Kubeflow의 Pipeline 기능을 이해하기 위해 아래 사이트를 참조하여 사용 해 봄  

- Using Kubeflow for Financial Time Series (https://github.com/kubeflow/examples/tree/master/financial_time_series)

- This example covers the following concepts:

   a. Deploying Kubeflow to a GKE cluster

   b. Exploration via JupyterHub (prospect data, preprocess data, develop ML model)

   c. Training several tensorflow models at scale with TF-jobs

   d. Deploy and serve with TF-serving

   e. Iterate training and serving

   f. Training on GPU

   g. Using Kubeflow Pipelines to automate ML workflow

 

 

2. TensorFlow training, delay & serving in Kubernets

a. Cloud shell 기동 

 

b. Training at scale with TF-jobs

- We will build a docker image on Google Cloud by running following command

$ git clone https://github.com/kubeflow/examples.git
$ cd examples/financial_time_series/tensorflow_model/
$ cat Dockerfile
FROM tensorflow/tensorflow:1.15.0-py3
…
RUN pip3 install google-cloud-storage==1.17.0 \
                 google-cloud-bigquery==1.6.0 \
                 pandas==0.23.4
COPY . /opt/workdir
WORKDIR /opt/workdir
$ ls
CPU         GPU      __init__.py     requirements.txt  run_preprocess.py               run_train.py      tfserving.yaml
Dockerfile  helpers  ml_pipeline.py  run_deploy.py     run_preprocess_train_deploy.py  serving_requests
$ export TRAIN_PATH=gcr.io/${PROJECT}/financial-model/cpu:v1         # format: gcr.io/<project>/<image-name>/cpu:v1

- Docker image를 생성하여 Google Container Registry에 등록 

$ gcloud builds submit --tag $TRAIN_PATH .
Creating temporary tarball archive of 23 file(s) totalling 39.1 KiB before compression.
…
 Uploading tarball of [.] to [gs://my-kubeflow-269301_cloudbuild/source/1582615138.56-855904751f8341e69964239f29d28135.tgz]
…
Operation "operations/acf.045e2496-0e8e-4679-95eb-508ccf68839f" finished successfully.
Created [https://cloudbuild.googleapis.com/v1/projects/my-kubeflow-269301/builds/74fb9d48-5a2e-4a85-882e-f00bbcd8b9b3].
Logs are available at [https://console.cloud.google.com/gcr/builds/74fb9d48-5a2e-4a85-882e-f00bbcd8b9b3?project=954212444059].
--------------------------------------------------------------- REMOTE BUILD OUTPUT ------------------------------------------
starting build "74fb9d48-5a2e-4a85-882e-f00bbcd8b9b3"
FETCHSOURCE
…
BUILD
Already have image (with digest): gcr.io/cloud-builders/docker
Step 1/10 : FROM tensorflow/tensorflow:1.15.0-py3
1.15.0-py3: Pulling from tensorflow/tensorflow
…
Step 10/10 : WORKDIR /opt/workdir
Successfully tagged gcr.io/my-kubeflow-269301/financial-model/cpu:v1
PUSH
 Pushing gcr.io/my-kubeflow-269301/financial-model/cpu:v1
 The push refers to repository [gcr.io/my-kubeflow-269301/financial-model/cpu]
 v1: digest: sha256:fd5178caff6cebc35393d39c5514ce5baf56764637841d29108fcba47caad327 size: 4098
…
$

- 빌드 확인: https://console.cloud.google.com/gcr/builds/74fb9d48-5a2e-4a85-882e-f00bbcd8b9b3?project=954212444059     

- Docker image 등록 확인

- We will create a bucket to store our data and model artifacts:

$ BUCKET_NAME=financial-bucket-ysjeon7
$ gsutil mb gs://$BUCKET_NAME/

- We can launch the tf-job to our Kubeflow cluster and follow the progress via the logs of the pod.

   In this case we are using a very simple definition of a TF-job, it only has a single worker as we are not doing any advanced training set-up (e.g. distributed training).

   TFJob is a Kubernetes custom resource that you can use to run TensorFlow training jobs on Kubernetes. TFJob spec is designed to manage distributed TensorFlow training jobs.

$ vi CPU/tfjob1.yaml
apiVersion: kubeflow.org/v1
kind: TFJob
…
spec:
 tfReplicaSpecs:
   Worker:
    replicas: 1
…
        - name: tensorflow
          image: gcr.io/my-kubeflow-269301/financial-model/cpu:v1
           command:
             - python
             - run_preprocess_train_deploy.py
             - --model=FlatModel
             - --epochs=30001
             - --bucket=financial-bucket-ysjeon7
             - --tag=v1
…
$ kubectl apply -f CPU/tfjob1.yaml
$ POD_NAME=$(kubectl get pods -n kubeflow --selector=tf-job-name=tfjob-flat --template '{{range .items}}{{.metadata.name}}{{"\n"}}{{end}}')
$ kubectl logs -f $POD_NAME -n kubeflow
…
Precision =  0.9142857142857143
Recall =  0.2222222222222222
F1 Score =  0.35754189944134074
Accuracy =  0.6006944444444444
$ kubectl get pods -n kubeflow | grep $POD_NAME
tfjob-flat-worker-0                                            0/1     Completed   0          12m
$

- Trained model 확인

 

c. Deploy and serve with TF-serving

- We will use the standard TF-serving module that Kubeflow offers.

$ vi tfserving.yaml
apiVersion: v1
 kind: Service
 …
   spec:
     containers:
     - args:
       - --port=9000
       - --rest_api_port=8500
       - --model_name=finance-model
       - --model_base_path=gs://financial-bucket-ysjeon7/tfserving/
     command:
     - /usr/bin/tensorflow_model_server
     image: tensorflow/serving:1.11.1
 …
$ kubectl apply -f tfserving.yaml
service/tfserving-service created
deployment.apps/tfserving created
$ POD=`kubectl get pods -n kubeflow --selector=app=model | awk '{print $1}' | tail -1`
$ kubectl logs -f $POD -n kubeflow
…
$ kubectl get pods -o wide | grep $POD
tfserving-5b6ddb8586-llkv9                                    1/1     Running     0          5h55m
$

- We can set up port-forwarding to our localhost.

$ kubectl get services -n kubeflow --selector=app=model
NAME                TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)             AGE
tfserving-service   ClusterIP   10.63.248.92   <none>        9000/TCP,8500/TCP   9m46s
$ kubectl port-forward $POD 8500:8500 -n kubeflow 2>&1 >/dev/null &
[1] 1798
$ netstat -na | grep -w 8500
tcp        0      0 127.0.0.1:8500          0.0.0.0:*               LISTEN
$ curl “http://127.0.0.1:8500/v1/models/finance-model”
…
$

- We need to do is send a request to localhost:8500 with the expected input of the saved model and it will return a prediction.

$ sudo pip3 install numpy requests
$ python3 -m serving_requests.request_random
{'model-tag': 'v1', 'prediction': 0}
$ sudo pip3 install -r requirements.txt
$ python3 -m serving_requests.request
{'prediction': 0, 'model-tag': 'v1'}
$

 

d. Running another TF-job and serving update

- We will train a more complex neural network with several hidden layers.

$ vi CPU/tfjob2.yaml
apiVersion: kubeflow.org/v1
kind: TFJob
 …
 spec:
 tfReplicaSpecs:
   Worker:
     replicas: 1
 …
         - name: tensorflow
           image: gcr.io/my-kubeflow-269301/financial-model/cpu:v1
           command:
             - python
             - run_preprocess_train_deploy.py
             - --model=DeepModel
             - --epochs=30001
             - --bucket=financial-bucket-ysjeon7
             - --tag=v2
…
$ kubectl apply -f CPU/tfjob2.yaml
$ POD_NAME=$(kubectl get pods -n kubeflow --selector=tf-job-name=tfjob-deep  --template '{{range .items}}{{.metadata.name}}{{"\n"}}{{end}}')
$ kubectl logs -f $POD_NAME -n kubeflow
…
Accuracy =  0.7222222222222222
$ python3 -m serving_requests.request
{'prediction': 1, 'model-tag': 'v2'}
$

    The response returns the model tag 'v2' and predicts the correct output 1

 

e. Running TF-job on a GPU

$ cp GPU/Dockerfile ./Dockerfile
$ cat Dockerfile
FROM tensorflow/tensorflow:1.15.0-gpu-py3
…
$ export TRAIN_PATH_GPU=gcr.io/my-kubeflow-269301/financial-model/gpu:v1
$ gcloud builds submit --tag $TRAIN_PATH_GPU .
…
$

- The GKE deployment script for Kubeflow automatically adds a GPU-pool that can scale as needed so you don’t need to pay for a GPU when you don’t need it.

$ vi GPU/tfjob3.yaml
apiVersion: kubeflow.org/v1
kind: TFJob
…
 spec:
 tfReplicaSpecs:
   Worker:
     replicas: 1
…
         - name: tensorflow
           image: gcr.io/my-kubeflow-269301/financial-model/gpu:v1
           command:
            - python
            - run_preprocess_train_deploy.py
            - --model=DeepModel
            - --epochs=30001
            - --bucket=financial-bucket-ysjeon7
            - --tag=v3
…
       # the container now has a nodeSelector to point to the GPU-pool.
       resources:
         limits:
           #  You can consume these GPUs from your containers by requesting <vendor>.com/gpu just like you request cpu or memory.
           nvidia.com/gpu: 1
…
$ kubectl apply -f GPU/tfjob3.yaml
$ POD_NAME=$(kubectl get pods -n kubeflow --selector=tf-job-name=tfjob-deep-gpu --template '{{range .items}}{{.metadata.name}}{{"\n"}}{{end}}')
$ kubectl logs -f $POD_NAME -n kubeflow
…
$

 

- Troubleshooting

   ▷ Problem: Error msg = 예약할 수 없는 Pod ( PodUnschedulable은 리소스 부족 또는 일부 구성 오류로 인해 Pod를 예약할 수 없음을 나타냅니다.)

   ▷ Solution: 

       할당된 GPUs (all regions)의 quata 값이 0이기 때문에 발생 되었으며, 메일로 quata 값 증가를 Google에 요청 필요.

       참고 Ref.  https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/#deploying-nvidia-gpu-device-plugin

 

 

3. Kubeflow Pipelines

- Kubeflow pipelines offers an easy way of chaining these steps (the preprocessing, training and deploy step) together.

- Kubeflow Pipelines asks us to compile our pipeline Python3 file into a domain-specific-language.

 

a. Pipeline 설정 생성

$ sudo pip3 install python-dateutil kfp==0.1.36
$ vi ml_pipeline.py
…
 class Preprocess(dsl.ContainerOp):
   def __init__(self, name, bucket, cutoff_year):
     super(Preprocess, self).__init__(
      name=name,
      # image needs to be a compile-time string
      image='gcr.io/my-kubeflow-269301/financial-model/cpu:v1',
      command=['python3', 'run_preprocess.py'],
…
if __name__ == '__main__':
import kfp.compiler as compiler
compiler.Compiler().compile(preprocess_train_deploy, __file__ + '.tar.gz')
$ python3 ml_pipeline.py
$ tar tzvf ml_pipeline.py.tar.gz
-rw-r--r-- 0/0            7093 1970-01-01 09:00 pipeline.yaml
$

 

b. Pipeline 등록

- PC로 파일 다운로드

- Upload pipeline

  Kubeflow 접속: https://my-kubeflow.endpoints.my-kubeflow-269301.cloud.goog/   

      “+ Upload pipelinet” 선택 후 다음 항목 입력 해서 “Create” 선택

      Pipeline Name: Finance model

      Pipeline Description : Finance model pipeline

      Upload a file: ml_pipeline.py.tar.gz

 

c. Pipeline 실행

- “+ Create run”

Pipeline 수행 과정 및 결과 확인

댓글