KFServing - Deep dive

1. Kubeflow KFServing?

- Kubeflow supports two model serving systems that allow multi-framework model serving: KFServing and Seldon Core.

- KFServing enables serverless inferencing on Kubernetes

✓ Encapsulate the complexity of autoscaling, networking, health checking, and server configuration

to bring cutting edge serving features like GPU autoscaling, scale to zero, and canary rollouts to your ML deployments.

https://v1-2-branch.kubeflow.org/docs/components/kfserving/kfserving.png

- Kubeflow 1.2 includes KFServing v0.4.1.

Knative Serving v0.14.3, istio v1.3

- Kubeflow 1.3 includes KFServing v0.5.1 which promoted the core InferenceService API from v1alpha2 to v1beta1 stable and added v1alpha1 version of Multi-Model Serving.

Knative Serving v0.17.4, Istio v1.9

- Reference

https://v1-2-branch.kubeflow.org/docs/components/kfserving/kfserving/

https://www.kangwoo.kr/2020/04/11/kubeflow-kfserving-%EA%B0%9C%EC%9A%94/

2. Inference service 배포

- Deploy tensorflow models with out-of-the-box model servers

https://github.com/kubeflow/kfserving/tree/master/docs/samples/v1alpha2/tensorflow

https://github.com/kubeflow/kfserving/tree/master/docs/samples/v1beta1/tensorflow ⇠ Kubeflow 1.3

$ vi tensorflow.yaml
apiVersion: "serving.kubeflow.org/v1alpha2"
kind: "InferenceService"
metadata:
  name: "flowers-sample"
  namespace: yoosung-jeon
spec:
  default:
    predictor:
      tensorflow:
        storageUri: "gs://kfserving-samples/models/tensorflow/flowers"
$ k apply -f tensorflow.yaml
...
$

- InferenceService의 data plane

✓ Endpoint: “Default”와 “Canary” 엔드포인트

✓ Component: 각 엔드 포인트는 “예측자(predictor)”, “설명자(explainer)”및 “변환기(transformer)”와 같은 여러 컴포넌트로 구성

Predictor: InferenceService의 핵심 컴포넌트. 네트워크 엔드포인트에서 사용 가능하게 하는 모델 및 모델 서버

Explainer: 모델이 어떻게 예측을 했는지에 대한 설명을 제공. 사용자는 자신들이 가지고 있는 자체 설명 컨테이너를 정의할 수 있음. 일반적인 사용 사례의 경우 KFServing은 Alibi와 같은 기본 Explainer를 제공

Transformer: 예측 및 설명 워크 플로우 전에 사전 및 사후 처리 단계를 정의. Explainer 과 마찬가지로 관련 환경 변수로 구성. 일반적인 사용 사례의 경우 KFServing은 Feast와 같은 기본 Transformer를 제공

- Inference service 호출

Dex (OIDC)가 구성되어 있기 때문에 배포한 Inference service를 호출하면 인증을 요청하는 URL로 리다이렉트 된다. Dex 인증을 우회하는 설정이 필요하다. 'Dex 인증 / 우회' 문서를 참조하기 바란다.

$ MODEL_NAME=flowers-sample
$ SERVICE_HOSTNAME=$(kubectl get inferenceservice ${MODEL_NAME} -o jsonpath='{.status.url}' | cut -d "/" -f 3)
$ INGRESS_HOST=$(kubectl -n istio-system get service istio-ingressgateway -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
$ INGRESS_PORT=$(kubectl -n istio-system get service istio-ingressgateway -o jsonpath='{.spec.ports[?(@.name=="http2")].port}')
$ INPUT_PATH=@./input.json
$ echo "SERVICE_HOSTNAME=$SERVICE_HOSTNAME"
SERVICE_HOSTNAME=flowers-sample.yoosung-jeon.example.com
$ echo "INGRESS_HOST=$INGRESS_HOST, INGRESS_PORT=$INGRESS_PORT"
INGRESS_HOST=14.52.244.137, INGRESS_PORT=80
$

$ curl -v -H "Host: ${SERVICE_HOSTNAME}" http://${INGRESS_HOST}:${INGRESS_PORT}/v1/models/$MODEL_NAME:predict -d $INPUT_PATH
* About to connect() to 14.52.244.137 port 80 (#0)
*   Trying 14.52.244.137...
* Connected to 14.52.244.137 (14.52.244.137) port 80 (#0)
> POST /v1/models/flowers-sample:predict HTTP/1.1
> User-Agent: curl/7.29.0
> Accept: */*
> Host: flowers-sample.yoosung-jeon.example.com
> Content-Length: 16201
> Content-Type: application/x-www-form-urlencoded
> Expect: 100-continue
>
< HTTP/1.1 100 Continue
< HTTP/1.1 200 OK
< content-length: 222
< content-type: application/json
< date: Wed, 21 Jul 2021 06:03:18 GMT
< x-envoy-upstream-service-time: 2049
< server: istio-envoy
<
{
    "predictions": [
        {
            "scores": [0.999114931, 9.20987877e-05, 0.000136786475, 0.000337258185, 0.000300532876, 1.84813962e-05],
            "prediction": 0,
            "key": "   1"
        }
    ]
* Connection #0 to host 14.52.244.137 left intact
}
$

3. Custom Domain 변경

- Inference service를 배포하면 호출할 URL은 '{Inference service name}.{namespace}.example.com' 형식이 된다.

- 'example.com' domain을 변경하고자 할 경우는 'Knative - Custom domain 변경' 문서를 참조하기 바란다.

$ k get inferenceservices.serving.kubeflow.org -n yoosung-jeon
NAME             URL                                                       READY   DEFAULT TRAFFIC   CANARY TRAFFIC   AGE
bert-large       http://bert-large.yoosung-jeon.kf-serv.acp.kt.co.kr       True    100                                5d16h
flowers-sample   http://flowers-sample.yoosung-jeon.kf-serv.acp.kt.co.kr   True    100                                27m
$

4. Tensorflow ModelServer 버전 지정

- 기본으로 사용되는 Tensorflow ModelServer의 버전은 1.14.0이다.

$ k describe cm inferenceservice-config -n kubeflow | grep tensorflow -A 3
    "tensorflow": {
        "image": "tensorflow/serving",
        "defaultImageVersion": "1.14.0",
        "defaultGpuImageVersion": "1.14.0-gpu"
    },
$ k exec flowers-sample-predictor-default-5jsk4-deployment-78cc8b8bfzkfc -c kfserving-container -n yoosung-jeon -it -- tensorflow_model_server --version
TensorFlow ModelServer: 1.14.0-rc0
TensorFlow Library: 1.14.0
$

- runtimeVersion으로 사용할 Tensorflow ModelServer의 버전을 지정하면 된다.

$ vi tensorflow.yaml
apiVersion: "serving.kubeflow.org/v1alpha2"
kind: "InferenceService"
metadata:
  name: "flowers-sample"
  namespace: yoosung-jeon
spec:
  default:
    predictor:
      tensorflow:
        runtimeVersion: "2.5.1"
        storageUri: "gs://kfserving-samples/models/tensorflow/flowers"
$ k apply -f tensorflow.yaml
...
$ k get pod -n yoosung-jeon -l serving.kubeflow.org/inferenceservice=flowers-sample
NAME                                                              READY   STATUS    RESTARTS   AGE
flowers-sample-predictor-default-zwwvl-deployment-f668b4498294f   2/2     Running   3          2d11h
$ k exec flowers-sample-predictor-default-zwwvl-deployment-f668b4498294f -c kfserving-container -n yoosung-jeon -it -- tensorflow_model_server --version
TensorFlow ModelServer: 2.5.1-rc3
TensorFlow Library: 2.5.0
$

5. GPU 사용

- GPU를 사용하는 Tensorflow ModelServer를 지정하고, 사용할 GPU 리소스 수량을 지정한다.

runtimeVersion: "2.5.1-gpu"

nvidia.com/gpu: 1

$ vi tensorflow.yaml
...
spec:
  default:
    predictor:
      tensorflow:
        runtimeVersion: "2.5.1-gpu"
        resources:
          limits:
            cpu: "1"
            memory: 16Gi
            nvidia.com/gpu: 1
          requests:
            cpu: "1"
            memory: 16Gi
            nvidia.com/gpu: 1
…
$ k apply -f tensorflow.yaml
...
$ k logs flowers-sample-predictor-default-cxqnw-deployment-5d94b997s4jfr -n yoosung-jeon -c kfserving-container
…
2021-08-13 02:26:35.968290: I external/org_tensorflow/tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
…
$

- 배포한 Inference service는 GPU 1개를 할당받아서 점유하고 있다.

$ k exec nvidia-driver-daemonset-txb8j -n gpu-operator-resources -it -- nvidia-smi
Fri Aug 13 02:59:44 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.80.02    Driver Version: 450.80.02    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-PCIE...  On   | 00000000:3B:00.0 Off |                    0 |
| N/A   28C    P0    38W / 250W |  31461MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-PCIE...  On   | 00000000:D8:00.0 Off |                    0 |
| N/A   25C    P0    24W / 250W |      0MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A    160113      C   ...n/tensorflow_model_server    31457MiB |
+-----------------------------------------------------------------------------+
$

6. Model Storages 저장

- PVC 생성 및 모델 배포

✓ 모델을 배포할 PVC를 생성한다. 본 예제에서는 Storage class로 nfs-client-provisioner를 사용하고 있으며, 생성된 PV에 직접 접근이 가능한다.

NAS의 마운트 포인트는 '/nfs_01'이고, kfserving-pvc의 PV는 'yoosung-jeon-kfserving-pvc-pvc-a83bf69f-3b8f-41be-ba51-e76402a36f6d'이다.

참고문서: NFS-Client Provisioner

✓ gsutil 명령어로 Google storage에 있는 Tensorflow flowers 모델을 복사한다.

✓ 모델 디렉토리 구조는 'model_name/{numbers}/*' 형식을 준수해야 한다.

$ vi kfserving-pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: kfserving-pvc
  namespace: yoosung-jeon
spec:
  accessModes:
  - ReadWriteMany
  resources:
    requests:
      storage: 1Gi
  storageClassName: nfs-sc-iap
$ k apply -f kfserving-pvc.yaml
...
$ cd /nfs_01/yoosung-jeon-kfserving-pvc-pvc-a83bf69f-3b8f-41be-ba51-e76402a36f6d
$ gsutil cp -r gs://kfserving-samples/models/tensorflow/flowers .
Copying gs://kfserving-samples/models/tensorflow/flowers/0001/saved_model.pb...
Copying gs://kfserving-samples/models/tensorflow/flowers/0001/variables/variables.data-00000-of-00001...
Copying gs://kfserving-samples/models/tensorflow/flowers/0001/variables/variables.index...
\ [3 files][109.5 MiB/109.5 MiB]    2.8 MiB/s
Operation completed over 3 objects/109.5 MiB.
$ tree flowers
flowers
└── 0001
    ├── saved_model.pb
    └── variables
        ├── variables.data-00000-of-00001
        └── variables.index

2 directories, 3 files
$

- storageUri에 'pvc://{PVC name}/{model}' 형식으로 설정한다.

모델이 다수의 숫자 디렉토리가 가지고 있는 경우에는 가장 큰 숫자의 디렉토리를 사용한다.

$ vi tensorflow.yaml
apiVersion: "serving.kubeflow.org/v1alpha2"
kind: "InferenceService"
metadata:
  name: "flowers-sample"
  namespace: yoosung-jeon
spec:
  default:
    predictor:
      tensorflow:
        storageUri: "pvc://kfserving-pvc/flowers"
        runtimeVersion: "2.5.1"
$ k apply -f tensorflow.yaml
…
$ k logs flowers-sample-predictor-default-7wjb9-deployment-848cbb7btn6gn -n yoosung-jeon -c storage-initializer
[I 210813 07:25:05 initializer-entrypoint:13] Initializing, args: src_uri [/mnt/pvc/flowers] dest_path[ [/mnt/models]
[I 210813 07:25:05 storage:35] Copying contents of /mnt/pvc/flowers to local
[I 210813 07:25:05 storage:205] Linking: /mnt/pvc/flowers/0001 to /mnt/models/0001
$

7. Autoscaling

- KFServing은 Knative를 이용하여 autoscaling을 지원한다. autoscaling에 대한 자세한 내용은 'Knative - Autoscaling #1 (개념)' 문서를 참조하기 바란다.

$ vi tensorflow-autoscaler.yaml
apiVersion: "serving.kubeflow.org/v1alpha2"
kind: "InferenceService"
metadata:
  name: "flowers-sample"
  namespace: yoosung-jeon
  annotations:
    autoscaling.knative.dev/target: "3"
    autoscaling.knative.dev/minSacle: "1"
    autoscaling.knative.dev/maxSacle: "10"
spec:
  default:
    predictor:
      tensorflow:
        storageUri: "pvc://kfserving-pvc/flowers"
        runtimeVersion: "2.5.1"
$ k apply -f tensorflow-autoscaler.yaml
...
$ k get pod -l app=flowers-sample-predictor-default-zwwvl -n yoosung-jeon
NAME                                                              READY   STATUS    RESTARTS   AGE
flowers-sample-predictor-default-zwwvl-deployment-8cc7d458pkfvx   2/2     Running   0          3h21m
$

- hey 툴을 이용하여 30초 동안 동시 5 세션의 부하를 발생시키면, Autoscaling되는 것을 확인 할 수 있다.

$  ./hey_linux_amd64 -z 30s -c 5 -m POST -D ./input.json http://flowers-sample.yoosung-jeon.kf-serv.acp.kt.co.kr:80/v1/models/flowers-sample:predict

Summary:
  Total:	33.4836 secs
  Slowest:	16.0879 secs
  Fastest:	1.4547 secs
  Average:	5.1455 secs
  Requests/sec:	0.9258

  Total data:	6851 bytes
  Size/request:	221 bytes

Response time histogram:
  1.455 [1]	|■■■■
  2.918 [6]	|■■■■■■■■■■■■■■■■■■■■■■
  4.381 [7]	|■■■■■■■■■■■■■■■■■■■■■■■■■
  5.845 [11]	|■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  7.308 [1]	|■■■■
  8.771 [1]	|■■■■
  10.235 [2]	|■■■■■■■
  11.698 [0]	|
  13.161 [1]	|■■■■
  14.625 [0]	|
  16.088 [1]	|■■■■


Latency distribution:
  10% in 2.5903 secs
  25% in 3.0997 secs
  50% in 4.8862 secs
  75% in 5.5104 secs
  90% in 9.8968 secs
  95% in 16.0879 secs
  0% in 0.0000 secs

Details (average, fastest, slowest):
  DNS+dialup:	0.0115 secs, 1.4547 secs, 16.0879 secs
  DNS-lookup:	0.0113 secs, 0.0000 secs, 0.0700 secs
  req write:	0.0002 secs, 0.0001 secs, 0.0004 secs
  resp wait:	5.1336 secs, 1.4541 secs, 16.0160 secs
  resp read:	0.0002 secs, 0.0000 secs, 0.0005 secs

Status code distribution:
  [200]	31 responses

$ k get pod -l app=flowers-sample-predictor-default-zwwvl -n yoosung-jeon
NAME                                                              READY   STATUS    RESTARTS   AGE
flowers-sample-predictor-default-zwwvl-deployment-8cc7d4586ggxv   2/2     Running   0          20s
flowers-sample-predictor-default-zwwvl-deployment-8cc7d4589c96m   2/2     Running   0          24s
flowers-sample-predictor-default-zwwvl-deployment-8cc7d458pkfvx   2/2     Running   0          3h21m
$

'Kubeflow > Management' 카테고리의 다른 글

KFServing - Canary rollout 테스트 (0)	2021.10.17
Dex 인증 / 우회 (0)	2021.09.29
Kubeflow Jupyter Custom Image 추가 (0)	2021.09.24

일주일만 하면 ...

KFServing - Deep dive

'Kubeflow > Management' 카테고리의 다른 글

댓글

티스토리툴바

KFServing - Deep dive

'Kubeflow > Management' 카테고리의 다른 글

관련글

댓글

티스토리툴바