본문 바로가기
Kubeflow/Install

Kubeflow 1.0 in GCE 구성

by 여행을 떠나자! 2021. 9. 24.

2020.02.20

 

1. 개요

- 본 문서에서는 GCP(Google Cloud Platform)에서 Kubernetes 기반의 End 2 End ML Platform인 Kubeflow를 구성하는 절차를 설명하고자 함

- Ref. Page: https://www.kubeflow.org/docs/gke/deploy/deploy-cli/

 

 

2. What is Kubeflow ?

- Kubeflow is the ML toolkit for Kubernetes. The following diagram shows Kubeflow as a platform for arranging the components of your ML system on top of Kubernetes.

- https://www.kubeflow.org/docs/about/kubeflow/

- https://bcho.tistory.com/1301

 

 

3. Kubeflow 구성

a. Create a GCP Project 

a-1. Google 계정 생성

       - https://accounts.google.com/ 접속 > 계정 생성 (ysjeon71.kubeflow2@gmail.com)

       - $300 무료 크레딧 제공 (12개월 사용 가능), 결재 정보 입력 필수

a-2. Create GCP Project

       - https://console.cloud.google.com/ 접속 > 'My First Project’ 선택 > ‘새 프로젝트’ 선택 > 만들기 

       - Project Name: My Kubeflow

          Project ID: my-kubeflow-269301 —> my-kubeflow-271310 (신규 생성)

 

b. Setup a GCP Project

b-1. The specified APIs are enabled (Project별 설정)

       - https://www.kubeflow.org/docs/gke/deploy/project-setup

          Compute Engine API 선택 > API 활성화

          Kubernetes Engine API 선택 > API 활성화

          …

          AI Platform Training & Prediction API 선택 > API 활성화

b-2. Set up an OAuth credential

       - https://www.kubeflow.org/docs/gke/deploy/oauth-setup/

          Set up your OAuth consent screen (OAuth 동의 화면)

               User type: 외부

               Application Name: Kubeflow

               Authorized domains: my-kubeflow-271310.cloud.goog

          On the credentials screen (사용자 인증정보) : 

               "+사용자 인증 정보 만들기" > "OAuth 클라이언트 ID" 선택

                   애플리케이션 유형: 웹 애플리케이션

               생성결과: 

                   Client ID: 760629890301-8fmr6nlu2g6f04gpqaq3ljc133pqel0i.apps.googleusercontent.com

                   Client Secret: zoK2h8S9WXwqkhak9ROFOfZW

          On the Create credentials screen > edit

          Authorized redirect URIs: https://iap.googleapis.com/v1/oauth/clientIds/760629890301-8fmr6nlu2g6f04gpqaq3ljc133pqel0i.apps.googleusercontent.com:handleRedirect

 

c. Prepare your environment

- https://www.kubeflow.org/docs/gke/deploy/deploy-cli/#prepare-your-environment

c-1. Cloud Shell 

    우측 상단에 “Cloud shell 활성화” 아이콘을 클릭하면 아래와 같이 Cloudshell 창이 열림

c-2. prepare your environment

$ wget https://github.com/kubeflow/kfctl/releases/download/v1.0.1/kfctl_v1.0.1-0-gf3edb9b_linux.tar.gz <= lastest release
$ tar xzf kfctl_v1.0-0-gf3edb9b_linux.tar.gz

## Cloud Shell에 재 접속할 때 환경 설정이 유지되도록 .bash_profile에 추가
$ vi ~/.bash_profile
#!/bin/bash
export PROJECT=my-kubeflow-271310
export ZONE=us-east1-b
export CLIENT_ID=760629890301-8fmr6nlu2g6f04gpqaq3ljc133pqel0i.apps.googleusercontent.com
export CLIENT_SECRET=zoK2h8S9WXwqkhak9ROFOfZW
export CONFIG_FILE=kfdef.yaml
export KF_NAME=my-kubeflow
export KF_DIR=~/kf_deployments/${KF_NAME}

gcloud config set project ${PROJECT}
gcloud config set compute/zone ${ZONE}
gcloud container clusters get-credentials ${KF_NAME} --zone ${ZONE} --project ${PROJECT}
export PATH=~/kfctl:${PATH}
alias k=kubectl
$ source ~/.bash_profile

$ export CONFIG_URI="https://raw.githubusercontent.com/kubeflow/manifests/v1.0-branch/kfdef/kfctl_gcp_iap.v1.0.1.yaml"   ## lastest release
$ export CONFIG_FILE=kfdef.yaml

 

d. Customizing your Kubeflow deployment

- https://www.kubeflow.org/docs/gke/deploy/deploy-cli/#customizing-your-kubeflow-deployment

d-1. Customizing Kubeflow config file

$ curl -L -o ${CONFIG_FILE} ${CONFIG_URI}
$ wget -O yq https://github.com/mikefarah/yq/releases/download/3.1.1/yq_linux_386 && chmod 744 yq
$ ./yq w -i ${CONFIG_FILE} 'spec.plugins[0].spec.project' ${PROJECT}
$ ./yq w -i ${CONFIG_FILE} 'spec.plugins[0].spec.zone' ${ZONE}
$ ./yq w -i ${CONFIG_FILE} 'metadata.name' ${KF_NAME}
$ kfctl build -V -f ${CONFIG_FILE}

d-2. Customizing GKE(Google Kubernetes Engine) config file

    - https://www.kubeflow.org/docs/gke/customizing-gke/

       To customize your GKE cluster modify the deployment manager configuration files in the directory ${KF_DIR}/gcp_config

       To customize individual Kubeflow applications modify the Kustomize manifests in the directory ${KF_DIR}/kustomize

       The node autoprovisioning can be useful to autoscale the cluster with non-user defined node pools.

        make sure you set enableNodeAutoprovisioning to false in {KF_NAME}/gcp_config/cluster-kubeflow.yaml as we will work with our dedicated gpu-pool that Kubeflow deployment foresees. 

$ vi ./gcp_config/cluster-kubeflow.yaml
…
   autoprovisioning-config:
     enabled: false                       # default: true
…
    cpu-pool-enable-autoscaling: true     # default: true
    cpu-pool-machine-type: n1-highmem-2   # default: n1-standard-8 (vCPU 8, 30GB) -> n1-highmem-2 (vCPU 2, 13GB)
…
    cpu-pool-enable-autoscaling: true     # default: true
    gpu-pool-initailNodeCount: 1          # default: 0
    gpu-pool-machine-type: n1-highmem-2   # default: n1-standard-8 (vCPU 8, 30GB) -> n1-highmem-4 (vCPU 2, 13GB)
    gpu-type: nvidia-tesla-p100           # default: nvidia-tesla-k80 (us-east1-b에서 미 제공) -> nvidia-tesla-p100
…
$

    - 참고사항 : 

       GPU zone 조회 방법: $ gcloud compute accelerator-types list

       machine types: https://cloud.google.com/compute/docs/machine-types

 

e. Deploying Kubeflow

- https://www.kubeflow.org/docs/gke/deploy/deploy-cli/#deploying-kubeflow

$ kfctl apply -V -f ${CONFIG_FILE}
…
INFO[0485] Applied the configuration Successfully!       filename="cmd/apply.go:72"
$

 

f. Check your deployment & Access the Kubeflow user interface (UI)

- https://www.kubeflow.org/docs/gke/deploy/deploy-cli/#check-your-deployment

$ gcloud container clusters get-credentials ${KF_NAME} --zone ${ZONE} --project ${PROJECT}
 Fetching cluster endpoint and auth data.
 kubeconfig entry generated for my-kubeflow.
$
$ kubectl cluster-info
Kubernetes master is running at https://35.185.103.183
GLBCDefaultBackend is running at https://35.185.103.183/api/v1/namespaces/kube-system/services/default-http-backend:http/proxy
Heapster is running at https://35.185.103.183/api/v1/namespaces/kube-system/services/heapster/proxy
KubeDNS is running at https://35.185.103.183/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy
Metrics-server is running at https://35.185.103.183/api/v1/namespaces/kube-system/services/https:metrics-server:/proxy
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"14+", GitVersion:"v1.14.10-dispatcher", GitCommit:"f5757a1dee5a89cc5e29cd7159076648bf21a02b", GitTreeState:"clean", BuildDate:"2020-02-06T03:29:33Z", GoVersion:"go1.12.12b4", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"14+", GitVersion:"v1.14.10-gke.27", GitCommit:"145f9e21a4515947d6fb10819e5a336aff1b6959", GitTreeState:"clean", BuildDate:"2020-02-21T18:01:40Z", GoVersion:"go1.12.12b4", Compiler:"gc", Platform:"linux/amd64"}
$ kfctl version
kfctl v1.0.1-0-gf3edb9b
$

- To further debug and diagnose cluster problems, use 'kubectl cluster-info dump'.

$ kubectl -n istio-system get ingress
NAME            HOSTS                                                 ADDRESS          PORTS   AGE
envoy-ingress   my-kubeflow.endpoints.my-kubeflow-269301.cloud.goog   34.107.211.135   80      6m42s$
$

- It can take 20 minutes for the URI to become available. Kubeflow needs to provision a signed SSL certificate and register a DNS name. (⌛ 여유있게 커피 한잔을)

$ nslookup ${KF_NAME}.endpoints.${PROJECT}.cloud.goog
Server: 169.254.169.254
Address: 169.254.169.254#53
   
Non-authoritative answer:
Name: my-kubeflow.endpoints.my-kubeflow-269301.cloud.goog
Address: 34.107.203.148
$

 

g. Deleting your deployment

- If you want to delete all the resources, including storage:

$ ~/kfctl/kfctl delete -V -f ${CONFIG_FILE} --delete_storage

 

 

4. Troubleshooting

- Case 1

   ▷ Problem: asia-northeast3-a (서울) zone 선택하고 Kubeflow deploy시 에러 발생

$ ~/kfctl/kfctl apply -V -f ${CONFIG_FILE}
…
 ERRO[0014] Creating my-kubeflow-storage error: &{Code:RESOURCE_ERROR Location:/deployments/my-kubeflow-storage/resources/my-kubeflow-storage-artifact-store Message:
{
   "ResourceType":"compute.v1.disk",
   "ResourceErrorCode":"400",
   "ResourceErrorMessage":{
      "code":400,
      "errors":[
         {
            "domain":"global",
            "message":"Invalid value for field 'zone': 'asia-northeast3-a'.  Unknown zone.",
            "reason":"invalid"
         }
      ],
      "message":"Invalid value for field 'zone': 'asia-northeast3-a'. Unknown zone.",
      "statusMessage":"Bad Request",
      "requestPath":"https://compute.googleapis.com/compute/v1/projects/notional-clover-268704/zones/asia-northeast3-a/disks",
      "httpMethod":"POST"
   }
}
ForceSendFields:[] NullFields:[]}  filename="gcp/gcp.go:386”
$

   ▷ Workaround: 해결책은 못 찾고, 다른 zone으로 변경해서 진행 함

$ export ZONE=us-east1-b

 

- Case 2

   ▷ Problem: Compute engine(Work node)를 다음과 같이 구성하여 적용시 에러 발생

$ cat gconfig/cluster-kubeflow.yaml
…
    cpu-pool-initailNodeCount: 2
    cpu-pool-machine-type: n1-standard-4
…
    gpu-pool-initailNodeCount: 1
    gpu-pool-machine-type: n1-standard-4
…
$ ~/kfctl/kfctl apply -V -f ${CONFIG_FILE}
…
ERRO[0344] Updating my-kubeflow error: &{Code:RESOURCE_ERROR Location:/deployments/my-kubeflow/resources/my-kubeflow-gpu-pool-v1 Message:
{
   "ResourceType":"gcp-types/container-v1beta1:projects.locations.clusters.nodePools",
   "ResourceErrorCode":"403",
   "ResourceErrorMessage":{
   "code":403,
   "message":"Insufficient regional quota to satisfy request: resource \"CPUS\": request requires '4.0' and is short '4.0'. project has a quota of '8.0' with '0.0' available. View and manage quotas at https://console.cloud.google.com/iam-admin/quotas?usage=USED&project=my-kubeflow-269301.",
   "status":"PERMISSION_DENIED",
   "statusMessage":"Forbidden",
   "requestPath":"https://container.googleapis.com/v1beta1/projects/my-kubeflow-269301/locations/us-east1-b/clusters/my-kubeflow/nodePools",
   "httpMethod":"POST"
}
ForceSendFields:[] NullFields:[]} filename="gcp/gcp.go:386"Error: failed to apply: (kubeflow.error): Code 500 with message: coordinator Apply failed for gcp: (kubeflow.error): Code 400 with message: gcp apply could not update deployment ma
$

   ▷ Workaround:

        (최종) 무료 체험판 ($300 크레딧) 이용시 GPU를 사용할 수 없으며, 할당량 증가 요청도 할 수 없음  

        https://cloud.google.com/free/docs/gcp-free-tier?_ga=2.118720457.-1501721812.1583963754

        https://cloud.google.com/compute/quotas?hl=ko

        CPUs는 재 조정을 통해서 해결하였으나, 추가로 GPU에서 문제가 발생되어 GPUs(all regions)의 할당량을 요 시도

 

Case 3

   ▷ Problem:

$ curl "https://my-kubeflow.endpoints.my-kubeflow-269206.cloud.goog"
curl: (6) Could not resolve host: my-kubeflow.endpoints.my-kubeflow-269206.cloud.goog
$ curl "https://my-kubeflow.endpoints.my-kubeflow-269206.cloud.goog"
curl: (35) error:14077410:SSL routines:SSL23_GET_SERVER_HELLO:sslv3 alert handshake failure      
$

   ▷ Workaround:

       It can take 20 minutes for the URI to become available. Kubeflow needs to provision a signed SSL certificate and register a DNS name.

          or

       https://www.kubeflow.org/docs/gke/troubleshooting-gke/

'Kubeflow > Install' 카테고리의 다른 글

Kubeflow 1.4.1 in Minikube 구성  (0) 2021.12.30
Kubeflow 1.2 in On-prem 구성  (0) 2021.09.24
Kubeflow 1.2 in Minikube 구성  (0) 2021.09.24
Kubeflow 1.0 in On-prem 구성  (0) 2021.09.24
Kubeflow 1.0 using MiniKF 구성 (Windows 10)  (0) 2021.09.24

댓글