2020.02.20
1. 개요
- 본 문서에서는 GCP(Google Cloud Platform)에서 Kubernetes 기반의 End 2 End ML Platform인 Kubeflow를 구성하는 절차를 설명하고자 함
- Ref. Page: https://www.kubeflow.org/docs/gke/deploy/deploy-cli/
2. What is Kubeflow ?
- Kubeflow is the ML toolkit for Kubernetes. The following diagram shows Kubeflow as a platform for arranging the components of your ML system on top of Kubernetes.
- https://www.kubeflow.org/docs/about/kubeflow/
- https://bcho.tistory.com/1301
3. Kubeflow 구성
a. Create a GCP Project
a-1. Google 계정 생성
- https://accounts.google.com/ 접속 > 계정 생성 (ysjeon71.kubeflow2@gmail.com)
- $300 무료 크레딧 제공 (12개월 사용 가능), 결재 정보 입력 필수
a-2. Create GCP Project
- https://console.cloud.google.com/ 접속 > 'My First Project’ 선택 > ‘새 프로젝트’ 선택 > 만들기
- Project Name: My Kubeflow
Project ID: my-kubeflow-269301 —> my-kubeflow-271310 (신규 생성)
b. Setup a GCP Project
b-1. The specified APIs are enabled (Project별 설정)
- https://www.kubeflow.org/docs/gke/deploy/project-setup
Compute Engine API 선택 > API 활성화
Kubernetes Engine API 선택 > API 활성화
…
AI Platform Training & Prediction API 선택 > API 활성화
b-2. Set up an OAuth credential
- https://www.kubeflow.org/docs/gke/deploy/oauth-setup/
Set up your OAuth consent screen (OAuth 동의 화면)
User type: 외부
Application Name: Kubeflow
Authorized domains: my-kubeflow-271310.cloud.goog
On the credentials screen (사용자 인증정보) :
"+사용자 인증 정보 만들기" > "OAuth 클라이언트 ID" 선택
애플리케이션 유형: 웹 애플리케이션
생성결과:
Client ID: 760629890301-8fmr6nlu2g6f04gpqaq3ljc133pqel0i.apps.googleusercontent.com
Client Secret: zoK2h8S9WXwqkhak9ROFOfZW
On the Create credentials screen > edit
Authorized redirect URIs: https://iap.googleapis.com/v1/oauth/clientIds/760629890301-8fmr6nlu2g6f04gpqaq3ljc133pqel0i.apps.googleusercontent.com:handleRedirect
c. Prepare your environment
- https://www.kubeflow.org/docs/gke/deploy/deploy-cli/#prepare-your-environment
c-1. Cloud Shell
우측 상단에 “Cloud shell 활성화” 아이콘을 클릭하면 아래와 같이 Cloudshell 창이 열림
c-2. prepare your environment
$ wget https://github.com/kubeflow/kfctl/releases/download/v1.0.1/kfctl_v1.0.1-0-gf3edb9b_linux.tar.gz <= lastest release
$ tar xzf kfctl_v1.0-0-gf3edb9b_linux.tar.gz
## Cloud Shell에 재 접속할 때 환경 설정이 유지되도록 .bash_profile에 추가
$ vi ~/.bash_profile
#!/bin/bash
export PROJECT=my-kubeflow-271310
export ZONE=us-east1-b
export CLIENT_ID=760629890301-8fmr6nlu2g6f04gpqaq3ljc133pqel0i.apps.googleusercontent.com
export CLIENT_SECRET=zoK2h8S9WXwqkhak9ROFOfZW
export CONFIG_FILE=kfdef.yaml
export KF_NAME=my-kubeflow
export KF_DIR=~/kf_deployments/${KF_NAME}
gcloud config set project ${PROJECT}
gcloud config set compute/zone ${ZONE}
gcloud container clusters get-credentials ${KF_NAME} --zone ${ZONE} --project ${PROJECT}
export PATH=~/kfctl:${PATH}
alias k=kubectl
$ source ~/.bash_profile
$ export CONFIG_URI="https://raw.githubusercontent.com/kubeflow/manifests/v1.0-branch/kfdef/kfctl_gcp_iap.v1.0.1.yaml" ## lastest release
$ export CONFIG_FILE=kfdef.yaml
d. Customizing your Kubeflow deployment
- https://www.kubeflow.org/docs/gke/deploy/deploy-cli/#customizing-your-kubeflow-deployment
d-1. Customizing Kubeflow config file
$ curl -L -o ${CONFIG_FILE} ${CONFIG_URI}
$ wget -O yq https://github.com/mikefarah/yq/releases/download/3.1.1/yq_linux_386 && chmod 744 yq
$ ./yq w -i ${CONFIG_FILE} 'spec.plugins[0].spec.project' ${PROJECT}
$ ./yq w -i ${CONFIG_FILE} 'spec.plugins[0].spec.zone' ${ZONE}
$ ./yq w -i ${CONFIG_FILE} 'metadata.name' ${KF_NAME}
$ kfctl build -V -f ${CONFIG_FILE}
d-2. Customizing GKE(Google Kubernetes Engine) config file
- https://www.kubeflow.org/docs/gke/customizing-gke/
To customize your GKE cluster modify the deployment manager configuration files in the directory ${KF_DIR}/gcp_config
To customize individual Kubeflow applications modify the Kustomize manifests in the directory ${KF_DIR}/kustomize
The node autoprovisioning can be useful to autoscale the cluster with non-user defined node pools.
make sure you set enableNodeAutoprovisioning to false in {KF_NAME}/gcp_config/cluster-kubeflow.yaml as we will work with our dedicated gpu-pool that Kubeflow deployment foresees.
$ vi ./gcp_config/cluster-kubeflow.yaml
…
autoprovisioning-config:
enabled: false # default: true
…
cpu-pool-enable-autoscaling: true # default: true
cpu-pool-machine-type: n1-highmem-2 # default: n1-standard-8 (vCPU 8, 30GB) -> n1-highmem-2 (vCPU 2, 13GB)
…
cpu-pool-enable-autoscaling: true # default: true
gpu-pool-initailNodeCount: 1 # default: 0
gpu-pool-machine-type: n1-highmem-2 # default: n1-standard-8 (vCPU 8, 30GB) -> n1-highmem-4 (vCPU 2, 13GB)
gpu-type: nvidia-tesla-p100 # default: nvidia-tesla-k80 (us-east1-b에서 미 제공) -> nvidia-tesla-p100
…
$
- 참고사항 :
GPU zone 조회 방법: $ gcloud compute accelerator-types list
machine types: https://cloud.google.com/compute/docs/machine-types
e. Deploying Kubeflow
- https://www.kubeflow.org/docs/gke/deploy/deploy-cli/#deploying-kubeflow
$ kfctl apply -V -f ${CONFIG_FILE}
…
INFO[0485] Applied the configuration Successfully! filename="cmd/apply.go:72"
$
f. Check your deployment & Access the Kubeflow user interface (UI)
- https://www.kubeflow.org/docs/gke/deploy/deploy-cli/#check-your-deployment
$ gcloud container clusters get-credentials ${KF_NAME} --zone ${ZONE} --project ${PROJECT}
Fetching cluster endpoint and auth data.
kubeconfig entry generated for my-kubeflow.
$
$ kubectl cluster-info
Kubernetes master is running at https://35.185.103.183
GLBCDefaultBackend is running at https://35.185.103.183/api/v1/namespaces/kube-system/services/default-http-backend:http/proxy
Heapster is running at https://35.185.103.183/api/v1/namespaces/kube-system/services/heapster/proxy
KubeDNS is running at https://35.185.103.183/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy
Metrics-server is running at https://35.185.103.183/api/v1/namespaces/kube-system/services/https:metrics-server:/proxy
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"14+", GitVersion:"v1.14.10-dispatcher", GitCommit:"f5757a1dee5a89cc5e29cd7159076648bf21a02b", GitTreeState:"clean", BuildDate:"2020-02-06T03:29:33Z", GoVersion:"go1.12.12b4", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"14+", GitVersion:"v1.14.10-gke.27", GitCommit:"145f9e21a4515947d6fb10819e5a336aff1b6959", GitTreeState:"clean", BuildDate:"2020-02-21T18:01:40Z", GoVersion:"go1.12.12b4", Compiler:"gc", Platform:"linux/amd64"}
$ kfctl version
kfctl v1.0.1-0-gf3edb9b
$
- To further debug and diagnose cluster problems, use 'kubectl cluster-info dump'.
$ kubectl -n istio-system get ingress
NAME HOSTS ADDRESS PORTS AGE
envoy-ingress my-kubeflow.endpoints.my-kubeflow-269301.cloud.goog 34.107.211.135 80 6m42s$
$
- It can take 20 minutes for the URI to become available. Kubeflow needs to provision a signed SSL certificate and register a DNS name. (⌛ 여유있게 커피 한잔을)
$ nslookup ${KF_NAME}.endpoints.${PROJECT}.cloud.goog
Server: 169.254.169.254
Address: 169.254.169.254#53
Non-authoritative answer:
Name: my-kubeflow.endpoints.my-kubeflow-269301.cloud.goog
Address: 34.107.203.148
$
g. Deleting your deployment
- If you want to delete all the resources, including storage:
$ ~/kfctl/kfctl delete -V -f ${CONFIG_FILE} --delete_storage
4. Troubleshooting
- Case 1
▷ Problem: asia-northeast3-a (서울) zone 선택하고 Kubeflow deploy시 에러 발생
$ ~/kfctl/kfctl apply -V -f ${CONFIG_FILE}
…
ERRO[0014] Creating my-kubeflow-storage error: &{Code:RESOURCE_ERROR Location:/deployments/my-kubeflow-storage/resources/my-kubeflow-storage-artifact-store Message:
{
"ResourceType":"compute.v1.disk",
"ResourceErrorCode":"400",
"ResourceErrorMessage":{
"code":400,
"errors":[
{
"domain":"global",
"message":"Invalid value for field 'zone': 'asia-northeast3-a'. Unknown zone.",
"reason":"invalid"
}
],
"message":"Invalid value for field 'zone': 'asia-northeast3-a'. Unknown zone.",
"statusMessage":"Bad Request",
"requestPath":"https://compute.googleapis.com/compute/v1/projects/notional-clover-268704/zones/asia-northeast3-a/disks",
"httpMethod":"POST"
}
}
ForceSendFields:[] NullFields:[]} filename="gcp/gcp.go:386”
$
▷ Workaround: 해결책은 못 찾고, 다른 zone으로 변경해서 진행 함
$ export ZONE=us-east1-b
- Case 2
▷ Problem: Compute engine(Work node)를 다음과 같이 구성하여 적용시 에러 발생
$ cat gconfig/cluster-kubeflow.yaml
…
cpu-pool-initailNodeCount: 2
cpu-pool-machine-type: n1-standard-4
…
gpu-pool-initailNodeCount: 1
gpu-pool-machine-type: n1-standard-4
…
$ ~/kfctl/kfctl apply -V -f ${CONFIG_FILE}
…
ERRO[0344] Updating my-kubeflow error: &{Code:RESOURCE_ERROR Location:/deployments/my-kubeflow/resources/my-kubeflow-gpu-pool-v1 Message:
{
"ResourceType":"gcp-types/container-v1beta1:projects.locations.clusters.nodePools",
"ResourceErrorCode":"403",
"ResourceErrorMessage":{
"code":403,
"message":"Insufficient regional quota to satisfy request: resource \"CPUS\": request requires '4.0' and is short '4.0'. project has a quota of '8.0' with '0.0' available. View and manage quotas at https://console.cloud.google.com/iam-admin/quotas?usage=USED&project=my-kubeflow-269301.",
"status":"PERMISSION_DENIED",
"statusMessage":"Forbidden",
"requestPath":"https://container.googleapis.com/v1beta1/projects/my-kubeflow-269301/locations/us-east1-b/clusters/my-kubeflow/nodePools",
"httpMethod":"POST"
}
ForceSendFields:[] NullFields:[]} filename="gcp/gcp.go:386"Error: failed to apply: (kubeflow.error): Code 500 with message: coordinator Apply failed for gcp: (kubeflow.error): Code 400 with message: gcp apply could not update deployment ma
$
▷ Workaround:
(최종) 무료 체험판 ($300 크레딧) 이용시 GPU를 사용할 수 없으며, 할당량 증가 요청도 할 수 없음
https://cloud.google.com/free/docs/gcp-free-tier?_ga=2.118720457.-1501721812.1583963754
https://cloud.google.com/compute/quotas?hl=ko
CPUs는 재 조정을 통해서 해결하였으나, 추가로 GPU에서 문제가 발생되어 GPUs(all regions)의 할당량을 요 시도
Case 3
▷ Problem:
$ curl "https://my-kubeflow.endpoints.my-kubeflow-269206.cloud.goog"
curl: (6) Could not resolve host: my-kubeflow.endpoints.my-kubeflow-269206.cloud.goog
$ curl "https://my-kubeflow.endpoints.my-kubeflow-269206.cloud.goog"
curl: (35) error:14077410:SSL routines:SSL23_GET_SERVER_HELLO:sslv3 alert handshake failure
$
▷ Workaround:
It can take 20 minutes for the URI to become available. Kubeflow needs to provision a signed SSL certificate and register a DNS name.
or
'Kubeflow > Install' 카테고리의 다른 글
Kubeflow 1.4.1 in Minikube 구성 (0) | 2021.12.30 |
---|---|
Kubeflow 1.2 in On-prem 구성 (0) | 2021.09.24 |
Kubeflow 1.2 in Minikube 구성 (0) | 2021.09.24 |
Kubeflow 1.0 in On-prem 구성 (0) | 2021.09.24 |
Kubeflow 1.0 using MiniKF 구성 (Windows 10) (0) | 2021.09.24 |
댓글