본문 바로가기

Kubeflow22

Kubeflow 1.4.1 in Minikube 구성 1. 개요 - minikube에 Kubeflow 1.4.1을 설치한다. - Kubeflow 설치 방식이 버전 1.2까지는 kfctl를 사용하였으나, 1.3부터는 kustomize를 사용한다. ✓ kfctl is a CLI for deploying and managing Kubeflow, Latest release: 1.2.0(21 Nov 2020), https://github.com/kubeflow/kfctl ▷ Kubeflow 1.2 in On-prem 구성 ▷ Kubeflow 1.2 in Minikube 구성 ✓ Kubeflow 1.4.1는 Istio 1.9.6, Knative 0.22.1을 포함하고 있다. 2. 환경 - kubeflow 1.4.1 - minikube 1.17.1 - kubernete.. 2021. 12. 30.
KFServing - Canary rollout 테스트 1. 개요 - Kubeflow KFServing에서 제공하는 Canary rollout 기능을 살펴보고자 한다. - Canary release 은 새로운 버전의 모델을 배포할 때, 소규모의 사용자들 에게만 먼저 제공함으로써 위험을 빠르게 감지할 수 있는 배포 전략이다. https://m.blog.naver.com/muchine98/220262491992 2. Envrironments - Kubernetes 1.16.15 - Kubeflow 1.2 - nfs-client-provisioner v3.1.0 (dynamic provisioning of Kubernetes Persistent Volumes) 참고 문서: NFS-Client Provisioner 3. Canary rollout 테스트 - PVC .. 2021. 10. 17.
KFServing - Deep dive 1. Kubeflow KFServing? - Kubeflow supports two model serving systems that allow multi-framework model serving: KFServing and Seldon Core. - KFServing enables serverless inferencing on Kubernetes ✓ Encapsulate the complexity of autoscaling, networking, health checking, and server configuration to bring cutting edge serving features like GPU autoscaling, scale to zero, and canary rollouts to your .. 2021. 10. 14.
Dex 인증 / 우회 1. 개요 - Dex ? (https://github.com/dexidp/dex) Dex is an identity service that uses OpenID Connect to drive authentication for other apps. - Dex 구성 'kfctl_istio_dex.v1.2.0.yaml'를 이용하여 Kubeflow(v1.2)를 구성할 경우 dex(v2.22)가 포함되어 설치 된다. - Dex 인증 범위 다음과 같은 유형의 자원에 호출할 경우 Dex 인증을 요구 한다. ✓ Kubeflow dashboard login ✓ KFServing (Knative 기반에서 동작) ✓ Knative serving (Istio 기반에서 동작) ✓ Istio Virtual service - 적용.. 2021. 9. 29.
Distributed training 사례 #4 (From KF Jupyter, PyTorch) 2021.06.30 1. Kubeflow Jupyter 환경에서 Distributed training (PyTorch) - Environments ✓ Remote - 개발 환경 Kubeflow Jupyter (CPU) ✓ Remote - 학습 환경 Kubeflow 1.2 (The machine learning toolkit for Kubernetes) / Kubernetes 1.16.15 Nexus (Private docker registry) Nvidia V100 / Driver 450.80, cuda 11.2, cuDNN 8.1.0 - Flow (PyTorch) ✓ Remote (Kubeflow Jupyter) a. Docker Image build b. Docker Image Push c. Kube.. 2021. 9. 27.
Distributed training 사례 #3 (In Jupyter) 2021.06.25 1. Kubeflow Jupyter(GPU 할당) 환경에서 Distributed training (Tensorflow) - Environments ✓ Remote - 개발 환경 Kubeflow Jupyter (GPU) ✓ Remote - 학습 환경 Kubeflow 1.2 (The machine learning toolkit for Kubernetes) / Kubernetes 1.16.15 Harbor 2.2.1 (Private docker registry) Nvidia V100 / Driver 450.80, cuda 11.2, cuDNN 8.1.0 CentOS 7.8 - 관련 기술 - Tensorflow MirroredStrategy (Data parallelism) 2. 사전작업 a.. 2021. 9. 26.
Distributed training 사례 #2 (From KF Jupyter, Tensorflow) 2021.06.24 1. Kubeflow Jupyter 환경에서 Distributed training (Tensorflow) - Environments ✓ Remote - 개발 환경 Kubeflow Jupyter ✓ Remote - 학습 환경 Kubeflow 1.2 (The machine learning toolkit for Kubernetes) / Kubernetes 1.16.15 Nexus (Private docker registry) Nvidia V100 / Driver 450.80, cuda 11.2, cuDNN 8.1.0 CentOS 7.8 - Flow (Tensorflow) ✓ Remote (Kubeflow Jupyter) a. Docker Image build b. Docker Image P.. 2021. 9. 26.
Distributed training 사례 #1 (From MacOS) 2021.06.24 1. Local 환경에서 Distributed training (Tensorflow) - Environments ✓ Local - 개발 환경 Python 3.8.5, Jupyter / PyCharm (option) ✓ Remote - 학습 환경 Kubeflow 1.2 (The machine learning toolkit for Kubernetes) / Kubernetes 1.16.15 Master node: 3ea, Worker node: 4ea Harbor 2.2.1 (Private docker registry) Nvidia V100 / Driver 450.80, cuda 11.2, cuDNN 8.1.0 CentOS 7.8 - Flow (Tensorflow 기준) ✓ Local (M.. 2021. 9. 26.
Distributed training 개요 2021.6.28 1. Distributed training ? a. Distributed training 분류 (https://ettrends.etri.re.kr/ettrends/172/0905172001/) - Data Parallelism ✓ 대량의 데이터를 다수의 컴퓨터에서 데이터를 분산하여 학습하는 방법 - Model Parallelism ✓ 딥러닝 모델의 크기가 증가되어 하나의 컴퓨터에서 처리하지 못 하는 경우 모델을 분할하여 학습하는 방법 ▷ 레이어 분할 ▷ 학습 피처 분할 b. Distributed training을 위해 필요한 것 - Tensorflow/PyTorch 분산 학습 API - Kubeflow/Kubernetes 기반의 분산 학습 환경 2. Tensorflow distribut.. 2021. 9. 26.
Kubeflow 1.0 기능 #6 (Metadata) 2020.03.23 1. Kubeflow Metadata ? - Tracking and managing metadata of machine learning workflows in Kubeflow - metadata means information about executions (runs), models, datasets, and other artifacts. Artifacts are the files and objects that form the inputs and outputs of the components in your ML workflow 2. Try the Metadata SDK in a sample Jupyter notebook - demo.ipynb download from Github Te.. 2021. 9. 25.
Kubeflow 1.0 기능 #5 (KFServing, TFServing) 2020.03.13 1. Model serving overview - https://v1-0-branch.kubeflow.org/docs/reference/pytorchjob/v1/pytorch/ - Kubeflow supports two model serving systems that allow multi-framework model serving: KFServing and Seldon Core. Alternatively, you can use a standalone model serving system. a. Multi-framework model serving - A check mark (✓) indicates that the system (KFServing or Seldon Core) suppor.. 2021. 9. 25.
Kubeflow 1.0 기능 #4 (PyTorch Training) 2020.03.09 1. PyTorchJob ? - Kubeflow에서 PyTorch training할 때 사용되는 Kubernetes custom resource - https://v1-0-branch.kubeflow.org/docs/reference/pytorchjob/v1/pytorch/ 2. PyTorch training 하기 - https://v1-0-branch.kubeflow.org/docs/components/training/pytorch/components/training/pytorch/ a. Cloud shell 기동 b. Verify that PyTorch support is included in your Kubeflow deployment $ kubectl get crd | head.. 2021. 9. 25.
Kubeflow 1.0 기능 #3 (Katib) 2020.03.09 1. Kubeflow Katib ? - Katib uses for automated tuning of ML model’s hyperparameters. Hyperparameters are the variables that control the model training process. For example: ✓ Learning rate. ✓ Number of layers in a neural network. ✓ Number of nodes in each layer. Hyperparameter values are not learned. Hyperparameter tuning is the process of optimizing the hyperparameter values to maxim.. 2021. 9. 25.
Kubeflow 1.0 기능 #2 (TF-Job, TF-Serving, Kubeflow pipeline) 2020.02.26 1. 개요 - GKE에 설치한 Kubeflow의 Pipeline 기능을 이해하기 위해 아래 사이트를 참조하여 사용 해 봄 - Using Kubeflow for Financial Time Series (https://github.com/kubeflow/examples/tree/master/financial_time_series) - This example covers the following concepts: a. Deploying Kubeflow to a GKE cluster b. Exploration via JupyterHub (prospect data, preprocess data, develop ML model) c. Training several tensorflow models.. 2021. 9. 24.
Kubeflow 1.0 기능 #1 (Jupyter notebook) 2020.03.12 1. 참고 문서 - https://v1-0-branch.kubeflow.org/docs/notebooks/setup/ 2. Notebook server 생성 a. Cloud shell 기동 b. URL 확인 $ kubectl -n istio-system get ingress NAME HOSTS ADDRESS PORTS AGE envoy-ingress my-kubeflow.endpoints.my-kubeflow-269301.cloud.goog 34.107.211.135 80 6m42s$ $ c. Kubeflow 접속 (URL: my-kubeflow.endpoints.my-kubeflow-269301.cloud.goog) d. Create a Jupyter notebook server a.. 2021. 9. 24.