본문 바로가기

Kubeflow/Distributed training6

Distributed training 사례 #4 (From KF Jupyter, PyTorch) 2021.06.30 1. Kubeflow Jupyter 환경에서 Distributed training (PyTorch) - Environments ✓ Remote - 개발 환경 Kubeflow Jupyter (CPU) ✓ Remote - 학습 환경 Kubeflow 1.2 (The machine learning toolkit for Kubernetes) / Kubernetes 1.16.15 Nexus (Private docker registry) Nvidia V100 / Driver 450.80, cuda 11.2, cuDNN 8.1.0 - Flow (PyTorch) ✓ Remote (Kubeflow Jupyter) a. Docker Image build b. Docker Image Push c. Kube.. 2021. 9. 27.
Distributed training 사례 #3 (In Jupyter) 2021.06.25 1. Kubeflow Jupyter(GPU 할당) 환경에서 Distributed training (Tensorflow) - Environments ✓ Remote - 개발 환경 Kubeflow Jupyter (GPU) ✓ Remote - 학습 환경 Kubeflow 1.2 (The machine learning toolkit for Kubernetes) / Kubernetes 1.16.15 Harbor 2.2.1 (Private docker registry) Nvidia V100 / Driver 450.80, cuda 11.2, cuDNN 8.1.0 CentOS 7.8 - 관련 기술 - Tensorflow MirroredStrategy (Data parallelism) 2. 사전작업 a.. 2021. 9. 26.
Distributed training 사례 #2 (From KF Jupyter, Tensorflow) 2021.06.24 1. Kubeflow Jupyter 환경에서 Distributed training (Tensorflow) - Environments ✓ Remote - 개발 환경 Kubeflow Jupyter ✓ Remote - 학습 환경 Kubeflow 1.2 (The machine learning toolkit for Kubernetes) / Kubernetes 1.16.15 Nexus (Private docker registry) Nvidia V100 / Driver 450.80, cuda 11.2, cuDNN 8.1.0 CentOS 7.8 - Flow (Tensorflow) ✓ Remote (Kubeflow Jupyter) a. Docker Image build b. Docker Image P.. 2021. 9. 26.
Distributed training 사례 #1 (From MacOS) 2021.06.24 1. Local 환경에서 Distributed training (Tensorflow) - Environments ✓ Local - 개발 환경 Python 3.8.5, Jupyter / PyCharm (option) ✓ Remote - 학습 환경 Kubeflow 1.2 (The machine learning toolkit for Kubernetes) / Kubernetes 1.16.15 Master node: 3ea, Worker node: 4ea Harbor 2.2.1 (Private docker registry) Nvidia V100 / Driver 450.80, cuda 11.2, cuDNN 8.1.0 CentOS 7.8 - Flow (Tensorflow 기준) ✓ Local (M.. 2021. 9. 26.
Distributed training 개요 2021.6.28 1. Distributed training ? a. Distributed training 분류 (https://ettrends.etri.re.kr/ettrends/172/0905172001/) - Data Parallelism ✓ 대량의 데이터를 다수의 컴퓨터에서 데이터를 분산하여 학습하는 방법 - Model Parallelism ✓ 딥러닝 모델의 크기가 증가되어 하나의 컴퓨터에서 처리하지 못 하는 경우 모델을 분할하여 학습하는 방법 ▷ 레이어 분할 ▷ 학습 피처 분할 b. Distributed training을 위해 필요한 것 - Tensorflow/PyTorch 분산 학습 API - Kubeflow/Kubernetes 기반의 분산 학습 환경 2. Tensorflow distribut.. 2021. 9. 26.
Running the MNIST using distributed training 2021.5.28 1. Running the MNIST on-prem Jupyter notebook - The MNIST on-prem notebook builds a Docker image, launches a TFJob to train a model, and creates an InferenceService (KFServing) to deploy the trained model. - https://v1-2-branch.kubeflow.org/docs/started/workstation/minikube-linux/#running-the-mnist-on-prem-jupyter-notebook a. Prerequisites - Step 1: Set up Python environment in MacOS y.. 2021. 9. 24.