본문 바로가기

Data Parallelism4

Distributed training 사례 #4 (From KF Jupyter, PyTorch) 2021.06.30 1. Kubeflow Jupyter 환경에서 Distributed training (PyTorch) - Environments ✓ Remote - 개발 환경 Kubeflow Jupyter (CPU) ✓ Remote - 학습 환경 Kubeflow 1.2 (The machine learning toolkit for Kubernetes) / Kubernetes 1.16.15 Nexus (Private docker registry) Nvidia V100 / Driver 450.80, cuda 11.2, cuDNN 8.1.0 - Flow (PyTorch) ✓ Remote (Kubeflow Jupyter) a. Docker Image build b. Docker Image Push c. Kube.. 2021. 9. 27.
Distributed training 사례 #3 (In Jupyter) 2021.06.25 1. Kubeflow Jupyter(GPU 할당) 환경에서 Distributed training (Tensorflow) - Environments ✓ Remote - 개발 환경 Kubeflow Jupyter (GPU) ✓ Remote - 학습 환경 Kubeflow 1.2 (The machine learning toolkit for Kubernetes) / Kubernetes 1.16.15 Harbor 2.2.1 (Private docker registry) Nvidia V100 / Driver 450.80, cuda 11.2, cuDNN 8.1.0 CentOS 7.8 - 관련 기술 - Tensorflow MirroredStrategy (Data parallelism) 2. 사전작업 a.. 2021. 9. 26.
Distributed training 사례 #2 (From KF Jupyter, Tensorflow) 2021.06.24 1. Kubeflow Jupyter 환경에서 Distributed training (Tensorflow) - Environments ✓ Remote - 개발 환경 Kubeflow Jupyter ✓ Remote - 학습 환경 Kubeflow 1.2 (The machine learning toolkit for Kubernetes) / Kubernetes 1.16.15 Nexus (Private docker registry) Nvidia V100 / Driver 450.80, cuda 11.2, cuDNN 8.1.0 CentOS 7.8 - Flow (Tensorflow) ✓ Remote (Kubeflow Jupyter) a. Docker Image build b. Docker Image P.. 2021. 9. 26.
Distributed training 사례 #1 (From MacOS) 2021.06.24 1. Local 환경에서 Distributed training (Tensorflow) - Environments ✓ Local - 개발 환경 Python 3.8.5, Jupyter / PyCharm (option) ✓ Remote - 학습 환경 Kubeflow 1.2 (The machine learning toolkit for Kubernetes) / Kubernetes 1.16.15 Master node: 3ea, Worker node: 4ea Harbor 2.2.1 (Private docker registry) Nvidia V100 / Driver 450.80, cuda 11.2, cuDNN 8.1.0 CentOS 7.8 - Flow (Tensorflow 기준) ✓ Local (M.. 2021. 9. 26.