2021.06.30
1. Kubeflow Jupyter 환경에서 Distributed training (PyTorch)
- Environments
✓ Remote - 개발 환경
Kubeflow Jupyter (CPU)
✓ Remote - 학습 환경
Kubeflow 1.2 (The machine learning toolkit for Kubernetes) / Kubernetes 1.16.15
Nexus (Private docker registry)
Nvidia V100 / Driver 450.80, cuda 11.2, cuDNN 8.1.0
- Flow (PyTorch)
✓ Remote (Kubeflow Jupyter)
a. Docker Image build
b. Docker Image Push
c. Kubeflow PyTorchJob deployment
✓ Remote (Kubeflow / Kubernetes)
d. Kubernetes POD 생성 (Chief POD, Worker #n POD)
Docker image pull & Container 생성, GPU / CPU / Memory 리소스 할당
e. Kubernetes POD 실행
Tensorflow distributed training
- 관련 기술
✓ Kubeflow fairing
✓ PyTorch distributed (Data parallelism)
2. 사전작업
a. Remote (Kubernetes cluster)
- Kubeflow 계정 생성 & 리소스 쿼터 설정 (GPU, CPU, Memory, Disk)
- Docker registry credential 생성
Private docker registry는 TLS Certificate를 제공해야 함 (kubeflow 1.2 기준 self-signed certificate 미지원)
$ k create secret docker-registry chelsea-agp-reg-cred -n yoosung-jeon \
--docker-server=repo.chelsea.kt.co.kr --docker-username=agp --docker-password=*****
$ k patch serviceaccount default -n yoosung-jeon -p "{\"imagePullSecrets\": [{\"name\": \"chelsea-agp-reg-cred\"}]}"
- Kubernetes PV(Persistent volume) 생성
Distributed training 어플리케이션들에서 공유하여 사용할 디스크
$ vi mnist-pytorch-pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: mnist-pytorch-pvc
namespace: yoosung-jeon
spec:
accessModes:
- ReadWriteMany
resources:
requests:
storage: 1Gi
storageClassName: nfs-sc-iap
$ k apply -f mnist-pytorch-pvc.yaml
…
$ cp -r data `k get pvc -n yoosung-jeon | grep -w mnist-pytorch-pvc | awk '{printf("/nfs_01/yoosung-jeon-%s-%s", $1, $3)}'`
$
b. Remote (Kubeflow Jupyter)
- Kubeflow Jupyter notebook 생성
i. Kubeflow dashboard login (http://kf.acp.kt.co.kr)
ii. Notebook server 생성
Menu (좌측 상단 ) > Notebook Servers > ‘New Server’
Name: Study-pytorch
Image: repo.acp.kt.co.kr/kubeflow/kubeflow-images-private/pytorch-1.9.0-cuda11.1-notebook:1.0.0
iii. Jupyter 접속
iv. Terminal 접속
v. Docker registry credential 생성
$ mkdir .docker && vi .docker/config.json
{
"auths": {
"repo.chelsea.kt.co.kr": {"auth": "YWdwOm5ldzEy****"}
}
}
$
3. Distributed training 코딩
a. Base image 생성
- Distributed training 어플리케이션에서 사용되는 python package 들을 Base image에 추가, 추가할 Package가 없는 경우 생략
yoosungjeon@ysjeon-Dev create-base-docker-image % vi Dockerfile
FROM pytorch/pytorch:1.9.0-cuda11.1-cudnn8-runtime
COPY requirements.txt .
RUN pip install -r requirements.txt
yoosungjeon@ysjeon-Dev create-base-docker-image % vi requirements.txt
...
yoosungjeon@ysjeon-Dev create-base-docker-image % docker build -t repo.chelsea.kt.co.kr/agp/pytorch-custom:1.9.0-cuda11.1-cudnn8-runtime -f Dockerfile .
…
yoosungjeon@ysjeon-Dev create-base-docker-image % docker push repo.chelsea.kt.co.kr/agp/pytorch-custom:1.9.0-cuda11.1-cudnn8-runtime
yoosungjeon@ysjeon-Dev create-base-docker-image %
b. Fairing & Distributed training 코드
- 소스
4.jupyter-cpu-fairing-pytorch/pytorch-mnist.ipynb
- Code snippet
def train_and_test(...)
# create default process group
WORLD_SIZE = int(os.environ.get("WORLD_SIZE", 1))
backend = dist.Backend.NCCL
if dist.is_available() and WORLD_SIZE > 1:
logging.debug("Using distributed PyTorch with {} backend".format(backend))
dist.init_process_group(backend=backend)
...
# construct DDP(Distribeted Data Parallel) model
if dist.is_available() and dist.is_initialized():
if use_cuda:
torch.cuda.set_device(torch.cuda.current_device())
model = nn.parallel.DistributedDataParallel(model)
...
def fairing_run():
...
fairing.config.set_builder(name='append',
registry=DOCKER_REGISTRY,
base_image='pytorch/pytorch:1.9.0-cuda11.1-cudnn8-runtime',
image_name=project_name)
fairing.config.set_deployer(name="pytorchjob",
job_name=pytorchjob_name,
master_count=1,
worker_count=worker_count,
pod_spec_mutators=[
volume_mounts(volume_type='pvc', volume_name=k8s_pvc_name, mount_path=mount_name),
get_resource_mutator(gpu=gpus_per_worker, gpu_vendor='nvidia')],
stream_log=False)
fairing.config.run()
if __name__ == '__main__':
...
if os.getenv('FAIRING_RUNTIME', None) is None:
fairing_run()
else:
main()
4. Distributed training 실행
a. Remote (Kubeflow Jupyter)
5. 소스
'Kubeflow > Distributed training' 카테고리의 다른 글
Distributed training 사례 #3 (In Jupyter) (0) | 2021.09.26 |
---|---|
Distributed training 사례 #2 (From KF Jupyter, Tensorflow) (0) | 2021.09.26 |
Distributed training 사례 #1 (From MacOS) (0) | 2021.09.26 |
Distributed training 개요 (0) | 2021.09.26 |
Running the MNIST using distributed training (0) | 2021.09.24 |
댓글