본문 바로가기
Kubeflow/기능 탐방 (Kubeflow 1.0)

Kubeflow 1.0 기능 #3 (Katib)

by 여행을 떠나자! 2021. 9. 25.

2020.03.09

 

1. Kubeflow Katib ?

- Katib uses for automated tuning of ML model’s hyperparameters.

   Hyperparameters are the variables that control the model training process. For example:

      ✓ Learning rate.

      ✓ Number of layers in a neural network.

      ✓ Number of nodes in each layer.

   Hyperparameter values are not learned.

   Hyperparameter tuning is the process of optimizing the hyperparameter values to maximize the predictive accuracy of the model.

   If you don’t use Katib or a similar system for hyperparameter tuning, you need run many training jobs yourself, manually adjusting the hyperparameters to find the optimal values.

- Katib offers a neural architecture search (NAS) feature. You can use the NAS to design your artificial neural network.

   NAS technology in general uses various techniques to find the optimal neural network design.

   The NAS in Katib uses the reinforcement learning technique.

- Both are subsets of automated machine learning (AutoML)

   AutoML 시스템은  레이블링된 학습 데이터를 입력으로 제공하고 최적화된 모델을 제공 함

- Katib supports a number of ML frameworks, including TensorFlow, MXNet, PyTorch, XGBoost, and others.

 

 

2. 참고 문서

- https://v1-0-branch.kubeflow.org/docs/components/hyperparameter-tuning/hyperparameter/

 

 

3. Katib setup

a. Installing Katib

- You can skip this step if you have already installed Kubeflow. Your Kubeflow deployment includes Katib.

 

b. Setting up persistent volumes 

- You can skip this step if you’re using Kubeflow on Google Kubernetes Engine (GKE) or if your Kubernetes cluster includes a StorageClass for dynamic volume provisioning. 

 

 

4. Accessing the Katib UI

-  Kubeflow 접속: https://my-kubeflow.endpoints.my-kubeflow-269301.cloud.goog      

-  Katib 메뉴 선택

 

 

5. Examples

- Cloud shell 기동 

- random-example Hyperparameter tuning

   The random algorithm example uses an MXNet neural network to train an image classification model using the MNIST dataset. 

   The experiment runs three training jobs with various hyperparameters and saves the results.

$ curl https://raw.githubusercontent.com/kubeflow/katib/master/examples/v1alpha3/random-example.yaml --output random-example.yaml
$ cat random-example.yaml
apiVersion: "kubeflow.org/v1alpha3"
kind: Experiment
…
spec:
…
  parallelTrialCount: 3
  maxTrialCount: 12
  maxFailedTrialCount: 3
  parameters:
   - name: --lr
     parameterType: double
     feasibleSpace:
       min: "0.01"
       max: "0.03"
   - name: --num-layers
     parameterType: int
     feasibleSpace:
       min: "2"
       max: "5"
   - name: --optimizer
     parameterType: categorical
     feasibleSpace:
       list:
       - sgd
       - adam
       - ftrl
…
           spec:
             containers:
             - name: {{.Trial}}
               image: docker.io/kubeflowkatib/mxnet-mnist
               command:
               - "python3"
               - "/opt/mxnet-mnist/mnist.py"
               - "--batch-size=64"
               {{- with .HyperParameters}}
               {{- range .}}
               - "{{.Name}}={{.Value}}"
               {{- end}}
               {{- end}}
$ kubectl apply -f random-example.yaml
experiment.kubeflow.org/random-example created
$ kubectl get pods --show-labels | grep random-example
random-example-2xh7zvhr-g5d7g           1/1  Running  0  8m51s  controller-uid=dd3773c1-5e7a-11ea-a37f-42010a8e023a,job-name=random-example-2xh7zvhr
random-example-59jb6pgg-zwpq7           1/1  Running  0  8m51s  controller-uid=dd1d4a63-5e7a-11ea-a37f-42010a8e023a,job-name=random-example-59jb6pgg
random-example-nlmwhpq9-q5788           1/1  Running  0  8m51s  controller-uid=dd248971-5e7a-11ea-a37f-42010a8e023a,job-name=random-example-nlmwhpq9
random-example-random-76fcbdd7f4-wwcsw  1/1  Running  0  11m    controller-tools.k8s.io=1.0,deployment=random-example-random,experiment=random-example,pod-template-hash=76fcbdd7f4,suggestion=random-example
$ POD_NAME=$(kubectl get pods -n kubeflow --selector=experiment=random-example --template '{{range .items}}{{.metadata.name}}{{"\n"}}{{end}}')
$ kubectl logs -f $POD_NAME
…
$ kubectl -n kubeflow describe experiment random-example
…
Status:
 Conditions:
  …
  Message:               Experiment is running
 Reason:                ExperimentRunning
  Status:                True
  Type:                  Running
…
$

- When the last value in Status.Conditions.Type is Succeeded, the experiment is complete.

   View the results of the experiment in the Katib UI: Kubeflow dashboard > Katib > HP > Monitor

- TensorFlow example 

$ kubectl apply -f https://raw.githubusercontent.com/kubeflow/katib/master/examples/v1alpha3/tfjob-example.yaml
$ kubectl -n kubeflow describe experiment tfjob-example

- PyTorch example 

$ kubectl apply -f https://raw.githubusercontent.com/kubeflow/katib/master/examples/v1alpha3/pytorchjob-example.yaml
$ kubectl -n kubeflow describe experiment pytorchjob-example

 

댓글