2021.05.11
1. 개요
- Issue
✓ 특정 Node가 NotReady 상태가 되면, Node에 속한 POD는 Terminating 상태로 변경/유지되면서 self-healing이 진행되지 않음
✓ 영향도
▷ Deployments는 서비스가 가능 (단, replica 수 > 1 & Node 간 분산 배치)
▷ Statefulsets는 애플리케이션 마다 서비스 지속 여부가 다름 (ex. MariaDB 서비스 불가, Redis cluster는 서비스 가능)
✓ If an entire node goes down, Kubernetes generally isn’t able to spin a new one up
▷ AWS, GCP에서는 autoscaling groups 기능으로, Azure에서는 scale sets 기능으로 지원
▷ On-prem 환경에서는 Rancher, Cluster API, Kublr와 같은 Kubernetes cluster management 소프트웨어가 필요
https://kublr.com/blog/reliable-self-healing-kubernetes-explained/
- Environments
✓ Kubernetes 1.16.15
✓ nfs-client-provisioner v3.1.0 ( Dynamic NFS Provisioner)
- 정상 시점의 상태 정보
$ k get nodes
NAME STATUS ROLES AGE VERSION
acp-master01 Ready master 24d v1.16.15
acp-master02 Ready master 24d v1.16.15
acp-master03 Ready master 24d v1.16.15
acp-worker01 Ready <none> 24d v1.16.15
acp-worker02 Ready <none> 24d v1.16.15
acp-worker03 Ready <none> 51m v1.16.15
$ k get pods -n kubeflow -o wide | egrep 'NAME|worker01'
admission-webhook-bootstrap-stateful-set-0 1/1 Running 6 4d 10.244.3.61 acp-worker01 <none> <none>
application-controller-stateful-set-0 1/1 Running 0 69m 10.244.3.83 acp-worker01 <none> <none>
argo-ui-669bcd8bfc-m9cpf 1/1 Running 0 4d 10.244.3.48 acp-worker01 <none> <none>
cache-server-5f59f9c4b6-h2zvj 2/2 Running 0 74m 10.244.3.82 acp-worker01 <none> <none>
harbor-registry-redis-master-0 1/1 Running 35 5d4h 10.244.3.31 acp-worker01 <none> <none>
jupyter-web-app-deployment-5c79699b86-nk88k 1/1 Running 0 4d 10.244.3.49 acp-worker01 <none> <none>
katib-db-manager-68966c4665-ptl2v 1/1 Running 7 74m 10.244.3.79 acp-worker01 <none> <none>
kfserving-controller-manager-0 2/2 Running 0 4d 10.244.3.62 acp-worker01 <none> <none>
metadata-envoy-deployment-c5985d64b-btsp9 1/1 Running 0 4d 10.244.3.50 acp-worker01 <none> <none>
metadata-grpc-deployment-9fdb476-p4wlw 1/1 Running 2 4d 10.244.3.51 acp-worker01 <none> <none>
ml-pipeline-544647c564-p258f 2/2 Running 0 23h 10.244.3.71 acp-worker01 <none> <none>
ml-pipeline-ui-5499f67d-pqwx5 2/2 Running 0 4d 10.244.3.59 acp-worker01 <none> <none>
ml-pipeline-visualizationserver-769546b47b-45dtn 2/2 Running 0 4d 10.244.3.53 acp-worker01 <none> <none>
mpi-operator-b46dfb59d-745f7 1/1 Running 0 4d 10.244.3.60 acp-worker01 <none> <none>
mxnet-operator-64b9d686c7-gbbm5 1/1 Running 0 74m 10.244.3.80 acp-worker01 <none> <none>
mysql-864f9c758b-zpcdt 2/2 Running 0 74m 10.244.3.81 acp-worker01 <none> <none>
notebook-controller-deployment-57bdb6c975-bx4bt 1/1 Running 0 74m 10.244.3.78 acp-worker01 <none> <none>
tf-job-operator-6f9dd89b5f-pndbd 1/1 Running 0 4d 10.244.3.56 acp-worker01 <none> <none>
workflow-controller-54dccb7dc4-t7pmm 1/1 Running 0 4d 10.244.3.58 acp-worker01 <none> <none>
$
- RWO(ReadWriteOnly) 타입의 Volume을 사용하고 있는 Stateful POD 정보
$ k get pod -n harbor-registry -o wide | egrep 'NAME|harbor-registry-redis-master-0'
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
harbor-registry-redis-master-0 1/1 Running 354 5d4h 10.244.3.31 acp-worker01 <none> <none>
$ k get pvc -n harbor-registry | egrep 'NAME|redis-data-harbor-registry-redis-master-0'
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
redis-data-harbor-registry-redis-master-0 Bound pvc-08aa4e60-1e12-49c0-b4ba-d770e008bfdd 8Gi RWO nfs-sc-acp 13d
$
2. 시험
- acp-worker01 Node의 kubelet process stop (장애 상황 유발)
root@acp-worker01:/home/acp# systemctl stop kubelet
root@acp-worker01:/home/acp#
- 장애 발생 시점의 상태 정보 (acp-worker01 node 상태가 NotReady로 변경)
$ k get nodes
NAME STATUS ROLES AGE VERSION
acp-master01 Ready master 24d v1.16.15
acp-master02 Ready master 24d v1.16.15
acp-master03 Ready master 24d v1.16.15
acp-worker01 NotReady <none> 24d v1.16.15
acp-worker02 Ready <none> 24d v1.16.15
acp-worker03 Ready <none> 55m v1.16.15
$
- 장애 노드에 있던 POD 상태가 Running에서 Terminating로 변경 후 지속되며, 다른 Node에서 신규로 생성되지 않음
$ k get pods -n kubeflow -o wide | egrep worker01
admission-webhook-bootstrap-stateful-set-0 1/1 Terminating 6 4d 10.244.3.61 acp-worker01 <none> <none>
application-controller-stateful-set-0 1/1 Terminating 0 72m 10.244.3.83 acp-worker01 <none> <none>
...
$
3. Solution
- https://blog.mayadata.io/recover-from-volume-multi-attach-error-in-on-prem-kubernetes-clusters
if the node is really shut down or if it is due to a network/split-brain condition to the master nodes.
- NotReady 상태인 Node를 삭제
$ k delete nodes acp-worker01
node "acp-worker01" deleted
$ k get nodes
NAME STATUS ROLES AGE VERSION
acp-master01 Ready master 24d v1.16.15
acp-master02 Ready master 24d v1.16.15
acp-master03 Ready master 24d v1.16.15
acp-worker02 Ready <none> 24d v1.16.15
acp-worker03 Ready <none> 65m v1.16.15
$ k get pods -n kubeflow -o wide | egrep worker01
$
- NotReady 상태의 Node가 삭제되면서 해당 Node의 Terminating 상태에 있던 POD들은 다른 Node에서 재 기동 됨.
NFS/RWO 타입의 Volume을 사용하고 있는 Statfule POD들도 정상적으로 절체가 됨
$ k get pod -n harbor-registry -o wide | egrep 'NAME|harbor-registry-redis-master-0'
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
harbor-registry-redis-master-0 1/1 Running 0 37s 10.244.5.89 acp-worker03 <none> <none>
$
- NotReady 상태였던 Node가 정상화 되었을 때 상태
acp-worker01 Node가 정상화 되어 Kubernetes에 자동으로 Join 되면서 Node 상태가 Ready로 됨
만약 acp-worker01 Node가 네트워크 단절 같은 상황에서 Join 할 경우, 실행 중이던 POD는 이미 다른 Node에서 기동 되었으므로 자동 종료됨
root@acp-worker01:/home/acp# systemctl start kubelet
root@acp-worker01:/home/acp#
NAME STATUS ROLES AGE VERSION
acp-master01 Ready master 24d v1.16.15
acp-master02 Ready master 24d v1.16.15
acp-master03 Ready master 24d v1.16.15
acp-worker01 Ready <none> 1m v1.16.15
acp-worker02 Ready <none> 24d v1.16.15
acp-worker03 Ready <none> 65m v1.16.15
root@acp-worker01:/home/acp#
$ k get pod -A -o wide | grep worker01
istio-system istio-nodeagent-8xf7v 1/1 Running 0 38s 10.244.3.84 acp-worker01 <none> <none>
kube-system kube-flannel-ds-amd64-45brz 1/1 Running 0 38s 10.214.35.103 acp-worker01 <none> <none>
kube-system kube-proxy-j2jqj 1/1 Running 0 38s 10.214.35.103 acp-worker01 <none> <none>
metallb-system speaker-22zr8 1/1 Running 0 37s 10.214.35.103 acp-worker01 <none> <none>
$
4. GKE의 Auto-reparing nodes
- https://cloud.google.com/kubernetes-engine/docs/how-to/node-auto-repair
- When enabled, GKE makes periodic checks on the health state of each node in your cluster. If a node fails consecutive health checks over an extended time period, GKE initiates a repair process for that node.
- Repair criteria
✓ A node reports a NotReady status on consecutive checks over the given time threshold (approximately 10 minutes).
✓ A node does not report any status at all over the given time threshold (approximately 10 minutes).
✓ A node's boot disk is out of disk space for an extended time period (approximately 30 minutes).
'Kubernetes > Management' 카테고리의 다른 글
Cert-manager with LetsEncrypt (DNS challenge) (1) | 2021.09.23 |
---|---|
Crobjob (0) | 2021.09.23 |
K8s - Slab memory leakage (2) | 2021.09.16 |
K8s - CNI not ready (0) | 2021.09.15 |
istio - Envoy CPU 과다 점유 (0) | 2021.09.15 |
댓글