본문 바로가기
Kubernetes/Management

K8s - Node NotReady

by 여행을 떠나자! 2021. 9. 16.

2021.05.11

 

1. 개요

- Issue

   ✓ 특정 Node가 NotReady 상태가 되면, Node에 속한 POD는 Terminating 상태로 변경/유지되면서 self-healing이 진행되지 않음

   ✓ 영향도

       Deployments는 서비스가 가능 (단, replica 수 > 1 & Node 간 분산 배치)      

        Statefulsets는 애플리케이션 마다 서비스 지속 여부가 다름 (ex. MariaDB 서비스 불가, Redis cluster는 서비스 가능)

   If an entire node goes down, Kubernetes generally isn’t able to spin a new one up

        AWS, GCP에서는 autoscaling groups 기능으로, Azure에서는 scale sets 기능으로 지원

        On-prem 환경에서는 Rancher, Cluster API, Kublr와 같은 Kubernetes cluster management 소프트웨어가 필요

             https://kublr.com/blog/reliable-self-healing-kubernetes-explained/

 

- Environments

   ✓ Kubernetes 1.16.15

   ✓ nfs-client-provisioner v3.1.0 ( Dynamic NFS Provisioner)

 

- 정상 시점의 상태 정보

$ k get nodes
NAME           STATUS   ROLES    AGE   VERSION
acp-master01   Ready    master   24d   v1.16.15
acp-master02   Ready    master   24d   v1.16.15
acp-master03   Ready    master   24d   v1.16.15
acp-worker01   Ready    <none>   24d   v1.16.15
acp-worker02   Ready    <none>   24d   v1.16.15
acp-worker03   Ready    <none>   51m   v1.16.15
$ k get pods -n kubeflow -o wide | egrep 'NAME|worker01'
admission-webhook-bootstrap-stateful-set-0        1/1  Running  6  4d   10.244.3.61  acp-worker01  <none>  <none>
application-controller-stateful-set-0             1/1  Running  0  69m  10.244.3.83  acp-worker01  <none>  <none>
argo-ui-669bcd8bfc-m9cpf                          1/1  Running  0  4d   10.244.3.48  acp-worker01  <none>  <none>
cache-server-5f59f9c4b6-h2zvj                     2/2  Running  0  74m  10.244.3.82  acp-worker01  <none>  <none>
harbor-registry-redis-master-0                    1/1  Running  35 5d4h 10.244.3.31  acp-worker01  <none>  <none>
jupyter-web-app-deployment-5c79699b86-nk88k       1/1  Running  0  4d   10.244.3.49  acp-worker01  <none>  <none>
katib-db-manager-68966c4665-ptl2v                 1/1  Running  7  74m  10.244.3.79  acp-worker01  <none>  <none>
kfserving-controller-manager-0                    2/2  Running  0  4d   10.244.3.62  acp-worker01  <none>  <none>
metadata-envoy-deployment-c5985d64b-btsp9         1/1  Running  0  4d   10.244.3.50  acp-worker01  <none>  <none>
metadata-grpc-deployment-9fdb476-p4wlw            1/1  Running  2  4d   10.244.3.51  acp-worker01  <none>  <none>
ml-pipeline-544647c564-p258f                      2/2  Running  0  23h  10.244.3.71  acp-worker01  <none>  <none>
ml-pipeline-ui-5499f67d-pqwx5                     2/2  Running  0  4d   10.244.3.59  acp-worker01  <none>  <none>
ml-pipeline-visualizationserver-769546b47b-45dtn  2/2  Running  0  4d   10.244.3.53  acp-worker01  <none>  <none>
mpi-operator-b46dfb59d-745f7                      1/1  Running  0  4d   10.244.3.60  acp-worker01  <none>  <none>
mxnet-operator-64b9d686c7-gbbm5                   1/1  Running  0  74m  10.244.3.80  acp-worker01  <none>  <none>
mysql-864f9c758b-zpcdt                            2/2  Running  0  74m  10.244.3.81  acp-worker01  <none>  <none>
notebook-controller-deployment-57bdb6c975-bx4bt   1/1  Running  0  74m  10.244.3.78  acp-worker01  <none>  <none>
tf-job-operator-6f9dd89b5f-pndbd                  1/1  Running  0  4d   10.244.3.56  acp-worker01  <none>  <none>
workflow-controller-54dccb7dc4-t7pmm              1/1  Running  0  4d   10.244.3.58  acp-worker01  <none>  <none>
$

 

- RWO(ReadWriteOnly) 타입의 Volume을 사용하고 있는 Stateful POD 정보

$ k get pod -n harbor-registry -o wide | egrep 'NAME|harbor-registry-redis-master-0'
NAME                            READY  STATUS   RESTARTS  AGE   IP           NODE          NOMINATED NODE  READINESS GATES
harbor-registry-redis-master-0  1/1    Running  354       5d4h  10.244.3.31  acp-worker01  <none>          <none>
$ k get pvc -n harbor-registry | egrep 'NAME|redis-data-harbor-registry-redis-master-0'
NAME                                       STATUS  VOLUME                                    CAPACITY  ACCESS MODES  STORAGECLASS  AGE
redis-data-harbor-registry-redis-master-0  Bound   pvc-08aa4e60-1e12-49c0-b4ba-d770e008bfdd  8Gi       RWO           nfs-sc-acp    13d
$

 

 

2. 시험

- acp-worker01 Node의 kubelet process stop (장애 상황 유발)

root@acp-worker01:/home/acp# systemctl stop kubelet
root@acp-worker01:/home/acp#

 

- 장애 발생 시점의 상태 정보 (acp-worker01 node 상태가 NotReady로 변경) 

$ k get nodes
NAME           STATUS     ROLES    AGE   VERSION
acp-master01   Ready      master   24d   v1.16.15
acp-master02   Ready      master   24d   v1.16.15
acp-master03   Ready      master   24d   v1.16.15
acp-worker01   NotReady   <none>   24d   v1.16.15
acp-worker02   Ready      <none>   24d   v1.16.15
acp-worker03   Ready      <none>   55m   v1.16.15
$

 

- 장애 노드에 있던 POD 상태가 Running에서 Terminating로 변경 후 지속되며, 다른 Node에서 신규로 생성되지 않음

$ k get pods -n kubeflow -o wide | egrep worker01
admission-webhook-bootstrap-stateful-set-0  1/1  Terminating  6  4d   10.244.3.61  acp-worker01  <none>  <none>
application-controller-stateful-set-0       1/1  Terminating  0  72m  10.244.3.83  acp-worker01  <none>  <none>
...
$

 

 

3. Solution

- https://blog.mayadata.io/recover-from-volume-multi-attach-error-in-on-prem-kubernetes-clusters

   if the node is really shut down or if it is due to a network/split-brain condition to the master nodes.

 

- NotReady 상태인 Node를 삭제

$ k delete nodes acp-worker01
node "acp-worker01" deleted
$ k get nodes
NAME           STATUS   ROLES    AGE   VERSION
acp-master01   Ready    master   24d   v1.16.15
acp-master02   Ready    master   24d   v1.16.15
acp-master03   Ready    master   24d   v1.16.15
acp-worker02   Ready    <none>   24d   v1.16.15
acp-worker03   Ready    <none>   65m   v1.16.15
$ k get pods -n kubeflow -o wide | egrep worker01
$

 

- NotReady 상태의 Node가 삭제되면서 해당 Node의 Terminating 상태에 있던 POD들은 다른 Node에서 재 기동 됨.

   NFS/RWO 타입의 Volume을 사용하고 있는 Statfule POD들도 정상적으로 절체가 됨

 $ k get pod -n harbor-registry -o wide | egrep 'NAME|harbor-registry-redis-master-0'
 NAME                             READY  STATUS   RESTARTS  AGE  IP           NODE          NOMINATED NODE  READINESS GATES
 harbor-registry-redis-master-0   1/1    Running  0         37s  10.244.5.89  acp-worker03  <none>          <none>
 $

 

- NotReady 상태였던 Node가 정상화 되었을 때 상태

   acp-worker01 Node가 정상화 되어 Kubernetes에 자동으로 Join 되면서 Node 상태가 Ready로 됨

   만약 acp-worker01 Node가 네트워크 단절 같은 상황에서 Join 할 경우, 실행 중이던 POD는 이미 다른 Node에서 기동 되었으므로 자동 종료됨  

root@acp-worker01:/home/acp# systemctl start kubelet
root@acp-worker01:/home/acp#
NAME           STATUS   ROLES    AGE   VERSION
acp-master01   Ready    master   24d   v1.16.15
acp-master02   Ready    master   24d   v1.16.15
acp-master03   Ready    master   24d   v1.16.15
acp-worker01   Ready    <none>   1m    v1.16.15
acp-worker02   Ready    <none>   24d   v1.16.15
acp-worker03   Ready    <none>   65m   v1.16.15
root@acp-worker01:/home/acp#

$ k get pod -A -o wide | grep worker01
istio-system    istio-nodeagent-8xf7v        1/1  Running  0  38s  10.244.3.84    acp-worker01  <none>  <none>
kube-system     kube-flannel-ds-amd64-45brz  1/1  Running  0  38s  10.214.35.103  acp-worker01  <none>  <none>
kube-system     kube-proxy-j2jqj             1/1  Running  0  38s  10.214.35.103  acp-worker01  <none>  <none>
metallb-system  speaker-22zr8                1/1  Running  0  37s  10.214.35.103  acp-worker01  <none>  <none>
$

 

 

4. GKE의 Auto-reparing nodes

- https://cloud.google.com/kubernetes-engine/docs/how-to/node-auto-repair

- When enabled, GKE makes periodic checks on the health state of each node in your cluster. If a node fails consecutive health checks over an extended time period, GKE initiates a repair process for that node.

- Repair criteria 
   ✓ A node reports a NotReady status on consecutive checks over the given time threshold (approximately 10 minutes).
   ✓ A node does not report any status at all over the given time threshold (approximately 10 minutes).
   ✓ A node's boot disk is out of disk space for an extended time period (approximately 30 minutes).

'Kubernetes > Management' 카테고리의 다른 글

Cert-manager with LetsEncrypt (DNS challenge)  (1) 2021.09.23
Crobjob  (0) 2021.09.23
K8s - Slab memory leakage  (2) 2021.09.16
K8s - CNI not ready  (0) 2021.09.15
istio - Envoy CPU 과다 점유  (0) 2021.09.15

댓글