2020.12.01
a. Problem: Worker node - CPU 과다 점유로 성능 저하 현상 발생
- Environments
Kubernetes 1.16.15, istio 1.3
- 영향도
Rook ceph의 rook-ceph-mon-o POD가 iap04 노드에서 동작될 경우 응답 속도가 느려서 quorum에서 제외 되면서 fail-over 동작
[root@iap04 ~]# top
top - 10:46:25 up 6 days, 17:58, 1 user, load average: 73.37, 77.52, 79.46
Tasks: 403 total, 19 running, 382 sleeping, 0 stopped, 2 zombie
%Cpu(s): 90.1 us, 8.2 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 1.6 si, 0.0 st
KiB Mem : 32490092 total, 6647024 free, 16691484 used, 9151584 buff/cache
KiB Swap: 0 total, 0 free, 0 used. 13772944 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
22552 1337 20 0 444656 313108 20312 R 85.0 1.0 16:14.61 envoy
22891 1337 20 0 392428 261624 20308 R 85.0 0.8 16:06.74 envoy
24234 1337 20 0 444712 315224 20312 R 85.0 1.0 15:35.31 envoy
24256 1337 20 0 445736 314428 20304 R 85.0 1.0 15:55.10 envoy
24590 1337 20 0 448804 317332 20312 R 85.0 1.0 15:32.17 envoy
25626 1337 20 0 449828 318916 20316 R 85.0 1.0 15:59.56 envoy
22889 1337 20 0 379172 247632 20304 R 75.0 0.8 16:01.93 envoy
17206 root 20 0 3779588 391124 23104 S 20.0 1.2 1373:21 kubelet
…
b. Cause analysis: Istio의 envoy container에서 CPU 과다 점유
[root@iap04 ~]# ps -ef | grep 22552 | grep -v grep | cut -c-100
1337 22552 22412 88 10:28 ? 00:16:56 /usr/local/bin/envoy -c /etc/istio/proxy/envoy-rev0.
[root@iap04 ~]# ps -ef | grep 22412 | grep -v grep | cut -c-100
1337 22412 22380 0 10:28 ? 00:00:01 /usr/local/bin/pilot-agent proxy sidecar --domain kn
1337 22552 22412 88 10:28 ? 00:17:03 /usr/local/bin/envoy -c /etc/istio/proxy/envoy-rev0.
[root@iap04 ~]# ps -ef | grep "/usr/local/bin/pilot-agent proxy sidecar" | grep -v grep | cut -c-130
1337 22412 22380 0 10:28 ? 00:00:01 /usr/local/bin/pilot-agent proxy sidecar --domain knative-serving.svc.cluster.local
1337 22736 22682 0 10:28 ? 00:00:01 /usr/local/bin/pilot-agent proxy sidecar --domain knative-serving.svc.cluster.local
1337 22757 22707 0 10:28 ? 00:00:01 /usr/local/bin/pilot-agent proxy sidecar --domain knative-serving.svc.cluster.local
1337 24015 23965 0 10:28 ? 00:00:02 /usr/local/bin/pilot-agent proxy sidecar --domain knative-serving.svc.cluster.local
1337 24028 23982 0 10:28 ? 00:00:01 /usr/local/bin/pilot-agent proxy sidecar --domain knative-serving.svc.cluster.local
1337 24412 24377 0 10:28 ? 00:00:01 /usr/local/bin/pilot-agent proxy sidecar --domain knative-serving.svc.cluster.local
1337 25513 25471 0 10:28 ? 00:00:01 /usr/local/bin/pilot-agent proxy sidecar --domain knative-serving.svc.cluster.local
[root@iap04 ~]#
# Istio envoy(sidecar)를 injection 하도록 설정된 namespace 검색 및 비 정상 POD 확인
[iap@iap01 ~]$ kubectl get namespace -L istio-injection | grep enabled
admin Active 117d enabled
knative-serving Active 117d enabled
[iap@iap01 ~]$ k get pod -n knative-serving -o wide | grep "1/2"
activator-6dc4884-77wtg 1/2 Running 3 16h 10.244.6.217 iap04 <none> <none>
activator-6dc4884-pnrt5 1/2 Running 12 3d16h 10.244.6.205 iap04 <none> <none>
activator-6dc4884-w78wr 1/2 Running 3 16h 10.244.6.216 iap04 <none> <none>
activator-6dc4884-wr9k4 1/2 Running 3 16h 10.244.6.214 iap04 <none> <none>
activator-6dc4884-zrlbh 1/2 Running 3 16h 10.244.6.215 iap04 <none> <none>
activator-6dc4884-zz9js 1/2 Running 4 16h 10.244.6.213 iap04 <none> <none>
[iap@iap01 ~]$ k get pod -n knative-serving -o wide | grep activator | wc -l
20
[iap@iap01 ~]$ k get deployments.apps activator -n knative-serving
NAME READY UP-TO-DATE AVAILABLE AGE
activator 16/20 20 16 117d
[iap@iap01 ~]$ k describe pod activator-6dc4884-77wtg -n knative-serving
…
istio-proxy:
Container ID: docker://474fcacc7b235a02c51a7a3e789f0a27c7c28e11d6126136d12787b4d48ac927
Image: docker.io/istio/proxyv2:1.5.8
…
State: Running
Started: Tue, 01 Dec 2020 10:28:13 +0900
Last State: Terminated
Reason: Completed
Exit Code: 0
Started: Mon, 30 Nov 2020 18:19:55 +0900
Finished: Tue, 01 Dec 2020 10:28:07 +0900
Ready: False
Restart Count: 1
…
# envoy process가 CPU 과다 점유 내용 확인
[iap@iap01 ~]$ k exec activator-6dc4884-77wtg -c istio-proxy -n knative-serving -it -- sh
$ top
top - 01:52:45 up 6 days, 18:04, 0 users, load average: 73.67, 77.36, 78.76
Tasks: 4 total, 2 running, 2 sleeping, 0 stopped, 0 zombie
%Cpu(s): 91.8 us, 7.1 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 1.2 si, 0.0 st
KiB Mem : 32490092 total, 5735552 free, 17246084 used, 9508456 buff/cache
KiB Swap: 0 total, 0 free, 0 used. 13201572 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
25 istio-p+ 20 0 542956 412108 20312 R 97.3 1.3 21:18.08 envoy
1 istio-p+ 20 0 158516 28904 16552 S 0.3 0.1 0:01.96 pilot-agent
…
c. Solution:
[iap@iap01 ~]$ k rollout restart deployment activator -n knative-serving
deployment.apps/activator restarted
[iap@iap01 ~]$
CPU 과다 점유 현상 해소 됨 그러나 iap04/iap05는 저 사양으로 CPU 사용율은 여전히 높음
[root@iap04 ~]# lscpu | grep -i socket
Core(s) per socket: 4
Socket(s): 1
[root@iap04 ~]#
[root@iap10 ~]# lscpu | grep -i socket
Core(s) per socket: 20
Socket(s): 2
[root@iap10 ~]#
[iap@iap01 ~]$ k get deployments.apps activator -n knative-serving
NAME READY UP-TO-DATE AVAILABLE AGE
activator 20/20 20 20 117d
[iap@iap01 ~]$ k get pod -n knative-serving -o wide | grep activator | grep -v grep | tr -s ' ' | cut -d' ' -f 7 | sort | uniq -c
5 iap04
4 iap05
8 iap10
3 iap11
[iap@iap01 ~]$ ~/bin/check-cpu.sh
procs -----------memory-------------- ---swap--- -----io----- ----syste---- ------cpu-----
node r b swpd free buff cache si so bi bo in cs us sy id wa st
iap04: 9 8 0 4000816 420852 11492244 0 0 0 154 16678 30269 71 10 7 12 0
iap05: 2 13 0 256456 117000 21081552 0 0 0 96 11376 22950 31 5 10 54 0
iap06: 1 2 0 3699588 283164 13359000 0 0 0 52 8015 20149 24 3 62 12 0
iap07: 0 0 0 7800952 6824 18504192 0 0 0 0 2387 6741 0 0 100 0 0
iap08: 25 0 0 10890024 7072 18057196 0 0 0 16436 29340 33996 24 3 73 0 0
iap09: 0 0 0 17706208 22344 12057468 0 0 0 8 1674 3914 1 0 99 0 0
iap10: 2 1 0 38194324 13944 1099520 0 0 28672 20844 20330 15851 5 1 93 1 0
iap11: 4 2 0 5350500 47616 6421024 0 0 0 29168 31602 110691 4 2 92 2 0
[iap@iap01 ~]$
[iap@iap01 ~]$ k exec activator-6c8699d66-g9q2c -c istio-proxy -n knative-serving -it -- sh
$ top
top - 04:05:26 up 6 days, 20:17, 0 users, load average: 23.39, 22.03, 19.40
Tasks: 4 total, 2 running, 2 sleeping, 0 stopped, 0 zombie
%Cpu(s): 57.6 us, 4.3 sy, 0.0 ni, 20.4 id, 16.6 wa, 0.0 hi, 1.1 si, 0.0 st
KiB Mem : 32490092 total, 2193476 free, 18648272 used, 11648344 buff/cache
KiB Swap: 0 total, 0 free, 0 used. 11751124 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
22 istio-p+ 20 0 215340 76296 20384 R 83.7 0.2 105:44.26 envoy
'Kubernetes > Management' 카테고리의 다른 글
Cert-manager with LetsEncrypt (DNS challenge) (1) | 2021.09.23 |
---|---|
Crobjob (0) | 2021.09.23 |
K8s - Slab memory leakage (2) | 2021.09.16 |
K8s - Node NotReady (0) | 2021.09.16 |
K8s - CNI not ready (0) | 2021.09.15 |
댓글