본문 바로가기
Kubernetes/Monitoring

kube-prometheus-stack

by 여행을 떠나자! 2021. 9. 21.

2021.02.24

 

1. Prometheus ?

- https://github.com/prometheus-operator/kube-prometheus

- kube-prometheus-stack

  ✓ Installs the kube-prometheus stack, a collection of Kubernetes manifests, Grafana dashboards, and Prometheus rules combined with documentation and scripts to provide easy to operate end-to-end Kubernetes cluster monitoring with Prometheus using the Prometheus Operator.  

  ✓ Prometheus is deployed along with kube-state-metrics and node_exporter to expose cluster-level metrics for Kubernetes API objects and node-level metrics such as CPU utilization

- Prometheus (https://prometheus.io/)

  An open-source monitoring system with a dimensional data model, flexible query language, efficient time series database and modern alerting approach.

  Prometheus's main features are:

      ▷ a multi-dimensional data model with time series data identified by metric name and key/value pairs

      ▷ PromQL, a flexible query language to leverage this dimensionality

      ▷ no reliance on distributed storage; single server nodes are autonomous

      ▷ ime series collection happens via a pull model over HTTP

      ▷ pushing time series is supported via an intermediary gateway

      ▷ targets are discovered via service discovery or static configuration

      ▷ multiple modes of graphing and dashboarding support

- Prometheus Architecture

 

 

2. Environments
- Kubernetes 1.16.15
- kube-prometheus-stack 13.10.0
   prometheus operator 0.45.0
   grafana 7.4.1
   node-exporter v1.0.1
      Prometheus exporter for hardware and OS metrics exposed by *NIX kernels, written in Go with pluggable metric 
   kube-state-metrics v1.9.7
      kube-state-metrics is a simple service that listens to the Kubernetes API server and generates metrics about the state 

 

 

3. Metric 제공 소프트웨어

- GPU Operator v1.4.0) (NVIDIA DCGM-exporter 2.0.13-2.1.2) : It exposes GPU metrics exporter for Prometheus leveraging NVIDIA DCGM.

- RabbitMQ Cluster Operator 1.3.0 (annotation type)

- redis-cluster helm chart 4.2.5 (annotation type)

 

 

4. Install kube prometheus stack

$ helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
"prometheus-community" has been added to your repositories
$ helm repo update
…
...Successfully got an update from the "prometheus-community" chart repository
Update Complete. ⎈Happy Helming!⎈
$

$ helm inspect values prometheus-community/kube-prometheus-stack --version 13.10.0 > kube-prometheus-stack.values
$ vi kube-prometheus-stack.values
…
grafana:
  adminPassword: prom-operator # default: prom-operator

…
prometheus:
  service:
    nodePort: 30090
    type: NodePort        # default: ClusterIP

…
  prometheusSpec:
    serviceMonitorSelectorNilUsesHelmValues: false   # default: true
    retention: 30d        # default: 10d
…
    storageSpec: {}
      volumeClaimTemplate:
        spec:
          resources:
            requests:
              storage: 50Gi
…
    additionalScrapeConfigs: []
    #
    # Prometheus can be configured to scrape all Pods with the prometheus.io/scrape: true annotation.
    # Supported SW: RabbitMQ, Redis
    #    $ k describe pod aicc-server-0 -n rabbitmq-cluster | grep "prometheus.io"
    #       Annotations:  prometheus.io/port: 15692
    #                     prometheus.io/scrape: true
    #
    additionalScrapeConfigs:
    - job_name: kubernetes-pods
      honor_timestamps: true
      scrape_interval: 30s             # How frequently to scrape targets by default.
      scrape_timeout: 30s              # How long until a scrape request times out.
      metrics_path: /metrics
      scheme: http
      kubernetes_sd_configs:
      - role: pod
      relabel_configs:
    - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        separator: ;
        regex: "true"
        replacement: $1
        action: keep
      - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
        separator: ;
        regex: ([^:]+)(?::\d+)?;(\d+)
        target_label: __address__
        replacement: $1:$2
        action: replace
    #
    # Kubernetes SD configurations : endpoints (DCGM)
    # Supported SW: DCGM(The NVIDIA Data Center GPU Manager)
    #
    - job_name: gpu-metrics
      scrape_interval: 30s
      scrape_timeout: 30s
      metrics_path: /metrics
      scheme: http
      kubernetes_sd_configs:
      - role: endpoints
        namespaces:
          names:
          - gpu-operator-resources
      relabel_configs:
      - source_labels: [__meta_kubernetes_pod_node_name]
        action: replace
        target_label: kubernetes_node

...
$ helm install prometheus-community/kube-prometheus-stack --create-namespace --namespace kube-prometheus-stack --generate-name \
  --values kube-prometheus-stack.values
NAME: kube-prometheus-stack-1613969557
LAST DEPLOYED: Mon Feb 22 13:52:39 2021
NAMESPACE: kube-prometheus-stack
STATUS: deployed
REVISION: 1
NOTES:
kube-prometheus-stack has been installed. Check its status by running:
  kubectl --namespace kube-prometheus-stack get pods -l "release=kube-prometheus-stack-1613969557"
Visit https://github.com/prometheus-operator/kube-prometheus for instructions on how to create & configure Alertmanager and Prometheus instances using the Operator.
$ kubectl patch svc kube-prometheus-stack-1613969557-grafana -n  kube-prometheus-stack -p '{ "spec": { "type": "NodePort" } }'
$

$ helm list -A | egrep 'NAME|prometheus'
NAME                               NAMESPACE               REVISION   UPDATED                 STATUS     CHART                           APP VERSION
kube-prometheus-stack-1613969557   kube-prometheus-stack   1          021-02-22 13:52:39 KST  deployed   kube-prometheus-stack-13.10.0   0.45.0
$ k get pod -n kube-prometheus-stack
NAME                                                              READY   STATUS    RESTARTS   AGE
alertmanager-kube-prometheus-stack-1613-alertmanager-0            2/2     Running   0          9m48s
kube-prometheus-stack-1613-operator-5945548cd4-x7652              1/1     Running   0          9m50s
kube-prometheus-stack-1613969557-grafana-65ff6b7d5f-dbqvc         2/2     Running   0          9m50s
kube-prometheus-stack-1613969557-kube-state-metrics-754c984fshf   1/1     Running   0          9m50s
kube-prometheus-stack-1613969557-prometheus-node-exporter-vgszp   1/1     Running   0          9m50s
kube-prometheus-stack-1613969557-prometheus-node-exporter-vpb4b   1/1     Running   0          9m50s
kube-prometheus-stack-1613969557-prometheus-node-exporter-zhpn7   1/1     Running   0          9m50s
prometheus-kube-prometheus-stack-1613-prometheus-0                2/2     Running   1          9m48s
$ k get svc -n kube-prometheus-stack
NAME                                                        TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)                      AGE
alertmanager-operated                                       ClusterIP   None             <none>        9093/TCP,9094/TCP,9094/UDP   9m55s
kube-prometheus-stack-1613-alertmanager                     ClusterIP   10.108.111.181   <none>        9093/TCP                     9m59s
kube-prometheus-stack-1613-operator                         ClusterIP   10.101.15.13     <none>        443/TCP                      9m59s
kube-prometheus-stack-1613-prometheus                       NodePort    10.96.39.250     <none>        9090:30090/TCP               9m59s
kube-prometheus-stack-1613969557-grafana                    ClusterIP   10.110.51.223    <none>        80/TCP                       10m
kube-prometheus-stack-1613969557-kube-state-metrics         ClusterIP   10.103.230.175   <none>        8080/TCP                     9m59s
kube-prometheus-stack-1613969557-prometheus-node-exporter   ClusterIP   10.97.101.160    <none>        9100/TCP                     10m
prometheus-operated                                         ClusterIP   None             <none>        9090/TCP                     9m55s
$ k get pvc -n kube-prometheus-stack
NAME                                                   TATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
prometheus-kube-prometheus-stack-1613-prometheus-...   Bound   pvc-260ff471-f742-45f6-b032-b2dd5282cd32   50Gi       RWO            nfs-sc-gmd     9m55s
$

 

 

5. Add scrape config

$ k get secrets kube-prometheus-stack-1613-prometheus-scrape-confg -n kube-prometheus-stack \
  -o go-template='{{index .data "additional-scrape-configs.yaml" | base64decode }}' > additional-scrape-configs.yaml
$ vi additional-scrape-configs.yaml
...
# append new scrpe config below last line
- job_name: kubernetes-pods
  scrape_interval: 30s
  scrape_timeout: 30s
  ...
$ ADDITAIONAL_SCRAPE_CONFIG=$(base64 < "./additional-scrape-configs.yaml" | tr -d '\n')
$ kubectl patch secret kube-prometheus-stack-1613-prometheus-scrape-confg -n kube-prometheus-stack \
  -p "{\"data\":{\"additional-scrape-configs.yaml\":\"${ADDITAIONAL_SCRAPE_CONFIG}\"}}"
$ k rollout restart statefulsets.apps prometheus-kube-prometheus-stack-1613-prometheus -n kube-prometheus-stack
$

 변경 내역은 Prometheus UI의 Status > configuration에서 확인 가능

 

 

6. Connect to Prometheus UI

- http://125.133.49.139:30090/

     $ k get svc kube-prometheus-stack-1613-prometheus -n kube-prometheus-stack -o jsonpath='{.spec.ports[?(@.name=="web")].nodePort}'

     30090

     $

 

- Service Discovery : Menu - Status > Service Discovery

 

 

7. Import Dashboards to Grafana

http://125.133.49.133:32200/

   ✓ username: admin

  ✓ password:

      $ grep adminPassword kube-prometheus-stack.values

       adminPassword: prom-operator

       $

  ✓ Port:

       $ k get svc kube-prometheus-stack-1613969557-grafana -n kube-prometheus-stack -o jsonpath='{.spec.ports[?(@.name=="service")].nodePort}'

       32200

       $

- Dashboard Import

   + > Import > Import via grafana.com > input “Dashboard ID”, load > import

- Dashboard ID 정보

  ✓ DCGM : 12239 (official), https://grafana.com/grafana/dashboards/12239
  ✓ RabbitMQ : 10991(official), https://grafana.com/grafana/dashboards/10991
  ✓ Redis : 11835(Non-official), https://grafana.com/grafana/dashboards/11835
                  12776(Official), https://grafana.com/grafana/dashboards/12776 - datasource : Prometheus 미지원
  ✓ Node : kube-prometheus-stack에서 다수의 dashbaord를 제공 (ex. Nodex, Kubernetes / Compute Resources / Node (Pods))
                  1860(Non-official), https://grafana.com/grafana/dashboards/1860

- Dashboards > Manage > Nodes > Mark as favorite
  ✓ Nodes
  ✓ Kubernetes / Compute Resources / Node (Pods)

 

 

8. Trouble shooting

a. Problem: podMonitor를 인식 하지 못 함

https://www.rabbitmq.com/kubernetes/operator/operator-monitoring.html#prom-operator
- The Prometheus Operator defines scraping configuration through a more flexible custom resource called PodMonitor. 
   The PodMonitor custom resource definition (CRD) allows to declaratively define how a dynamic set of pods should be monitored.
   The PodMonitor object introduced by the Prometheus Operator discovers these pods and generates the relevant configuration for the Prometheus server in order to monitor them.
- 테스트 환경

   kube-prometheus-stack-12.10.6

   prometheus-operator:v0.44.0

   prometheus:v2.22.1

   kubernetes 1.16.15, 1.17.14

$ vi rabbitmq-podmonitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
  name: rabbitmq
  namespace: rabbitmq-monitor
spec:
  podMetricsEndpoints:
  - interval: 15s
    port: prometheus
  selector:
    matchLabels:
      app.kubernetes.io/component: rabbitmq
  namespaceSelector:
    any: true

$ k apply -f rabbitmq-podmonitor.yaml
podmonitor.monitoring.coreos.com/rabbitmq created
$

b. Cause

- Prometheus UI > Status > Targets에서 podMonitor 관련 내용이 조회 되지 않음 …

- k8s 1.16.15, 1.17.14에서 logLevel=debug 변경 후

$ k logs -l app=kube-prometheus-stack-operator -n rabbitmq-monitor -f | grep -i monitors
level=debug ts=2021-01-07T11:49:12.248382883Z caller=operator.go:1631 component=prometheusoperator msg="filtering namespaces to select ServiceMonitors from" namespaces=rabbitmq-system,rabbitmq-cluster,rabbitmq-monitor,default,kube-node-lease,kube-public,kube-system namespace=rabbitmq-monitor prometheus=kube-prometheus-stack-1610-prometheus
level=debug ts=2021-01-07T11:49:12.25467112Z caller=operator.go:1692 component=prometheusoperator msg="selected ServiceMonitors" servicemonitors=rabbitmq-monitor/kube-prometheus-stack-1610-apiserver,rabbitmq-monitor/kube-prometheus-stack-1610-grafana,rabbitmq-monitor/kube-prometheus-stack-1610-kube-scheduler,rabbitmq-monitor/kube-prometheus-stack-1610-node-exporter,rabbitmq-monitor/kube-prometheus-stack-1610-kube-state-metrics,rabbitmq-monitor/kube-prometheus-stack-1610-operator,rabbitmq-monitor/kube-prometheus-stack-1610-coredns,rabbitmq-monitor/kube-prometheus-stack-1610-prometheus,rabbitmq-monitor/kube-prometheus-stack-1610-kubelet,rabbitmq-monitor/kube-prometheus-stack-1610-alertmanager,rabbitmq-monitor/kube-prometheus-stack-1610-kube-etcd,rabbitmq-monitor/kube-prometheus-stack-1610-kube-proxy,rabbitmq-monitor/kube-prometheus-stack-1610-kube-controller-manager namespace=rabbitmq-monitor prometheus=kube-prometheus-stack-1610-prometheus
level=debug ts=2021-01-07T11:49:12.254717786Z caller=operator.go:1727 component=prometheusoperator msg="filtering namespaces to select PodMonitors from" namespaces=rabbitmq-system,rabbitmq-cluster,rabbitmq-monitor,default,kube-node-lease,kube-public,kube-system namespace=rabbitmq-monitor prometheus=kube-prometheus-stack-1610-prometheus
level=debug ts=2021-01-07T11:49:12.254750262Z caller=operator.go:1780 component=prometheusoperator msg="selected PodMonitors" podmonitors= namespace=rabbitmq-monitor prometheus=kube-prometheus-stack-1610-prometheus

- k8s 1.17.16

$ k logs -l app=kube-prometheus-stack-operator -n rabbitmq-monitor -f | grep -i podmonitor
…
level=debug ts=2021-01-08T01:06:29.318066557Z caller=operator.go:1727 component=prometheusoperator msg="filtering namespaces to select PodMonitors from" namespaces=kube-node-lease,default,rabbitmq-monitor,rabbitmq-system,rabbitmq-cluster,kube-system,kube-public namespace=rabbitmq-monitor prometheus=kube-prometheus-stack-1610-prometheus
level=debug ts=2021-01-08T01:06:29.318082198Z caller=operator.go:1780 component=prometheusoperator msg="selected PodMonitors" podmonitors= namespace=rabbitmq-monitor prometheus=kube-prometheus-stack-1610-prometheus
level=debug ts=2021-01-08T01:06:51.906724664Z caller=klog.go:55 component=k8s_client_runtime func=Verbose.Infof msg="pkg/mod/k8s.io/client-go@v0.19.2/tools/cache/reflector.go:156: Watch close - *v1.PodMonitor total 2 items received"
level=debug ts=2021-01-08T01:06:51.908978768Z caller=klog.go:72 component=k8s_client_runtime func=Infof msg="GET https://10.96.0.1:443/apis/monitoring.coreos.com/v1/podmonitors?       allowWatchBookmarks=true&resourceVersion=1102&timeout=7m51s&timeoutSeconds=471&watch=true 200 OK in 2 milliseconds"
…

 

c. Workaround: podMonitor  대신 Scraping Annotations 설정으로 해결

'Kubernetes > Monitoring' 카테고리의 다른 글

GPU Monitor  (0) 2021.09.21
Elastic Observability  (0) 2021.09.20
Elastic Observability - filebeat/metricbeat POD 오류  (0) 2021.09.15
Dashboard on bare-metal  (0) 2021.09.15
Dashboard on GCE  (0) 2021.09.15

댓글