Kubernetes80 GPU Operator Install on Ubuntu 2021.6.9 1. Environments - Softwares ✓ Ubuntu 18.04.5 LTS (Bionic Beaver), Kubernetes 1.16.15, Docker 19.03.15 ✓ NVIDIA Driver 460.73.01, cuda-libraries-11-2, libcudnn8_8.2.1.32 - 별도 설치 ✓ GPU Operator 1.7.0 NVIDIA k8s device plugin 0.9.0 NVIDIA container toolkit 1.7.0 NVIDIA DCGM-exporter 2.1.8-2.4.0 Node Feature Discovery 0.6.0 GPU Feature Discovery 0.4.1 - GPU Card ✓ NVIDIA Tesla V100 2. NVIDI.. 2021. 9. 21. GPU Operator on CentOS 2020.12.23 1. NVIDIA GPU Operator - https://developer.nvidia.com/blog/nvidia-gpu-operator-simplifying-gpu-management-in-kubernetes/ - Simplifying GPU Management in Kubernetes - To provision GPU worker nodes in a Kubernetes cluster, the following NVIDIA software components are required – the driver, container runtime, device plugin and monitoring. The GPU Operator simplifies both the initial depl.. 2021. 9. 21. Helm 2020.09.04 1. Helm Helm is the Kubernetes package manager 2. Helm 3.3.1 구성하기 - https://helm.sh/docs/intro/install/ a. Install Helm $ curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/v3.3.1/scripts/get-helm-3 $ chmod 744 get_helm.sh $ ./get_helm.sh Downloading https://get.helm.sh/helm-v3.3.1-darwin-amd64.tar.gz Preparing to install helm into /usr/local/bin Password: helm inst.. 2021. 9. 21. kube-prometheus-stack 2021.02.24 1. Prometheus ? - https://github.com/prometheus-operator/kube-prometheus - kube-prometheus-stack ✓ Installs the kube-prometheus stack, a collection of Kubernetes manifests, Grafana dashboards, and Prometheus rules combined with documentation and scripts to provide easy to operate end-to-end Kubernetes cluster monitoring with Prometheus using the Prometheus Operator. ✓ Prometheus is de.. 2021. 9. 21. GPU Monitor 2020.12.23 1. GPU Monitor - Prometheus Prometheus is deployed along with kube-state-metrics and node_exporter to expose cluster-level metrics for Kubernetes API objects and node-level metrics such as CPU utilization - DCGM-Exporter (https://github.com/NVIDIA/gpu-monitoring-tools) It exposes GPU metrics exporter for Prometheus leveraging NVIDIA DCGM. - kube-state-metrics kube-state-metrics is a s.. 2021. 9. 21. Elastic Observability 2020.10.21 1. Elastic Observability a. Elastic Observability (https://www.elastic.co/guide/en/kibana/7.9/observability.html) - combines your logs, metrics, and APM data for unified visibility and analysis using one tool. Elastic Observability : Logs + Metrics + APM + Uptime b. Metrics app (https://www.elastic.co/guide/en/kibana/7.9/xpack-infra.html) - The Metrics app in Kibana enables you to mon.. 2021. 9. 20. Jenkins 2021.03.20 1. Jenkins ? - Jenkins is a self-contained, open source automation server which can be used to automate all sorts of tasks related to building, testing, and delivering or deploying software. - https://www.jenkins.io/ 2. Environments - Kubernetes 1.16.15 - jenkinsci/jenkins chart 3.2.4 - Jenkins 2.277.1 - Jenkins plug-in ✓ Kubernetes plugin 1.29.2 ✓ Matrix Authorization Strategy 2.6.5 .. 2021. 9. 18. Harbor 2021.4.29 1. Harbor (Private Docker Registry, https://goharbor.io/) - What is Harbor? Harbor is an open source container image registry that secures images with role-based access control, scans images for vulnerabilities, and signs images as trusted. A CNCF Incubating project, Harbor delivers compliance, performance, and interoperability to help you consistently and securely manage images across.. 2021. 9. 18. Giblab 2021.03.20 1. GitLab ? - GitLab is a web-based DevOps lifecycle tool that provides a Git-repository manager providing wiki, issue-tracking and continuous integration and deployment pipeline features. - GitLab license ✓ GitLab Community Edition (MIT License) vs GitLab Enterprise Edition (EE) license https://about.gitlab.com/install/ce-or-ee/ ✓ GitLab Community Edition is open source, with an MIT .. 2021. 9. 17. Argo CD 2021.4.20 1. Argo CD - Argo CD is a declarative, GitOps continuous delivery tool for Kubernetes. - https://argoproj.github.io/argo-cd/ - Argo CD is largely stateless, all data is persisted as Kubernetes objects, which in turn is stored in Kubernetes' etcd. Redis is only used as a throw-away cache and can be lost. When lost, it will be rebuilt without loss of service. 2. Environments - Kubernetes.. 2021. 9. 16. Rook Ceph - DiskPressure 2020.11.30 a. Problem: DiskPressure - Environments Kubernetes 1.16.15, Rook Ceph 1.3.8, CentOS 7.8 [iap@iap01 ~]$ k get pod -n rook-ceph -o wide| egrep -v "Run|Com" NAME READY STATUS RESTARTS AGE IP NODE ... csi-cephfsplugin-tf82b 0/3 Evicted 0 13m iap04 csi-rbdplugin-jzkxk 0/3 Evicted 0 1s iap04 [iap@iap01 ~]$ k describe pod csi-cephfsplugin-tf82b -n rook-ceph | grep Events -A10 Events: Type Re.. 2021. 9. 16. Rook Ceph - scrub error 2021.04.14 a. Problem: scrub error Environments: Kubernetes 1.16.15, Rook Ceph 1.3.8 특정 PG(placement groups)에서 data damage 발생 A Placement Group (PG) is a logical collection of objects that are replicated on OSDs to provide reliability in a storage system. [iap@iap01 ~]$ ceph-toolbox.sh [root@rook-ceph-tools-79d7c49c8d-kp6xh /]# ceph status cluster: id: 1ef6e249-005e-477e-999b-b874f9fa0854 health.. 2021. 9. 16. Rook Ceph - rook-ceph-osd POD is CrashLoopBackOff 2021.05.10 a. Problem: rook-ceph-osd-19-5b8c7f4787-klrfr POD 상태가 CrashLoopBackOff - Environments Kubernetes 1.16.15, Rook Ceph 1.3.8 [iap@iap01 ~]$ k get pod -n rook-ceph -o wide | egrep 'NAME|osd-[0-9]' NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES rook-ceph-osd-12-686858c5dd-hsxh7 1/1 Running 1 37h 10.244.10.105 iap10 rook-ceph-osd-13-584d4ff974-wdtq9 1/1 Running 1 37h .. 2021. 9. 16. Rook Ceph - pgs undersized 2020.12.31 a. Problem: pgs undersized - Environments Kubernetes 1.16.15, Rook Ceph 1.3.8 [root@rook-ceph-tools-79d7c49c8d-4c4x5 /]# ceph status cluster: id: 1ef6e249-005e-477e-999b-b874f9fa0854 health: HEALTH_WARN Degraded data redundancy: 2/1036142 objects degraded (0.000%), 2 pgs degraded, 14 pgs undersized … b. Cause analysis - undersized The placement group has fewer copies than the configur.. 2021. 9. 16. Rook Ceph - OSD autoout 2021.05.14 a. Problem : OSD autoout - Environments Kubernetes 1.16.15, Rook Ceph 1.3.8 - 특정 OSD(Object storage devices)가 autoout 상태이며, 관련 rook-ceph-osd-[number] POD가 기동되지 않은 상태 [root@rook-ceph-tools-79d7c49c8d-kp6xh /]# ceph osd status +----+-------+-------+-------+--------+---------+--------+---------+----------------+ | id | host | used | avail | wr ops | wr data | rd ops | rd data | state | +.. 2021. 9. 16. 이전 1 2 3 4 5 6 다음