Thursday, 7 May 2026

TASK I: Kubernetes Monitoring

 

TASK K: Kubernetes Monitoring (Prometheus + Grafana + k9s) — Step-by-Step Guide

Overview

In this task, you add Kubernetes-native monitoring to your k3s cluster using Prometheus (metrics collection), Grafana (visualization), and k9s (terminal UI). This complements the Datadog infrastructure monitoring from Task D.

Why both Datadog AND Prometheus?

  • Datadog (Task D): Monitors the operating system — CPU, memory, disk, network. Installed via Ansible. Runs as a system service. Data goes to Datadog's cloud (SaaS).
  • Prometheus (Task K): Monitors Kubernetes objects — pods, deployments, containers, HPA. Runs INSIDE the cluster as pods. Data stays in-cluster.

Think of it this way: Datadog is the building security camera (watches the physical servers). Prometheus is the restaurant manager (watches the kitchen staff — your pods).

What you'll do:

  1. Understand the monitoring architecture
  2. Learn Helm — charts, releases, repositories, values, upgrade, rollback
  3. Install kube-prometheus-stack via Helm (Helm as the real-world use case)
  4. Access Grafana and explore pre-built dashboards
  5. Monitor HealthPulse pods and observe scaling events
  6. Create a custom HealthPulse dashboard with PromQL
  7. Set up alerts in Grafana
  8. Use k9s alongside Grafana for cluster management
  9. Compare Datadog vs Prometheus
  10. Document everything in MkDocs

Prerequisites

Before starting Task K, ensure you have completed:

  •   k3s cluster running (1 master + 2 workers), application deployed
  •  Datadog agents installed on all nodes (OS-level monitoring)
  •  kubectl configured with KUBECONFIG=~/.kube/healthpulse-config
  •  Application deployed in healthpulse-devhealthpulse-uathealthpulse-prod namespaces

Verify your cluster:

export KUBECONFIG=~/.kube/healthpulse-config
kubectl get nodes
# All 3 nodes should show Ready

Step 1: Understand the Monitoring Architecture

┌──────────────────────────────────────────────────────────────────┐
│                         k3s Cluster                              │
│                                                                  │
│  ┌─── monitoring namespace ────────────────────────────────────┐ │
│  │                                                              │ │
│  │  ┌──────────────┐  scrapes   ┌────────────────────────────┐ │ │
│  │  │  Prometheus   │◄─every───│  kube-state-metrics         │ │ │
│  │  │  Server       │  30s     │  (pod/deploy/svc counts)    │ │ │
│  │  │              │          └────────────────────────────┘ │ │
│  │  │  Stores all   │  scrapes   ┌────────────────────────────┐ │ │
│  │  │  time-series   │◄─every───│  Node Exporter (DaemonSet) │ │ │
│  │  │  metrics       │  30s     │  (CPU/mem/disk per node)   │ │ │
│  │  └──────┬───────┘          └────────────────────────────┘ │ │
│  │         │ PromQL queries                                    │ │
│  │         ▼                     ┌────────────────────────────┐ │ │
│  │  ┌──────────────┐            │  Alertmanager              │ │ │
│  │  │   Grafana     │            │  (email/Slack alerts)      │ │ │
│  │  │  Dashboards   │            └────────────────────────────┘ │ │
│  │  └──────┬───────┘                                           │ │
│  └─────────┼───────────────────────────────────────────────────┘ │
│            │ port-forward :3000                                   │
│            ▼                                                      │
│  YOUR BROWSER → localhost:3000                                   │
│                                                                  │
│  ┌─── healthpulse-prod ──┐                                       │
│  │  Pod 1  │  Pod 2      │ ◄── Prometheus scrapes automatically  │
│  └─────────┴─────────────┘                                       │
└──────────────────────────────────────────────────────────────────┘

The scraping model: Prometheus uses a pull model — it calls each target's /metrics HTTP endpoint every 30 seconds, collects the data, and stores it as time-series. Grafana then queries Prometheus using PromQL.

Key insight: Datadog agents PUSH metrics to the cloud. Prometheus PULLS metrics from targets. Different approach, same goal.


Step 2: Learn Helm

2.1 — The Problem Helm Solves

Before Helm, installing something like a monitoring stack on Kubernetes meant managing dozens of YAML files by hand. Consider what the Prometheus + Grafana stack actually requires:

Without Helm — you manage ALL of this manually:
├── prometheus-deployment.yml
├── prometheus-service.yml
├── prometheus-configmap.yml          (scrape config, 200+ lines)
├── prometheus-rbac.yml               (ClusterRole, ClusterRoleBinding, ServiceAccount)
├── prometheus-persistentvolume.yml
├── grafana-deployment.yml
├── grafana-service.yml
├── grafana-configmap.yml             (datasources, dashboards)
├── grafana-secret.yml                (admin password)
├── alertmanager-deployment.yml
├── alertmanager-configmap.yml
├── alertmanager-service.yml
├── node-exporter-daemonset.yml       (runs on every node)
├── node-exporter-service.yml
├── kube-state-metrics-deployment.yml
├── kube-state-metrics-rbac.yml
└── ... (20+ more files)

And every time you upgrade, you diff all of them manually. Every environment (dev, UAT, prod) needs its own copy with slightly different values.

Helm solves this — all 20+ manifests become one install command, with configuration in one place.


2.2 — What Helm Is

Helm is the package manager for Kubernetes — like apt for Ubuntu or brew for Mac, but for Kubernetes applications.

apt install nginx          →   helm install nginx ingress-nginx/ingress-nginx
brew install postgresql    →   helm install postgres bitnami/postgresql
npm install react          →   helm install monitoring prometheus-community/kube-prometheus-stack

Three concepts to understand:

Chart — A Helm package. Contains templated Kubernetes YAML files + default configuration. Think of it like a .deb package (apt) or a formula (brew).

Release — A running instance of a chart in your cluster. You can install the same chart multiple times with different names and configs. Each installation is a separate release.

Repository — A collection of charts, hosted on a URL. Like npm registry or apt sources.

Repository                    Chart                         Release
─────────────────────         ──────────────────────        ─────────────────────
prometheus-community     →    kube-prometheus-stack    →    "monitoring" (your name)
(chart server)                (the package)                 (running in cluster)

2.3 — Helm Architecture

┌─────────────────────────────────────────────────────┐
│  Chart Repository (remote server)                   │
│  e.g. prometheus-community.github.io/helm-charts    │
│  ├── kube-prometheus-stack-65.1.0.tgz               │
│  ├── kube-prometheus-stack-64.0.0.tgz               │
│  └── ...                                            │
└────────────────────┬────────────────────────────────┘
                     │  helm repo add / helm install
                     ▼
┌─────────────────────────────────────────────────────┐
│  Your Machine (Helm CLI)                            │
│  ├── ~/.cache/helm/repository/   (cached charts)    │
│  └── reads KUBECONFIG → talks to cluster API        │
└────────────────────┬────────────────────────────────┘
                     │  generates + applies YAML
                     ▼
┌─────────────────────────────────────────────────────┐
│  k3s Cluster                                        │
│  └── monitoring namespace                           │
│      ├── Deployment: grafana                        │
│      ├── Deployment: kube-state-metrics             │
│      ├── StatefulSet: prometheus                    │
│      ├── DaemonSet: node-exporter                   │
│      ├── Service: monitoring-grafana                │
│      └── ... (20+ resources, all managed by Helm)   │
└─────────────────────────────────────────────────────┘

Helm also stores release history as secrets inside the cluster — which is how helm rollback works.


2.4 — Install Helm

On the k3s master (or your laptop):

Mac:

brew install helm

Linux (including EC2 master):

curl https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash

Windows:

scoop install helm
# or
choco install kubernetes-helm

Verify:

helm version
# → version.BuildInfo{Version:"v3.x.x", ...}

Helm uses your KUBECONFIG — whichever cluster kubectl talks to, Helm talks to as well. No separate config needed.


2.5 — Core Helm Commands

Before installing anything, get familiar with the Helm CLI:

# ── Repositories ─────────────────────────────────────────────────
helm repo add <name> <url>       # Add a chart repository
helm repo update                 # Refresh the local index from all repos
helm repo list                   # Show all added repositories
helm search repo <keyword>       # Search charts across all repos

# ── Inspecting Charts (before installing) ────────────────────────
helm show chart <repo/chart>     # Show chart metadata (name, version, description)
helm show values <repo/chart>    # Show ALL configurable values (can be 1000+ lines)
helm template <name> <repo/chart> --values <file>  # Preview the YAML that would be applied

# ── Installing & Managing ─────────────────────────────────────────
helm install <release> <repo/chart> [flags]        # Install a chart
helm upgrade <release> <repo/chart> [flags]        # Upgrade an existing release
helm rollback <release> [revision]                 # Rollback to a previous version
helm uninstall <release> -n <namespace>            # Remove a release

# ── Inspecting Releases ──────────────────────────────────────────
helm list -n <namespace>         # List all releases in a namespace
helm list -A                     # List releases across all namespaces
helm status <release> -n <namespace>   # Show release status and notes
helm get values <release> -n <namespace>   # Show values used for a release
helm get manifest <release> -n <namespace> # Show all YAML applied by a release
helm history <release> -n <namespace>      # Show upgrade/rollback history

You'll use all of these in the steps below.


Step 3: Install kube-prometheus-stack via Helm

Now apply what you just learned. You'll install the entire Prometheus + Grafana monitoring stack using a single Helm chart.

3.1 — Add the Chart Repository

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

Verify it's there:

helm repo list
NAME                    URL
prometheus-community    https://prometheus-community.github.io/helm-charts

Search the repo to confirm the chart exists:

helm search repo prometheus-community/kube-prometheus-stack
NAME                                            CHART VERSION   APP VERSION   DESCRIPTION
prometheus-community/kube-prometheus-stack      65.1.0          v0.79.2       kube-prometheus-stack collects Kubernetes...

This tells you the chart version (65.1.0) and what app version it installs (Prometheus v0.79.2). Pin the chart version in production to prevent unexpected upgrades.


3.2 — Inspect the Chart Before Installing

Always look before you install. Two commands to know:

# Show chart metadata — what it contains, dependencies, maintainers
helm show chart prometheus-community/kube-prometheus-stack
# Show all configurable values — scroll through to understand what can be changed
helm show values prometheus-community/kube-prometheus-stack | head -100

The values output is typically 1000+ lines. This is your configuration reference — every setting you might want to override is listed here with its default.


3.3 — Understand the Values File

Instead of passing every option as --set flags on the command line, Helm lets you put all your configuration overrides in a values.yml file.

Open kubernetes/monitoring/values.yml:

grafana:
  adminUser: admin
  adminPassword: healthpulse123    # override the default random password
  resources:
    requests:
      cpu: 100m
      memory: 128Mi
    limits:
      cpu: 300m
      memory: 256Mi
  service:
    type: ClusterIP                # use port-forward, not LoadBalancer

prometheus:
  prometheusSpec:
    retention: 7d                  # keep metrics for 7 days
    resources:
      requests:
        cpu: 200m
        memory: 512Mi
      limits:
        cpu: 500m
        memory: 1Gi
    storageSpec: {}                # emptyDir — data lost on pod restart (fine for capstone)

alertmanager:
  alertmanagerSpec:
    resources:
      requests:
        cpu: 50m
        memory: 64Mi
      limits:
        cpu: 100m
        memory: 128Mi

How values work:

Chart defaults (values.yaml in chart)
         +
Your overrides (kubernetes/monitoring/values.yml)
         =
Final configuration applied to cluster

You only need to specify what you want to change. Everything else uses the chart's defaults. This is the key advantage over managing raw YAML — you only touch what matters to you.

--set vs -f values.yml:

MethodUse When
--set key=valueOne or two quick overrides, testing
-f values.ymlMultiple settings, version-controlled config

For this capstone, we use a values.yml file so your configuration is committed to Git alongside the rest of the code.


3.4 — Preview What Will Be Installed (helm template)

Before actually installing, use helm template to render the YAML that Helm would apply. This is useful for debugging and learning:

helm template monitoring prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  -f kubernetes/monitoring/values.yml | head -80

This outputs all the Kubernetes YAML that will be created — Deployments, Services, ConfigMaps, RBAC — without touching your cluster. Pipe it to a file to read through it:

helm template monitoring prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  -f kubernetes/monitoring/values.yml > /tmp/monitoring-preview.yml

wc -l /tmp/monitoring-preview.yml   # see how many lines Helm generates for you

Teaching moment: Count the lines. The chart generates thousands of lines of YAML that you would otherwise maintain by hand. This is what Helm does for you.


3.5 — Install the Stack

helm install monitoring prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace \
  -f kubernetes/monitoring/values.yml

Breaking down the command:

PartWhat It Does
helm installInstall a chart as a new release
monitoringRelease name — your chosen name for this installation. Use this in helm upgradehelm statushelm uninstall
prometheus-community/kube-prometheus-stack<repo-name>/<chart-name> — which chart to install
--namespace monitoringInstall all resources into this namespace
--create-namespaceCreate the namespace if it doesn't exist (saves a separate kubectl create namespace)
-f kubernetes/monitoring/values.ymlApply your configuration overrides

Helm prints a summary and release notes when done — read them. They often contain the next steps (like how to access the UI).


3.6 — Inspect the Release

# See the release is deployed
helm list -n monitoring
NAME        NAMESPACE   REVISION  STATUS    CHART                           APP VERSION
monitoring  monitoring  1         deployed  kube-prometheus-stack-65.1.0    v0.79.2
# Show status and the release notes
helm status monitoring -n monitoring
# Show what values are actually in use (your overrides merged with chart defaults)
helm get values monitoring -n monitoring
# Show the full YAML that was applied to the cluster
helm get manifest monitoring -n monitoring | head -50
# Show release history (useful later when you upgrade)
helm history monitoring -n monitoring
REVISION  STATUS    CHART                           DESCRIPTION
1         deployed  kube-prometheus-stack-65.1.0    Install complete

3.7 — What Gets Installed

kubectl get pods -n monitoring

Wait 2–3 minutes. All pods should show Running:

NAME                                                         READY   STATUS    AGE
alertmanager-monitoring-kube-prometheus-alertmanager-0       2/2     Running   2m
monitoring-grafana-7f8c9d6b4-xxxxx                           3/3     Running   2m
monitoring-kube-prometheus-operator-6b4c9f8d7-xxxxx          1/1     Running   2m
monitoring-kube-state-metrics-5c6d8f9b7-xxxxx                1/1     Running   2m
monitoring-prometheus-node-exporter-xxxxx                    1/1     Running   2m
monitoring-prometheus-node-exporter-xxxxx                    1/1     Running   2m
monitoring-prometheus-node-exporter-xxxxx                    1/1     Running   2m
prometheus-monitoring-kube-prometheus-prometheus-0           2/2     Running   2m
ComponentKindWhat It Does
PrometheusStatefulSetScrapes and stores all metrics
GrafanaDeploymentDashboard UI for visualizing metrics
Node ExporterDaemonSetExposes OS metrics from each node (one pod per node — that's why you see 3)
kube-state-metricsDeploymentExposes Kubernetes object metrics (pod count, deploy status)
AlertmanagerStatefulSetRoutes alerts to email, Slack, PagerDuty
Prometheus OperatorDeploymentManages Prometheus config via CRDs — the "brain" of the stack

Notice the 3 node-exporter pods. Node Exporter is a DaemonSet — Kubernetes automatically runs one copy on every node. Helm configured this for you.

kubectl get svc -n monitoring

Checkpoint: All pods Running and services created. One command installed 8 pods, 10+ services, ConfigMaps, RBAC, and CRDs. That is what Helm does.


3.8 — Upgrading a Release -OPTIONAL

Later, if you want to change the Grafana password or increase Prometheus retention, edit kubernetes/monitoring/values.yml and run:

helm upgrade monitoring prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  -f kubernetes/monitoring/values.yml

Helm applies only the diff — it doesn't reinstall everything. The revision number increments:

helm history monitoring -n monitoring
REVISION  STATUS      CHART                           DESCRIPTION
1         superseded  kube-prometheus-stack-65.1.0    Install complete
2         deployed    kube-prometheus-stack-65.1.0    Upgrade complete

3.9 — Rolling Back

If an upgrade breaks something, rollback to the previous revision:

# Rollback to revision 1
helm rollback monitoring 1 -n monitoring

# Verify
helm history monitoring -n monitoring
REVISION  STATUS      DESCRIPTION
1         superseded  Install complete
2         superseded  Upgrade complete
3         deployed    Rollback to 1

This is possible because Helm stores release history as Kubernetes secrets. Each revision is saved — Helm can reconstruct any previous state. Compare this to manually applying kubectl apply — there is no rollback history at all.


3.10 — If You Need to Reinstall

helm uninstall monitoring -n monitoring
kubectl delete namespace monitoring

# Wait for namespace to terminate
kubectl get namespace monitoring   # keep running until it disappears

# Then reinstall
helm install monitoring prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace \
  -f kubernetes/monitoring/values.yml

Step 4: Access Grafana



4.1.1 — Option A: Traefik Ingress (Recommended — DNS required)

Grafana gets its own subdomain routed through Traefik, exactly like your application environments. No port-forward needed, no open terminal.

The wildcard DNS record (*.team-healthpulse.com) already covers grafana.team-healthpulse.com — no Terraform changes needed.

Apply the Grafana ingress:

kubectl apply -f kubernetes/ingress-grafana.yml

# Verify Traefik picked it up
kubectl get ingress -n monitoring
NAME              CLASS     HOSTS                           ADDRESS     PORTS   AGE
grafana-ingress   traefik   grafana.team-healthpulse.com   10.43.0.1   80      10s

Open http://grafana.team-healthpulse.com in your browser.

Why this works without DNS changes: The Terraform DNS config includes a wildcard A record (*.team-healthpulse.com → k3s master EIP). Any subdomain not explicitly defined — including grafana — automatically resolves to the k3s master, where Traefik picks it up and routes it based on the Ingress hostname rule.

4.1.2 — Port-Forward

kubectl port-forward -n monitoring svc/monitoring-grafana 3000:80 -n monitoring #local
kubectl port-forward --address 0.0.0.0 svc/
monitoring-grafana 3000:80 -n monitoring#ec2 master

Keep this terminal running. Open a new terminal for other commands.

4.2 — Login

  1. Open browser: http://localhost:3000
  2. Username: admin / Password: healthpulse123

4.3 — Grafana UI Overview

AreaWhereWhat It Shows
DashboardsLeft sidebar → DashboardsBrowse all pre-built and custom dashboards
ExploreLeft sidebar → ExploreFree-form PromQL query editor
AlertingLeft sidebar → AlertingAlert rules, notification channels
Data SourcesSettings → Data SourcesPrometheus (pre-configured)

Step 5: Explore Pre-Built Dashboards

The kube-prometheus-stack ships with 20+ production-grade dashboards.

  1. Click Dashboards → Browse
  2. Look for dashboards starting with Kubernetes / and Node Exporter /

Key Dashboard 1: Kubernetes / Compute Resources / Namespace (Pods)

Select a namespace from the dropdown to see:

PanelWhat It Shows
CPU UsageCPU consumed by each pod — are pods CPU-starved?
CPU QuotaCPU requested vs actual — over-provisioning or under-provisioning?
Memory UsageMemory per pod — approaching limits? (OOMKill risk)
Network I/OBytes sent/received per pod

Key Dashboard 2: Node Exporter / Nodes

Shows each EC2 instance's resources — compare with Datadog:

PanelDatadog Equivalent
CPU Busysystem.cpu.user
Memory Usagesystem.mem.used
Disk Spacesystem.disk.in_use
Network Trafficsystem.net.bytes_rcvd

Key insight: The numbers from Node Exporter and Datadog should be very close. Cross-referencing validates both monitoring systems.

Key Dashboard 3: Kubernetes / Networking / Namespace (Pods)

Shows receive/transmit bandwidth and packet rates per pod.


Step 6: Monitor HealthPulse Pods

6.1 — View Production Metrics

  1. Open Kubernetes / Compute Resources / Namespace (Pods) dashboard
  2. Select namespace: healthpulse-prod
  3. Observe CPU and memory for every production pod

6.2 — Deploy and Watch Metrics Change

Terminal 1: Keep Grafana open on the namespace dashboard

Terminal 2: Deploy a new version:

kubectl set image deployment/healthpulse-portal \
  healthpulse-portal=<ARTIFACTORY_URL>/healthpulse-portal:2.0.0 \
  -n healthpulse-prod

Watch Grafana — old pod lines end, new pod lines appear, with a brief overlap (rolling update).

6.3 — Trigger HPA and Watch Scaling

# Generate load
kubectl run load-test --image=busybox -n healthpulse-prod --restart=Never -- \
  /bin/sh -c "while true; do wget -q -O- http://healthpulse-service/health; done"

# In another terminal — watch HPA scale
kubectl get hpa -n healthpulse-prod -w

# Clean up when done
kubectl delete pod load-test -n healthpulse-prod

Watch Grafana: CPU climbs, new pod lines appear as HPA scales up.


Step 7: Create a Custom HealthPulse Dashboard

  1. Click Dashboards → New → New Dashboard
  2. Rename to: HealthPulse — Kubernetes Overview

Panel 1: Pod Count by Namespace

  • PromQL: count(kube_pod_info) by (namespace)
  • Type: Bar gauge or Stat
  • Counts pods in each namespace using kube-state-metrics data.

Panel 2: CPU Usage by Pod (Production)

  • PromQL: rate(container_cpu_usage_seconds_total{namespace="healthpulse-prod", container!="", container!="POD"}[5m])
  • Type: Time series
  • rate() converts a cumulative counter into per-second usage over 5 minutes. The container!="POD" filter excludes Kubernetes pause containers.

Panel 3: Memory Usage Trend

  • PromQL: container_memory_usage_bytes{namespace="healthpulse-prod", container!="", container!="POD"}
  • Type: Time series
  • Unit: bytes (Standard options → Unit → Data → bytes)
  • Memory is a gauge (goes up and down) — no rate() needed.

Panel 4: Pod Restart Count

  • PromQL: kube_pod_container_status_restarts_total{namespace=~"healthpulse-.*"}
  • Type: Stat or Table
  • High restart count = CrashLoopBackOff. Healthy pods show 0.

Panel 5: HPA Replica Count

  • PromQL: kube_horizontalpodautoscaler_status_current_replicas{namespace="healthpulse-prod"}
  • Type: Time series
  • Optionally add a second query: kube_horizontalpodautoscaler_status_desired_replicas{namespace="healthpulse-prod"}
  • When these lines diverge, the cluster is actively scaling.

Save the dashboard (disk icon at top).


Step 8: Set Up Alerts in Grafana

8.1 — Alert 1: Pod Crash-Looping

  1. Alerting → Alert rules → New alert rule
  2. Name: Pod CrashLooping — HealthPulse
  3. PromQL:
    increase(kube_pod_container_status_restarts_total{namespace=~"healthpulse-.*"}[5m]) > 3
    
  4. Evaluate every: 1m | For: 5m
  5. Label: severity: warning

Fires when any HealthPulse pod restarts more than 3 times in 5 minutes.

8.2 — Alert 2: High CPU

  1. Name: High CPU — HealthPulse Prod
  2. PromQL:
    (sum(rate(container_cpu_usage_seconds_total{namespace="healthpulse-prod", container!="", container!="POD"}[5m])) by (pod)) > 0.8
    
  3. Evaluate every: 1m | For: 5m
  4. Label: severity: critical

8.3 — Configure Notification Channel

  1. Alerting → Contact points → New contact point
  2. Name: HealthPulse Notifications
  3. Type: Email (enter your address) or Slack (enter webhook URL, channel #healthpulse-alerts)
  4. Notification policies → edit default policy to use your contact point

Step 9: Install and Use k9s

k9s is a terminal-based Kubernetes UI — like htop for your cluster.

9.1 — Install

Mac: brew install derailed/k9s/k9s Linux: curl -sS https://webinstall.dev/k9s | bash Windows (Scoop): scoop install k9s

9.2 — Launch

KUBECONFIG=~/.kube/healthpulse-config k9s

9.3 — Navigation Commands

Type : to enter command mode:

CommandView
:podsAll pods
:deployDeployments
:svcServices
:nsNamespaces
:nodesCluster nodes
:hpaHorizontal Pod Autoscalers
:eventsRecent cluster events

9.4 — Actions on Resources

KeyAction
lView logs
sShell into pod
dDescribe resource
yView YAML
/Filter/search
EnterDrill into resource
EscGo back
:qQuit

9.5 — Cross-Reference with Grafana

Metrick9sGrafana
Pod count:pods — count rowsPod Count panel
CPU:pods — CPU columnCPU Usage panel
Restarts:pods — RESTARTS columnRestart Count panel
HPA replicas:hpa — REPLICAS columnHPA panel

When to use which: k9s for real-time interactive checks ("What's happening NOW?"). Grafana for trends and history ("What happened over the last 6 hours?").


Step 10: Compare Datadog vs Prometheus

AspectDatadog (Task D)Prometheus + Grafana (Task K)
ScopeInfrastructure/OS metricsKubernetes-native metrics
Where it runsAgent on each EC2Pods inside k3s cluster
Installed viaAnsibleHelm
Data storageDatadog cloud (SaaS)In-cluster (Prometheus pod)
DashboardsDatadog web UIGrafana (self-hosted)
Query languageDatadog queriesPromQL
AlertingDatadog MonitorsAlertmanager + Grafana
CostFree for 5 hosts, paid beyondFree and open source
IndustryEnterprise (Netflix, Airbnb)Cloud-native standard (CNCF)

When to use which (they complement each other):

ScenarioBest Tool
EC2 running out of disk?Datadog
Pods keep restarting?Prometheus
HPA replica count?Prometheus
Nginx healthy on bare-metal?Datadog
Server went down?Datadog
Pod crash-loop alert?Prometheus/Grafana

The real-world pattern:

Layer 1: Infrastructure monitoring (Datadog) → "Are my machines healthy?"
Layer 2: Kubernetes monitoring (Prometheus)  → "Are my applications healthy?"
Layer 3: Application monitoring (APM)        → "Are my users happy?"

You've now built Layers 1 and 2.


Step 11: Document in MkDocs

11.1 — Add to Architecture Documentation

Add a monitoring section to docs/architecture.md describing the two-layer monitoring approach and Prometheus components.

11.2 — Add an ADR

Create docs/adr/007-monitoring-tools.md:

  • Decision: Datadog for infrastructure, Prometheus + Grafana for Kubernetes, k9s for interactive management
  • Rationale: Datadog watches machines, Prometheus watches workloads, they complement each other
  • Consequences: Two systems to maintain, Prometheus retention limited by in-cluster storage

11.3 — Screenshots

Add screenshots of:

  1. Node Exporter / Nodes dashboard
  2. Namespace (Pods) dashboard for healthpulse-prod
  3. Your custom HealthPulse dashboard

Acceptance Criteria Checklist

  •  Helm installed (helm version)
  •  kube-prometheus-stack deployed (kubectl get pods -n monitoring — all Running)
  •  Grafana accessible at localhost:3000 via port-forward
  •  Pre-built dashboards explored (Node Exporter, Namespace Pods, Networking)
  •  Custom dashboard created with 5 panels (pod count, CPU, memory, restarts, HPA)
  •  At least one alert rule configured (crash-loop or high CPU)
  •  k9s installed and can navigate cluster resources
  •  Datadog vs Prometheus comparison understood and documented

Instructor Verification

Be prepared to:

  1. Show kubectl get pods -n monitoring and explain each component
  2. Open Grafana and navigate pre-built dashboards — explain the metrics
  3. Show your custom dashboard and explain each PromQL query
  4. Demo k9s — navigate pods, view logs, describe a resource
  5. Explain: Why both Datadog AND Prometheus? What does each monitor?
  6. Cross-reference: Show the same metric in both Datadog and Grafana

Troubleshooting

Pods stuck in Pending (monitoring namespace)

kubectl describe pod <POD_NAME> -n monitoring
# Check Events section
  • Insufficient resources: Monitoring needs CPU/memory. On t3.small, the cluster may be tight:
    kubectl top nodes
    Fix: Scale down dev replicas temporarily or use t3.medium instances.
  • PVC pending: Check kubectl get pvc -n monitoring. k3s local-path-provisioner should auto-bind.

Grafana shows "No data"

# Verify Prometheus is running and scraping
kubectl port-forward -n monitoring svc/monitoring-kube-prometheus-prometheus 9090:9090
# Open http://localhost:9090/targets — all should show UP

Port-forward disconnects

Just re-run:

kubectl port-forward -n monitoring svc/monitoring-grafana 3000:80

PromQL syntax errors

ErrorFix
parse errorPromQL is case-sensitive — check typos
unknown metricUse Explore → Metrics browser to find correct names
no dataLabel filter doesn't match — start broad, add filters gradually
rate() requires counterRemove rate() — you're using it on a gauge metric

Helm install fails

# Check the release status
helm status monitoring -n monitoring

# Check what revision it's on and if it errored
helm history monitoring -n monitoring

# If stuck in a failed state, uninstall and retry:
helm uninstall monitoring -n monitoring
kubectl delete namespace monitoring

# Wait for namespace to fully terminate (can take 30-60 seconds)
kubectl get namespace monitoring   # run until it disappears

# Reinstall
helm install monitoring prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace \
  -f kubernetes/monitoring/values.yml

Want to see exactly what Helm installed?

# List all Kubernetes resources created by this release
helm get manifest monitoring -n monitoring

# See which values are active (your overrides merged with defaults)
helm get values monitoring -n monitoring --all

k9s connection issues

# Test kubectl first
kubectl get nodes
# If kubectl works but k9s doesn't:
k9s --kubeconfig ~/.kube/healthpulse-config

Key Concepts Reference

ConceptWhat It Means
PrometheusOpen-source monitoring that scrapes metrics from targets and stores them as time-series. CNCF graduated project.
PromQLPrometheus Query Language. Like SQL but for time-series. Examples: rate(...)count(...) by (label)
Time seriesData points indexed by timestamp. Example: CPU at 10:00=45%, 10:01=47%. Every Prometheus metric is a time series.
ScrapingPrometheus PULLS data from targets by calling /metrics endpoints at regular intervals.
ExporterComponent that exposes metrics in Prometheus format. Node Exporter = OS metrics. kube-state-metrics = K8s objects.
GrafanaVisualization platform. Connects to Prometheus and renders dashboards with graphs, tables, alerts.
HelmPackage manager for Kubernetes. Like apt or brew but for Kubernetes apps. One command installs dozens of coordinated manifests.
Helm chartA package of templated Kubernetes manifests + default values. kube-prometheus-stack = Prometheus + Grafana + exporters + RBAC in one package.
Helm releaseA running instance of a chart in your cluster. Named by you at install time (monitoring). You can install the same chart multiple times with different names.
Helm repositoryA server hosting a collection of charts. Like npm registry or apt sources. Add with helm repo add.
values.ymlYour configuration overrides for a chart. Merged with the chart's defaults at install time. Only specify what you want to change.
helm templateRenders the YAML Helm would apply — without touching the cluster. Use for previewing and debugging.
helm upgradeApply changed values or a newer chart version to an existing release. Helm diffs and applies only what changed.
helm rollbackRestore a release to a previous revision. Works because Helm stores release history as cluster secrets.
AlertmanagerReceives alerts from Prometheus, deduplicates them, routes to email/Slack/PagerDuty.
k9sTerminal-based Kubernetes UI. Real-time interactive view without typing kubectl commands.
Counter vs GaugeCounter only goes up (total restarts) — use rate(). Gauge goes up and down (current memory) — read directly.

TASK I: Kubernetes Monitoring

  TASK K: Kubernetes Monitoring (Prometheus + Grafana + k9s) — Step-by-Step Guide Overview In this task, you add  Kubernetes-native monitori...