Ace Cloud Interviews
ObservabilityIntermediate8-12 hours

Kubernetes Observability Stack with Prometheus and Grafana

KubernetesHelmPrometheusGrafanaAlertmanagerkube-state-metricsNode Exporter

Best for: Site Reliability Engineer, Platform Engineer, Cloud/DevOps Engineer

Overview

Set up a complete observability stack for a Kubernetes cluster using the kube-prometheus-stack Helm chart. The goal is not just to get the tools running, but to define meaningful SLO-based alerts, build dashboards that reflect real service health, and write runbooks so an on-call engineer knows what to do when an alert fires at 3am.

What you will build

  • Deploy the kube-prometheus-stack Helm chart (Prometheus Operator, Grafana, Alertmanager, kube-state-metrics)
  • Deploy a sample multi-tier application and observe its default metrics
  • Write PromQL queries for request rate, error rate, and latency (the RED method)
  • Build a Grafana dashboard covering service health and node-level infrastructure metrics
  • Define alert rules as PrometheusRule CRDs: high error rate, CrashLoopBackOff, and node disk pressure
  • Configure Alertmanager to route alerts to a webhook, email, or Slack with deduplication and grouping

Before you start

  • A running Kubernetes cluster (kind, k3s, or a cloud-managed cluster)
  • Basic Kubernetes knowledge - Deployments, Services, namespaces
  • Helm installed locally

Deliverables

A complete submission should include all of the following.

  • A Helm values file for kube-prometheus-stack with custom resource sizing for a local cluster
  • PrometheusRule CRDs defining at least three alert rules with severity labels
  • A Grafana dashboard JSON file committed to the repository and loadable via a ConfigMap
  • Alertmanager configuration with routing, grouping, and at least one receiver configured
  • A runbook comment on each alert rule explaining what it means and the first three steps to investigate

Stretch goals

Optional extras that demonstrate deeper understanding and make your project stand out.

  • +Add Loki for log aggregation and correlate logs with metrics in a single Grafana panel
  • +Instrument a custom application with the Prometheus client library and expose a /metrics endpoint
  • +Build a Grafana SLO dashboard using recording rules to track a 99.9% availability target
  • +Configure remote write to a long-term storage backend such as Thanos or Grafana Mimir

Interview talking points

When you discuss this project in an interview, be ready to answer these questions specifically.

  • 1The RED method (Rate, Errors, Duration for services) vs the USE method (Utilisation, Saturation, Errors for resources) and when to apply each
  • 2Why you defined alerts as PrometheusRule CRDs rather than static config files - the Prometheus Operator watches for them automatically, enabling GitOps
  • 3How Alertmanager deduplication and grouping prevent an alert storm from flooding an on-call channel
  • 4What a recording rule is and why you use one for SLO calculations - precomputing an expensive PromQL query saves Prometheus CPU on every scrape
  • 5How you would handle high-cardinality labels (e.g. per-user or per-request labels) and why they can cause Prometheus to OOM