Top 10 Kubernetes Interview Mistakes
Published 20 June 2026 by Ace Cloud Interviews
Kubernetes questions trip up even experienced engineers in interviews. The failures are rarely about not knowing kubectl commands - they are about misunderstanding how the control plane actually makes decisions, why those decisions matter in production, and what your answers reveal about your operational depth. These are the 10 most common gaps that interviewers notice.
Saying "the pod restarts" when you mean the container restarts
What candidates say
“When the application crashes, the pod restarts and comes back up.”
Why interviewers mark this down
Interviewers hear this constantly and it signals a shallow understanding of the Kubernetes object model. A pod is a wrapper - its containers can restart independently via the container restart policy without the pod itself being destroyed and recreated. The pod's restartCount field tracks how many times its containers have restarted. A pod is actually replaced and rescheduled only when it is evicted, when its node goes down, or when a controller decides to replace it.
What to say instead
Say: "The container restarts based on the pod's restartPolicy - Always by default. The pod itself stays scheduled on the same node. If you check kubectl describe pod you will see the container's restartCount increment. The pod only gets replaced and rescheduled if it is evicted or if the node becomes unavailable."
Mixing up resource requests and limits
What candidates say
“Requests are what the container uses, limits are the maximum it can use.”
Why interviewers mark this down
This is technically correct but misses the critical operational distinction. Requests are used by the kube-scheduler to decide which node can fit the pod - a node is considered full when the sum of all pod requests equals its allocatable capacity, regardless of actual usage. Limits are enforced at runtime by the container runtime: CPU gets throttled when the container exceeds its limit, and the container is OOM-killed if it exceeds its memory limit. Not setting requests leads to suboptimal scheduling; not setting limits creates noisy-neighbour problems.
What to say instead
Say: "Requests are scheduling hints - the scheduler uses them for bin-packing. A node can be full for scheduling purposes while actually using only 20% of its CPU. Limits are runtime constraints enforced by the kernel. CPU gets throttled at the limit; memory causes an OOM kill. In production I always set both, and I monitor for OOMKilled events and CPU throttling metrics."
Treating kubectl exec as the primary debugging tool
What candidates say
“How do you debug a crashing pod? I'd kubectl exec into the container and look around.”
Why interviewers mark this down
If the container is crash-looping, you cannot exec into it - it is not running long enough. And even when the container is running, logs and events give faster signal. Interviewers asking this question are testing for a systematic, signal-first approach: start with what Kubernetes already knows, then dig deeper. Candidates who jump straight to exec reveal they do not have a mental model of the debugging hierarchy.
What to say instead
Say: "I start with kubectl describe pod to check status, conditions, and recent events. Then kubectl logs - and crucially kubectl logs --previous to see the last container's output before it crashed. kubectl get events --sort-by=.lastTimestamp for cluster-wide context. If the container is running and I need to inspect live state, then kubectl exec. For network issues I run a debug pod in the same namespace with the right network tools."
Not knowing what actually happens when a node goes down
What candidates say
“The pods get rescheduled to another node.”
Why interviewers mark this down
The timing and behaviour depend heavily on the controller type, and the gaps matter in production. The node does not immediately get declared unhealthy - the kubelet heartbeat timeout is 40 seconds, and by default Kubernetes waits an additional 5 minutes before rescheduling Deployment pods. DaemonSet pods are not rescheduled - they run one-per-node by design and are simply gone until a new node joins. StatefulSet pods have identity and ordering constraints that mean they may not reschedule automatically. Interviewers testing production readiness want to hear these details.
What to say instead
Say: "For Deployment pods, the node goes NotReady after the kubelet heartbeat timeout (around 40 seconds), then pods are evicted and rescheduled after the pod eviction timeout - 5 minutes by default. DaemonSet pods are not rescheduled; they run per-node and are simply lost with the node. StatefulSet pods need correct tolerations and a PodDisruptionBudget for automatic rescheduling. I tune the eviction timeout based on how quickly we can detect and replace a failed node in our environment."
Saying Secrets are encrypted
What candidates say
“Secrets are like ConfigMaps but for sensitive data - they're encrypted.”
Why interviewers mark this down
Secrets are base64-encoded in etcd by default, not encrypted. Without enabling encryption at rest via a KMS provider, anyone with direct etcd access can read Secret values in plain text. The base64 encoding exists for transport convenience, not security. Interviewers focused on security posture will probe this directly. Best practice in production is to not store actual secret values in Kubernetes Secrets at all - use External Secrets Operator or Vault to sync from a dedicated secrets manager.
What to say instead
Say: "Secrets are base64-encoded by default in etcd - not encrypted. They get distinct RBAC controls compared to ConfigMaps, but they are not secure without encryption at rest enabled. In production I use External Secrets Operator to pull from AWS Secrets Manager, so the secret material never touches etcd in plaintext and is never committed to Git."
Knowing HPA exists but not how it calculates replicas
What candidates say
“HPA scales my pods up when CPU usage gets high.”
Why interviewers mark this down
This tells an interviewer you have read the docs but have not thought about how it works. HPA queries the metrics API on a configurable interval (default 15 seconds) and calculates desired replicas using: ceil(current replicas * current metric value / target metric value). It has a stabilisation window - default 5 minutes on scale-down - to prevent thrashing. It cannot scale to zero; that requires KEDA. It is also not limited to CPU: custom metrics from Prometheus and external metrics from queues are common production patterns.
What to say instead
Say: "HPA uses the formula desired = ceil(current * actual / target). It queries metrics-server for CPU and memory, or a custom metrics adapter for Prometheus metrics. There is a default 5-minute stabilisation window on scale-down to prevent flapping. For scale-to-zero I use KEDA. I have used KEDA to scale workers based on SQS queue depth - HPA alone cannot drive that kind of event-based scaling."
Treating namespaces as security isolation
What candidates say
“We put each team in their own namespace to isolate them.”
Why interviewers mark this down
Namespaces provide logical organisation and RBAC scoping, but they do not provide network isolation. By default, any pod in any namespace can reach any other pod in the cluster without restriction. A misconfigured or compromised pod can send traffic across namespace boundaries freely. Interviewers testing security posture want to hear that you know what namespaces actually protect (resource quotas, RBAC scoping, LimitRanges) and what they do not (network traffic).
What to say instead
Say: "Namespaces give you RBAC scoping and resource quotas, but not network isolation. For multi-tenant security I add NetworkPolicies with a default-deny baseline and allow only specific cross-namespace traffic. I also apply PodSecurity standards to restrict privilege escalation within each namespace. For hard multi-tenancy where teams should not trust each other at all, separate node pools or clusters are safer than relying on namespace boundaries alone."
Giving vague RBAC answers
What candidates say
“I'd create a service account and give the pod the right permissions.”
Why interviewers mark this down
This is the answer that shows a candidate has memorised the concept without having operated it. Interviewers want specifics: Role vs ClusterRole (namespace-scoped vs cluster-wide), RoleBinding vs ClusterRoleBinding, the principle of least privilege applied to specific verbs and resources, and awareness that the default service account has automountServiceAccountToken set to true - which exposes a token inside every pod that did not ask for one.
What to say instead
Say: "I create a dedicated service account for the workload, define a Role with the minimum required verbs on specific resources - for example get and list on ConfigMaps, not wildcard access. Bind it with a RoleBinding scoped to the namespace. The default service account should have automountServiceAccountToken: false unless a pod actually needs API access, because the token is mounted automatically and can be misused if a container is compromised."
Not knowing how Services actually route traffic
What candidates say
“A Service gives pods a stable IP and load balances traffic between them.”
Why interviewers mark this down
Correct, but surface-level. Interviewers testing production debugging skills want to know how Services work under the hood. kube-proxy watches the API server for Service and Endpoints changes and programs iptables rules (or IPVS entries) on every node. Traffic to a ClusterIP is NAT'd to a pod IP by those iptables rules. The Endpoints controller updates the Endpoints object as pods become Ready or NotReady - which is exactly why readiness probes are critical for zero-downtime deployments. Without this understanding, candidates cannot diagnose Service connectivity failures.
What to say instead
Say: "kube-proxy programs iptables rules on every node. When a packet hits a ClusterIP, iptables NATs it probabilistically to one of the backing pod IPs. The Endpoints controller tracks which pods are Ready based on their readiness probe - if a probe fails, the pod is removed from Endpoints and receives no new traffic. This is why missing readiness probes cause traffic to hit pods that are not ready yet. For a broken Service I always check kubectl get endpoints first to see if any pods are listed."
Not knowing what etcd actually does
What candidates say
“Kubernetes stores all cluster state in the control plane.”
Why interviewers mark this down
The control plane is not a storage layer - etcd is. This vague answer tells an interviewer you do not know your control plane components. etcd is a distributed, consistent key-value store using the Raft consensus algorithm. It stores all cluster state: pod specs, ConfigMaps, Secrets, RBAC rules, everything the API server serves. If etcd loses quorum, the API server goes read-only - existing pods keep running but nothing can be scheduled or modified. This matters for reliability discussions and DR planning.
What to say instead
Say: "etcd stores all Kubernetes cluster state - pod specs, ConfigMaps, Secrets, RBAC policies. It uses Raft for consensus: with 3 nodes you tolerate 1 failure; with 5 nodes, 2. If you lose quorum, the API server goes read-only - existing workloads keep running but you cannot schedule, modify, or create anything until quorum is restored. I always schedule regular etcd snapshots via etcdctl snapshot save and store them off-cluster. Losing etcd without a recent backup means rebuilding the entire cluster."
The bottom line
None of these are trick questions. They are the exact gaps that cause real production problems: pods not rescheduling when expected, Services sending traffic to unhealthy pods, clusters going read-only at 2am, namespaces that do not actually isolate anything. The candidates who stand out in Kubernetes interviews connect the mechanics to the consequences. Study how the scheduler places pods, how kube-proxy programs iptables, how etcd stores state - not because the interviewer wants to quiz you on internals, but because understanding the internals is what makes you reliable in production.