The Kubernetes Deployment Checklist
A production Kubernetes deployment checklist: resource limits, probes, rollout strategy, PodDisruptionBudgets, graceful shutdown, and the items teams skip.
Part of Kubernetes Operations for Production Platforms
A real Kubernetes deployment checklist is the difference between a rollout that drops zero requests and one that sheds a burst of errors every single time you ship. Kubernetes will happily run a deployment that has no resource requests, lying probes, and no graceful shutdown, and it will punish you for all three under load. This checklist is the set of items that make a deployment production-safe, ordered by how often teams skip them and pay for it.
Most production incidents on Kubernetes are not exotic. They are a missing PodDisruptionBudget during a node drain, or a service that never handled SIGTERM, surfacing on an ordinary Tuesday deploy. The checklist exists to make those non-events.
Why a deployment checklist matters
Kubernetes gives you enormous power and very few guardrails. A deployment manifest with none of the safety items still deploys; the failure shows up later, as dropped requests on rollout, a node drain that takes the whole service down, or a memory leak that kills a node. A checklist turns those latent failures into things you handled before they fired.
The throughline of every item below is the same: make the deployment behave correctly during the disruptions that are normal in Kubernetes, deploys, scale events, node drains, and pod evictions. This post is part of the Kubernetes operations series and ties together several of its deep dives.
What should a Kubernetes deployment checklist include?
A production deployment needs resource requests and limits, the three probe types, a safe rolling-update strategy, a PodDisruptionBudget, graceful shutdown, autoscaling, externalized config and secrets, and observability. The items most often missing, and most often the cause of deploy-time incidents, are graceful shutdown and PodDisruptionBudgets.
Here is the full checklist, grouped:
| Area | Item | Why it matters |
|---|---|---|
| Resources | CPU/memory requests | Scheduling + no CPU starvation |
| Resources | Memory limit | Contains a leak before it kills the node |
| Health | Readiness probe | Gates traffic to ready pods only |
| Health | Liveness probe | Restarts a wedged process (lenient) |
| Health | Startup probe | Covers slow boots without false kills |
| Rollout | Rolling update strategy | Gradual replace; set maxUnavailable/maxSurge |
| Rollout | PodDisruptionBudget | Caps simultaneous voluntary disruptions |
| Lifecycle | Graceful shutdown | Drains in-flight requests on rollout |
| Scaling | HPA (right signal) | Responds to real load |
| Config | Externalized config + secrets | No rebuild to change config; no secrets in image |
| Ops | Metrics, logs, traces | Debuggable when the deploy goes wrong |
You do not have to add all of these on day one, but a service that takes real traffic should have every one before you call it production-grade.
How do you deploy to Kubernetes without downtime?
Zero-downtime deploys come from four items working together: a rolling update (replace pods gradually), accurate readiness probes (send traffic only to pods that can serve), graceful shutdown (drain in-flight requests before a pod exits), and a PodDisruptionBudget (stop too many pods going down at once). Remove any one and a rollout starts dropping requests.
The sequence on a rollout should be: a new pod starts, passes its readiness probe, and joins the endpoints; an old pod is sent SIGTERM, immediately fails its readiness probe so it stops receiving new requests, finishes its in-flight work within the termination grace period, then exits. When that choreography is correct, a deploy is invisible to users. When readiness or shutdown is wrong, every deploy sheds a burst of errors.
What is the most commonly missed item in Kubernetes deployments?
Graceful shutdown, by a wide margin. Many deployments never handle SIGTERM and have no preStop hook, so when Kubernetes terminates a pod on rollout, in-flight requests are cut mid-flight. The fix is to fail readiness immediately on shutdown so no new traffic arrives, then drain existing work within the termination grace period before exiting.
The related subtlety is the termination grace period: it must be long enough for your longest in-flight request to finish, or the pod is force-killed mid-request anyway. Set it to comfortably exceed your real request duration, and make the application actually stop accepting new work the moment it receives SIGTERM.
Resource requests, limits, and the CPU-limit nuance
Always set resource requests, because they drive scheduling and guarantee the pod is not starved of CPU during normal operation or startup. Limits deserve more thought. A memory limit is valuable: it contains a memory leak so a single pod cannot take down the whole node. A CPU limit is double-edged: it caps a pod’s CPU and can throttle a latency-sensitive service at exactly the wrong moment.
The pragmatic stance most operators land on is: requests always, memory limits usually, CPU limits cautiously or not at all for latency-sensitive services. The reason is that CPU is compressible (throttling slows you down) while memory is not (running out kills you), so the two limits protect against different severities of failure and deserve different defaults.
How do you roll back a bad Kubernetes deployment?
Roll back fast and automatically where you can. Kubernetes keeps a deployment’s revision history, so a manual rollback to the previous known-good revision is a single command, but the better posture is to catch a bad rollout before it fully ships. A progressive rollout that watches health and halts on regression turns “page, diagnose, roll back” into “the rollout stopped itself.”
The baseline capability is the built-in rollout history: every deployment update creates a revision, and you can return to the prior one quickly. That is your floor, and every team should know the command and have tested it. The failure mode it guards against is the deploy that looked fine in CI and degrades under real traffic, which is exactly when seconds matter and nobody wants to be reconstructing the previous config by hand.
Above that floor, progressive delivery shrinks the blast radius. A canary or blue-green rollout exposes the new version to a slice of traffic first, watches the golden signals, and only proceeds if they hold, rolling back automatically if they do not. This pairs directly with the observability you already need: the same metrics that page you can gate a rollout. The goal is that a bad deploy is contained to a fraction of traffic for a few minutes, not shipped to everyone and then frantically reverted.
A pre-ship deployment checklist
Before a deployment serves production traffic:
- Resource requests set; memory limit set; CPU limit chosen deliberately.
- Readiness, liveness, and startup probes configured and meaningful (not fixed sleeps).
- Rolling update strategy with sane
maxUnavailable/maxSurge. - A PodDisruptionBudget so node drains cannot take the service down.
- Graceful shutdown: SIGTERM handled, readiness fails first, grace period exceeds longest request.
- Autoscaling configured on the right signal.
- Config and secrets externalized; nothing sensitive baked into the image.
- Metrics, logs, and traces wired so a bad rollout is immediately debuggable.
What I’d do differently
The lesson I keep relearning is that the boring items, graceful shutdown, PodDisruptionBudgets, are the ones that actually cause production pain, while the exciting ones get all the attention. A team will spend a week on a fancy deployment strategy and never add a preStop hook, then wonder why every deploy sheds errors.
If I were standardizing deployments again, I would encode this checklist as a template and a CI policy check, so a deployment literally cannot ship without requests, probes, a PDB, and graceful shutdown. Making the safe path the default path is the only way these items stop being “things we meant to add.” The checklist is cheap; the incidents it prevents are not.
Sources
- Kubernetes, Deployments and rolling updates: kubernetes.io/docs/concepts/workloads/controllers/deployment
- Kubernetes, Pod termination and graceful shutdown: kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-termination
- Kubernetes, Resource management for pods: kubernetes.io/docs/concepts/configuration/manage-resources-containers
Frequently asked questions
What should a Kubernetes deployment checklist include?
Resource requests and limits, readiness/liveness/startup probes, a safe rolling-update strategy, a PodDisruptionBudget, graceful shutdown (preStop + termination grace), autoscaling, observability, and externalized config and secrets. The items teams most often skip are graceful shutdown and PodDisruptionBudgets, which is why deploys drop requests.
How do you deploy to Kubernetes without downtime?
Combine a rolling update with correct readiness probes, graceful shutdown, and a PodDisruptionBudget. Readiness gates traffic to only-ready pods, the rolling strategy replaces pods gradually, graceful shutdown drains in-flight requests, and the PDB stops too many pods going down at once. Missing any one of these is how a deploy drops requests.
Do I need resource requests and limits on every deployment?
Set requests on every deployment; they drive scheduling and protect the pod from CPU starvation. Limits are more nuanced: memory limits prevent a leak from taking down a node, but aggressive CPU limits can throttle latency-sensitive services. Always set requests; set limits deliberately, especially memory.
What is the most commonly missed item in Kubernetes deployments?
Graceful shutdown. Many deployments have no preStop hook and do not handle SIGTERM, so on every rollout in-flight requests are killed mid-flight. The fix is to fail readiness on shutdown, then finish in-flight work within the termination grace period before the process exits.