Kubernetes

Readiness Probes That Don't Lie

Most Kubernetes readiness probes lie: they return 200 because the process started, not because the service can serve. How to write probes that tell the truth.

Part of Kubernetes Operations for Production Platforms
Kubernetes readiness probes, shown as a gate that opens only on a healthy amber heartbeat with traffic waiting

Most Kubernetes readiness probes lie. They return 200 because the process is running, not because the service can actually serve a request, so Kubernetes confidently routes traffic to a pod that immediately throws errors. A readiness probe that does not reflect real readiness is worse than no probe at all, because it makes failure look like health.

A truthful readiness probe answers one question: can this specific pod serve a real request right now? Not “did the process start,” not “is the binary alive,” but “would a request succeed if I sent one.” Writing the probe to answer that, and no more, is the whole skill.

Why readiness probes matter

In Kubernetes, the readiness probe is the gate between your pod and live traffic. Pass it and the pod joins the Service endpoints and starts receiving requests. Fail it and the pod is pulled from rotation, no traffic, no restart.

That makes the readiness probe a load-bearing piece of your reliability, and a deceptively easy one to get wrong. The default instinct, a handler that returns 200, technically “works” and quietly defeats the entire mechanism. This post is part of the Kubernetes operations series.

What is the difference between a readiness and liveness probe?

A readiness probe decides whether a pod receives traffic: failing it removes the pod from the Service endpoints but never restarts it. A liveness probe decides whether a pod is restarted: failing it kills and recreates the container. Readiness asks “can I serve right now,” liveness asks “am I broken beyond recovery.”

Conflating the two is one of the most common and damaging Kubernetes mistakes. If you put dependency checks in a liveness probe, a transient database blip will restart every pod instead of simply pausing traffic, turning a brief degradation into a self-inflicted restart storm.

Readiness probeLiveness probe
ControlsWhether the pod gets trafficWhether the pod is restarted
On failureRemoved from Service endpointsContainer killed and recreated
Answers”Can I serve right now?""Am I unrecoverably broken?”
Dependency checksSometimes, carefullyAlmost never
Failure during a DB blipPauses traffic (good)Restart storm (bad)

There is also the startup probe, which protects slow-starting containers by holding off liveness and readiness checks until the app has finished booting. Use it for anything with a long warmup (a JVM service, a large model load) so a slow start is not mistaken for a failure.

Why do readiness probes give false positives?

Because the probe checks that the process is up, not that it can serve. The classic offender is a /healthz handler that returns 200 unconditionally. It passes while the database connection pool is empty, the cache is cold, or a critical downstream is unreachable, so Kubernetes sends real traffic straight into errors.

A truthful probe verifies the things the pod genuinely needs to serve. If the service cannot function without its database connection pool being initialized, the readiness probe should reflect that the pool is ready, not merely that the HTTP server is listening.

// LYING readiness probe: passes as long as the process runs
http.HandleFunc("/readyz", func(w http.ResponseWriter, r *http.Request) {
    w.WriteHeader(http.StatusOK) // says "ready" even with no DB, cold cache
})

// TRUTHFUL readiness probe: reflects real ability to serve
http.HandleFunc("/readyz", func(w http.ResponseWriter, r *http.Request) {
    if !app.DBPoolReady() || !app.CacheWarm() {
        w.WriteHeader(http.StatusServiceUnavailable)
        return
    }
    w.WriteHeader(http.StatusOK)
})

Should a readiness probe check dependencies like the database?

Check the critical dependencies the pod truly cannot serve without, but do it carefully. The danger is a shared dependency: if every pod’s readiness probe checks the same database, one database blip fails all of them at once, removes the entire service from rotation, and converts a partial degradation into a full outage.

The balance is to distinguish what is local from what is shared. Checking your own initialized connection pool is safe, because it is per-pod. Checking that the shared database is reachable on every probe couples every pod’s fate to that dependency, which can be exactly the wrong behavior under stress.

A more resilient pattern is to keep serving in a degraded mode when a non-critical dependency is down, rather than failing readiness. Fail readiness only for dependencies without which a request genuinely cannot succeed, and even then consider whether removing every pod at once is better or worse than serving degraded responses.

Can a bad liveness probe cause an outage?

Yes, and it is a classic self-inflicted one. A liveness probe that is too aggressive, or that checks external dependencies, restarts healthy pods during load spikes or transient blips. The restarts shed capacity exactly when you need it most, which increases load on the survivors, which fails more probes: a restart storm.

Liveness should detect only states a restart can actually fix: a deadlocked process, an unrecoverable internal error, a wedged event loop. It should never fail because a dependency is slow or because the pod is briefly busy. When in doubt, make the liveness probe more lenient and the readiness probe more precise.

How do you tune probe timing?

Set the timing so a healthy pod is never marked unhealthy and a truly broken one is caught quickly. The four knobs that matter are initialDelaySeconds (or better, a startup probe), periodSeconds, timeoutSeconds, and failureThreshold. The most common failure is a timeoutSeconds set so tight that a pod under load misses the deadline and gets pulled or restarted while it is actually fine.

A practical approach: use a startup probe to cover boot time instead of a long initialDelaySeconds, so slow starts do not force you to loosen the steady-state checks. Keep periodSeconds short enough to react quickly (a few seconds) but not so short that probes add meaningful load. Set timeoutSeconds generously relative to your real probe latency under load, because a probe that does a tiny bit of work can blow a 1-second timeout during a traffic spike. And set failureThreshold high enough that a single transient miss does not flap the pod out of rotation.

The asymmetry to remember: for readiness, err toward reacting fast so you stop sending traffic to a struggling pod. For liveness, err toward patience so you do not restart pods that are merely busy. Tuning both the same way is how a load spike turns into a restart storm.

A readiness probe checklist

Before you ship a probe, confirm:

  • Readiness reflects real ability to serve, not just that the process started.
  • Liveness detects only unrecoverable states; it does not check external dependencies.
  • A startup probe guards any slow-booting container so warmup is not read as failure.
  • Shared-dependency checks will not remove the entire service on a single blip.
  • The pod degrades gracefully where it can, instead of failing readiness for every minor issue.
  • Probe timeouts, periods, and failure thresholds are tuned (a too-tight timeout fails healthy pods under load).
  • During a graceful shutdown, the pod fails readiness first so it drains traffic before terminating.

That last point matters for clean deploys: on shutdown, fail readiness immediately so Kubernetes stops sending new requests, then finish in-flight work, then exit. It is the difference between a deploy that drops zero requests and one that drops a burst on every rollout.

What I’d do differently

The mistake I have made is copying a /healthz returns-200 handler from a tutorial into a real service and calling it done. It passes review, it passes in staging with everything healthy, and it lies the first time a dependency hiccups in production, sending traffic into a wall.

If I were writing probes from scratch, I would design them around the question “can this pod serve a request right now,” wire readiness to drain traffic on shutdown, keep liveness lenient enough that it only ever fixes truly stuck processes, and load-test the probe behavior under a dependency failure before trusting it. Probes are reliability code, and they deserve the same rigor as the request path they guard. For the broader platform context these probes run in, see Kubernetes Namespace Strategy for SaaS Platforms.

Sources

Frequently asked questions

What is the difference between a readiness and liveness probe?

A readiness probe controls whether a pod receives traffic; failing it removes the pod from the Service endpoints but does not restart it. A liveness probe controls whether a pod is restarted; failing it kills and recreates the container. Readiness is "can I serve right now," liveness is "am I broken beyond recovery."

Why do readiness probes give false positives?

Because the probe checks that the process started, not that it can actually serve. A handler that returns 200 unconditionally passes even when the database is down or the cache is cold, so Kubernetes sends traffic to a pod that immediately errors.

Should a readiness probe check dependencies like the database?

Check critical dependencies the pod cannot serve without, but carefully. A shared dependency outage can fail every pod's readiness at once and remove the whole service from rotation, turning a degraded state into a total one. Prefer checking your own ability to serve over deep dependency chains.

Can a bad liveness probe cause an outage?

Yes. A liveness probe that is too aggressive or that checks dependencies restarts healthy pods during load spikes or transient blips, creating a restart storm that makes the incident worse. Liveness should detect only unrecoverable states.

Newsletter

Liked this breakdown?

Production wisdom on distributed systems, delivered when there is something worth saying. No spam, unsubscribe anytime.