HPA on Queue Depth, Not CPU
For queue-driven workers, CPU-based autoscaling reacts too late. Scale your Kubernetes HPA on queue depth or lag instead. Why CPU lies, and how to switch.
Part of Kubernetes Operations for Production Platforms
For queue-driven workers, autoscaling on CPU is scaling on the wrong signal. Kubernetes HPA on queue depth is the fix: scale on the backlog of pending work, not on CPU, because for an async worker CPU only rises after it is already processing, which is after the queue has built up. By the time CPU crosses your threshold, latency has already degraded. The queue depth is the leading indicator; CPU is the lagging one.
This is one of the most common autoscaling mistakes in production, and it is subtle because CPU autoscaling is the default everyone reaches for. It works fine for web services and quietly fails for workers.
Why the autoscaling signal matters
The Horizontal Pod Autoscaler adds and removes replicas based on a metric crossing a target. The entire usefulness of autoscaling depends on that metric being a timely measure of demand. Pick a metric that lags, and the autoscaler always reacts too late, adding capacity after the pain instead of before it.
For request-driven services, CPU is a reasonable proxy: more requests, more CPU, right now. For queue-driven workers, that relationship breaks, and the break is the whole subject of this post. This is part of the Kubernetes operations series and pairs closely with Backpressure Design for Real-Time Systems.
Why is CPU a bad autoscaling signal for queue workers?
Because a worker’s CPU usage reflects work it is currently processing, not work waiting to be processed. When a flood of messages lands in the queue, the workers keep chugging at whatever rate they can; CPU does not spike to signal the flood, because the workers were already busy. CPU only tells you about the present throughput, while the queue tells you about the unmet demand.
The result is a dangerous lag. The backlog builds, consumer lag climbs, end-to-end latency degrades, and CPU is sitting at a normal-looking level the whole time because the workers are simply saturated, not over-CPU. By the time CPU does cross the threshold (if it ever does), you are already deep in a backlog you will take a long time to drain.
| CPU as the signal | Queue depth as the signal | |
|---|---|---|
| What it measures | Current processing | Pending, unprocessed work |
| Indicator type | Lagging | Leading |
| Reacts to a burst | After backlog forms | As backlog forms |
| Scale-to-zero when idle | Awkward | Natural (queue empty) |
| Best for | Request-driven services | Async / queue workers |
What should you scale a queue worker on?
Scale on the queue: pending message count (queue depth) or consumer lag. These are direct, leading measures of how much work is waiting, so the autoscaler adds workers the moment the backlog grows, before latency degrades. The target becomes “messages per worker” rather than “CPU percent,” which maps directly to how fast you drain the queue.
The mental model is simple: if you want to keep the backlog under control, scale the number of workers to the amount of waiting work. Twice the queue depth, roughly twice the workers, and the backlog drains at a predictable rate. CPU never enters the equation, because the queue already tells you exactly how behind you are.
How do you autoscale Kubernetes on queue depth?
Feed the HPA an external metric representing the queue, most commonly via KEDA, which provides ready-made scalers for Kafka, SQS, RabbitMQ, Redis, and many others. KEDA reads the queue depth or consumer lag and drives the HPA, and it can scale to zero when the queue is empty, which native CPU-based HPA cannot do cleanly. The plain HPA can also consume external metrics if you pipe them in yourself, but KEDA removes most of the plumbing.
Scale-to-zero is an underrated benefit. A queue worker with no pending messages does not need to run at all, and KEDA can take it to zero replicas and wake it when messages arrive. For bursty async workloads, that turns idle capacity into zero cost, which a CPU-based HPA (with its minimum-one-replica floor and CPU-never-quite-zero behavior) cannot match.
When is CPU-based autoscaling fine?
CPU autoscaling is perfectly good for request-driven services where CPU rises with incoming traffic in real time, an HTTP API, a render service, anything that does CPU work synchronously as requests arrive. There, CPU is a close, timely proxy for demand, and adding replicas when CPU climbs is exactly right.
The rule is about workload shape, not a blanket preference. Synchronous, CPU-bound request handling scales fine on CPU. Asynchronous, queue-fed work scales on the queue. Many systems have both, in which case you use CPU for the web tier and queue depth for the workers, rather than forcing one signal on both.
How do you avoid autoscaling thrash on queue depth?
Tune the scaling target and stabilization so the autoscaler does not flap replicas up and down as the queue wobbles. The two levers are the target value (messages or lag per replica, set from your real drain rate) and a stabilization window that smooths brief dips so the autoscaler does not scale down the instant a burst clears. Without these, a spiky queue produces a thrashing replica count, which is its own kind of instability.
Thrash happens when the autoscaler reacts to every small fluctuation. A queue that briefly empties triggers a scale-down, then the next burst triggers a scale-up, and the churn of starting and stopping pods, each paying startup time, can be worse than running slightly more replicas steadily. The fix is a scale-down stabilization window so the autoscaler waits to confirm the queue has genuinely drained before removing capacity, while still scaling up promptly when the backlog grows.
Set the per-replica target from how fast one worker actually drains the queue. If one worker clears a known number of messages per second and you want the backlog gone within a target time, that math gives you the messages-per-replica target directly. Anchoring the target to a real drain rate, rather than a guessed number, is what makes the autoscaling both responsive and stable, scaling up fast on real demand and down calmly when the work is truly done.
An autoscaling-signal checklist
When configuring an HPA, decide:
- Is this workload request-driven (CPU tracks demand) or queue-driven (backlog is the truth)?
- For queue workers, the scaling metric is queue depth or consumer lag, not CPU.
- The target is expressed as work-per-replica (messages/lag per pod), tied to your drain-rate goal.
- KEDA (or an external-metrics pipeline) feeds the queue metric to the HPA.
- Scale-to-zero is enabled where idle workers cost money for no reason.
- The web tier and the worker tier use the signal appropriate to each.
- You verified the autoscaler reacts before latency degrades, by load-testing a burst.
What about scaling on latency or in-flight requests?
Queue depth is the cleanest signal for async workers, but for some workloads the right leading indicator is in-flight request count or end-to-end latency, and the same principle applies: scale on the metric that reflects demand before saturation, not after. The point was never “always use queue depth”; it was “use a leading indicator, and CPU usually isn’t one.”
For a synchronous service with bounded concurrency, the number of in-flight requests is often a better signal than CPU, because it rises the moment requests queue up at the application even if CPU has not yet maxed out. For some user-facing services, scaling on a latency SLO works too: if p95 latency starts climbing, add capacity, because rising latency is an early sign of saturation. Both are leading indicators in the way CPU is not for these shapes of work.
The unifying rule is to choose the metric that crosses its threshold while you still have time to react. CPU works when it rises with demand in real time. Queue depth works for async workers. In-flight count or latency work for concurrency-bound services. The wrong move, the one this whole post argues against, is defaulting to CPU for every workload because it is the built-in option, when the workload’s real demand shows up somewhere else first.
What I’d do differently
The mistake I have watched (and made) is reaching for CPU autoscaling everywhere because it is the default, then being confused when the worker tier falls behind under load while CPU looks fine. The autoscaler was doing its job; it was just watching the wrong number, one that could not see the backlog forming.
If I were setting up autoscaling again, I would ask one question per workload before touching the HPA: does CPU rise with demand in real time here? If yes, CPU is fine. If the work flows through a queue, I would wire KEDA to the queue depth from the start. Autoscaling is only as good as the signal it watches, and for async workers the queue is the only signal that tells the truth in time.
Sources
- Kubernetes, Horizontal Pod Autoscaler: kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale
- KEDA, Kubernetes Event-Driven Autoscaling: keda.sh/docs/latest/concepts
- Kubernetes, HPA with custom and external metrics: kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/#scaling-on-custom-metrics
Frequently asked questions
Why is CPU a bad autoscaling signal for queue workers?
Because for a worker pulling from a queue, CPU usage does not rise until work is already being processed, which is after the backlog has built up. CPU is a lagging indicator of demand for async workloads. By the time CPU crosses the threshold, the queue is already deep and latency has already degraded.
What should you scale a queue worker on?
Scale on the queue itself: the number of pending messages (queue depth) or consumer lag. That is the leading indicator of demand. When the backlog grows, you want more workers immediately, before CPU even reflects the load, so scaling on queue depth responds in time rather than after the fact.
How do you autoscale Kubernetes on queue depth?
Use an external/custom metric the HPA can read, commonly via KEDA, which has scalers for Kafka, SQS, RabbitMQ, and others. KEDA reads the queue depth or lag and drives the HPA, including scaling to zero when the queue is empty. The native HPA can also use external metrics if you feed them in.
When is CPU-based autoscaling fine?
For request-driven services where CPU rises with traffic in real time, CPU or memory autoscaling is reasonable, because the signal tracks demand closely. CPU autoscaling fails specifically for async, queue-driven workloads where the backlog, not CPU, is the true measure of pending work.