Grafana Dashboards for Operators, Not Executives
Most Grafana dashboards are decoration. An operator dashboard answers one question fast during an incident. How to design dashboards that speed up debugging.
Part of Observability for Distributed Systems
A Grafana dashboard built for operators answers one question fast: is this service healthy, and if not, where is the problem? Most dashboards fail that test. They are walls of pretty panels built to impress in a review, useless at 3 a.m. when someone is paged and has thirty seconds to find the fault. Designing for the operator under pressure, not the executive in a meeting, is the whole discipline.
The tell is the panel count. A dashboard with fifty graphs is decoration; nobody scans fifty graphs while production is down. A dashboard with a tight golden-signals row and a clear path to drill deeper is a tool.
Why dashboard design matters
A dashboard is read at two very different moments: calmly, during a review, and frantically, during an incident. Most dashboards are unconsciously designed for the calm moment, which is exactly the wrong audience, because the calm moment is not when the dashboard has to earn its keep.
The incident is when it matters. Under pressure, an operator needs to go from “something is wrong” to “the database is saturated” in seconds. A dashboard that buries that signal in a grid of low-value panels actively slows the response. This post is part of the Observability series and pairs with Jaeger Tracing for Cross-Service Debugging.
What are the four golden signals?
The four golden signals are latency, traffic, errors, and saturation. Latency is how long requests take (watch the tail, not the average), traffic is request volume, errors is the failure rate, and saturation is how full your constrained resources are. Together they summarize whether a service is healthy, which makes them the correct top row of any operator dashboard.
They come from Google’s SRE practice, and they are popular because they generalize: almost any service’s health can be read from those four numbers. If those four look good, the service is almost certainly fine; if one is bad, it points you at the category of problem.
| Signal | Question it answers | Watch for |
|---|---|---|
| Latency | How long do requests take? | p99/p99.9 tail, not the mean |
| Traffic | How much demand? | Sudden spikes or drops |
| Errors | What fraction fail? | Rate and sudden changes |
| Saturation | How full are resources? | The most constrained resource first |
What makes a good Grafana dashboard?
A good Grafana dashboard answers a specific question fast and contains nothing that does not serve that question. It leads with the golden signals, uses a consistent time range across panels so they are comparable, labels axes and units clearly, and ruthlessly omits anything that does not help someone act. Clarity under pressure is the only metric that matters.
The design principles that follow from “answer fast”:
- One dashboard, one job. A service-health dashboard is not also a capacity-planning dashboard. Separate audiences, separate boards.
- Most important panels top-left. Eyes start there; put the golden signals where they land first.
- Consistent time windows. If panels show different ranges, you cannot correlate a spike across them.
- Units and thresholds visible. A number without a unit or a “good vs bad” threshold makes the operator do math under stress.
How many panels should a dashboard have?
As few as answer the dashboard’s question, usually a handful, not dozens. A fifty-panel dashboard cannot be scanned during an incident, so the detail is effectively invisible when it is needed. Lead with the golden-signal panels and push everything else into linked drill-down dashboards that you open only once the top-level board points you there.
The structure that scales is a hierarchy: a top-level health dashboard with the golden signals and a handful of key indicators, linking down to detailed dashboards per subsystem. The operator starts at the top, sees which signal is bad, and clicks into the relevant detail. Depth on demand, not depth by default.
Should dashboards and alerts use the same metrics?
Yes, always. The metric that pages you must be front and center on the dashboard you open when that page fires. If the alert references one number and the dashboard shows different ones, the operator wastes precious time reconciling them. Alerts and the operator dashboard should tell a single, consistent story.
This alignment is what makes a page actionable. The ideal flow is: the alert fires on a golden signal, the operator opens the linked dashboard, and the very first panel shows the metric that fired, in context, with its threshold marked. There is no translation step. The dashboard and the alert are two views of the same truth, which is also why dashboards should be the thing you tune alongside your SLOs rather than an afterthought.
How do you organize dashboards across many services?
Use a hierarchy and a template, not one bespoke dashboard per service built by hand. At scale, the winning pattern is a single standardized service-health dashboard, parameterized by a variable that selects the service, so every service is viewed through the same consistent lens. Above it sits a fleet-level overview; below it sit subsystem drill-downs.
This solves the problem that kills observability at scale: dashboard sprawl. When every team hand-builds its own dashboards, you end up with hundreds of inconsistent boards, and an on-call engineer paged for an unfamiliar service has to learn a new layout under pressure. A templated dashboard means the golden signals are always in the same place, whatever service you are looking at, so muscle memory works across the whole fleet.
The structure that scales is three tiers: a fleet overview showing the health of all services at a glance, a per-service health dashboard generated from one template, and detailed subsystem dashboards linked from each. Operators move down the tiers as they narrow the problem. Building dashboards as code (versioned, templated, reviewed) rather than clicking them together by hand is what keeps that structure consistent as the number of services grows.
An operator-dashboard checklist
Before you call a dashboard done:
- It answers one clear question (service health, or one subsystem), not five.
- The golden signals (latency tail, traffic, errors, saturation) are the top row.
- Every panel would change what an operator does during an incident; the rest are cut or moved to drill-downs.
- Panels share a consistent time range and have clear units and thresholds.
- The metric behind each alert is visible on the dashboard the alert links to.
- A new on-call engineer can read service health from it in under thirty seconds.
- Detail lives in linked drill-down dashboards, not crammed onto the top-level view.
What should you not put on an operator dashboard?
Leave off anything that does not change what an operator does in the next few minutes: vanity metrics, raw counters with no rate or threshold, business KPIs that belong on an executive dashboard, and any panel that is there to look comprehensive rather than to be acted on. If a graph cannot trigger a decision during an incident, it is clutter on the board that matters most.
The specific offenders show up again and again. Cumulative totals (total requests ever) instead of rates tell an operator nothing about right now. Panels with no unit or no good-versus-bad threshold force mental math under stress. Business metrics like signups or revenue belong on a different dashboard for a different audience; mixing them into a health view dilutes the signal an on-call engineer is scanning for. And the worst offender is the panel added “just in case,” which is never the panel anyone looks at during a real incident.
The discipline is subtractive. Every panel should justify its place by answering “what would an operator do differently because of this.” Anything that cannot answer that question moves to a drill-down or comes off the board entirely. A lean dashboard is not a less-thorough dashboard; it is one that respects the thirty seconds an operator actually has.
What I’d do differently
The dashboards I regret are the impressive ones: dense grids of panels that looked authoritative in a review and that nobody could actually use when the system was on fire. They optimized for the wrong audience, the observer who wants to feel informed, instead of the operator who needs to act.
If I were building dashboards again, I would design every one for the worst moment: a tired engineer, paged at 3 a.m., who needs the answer in seconds. That constraint forces the right choices, golden signals on top, ruthless panel pruning, alerts and dashboards in lockstep, and detail one click away. A dashboard that serves that moment serves every other moment too. The traces that complement these dashboards during debugging are covered in Jaeger Tracing for Cross-Service Debugging.
Sources
- Google SRE Book, Monitoring Distributed Systems (the four golden signals): sre.google/sre-book/monitoring-distributed-systems
- Grafana, Best practices for dashboard management: grafana.com/docs/grafana/latest/dashboards/build-dashboards/best-practices
- Grafana, Common observability strategies: grafana.com/docs/grafana/latest/fundamentals
Frequently asked questions
What makes a good Grafana dashboard?
A good operator dashboard answers a specific question fast: is the service healthy, and if not, where is the problem. It leads with the golden signals (latency, traffic, errors, saturation), uses consistent time ranges, and cuts every panel that does not help someone act during an incident.
What are the four golden signals?
Latency, traffic, errors, and saturation. Latency is how long requests take, traffic is how many you are getting, errors is the failure rate, and saturation is how full your resources are. Together they summarize service health and are the right top row of an operator dashboard.
How many panels should a dashboard have?
As few as answer the dashboard's question. A wall of fifty panels is for decoration, not debugging; under incident pressure nobody can scan it. Lead with a handful of golden-signal panels and push detail to linked, drill-down dashboards.
Should dashboards and alerts use the same metrics?
Yes. The metric that pages you should be visible on the dashboard you open when paged, so you can immediately see what fired and why. Alerts and the operator dashboard should tell one consistent story, not reference different numbers.