Incident Postmortems for Staff Engineers
A blameless incident postmortem fixes the system, not the person. The structure, the root-cause discipline, and the action items that actually get done.
Part of Staff Engineer Craft: Design, Influence, and Learning
A good incident postmortem fixes the system, not the person. The single most important property is that it is blameless: it assumes a competent engineer acted reasonably inside a system that allowed the failure, and it asks what about the system let this happen. Get that culture right and people surface the honest detail you need to actually prevent recurrence. Get it wrong and every postmortem becomes theater, because nobody tells the truth when truth gets punished.
The postmortem is where an organization either learns from failure or merely survives it. The difference is not how serious the meeting feels; it is whether the action items get done and whether people felt safe enough to tell you what really happened.
Why postmortems are a staff engineer’s responsibility
Incidents are inevitable in any system of real complexity. What separates a maturing engineering org from a stagnant one is whether each incident makes the system stronger or just exhausts the people who fought it. Driving that learning is squarely staff-level work, because it requires both technical depth and the influence to get systemic fixes prioritized.
A staff engineer running a postmortem is doing two jobs at once: finding the true root cause, and protecting the culture that makes finding it possible. Both matter, and the second is the one that quietly fails first. This post is part of the Staff engineer craft series.
What is a blameless postmortem?
A blameless postmortem examines an incident by focusing on the systemic conditions that allowed it, not on the individual who triggered it. Its premise is that good engineers cause incidents inside imperfect systems, so the productive question is never “who did this” but “what about our system made this possible, and how do we change it.” Blame is replaced by systemic curiosity.
The reasoning is practical, not just kind. If “root cause: human error” is an acceptable conclusion, you stop looking exactly when the useful analysis would begin, because human error is the start of the investigation, not the end. Why was that error possible? Why did nothing catch it? Those questions lead to fixes; “be more careful” leads nowhere.
Why blameless? Doesn’t someone need to be accountable?
Accountability belongs to fixing the system, not to punishing the person. Blame does not improve reliability; it teaches people to hide mistakes, downplay incidents, and withhold the candid detail a postmortem depends on, which makes the next failure both more likely and harder to learn from. Blameless cultures surface more truth and therefore fix more root causes.
This is the part leaders find counterintuitive: the way to get more accountability is to remove blame. The accountability is real, it just attaches to the system and to completing the action items, not to shaming whoever was on the keyboard when a latent flaw finally surfaced.
What should an incident postmortem include?
A complete postmortem has a summary and impact, a factual timeline, the root cause, the detection and mitigation story, and concrete action items with owners and due dates. Everything except the action items is context; the owned, tracked action items are the only part that actually changes the future.
A dependable structure:
- Summary and impact. What happened, for how long, and who or what was affected.
- Timeline. A factual, timestamped sequence: detection, diagnosis, mitigation, resolution.
- Root cause. The systemic condition that allowed it, not just the trigger.
- Detection and response. How you found out, how fast, and what slowed the response.
- What went well. The mitigations and instincts worth reinforcing.
- Action items. Specific, owned, dated changes that prevent recurrence.
How do you find the real root cause of an incident?
Keep asking “why” past the first technical trigger until you reach a condition a system change can fix. The bad deploy, the wrong config, the unhandled input, those are triggers. The root cause is why that trigger was possible and why nothing caught it. Stop only when the answer is something you can actually change about the system.
The classic example: the incident was “a deploy took the service down.” Ask why. The deploy changed a config past a limit. Why did that ship? There was no validation on the value. Why did it reach production? The pipeline had no check for it. The root cause is the missing guardrail, not the engineer who typed the value, and the fix is a guardrail, not a reprimand.
This is also where the postmortem connects back to design: many root causes are exactly the failure modes the Distributed systems patterns series exists to prevent, from a missing timeout budget to absent backpressure to a readiness probe that lied. A good postmortem frequently produces a future RFC.
Action items that actually get done
The graveyard of postmortems is full of action items nobody ever completed. An action item prevents recurrence only if it is specific, owned by a named person, given a due date, and tracked like any other prioritized work. “We should improve monitoring” is not an action item; “add an alert on config values exceeding the broker limit, owned by X, due next sprint” is.
The staff engineer’s real contribution here is getting those items prioritized against feature work. A postmortem whose action items lose every sprint-planning fight has not made the system safer; it has documented a failure you will repeat. Tracking completion, and following up when items slip, is the unglamorous part that determines whether the whole exercise was worth anything.
When should you write a postmortem?
Write one for any incident that had real user impact, exceeded a severity threshold, or surprised you, even a near-miss that could have been severe. The trigger should be defined in advance (for example, any SEV-1 and SEV-2, or any customer-visible outage) so the decision is not relitigated emotionally after each event. A clear, pre-agreed threshold is what keeps postmortems consistent rather than political.
Near-misses deserve special attention, and teams routinely skip them. An incident that almost took down production but was caught in time exposed the same systemic weakness as a full outage would have, minus the damage. Writing the postmortem anyway means you fix the weakness before it actually fires, which is the cheapest possible time to learn the lesson. A culture that only writes postmortems for incidents that already hurt is one that waits for the damage before it learns.
The flip side is not to over-do it: a postmortem for a trivial, well-understood blip with no systemic cause is busywork that dilutes the practice. Reserve the full ritual for incidents that actually have something to teach, define the threshold up front, and include near-misses that crossed it. The goal is consistent learning from events worth learning from, not a document for every hiccup.
An incident postmortem checklist
Before you call a postmortem done:
- The analysis is blameless, focused on systemic conditions, not individuals.
- There is a factual, timestamped timeline everyone agrees on.
- Root cause goes past the trigger to a condition a system change can fix.
- Detection and response gaps are named (how long to detect, what slowed mitigation).
- Action items are specific, owned, dated, and entered into the real backlog.
- What went well is captured, so good instincts are reinforced.
- The action items are actually prioritized, and someone tracks them to completion.
What I’d do differently
The failure I have watched most is the postmortem that is procedurally perfect and practically useless: a thorough document, a serious meeting, and a list of action items that quietly die in the backlog while the team moves on. The ritual was performed; the system did not change; the incident recurs.
If I were running postmortems again, I would spend less energy on the document and more on two things: protecting blamelessness ruthlessly so the truth comes out, and treating action items as real, prioritized work with owners and follow-through. A short postmortem whose fixes actually ship beats a beautiful one whose fixes never do. The point was never the write-up; it was a system that fails less next quarter than it did this one.
Sources
- Google SRE Book, Postmortem Culture: Learning from Failure: sre.google/sre-book/postmortem-culture
- Google SRE Workbook, Example postmortem and templates: sre.google/workbook/postmortem-culture
- Atlassian, Incident postmortem guide: atlassian.com/incident-management/postmortem
Frequently asked questions
What is a blameless postmortem?
A blameless postmortem analyzes an incident by focusing on the systems and conditions that allowed it, not on blaming the person involved. The premise is that good engineers cause incidents inside bad systems, so fixing the system prevents recurrence while blaming the person only teaches people to hide problems.
What should an incident postmortem include?
A summary and impact, a timeline of what happened, the root cause, how it was detected and mitigated, and concrete action items with owners. The action items, owned and tracked to completion, are the part that actually prevents recurrence; everything else is context.
Why blameless? Doesn't someone need to be accountable?
Accountability lives in fixing the system, not in punishing the person. Blame drives people to hide mistakes and withhold the honest detail a postmortem needs, which makes the next incident more likely. Blameless cultures surface more truth and fix more root causes.
How do you find the real root cause of an incident?
Keep asking why past the first technical trigger until you reach the systemic condition that allowed it. The bad deploy is rarely the root cause; why it was possible to deploy, and why nothing caught it, usually is. Stop when the cause is something a system change can address.