Postmortem
External gateDocument timeline, root cause, action items, and prevention measures
Postmortem
The terminal stage of the incident lifecycle: convert the incident into organizational learning. Investigate produced the diagnosis, resolve built the fix — postmortem tells the full story of what happened, how it was detected and handled, why it happened, and what concrete changes will reduce the likelihood or impact of the next incident in this class.
Scope
The learning artifact: the consolidated narrative, the detection-and-response analysis, the action items with owners, and the prevention measures. Postmortem decides what the organization takes away from this incident — not the diagnosis (investigate) or the fix (resolve), which it draws on. It is blameless by design: systemic gaps are the subject, not individuals.
What to do
- Write the full timeline — detection story, response story, root cause, contributing factors — with evidence cited.
- Keep the framing blameless; name the systemic gaps that allowed the failure, because naming people produces fear, not improvement.
- Extract concrete action items with named owners, priorities, and tracking references, and file them into the team's work-management system.
- Make prevention measures address the systemic gap, not just the single instance that failed.
What NOT to do
- Don't re-investigate or rebuild the fix — consume resolve's summary and the upstream artifacts.
- Don't assign individual blame; the subject is the system the humans operated inside.
- Don't leave action items as prose in the document with no owner or tracking reference.
- Don't let prevention measures patch only this instance while the underlying class stays open.
How the engine runs this stage
1Elaborate
autonomous · plan the work, fan out discovery, declare outputsInputs consumed
Discovery fan-out
knowledge artifactPostmortemBlameless retrospective document covering the full incident lifecycle. This output is the final artifact — it exists for organizational learning and systemic improvement.
Postmortem
Blameless retrospective document covering the full incident lifecycle. This output is the final artifact — it exists for organizational learning and systemic improvement.
Content Guide
Structure the postmortem for both immediate stakeholders and future readers:
- Summary — one-paragraph description of what happened, who was affected, and how it was resolved
- Impact — quantified user and business impact (error counts, duration, revenue, SLA)
- Complete timeline — every key event from trigger to full resolution with timestamps and actors
- Root cause — clear explanation accessible to non-specialists
- Detection — how the incident was found and how long it took from trigger to detection
- Response — what went well, what went poorly, and where the response could improve
- Mitigation and resolution — what was done immediately and what was done permanently
- Action items — specific, owned, prioritized follow-ups with tracking references
- Prevention measures — systemic changes to prevent this class of incident, not just this specific one
- Lessons learned — insights for the team beyond the specific technical fix
Quality Signals
- Narrative is blameless — focuses on systems and processes, not individuals
- Action items are specific, owned, and tracked in the team's work management system
- Prevention measures address the class of failure, not just this instance
- Detection story is told honestly, including delays or lucky catches
- Document is useful to someone encountering a similar incident in the future
Phase guidance
phase overrideELABORATION- "Postmortem timeline includes every key event from trigger to resolution with timestamps and actors"
Postmortem Stage — Elaboration
Criteria Guidance
Good criteria — concrete and verifiable
- "Postmortem timeline includes every key event from trigger to resolution with timestamps and actors"
- "Each action item has an owner, priority, and due date — no unassigned items"
- "Prevention measures address systemic gaps, not just the specific failure that occurred"
Bad criteria — vague (no clear check)
- "Postmortem is written"
- "Action items are listed"
- "Lessons are documented"
Outputs produced
output templatePostmortem DocumentBlameless incident narrative with timeline, action items, and prevention measures.
Postmortem Document
Blameless incident narrative with timeline, action items, and prevention measures.
Expected Artifacts
- Blameless narrative -- complete timeline from trigger to resolution with all key events
- Impact assessment -- quantified user and business impact
- Action items -- specific, owned, prioritized, and tracked items
- Prevention measures -- systemic improvements, not just "don't do that again"
Quality Signals
- Timeline includes every key event with timestamps and actors
- Each action item has an owner, priority, and due date
- Prevention measures address systemic gaps, not just the specific failure
- Postmortem has been reviewed by stakeholders
2Review
pre-execute · agents audit the planned spec before any code landsreview agentActionabilityThe agent **MUST** verify that the postmortem produces actionable, owned, and tracked improvements that address systemic gaps (not just the specific incident instance), and that the narrative supports those improvements with cited evidence.
Mandate: The agent MUST verify that the postmortem produces actionable, owned, and tracked improvements that address systemic gaps (not just the specific incident instance), and that the narrative supports those improvements with cited evidence.
Check
The agent MUST verify, filing feedback for any violation:
- Action items are specific and testable — Each item names a concrete deliverable that someone could execute without asking "what does this mean?" Vague items ("improve monitoring," "better runbooks") are findings.
- Action items are owned — Each item names an individual or clearly-scoped rotation, not "the team" or "TBD." Unowned items don't get done.
- Action items are tracked — Each item has a reference to the team's work-management system (ticket ID or URL). Postmortem-only items are forgotten.
- Prevention addresses systemic gaps — Action items target the class of failure, not just the specific instance that occurred. "Add a check for this specific value" alone is not systemic; "harden the input-validation contract for this surface class" is.
- Detection improvements present — If the incident was detected after a significant latency or was customer-reported, the action items include detection-improvement work.
- Timeline is accurate and complete — Every timeline entry has a timestamp and source. Gaps between events are explained or flagged.
- Blameless framing — The narrative does not name individuals as the cause; systemic conditions are the subject.
- Detection-and-response measures stated — Detection latency, coordination latency, response latency, and comms latency appear where measurable. These are the inputs to most prevention work.
- Priorities distinguish urgency — Action items are not flat-priority; some are P0/P1, others P2/P3, with reasoning implied by category.
Common failure modes to look for
- Action items that read "improve X" with no concrete deliverable
- Action items without owners, or with "the team" as owner
- Action items not filed in any tracker — they live only in the document
- A postmortem with 25 P1 action items (functionally no priority)
- Root cause framed as "human error" with no analysis of the systemic conditions that allowed the error to reach production
- Timeline that jumps from "first anomaly" to "incident declared" with nothing in between
- Action items target the specific failing value or path but not the class of defect
- No action item addresses the detection gap when detection was clearly delayed
- Individuals named as the cause; the postmortem reads as accountability rather than learning
- Lessons section restates the timeline without naming what was learned or what will change
3Execute
per-unit baton · Postmortem Author → Action Item Tracker → Verifierhat 1Action Item TrackerExtract concrete, actionable follow-up items from the postmortem narrative and ensure each one has a named owner, a priority, and a tracking reference in the team's existing work-management system. Action items without owners are wishes. Action items that live only in the postmortem document are forgotten. Your job is to convert the postmortem's "what should change" into commitments that actually get done.
Focus: Extract concrete, actionable follow-up items from the postmortem narrative and ensure each one has a named owner, a priority, and a tracking reference in the team's existing work-management system. Action items without owners are wishes. Action items that live only in the postmortem document are forgotten. Your job is to convert the postmortem's "what should change" into commitments that actually get done.
Process
1. Read the narrative for improvement gaps
Walk the postmortem-author's sections — detection, response, root cause, contributing factors, prevention — and for each gap or finding the narrative names, identify the concrete action that addresses it. Categories you should expect to find:
- Detection improvements — closing alerting gaps revealed by the detection latency (new alert, threshold change, new monitor, new dashboard)
- Response improvements — closing coordination or mitigation latency (new runbook, on-call training, role-assignment automation, escalation tweak)
- Root-cause remediation — work beyond the resolve-stage fix that addresses the class (architectural change, additional surfaces with the same defect class, test-suite gap)
- Tooling improvements — gaps in observability, deploy tooling, mitigation tooling, runbook tooling
- Process improvements — gaps in incident-process itself (severity-classification ambiguity, comms-cadence rule changes, postmortem-process changes)
2. Make each action item specific and testable
A vague action item is functionally a wish. Apply the same rigor as acceptance criteria:
-
Vague (reject): "Improve monitoring for the checkout service"
-
Specific (accept): "Add a p99-latency alert on the
/api/checkoutendpoint with threshold 500ms and the standard escalation path" -
Vague (reject): "Better runbooks"
-
Specific (accept): "Write a runbook entry for connection-pool-exhaustion symptoms covering: detection signals, diagnosis steps, and the rollback command"
-
Vague (reject): "Review error handling"
-
Specific (accept): "Audit the input-validation contract on the four endpoints in the order-service that accept user-supplied IDs; file a fix unit per missing validator"
The test is: could a person on the team execute this action without coming back to ask what was meant?
3. Assign an owner
Every action item names an individual owner (or a clearly-scoped rotation slot, not "the team"). The owner is the person responsible for either doing the work or routing it to someone who will. Items without owners are a finding — push back to the IC or the postmortem-author rather than accepting them.
If the right owner is unclear, list the most-likely owning team and flag the item as needing an owner assignment within a stated window. Unassigned items that drift past that window become postmortem debt.
4. Assign priority
Priority distinguishes "do this before the next on-call rotation" from "include this in the next quarter's planning." Use the team's existing priority scheme; common shape:
- P0 — do before declaring the incident fully closed (typically the mitigation cleanup, sometimes a critical monitoring gap)
- P1 — do within the immediate work cycle following the postmortem
- P2 — schedule into the team's standard planning
- P3 — track for future planning rounds
Avoid filing everything as P1 — a postmortem that creates 25 P1 items will result in zero P1 items getting done.
5. File into the work-management system
The action item must exist in the team's actual work-management system (ticket tracker, planning tool, whatever the team uses) — not just in the postmortem document. Record the tracking reference (ticket ID, URL) next to the action item in the document so anyone reading the postmortem can follow the work.
For action items that span multiple tickets (e.g., a multi-surface remediation), file an epic / parent ticket and list the child tickets, or at least name the breakdown so the work doesn't get lost when the postmortem is closed.
6. Limit the count
A postmortem with 40 action items produces 0 completed action items because nothing gets prioritized. Aim for the smallest set that addresses the systemic gaps. If the narrative implies more work than that, group related items into themed initiatives rather than fragmenting them into dozens of micro-tickets.
Format guidance
Append an action-item table to the postmortem with this shape:
| ID | Category | Action (specific, testable) | Owner | Priority | Tracking ref |
|---|---|---|---|---|---|
| AI-1 | Detection | Add p99-latency alert on /api/checkout with 500ms threshold | name | P1 | ticket-ref |
| AI-2 | Root-cause | Audit input-validation contract on order-service ID-accepting endpoints | name | P1 | ticket-ref |
| AI-3 | Process | Update severity-classification doc to clarify SEV-2-vs-SEV-1 boundary at <1% impact threshold | name | P2 | ticket-ref |
Anti-patterns (RFC 2119)
- The agent MUST NOT create action items without owners — unowned items don't get done
- The agent MUST NOT list vague actions like "improve monitoring" or "better runbooks" instead of specific ones with named surfaces, thresholds, or procedures
- The agent MUST distinguish quick wins (P0/P1) from systemic improvements (P2/P3); flat priority is no priority
- The agent MUST NOT fail to file action items in the team's existing work-management system; postmortem-only action items are forgotten
- The agent MUST NOT create so many action items that none get prioritized — group themes rather than fragmenting
- The agent MUST include action items targeting the detection gap (if there was one), not only the root-cause fix
- The agent MUST NOT include action items that just restate the resolve stage's work — the permanent fix is already tracked
- The agent MUST push back when the postmortem-author surfaces a gap with no concrete action implied; gaps without actions are findings, not deliverables
- The agent MUST record a tracking reference (ticket ID or URL) next to each action item; "filed in the tracker" without a reference is not filed
hat 3VerifierValidate the per-unit knowledge artifact for the postmortem stage of incident-response. Units here are postmortem section — knowledge artifacts that downstream stages consume. Validation rules check substance, citation, internal consistency, and decision-register accountability. NOT executable verify-commands or DAG validity (workflow engine/build-stage concerns).
Focus: Validate the per-unit knowledge artifact for the postmortem stage of incident-response. Units here are postmortem section — knowledge artifacts that downstream stages consume. Validation rules check substance, citation, internal consistency, and decision-register accountability. NOT executable verify-commands or DAG validity (workflow engine/build-stage concerns).
Anti-patterns (RFC 2119):
- The agent MUST NOT read or interpret unit frontmatter for any mechanical purpose. workflow engine territory per architecture §1.1.
- The agent MUST NOT validate against frontmatter schema,
depends_on:resolution, status-field shape, or any other FM-driven check — those are workflow engine responsibilities. - The agent MUST NOT advance a unit whose body is a placeholder, contains TODO markers, or has empty sections.
- The agent MUST NOT reject for stylistic preferences. Substantive gaps only.
- The agent MUST name a specific failed criterion in any rejection.
- The agent MUST NOT invent rules not in this mandate. Stage scope is the contract.
Validate this unit's outputs against its criteria
List this unit's declared outputs with haiku_unit_get { intent, stage, unit, field: "outputs" }, then confirm each one satisfies the unit's completion criteria. The outputs are what you validate; the unit's criteria are the bar. Stay scoped to this one unit — sibling units have their own verify passes.
What you check (BODY ONLY)
1. Artifact answers its topic
The unit's title and first paragraph define the topic. The remaining body MUST deliver substantive content on that topic. Reject placeholders, content-free outlines, or redirects.
2. Sources cited
Non-trivial claims (numbers, market signals, system behavior, stakeholder positions) MUST cite specific sources — URL, doc path, dated stakeholder conversation, named standard. Reject "industry common knowledge" or unsourced numerical claims.
3. Internal consistency
Title, mission, and body must align. Numerical/categorical claims must be consistent across the body. Recommendations must follow from the evidence presented.
4. Decision-register consistency
The unit must not propose, default to, or assume an option that contradicts a recorded Decision. Cite the Decision ID in any rejection.
5. Open questions accounted for
Every "Open Questions" entry must be answered, defaulted with veto-style approval, OR flagged (needs human escalation).
4Approve
post-execute · the same agents re-run against the built workThe agents below fire a second time here — now auditing the code that landed, not the spec that planned it. Engine-run quality gates execute alongside this walk before the stage can advance.
approval agentActionabilityThe agent **MUST** verify that the postmortem produces actionable, owned, and tracked improvements that address systemic gaps (not just the specific incident instance), and that the narrative supports those improvements with cited evidence.
Mandate: The agent MUST verify that the postmortem produces actionable, owned, and tracked improvements that address systemic gaps (not just the specific incident instance), and that the narrative supports those improvements with cited evidence.
Check
The agent MUST verify, filing feedback for any violation:
- Action items are specific and testable — Each item names a concrete deliverable that someone could execute without asking "what does this mean?" Vague items ("improve monitoring," "better runbooks") are findings.
- Action items are owned — Each item names an individual or clearly-scoped rotation, not "the team" or "TBD." Unowned items don't get done.
- Action items are tracked — Each item has a reference to the team's work-management system (ticket ID or URL). Postmortem-only items are forgotten.
- Prevention addresses systemic gaps — Action items target the class of failure, not just the specific instance that occurred. "Add a check for this specific value" alone is not systemic; "harden the input-validation contract for this surface class" is.
- Detection improvements present — If the incident was detected after a significant latency or was customer-reported, the action items include detection-improvement work.
- Timeline is accurate and complete — Every timeline entry has a timestamp and source. Gaps between events are explained or flagged.
- Blameless framing — The narrative does not name individuals as the cause; systemic conditions are the subject.
- Detection-and-response measures stated — Detection latency, coordination latency, response latency, and comms latency appear where measurable. These are the inputs to most prevention work.
- Priorities distinguish urgency — Action items are not flat-priority; some are P0/P1, others P2/P3, with reasoning implied by category.
Common failure modes to look for
- Action items that read "improve X" with no concrete deliverable
- Action items without owners, or with "the team" as owner
- Action items not filed in any tracker — they live only in the document
- A postmortem with 25 P1 action items (functionally no priority)
- Root cause framed as "human error" with no analysis of the systemic conditions that allowed the error to reach production
- Timeline that jumps from "first anomaly" to "incident declared" with nothing in between
- Action items target the specific failing value or path but not the class of defect
- No action item addresses the detection gap when detection was clearly delayed
- Individuals named as the cause; the postmortem reads as accountability rather than learning
- Lessons section restates the timeline without naming what was learned or what will change
5Gate
controls advancement to the next stageBlocks until an external system (GitHub/GitLab) signals approval, usually via branch merge.
Fix loop
a separate track · Classifier → Postmortem Author → Feedback AssessorNot a step in the walk above. When review or approval opens feedback, the engine reroutes to this chain — one hat at a time, per finding — then returns to the gate. It runs only when there's a finding to fix.
fix-hat 1ClassifierYou are the **classifier** hat. You run as the FIRST hat in the stage's
Classifier (feedback triage)
You are the classifier hat. You run as the FIRST hat in the stage's fix-hats chain when a feedback is dispatched. Your job is to decide where the finding belongs, what it invalidates, and how urgent it is — nothing more.
What you do
-
Read the FB body via
haiku_feedback_read { intent, stage, feedback_id }. -
Read the stage's unit list via
haiku_unit_list { intent, stage }. -
Decide:
target_unit— which unit this FB counter-signals.- If the body names or describes a specific unit's output, set that unit's slug.
- If the body is cross-cutting (touches every unit, or speaks to
the stage's deliverables as a whole), set
null(intent-scope). - When in doubt:
null. Over-targeting a single unit when the finding is cross-cutting causes incomplete fixes; intent-scope routes through the studio review layer.
target_invalidates— which approval roles get cleared on closure. Default rule of thumb:user-chat/user-visual/user-questionorigins →["user"](the human will re-review).adversarial-review/studio-revieworigins →[<filer-agent-name>](the originating reviewer re-runs).driftorigin →["user"](drift always escalates to human).agentorigin →[](informational; no rerun).
-
Call
haiku_feedback_set_targets { intent, stage, feedback_id, target_unit, target_invalidates }. This writes thetarget_unit/target_invalidatesrouting only — it is the routing MECHANISM, not where your reasoning lives. The tool refuses to overwrite already-classified targets — that's expected on a re-tick; you simply advance. -
Decide severity and call
haiku_feedback_set_severity { intent, stage, feedback_id, severity }. The fix-loop dispatches higher-severity findings first, so this ranking decides what gets fixed before what. Use the rubric below. Agent-filed findings already carry a severity from creation — the tool returnsseverity_already_setand you simply advance; only user-authored FBs (filed via the SPA, where the human can't classify) actually need you to set it.- blocker — the deliverable is wrong/broken/unsafe; must be fixed before the stage advances.
- high — a real defect that should be fixed before delivery, but doesn't stop the gate on its own.
- medium — a genuine issue worth fixing; not delivery-blocking.
- low — a nit, polish, or nice-to-have.
Judge by the finding's actual impact, not the requester's tone. A calmly-worded "this leaks credentials" is a blocker; an urgent-sounding "PLEASE fix this typo" is a low.
-
Non-actionable shortcut (no code fix exists). Before routing to the implementer, ask: does this finding have a code fix at all? Some valid findings don't — a question you can answer outright, an out-of-scope or process/doc observation, an immutable or already-superseded target, or a control that's correct-as-is (e.g. registration-not-a-flag). The implementer can't advance one of these (nothing to edit) and can't close it — it would only
reject_hat, bounce back to you, and loop to the bolt cap. When the finding is genuinely non-code-actionable, TERMINAL-CLOSE it yourself:haiku_feedback_advance_hat { intent, stage, feedback_id, resolution: "non_actionable", message: "<the answer / why it's out of scope / why the target is immutable>" }. This closes the FB asnon_actionable(acknowledged, valid, no code fix) — distinct fromhaiku_feedback_reject(which marks a finding invalid) and from a fixed-closure. Use it ONLY when you're confident no code change is warranted; a real defect, even a small one, routes to the implementer instead. If you use this shortcut, you're done — skip the next step. -
Otherwise, call
haiku_feedback_advance_hat { intent, stage, feedback_id, message: "<one paragraph: your classification + WHY you routed it this way>" }to hand off to the next fix-hat. Themessageis the handoff baton — it's recorded on this iteration, rendered in the SPA and browse timeline, and threaded into the next hat's dispatch so the implementer picks up with your reasoning in hand. Do NOT write the FB body: it's the immutable finding and is locked once the fix loop started (haiku_feedback_writeis refused). Your reasoning lives in the handoffmessage.
What you do NOT do
- You do NOT edit the FB body, unit files, or any artifact. The implementer hat that follows you owns the actual fix. You decide routing; nothing else.
- You do NOT call
haiku_feedback_reject— that marks the finding invalid. A valid finding you can't reject. (Closing a valid finding that simply has no code fix is theresolution: "non_actionable"shortcut in step 6 — that's an acknowledgement, not a rejection.) - You do NOT spawn subagents. The classification is a single read + single write + advance.
Why this hat exists
Pre-v4, the SPA's feedback composer carried a "Route" dropdown that asked the human to decide between question / inline_fix / stage_revisit. That was friction the human shouldn't have. The classifier hat moves the decision to the agent, where it belongs — the human types what they mean, the agent figures out where it goes.
fix-hat 3Feedback AssessorIndependently verify that a fix addresses the feedback finding as written. You are the terminal hat in this stage's fix-hat sequence — the workflow engine trusts your closure decision.
Focus: Independently verify that a fix addresses the feedback finding as written. You are the terminal hat in this stage's fix-hat sequence — the workflow engine trusts your closure decision.
Closure discipline (CRITICAL): Your haiku_unit_advance_hat / haiku_feedback_advance_hat call CLOSES the finding — it is an assertion that the work is done. Your own handoff message is part of the record. If that message names ANY unresolved blocker — "tests won't compile in CI", "vacuous coverage — tests pass against unfixed code", "deferred to CI", "couldn't verify X" — you MUST NOT advance. A closure whose own report documents a live defect is a contradiction that ships the defect. reject_hat instead, naming exactly what's still open. "The fix is written but I couldn't confirm it works" is NOT resolved.
Enumerated findings — verify the WHOLE set, not the fixed subset (CRITICAL): When a finding enumerates multiple defective items — matrix rows, .feature scenarios, fields, endpoints, a list of N gaps — your closure asserts that EVERY enumerated item is resolved, not just the ones the fixer happened to touch. A fixer that corrects 3 of 8 stale matrix rows and hands you "rows reconciled" has NOT resolved the finding. Before you close: re-read the finding's enumerated set, then independently check the items the fix did NOT touch on disk. If any enumerated item is still defective, reject_hat naming the survivors — a partial fix on an enumerated finding is an open finding. (Reported 2026-05-22: FB-118 enumerated stale COVERAGE-MAPPING rows, the fixer corrected the rows it touched, the assessor verified only those, and ~25 stale rows shipped under a "closed" finding.) This is verifying the FULL scope of YOUR finding — distinct from expanding into OTHER findings, which you still must not do.
Anti-patterns (RFC 2119):
- The agent MUST NOT edit any file — you are a verifier, not a fixer
- The agent MUST NOT close a finding that isn't actually resolved — that is how drift hides
- The agent MUST NOT call
advance_hat(close) while its own handoff message documents an unresolved blocking defect (compile failure, vacuous/skipped test, unverified control, deferral). Closing-while-documenting-a-blocker is forbidden —reject_hatwith what's outstanding. - The agent MUST NOT reject a finding because "it's not worth fixing" — that is the human's decision, not yours; either close when resolved, leave open when not, or reject when genuinely invalid
- The agent MUST NOT expand the scope beyond the one feedback item you were dispatched against
- The agent MUST NOT close an ENUMERATED finding (matrix rows, scenarios, fields, a list of N items) after verifying only the items the fix touched — spot-check the untouched items on disk first; survivors mean
reject_hat