Incident Response · stage 5 of 5

Postmortem

External gate

Document timeline, root cause, action items, and prevention measures

Postmortem

The terminal stage of the incident lifecycle: convert the incident into organizational learning. Investigate produced the diagnosis, resolve built the fix — postmortem tells the full story of what happened, how it was detected and handled, why it happened, and what concrete changes will reduce the likelihood or impact of the next incident in this class.

Scope

The learning artifact: the consolidated narrative, the detection-and-response analysis, the action items with owners, and the prevention measures. Postmortem decides what the organization takes away from this incident — not the diagnosis (investigate) or the fix (resolve), which it draws on. It is blameless by design: systemic gaps are the subject, not individuals.

What to do

  • Write the full timeline — detection story, response story, root cause, contributing factors — with evidence cited.
  • Keep the framing blameless; name the systemic gaps that allowed the failure, because naming people produces fear, not improvement.
  • Extract concrete action items with named owners, priorities, and tracking references, and file them into the team's work-management system.
  • Make prevention measures address the systemic gap, not just the single instance that failed.

What NOT to do

  • Don't re-investigate or rebuild the fix — consume resolve's summary and the upstream artifacts.
  • Don't assign individual blame; the subject is the system the humans operated inside.
  • Don't leave action items as prose in the document with no owner or tracking reference.
  • Don't let prevention measures patch only this instance while the underlying class stays open.

How the engine runs this stage

1Elaborate

autonomous · plan the work, fan out discovery, declare outputs

Discovery fan-out

knowledge artifactPostmortemBlameless retrospective document covering the full incident lifecycle. This output is the final artifact — it exists for organizational learning and systemic improvement.

Postmortem

Blameless retrospective document covering the full incident lifecycle. This output is the final artifact — it exists for organizational learning and systemic improvement.

Content Guide

Structure the postmortem for both immediate stakeholders and future readers:

  • Summary — one-paragraph description of what happened, who was affected, and how it was resolved
  • Impact — quantified user and business impact (error counts, duration, revenue, SLA)
  • Complete timeline — every key event from trigger to full resolution with timestamps and actors
  • Root cause — clear explanation accessible to non-specialists
  • Detection — how the incident was found and how long it took from trigger to detection
  • Response — what went well, what went poorly, and where the response could improve
  • Mitigation and resolution — what was done immediately and what was done permanently
  • Action items — specific, owned, prioritized follow-ups with tracking references
  • Prevention measures — systemic changes to prevent this class of incident, not just this specific one
  • Lessons learned — insights for the team beyond the specific technical fix

Quality Signals

  • Narrative is blameless — focuses on systems and processes, not individuals
  • Action items are specific, owned, and tracked in the team's work management system
  • Prevention measures address the class of failure, not just this instance
  • Detection story is told honestly, including delays or lucky catches
  • Document is useful to someone encountering a similar incident in the future

Phase guidance

phase overrideELABORATION- "Postmortem timeline includes every key event from trigger to resolution with timestamps and actors"

Postmortem Stage — Elaboration

Criteria Guidance

Good criteria — concrete and verifiable

  • "Postmortem timeline includes every key event from trigger to resolution with timestamps and actors"
  • "Each action item has an owner, priority, and due date — no unassigned items"
  • "Prevention measures address systemic gaps, not just the specific failure that occurred"

Bad criteria — vague (no clear check)

  • "Postmortem is written"
  • "Action items are listed"
  • "Lessons are documented"

Outputs produced

output templatePostmortem DocumentBlameless incident narrative with timeline, action items, and prevention measures.

Postmortem Document

Blameless incident narrative with timeline, action items, and prevention measures.

Expected Artifacts

  • Blameless narrative -- complete timeline from trigger to resolution with all key events
  • Impact assessment -- quantified user and business impact
  • Action items -- specific, owned, prioritized, and tracked items
  • Prevention measures -- systemic improvements, not just "don't do that again"

Quality Signals

  • Timeline includes every key event with timestamps and actors
  • Each action item has an owner, priority, and due date
  • Prevention measures address systemic gaps, not just the specific failure
  • Postmortem has been reviewed by stakeholders

2Review

pre-execute · agents audit the planned spec before any code lands
review agentActionabilityThe agent **MUST** verify that the postmortem produces actionable, owned, and tracked improvements that address systemic gaps (not just the specific incident instance), and that the narrative supports those improvements with cited evidence.

Mandate: The agent MUST verify that the postmortem produces actionable, owned, and tracked improvements that address systemic gaps (not just the specific incident instance), and that the narrative supports those improvements with cited evidence.

Check

The agent MUST verify, filing feedback for any violation:

  • Action items are specific and testable — Each item names a concrete deliverable that someone could execute without asking "what does this mean?" Vague items ("improve monitoring," "better runbooks") are findings.
  • Action items are owned — Each item names an individual or clearly-scoped rotation, not "the team" or "TBD." Unowned items don't get done.
  • Action items are tracked — Each item has a reference to the team's work-management system (ticket ID or URL). Postmortem-only items are forgotten.
  • Prevention addresses systemic gaps — Action items target the class of failure, not just the specific instance that occurred. "Add a check for this specific value" alone is not systemic; "harden the input-validation contract for this surface class" is.
  • Detection improvements present — If the incident was detected after a significant latency or was customer-reported, the action items include detection-improvement work.
  • Timeline is accurate and complete — Every timeline entry has a timestamp and source. Gaps between events are explained or flagged.
  • Blameless framing — The narrative does not name individuals as the cause; systemic conditions are the subject.
  • Detection-and-response measures stated — Detection latency, coordination latency, response latency, and comms latency appear where measurable. These are the inputs to most prevention work.
  • Priorities distinguish urgency — Action items are not flat-priority; some are P0/P1, others P2/P3, with reasoning implied by category.

Common failure modes to look for

  • Action items that read "improve X" with no concrete deliverable
  • Action items without owners, or with "the team" as owner
  • Action items not filed in any tracker — they live only in the document
  • A postmortem with 25 P1 action items (functionally no priority)
  • Root cause framed as "human error" with no analysis of the systemic conditions that allowed the error to reach production
  • Timeline that jumps from "first anomaly" to "incident declared" with nothing in between
  • Action items target the specific failing value or path but not the class of defect
  • No action item addresses the detection gap when detection was clearly delayed
  • Individuals named as the cause; the postmortem reads as accountability rather than learning
  • Lessons section restates the timeline without naming what was learned or what will change

3Execute

per-unit baton · Postmortem Author → Action Item Tracker → Verifier
hat 1Action Item TrackerExtract concrete, actionable follow-up items from the postmortem narrative and ensure each one has a named owner, a priority, and a tracking reference in the team's existing work-management system. Action items without owners are wishes. Action items that live only in the postmortem document are forgotten. Your job is to convert the postmortem's "what should change" into commitments that actually get done.

Focus: Extract concrete, actionable follow-up items from the postmortem narrative and ensure each one has a named owner, a priority, and a tracking reference in the team's existing work-management system. Action items without owners are wishes. Action items that live only in the postmortem document are forgotten. Your job is to convert the postmortem's "what should change" into commitments that actually get done.

Process

1. Read the narrative for improvement gaps

Walk the postmortem-author's sections — detection, response, root cause, contributing factors, prevention — and for each gap or finding the narrative names, identify the concrete action that addresses it. Categories you should expect to find:

  • Detection improvements — closing alerting gaps revealed by the detection latency (new alert, threshold change, new monitor, new dashboard)
  • Response improvements — closing coordination or mitigation latency (new runbook, on-call training, role-assignment automation, escalation tweak)
  • Root-cause remediation — work beyond the resolve-stage fix that addresses the class (architectural change, additional surfaces with the same defect class, test-suite gap)
  • Tooling improvements — gaps in observability, deploy tooling, mitigation tooling, runbook tooling
  • Process improvements — gaps in incident-process itself (severity-classification ambiguity, comms-cadence rule changes, postmortem-process changes)

2. Make each action item specific and testable

A vague action item is functionally a wish. Apply the same rigor as acceptance criteria:

  • Vague (reject): "Improve monitoring for the checkout service"

  • Specific (accept): "Add a p99-latency alert on the /api/checkout endpoint with threshold 500ms and the standard escalation path"

  • Vague (reject): "Better runbooks"

  • Specific (accept): "Write a runbook entry for connection-pool-exhaustion symptoms covering: detection signals, diagnosis steps, and the rollback command"

  • Vague (reject): "Review error handling"

  • Specific (accept): "Audit the input-validation contract on the four endpoints in the order-service that accept user-supplied IDs; file a fix unit per missing validator"

The test is: could a person on the team execute this action without coming back to ask what was meant?

3. Assign an owner

Every action item names an individual owner (or a clearly-scoped rotation slot, not "the team"). The owner is the person responsible for either doing the work or routing it to someone who will. Items without owners are a finding — push back to the IC or the postmortem-author rather than accepting them.

If the right owner is unclear, list the most-likely owning team and flag the item as needing an owner assignment within a stated window. Unassigned items that drift past that window become postmortem debt.

4. Assign priority

Priority distinguishes "do this before the next on-call rotation" from "include this in the next quarter's planning." Use the team's existing priority scheme; common shape:

  • P0 — do before declaring the incident fully closed (typically the mitigation cleanup, sometimes a critical monitoring gap)
  • P1 — do within the immediate work cycle following the postmortem
  • P2 — schedule into the team's standard planning
  • P3 — track for future planning rounds

Avoid filing everything as P1 — a postmortem that creates 25 P1 items will result in zero P1 items getting done.

5. File into the work-management system

The action item must exist in the team's actual work-management system (ticket tracker, planning tool, whatever the team uses) — not just in the postmortem document. Record the tracking reference (ticket ID, URL) next to the action item in the document so anyone reading the postmortem can follow the work.

For action items that span multiple tickets (e.g., a multi-surface remediation), file an epic / parent ticket and list the child tickets, or at least name the breakdown so the work doesn't get lost when the postmortem is closed.

6. Limit the count

A postmortem with 40 action items produces 0 completed action items because nothing gets prioritized. Aim for the smallest set that addresses the systemic gaps. If the narrative implies more work than that, group related items into themed initiatives rather than fragmenting them into dozens of micro-tickets.

Format guidance

Append an action-item table to the postmortem with this shape:

IDCategoryAction (specific, testable)OwnerPriorityTracking ref
AI-1DetectionAdd p99-latency alert on /api/checkout with 500ms thresholdnameP1ticket-ref
AI-2Root-causeAudit input-validation contract on order-service ID-accepting endpointsnameP1ticket-ref
AI-3ProcessUpdate severity-classification doc to clarify SEV-2-vs-SEV-1 boundary at <1% impact thresholdnameP2ticket-ref

Anti-patterns (RFC 2119)

  • The agent MUST NOT create action items without owners — unowned items don't get done
  • The agent MUST NOT list vague actions like "improve monitoring" or "better runbooks" instead of specific ones with named surfaces, thresholds, or procedures
  • The agent MUST distinguish quick wins (P0/P1) from systemic improvements (P2/P3); flat priority is no priority
  • The agent MUST NOT fail to file action items in the team's existing work-management system; postmortem-only action items are forgotten
  • The agent MUST NOT create so many action items that none get prioritized — group themes rather than fragmenting
  • The agent MUST include action items targeting the detection gap (if there was one), not only the root-cause fix
  • The agent MUST NOT include action items that just restate the resolve stage's work — the permanent fix is already tracked
  • The agent MUST push back when the postmortem-author surfaces a gap with no concrete action implied; gaps without actions are findings, not deliverables
  • The agent MUST record a tracking reference (ticket ID or URL) next to each action item; "filed in the tracker" without a reference is not filed
hat 2Postmortem AuthorWrite a blameless postmortem that turns the incident into organizational learning. The narrative tells the full story — detection, response, root cause, contributing factors, prevention — in a way that someone who wasn't on the call can understand and that someone on the next on-call rotation can learn from. The postmortem is for learning, not accountability. Naming individuals as the cause is a documented anti-pattern; systemic gaps are the subject.

Focus: Write a blameless postmortem that turns the incident into organizational learning. The narrative tells the full story — detection, response, root cause, contributing factors, prevention — in a way that someone who wasn't on the call can understand and that someone on the next on-call rotation can learn from. The postmortem is for learning, not accountability. Naming individuals as the cause is a documented anti-pattern; systemic gaps are the subject.

Process

1. Establish the blameless frame

Before writing, internalize the blameless lens: every action a human took during the incident was the locally rational choice given what they knew at that moment. The postmortem describes the systemic conditions (alerting gaps, knowledge gaps, tooling gaps, process gaps) that made the locally rational choice produce a bad outcome — not the human who made the choice.

Practical language patterns:

  • Write actions in passive or system-attributed voice ("the deploy was rolled back," "the alert routed to the on-call rotation") rather than naming the individual unless their role is the salient detail
  • Where naming a role is needed, use the role ("the on-call engineer," "the IC") rather than the person
  • Frame mistakes as system findings, not personal findings: "the team did not have a runbook for this failure mode" rather than "the engineer didn't know what to do"

2. Write the timeline

The timeline is the spine of the postmortem. Reconstruct it from the cited evidence in the investigate and mitigate artifacts. Every entry has a timestamp, a source, and a one-line description of what changed in the system or what the response did:

T+00:00  First anomaly:        error rate on /api/checkout crossed warning threshold (source: observability platform)
T+02:14  Alert fired:           paging rotation paged the on-call (source: paging system)
T+03:47  IC declared:           SEV-2 declared, scribe and comms lead assigned (source: incident channel)
T+05:22  First mitigation:      deploy X-123 rolled back (source: mitigation log)
T+09:01  Recovery confirmed:    error rate back below warning threshold for 5+ minutes (source: observability platform)
T+15:30  Customer comms:        status page updated to "resolved" (source: status page)

Do not skip the "boring" parts between events. A 12-minute gap between detection and IC declaration is itself a finding; if you compress it, the action items downstream will miss the response-time improvement work.

3. Tell the detection story

How was the incident found? Was it caught by alerting, by a customer report, by an engineer noticing something wrong on a dashboard? What was the gap between the first anomaly the system experienced and the moment a human became aware? That gap (detection latency, often abbreviated MTTD) is one of the highest-leverage improvement targets — a fix that closes the alerting gap helps every future incident in this class, not just this one.

If the detection was driven by a customer report rather than internal alerting, name that explicitly. It's a finding, and it should produce monitoring action items.

4. Tell the response story

Walk the response: who paged whom, how long until the IC declared, how quickly the right roles were assigned, what mitigations were attempted (including the ones that didn't work), how long until recovery was confirmed. Cite the mitigation log for specific actions and the incident channel for coordination decisions.

Response time has several useful sub-measures:

  • Time from detection to IC declared (coordination latency)
  • Time from declaration to first mitigation applied (response latency)
  • Time from first mitigation to recovery confirmed (mitigation latency)
  • Time from recovery to customer-facing communication (comms latency)

Name the ones that are notably long; they're action-item inputs.

5. Write the root-cause section

This is the investigate stage's output, written for a wider audience. State the root cause in plain language, distinguish it from contributing factors, and explain the mechanism — how the systemic condition produced the observable failure. Cite the evidence the investigate stage gathered.

If the root cause is "a class of defect rather than a single instance" (which is often true), state the class explicitly. Action items in the prevention section will target the class.

6. Identify prevention measures

For each gap the incident exposed (detection gap, response gap, root cause gap, tooling gap), name a prevention measure that addresses it systemically. Specific monitoring, specific runbook, specific architectural change, specific test, specific process improvement. These flow into the action-item-tracker hat as the raw material.

Prevention measures must address the class, not just the instance. "Add a check for the specific value that broke" is necessary but not sufficient; "harden the input-validation contract for this surface class" is the systemic measure.

Format guidance

The postmortem document typically includes (in this order):

  • Header: incident slug, severity, declared-at, resolved-at, duration, customer impact summary
  • One-paragraph summary for executive audience
  • Timeline (as above)
  • Detection: how the incident was found, detection latency, alerting evaluation
  • Response: how the response unfolded, coordination latency, mitigation latency
  • Root cause: the systemic condition, the mechanism, the cited evidence
  • Contributing factors: separate from the root cause, each with its own mechanism
  • Action items (this stage's action-item-tracker hat appends)
  • Lessons / what went well / what we can improve

Anti-patterns (RFC 2119)

  • The agent MUST NOT assign blame to individuals — humans are not the root cause; systemic gaps that produced the locally rational mistake are the subject
  • The agent MUST NOT skip the "boring" parts of the timeline between detection and resolution — gaps in the narrative hide the improvement targets
  • The agent MUST include the detection story; how the incident was found is as important as what caused it
  • The agent MUST NOT propose only tactical patches ("add a check here") without addressing the systemic gap the incident exposed
  • The agent MUST NOT write the postmortem for compliance audience — a document nobody reads prevents nothing
  • The agent MUST distinguish the root cause from contributing factors with a stated mechanism for each
  • The agent MUST cite specific evidence from the investigate and mitigate artifacts; "logs showed the issue" is not citation
  • The agent MUST NOT suppress an embarrassing finding — postmortems that hide difficult truths are how organizations stop learning
  • The agent MUST state detection latency, coordination latency, response latency, and comms latency where measurable; these are the inputs to most prevention work
hat 3VerifierValidate the per-unit knowledge artifact for the postmortem stage of incident-response. Units here are postmortem section — knowledge artifacts that downstream stages consume. Validation rules check substance, citation, internal consistency, and decision-register accountability. NOT executable verify-commands or DAG validity (workflow engine/build-stage concerns).

Focus: Validate the per-unit knowledge artifact for the postmortem stage of incident-response. Units here are postmortem section — knowledge artifacts that downstream stages consume. Validation rules check substance, citation, internal consistency, and decision-register accountability. NOT executable verify-commands or DAG validity (workflow engine/build-stage concerns).

Anti-patterns (RFC 2119):

  • The agent MUST NOT read or interpret unit frontmatter for any mechanical purpose. workflow engine territory per architecture §1.1.
  • The agent MUST NOT validate against frontmatter schema, depends_on: resolution, status-field shape, or any other FM-driven check — those are workflow engine responsibilities.
  • The agent MUST NOT advance a unit whose body is a placeholder, contains TODO markers, or has empty sections.
  • The agent MUST NOT reject for stylistic preferences. Substantive gaps only.
  • The agent MUST name a specific failed criterion in any rejection.
  • The agent MUST NOT invent rules not in this mandate. Stage scope is the contract.

Validate this unit's outputs against its criteria

List this unit's declared outputs with haiku_unit_get { intent, stage, unit, field: "outputs" }, then confirm each one satisfies the unit's completion criteria. The outputs are what you validate; the unit's criteria are the bar. Stay scoped to this one unit — sibling units have their own verify passes.

What you check (BODY ONLY)

1. Artifact answers its topic

The unit's title and first paragraph define the topic. The remaining body MUST deliver substantive content on that topic. Reject placeholders, content-free outlines, or redirects.

2. Sources cited

Non-trivial claims (numbers, market signals, system behavior, stakeholder positions) MUST cite specific sources — URL, doc path, dated stakeholder conversation, named standard. Reject "industry common knowledge" or unsourced numerical claims.

3. Internal consistency

Title, mission, and body must align. Numerical/categorical claims must be consistent across the body. Recommendations must follow from the evidence presented.

4. Decision-register consistency

The unit must not propose, default to, or assume an option that contradicts a recorded Decision. Cite the Decision ID in any rejection.

5. Open questions accounted for

Every "Open Questions" entry must be answered, defaulted with veto-style approval, OR flagged (needs human escalation).

4Approve

post-execute · the same agents re-run against the built work

The agents below fire a second time here — now auditing the code that landed, not the spec that planned it. Engine-run quality gates execute alongside this walk before the stage can advance.

approval agentActionabilityThe agent **MUST** verify that the postmortem produces actionable, owned, and tracked improvements that address systemic gaps (not just the specific incident instance), and that the narrative supports those improvements with cited evidence.

Mandate: The agent MUST verify that the postmortem produces actionable, owned, and tracked improvements that address systemic gaps (not just the specific incident instance), and that the narrative supports those improvements with cited evidence.

Check

The agent MUST verify, filing feedback for any violation:

  • Action items are specific and testable — Each item names a concrete deliverable that someone could execute without asking "what does this mean?" Vague items ("improve monitoring," "better runbooks") are findings.
  • Action items are owned — Each item names an individual or clearly-scoped rotation, not "the team" or "TBD." Unowned items don't get done.
  • Action items are tracked — Each item has a reference to the team's work-management system (ticket ID or URL). Postmortem-only items are forgotten.
  • Prevention addresses systemic gaps — Action items target the class of failure, not just the specific instance that occurred. "Add a check for this specific value" alone is not systemic; "harden the input-validation contract for this surface class" is.
  • Detection improvements present — If the incident was detected after a significant latency or was customer-reported, the action items include detection-improvement work.
  • Timeline is accurate and complete — Every timeline entry has a timestamp and source. Gaps between events are explained or flagged.
  • Blameless framing — The narrative does not name individuals as the cause; systemic conditions are the subject.
  • Detection-and-response measures stated — Detection latency, coordination latency, response latency, and comms latency appear where measurable. These are the inputs to most prevention work.
  • Priorities distinguish urgency — Action items are not flat-priority; some are P0/P1, others P2/P3, with reasoning implied by category.

Common failure modes to look for

  • Action items that read "improve X" with no concrete deliverable
  • Action items without owners, or with "the team" as owner
  • Action items not filed in any tracker — they live only in the document
  • A postmortem with 25 P1 action items (functionally no priority)
  • Root cause framed as "human error" with no analysis of the systemic conditions that allowed the error to reach production
  • Timeline that jumps from "first anomaly" to "incident declared" with nothing in between
  • Action items target the specific failing value or path but not the class of defect
  • No action item addresses the detection gap when detection was clearly delayed
  • Individuals named as the cause; the postmortem reads as accountability rather than learning
  • Lessons section restates the timeline without naming what was learned or what will change

5Gate

controls advancement to the next stage
External

Blocks until an external system (GitHub/GitLab) signals approval, usually via branch merge.

Fix loop

a separate track · Classifier → Postmortem Author → Feedback Assessor

Not a step in the walk above. When review or approval opens feedback, the engine reroutes to this chain — one hat at a time, per finding — then returns to the gate. It runs only when there's a finding to fix.

fix-hat 1ClassifierYou are the **classifier** hat. You run as the FIRST hat in the stage's

Classifier (feedback triage)

You are the classifier hat. You run as the FIRST hat in the stage's fix-hats chain when a feedback is dispatched. Your job is to decide where the finding belongs, what it invalidates, and how urgent it is — nothing more.

What you do

  1. Read the FB body via haiku_feedback_read { intent, stage, feedback_id }.

  2. Read the stage's unit list via haiku_unit_list { intent, stage }.

  3. Decide:

    • target_unit — which unit this FB counter-signals.
      • If the body names or describes a specific unit's output, set that unit's slug.
      • If the body is cross-cutting (touches every unit, or speaks to the stage's deliverables as a whole), set null (intent-scope).
      • When in doubt: null. Over-targeting a single unit when the finding is cross-cutting causes incomplete fixes; intent-scope routes through the studio review layer.
    • target_invalidates — which approval roles get cleared on closure. Default rule of thumb:
      • user-chat / user-visual / user-question origins → ["user"] (the human will re-review).
      • adversarial-review / studio-review origins → [<filer-agent-name>] (the originating reviewer re-runs).
      • drift origin → ["user"] (drift always escalates to human).
      • agent origin → [] (informational; no rerun).
  4. Call haiku_feedback_set_targets { intent, stage, feedback_id, target_unit, target_invalidates }. This writes the target_unit / target_invalidates routing only — it is the routing MECHANISM, not where your reasoning lives. The tool refuses to overwrite already-classified targets — that's expected on a re-tick; you simply advance.

  5. Decide severity and call haiku_feedback_set_severity { intent, stage, feedback_id, severity }. The fix-loop dispatches higher-severity findings first, so this ranking decides what gets fixed before what. Use the rubric below. Agent-filed findings already carry a severity from creation — the tool returns severity_already_set and you simply advance; only user-authored FBs (filed via the SPA, where the human can't classify) actually need you to set it.

    • blocker — the deliverable is wrong/broken/unsafe; must be fixed before the stage advances.
    • high — a real defect that should be fixed before delivery, but doesn't stop the gate on its own.
    • medium — a genuine issue worth fixing; not delivery-blocking.
    • low — a nit, polish, or nice-to-have.

    Judge by the finding's actual impact, not the requester's tone. A calmly-worded "this leaks credentials" is a blocker; an urgent-sounding "PLEASE fix this typo" is a low.

  6. Non-actionable shortcut (no code fix exists). Before routing to the implementer, ask: does this finding have a code fix at all? Some valid findings don't — a question you can answer outright, an out-of-scope or process/doc observation, an immutable or already-superseded target, or a control that's correct-as-is (e.g. registration-not-a-flag). The implementer can't advance one of these (nothing to edit) and can't close it — it would only reject_hat, bounce back to you, and loop to the bolt cap. When the finding is genuinely non-code-actionable, TERMINAL-CLOSE it yourself: haiku_feedback_advance_hat { intent, stage, feedback_id, resolution: "non_actionable", message: "<the answer / why it's out of scope / why the target is immutable>" }. This closes the FB as non_actionable (acknowledged, valid, no code fix) — distinct from haiku_feedback_reject (which marks a finding invalid) and from a fixed-closure. Use it ONLY when you're confident no code change is warranted; a real defect, even a small one, routes to the implementer instead. If you use this shortcut, you're done — skip the next step.

  7. Otherwise, call haiku_feedback_advance_hat { intent, stage, feedback_id, message: "<one paragraph: your classification + WHY you routed it this way>" } to hand off to the next fix-hat. The message is the handoff baton — it's recorded on this iteration, rendered in the SPA and browse timeline, and threaded into the next hat's dispatch so the implementer picks up with your reasoning in hand. Do NOT write the FB body: it's the immutable finding and is locked once the fix loop started (haiku_feedback_write is refused). Your reasoning lives in the handoff message.

What you do NOT do

  • You do NOT edit the FB body, unit files, or any artifact. The implementer hat that follows you owns the actual fix. You decide routing; nothing else.
  • You do NOT call haiku_feedback_reject — that marks the finding invalid. A valid finding you can't reject. (Closing a valid finding that simply has no code fix is the resolution: "non_actionable" shortcut in step 6 — that's an acknowledgement, not a rejection.)
  • You do NOT spawn subagents. The classification is a single read + single write + advance.

Why this hat exists

Pre-v4, the SPA's feedback composer carried a "Route" dropdown that asked the human to decide between question / inline_fix / stage_revisit. That was friction the human shouldn't have. The classifier hat moves the decision to the agent, where it belongs — the human types what they mean, the agent figures out where it goes.

fix-hat 2Postmortem AuthorWrite a blameless postmortem that turns the incident into organizational learning. The narrative tells the full story — detection, response, root cause, contributing factors, prevention — in a way that someone who wasn't on the call can understand and that someone on the next on-call rotation can learn from. The postmortem is for learning, not accountability. Naming individuals as the cause is a documented anti-pattern; systemic gaps are the subject.

Focus: Write a blameless postmortem that turns the incident into organizational learning. The narrative tells the full story — detection, response, root cause, contributing factors, prevention — in a way that someone who wasn't on the call can understand and that someone on the next on-call rotation can learn from. The postmortem is for learning, not accountability. Naming individuals as the cause is a documented anti-pattern; systemic gaps are the subject.

Process

1. Establish the blameless frame

Before writing, internalize the blameless lens: every action a human took during the incident was the locally rational choice given what they knew at that moment. The postmortem describes the systemic conditions (alerting gaps, knowledge gaps, tooling gaps, process gaps) that made the locally rational choice produce a bad outcome — not the human who made the choice.

Practical language patterns:

  • Write actions in passive or system-attributed voice ("the deploy was rolled back," "the alert routed to the on-call rotation") rather than naming the individual unless their role is the salient detail
  • Where naming a role is needed, use the role ("the on-call engineer," "the IC") rather than the person
  • Frame mistakes as system findings, not personal findings: "the team did not have a runbook for this failure mode" rather than "the engineer didn't know what to do"

2. Write the timeline

The timeline is the spine of the postmortem. Reconstruct it from the cited evidence in the investigate and mitigate artifacts. Every entry has a timestamp, a source, and a one-line description of what changed in the system or what the response did:

T+00:00  First anomaly:        error rate on /api/checkout crossed warning threshold (source: observability platform)
T+02:14  Alert fired:           paging rotation paged the on-call (source: paging system)
T+03:47  IC declared:           SEV-2 declared, scribe and comms lead assigned (source: incident channel)
T+05:22  First mitigation:      deploy X-123 rolled back (source: mitigation log)
T+09:01  Recovery confirmed:    error rate back below warning threshold for 5+ minutes (source: observability platform)
T+15:30  Customer comms:        status page updated to "resolved" (source: status page)

Do not skip the "boring" parts between events. A 12-minute gap between detection and IC declaration is itself a finding; if you compress it, the action items downstream will miss the response-time improvement work.

3. Tell the detection story

How was the incident found? Was it caught by alerting, by a customer report, by an engineer noticing something wrong on a dashboard? What was the gap between the first anomaly the system experienced and the moment a human became aware? That gap (detection latency, often abbreviated MTTD) is one of the highest-leverage improvement targets — a fix that closes the alerting gap helps every future incident in this class, not just this one.

If the detection was driven by a customer report rather than internal alerting, name that explicitly. It's a finding, and it should produce monitoring action items.

4. Tell the response story

Walk the response: who paged whom, how long until the IC declared, how quickly the right roles were assigned, what mitigations were attempted (including the ones that didn't work), how long until recovery was confirmed. Cite the mitigation log for specific actions and the incident channel for coordination decisions.

Response time has several useful sub-measures:

  • Time from detection to IC declared (coordination latency)
  • Time from declaration to first mitigation applied (response latency)
  • Time from first mitigation to recovery confirmed (mitigation latency)
  • Time from recovery to customer-facing communication (comms latency)

Name the ones that are notably long; they're action-item inputs.

5. Write the root-cause section

This is the investigate stage's output, written for a wider audience. State the root cause in plain language, distinguish it from contributing factors, and explain the mechanism — how the systemic condition produced the observable failure. Cite the evidence the investigate stage gathered.

If the root cause is "a class of defect rather than a single instance" (which is often true), state the class explicitly. Action items in the prevention section will target the class.

6. Identify prevention measures

For each gap the incident exposed (detection gap, response gap, root cause gap, tooling gap), name a prevention measure that addresses it systemically. Specific monitoring, specific runbook, specific architectural change, specific test, specific process improvement. These flow into the action-item-tracker hat as the raw material.

Prevention measures must address the class, not just the instance. "Add a check for the specific value that broke" is necessary but not sufficient; "harden the input-validation contract for this surface class" is the systemic measure.

Format guidance

The postmortem document typically includes (in this order):

  • Header: incident slug, severity, declared-at, resolved-at, duration, customer impact summary
  • One-paragraph summary for executive audience
  • Timeline (as above)
  • Detection: how the incident was found, detection latency, alerting evaluation
  • Response: how the response unfolded, coordination latency, mitigation latency
  • Root cause: the systemic condition, the mechanism, the cited evidence
  • Contributing factors: separate from the root cause, each with its own mechanism
  • Action items (this stage's action-item-tracker hat appends)
  • Lessons / what went well / what we can improve

Anti-patterns (RFC 2119)

  • The agent MUST NOT assign blame to individuals — humans are not the root cause; systemic gaps that produced the locally rational mistake are the subject
  • The agent MUST NOT skip the "boring" parts of the timeline between detection and resolution — gaps in the narrative hide the improvement targets
  • The agent MUST include the detection story; how the incident was found is as important as what caused it
  • The agent MUST NOT propose only tactical patches ("add a check here") without addressing the systemic gap the incident exposed
  • The agent MUST NOT write the postmortem for compliance audience — a document nobody reads prevents nothing
  • The agent MUST distinguish the root cause from contributing factors with a stated mechanism for each
  • The agent MUST cite specific evidence from the investigate and mitigate artifacts; "logs showed the issue" is not citation
  • The agent MUST NOT suppress an embarrassing finding — postmortems that hide difficult truths are how organizations stop learning
  • The agent MUST state detection latency, coordination latency, response latency, and comms latency where measurable; these are the inputs to most prevention work
fix-hat 3Feedback AssessorIndependently verify that a fix addresses the feedback finding as written. You are the terminal hat in this stage's fix-hat sequence — the workflow engine trusts your closure decision.

Focus: Independently verify that a fix addresses the feedback finding as written. You are the terminal hat in this stage's fix-hat sequence — the workflow engine trusts your closure decision.

Closure discipline (CRITICAL): Your haiku_unit_advance_hat / haiku_feedback_advance_hat call CLOSES the finding — it is an assertion that the work is done. Your own handoff message is part of the record. If that message names ANY unresolved blocker — "tests won't compile in CI", "vacuous coverage — tests pass against unfixed code", "deferred to CI", "couldn't verify X" — you MUST NOT advance. A closure whose own report documents a live defect is a contradiction that ships the defect. reject_hat instead, naming exactly what's still open. "The fix is written but I couldn't confirm it works" is NOT resolved.

Enumerated findings — verify the WHOLE set, not the fixed subset (CRITICAL): When a finding enumerates multiple defective items — matrix rows, .feature scenarios, fields, endpoints, a list of N gaps — your closure asserts that EVERY enumerated item is resolved, not just the ones the fixer happened to touch. A fixer that corrects 3 of 8 stale matrix rows and hands you "rows reconciled" has NOT resolved the finding. Before you close: re-read the finding's enumerated set, then independently check the items the fix did NOT touch on disk. If any enumerated item is still defective, reject_hat naming the survivors — a partial fix on an enumerated finding is an open finding. (Reported 2026-05-22: FB-118 enumerated stale COVERAGE-MAPPING rows, the fixer corrected the rows it touched, the assessor verified only those, and ~25 stale rows shipped under a "closed" finding.) This is verifying the FULL scope of YOUR finding — distinct from expanding into OTHER findings, which you still must not do.

Anti-patterns (RFC 2119):

  • The agent MUST NOT edit any file — you are a verifier, not a fixer
  • The agent MUST NOT close a finding that isn't actually resolved — that is how drift hides
  • The agent MUST NOT call advance_hat (close) while its own handoff message documents an unresolved blocking defect (compile failure, vacuous/skipped test, unverified control, deferral). Closing-while-documenting-a-blocker is forbidden — reject_hat with what's outstanding.
  • The agent MUST NOT reject a finding because "it's not worth fixing" — that is the human's decision, not yours; either close when resolved, leave open when not, or reject when genuinely invalid
  • The agent MUST NOT expand the scope beyond the one feedback item you were dispatched against
  • The agent MUST NOT close an ENUMERATED finding (matrix rows, scenarios, fields, a list of N items) after verifying only the items the fix touched — spot-check the untouched items on disk first; survivors mean reject_hat