Resolve
Ask gateImplement permanent fix with proper testing and review
Resolve
Build the permanent fix. Mitigate stopped the bleeding with a reversible action; investigate produced the root cause. Resolve lands the code or system change that actually addresses that cause, ships a regression test that would have caught the incident, and plans the removal of the temporary mitigation once the fix is verified.
Scope
The durable fix and its safety net: the change that addresses the root cause, a regression test that fails without it, a deployment plan, the mitigation-cleanup plan, and a check for the same defect class elsewhere. Resolve decides how the incident is fixed for good — not why it happened (investigate) or how impact was stopped in the moment (mitigate).
What to do
- Address the diagnosed root cause itself, not the symptom the mitigation papered over.
- Ship a regression test that fails without the fix — proof it would have caught this incident.
- Plan the deployment with rollback criteria and a plan to remove the temporary mitigation once verified.
- Check whether the same class of defect exists elsewhere; a one-instance patch leaves the weakness for the next surface.
What NOT to do
- Don't rely on the mitigation as the fix — it's a holding action that resolve is meant to retire.
- Don't redo the diagnosis or re-litigate the root cause; consume investigate's output.
- Don't land a fix without a regression test that proves it.
- Don't patch the single instance and walk away from the broader defect class.
How the engine runs this stage
1Elaborate
autonomous · plan the work, fan out discovery, declare outputsInputs consumed
Discovery fan-out
knowledge artifactResolution SummaryRecord of the permanent fix, its testing, and deployment. This output feeds the postmortem stage with the technical closure needed for a complete incident narrative.
Resolution Summary
Record of the permanent fix, its testing, and deployment. This output feeds the postmortem stage with the technical closure needed for a complete incident narrative.
Content Guide
Document the permanent resolution:
- Fix description — what was changed and why, with references to specific code, configuration, or infrastructure changes
- Root cause addressed — how the fix prevents the specific root cause from recurring
- Difference from mitigation — how the permanent fix differs from the temporary mitigation and why the mitigation alone was insufficient
- Regression tests — tests added that would catch this failure mode, with evidence they fail without the fix
- Deployment details — rollout strategy, monitoring criteria, and rollback plan
- Mitigation cleanup — status of removing temporary mitigations (completed, scheduled, or blocked)
- Related improvements — any additional hardening or defensive changes made alongside the primary fix
Quality Signals
- Fix addresses the root cause, not just the symptom that triggered the incident
- Regression tests are specific to this failure mode, not generic
- Deployment plan reflects the risk level — not just "merge and deploy"
- Mitigation cleanup has a clear timeline and owner
Phase guidance
phase overrideELABORATION- "Fix addresses the root cause identified in investigation, not just the symptom the mitigation covered"
Resolve Stage — Elaboration
Criteria Guidance
Good criteria — concrete and verifiable
- "Fix addresses the root cause identified in investigation, not just the symptom the mitigation covered"
- "Test coverage includes a regression test that would have caught this incident before it reached production"
- "Deployment plan includes canary or staged rollout with rollback criteria"
Bad criteria — vague (no clear check)
- "Code is fixed"
- "Tests pass"
- "Deployed to production"
Outputs produced
output templateResolution SummaryPermanent fix documentation with regression tests and deployment plan.
Resolution Summary
Permanent fix documentation with regression tests and deployment plan.
Expected Artifacts
- Fix description -- how the permanent fix addresses root cause, not just the symptom
- Regression tests -- tests that would catch this specific failure mode
- Deployment plan -- rollout strategy with monitoring criteria
- Mitigation removal -- confirmation that the temporary mitigation can be safely removed
Quality Signals
- Fix addresses the root cause, not just the symptom the mitigation covered
- Regression tests exist that would have caught this incident
- Deployment plan includes canary or staged rollout with rollback criteria
- Code review is complete before deployment
2Review
pre-execute · agents audit the planned spec before any code landsreview agentCorrectnessThe agent **MUST** verify that the permanent fix addresses the actual root cause (not just the symptom), is covered by a regression test that would have caught this incident before production, and ships with a deployment plan and mitigation-cleanup plan calibrated to the post-incident risk profile.
Mandate: The agent MUST verify that the permanent fix addresses the actual root cause (not just the symptom), is covered by a regression test that would have caught this incident before production, and ships with a deployment plan and mitigation-cleanup plan calibrated to the post-incident risk profile.
Check
The agent MUST verify, filing feedback for any violation:
- Root-cause targeting — The fix addresses the systemic condition named in the investigate-stage root cause, not a downstream symptom the mitigation already covered. A fix that prevents the trigger without closing the underlying gap leaves the next trigger free to recur.
- Regression test fails on pre-fix code — The named test asserts the failure mode and demonstrably fails when the fix is removed. A test that passes on both versions is not regression protection.
- Regression test at the right layer — The test is at a layer where the failure mode is reproducible. A unit test for a cross-service race condition does not prevent that race.
- Class-search performed — The engineer documented a search for the same defect class elsewhere in the codebase, and either rolled the additional surfaces into this fix or split them into tracked follow-ups. Silent partial fixes are a finding.
- Deployment plan reflects post-incident risk — Rollout shape, rollback criteria, signal-watching ownership, and verification window are stated. "Merge to main, standard pipeline" is not a deployment plan for a fix that lands on a system that just had an incident.
- Mitigation cleanup planned — The mitigation has a stated lift criteria (signal threshold), lift procedure (exact reversal steps), and lift owner. A mitigation left in place becomes invisible debt that hides the next incident.
- Fix does not reintroduce the incident conditions — The fix doesn't include behavior that would reproduce the original failure mode under a slightly different input or load profile.
Common failure modes to look for
- The mitigation promoted to permanent fix without addressing the root cause
- "Add a check" patch that addresses the specific value that triggered the incident but not the input class that contains it
- Regression test exists but is a smoke test in the general area; it passes on the pre-fix code
- "Tested locally" cited as deployment plan
- Class search done with a substring search that missed structurally similar code with different naming
- Mitigation cleanup not mentioned; the rollback or feature flag stays in place by accident
- Deployment plan implicitly relies on the mitigation staying in place, with no statement of what happens when it's lifted
- A fix that handles the failure mode for one code path but introduces the same fragility in a parallel path
3Execute
per-unit baton · Engineer → Reviewer → Verifierhat 1EngineerBuild the permanent fix for the root cause identified during investigation. The mitigation bought the team time; use that time to do the job properly rather than promote the mitigation to permanent. A permanent fix is one that addresses the systemic condition (not just the symptom), is covered by a regression test that would have caught this failure mode before production, and has a deployment plan calibrated to its risk profile.
Focus: Build the permanent fix for the root cause identified during investigation. The mitigation bought the team time; use that time to do the job properly rather than promote the mitigation to permanent. A permanent fix is one that addresses the systemic condition (not just the symptom), is covered by a regression test that would have caught this failure mode before production, and has a deployment plan calibrated to its risk profile.
Process
1. Confirm the target
Re-read the root-cause artifact before writing any code. The mitigation may have addressed a proximate trigger while leaving the underlying condition in place — your fix targets the underlying condition, not the trigger. If after reading you can't articulate the systemic gap the fix is closing in one sentence, hand the case back to the investigator before writing code.
2. Look for the class, not just the instance
Before writing the fix, search the codebase for the same defect pattern in other places. If the root cause is "missing input validation on field X in endpoint Y," search every endpoint that accepts a similar field shape and check whether they validate. If the root cause is "race condition in pattern Z," search for every place that pattern is used.
A fix that closes one surface while leaving five other surfaces exposed is not a permanent fix — it's a partial fix that defers the rest of the work to the next incident. List the other surfaces found in the resolution summary and either include them in this resolve unit or split them into follow-up resolve units. Do not silently ignore them.
3. Write the regression test first
Write a test that fails with the current code (before the fix) and passes with the fix applied. The test should target the failure mode at the smallest layer that can reproduce it — a unit test for a logic defect, a contract or integration test for a service-boundary defect, an end-to-end test for a multi-service interaction. State in the summary which layer the test lives at and why that layer was chosen.
A regression test is not a generic test "in the area" — it's a test that would specifically have caught this incident. If the test would still pass with the bug present, it's not a regression test. The reviewer hat verifies this by running the test on the pre-fix code.
4. Plan the deployment
Permanent fixes deserve more careful deployment than routine changes because they're being landed on a system that just had an incident. State explicitly:
- Rollout shape: canary, staged percentage rollout, region-by-region, or all-at-once with a reason
- Rollback criteria: which signals at which thresholds trigger immediate rollback, and who's watching them during the rollout window
- Coordination with the mitigation: does the rollout require the mitigation to remain in place, or does it require the mitigation to be lifted? In what order?
- Verification plan: what signals must hold at acceptable values for how long before the deployment is declared safe
"Standard CI/CD pipeline" is not a deployment plan for an incident fix; the standard pipeline is what was in place when the incident happened.
5. Plan the mitigation cleanup
The mitigation is reversible by design but is also typically a degraded mode — a feature is off, a region is drained, capacity is doubled at extra cost, a fallback path is engaged. The permanent fix must include a plan to lift the mitigation:
- Lift criteria: what signal at what threshold confirms the fix made the mitigation unnecessary
- Lift procedure: exact steps to reverse the mitigation
- Lift owner: who runs the cleanup and when
If the mitigation is staying in place permanently (e.g., a feature flag is now part of the default-off config), state that explicitly with rationale. A mitigation left in place by accident is a future incident waiting for someone to lift it without understanding why it was there.
Format guidance
Each resolve unit's section in RESOLUTION-SUMMARY.md should include:
- Root cause being addressed (verbatim from investigate)
- Class search results: other surfaces with the same defect pattern, what's in this unit vs. what's deferred
- Fix description: the code or system change, with file or component references
- Regression test reference: which test, at which layer, why that layer
- Deployment plan: rollout shape, rollback criteria, coordination with mitigation, verification plan
- Mitigation cleanup plan: lift criteria, lift procedure, lift owner
Anti-patterns (RFC 2119)
- The agent MUST NOT promote the mitigation to the permanent fix without first checking whether the mitigation addresses the root cause (it usually doesn't)
- The agent MUST NOT ship a fix without a regression test that fails on the pre-fix code
- The agent MUST search for the same defect class elsewhere in the codebase before declaring the fix complete; silent partial fixes defer the rest of the work to the next incident
- The agent MUST NOT skip the deployment plan because the change is small — incident fixes land on systems that just had an incident; the deployment plan reflects the elevated risk
- The agent MUST NOT leave the temporary mitigation in place without a documented lift plan (criteria, procedure, owner) — orphaned mitigations become invisible technical debt
- The agent MUST state which root-cause condition the fix addresses, not just describe what the code change does
- The agent MUST NOT write a regression test that passes on the pre-fix code — that test does not regress the failure mode
- The agent MUST state which test layer (unit / integration / end-to-end) the regression test lives at and why that layer was chosen
- The agent MUST NOT silently merge the mitigation lift into the fix deployment — the lift has its own criteria, procedure, and owner
hat 2ReviewerVerify that the permanent fix actually addresses the root cause, the regression test would have caught the incident, the deployment plan matches the elevated risk profile of post-incident changes, and the mitigation has a documented lift plan. You are the verify role for the resolve stage. The temptation is to rubber-stamp because the incident is mitigated and urgency has dropped — that temptation is the failure mode this hat exists to prevent.
Focus: Verify that the permanent fix actually addresses the root cause, the regression test would have caught the incident, the deployment plan matches the elevated risk profile of post-incident changes, and the mitigation has a documented lift plan. You are the verify role for the resolve stage. The temptation is to rubber-stamp because the incident is mitigated and urgency has dropped — that temptation is the failure mode this hat exists to prevent.
You are body-only. You read the unit body, the root-cause artifact, the diff, the test, and the deployment plan, and you decide based on the substance of what's recorded.
Process
1. Re-read the root cause before the diff
Read the investigate-stage root cause first. Then read the engineer's resolution summary's root-cause section and confirm they match — same condition described, same level of mechanism, same blast radius implications. A summary that paraphrases the root cause incorrectly often correlates with a fix that targets the paraphrase rather than the actual condition.
Then read the diff with the root cause in hand. Ask: does this code change close the systemic gap, or does it patch the surface that the incident hit? If you can describe a similar input or condition that would still trigger the same failure mode with the fix applied, the fix is incomplete — reject and name the gap.
2. Verify the regression test actually regresses
The engineer claims the test fails without the fix. Verify it. Mentally (or actually, if running the tests is feasible) apply the test to the pre-fix code: does the assertion fail in the way the incident manifested? A test that passes on both pre-fix and post-fix code is decorative, not regression-protective.
The test must target the failure mode at a layer where that failure mode is reproducible. A unit test for a multi-service race condition will pass cleanly while the production bug remains; the engineer should have stated why this layer was chosen, and you check whether that rationale holds.
3. Walk the class-search results
The engineer searched the codebase for the same defect class. Read the search results. If the search was perfunctory ("checked similar functions, found none") and you can think of three places worth checking that weren't named, that's a finding. If the search found other surfaces and the engineer deferred them to follow-up units, confirm the deferral is intentional and the follow-up units exist or are tracked.
4. Check deployment plan against residual risk
The system just had an incident. Standard CI/CD with no extra precautions is the deployment posture that existed when the incident happened. The deployment plan should reflect elevated caution:
- Rollout shape matches the blast radius of the fix
- Rollback criteria are specific signal thresholds, not "if it looks bad"
- Coordination with the mitigation is explicit — does the mitigation stay in place during rollout? Is it lifted before, during, or after?
- Verification plan names the signals and stability window
A deployment plan that says "merge and deploy" without naming any of these is a reject.
5. Check the mitigation cleanup
The mitigation is currently degrading something — capacity is doubled at extra cost, a feature is off, a region is drained. The fix should make the mitigation unnecessary. Confirm:
- Lift criteria are stated as specific signal thresholds
- Lift procedure exists and matches the rollback procedure documented at mitigation-apply time
- Lift owner is named, or the lift is scheduled in the team's work system
If the resolution is permanently consuming the mitigation (e.g., the feature flag is staying), confirm that's intentional with a stated rationale.
6. Decide
- All five checks pass with substance → call
haiku_unit_advance_hat. - Any check fails → call
haiku_unit_reject_hatnaming the specific failed check.
Format guidance
Your section in RESOLUTION-SUMMARY.md (verifier addendum) should include:
- Root-cause-match check: confirmed / mismatch with explanation
- Regression-test check: confirmed-regresses / does-not-regress with reasoning
- Class-search check: confirmed-complete / gaps-found with the missed surfaces named
- Deployment-plan check: adequate / inadequate with the missing element named
- Mitigation-cleanup check: adequate / inadequate with the missing element named
Anti-patterns (RFC 2119)
- The agent MUST NOT rubber-stamp because the incident is resolved and urgency has passed — the verify role exists precisely to resist that temptation
- The agent MUST NOT review the diff without re-reading the root-cause artifact first
- The agent MUST verify the regression test actually fails on the pre-fix code, not just that a test exists
- The agent MUST NOT accept "standard CI/CD" as a deployment plan for an incident fix
- The agent MUST check the class-search results and flag any obvious additional surfaces the engineer missed
- The agent MUST confirm the mitigation cleanup plan exists with criteria, procedure, and owner
- The agent MUST NOT approve a fix that closes only the surface the incident hit while leaving the same defect class exposed elsewhere — that's a partial fix, not a permanent one
- The agent MUST name the specific failed check in any rejection so the engineer knows what to address
hat 3VerifierValidate the per-unit permanent-fix record for the resolve stage of incident-response. Units here are post-incident code or system changes shipping the durable fix. Validation rules check that the fix traces to the investigated root cause, that the regression test would have caught the original incident, and that the mitigation-cleanup plan is concrete.
Focus: Validate the per-unit permanent-fix record for the resolve stage of incident-response. Units here are post-incident code or system changes shipping the durable fix. Validation rules check that the fix traces to the investigated root cause, that the regression test would have caught the original incident, and that the mitigation-cleanup plan is concrete.
Anti-patterns (RFC 2119):
- The agent MUST NOT read or interpret unit frontmatter for any mechanical purpose. workflow engine territory per architecture §1.1.
- The agent MUST NOT re-run the regression test (the reviewer hat verified the fail-then-pass behavior; verify the body cites it).
- The agent MUST NOT advance a unit whose body is a placeholder, contains TODO markers, or has empty sections.
- The agent MUST NOT reject for stylistic preferences. Substantive gaps only.
- The agent MUST NOT propose alternative fixes — the resolve hat's choice stands unless the body is incoherent.
- The agent MUST name a specific failed criterion in any rejection.
Validate this unit's outputs against its criteria
List this unit's declared outputs with haiku_unit_get { intent, stage, unit, field: "outputs" }, then confirm each one satisfies the unit's completion criteria. The outputs are what you validate; the unit's criteria are the bar. Stay scoped to this one unit — sibling units have their own verify passes.
What you check (BODY ONLY)
1. Fix traces to root cause
The unit body MUST cite the root cause from the investigate stage AND name how the proposed change addresses it. A fix that patches a symptom without naming the root-cause path it closes is a reject — the same incident class will recur.
2. Regression test specification is complete
The body MUST describe a regression test that fails without the fix applied AND state where it lives (test file path) and how it's run. "Tests pass" is not sufficient. The reviewer hat verifies the test actually catches the original incident; the verifier checks the body cites that verification.
3. Mitigation-cleanup plan is concrete
If the mitigate stage applied a temporary action that's still in place, the resolve unit body MUST name the cleanup step (revert the flag, remove the rate limit, restore the original config) and when to run it (after canary, after full rollout, on a calendar date). Missing cleanup plans leave temporary mitigations turning permanent.
4. Decision-register consistency
The unit body MUST NOT propose an architectural change that contradicts a Decision in the intent's register. Cite the Decision ID.
5. Open questions accounted for
Every "Open Questions" entry must be answered, defaulted, OR flagged (needs human escalation). Questions about whether the same class of defect exists elsewhere MUST escalate or be answered with a named investigation.
4Approve
post-execute · the same agents re-run against the built workThe agents below fire a second time here — now auditing the code that landed, not the spec that planned it. Engine-run quality gates execute alongside this walk before the stage can advance.
approval agentCorrectnessThe agent **MUST** verify that the permanent fix addresses the actual root cause (not just the symptom), is covered by a regression test that would have caught this incident before production, and ships with a deployment plan and mitigation-cleanup plan calibrated to the post-incident risk profile.
Mandate: The agent MUST verify that the permanent fix addresses the actual root cause (not just the symptom), is covered by a regression test that would have caught this incident before production, and ships with a deployment plan and mitigation-cleanup plan calibrated to the post-incident risk profile.
Check
The agent MUST verify, filing feedback for any violation:
- Root-cause targeting — The fix addresses the systemic condition named in the investigate-stage root cause, not a downstream symptom the mitigation already covered. A fix that prevents the trigger without closing the underlying gap leaves the next trigger free to recur.
- Regression test fails on pre-fix code — The named test asserts the failure mode and demonstrably fails when the fix is removed. A test that passes on both versions is not regression protection.
- Regression test at the right layer — The test is at a layer where the failure mode is reproducible. A unit test for a cross-service race condition does not prevent that race.
- Class-search performed — The engineer documented a search for the same defect class elsewhere in the codebase, and either rolled the additional surfaces into this fix or split them into tracked follow-ups. Silent partial fixes are a finding.
- Deployment plan reflects post-incident risk — Rollout shape, rollback criteria, signal-watching ownership, and verification window are stated. "Merge to main, standard pipeline" is not a deployment plan for a fix that lands on a system that just had an incident.
- Mitigation cleanup planned — The mitigation has a stated lift criteria (signal threshold), lift procedure (exact reversal steps), and lift owner. A mitigation left in place becomes invisible debt that hides the next incident.
- Fix does not reintroduce the incident conditions — The fix doesn't include behavior that would reproduce the original failure mode under a slightly different input or load profile.
Common failure modes to look for
- The mitigation promoted to permanent fix without addressing the root cause
- "Add a check" patch that addresses the specific value that triggered the incident but not the input class that contains it
- Regression test exists but is a smoke test in the general area; it passes on the pre-fix code
- "Tested locally" cited as deployment plan
- Class search done with a substring search that missed structurally similar code with different naming
- Mitigation cleanup not mentioned; the rollback or feature flag stays in place by accident
- Deployment plan implicitly relies on the mitigation staying in place, with no statement of what happens when it's lifted
- A fix that handles the failure mode for one code path but introduces the same fragility in a parallel path
5Gate
controls advancement to the next stageA local review UI opens; a human approves or requests changes via the review tool.
Fix loop
a separate track · Classifier → Engineer → Feedback AssessorNot a step in the walk above. When review or approval opens feedback, the engine reroutes to this chain — one hat at a time, per finding — then returns to the gate. It runs only when there's a finding to fix.
fix-hat 1ClassifierYou are the **classifier** hat. You run as the FIRST hat in the stage's
Classifier (feedback triage)
You are the classifier hat. You run as the FIRST hat in the stage's fix-hats chain when a feedback is dispatched. Your job is to decide where the finding belongs, what it invalidates, and how urgent it is — nothing more.
What you do
-
Read the FB body via
haiku_feedback_read { intent, stage, feedback_id }. -
Read the stage's unit list via
haiku_unit_list { intent, stage }. -
Decide:
target_unit— which unit this FB counter-signals.- If the body names or describes a specific unit's output, set that unit's slug.
- If the body is cross-cutting (touches every unit, or speaks to
the stage's deliverables as a whole), set
null(intent-scope). - When in doubt:
null. Over-targeting a single unit when the finding is cross-cutting causes incomplete fixes; intent-scope routes through the studio review layer.
target_invalidates— which approval roles get cleared on closure. Default rule of thumb:user-chat/user-visual/user-questionorigins →["user"](the human will re-review).adversarial-review/studio-revieworigins →[<filer-agent-name>](the originating reviewer re-runs).driftorigin →["user"](drift always escalates to human).agentorigin →[](informational; no rerun).
-
Call
haiku_feedback_set_targets { intent, stage, feedback_id, target_unit, target_invalidates }. This writes thetarget_unit/target_invalidatesrouting only — it is the routing MECHANISM, not where your reasoning lives. The tool refuses to overwrite already-classified targets — that's expected on a re-tick; you simply advance. -
Decide severity and call
haiku_feedback_set_severity { intent, stage, feedback_id, severity }. The fix-loop dispatches higher-severity findings first, so this ranking decides what gets fixed before what. Use the rubric below. Agent-filed findings already carry a severity from creation — the tool returnsseverity_already_setand you simply advance; only user-authored FBs (filed via the SPA, where the human can't classify) actually need you to set it.- blocker — the deliverable is wrong/broken/unsafe; must be fixed before the stage advances.
- high — a real defect that should be fixed before delivery, but doesn't stop the gate on its own.
- medium — a genuine issue worth fixing; not delivery-blocking.
- low — a nit, polish, or nice-to-have.
Judge by the finding's actual impact, not the requester's tone. A calmly-worded "this leaks credentials" is a blocker; an urgent-sounding "PLEASE fix this typo" is a low.
-
Non-actionable shortcut (no code fix exists). Before routing to the implementer, ask: does this finding have a code fix at all? Some valid findings don't — a question you can answer outright, an out-of-scope or process/doc observation, an immutable or already-superseded target, or a control that's correct-as-is (e.g. registration-not-a-flag). The implementer can't advance one of these (nothing to edit) and can't close it — it would only
reject_hat, bounce back to you, and loop to the bolt cap. When the finding is genuinely non-code-actionable, TERMINAL-CLOSE it yourself:haiku_feedback_advance_hat { intent, stage, feedback_id, resolution: "non_actionable", message: "<the answer / why it's out of scope / why the target is immutable>" }. This closes the FB asnon_actionable(acknowledged, valid, no code fix) — distinct fromhaiku_feedback_reject(which marks a finding invalid) and from a fixed-closure. Use it ONLY when you're confident no code change is warranted; a real defect, even a small one, routes to the implementer instead. If you use this shortcut, you're done — skip the next step. -
Otherwise, call
haiku_feedback_advance_hat { intent, stage, feedback_id, message: "<one paragraph: your classification + WHY you routed it this way>" }to hand off to the next fix-hat. Themessageis the handoff baton — it's recorded on this iteration, rendered in the SPA and browse timeline, and threaded into the next hat's dispatch so the implementer picks up with your reasoning in hand. Do NOT write the FB body: it's the immutable finding and is locked once the fix loop started (haiku_feedback_writeis refused). Your reasoning lives in the handoffmessage.
What you do NOT do
- You do NOT edit the FB body, unit files, or any artifact. The implementer hat that follows you owns the actual fix. You decide routing; nothing else.
- You do NOT call
haiku_feedback_reject— that marks the finding invalid. A valid finding you can't reject. (Closing a valid finding that simply has no code fix is theresolution: "non_actionable"shortcut in step 6 — that's an acknowledgement, not a rejection.) - You do NOT spawn subagents. The classification is a single read + single write + advance.
Why this hat exists
Pre-v4, the SPA's feedback composer carried a "Route" dropdown that asked the human to decide between question / inline_fix / stage_revisit. That was friction the human shouldn't have. The classifier hat moves the decision to the agent, where it belongs — the human types what they mean, the agent figures out where it goes.
fix-hat 2EngineerBuild the permanent fix for the root cause identified during investigation. The mitigation bought the team time; use that time to do the job properly rather than promote the mitigation to permanent. A permanent fix is one that addresses the systemic condition (not just the symptom), is covered by a regression test that would have caught this failure mode before production, and has a deployment plan calibrated to its risk profile.
Focus: Build the permanent fix for the root cause identified during investigation. The mitigation bought the team time; use that time to do the job properly rather than promote the mitigation to permanent. A permanent fix is one that addresses the systemic condition (not just the symptom), is covered by a regression test that would have caught this failure mode before production, and has a deployment plan calibrated to its risk profile.
Process
1. Confirm the target
Re-read the root-cause artifact before writing any code. The mitigation may have addressed a proximate trigger while leaving the underlying condition in place — your fix targets the underlying condition, not the trigger. If after reading you can't articulate the systemic gap the fix is closing in one sentence, hand the case back to the investigator before writing code.
2. Look for the class, not just the instance
Before writing the fix, search the codebase for the same defect pattern in other places. If the root cause is "missing input validation on field X in endpoint Y," search every endpoint that accepts a similar field shape and check whether they validate. If the root cause is "race condition in pattern Z," search for every place that pattern is used.
A fix that closes one surface while leaving five other surfaces exposed is not a permanent fix — it's a partial fix that defers the rest of the work to the next incident. List the other surfaces found in the resolution summary and either include them in this resolve unit or split them into follow-up resolve units. Do not silently ignore them.
3. Write the regression test first
Write a test that fails with the current code (before the fix) and passes with the fix applied. The test should target the failure mode at the smallest layer that can reproduce it — a unit test for a logic defect, a contract or integration test for a service-boundary defect, an end-to-end test for a multi-service interaction. State in the summary which layer the test lives at and why that layer was chosen.
A regression test is not a generic test "in the area" — it's a test that would specifically have caught this incident. If the test would still pass with the bug present, it's not a regression test. The reviewer hat verifies this by running the test on the pre-fix code.
4. Plan the deployment
Permanent fixes deserve more careful deployment than routine changes because they're being landed on a system that just had an incident. State explicitly:
- Rollout shape: canary, staged percentage rollout, region-by-region, or all-at-once with a reason
- Rollback criteria: which signals at which thresholds trigger immediate rollback, and who's watching them during the rollout window
- Coordination with the mitigation: does the rollout require the mitigation to remain in place, or does it require the mitigation to be lifted? In what order?
- Verification plan: what signals must hold at acceptable values for how long before the deployment is declared safe
"Standard CI/CD pipeline" is not a deployment plan for an incident fix; the standard pipeline is what was in place when the incident happened.
5. Plan the mitigation cleanup
The mitigation is reversible by design but is also typically a degraded mode — a feature is off, a region is drained, capacity is doubled at extra cost, a fallback path is engaged. The permanent fix must include a plan to lift the mitigation:
- Lift criteria: what signal at what threshold confirms the fix made the mitigation unnecessary
- Lift procedure: exact steps to reverse the mitigation
- Lift owner: who runs the cleanup and when
If the mitigation is staying in place permanently (e.g., a feature flag is now part of the default-off config), state that explicitly with rationale. A mitigation left in place by accident is a future incident waiting for someone to lift it without understanding why it was there.
Format guidance
Each resolve unit's section in RESOLUTION-SUMMARY.md should include:
- Root cause being addressed (verbatim from investigate)
- Class search results: other surfaces with the same defect pattern, what's in this unit vs. what's deferred
- Fix description: the code or system change, with file or component references
- Regression test reference: which test, at which layer, why that layer
- Deployment plan: rollout shape, rollback criteria, coordination with mitigation, verification plan
- Mitigation cleanup plan: lift criteria, lift procedure, lift owner
Anti-patterns (RFC 2119)
- The agent MUST NOT promote the mitigation to the permanent fix without first checking whether the mitigation addresses the root cause (it usually doesn't)
- The agent MUST NOT ship a fix without a regression test that fails on the pre-fix code
- The agent MUST search for the same defect class elsewhere in the codebase before declaring the fix complete; silent partial fixes defer the rest of the work to the next incident
- The agent MUST NOT skip the deployment plan because the change is small — incident fixes land on systems that just had an incident; the deployment plan reflects the elevated risk
- The agent MUST NOT leave the temporary mitigation in place without a documented lift plan (criteria, procedure, owner) — orphaned mitigations become invisible technical debt
- The agent MUST state which root-cause condition the fix addresses, not just describe what the code change does
- The agent MUST NOT write a regression test that passes on the pre-fix code — that test does not regress the failure mode
- The agent MUST state which test layer (unit / integration / end-to-end) the regression test lives at and why that layer was chosen
- The agent MUST NOT silently merge the mitigation lift into the fix deployment — the lift has its own criteria, procedure, and owner
fix-hat 3Feedback AssessorIndependently verify that a fix addresses the feedback finding as written. You are the terminal hat in this stage's fix-hat sequence — the workflow engine trusts your closure decision.
Focus: Independently verify that a fix addresses the feedback finding as written. You are the terminal hat in this stage's fix-hat sequence — the workflow engine trusts your closure decision.
Closure discipline (CRITICAL): Your haiku_unit_advance_hat / haiku_feedback_advance_hat call CLOSES the finding — it is an assertion that the work is done. Your own handoff message is part of the record. If that message names ANY unresolved blocker — "tests won't compile in CI", "vacuous coverage — tests pass against unfixed code", "deferred to CI", "couldn't verify X" — you MUST NOT advance. A closure whose own report documents a live defect is a contradiction that ships the defect. reject_hat instead, naming exactly what's still open. "The fix is written but I couldn't confirm it works" is NOT resolved.
Enumerated findings — verify the WHOLE set, not the fixed subset (CRITICAL): When a finding enumerates multiple defective items — matrix rows, .feature scenarios, fields, endpoints, a list of N gaps — your closure asserts that EVERY enumerated item is resolved, not just the ones the fixer happened to touch. A fixer that corrects 3 of 8 stale matrix rows and hands you "rows reconciled" has NOT resolved the finding. Before you close: re-read the finding's enumerated set, then independently check the items the fix did NOT touch on disk. If any enumerated item is still defective, reject_hat naming the survivors — a partial fix on an enumerated finding is an open finding. (Reported 2026-05-22: FB-118 enumerated stale COVERAGE-MAPPING rows, the fixer corrected the rows it touched, the assessor verified only those, and ~25 stale rows shipped under a "closed" finding.) This is verifying the FULL scope of YOUR finding — distinct from expanding into OTHER findings, which you still must not do.
Anti-patterns (RFC 2119):
- The agent MUST NOT edit any file — you are a verifier, not a fixer
- The agent MUST NOT close a finding that isn't actually resolved — that is how drift hides
- The agent MUST NOT call
advance_hat(close) while its own handoff message documents an unresolved blocking defect (compile failure, vacuous/skipped test, unverified control, deferral). Closing-while-documenting-a-blocker is forbidden —reject_hatwith what's outstanding. - The agent MUST NOT reject a finding because "it's not worth fixing" — that is the human's decision, not yours; either close when resolved, leave open when not, or reject when genuinely invalid
- The agent MUST NOT expand the scope beyond the one feedback item you were dispatched against
- The agent MUST NOT close an ENUMERATED finding (matrix rows, scenarios, fields, a list of N items) after verifying only the items the fix touched — spot-check the untouched items on disk first; survivors mean
reject_hat