Mitigate
Ask gateApply immediate fixes to stop the bleeding — rollbacks, feature flags, scaling
Mitigate
Stop user-facing impact as fast as safely possible. Mitigation is not the permanent fix — it returns the system to acceptable behavior while resolve builds the proper fix on a calmer timeline. It runs in parallel with investigate: a known-safe mitigation doesn't wait for a confirmed root cause, as long as the action names the hypothesis it's acting on.
Scope
Restoring acceptable behavior with reversible actions: rollbacks, feature-flag flips, scaling, load shedding, traffic draining. Mitigate decides how to stop the bleeding now — not why it's bleeding (investigate) or how to fix it for good (resolve). Its moves are temporary by design and meant to be undone once resolve lands.
What to do
- Prefer reversible, known-safe actions; name the hypothesis each mitigation targets and the signal that will confirm it worked.
- Record every action — the exact change, the timestamp, and the rollback procedure — as you go.
- Watch the verification signal; a non-recovering signal means the hypothesis was wrong, not that you should escalate the same move.
- Require an explicit acknowledgment that user-facing impact has actually stopped before calling the incident mitigated.
What NOT to do
- Don't build the permanent fix or ship the regression test — that's resolve; mitigation is the holding action.
- Don't redo the diagnosis; consume investigate's working hypothesis.
- Don't apply an irreversible change as a mitigation when a reversible one exists.
- Don't leave a mitigation in place without recording which hypothesis it's holding back, or resolve can't clean it up safely.
How the engine runs this stage
1Elaborate
collaborative · plan the work, fan out discovery, declare outputsInputs consumed
Discovery fan-out
knowledge artifactMitigation LogRecord of all mitigation actions taken, their effects, and how to reverse them. This output feeds the resolve stage so the permanent fix can be built with full context of the temporary measures in place.
Mitigation Log
Record of all mitigation actions taken, their effects, and how to reverse them. This output feeds the resolve stage so the permanent fix can be built with full context of the temporary measures in place.
Content Guide
Document every action taken to stop the bleeding:
- Actions taken — exact commands, config changes, rollback versions, or feature flags toggled, with timestamps
- Rationale — why this mitigation was chosen over alternatives
- Verification results — before/after metrics showing the mitigation's effect
- Rollback plan — how to reverse each mitigation action if it causes its own problems
- Known side effects — any degraded functionality, reduced capacity, or disabled features resulting from the mitigation
- Remaining risk — what the mitigation does NOT address and what could still go wrong
- Cleanup required — what temporary measures need to be removed once a permanent fix is in place
Quality Signals
- Every action has a timestamp, an actor, and a reversal procedure
- Verification uses the same metrics that detected the incident
- Side effects are explicitly documented, not discovered later
- The mitigation is clearly labeled as temporary, with cleanup expectations
Phase guidance
phase overrideELABORATION- "Mitigation action is documented with exact commands or config changes applied"
Mitigate Stage — Elaboration
Criteria Guidance
Good criteria — concrete and verifiable
- "Mitigation action is documented with exact commands or config changes applied"
- "Verification confirms user-facing impact has stopped, measured by the same metrics that triggered the incident"
- "Rollback plan exists in case the mitigation itself causes regression"
Bad criteria — vague (no clear check)
- "Issue is mitigated"
- "Fix is applied"
- "Things are back to normal"
Outputs produced
output templateMitigation LogRecord of immediate actions taken to stop user-facing impact.
Mitigation Log
Record of immediate actions taken to stop user-facing impact.
Expected Artifacts
- Actions taken -- exact commands or config changes applied with timestamps
- Verification -- confirmation that user-facing impact has stopped using the same detection signals
- Rollback plan -- documented procedure in case the mitigation itself causes regression
- Known side effects -- any side effects of the mitigation called out
Quality Signals
- Mitigation actions are documented with exact details and timestamps
- Impact cessation is verified using the same metrics that triggered the incident
- A rollback plan exists for the mitigation itself
- Side effects are explicitly documented
2Review
pre-execute · agents audit the planned spec before any code landsreview agentSafetyThe agent **MUST** verify that the mitigation actually stopped user-facing impact, was reversible by design, and did not introduce new risks that the response is unaware of.
Mandate: The agent MUST verify that the mitigation actually stopped user-facing impact, was reversible by design, and did not introduce new risks that the response is unaware of.
Check
The agent MUST verify, filing feedback for any violation:
- Impact addressed, not deflected — The mitigation acts on the user-facing symptom, not on a downstream effect of it. Suppressing the alert is not mitigating the incident; clearing the queue is not fixing the producer.
- Reversibility documented — Every applied action has a stated rollback procedure in the log. A non-reversible action (a destructive data operation, an irreversible config rewrite) used as a mitigation is the highest-priority finding.
- Verified in production — The mitigation is verified to have stopped impact by measuring the same signals that detected the incident, not just verified to have deployed cleanly. "Deployed successfully" is not "mitigated."
- No new data loss or corruption — The mitigation did not cause data loss, data corruption, or state inconsistency. Where the mitigation touched data paths, the log states whether data was inspected post-mitigation and what it showed.
- Single-variable change discipline — Mitigations were applied one at a time, with stability windows between them. Concurrent mitigations are flagged because attribution becomes impossible.
- Hypothesis tied to action — Each mitigation cites the root-cause hypothesis it was acting on. A mitigation without a stated hypothesis is a coin flip and worth a finding.
- Communication trail — The log shows pre-apply announcement and post-apply confirmation timestamps so the timeline is intact for the postmortem.
Common failure modes to look for
- A mitigation applied without a rollback procedure recorded
- "Restarted the service" with no investigation into why it was stuck — restarts can mask conditions that recur
- A permanent code fix shipped as the mitigation because "it was a one-line change" — this is resolve-stage work, not mitigate-stage
- Two mitigations applied in quick succession; the second was applied before the first had time to show effect
- Verification using a different signal than detection (alert was on error rate; verification was on CPU)
- Partial mitigation accepted as full recovery — residual impact was real but not surfaced
- A mitigation that affected surfaces outside the incident's blast radius without documenting that those surfaces were checked
- "Hotfix deployed" with no rollback path because the fix is not reversible — a non-reversible mitigation defeats the purpose of mitigation
3Execute
per-unit baton · Mitigation Planner → Mitigator → Verifierhat 1Mitigation PlannerPlan the mitigation action before any change lands. Decide which reversible action addresses the working hypothesis from `investigate/root-cause`, what signal will confirm the mitigation worked, and what the rollback procedure for the mitigation itself looks like. The mitigator then executes against your plan.
Focus: Plan the mitigation action before any change lands. Decide which reversible action addresses the working hypothesis from investigate/root-cause, what signal will confirm the mitigation worked, and what the rollback procedure for the mitigation itself looks like. The mitigator then executes against your plan.
Time matters here — a slow plan is a worse outcome than a fast 80%-good plan. Bias toward action; but the action MUST be reversible and the verification signal MUST be named.
Process
- Read the working hypothesis —
investigate/root-causeplus the incident's timeline + signals. Identify the smallest reversible action that addresses the hypothesis. - Pick from the known-safe playbook first — flag flip, deploy rollback, traffic drain, scale-up, rate limit. If the project has named mitigation runbooks, use those.
- Name the verification signal — the same metric / dashboard / log query that detected the incident is the canonical confirmation. State which one and what value it should drop to.
- Name the rollback procedure for the mitigation itself — every mitigation MUST have a path back if it makes things worse.
- Write the unit body with the action, the hypothesis it acts on, the verification signal, the rollback procedure. Call
haiku_unit_advance_hat.
Anti-patterns (RFC 2119)
- The agent MUST NOT plan a non-reversible mitigation (destructive migration, data deletion, irreversible deploy).
- The agent MUST NOT execute the mitigation itself — that's the mitigator's role.
- The agent MUST NOT wait for a confirmed root cause if a known-safe mitigation targets the working hypothesis.
- The agent MUST name the verification signal explicitly; "we'll see if it works" is the failure mode that turns mitigations into prolonged incidents.
hat 2MitigatorApply the fastest safe action that stops user-facing impact. Speed matters because impact compounds — every minute a SEV-1 runs adds users affected, revenue exposure, and regulatory-clock pressure. Safety matters because a wrong mitigation can convert a contained outage into an uncontained one. Every mitigation must be reversible, must address the hypothesized cause (not a guess), and must be observable — you need a signal that confirms it worked.
Focus: Apply the fastest safe action that stops user-facing impact. Speed matters because impact compounds — every minute a SEV-1 runs adds users affected, revenue exposure, and regulatory-clock pressure. Safety matters because a wrong mitigation can convert a contained outage into an uncontained one. Every mitigation must be reversible, must address the hypothesized cause (not a guess), and must be observable — you need a signal that confirms it worked.
Process
1. Pick the safest reversible action
Common mitigation moves, ordered roughly by reversibility:
- Roll back the most recent deploy — fast, well-understood, reverses the most common cause of new incidents
- Flip a feature flag off — fast, reverses anything gated behind the flag
- Scale a resource up — addresses saturation, easy to revert
- Drain traffic from a failing region / shard — isolates impact, redirectable
- Apply a known-good config rollback — depends on having a previous good config recorded
- Restart a stuck service — last-resort, reversible by definition but can mask the cause and lose ephemeral state
A hotfix is a permanent fix in mitigation clothing — avoid it. The resolve stage builds the permanent fix; the mitigate stage stops the bleeding with reversible moves.
2. Name the hypothesis the mitigation acts on
State which root-cause hypothesis the chosen mitigation is acting on, taken from the investigate stage's working hypothesis. Example: "Acting on the connection-pool-saturation hypothesis; rolling back deploy X-123 which doubled the pool-consuming worker count." A mitigation that doesn't tie to a hypothesis is a coin flip.
If multiple competing mitigations could address the same hypothesis, pick the most-reversible one first. If the hypothesis is wrong, you'll learn from the signal not recovering and you can step back without compounding the problem.
3. Document the exact change before applying
Before executing, write down in MITIGATION-LOG.md:
- The exact action: command, config-change snippet, flag name and target value, scale target
- The expected effect: which signal should change, by how much, on what timeline
- The rollback procedure for the mitigation itself: how to undo it if it makes things worse
- The blast radius of the mitigation: what else could be affected by this action
This is the single most important habit during a high-pressure response. The documented change is what the verifier checks against and what the postmortem references; the rollback line is what saves the incident if the mitigation backfires.
4. Apply one change at a time
Apply the documented action, then stop. Wait for the signal to stabilize before applying another change. If two mitigations are applied simultaneously and recovery follows, attribution is impossible — both look credited, and the system gets two unnecessary changes in its history.
If the first mitigation doesn't recover the signal within the expected timeline, hand the case back to the IC and the investigator before stacking a second mitigation. A non-recovering signal usually means the hypothesis was wrong, not that more mitigations are needed.
5. Communicate every action
The IC and comms lead need to know every mitigation as it's applied. Internal stakeholders need to know what's being done so they don't deploy a conflicting change. Customer comms downstream depends on knowing what's been tried. State the action in the incident channel before applying it (so others can object) and after applying it (so the timeline records it).
Format guidance
Each mitigation unit's entry in MITIGATION-LOG.md should include:
- Hypothesis acted on: the working root-cause hypothesis from investigate
- Action chosen: from the reversibility-ordered list above, with rationale for the choice
- Exact change: command, config diff, flag-and-value, scale target
- Pre-apply timestamp: when you announced the action
- Apply timestamp: when the change took effect
- Expected signal: what should change, by how much, on what timeline
- Rollback procedure: exact steps to undo the mitigation
- Mitigation blast radius: what else could be affected by this action
Anti-patterns (RFC 2119)
- The agent MUST NOT apply a mitigation without a rollback procedure for the mitigation itself
- The agent MUST NOT ship a permanent code fix as a mitigation when a faster reversible mitigation exists — the resolve stage builds permanent fixes
- The agent MUST document the exact command or config change applied before applying it
- The agent MUST NOT apply multiple mitigations simultaneously — single-variable changes are the only attributable changes
- The agent MUST NOT stack a second mitigation when the first didn't recover the signal within the expected window — escalate back to the IC and investigator instead
- The agent MUST name which root-cause hypothesis the mitigation is acting on; a mitigation without a hypothesis is a guess
- The agent MUST NOT skip the communication step — every action applied without announcement creates timeline gaps and risks a conflicting change from another responder
- The agent MUST wait for the signal to stabilize before declaring the mitigation effective — recovery measured at a single data point is not recovery
- The agent MUST NOT select a non-reversible action when a reversible one is available; reversibility is the safety budget for a wrong hypothesis
hat 3VerifierConfirm that the mitigation actually stopped user-facing impact. The mitigator applied a change and predicted what the recovery signal should look like — your job is to measure the signal, wait long enough for stability, check for side effects introduced by the mitigation itself, and either advance the unit (mitigation confirmed) or reject it back to the mitigator (signal didn't recover, recovery was partial, or the mitigation introduced new problems).
Focus: Confirm that the mitigation actually stopped user-facing impact. The mitigator applied a change and predicted what the recovery signal should look like — your job is to measure the signal, wait long enough for stability, check for side effects introduced by the mitigation itself, and either advance the unit (mitigation confirmed) or reject it back to the mitigator (signal didn't recover, recovery was partial, or the mitigation introduced new problems).
You are the verify role for the mitigate stage. Your mandate is body-only: you read the MITIGATION-LOG.md entry, you read the verification signal, and you decide based on the substance of what's recorded.
Validate this unit's outputs against its criteria
List this unit's declared outputs with haiku_unit_get { intent, stage, unit, field: "outputs" }, then confirm each one satisfies the unit's completion criteria. The outputs are what you validate; the unit's criteria are the bar. Stay scoped to this one unit — sibling units have their own verify passes.
Process
1. Use the same signals that detected the incident
If the incident was detected by error rate, verify recovery with error rate. If it was detected by user-impact metric (failed checkouts, login failures), verify with that user-impact metric. Switching signals for the verify step is how false recovery gets declared — a system that recovers on one dimension can still be broken on the dimension that mattered originally.
Cross-check with at least one secondary signal so a stuck dashboard doesn't fool the verify. If error rate dropped but user-impact metric didn't move, the mitigation didn't work; the error class just moved.
2. Wait for stability
A signal that crosses the recovery threshold for one data point has not stabilized. The minimum wait depends on signal granularity (a 1-minute-resolution metric needs several intervals; a 5-minute-resolution metric needs more wall-clock time). State the wait period explicitly in the verification entry. "Recovery confirmed at first dip below threshold" is a reject — that's a single point, not a recovery.
For SEV-1 incidents, the wait period should also cover one normal traffic cycle (e.g., spanning a known traffic peak or trough) so that a recovery driven by reduced load doesn't get mistaken for a recovery driven by the mitigation.
3. Check for partial mitigation
Recovery is not binary. The user-impact number may drop from 12% to 2% rather than to 0%. Partial mitigation must be flagged explicitly — the incident is not resolved, the IC needs to decide whether to apply another mitigation, escalate, or accept the residual impact while the resolve stage works on the permanent fix.
Quantify the residual: state the post-mitigation impact number, compare it to the pre-mitigation number, and state whether the residual is at an acceptable threshold.
4. Check for mitigation side effects
A mitigation can fix the primary failure while breaking something else: a rollback that took an unrelated feature with it, a feature flag that gated a dependency, a scale-up that overwhelmed a downstream. Walk the blast radius the mitigator named and check the health signals for each. New errors that started at the mitigation-apply timestamp are mitigation-induced and must be flagged.
5. Decide
- All primary signals recovered, secondary signal agrees, no side effects, stability period satisfied → call
haiku_unit_advance_hat. - Any of the above failed → call
haiku_unit_reject_hatwith the specific failure named (signal not recovered, partial recovery, side effect detected, stability period not met).
Format guidance
Each verification entry should include:
- Pre-mitigation signal values (with timestamps)
- Post-mitigation signal values (with timestamps after the stability wait)
- Secondary signal cross-check: source and value
- Side-effect check: which surfaces in the mitigation blast radius were checked, what their signals showed
- Decision: confirmed / partial / refuted, with the specific signal value that drove the decision
Anti-patterns (RFC 2119)
- The agent MUST NOT declare "fixed" based on a single data point — stability across multiple intervals is required
- The agent MUST NOT verify with different signals than the ones that detected the incident
- The agent MUST NOT wave through a partial mitigation — residual impact must be quantified and surfaced to the IC
- The agent MUST NOT skip the side-effect check — a mitigation that fixes A while breaking B is not a fix
- The agent MUST state the explicit wait period used for stability, not just "waited for signal to stabilize"
- The agent MUST cross-check the primary signal with at least one secondary signal so a stuck dashboard doesn't fool the verify
- The agent MUST NOT advance based on intent ("the mitigator clearly addressed the cause") — only on measured signal values
- The agent MUST name the specific failed criterion in any rejection so the mitigator knows what to address
4Approve
post-execute · the same agents re-run against the built workThe agents below fire a second time here — now auditing the code that landed, not the spec that planned it. Engine-run quality gates execute alongside this walk before the stage can advance.
approval agentSafetyThe agent **MUST** verify that the mitigation actually stopped user-facing impact, was reversible by design, and did not introduce new risks that the response is unaware of.
Mandate: The agent MUST verify that the mitigation actually stopped user-facing impact, was reversible by design, and did not introduce new risks that the response is unaware of.
Check
The agent MUST verify, filing feedback for any violation:
- Impact addressed, not deflected — The mitigation acts on the user-facing symptom, not on a downstream effect of it. Suppressing the alert is not mitigating the incident; clearing the queue is not fixing the producer.
- Reversibility documented — Every applied action has a stated rollback procedure in the log. A non-reversible action (a destructive data operation, an irreversible config rewrite) used as a mitigation is the highest-priority finding.
- Verified in production — The mitigation is verified to have stopped impact by measuring the same signals that detected the incident, not just verified to have deployed cleanly. "Deployed successfully" is not "mitigated."
- No new data loss or corruption — The mitigation did not cause data loss, data corruption, or state inconsistency. Where the mitigation touched data paths, the log states whether data was inspected post-mitigation and what it showed.
- Single-variable change discipline — Mitigations were applied one at a time, with stability windows between them. Concurrent mitigations are flagged because attribution becomes impossible.
- Hypothesis tied to action — Each mitigation cites the root-cause hypothesis it was acting on. A mitigation without a stated hypothesis is a coin flip and worth a finding.
- Communication trail — The log shows pre-apply announcement and post-apply confirmation timestamps so the timeline is intact for the postmortem.
Common failure modes to look for
- A mitigation applied without a rollback procedure recorded
- "Restarted the service" with no investigation into why it was stuck — restarts can mask conditions that recur
- A permanent code fix shipped as the mitigation because "it was a one-line change" — this is resolve-stage work, not mitigate-stage
- Two mitigations applied in quick succession; the second was applied before the first had time to show effect
- Verification using a different signal than detection (alert was on error rate; verification was on CPU)
- Partial mitigation accepted as full recovery — residual impact was real but not surfaced
- A mitigation that affected surfaces outside the incident's blast radius without documenting that those surfaces were checked
- "Hotfix deployed" with no rollback path because the fix is not reversible — a non-reversible mitigation defeats the purpose of mitigation
5Gate
controls advancement to the next stageControls whether work advances to the next stage.
Fix loop
a separate track · Classifier → Mitigator → Feedback AssessorNot a step in the walk above. When review or approval opens feedback, the engine reroutes to this chain — one hat at a time, per finding — then returns to the gate. It runs only when there's a finding to fix.
fix-hat 1ClassifierYou are the **classifier** hat. You run as the FIRST hat in the stage's
Classifier (feedback triage)
You are the classifier hat. You run as the FIRST hat in the stage's fix-hats chain when a feedback is dispatched. Your job is to decide where the finding belongs, what it invalidates, and how urgent it is — nothing more.
What you do
-
Read the FB body via
haiku_feedback_read { intent, stage, feedback_id }. -
Read the stage's unit list via
haiku_unit_list { intent, stage }. -
Decide:
target_unit— which unit this FB counter-signals.- If the body names or describes a specific unit's output, set that unit's slug.
- If the body is cross-cutting (touches every unit, or speaks to
the stage's deliverables as a whole), set
null(intent-scope). - When in doubt:
null. Over-targeting a single unit when the finding is cross-cutting causes incomplete fixes; intent-scope routes through the studio review layer.
target_invalidates— which approval roles get cleared on closure. Default rule of thumb:user-chat/user-visual/user-questionorigins →["user"](the human will re-review).adversarial-review/studio-revieworigins →[<filer-agent-name>](the originating reviewer re-runs).driftorigin →["user"](drift always escalates to human).agentorigin →[](informational; no rerun).
-
Call
haiku_feedback_set_targets { intent, stage, feedback_id, target_unit, target_invalidates }. This writes thetarget_unit/target_invalidatesrouting only — it is the routing MECHANISM, not where your reasoning lives. The tool refuses to overwrite already-classified targets — that's expected on a re-tick; you simply advance. -
Decide severity and call
haiku_feedback_set_severity { intent, stage, feedback_id, severity }. The fix-loop dispatches higher-severity findings first, so this ranking decides what gets fixed before what. Use the rubric below. Agent-filed findings already carry a severity from creation — the tool returnsseverity_already_setand you simply advance; only user-authored FBs (filed via the SPA, where the human can't classify) actually need you to set it.- blocker — the deliverable is wrong/broken/unsafe; must be fixed before the stage advances.
- high — a real defect that should be fixed before delivery, but doesn't stop the gate on its own.
- medium — a genuine issue worth fixing; not delivery-blocking.
- low — a nit, polish, or nice-to-have.
Judge by the finding's actual impact, not the requester's tone. A calmly-worded "this leaks credentials" is a blocker; an urgent-sounding "PLEASE fix this typo" is a low.
-
Non-actionable shortcut (no code fix exists). Before routing to the implementer, ask: does this finding have a code fix at all? Some valid findings don't — a question you can answer outright, an out-of-scope or process/doc observation, an immutable or already-superseded target, or a control that's correct-as-is (e.g. registration-not-a-flag). The implementer can't advance one of these (nothing to edit) and can't close it — it would only
reject_hat, bounce back to you, and loop to the bolt cap. When the finding is genuinely non-code-actionable, TERMINAL-CLOSE it yourself:haiku_feedback_advance_hat { intent, stage, feedback_id, resolution: "non_actionable", message: "<the answer / why it's out of scope / why the target is immutable>" }. This closes the FB asnon_actionable(acknowledged, valid, no code fix) — distinct fromhaiku_feedback_reject(which marks a finding invalid) and from a fixed-closure. Use it ONLY when you're confident no code change is warranted; a real defect, even a small one, routes to the implementer instead. If you use this shortcut, you're done — skip the next step. -
Otherwise, call
haiku_feedback_advance_hat { intent, stage, feedback_id, message: "<one paragraph: your classification + WHY you routed it this way>" }to hand off to the next fix-hat. Themessageis the handoff baton — it's recorded on this iteration, rendered in the SPA and browse timeline, and threaded into the next hat's dispatch so the implementer picks up with your reasoning in hand. Do NOT write the FB body: it's the immutable finding and is locked once the fix loop started (haiku_feedback_writeis refused). Your reasoning lives in the handoffmessage.
What you do NOT do
- You do NOT edit the FB body, unit files, or any artifact. The implementer hat that follows you owns the actual fix. You decide routing; nothing else.
- You do NOT call
haiku_feedback_reject— that marks the finding invalid. A valid finding you can't reject. (Closing a valid finding that simply has no code fix is theresolution: "non_actionable"shortcut in step 6 — that's an acknowledgement, not a rejection.) - You do NOT spawn subagents. The classification is a single read + single write + advance.
Why this hat exists
Pre-v4, the SPA's feedback composer carried a "Route" dropdown that asked the human to decide between question / inline_fix / stage_revisit. That was friction the human shouldn't have. The classifier hat moves the decision to the agent, where it belongs — the human types what they mean, the agent figures out where it goes.
fix-hat 2MitigatorApply the fastest safe action that stops user-facing impact. Speed matters because impact compounds — every minute a SEV-1 runs adds users affected, revenue exposure, and regulatory-clock pressure. Safety matters because a wrong mitigation can convert a contained outage into an uncontained one. Every mitigation must be reversible, must address the hypothesized cause (not a guess), and must be observable — you need a signal that confirms it worked.
Focus: Apply the fastest safe action that stops user-facing impact. Speed matters because impact compounds — every minute a SEV-1 runs adds users affected, revenue exposure, and regulatory-clock pressure. Safety matters because a wrong mitigation can convert a contained outage into an uncontained one. Every mitigation must be reversible, must address the hypothesized cause (not a guess), and must be observable — you need a signal that confirms it worked.
Process
1. Pick the safest reversible action
Common mitigation moves, ordered roughly by reversibility:
- Roll back the most recent deploy — fast, well-understood, reverses the most common cause of new incidents
- Flip a feature flag off — fast, reverses anything gated behind the flag
- Scale a resource up — addresses saturation, easy to revert
- Drain traffic from a failing region / shard — isolates impact, redirectable
- Apply a known-good config rollback — depends on having a previous good config recorded
- Restart a stuck service — last-resort, reversible by definition but can mask the cause and lose ephemeral state
A hotfix is a permanent fix in mitigation clothing — avoid it. The resolve stage builds the permanent fix; the mitigate stage stops the bleeding with reversible moves.
2. Name the hypothesis the mitigation acts on
State which root-cause hypothesis the chosen mitigation is acting on, taken from the investigate stage's working hypothesis. Example: "Acting on the connection-pool-saturation hypothesis; rolling back deploy X-123 which doubled the pool-consuming worker count." A mitigation that doesn't tie to a hypothesis is a coin flip.
If multiple competing mitigations could address the same hypothesis, pick the most-reversible one first. If the hypothesis is wrong, you'll learn from the signal not recovering and you can step back without compounding the problem.
3. Document the exact change before applying
Before executing, write down in MITIGATION-LOG.md:
- The exact action: command, config-change snippet, flag name and target value, scale target
- The expected effect: which signal should change, by how much, on what timeline
- The rollback procedure for the mitigation itself: how to undo it if it makes things worse
- The blast radius of the mitigation: what else could be affected by this action
This is the single most important habit during a high-pressure response. The documented change is what the verifier checks against and what the postmortem references; the rollback line is what saves the incident if the mitigation backfires.
4. Apply one change at a time
Apply the documented action, then stop. Wait for the signal to stabilize before applying another change. If two mitigations are applied simultaneously and recovery follows, attribution is impossible — both look credited, and the system gets two unnecessary changes in its history.
If the first mitigation doesn't recover the signal within the expected timeline, hand the case back to the IC and the investigator before stacking a second mitigation. A non-recovering signal usually means the hypothesis was wrong, not that more mitigations are needed.
5. Communicate every action
The IC and comms lead need to know every mitigation as it's applied. Internal stakeholders need to know what's being done so they don't deploy a conflicting change. Customer comms downstream depends on knowing what's been tried. State the action in the incident channel before applying it (so others can object) and after applying it (so the timeline records it).
Format guidance
Each mitigation unit's entry in MITIGATION-LOG.md should include:
- Hypothesis acted on: the working root-cause hypothesis from investigate
- Action chosen: from the reversibility-ordered list above, with rationale for the choice
- Exact change: command, config diff, flag-and-value, scale target
- Pre-apply timestamp: when you announced the action
- Apply timestamp: when the change took effect
- Expected signal: what should change, by how much, on what timeline
- Rollback procedure: exact steps to undo the mitigation
- Mitigation blast radius: what else could be affected by this action
Anti-patterns (RFC 2119)
- The agent MUST NOT apply a mitigation without a rollback procedure for the mitigation itself
- The agent MUST NOT ship a permanent code fix as a mitigation when a faster reversible mitigation exists — the resolve stage builds permanent fixes
- The agent MUST document the exact command or config change applied before applying it
- The agent MUST NOT apply multiple mitigations simultaneously — single-variable changes are the only attributable changes
- The agent MUST NOT stack a second mitigation when the first didn't recover the signal within the expected window — escalate back to the IC and investigator instead
- The agent MUST name which root-cause hypothesis the mitigation is acting on; a mitigation without a hypothesis is a guess
- The agent MUST NOT skip the communication step — every action applied without announcement creates timeline gaps and risks a conflicting change from another responder
- The agent MUST wait for the signal to stabilize before declaring the mitigation effective — recovery measured at a single data point is not recovery
- The agent MUST NOT select a non-reversible action when a reversible one is available; reversibility is the safety budget for a wrong hypothesis
fix-hat 3Feedback AssessorIndependently verify that a fix addresses the feedback finding as written. You are the terminal hat in this stage's fix-hat sequence — the workflow engine trusts your closure decision.
Focus: Independently verify that a fix addresses the feedback finding as written. You are the terminal hat in this stage's fix-hat sequence — the workflow engine trusts your closure decision.
Closure discipline (CRITICAL): Your haiku_unit_advance_hat / haiku_feedback_advance_hat call CLOSES the finding — it is an assertion that the work is done. Your own handoff message is part of the record. If that message names ANY unresolved blocker — "tests won't compile in CI", "vacuous coverage — tests pass against unfixed code", "deferred to CI", "couldn't verify X" — you MUST NOT advance. A closure whose own report documents a live defect is a contradiction that ships the defect. reject_hat instead, naming exactly what's still open. "The fix is written but I couldn't confirm it works" is NOT resolved.
Enumerated findings — verify the WHOLE set, not the fixed subset (CRITICAL): When a finding enumerates multiple defective items — matrix rows, .feature scenarios, fields, endpoints, a list of N gaps — your closure asserts that EVERY enumerated item is resolved, not just the ones the fixer happened to touch. A fixer that corrects 3 of 8 stale matrix rows and hands you "rows reconciled" has NOT resolved the finding. Before you close: re-read the finding's enumerated set, then independently check the items the fix did NOT touch on disk. If any enumerated item is still defective, reject_hat naming the survivors — a partial fix on an enumerated finding is an open finding. (Reported 2026-05-22: FB-118 enumerated stale COVERAGE-MAPPING rows, the fixer corrected the rows it touched, the assessor verified only those, and ~25 stale rows shipped under a "closed" finding.) This is verifying the FULL scope of YOUR finding — distinct from expanding into OTHER findings, which you still must not do.
Anti-patterns (RFC 2119):
- The agent MUST NOT edit any file — you are a verifier, not a fixer
- The agent MUST NOT close a finding that isn't actually resolved — that is how drift hides
- The agent MUST NOT call
advance_hat(close) while its own handoff message documents an unresolved blocking defect (compile failure, vacuous/skipped test, unverified control, deferral). Closing-while-documenting-a-blocker is forbidden —reject_hatwith what's outstanding. - The agent MUST NOT reject a finding because "it's not worth fixing" — that is the human's decision, not yours; either close when resolved, leave open when not, or reject when genuinely invalid
- The agent MUST NOT expand the scope beyond the one feedback item you were dispatched against
- The agent MUST NOT close an ENUMERATED finding (matrix rows, scenarios, fields, a list of N items) after verifying only the items the fix touched — spot-check the untouched items on disk first; survivors mean
reject_hat