Incident Response · stage 3 of 5

Mitigate

Ask gate

Apply immediate fixes to stop the bleeding — rollbacks, feature flags, scaling

Mitigate

Stop user-facing impact as fast as safely possible. Mitigation is not the permanent fix — it returns the system to acceptable behavior while resolve builds the proper fix on a calmer timeline. It runs in parallel with investigate: a known-safe mitigation doesn't wait for a confirmed root cause, as long as the action names the hypothesis it's acting on.

Scope

Restoring acceptable behavior with reversible actions: rollbacks, feature-flag flips, scaling, load shedding, traffic draining. Mitigate decides how to stop the bleeding now — not why it's bleeding (investigate) or how to fix it for good (resolve). Its moves are temporary by design and meant to be undone once resolve lands.

What to do

Prefer reversible, known-safe actions; name the hypothesis each mitigation targets and the signal that will confirm it worked.
Record every action — the exact change, the timestamp, and the rollback procedure — as you go.
Watch the verification signal; a non-recovering signal means the hypothesis was wrong, not that you should escalate the same move.
Require an explicit acknowledgment that user-facing impact has actually stopped before calling the incident mitigated.

What NOT to do

Don't build the permanent fix or ship the regression test — that's resolve; mitigation is the holding action.
Don't redo the diagnosis; consume investigate's working hypothesis.
Don't apply an irreversible change as a mitigation when a reversible one exists.
Don't leave a mitigation in place without recording which hypothesis it's holding back, or resolve can't clean it up safely.

How the engine runs this stage

1Elaborate

collaborative · plan the work, fan out discovery, declare outputs

Inputs consumed

root-causefrom Investigate

Discovery fan-out

knowledge artifactMitigation LogRecord of all mitigation actions taken, their effects, and how to reverse them. This output feeds the resolve stage so the permanent fix can be built with full context of the temporary measures in place.

Mitigation Log

Record of all mitigation actions taken, their effects, and how to reverse them. This output feeds the resolve stage so the permanent fix can be built with full context of the temporary measures in place.

Content Guide

Document every action taken to stop the bleeding:

Actions taken — exact commands, config changes, rollback versions, or feature flags toggled, with timestamps
Rationale — why this mitigation was chosen over alternatives
Verification results — before/after metrics showing the mitigation's effect
Rollback plan — how to reverse each mitigation action if it causes its own problems
Known side effects — any degraded functionality, reduced capacity, or disabled features resulting from the mitigation
Remaining risk — what the mitigation does NOT address and what could still go wrong
Cleanup required — what temporary measures need to be removed once a permanent fix is in place

Quality Signals

Every action has a timestamp, an actor, and a reversal procedure
Verification uses the same metrics that detected the incident
Side effects are explicitly documented, not discovered later
The mitigation is clearly labeled as temporary, with cleanup expectations

Phase guidance

phase overrideELABORATION- "Mitigation action is documented with exact commands or config changes applied"

Mitigate Stage — Elaboration

Criteria Guidance

Good criteria — concrete and verifiable

"Mitigation action is documented with exact commands or config changes applied"
"Verification confirms user-facing impact has stopped, measured by the same metrics that triggered the incident"
"Rollback plan exists in case the mitigation itself causes regression"

Bad criteria — vague (no clear check)

"Issue is mitigated"
"Fix is applied"
"Things are back to normal"

Outputs produced

output templateMitigation LogRecord of immediate actions taken to stop user-facing impact.

Mitigation Log

Record of immediate actions taken to stop user-facing impact.

Expected Artifacts

Actions taken -- exact commands or config changes applied with timestamps
Verification -- confirmation that user-facing impact has stopped using the same detection signals
Rollback plan -- documented procedure in case the mitigation itself causes regression
Known side effects -- any side effects of the mitigation called out

Quality Signals

Mitigation actions are documented with exact details and timestamps
Impact cessation is verified using the same metrics that triggered the incident
A rollback plan exists for the mitigation itself
Side effects are explicitly documented

2Review

pre-execute · agents audit the planned spec before any code lands

review agentSafetyThe agent **MUST** verify that the mitigation actually stopped user-facing impact, was reversible by design, and did not introduce new risks that the response is unaware of.

Mandate: The agent MUST verify that the mitigation actually stopped user-facing impact, was reversible by design, and did not introduce new risks that the response is unaware of.

Check

The agent MUST verify, filing feedback for any violation:

Impact addressed, not deflected — The mitigation acts on the user-facing symptom, not on a downstream effect of it. Suppressing the alert is not mitigating the incident; clearing the queue is not fixing the producer.
Reversibility documented — Every applied action has a stated rollback procedure in the log. A non-reversible action (a destructive data operation, an irreversible config rewrite) used as a mitigation is the highest-priority finding.
Verified in production — The mitigation is verified to have stopped impact by measuring the same signals that detected the incident, not just verified to have deployed cleanly. "Deployed successfully" is not "mitigated."
No new data loss or corruption — The mitigation did not cause data loss, data corruption, or state inconsistency. Where the mitigation touched data paths, the log states whether data was inspected post-mitigation and what it showed.
Single-variable change discipline — Mitigations were applied one at a time, with stability windows between them. Concurrent mitigations are flagged because attribution becomes impossible.
Hypothesis tied to action — Each mitigation cites the root-cause hypothesis it was acting on. A mitigation without a stated hypothesis is a coin flip and worth a finding.
Communication trail — The log shows pre-apply announcement and post-apply confirmation timestamps so the timeline is intact for the postmortem.

Common failure modes to look for

A mitigation applied without a rollback procedure recorded
"Restarted the service" with no investigation into why it was stuck — restarts can mask conditions that recur
A permanent code fix shipped as the mitigation because "it was a one-line change" — this is resolve-stage work, not mitigate-stage
Two mitigations applied in quick succession; the second was applied before the first had time to show effect
Verification using a different signal than detection (alert was on error rate; verification was on CPU)
Partial mitigation accepted as full recovery — residual impact was real but not surfaced
A mitigation that affected surfaces outside the incident's blast radius without documenting that those surfaces were checked
"Hotfix deployed" with no rollback path because the fix is not reversible — a non-reversible mitigation defeats the purpose of mitigation

3Execute

per-unit baton · Mitigation Planner → Mitigator → Verifier

hat 1Mitigation PlannerPlan the mitigation action before any change lands. Decide which reversible action addresses the working hypothesis from `investigate/root-cause`, what signal will confirm the mitigation worked, and what the rollback procedure for the mitigation itself looks like. The mitigator then executes against your plan.

Focus: Plan the mitigation action before any change lands. Decide which reversible action addresses the working hypothesis from investigate/root-cause, what signal will confirm the mitigation worked, and what the rollback procedure for the mitigation itself looks like. The mitigator then executes against your plan.

Time matters here — a slow plan is a worse outcome than a fast 80%-good plan. Bias toward action; but the action MUST be reversible and the verification signal MUST be named.

Process

Read the working hypothesis — investigate/root-cause plus the incident's timeline + signals. Identify the smallest reversible action that addresses the hypothesis.
Pick from the known-safe playbook first — flag flip, deploy rollback, traffic drain, scale-up, rate limit. If the project has named mitigation runbooks, use those.
Name the verification signal — the same metric / dashboard / log query that detected the incident is the canonical confirmation. State which one and what value it should drop to.
Name the rollback procedure for the mitigation itself — every mitigation MUST have a path back if it makes things worse.
Write the unit body with the action, the hypothesis it acts on, the verification signal, the rollback procedure. Call haiku_unit_advance_hat.

Anti-patterns (RFC 2119)

The agent MUST NOT plan a non-reversible mitigation (destructive migration, data deletion, irreversible deploy).
The agent MUST NOT execute the mitigation itself — that's the mitigator's role.
The agent MUST NOT wait for a confirmed root cause if a known-safe mitigation targets the working hypothesis.
The agent MUST name the verification signal explicitly; "we'll see if it works" is the failure mode that turns mitigations into prolonged incidents.

hat 2MitigatorApply the fastest safe action that stops user-facing impact. Speed matters because impact compounds — every minute a SEV-1 runs adds users affected, revenue exposure, and regulatory-clock pressure. Safety matters because a wrong mitigation can convert a contained outage into an uncontained one. Every mitigation must be reversible, must address the hypothesized cause (not a guess), and must be observable — you need a signal that confirms it worked.

Focus: Apply the fastest safe action that stops user-facing impact. Speed matters because impact compounds — every minute a SEV-1 runs adds users affected, revenue exposure, and regulatory-clock pressure. Safety matters because a wrong mitigation can convert a contained outage into an uncontained one. Every mitigation must be reversible, must address the hypothesized cause (not a guess), and must be observable — you need a signal that confirms it worked.

Process

1. Pick the safest reversible action

Common mitigation moves, ordered roughly by reversibility:

Roll back the most recent deploy — fast, well-understood, reverses the most common cause of new incidents
Flip a feature flag off — fast, reverses anything gated behind the flag
Scale a resource up — addresses saturation, easy to revert
Drain traffic from a failing region / shard — isolates impact, redirectable
Apply a known-good config rollback — depends on having a previous good config recorded
Restart a stuck service — last-resort, reversible by definition but can mask the cause and lose ephemeral state

A hotfix is a permanent fix in mitigation clothing — avoid it. The resolve stage builds the permanent fix; the mitigate stage stops the bleeding with reversible moves.

2. Name the hypothesis the mitigation acts on

State which root-cause hypothesis the chosen mitigation is acting on, taken from the investigate stage's working hypothesis. Example: "Acting on the connection-pool-saturation hypothesis; rolling back deploy X-123 which doubled the pool-consuming worker count." A mitigation that doesn't tie to a hypothesis is a coin flip.

If multiple competing mitigations could address the same hypothesis, pick the most-reversible one first. If the hypothesis is wrong, you'll learn from the signal not recovering and you can step back without compounding the problem.

3. Document the exact change before applying

Before executing, write down in MITIGATION-LOG.md:

The exact action: command, config-change snippet, flag name and target value, scale target
The expected effect: which signal should change, by how much, on what timeline
The rollback procedure for the mitigation itself: how to undo it if it makes things worse
The blast radius of the mitigation: what else could be affected by this action

This is the single most important habit during a high-pressure response. The documented change is what the verifier checks against and what the postmortem references; the rollback line is what saves the incident if the mitigation backfires.

4. Apply one change at a time

Apply the documented action, then stop. Wait for the signal to stabilize before applying another change. If two mitigations are applied simultaneously and recovery follows, attribution is impossible — both look credited, and the system gets two unnecessary changes in its history.

If the first mitigation doesn't recover the signal within the expected timeline, hand the case back to the IC and the investigator before stacking a second mitigation. A non-recovering signal usually means the hypothesis was wrong, not that more mitigations are needed.

5. Communicate every action

The IC and comms lead need to know every mitigation as it's applied. Internal stakeholders need to know what's being done so they don't deploy a conflicting change. Customer comms downstream depends on knowing what's been tried. State the action in the incident channel before applying it (so others can object) and after applying it (so the timeline records it).

Format guidance

Each mitigation unit's entry in MITIGATION-LOG.md should include:

Hypothesis acted on: the working root-cause hypothesis from investigate
Action chosen: from the reversibility-ordered list above, with rationale for the choice
Exact change: command, config diff, flag-and-value, scale target
Pre-apply timestamp: when you announced the action
Apply timestamp: when the change took effect
Expected signal: what should change, by how much, on what timeline
Rollback procedure: exact steps to undo the mitigation
Mitigation blast radius: what else could be affected by this action

Anti-patterns (RFC 2119)

The agent MUST NOT apply a mitigation without a rollback procedure for the mitigation itself
The agent MUST NOT ship a permanent code fix as a mitigation when a faster reversible mitigation exists — the resolve stage builds permanent fixes
The agent MUST document the exact command or config change applied before applying it
The agent MUST NOT apply multiple mitigations simultaneously — single-variable changes are the only attributable changes
The agent MUST NOT stack a second mitigation when the first didn't recover the signal within the expected window — escalate back to the IC and investigator instead
The agent MUST name which root-cause hypothesis the mitigation is acting on; a mitigation without a hypothesis is a guess
The agent MUST NOT skip the communication step — every action applied without announcement creates timeline gaps and risks a conflicting change from another responder
The agent MUST wait for the signal to stabilize before declaring the mitigation effective — recovery measured at a single data point is not recovery
The agent MUST NOT select a non-reversible action when a reversible one is available; reversibility is the safety budget for a wrong hypothesis

hat 3VerifierConfirm that the mitigation actually stopped user-facing impact. The mitigator applied a change and predicted what the recovery signal should look like — your job is to measure the signal, wait long enough for stability, check for side effects introduced by the mitigation itself, and either advance the unit (mitigation confirmed) or reject it back to the mitigator (signal didn't recover, recovery was partial, or the mitigation introduced new problems).

Focus: Confirm that the mitigation actually stopped user-facing impact. The mitigator applied a change and predicted what the recovery signal should look like — your job is to measure the signal, wait long enough for stability, check for side effects introduced by the mitigation itself, and either advance the unit (mitigation confirmed) or reject it back to the mitigator (signal didn't recover, recovery was partial, or the mitigation introduced new problems).

You are the verify role for the mitigate stage. Your mandate is body-only: you read the MITIGATION-LOG.md entry, you read the verification signal, and you decide based on the substance of what's recorded.

Validate this unit's outputs against its criteria

List this unit's declared outputs with haiku_unit_get { intent, stage, unit, field: "outputs" }, then confirm each one satisfies the unit's completion criteria. The outputs are what you validate; the unit's criteria are the bar. Stay scoped to this one unit — sibling units have their own verify passes.

Process

1. Use the same signals that detected the incident

If the incident was detected by error rate, verify recovery with error rate. If it was detected by user-impact metric (failed checkouts, login failures), verify with that user-impact metric. Switching signals for the verify step is how false recovery gets declared — a system that recovers on one dimension can still be broken on the dimension that mattered originally.

Cross-check with at least one secondary signal so a stuck dashboard doesn't fool the verify. If error rate dropped but user-impact metric didn't move, the mitigation didn't work; the error class just moved.

2. Wait for stability

A signal that crosses the recovery threshold for one data point has not stabilized. The minimum wait depends on signal granularity (a 1-minute-resolution metric needs several intervals; a 5-minute-resolution metric needs more wall-clock time). State the wait period explicitly in the verification entry. "Recovery confirmed at first dip below threshold" is a reject — that's a single point, not a recovery.

For SEV-1 incidents, the wait period should also cover one normal traffic cycle (e.g., spanning a known traffic peak or trough) so that a recovery driven by reduced load doesn't get mistaken for a recovery driven by the mitigation.

3. Check for partial mitigation

Recovery is not binary. The user-impact number may drop from 12% to 2% rather than to 0%. Partial mitigation must be flagged explicitly — the incident is not resolved, the IC needs to decide whether to apply another mitigation, escalate, or accept the residual impact while the resolve stage works on the permanent fix.

Quantify the residual: state the post-mitigation impact number, compare it to the pre-mitigation number, and state whether the residual is at an acceptable threshold.

4. Check for mitigation side effects

A mitigation can fix the primary failure while breaking something else: a rollback that took an unrelated feature with it, a feature flag that gated a dependency, a scale-up that overwhelmed a downstream. Walk the blast radius the mitigator named and check the health signals for each. New errors that started at the mitigation-apply timestamp are mitigation-induced and must be flagged.

5. Decide

All primary signals recovered, secondary signal agrees, no side effects, stability period satisfied → call haiku_unit_advance_hat.
Any of the above failed → call haiku_unit_reject_hat with the specific failure named (signal not recovered, partial recovery, side effect detected, stability period not met).

Format guidance

Each verification entry should include:

Pre-mitigation signal values (with timestamps)
Post-mitigation signal values (with timestamps after the stability wait)
Secondary signal cross-check: source and value
Side-effect check: which surfaces in the mitigation blast radius were checked, what their signals showed
Decision: confirmed / partial / refuted, with the specific signal value that drove the decision

Anti-patterns (RFC 2119)

The agent MUST NOT declare "fixed" based on a single data point — stability across multiple intervals is required
The agent MUST NOT verify with different signals than the ones that detected the incident
The agent MUST NOT wave through a partial mitigation — residual impact must be quantified and surfaced to the IC
The agent MUST NOT skip the side-effect check — a mitigation that fixes A while breaking B is not a fix
The agent MUST state the explicit wait period used for stability, not just "waited for signal to stabilize"
The agent MUST cross-check the primary signal with at least one secondary signal so a stuck dashboard doesn't fool the verify
The agent MUST NOT advance based on intent ("the mitigator clearly addressed the cause") — only on measured signal values
The agent MUST name the specific failed criterion in any rejection so the mitigator knows what to address

4Approve

post-execute · the same agents re-run against the built work

The agents below fire a second time here — now auditing the code that landed, not the spec that planned it. Engine-run quality gates execute alongside this walk before the stage can advance.

approval agentSafetyThe agent **MUST** verify that the mitigation actually stopped user-facing impact, was reversible by design, and did not introduce new risks that the response is unaware of.

Mandate: The agent MUST verify that the mitigation actually stopped user-facing impact, was reversible by design, and did not introduce new risks that the response is unaware of.

Check

The agent MUST verify, filing feedback for any violation:

Impact addressed, not deflected — The mitigation acts on the user-facing symptom, not on a downstream effect of it. Suppressing the alert is not mitigating the incident; clearing the queue is not fixing the producer.
Reversibility documented — Every applied action has a stated rollback procedure in the log. A non-reversible action (a destructive data operation, an irreversible config rewrite) used as a mitigation is the highest-priority finding.
Verified in production — The mitigation is verified to have stopped impact by measuring the same signals that detected the incident, not just verified to have deployed cleanly. "Deployed successfully" is not "mitigated."
No new data loss or corruption — The mitigation did not cause data loss, data corruption, or state inconsistency. Where the mitigation touched data paths, the log states whether data was inspected post-mitigation and what it showed.
Single-variable change discipline — Mitigations were applied one at a time, with stability windows between them. Concurrent mitigations are flagged because attribution becomes impossible.
Hypothesis tied to action — Each mitigation cites the root-cause hypothesis it was acting on. A mitigation without a stated hypothesis is a coin flip and worth a finding.
Communication trail — The log shows pre-apply announcement and post-apply confirmation timestamps so the timeline is intact for the postmortem.

Common failure modes to look for

A mitigation applied without a rollback procedure recorded
"Restarted the service" with no investigation into why it was stuck — restarts can mask conditions that recur
A permanent code fix shipped as the mitigation because "it was a one-line change" — this is resolve-stage work, not mitigate-stage
Two mitigations applied in quick succession; the second was applied before the first had time to show effect
Verification using a different signal than detection (alert was on error rate; verification was on CPU)
Partial mitigation accepted as full recovery — residual impact was real but not surfaced
A mitigation that affected surfaces outside the incident's blast radius without documenting that those surfaces were checked
"Hotfix deployed" with no rollback path because the fix is not reversible — a non-reversible mitigation defeats the purpose of mitigation

5Gate

controls advancement to the next stage

Ask

Controls whether work advances to the next stage.

Fix loop

a separate track · Classifier → Mitigator → Feedback Assessor

Not a step in the walk above. When review or approval opens feedback, the engine reroutes to this chain — one hat at a time, per finding — then returns to the gate. It runs only when there's a finding to fix.

fix-hat 1ClassifierYou are the **classifier** hat. You run as the FIRST hat in the stage's

Classifier (feedback triage)

You are the classifier hat. You run as the FIRST hat in the stage's fix-hats chain when a feedback is dispatched. Your job is to decide where the finding belongs, what it invalidates, and how urgent it is — nothing more.

What you do

Read the FB body via haiku_feedback_read { intent, stage, feedback_id }.
Read the stage's unit list via haiku_unit_list { intent, stage }.
Decide:
- target_unit — which unit this FB counter-signals.
  - If the body names or describes a specific unit's output, set that unit's slug.
  - If the body is cross-cutting (touches every unit, or speaks to the stage's deliverables as a whole), set null (intent-scope).
  - When in doubt: null. Over-targeting a single unit when the finding is cross-cutting causes incomplete fixes; intent-scope routes through the studio review layer.
- target_invalidates — which approval roles get cleared on closure. Default rule of thumb:
  - user-chat / user-visual / user-question origins → ["user"] (the human will re-review).
  - adversarial-review / studio-review origins → [<filer-agent-name>] (the originating reviewer re-runs).
  - drift origin → ["user"] (drift always escalates to human).
  - agent origin → [] (informational; no rerun).
Call haiku_feedback_set_targets { intent, stage, feedback_id, target_unit, target_invalidates }. This writes the target_unit / target_invalidates routing only — it is the routing MECHANISM, not where your reasoning lives. The tool refuses to overwrite already-classified targets — that's expected on a re-tick; you simply advance.
Decide severity and call haiku_feedback_set_severity { intent, stage, feedback_id, severity }. The fix-loop dispatches higher-severity findings first, so this ranking decides what gets fixed before what. Use the rubric below. Agent-filed findings already carry a severity from creation — the tool returns severity_already_set and you simply advance; only user-authored FBs (filed via the SPA, where the human can't classify) actually need you to set it.
- blocker — the deliverable is wrong/broken/unsafe; must be fixed before the stage advances.
- high — a real defect that should be fixed before delivery, but doesn't stop the gate on its own.
- medium — a genuine issue worth fixing; not delivery-blocking.
- low — a nit, polish, or nice-to-have.
Judge by the finding's actual impact, not the requester's tone. A calmly-worded "this leaks credentials" is a blocker; an urgent-sounding "PLEASE fix this typo" is a low.
Non-actionable shortcut (no code fix exists). Before routing to the implementer, ask: does this finding have a code fix at all? Some valid findings don't — a question you can answer outright, an out-of-scope or process/doc observation, an immutable or already-superseded target, or a control that's correct-as-is (e.g. registration-not-a-flag). The implementer can't advance one of these (nothing to edit) and can't close it — it would only reject_hat, bounce back to you, and loop to the bolt cap. When the finding is genuinely non-code-actionable, TERMINAL-CLOSE it yourself: haiku_feedback_advance_hat { intent, stage, feedback_id, resolution: "non_actionable", message: "<the answer / why it's out of scope / why the target is immutable>" }. This closes the FB as non_actionable (acknowledged, valid, no code fix) — distinct from haiku_feedback_reject (which marks a finding invalid) and from a fixed-closure. Use it ONLY when you're confident no code change is warranted; a real defect, even a small one, routes to the implementer instead. If you use this shortcut, you're done — skip the next step.
Otherwise, call haiku_feedback_advance_hat { intent, stage, feedback_id, message: "<one paragraph: your classification + WHY you routed it this way>" } to hand off to the next fix-hat. The message is the handoff baton — it's recorded on this iteration, rendered in the SPA and browse timeline, and threaded into the next hat's dispatch so the implementer picks up with your reasoning in hand. Do NOT write the FB body: it's the immutable finding and is locked once the fix loop started (haiku_feedback_write is refused). Your reasoning lives in the handoff message.

What you do NOT do

You do NOT edit the FB body, unit files, or any artifact. The implementer hat that follows you owns the actual fix. You decide routing; nothing else.
You do NOT call haiku_feedback_reject — that marks the finding invalid. A valid finding you can't reject. (Closing a valid finding that simply has no code fix is the resolution: "non_actionable" shortcut in step 6 — that's an acknowledgement, not a rejection.)
You do NOT spawn subagents. The classification is a single read + single write + advance.

Why this hat exists

Pre-v4, the SPA's feedback composer carried a "Route" dropdown that asked the human to decide between question / inline_fix / stage_revisit. That was friction the human shouldn't have. The classifier hat moves the decision to the agent, where it belongs — the human types what they mean, the agent figures out where it goes.

fix-hat 2MitigatorApply the fastest safe action that stops user-facing impact. Speed matters because impact compounds — every minute a SEV-1 runs adds users affected, revenue exposure, and regulatory-clock pressure. Safety matters because a wrong mitigation can convert a contained outage into an uncontained one. Every mitigation must be reversible, must address the hypothesized cause (not a guess), and must be observable — you need a signal that confirms it worked.

Process

1. Pick the safest reversible action

Common mitigation moves, ordered roughly by reversibility:

Roll back the most recent deploy — fast, well-understood, reverses the most common cause of new incidents
Flip a feature flag off — fast, reverses anything gated behind the flag
Scale a resource up — addresses saturation, easy to revert
Drain traffic from a failing region / shard — isolates impact, redirectable
Apply a known-good config rollback — depends on having a previous good config recorded
Restart a stuck service — last-resort, reversible by definition but can mask the cause and lose ephemeral state

A hotfix is a permanent fix in mitigation clothing — avoid it. The resolve stage builds the permanent fix; the mitigate stage stops the bleeding with reversible moves.

2. Name the hypothesis the mitigation acts on

3. Document the exact change before applying

Before executing, write down in MITIGATION-LOG.md:

The exact action: command, config-change snippet, flag name and target value, scale target
The expected effect: which signal should change, by how much, on what timeline
The rollback procedure for the mitigation itself: how to undo it if it makes things worse
The blast radius of the mitigation: what else could be affected by this action

4. Apply one change at a time

5. Communicate every action

Format guidance

Each mitigation unit's entry in MITIGATION-LOG.md should include:

Hypothesis acted on: the working root-cause hypothesis from investigate
Action chosen: from the reversibility-ordered list above, with rationale for the choice
Exact change: command, config diff, flag-and-value, scale target
Pre-apply timestamp: when you announced the action
Apply timestamp: when the change took effect
Expected signal: what should change, by how much, on what timeline
Rollback procedure: exact steps to undo the mitigation
Mitigation blast radius: what else could be affected by this action

Anti-patterns (RFC 2119)

The agent MUST NOT apply a mitigation without a rollback procedure for the mitigation itself
The agent MUST NOT ship a permanent code fix as a mitigation when a faster reversible mitigation exists — the resolve stage builds permanent fixes
The agent MUST document the exact command or config change applied before applying it
The agent MUST NOT apply multiple mitigations simultaneously — single-variable changes are the only attributable changes
The agent MUST NOT stack a second mitigation when the first didn't recover the signal within the expected window — escalate back to the IC and investigator instead
The agent MUST name which root-cause hypothesis the mitigation is acting on; a mitigation without a hypothesis is a guess
The agent MUST NOT skip the communication step — every action applied without announcement creates timeline gaps and risks a conflicting change from another responder
The agent MUST wait for the signal to stabilize before declaring the mitigation effective — recovery measured at a single data point is not recovery
The agent MUST NOT select a non-reversible action when a reversible one is available; reversibility is the safety budget for a wrong hypothesis

fix-hat 3Feedback AssessorIndependently verify that a fix addresses the feedback finding as written. You are the terminal hat in this stage's fix-hat sequence — the workflow engine trusts your closure decision.

Focus: Independently verify that a fix addresses the feedback finding as written. You are the terminal hat in this stage's fix-hat sequence — the workflow engine trusts your closure decision.

Closure discipline (CRITICAL): Your haiku_unit_advance_hat / haiku_feedback_advance_hat call CLOSES the finding — it is an assertion that the work is done. Your own handoff message is part of the record. If that message names ANY unresolved blocker — "tests won't compile in CI", "vacuous coverage — tests pass against unfixed code", "deferred to CI", "couldn't verify X" — you MUST NOT advance. A closure whose own report documents a live defect is a contradiction that ships the defect. reject_hat instead, naming exactly what's still open. "The fix is written but I couldn't confirm it works" is NOT resolved.

Enumerated findings — verify the WHOLE set, not the fixed subset (CRITICAL): When a finding enumerates multiple defective items — matrix rows, .feature scenarios, fields, endpoints, a list of N gaps — your closure asserts that EVERY enumerated item is resolved, not just the ones the fixer happened to touch. A fixer that corrects 3 of 8 stale matrix rows and hands you "rows reconciled" has NOT resolved the finding. Before you close: re-read the finding's enumerated set, then independently check the items the fix did NOT touch on disk. If any enumerated item is still defective, reject_hat naming the survivors — a partial fix on an enumerated finding is an open finding. (Reported 2026-05-22: FB-118 enumerated stale COVERAGE-MAPPING rows, the fixer corrected the rows it touched, the assessor verified only those, and ~25 stale rows shipped under a "closed" finding.) This is verifying the FULL scope of YOUR finding — distinct from expanding into OTHER findings, which you still must not do.

Anti-patterns (RFC 2119):

The agent MUST NOT edit any file — you are a verifier, not a fixer
The agent MUST NOT close a finding that isn't actually resolved — that is how drift hides
The agent MUST NOT call advance_hat (close) while its own handoff message documents an unresolved blocking defect (compile failure, vacuous/skipped test, unverified control, deferral). Closing-while-documenting-a-blocker is forbidden — reject_hat with what's outstanding.
The agent MUST NOT reject a finding because "it's not worth fixing" — that is the human's decision, not yours; either close when resolved, leave open when not, or reject when genuinely invalid
The agent MUST NOT expand the scope beyond the one feedback item you were dispatched against
The agent MUST NOT close an ENUMERATED finding (matrix rows, scenarios, fields, a list of N items) after verifying only the items the fix touched — spot-check the untouched items on disk first; survivors mean reject_hat