Cutover
External gatePlan and execute the production cutover with rollback procedures
Cutover
Plan and execute the production cutover: the runbook the on-call team follows during the maintenance window, with a rollback procedure or an explicit forward-fix rationale for every step. This is the operational stage of the migration — the point where the validated work goes live, and the point of no return is real.
Scope
Authoring and executing the cutover runbook. Cutover decides how the production switch happens, in what order, with what go/no-go gates and rollback paths — not whether the migration is correct (validation) or how the data moves (migrate). Units are operational steps: preconditions, action, post-condition check, and a named rollback or a stated reason none exists.
What to do
- Sequence each step with preconditions, owner, expected duration, action, post-condition check, and go/no-go criteria.
- Pair every step with its rollback procedure, or state explicitly why the step is forward-fix only and mark the point of no return.
- Define a data-sync strategy for writes that arrive during the maintenance window.
- Make each post-condition produce a mechanical pass/fail signal the on-call team can act on without judgment calls.
What NOT to do
- Don't proceed on a migration the validation stage hasn't signed off, including the rollback rehearsal.
- Don't change migration code or mappings here; cutover executes, it doesn't rebuild.
- Don't write a step with no rollback and no stated forward-fix rationale.
- Don't self-advance the cutover gate — the runbook proceeds through the team's actual change-management approval.
How the engine runs this stage
1Elaborate
collaborative · plan the work, fan out discovery, declare outputsInputs consumed
Discovery fan-out
knowledge artifactCutover RunbookDocument the production cutover plan, rollback procedure, and post-cutover verification. This is the final deliverable of the migration studio.
Cutover Runbook
Document the production cutover plan, rollback procedure, and post-cutover verification. This is the final deliverable of the migration studio.
Content Guide
Structure the runbook around execution phases:
- Pre-cutover checklist — prerequisites verified before starting
- Cutover steps — sequenced steps with owner, expected duration, and go/no-go checkpoints
- Traffic routing plan — how traffic shifts from source to target
- Point of no return — the step after which rollback becomes significantly more expensive
- Rollback procedure — step-by-step restoration to pre-migration state with RTO verification
- Post-cutover verification — checks confirming the target is serving correctly
- Communication plan — stakeholder notifications for maintenance window, completion, and escalation
- Escalation contacts — who to reach for each category of issue
Quality Signals
- Every step has an explicit owner and go/no-go criteria
- Rollback procedure is tested end-to-end, not theoretical
- Point of no return is clearly marked and understood by all participants
- Post-cutover verification reuses validation checks from the validation stage
Phase guidance
phase overrideELABORATION- "Cutover runbook lists every step with owner, expected duration, and go/no-go checkpoint"
Cutover Stage — Elaboration
Criteria Guidance
Good criteria — concrete and verifiable
- "Cutover runbook lists every step with owner, expected duration, and go/no-go checkpoint"
- "Rollback procedure is tested end-to-end and restores the source system to pre-migration state within the defined RTO"
- "Communication plan notifies all downstream consumers with maintenance window, expected impact, and escalation contacts"
Bad criteria — vague (no clear check)
- "Cutover plan exists"
- "Rollback is possible"
- "Stakeholders are notified"
Outputs produced
output templateCutover RunbookStep-by-step cutover plan with rollback procedures and communication plan.
Cutover Runbook
Step-by-step cutover plan with rollback procedures and communication plan.
Expected Artifacts
- Runbook -- every step with owner, expected duration, and go/no-go checkpoint
- Rollback procedure -- tested end-to-end, restores source system within defined RTO
- Communication plan -- downstream consumers notified with maintenance window and escalation contacts
- Cutover verification -- post-cutover checks confirming successful migration
Quality Signals
- Every step has an owner and go/no-go checkpoint
- Rollback procedure is tested end-to-end before cutover
- Communication plan covers all downstream consumers
- Post-cutover verification confirms production is healthy
2Review
pre-execute · agents audit the planned spec before any code landsreview agentRollback ReadinessThe agent **MUST** verify the cutover runbook includes a viable rollback (or an explicit forward-fix-only rationale) at every step, that the point of no return is marked exactly once per dependency chain, that the validation stage's rollback rehearsal record is cited, and that post-cutover write handling is addressed. Untested rollback under outage pressure is how migrations turn into incidents.
Mandate: The agent MUST verify the cutover runbook includes a viable rollback (or an explicit forward-fix-only rationale) at every step, that the point of no return is marked exactly once per dependency chain, that the validation stage's rollback rehearsal record is cited, and that post-cutover write handling is addressed. Untested rollback under outage pressure is how migrations turn into incidents.
Check
The agent MUST verify, filing feedback for any violation:
- Rollback entry per reversible step — every step classified as reversible (fully / with-loss / at-cost) has a matching rollback entry with the same step id, mirrored structure (preconditions, action, post-condition, duration), and a reverse procedure.
- Forward-fix rationale for irreversible steps — every step past the point of no return explicitly states "forward-fix only — see forward-fix procedure" and links the procedure. Silent absence of rollback is a hard finding.
- Point of no return marked exactly once — the cumulative cutover chain has exactly one step (per dependency path) flagged as crossing the point of no return. Multiple markers or none at all are findings.
- Validation rehearsal cited — every rollback procedure cites the validation-stage rollback rehearsal record (procedure, dataset, observed RTO). If no rehearsal record exists, the fix is to run validation, not to rehearse inside cutover — file feedback against validation, not cutover.
- Reverse-duration fits cumulative RTO — each rollback step's expected reverse duration sums into the cumulative RTO budget the intent declared. Steps that don't fit are findings.
- Post-cutover write handling — every reversible step that crosses any window where the target accepts writes addresses how those writes are handled on rollback (replicate back, drop with impact statement, escalate). Silent loss is a hard finding.
- Communication plan covers rollback — the runbook's communication plan names audiences and triggers for rollback initiation, completion, and partial-rollback states, not just success paths.
- Reversibility classification explicit — every step carries an explicit class (fully reversible / reversible with loss / reversible at material cost / forward-fix only).
Common failure modes to look for
- A rollback entry that references state the forward step destroys (no snapshot, no log, no source-as-authoritative remnant)
- Reverse duration much shorter than the forward duration without justification — usually a sign the rollback hasn't been thought through
- Point of no return implicitly assumed but not marked on a specific step
- "Rollback is tested" claim without citing the validation rehearsal record
- Post-cutover writes addressed only for the happy rollback path, not for partial-rollback states
- Communication plan that names audiences for go but not for no-go
- Rollback procedure that depends on the same person being on-call who executed the forward step
- A step classified as "fully reversible" that actually loses data written to the target during its window
Borrowed from other stages
3Execute
per-unit baton · Cutover Coordinator → Rollback Engineer → Verifierhat 1Cutover CoordinatorAuthor the runbook entry for this cutover step — preconditions, owner, expected duration, action, post-condition check, go/no-go criteria, communication triggers. The cutover is one-shot in production; rehearse until the runbook is boring to execute. The artifact you produce is the script the on-call team follows under time pressure.
Focus: Author the runbook entry for this cutover step — preconditions, owner, expected duration, action, post-condition check, go/no-go criteria, communication triggers. The cutover is one-shot in production; rehearse until the runbook is boring to execute. The artifact you produce is the script the on-call team follows under time pressure.
You produce one output: the unit's section of CUTOVER-RUNBOOK.md — the step's runbook entry, in the format the rest of the runbook follows.
Process
1. Read the validation report and the relevant assessment risks
Cutover is downstream of every other stage. Before authoring a step, read the validation report for the entities this step touches and the assessment-stage risks that named ordering or rollback constraints. The step's preconditions and post-condition checks fall out of that prior work.
2. Pick the cutover style this step participates in
Three common styles; the intent's mode picks one, but each step may differ in detail:
- Big-bang — entire system flips at once during a maintenance window. Steps are tightly sequenced; rollback windows are short and explicit.
- Phased — system flips piece by piece over scheduled windows. Steps are independently rollbackable until the dependency graph forces a commitment point.
- Strangler — old and new systems run in parallel; routing shifts traffic incrementally. Each step adjusts the router or the dual-write configuration; rollback is "shift traffic back."
- Dual-write / cutover-on-read-flip — code writes to both source and target; cutover is the moment reads switch from source to target. Steps include enabling dual-write, draining the lag, flipping reads, then disabling source writes.
Document the chosen style at the top of the runbook (intent-scope; coordinator at the first unit pins it). Each step's entry MUST be consistent with the style.
3. Write the step's runbook entry
Each step gets the same fields:
- Step ID — stable identifier referenced by other steps and by the rollback procedure
- Owner — named role or person responsible for executing this step
- Preconditions — what MUST be true before this step starts (named, individually checkable)
- Action — the unambiguous procedure (one sentence per action; reference the script / command / dashboard change explicitly)
- Expected duration — the rehearsed time, with the maximum tolerated time before this step is considered stuck
- Post-condition check — the mechanical verification that the action succeeded (a query to run, a metric to read, a dashboard to inspect with named expected values)
- Go / no-go criteria — what conditions advance to the next step; what conditions trigger rollback; what conditions trigger pause-and-escalate
- Communication triggers — what messages go to which audiences at this step (start, success, failure)
- Rollback reference — the matching rollback step id (the rollback-engineer's deliverable)
- Point-of-no-return marker — explicit flag if this step crosses the threshold after which rollback becomes impossible or significantly more expensive
4. Establish go/no-go decision criteria
Every step ends with a go/no-go decision. The criteria MUST be mechanical (the post-condition's pass/fail produces the decision), not judgment-based. Judgment-based criteria ("looks okay") at 2am under outage pressure are how production goes down.
5. Plan the communication
For each step, name the audiences (engineering on-call, customer success, customer-facing comms, leadership escalation chain) and the trigger that fires a message to each. Pre-scheduled status updates count too. The communication plan is part of the runbook, not a separate document.
6. Self-check before handing off
- Preconditions are individually checkable, not summarized
- Action references the actual script / command / dashboard
- Expected duration cites a rehearsal source
- Post-condition check produces mechanical pass/fail
- Go / no-go decision is mechanical, not judgment-based
- Communication triggers name audiences and the trigger condition
- Rollback step id is named (the rollback-engineer's hat will create the matching entry)
- Point-of-no-return marker is set explicitly (
crosses point of no return/pre-point-of-no-return)
Anti-patterns (RFC 2119)
- The agent MUST NOT treat the cutover step as "just run the script in prod" — every step has preconditions, post-conditions, and a rollback reference
- The agent MUST NOT skip rehearsal — expected duration MUST cite a rehearsal in a representative environment
- The agent MUST define explicit go/no-go criteria that are mechanical, not judgment-based
- The agent MUST NOT leave the communication plan to the last minute; the runbook owns it
- The agent MUST NOT assume all stakeholders know the maintenance window — every audience has a named communication trigger
- The agent MUST mark the point-of-no-return explicitly on the step that crosses it
- The agent MUST cite validation-stage evidence (specific reconciliation or parity result) for the preconditions and post-conditions that depend on data state
- The agent MUST NOT invent step durations; cite the rehearsal where the duration was observed
hat 2Rollback EngineerDesign and document the rollback for this cutover step. Restore the source to its pre-step state, identify the point of no return, and confirm the rollback fits inside the RTO. A rollback that depends on state the forward step destroyed is not a rollback. Validation owns rollback rehearsal; this hat documents the procedure and depends on that prior rehearsal.
Focus: Design and document the rollback for this cutover step. Restore the source to its pre-step state, identify the point of no return, and confirm the rollback fits inside the RTO. A rollback that depends on state the forward step destroyed is not a rollback. Validation owns rollback rehearsal; this hat documents the procedure and depends on that prior rehearsal.
You produce one output: the unit's rollback entry in CUTOVER-RUNBOOK.md — paired with the coordinator's forward step, with the same step id and the reverse semantics.
Process
1. Read the coordinator's forward step
Before writing rollback, read the coordinator's forward step. The rollback's preconditions are the forward step's post-conditions; the rollback's action reverses the forward action; the rollback's post-conditions are the forward step's preconditions. The rollback entry mirrors the forward entry.
2. Decide whether rollback is possible at all
For each cutover step, classify reversibility:
- Fully reversible — rollback restores the system byte-for-byte. Typical for routing changes, config flips, read-source switches before any write to target.
- Reversible with data loss — rollback restores the source as authoritative but loses writes that landed on the target after the forward step. Document the loss explicitly; the communication plan MUST cover the affected users.
- Reversible at material cost — rollback is possible but expensive (re-running an extract, restoring from a snapshot, replaying logs). Document the cost and the maximum acceptable scenario for invoking it.
- Forward-fix only — past the point of no return. Document the rationale and the forward-fix procedure that takes the place of rollback.
The classification MUST be explicit on every step.
3. Identify the point of no return
Across the unit's forward step and the chain of prior steps, identify whether this step crosses the point of no return. The marker MUST appear on exactly one step per dependency chain. After it, only forward-fix is possible.
Common point-of-no-return triggers:
- Source writes are disabled (no way to replay them once enabled)
- Target accepts authoritative writes that aren't replicated back to source
- Source data is deleted or archived in a way that's not trivially restorable
- External integrations are repointed and their state diverges
If this step crosses the point, the rollback entry MUST say "forward-fix only — see forward-fix procedure" and link the procedure.
4. Write the reverse procedure
For reversible steps, the entry has:
- Step ID — the same id as the forward step, suffixed
-rollback(e.g.04-rollback) - Preconditions — the post-conditions of the forward step that are still in place (if those have already drifted, the rollback's preconditions are different and the procedure changes)
- Action — the reverse procedure, naming the script / command / dashboard change
- Expected duration — the rehearsed reverse time; MUST fit inside the cumulative RTO budget for the cutover
- Post-condition check — confirms the source is back to pre-step state (cite the same checks the forward step's preconditions used)
- Communication triggers — who to notify on rollback initiation, on completion, and on partial-rollback states
5. Confirm the rollback was rehearsed in validation
The validation stage owns rollback rehearsal. The rollback entry MUST cite the validation rehearsal record — what was rehearsed, when, against which dataset, with what RTO observed. If the rehearsal didn't cover this step, escalate to validation rather than approving the runbook.
6. Account for data written to target after cutover
A common rollback gap: writes that the application made to the target after the forward step succeeded. The rollback procedure MUST address them — replicate back to source, drop with documented impact, or escalate as a known limitation. Silent loss of post-cutover writes is the worst rollback bug.
7. Self-check before handing off
- Reversibility class is explicit (fully / with-loss / at-cost / forward-fix-only)
- Point-of-no-return is marked on exactly one step in the chain
- Reverse procedure mirrors the forward step's structure
- Expected reverse duration fits in the RTO budget
- Validation rehearsal record is cited
- Post-cutover writes are addressed explicitly, not ignored
Anti-patterns (RFC 2119)
- The agent MUST NOT assume rollback works without citing the validation-stage rehearsal record
- The agent MUST mark the point of no return explicitly on the step that crosses it
- The agent MUST NOT write rollback procedures that depend on state the forward step destroyed
- The agent MUST NOT ignore data written to the target after cutover; explicitly address replication, drop, or escalation
- The agent MUST NOT treat rollback as optional because "the migration will work" — every reversible step has a rollback entry
- The agent MUST classify reversibility explicitly (fully / with-loss / at-cost / forward-fix-only)
- The agent MUST confirm the reverse procedure's expected duration fits in the cumulative RTO
- The agent MUST cite the Decision register when a chosen rollback strategy (snapshot restore vs. log replay vs. dual-write reverse) contradicts a recorded decision
hat 3VerifierValidate the per-unit operational artifact for the cutover stage of migration. Units here are cutover step — operational steps with concrete preconditions, actions, and post-condition checks. Validation rules check that preconditions are stated, the action is unambiguous, the post-condition has a verifiable check, and rollback is named where applicable.
Focus: Validate the per-unit operational artifact for the cutover stage of migration. Units here are cutover step — operational steps with concrete preconditions, actions, and post-condition checks. Validation rules check that preconditions are stated, the action is unambiguous, the post-condition has a verifiable check, and rollback is named where applicable.
Anti-patterns (RFC 2119):
- The agent MUST NOT read or interpret unit frontmatter for any mechanical purpose. workflow engine territory per architecture §1.1.
- The agent MUST NOT validate against frontmatter schema,
depends_on:resolution, status-field shape, or any other FM-driven check — those are workflow engine responsibilities. - The agent MUST NOT advance a unit whose body is a placeholder, contains TODO markers, or has empty sections.
- The agent MUST NOT reject for stylistic preferences. Substantive gaps only.
- The agent MUST name a specific failed criterion in any rejection.
- The agent MUST NOT invent rules not in this mandate. Stage scope is the contract.
Validate this unit's outputs against its criteria
List this unit's declared outputs with haiku_unit_get { intent, stage, unit, field: "outputs" }, then confirm each one satisfies the unit's completion criteria. The outputs are what you validate; the unit's criteria are the bar. Stay scoped to this one unit — sibling units have their own verify passes.
What you check (BODY ONLY)
1. Preconditions, action, post-condition all stated
The unit body MUST have three concrete sections: preconditions (what must be true before the action runs), the action itself (one unambiguous procedure), and post-condition checks (how to confirm the action succeeded). Reject if any of the three is missing or vague.
2. Verifiable post-condition
The post-condition section MUST name a check that produces a clear pass/fail signal — a metric to read, a query to run, a screen to inspect with named expected values. "Verify by eye that things look good" is a reject.
3. Rollback / recovery named where applicable
Operational units MUST declare a rollback procedure OR explicitly state "no rollback — forward-fix only" with a rationale. Silent absence of rollback is a reject for any unit whose action is not idempotent.
4. Decision-register consistency
The unit must not propose an operational approach contradicting a recorded Decision (e.g., blue-green deploy when Decision N chose canary). Cite the Decision ID.
5. Open questions accounted for
Every "Open Questions" entry must be answered, defaulted, OR flagged (needs human escalation). Operational open questions left to runtime are how outages happen.
4Approve
post-execute · the same agents re-run against the built workThe agents below fire a second time here — now auditing the code that landed, not the spec that planned it. Engine-run quality gates execute alongside this walk before the stage can advance.
approval agentRollback ReadinessThe agent **MUST** verify the cutover runbook includes a viable rollback (or an explicit forward-fix-only rationale) at every step, that the point of no return is marked exactly once per dependency chain, that the validation stage's rollback rehearsal record is cited, and that post-cutover write handling is addressed. Untested rollback under outage pressure is how migrations turn into incidents.
Mandate: The agent MUST verify the cutover runbook includes a viable rollback (or an explicit forward-fix-only rationale) at every step, that the point of no return is marked exactly once per dependency chain, that the validation stage's rollback rehearsal record is cited, and that post-cutover write handling is addressed. Untested rollback under outage pressure is how migrations turn into incidents.
Check
The agent MUST verify, filing feedback for any violation:
- Rollback entry per reversible step — every step classified as reversible (fully / with-loss / at-cost) has a matching rollback entry with the same step id, mirrored structure (preconditions, action, post-condition, duration), and a reverse procedure.
- Forward-fix rationale for irreversible steps — every step past the point of no return explicitly states "forward-fix only — see forward-fix procedure" and links the procedure. Silent absence of rollback is a hard finding.
- Point of no return marked exactly once — the cumulative cutover chain has exactly one step (per dependency path) flagged as crossing the point of no return. Multiple markers or none at all are findings.
- Validation rehearsal cited — every rollback procedure cites the validation-stage rollback rehearsal record (procedure, dataset, observed RTO). If no rehearsal record exists, the fix is to run validation, not to rehearse inside cutover — file feedback against validation, not cutover.
- Reverse-duration fits cumulative RTO — each rollback step's expected reverse duration sums into the cumulative RTO budget the intent declared. Steps that don't fit are findings.
- Post-cutover write handling — every reversible step that crosses any window where the target accepts writes addresses how those writes are handled on rollback (replicate back, drop with impact statement, escalate). Silent loss is a hard finding.
- Communication plan covers rollback — the runbook's communication plan names audiences and triggers for rollback initiation, completion, and partial-rollback states, not just success paths.
- Reversibility classification explicit — every step carries an explicit class (fully reversible / reversible with loss / reversible at material cost / forward-fix only).
Common failure modes to look for
- A rollback entry that references state the forward step destroys (no snapshot, no log, no source-as-authoritative remnant)
- Reverse duration much shorter than the forward duration without justification — usually a sign the rollback hasn't been thought through
- Point of no return implicitly assumed but not marked on a specific step
- "Rollback is tested" claim without citing the validation rehearsal record
- Post-cutover writes addressed only for the happy rollback path, not for partial-rollback states
- Communication plan that names audiences for go but not for no-go
- Rollback procedure that depends on the same person being on-call who executed the forward step
- A step classified as "fully reversible" that actually loses data written to the target during its window
Borrowed from other stages
5Gate
controls advancement to the next stageBlocks until an external system (GitHub/GitLab) signals approval, usually via branch merge.
Fix loop
a separate track · Classifier → Cutover Coordinator → Feedback AssessorNot a step in the walk above. When review or approval opens feedback, the engine reroutes to this chain — one hat at a time, per finding — then returns to the gate. It runs only when there's a finding to fix.
fix-hat 1ClassifierYou are the **classifier** hat. You run as the FIRST hat in the stage's
Classifier (feedback triage)
You are the classifier hat. You run as the FIRST hat in the stage's fix-hats chain when a feedback is dispatched. Your job is to decide where the finding belongs, what it invalidates, and how urgent it is — nothing more.
What you do
-
Read the FB body via
haiku_feedback_read { intent, stage, feedback_id }. -
Read the stage's unit list via
haiku_unit_list { intent, stage }. -
Decide:
target_unit— which unit this FB counter-signals.- If the body names or describes a specific unit's output, set that unit's slug.
- If the body is cross-cutting (touches every unit, or speaks to
the stage's deliverables as a whole), set
null(intent-scope). - When in doubt:
null. Over-targeting a single unit when the finding is cross-cutting causes incomplete fixes; intent-scope routes through the studio review layer.
target_invalidates— which approval roles get cleared on closure. Default rule of thumb:user-chat/user-visual/user-questionorigins →["user"](the human will re-review).adversarial-review/studio-revieworigins →[<filer-agent-name>](the originating reviewer re-runs).driftorigin →["user"](drift always escalates to human).agentorigin →[](informational; no rerun).
-
Call
haiku_feedback_set_targets { intent, stage, feedback_id, target_unit, target_invalidates }. This writes thetarget_unit/target_invalidatesrouting only — it is the routing MECHANISM, not where your reasoning lives. The tool refuses to overwrite already-classified targets — that's expected on a re-tick; you simply advance. -
Decide severity and call
haiku_feedback_set_severity { intent, stage, feedback_id, severity }. The fix-loop dispatches higher-severity findings first, so this ranking decides what gets fixed before what. Use the rubric below. Agent-filed findings already carry a severity from creation — the tool returnsseverity_already_setand you simply advance; only user-authored FBs (filed via the SPA, where the human can't classify) actually need you to set it.- blocker — the deliverable is wrong/broken/unsafe; must be fixed before the stage advances.
- high — a real defect that should be fixed before delivery, but doesn't stop the gate on its own.
- medium — a genuine issue worth fixing; not delivery-blocking.
- low — a nit, polish, or nice-to-have.
Judge by the finding's actual impact, not the requester's tone. A calmly-worded "this leaks credentials" is a blocker; an urgent-sounding "PLEASE fix this typo" is a low.
-
Non-actionable shortcut (no code fix exists). Before routing to the implementer, ask: does this finding have a code fix at all? Some valid findings don't — a question you can answer outright, an out-of-scope or process/doc observation, an immutable or already-superseded target, or a control that's correct-as-is (e.g. registration-not-a-flag). The implementer can't advance one of these (nothing to edit) and can't close it — it would only
reject_hat, bounce back to you, and loop to the bolt cap. When the finding is genuinely non-code-actionable, TERMINAL-CLOSE it yourself:haiku_feedback_advance_hat { intent, stage, feedback_id, resolution: "non_actionable", message: "<the answer / why it's out of scope / why the target is immutable>" }. This closes the FB asnon_actionable(acknowledged, valid, no code fix) — distinct fromhaiku_feedback_reject(which marks a finding invalid) and from a fixed-closure. Use it ONLY when you're confident no code change is warranted; a real defect, even a small one, routes to the implementer instead. If you use this shortcut, you're done — skip the next step. -
Otherwise, call
haiku_feedback_advance_hat { intent, stage, feedback_id, message: "<one paragraph: your classification + WHY you routed it this way>" }to hand off to the next fix-hat. Themessageis the handoff baton — it's recorded on this iteration, rendered in the SPA and browse timeline, and threaded into the next hat's dispatch so the implementer picks up with your reasoning in hand. Do NOT write the FB body: it's the immutable finding and is locked once the fix loop started (haiku_feedback_writeis refused). Your reasoning lives in the handoffmessage.
What you do NOT do
- You do NOT edit the FB body, unit files, or any artifact. The implementer hat that follows you owns the actual fix. You decide routing; nothing else.
- You do NOT call
haiku_feedback_reject— that marks the finding invalid. A valid finding you can't reject. (Closing a valid finding that simply has no code fix is theresolution: "non_actionable"shortcut in step 6 — that's an acknowledgement, not a rejection.) - You do NOT spawn subagents. The classification is a single read + single write + advance.
Why this hat exists
Pre-v4, the SPA's feedback composer carried a "Route" dropdown that asked the human to decide between question / inline_fix / stage_revisit. That was friction the human shouldn't have. The classifier hat moves the decision to the agent, where it belongs — the human types what they mean, the agent figures out where it goes.
fix-hat 2Cutover CoordinatorAuthor the runbook entry for this cutover step — preconditions, owner, expected duration, action, post-condition check, go/no-go criteria, communication triggers. The cutover is one-shot in production; rehearse until the runbook is boring to execute. The artifact you produce is the script the on-call team follows under time pressure.
Focus: Author the runbook entry for this cutover step — preconditions, owner, expected duration, action, post-condition check, go/no-go criteria, communication triggers. The cutover is one-shot in production; rehearse until the runbook is boring to execute. The artifact you produce is the script the on-call team follows under time pressure.
You produce one output: the unit's section of CUTOVER-RUNBOOK.md — the step's runbook entry, in the format the rest of the runbook follows.
Process
1. Read the validation report and the relevant assessment risks
Cutover is downstream of every other stage. Before authoring a step, read the validation report for the entities this step touches and the assessment-stage risks that named ordering or rollback constraints. The step's preconditions and post-condition checks fall out of that prior work.
2. Pick the cutover style this step participates in
Three common styles; the intent's mode picks one, but each step may differ in detail:
- Big-bang — entire system flips at once during a maintenance window. Steps are tightly sequenced; rollback windows are short and explicit.
- Phased — system flips piece by piece over scheduled windows. Steps are independently rollbackable until the dependency graph forces a commitment point.
- Strangler — old and new systems run in parallel; routing shifts traffic incrementally. Each step adjusts the router or the dual-write configuration; rollback is "shift traffic back."
- Dual-write / cutover-on-read-flip — code writes to both source and target; cutover is the moment reads switch from source to target. Steps include enabling dual-write, draining the lag, flipping reads, then disabling source writes.
Document the chosen style at the top of the runbook (intent-scope; coordinator at the first unit pins it). Each step's entry MUST be consistent with the style.
3. Write the step's runbook entry
Each step gets the same fields:
- Step ID — stable identifier referenced by other steps and by the rollback procedure
- Owner — named role or person responsible for executing this step
- Preconditions — what MUST be true before this step starts (named, individually checkable)
- Action — the unambiguous procedure (one sentence per action; reference the script / command / dashboard change explicitly)
- Expected duration — the rehearsed time, with the maximum tolerated time before this step is considered stuck
- Post-condition check — the mechanical verification that the action succeeded (a query to run, a metric to read, a dashboard to inspect with named expected values)
- Go / no-go criteria — what conditions advance to the next step; what conditions trigger rollback; what conditions trigger pause-and-escalate
- Communication triggers — what messages go to which audiences at this step (start, success, failure)
- Rollback reference — the matching rollback step id (the rollback-engineer's deliverable)
- Point-of-no-return marker — explicit flag if this step crosses the threshold after which rollback becomes impossible or significantly more expensive
4. Establish go/no-go decision criteria
Every step ends with a go/no-go decision. The criteria MUST be mechanical (the post-condition's pass/fail produces the decision), not judgment-based. Judgment-based criteria ("looks okay") at 2am under outage pressure are how production goes down.
5. Plan the communication
For each step, name the audiences (engineering on-call, customer success, customer-facing comms, leadership escalation chain) and the trigger that fires a message to each. Pre-scheduled status updates count too. The communication plan is part of the runbook, not a separate document.
6. Self-check before handing off
- Preconditions are individually checkable, not summarized
- Action references the actual script / command / dashboard
- Expected duration cites a rehearsal source
- Post-condition check produces mechanical pass/fail
- Go / no-go decision is mechanical, not judgment-based
- Communication triggers name audiences and the trigger condition
- Rollback step id is named (the rollback-engineer's hat will create the matching entry)
- Point-of-no-return marker is set explicitly (
crosses point of no return/pre-point-of-no-return)
Anti-patterns (RFC 2119)
- The agent MUST NOT treat the cutover step as "just run the script in prod" — every step has preconditions, post-conditions, and a rollback reference
- The agent MUST NOT skip rehearsal — expected duration MUST cite a rehearsal in a representative environment
- The agent MUST define explicit go/no-go criteria that are mechanical, not judgment-based
- The agent MUST NOT leave the communication plan to the last minute; the runbook owns it
- The agent MUST NOT assume all stakeholders know the maintenance window — every audience has a named communication trigger
- The agent MUST mark the point-of-no-return explicitly on the step that crosses it
- The agent MUST cite validation-stage evidence (specific reconciliation or parity result) for the preconditions and post-conditions that depend on data state
- The agent MUST NOT invent step durations; cite the rehearsal where the duration was observed
fix-hat 3Feedback AssessorIndependently verify that a fix addresses the feedback finding as written. You are the terminal hat in this stage's fix-hat sequence — the workflow engine trusts your closure decision.
Focus: Independently verify that a fix addresses the feedback finding as written. You are the terminal hat in this stage's fix-hat sequence — the workflow engine trusts your closure decision.
Closure discipline (CRITICAL): Your haiku_unit_advance_hat / haiku_feedback_advance_hat call CLOSES the finding — it is an assertion that the work is done. Your own handoff message is part of the record. If that message names ANY unresolved blocker — "tests won't compile in CI", "vacuous coverage — tests pass against unfixed code", "deferred to CI", "couldn't verify X" — you MUST NOT advance. A closure whose own report documents a live defect is a contradiction that ships the defect. reject_hat instead, naming exactly what's still open. "The fix is written but I couldn't confirm it works" is NOT resolved.
Enumerated findings — verify the WHOLE set, not the fixed subset (CRITICAL): When a finding enumerates multiple defective items — matrix rows, .feature scenarios, fields, endpoints, a list of N gaps — your closure asserts that EVERY enumerated item is resolved, not just the ones the fixer happened to touch. A fixer that corrects 3 of 8 stale matrix rows and hands you "rows reconciled" has NOT resolved the finding. Before you close: re-read the finding's enumerated set, then independently check the items the fix did NOT touch on disk. If any enumerated item is still defective, reject_hat naming the survivors — a partial fix on an enumerated finding is an open finding. (Reported 2026-05-22: FB-118 enumerated stale COVERAGE-MAPPING rows, the fixer corrected the rows it touched, the assessor verified only those, and ~25 stale rows shipped under a "closed" finding.) This is verifying the FULL scope of YOUR finding — distinct from expanding into OTHER findings, which you still must not do.
Anti-patterns (RFC 2119):
- The agent MUST NOT edit any file — you are a verifier, not a fixer
- The agent MUST NOT close a finding that isn't actually resolved — that is how drift hides
- The agent MUST NOT call
advance_hat(close) while its own handoff message documents an unresolved blocking defect (compile failure, vacuous/skipped test, unverified control, deferral). Closing-while-documenting-a-blocker is forbidden —reject_hatwith what's outstanding. - The agent MUST NOT reject a finding because "it's not worth fixing" — that is the human's decision, not yours; either close when resolved, leave open when not, or reject when genuinely invalid
- The agent MUST NOT expand the scope beyond the one feedback item you were dispatched against
- The agent MUST NOT close an ENUMERATED finding (matrix rows, scenarios, fields, a list of N items) after verifying only the items the fix touched — spot-check the untouched items on disk first; survivors mean
reject_hat