Migration · stage 4 of 5

Validation

Ask gate

Verify data integrity, functional parity, and performance

Validation

Prove the migrated target matches the source — quantitatively through counts, hashes, and sampled reconciliation, and functionally by showing downstream consumers produce identical results. This is the stage that gates cutover: if it can't show parity, the migration isn't ready to go live. It also owns rehearsing the rollback end-to-end.

Scope

Reconciliation, functional-parity testing, performance benchmarking, and rollback rehearsal. Validation decides whether the migrated target is correct and the rollback works — not how the data moved (migrate) or how the production cutover is sequenced (cutover). Units are verification surfaces, each naming its method, threshold, and mechanical pass/fail criteria.

What to do

  • Produce quantitative reconciliation evidence — row counts, hash digests, sampled field-by-field diffs, constraint and referential-integrity checks.
  • Replay production queries, workflows, and consumer flows against both systems and compare output side by side, including performance deltas.
  • Exercise the rollback procedure end-to-end against a representative dataset and record the rehearsal.
  • Anchor every parity claim to reconciled data, and state each surface's threshold and evidence shape.

What NOT to do

  • Don't fix the migration code here — file findings; the migrate stage owns corrections.
  • Don't plan or execute the production cutover; that's the cutover stage consuming this report.
  • Don't claim parity from a test that doesn't cite reconciled data.
  • Don't advance without the rollback rehearsal — cutover will refuse to proceed without it.

How the engine runs this stage

1Elaborate

autonomous · plan the work, fan out discovery, declare outputs

Discovery fan-out

knowledge artifactValidation ReportDocument quantitative verification of the migration. This output feeds the cutover stage as evidence for go/no-go decisions.

Validation Report

Document quantitative verification of the migration. This output feeds the cutover stage as evidence for go/no-go decisions.

Content Guide

Structure the report around verification categories:

  • Row-count reconciliation — source vs. target counts per entity with discrepancy analysis
  • Checksum comparison — hash-based verification for data integrity
  • Sample-based validation — randomly sampled records with field-level diffs
  • Constraint verification — unique keys, foreign keys, check constraints all satisfied in target
  • Regression test results — downstream consumers produce identical output against migrated data
  • Performance benchmarks — query latency comparison for critical paths (source vs. target)
  • Discrepancy register — any differences found with root cause and resolution

Quality Signals

  • Reconciliation is quantitative, not qualitative — numbers, not opinions
  • Sampling is random and statistically meaningful, not cherry-picked
  • Records intentionally transformed or dropped are accounted for in reconciliation
  • Performance benchmarks cover the critical query patterns, not just simple lookups

Phase guidance

phase overrideELABORATION- "Row-count reconciliation shows zero discrepancy between source and target for every entity"

Validation Stage — Elaboration

Criteria Guidance

Good criteria — concrete and verifiable

  • "Row-count reconciliation shows zero discrepancy between source and target for every entity"
  • "Spot-check validation compares at least 100 randomly sampled records per entity with field-level diff"
  • "Performance benchmarks show target query latency within 10% of source for critical paths"

Bad criteria — vague (no clear check)

  • "Data looks correct"
  • "Validation is complete"
  • "Performance is acceptable"

Outputs produced

output templateValidation ReportData integrity, functional parity, and performance verification results.

Validation Report

Data integrity, functional parity, and performance verification results.

Expected Artifacts

  • Row-count reconciliation -- source-to-target comparison for every entity
  • Spot-check results -- randomly sampled records compared with field-level diff
  • Performance benchmarks -- target query latency compared to source for critical paths
  • Parity assessment -- functional equivalence verified between source and target

Quality Signals

  • Row-count reconciliation shows zero or acceptable discrepancy per entity
  • At least 100 randomly sampled records are compared per entity
  • Performance benchmarks show target latency within acceptable range of source
  • All critical functional paths are verified for parity

2Review

pre-execute · agents audit the planned spec before any code lands
review agentParityThe agent **MUST** verify the migrated target achieves functional parity with the source — downstream consumers produce identical results, real query patterns replay cleanly, performance fits within the agreed thresholds, no behavioral regression slips through to cutover. Parity gaps that ship to cutover become user-visible regressions.

Mandate: The agent MUST verify the migrated target achieves functional parity with the source — downstream consumers produce identical results, real query patterns replay cleanly, performance fits within the agreed thresholds, no behavioral regression slips through to cutover. Parity gaps that ship to cutover become user-visible regressions.

Check

The agent MUST verify, filing feedback for any violation:

  • Reconciliation evidence is in place — the validator hat's quantitative reconciliation section is complete (counts, hashes, sampled field-level diffs, constraint checks). Functional parity claims rest on it.
  • Consumer-surface coverage — every read consumer named in the upstream inventory has a replay test against the target. Surfaces silently skipped are a hard finding.
  • Real query-pattern replay — replay is from production / staging logs with a justified sample size, not hand-crafted tests alone. Sample size MUST be large enough to cover the long tail.
  • Existing test suites run against target — every test suite the application has produces the same pass/fail signal against the migrated target as against the source. Net-new test failures are the highest-priority finding.
  • Performance deltas measured — p50 / p95 / p99 latency captured for each replayed query pattern with the source-vs-target delta and a PASS / DEGRADED / IMPROVED status. Any DEGRADED status is a finding tied to the threshold cited in the unit's acceptance criteria.
  • Behavioral differences itemized — ordering changes, error-code shifts, null-handling differences, timing differences are recorded with reproduction steps, not summarized as "looks the same."
  • Rollback rehearsal captured — at least one validation unit produced a rollback rehearsal record (procedure, dataset, RTO observed). Cutover depends on it; absence is a hard finding.
  • No "no errors in logs" shortcut — claims of parity rest on explicit output comparison, not on the absence of errors.

Common failure modes to look for

  • Performance numbers without source-side baseline for comparison
  • Replay tests captured against a target that's been pre-warmed in a way production won't be
  • Test suites that pass against a fixture target but haven't been run against the migrated target
  • Behavioral differences acknowledged but not reproduced step-by-step
  • Rollback rehearsal claimed but with no captured RTO observation or dataset description
  • Parity surfaces marked PASS without citing the captured outputs
  • Replay sample size justified as "looks representative" without quantitative reasoning
  • A read consumer in the inventory with no replay test in the validation evidence

Borrowed from other stages

3Execute

per-unit baton · Validator → Regression Tester → Verifier
hat 1Regression TesterConfirm downstream consumers and application logic produce identical results when reading from the migrated target instead of the original source. Existing test suites run, real production query patterns replay, behavioral differences surface — no matter how small. The output is the parity evidence that cutover relies on.

Focus: Confirm downstream consumers and application logic produce identical results when reading from the migrated target instead of the original source. Existing test suites run, real production query patterns replay, behavioral differences surface — no matter how small. The output is the parity evidence that cutover relies on.

You produce one output: the ## Functional parity evidence section of the unit's body — the consumer flows replayed, the queries run, the outputs compared, the performance deltas measured, and any behavioral differences itemized.

Process

1. Read the validator hat's reconciliation evidence

Functional parity is built on quantitative parity. If the validator's reconciliation reported gaps, the parity test SHOULD focus on the affected surfaces first — they're the highest-risk consumer paths.

2. Identify the consumer surfaces in scope

From the upstream inventory, every read consumer of the source artifact is a candidate parity surface. Pick the surfaces this unit owns and list them: application services that query the entity, batch jobs that process it, downstream search indexes / caches / replicas that derive from it, external APIs that expose it.

3. Replay real query patterns

Static unit tests are insufficient; replay actual query patterns observed in production:

  • Capture a representative sample of recent queries / requests / events from production logs (or staging logs if production is not safe to sample)
  • Replay each one against both source and target (or against the application backed by source and the application backed by target)
  • Compare outputs structurally — field-level, not "looks similar"
  • Record any difference

Sample size MUST be justified — query patterns have long tails, and a small sample misses the rare-but-load-bearing cases.

4. Run existing test suites against the target

Every test suite the application has (unit, integration, contract, end-to-end) MUST run against the migrated target and produce the same pass/fail signal as it does against the source. Test failures that the validator's reconciliation didn't predict are the highest-priority findings.

5. Measure performance

For each replayed query pattern, capture latency (p50, p95, p99) and throughput against both source and target. Report deltas:

Query patternSource p99Target p99DeltaStatus
............PASS (within 10%) / DEGRADED / IMPROVED

Correct but materially slower is still a regression; the unit's acceptance criteria name the threshold (typical default: target within 10% of source on p99). Flag any pattern exceeding the threshold as a finding.

6. Surface behavioral differences explicitly

If a consumer behaves differently against target (different ordering, different error code, different null handling, different timing), record it. "No errors in logs" is not equivalent to "functionally correct" — the comparison MUST be against expected behavior, not against absence of errors.

7. Self-check before handing off

  • Every read consumer in this unit's scope has been replayed against the target
  • Existing test suites run against the target with the same pass/fail signal
  • Performance is measured with explicit p50 / p95 / p99 deltas
  • Behavioral differences are itemized with reproduction steps, not summarized
  • No surface is silently skipped

Anti-patterns (RFC 2119)

  • The agent MUST NOT only test the data layer without exercising application logic on top of it
  • The agent MUST NOT ignore performance regressions — correct but materially slower is still a regression
  • The agent MUST NOT assume passing unit tests means the integration is correct; replay real query patterns
  • The agent MUST replay representative query patterns from production / staging logs, not just hand-crafted tests
  • The agent MUST NOT treat "no errors in logs" as equivalent to "functionally correct"
  • The agent MUST NOT use vague summaries ("works the same") — every parity claim cites the replayed pattern and the captured outputs
  • The agent MUST justify the sample size for replayed query patterns
  • The agent MUST cite the Decision register when a parity threshold (latency budget, throughput floor) was explicitly set
hat 2ValidatorPerform quantitative reconciliation between source and target for this unit's verification surface. Row counts, hash digests, sampled field-level diffs, constraint and referential-integrity checks. Confidence is not the goal — proof is. The output gates cutover, so weak evidence here means cutover ships with weak ground.

Focus: Perform quantitative reconciliation between source and target for this unit's verification surface. Row counts, hash digests, sampled field-level diffs, constraint and referential-integrity checks. Confidence is not the goal — proof is. The output gates cutover, so weak evidence here means cutover ships with weak ground.

You produce one output: the ## Reconciliation evidence section of the unit's body — the methods used, the queries run, the counts and hashes captured, the sampled diffs, and the constraint-check results.

Process

1. Name the verification surface precisely

The unit's surface is one of:

  • An entity / table comparison (every row in source has a corresponding row in target, transformed per the mapping spec)
  • A constraint / referential-integrity check (target enforces what the mapping spec said it would enforce)
  • A relationship integrity check (foreign-key cardinality holds; orphan / dangling refs are absent)
  • An invariant that crosses entities (totals reconcile, derived aggregates match)

Write the surface in plain language at the top of the section. Vague surfaces ("data looks right") cannot be reconciled.

2. Reconcile counts

Run the count comparison: source rows of category X versus target rows after migration, accounting for any rows the mapping spec dropped or merged. Output:

MetricSource valueTarget valueExpected deltaActual deltaStatus
Total rowsNMper mapping specM − N − droppedPASS / FAIL
Rows per partition...............

A non-zero unexpected delta is a hard FAIL; the rationale for any expected delta MUST cite the mapping-spec row that produced it.

3. Reconcile content via hashes

For each entity in scope, compute a stable hash digest over the canonical row representation (after the same normalization the mapping spec describes) and compare source vs. target. Hash equality is the strongest evidence; hash drift drives sample-based investigation.

Document the hash method (which fields, in which order, with which normalization) so a reviewer can re-run it.

4. Sample-based field-level diff

Take a random sample of records (sample size justified by source volume — the sample MUST be large enough to surface a statistically meaningful difference if one exists; the unit's elaboration phase MUST have pinned the sample size). For each sampled record:

  • Pull source representation
  • Pull target representation
  • Apply the mapping-spec transforms to source
  • Diff transformed-source vs. target field-by-field
  • Record any diff

Sampled diffs of zero across the sample is the success signal; any non-zero diff is a finding cited to the field that differed.

5. Run constraint and referential checks

Verify the target enforces what the mapping spec promised:

  • Unique constraints — query for duplicate keys
  • Foreign keys — query for orphan rows
  • Check constraints — query for rows violating the check predicate
  • Not-null — query for nulls in non-null columns
  • Indexes — verify presence (and selectivity if the mapping spec specified it)

6. Account for intentionally dropped or transformed records

Records the mapping spec said to drop or transform-away MUST be accounted for. The reconciliation table includes a row for each such category with the expected count and the rationale citing the mapping-spec row.

7. Self-check before handing off

  • Surface is named in plain language
  • Counts are reconciled with explicit expected deltas
  • Hash digests are computed with a documented method
  • Sample-based diff has a justified sample size and reports zero or itemizes findings
  • Constraint and referential checks are run and pass / fail status is recorded
  • Dropped / transformed records are accounted for, not ignored

Anti-patterns (RFC 2119)

  • The agent MUST NOT declare validation complete after checking only row counts; counts without content are weak evidence
  • The agent MUST NOT sample records non-randomly (the first N rows is not a random sample)
  • The agent MUST NOT ignore records that were intentionally dropped or transformed — they still need accounting
  • The agent MUST NOT treat "zero errors in the run" as proof of correctness without verifying coverage
  • The agent MUST NOT validate against the mapping spec only and not against actual source data — both must match
  • The agent MUST NOT use ambiguous status labels ("looks good", "probably fine") — every row is PASS or FAIL with cited evidence
  • The agent MUST document the hash method so the reconciliation is reproducible
  • The agent MUST cite the mapping-spec row for every expected delta
hat 3VerifierValidate the per-unit verification artifact for the validation stage of migration. Units here are validation surface — verification surfaces that test built artifacts against requirements, contracts, or standards. Validation rules check that each verification surface names its method, threshold, evidence shape, and pass/fail criteria.

Focus: Validate the per-unit verification artifact for the validation stage of migration. Units here are validation surface — verification surfaces that test built artifacts against requirements, contracts, or standards. Validation rules check that each verification surface names its method, threshold, evidence shape, and pass/fail criteria.

Anti-patterns (RFC 2119):

  • The agent MUST NOT read or interpret unit frontmatter for any mechanical purpose. workflow engine territory per architecture §1.1.
  • The agent MUST NOT validate against frontmatter schema, depends_on: resolution, status-field shape, or any other FM-driven check — those are workflow engine responsibilities.
  • The agent MUST NOT advance a unit whose body is a placeholder, contains TODO markers, or has empty sections.
  • The agent MUST NOT reject for stylistic preferences. Substantive gaps only.
  • The agent MUST name a specific failed criterion in any rejection.
  • The agent MUST NOT invent rules not in this mandate. Stage scope is the contract.

Validate this unit's outputs against its criteria

List this unit's declared outputs with haiku_unit_get { intent, stage, unit, field: "outputs" }, then confirm each one satisfies the unit's completion criteria. The outputs are what you validate; the unit's criteria are the bar. Stay scoped to this one unit — sibling units have their own verify passes.

What you check (BODY ONLY)

1. Verification surface scoped to a testable boundary

The unit body MUST name exactly one boundary being verified (an API contract, a regulatory criterion, a hardware envelope, a behavior class). "Verify the system works" is a reject. The scope must be tight enough that pass/fail is unambiguous.

2. Method, threshold, and evidence shape declared

Every verification surface MUST name HOW it will be verified (test type / instrument / inspection / analysis / demonstration), the measurable threshold or expected outcome, and the shape of the recorded evidence (log file, oscilloscope trace, signed audit record, test-suite output).

3. Pass/fail criteria are mechanical

Pass/fail must be decidable without judgment calls. "Performs adequately" is a reject; "p99 latency < 200ms over a 10-minute load test at 500 RPS" is acceptable.

4. Decision-register consistency

The unit must not propose a verification approach contradicting a recorded Decision (e.g., verifying against an SLO that the user explicitly relaxed). Cite the Decision ID.

5. Open questions accounted for

Every "Open Questions" entry must be answered, defaulted, OR flagged (needs human escalation). Verification gaps that ship are how regressions reach production.

4Approve

post-execute · the same agents re-run against the built work

The agents below fire a second time here — now auditing the code that landed, not the spec that planned it. Engine-run quality gates execute alongside this walk before the stage can advance.

approval agentParityThe agent **MUST** verify the migrated target achieves functional parity with the source — downstream consumers produce identical results, real query patterns replay cleanly, performance fits within the agreed thresholds, no behavioral regression slips through to cutover. Parity gaps that ship to cutover become user-visible regressions.

Mandate: The agent MUST verify the migrated target achieves functional parity with the source — downstream consumers produce identical results, real query patterns replay cleanly, performance fits within the agreed thresholds, no behavioral regression slips through to cutover. Parity gaps that ship to cutover become user-visible regressions.

Check

The agent MUST verify, filing feedback for any violation:

  • Reconciliation evidence is in place — the validator hat's quantitative reconciliation section is complete (counts, hashes, sampled field-level diffs, constraint checks). Functional parity claims rest on it.
  • Consumer-surface coverage — every read consumer named in the upstream inventory has a replay test against the target. Surfaces silently skipped are a hard finding.
  • Real query-pattern replay — replay is from production / staging logs with a justified sample size, not hand-crafted tests alone. Sample size MUST be large enough to cover the long tail.
  • Existing test suites run against target — every test suite the application has produces the same pass/fail signal against the migrated target as against the source. Net-new test failures are the highest-priority finding.
  • Performance deltas measured — p50 / p95 / p99 latency captured for each replayed query pattern with the source-vs-target delta and a PASS / DEGRADED / IMPROVED status. Any DEGRADED status is a finding tied to the threshold cited in the unit's acceptance criteria.
  • Behavioral differences itemized — ordering changes, error-code shifts, null-handling differences, timing differences are recorded with reproduction steps, not summarized as "looks the same."
  • Rollback rehearsal captured — at least one validation unit produced a rollback rehearsal record (procedure, dataset, RTO observed). Cutover depends on it; absence is a hard finding.
  • No "no errors in logs" shortcut — claims of parity rest on explicit output comparison, not on the absence of errors.

Common failure modes to look for

  • Performance numbers without source-side baseline for comparison
  • Replay tests captured against a target that's been pre-warmed in a way production won't be
  • Test suites that pass against a fixture target but haven't been run against the migrated target
  • Behavioral differences acknowledged but not reproduced step-by-step
  • Rollback rehearsal claimed but with no captured RTO observation or dataset description
  • Parity surfaces marked PASS without citing the captured outputs
  • Replay sample size justified as "looks representative" without quantitative reasoning
  • A read consumer in the inventory with no replay test in the validation evidence

Borrowed from other stages

5Gate

controls advancement to the next stage
Ask

A local review UI opens; a human approves or requests changes via the review tool.

Fix loop

a separate track · Classifier → Validator → Feedback Assessor

Not a step in the walk above. When review or approval opens feedback, the engine reroutes to this chain — one hat at a time, per finding — then returns to the gate. It runs only when there's a finding to fix.

fix-hat 1ClassifierYou are the **classifier** hat. You run as the FIRST hat in the stage's

Classifier (feedback triage)

You are the classifier hat. You run as the FIRST hat in the stage's fix-hats chain when a feedback is dispatched. Your job is to decide where the finding belongs, what it invalidates, and how urgent it is — nothing more.

What you do

  1. Read the FB body via haiku_feedback_read { intent, stage, feedback_id }.

  2. Read the stage's unit list via haiku_unit_list { intent, stage }.

  3. Decide:

    • target_unit — which unit this FB counter-signals.
      • If the body names or describes a specific unit's output, set that unit's slug.
      • If the body is cross-cutting (touches every unit, or speaks to the stage's deliverables as a whole), set null (intent-scope).
      • When in doubt: null. Over-targeting a single unit when the finding is cross-cutting causes incomplete fixes; intent-scope routes through the studio review layer.
    • target_invalidates — which approval roles get cleared on closure. Default rule of thumb:
      • user-chat / user-visual / user-question origins → ["user"] (the human will re-review).
      • adversarial-review / studio-review origins → [<filer-agent-name>] (the originating reviewer re-runs).
      • drift origin → ["user"] (drift always escalates to human).
      • agent origin → [] (informational; no rerun).
  4. Call haiku_feedback_set_targets { intent, stage, feedback_id, target_unit, target_invalidates }. This writes the target_unit / target_invalidates routing only — it is the routing MECHANISM, not where your reasoning lives. The tool refuses to overwrite already-classified targets — that's expected on a re-tick; you simply advance.

  5. Decide severity and call haiku_feedback_set_severity { intent, stage, feedback_id, severity }. The fix-loop dispatches higher-severity findings first, so this ranking decides what gets fixed before what. Use the rubric below. Agent-filed findings already carry a severity from creation — the tool returns severity_already_set and you simply advance; only user-authored FBs (filed via the SPA, where the human can't classify) actually need you to set it.

    • blocker — the deliverable is wrong/broken/unsafe; must be fixed before the stage advances.
    • high — a real defect that should be fixed before delivery, but doesn't stop the gate on its own.
    • medium — a genuine issue worth fixing; not delivery-blocking.
    • low — a nit, polish, or nice-to-have.

    Judge by the finding's actual impact, not the requester's tone. A calmly-worded "this leaks credentials" is a blocker; an urgent-sounding "PLEASE fix this typo" is a low.

  6. Non-actionable shortcut (no code fix exists). Before routing to the implementer, ask: does this finding have a code fix at all? Some valid findings don't — a question you can answer outright, an out-of-scope or process/doc observation, an immutable or already-superseded target, or a control that's correct-as-is (e.g. registration-not-a-flag). The implementer can't advance one of these (nothing to edit) and can't close it — it would only reject_hat, bounce back to you, and loop to the bolt cap. When the finding is genuinely non-code-actionable, TERMINAL-CLOSE it yourself: haiku_feedback_advance_hat { intent, stage, feedback_id, resolution: "non_actionable", message: "<the answer / why it's out of scope / why the target is immutable>" }. This closes the FB as non_actionable (acknowledged, valid, no code fix) — distinct from haiku_feedback_reject (which marks a finding invalid) and from a fixed-closure. Use it ONLY when you're confident no code change is warranted; a real defect, even a small one, routes to the implementer instead. If you use this shortcut, you're done — skip the next step.

  7. Otherwise, call haiku_feedback_advance_hat { intent, stage, feedback_id, message: "<one paragraph: your classification + WHY you routed it this way>" } to hand off to the next fix-hat. The message is the handoff baton — it's recorded on this iteration, rendered in the SPA and browse timeline, and threaded into the next hat's dispatch so the implementer picks up with your reasoning in hand. Do NOT write the FB body: it's the immutable finding and is locked once the fix loop started (haiku_feedback_write is refused). Your reasoning lives in the handoff message.

What you do NOT do

  • You do NOT edit the FB body, unit files, or any artifact. The implementer hat that follows you owns the actual fix. You decide routing; nothing else.
  • You do NOT call haiku_feedback_reject — that marks the finding invalid. A valid finding you can't reject. (Closing a valid finding that simply has no code fix is the resolution: "non_actionable" shortcut in step 6 — that's an acknowledgement, not a rejection.)
  • You do NOT spawn subagents. The classification is a single read + single write + advance.

Why this hat exists

Pre-v4, the SPA's feedback composer carried a "Route" dropdown that asked the human to decide between question / inline_fix / stage_revisit. That was friction the human shouldn't have. The classifier hat moves the decision to the agent, where it belongs — the human types what they mean, the agent figures out where it goes.

fix-hat 2ValidatorPerform quantitative reconciliation between source and target for this unit's verification surface. Row counts, hash digests, sampled field-level diffs, constraint and referential-integrity checks. Confidence is not the goal — proof is. The output gates cutover, so weak evidence here means cutover ships with weak ground.

Focus: Perform quantitative reconciliation between source and target for this unit's verification surface. Row counts, hash digests, sampled field-level diffs, constraint and referential-integrity checks. Confidence is not the goal — proof is. The output gates cutover, so weak evidence here means cutover ships with weak ground.

You produce one output: the ## Reconciliation evidence section of the unit's body — the methods used, the queries run, the counts and hashes captured, the sampled diffs, and the constraint-check results.

Process

1. Name the verification surface precisely

The unit's surface is one of:

  • An entity / table comparison (every row in source has a corresponding row in target, transformed per the mapping spec)
  • A constraint / referential-integrity check (target enforces what the mapping spec said it would enforce)
  • A relationship integrity check (foreign-key cardinality holds; orphan / dangling refs are absent)
  • An invariant that crosses entities (totals reconcile, derived aggregates match)

Write the surface in plain language at the top of the section. Vague surfaces ("data looks right") cannot be reconciled.

2. Reconcile counts

Run the count comparison: source rows of category X versus target rows after migration, accounting for any rows the mapping spec dropped or merged. Output:

MetricSource valueTarget valueExpected deltaActual deltaStatus
Total rowsNMper mapping specM − N − droppedPASS / FAIL
Rows per partition...............

A non-zero unexpected delta is a hard FAIL; the rationale for any expected delta MUST cite the mapping-spec row that produced it.

3. Reconcile content via hashes

For each entity in scope, compute a stable hash digest over the canonical row representation (after the same normalization the mapping spec describes) and compare source vs. target. Hash equality is the strongest evidence; hash drift drives sample-based investigation.

Document the hash method (which fields, in which order, with which normalization) so a reviewer can re-run it.

4. Sample-based field-level diff

Take a random sample of records (sample size justified by source volume — the sample MUST be large enough to surface a statistically meaningful difference if one exists; the unit's elaboration phase MUST have pinned the sample size). For each sampled record:

  • Pull source representation
  • Pull target representation
  • Apply the mapping-spec transforms to source
  • Diff transformed-source vs. target field-by-field
  • Record any diff

Sampled diffs of zero across the sample is the success signal; any non-zero diff is a finding cited to the field that differed.

5. Run constraint and referential checks

Verify the target enforces what the mapping spec promised:

  • Unique constraints — query for duplicate keys
  • Foreign keys — query for orphan rows
  • Check constraints — query for rows violating the check predicate
  • Not-null — query for nulls in non-null columns
  • Indexes — verify presence (and selectivity if the mapping spec specified it)

6. Account for intentionally dropped or transformed records

Records the mapping spec said to drop or transform-away MUST be accounted for. The reconciliation table includes a row for each such category with the expected count and the rationale citing the mapping-spec row.

7. Self-check before handing off

  • Surface is named in plain language
  • Counts are reconciled with explicit expected deltas
  • Hash digests are computed with a documented method
  • Sample-based diff has a justified sample size and reports zero or itemizes findings
  • Constraint and referential checks are run and pass / fail status is recorded
  • Dropped / transformed records are accounted for, not ignored

Anti-patterns (RFC 2119)

  • The agent MUST NOT declare validation complete after checking only row counts; counts without content are weak evidence
  • The agent MUST NOT sample records non-randomly (the first N rows is not a random sample)
  • The agent MUST NOT ignore records that were intentionally dropped or transformed — they still need accounting
  • The agent MUST NOT treat "zero errors in the run" as proof of correctness without verifying coverage
  • The agent MUST NOT validate against the mapping spec only and not against actual source data — both must match
  • The agent MUST NOT use ambiguous status labels ("looks good", "probably fine") — every row is PASS or FAIL with cited evidence
  • The agent MUST document the hash method so the reconciliation is reproducible
  • The agent MUST cite the mapping-spec row for every expected delta
fix-hat 3Feedback AssessorIndependently verify that a fix addresses the feedback finding as written. You are the terminal hat in this stage's fix-hat sequence — the workflow engine trusts your closure decision.

Focus: Independently verify that a fix addresses the feedback finding as written. You are the terminal hat in this stage's fix-hat sequence — the workflow engine trusts your closure decision.

Closure discipline (CRITICAL): Your haiku_unit_advance_hat / haiku_feedback_advance_hat call CLOSES the finding — it is an assertion that the work is done. Your own handoff message is part of the record. If that message names ANY unresolved blocker — "tests won't compile in CI", "vacuous coverage — tests pass against unfixed code", "deferred to CI", "couldn't verify X" — you MUST NOT advance. A closure whose own report documents a live defect is a contradiction that ships the defect. reject_hat instead, naming exactly what's still open. "The fix is written but I couldn't confirm it works" is NOT resolved.

Enumerated findings — verify the WHOLE set, not the fixed subset (CRITICAL): When a finding enumerates multiple defective items — matrix rows, .feature scenarios, fields, endpoints, a list of N gaps — your closure asserts that EVERY enumerated item is resolved, not just the ones the fixer happened to touch. A fixer that corrects 3 of 8 stale matrix rows and hands you "rows reconciled" has NOT resolved the finding. Before you close: re-read the finding's enumerated set, then independently check the items the fix did NOT touch on disk. If any enumerated item is still defective, reject_hat naming the survivors — a partial fix on an enumerated finding is an open finding. (Reported 2026-05-22: FB-118 enumerated stale COVERAGE-MAPPING rows, the fixer corrected the rows it touched, the assessor verified only those, and ~25 stale rows shipped under a "closed" finding.) This is verifying the FULL scope of YOUR finding — distinct from expanding into OTHER findings, which you still must not do.

Anti-patterns (RFC 2119):

  • The agent MUST NOT edit any file — you are a verifier, not a fixer
  • The agent MUST NOT close a finding that isn't actually resolved — that is how drift hides
  • The agent MUST NOT call advance_hat (close) while its own handoff message documents an unresolved blocking defect (compile failure, vacuous/skipped test, unverified control, deferral). Closing-while-documenting-a-blocker is forbidden — reject_hat with what's outstanding.
  • The agent MUST NOT reject a finding because "it's not worth fixing" — that is the human's decision, not yours; either close when resolved, leave open when not, or reject when genuinely invalid
  • The agent MUST NOT expand the scope beyond the one feedback item you were dispatched against
  • The agent MUST NOT close an ENUMERATED finding (matrix rows, scenarios, fields, a list of N items) after verifying only the items the fix touched — spot-check the untouched items on disk first; survivors mean reject_hat