Quality Assurance · stage 2 of 5

Design Tests

Auto gate

Design test cases and plan automation

Design Tests

Turn the test strategy into executable test artifacts: explicit test cases, a traceability matrix back to requirements, and an assessment of which cases to automate. This is where the strategy's intent becomes something a tester or a framework can actually run.

Scope

Test design and automation strategy — case definition (preconditions, steps, expected results), requirement traceability, and automation feasibility. Design-tests decides what the tests are, not what to test (that's plan), whether they pass (execute-tests), or what failures mean (analyze).

What to do

Trace every test case back to a requirement or quality dimension the strategy named — leave no case unanchored and no in-scope requirement uncovered.
Apply real design techniques (boundary, equivalence partition, decision table, state transition) rather than happy-path-only cases.
Decide which cases automate and which stay manual, and justify each call against cost and stability.
Write cases precise enough that someone other than the author could run them and get the same result.

What NOT to do

Don't redefine scope or risk priority — that's a revisit to plan, not a quiet reinterpretation here.
Don't execute the cases or capture results; designing and running are separate stages.
Don't leave a strategy-named area without coverage.
Don't write cases whose expected result is ambiguous or unverifiable.

How the engine runs this stage

1Elaborate

autonomous · plan the work, fan out discovery, declare outputs

Inputs consumed

test-strategyfrom Plan

Discovery fan-out

knowledge artifactTest Suite SpecTest case inventory with requirement traceability and automation plan.

Test Suite Spec

Test case inventory with requirement traceability and automation plan.

Content Guide

Structure the spec for efficient test execution:

Traceability matrix -- mapping of test cases to requirements
Test cases -- for each: ID, description, preconditions, steps, expected results, pass/fail criteria
Test data requirements -- data sets needed including boundary conditions and edge cases
Automation plan -- which tests to automate, framework, and tooling requirements
Coverage analysis -- coverage targets vs planned coverage with gap justification
Execution priority -- test execution order based on strategy priorities

Quality Signals

Every requirement has at least one associated test case
Test cases have explicit expected results, not just steps
Automation candidates are selected based on ROI analysis
Coverage meets the targets defined in the test strategy

Phase guidance

phase overrideELABORATION- "Test suite spec includes test cases for every requirement with traceability matrix linking tests to requirements"

Design Tests Stage — Elaboration

Criteria Guidance

Good criteria — concrete and verifiable

"Test suite spec includes test cases for every requirement with traceability matrix linking tests to requirements"
"Each test case has explicit preconditions, steps, expected results, and pass/fail criteria"
"Automation feasibility assessment identifies which tests to automate, which to run manually, and the rationale"

Bad criteria — vague (no clear check)

"Test cases are designed"
"Automation is planned"
"Tests are ready"

Outputs produced

output templateTest Suite SpecTest cases with traceability matrix, automation plan, and test data requirements.

Test Suite Specification

Test cases with traceability matrix, automation plan, and test data requirements.

Expected Artifacts

Test cases -- each with preconditions, steps, expected results, and pass/fail criteria
Traceability matrix -- tests linked to requirements ensuring coverage
Automation plan -- which tests to automate vs manual with rationale
Test data requirements -- data needed for test execution documented

Quality Signals

Every requirement has at least one test case linked via traceability matrix
Test cases have explicit preconditions, steps, and expected results
Automation feasibility is assessed with framework and tooling identified
Coverage meets the strategy targets

2Review

pre-execute · agents audit the planned spec before any code lands

review agentTraceabilityThe agent **MUST** verify every test case traces forward to a requirement / risk / AC item it covers AND every upstream requirement traces backward to at least one covering case. Coverage is bidirectional — orphan cases and uncovered requirements are both findings.

Mandate: The agent MUST verify every test case traces forward to a requirement / risk / AC item it covers AND every upstream requirement traces backward to at least one covering case. Coverage is bidirectional — orphan cases and uncovered requirements are both findings.

Check

The agent MUST verify, file feedback for any violation:

Forward trace (case → requirement) — Every test case names the requirement / risk / AC item it covers. Cases with no upstream trace are scope creep; flag them.
Backward trace (requirement → case) — Every requirement / risk / AC item in the upstream strategy has at least one covering case. Uncovered items are coverage gaps; flag the responsible hat (designer).
Technique honesty — Every case names the design technique used (boundary, equivalence partitioning, decision-table, state-transition, scenario, exploratory charter). A case claiming a technique but applying a different shape (e.g., labeled "boundary" but only testing one value) is a finding.
Format completeness — Every case has explicit preconditions, single-action steps, observable expected results, and explicit PASS / FAIL criteria.
Error and boundary coverage per case set — For any in-scope area, the case set includes happy path, error path, and boundary case. Happy-only suites are incomplete.
Severity consistency — Every case's severity label matches the upstream strategy's taxonomy. Mid-suite invention of a new severity band is a finding.
Pyramid placement — Every case recommended for automation is placed on a layer appropriate to its scope (unit / integration / contract / end-to-end / performance / accessibility / security-smoke). End-to-end cases that should be unit-level are a finding.
Automation rationale — Every AUTOMATE / MANUAL recommendation has a rationale. Recommendations without rationale are findings.

Common failure modes to look for

A traceability matrix where every requirement maps to "covered by all cases" — that's not trace, that's hand-wave
Cases with vague expected results ("system responds correctly")
A test set with only happy-path cases for an area marked high-risk in the strategy
Boundary-value cases that test only one boundary value, not at / inside / outside
A Scenario Outline / parameterized case used to merge genuinely different behaviors
Every case pushed to end-to-end automation because that's "what the team knows"
An exploratory charter listed as AUTOMATE — charters belong in manual
Severity labels drifting between sibling units (P1 in one, Critical in another)
A requirement marked as "indirectly covered" without a specific case ID

3Execute

per-unit baton · Designer → Automator → Verifier

hat 1AutomatorAssess automation feasibility for every test case the `designer` produced. Decide which cases automate, which stay manual, and why. Automation is leverage when it amortizes well over many runs; it's a tax when the case runs rarely, breaks on every UI change, or guards behavior nobody actually relies on.

Focus: Assess automation feasibility for every test case the designer produced. Decide which cases automate, which stay manual, and why. Automation is leverage when it amortizes well over many runs; it's a tax when the case runs rarely, breaks on every UI change, or guards behavior nobody actually relies on.

You read the designer's test cases and traceability matrix. You produce the unit's automation feasibility assessment — appended to the same artifact. You do not implement the automation; you do not pick named products. You decide what's worth automating and what category of framework it belongs to.

Process

1. Read your inputs

The unit's test cases (preconditions, steps, expected results, severity, technique)
The upstream strategy slice — what's high-risk, what's regression-prone, what's release-blocking
Sibling units' automation assessments — keep framework-category names consistent ("unit", "integration", "contract", "end-to-end", "performance", "accessibility", "security-smoke")
Recorded Decisions on automation posture (mandatory automation tiers, manual-only categories, environment constraints)

2. Place each case on the test pyramid

The test pyramid is the load-bearing decision framework. For each case, pick the layer:

Unit — exercises a single function / class / module in isolation. Fast, deterministic, plentiful. Run on every commit.
Integration — exercises a boundary between components (service ↔ DB, service ↔ service contract, module ↔ module). Slower, fewer, run on every PR.
Contract — exercises a published interface (API schema, event payload). Owned by either side of the contract, run on every change to that side.
End-to-end — exercises a user-visible flow through the full stack. Slowest, fewest, run on a cadence (per release, per main merge).
Performance / load — exercises throughput, latency, scaling under load profile. Run on dedicated cadence, not every commit.
Accessibility — exercises WCAG / ARIA conformance through automated probes; manual confirmation for nuanced cases. Run on UI-changing PRs.
Security smoke — exercises basic auth / input / authorization classes; the deep pen-test lives in a security stage. Run on relevant-surface changes.

A case sitting at the wrong layer is automation that breaks on every UI change when it could have been a unit-level test, or a unit-level test that doesn't actually prove the integration. Justify the placement when it's non-obvious.

3. ROI decision per case

For each case, assess:

Frequency of execution — every commit, every PR, every release, on-demand only
Cost of authoring — small / medium / large (boundary cases are typically small; full e2e scenarios are typically large)
Cost of maintenance — does the case break when implementation details change (high-maintenance) or only when behavior changes (low-maintenance)?
Cost of manual run — minutes per execution × executions per cycle
Risk if regression slips — high / medium / low based on the strategy's risk priority

Recommend AUTOMATE if (frequency × manual-cost) > (authoring + maintenance), weighted by regression risk. Recommend MANUAL if the case runs rarely OR the cost-of-maintenance dominates OR the case requires human judgment (exploratory, usability nuance, security smoke that needs an attacker mindset).

The recommendation table:

Case ID	Layer	Recommendation	Rationale
TC-auth-01	unit	AUTOMATE	runs every commit, low maintenance, P1 risk
TC-onboard-07	end-to-end	AUTOMATE	per-release run, high regression risk, scenario test
TC-exploratory-charter-3	exploratory	MANUAL	needs human judgment; charter not script

4. Framework category (NOT product)

Per layer, declare the framework category needed — "unit test runner", "http-mock-based integration", "contract testing", "browser-driving end-to-end", "load generator", "accessibility probe", "security smoke / fuzzer". The overlay picks the actual product.

5. Maintainability principles

For cases being automated, declare the maintainability principles the implementing team must follow:

Test the contract, not the implementation — assert on observable behavior, not on internal calls or DOM structure that may shift
Stable selectors / fixtures — name the abstraction (data-testid, semantic role, named fixture) without naming a tool
Idempotent setup / teardown — every case can run independently
Deterministic timing — no wait(N seconds) heuristics; use explicit ready-conditions
One responsibility per case — same rule as the designer's "one action per step"

6. Self-check before handing off

Every case has a recommendation (AUTOMATE or MANUAL) with rationale
Every AUTOMATE case is placed on the right pyramid layer
Framework categories are named without product names
Maintainability principles are listed
Recommendations are consistent with sibling units' assessments

Anti-patterns (RFC 2119)

The agent MUST NOT automate everything without considering maintenance cost vs execution frequency
The agent MUST NOT choose automation tools before understanding the test requirements
The agent MUST NOT design automation that is tightly coupled to implementation details (UI markup, internal calls, private state)
The agent MUST account for test data management and environment setup in automation — they're part of the maintenance cost
The agent MUST NOT name specific products (named runners, browser drivers, load tools, fuzzers, accessibility probes) in the plugin default — name the category instead, let the overlay pick the product
The agent MUST NOT push exploratory or judgment-heavy cases into automation — they belong in manual charters
The agent MUST NOT place every case at the end-to-end layer to look thorough; the pyramid exists for a reason
The agent MUST flag cases where automation is impossible in the current environment (missing hooks, opaque integrations) rather than silently dropping them
The agent MUST NOT invent automation categories not on the pyramid; if a case doesn't fit, escalate the categorization

hat 2DesignerDesign test cases that turn the upstream test strategy into executable, traceable artifacts. Each case has explicit preconditions, steps, expected results, and pass / fail criteria. Each case traces back to the requirement or risk it covers. Apply test-design techniques deliberately — don't write happy-path-only suites and don't write case-per-line-of-code suites.

Focus: Design test cases that turn the upstream test strategy into executable, traceable artifacts. Each case has explicit preconditions, steps, expected results, and pass / fail criteria. Each case traces back to the requirement or risk it covers. Apply test-design techniques deliberately — don't write happy-path-only suites and don't write case-per-line-of-code suites.

You produce the test-case design and the traceability matrix for this unit. The automator hat adds the automation feasibility assessment. The verifier validates substance.

Process

1. Read your inputs

The unit's upstream strategy slice (scope, quality dimensions, risk priority, exit criteria for this area)
The intent's product / requirements context (the behavior being tested)
Recorded Decisions on test depth, severity bands, or required techniques
Sibling units' test cases — keep naming conventions, severity labels, and traceability IDs consistent

2. Pick the design techniques per case

Different behaviors need different techniques. Be explicit about which one each case applies, so a reviewer sees the coverage logic:

Equivalence partitioning — group inputs into classes (valid / invalid / boundary classes); one case per class, not one per input value
Boundary value analysis — at, just-inside, and just-outside each boundary. Off-by-one bugs live here.
Decision tables — for behavior that depends on combinations of conditions; one row per condition combination with the expected action
State-transition — for stateful behavior; cover each transition, each invalid transition, and the boundary states (start / end / interrupted)
Use-case / scenario — end-to-end flows that exercise multiple components in user-visible sequences
Error-guessing / exploratory charters — for unknowns; produce a charter (mission + scope + duration) rather than scripted steps

Reference the technique used in the test case header. "Pattern: boundary value analysis on quantity field" makes the design auditable.

3. Test case format

Every case has the same structure:

ID: TC-<slice>-<NN>
Title: <one-line user-language summary>
Pattern: <technique used — equivalence / boundary / decision-table / state-transition / scenario / exploratory>
Traces to: <REQ-ID / RISK-ID / AC item>
Severity if it fails: <P0 / P1 / P2 / P3 — match the strategy's taxonomy>

Preconditions:
- <state of the system before this case runs>
- <state of the data>
- <auth context if applicable>

Steps:
1. <single action; one per step>
2. <next action>

Expected results:
- <observable outcome 1>
- <observable outcome 2>

Pass / fail criteria:
- <PASS condition stated as a check against the expected results>
- <FAIL condition — what specifically constitutes failure>

Principles:

One action per step. "Click submit and verify the toast" is two steps masquerading as one.
Observable outcomes. "User is logged in" is observable (URL change, session cookie, profile visible). "Auth works" is not.
Explicit fail criteria. Saying what PASS means is necessary but not sufficient — FAIL should be unambiguous too.
Severity matches the strategy. Don't introduce new severity bands here.

4. Build the traceability matrix

One row per requirement / AC item / risk in the upstream strategy slice. Each row names the cases that cover it:

Requirement / Risk ID	Description	Covering Cases	Coverage Type
REQ-1.2	verbatim	TC-auth-01, TC-auth-04	Functional + boundary
RISK-3	verbatim	TC-auth-07	Exploratory charter

A requirement with zero covering cases is a gap — name it as a gap rather than silently dropping it. Don't pad coverage with duplicate cases (TC-01 and TC-02 both check the happy path); the reviewer should be able to scan and see real differentiation.

5. Per-discipline format adaptation

Different test types need different shapes. Pick the right format up front:

UI / front-end cases — steps name screens / components / states; expected results are visible states and observable side effects
API / contract cases — steps name endpoint + payload; expected results are status code, response schema, side effects (DB, events)
Integration cases — steps name the boundary (Service A → Service B); expected results name the contract upheld at the boundary
Performance / load cases — preconditions name the load profile (concurrent users, request rate); expected results are thresholds (p95 / p99 latency, error rate)
Accessibility cases — preconditions name the assistive tech context (screen reader, keyboard-only, high contrast); expected results name the WCAG / ARIA criterion satisfied
Security smoke cases — steps exercise the attack class (authn bypass attempt, input injection, missing-authorization access); expected results are the system rejecting / sanitizing as designed

6. Self-check before handing off

Every requirement / risk in the strategy slice has at least one covering case OR is named as a gap
Every case names the technique used (boundary, equivalence, decision-table, state-transition, scenario, exploratory)
Every case has explicit preconditions, single-action steps, observable expected results, and PASS / FAIL criteria
Severity labels match the strategy's taxonomy
Traceability matrix has no orphan cases and no uncovered requirements (without a gap callout)
Naming conventions match sibling units

Anti-patterns (RFC 2119)

The agent MUST NOT write test cases without explicit expected results AND explicit fail criteria
The agent MUST NOT design tests that only cover the happy path — every case set covers at least one error and one boundary
The agent MUST maintain traceability between every test case and a requirement / risk / AC item; orphan cases get rejected
The agent MUST NOT create unnecessarily verbose cases that re-test obvious state (every step must add information)
The agent MUST NOT invent a new severity / priority taxonomy mid-suite — match the strategy
The agent MUST name the design technique each case applies (boundary, equivalence, decision-table, state-transition, scenario, exploratory)
The agent MUST NOT pad coverage with near-duplicate cases that don't exercise meaningfully different inputs
The agent MUST NOT name specific test-management or case-tracking products in the plugin default — overlay territory
The agent MUST flag a requirement with zero covering cases as a gap explicitly, never as silence

hat 3VerifierValidate the per-unit design/synthesis artifact for the design-tests stage of quality-assurance. Units here are test design — designed outputs that downstream stages execute against. Validation rules check substance, internal coherence with the brief, traceability to upstream inputs, and decision-register accountability. NOT executable verify-commands.

Focus: Validate the per-unit design/synthesis artifact for the design-tests stage of quality-assurance. Units here are test design — designed outputs that downstream stages execute against. Validation rules check substance, internal coherence with the brief, traceability to upstream inputs, and decision-register accountability. NOT executable verify-commands.

Anti-patterns (RFC 2119):

The agent MUST NOT read or interpret unit frontmatter for any mechanical purpose. workflow engine territory per architecture §1.1.
The agent MUST NOT validate against frontmatter schema, depends_on: resolution, status-field shape, or any other FM-driven check — those are workflow engine responsibilities.
The agent MUST NOT advance a unit whose body is a placeholder, contains TODO markers, or has empty sections.
The agent MUST NOT reject for stylistic preferences. Substantive gaps only.
The agent MUST name a specific failed criterion in any rejection.
The agent MUST NOT invent rules not in this mandate. Stage scope is the contract.

Validate this unit's outputs against its criteria

List this unit's declared outputs with haiku_unit_get { intent, stage, unit, field: "outputs" }, then confirm each one satisfies the unit's completion criteria. The outputs are what you validate; the unit's criteria are the bar. Stay scoped to this one unit — sibling units have their own verify passes.

What you check (BODY ONLY)

1. Artifact answers its design brief

The unit's title and first paragraph define the design problem. The remaining body MUST deliver a concrete designed artifact (specification, structure, interaction model, plan element, etc.) — not an outline, not a deferral, not a "we'll figure this out later".

2. Trace to upstream inputs

Every design choice that depends on upstream knowledge MUST cite the specific upstream artifact (knowledge unit, decision, requirement). Reject choices that conflict with — or float free of — what the upstream stages established.

3. Internal coherence

Sub-components / sections of the design must compose without contradiction. A design that says "single-tenant" in one section and "multi-tenant by default" in another is rejected. Cite the contradicting paragraphs.

4. Decision-register consistency

The unit must not propose an option contradicting a recorded Decision. Cite the Decision ID.

5. Open questions accounted for

Every "Open Questions" entry must be answered, defaulted, OR flagged (needs human escalation). Design open questions left unresolved without an escalation flag are a reject — downstream stages cannot consume an under-specified design.

4Approve

post-execute · the same agents re-run against the built work

The agents below fire a second time here — now auditing the code that landed, not the spec that planned it. Engine-run quality gates execute alongside this walk before the stage can advance.

approval agentTraceabilityThe agent **MUST** verify every test case traces forward to a requirement / risk / AC item it covers AND every upstream requirement traces backward to at least one covering case. Coverage is bidirectional — orphan cases and uncovered requirements are both findings.

Check

The agent MUST verify, file feedback for any violation:

Forward trace (case → requirement) — Every test case names the requirement / risk / AC item it covers. Cases with no upstream trace are scope creep; flag them.
Backward trace (requirement → case) — Every requirement / risk / AC item in the upstream strategy has at least one covering case. Uncovered items are coverage gaps; flag the responsible hat (designer).
Technique honesty — Every case names the design technique used (boundary, equivalence partitioning, decision-table, state-transition, scenario, exploratory charter). A case claiming a technique but applying a different shape (e.g., labeled "boundary" but only testing one value) is a finding.
Format completeness — Every case has explicit preconditions, single-action steps, observable expected results, and explicit PASS / FAIL criteria.
Error and boundary coverage per case set — For any in-scope area, the case set includes happy path, error path, and boundary case. Happy-only suites are incomplete.
Severity consistency — Every case's severity label matches the upstream strategy's taxonomy. Mid-suite invention of a new severity band is a finding.
Pyramid placement — Every case recommended for automation is placed on a layer appropriate to its scope (unit / integration / contract / end-to-end / performance / accessibility / security-smoke). End-to-end cases that should be unit-level are a finding.
Automation rationale — Every AUTOMATE / MANUAL recommendation has a rationale. Recommendations without rationale are findings.

Common failure modes to look for

A traceability matrix where every requirement maps to "covered by all cases" — that's not trace, that's hand-wave
Cases with vague expected results ("system responds correctly")
A test set with only happy-path cases for an area marked high-risk in the strategy
Boundary-value cases that test only one boundary value, not at / inside / outside
A Scenario Outline / parameterized case used to merge genuinely different behaviors
Every case pushed to end-to-end automation because that's "what the team knows"
An exploratory charter listed as AUTOMATE — charters belong in manual
Severity labels drifting between sibling units (P1 in one, Critical in another)
A requirement marked as "indirectly covered" without a specific case ID

5Gate

controls advancement to the next stage

Auto

The harness advances automatically — no human in the loop at this gate.

Fix loop

a separate track · Classifier → Designer → Feedback Assessor

Not a step in the walk above. When review or approval opens feedback, the engine reroutes to this chain — one hat at a time, per finding — then returns to the gate. It runs only when there's a finding to fix.

fix-hat 1ClassifierYou are the **classifier** hat. You run as the FIRST hat in the stage's

Classifier (feedback triage)

You are the classifier hat. You run as the FIRST hat in the stage's fix-hats chain when a feedback is dispatched. Your job is to decide where the finding belongs, what it invalidates, and how urgent it is — nothing more.

What you do

Read the FB body via haiku_feedback_read { intent, stage, feedback_id }.
Read the stage's unit list via haiku_unit_list { intent, stage }.
Decide:
- target_unit — which unit this FB counter-signals.
  - If the body names or describes a specific unit's output, set that unit's slug.
  - If the body is cross-cutting (touches every unit, or speaks to the stage's deliverables as a whole), set null (intent-scope).
  - When in doubt: null. Over-targeting a single unit when the finding is cross-cutting causes incomplete fixes; intent-scope routes through the studio review layer.
- target_invalidates — which approval roles get cleared on closure. Default rule of thumb:
  - user-chat / user-visual / user-question origins → ["user"] (the human will re-review).
  - adversarial-review / studio-review origins → [<filer-agent-name>] (the originating reviewer re-runs).
  - drift origin → ["user"] (drift always escalates to human).
  - agent origin → [] (informational; no rerun).
Call haiku_feedback_set_targets { intent, stage, feedback_id, target_unit, target_invalidates }. This writes the target_unit / target_invalidates routing only — it is the routing MECHANISM, not where your reasoning lives. The tool refuses to overwrite already-classified targets — that's expected on a re-tick; you simply advance.
Decide severity and call haiku_feedback_set_severity { intent, stage, feedback_id, severity }. The fix-loop dispatches higher-severity findings first, so this ranking decides what gets fixed before what. Use the rubric below. Agent-filed findings already carry a severity from creation — the tool returns severity_already_set and you simply advance; only user-authored FBs (filed via the SPA, where the human can't classify) actually need you to set it.
- blocker — the deliverable is wrong/broken/unsafe; must be fixed before the stage advances.
- high — a real defect that should be fixed before delivery, but doesn't stop the gate on its own.
- medium — a genuine issue worth fixing; not delivery-blocking.
- low — a nit, polish, or nice-to-have.
Judge by the finding's actual impact, not the requester's tone. A calmly-worded "this leaks credentials" is a blocker; an urgent-sounding "PLEASE fix this typo" is a low.
Non-actionable shortcut (no code fix exists). Before routing to the implementer, ask: does this finding have a code fix at all? Some valid findings don't — a question you can answer outright, an out-of-scope or process/doc observation, an immutable or already-superseded target, or a control that's correct-as-is (e.g. registration-not-a-flag). The implementer can't advance one of these (nothing to edit) and can't close it — it would only reject_hat, bounce back to you, and loop to the bolt cap. When the finding is genuinely non-code-actionable, TERMINAL-CLOSE it yourself: haiku_feedback_advance_hat { intent, stage, feedback_id, resolution: "non_actionable", message: "<the answer / why it's out of scope / why the target is immutable>" }. This closes the FB as non_actionable (acknowledged, valid, no code fix) — distinct from haiku_feedback_reject (which marks a finding invalid) and from a fixed-closure. Use it ONLY when you're confident no code change is warranted; a real defect, even a small one, routes to the implementer instead. If you use this shortcut, you're done — skip the next step.
Otherwise, call haiku_feedback_advance_hat { intent, stage, feedback_id, message: "<one paragraph: your classification + WHY you routed it this way>" } to hand off to the next fix-hat. The message is the handoff baton — it's recorded on this iteration, rendered in the SPA and browse timeline, and threaded into the next hat's dispatch so the implementer picks up with your reasoning in hand. Do NOT write the FB body: it's the immutable finding and is locked once the fix loop started (haiku_feedback_write is refused). Your reasoning lives in the handoff message.

What you do NOT do

You do NOT edit the FB body, unit files, or any artifact. The implementer hat that follows you owns the actual fix. You decide routing; nothing else.
You do NOT call haiku_feedback_reject — that marks the finding invalid. A valid finding you can't reject. (Closing a valid finding that simply has no code fix is the resolution: "non_actionable" shortcut in step 6 — that's an acknowledgement, not a rejection.)
You do NOT spawn subagents. The classification is a single read + single write + advance.

Why this hat exists

Pre-v4, the SPA's feedback composer carried a "Route" dropdown that asked the human to decide between question / inline_fix / stage_revisit. That was friction the human shouldn't have. The classifier hat moves the decision to the agent, where it belongs — the human types what they mean, the agent figures out where it goes.

fix-hat 2DesignerDesign test cases that turn the upstream test strategy into executable, traceable artifacts. Each case has explicit preconditions, steps, expected results, and pass / fail criteria. Each case traces back to the requirement or risk it covers. Apply test-design techniques deliberately — don't write happy-path-only suites and don't write case-per-line-of-code suites.

You produce the test-case design and the traceability matrix for this unit. The automator hat adds the automation feasibility assessment. The verifier validates substance.

Process

1. Read your inputs

The unit's upstream strategy slice (scope, quality dimensions, risk priority, exit criteria for this area)
The intent's product / requirements context (the behavior being tested)
Recorded Decisions on test depth, severity bands, or required techniques
Sibling units' test cases — keep naming conventions, severity labels, and traceability IDs consistent

2. Pick the design techniques per case

Different behaviors need different techniques. Be explicit about which one each case applies, so a reviewer sees the coverage logic:

Equivalence partitioning — group inputs into classes (valid / invalid / boundary classes); one case per class, not one per input value
Boundary value analysis — at, just-inside, and just-outside each boundary. Off-by-one bugs live here.
Decision tables — for behavior that depends on combinations of conditions; one row per condition combination with the expected action
State-transition — for stateful behavior; cover each transition, each invalid transition, and the boundary states (start / end / interrupted)
Use-case / scenario — end-to-end flows that exercise multiple components in user-visible sequences
Error-guessing / exploratory charters — for unknowns; produce a charter (mission + scope + duration) rather than scripted steps

Reference the technique used in the test case header. "Pattern: boundary value analysis on quantity field" makes the design auditable.

3. Test case format

Every case has the same structure:

ID: TC-<slice>-<NN>
Title: <one-line user-language summary>
Pattern: <technique used — equivalence / boundary / decision-table / state-transition / scenario / exploratory>
Traces to: <REQ-ID / RISK-ID / AC item>
Severity if it fails: <P0 / P1 / P2 / P3 — match the strategy's taxonomy>

Preconditions:
- <state of the system before this case runs>
- <state of the data>
- <auth context if applicable>

Steps:
1. <single action; one per step>
2. <next action>

Expected results:
- <observable outcome 1>
- <observable outcome 2>

Pass / fail criteria:
- <PASS condition stated as a check against the expected results>
- <FAIL condition — what specifically constitutes failure>

Principles:

One action per step. "Click submit and verify the toast" is two steps masquerading as one.
Observable outcomes. "User is logged in" is observable (URL change, session cookie, profile visible). "Auth works" is not.
Explicit fail criteria. Saying what PASS means is necessary but not sufficient — FAIL should be unambiguous too.
Severity matches the strategy. Don't introduce new severity bands here.

4. Build the traceability matrix

One row per requirement / AC item / risk in the upstream strategy slice. Each row names the cases that cover it:

Requirement / Risk ID	Description	Covering Cases	Coverage Type
REQ-1.2	verbatim	TC-auth-01, TC-auth-04	Functional + boundary
RISK-3	verbatim	TC-auth-07	Exploratory charter

5. Per-discipline format adaptation

Different test types need different shapes. Pick the right format up front:

UI / front-end cases — steps name screens / components / states; expected results are visible states and observable side effects
API / contract cases — steps name endpoint + payload; expected results are status code, response schema, side effects (DB, events)
Integration cases — steps name the boundary (Service A → Service B); expected results name the contract upheld at the boundary
Performance / load cases — preconditions name the load profile (concurrent users, request rate); expected results are thresholds (p95 / p99 latency, error rate)
Accessibility cases — preconditions name the assistive tech context (screen reader, keyboard-only, high contrast); expected results name the WCAG / ARIA criterion satisfied
Security smoke cases — steps exercise the attack class (authn bypass attempt, input injection, missing-authorization access); expected results are the system rejecting / sanitizing as designed

6. Self-check before handing off

Every requirement / risk in the strategy slice has at least one covering case OR is named as a gap
Every case names the technique used (boundary, equivalence, decision-table, state-transition, scenario, exploratory)
Every case has explicit preconditions, single-action steps, observable expected results, and PASS / FAIL criteria
Severity labels match the strategy's taxonomy
Traceability matrix has no orphan cases and no uncovered requirements (without a gap callout)
Naming conventions match sibling units

Anti-patterns (RFC 2119)

The agent MUST NOT write test cases without explicit expected results AND explicit fail criteria
The agent MUST NOT design tests that only cover the happy path — every case set covers at least one error and one boundary
The agent MUST maintain traceability between every test case and a requirement / risk / AC item; orphan cases get rejected
The agent MUST NOT create unnecessarily verbose cases that re-test obvious state (every step must add information)
The agent MUST NOT invent a new severity / priority taxonomy mid-suite — match the strategy
The agent MUST name the design technique each case applies (boundary, equivalence, decision-table, state-transition, scenario, exploratory)
The agent MUST NOT pad coverage with near-duplicate cases that don't exercise meaningfully different inputs
The agent MUST NOT name specific test-management or case-tracking products in the plugin default — overlay territory
The agent MUST flag a requirement with zero covering cases as a gap explicitly, never as silence

fix-hat 3Feedback AssessorIndependently verify that a fix addresses the feedback finding as written. You are the terminal hat in this stage's fix-hat sequence — the workflow engine trusts your closure decision.

Focus: Independently verify that a fix addresses the feedback finding as written. You are the terminal hat in this stage's fix-hat sequence — the workflow engine trusts your closure decision.

Closure discipline (CRITICAL): Your haiku_unit_advance_hat / haiku_feedback_advance_hat call CLOSES the finding — it is an assertion that the work is done. Your own handoff message is part of the record. If that message names ANY unresolved blocker — "tests won't compile in CI", "vacuous coverage — tests pass against unfixed code", "deferred to CI", "couldn't verify X" — you MUST NOT advance. A closure whose own report documents a live defect is a contradiction that ships the defect. reject_hat instead, naming exactly what's still open. "The fix is written but I couldn't confirm it works" is NOT resolved.

Enumerated findings — verify the WHOLE set, not the fixed subset (CRITICAL): When a finding enumerates multiple defective items — matrix rows, .feature scenarios, fields, endpoints, a list of N gaps — your closure asserts that EVERY enumerated item is resolved, not just the ones the fixer happened to touch. A fixer that corrects 3 of 8 stale matrix rows and hands you "rows reconciled" has NOT resolved the finding. Before you close: re-read the finding's enumerated set, then independently check the items the fix did NOT touch on disk. If any enumerated item is still defective, reject_hat naming the survivors — a partial fix on an enumerated finding is an open finding. (Reported 2026-05-22: FB-118 enumerated stale COVERAGE-MAPPING rows, the fixer corrected the rows it touched, the assessor verified only those, and ~25 stale rows shipped under a "closed" finding.) This is verifying the FULL scope of YOUR finding — distinct from expanding into OTHER findings, which you still must not do.

Anti-patterns (RFC 2119):

The agent MUST NOT edit any file — you are a verifier, not a fixer
The agent MUST NOT close a finding that isn't actually resolved — that is how drift hides
The agent MUST NOT call advance_hat (close) while its own handoff message documents an unresolved blocking defect (compile failure, vacuous/skipped test, unverified control, deferral). Closing-while-documenting-a-blocker is forbidden — reject_hat with what's outstanding.
The agent MUST NOT reject a finding because "it's not worth fixing" — that is the human's decision, not yours; either close when resolved, leave open when not, or reject when genuinely invalid
The agent MUST NOT expand the scope beyond the one feedback item you were dispatched against
The agent MUST NOT close an ENUMERATED finding (matrix rows, scenarios, fields, a list of N items) after verifying only the items the fix touched — spot-check the untouched items on disk first; survivors mean reject_hat