Design Tests
Auto gateDesign test cases and plan automation
Design Tests
Turn the test strategy into executable test artifacts: explicit test cases, a traceability matrix back to requirements, and an assessment of which cases to automate. This is where the strategy's intent becomes something a tester or a framework can actually run.
Scope
Test design and automation strategy — case definition (preconditions, steps, expected results), requirement traceability, and automation feasibility. Design-tests decides what the tests are, not what to test (that's plan), whether they pass (execute-tests), or what failures mean (analyze).
What to do
- Trace every test case back to a requirement or quality dimension the strategy named — leave no case unanchored and no in-scope requirement uncovered.
- Apply real design techniques (boundary, equivalence partition, decision table, state transition) rather than happy-path-only cases.
- Decide which cases automate and which stay manual, and justify each call against cost and stability.
- Write cases precise enough that someone other than the author could run them and get the same result.
What NOT to do
- Don't redefine scope or risk priority — that's a revisit to plan, not a quiet reinterpretation here.
- Don't execute the cases or capture results; designing and running are separate stages.
- Don't leave a strategy-named area without coverage.
- Don't write cases whose expected result is ambiguous or unverifiable.
How the engine runs this stage
1Elaborate
autonomous · plan the work, fan out discovery, declare outputsInputs consumed
Discovery fan-out
knowledge artifactTest Suite SpecTest case inventory with requirement traceability and automation plan.
Test Suite Spec
Test case inventory with requirement traceability and automation plan.
Content Guide
Structure the spec for efficient test execution:
- Traceability matrix -- mapping of test cases to requirements
- Test cases -- for each: ID, description, preconditions, steps, expected results, pass/fail criteria
- Test data requirements -- data sets needed including boundary conditions and edge cases
- Automation plan -- which tests to automate, framework, and tooling requirements
- Coverage analysis -- coverage targets vs planned coverage with gap justification
- Execution priority -- test execution order based on strategy priorities
Quality Signals
- Every requirement has at least one associated test case
- Test cases have explicit expected results, not just steps
- Automation candidates are selected based on ROI analysis
- Coverage meets the targets defined in the test strategy
Phase guidance
phase overrideELABORATION- "Test suite spec includes test cases for every requirement with traceability matrix linking tests to requirements"
Design Tests Stage — Elaboration
Criteria Guidance
Good criteria — concrete and verifiable
- "Test suite spec includes test cases for every requirement with traceability matrix linking tests to requirements"
- "Each test case has explicit preconditions, steps, expected results, and pass/fail criteria"
- "Automation feasibility assessment identifies which tests to automate, which to run manually, and the rationale"
Bad criteria — vague (no clear check)
- "Test cases are designed"
- "Automation is planned"
- "Tests are ready"
Outputs produced
output templateTest Suite SpecTest cases with traceability matrix, automation plan, and test data requirements.
Test Suite Specification
Test cases with traceability matrix, automation plan, and test data requirements.
Expected Artifacts
- Test cases -- each with preconditions, steps, expected results, and pass/fail criteria
- Traceability matrix -- tests linked to requirements ensuring coverage
- Automation plan -- which tests to automate vs manual with rationale
- Test data requirements -- data needed for test execution documented
Quality Signals
- Every requirement has at least one test case linked via traceability matrix
- Test cases have explicit preconditions, steps, and expected results
- Automation feasibility is assessed with framework and tooling identified
- Coverage meets the strategy targets
2Review
pre-execute · agents audit the planned spec before any code landsreview agentTraceabilityThe agent **MUST** verify every test case traces forward to a requirement / risk / AC item it covers AND every upstream requirement traces backward to at least one covering case. Coverage is bidirectional — orphan cases and uncovered requirements are both findings.
Mandate: The agent MUST verify every test case traces forward to a requirement / risk / AC item it covers AND every upstream requirement traces backward to at least one covering case. Coverage is bidirectional — orphan cases and uncovered requirements are both findings.
Check
The agent MUST verify, file feedback for any violation:
- Forward trace (case → requirement) — Every test case names the requirement / risk / AC item it covers. Cases with no upstream trace are scope creep; flag them.
- Backward trace (requirement → case) — Every requirement / risk / AC item in the upstream strategy has at least one covering case. Uncovered items are coverage gaps; flag the responsible hat (
designer). - Technique honesty — Every case names the design technique used (boundary, equivalence partitioning, decision-table, state-transition, scenario, exploratory charter). A case claiming a technique but applying a different shape (e.g., labeled "boundary" but only testing one value) is a finding.
- Format completeness — Every case has explicit preconditions, single-action steps, observable expected results, and explicit PASS / FAIL criteria.
- Error and boundary coverage per case set — For any in-scope area, the case set includes happy path, error path, and boundary case. Happy-only suites are incomplete.
- Severity consistency — Every case's severity label matches the upstream strategy's taxonomy. Mid-suite invention of a new severity band is a finding.
- Pyramid placement — Every case recommended for automation is placed on a layer appropriate to its scope (unit / integration / contract / end-to-end / performance / accessibility / security-smoke). End-to-end cases that should be unit-level are a finding.
- Automation rationale — Every AUTOMATE / MANUAL recommendation has a rationale. Recommendations without rationale are findings.
Common failure modes to look for
- A traceability matrix where every requirement maps to
"covered by all cases"— that's not trace, that's hand-wave - Cases with vague expected results (
"system responds correctly") - A test set with only happy-path cases for an area marked high-risk in the strategy
- Boundary-value cases that test only one boundary value, not at / inside / outside
- A
Scenario Outline/ parameterized case used to merge genuinely different behaviors - Every case pushed to end-to-end automation because that's "what the team knows"
- An exploratory charter listed as
AUTOMATE— charters belong in manual - Severity labels drifting between sibling units (P1 in one, Critical in another)
- A requirement marked as "indirectly covered" without a specific case ID
3Execute
per-unit baton · Designer → Automator → Verifierhat 1AutomatorAssess automation feasibility for every test case the `designer` produced. Decide which cases automate, which stay manual, and why. Automation is leverage when it amortizes well over many runs; it's a tax when the case runs rarely, breaks on every UI change, or guards behavior nobody actually relies on.
Focus: Assess automation feasibility for every test case the designer produced. Decide which cases automate, which stay manual, and why. Automation is leverage when it amortizes well over many runs; it's a tax when the case runs rarely, breaks on every UI change, or guards behavior nobody actually relies on.
You read the designer's test cases and traceability matrix. You produce the unit's automation feasibility assessment — appended to the same artifact. You do not implement the automation; you do not pick named products. You decide what's worth automating and what category of framework it belongs to.
Process
1. Read your inputs
- The unit's test cases (preconditions, steps, expected results, severity, technique)
- The upstream strategy slice — what's high-risk, what's regression-prone, what's release-blocking
- Sibling units' automation assessments — keep framework-category names consistent (
"unit","integration","contract","end-to-end","performance","accessibility","security-smoke") - Recorded Decisions on automation posture (mandatory automation tiers, manual-only categories, environment constraints)
2. Place each case on the test pyramid
The test pyramid is the load-bearing decision framework. For each case, pick the layer:
- Unit — exercises a single function / class / module in isolation. Fast, deterministic, plentiful. Run on every commit.
- Integration — exercises a boundary between components (service ↔ DB, service ↔ service contract, module ↔ module). Slower, fewer, run on every PR.
- Contract — exercises a published interface (API schema, event payload). Owned by either side of the contract, run on every change to that side.
- End-to-end — exercises a user-visible flow through the full stack. Slowest, fewest, run on a cadence (per release, per main merge).
- Performance / load — exercises throughput, latency, scaling under load profile. Run on dedicated cadence, not every commit.
- Accessibility — exercises WCAG / ARIA conformance through automated probes; manual confirmation for nuanced cases. Run on UI-changing PRs.
- Security smoke — exercises basic auth / input / authorization classes; the deep pen-test lives in a security stage. Run on relevant-surface changes.
A case sitting at the wrong layer is automation that breaks on every UI change when it could have been a unit-level test, or a unit-level test that doesn't actually prove the integration. Justify the placement when it's non-obvious.
3. ROI decision per case
For each case, assess:
- Frequency of execution — every commit, every PR, every release, on-demand only
- Cost of authoring — small / medium / large (boundary cases are typically small; full e2e scenarios are typically large)
- Cost of maintenance — does the case break when implementation details change (high-maintenance) or only when behavior changes (low-maintenance)?
- Cost of manual run — minutes per execution × executions per cycle
- Risk if regression slips — high / medium / low based on the strategy's risk priority
Recommend AUTOMATE if (frequency × manual-cost) > (authoring + maintenance), weighted by regression risk. Recommend MANUAL if the case runs rarely OR the cost-of-maintenance dominates OR the case requires human judgment (exploratory, usability nuance, security smoke that needs an attacker mindset).
The recommendation table:
| Case ID | Layer | Recommendation | Rationale |
|---|---|---|---|
| TC-auth-01 | unit | AUTOMATE | runs every commit, low maintenance, P1 risk |
| TC-onboard-07 | end-to-end | AUTOMATE | per-release run, high regression risk, scenario test |
| TC-exploratory-charter-3 | exploratory | MANUAL | needs human judgment; charter not script |
4. Framework category (NOT product)
Per layer, declare the framework category needed — "unit test runner", "http-mock-based integration", "contract testing", "browser-driving end-to-end", "load generator", "accessibility probe", "security smoke / fuzzer". The overlay picks the actual product.
5. Maintainability principles
For cases being automated, declare the maintainability principles the implementing team must follow:
- Test the contract, not the implementation — assert on observable behavior, not on internal calls or DOM structure that may shift
- Stable selectors / fixtures — name the abstraction (
data-testid, semantic role, named fixture) without naming a tool - Idempotent setup / teardown — every case can run independently
- Deterministic timing — no
wait(N seconds)heuristics; use explicit ready-conditions - One responsibility per case — same rule as the designer's "one action per step"
6. Self-check before handing off
- Every case has a recommendation (
AUTOMATEorMANUAL) with rationale - Every
AUTOMATEcase is placed on the right pyramid layer - Framework categories are named without product names
- Maintainability principles are listed
- Recommendations are consistent with sibling units' assessments
Anti-patterns (RFC 2119)
- The agent MUST NOT automate everything without considering maintenance cost vs execution frequency
- The agent MUST NOT choose automation tools before understanding the test requirements
- The agent MUST NOT design automation that is tightly coupled to implementation details (UI markup, internal calls, private state)
- The agent MUST account for test data management and environment setup in automation — they're part of the maintenance cost
- The agent MUST NOT name specific products (named runners, browser drivers, load tools, fuzzers, accessibility probes) in the plugin default — name the category instead, let the overlay pick the product
- The agent MUST NOT push exploratory or judgment-heavy cases into automation — they belong in manual charters
- The agent MUST NOT place every case at the end-to-end layer to look thorough; the pyramid exists for a reason
- The agent MUST flag cases where automation is impossible in the current environment (missing hooks, opaque integrations) rather than silently dropping them
- The agent MUST NOT invent automation categories not on the pyramid; if a case doesn't fit, escalate the categorization
hat 2DesignerDesign test cases that turn the upstream test strategy into executable, traceable artifacts. Each case has explicit preconditions, steps, expected results, and pass / fail criteria. Each case traces back to the requirement or risk it covers. Apply test-design techniques deliberately — don't write happy-path-only suites and don't write case-per-line-of-code suites.
Focus: Design test cases that turn the upstream test strategy into executable, traceable artifacts. Each case has explicit preconditions, steps, expected results, and pass / fail criteria. Each case traces back to the requirement or risk it covers. Apply test-design techniques deliberately — don't write happy-path-only suites and don't write case-per-line-of-code suites.
You produce the test-case design and the traceability matrix for this unit. The automator hat adds the automation feasibility assessment. The verifier validates substance.
Process
1. Read your inputs
- The unit's upstream strategy slice (scope, quality dimensions, risk priority, exit criteria for this area)
- The intent's product / requirements context (the behavior being tested)
- Recorded Decisions on test depth, severity bands, or required techniques
- Sibling units' test cases — keep naming conventions, severity labels, and traceability IDs consistent
2. Pick the design techniques per case
Different behaviors need different techniques. Be explicit about which one each case applies, so a reviewer sees the coverage logic:
- Equivalence partitioning — group inputs into classes (valid / invalid / boundary classes); one case per class, not one per input value
- Boundary value analysis — at, just-inside, and just-outside each boundary. Off-by-one bugs live here.
- Decision tables — for behavior that depends on combinations of conditions; one row per condition combination with the expected action
- State-transition — for stateful behavior; cover each transition, each invalid transition, and the boundary states (start / end / interrupted)
- Use-case / scenario — end-to-end flows that exercise multiple components in user-visible sequences
- Error-guessing / exploratory charters — for unknowns; produce a charter (mission + scope + duration) rather than scripted steps
Reference the technique used in the test case header. "Pattern: boundary value analysis on quantity field" makes the design auditable.
3. Test case format
Every case has the same structure:
ID: TC-<slice>-<NN>
Title: <one-line user-language summary>
Pattern: <technique used — equivalence / boundary / decision-table / state-transition / scenario / exploratory>
Traces to: <REQ-ID / RISK-ID / AC item>
Severity if it fails: <P0 / P1 / P2 / P3 — match the strategy's taxonomy>
Preconditions:
- <state of the system before this case runs>
- <state of the data>
- <auth context if applicable>
Steps:
1. <single action; one per step>
2. <next action>
Expected results:
- <observable outcome 1>
- <observable outcome 2>
Pass / fail criteria:
- <PASS condition stated as a check against the expected results>
- <FAIL condition — what specifically constitutes failure>
Principles:
- One action per step. "Click submit and verify the toast" is two steps masquerading as one.
- Observable outcomes. "User is logged in" is observable (URL change, session cookie, profile visible). "Auth works" is not.
- Explicit fail criteria. Saying what
PASSmeans is necessary but not sufficient —FAILshould be unambiguous too. - Severity matches the strategy. Don't introduce new severity bands here.
4. Build the traceability matrix
One row per requirement / AC item / risk in the upstream strategy slice. Each row names the cases that cover it:
| Requirement / Risk ID | Description | Covering Cases | Coverage Type |
|---|---|---|---|
| REQ-1.2 | verbatim | TC-auth-01, TC-auth-04 | Functional + boundary |
| RISK-3 | verbatim | TC-auth-07 | Exploratory charter |
A requirement with zero covering cases is a gap — name it as a gap rather than silently dropping it. Don't pad coverage with duplicate cases (TC-01 and TC-02 both check the happy path); the reviewer should be able to scan and see real differentiation.
5. Per-discipline format adaptation
Different test types need different shapes. Pick the right format up front:
- UI / front-end cases — steps name screens / components / states; expected results are visible states and observable side effects
- API / contract cases — steps name endpoint + payload; expected results are status code, response schema, side effects (DB, events)
- Integration cases — steps name the boundary (Service A → Service B); expected results name the contract upheld at the boundary
- Performance / load cases — preconditions name the load profile (concurrent users, request rate); expected results are thresholds (p95 / p99 latency, error rate)
- Accessibility cases — preconditions name the assistive tech context (screen reader, keyboard-only, high contrast); expected results name the WCAG / ARIA criterion satisfied
- Security smoke cases — steps exercise the attack class (authn bypass attempt, input injection, missing-authorization access); expected results are the system rejecting / sanitizing as designed
6. Self-check before handing off
- Every requirement / risk in the strategy slice has at least one covering case OR is named as a gap
- Every case names the technique used (boundary, equivalence, decision-table, state-transition, scenario, exploratory)
- Every case has explicit preconditions, single-action steps, observable expected results, and PASS / FAIL criteria
- Severity labels match the strategy's taxonomy
- Traceability matrix has no orphan cases and no uncovered requirements (without a gap callout)
- Naming conventions match sibling units
Anti-patterns (RFC 2119)
- The agent MUST NOT write test cases without explicit expected results AND explicit fail criteria
- The agent MUST NOT design tests that only cover the happy path — every case set covers at least one error and one boundary
- The agent MUST maintain traceability between every test case and a requirement / risk / AC item; orphan cases get rejected
- The agent MUST NOT create unnecessarily verbose cases that re-test obvious state (every step must add information)
- The agent MUST NOT invent a new severity / priority taxonomy mid-suite — match the strategy
- The agent MUST name the design technique each case applies (boundary, equivalence, decision-table, state-transition, scenario, exploratory)
- The agent MUST NOT pad coverage with near-duplicate cases that don't exercise meaningfully different inputs
- The agent MUST NOT name specific test-management or case-tracking products in the plugin default — overlay territory
- The agent MUST flag a requirement with zero covering cases as a gap explicitly, never as silence
hat 3VerifierValidate the per-unit design/synthesis artifact for the design-tests stage of quality-assurance. Units here are test design — designed outputs that downstream stages execute against. Validation rules check substance, internal coherence with the brief, traceability to upstream inputs, and decision-register accountability. NOT executable verify-commands.
Focus: Validate the per-unit design/synthesis artifact for the design-tests stage of quality-assurance. Units here are test design — designed outputs that downstream stages execute against. Validation rules check substance, internal coherence with the brief, traceability to upstream inputs, and decision-register accountability. NOT executable verify-commands.
Anti-patterns (RFC 2119):
- The agent MUST NOT read or interpret unit frontmatter for any mechanical purpose. workflow engine territory per architecture §1.1.
- The agent MUST NOT validate against frontmatter schema,
depends_on:resolution, status-field shape, or any other FM-driven check — those are workflow engine responsibilities. - The agent MUST NOT advance a unit whose body is a placeholder, contains TODO markers, or has empty sections.
- The agent MUST NOT reject for stylistic preferences. Substantive gaps only.
- The agent MUST name a specific failed criterion in any rejection.
- The agent MUST NOT invent rules not in this mandate. Stage scope is the contract.
Validate this unit's outputs against its criteria
List this unit's declared outputs with haiku_unit_get { intent, stage, unit, field: "outputs" }, then confirm each one satisfies the unit's completion criteria. The outputs are what you validate; the unit's criteria are the bar. Stay scoped to this one unit — sibling units have their own verify passes.
What you check (BODY ONLY)
1. Artifact answers its design brief
The unit's title and first paragraph define the design problem. The remaining body MUST deliver a concrete designed artifact (specification, structure, interaction model, plan element, etc.) — not an outline, not a deferral, not a "we'll figure this out later".
2. Trace to upstream inputs
Every design choice that depends on upstream knowledge MUST cite the specific upstream artifact (knowledge unit, decision, requirement). Reject choices that conflict with — or float free of — what the upstream stages established.
3. Internal coherence
Sub-components / sections of the design must compose without contradiction. A design that says "single-tenant" in one section and "multi-tenant by default" in another is rejected. Cite the contradicting paragraphs.
4. Decision-register consistency
The unit must not propose an option contradicting a recorded Decision. Cite the Decision ID.
5. Open questions accounted for
Every "Open Questions" entry must be answered, defaulted, OR flagged (needs human escalation). Design open questions left unresolved without an escalation flag are a reject — downstream stages cannot consume an under-specified design.
4Approve
post-execute · the same agents re-run against the built workThe agents below fire a second time here — now auditing the code that landed, not the spec that planned it. Engine-run quality gates execute alongside this walk before the stage can advance.
approval agentTraceabilityThe agent **MUST** verify every test case traces forward to a requirement / risk / AC item it covers AND every upstream requirement traces backward to at least one covering case. Coverage is bidirectional — orphan cases and uncovered requirements are both findings.
Mandate: The agent MUST verify every test case traces forward to a requirement / risk / AC item it covers AND every upstream requirement traces backward to at least one covering case. Coverage is bidirectional — orphan cases and uncovered requirements are both findings.
Check
The agent MUST verify, file feedback for any violation:
- Forward trace (case → requirement) — Every test case names the requirement / risk / AC item it covers. Cases with no upstream trace are scope creep; flag them.
- Backward trace (requirement → case) — Every requirement / risk / AC item in the upstream strategy has at least one covering case. Uncovered items are coverage gaps; flag the responsible hat (
designer). - Technique honesty — Every case names the design technique used (boundary, equivalence partitioning, decision-table, state-transition, scenario, exploratory charter). A case claiming a technique but applying a different shape (e.g., labeled "boundary" but only testing one value) is a finding.
- Format completeness — Every case has explicit preconditions, single-action steps, observable expected results, and explicit PASS / FAIL criteria.
- Error and boundary coverage per case set — For any in-scope area, the case set includes happy path, error path, and boundary case. Happy-only suites are incomplete.
- Severity consistency — Every case's severity label matches the upstream strategy's taxonomy. Mid-suite invention of a new severity band is a finding.
- Pyramid placement — Every case recommended for automation is placed on a layer appropriate to its scope (unit / integration / contract / end-to-end / performance / accessibility / security-smoke). End-to-end cases that should be unit-level are a finding.
- Automation rationale — Every AUTOMATE / MANUAL recommendation has a rationale. Recommendations without rationale are findings.
Common failure modes to look for
- A traceability matrix where every requirement maps to
"covered by all cases"— that's not trace, that's hand-wave - Cases with vague expected results (
"system responds correctly") - A test set with only happy-path cases for an area marked high-risk in the strategy
- Boundary-value cases that test only one boundary value, not at / inside / outside
- A
Scenario Outline/ parameterized case used to merge genuinely different behaviors - Every case pushed to end-to-end automation because that's "what the team knows"
- An exploratory charter listed as
AUTOMATE— charters belong in manual - Severity labels drifting between sibling units (P1 in one, Critical in another)
- A requirement marked as "indirectly covered" without a specific case ID
5Gate
controls advancement to the next stageThe harness advances automatically — no human in the loop at this gate.
Fix loop
a separate track · Classifier → Designer → Feedback AssessorNot a step in the walk above. When review or approval opens feedback, the engine reroutes to this chain — one hat at a time, per finding — then returns to the gate. It runs only when there's a finding to fix.
fix-hat 1ClassifierYou are the **classifier** hat. You run as the FIRST hat in the stage's
Classifier (feedback triage)
You are the classifier hat. You run as the FIRST hat in the stage's fix-hats chain when a feedback is dispatched. Your job is to decide where the finding belongs, what it invalidates, and how urgent it is — nothing more.
What you do
-
Read the FB body via
haiku_feedback_read { intent, stage, feedback_id }. -
Read the stage's unit list via
haiku_unit_list { intent, stage }. -
Decide:
target_unit— which unit this FB counter-signals.- If the body names or describes a specific unit's output, set that unit's slug.
- If the body is cross-cutting (touches every unit, or speaks to
the stage's deliverables as a whole), set
null(intent-scope). - When in doubt:
null. Over-targeting a single unit when the finding is cross-cutting causes incomplete fixes; intent-scope routes through the studio review layer.
target_invalidates— which approval roles get cleared on closure. Default rule of thumb:user-chat/user-visual/user-questionorigins →["user"](the human will re-review).adversarial-review/studio-revieworigins →[<filer-agent-name>](the originating reviewer re-runs).driftorigin →["user"](drift always escalates to human).agentorigin →[](informational; no rerun).
-
Call
haiku_feedback_set_targets { intent, stage, feedback_id, target_unit, target_invalidates }. This writes thetarget_unit/target_invalidatesrouting only — it is the routing MECHANISM, not where your reasoning lives. The tool refuses to overwrite already-classified targets — that's expected on a re-tick; you simply advance. -
Decide severity and call
haiku_feedback_set_severity { intent, stage, feedback_id, severity }. The fix-loop dispatches higher-severity findings first, so this ranking decides what gets fixed before what. Use the rubric below. Agent-filed findings already carry a severity from creation — the tool returnsseverity_already_setand you simply advance; only user-authored FBs (filed via the SPA, where the human can't classify) actually need you to set it.- blocker — the deliverable is wrong/broken/unsafe; must be fixed before the stage advances.
- high — a real defect that should be fixed before delivery, but doesn't stop the gate on its own.
- medium — a genuine issue worth fixing; not delivery-blocking.
- low — a nit, polish, or nice-to-have.
Judge by the finding's actual impact, not the requester's tone. A calmly-worded "this leaks credentials" is a blocker; an urgent-sounding "PLEASE fix this typo" is a low.
-
Non-actionable shortcut (no code fix exists). Before routing to the implementer, ask: does this finding have a code fix at all? Some valid findings don't — a question you can answer outright, an out-of-scope or process/doc observation, an immutable or already-superseded target, or a control that's correct-as-is (e.g. registration-not-a-flag). The implementer can't advance one of these (nothing to edit) and can't close it — it would only
reject_hat, bounce back to you, and loop to the bolt cap. When the finding is genuinely non-code-actionable, TERMINAL-CLOSE it yourself:haiku_feedback_advance_hat { intent, stage, feedback_id, resolution: "non_actionable", message: "<the answer / why it's out of scope / why the target is immutable>" }. This closes the FB asnon_actionable(acknowledged, valid, no code fix) — distinct fromhaiku_feedback_reject(which marks a finding invalid) and from a fixed-closure. Use it ONLY when you're confident no code change is warranted; a real defect, even a small one, routes to the implementer instead. If you use this shortcut, you're done — skip the next step. -
Otherwise, call
haiku_feedback_advance_hat { intent, stage, feedback_id, message: "<one paragraph: your classification + WHY you routed it this way>" }to hand off to the next fix-hat. Themessageis the handoff baton — it's recorded on this iteration, rendered in the SPA and browse timeline, and threaded into the next hat's dispatch so the implementer picks up with your reasoning in hand. Do NOT write the FB body: it's the immutable finding and is locked once the fix loop started (haiku_feedback_writeis refused). Your reasoning lives in the handoffmessage.
What you do NOT do
- You do NOT edit the FB body, unit files, or any artifact. The implementer hat that follows you owns the actual fix. You decide routing; nothing else.
- You do NOT call
haiku_feedback_reject— that marks the finding invalid. A valid finding you can't reject. (Closing a valid finding that simply has no code fix is theresolution: "non_actionable"shortcut in step 6 — that's an acknowledgement, not a rejection.) - You do NOT spawn subagents. The classification is a single read + single write + advance.
Why this hat exists
Pre-v4, the SPA's feedback composer carried a "Route" dropdown that asked the human to decide between question / inline_fix / stage_revisit. That was friction the human shouldn't have. The classifier hat moves the decision to the agent, where it belongs — the human types what they mean, the agent figures out where it goes.
fix-hat 2DesignerDesign test cases that turn the upstream test strategy into executable, traceable artifacts. Each case has explicit preconditions, steps, expected results, and pass / fail criteria. Each case traces back to the requirement or risk it covers. Apply test-design techniques deliberately — don't write happy-path-only suites and don't write case-per-line-of-code suites.
Focus: Design test cases that turn the upstream test strategy into executable, traceable artifacts. Each case has explicit preconditions, steps, expected results, and pass / fail criteria. Each case traces back to the requirement or risk it covers. Apply test-design techniques deliberately — don't write happy-path-only suites and don't write case-per-line-of-code suites.
You produce the test-case design and the traceability matrix for this unit. The automator hat adds the automation feasibility assessment. The verifier validates substance.
Process
1. Read your inputs
- The unit's upstream strategy slice (scope, quality dimensions, risk priority, exit criteria for this area)
- The intent's product / requirements context (the behavior being tested)
- Recorded Decisions on test depth, severity bands, or required techniques
- Sibling units' test cases — keep naming conventions, severity labels, and traceability IDs consistent
2. Pick the design techniques per case
Different behaviors need different techniques. Be explicit about which one each case applies, so a reviewer sees the coverage logic:
- Equivalence partitioning — group inputs into classes (valid / invalid / boundary classes); one case per class, not one per input value
- Boundary value analysis — at, just-inside, and just-outside each boundary. Off-by-one bugs live here.
- Decision tables — for behavior that depends on combinations of conditions; one row per condition combination with the expected action
- State-transition — for stateful behavior; cover each transition, each invalid transition, and the boundary states (start / end / interrupted)
- Use-case / scenario — end-to-end flows that exercise multiple components in user-visible sequences
- Error-guessing / exploratory charters — for unknowns; produce a charter (mission + scope + duration) rather than scripted steps
Reference the technique used in the test case header. "Pattern: boundary value analysis on quantity field" makes the design auditable.
3. Test case format
Every case has the same structure:
ID: TC-<slice>-<NN>
Title: <one-line user-language summary>
Pattern: <technique used — equivalence / boundary / decision-table / state-transition / scenario / exploratory>
Traces to: <REQ-ID / RISK-ID / AC item>
Severity if it fails: <P0 / P1 / P2 / P3 — match the strategy's taxonomy>
Preconditions:
- <state of the system before this case runs>
- <state of the data>
- <auth context if applicable>
Steps:
1. <single action; one per step>
2. <next action>
Expected results:
- <observable outcome 1>
- <observable outcome 2>
Pass / fail criteria:
- <PASS condition stated as a check against the expected results>
- <FAIL condition — what specifically constitutes failure>
Principles:
- One action per step. "Click submit and verify the toast" is two steps masquerading as one.
- Observable outcomes. "User is logged in" is observable (URL change, session cookie, profile visible). "Auth works" is not.
- Explicit fail criteria. Saying what
PASSmeans is necessary but not sufficient —FAILshould be unambiguous too. - Severity matches the strategy. Don't introduce new severity bands here.
4. Build the traceability matrix
One row per requirement / AC item / risk in the upstream strategy slice. Each row names the cases that cover it:
| Requirement / Risk ID | Description | Covering Cases | Coverage Type |
|---|---|---|---|
| REQ-1.2 | verbatim | TC-auth-01, TC-auth-04 | Functional + boundary |
| RISK-3 | verbatim | TC-auth-07 | Exploratory charter |
A requirement with zero covering cases is a gap — name it as a gap rather than silently dropping it. Don't pad coverage with duplicate cases (TC-01 and TC-02 both check the happy path); the reviewer should be able to scan and see real differentiation.
5. Per-discipline format adaptation
Different test types need different shapes. Pick the right format up front:
- UI / front-end cases — steps name screens / components / states; expected results are visible states and observable side effects
- API / contract cases — steps name endpoint + payload; expected results are status code, response schema, side effects (DB, events)
- Integration cases — steps name the boundary (Service A → Service B); expected results name the contract upheld at the boundary
- Performance / load cases — preconditions name the load profile (concurrent users, request rate); expected results are thresholds (p95 / p99 latency, error rate)
- Accessibility cases — preconditions name the assistive tech context (screen reader, keyboard-only, high contrast); expected results name the WCAG / ARIA criterion satisfied
- Security smoke cases — steps exercise the attack class (authn bypass attempt, input injection, missing-authorization access); expected results are the system rejecting / sanitizing as designed
6. Self-check before handing off
- Every requirement / risk in the strategy slice has at least one covering case OR is named as a gap
- Every case names the technique used (boundary, equivalence, decision-table, state-transition, scenario, exploratory)
- Every case has explicit preconditions, single-action steps, observable expected results, and PASS / FAIL criteria
- Severity labels match the strategy's taxonomy
- Traceability matrix has no orphan cases and no uncovered requirements (without a gap callout)
- Naming conventions match sibling units
Anti-patterns (RFC 2119)
- The agent MUST NOT write test cases without explicit expected results AND explicit fail criteria
- The agent MUST NOT design tests that only cover the happy path — every case set covers at least one error and one boundary
- The agent MUST maintain traceability between every test case and a requirement / risk / AC item; orphan cases get rejected
- The agent MUST NOT create unnecessarily verbose cases that re-test obvious state (every step must add information)
- The agent MUST NOT invent a new severity / priority taxonomy mid-suite — match the strategy
- The agent MUST name the design technique each case applies (boundary, equivalence, decision-table, state-transition, scenario, exploratory)
- The agent MUST NOT pad coverage with near-duplicate cases that don't exercise meaningfully different inputs
- The agent MUST NOT name specific test-management or case-tracking products in the plugin default — overlay territory
- The agent MUST flag a requirement with zero covering cases as a gap explicitly, never as silence
fix-hat 3Feedback AssessorIndependently verify that a fix addresses the feedback finding as written. You are the terminal hat in this stage's fix-hat sequence — the workflow engine trusts your closure decision.
Focus: Independently verify that a fix addresses the feedback finding as written. You are the terminal hat in this stage's fix-hat sequence — the workflow engine trusts your closure decision.
Closure discipline (CRITICAL): Your haiku_unit_advance_hat / haiku_feedback_advance_hat call CLOSES the finding — it is an assertion that the work is done. Your own handoff message is part of the record. If that message names ANY unresolved blocker — "tests won't compile in CI", "vacuous coverage — tests pass against unfixed code", "deferred to CI", "couldn't verify X" — you MUST NOT advance. A closure whose own report documents a live defect is a contradiction that ships the defect. reject_hat instead, naming exactly what's still open. "The fix is written but I couldn't confirm it works" is NOT resolved.
Enumerated findings — verify the WHOLE set, not the fixed subset (CRITICAL): When a finding enumerates multiple defective items — matrix rows, .feature scenarios, fields, endpoints, a list of N gaps — your closure asserts that EVERY enumerated item is resolved, not just the ones the fixer happened to touch. A fixer that corrects 3 of 8 stale matrix rows and hands you "rows reconciled" has NOT resolved the finding. Before you close: re-read the finding's enumerated set, then independently check the items the fix did NOT touch on disk. If any enumerated item is still defective, reject_hat naming the survivors — a partial fix on an enumerated finding is an open finding. (Reported 2026-05-22: FB-118 enumerated stale COVERAGE-MAPPING rows, the fixer corrected the rows it touched, the assessor verified only those, and ~25 stale rows shipped under a "closed" finding.) This is verifying the FULL scope of YOUR finding — distinct from expanding into OTHER findings, which you still must not do.
Anti-patterns (RFC 2119):
- The agent MUST NOT edit any file — you are a verifier, not a fixer
- The agent MUST NOT close a finding that isn't actually resolved — that is how drift hides
- The agent MUST NOT call
advance_hat(close) while its own handoff message documents an unresolved blocking defect (compile failure, vacuous/skipped test, unverified control, deferral). Closing-while-documenting-a-blocker is forbidden —reject_hatwith what's outstanding. - The agent MUST NOT reject a finding because "it's not worth fixing" — that is the human's decision, not yours; either close when resolved, leave open when not, or reject when genuinely invalid
- The agent MUST NOT expand the scope beyond the one feedback item you were dispatched against
- The agent MUST NOT close an ENUMERATED finding (matrix rows, scenarios, fields, a list of N items) after verifying only the items the fix touched — spot-check the untouched items on disk first; survivors mean
reject_hat