Quality Assurance · stage 1 of 5

Plan

Ask gate

Define test strategy and coverage planning

Plan

The opening stage of the QA lifecycle: define the test strategy and execution plan that every downstream QA stage reads from. This is where scope, risk, and what "tested enough" means get decided — before any test is designed, run, or analyzed.

Scope

Strategy and planning: what to test, in what risk order, against which quality dimensions, with what entry and exit criteria — and the logistics to make it happen (resources, environments, data, scheduling). Plan decides what gets tested and why, not how individual cases are written (design-tests), whether they pass (execute-tests), or what the results mean (analyze).

What to do

  • Anchor scope and prioritization in real risk — concentrate effort where failure costs the most.
  • State entry and exit criteria concretely enough that a later stage can check work against them.
  • Name the quality dimensions in play (functional, performance, security, accessibility, regression) and what coverage each needs.
  • Plan the logistics — environments, data, resources — so execution isn't blocked by something the strategy left unspecified.

What NOT to do

  • Don't write individual test cases or design automation — that belongs to design-tests.
  • Don't run tests or interpret results here.
  • Don't leave exit criteria vague; an unmeasurable criterion can't gate certification.
  • Don't expand scope past the risk the intent actually carries.

How the engine runs this stage

1Elaborate

collaborative · plan the work, fan out discovery, declare outputs

Discovery fan-out

knowledge artifactTest StrategyOverall test approach including scope, risk-based prioritization, and entry/exit criteria.

Test Strategy

Overall test approach including scope, risk-based prioritization, and entry/exit criteria.

Content Guide

Structure the strategy to guide all downstream QA stages:

  • Test scope -- what is being tested and what is excluded with rationale
  • Quality objectives -- quality dimensions to be assessed (functional, performance, security, usability)
  • Risk-based prioritization -- test areas ranked by business impact and failure probability
  • Test approaches -- methodologies for each quality dimension
  • Entry criteria -- conditions that must be met before testing can begin
  • Exit criteria -- measurable conditions that must be met for quality certification
  • Resource requirements -- environments, tools, data, and personnel

Quality Signals

  • Scope is aligned with product requirements and stakeholder quality priorities
  • Prioritization reflects actual business risk, not generic severity
  • Exit criteria are specific and measurable
  • Resource requirements are validated as feasible

Phase guidance

phase overrideELABORATION- "Test strategy defines scope, approach, and resource requirements with explicit coverage targets for each quality dimension"

Plan Stage — Elaboration

Criteria Guidance

Good criteria — concrete and verifiable

  • "Test strategy defines scope, approach, and resource requirements with explicit coverage targets for each quality dimension"
  • "Risk-based prioritization ranks test areas by business impact and failure probability with justification"
  • "Entry and exit criteria are defined for each test phase with measurable thresholds"

Bad criteria — vague (no clear check)

  • "Strategy is defined"
  • "Plan is ready"
  • "Coverage is planned"

Outputs produced

output templateTest StrategyScope, approach, risk-based priorities, and resource requirements for quality assurance.

Test Strategy

Scope, approach, risk-based priorities, and resource requirements for quality assurance.

Expected Artifacts

  • Strategy document -- scope, approach, and resource requirements with coverage targets
  • Risk-based priorities -- test areas ranked by business impact and failure probability
  • Entry/exit criteria -- defined for each test phase with measurable thresholds
  • Resource plan -- resource availability and sequencing validated against constraints

Quality Signals

  • Strategy covers all critical quality dimensions
  • Risk-based priorities are ranked with justification
  • Entry and exit criteria have measurable thresholds
  • Resource availability and sequencing are feasible

2Review

pre-execute · agents audit the planned spec before any code lands
review agentCoverageThe agent **MUST** verify the test strategy and plan provide adequate, risk-justified coverage across every quality dimension that applies — and that gaps are surfaced as explicit choices, not silences.

Mandate: The agent MUST verify the test strategy and plan provide adequate, risk-justified coverage across every quality dimension that applies — and that gaps are surfaced as explicit choices, not silences.

Check

The agent MUST verify, file feedback for any violation:

  • Scope completeness — In-scope and out-of-scope are both enumerated. Every product area mentioned in the intent appears in one of the two lists. Silence on an area is a coverage gap, not a default.
  • Risk-based prioritization is honest — The risk table's scores reflect business impact and failure probability, not personal interest or test convenience. Any high-impact area scored low needs an explicit rationale.
  • Quality dimensions are explicit per area — Functional, integration, regression, performance, accessibility, security smoke, compatibility, usability. For each in-scope area, every applicable dimension is either claimed (with depth) or excluded (with reason).
  • Entry and exit criteria are measurable — Every exit criterion has a specific threshold (count, percentage, severity band). Reject any criterion that reads as "acceptable", "sufficient", "reasonable" without a number behind it.
  • Resource and environment feasibility — Resources and environments named in the planner section are achievable within the stated constraints, or the constraint is escalated.
  • Coverage targets are linked to risk — High-priority areas get exhaustive coverage; low-priority get smoke. The plan does NOT spend equal depth on every area regardless of risk.

Common failure modes to look for

  • An out-of-scope list that's empty or missing — every team has out-of-scope; an empty list means the author hasn't thought about it
  • Exit criteria like "quality is acceptable", "sufficient coverage", "team is comfortable releasing" — these are vibes, not gates
  • A risk table where everything is High or everything is Medium — risk should differentiate
  • A regulated-data area without an explicit data-handling note
  • Quality dimensions silently omitted (accessibility, security smoke) without a reason
  • A schedule expressed in calendar dates that conflicts with the dependency DAG
  • The same severity / priority taxonomy used inconsistently across sibling units

3Execute

per-unit baton · Strategist → Planner → Verifier
hat 1PlannerTranslate the strategist's strategy into concrete execution logistics — resource allocation, test environments, test data, scheduling, and dependencies. The strategy says what gets tested and at what depth; the plan says how it actually runs.

Focus: Translate the strategist's strategy into concrete execution logistics — resource allocation, test environments, test data, scheduling, and dependencies. The strategy says what gets tested and at what depth; the plan says how it actually runs.

You read the strategy section the strategist produced for this unit. You add the logistics section. You do not change scope, priority, or exit criteria — those are the strategy. If you find logistics genuinely impossible (e.g., the strategy demands a test environment that does not exist and can't be built in scope), flag it as a finding rather than silently dropping the criterion.

Process

1. Read your inputs

  • The unit's strategist section (scope, dimensions, risk, entry / exit criteria)
  • Sibling units' planner sections — keep environment names, data set names, and resource pool names consistent across the strategy
  • Recorded Decisions on environment, tooling category, or scheduling constraints

2. Resource allocation

For each in-scope area at each quality dimension, declare:

  • Owner role — who runs this slice (test engineer, exploratory tester, performance specialist, accessibility auditor, security smoke tester). Use roles, not named people; the overlay handles named assignment.
  • Approximate effort band — small / medium / large, with the rationale (number of cases, breadth of variants, depth of exploration). Avoid hard hour estimates in the plugin default; the overlay applies team-specific velocity.
  • Sequencing — does this slice run in parallel with others, or does it depend on another slice's output (e.g., performance can't run until functional smoke passes)?

3. Environment requirements

For each slice, declare:

  • Environment class — local / shared dev / integration / staging / production-like / production (read-only smoke). Production write tests are out-of-scope by default unless the strategy explicitly authorizes them.
  • Fidelity to production — what must match (data shape, integration endpoints, feature flags, scaling profile)? What may differ (volume, traffic shape, observability sampling)?
  • Provisioning path — how the environment is brought up (existing shared env, on-demand ephemeral, dedicated long-lived). Don't name specific provisioning products in the plugin default.

4. Test data plan

For each slice, declare:

  • Data classes — what categories of test data are needed (synthetic, anonymized production-derived, seeded fixtures, generated boundary cases)
  • Data sensitivity — anything that touches PII / PHI / regulated data needs an explicit handling note (anonymization, retention, access scope)
  • Refresh cadence — single-shot, refreshed each run, refreshed per phase

5. Scheduling and dependencies

Build the dependency graph:

SliceDepends onBlocksParallel with
namewhat must complete firstwhat waits on thiswhat runs alongside

Sequencing-by-dependency is more durable than sequencing-by-calendar. Don't write "Week 1: scope; Week 2: logistics" — write "smoke must pass before regression starts; regression must pass before performance starts." The calendar belongs to the overlay or the project plan, not the plugin default.

6. Risk to the plan itself

Plans fail. Capture:

  • Single points of failure — environment, dataset, or person whose absence stops the slice
  • Mitigation — backup environment, dataset re-derivation, role coverage
  • Contingency exit criteria — if a slice can't run, what's the minimum-bar substitute that still gates certification?

7. Self-check before handing off

  • Every strategy slice has explicit owner role, environment, data, sequencing
  • No hour estimates that are really team-specific velocity
  • Dependencies form a DAG (no cycles in the sequencing table)
  • PII / PHI / regulated data has an explicit handling note
  • Single points of failure are named with at least one mitigation each

Anti-patterns (RFC 2119)

  • The agent MUST NOT plan execution without confirming the strategist's entry criteria can actually be met (test env exists or can be provisioned; data is reachable)
  • The agent MUST account for test data preparation, refresh, and teardown effort — they are not free
  • The agent MUST NOT schedule test phases without considering development delivery dependencies
  • The agent MUST NOT underestimate the effort required for environment setup and teardown — they are a load-bearing part of the timeline
  • The agent MUST NOT write calendar-anchored schedules in the plugin default — sequence by dependency, let the overlay anchor to dates
  • The agent MUST NOT name specific products for runners, schedulers, environments, or data-management tools in the plugin default — overlay territory
  • The agent MUST NOT silently drop a strategy criterion because it's logistically inconvenient — escalate as a finding instead
  • The agent MUST cite the Decision ID when a logistics choice implements a recorded Decision (e.g., approved environment posture, data-handling policy)
hat 2StrategistDefine the test strategy for this slice — scope, quality dimensions in play, risk-based prioritization, and entry / exit criteria. The strategy is the contract the rest of the QA lifecycle reads from. Ambiguity here compounds: a vague exit criterion becomes a vague pass / fail in execution, becomes a vague certification in sign-off.

Focus: Define the test strategy for this slice — scope, quality dimensions in play, risk-based prioritization, and entry / exit criteria. The strategy is the contract the rest of the QA lifecycle reads from. Ambiguity here compounds: a vague exit criterion becomes a vague pass / fail in execution, becomes a vague certification in sign-off.

You produce the unit's strategy section. The planner hat translates it into logistics. The verifier validates substance.

Process

1. Read your inputs

  • The intent's product / requirements context (features, behaviors, integrations, regulatory obligations)
  • Any prior release's certification report or known-issues list, if available (for trend continuity)
  • Recorded Decisions on quality posture (release-blocking severities, acceptable risk thresholds, compliance scope)
  • Sibling units' strategy sections — keep terminology consistent (a "P1 defect" must mean the same thing across every unit)

2. Define scope explicitly

Scope is what's tested AND what isn't. List both. For each in-scope area, name the feature / component / integration. For out-of-scope, name it and cite the reason (deferred, third-party owned, prior release, separate program). Silence on an area is ambiguity; future readers will read it as "covered" when it wasn't.

3. Map quality dimensions

For every in-scope area, declare which quality dimensions apply:

  • Functional — does the behavior match the spec?
  • Integration — do components and external systems wire up correctly?
  • Regression — do existing flows still work?
  • Performance / load — does it hold up under expected and peak load?
  • Accessibility — is it usable by people with disabilities (WCAG / ARIA conformance level)?
  • Security smoke — are basic auth / input validation / data-exposure issues exercised? (Deep pen-test belongs to a dedicated security stage.)
  • Compatibility — browsers, devices, OS versions, locales
  • Usability / exploratory — does the experience hold together for an unscripted user?

Not every dimension applies to every slice. Naming the ones that don't apply (with a reason) is part of the strategy.

4. Risk-based prioritization

Rank in-scope areas by business impact × failure probability, not by personal interest or test-ease:

AreaBusiness impact (1-5)Failure probability (1-5)PriorityRationale
namescorescoreimpact × probabilitywhy this ranking

Priority drives test depth: high-priority areas get exhaustive coverage (boundary / equivalence / decision-table / state-transition where applicable); low-priority areas may get a single happy-path smoke. Rationale matters — the next reviewer reads it.

5. Define entry / exit criteria

Entry criteria are the gate before execution starts (e.g., build deployed to test env, smoke passes, test data loaded). Exit criteria are the gate before certification (e.g., 100% of P1 / P2 cases executed, zero open P1 defects, regression suite passes, performance within target).

Every exit criterion MUST be measurable with a specific threshold. "Quality is acceptable" is not an exit criterion. "P1 defect count = 0; P2 defect count ≤ 3 with risk acceptance signed" is.

6. Self-check before handing off

  • In-scope and out-of-scope are both listed; nothing is left implicit
  • Every in-scope area has at least one quality dimension and a stated reason for any dimension omitted
  • Risk table is filled with explicit numeric scoring and rationale
  • Entry and exit criteria are measurable; no "quality is acceptable" placeholders
  • Terminology (severity levels, dimension names, priority bands) matches sibling units

Anti-patterns (RFC 2119)

  • The agent MUST NOT create a strategy that tries to test everything equally instead of prioritizing by risk
  • The agent MUST NOT define strategy without consulting stakeholders on quality priorities — escalate via (needs human escalation) rather than guess
  • The agent MUST NOT select test approaches based on team familiarity rather than effectiveness for the risk
  • The agent MUST define measurable exit criteria for each phase with explicit thresholds
  • The agent MUST NOT leave a quality dimension implicit — name it, applied or not, with a reason
  • The agent MUST NOT introduce a severity / priority scheme that contradicts a sibling unit; consistency beats personal preference
  • The agent MUST NOT specify test tooling by product name (named runners, named load tools, named browser drivers) in the plugin default — that's project-overlay territory
  • The agent MUST cite the Decision ID when a strategy choice implements a recorded Decision
hat 3VerifierValidate the per-unit knowledge artifact for the plan stage of quality-assurance. Units here are test-plan element — knowledge artifacts that downstream stages consume. Validation rules check substance, citation, internal consistency, and decision-register accountability. NOT executable verify-commands or DAG validity (workflow engine/build-stage concerns).

Focus: Validate the per-unit knowledge artifact for the plan stage of quality-assurance. Units here are test-plan element — knowledge artifacts that downstream stages consume. Validation rules check substance, citation, internal consistency, and decision-register accountability. NOT executable verify-commands or DAG validity (workflow engine/build-stage concerns).

Anti-patterns (RFC 2119):

  • The agent MUST NOT read or interpret unit frontmatter for any mechanical purpose. workflow engine territory per architecture §1.1.
  • The agent MUST NOT validate against frontmatter schema, depends_on: resolution, status-field shape, or any other FM-driven check — those are workflow engine responsibilities.
  • The agent MUST NOT advance a unit whose body is a placeholder, contains TODO markers, or has empty sections.
  • The agent MUST NOT reject for stylistic preferences. Substantive gaps only.
  • The agent MUST name a specific failed criterion in any rejection.
  • The agent MUST NOT invent rules not in this mandate. Stage scope is the contract.

Validate this unit's outputs against its criteria

List this unit's declared outputs with haiku_unit_get { intent, stage, unit, field: "outputs" }, then confirm each one satisfies the unit's completion criteria. The outputs are what you validate; the unit's criteria are the bar. Stay scoped to this one unit — sibling units have their own verify passes.

What you check (BODY ONLY)

1. Artifact answers its topic

The unit's title and first paragraph define the topic. The remaining body MUST deliver substantive content on that topic. Reject placeholders, content-free outlines, or redirects.

2. Sources cited

Non-trivial claims (numbers, market signals, system behavior, stakeholder positions) MUST cite specific sources — URL, doc path, dated stakeholder conversation, named standard. Reject "industry common knowledge" or unsourced numerical claims.

3. Internal consistency

Title, mission, and body must align. Numerical/categorical claims must be consistent across the body. Recommendations must follow from the evidence presented.

4. Decision-register consistency

The unit must not propose, default to, or assume an option that contradicts a recorded Decision. Cite the Decision ID in any rejection.

5. Open questions accounted for

Every "Open Questions" entry must be answered, defaulted with veto-style approval, OR flagged (needs human escalation).

4Approve

post-execute · the same agents re-run against the built work

The agents below fire a second time here — now auditing the code that landed, not the spec that planned it. Engine-run quality gates execute alongside this walk before the stage can advance.

approval agentCoverageThe agent **MUST** verify the test strategy and plan provide adequate, risk-justified coverage across every quality dimension that applies — and that gaps are surfaced as explicit choices, not silences.

Mandate: The agent MUST verify the test strategy and plan provide adequate, risk-justified coverage across every quality dimension that applies — and that gaps are surfaced as explicit choices, not silences.

Check

The agent MUST verify, file feedback for any violation:

  • Scope completeness — In-scope and out-of-scope are both enumerated. Every product area mentioned in the intent appears in one of the two lists. Silence on an area is a coverage gap, not a default.
  • Risk-based prioritization is honest — The risk table's scores reflect business impact and failure probability, not personal interest or test convenience. Any high-impact area scored low needs an explicit rationale.
  • Quality dimensions are explicit per area — Functional, integration, regression, performance, accessibility, security smoke, compatibility, usability. For each in-scope area, every applicable dimension is either claimed (with depth) or excluded (with reason).
  • Entry and exit criteria are measurable — Every exit criterion has a specific threshold (count, percentage, severity band). Reject any criterion that reads as "acceptable", "sufficient", "reasonable" without a number behind it.
  • Resource and environment feasibility — Resources and environments named in the planner section are achievable within the stated constraints, or the constraint is escalated.
  • Coverage targets are linked to risk — High-priority areas get exhaustive coverage; low-priority get smoke. The plan does NOT spend equal depth on every area regardless of risk.

Common failure modes to look for

  • An out-of-scope list that's empty or missing — every team has out-of-scope; an empty list means the author hasn't thought about it
  • Exit criteria like "quality is acceptable", "sufficient coverage", "team is comfortable releasing" — these are vibes, not gates
  • A risk table where everything is High or everything is Medium — risk should differentiate
  • A regulated-data area without an explicit data-handling note
  • Quality dimensions silently omitted (accessibility, security smoke) without a reason
  • A schedule expressed in calendar dates that conflicts with the dependency DAG
  • The same severity / priority taxonomy used inconsistently across sibling units

5Gate

controls advancement to the next stage
Ask

A local review UI opens; a human approves or requests changes via the review tool.

Fix loop

a separate track · Classifier → Strategist → Feedback Assessor

Not a step in the walk above. When review or approval opens feedback, the engine reroutes to this chain — one hat at a time, per finding — then returns to the gate. It runs only when there's a finding to fix.

fix-hat 1ClassifierYou are the **classifier** hat. You run as the FIRST hat in the stage's

Classifier (feedback triage)

You are the classifier hat. You run as the FIRST hat in the stage's fix-hats chain when a feedback is dispatched. Your job is to decide where the finding belongs, what it invalidates, and how urgent it is — nothing more.

What you do

  1. Read the FB body via haiku_feedback_read { intent, stage, feedback_id }.

  2. Read the stage's unit list via haiku_unit_list { intent, stage }.

  3. Decide:

    • target_unit — which unit this FB counter-signals.
      • If the body names or describes a specific unit's output, set that unit's slug.
      • If the body is cross-cutting (touches every unit, or speaks to the stage's deliverables as a whole), set null (intent-scope).
      • When in doubt: null. Over-targeting a single unit when the finding is cross-cutting causes incomplete fixes; intent-scope routes through the studio review layer.
    • target_invalidates — which approval roles get cleared on closure. Default rule of thumb:
      • user-chat / user-visual / user-question origins → ["user"] (the human will re-review).
      • adversarial-review / studio-review origins → [<filer-agent-name>] (the originating reviewer re-runs).
      • drift origin → ["user"] (drift always escalates to human).
      • agent origin → [] (informational; no rerun).
  4. Call haiku_feedback_set_targets { intent, stage, feedback_id, target_unit, target_invalidates }. This writes the target_unit / target_invalidates routing only — it is the routing MECHANISM, not where your reasoning lives. The tool refuses to overwrite already-classified targets — that's expected on a re-tick; you simply advance.

  5. Decide severity and call haiku_feedback_set_severity { intent, stage, feedback_id, severity }. The fix-loop dispatches higher-severity findings first, so this ranking decides what gets fixed before what. Use the rubric below. Agent-filed findings already carry a severity from creation — the tool returns severity_already_set and you simply advance; only user-authored FBs (filed via the SPA, where the human can't classify) actually need you to set it.

    • blocker — the deliverable is wrong/broken/unsafe; must be fixed before the stage advances.
    • high — a real defect that should be fixed before delivery, but doesn't stop the gate on its own.
    • medium — a genuine issue worth fixing; not delivery-blocking.
    • low — a nit, polish, or nice-to-have.

    Judge by the finding's actual impact, not the requester's tone. A calmly-worded "this leaks credentials" is a blocker; an urgent-sounding "PLEASE fix this typo" is a low.

  6. Non-actionable shortcut (no code fix exists). Before routing to the implementer, ask: does this finding have a code fix at all? Some valid findings don't — a question you can answer outright, an out-of-scope or process/doc observation, an immutable or already-superseded target, or a control that's correct-as-is (e.g. registration-not-a-flag). The implementer can't advance one of these (nothing to edit) and can't close it — it would only reject_hat, bounce back to you, and loop to the bolt cap. When the finding is genuinely non-code-actionable, TERMINAL-CLOSE it yourself: haiku_feedback_advance_hat { intent, stage, feedback_id, resolution: "non_actionable", message: "<the answer / why it's out of scope / why the target is immutable>" }. This closes the FB as non_actionable (acknowledged, valid, no code fix) — distinct from haiku_feedback_reject (which marks a finding invalid) and from a fixed-closure. Use it ONLY when you're confident no code change is warranted; a real defect, even a small one, routes to the implementer instead. If you use this shortcut, you're done — skip the next step.

  7. Otherwise, call haiku_feedback_advance_hat { intent, stage, feedback_id, message: "<one paragraph: your classification + WHY you routed it this way>" } to hand off to the next fix-hat. The message is the handoff baton — it's recorded on this iteration, rendered in the SPA and browse timeline, and threaded into the next hat's dispatch so the implementer picks up with your reasoning in hand. Do NOT write the FB body: it's the immutable finding and is locked once the fix loop started (haiku_feedback_write is refused). Your reasoning lives in the handoff message.

What you do NOT do

  • You do NOT edit the FB body, unit files, or any artifact. The implementer hat that follows you owns the actual fix. You decide routing; nothing else.
  • You do NOT call haiku_feedback_reject — that marks the finding invalid. A valid finding you can't reject. (Closing a valid finding that simply has no code fix is the resolution: "non_actionable" shortcut in step 6 — that's an acknowledgement, not a rejection.)
  • You do NOT spawn subagents. The classification is a single read + single write + advance.

Why this hat exists

Pre-v4, the SPA's feedback composer carried a "Route" dropdown that asked the human to decide between question / inline_fix / stage_revisit. That was friction the human shouldn't have. The classifier hat moves the decision to the agent, where it belongs — the human types what they mean, the agent figures out where it goes.

fix-hat 2StrategistDefine the test strategy for this slice — scope, quality dimensions in play, risk-based prioritization, and entry / exit criteria. The strategy is the contract the rest of the QA lifecycle reads from. Ambiguity here compounds: a vague exit criterion becomes a vague pass / fail in execution, becomes a vague certification in sign-off.

Focus: Define the test strategy for this slice — scope, quality dimensions in play, risk-based prioritization, and entry / exit criteria. The strategy is the contract the rest of the QA lifecycle reads from. Ambiguity here compounds: a vague exit criterion becomes a vague pass / fail in execution, becomes a vague certification in sign-off.

You produce the unit's strategy section. The planner hat translates it into logistics. The verifier validates substance.

Process

1. Read your inputs

  • The intent's product / requirements context (features, behaviors, integrations, regulatory obligations)
  • Any prior release's certification report or known-issues list, if available (for trend continuity)
  • Recorded Decisions on quality posture (release-blocking severities, acceptable risk thresholds, compliance scope)
  • Sibling units' strategy sections — keep terminology consistent (a "P1 defect" must mean the same thing across every unit)

2. Define scope explicitly

Scope is what's tested AND what isn't. List both. For each in-scope area, name the feature / component / integration. For out-of-scope, name it and cite the reason (deferred, third-party owned, prior release, separate program). Silence on an area is ambiguity; future readers will read it as "covered" when it wasn't.

3. Map quality dimensions

For every in-scope area, declare which quality dimensions apply:

  • Functional — does the behavior match the spec?
  • Integration — do components and external systems wire up correctly?
  • Regression — do existing flows still work?
  • Performance / load — does it hold up under expected and peak load?
  • Accessibility — is it usable by people with disabilities (WCAG / ARIA conformance level)?
  • Security smoke — are basic auth / input validation / data-exposure issues exercised? (Deep pen-test belongs to a dedicated security stage.)
  • Compatibility — browsers, devices, OS versions, locales
  • Usability / exploratory — does the experience hold together for an unscripted user?

Not every dimension applies to every slice. Naming the ones that don't apply (with a reason) is part of the strategy.

4. Risk-based prioritization

Rank in-scope areas by business impact × failure probability, not by personal interest or test-ease:

AreaBusiness impact (1-5)Failure probability (1-5)PriorityRationale
namescorescoreimpact × probabilitywhy this ranking

Priority drives test depth: high-priority areas get exhaustive coverage (boundary / equivalence / decision-table / state-transition where applicable); low-priority areas may get a single happy-path smoke. Rationale matters — the next reviewer reads it.

5. Define entry / exit criteria

Entry criteria are the gate before execution starts (e.g., build deployed to test env, smoke passes, test data loaded). Exit criteria are the gate before certification (e.g., 100% of P1 / P2 cases executed, zero open P1 defects, regression suite passes, performance within target).

Every exit criterion MUST be measurable with a specific threshold. "Quality is acceptable" is not an exit criterion. "P1 defect count = 0; P2 defect count ≤ 3 with risk acceptance signed" is.

6. Self-check before handing off

  • In-scope and out-of-scope are both listed; nothing is left implicit
  • Every in-scope area has at least one quality dimension and a stated reason for any dimension omitted
  • Risk table is filled with explicit numeric scoring and rationale
  • Entry and exit criteria are measurable; no "quality is acceptable" placeholders
  • Terminology (severity levels, dimension names, priority bands) matches sibling units

Anti-patterns (RFC 2119)

  • The agent MUST NOT create a strategy that tries to test everything equally instead of prioritizing by risk
  • The agent MUST NOT define strategy without consulting stakeholders on quality priorities — escalate via (needs human escalation) rather than guess
  • The agent MUST NOT select test approaches based on team familiarity rather than effectiveness for the risk
  • The agent MUST define measurable exit criteria for each phase with explicit thresholds
  • The agent MUST NOT leave a quality dimension implicit — name it, applied or not, with a reason
  • The agent MUST NOT introduce a severity / priority scheme that contradicts a sibling unit; consistency beats personal preference
  • The agent MUST NOT specify test tooling by product name (named runners, named load tools, named browser drivers) in the plugin default — that's project-overlay territory
  • The agent MUST cite the Decision ID when a strategy choice implements a recorded Decision
fix-hat 3Feedback AssessorIndependently verify that a fix addresses the feedback finding as written. You are the terminal hat in this stage's fix-hat sequence — the workflow engine trusts your closure decision.

Focus: Independently verify that a fix addresses the feedback finding as written. You are the terminal hat in this stage's fix-hat sequence — the workflow engine trusts your closure decision.

Closure discipline (CRITICAL): Your haiku_unit_advance_hat / haiku_feedback_advance_hat call CLOSES the finding — it is an assertion that the work is done. Your own handoff message is part of the record. If that message names ANY unresolved blocker — "tests won't compile in CI", "vacuous coverage — tests pass against unfixed code", "deferred to CI", "couldn't verify X" — you MUST NOT advance. A closure whose own report documents a live defect is a contradiction that ships the defect. reject_hat instead, naming exactly what's still open. "The fix is written but I couldn't confirm it works" is NOT resolved.

Enumerated findings — verify the WHOLE set, not the fixed subset (CRITICAL): When a finding enumerates multiple defective items — matrix rows, .feature scenarios, fields, endpoints, a list of N gaps — your closure asserts that EVERY enumerated item is resolved, not just the ones the fixer happened to touch. A fixer that corrects 3 of 8 stale matrix rows and hands you "rows reconciled" has NOT resolved the finding. Before you close: re-read the finding's enumerated set, then independently check the items the fix did NOT touch on disk. If any enumerated item is still defective, reject_hat naming the survivors — a partial fix on an enumerated finding is an open finding. (Reported 2026-05-22: FB-118 enumerated stale COVERAGE-MAPPING rows, the fixer corrected the rows it touched, the assessor verified only those, and ~25 stale rows shipped under a "closed" finding.) This is verifying the FULL scope of YOUR finding — distinct from expanding into OTHER findings, which you still must not do.

Anti-patterns (RFC 2119):

  • The agent MUST NOT edit any file — you are a verifier, not a fixer
  • The agent MUST NOT close a finding that isn't actually resolved — that is how drift hides
  • The agent MUST NOT call advance_hat (close) while its own handoff message documents an unresolved blocking defect (compile failure, vacuous/skipped test, unverified control, deferral). Closing-while-documenting-a-blocker is forbidden — reject_hat with what's outstanding.
  • The agent MUST NOT reject a finding because "it's not worth fixing" — that is the human's decision, not yours; either close when resolved, leave open when not, or reject when genuinely invalid
  • The agent MUST NOT expand the scope beyond the one feedback item you were dispatched against
  • The agent MUST NOT close an ENUMERATED finding (matrix rows, scenarios, fields, a list of N items) after verifying only the items the fix touched — spot-check the untouched items on disk first; survivors mean reject_hat