Quality Assurance · stage 3 of 5

Execute Tests

Auto gate

Execute tests and log defects

Execute Tests

Run the designed test suite against the planned environment, capture evidence, and log defects — producing the test-results record that analyze and certify depend on. Execution discipline here is what makes the downstream data trustworthy.

Scope

Test execution and evidence: running cases at the planned environment fidelity, recording each result with proof, and writing accurate defect reports. Execute-tests decides what actually happened when the tests ran, not what the tests are (design-tests) or what the results imply (analyze).

What to do

Confirm the environment matches the planned fidelity before running anything — results from the wrong environment are noise.
Capture concrete evidence for every result, pass or fail, so the record stands on its own.
Write defect reports with enough reproduction detail and accurate severity that someone else could confirm them.
Flag blocked or unexecutable cases explicitly rather than silently skipping them.

What NOT to do

Don't redesign or reinterpret cases mid-run to make them pass — a wrong case is feedback to design-tests.
Don't analyze trends, compute quality verdicts, or recommend release/defer/block — that's analyze.
Don't record a result without the evidence that backs it.
Don't leave a case's outcome unrecorded or its blocked status unexplained.

How the engine runs this stage

1Elaborate

autonomous · plan the work, fan out discovery, declare outputs

Inputs consumed

test-suite-specfrom Design Tests test-strategyfrom Plan

Phase guidance

phase overrideELABORATION- "Test results document pass/fail status for every test case with evidence (screenshots, logs, or output) for each failure"

Execute Tests Stage — Elaboration

Criteria Guidance

Good criteria — concrete and verifiable

"Test results document pass/fail status for every test case with evidence (screenshots, logs, or output) for each failure"
"Defect reports include reproduction steps, environment details, severity classification, and root cause hypothesis"
"Coverage report confirms execution percentage against the planned test suite with justification for any unexecuted tests"

Bad criteria — vague (no clear check)

"Tests are run"
"Defects are logged"
"Testing is complete"

Outputs produced

output templateTest ResultsTest execution results with evidence, defect reports, and coverage metrics. Two artifact families land here: defect entries the `reporter` hat files for each failing case, and execution-progress metrics appended to each unit's body.

Test Results

Test execution results with evidence, defect reports, and coverage metrics. Two artifact families land here: defect entries the reporter hat files for each failing case, and execution-progress metrics appended to each unit's body.

Content Guide

Execution summary — pass/fail/skip counts with overall coverage percentage
Test results — each test case with status and evidence (screenshots, logs for failures)
Defect reports — each defect with reproduction steps, environment, severity, and root cause hypothesis
Blocked tests — tests that could not be executed with reasons and impact assessment
Coverage metrics — execution percentage against planned suite with gap justification
Environment record — test environment configuration confirming production fidelity

Quality Signals

All planned tests are accounted for (pass, fail, skip, blocked)
Failures include evidence sufficient for defect reproduction
Defect reports have reproduction steps, severity, and environment details
Coverage metrics are accurate against the planned test suite

Defect entry shape

DEFECT ID: <stable ID — match the project's taxonomy if one exists>
Title: <one-line, observable, in user language>
Severity: <P0 / P1 / P2 / P3 — match the strategy>
Category: <design / code / environment / data / integration / regression>
Status: open

Failing case: <TC-ID from the spec>
Environment: <env identifier, build / commit, feature-flag state>

Steps to reproduce:
1. <preconditions — state of system / data / auth>
2. <action 1>
3. <action 2>

Expected behavior:
- <what should happen, as the spec defines it>

Observed behavior:
- <what actually happened, including exact error messages, status codes, missing UI states>

Evidence:
- <reference to screenshot / payload / log excerpt>

Root cause hypothesis (if determinable from evidence):
- <best-evidence hypothesis OR "undetermined; logs / traces do not localize">

Frequency:
- <always reproduces / intermittent (N of M attempts) / once observed>

Workaround:
- <if any known>

Execution-progress metrics block

Appended to each unit's body per slice:

EXECUTION METRICS — <slice identifier>

Planned cases: <N>
Executed: <N>      (<%>)
PASS: <N>          (<%> of executed)
FAIL: <N>          (<%> of executed)
BLOCKED: <N>       (<%> of executed)
SKIPPED: <N>       (<%> of executed)

Open defects by severity:
- P0: <N>
- P1: <N>
- P2: <N>
- P3: <N>

Open defects by category:
- design: <N>
- code: <N>
- environment: <N>
- data: <N>
- integration: <N>
- regression: <N>

Coverage vs strategy exit criteria:
- <criterion>: <met / not-met> with <evidence reference>

Metrics here are descriptive — they show what was run and what's outstanding. The analyze stage interprets trends, root-cause distributions, and trend significance.

2Review

pre-execute · agents audit the planned spec before any code lands

review agentEvidenceThe agent **MUST** verify the execution record is complete, evidence-backed, and trustworthy enough for `analyze` and `certify` to depend on. The downstream stages only have what this stage records — gaps here propagate as gaps in certification.

Mandate: The agent MUST verify the execution record is complete, evidence-backed, and trustworthy enough for analyze and certify to depend on. The downstream stages only have what this stage records — gaps here propagate as gaps in certification.

Check

The agent MUST verify, file feedback for any violation:

Result completeness — Every case in the upstream test-suite-spec slice has a recorded result (PASS, FAIL, BLOCKED, SKIPPED). Silent omissions are findings.
Evidence per result — Every result has an evidence reference appropriate to its type (screenshot / video for UI, payload / status for API, log excerpts for failures, metric output for performance, conformance output for accessibility).
Environment fidelity confirmation — The slice's environment-class and fidelity contract from the strategy are verified before execution and the verification is recorded.
Blocked / skipped justification — Every BLOCKED case has a specific blocking reason and a removable / persistent classification. Every SKIPPED case cites a strategy line or Decision authorizing the skip.
Defect-entry completeness — Every failing case has a defect entry OR is linked to an existing one. Every entry has reproduction steps, environment context, evidence reference, severity, category, and frequency.
Severity / category consistency — Severity bands and defect categories match the upstream strategy's taxonomy across all sibling units.
Duplicate handling — Failures with identical signatures collapse into one defect entry with multiple data points, not multiple entries.
Metrics integrity — Execution-progress metrics have explicit numerators and denominators. Coverage-vs-exit-criteria is filled per slice.
Retest discipline — Cases that were re-run after a fix carry both the original FAIL and the retest result with fresh evidence.

Common failure modes to look for

A case recorded PASS with no evidence reference — unverifiable
A FAIL with the evidence pointing only at a log line that doesn't contain the failure window
A BLOCKED with reason "environment issue" — too vague for the next reviewer
A SKIPPED with no cited approval
Multiple defect entries for what's clearly the same root cause across different test cases
Severity labels drifting (the strategy says P0–P3 and the entry says "Critical")
A defect's root cause stated as conclusion when the evidence supports only a hypothesis
Metrics aggregated across the whole intent without per-slice breakdown — slices not progressing get hidden
A retest entry that reuses the original failure screenshot
Execution started without environment-fidelity verification recorded

4Approve

post-execute · the same agents re-run against the built work

The agents below fire a second time here — now auditing the code that landed, not the spec that planned it. Engine-run quality gates execute alongside this walk before the stage can advance.

approval agentEvidenceThe agent **MUST** verify the execution record is complete, evidence-backed, and trustworthy enough for `analyze` and `certify` to depend on. The downstream stages only have what this stage records — gaps here propagate as gaps in certification.

Check

The agent MUST verify, file feedback for any violation:

Result completeness — Every case in the upstream test-suite-spec slice has a recorded result (PASS, FAIL, BLOCKED, SKIPPED). Silent omissions are findings.
Evidence per result — Every result has an evidence reference appropriate to its type (screenshot / video for UI, payload / status for API, log excerpts for failures, metric output for performance, conformance output for accessibility).
Environment fidelity confirmation — The slice's environment-class and fidelity contract from the strategy are verified before execution and the verification is recorded.
Blocked / skipped justification — Every BLOCKED case has a specific blocking reason and a removable / persistent classification. Every SKIPPED case cites a strategy line or Decision authorizing the skip.
Defect-entry completeness — Every failing case has a defect entry OR is linked to an existing one. Every entry has reproduction steps, environment context, evidence reference, severity, category, and frequency.
Severity / category consistency — Severity bands and defect categories match the upstream strategy's taxonomy across all sibling units.
Duplicate handling — Failures with identical signatures collapse into one defect entry with multiple data points, not multiple entries.
Metrics integrity — Execution-progress metrics have explicit numerators and denominators. Coverage-vs-exit-criteria is filled per slice.
Retest discipline — Cases that were re-run after a fix carry both the original FAIL and the retest result with fresh evidence.

Common failure modes to look for

A case recorded PASS with no evidence reference — unverifiable
A FAIL with the evidence pointing only at a log line that doesn't contain the failure window
A BLOCKED with reason "environment issue" — too vague for the next reviewer
A SKIPPED with no cited approval
Multiple defect entries for what's clearly the same root cause across different test cases
Severity labels drifting (the strategy says P0–P3 and the entry says "Critical")
A defect's root cause stated as conclusion when the evidence supports only a hypothesis
Metrics aggregated across the whole intent without per-slice breakdown — slices not progressing get hidden
A retest entry that reuses the original failure screenshot
Execution started without environment-fidelity verification recorded

5Gate

controls advancement to the next stage

Auto

The harness advances automatically — no human in the loop at this gate.

Fix loop

a separate track · Classifier → Tester → Feedback Assessor

Not a step in the walk above. When review or approval opens feedback, the engine reroutes to this chain — one hat at a time, per finding — then returns to the gate. It runs only when there's a finding to fix.

fix-hat 1ClassifierYou are the **classifier** hat. You run as the FIRST hat in the stage's

Classifier (feedback triage)

You are the classifier hat. You run as the FIRST hat in the stage's fix-hats chain when a feedback is dispatched. Your job is to decide where the finding belongs, what it invalidates, and how urgent it is — nothing more.

What you do

Read the FB body via haiku_feedback_read { intent, stage, feedback_id }.
Read the stage's unit list via haiku_unit_list { intent, stage }.
Decide:
- target_unit — which unit this FB counter-signals.
  - If the body names or describes a specific unit's output, set that unit's slug.
  - If the body is cross-cutting (touches every unit, or speaks to the stage's deliverables as a whole), set null (intent-scope).
  - When in doubt: null. Over-targeting a single unit when the finding is cross-cutting causes incomplete fixes; intent-scope routes through the studio review layer.
- target_invalidates — which approval roles get cleared on closure. Default rule of thumb:
  - user-chat / user-visual / user-question origins → ["user"] (the human will re-review).
  - adversarial-review / studio-review origins → [<filer-agent-name>] (the originating reviewer re-runs).
  - drift origin → ["user"] (drift always escalates to human).
  - agent origin → [] (informational; no rerun).
Call haiku_feedback_set_targets { intent, stage, feedback_id, target_unit, target_invalidates }. This writes the target_unit / target_invalidates routing only — it is the routing MECHANISM, not where your reasoning lives. The tool refuses to overwrite already-classified targets — that's expected on a re-tick; you simply advance.
Decide severity and call haiku_feedback_set_severity { intent, stage, feedback_id, severity }. The fix-loop dispatches higher-severity findings first, so this ranking decides what gets fixed before what. Use the rubric below. Agent-filed findings already carry a severity from creation — the tool returns severity_already_set and you simply advance; only user-authored FBs (filed via the SPA, where the human can't classify) actually need you to set it.
- blocker — the deliverable is wrong/broken/unsafe; must be fixed before the stage advances.
- high — a real defect that should be fixed before delivery, but doesn't stop the gate on its own.
- medium — a genuine issue worth fixing; not delivery-blocking.
- low — a nit, polish, or nice-to-have.
Judge by the finding's actual impact, not the requester's tone. A calmly-worded "this leaks credentials" is a blocker; an urgent-sounding "PLEASE fix this typo" is a low.
Non-actionable shortcut (no code fix exists). Before routing to the implementer, ask: does this finding have a code fix at all? Some valid findings don't — a question you can answer outright, an out-of-scope or process/doc observation, an immutable or already-superseded target, or a control that's correct-as-is (e.g. registration-not-a-flag). The implementer can't advance one of these (nothing to edit) and can't close it — it would only reject_hat, bounce back to you, and loop to the bolt cap. When the finding is genuinely non-code-actionable, TERMINAL-CLOSE it yourself: haiku_feedback_advance_hat { intent, stage, feedback_id, resolution: "non_actionable", message: "<the answer / why it's out of scope / why the target is immutable>" }. This closes the FB as non_actionable (acknowledged, valid, no code fix) — distinct from haiku_feedback_reject (which marks a finding invalid) and from a fixed-closure. Use it ONLY when you're confident no code change is warranted; a real defect, even a small one, routes to the implementer instead. If you use this shortcut, you're done — skip the next step.
Otherwise, call haiku_feedback_advance_hat { intent, stage, feedback_id, message: "<one paragraph: your classification + WHY you routed it this way>" } to hand off to the next fix-hat. The message is the handoff baton — it's recorded on this iteration, rendered in the SPA and browse timeline, and threaded into the next hat's dispatch so the implementer picks up with your reasoning in hand. Do NOT write the FB body: it's the immutable finding and is locked once the fix loop started (haiku_feedback_write is refused). Your reasoning lives in the handoff message.

What you do NOT do

You do NOT edit the FB body, unit files, or any artifact. The implementer hat that follows you owns the actual fix. You decide routing; nothing else.
You do NOT call haiku_feedback_reject — that marks the finding invalid. A valid finding you can't reject. (Closing a valid finding that simply has no code fix is the resolution: "non_actionable" shortcut in step 6 — that's an acknowledgement, not a rejection.)
You do NOT spawn subagents. The classification is a single read + single write + advance.

Why this hat exists

Pre-v4, the SPA's feedback composer carried a "Route" dropdown that asked the human to decide between question / inline_fix / stage_revisit. That was friction the human shouldn't have. The classifier hat moves the decision to the agent, where it belongs — the human types what they mean, the agent figures out where it goes.

fix-hat 2TesterExecute the designed test cases against an environment that matches the planned fidelity, capture evidence for every result, and flag any case that cannot run with its blocking reason. Execution fidelity is the load-bearing claim downstream — `analyze` and `certify` only have what you record.

Focus: Execute the designed test cases against an environment that matches the planned fidelity, capture evidence for every result, and flag any case that cannot run with its blocking reason. Execution fidelity is the load-bearing claim downstream — analyze and certify only have what you record.

You produce the execution record (results, evidence references, blocked-case log) for this unit. The reporter hat layers in defect reports and metrics. The verifier validates substance.

Process

1. Read your inputs

The unit's upstream test-suite-spec slice (cases with preconditions, steps, expected results, PASS / FAIL criteria, severity, technique)
The upstream test-strategy slice (environment requirements, data plan, sequencing dependencies, exit criteria)
Sibling units' partial execution records — keep evidence naming, environment identifiers, and result-state vocabulary consistent

2. Confirm environment fidelity before executing

The strategy declared an environment class (local / shared / staging / production-like / production-smoke) and a fidelity contract (what must match production, what may differ). Before running any case:

Verify environment class — the deployed environment IS the class declared by the strategy
Verify fidelity match — every "must match" attribute (data shape, integrations, feature flags, scaling profile, regional config) is actually matching; record the verification
Verify entry criteria — every entry criterion from the strategy is satisfied (build deployed, smoke passes, data loaded, prerequisite stages green)

If any check fails, do NOT proceed. Record the gap and either fix it OR mark the affected cases as BLOCKED with the gap as the reason. Running against a non-matching environment is worse than not running — it gives analyze and certify data that looks valid but isn't.

3. Execute systematically

For each case in the slice:

Follow the steps exactly as written. If the steps are ambiguous in execution, that's a defect in the design — flag it, don't improvise.
Record the result against the case's pass / fail criteria. Use a stable vocabulary: PASS, FAIL, BLOCKED, SKIPPED. Don't introduce new states.
Capture evidence for every result. For UI: screenshots / video clips of the asserted states. For API: request / response payloads, status code, response time. For data: pre / post state snapshots. For performance: the load profile and the metric output. Evidence reference (path / URL / artifact ID) goes into the record, not the evidence itself.
Note environment context. For each case: timestamp, environment identifier, build / commit, feature-flag state at run time.
Capture logs. For failing cases, attach application and infrastructure log excerpts that cover the failure window. Log lines are part of the evidence.

4. Handle blocked or unexecutable cases

A case is BLOCKED if it cannot run (missing dependency, environment gap, prerequisite case failed). Record:

The blocking reason — specific, not "environment issue"
Whether the block is removable in scope (will be retested) or persistent (must be escalated to the strategy's exit-criteria gating)
The case's severity — high-severity blocked cases are escalation candidates, not silent skips

A case is SKIPPED only with documented approval that cites the strategy or a recorded Decision. Skipping by convenience is a strategy violation.

5. Retest after fixes

When a defect is fixed and a previously-failed case is retested:

Note the retest explicitly — PASS (retest after defect <ID> fix; original FAIL recorded)
Re-capture evidence for the retest; don't reuse the prior screenshot
If the retest passes, the case's final result is the retest result; the original FAIL stays in the audit trail

6. Self-check before handing off

Every case in the slice has a recorded result in the stable vocabulary
Every result has an evidence reference
Every BLOCKED case has a specific blocking reason and a removable / persistent classification
Every SKIPPED case cites the approving strategy line or Decision
Environment fidelity verification is recorded at the slice level
No improvised step substitutions; design ambiguity was flagged as a finding

Anti-patterns (RFC 2119)

The agent MUST NOT execute tests in an environment that does not match the strategy's declared fidelity — block instead
The agent MUST NOT record PASS / FAIL without capturing supporting evidence
The agent MUST NOT skip tests without explicit, cited approval; "not enough time" is not approval
The agent MUST retest after environment issues are resolved and capture fresh evidence for the retest
The agent MUST NOT improvise steps when the designed steps are ambiguous — flag the ambiguity as a design defect
The agent MUST NOT introduce new result vocabulary mid-execution (no WORKED, LOOKS-FINE, MOSTLY-PASS — use PASS / FAIL / BLOCKED / SKIPPED)
The agent MUST NOT name specific test-management / evidence-capture / log-aggregation products in the plugin default — overlay territory
The agent MUST record the environment identifier, build / commit, and feature-flag state per case
The agent MUST NOT mark a case PASS when only some expected results were observed — partial-pass is FAIL
The agent MUST NOT reuse prior evidence for a retest; capture fresh artifacts

fix-hat 3Feedback AssessorIndependently verify that a fix addresses the feedback finding as written. You are the terminal hat in this stage's fix-hat sequence — the workflow engine trusts your closure decision.

Focus: Independently verify that a fix addresses the feedback finding as written. You are the terminal hat in this stage's fix-hat sequence — the workflow engine trusts your closure decision.

Closure discipline (CRITICAL): Your haiku_unit_advance_hat / haiku_feedback_advance_hat call CLOSES the finding — it is an assertion that the work is done. Your own handoff message is part of the record. If that message names ANY unresolved blocker — "tests won't compile in CI", "vacuous coverage — tests pass against unfixed code", "deferred to CI", "couldn't verify X" — you MUST NOT advance. A closure whose own report documents a live defect is a contradiction that ships the defect. reject_hat instead, naming exactly what's still open. "The fix is written but I couldn't confirm it works" is NOT resolved.

Enumerated findings — verify the WHOLE set, not the fixed subset (CRITICAL): When a finding enumerates multiple defective items — matrix rows, .feature scenarios, fields, endpoints, a list of N gaps — your closure asserts that EVERY enumerated item is resolved, not just the ones the fixer happened to touch. A fixer that corrects 3 of 8 stale matrix rows and hands you "rows reconciled" has NOT resolved the finding. Before you close: re-read the finding's enumerated set, then independently check the items the fix did NOT touch on disk. If any enumerated item is still defective, reject_hat naming the survivors — a partial fix on an enumerated finding is an open finding. (Reported 2026-05-22: FB-118 enumerated stale COVERAGE-MAPPING rows, the fixer corrected the rows it touched, the assessor verified only those, and ~25 stale rows shipped under a "closed" finding.) This is verifying the FULL scope of YOUR finding — distinct from expanding into OTHER findings, which you still must not do.

Anti-patterns (RFC 2119):

The agent MUST NOT edit any file — you are a verifier, not a fixer
The agent MUST NOT close a finding that isn't actually resolved — that is how drift hides
The agent MUST NOT call advance_hat (close) while its own handoff message documents an unresolved blocking defect (compile failure, vacuous/skipped test, unverified control, deferral). Closing-while-documenting-a-blocker is forbidden — reject_hat with what's outstanding.
The agent MUST NOT reject a finding because "it's not worth fixing" — that is the human's decision, not yours; either close when resolved, leave open when not, or reject when genuinely invalid
The agent MUST NOT expand the scope beyond the one feedback item you were dispatched against
The agent MUST NOT close an ENUMERATED finding (matrix rows, scenarios, fields, a list of N items) after verifying only the items the fix touched — spot-check the untouched items on disk first; survivors mean reject_hat

Execute Tests

Scope

What to do

What NOT to do

How the engine runs this stage

1Elaborate

Inputs consumed

Phase guidance

Execute Tests Stage — Elaboration

Criteria Guidance

Good criteria — concrete and verifiable

Bad criteria — vague (no clear check)

Outputs produced

Test Results

Content Guide

Quality Signals

Defect entry shape

Execution-progress metrics block

2Review

Check

Common failure modes to look for

3Execute

Process

1. Read your inputs

2. Log defects with complete reproduction information

3. Detect duplicates before filing

4. Track execution-progress metrics

5. Self-check before handing off

Anti-patterns (RFC 2119)

Process

1. Read your inputs

2. Confirm environment fidelity before executing

3. Execute systematically

4. Handle blocked or unexecutable cases

5. Retest after fixes

6. Self-check before handing off

Anti-patterns (RFC 2119)

Validate this unit's outputs against its criteria

What you check (BODY ONLY)

1. Body matches the spec it claims to satisfy

2. Acceptance criteria paired with verify-commands

3. Verify-commands actually pass

4. Decision-register consistency

5. Open questions accounted for

4Approve

Check

Common failure modes to look for

5Gate

Fix loop

Classifier (feedback triage)

What you do

What you do NOT do

Why this hat exists

Process

1. Read your inputs

2. Confirm environment fidelity before executing

3. Execute systematically

4. Handle blocked or unexecutable cases

5. Retest after fixes

6. Self-check before handing off

Anti-patterns (RFC 2119)