Execute Tests
Auto gateExecute tests and log defects
Execute Tests
Run the designed test suite against the planned environment, capture evidence, and log defects — producing the test-results record that analyze and certify depend on. Execution discipline here is what makes the downstream data trustworthy.
Scope
Test execution and evidence: running cases at the planned environment fidelity, recording each result with proof, and writing accurate defect reports. Execute-tests decides what actually happened when the tests ran, not what the tests are (design-tests) or what the results imply (analyze).
What to do
- Confirm the environment matches the planned fidelity before running anything — results from the wrong environment are noise.
- Capture concrete evidence for every result, pass or fail, so the record stands on its own.
- Write defect reports with enough reproduction detail and accurate severity that someone else could confirm them.
- Flag blocked or unexecutable cases explicitly rather than silently skipping them.
What NOT to do
- Don't redesign or reinterpret cases mid-run to make them pass — a wrong case is feedback to design-tests.
- Don't analyze trends, compute quality verdicts, or recommend release/defer/block — that's analyze.
- Don't record a result without the evidence that backs it.
- Don't leave a case's outcome unrecorded or its blocked status unexplained.
How the engine runs this stage
1Elaborate
autonomous · plan the work, fan out discovery, declare outputsInputs consumed
Phase guidance
phase overrideELABORATION- "Test results document pass/fail status for every test case with evidence (screenshots, logs, or output) for each failure"
Execute Tests Stage — Elaboration
Criteria Guidance
Good criteria — concrete and verifiable
- "Test results document pass/fail status for every test case with evidence (screenshots, logs, or output) for each failure"
- "Defect reports include reproduction steps, environment details, severity classification, and root cause hypothesis"
- "Coverage report confirms execution percentage against the planned test suite with justification for any unexecuted tests"
Bad criteria — vague (no clear check)
- "Tests are run"
- "Defects are logged"
- "Testing is complete"
Outputs produced
output templateTest ResultsTest execution results with evidence, defect reports, and coverage metrics. Two artifact families land here: defect entries the `reporter` hat files for each failing case, and execution-progress metrics appended to each unit's body.
Test Results
Test execution results with evidence, defect reports, and coverage metrics. Two artifact families land here: defect entries the reporter hat files for each failing case, and execution-progress metrics appended to each unit's body.
Content Guide
- Execution summary — pass/fail/skip counts with overall coverage percentage
- Test results — each test case with status and evidence (screenshots, logs for failures)
- Defect reports — each defect with reproduction steps, environment, severity, and root cause hypothesis
- Blocked tests — tests that could not be executed with reasons and impact assessment
- Coverage metrics — execution percentage against planned suite with gap justification
- Environment record — test environment configuration confirming production fidelity
Quality Signals
- All planned tests are accounted for (pass, fail, skip, blocked)
- Failures include evidence sufficient for defect reproduction
- Defect reports have reproduction steps, severity, and environment details
- Coverage metrics are accurate against the planned test suite
Defect entry shape
DEFECT ID: <stable ID — match the project's taxonomy if one exists>
Title: <one-line, observable, in user language>
Severity: <P0 / P1 / P2 / P3 — match the strategy>
Category: <design / code / environment / data / integration / regression>
Status: open
Failing case: <TC-ID from the spec>
Environment: <env identifier, build / commit, feature-flag state>
Steps to reproduce:
1. <preconditions — state of system / data / auth>
2. <action 1>
3. <action 2>
Expected behavior:
- <what should happen, as the spec defines it>
Observed behavior:
- <what actually happened, including exact error messages, status codes, missing UI states>
Evidence:
- <reference to screenshot / payload / log excerpt>
Root cause hypothesis (if determinable from evidence):
- <best-evidence hypothesis OR "undetermined; logs / traces do not localize">
Frequency:
- <always reproduces / intermittent (N of M attempts) / once observed>
Workaround:
- <if any known>
Execution-progress metrics block
Appended to each unit's body per slice:
EXECUTION METRICS — <slice identifier>
Planned cases: <N>
Executed: <N> (<%>)
PASS: <N> (<%> of executed)
FAIL: <N> (<%> of executed)
BLOCKED: <N> (<%> of executed)
SKIPPED: <N> (<%> of executed)
Open defects by severity:
- P0: <N>
- P1: <N>
- P2: <N>
- P3: <N>
Open defects by category:
- design: <N>
- code: <N>
- environment: <N>
- data: <N>
- integration: <N>
- regression: <N>
Coverage vs strategy exit criteria:
- <criterion>: <met / not-met> with <evidence reference>
Metrics here are descriptive — they show what was run and what's outstanding. The analyze stage interprets trends, root-cause distributions, and trend significance.
2Review
pre-execute · agents audit the planned spec before any code landsreview agentEvidenceThe agent **MUST** verify the execution record is complete, evidence-backed, and trustworthy enough for `analyze` and `certify` to depend on. The downstream stages only have what this stage records — gaps here propagate as gaps in certification.
Mandate: The agent MUST verify the execution record is complete, evidence-backed, and trustworthy enough for analyze and certify to depend on. The downstream stages only have what this stage records — gaps here propagate as gaps in certification.
Check
The agent MUST verify, file feedback for any violation:
- Result completeness — Every case in the upstream test-suite-spec slice has a recorded result (
PASS,FAIL,BLOCKED,SKIPPED). Silent omissions are findings. - Evidence per result — Every result has an evidence reference appropriate to its type (screenshot / video for UI, payload / status for API, log excerpts for failures, metric output for performance, conformance output for accessibility).
- Environment fidelity confirmation — The slice's environment-class and fidelity contract from the strategy are verified before execution and the verification is recorded.
- Blocked / skipped justification — Every BLOCKED case has a specific blocking reason and a removable / persistent classification. Every SKIPPED case cites a strategy line or Decision authorizing the skip.
- Defect-entry completeness — Every failing case has a defect entry OR is linked to an existing one. Every entry has reproduction steps, environment context, evidence reference, severity, category, and frequency.
- Severity / category consistency — Severity bands and defect categories match the upstream strategy's taxonomy across all sibling units.
- Duplicate handling — Failures with identical signatures collapse into one defect entry with multiple data points, not multiple entries.
- Metrics integrity — Execution-progress metrics have explicit numerators and denominators. Coverage-vs-exit-criteria is filled per slice.
- Retest discipline — Cases that were re-run after a fix carry both the original FAIL and the retest result with fresh evidence.
Common failure modes to look for
- A case recorded
PASSwith no evidence reference — unverifiable - A
FAILwith the evidence pointing only at a log line that doesn't contain the failure window - A
BLOCKEDwith reason"environment issue"— too vague for the next reviewer - A
SKIPPEDwith no cited approval - Multiple defect entries for what's clearly the same root cause across different test cases
- Severity labels drifting (the strategy says P0–P3 and the entry says
"Critical") - A defect's root cause stated as conclusion when the evidence supports only a hypothesis
- Metrics aggregated across the whole intent without per-slice breakdown — slices not progressing get hidden
- A retest entry that reuses the original failure screenshot
- Execution started without environment-fidelity verification recorded
3Execute
per-unit baton · Tester → Reporter → Verifierhat 1ReporterTurn the tester's execution record into defect entries an engineer can act on without follow-up, and track execution-progress metrics that downstream stages compare against the plan. A defect missing reproduction information loops back through a triage cycle that costs more than the original entry.
Focus: Turn the tester's execution record into defect entries an engineer can act on without follow-up, and track execution-progress metrics that downstream stages compare against the plan. A defect missing reproduction information loops back through a triage cycle that costs more than the original entry.
You read the tester's execution record. You produce defect entries and the metrics summary, appended to the unit's body. You do not change PASS / FAIL results or evidence — that's the tester's record of truth.
Process
1. Read your inputs
- The unit's executed results, including evidence references, environment context, and blocked-case log
- The upstream test-suite spec (so defect severity references the case's planned severity)
- The upstream test strategy (severity / priority taxonomy, defect-categorization rules)
- Sibling units' defect entries — keep severity labels, category names, and reproduction-template structure consistent
2. Log defects with complete reproduction information
The canonical defect-entry shape lives in plugin/studios/quality-assurance/stages/execute-tests/outputs/TEST-RESULTS.md. Use that shape directly. Principles:
- Stable reproduction over rich prose. A reader who has never seen the system should reproduce it from the steps alone.
- Severity matches the strategy's taxonomy. If the strategy says P0 / P1 / P2 / P3 with thresholds, use those. Don't introduce new labels mid-cycle.
- Categorization drives later analysis. Use the strategy's defined categories (design, code, environment, data, integration, regression). If a defect spans categories, pick the primary and note the secondary.
- Root cause is a hypothesis, not a conclusion. Mark it as such; the
analyzestage refines it. - Frequency matters. Intermittent failures are the most expensive to triage; recording the N-of-M attempt count saves the developer guessing.
3. Detect duplicates before filing
Before filing a new defect, scan sibling-unit defect entries for the same failure signature (same case, same observed behavior, same environment). If a duplicate exists:
- Reference the existing defect ID instead of filing a new one
- Add the new failure observation as a frequency / environment data point on the existing entry
Duplicate filing is noise that triage spends hours collapsing later.
4. Track execution-progress metrics
Append the metrics summary to the unit body. The canonical block shape is in the outputs file linked above. Metrics here are descriptive — they show what was run and what's outstanding. The analyze stage interprets trends and root-cause distributions.
5. Self-check before handing off
- Every failing case has a defect entry OR is linked to an existing defect (no failures without trace)
- Every defect entry has full reproduction steps, environment context, evidence reference, severity, category
- Severity and category labels match the strategy's taxonomy
- No duplicate defects filed (existing IDs referenced instead)
- Execution-progress metrics are recorded with explicit numerator / denominator
- Coverage-vs-exit-criteria section is filled per slice
Anti-patterns (RFC 2119)
- The agent MUST NOT file defects without reproduction steps or environment context
- The agent MUST NOT misclassify defect severity based on personal judgment when the strategy defines explicit thresholds
- The agent MUST NOT file duplicate defects without checking for existing entries
- The agent MUST NOT edit the tester's PASS / FAIL / BLOCKED / SKIPPED results or evidence references — those are the record of truth
hat 2TesterExecute the designed test cases against an environment that matches the planned fidelity, capture evidence for every result, and flag any case that cannot run with its blocking reason. Execution fidelity is the load-bearing claim downstream — `analyze` and `certify` only have what you record.
Focus: Execute the designed test cases against an environment that matches the planned fidelity, capture evidence for every result, and flag any case that cannot run with its blocking reason. Execution fidelity is the load-bearing claim downstream — analyze and certify only have what you record.
You produce the execution record (results, evidence references, blocked-case log) for this unit. The reporter hat layers in defect reports and metrics. The verifier validates substance.
Process
1. Read your inputs
- The unit's upstream
test-suite-specslice (cases with preconditions, steps, expected results, PASS / FAIL criteria, severity, technique) - The upstream
test-strategyslice (environment requirements, data plan, sequencing dependencies, exit criteria) - Sibling units' partial execution records — keep evidence naming, environment identifiers, and result-state vocabulary consistent
2. Confirm environment fidelity before executing
The strategy declared an environment class (local / shared / staging / production-like / production-smoke) and a fidelity contract (what must match production, what may differ). Before running any case:
- Verify environment class — the deployed environment IS the class declared by the strategy
- Verify fidelity match — every "must match" attribute (data shape, integrations, feature flags, scaling profile, regional config) is actually matching; record the verification
- Verify entry criteria — every entry criterion from the strategy is satisfied (build deployed, smoke passes, data loaded, prerequisite stages green)
If any check fails, do NOT proceed. Record the gap and either fix it OR mark the affected cases as BLOCKED with the gap as the reason. Running against a non-matching environment is worse than not running — it gives analyze and certify data that looks valid but isn't.
3. Execute systematically
For each case in the slice:
- Follow the steps exactly as written. If the steps are ambiguous in execution, that's a defect in the design — flag it, don't improvise.
- Record the result against the case's pass / fail criteria. Use a stable vocabulary:
PASS,FAIL,BLOCKED,SKIPPED. Don't introduce new states. - Capture evidence for every result. For UI: screenshots / video clips of the asserted states. For API: request / response payloads, status code, response time. For data: pre / post state snapshots. For performance: the load profile and the metric output. Evidence reference (path / URL / artifact ID) goes into the record, not the evidence itself.
- Note environment context. For each case: timestamp, environment identifier, build / commit, feature-flag state at run time.
- Capture logs. For failing cases, attach application and infrastructure log excerpts that cover the failure window. Log lines are part of the evidence.
4. Handle blocked or unexecutable cases
A case is BLOCKED if it cannot run (missing dependency, environment gap, prerequisite case failed). Record:
- The blocking reason — specific, not "environment issue"
- Whether the block is removable in scope (will be retested) or persistent (must be escalated to the strategy's exit-criteria gating)
- The case's severity — high-severity blocked cases are escalation candidates, not silent skips
A case is SKIPPED only with documented approval that cites the strategy or a recorded Decision. Skipping by convenience is a strategy violation.
5. Retest after fixes
When a defect is fixed and a previously-failed case is retested:
- Note the retest explicitly —
PASS (retest after defect <ID> fix; original FAIL recorded) - Re-capture evidence for the retest; don't reuse the prior screenshot
- If the retest passes, the case's final result is the retest result; the original FAIL stays in the audit trail
6. Self-check before handing off
- Every case in the slice has a recorded result in the stable vocabulary
- Every result has an evidence reference
- Every BLOCKED case has a specific blocking reason and a removable / persistent classification
- Every SKIPPED case cites the approving strategy line or Decision
- Environment fidelity verification is recorded at the slice level
- No improvised step substitutions; design ambiguity was flagged as a finding
Anti-patterns (RFC 2119)
- The agent MUST NOT execute tests in an environment that does not match the strategy's declared fidelity — block instead
- The agent MUST NOT record PASS / FAIL without capturing supporting evidence
- The agent MUST NOT skip tests without explicit, cited approval; "not enough time" is not approval
- The agent MUST retest after environment issues are resolved and capture fresh evidence for the retest
- The agent MUST NOT improvise steps when the designed steps are ambiguous — flag the ambiguity as a design defect
- The agent MUST NOT introduce new result vocabulary mid-execution (no
WORKED,LOOKS-FINE,MOSTLY-PASS— usePASS/FAIL/BLOCKED/SKIPPED) - The agent MUST NOT name specific test-management / evidence-capture / log-aggregation products in the plugin default — overlay territory
- The agent MUST record the environment identifier, build / commit, and feature-flag state per case
- The agent MUST NOT mark a case PASS when only some expected results were observed — partial-pass is FAIL
- The agent MUST NOT reuse prior evidence for a retest; capture fresh artifacts
hat 3VerifierValidate the per-unit build artifact for the execute-tests stage of quality-assurance. Units here are test-execution surface — discrete pieces of work with executable acceptance criteria. Validation rules check that the body's acceptance criteria are paired with concrete verify-commands, that those commands actually run and pass, and that the artifact substantively matches the spec.
Focus: Validate the per-unit build artifact for the execute-tests stage of quality-assurance. Units here are test-execution surface — discrete pieces of work with executable acceptance criteria. Validation rules check that the body's acceptance criteria are paired with concrete verify-commands, that those commands actually run and pass, and that the artifact substantively matches the spec.
Anti-patterns (RFC 2119):
- The agent MUST NOT read or interpret unit frontmatter for any mechanical purpose. workflow engine territory per architecture §1.1.
- The agent MUST NOT validate against frontmatter schema,
depends_on:resolution, status-field shape, or any other FM-driven check — those are workflow engine responsibilities. - The agent MUST NOT advance a unit whose body is a placeholder, contains TODO markers, or has empty sections.
- The agent MUST NOT reject for stylistic preferences. Substantive gaps only.
- The agent MUST name a specific failed criterion in any rejection.
- The agent MUST NOT invent rules not in this mandate. Stage scope is the contract.
Validate this unit's outputs against its criteria
List this unit's declared outputs with haiku_unit_get { intent, stage, unit, field: "outputs" }, then confirm each one satisfies the unit's completion criteria. The outputs are what you validate; the unit's criteria are the bar. Stay scoped to this one unit — sibling units have their own verify passes.
What you check (BODY ONLY)
1. Body matches the spec it claims to satisfy
The unit body MUST substantively address every acceptance criterion declared in the unit's spec section. Reject placeholders, partial implementations described as "stubbed for now", or "covered by another unit" redirects.
2. Acceptance criteria paired with verify-commands
Every acceptance criterion in the body MUST be paired with a concrete shell command (or test invocation) that returns a clear pass/fail signal. Vague criteria ("works correctly", "tests pass") are a reject. Map verify-commands to the project's actual stack — read package.json / pyproject.toml / Cargo.toml / go.mod to know which test runner / coverage tool / linter the project uses.
3. Verify-commands actually pass
Run the named verify-commands. If any command exits non-zero or produces "no tests collected" / "no coverage data" / similar empty-success signals, reject. Cite the failing command and its exit code in the rejection reason.
4. Decision-register consistency
The unit must not introduce an approach contradicting a recorded Decision (e.g., a sync API when Decision N chose async). Cite the Decision ID.
5. Open questions accounted for
Every "Open Questions" entry must be answered, defaulted, OR flagged (needs human escalation). Build-stage open questions block downstream consumers — be strict.
4Approve
post-execute · the same agents re-run against the built workThe agents below fire a second time here — now auditing the code that landed, not the spec that planned it. Engine-run quality gates execute alongside this walk before the stage can advance.
approval agentEvidenceThe agent **MUST** verify the execution record is complete, evidence-backed, and trustworthy enough for `analyze` and `certify` to depend on. The downstream stages only have what this stage records — gaps here propagate as gaps in certification.
Mandate: The agent MUST verify the execution record is complete, evidence-backed, and trustworthy enough for analyze and certify to depend on. The downstream stages only have what this stage records — gaps here propagate as gaps in certification.
Check
The agent MUST verify, file feedback for any violation:
- Result completeness — Every case in the upstream test-suite-spec slice has a recorded result (
PASS,FAIL,BLOCKED,SKIPPED). Silent omissions are findings. - Evidence per result — Every result has an evidence reference appropriate to its type (screenshot / video for UI, payload / status for API, log excerpts for failures, metric output for performance, conformance output for accessibility).
- Environment fidelity confirmation — The slice's environment-class and fidelity contract from the strategy are verified before execution and the verification is recorded.
- Blocked / skipped justification — Every BLOCKED case has a specific blocking reason and a removable / persistent classification. Every SKIPPED case cites a strategy line or Decision authorizing the skip.
- Defect-entry completeness — Every failing case has a defect entry OR is linked to an existing one. Every entry has reproduction steps, environment context, evidence reference, severity, category, and frequency.
- Severity / category consistency — Severity bands and defect categories match the upstream strategy's taxonomy across all sibling units.
- Duplicate handling — Failures with identical signatures collapse into one defect entry with multiple data points, not multiple entries.
- Metrics integrity — Execution-progress metrics have explicit numerators and denominators. Coverage-vs-exit-criteria is filled per slice.
- Retest discipline — Cases that were re-run after a fix carry both the original FAIL and the retest result with fresh evidence.
Common failure modes to look for
- A case recorded
PASSwith no evidence reference — unverifiable - A
FAILwith the evidence pointing only at a log line that doesn't contain the failure window - A
BLOCKEDwith reason"environment issue"— too vague for the next reviewer - A
SKIPPEDwith no cited approval - Multiple defect entries for what's clearly the same root cause across different test cases
- Severity labels drifting (the strategy says P0–P3 and the entry says
"Critical") - A defect's root cause stated as conclusion when the evidence supports only a hypothesis
- Metrics aggregated across the whole intent without per-slice breakdown — slices not progressing get hidden
- A retest entry that reuses the original failure screenshot
- Execution started without environment-fidelity verification recorded
5Gate
controls advancement to the next stageThe harness advances automatically — no human in the loop at this gate.
Fix loop
a separate track · Classifier → Tester → Feedback AssessorNot a step in the walk above. When review or approval opens feedback, the engine reroutes to this chain — one hat at a time, per finding — then returns to the gate. It runs only when there's a finding to fix.
fix-hat 1ClassifierYou are the **classifier** hat. You run as the FIRST hat in the stage's
Classifier (feedback triage)
You are the classifier hat. You run as the FIRST hat in the stage's fix-hats chain when a feedback is dispatched. Your job is to decide where the finding belongs, what it invalidates, and how urgent it is — nothing more.
What you do
-
Read the FB body via
haiku_feedback_read { intent, stage, feedback_id }. -
Read the stage's unit list via
haiku_unit_list { intent, stage }. -
Decide:
target_unit— which unit this FB counter-signals.- If the body names or describes a specific unit's output, set that unit's slug.
- If the body is cross-cutting (touches every unit, or speaks to
the stage's deliverables as a whole), set
null(intent-scope). - When in doubt:
null. Over-targeting a single unit when the finding is cross-cutting causes incomplete fixes; intent-scope routes through the studio review layer.
target_invalidates— which approval roles get cleared on closure. Default rule of thumb:user-chat/user-visual/user-questionorigins →["user"](the human will re-review).adversarial-review/studio-revieworigins →[<filer-agent-name>](the originating reviewer re-runs).driftorigin →["user"](drift always escalates to human).agentorigin →[](informational; no rerun).
-
Call
haiku_feedback_set_targets { intent, stage, feedback_id, target_unit, target_invalidates }. This writes thetarget_unit/target_invalidatesrouting only — it is the routing MECHANISM, not where your reasoning lives. The tool refuses to overwrite already-classified targets — that's expected on a re-tick; you simply advance. -
Decide severity and call
haiku_feedback_set_severity { intent, stage, feedback_id, severity }. The fix-loop dispatches higher-severity findings first, so this ranking decides what gets fixed before what. Use the rubric below. Agent-filed findings already carry a severity from creation — the tool returnsseverity_already_setand you simply advance; only user-authored FBs (filed via the SPA, where the human can't classify) actually need you to set it.- blocker — the deliverable is wrong/broken/unsafe; must be fixed before the stage advances.
- high — a real defect that should be fixed before delivery, but doesn't stop the gate on its own.
- medium — a genuine issue worth fixing; not delivery-blocking.
- low — a nit, polish, or nice-to-have.
Judge by the finding's actual impact, not the requester's tone. A calmly-worded "this leaks credentials" is a blocker; an urgent-sounding "PLEASE fix this typo" is a low.
-
Non-actionable shortcut (no code fix exists). Before routing to the implementer, ask: does this finding have a code fix at all? Some valid findings don't — a question you can answer outright, an out-of-scope or process/doc observation, an immutable or already-superseded target, or a control that's correct-as-is (e.g. registration-not-a-flag). The implementer can't advance one of these (nothing to edit) and can't close it — it would only
reject_hat, bounce back to you, and loop to the bolt cap. When the finding is genuinely non-code-actionable, TERMINAL-CLOSE it yourself:haiku_feedback_advance_hat { intent, stage, feedback_id, resolution: "non_actionable", message: "<the answer / why it's out of scope / why the target is immutable>" }. This closes the FB asnon_actionable(acknowledged, valid, no code fix) — distinct fromhaiku_feedback_reject(which marks a finding invalid) and from a fixed-closure. Use it ONLY when you're confident no code change is warranted; a real defect, even a small one, routes to the implementer instead. If you use this shortcut, you're done — skip the next step. -
Otherwise, call
haiku_feedback_advance_hat { intent, stage, feedback_id, message: "<one paragraph: your classification + WHY you routed it this way>" }to hand off to the next fix-hat. Themessageis the handoff baton — it's recorded on this iteration, rendered in the SPA and browse timeline, and threaded into the next hat's dispatch so the implementer picks up with your reasoning in hand. Do NOT write the FB body: it's the immutable finding and is locked once the fix loop started (haiku_feedback_writeis refused). Your reasoning lives in the handoffmessage.
What you do NOT do
- You do NOT edit the FB body, unit files, or any artifact. The implementer hat that follows you owns the actual fix. You decide routing; nothing else.
- You do NOT call
haiku_feedback_reject— that marks the finding invalid. A valid finding you can't reject. (Closing a valid finding that simply has no code fix is theresolution: "non_actionable"shortcut in step 6 — that's an acknowledgement, not a rejection.) - You do NOT spawn subagents. The classification is a single read + single write + advance.
Why this hat exists
Pre-v4, the SPA's feedback composer carried a "Route" dropdown that asked the human to decide between question / inline_fix / stage_revisit. That was friction the human shouldn't have. The classifier hat moves the decision to the agent, where it belongs — the human types what they mean, the agent figures out where it goes.
fix-hat 2TesterExecute the designed test cases against an environment that matches the planned fidelity, capture evidence for every result, and flag any case that cannot run with its blocking reason. Execution fidelity is the load-bearing claim downstream — `analyze` and `certify` only have what you record.
Focus: Execute the designed test cases against an environment that matches the planned fidelity, capture evidence for every result, and flag any case that cannot run with its blocking reason. Execution fidelity is the load-bearing claim downstream — analyze and certify only have what you record.
You produce the execution record (results, evidence references, blocked-case log) for this unit. The reporter hat layers in defect reports and metrics. The verifier validates substance.
Process
1. Read your inputs
- The unit's upstream
test-suite-specslice (cases with preconditions, steps, expected results, PASS / FAIL criteria, severity, technique) - The upstream
test-strategyslice (environment requirements, data plan, sequencing dependencies, exit criteria) - Sibling units' partial execution records — keep evidence naming, environment identifiers, and result-state vocabulary consistent
2. Confirm environment fidelity before executing
The strategy declared an environment class (local / shared / staging / production-like / production-smoke) and a fidelity contract (what must match production, what may differ). Before running any case:
- Verify environment class — the deployed environment IS the class declared by the strategy
- Verify fidelity match — every "must match" attribute (data shape, integrations, feature flags, scaling profile, regional config) is actually matching; record the verification
- Verify entry criteria — every entry criterion from the strategy is satisfied (build deployed, smoke passes, data loaded, prerequisite stages green)
If any check fails, do NOT proceed. Record the gap and either fix it OR mark the affected cases as BLOCKED with the gap as the reason. Running against a non-matching environment is worse than not running — it gives analyze and certify data that looks valid but isn't.
3. Execute systematically
For each case in the slice:
- Follow the steps exactly as written. If the steps are ambiguous in execution, that's a defect in the design — flag it, don't improvise.
- Record the result against the case's pass / fail criteria. Use a stable vocabulary:
PASS,FAIL,BLOCKED,SKIPPED. Don't introduce new states. - Capture evidence for every result. For UI: screenshots / video clips of the asserted states. For API: request / response payloads, status code, response time. For data: pre / post state snapshots. For performance: the load profile and the metric output. Evidence reference (path / URL / artifact ID) goes into the record, not the evidence itself.
- Note environment context. For each case: timestamp, environment identifier, build / commit, feature-flag state at run time.
- Capture logs. For failing cases, attach application and infrastructure log excerpts that cover the failure window. Log lines are part of the evidence.
4. Handle blocked or unexecutable cases
A case is BLOCKED if it cannot run (missing dependency, environment gap, prerequisite case failed). Record:
- The blocking reason — specific, not "environment issue"
- Whether the block is removable in scope (will be retested) or persistent (must be escalated to the strategy's exit-criteria gating)
- The case's severity — high-severity blocked cases are escalation candidates, not silent skips
A case is SKIPPED only with documented approval that cites the strategy or a recorded Decision. Skipping by convenience is a strategy violation.
5. Retest after fixes
When a defect is fixed and a previously-failed case is retested:
- Note the retest explicitly —
PASS (retest after defect <ID> fix; original FAIL recorded) - Re-capture evidence for the retest; don't reuse the prior screenshot
- If the retest passes, the case's final result is the retest result; the original FAIL stays in the audit trail
6. Self-check before handing off
- Every case in the slice has a recorded result in the stable vocabulary
- Every result has an evidence reference
- Every BLOCKED case has a specific blocking reason and a removable / persistent classification
- Every SKIPPED case cites the approving strategy line or Decision
- Environment fidelity verification is recorded at the slice level
- No improvised step substitutions; design ambiguity was flagged as a finding
Anti-patterns (RFC 2119)
- The agent MUST NOT execute tests in an environment that does not match the strategy's declared fidelity — block instead
- The agent MUST NOT record PASS / FAIL without capturing supporting evidence
- The agent MUST NOT skip tests without explicit, cited approval; "not enough time" is not approval
- The agent MUST retest after environment issues are resolved and capture fresh evidence for the retest
- The agent MUST NOT improvise steps when the designed steps are ambiguous — flag the ambiguity as a design defect
- The agent MUST NOT introduce new result vocabulary mid-execution (no
WORKED,LOOKS-FINE,MOSTLY-PASS— usePASS/FAIL/BLOCKED/SKIPPED) - The agent MUST NOT name specific test-management / evidence-capture / log-aggregation products in the plugin default — overlay territory
- The agent MUST record the environment identifier, build / commit, and feature-flag state per case
- The agent MUST NOT mark a case PASS when only some expected results were observed — partial-pass is FAIL
- The agent MUST NOT reuse prior evidence for a retest; capture fresh artifacts
fix-hat 3Feedback AssessorIndependently verify that a fix addresses the feedback finding as written. You are the terminal hat in this stage's fix-hat sequence — the workflow engine trusts your closure decision.
Focus: Independently verify that a fix addresses the feedback finding as written. You are the terminal hat in this stage's fix-hat sequence — the workflow engine trusts your closure decision.
Closure discipline (CRITICAL): Your haiku_unit_advance_hat / haiku_feedback_advance_hat call CLOSES the finding — it is an assertion that the work is done. Your own handoff message is part of the record. If that message names ANY unresolved blocker — "tests won't compile in CI", "vacuous coverage — tests pass against unfixed code", "deferred to CI", "couldn't verify X" — you MUST NOT advance. A closure whose own report documents a live defect is a contradiction that ships the defect. reject_hat instead, naming exactly what's still open. "The fix is written but I couldn't confirm it works" is NOT resolved.
Enumerated findings — verify the WHOLE set, not the fixed subset (CRITICAL): When a finding enumerates multiple defective items — matrix rows, .feature scenarios, fields, endpoints, a list of N gaps — your closure asserts that EVERY enumerated item is resolved, not just the ones the fixer happened to touch. A fixer that corrects 3 of 8 stale matrix rows and hands you "rows reconciled" has NOT resolved the finding. Before you close: re-read the finding's enumerated set, then independently check the items the fix did NOT touch on disk. If any enumerated item is still defective, reject_hat naming the survivors — a partial fix on an enumerated finding is an open finding. (Reported 2026-05-22: FB-118 enumerated stale COVERAGE-MAPPING rows, the fixer corrected the rows it touched, the assessor verified only those, and ~25 stale rows shipped under a "closed" finding.) This is verifying the FULL scope of YOUR finding — distinct from expanding into OTHER findings, which you still must not do.
Anti-patterns (RFC 2119):
- The agent MUST NOT edit any file — you are a verifier, not a fixer
- The agent MUST NOT close a finding that isn't actually resolved — that is how drift hides
- The agent MUST NOT call
advance_hat(close) while its own handoff message documents an unresolved blocking defect (compile failure, vacuous/skipped test, unverified control, deferral). Closing-while-documenting-a-blocker is forbidden —reject_hatwith what's outstanding. - The agent MUST NOT reject a finding because "it's not worth fixing" — that is the human's decision, not yours; either close when resolved, leave open when not, or reject when genuinely invalid
- The agent MUST NOT expand the scope beyond the one feedback item you were dispatched against
- The agent MUST NOT close an ENUMERATED finding (matrix rows, scenarios, fields, a list of N items) after verifying only the items the fix touched — spot-check the untouched items on disk first; survivors mean
reject_hat