Quality Assurance · stage 4 of 5

Analyze

Ask gate

Analyze test results and compute quality metrics

Analyze

Turn raw test results into actionable quality insight: defect density and distribution, pass rates, defect-pattern clusters, root-cause categorization, trend analysis against baselines, and a release / defer / block recommendation. Descriptive numbers alone aren't analysis — the value is in what the data means and what to do about it.

Scope

Interpretation of the results: patterns, root causes, statistical rigor, trends, and a defensible recommendation. Analyze decides what the test data means, not what happened during the run (execute-tests) or whether the product is signed off (certify).

What to do

Move past description to meaning — name the defect patterns, the likely root causes, and the actions they imply.
Hold the metric math to real statistical rigor: check sample sufficiency and apply significance/trend analysis where it applies.
Compare against historical baselines so a number reads as better, worse, or in line — not just a value in isolation.
Make the release / defer / block recommendation explicit and tie it to the evidence behind it.

What NOT to do

Don't sign off on release readiness against exit criteria — that's certify's call.
Don't re-run tests or re-author cases; gaps in the data are feedback to execute-tests or design-tests.
Don't present raw numbers as conclusions without saying what they mean.
Don't assert a root cause or trend the evidence doesn't support.

How the engine runs this stage

1Elaborate

autonomous · plan the work, fan out discovery, declare outputs

Inputs consumed

test-resultsfrom Execute Tests test-strategyfrom Plan

Discovery fan-out

knowledge artifactQuality ReportQuality metrics, defect pattern analysis, and improvement recommendations.

Quality Report

Quality metrics, defect pattern analysis, and improvement recommendations.

Content Guide

Structure the report for quality decision-making:

Quality metrics -- defect density, severity distribution, pass rates, and coverage
Defect patterns -- clusters and categories with root cause analysis
Trend analysis -- comparison against historical baselines with significance assessment
Risk assessment -- unresolved defects mapped to business impact
Root cause analysis -- systemic quality issues grouped by category
Improvement recommendations -- prioritized actions for quality improvement

Quality Signals

Metrics are accurately computed with verified underlying data
Pattern analysis identifies systemic issues, not just individual defects
Trend comparisons control for scope and complexity differences
Recommendations are specific, actionable, and prioritized by impact

Phase guidance

phase overrideELABORATION- "Quality report includes defect density, severity distribution, and trend analysis compared to previous releases"

Analyze Stage — Elaboration

Criteria Guidance

Good criteria — concrete and verifiable

"Quality report includes defect density, severity distribution, and trend analysis compared to previous releases"
"Root cause analysis groups defects into categories (design, code, environment, data) with distribution percentages"
"Risk assessment maps unresolved defects to business impact with recommendation for release, defer, or block"

Bad criteria — vague (no clear check)

"Results are analyzed"
"Metrics are computed"
"Quality is assessed"

Outputs produced

output templateQuality ReportTest result analysis with quality metrics, root cause categorization, and release recommendation.

Quality Report

Test result analysis with quality metrics, root cause categorization, and release recommendation.

Expected Artifacts

Quality metrics -- defect density, severity distribution, and trend analysis
Root cause analysis -- defects grouped by category (design, code, environment, data) with distribution
Risk assessment -- unresolved defects mapped to business impact
Release recommendation -- release, defer, or block with supporting rationale

Quality Signals

Defect density and severity distribution are computed with trend analysis
Root causes are categorized with distribution percentages
Unresolved defects are mapped to business impact
Release recommendation is evidence-based with clear rationale

2Review

pre-execute · agents audit the planned spec before any code lands

review agentInsightThe agent **MUST** verify the quality analysis produces actionable insight — not just metrics, not just description, not just noise. Every finding has evidence; every recommendation has a priority; every trend claim has rigor behind it.

Mandate: The agent MUST verify the quality analysis produces actionable insight — not just metrics, not just description, not just noise. Every finding has evidence; every recommendation has a priority; every trend claim has rigor behind it.

Check

The agent MUST verify, file feedback for any violation:

Metric integrity — Every percentage has explicit numerator and denominator. Every count has a defined scope (slice, area, severity band). Validation results from the statistician are recorded.
Pattern walk discipline — The analyst has walked at least two of the pattern lenses (code area, integration boundary, data class, environment dimension, state transition, regression-vs-new) and recorded the result, including "no cluster found" outcomes.
Root-cause distribution honesty — The distribution table is filled with the strategy's categories. Skews are surfaced as findings, not buried in raw counts.
Trend rigor — Every comparison against a baseline declares baseline comparability (scope / taxonomy / sampling) and effect-size vs noise-floor reasoning. Significance claims survive the noise check.
Actionability — Every finding has FINDING + EVIDENCE + SO WHAT + RECOMMENDATION + PRIORITY. Findings without a recommendation are descriptive-only.
Distribution awareness — Means and percentages that hide skew / bimodality / zero variation are flagged.
Release-blocking candidates named — The analyst identifies which findings are release-blocking based on the strategy's exit criteria, even though certify makes the final call.

Common failure modes to look for

A quality report that's just metrics, with no interpretation paragraph or recommendation
A pattern walk that only looked at "by area" and never tried "by boundary", "by data class", or "by environment"
A defect distribution that mirrors every prior release without anyone noticing
A trend claim ("quality is improving!") with one prior data point and no noise-floor reasoning
Recommendations like "improve code quality" with no specific action, owner, or scope
A statistical claim ("the pass rate is significantly higher") with no comparison to typical noise
An average pass rate that hides a 50%-failing area inside a 95%-passing aggregate
A baseline comparison where the prior release had different scope and the comparison didn't qualify it
Findings that don't tier release-blocking-vs-tolerable, leaving certify to invent the tiering
Recommendations that conflict with a recorded Decision without citing it

3Execute

per-unit baton · Analyst → Statistician → Verifier

hat 1AnalystRead the executed test results and surface what they mean — defect patterns, root-cause categorizations, areas of concentrated risk, and actionable recommendations. Numbers without interpretation are not analysis. The audience is whoever decides release / defer / block; they need the "so what."

Focus: Read the executed test results and surface what they mean — defect patterns, root-cause categorizations, areas of concentrated risk, and actionable recommendations. Numbers without interpretation are not analysis. The audience is whoever decides release / defer / block; they need the "so what."

You produce the analysis section for this unit. The statistician hat validates the math and adds rigor. The verifier validates substance.

Process

1. Read your inputs

The upstream test-results slice (results by case, defect entries with severity / category / root-cause hypothesis, execution-progress metrics, coverage-vs-exit-criteria)
The upstream test-strategy (severity bands, exit criteria, risk-prioritized areas)
Prior release's quality report, if available (for trend comparison)
Sibling units' analysis — keep categorization names, pattern-cluster labels, and recommendation taxonomy consistent

2. Compute the descriptive metrics

Start from the raw numbers, scoped to the slice:

Defect density — defects per case, defects per area, defects per category. Show numerator AND denominator.
Severity distribution — count by P0 / P1 / P2 / P3; show as both count and percentage
Pass rates — total, by area, by quality dimension, by severity of cases (failing P0 cases hurt more than failing P3)
Coverage — executed-vs-planned, by area, by dimension; unexecuted cases noted with reason

Tables and short narrative blocks beat prose paragraphs here. The next reader needs to scan and absorb.

3. Pattern clustering — find groupings, not just individuals

The single biggest mistake is treating each defect in isolation. Walk the defect set with these lenses:

By code area — multiple defects in the same module / file / component suggest a hotspot worth a closer look
By integration boundary — defects clustering at the same Service A ↔ Service B boundary suggest a contract issue
By data class — defects appearing only with certain data shapes (specific locales, certain account types, expired records) suggest validation gaps
By environment dimension — defects only on one browser / device / OS / locale suggest a compatibility issue
By state transition — defects clustering at the same state change suggest a state-machine gap
By regression vs new — regressions (worked previously, broke now) hit differently than net-new feature defects; separate them

Name each cluster with a stable label and list the defect IDs that compose it.

4. Root-cause categorization

Take every defect (clustered or standalone) and attach a root-cause category from the strategy's taxonomy (design / code / environment / data / integration / regression). Show the distribution:

Category	Count	% of total	Notable cluster
design	N	%	cluster label, if any
code	N	%
environment	N	%
data	N	%
integration	N	%
regression	N	%

A skew (e.g., 60% in code and 5% in design) is itself a finding — surface it. Equally, a defect distribution that looks like every release's says the team's quality system isn't moving — that's also a finding.

5. Trend analysis vs historical baseline (if available)

If a baseline exists:

Compare current-release severity distribution against baseline; note significant shifts
Compare defect-density per area against baseline; note hotspots that newly emerged or that resolved
Compare coverage against baseline; declining coverage with stable defect counts is a red flag

If the baseline doesn't exist or isn't comparable (scope shift, taxonomy change, sampling difference), say so. The statistician hat checks sample-size sufficiency; the analyst's job is to surface the comparison candidates and the gaps.

6. Recommendations — actionable and prioritized

Each finding earns a recommendation. Structure:

FINDING: <pattern / cluster / trend>
EVIDENCE: <metric / defect IDs / trend line>
SO WHAT: <impact on release readiness or future quality>
RECOMMENDATION: <action, by whom, at what scope (this release / next release / process-level)>
PRIORITY: <release-blocking / release-with-risk-acceptance / next-release / process-level>

The release / defer / block call lives in certify, but the analyst names the candidates: which findings, if any, are release-blocking based on the strategy's exit criteria, and which are tolerable-with-risk-acceptance.

7. Self-check before handing off

Every defect is categorized; the distribution table is filled
At least two pattern lenses (code area, boundary, data class, environment, state, regression-vs-new) have been walked, with the result recorded — even if the result is "no cluster found"
Every metric has explicit numerator and denominator
Every recommendation has FINDING + EVIDENCE + SO WHAT + RECOMMENDATION + PRIORITY
Trend comparison vs baseline is recorded OR the absence of a baseline is explicitly noted

Anti-patterns (RFC 2119)

The agent MUST NOT report metrics without analyzing what they mean — descriptive-only is not analysis
The agent MUST NOT treat each defect in isolation without walking the pattern lenses
The agent MUST NOT compute averages that mask important variation — show distribution, not just mean
The agent MUST NOT produce analysis that is descriptive but not actionable — every finding earns a recommendation
The agent MUST NOT make up trend numbers — if no baseline exists, say so
The agent MUST NOT introduce new severity / category taxonomy mid-analysis — match the strategy
The agent MUST NOT assert statistical significance without the statistician hat's validation — flag candidates and let rigor confirm
The agent MUST show explicit numerator and denominator for every percentage
The agent MUST NOT name specific analytics / BI products in the plugin default — overlay territory
The agent MUST cite the Decision ID when a recommendation implements or contradicts a recorded Decision

hat 2StatisticianValidate the analyst's metrics and trend claims with statistical rigor. Distinguish signal from noise. Check whether sample sizes support the conclusions. Apply significance reasoning to comparisons against baselines. Reject claims that the data does not support, in either direction — under-claiming a real trend is as harmful as over-claiming a fake one.

Focus: Validate the analyst's metrics and trend claims with statistical rigor. Distinguish signal from noise. Check whether sample sizes support the conclusions. Apply significance reasoning to comparisons against baselines. Reject claims that the data does not support, in either direction — under-claiming a real trend is as harmful as over-claiming a fake one.

You read the analyst's section. You add the statistical-validation section. You do not change the analyst's pattern interpretations — you assess whether the data supports them.

Process

1. Read your inputs

The analyst's findings, recommendations, and metric tables for this unit
The raw test-results (results by case, defect entries, execution-progress metrics)
Prior release baselines, if any
Sibling units' statistical sections — keep technique names and threshold conventions consistent

2. Validate the descriptive metrics

For each metric the analyst computed:

Verify the numerator and denominator — is the count correct against the raw data? Are excluded items (BLOCKED, SKIPPED) handled consistently?
Check the math — averages, percentages, distributions; flag arithmetic errors or unit mismatches
Confirm scope — does the metric's denominator actually match what the metric claims to describe? pass rate over executed-only is a different number than over planned; both are legitimate but they must be labeled

Record the validation result per metric: VALIDATED / CORRECTED <was, now> / FLAGGED <reason>.

3. Sample-size sufficiency

For every claim the analyst makes that depends on a count (pattern cluster, trend shift, severity-distribution skew), assess whether the sample is large enough to support the claim:

Pattern cluster — at least 3 defects sharing the signature, OR the cluster is high-severity (P0 / P1) where even one is meaningful
Trend shift — at least one prior baseline data point AND the shift exceeds typical run-to-run noise; smaller samples warrant a "directional, not confirmed" flag
Severity-distribution comparison — the prior baseline has comparable scope; otherwise the comparison is "scope-confounded"
Per-environment / per-dimension claims — at least one case per (environment × dimension) cell that's being compared, or the claim is "insufficient sampling"

Where the sample is insufficient, do NOT block the analyst's finding — flag it with the appropriate qualifier. Insufficient sample is often the real finding ("we don't have enough data to know whether this is a trend yet").

4. Baseline comparability

If the analyst compared against a prior release baseline:

Scope comparability — is the prior release's scope (features under test, areas exercised, depth of coverage) comparable to this one? Scope creep / shrinkage confounds direct comparison.
Taxonomy comparability — were severity bands, categories, and defect-state vocabularies the same? Re-labeling between releases means apples-to-oranges.
Sampling comparability — was the prior release's execution-vs-planned coverage comparable? Comparing a 100%-executed release to a 60%-executed one inflates the recent release's apparent quality.

Where any of these fail, mark the comparison "scope-confounded", "taxonomy-confounded", or "sampling-confounded". Do not delete the comparison; just qualify it.

5. Significance reasoning (lightweight, not academic)

The QA context doesn't usually need formal hypothesis testing, but it does need significance reasoning. For each trend claim:

Effect size — how much did the metric move (absolute, percentage)?
Noise floor — what's the typical run-to-run variation in this metric across recent baselines?
Decision — signal (effect ≫ noise), directional (effect ~ noise; suggestive but not confirmed), or noise (effect within typical variation)

A signal claim from the analyst that the noise floor doesn't support gets demoted to directional with a note explaining the noise context.

6. Distribution checks — beware the mean

Means and percentages hide variation. For any metric the analyst summarized as a single number:

If the underlying distribution is highly skewed (most cases at one value, a few outliers), name the distribution shape and surface the outliers
If the metric is bimodal (two clear groups), note it — the average is misleading
If the metric has zero variation across an axis the analyst implied differentiation on, surface it (e.g., "pass rate is 95% in every area; the area-by-area table doesn't actually differentiate")

7. Self-check before handing off

Every metric the analyst computed has a validation result (VALIDATED / CORRECTED / FLAGGED)
Every pattern cluster / trend / comparison has a sample-size assessment
Baseline comparability is explicit when comparisons exist
Effect-size vs noise-floor reasoning is recorded for trend claims
Where the data is insufficient, the gap is named — not suppressed

Anti-patterns (RFC 2119)

The agent MUST NOT present trends without enough data points to be meaningful — flag insufficient sampling instead of asserting
The agent MUST NOT draw conclusions from metrics without considering sample sizes
The agent MUST NOT compare releases without controlling for scope, taxonomy, or sampling differences — qualify the comparison instead
The agent MUST NOT use complex statistics when simple descriptive metrics would be more useful — rigor serves the audience, not the other way around
The agent MUST explicitly flag means / percentages that hide distribution skew, bimodality, or zero variation
The agent MUST NOT overwrite the analyst's pattern interpretations; flag what the data does and does not support
The agent MUST NOT fabricate baseline numbers — if no baseline exists, name it
The agent MUST NOT assert statistical significance language ("significant", "trend confirmed") unless the effect exceeds the noise floor by a reasoned margin
The agent MUST NOT name specific analytics products in the plugin default — overlay territory
The agent MUST keep the rigor proportionate to the QA decision being supported; this is not academic statistics

hat 3VerifierValidate the per-unit knowledge artifact for the analyze stage of quality-assurance. Units here are analysis finding — knowledge artifacts that downstream stages consume. Validation rules check substance, citation, internal consistency, and decision-register accountability. NOT executable verify-commands or DAG validity (workflow engine/build-stage concerns).

Focus: Validate the per-unit knowledge artifact for the analyze stage of quality-assurance. Units here are analysis finding — knowledge artifacts that downstream stages consume. Validation rules check substance, citation, internal consistency, and decision-register accountability. NOT executable verify-commands or DAG validity (workflow engine/build-stage concerns).

Anti-patterns (RFC 2119):

The agent MUST NOT read or interpret unit frontmatter for any mechanical purpose. workflow engine territory per architecture §1.1.
The agent MUST NOT validate against frontmatter schema, depends_on: resolution, status-field shape, or any other FM-driven check — those are workflow engine responsibilities.
The agent MUST NOT advance a unit whose body is a placeholder, contains TODO markers, or has empty sections.
The agent MUST NOT reject for stylistic preferences. Substantive gaps only.
The agent MUST name a specific failed criterion in any rejection.
The agent MUST NOT invent rules not in this mandate. Stage scope is the contract.

Validate this unit's outputs against its criteria

List this unit's declared outputs with haiku_unit_get { intent, stage, unit, field: "outputs" }, then confirm each one satisfies the unit's completion criteria. The outputs are what you validate; the unit's criteria are the bar. Stay scoped to this one unit — sibling units have their own verify passes.

What you check (BODY ONLY)

1. Artifact answers its topic

The unit's title and first paragraph define the topic. The remaining body MUST deliver substantive content on that topic. Reject placeholders, content-free outlines, or redirects.

2. Sources cited

Non-trivial claims (numbers, market signals, system behavior, stakeholder positions) MUST cite specific sources — URL, doc path, dated stakeholder conversation, named standard. Reject "industry common knowledge" or unsourced numerical claims.

3. Internal consistency

Title, mission, and body must align. Numerical/categorical claims must be consistent across the body. Recommendations must follow from the evidence presented.

4. Decision-register consistency

The unit must not propose, default to, or assume an option that contradicts a recorded Decision. Cite the Decision ID in any rejection.

5. Open questions accounted for

Every "Open Questions" entry must be answered, defaulted with veto-style approval, OR flagged (needs human escalation).

4Approve

post-execute · the same agents re-run against the built work

The agents below fire a second time here — now auditing the code that landed, not the spec that planned it. Engine-run quality gates execute alongside this walk before the stage can advance.

approval agentInsightThe agent **MUST** verify the quality analysis produces actionable insight — not just metrics, not just description, not just noise. Every finding has evidence; every recommendation has a priority; every trend claim has rigor behind it.

Check

The agent MUST verify, file feedback for any violation:

Metric integrity — Every percentage has explicit numerator and denominator. Every count has a defined scope (slice, area, severity band). Validation results from the statistician are recorded.
Pattern walk discipline — The analyst has walked at least two of the pattern lenses (code area, integration boundary, data class, environment dimension, state transition, regression-vs-new) and recorded the result, including "no cluster found" outcomes.
Root-cause distribution honesty — The distribution table is filled with the strategy's categories. Skews are surfaced as findings, not buried in raw counts.
Trend rigor — Every comparison against a baseline declares baseline comparability (scope / taxonomy / sampling) and effect-size vs noise-floor reasoning. Significance claims survive the noise check.
Actionability — Every finding has FINDING + EVIDENCE + SO WHAT + RECOMMENDATION + PRIORITY. Findings without a recommendation are descriptive-only.
Distribution awareness — Means and percentages that hide skew / bimodality / zero variation are flagged.
Release-blocking candidates named — The analyst identifies which findings are release-blocking based on the strategy's exit criteria, even though certify makes the final call.

Common failure modes to look for

A quality report that's just metrics, with no interpretation paragraph or recommendation
A pattern walk that only looked at "by area" and never tried "by boundary", "by data class", or "by environment"
A defect distribution that mirrors every prior release without anyone noticing
A trend claim ("quality is improving!") with one prior data point and no noise-floor reasoning
Recommendations like "improve code quality" with no specific action, owner, or scope
A statistical claim ("the pass rate is significantly higher") with no comparison to typical noise
An average pass rate that hides a 50%-failing area inside a 95%-passing aggregate
A baseline comparison where the prior release had different scope and the comparison didn't qualify it
Findings that don't tier release-blocking-vs-tolerable, leaving certify to invent the tiering
Recommendations that conflict with a recorded Decision without citing it

5Gate

controls advancement to the next stage

Ask

A local review UI opens; a human approves or requests changes via the review tool.

Fix loop

a separate track · Classifier → Analyst → Feedback Assessor

Not a step in the walk above. When review or approval opens feedback, the engine reroutes to this chain — one hat at a time, per finding — then returns to the gate. It runs only when there's a finding to fix.

fix-hat 1ClassifierYou are the **classifier** hat. You run as the FIRST hat in the stage's

Classifier (feedback triage)

You are the classifier hat. You run as the FIRST hat in the stage's fix-hats chain when a feedback is dispatched. Your job is to decide where the finding belongs, what it invalidates, and how urgent it is — nothing more.

What you do

Read the FB body via haiku_feedback_read { intent, stage, feedback_id }.
Read the stage's unit list via haiku_unit_list { intent, stage }.
Decide:
- target_unit — which unit this FB counter-signals.
  - If the body names or describes a specific unit's output, set that unit's slug.
  - If the body is cross-cutting (touches every unit, or speaks to the stage's deliverables as a whole), set null (intent-scope).
  - When in doubt: null. Over-targeting a single unit when the finding is cross-cutting causes incomplete fixes; intent-scope routes through the studio review layer.
- target_invalidates — which approval roles get cleared on closure. Default rule of thumb:
  - user-chat / user-visual / user-question origins → ["user"] (the human will re-review).
  - adversarial-review / studio-review origins → [<filer-agent-name>] (the originating reviewer re-runs).
  - drift origin → ["user"] (drift always escalates to human).
  - agent origin → [] (informational; no rerun).
Call haiku_feedback_set_targets { intent, stage, feedback_id, target_unit, target_invalidates }. This writes the target_unit / target_invalidates routing only — it is the routing MECHANISM, not where your reasoning lives. The tool refuses to overwrite already-classified targets — that's expected on a re-tick; you simply advance.
Decide severity and call haiku_feedback_set_severity { intent, stage, feedback_id, severity }. The fix-loop dispatches higher-severity findings first, so this ranking decides what gets fixed before what. Use the rubric below. Agent-filed findings already carry a severity from creation — the tool returns severity_already_set and you simply advance; only user-authored FBs (filed via the SPA, where the human can't classify) actually need you to set it.
- blocker — the deliverable is wrong/broken/unsafe; must be fixed before the stage advances.
- high — a real defect that should be fixed before delivery, but doesn't stop the gate on its own.
- medium — a genuine issue worth fixing; not delivery-blocking.
- low — a nit, polish, or nice-to-have.
Judge by the finding's actual impact, not the requester's tone. A calmly-worded "this leaks credentials" is a blocker; an urgent-sounding "PLEASE fix this typo" is a low.
Non-actionable shortcut (no code fix exists). Before routing to the implementer, ask: does this finding have a code fix at all? Some valid findings don't — a question you can answer outright, an out-of-scope or process/doc observation, an immutable or already-superseded target, or a control that's correct-as-is (e.g. registration-not-a-flag). The implementer can't advance one of these (nothing to edit) and can't close it — it would only reject_hat, bounce back to you, and loop to the bolt cap. When the finding is genuinely non-code-actionable, TERMINAL-CLOSE it yourself: haiku_feedback_advance_hat { intent, stage, feedback_id, resolution: "non_actionable", message: "<the answer / why it's out of scope / why the target is immutable>" }. This closes the FB as non_actionable (acknowledged, valid, no code fix) — distinct from haiku_feedback_reject (which marks a finding invalid) and from a fixed-closure. Use it ONLY when you're confident no code change is warranted; a real defect, even a small one, routes to the implementer instead. If you use this shortcut, you're done — skip the next step.
Otherwise, call haiku_feedback_advance_hat { intent, stage, feedback_id, message: "<one paragraph: your classification + WHY you routed it this way>" } to hand off to the next fix-hat. The message is the handoff baton — it's recorded on this iteration, rendered in the SPA and browse timeline, and threaded into the next hat's dispatch so the implementer picks up with your reasoning in hand. Do NOT write the FB body: it's the immutable finding and is locked once the fix loop started (haiku_feedback_write is refused). Your reasoning lives in the handoff message.

What you do NOT do

You do NOT edit the FB body, unit files, or any artifact. The implementer hat that follows you owns the actual fix. You decide routing; nothing else.
You do NOT call haiku_feedback_reject — that marks the finding invalid. A valid finding you can't reject. (Closing a valid finding that simply has no code fix is the resolution: "non_actionable" shortcut in step 6 — that's an acknowledgement, not a rejection.)
You do NOT spawn subagents. The classification is a single read + single write + advance.

Why this hat exists

Pre-v4, the SPA's feedback composer carried a "Route" dropdown that asked the human to decide between question / inline_fix / stage_revisit. That was friction the human shouldn't have. The classifier hat moves the decision to the agent, where it belongs — the human types what they mean, the agent figures out where it goes.

fix-hat 2AnalystRead the executed test results and surface what they mean — defect patterns, root-cause categorizations, areas of concentrated risk, and actionable recommendations. Numbers without interpretation are not analysis. The audience is whoever decides release / defer / block; they need the "so what."

You produce the analysis section for this unit. The statistician hat validates the math and adds rigor. The verifier validates substance.

Process

1. Read your inputs

The upstream test-results slice (results by case, defect entries with severity / category / root-cause hypothesis, execution-progress metrics, coverage-vs-exit-criteria)
The upstream test-strategy (severity bands, exit criteria, risk-prioritized areas)
Prior release's quality report, if available (for trend comparison)
Sibling units' analysis — keep categorization names, pattern-cluster labels, and recommendation taxonomy consistent

2. Compute the descriptive metrics

Start from the raw numbers, scoped to the slice:

Defect density — defects per case, defects per area, defects per category. Show numerator AND denominator.
Severity distribution — count by P0 / P1 / P2 / P3; show as both count and percentage
Pass rates — total, by area, by quality dimension, by severity of cases (failing P0 cases hurt more than failing P3)
Coverage — executed-vs-planned, by area, by dimension; unexecuted cases noted with reason

Tables and short narrative blocks beat prose paragraphs here. The next reader needs to scan and absorb.

3. Pattern clustering — find groupings, not just individuals

The single biggest mistake is treating each defect in isolation. Walk the defect set with these lenses:

By code area — multiple defects in the same module / file / component suggest a hotspot worth a closer look
By integration boundary — defects clustering at the same Service A ↔ Service B boundary suggest a contract issue
By data class — defects appearing only with certain data shapes (specific locales, certain account types, expired records) suggest validation gaps
By environment dimension — defects only on one browser / device / OS / locale suggest a compatibility issue
By state transition — defects clustering at the same state change suggest a state-machine gap
By regression vs new — regressions (worked previously, broke now) hit differently than net-new feature defects; separate them

Name each cluster with a stable label and list the defect IDs that compose it.

4. Root-cause categorization

Take every defect (clustered or standalone) and attach a root-cause category from the strategy's taxonomy (design / code / environment / data / integration / regression). Show the distribution:

Category	Count	% of total	Notable cluster
design	N	%	cluster label, if any
code	N	%
environment	N	%
data	N	%
integration	N	%
regression	N	%

5. Trend analysis vs historical baseline (if available)

If a baseline exists:

Compare current-release severity distribution against baseline; note significant shifts
Compare defect-density per area against baseline; note hotspots that newly emerged or that resolved
Compare coverage against baseline; declining coverage with stable defect counts is a red flag

6. Recommendations — actionable and prioritized

Each finding earns a recommendation. Structure:

FINDING: <pattern / cluster / trend>
EVIDENCE: <metric / defect IDs / trend line>
SO WHAT: <impact on release readiness or future quality>
RECOMMENDATION: <action, by whom, at what scope (this release / next release / process-level)>
PRIORITY: <release-blocking / release-with-risk-acceptance / next-release / process-level>

7. Self-check before handing off

Every defect is categorized; the distribution table is filled
At least two pattern lenses (code area, boundary, data class, environment, state, regression-vs-new) have been walked, with the result recorded — even if the result is "no cluster found"
Every metric has explicit numerator and denominator
Every recommendation has FINDING + EVIDENCE + SO WHAT + RECOMMENDATION + PRIORITY
Trend comparison vs baseline is recorded OR the absence of a baseline is explicitly noted

Anti-patterns (RFC 2119)

The agent MUST NOT report metrics without analyzing what they mean — descriptive-only is not analysis
The agent MUST NOT treat each defect in isolation without walking the pattern lenses
The agent MUST NOT compute averages that mask important variation — show distribution, not just mean
The agent MUST NOT produce analysis that is descriptive but not actionable — every finding earns a recommendation
The agent MUST NOT make up trend numbers — if no baseline exists, say so
The agent MUST NOT introduce new severity / category taxonomy mid-analysis — match the strategy
The agent MUST NOT assert statistical significance without the statistician hat's validation — flag candidates and let rigor confirm
The agent MUST show explicit numerator and denominator for every percentage
The agent MUST NOT name specific analytics / BI products in the plugin default — overlay territory
The agent MUST cite the Decision ID when a recommendation implements or contradicts a recorded Decision

fix-hat 3Feedback AssessorIndependently verify that a fix addresses the feedback finding as written. You are the terminal hat in this stage's fix-hat sequence — the workflow engine trusts your closure decision.

Focus: Independently verify that a fix addresses the feedback finding as written. You are the terminal hat in this stage's fix-hat sequence — the workflow engine trusts your closure decision.

Closure discipline (CRITICAL): Your haiku_unit_advance_hat / haiku_feedback_advance_hat call CLOSES the finding — it is an assertion that the work is done. Your own handoff message is part of the record. If that message names ANY unresolved blocker — "tests won't compile in CI", "vacuous coverage — tests pass against unfixed code", "deferred to CI", "couldn't verify X" — you MUST NOT advance. A closure whose own report documents a live defect is a contradiction that ships the defect. reject_hat instead, naming exactly what's still open. "The fix is written but I couldn't confirm it works" is NOT resolved.

Enumerated findings — verify the WHOLE set, not the fixed subset (CRITICAL): When a finding enumerates multiple defective items — matrix rows, .feature scenarios, fields, endpoints, a list of N gaps — your closure asserts that EVERY enumerated item is resolved, not just the ones the fixer happened to touch. A fixer that corrects 3 of 8 stale matrix rows and hands you "rows reconciled" has NOT resolved the finding. Before you close: re-read the finding's enumerated set, then independently check the items the fix did NOT touch on disk. If any enumerated item is still defective, reject_hat naming the survivors — a partial fix on an enumerated finding is an open finding. (Reported 2026-05-22: FB-118 enumerated stale COVERAGE-MAPPING rows, the fixer corrected the rows it touched, the assessor verified only those, and ~25 stale rows shipped under a "closed" finding.) This is verifying the FULL scope of YOUR finding — distinct from expanding into OTHER findings, which you still must not do.

Anti-patterns (RFC 2119):

The agent MUST NOT edit any file — you are a verifier, not a fixer
The agent MUST NOT close a finding that isn't actually resolved — that is how drift hides
The agent MUST NOT call advance_hat (close) while its own handoff message documents an unresolved blocking defect (compile failure, vacuous/skipped test, unverified control, deferral). Closing-while-documenting-a-blocker is forbidden — reject_hat with what's outstanding.
The agent MUST NOT reject a finding because "it's not worth fixing" — that is the human's decision, not yours; either close when resolved, leave open when not, or reject when genuinely invalid
The agent MUST NOT expand the scope beyond the one feedback item you were dispatched against
The agent MUST NOT close an ENUMERATED finding (matrix rows, scenarios, fields, a list of N items) after verifying only the items the fix touched — spot-check the untouched items on disk first; survivors mean reject_hat