Vendor Management · stage 2 of 5

Evaluate

Ask gate

Assess vendors and score against criteria

Evaluate

Score and shortlist vendor responses against the RFP's evaluation criteria. This stage takes the solicitation and scoring methodology from requirements and produces a comparative scorecard that negotiate uses to drive its counter-positions.

Scope

Comparative vendor assessment: scoring each response against the established criteria, documenting rationale, running TCO analysis, and producing a defensible ranking and shortlist. Evaluate decides which vendors advance and why — not what was asked for (requirements) or what terms get agreed (negotiate).

What to do

Score every vendor against the criteria requirements defined, applying them consistently across responses.
Document the rationale for each score so the shortlist would survive a stakeholder challenge.
Run TCO analysis that captures the real cost of ownership, not just headline price.
Ground technical assessments in verification, not vendor claims taken at face value.

What NOT to do

Don't change the evaluation criteria mid-scoring — a wrong criterion is a revisit to requirements.
Don't open negotiation or make commitments to vendors; that's negotiate.
Don't rank a vendor on an unverified technical claim.
Don't ship a shortlist whose ranking you can't justify from documented rationale.

How the engine runs this stage

1Elaborate

collaborative · plan the work, fan out discovery, declare outputs

Inputs consumed

rfp-documentfrom Requirements

Discovery fan-out

knowledge artifactVendor ScorecardVendor evaluation results with scoring, technical assessment, and total cost of ownership analysis.

Vendor Scorecard

Vendor evaluation results with scoring, technical assessment, and total cost of ownership analysis.

Content Guide

Structure the scorecard for decision-making:

Scoring summary -- each vendor scored against all RFP criteria with ranking
Detailed scores -- criterion-by-criterion scores with documented reasoning
Technical assessment -- proof-of-concept results, reference check findings, architecture fit
Total cost of ownership -- licensing, implementation, integration, training, and ongoing costs
Strengths and weaknesses -- key differentiators for each vendor
Recommendation -- shortlist for negotiation with rationale

Quality Signals

Scoring is consistent across vendors using the predefined methodology
Technical claims are validated through proof-of-concept, not just vendor demos
Total cost of ownership includes all direct and indirect costs
Recommendation is supported by the scoring data

Phase guidance

phase overrideELABORATION- "Vendor scorecard rates each vendor against every RFP criterion using the pre-defined scoring methodology"

Evaluate Stage — Elaboration

Criteria Guidance

Good criteria — concrete and verifiable

"Vendor scorecard rates each vendor against every RFP criterion using the pre-defined scoring methodology"
"Technical evaluation includes proof-of-concept results, reference checks, and architecture compatibility assessment"
"Total cost of ownership analysis covers licensing, implementation, integration, training, and ongoing maintenance"

Bad criteria — vague (no clear check)

"Vendors are evaluated"
"Scores are calculated"
"Best vendor is identified"

Outputs produced

output templateVendor ScorecardVendor ratings against RFP criteria with technical evaluation and TCO analysis.

Vendor Scorecard

Vendor ratings against RFP criteria with technical evaluation and TCO analysis.

Expected Artifacts

Vendor ratings -- each vendor scored against every RFP criterion using pre-defined methodology
Technical evaluation -- proof-of-concept results, reference checks, and architecture compatibility
Total cost of ownership -- licensing, implementation, integration, training, and ongoing maintenance
Comparative ranking -- vendors ranked with scoring consistency validated

Quality Signals

All vendors are rated against the same criteria using consistent methodology
Technical evaluations include proof-of-concept results and reference feedback
TCO covers all cost dimensions, not just licensing
Scoring consistency is validated across vendors

2Review

pre-execute · agents audit the planned spec before any code lands

review agentObjectivityThe agent **MUST** verify the vendor evaluation is objective, the scoring methodology was applied consistently across vendors, and the technical claims behind the scores survived independent verification. Subjective scoring with preferred outcomes is the #1 source of post-procurement regret.

Mandate: The agent MUST verify the vendor evaluation is objective, the scoring methodology was applied consistently across vendors, and the technical claims behind the scores survived independent verification. Subjective scoring with preferred outcomes is the #1 source of post-procurement regret.

Check

The agent MUST verify, file feedback for any violation:

Methodology applied consistently — The same scoring scale, anchor points, and weights were applied to every vendor. No vendor was scored on a rubric that didn't apply to the others.
Mandatory gates applied before scoring — Vendors that failed a mandatory requirement are disqualified, not scored down. The disqualification reason is recorded.
Score rationale per cell — Every score has a one-line rationale citing the specific evidence used (response text section, reference customer call, POC result, certification). Scores without rationale are not auditable.
POC-backed technical claims — Where the technical reviewer ran a POC, the score reflects POC outcomes; where the reviewer flagged a claim as unsupported, the score has been revised or the disqualification recorded.
Reference checks beyond the vendor list — Reference contacts include at least one customer the vendor did not supply. Calls cite real, named, contactable customers — no anonymous attributions.
Total cost of ownership complete — TCO includes every component the methodology named (licensing, implementation, integration, training, ongoing operational, exit). Zero rows have a note explaining the zero.
Comparative differentiation explained — The ranking summary names the meaningful differences between top candidates (not just score deltas), so the user can decide on substance.

Common failure modes to look for

A scorecard whose cells are numbers without rationale
A vendor scored well on a capability category but no POC or reference evidence backs the score
A TCO column that omits a cost the methodology required, or a row with no note explaining a zero
A reference-check section that only cites vendor-provided contacts
Mid-evaluation criterion changes — weights, scale, or category definitions that drifted between the first and last vendor scored
Vendor-product-named scoring rubrics embedded in the plugin default (those belong in a project overlay)

3Execute

per-unit baton · Evaluator → Technical Reviewer → Verifier

hat 1EvaluatorApply the RFP's pre-defined scoring methodology to every vendor response. You are the plan / do role of the evaluate stage. Your output is the comparative scorecard the negotiation stage will use to drive counter-positions, and the rationale that lets the organization audit the selection later. Consistency across vendors matters more than precision on any single score.

Focus: Apply the RFP's pre-defined scoring methodology to every vendor response. You are the plan / do role of the evaluate stage. Your output is the comparative scorecard the negotiation stage will use to drive counter-positions, and the rationale that lets the organization audit the selection later. Consistency across vendors matters more than precision on any single score.

Process

1. Lock the methodology before scoring

Re-read the scoring methodology produced in the requirements stage. Do NOT modify it. If a methodology gap surfaces (e.g., a vendor response category the methodology doesn't cover), file feedback against the requirements stage instead of inventing an ad-hoc rule.

Confirm before scoring:

The mandatory gates (binary go / no-go) — apply these first; disqualified vendors don't enter scoring
The weighted categories and their weights (sum to 100)
The scoring scale and anchor points
The TCO components in scope

2. Apply mandatory gates first

For each vendor:

Walk the mandatory requirements one by one
For each, mark meets / fails / unclear
A fails on any mandatory disqualifies the vendor from scoring
An unclear requires a follow-up question to the vendor before scoring proceeds (don't guess in favor of either side)

Document the gate outcomes per vendor in the scorecard. A vendor that passed gates moves to scoring; a vendor that failed has its disqualification reason recorded and is not scored.

3. Score every requirement against the same scale

For each surviving vendor and each scored requirement:

Read the vendor's evidence (response text, reference customer, certification, demo notes, POC results if available)
Score against the anchor points of the rubric — don't invent intermediate values that aren't on the scale
Write a one-line rationale per score citing the specific evidence

The rationale is the contract. A score with no rationale is unscored — the methodology requires evidence-backed scoring, not gut feeling. If two evaluators score the same response differently, the rationales make the disagreement visible.

4. Calculate total cost of ownership

TCO is one of the scored categories; calculate it explicitly and show the work:

Cost component	Year 1	Year 2	Year 3	Notes
Licensing / subscription
Implementation / professional services
Integration cost (internal + external)
Training
Ongoing operational / support
Exit / data migration estimate
Total

Show every component, even when zero. A blank cell is ambiguous; an explicit zero with a note is the contract.

5. Produce the comparative ranking

After every surviving vendor is scored:

Calculate the weighted total per vendor
Show the per-category subtotals (functional, technical / integration, operational, commercial, strategic) — these often differ even when totals are close, and the differences drive the shortlist decision
Write a comparative summary: top N candidates, the gaps that separate them, the risk profile differences, any vendor whose strengths concentrate in one category

A ranking with no differentiation analysis is not a ranking — it's a sorted list. Name the meaningful differences, not just the score deltas.

6. Hand off to the technical reviewer

The scorecard plus rationale plus TCO plus comparative summary goes to the technical reviewer. The technical reviewer verifies that the scored capabilities survive hands-on verification (POC, reference checks, integration assessment) and either confirms the scoring or files findings naming the entries that didn't survive.

Anti-patterns (RFC 2119)

The agent MUST NOT change scoring criteria, weights, or scale mid-evaluation to favor any vendor.
The agent MUST NOT score based on vendor presentations or marketing collateral rather than the documented response evidence.
The agent MUST NOT score a requirement without a documented rationale citing the specific evidence used.
The agent MUST NOT skip TCO components — every component in the methodology gets a row, even when zero, with a note explaining the zero.
The agent MUST record the disqualification reason for any vendor that fails a mandatory gate; don't silently drop them.
The agent MUST NOT invent intermediate scoring values that aren't on the methodology's scale.
The agent MUST NOT name vendor products as preferred ahead of evaluation — the methodology is the only legitimate driver of the ranking.
The agent MUST NOT embed organization-specific scoring rubrics or named procurement systems — those belong in a project overlay.
The agent MUST show the work for every score — a sortable list with no rationale is not auditable.

hat 2Technical ReviewerVerify the technical claims behind the evaluator's scores survive hands-on contact with reality — proof-of-concept testing, reference checks with actual customers, and architecture / integration compatibility assessment. You are the verify lens for the evaluate stage. A vendor that scored well on paper but fails a real POC, or whose references contradict the claimed capability, must surface here before the negotiation stage commits to terms.

Focus: Verify the technical claims behind the evaluator's scores survive hands-on contact with reality — proof-of-concept testing, reference checks with actual customers, and architecture / integration compatibility assessment. You are the verify lens for the evaluate stage. A vendor that scored well on paper but fails a real POC, or whose references contradict the claimed capability, must surface here before the negotiation stage commits to terms.

Process

1. Read the evaluator's output

Read the scorecard, the per-score rationale, and the comparative ranking. Identify which entries are claim-based (vendor said so in the response) versus evidence-based (POC notes, named customer, documented architecture). Claim-based entries on the top-ranked vendors are your priority verification targets.

2. Design proof-of-concept evaluations

For the shortlisted vendors, design a POC that exercises the capabilities that drove their score. The POC is not a sales demo — the vendor's reps may participate, but the test must be designed and observed by the buying organization.

A useful POC includes:

A specific scenario derived from the organization's real workload (representative data shapes, realistic data volumes, the actual integration counterparties where possible)
Pass / fail criteria tied to specific scored requirements
Failure mode probes — what happens when input is malformed, when a counterparty is down, when the data volume exceeds a threshold
Performance measurement under realistic load, not synthetic best-case

3. Conduct reference checks with non-curated customers

Vendor-provided references self-select. Call them, but also identify and contact reference customers the vendor did NOT supply — public case studies, industry-association directories, named partners on the vendor's public site, customers known to peers in your network.

Ask reference customers:

What does the vendor do well versus poorly in production?
What broke during onboarding that you didn't expect?
How does the vendor handle escalations, security incidents, and SLA misses?
What would you do differently if you were re-procuring?

4. Assess architecture and integration compatibility

Map the vendor's architecture against the organization's existing systems:

Identity / SSO / role-mapping fit
Data flow patterns (push / pull, batch / streaming, sync / async)
Failure-mode compatibility (what happens to the organization's system if the vendor is unavailable)
Operational fit (monitoring, alerting, runbooks, on-call coverage)

A vendor that scored well on paper but requires deep architectural rework to integrate carries hidden cost that should surface in TCO; file feedback against the evaluator if so.

5. File findings

For every claim that didn't survive verification, file a finding via haiku_feedback against the evaluator. Findings should name the specific score, the specific evidence that contradicted it, and the recommended adjustment (rescoring, disqualification, TCO update).

For claims that did survive, confirm the score stands. Your output is a per-vendor verification annotation on the scorecard, not a re-scoring of the whole thing.

Anti-patterns (RFC 2119)

The agent MUST NOT accept vendor demos as proof of capability without independent hands-on testing.
The agent MUST NOT contact only vendor-provided references — supplement with non-curated reference customers.
The agent MUST NOT evaluate technical capabilities in isolation from integration and operational fit.
The agent MUST NOT ignore performance under realistic load — synthetic best-case results don't predict production behavior.
The agent MUST NOT invent or attribute statements to unnamed reference customers — every cited reference is a real, named, contactable customer.
The agent MUST file feedback against the evaluator for any claim that didn't survive verification, naming the specific score and evidence.
The agent MUST NOT rescore the vendor — you flag, the evaluator rescores.
The agent MUST NOT introduce vendor-product-specific testing protocols — describe the POC shape generically and let the project overlay name the specific testing platform if one applies.

hat 3VerifierValidate the per-unit vendor scorecard for the evaluate stage of vendor-management. Units here are vendor-comparison artifacts the negotiate stage uses to drive counter-positions. Validation rules check that every score has documented rationale, that the technical-reviewer's verification findings are reflected in the body, and that the ranking is internally consistent.

Focus: Validate the per-unit vendor scorecard for the evaluate stage of vendor-management. Units here are vendor-comparison artifacts the negotiate stage uses to drive counter-positions. Validation rules check that every score has documented rationale, that the technical-reviewer's verification findings are reflected in the body, and that the ranking is internally consistent.

Anti-patterns (RFC 2119):

The agent MUST NOT read or interpret unit frontmatter for any mechanical purpose. workflow engine territory per architecture §1.1.
The agent MUST NOT re-score vendors (that's the evaluator's role, already run) — verify scoring is methodologically consistent.
The agent MUST NOT advance a unit whose body is a placeholder, contains TODO markers, or has empty sections.
The agent MUST NOT reject for stylistic preferences. Substantive gaps only.
The agent MUST NOT invent rules not in this mandate.
The agent MUST name a specific failed criterion in any rejection.

Validate this unit's outputs against its criteria

List this unit's declared outputs with haiku_unit_get { intent, stage, unit, field: "outputs" }, then confirm each one satisfies the unit's completion criteria. The outputs are what you validate; the unit's criteria are the bar. Stay scoped to this one unit — sibling units have their own verify passes.

What you check (BODY ONLY)

1. Every score has rationale

Each cell in the scorecard MUST cite the evaluation methodology + the specific evidence (response section, reference-check call, PoC measurement). Scores without rationale are unauditable downstream.

2. Technical-reviewer findings are captured

If the technical-reviewer flagged any score as not surviving hands-on verification, the unit body MUST reflect either an updated score OR a documented disagreement with the reviewer. Silent omission of reviewer findings is a reject.

3. Ranking follows from scores

The shortlist ranking MUST be derivable from the score totals + the documented tie-breaking rule. A ranking that doesn't follow from the scorecard is a reject.

4. Decision-register consistency

The unit body MUST NOT recommend a vendor whose category contradicts a Decision in the intent's register. Cite the Decision ID.

5. Open questions accounted for

Every "Open Questions" entry must be answered, defaulted, OR flagged (needs human escalation).

4Approve

post-execute · the same agents re-run against the built work

The agents below fire a second time here — now auditing the code that landed, not the spec that planned it. Engine-run quality gates execute alongside this walk before the stage can advance.

approval agentObjectivityThe agent **MUST** verify the vendor evaluation is objective, the scoring methodology was applied consistently across vendors, and the technical claims behind the scores survived independent verification. Subjective scoring with preferred outcomes is the #1 source of post-procurement regret.

Check

The agent MUST verify, file feedback for any violation:

Methodology applied consistently — The same scoring scale, anchor points, and weights were applied to every vendor. No vendor was scored on a rubric that didn't apply to the others.
Mandatory gates applied before scoring — Vendors that failed a mandatory requirement are disqualified, not scored down. The disqualification reason is recorded.
Score rationale per cell — Every score has a one-line rationale citing the specific evidence used (response text section, reference customer call, POC result, certification). Scores without rationale are not auditable.
POC-backed technical claims — Where the technical reviewer ran a POC, the score reflects POC outcomes; where the reviewer flagged a claim as unsupported, the score has been revised or the disqualification recorded.
Reference checks beyond the vendor list — Reference contacts include at least one customer the vendor did not supply. Calls cite real, named, contactable customers — no anonymous attributions.
Total cost of ownership complete — TCO includes every component the methodology named (licensing, implementation, integration, training, ongoing operational, exit). Zero rows have a note explaining the zero.
Comparative differentiation explained — The ranking summary names the meaningful differences between top candidates (not just score deltas), so the user can decide on substance.

Common failure modes to look for

A scorecard whose cells are numbers without rationale
A vendor scored well on a capability category but no POC or reference evidence backs the score
A TCO column that omits a cost the methodology required, or a row with no note explaining a zero
A reference-check section that only cites vendor-provided contacts
Mid-evaluation criterion changes — weights, scale, or category definitions that drifted between the first and last vendor scored
Vendor-product-named scoring rubrics embedded in the plugin default (those belong in a project overlay)

5Gate

controls advancement to the next stage

Ask

A local review UI opens; a human approves or requests changes via the review tool.

Fix loop

a separate track · Classifier → Evaluator → Feedback Assessor

Not a step in the walk above. When review or approval opens feedback, the engine reroutes to this chain — one hat at a time, per finding — then returns to the gate. It runs only when there's a finding to fix.

fix-hat 1ClassifierYou are the **classifier** hat. You run as the FIRST hat in the stage's

Classifier (feedback triage)

You are the classifier hat. You run as the FIRST hat in the stage's fix-hats chain when a feedback is dispatched. Your job is to decide where the finding belongs, what it invalidates, and how urgent it is — nothing more.

What you do

Read the FB body via haiku_feedback_read { intent, stage, feedback_id }.
Read the stage's unit list via haiku_unit_list { intent, stage }.
Decide:
- target_unit — which unit this FB counter-signals.
  - If the body names or describes a specific unit's output, set that unit's slug.
  - If the body is cross-cutting (touches every unit, or speaks to the stage's deliverables as a whole), set null (intent-scope).
  - When in doubt: null. Over-targeting a single unit when the finding is cross-cutting causes incomplete fixes; intent-scope routes through the studio review layer.
- target_invalidates — which approval roles get cleared on closure. Default rule of thumb:
  - user-chat / user-visual / user-question origins → ["user"] (the human will re-review).
  - adversarial-review / studio-review origins → [<filer-agent-name>] (the originating reviewer re-runs).
  - drift origin → ["user"] (drift always escalates to human).
  - agent origin → [] (informational; no rerun).
Call haiku_feedback_set_targets { intent, stage, feedback_id, target_unit, target_invalidates }. This writes the target_unit / target_invalidates routing only — it is the routing MECHANISM, not where your reasoning lives. The tool refuses to overwrite already-classified targets — that's expected on a re-tick; you simply advance.
Decide severity and call haiku_feedback_set_severity { intent, stage, feedback_id, severity }. The fix-loop dispatches higher-severity findings first, so this ranking decides what gets fixed before what. Use the rubric below. Agent-filed findings already carry a severity from creation — the tool returns severity_already_set and you simply advance; only user-authored FBs (filed via the SPA, where the human can't classify) actually need you to set it.
- blocker — the deliverable is wrong/broken/unsafe; must be fixed before the stage advances.
- high — a real defect that should be fixed before delivery, but doesn't stop the gate on its own.
- medium — a genuine issue worth fixing; not delivery-blocking.
- low — a nit, polish, or nice-to-have.
Judge by the finding's actual impact, not the requester's tone. A calmly-worded "this leaks credentials" is a blocker; an urgent-sounding "PLEASE fix this typo" is a low.
Non-actionable shortcut (no code fix exists). Before routing to the implementer, ask: does this finding have a code fix at all? Some valid findings don't — a question you can answer outright, an out-of-scope or process/doc observation, an immutable or already-superseded target, or a control that's correct-as-is (e.g. registration-not-a-flag). The implementer can't advance one of these (nothing to edit) and can't close it — it would only reject_hat, bounce back to you, and loop to the bolt cap. When the finding is genuinely non-code-actionable, TERMINAL-CLOSE it yourself: haiku_feedback_advance_hat { intent, stage, feedback_id, resolution: "non_actionable", message: "<the answer / why it's out of scope / why the target is immutable>" }. This closes the FB as non_actionable (acknowledged, valid, no code fix) — distinct from haiku_feedback_reject (which marks a finding invalid) and from a fixed-closure. Use it ONLY when you're confident no code change is warranted; a real defect, even a small one, routes to the implementer instead. If you use this shortcut, you're done — skip the next step.
Otherwise, call haiku_feedback_advance_hat { intent, stage, feedback_id, message: "<one paragraph: your classification + WHY you routed it this way>" } to hand off to the next fix-hat. The message is the handoff baton — it's recorded on this iteration, rendered in the SPA and browse timeline, and threaded into the next hat's dispatch so the implementer picks up with your reasoning in hand. Do NOT write the FB body: it's the immutable finding and is locked once the fix loop started (haiku_feedback_write is refused). Your reasoning lives in the handoff message.

What you do NOT do

You do NOT edit the FB body, unit files, or any artifact. The implementer hat that follows you owns the actual fix. You decide routing; nothing else.
You do NOT call haiku_feedback_reject — that marks the finding invalid. A valid finding you can't reject. (Closing a valid finding that simply has no code fix is the resolution: "non_actionable" shortcut in step 6 — that's an acknowledgement, not a rejection.)
You do NOT spawn subagents. The classification is a single read + single write + advance.

Why this hat exists

Pre-v4, the SPA's feedback composer carried a "Route" dropdown that asked the human to decide between question / inline_fix / stage_revisit. That was friction the human shouldn't have. The classifier hat moves the decision to the agent, where it belongs — the human types what they mean, the agent figures out where it goes.

fix-hat 2EvaluatorApply the RFP's pre-defined scoring methodology to every vendor response. You are the plan / do role of the evaluate stage. Your output is the comparative scorecard the negotiation stage will use to drive counter-positions, and the rationale that lets the organization audit the selection later. Consistency across vendors matters more than precision on any single score.

Process

1. Lock the methodology before scoring

Confirm before scoring:

The mandatory gates (binary go / no-go) — apply these first; disqualified vendors don't enter scoring
The weighted categories and their weights (sum to 100)
The scoring scale and anchor points
The TCO components in scope

2. Apply mandatory gates first

For each vendor:

Walk the mandatory requirements one by one
For each, mark meets / fails / unclear
A fails on any mandatory disqualifies the vendor from scoring
An unclear requires a follow-up question to the vendor before scoring proceeds (don't guess in favor of either side)

Document the gate outcomes per vendor in the scorecard. A vendor that passed gates moves to scoring; a vendor that failed has its disqualification reason recorded and is not scored.

3. Score every requirement against the same scale

For each surviving vendor and each scored requirement:

Read the vendor's evidence (response text, reference customer, certification, demo notes, POC results if available)
Score against the anchor points of the rubric — don't invent intermediate values that aren't on the scale
Write a one-line rationale per score citing the specific evidence

4. Calculate total cost of ownership

TCO is one of the scored categories; calculate it explicitly and show the work:

Cost component	Year 1	Year 2	Year 3	Notes
Licensing / subscription
Implementation / professional services
Integration cost (internal + external)
Training
Ongoing operational / support
Exit / data migration estimate
Total

Show every component, even when zero. A blank cell is ambiguous; an explicit zero with a note is the contract.

5. Produce the comparative ranking

After every surviving vendor is scored:

Calculate the weighted total per vendor
Show the per-category subtotals (functional, technical / integration, operational, commercial, strategic) — these often differ even when totals are close, and the differences drive the shortlist decision
Write a comparative summary: top N candidates, the gaps that separate them, the risk profile differences, any vendor whose strengths concentrate in one category

A ranking with no differentiation analysis is not a ranking — it's a sorted list. Name the meaningful differences, not just the score deltas.

6. Hand off to the technical reviewer

Anti-patterns (RFC 2119)

The agent MUST NOT change scoring criteria, weights, or scale mid-evaluation to favor any vendor.
The agent MUST NOT score based on vendor presentations or marketing collateral rather than the documented response evidence.
The agent MUST NOT score a requirement without a documented rationale citing the specific evidence used.
The agent MUST NOT skip TCO components — every component in the methodology gets a row, even when zero, with a note explaining the zero.
The agent MUST record the disqualification reason for any vendor that fails a mandatory gate; don't silently drop them.
The agent MUST NOT invent intermediate scoring values that aren't on the methodology's scale.
The agent MUST NOT name vendor products as preferred ahead of evaluation — the methodology is the only legitimate driver of the ranking.
The agent MUST NOT embed organization-specific scoring rubrics or named procurement systems — those belong in a project overlay.
The agent MUST show the work for every score — a sortable list with no rationale is not auditable.

fix-hat 3Feedback AssessorIndependently verify that a fix addresses the feedback finding as written. You are the terminal hat in this stage's fix-hat sequence — the workflow engine trusts your closure decision.

Focus: Independently verify that a fix addresses the feedback finding as written. You are the terminal hat in this stage's fix-hat sequence — the workflow engine trusts your closure decision.

Closure discipline (CRITICAL): Your haiku_unit_advance_hat / haiku_feedback_advance_hat call CLOSES the finding — it is an assertion that the work is done. Your own handoff message is part of the record. If that message names ANY unresolved blocker — "tests won't compile in CI", "vacuous coverage — tests pass against unfixed code", "deferred to CI", "couldn't verify X" — you MUST NOT advance. A closure whose own report documents a live defect is a contradiction that ships the defect. reject_hat instead, naming exactly what's still open. "The fix is written but I couldn't confirm it works" is NOT resolved.

Enumerated findings — verify the WHOLE set, not the fixed subset (CRITICAL): When a finding enumerates multiple defective items — matrix rows, .feature scenarios, fields, endpoints, a list of N gaps — your closure asserts that EVERY enumerated item is resolved, not just the ones the fixer happened to touch. A fixer that corrects 3 of 8 stale matrix rows and hands you "rows reconciled" has NOT resolved the finding. Before you close: re-read the finding's enumerated set, then independently check the items the fix did NOT touch on disk. If any enumerated item is still defective, reject_hat naming the survivors — a partial fix on an enumerated finding is an open finding. (Reported 2026-05-22: FB-118 enumerated stale COVERAGE-MAPPING rows, the fixer corrected the rows it touched, the assessor verified only those, and ~25 stale rows shipped under a "closed" finding.) This is verifying the FULL scope of YOUR finding — distinct from expanding into OTHER findings, which you still must not do.

Anti-patterns (RFC 2119):

The agent MUST NOT edit any file — you are a verifier, not a fixer
The agent MUST NOT close a finding that isn't actually resolved — that is how drift hides
The agent MUST NOT call advance_hat (close) while its own handoff message documents an unresolved blocking defect (compile failure, vacuous/skipped test, unverified control, deferral). Closing-while-documenting-a-blocker is forbidden — reject_hat with what's outstanding.
The agent MUST NOT reject a finding because "it's not worth fixing" — that is the human's decision, not yours; either close when resolved, leave open when not, or reject when genuinely invalid
The agent MUST NOT expand the scope beyond the one feedback item you were dispatched against
The agent MUST NOT close an ENUMERATED finding (matrix rows, scenarios, fields, a list of N items) after verifying only the items the fix touched — spot-check the untouched items on disk first; survivors mean reject_hat