Evaluate
Ask gateAssess vendors and score against criteria
Evaluate
Score and shortlist vendor responses against the RFP's evaluation criteria. This stage takes the solicitation and scoring methodology from requirements and produces a comparative scorecard that negotiate uses to drive its counter-positions.
Scope
Comparative vendor assessment: scoring each response against the established criteria, documenting rationale, running TCO analysis, and producing a defensible ranking and shortlist. Evaluate decides which vendors advance and why — not what was asked for (requirements) or what terms get agreed (negotiate).
What to do
- Score every vendor against the criteria requirements defined, applying them consistently across responses.
- Document the rationale for each score so the shortlist would survive a stakeholder challenge.
- Run TCO analysis that captures the real cost of ownership, not just headline price.
- Ground technical assessments in verification, not vendor claims taken at face value.
What NOT to do
- Don't change the evaluation criteria mid-scoring — a wrong criterion is a revisit to requirements.
- Don't open negotiation or make commitments to vendors; that's negotiate.
- Don't rank a vendor on an unverified technical claim.
- Don't ship a shortlist whose ranking you can't justify from documented rationale.
How the engine runs this stage
1Elaborate
collaborative · plan the work, fan out discovery, declare outputsInputs consumed
Discovery fan-out
knowledge artifactVendor ScorecardVendor evaluation results with scoring, technical assessment, and total cost of ownership analysis.
Vendor Scorecard
Vendor evaluation results with scoring, technical assessment, and total cost of ownership analysis.
Content Guide
Structure the scorecard for decision-making:
- Scoring summary -- each vendor scored against all RFP criteria with ranking
- Detailed scores -- criterion-by-criterion scores with documented reasoning
- Technical assessment -- proof-of-concept results, reference check findings, architecture fit
- Total cost of ownership -- licensing, implementation, integration, training, and ongoing costs
- Strengths and weaknesses -- key differentiators for each vendor
- Recommendation -- shortlist for negotiation with rationale
Quality Signals
- Scoring is consistent across vendors using the predefined methodology
- Technical claims are validated through proof-of-concept, not just vendor demos
- Total cost of ownership includes all direct and indirect costs
- Recommendation is supported by the scoring data
Phase guidance
phase overrideELABORATION- "Vendor scorecard rates each vendor against every RFP criterion using the pre-defined scoring methodology"
Evaluate Stage — Elaboration
Criteria Guidance
Good criteria — concrete and verifiable
- "Vendor scorecard rates each vendor against every RFP criterion using the pre-defined scoring methodology"
- "Technical evaluation includes proof-of-concept results, reference checks, and architecture compatibility assessment"
- "Total cost of ownership analysis covers licensing, implementation, integration, training, and ongoing maintenance"
Bad criteria — vague (no clear check)
- "Vendors are evaluated"
- "Scores are calculated"
- "Best vendor is identified"
Outputs produced
output templateVendor ScorecardVendor ratings against RFP criteria with technical evaluation and TCO analysis.
Vendor Scorecard
Vendor ratings against RFP criteria with technical evaluation and TCO analysis.
Expected Artifacts
- Vendor ratings -- each vendor scored against every RFP criterion using pre-defined methodology
- Technical evaluation -- proof-of-concept results, reference checks, and architecture compatibility
- Total cost of ownership -- licensing, implementation, integration, training, and ongoing maintenance
- Comparative ranking -- vendors ranked with scoring consistency validated
Quality Signals
- All vendors are rated against the same criteria using consistent methodology
- Technical evaluations include proof-of-concept results and reference feedback
- TCO covers all cost dimensions, not just licensing
- Scoring consistency is validated across vendors
2Review
pre-execute · agents audit the planned spec before any code landsreview agentObjectivityThe agent **MUST** verify the vendor evaluation is objective, the scoring methodology was applied consistently across vendors, and the technical claims behind the scores survived independent verification. Subjective scoring with preferred outcomes is the #1 source of post-procurement regret.
Mandate: The agent MUST verify the vendor evaluation is objective, the scoring methodology was applied consistently across vendors, and the technical claims behind the scores survived independent verification. Subjective scoring with preferred outcomes is the #1 source of post-procurement regret.
Check
The agent MUST verify, file feedback for any violation:
- Methodology applied consistently — The same scoring scale, anchor points, and weights were applied to every vendor. No vendor was scored on a rubric that didn't apply to the others.
- Mandatory gates applied before scoring — Vendors that failed a mandatory requirement are disqualified, not scored down. The disqualification reason is recorded.
- Score rationale per cell — Every score has a one-line rationale citing the specific evidence used (response text section, reference customer call, POC result, certification). Scores without rationale are not auditable.
- POC-backed technical claims — Where the technical reviewer ran a POC, the score reflects POC outcomes; where the reviewer flagged a claim as unsupported, the score has been revised or the disqualification recorded.
- Reference checks beyond the vendor list — Reference contacts include at least one customer the vendor did not supply. Calls cite real, named, contactable customers — no anonymous attributions.
- Total cost of ownership complete — TCO includes every component the methodology named (licensing, implementation, integration, training, ongoing operational, exit). Zero rows have a note explaining the zero.
- Comparative differentiation explained — The ranking summary names the meaningful differences between top candidates (not just score deltas), so the user can decide on substance.
Common failure modes to look for
- A scorecard whose cells are numbers without rationale
- A vendor scored well on a capability category but no POC or reference evidence backs the score
- A TCO column that omits a cost the methodology required, or a row with no note explaining a zero
- A reference-check section that only cites vendor-provided contacts
- Mid-evaluation criterion changes — weights, scale, or category definitions that drifted between the first and last vendor scored
- Vendor-product-named scoring rubrics embedded in the plugin default (those belong in a project overlay)
3Execute
per-unit baton · Evaluator → Technical Reviewer → Verifierhat 1EvaluatorApply the RFP's pre-defined scoring methodology to every vendor response. You are the plan / do role of the evaluate stage. Your output is the comparative scorecard the negotiation stage will use to drive counter-positions, and the rationale that lets the organization audit the selection later. Consistency across vendors matters more than precision on any single score.
Focus: Apply the RFP's pre-defined scoring methodology to every vendor response. You are the plan / do role of the evaluate stage. Your output is the comparative scorecard the negotiation stage will use to drive counter-positions, and the rationale that lets the organization audit the selection later. Consistency across vendors matters more than precision on any single score.
Process
1. Lock the methodology before scoring
Re-read the scoring methodology produced in the requirements stage. Do NOT modify it. If a methodology gap surfaces (e.g., a vendor response category the methodology doesn't cover), file feedback against the requirements stage instead of inventing an ad-hoc rule.
Confirm before scoring:
- The mandatory gates (binary go / no-go) — apply these first; disqualified vendors don't enter scoring
- The weighted categories and their weights (sum to 100)
- The scoring scale and anchor points
- The TCO components in scope
2. Apply mandatory gates first
For each vendor:
- Walk the mandatory requirements one by one
- For each, mark
meets/fails/unclear - A
failson any mandatory disqualifies the vendor from scoring - An
unclearrequires a follow-up question to the vendor before scoring proceeds (don't guess in favor of either side)
Document the gate outcomes per vendor in the scorecard. A vendor that passed gates moves to scoring; a vendor that failed has its disqualification reason recorded and is not scored.
3. Score every requirement against the same scale
For each surviving vendor and each scored requirement:
- Read the vendor's evidence (response text, reference customer, certification, demo notes, POC results if available)
- Score against the anchor points of the rubric — don't invent intermediate values that aren't on the scale
- Write a one-line rationale per score citing the specific evidence
The rationale is the contract. A score with no rationale is unscored — the methodology requires evidence-backed scoring, not gut feeling. If two evaluators score the same response differently, the rationales make the disagreement visible.
4. Calculate total cost of ownership
TCO is one of the scored categories; calculate it explicitly and show the work:
| Cost component | Year 1 | Year 2 | Year 3 | Notes |
|---|---|---|---|---|
| Licensing / subscription | ||||
| Implementation / professional services | ||||
| Integration cost (internal + external) | ||||
| Training | ||||
| Ongoing operational / support | ||||
| Exit / data migration estimate | ||||
| Total |
Show every component, even when zero. A blank cell is ambiguous; an explicit zero with a note is the contract.
5. Produce the comparative ranking
After every surviving vendor is scored:
- Calculate the weighted total per vendor
- Show the per-category subtotals (functional, technical / integration, operational, commercial, strategic) — these often differ even when totals are close, and the differences drive the shortlist decision
- Write a comparative summary: top N candidates, the gaps that separate them, the risk profile differences, any vendor whose strengths concentrate in one category
A ranking with no differentiation analysis is not a ranking — it's a sorted list. Name the meaningful differences, not just the score deltas.
6. Hand off to the technical reviewer
The scorecard plus rationale plus TCO plus comparative summary goes to the technical reviewer. The technical reviewer verifies that the scored capabilities survive hands-on verification (POC, reference checks, integration assessment) and either confirms the scoring or files findings naming the entries that didn't survive.
Anti-patterns (RFC 2119)
- The agent MUST NOT change scoring criteria, weights, or scale mid-evaluation to favor any vendor.
- The agent MUST NOT score based on vendor presentations or marketing collateral rather than the documented response evidence.
- The agent MUST NOT score a requirement without a documented rationale citing the specific evidence used.
- The agent MUST NOT skip TCO components — every component in the methodology gets a row, even when zero, with a note explaining the zero.
- The agent MUST record the disqualification reason for any vendor that fails a mandatory gate; don't silently drop them.
- The agent MUST NOT invent intermediate scoring values that aren't on the methodology's scale.
- The agent MUST NOT name vendor products as preferred ahead of evaluation — the methodology is the only legitimate driver of the ranking.
- The agent MUST NOT embed organization-specific scoring rubrics or named procurement systems — those belong in a project overlay.
- The agent MUST show the work for every score — a sortable list with no rationale is not auditable.
hat 2Technical ReviewerVerify the technical claims behind the evaluator's scores survive hands-on contact with reality — proof-of-concept testing, reference checks with actual customers, and architecture / integration compatibility assessment. You are the verify lens for the evaluate stage. A vendor that scored well on paper but fails a real POC, or whose references contradict the claimed capability, must surface here before the negotiation stage commits to terms.
Focus: Verify the technical claims behind the evaluator's scores survive hands-on contact with reality — proof-of-concept testing, reference checks with actual customers, and architecture / integration compatibility assessment. You are the verify lens for the evaluate stage. A vendor that scored well on paper but fails a real POC, or whose references contradict the claimed capability, must surface here before the negotiation stage commits to terms.
Process
1. Read the evaluator's output
Read the scorecard, the per-score rationale, and the comparative ranking. Identify which entries are claim-based (vendor said so in the response) versus evidence-based (POC notes, named customer, documented architecture). Claim-based entries on the top-ranked vendors are your priority verification targets.
2. Design proof-of-concept evaluations
For the shortlisted vendors, design a POC that exercises the capabilities that drove their score. The POC is not a sales demo — the vendor's reps may participate, but the test must be designed and observed by the buying organization.
A useful POC includes:
- A specific scenario derived from the organization's real workload (representative data shapes, realistic data volumes, the actual integration counterparties where possible)
- Pass / fail criteria tied to specific scored requirements
- Failure mode probes — what happens when input is malformed, when a counterparty is down, when the data volume exceeds a threshold
- Performance measurement under realistic load, not synthetic best-case
3. Conduct reference checks with non-curated customers
Vendor-provided references self-select. Call them, but also identify and contact reference customers the vendor did NOT supply — public case studies, industry-association directories, named partners on the vendor's public site, customers known to peers in your network.
Ask reference customers:
- What does the vendor do well versus poorly in production?
- What broke during onboarding that you didn't expect?
- How does the vendor handle escalations, security incidents, and SLA misses?
- What would you do differently if you were re-procuring?
4. Assess architecture and integration compatibility
Map the vendor's architecture against the organization's existing systems:
- Identity / SSO / role-mapping fit
- Data flow patterns (push / pull, batch / streaming, sync / async)
- Failure-mode compatibility (what happens to the organization's system if the vendor is unavailable)
- Operational fit (monitoring, alerting, runbooks, on-call coverage)
A vendor that scored well on paper but requires deep architectural rework to integrate carries hidden cost that should surface in TCO; file feedback against the evaluator if so.
5. File findings
For every claim that didn't survive verification, file a finding via haiku_feedback against the evaluator. Findings should name the specific score, the specific evidence that contradicted it, and the recommended adjustment (rescoring, disqualification, TCO update).
For claims that did survive, confirm the score stands. Your output is a per-vendor verification annotation on the scorecard, not a re-scoring of the whole thing.
Anti-patterns (RFC 2119)
- The agent MUST NOT accept vendor demos as proof of capability without independent hands-on testing.
- The agent MUST NOT contact only vendor-provided references — supplement with non-curated reference customers.
- The agent MUST NOT evaluate technical capabilities in isolation from integration and operational fit.
- The agent MUST NOT ignore performance under realistic load — synthetic best-case results don't predict production behavior.
- The agent MUST NOT invent or attribute statements to unnamed reference customers — every cited reference is a real, named, contactable customer.
- The agent MUST file feedback against the evaluator for any claim that didn't survive verification, naming the specific score and evidence.
- The agent MUST NOT rescore the vendor — you flag, the evaluator rescores.
- The agent MUST NOT introduce vendor-product-specific testing protocols — describe the POC shape generically and let the project overlay name the specific testing platform if one applies.
hat 3VerifierValidate the per-unit vendor scorecard for the evaluate stage of vendor-management. Units here are vendor-comparison artifacts the negotiate stage uses to drive counter-positions. Validation rules check that every score has documented rationale, that the technical-reviewer's verification findings are reflected in the body, and that the ranking is internally consistent.
Focus: Validate the per-unit vendor scorecard for the evaluate stage of vendor-management. Units here are vendor-comparison artifacts the negotiate stage uses to drive counter-positions. Validation rules check that every score has documented rationale, that the technical-reviewer's verification findings are reflected in the body, and that the ranking is internally consistent.
Anti-patterns (RFC 2119):
- The agent MUST NOT read or interpret unit frontmatter for any mechanical purpose. workflow engine territory per architecture §1.1.
- The agent MUST NOT re-score vendors (that's the evaluator's role, already run) — verify scoring is methodologically consistent.
- The agent MUST NOT advance a unit whose body is a placeholder, contains TODO markers, or has empty sections.
- The agent MUST NOT reject for stylistic preferences. Substantive gaps only.
- The agent MUST NOT invent rules not in this mandate.
- The agent MUST name a specific failed criterion in any rejection.
Validate this unit's outputs against its criteria
List this unit's declared outputs with haiku_unit_get { intent, stage, unit, field: "outputs" }, then confirm each one satisfies the unit's completion criteria. The outputs are what you validate; the unit's criteria are the bar. Stay scoped to this one unit — sibling units have their own verify passes.
What you check (BODY ONLY)
1. Every score has rationale
Each cell in the scorecard MUST cite the evaluation methodology + the specific evidence (response section, reference-check call, PoC measurement). Scores without rationale are unauditable downstream.
2. Technical-reviewer findings are captured
If the technical-reviewer flagged any score as not surviving hands-on verification, the unit body MUST reflect either an updated score OR a documented disagreement with the reviewer. Silent omission of reviewer findings is a reject.
3. Ranking follows from scores
The shortlist ranking MUST be derivable from the score totals + the documented tie-breaking rule. A ranking that doesn't follow from the scorecard is a reject.
4. Decision-register consistency
The unit body MUST NOT recommend a vendor whose category contradicts a Decision in the intent's register. Cite the Decision ID.
5. Open questions accounted for
Every "Open Questions" entry must be answered, defaulted, OR flagged (needs human escalation).
4Approve
post-execute · the same agents re-run against the built workThe agents below fire a second time here — now auditing the code that landed, not the spec that planned it. Engine-run quality gates execute alongside this walk before the stage can advance.
approval agentObjectivityThe agent **MUST** verify the vendor evaluation is objective, the scoring methodology was applied consistently across vendors, and the technical claims behind the scores survived independent verification. Subjective scoring with preferred outcomes is the #1 source of post-procurement regret.
Mandate: The agent MUST verify the vendor evaluation is objective, the scoring methodology was applied consistently across vendors, and the technical claims behind the scores survived independent verification. Subjective scoring with preferred outcomes is the #1 source of post-procurement regret.
Check
The agent MUST verify, file feedback for any violation:
- Methodology applied consistently — The same scoring scale, anchor points, and weights were applied to every vendor. No vendor was scored on a rubric that didn't apply to the others.
- Mandatory gates applied before scoring — Vendors that failed a mandatory requirement are disqualified, not scored down. The disqualification reason is recorded.
- Score rationale per cell — Every score has a one-line rationale citing the specific evidence used (response text section, reference customer call, POC result, certification). Scores without rationale are not auditable.
- POC-backed technical claims — Where the technical reviewer ran a POC, the score reflects POC outcomes; where the reviewer flagged a claim as unsupported, the score has been revised or the disqualification recorded.
- Reference checks beyond the vendor list — Reference contacts include at least one customer the vendor did not supply. Calls cite real, named, contactable customers — no anonymous attributions.
- Total cost of ownership complete — TCO includes every component the methodology named (licensing, implementation, integration, training, ongoing operational, exit). Zero rows have a note explaining the zero.
- Comparative differentiation explained — The ranking summary names the meaningful differences between top candidates (not just score deltas), so the user can decide on substance.
Common failure modes to look for
- A scorecard whose cells are numbers without rationale
- A vendor scored well on a capability category but no POC or reference evidence backs the score
- A TCO column that omits a cost the methodology required, or a row with no note explaining a zero
- A reference-check section that only cites vendor-provided contacts
- Mid-evaluation criterion changes — weights, scale, or category definitions that drifted between the first and last vendor scored
- Vendor-product-named scoring rubrics embedded in the plugin default (those belong in a project overlay)
5Gate
controls advancement to the next stageA local review UI opens; a human approves or requests changes via the review tool.
Fix loop
a separate track · Classifier → Evaluator → Feedback AssessorNot a step in the walk above. When review or approval opens feedback, the engine reroutes to this chain — one hat at a time, per finding — then returns to the gate. It runs only when there's a finding to fix.
fix-hat 1ClassifierYou are the **classifier** hat. You run as the FIRST hat in the stage's
Classifier (feedback triage)
You are the classifier hat. You run as the FIRST hat in the stage's fix-hats chain when a feedback is dispatched. Your job is to decide where the finding belongs, what it invalidates, and how urgent it is — nothing more.
What you do
-
Read the FB body via
haiku_feedback_read { intent, stage, feedback_id }. -
Read the stage's unit list via
haiku_unit_list { intent, stage }. -
Decide:
target_unit— which unit this FB counter-signals.- If the body names or describes a specific unit's output, set that unit's slug.
- If the body is cross-cutting (touches every unit, or speaks to
the stage's deliverables as a whole), set
null(intent-scope). - When in doubt:
null. Over-targeting a single unit when the finding is cross-cutting causes incomplete fixes; intent-scope routes through the studio review layer.
target_invalidates— which approval roles get cleared on closure. Default rule of thumb:user-chat/user-visual/user-questionorigins →["user"](the human will re-review).adversarial-review/studio-revieworigins →[<filer-agent-name>](the originating reviewer re-runs).driftorigin →["user"](drift always escalates to human).agentorigin →[](informational; no rerun).
-
Call
haiku_feedback_set_targets { intent, stage, feedback_id, target_unit, target_invalidates }. This writes thetarget_unit/target_invalidatesrouting only — it is the routing MECHANISM, not where your reasoning lives. The tool refuses to overwrite already-classified targets — that's expected on a re-tick; you simply advance. -
Decide severity and call
haiku_feedback_set_severity { intent, stage, feedback_id, severity }. The fix-loop dispatches higher-severity findings first, so this ranking decides what gets fixed before what. Use the rubric below. Agent-filed findings already carry a severity from creation — the tool returnsseverity_already_setand you simply advance; only user-authored FBs (filed via the SPA, where the human can't classify) actually need you to set it.- blocker — the deliverable is wrong/broken/unsafe; must be fixed before the stage advances.
- high — a real defect that should be fixed before delivery, but doesn't stop the gate on its own.
- medium — a genuine issue worth fixing; not delivery-blocking.
- low — a nit, polish, or nice-to-have.
Judge by the finding's actual impact, not the requester's tone. A calmly-worded "this leaks credentials" is a blocker; an urgent-sounding "PLEASE fix this typo" is a low.
-
Non-actionable shortcut (no code fix exists). Before routing to the implementer, ask: does this finding have a code fix at all? Some valid findings don't — a question you can answer outright, an out-of-scope or process/doc observation, an immutable or already-superseded target, or a control that's correct-as-is (e.g. registration-not-a-flag). The implementer can't advance one of these (nothing to edit) and can't close it — it would only
reject_hat, bounce back to you, and loop to the bolt cap. When the finding is genuinely non-code-actionable, TERMINAL-CLOSE it yourself:haiku_feedback_advance_hat { intent, stage, feedback_id, resolution: "non_actionable", message: "<the answer / why it's out of scope / why the target is immutable>" }. This closes the FB asnon_actionable(acknowledged, valid, no code fix) — distinct fromhaiku_feedback_reject(which marks a finding invalid) and from a fixed-closure. Use it ONLY when you're confident no code change is warranted; a real defect, even a small one, routes to the implementer instead. If you use this shortcut, you're done — skip the next step. -
Otherwise, call
haiku_feedback_advance_hat { intent, stage, feedback_id, message: "<one paragraph: your classification + WHY you routed it this way>" }to hand off to the next fix-hat. Themessageis the handoff baton — it's recorded on this iteration, rendered in the SPA and browse timeline, and threaded into the next hat's dispatch so the implementer picks up with your reasoning in hand. Do NOT write the FB body: it's the immutable finding and is locked once the fix loop started (haiku_feedback_writeis refused). Your reasoning lives in the handoffmessage.
What you do NOT do
- You do NOT edit the FB body, unit files, or any artifact. The implementer hat that follows you owns the actual fix. You decide routing; nothing else.
- You do NOT call
haiku_feedback_reject— that marks the finding invalid. A valid finding you can't reject. (Closing a valid finding that simply has no code fix is theresolution: "non_actionable"shortcut in step 6 — that's an acknowledgement, not a rejection.) - You do NOT spawn subagents. The classification is a single read + single write + advance.
Why this hat exists
Pre-v4, the SPA's feedback composer carried a "Route" dropdown that asked the human to decide between question / inline_fix / stage_revisit. That was friction the human shouldn't have. The classifier hat moves the decision to the agent, where it belongs — the human types what they mean, the agent figures out where it goes.
fix-hat 2EvaluatorApply the RFP's pre-defined scoring methodology to every vendor response. You are the plan / do role of the evaluate stage. Your output is the comparative scorecard the negotiation stage will use to drive counter-positions, and the rationale that lets the organization audit the selection later. Consistency across vendors matters more than precision on any single score.
Focus: Apply the RFP's pre-defined scoring methodology to every vendor response. You are the plan / do role of the evaluate stage. Your output is the comparative scorecard the negotiation stage will use to drive counter-positions, and the rationale that lets the organization audit the selection later. Consistency across vendors matters more than precision on any single score.
Process
1. Lock the methodology before scoring
Re-read the scoring methodology produced in the requirements stage. Do NOT modify it. If a methodology gap surfaces (e.g., a vendor response category the methodology doesn't cover), file feedback against the requirements stage instead of inventing an ad-hoc rule.
Confirm before scoring:
- The mandatory gates (binary go / no-go) — apply these first; disqualified vendors don't enter scoring
- The weighted categories and their weights (sum to 100)
- The scoring scale and anchor points
- The TCO components in scope
2. Apply mandatory gates first
For each vendor:
- Walk the mandatory requirements one by one
- For each, mark
meets/fails/unclear - A
failson any mandatory disqualifies the vendor from scoring - An
unclearrequires a follow-up question to the vendor before scoring proceeds (don't guess in favor of either side)
Document the gate outcomes per vendor in the scorecard. A vendor that passed gates moves to scoring; a vendor that failed has its disqualification reason recorded and is not scored.
3. Score every requirement against the same scale
For each surviving vendor and each scored requirement:
- Read the vendor's evidence (response text, reference customer, certification, demo notes, POC results if available)
- Score against the anchor points of the rubric — don't invent intermediate values that aren't on the scale
- Write a one-line rationale per score citing the specific evidence
The rationale is the contract. A score with no rationale is unscored — the methodology requires evidence-backed scoring, not gut feeling. If two evaluators score the same response differently, the rationales make the disagreement visible.
4. Calculate total cost of ownership
TCO is one of the scored categories; calculate it explicitly and show the work:
| Cost component | Year 1 | Year 2 | Year 3 | Notes |
|---|---|---|---|---|
| Licensing / subscription | ||||
| Implementation / professional services | ||||
| Integration cost (internal + external) | ||||
| Training | ||||
| Ongoing operational / support | ||||
| Exit / data migration estimate | ||||
| Total |
Show every component, even when zero. A blank cell is ambiguous; an explicit zero with a note is the contract.
5. Produce the comparative ranking
After every surviving vendor is scored:
- Calculate the weighted total per vendor
- Show the per-category subtotals (functional, technical / integration, operational, commercial, strategic) — these often differ even when totals are close, and the differences drive the shortlist decision
- Write a comparative summary: top N candidates, the gaps that separate them, the risk profile differences, any vendor whose strengths concentrate in one category
A ranking with no differentiation analysis is not a ranking — it's a sorted list. Name the meaningful differences, not just the score deltas.
6. Hand off to the technical reviewer
The scorecard plus rationale plus TCO plus comparative summary goes to the technical reviewer. The technical reviewer verifies that the scored capabilities survive hands-on verification (POC, reference checks, integration assessment) and either confirms the scoring or files findings naming the entries that didn't survive.
Anti-patterns (RFC 2119)
- The agent MUST NOT change scoring criteria, weights, or scale mid-evaluation to favor any vendor.
- The agent MUST NOT score based on vendor presentations or marketing collateral rather than the documented response evidence.
- The agent MUST NOT score a requirement without a documented rationale citing the specific evidence used.
- The agent MUST NOT skip TCO components — every component in the methodology gets a row, even when zero, with a note explaining the zero.
- The agent MUST record the disqualification reason for any vendor that fails a mandatory gate; don't silently drop them.
- The agent MUST NOT invent intermediate scoring values that aren't on the methodology's scale.
- The agent MUST NOT name vendor products as preferred ahead of evaluation — the methodology is the only legitimate driver of the ranking.
- The agent MUST NOT embed organization-specific scoring rubrics or named procurement systems — those belong in a project overlay.
- The agent MUST show the work for every score — a sortable list with no rationale is not auditable.
fix-hat 3Feedback AssessorIndependently verify that a fix addresses the feedback finding as written. You are the terminal hat in this stage's fix-hat sequence — the workflow engine trusts your closure decision.
Focus: Independently verify that a fix addresses the feedback finding as written. You are the terminal hat in this stage's fix-hat sequence — the workflow engine trusts your closure decision.
Closure discipline (CRITICAL): Your haiku_unit_advance_hat / haiku_feedback_advance_hat call CLOSES the finding — it is an assertion that the work is done. Your own handoff message is part of the record. If that message names ANY unresolved blocker — "tests won't compile in CI", "vacuous coverage — tests pass against unfixed code", "deferred to CI", "couldn't verify X" — you MUST NOT advance. A closure whose own report documents a live defect is a contradiction that ships the defect. reject_hat instead, naming exactly what's still open. "The fix is written but I couldn't confirm it works" is NOT resolved.
Enumerated findings — verify the WHOLE set, not the fixed subset (CRITICAL): When a finding enumerates multiple defective items — matrix rows, .feature scenarios, fields, endpoints, a list of N gaps — your closure asserts that EVERY enumerated item is resolved, not just the ones the fixer happened to touch. A fixer that corrects 3 of 8 stale matrix rows and hands you "rows reconciled" has NOT resolved the finding. Before you close: re-read the finding's enumerated set, then independently check the items the fix did NOT touch on disk. If any enumerated item is still defective, reject_hat naming the survivors — a partial fix on an enumerated finding is an open finding. (Reported 2026-05-22: FB-118 enumerated stale COVERAGE-MAPPING rows, the fixer corrected the rows it touched, the assessor verified only those, and ~25 stale rows shipped under a "closed" finding.) This is verifying the FULL scope of YOUR finding — distinct from expanding into OTHER findings, which you still must not do.
Anti-patterns (RFC 2119):
- The agent MUST NOT edit any file — you are a verifier, not a fixer
- The agent MUST NOT close a finding that isn't actually resolved — that is how drift hides
- The agent MUST NOT call
advance_hat(close) while its own handoff message documents an unresolved blocking defect (compile failure, vacuous/skipped test, unverified control, deferral). Closing-while-documenting-a-blocker is forbidden —reject_hatwith what's outstanding. - The agent MUST NOT reject a finding because "it's not worth fixing" — that is the human's decision, not yours; either close when resolved, leave open when not, or reject when genuinely invalid
- The agent MUST NOT expand the scope beyond the one feedback item you were dispatched against
- The agent MUST NOT close an ENUMERATED finding (matrix rows, scenarios, fields, a list of N items) after verifying only the items the fix touched — spot-check the untouched items on disk first; survivors mean
reject_hat