Application Development · stage 2 of 6

Design

External / Ask gate

Visual and interaction design for user-facing surfaces

Design

Where the work gets its shape: translate the elaborated problem into the wireframes, component states, interaction specs, and layout rules that downstream stages build against.

Scope

Visual and interaction design for user-facing surfaces — what the work looks like and how it behaves to the touch. Not the behavioral contract (product), not implementation (development).

What to do

  • Reference the project's existing design system — its tokens, atoms, and primitives — by name, citing source.
  • Cover every state the problem requires: default, empty, loading, error, and the edges in between.
  • Keep designs internally consistent and concrete enough that development can build from them without guessing.

What NOT to do

  • Don't invent values — colors, spacing, type — that the design system already defines.
  • Don't specify behavior, acceptance criteria, or data contracts; that's the product stage.
  • Don't write code or choose implementation technology.

How the engine runs this stage

1Elaborate

collaborative · plan the work, fan out discovery, declare outputs

Discovery fan-out

knowledge artifactDesign BriefScreen-by-screen design specifications for the intent's user-facing surfaces. This output is the contract between design and development — what gets built should match what's specified here.

Design Brief

Screen-by-screen design specifications for the intent's user-facing surfaces. This output is the contract between design and development — what gets built should match what's specified here.

Content Guide

For each screen or view:

  • Layout structure — columns, sections, positioning, and spacing
  • Component inventory — each component with its location, purpose, and props
  • Interaction states — default, hover, focus, active, disabled, error, loading, empty for each interactive element
  • Responsive behavior — layout changes at each breakpoint (mobile 375px, tablet 768px, desktop 1280px)
  • Navigation flows — how users move between screens, what triggers transitions
  • Accessibility requirements — contrast ratios, label associations, keyboard navigation paths, focus management

Include a design gaps section documenting known missing states and their disposition (designed, deferred, out of scope).

Quality Signals

  • Every interactive element has all applicable states specified
  • Responsive behavior is described per breakpoint, not just "it's responsive"
  • Colors reference named tokens, never raw hex values
  • Touch targets are at least 44px on mobile
  • Keyboard navigation order is defined for complex interactions
knowledge artifactDesign DirectionCapture the user's design direction for this stage before any wireframes or unit specs land. The direction is a strategic choice (which visual archetype, which reference materials) that the rest of the design phase orbits.

Design Direction

Capture the user's design direction for this stage before any wireframes or unit specs land. The direction is a strategic choice (which visual archetype, which reference materials) that the rest of the design phase orbits.

How this discovery agent works

This template is tool-driven: the agent calls the pick_design_direction MCP tool. The tool opens the SPA picker, blocks on the user's submission, and writes a manifest to the location: declared above. The cursor's existence check on that file passes the gate — same artifact-driven model as every other discovery template.

Two submission modes inside the picker:

  • Archetype mode. The agent generates 2–3 distinct HTML wireframe archetypes (different layouts, interaction patterns, or visual hierarchies). The user picks one and optionally annotates screenshots with comments. The manifest records the chosen archetype, comments, and any annotated screenshots saved under stages/<stage>/artifacts/design-direction/dd-NN-*.png.
  • Upload mode. The user uploads reference materials (mockups, mood boards, real-product screenshots, design files — any non-empty MIME). The manifest records the uploaded files with optional captions; uploads land under stages/<stage>/artifacts/design-direction/uploads/up-NN-*.

Calling the tool

pick_design_direction { intent: "<slug>" }

Optional arguments documented on the tool itself: pre-generated archetypes, default mode, picker affordances. Defaults are usually correct.

When the tool returns, call haiku_run_next to re-tick. The cursor will read stages/{stage}/artifacts/design-direction.md, see it on disk, and clear the discovery gate.

Why this is a discovery agent and not a bespoke gate

Pre-2026-05-08 the cursor had three dedicated actions for design direction (design_direction_required, design_direction_complete, design_direction_uploaded) plus a STAGE.md flag (requires_design_direction: true). They worked but lived parallel to the rest of discovery — same shape (gate on a chosen artifact) under different machinery. The reframe collapses both onto one mechanism: studio declares an artifact contract, agent runs the named tool, file existence is the signal.

This also makes the pattern reusable. Any future "user input drives stage direction" need (mood-board picker, brand-voice picker, target-audience picker, etc.) is just another discovery template with a tool: field.

knowledge artifactDesign System AnchorConcrete design-system specs extracted from the project's source code. Every value is cited to its source file and line number. Downstream hats (designer, design-reviewer) must use these values — not guesses.

Design System Anchor

Concrete design-system specs extracted from the project's source code. Every value is cited to its source file and line number. Downstream hats (designer, design-reviewer) must use these values — not guesses.

This is long-lived repo knowledge (scope: project) — it persists across intents at .haiku/knowledge/DESIGN-SYSTEM-ANCHOR.md. When it already exists, read it as your prior and refresh it in place only where the source has actually moved on; don't re-derive the whole system from scratch every intent.

Content Guide

1. Atoms

For each reusable atomic component found in source (e.g. Button, Surface, Card, Input, Badge):

  • Component name and its source file path
  • Dimensions — height, min-height, min-width, padding (cited to file:line)
  • Border radius (cited to file:line)
  • States — list any conditional styles for the canonical 8-state set: default, hover, focus, active, disabled, error, loading, empty (cited to file:line). Match DESIGN-BRIEF.md's state vocabulary so downstream consistency review can compare like-for-like.
  • Variants — any size/color/shape variants the component declares (cited to file:line)

Example entry (substitute your project's actual paths):

### Button
Source: <atoms-dir>/Button.<ext>

- height: 44px             # Button.<ext>:23
- border-radius: 8px       # Button.<ext>:31
- padding-h: 16px          # Button.<ext>:28
- default: solid bg, brand-primary text  # Button.<ext>:18
- disabled-opacity: 0.4    # Button.<ext>:47
- empty: ghost variant w/ placeholder copy  # Button.<ext>:55

2. Tokens

Color, spacing, typography, and radius scales pulled from the project's tokens module (whichever path your design system uses — theme/colors.*, tokens/spacing.*, style/typography.*, or equivalent):

Every recorded color value MUST carry both its raw source value AND its named-token alias as defined in knowledge/DESIGN-TOKENS.md. The designer hat is forbidden from using raw hex; if a color exists in source but has no named alias yet, route the gap to ## Open Questions rather than emitting a raw hex into the anchor — never let unaliased values flow into the design context.

  • Color tokens — name (token alias), raw value, source citation (file:line)
  • Spacing scale — each step's named alias and step value (cited to file:line)
  • Typography scale — font family, sizes, weights, line heights (cited to file:line)
  • Radius scale — each named radius alias and value (cited to file:line)
  • Shadow/elevation — any named shadow tokens (cited to file:line)

Example entry (substitute your project's actual paths):

### Color Tokens
Source: <theme-dir>/colors.<ext> → mapped to knowledge/DESIGN-TOKENS.md

- color.brand.primary    = #1A73E8    # colors.<ext>:12 → DESIGN-TOKENS.md:8
- color.surface.bg       = #FFFFFF    # colors.<ext>:18 → DESIGN-TOKENS.md:14
- color.text.primary     = #212121    # colors.<ext>:24 → DESIGN-TOKENS.md:20
- (gap) #F5A623 used in Button.<ext>:62 has no named alias → see Open Questions

3. Active vs Dormant Patterns

Cross-reference era/status tags from the inception stage's DISCOVERY.md ## Existing Code Structure section. For each prior-art file listed there:

  • Active — pattern is used in the current codebase, values are current
  • Dormant — era-tagged as legacy (e.g. predecessor-product-era, deprecated-vendor-era) — flag these explicitly so the designer avoids them

4. Open Questions

Anything ambiguous from source that requires designer judgment before proceeding:

  • Token values that appear overridden in multiple places (list each location and value)
  • Components with undocumented variants
  • Source files listed in DISCOVERY.md that could not be located
  • Any token that has no source citation (must not be used until resolved)

Quality Signals

  • Every atom spec entry cites source file and line number
  • Every token entry cites source file and line number
  • No invented or approximated values — open questions section covers gaps
  • Active vs dormant flags are present for all prior-art files with era tags
  • The designer hat can read this document and produce mockups without touching source files
knowledge artifactDesign TokensNamed design values that downstream stages must use instead of raw values. All colors, spacing, typography, and other visual properties should be referenced by token name throughout the project.

Design Tokens

Named design values that downstream stages must use instead of raw values. All colors, spacing, typography, and other visual properties should be referenced by token name throughout the project.

This is long-lived repo knowledge (scope: project) — it persists across intents at .haiku/knowledge/DESIGN-TOKENS.md. When it already exists, read it as your prior and add or correct tokens only where the design system has actually changed; don't redefine the whole token set every intent.

Content Guide

Define tokens for:

  • Color tokens — primary, secondary, surface, error, warning, success, and semantic aliases
  • Spacing scale — consistent spacing values (e.g., 4px, 8px, 12px, 16px, 24px, 32px, 48px)
  • Typography scale — font families, sizes, weights, line heights for each usage context (heading, body, caption, etc.)
  • Border radii — small, medium, large, pill, circle
  • Shadow definitions — elevation levels with shadow values
  • Animation/transition values — duration and easing for standard transitions

If the project has an existing design system, tokens should reference or extend it. If no design system exists, this document establishes one.

Quality Signals

  • All downstream stages use token names, never raw values
  • Token names are semantic (e.g., color-error not color-red-500)
  • The scale is consistent and complete — no gaps that force raw values
  • Tokens are documented with their intended usage context

Phase guidance

phase overrideELABORATIONThe key words "MUST", "MUST NOT", "SHALL", "SHALL NOT", "REQUIRED" in this section are to be interpreted as described in RFC 2119.

Design Stage — Elaboration

Phase Instructions (RFC 2119)

The key words "MUST", "MUST NOT", "SHALL", "SHALL NOT", "REQUIRED" in this section are to be interpreted as described in RFC 2119.

During elaboration, the agent MUST establish a design direction with the user before drafting any final wireframes. The flow is intake-first — the user may already have finished designs, in which case archetype generation is skipped entirely:

  1. The agent MUST call pick_design_direction with no archetypes field (or an empty array) to open the picker in intake mode. The picker asks the user whether they have designs to upload OR want the agent to generate variants.
  2. The agent MUST call haiku_await_design_direction to wait for the response.
  3. The user's response branches:
    • Upload — the user provided finished designs. The next haiku_run_next tick surfaces the file paths via a design_direction_uploaded action. The agent MUST Read each uploaded file and treat the uploads as the source of truth for the visual direction. The agent MUST NOT generate archetypes in this case.
    • Generate — the user wants the agent to produce variants. The agent MUST generate 2-3 distinct design approaches as HTML wireframe snippets (different layouts, interaction patterns, or visual hierarchies) and call pick_design_direction again with the variants as archetypes — each with a name, description, and preview_html (the rendered wireframe).
  4. After variants are presented, the user either picks one as the final direction (optionally annotating it via the pencil tool — strokes are screen-captured and returned to the agent as image content blocks) or asks the agent to regenerate — keeping a subset of the current archetypes and steering the next batch via comments. On a regenerate response, the agent MUST produce replacements for the unkept slots and call pick_design_direction again with the merged set.
  5. The agent MUST use the final direction (uploaded files or selected archetype) to create the wireframes saved to stages/design/artifacts/
  6. The agent MUST NOT produce ASCII art wireframes — all wireframes MUST be HTML or design provider files
  7. If a design provider MCP is available (Pencil, OpenPencil, Figma), the agent SHOULD use it instead of raw HTML when generation is needed

Criteria Guidance

Design criteria are verified by visual approval — a reviewer inspects the deliverable against the criterion, not by command-exit-code. The condition can be a structural check (counting screen variants, asserting tokens-only colors via grep) or a reviewer-applied condition stated precisely enough that two reviewers would reach the same verdict.

Good criteria — concrete and verifiable

When generating criteria for this stage, focus on verifiable design deliverables:

  • Screen layouts defined for all breakpoints (mobile 375px / tablet 768px / desktop 1280px)
  • All interactive states specified (default, hover, focus, active, disabled, error)
  • Color usage references only design system tokens — no raw hex values (verifiable by grep for #[0-9a-fA-F]{3,6} outside token files)
  • Touch targets meet 44px minimum on mobile breakpoints
  • Empty states, loading states, and error states designed
  • Contrast ratios meet WCAG AA (4.5:1 body text, 3:1 large text) — verifiable by automated contrast checker
  • Focus order documented for keyboard navigation
  • Component hierarchy documented (which design system components to use/extend)
  • Interaction specs complete for all user actions (tap, swipe, scroll, transition)

Bad criteria — vague (no clear condition)

  • "Design looks good" — what does good mean?
  • "It's responsive" — at which breakpoints? With what behavior?
  • "Accessible" — to which standard? WCAG A, AA, AAA?

Bad criteria — design-specific unverifiable

(In addition to the universal unverifiable shapes called out in the workflow engine contracts.)

  • "Design is intuitive" — needs a usability test pass against a stated success-rate threshold
  • "Visual hierarchy is clear" — needs a structural rule (e.g. heading scale, contrast progression) the reviewer can apply consistently
  • "Brand feels right" — needs a brand-guideline document to compare against, not a subjective vibe check

Outputs produced

output templateDesign ArtifactsHigh-fidelity design deliverables for the intent's user-facing surfaces. These are the actual visual designs — mockups, wireframes, component specs — stored inside the intent as part of its specification.

Design Artifacts

High-fidelity design deliverables for the intent's user-facing surfaces. These are the actual visual designs — mockups, wireframes, component specs — stored inside the intent as part of its specification.

Design artifacts live in the intent's stage directory because they ARE the spec. Downstream stages (product, development) reference them as inputs to understand what gets built.

Content Guide

  • Screen mockups at all specified breakpoints (e.g., login-screen-desktop.png, login-screen-mobile.png)
  • Component library additions or modifications
  • Icon/asset exports needed for development
  • Interactive prototypes or flow diagrams
  • Design tool source files (.pen, .fig, .sketch) when applicable

Quality Signals

  • Every screen specified in the design brief has a corresponding mockup
  • Mockups match the design tokens (colors, spacing, typography)
  • Interactive states are visually represented, not just documented
  • Assets are in formats usable by development (PNG, SVG, or design tool native)

2Review

pre-execute · agents audit the planned spec before any code lands
review agentAccessibilityThe agent **MUST** verify the design meets accessibility requirements and does not exclude users by ability, input modality, or assistive-tech reliance. File feedback for any failure. Accessibility findings are not optional polish — they ship as production defects when missed.
<!-- `applies_to:` gates this review agent by output kind. The web a11y checks below (contrast, touch targets, focus indicators, SR flow) presume DOM / HTML artifacts. On a stage whose artifacts are all backend specs, CLI docs, or non-UI markdown, this agent skips itself rather than raising not-applicable findings. Absence of `applies_to:` means "always runs" (backward-compatible default). -->

Mandate: The agent MUST verify the design meets accessibility requirements and does not exclude users by ability, input modality, or assistive-tech reliance. File feedback for any failure. Accessibility findings are not optional polish — they ship as production defects when missed.

Check

The agent MUST verify each of the following:

  • Color contrast meets WCAG AA minimum — 4.5:1 for body text, 3:1 for large text and UI components / icons. Project overlays may require WCAG AAA — defer to overlay if present.
  • Touch targets are at least 44px on the major axis on mobile breakpoints. Targets smaller than that are an accessibility regression for users with motor impairments.
  • Keyboard reachability — every interactive element can be reached, focused, and activated via keyboard alone. Modals trap focus; dropdowns are operable; custom widgets implement the right ARIA pattern (combobox, listbox, dialog) instead of div-soup.
  • Focus indicators are visible at every interactive element and meet the WCAG focus-appearance contrast minimum. No outline: none without a replacement.
  • Information not conveyed by color alone — error states pair color with icon or text; chart series pair color with shape or label; status badges pair color with text.
  • Screen-reader flow — heading order is logical (no skipped levels), images / icons have appropriate alt / aria-label / decorative markings, landmarks are used for major regions, dynamic content uses live regions where appropriate.
  • Reduced-motion — animations that move > 5% of viewport respect prefers-reduced-motion.
  • Forms — every input has a programmatic label, errors are linked to the input via aria-describedby, required fields are marked beyond color.

Common failure modes to look for

  • Brand-color text on its branded background failing 4.5:1 (corporate palettes routinely fail; the designer didn't measure)
  • Custom dropdowns / toggles built from <div> + click handler with no keyboard / SR support
  • outline: none applied globally and not replaced with a custom focus ring
  • Error states shown via red border with no icon or text — invisible to users with red-green color blindness
  • Icon-only buttons with no aria-label
  • Modals that don't trap focus, or that return focus to <body> on close instead of the trigger
  • Charts using color as the only series differentiator
  • Heading order skipping levels (h1 → h3) to achieve a visual size
review agentConsistencyThe agent **MUST** verify the design is internally consistent across screens in the intent AND aligns with the project's existing design system. Inconsistency in design becomes drift in implementation becomes confusion in product. File feedback for any failure.

Mandate: The agent MUST verify the design is internally consistent across screens in the intent AND aligns with the project's existing design system. Inconsistency in design becomes drift in implementation becomes confusion in product. File feedback for any failure.

Check

The agent MUST verify each of the following:

  • Token discipline. All spacing, typography, color, radius, and elevation values reference named tokens from the design-system anchor (DESIGN-SYSTEM-ANCHOR.md) or token document (DESIGN-TOKENS.md). Raw hex codes, magic pixel values, bare font-family names, and arbitrary px margins are all findings.
  • State coverage parity across screens. Interactive elements that appear on multiple screens cover the same state set on each. A button with hover / focus / disabled on one screen and only hover on another is inconsistency, not by design.
  • Component naming and reuse. Component names match the existing pattern language. A "Card" on one screen is the same component as a "Card" on another. Net-new components are flagged as such with rationale in the unit body; they don't appear silently.
  • Layout grid and breakpoint behavior. The same grid, breakpoint set, and gutter values are used across all screens in the intent. A screen using a 12-col grid alongside a screen using an 8-col grid is a finding unless the unit body explains why.
  • Iconography and illustration style — same icon family across the intent; same illustration style; no mid-design swap between styles.
  • Typography ramp — every used type style traces to a step on the ramp. Inventing a one-off 21px/29px/600 is a finding.

Common failure modes to look for

  • One screen using a raw #3B82F6 and another using token primary.500 for what's clearly the same intent
  • Two units' designs each inventing a slightly different "secondary button" — different padding, different radius
  • A component reused on three screens with state coverage on one and not the others
  • Mixed icon sets across the intent (Lucide on one screen, Heroicons on another)
  • A breakpoint set declared in one unit's body that no other unit honors
  • Net-new tokens added inline without documentation in DESIGN-SYSTEM-ANCHOR.md
  • Inconsistent grid alignment across screens that visually belong together
review agentInception CoverageThe agent **MUST** audit design-stage artifacts against inception artifacts to ensure decisions are honored, UI surfaces are covered, and resolved questions are not re-opened.

Mandate: The agent MUST audit design-stage artifacts against inception artifacts to ensure decisions are honored, UI surfaces are covered, and resolved questions are not re-opened.

Step 1 — Discover inception artifacts

The agent MUST dynamically enumerate inception artifacts — do NOT hardcode file paths. Perform the following enumeration at the start of every run:

  1. Collect all files under .haiku/intents/{slug}/knowledge/ (recursively).
  2. Collect all files under .haiku/intents/{slug}/stages/inception/ (recursively), if the directory exists.

Short-circuit on missing inception: If neither location yields any files, emit a single info-severity note:

"Inception not run for this intent — coverage audit skipped."

Then return cleanly with no blocking findings. This makes the agent safe to use in single-stage / quick-mode intents where inception was not run.

Step 2 — Classify inception artifacts by content

Read each file in full. MUST NOT infer content from filenames — classify by heading scan and section text:

  • Decisions — any heading or section text containing decision, decided, resolved, or ## Decisions.
  • Open questions — any heading or section text containing open question, unresolved question, or ## Open Questions.
  • UI surfaces — any heading or section text containing ui surface, affected surface, ui impact, or ## UI Impact.
  • Constraints / risks — any explicit constraint or risk statement.

When DISCOVERY.md is the only inception artifact, all four roles will be subsections within it. Extract each role by section — do NOT treat the whole file as a single block.

Short-circuit on unclassifiable inception: If inception files exist but Step 2's heading scan finds zero hits across all four roles (decisions, open questions, UI surfaces, constraints/risks), emit a single warning-severity note:

"Inception artifacts present but use non-standard headings — coverage audit cannot classify content. Reviewer recommends running classification heuristics by hand or aligning inception to the canonical DISCOVERY.md template."

Then return cleanly with no blocker findings (the warning above is the sole emission). This prevents a flood of false-positive scope-creep findings on intents whose inception used custom heading vocabulary.

Step 3 — Read design-stage outputs

The agent MUST dynamically enumerate the following design output locations and read each that exists:

  • Every file under .haiku/intents/{slug}/stages/design/artifacts/
  • .haiku/intents/{slug}/stages/design/DESIGN-BRIEF.md
  • .haiku/knowledge/DESIGN-TOKENS.md
  • .haiku/knowledge/DESIGN-SYSTEM-ANCHOR.md

Short-circuit on no design output: If none of the above paths exist, emit a single info-severity note:

"Design stage has produced no readable artifacts yet — coverage audit skipped."

Then return cleanly with no blocker findings. This is the safe state for an intent that has not yet executed the design stage.

MUST NOT summarize inception artifacts — read them in full on each audit pass.

Step 4 — Emit findings

When inception artifacts ARE present, emit a haiku_feedback finding for each of the following failure modes:

Decision violation — severity: blocker

The design contradicts a decision extracted from inception.

Every finding body MUST include:

  • Inception artifact path + line range (or (decision: <text>) for inline-extracted decisions)
  • Design artifact path + line range (or screen ID) where the violation occurs
  • One-line recommendation: revisit inception decision, revise design, or escalate to human review

Surface gap — severity: blocker

A UI surface listed in inception is not represented in the design artifacts.

Every finding body MUST include:

  • Inception artifact path + the passage that names the surface
  • What design artifacts were checked and which surface is absent
  • One-line recommendation: add the missing surface or confirm it is intentionally deferred

Resolved-question regression — severity: blocker

The design re-introduces a question that was already settled in inception.

Every finding body MUST include:

  • Inception artifact path + the passage where the question was resolved
  • Design artifact path + the passage that re-opens it
  • One-line recommendation: align design with the settled answer or escalate for explicit re-decision

Scope creep — severity: warning

The design covers a surface or feature that inception did NOT list. May be legitimate but requires human triage.

Every finding body MUST include:

  • The specific inception artifact passage that omits the surface (do NOT flag scope creep without naming this passage)
  • Design artifact path + the passage that introduces the unlisted surface
  • One-line recommendation: confirm alignment with inception scope or create a follow-on intent

Anti-patterns (RFC 2119)

  • MUST NOT hardcode inception artifact filenames — Step 1 discovers them dynamically because inception's filenames are agent-authored and vary per intent. (Design-side artifact paths in Step 3 are stable by contract — those listed paths are the canonical locations declared by their discovery templates and may be referenced directly.)
  • MUST NOT summarize inception artifacts — read them in full per audit pass
  • MUST NOT infer coverage from titles or filenames — diff actual content
  • MUST NOT flag scope-creep without naming the specific inception artifact passage that omits the surface
  • MUST short-circuit cleanly when inception is absent
review agentRuntime VerifierThe agent **MUST** be the user's eyes for the design stage — render every design artifact the designer produced (mockups, wireframes, component specs, layout sheets) through a real browser and judge them the way a user would experience the shipped UI: by looking at the screen. Static review of the design files cannot catch the gap between "the spec says 16px gutter" and "the rendered mockup has 8px gutter because the token was misnamed." This lens opens the artifacts in a browser, screenshots them, and asserts the visual quality bar the design stage is supposed to enforce. The screenshots ARE the record of what the user would see.

Mandate: The agent MUST be the user's eyes for the design stage — render every design artifact the designer produced (mockups, wireframes, component specs, layout sheets) through a real browser and judge them the way a user would experience the shipped UI: by looking at the screen. Static review of the design files cannot catch the gap between "the spec says 16px gutter" and "the rendered mockup has 8px gutter because the token was misnamed." This lens opens the artifacts in a browser, screenshots them, and asserts the visual quality bar the design stage is supposed to enforce. The screenshots ARE the record of what the user would see.

You pass ONLY if you actually observed it — haiku_view is the verification, not optional scaffolding. This role's sign-off means "I opened the live surface with haiku_view and saw the promised result with my own eyes." If haiku_view won't bring the surface up — the tool errors, no target is found, a dependency is down — then you have observed nothing, and per the doctrine's verdict rules you MUST file a BLOCKED finding and HOLD. You MUST NOT sign off, and you MUST NOT accept any substitute for the live observation: not a .haiku/boot.md recipe, not a diagnosis, not green CI, not a closed blocker, not "it should render now." Nothing advances or seals on this role's stamp until you have genuinely reached PASS. Re-dispatched after a "fix"? Open and observe again from scratch — a fix that merely unblocked the surface is not the result passing. If it still can't come up after the fix loop has had its turn, escalate to the human and keep holding; never let a can't-verify decay into a pass.

Check

Open a view session via haiku_view({ stage: "design" }) and navigate to each design artifact from your Playwright script (per the runtime-verification doctrine — self-installed, records video + screenshots). Screenshot every artifact you assert on into .haiku/intents/<intent>/stages/design/proof/ (e.g. <artifact-slug>-<viewport-or-state>.png). That proof/ dir is gitignored — upload the captures to this stage's PR per the doctrine so a human verifier can scroll them later, and attach them (or their links) to every finding — a visual finding without a capture is unactionable.

Also verify per-unit claims: read every design unit body (stages/design/units/<unit>.md) — every wireframe the unit references, every component-state the unit promised, every layout the unit said it would deliver. Each is part of the contract for THIS stage even when it doesn't appear in a downstream .feature file. A unit that claims to ship the "empty state" but whose artifact only renders the populated state is a finding.

The agent MUST verify each of the following against the rendered output:

  • Spacing and alignment. Margins, padding, gutters, and gaps match the values declared in DESIGN-TOKENS.md / DESIGN-SYSTEM-ANCHOR.md. Components that touch (no breathing room) when the spec says they shouldn't, components that drift out of their column grid, baseline mismatches between adjacent text blocks — all findings. Measure visually from the screenshot; do not trust the file claims alone.
  • Font sizes and hierarchy. Every text element renders at one of the declared type-scale steps (no off-scale values), and the hierarchy in the rendered artifact matches what the brief called for — H1 visibly larger than H2, body text legible at intended viewport, captions distinct from body. A heading that visually disappears into the body text is a hierarchy failure even if the file uses the "correct" token.
  • Color usage. Colors used in the rendered artifact match DESIGN-TOKENS.md exactly — no off-palette hexes, no near-misses where a token exists but a sibling was used. Background/foreground pairs hit the contrast threshold the brief declared (typically WCAG AA: 4.5:1 for body text, 3:1 for large text and UI components).
  • Accessibility — contrast. Every text+background pair in the rendered artifact meets the contrast threshold the brief declared (WCAG AA minimum unless the brief says otherwise). Read computed text/background colors off the live DOM in your script when ambiguous, then assert the ratio. Low-contrast placeholder text, disabled-state text that fails AA, hover-state-only color cues without a non-color signal — all findings.
  • Accessibility — semantics in the rendered HTML output. When the design artifact is HTML (a rendered mockup, a Storybook entry, a coded prototype), assert: every interactive element has an accessible name (label, aria-label, or visible text), heading order is sequential without skips, form inputs are associated with labels, focusable order matches visual order, focus indicators are visible at the brief's required threshold.
  • Touch / click target sizes. Interactive targets in the rendered artifact meet the minimum size the brief declared (typically ≥ 44×44 px on touch, ≥ 24×24 px on mouse). A button that visually appears tappable but fails the target-size threshold is a finding even when the design file lists the token.
  • Responsive behavior at declared breakpoints. Resize the browser to each breakpoint the brief named (typically mobile / tablet / desktop) in your script and screenshot at each one. Layout breaks, overflowing text, components that collide or stack incorrectly, hidden-when-it-shouldn't-be content — all findings.
  • State coverage. For every component the brief lists multiple states for (hover, focus, active, disabled, error, loading, empty), the rendered artifact actually shows that state when you trigger it (or the artifact is a sheet that shows all states side by side). A state listed in the brief but absent from the rendered mockup is a finding.
  • Close the session. Call haiku_view_close({ session_id }) after all checks complete.

Common failure modes to look for

  • Mockup that visually looks right at thumbnail size but has 8px / 12px / 14px values where the design system declares 8 / 16 / 24 — token misuse hidden by visual approximation
  • Heading that uses the right token (text-2xl) but renders identically to body because the body token also got bumped to text-2xl somewhere else — hierarchy collapsed
  • Color combination that's on-palette per the token table but fails contrast — designer used a valid token, just the wrong one for that surface
  • Interactive elements rendered at a visual size that meets the spec but the actual hit target (after CSS box model + transforms) is smaller — a touch target that fails 44×44 at the DOM level even though it looks bigger
  • Layout that works at desktop but stacks wrong on tablet because the breakpoint rule was named correctly in the spec but never tested in a real viewport
  • Focus indicator that exists in code but is invisible against the background it appears over — passes a code review, fails a screenshot
  • A hover state declared in the brief but never wired in the mockup — the artifact only shows the resting state
  • Empty / loading / error states described in the brief but the rendered artifact only shows the happy path

3Execute

per-unit baton · Designer Prep → Designer → Design Reviewer
hat 1Design ReviewerVerify the designer hat's body output for THIS design unit substantively delivers a producable design: real tokens (not raw values), full state coverage, responsive behavior named at each breakpoint, accessibility considered, and consistency with the project's design system anchor. You are the **verify** role for design — the terminal hat in the per-unit hat sequence. Body-only verification per architecture §3.4; the workflow engine owns frontmatter and DAG checks.

Focus: Verify the designer hat's body output for THIS design unit substantively delivers a producable design: real tokens (not raw values), full state coverage, responsive behavior named at each breakpoint, accessibility considered, and consistency with the project's design system anchor. You are the verify role for design — the terminal hat in the per-unit hat sequence. Body-only verification per architecture §3.4; the workflow engine owns frontmatter and DAG checks.

The baton you receive is the designer's body — references to the produced mockup artifacts plus the design rationale. Your decision (advance vs reject) is what the workflow engine trusts to move the unit forward.

Validate this unit's outputs against its criteria

List this unit's declared outputs with haiku_unit_get { intent, stage, unit, field: "outputs" }, then confirm each one satisfies the unit's completion criteria. The outputs are what you validate; the unit's criteria are the bar. Stay scoped to this one unit — sibling units have their own verify passes.

Process

1. Read your inputs

  • The unit body — completion criteria, designer's notes, links to produced mockup artifacts under stages/design/artifacts/
  • .haiku/knowledge/DESIGN-SYSTEM-ANCHOR.md — the source-grounded token / atom inventory the designer-prep hat produced (long-lived repo knowledge, persists across intents)
  • .haiku/knowledge/DESIGN-TOKENS.md and stages/design/DESIGN-BRIEF.md — the design-system inputs the designer was expected to honor
  • The intent's decision register — any locked design choice that the unit must conform to
  • Sibling units' completed bodies — consistency across units is part of the verifier's mandate

2. Check (BODY ONLY)

Apply each criterion. Any single failure is a reject with the criterion named.

Token discipline. Every color, spacing, typography, and radius value referenced in the designer's body or in the linked mockup notes MUST cite a named token from DESIGN-SYSTEM-ANCHOR.md (or DESIGN-TOKENS.md). Raw hex codes, magic pixel values, and bare font-family names are rejects. If the designer added a new token, it must be documented in the body with a rationale.

State coverage. Every interactive element named in the body MUST list its states: default, hover, focus, active, disabled, error, loading, empty. Silence on a state is ambiguity; explicit absence (no hover state — this element is mobile-only) is acceptable.

Responsive behavior. Every screen / layout block MUST state behavior at each declared breakpoint (commonly mobile / tablet / desktop). "Looks fine on mobile" is a reject — the breakpoint and the actual change must be named.

Accessibility considered. The body MUST address: color contrast (token combinations meet the project's stated WCAG target), touch target size on mobile, keyboard reachability, focus indicator visibility, screen-reader labels for icon-only controls. Project overlays may add house-specific accessibility requirements; defer to those when present.

Anchor consistency. Every component referenced in the body MUST trace back to DESIGN-SYSTEM-ANCHOR.md — either as an existing atom / quark / molecule, or with an explicit note (new component — see <anchor section>) explaining why a new one is needed. Inventing components silently is a reject.

Decision-register consistency. The body MUST NOT propose a design choice that contradicts a recorded Decision. If the user picked light mode as the v1 scope, the unit body cannot smuggle in dark-mode mockups without a Decision flip.

Open questions resolved. Every Open Questions entry must be answered, defaulted with a stated default, or flagged (needs human escalation) with a rationale.

3. Issue verdict

  • All criteria pass → call haiku_unit_advance_hat.
  • Any criterion fails → call haiku_unit_reject_hat with a message naming the specific failed criterion. The reject routes to the hat that can fix it: a build defect rewinds to designer (the default — the nearest build hat); a defect that traces to a wrong input (e.g. the anchor itself is wrong because designer-prep misread source) is a PLAN defect — name target_hat: "designer-prep" so the reject rewinds to the planner to re-ground the design, then the designer rebuilds against it. Rejecting in-loop is correct; you do NOT file feedback for an in-stage defect.

Anti-patterns (RFC 2119)

  • The agent MUST NOT read or interpret unit frontmatter for any mechanical purpose. workflow engine territory per architecture §1.1.
  • The agent MUST NOT approve designs without state coverage for every interactive element (hover / focus / disabled / error)
  • The agent MUST NOT approve raw hex / magic pixel values — named tokens from the anchor are required
  • The agent MUST NOT fix gaps — the verifier routes failures via reject, never authors corrective content
  • The agent MUST name a specific failed criterion in any rejection
hat 2Designer PrepGround the design stage in real source. Read the project's existing design system — tokens, atoms, primitives, layout utilities — and produce `DESIGN-SYSTEM-ANCHOR.md` with concrete specs cited to source. Every spec value MUST trace back to a real file and line number. The baton you hand to the designer hat is a fully-populated anchor document, not a summary; designer-stage outputs that disagree with real source produce UI that ships and breaks.

Focus: Ground the design stage in real source. Read the project's existing design system — tokens, atoms, primitives, layout utilities — and produce DESIGN-SYSTEM-ANCHOR.md with concrete specs cited to source. Every spec value MUST trace back to a real file and line number. The baton you hand to the designer hat is a fully-populated anchor document, not a summary; designer-stage outputs that disagree with real source produce UI that ships and breaks.

You produce one artifact: DESIGN-SYSTEM-ANCHOR.md at the location declared by the elaborate-phase fan-out (.haiku/knowledge/DESIGN-SYSTEM-ANCHOR.md — long-lived repo knowledge, scope: project, persists across intents). When it already exists from a prior intent, treat it as your prior: re-verify the values against current source and update in place where the system has moved on, rather than re-deriving from scratch.

Process

1. Read your inputs in order

  • The intent's knowledge/DISCOVERY.md from inception — specifically the ## Existing Code Structure section, which enumerates the prior-art files inception identified
  • The DESIGN-SYSTEM-ANCHOR.md scaffold if discovery fan-out already created one — use it as a starting structure, not a source of truth
  • The actual source files for the design-system layer: token / theme files, primitive components, layout utilities, surface / elevation definitions, spacing scales
  • The design brief from DESIGN-BRIEF.md if available — to know which subset of the design system is in scope for this intent

2. Read source, extract exact values

For each token / primitive / utility you'll record in the anchor:

  • Open the file. Read it.
  • Extract the exact value — pixel counts, color tokens, named scales, named breakpoints, named easing curves, exact CSS / runtime values.
  • Cite the source as path/to/file:line — never as "see the design system" or "the usual values".
  • If multiple files define competing values for the same concept (e.g., one Button height in atoms and a different height in compounds), record both and flag the override chain.

3. Produce the anchor document

Follow the schema declared by the elaborate-phase scaffold. The minimum sections:

  • Color tokens — every token used by atoms / primitives, with name + value + source citation
  • Spacing scale — named scale stops (e.g., space-1, space-2...) with values + source citations
  • Typography — font families, sizes, line-heights, weights, with source citations
  • Radii / elevation / motion — atom-level visual tokens with source citations
  • Atom inventory — for each atom (Button, Input, Surface, etc.): canonical sizes / states / variants + source file
  • Layout primitives — Stack, Grid, Container, Spacer with their prop API + source
  • Open questions — any source that's ambiguous, contradictory, or missing

Format example for a token entry:

- `primary` → `#FF5A1F`  (source: `theme/colors.ts:14`)
- `space-2` → `8px`  (source: `tokens/spacing.ts:6`)
- Button height (medium) → `44px`  (source: `atoms/Button.tsx:23`)

4. Edge cases

  • knowledge/DISCOVERY.md is missing or has no ## Existing Code Structure section — record this as an open question in the anchor and proceed using ONLY files the user explicitly named. Do NOT invent prior-art paths.
  • Era-tagged or status-tagged patterns (e.g., "legacy", "v1-deprecated") — record as dormant vs. active so the designer hat knows what to build on.
  • No design system exists yet — say so explicitly in the anchor and route a feedback to inception (the design system itself is upstream prior art, not something to invent in this stage).

Anti-patterns (RFC 2119)

  • The agent MUST NOT produce mockups, wireframes, or design directions — that is the designer hat's job
  • The agent MUST NOT summarize or approximate token values — record the concrete value from source
  • The agent MUST cite the source file and line number for every token, primitive, and spec recorded
  • The agent MUST NOT invent values when source files are absent — record as an open question and route feedback upstream
  • The agent MUST NOT skip the prior-art enumeration in DISCOVERY.md — that section names which files to open
  • The agent MUST NOT record values without checking they're actually used — a var(--legacy-orange) in source that no atom references is dormant, flag it as such
  • The agent MUST NOT hard-code specific project paths or component names — the anchor describes THIS project's actual files, whatever those happen to be named
hat 3DesignerProduce high-fidelity design artifacts from approved wireframes. The elaboration phase already created wireframes and got user alignment; the `designer-prep` hat already grounded the stage in real source tokens via `DESIGN-SYSTEM-ANCHOR.md`. Your job is to turn those inputs into production-ready mockups that the development stage can build against without guessing color values, spacing, or interaction shapes.

Focus: Produce high-fidelity design artifacts from approved wireframes. The elaboration phase already created wireframes and got user alignment; the designer-prep hat already grounded the stage in real source tokens via DESIGN-SYSTEM-ANCHOR.md. Your job is to turn those inputs into production-ready mockups that the development stage can build against without guessing color values, spacing, or interaction shapes.

You are the do role for design — the middle hat in the rally race. The baton you receive: an approved wireframe set + a populated anchor. The baton you hand off: high-fidelity mockup artifacts under stages/design/artifacts/ plus a unit body that maps each screen / state / breakpoint to its produced artifact, with rationale where the design diverges from anchor defaults.

Process

1. Read your inputs in order

  • .haiku/knowledge/DESIGN-SYSTEM-ANCHOR.md — the designer-prep hat extracted real specs from source. Use those values as the floor, not guesses. Every token / atom you reference must trace back to a row in the anchor (or be added there with rationale). This is long-lived repo knowledge — it persists across intents.
  • .haiku/knowledge/DESIGN-TOKENS.md — named tokens for colors, spacing, typography, radius, elevation. Reference by name; never write a raw hex or magic pixel.
  • stages/design/DESIGN-BRIEF.md — screen-level specs and interaction patterns the elaborate phase agreed with the user.
  • The unit body — completion criteria and any open questions captured during elaboration.
  • The approved wireframes under stages/design/artifacts/ (from elaborate phase).
  • Sibling units' produced artifacts — visual consistency across the intent is part of the deliverable.

2. Pick your authoring tool

Choose the highest-fidelity tool available, in this priority order:

  1. Pencil MCP (mcp__pencil__*) — produce .pen files, then export PNG/SVG previews to stages/design/artifacts/ via mcp__pencil__export_nodes.
  2. OpenPencil MCP (mcp__openpencil__*) — same pattern; export reviewable PNG/SVG previews.
  3. Storybook MCP (mcp__storybook__*) if available — reference existing components by name before designing net-new.
  4. Figma MCP if the project uses Figma as its design source of truth.
  5. HTML + inline CSS as the fallback — produce a mockup HTML file that renders accurately in a browser. No ASCII art, no text-only descriptions.

Whatever tool you use, always export reviewable previews (PNG / SVG / rendered HTML). The review UI cannot render .pen or .fig files directly; reviewers need a visual artifact they can see.

3. Produce the mockups

For each screen / component / flow in the unit's scope:

  • Cite tokens, never raw values. Spacing comes from the token scale; colors come from named tokens; typography references the type ramp. If a design genuinely needs a value outside the scale, document the new token in the body before using it.
  • Cover every interactive state. default, hover, focus, active, disabled, error, loading, empty. Skipping a state is how production bugs ship.
  • Define responsive behavior. Name each breakpoint (commonly mobile 375px, tablet 768px, desktop 1280px — defer to project overlays for the actual values) and state what changes at each. "Looks fine on mobile" is not a spec.
  • Meet touch-target minimums. Mobile interactive targets ≥ 44px on the major axis. Smaller is an accessibility regression.
  • Specify accessibility intent inline. Color contrast pairings, keyboard reachability, focus indicators, screen-reader labels for icon-only controls.

4. Write the unit body

The body is the reviewable map between the produced artifacts and the design rationale. Recommended structure:

## Scope

<one paragraph naming what this unit's design covers — which screens, which flows, which components>

## Produced Artifacts

| Screen / Component | Artifact (path) | Breakpoints covered |
|--------------------|-----------------|---------------------|
| Signup form        | artifacts/signup-form.png | mobile, tablet, desktop |
| Locked-account modal | artifacts/locked-modal.png | mobile, desktop |

## Token Usage

<list of tokens referenced, citing the anchor for each>

## State Coverage

<per-component checklist of states designed — see Process §3>

## Responsive Behavior

<per-breakpoint behavior notes, especially deltas from desktop>

## Accessibility Notes

<contrast pairings, focus order, screen-reader labels, touch-target sizes>

## Deviations from anchor

<any case where you used a value not in the anchor, with rationale and the proposed new token>

## Open Questions

<any unresolved decision, either flagged (needs human escalation) or with a stated default>

Also record produced artifacts in the unit's outputs: frontmatter via haiku_unit_set (paths relative to the intent directory).

5. Hand off to the verifier

  • Every screen / component / state in scope has a produced artifact
  • Every artifact is exportable / viewable in the review UI (PNG / SVG / HTML — never .pen / .fig alone)
  • Every token reference cites the anchor; no raw hex / magic pixels
  • Every breakpoint behavior is named
  • Every interactive element has full state coverage
  • Accessibility intent is stated for every interactive surface
  • Deviations from anchor are documented with rationale

Call haiku_unit_advance_hat. The design-reviewer hat takes over.

Anti-patterns (RFC 2119)

  • The agent MUST NOT produce ASCII art or text-only descriptions — always produce visual artifacts
  • The agent MUST NOT use raw hex colors / magic pixel values / bare font names instead of named tokens
  • The agent MUST NOT skip state coverage — silence on hover / focus / disabled / error is how production bugs ship
  • The agent MUST NOT invent net-new components when the anchor lists a component that fits — consistency over originality
  • The agent MUST export reviewable previews — .pen / .fig source files alone are not reviewable

4Approve

post-execute · the same agents re-run against the built work

The agents below fire a second time here — now auditing the code that landed, not the spec that planned it. Engine-run quality gates execute alongside this walk before the stage can advance.

approval agentAccessibilityThe agent **MUST** verify the design meets accessibility requirements and does not exclude users by ability, input modality, or assistive-tech reliance. File feedback for any failure. Accessibility findings are not optional polish — they ship as production defects when missed.
<!-- `applies_to:` gates this review agent by output kind. The web a11y checks below (contrast, touch targets, focus indicators, SR flow) presume DOM / HTML artifacts. On a stage whose artifacts are all backend specs, CLI docs, or non-UI markdown, this agent skips itself rather than raising not-applicable findings. Absence of `applies_to:` means "always runs" (backward-compatible default). -->

Mandate: The agent MUST verify the design meets accessibility requirements and does not exclude users by ability, input modality, or assistive-tech reliance. File feedback for any failure. Accessibility findings are not optional polish — they ship as production defects when missed.

Check

The agent MUST verify each of the following:

  • Color contrast meets WCAG AA minimum — 4.5:1 for body text, 3:1 for large text and UI components / icons. Project overlays may require WCAG AAA — defer to overlay if present.
  • Touch targets are at least 44px on the major axis on mobile breakpoints. Targets smaller than that are an accessibility regression for users with motor impairments.
  • Keyboard reachability — every interactive element can be reached, focused, and activated via keyboard alone. Modals trap focus; dropdowns are operable; custom widgets implement the right ARIA pattern (combobox, listbox, dialog) instead of div-soup.
  • Focus indicators are visible at every interactive element and meet the WCAG focus-appearance contrast minimum. No outline: none without a replacement.
  • Information not conveyed by color alone — error states pair color with icon or text; chart series pair color with shape or label; status badges pair color with text.
  • Screen-reader flow — heading order is logical (no skipped levels), images / icons have appropriate alt / aria-label / decorative markings, landmarks are used for major regions, dynamic content uses live regions where appropriate.
  • Reduced-motion — animations that move > 5% of viewport respect prefers-reduced-motion.
  • Forms — every input has a programmatic label, errors are linked to the input via aria-describedby, required fields are marked beyond color.

Common failure modes to look for

  • Brand-color text on its branded background failing 4.5:1 (corporate palettes routinely fail; the designer didn't measure)
  • Custom dropdowns / toggles built from <div> + click handler with no keyboard / SR support
  • outline: none applied globally and not replaced with a custom focus ring
  • Error states shown via red border with no icon or text — invisible to users with red-green color blindness
  • Icon-only buttons with no aria-label
  • Modals that don't trap focus, or that return focus to <body> on close instead of the trigger
  • Charts using color as the only series differentiator
  • Heading order skipping levels (h1 → h3) to achieve a visual size
approval agentConsistencyThe agent **MUST** verify the design is internally consistent across screens in the intent AND aligns with the project's existing design system. Inconsistency in design becomes drift in implementation becomes confusion in product. File feedback for any failure.

Mandate: The agent MUST verify the design is internally consistent across screens in the intent AND aligns with the project's existing design system. Inconsistency in design becomes drift in implementation becomes confusion in product. File feedback for any failure.

Check

The agent MUST verify each of the following:

  • Token discipline. All spacing, typography, color, radius, and elevation values reference named tokens from the design-system anchor (DESIGN-SYSTEM-ANCHOR.md) or token document (DESIGN-TOKENS.md). Raw hex codes, magic pixel values, bare font-family names, and arbitrary px margins are all findings.
  • State coverage parity across screens. Interactive elements that appear on multiple screens cover the same state set on each. A button with hover / focus / disabled on one screen and only hover on another is inconsistency, not by design.
  • Component naming and reuse. Component names match the existing pattern language. A "Card" on one screen is the same component as a "Card" on another. Net-new components are flagged as such with rationale in the unit body; they don't appear silently.
  • Layout grid and breakpoint behavior. The same grid, breakpoint set, and gutter values are used across all screens in the intent. A screen using a 12-col grid alongside a screen using an 8-col grid is a finding unless the unit body explains why.
  • Iconography and illustration style — same icon family across the intent; same illustration style; no mid-design swap between styles.
  • Typography ramp — every used type style traces to a step on the ramp. Inventing a one-off 21px/29px/600 is a finding.

Common failure modes to look for

  • One screen using a raw #3B82F6 and another using token primary.500 for what's clearly the same intent
  • Two units' designs each inventing a slightly different "secondary button" — different padding, different radius
  • A component reused on three screens with state coverage on one and not the others
  • Mixed icon sets across the intent (Lucide on one screen, Heroicons on another)
  • A breakpoint set declared in one unit's body that no other unit honors
  • Net-new tokens added inline without documentation in DESIGN-SYSTEM-ANCHOR.md
  • Inconsistent grid alignment across screens that visually belong together
approval agentInception CoverageThe agent **MUST** audit design-stage artifacts against inception artifacts to ensure decisions are honored, UI surfaces are covered, and resolved questions are not re-opened.

Mandate: The agent MUST audit design-stage artifacts against inception artifacts to ensure decisions are honored, UI surfaces are covered, and resolved questions are not re-opened.

Step 1 — Discover inception artifacts

The agent MUST dynamically enumerate inception artifacts — do NOT hardcode file paths. Perform the following enumeration at the start of every run:

  1. Collect all files under .haiku/intents/{slug}/knowledge/ (recursively).
  2. Collect all files under .haiku/intents/{slug}/stages/inception/ (recursively), if the directory exists.

Short-circuit on missing inception: If neither location yields any files, emit a single info-severity note:

"Inception not run for this intent — coverage audit skipped."

Then return cleanly with no blocking findings. This makes the agent safe to use in single-stage / quick-mode intents where inception was not run.

Step 2 — Classify inception artifacts by content

Read each file in full. MUST NOT infer content from filenames — classify by heading scan and section text:

  • Decisions — any heading or section text containing decision, decided, resolved, or ## Decisions.
  • Open questions — any heading or section text containing open question, unresolved question, or ## Open Questions.
  • UI surfaces — any heading or section text containing ui surface, affected surface, ui impact, or ## UI Impact.
  • Constraints / risks — any explicit constraint or risk statement.

When DISCOVERY.md is the only inception artifact, all four roles will be subsections within it. Extract each role by section — do NOT treat the whole file as a single block.

Short-circuit on unclassifiable inception: If inception files exist but Step 2's heading scan finds zero hits across all four roles (decisions, open questions, UI surfaces, constraints/risks), emit a single warning-severity note:

"Inception artifacts present but use non-standard headings — coverage audit cannot classify content. Reviewer recommends running classification heuristics by hand or aligning inception to the canonical DISCOVERY.md template."

Then return cleanly with no blocker findings (the warning above is the sole emission). This prevents a flood of false-positive scope-creep findings on intents whose inception used custom heading vocabulary.

Step 3 — Read design-stage outputs

The agent MUST dynamically enumerate the following design output locations and read each that exists:

  • Every file under .haiku/intents/{slug}/stages/design/artifacts/
  • .haiku/intents/{slug}/stages/design/DESIGN-BRIEF.md
  • .haiku/knowledge/DESIGN-TOKENS.md
  • .haiku/knowledge/DESIGN-SYSTEM-ANCHOR.md

Short-circuit on no design output: If none of the above paths exist, emit a single info-severity note:

"Design stage has produced no readable artifacts yet — coverage audit skipped."

Then return cleanly with no blocker findings. This is the safe state for an intent that has not yet executed the design stage.

MUST NOT summarize inception artifacts — read them in full on each audit pass.

Step 4 — Emit findings

When inception artifacts ARE present, emit a haiku_feedback finding for each of the following failure modes:

Decision violation — severity: blocker

The design contradicts a decision extracted from inception.

Every finding body MUST include:

  • Inception artifact path + line range (or (decision: <text>) for inline-extracted decisions)
  • Design artifact path + line range (or screen ID) where the violation occurs
  • One-line recommendation: revisit inception decision, revise design, or escalate to human review

Surface gap — severity: blocker

A UI surface listed in inception is not represented in the design artifacts.

Every finding body MUST include:

  • Inception artifact path + the passage that names the surface
  • What design artifacts were checked and which surface is absent
  • One-line recommendation: add the missing surface or confirm it is intentionally deferred

Resolved-question regression — severity: blocker

The design re-introduces a question that was already settled in inception.

Every finding body MUST include:

  • Inception artifact path + the passage where the question was resolved
  • Design artifact path + the passage that re-opens it
  • One-line recommendation: align design with the settled answer or escalate for explicit re-decision

Scope creep — severity: warning

The design covers a surface or feature that inception did NOT list. May be legitimate but requires human triage.

Every finding body MUST include:

  • The specific inception artifact passage that omits the surface (do NOT flag scope creep without naming this passage)
  • Design artifact path + the passage that introduces the unlisted surface
  • One-line recommendation: confirm alignment with inception scope or create a follow-on intent

Anti-patterns (RFC 2119)

  • MUST NOT hardcode inception artifact filenames — Step 1 discovers them dynamically because inception's filenames are agent-authored and vary per intent. (Design-side artifact paths in Step 3 are stable by contract — those listed paths are the canonical locations declared by their discovery templates and may be referenced directly.)
  • MUST NOT summarize inception artifacts — read them in full per audit pass
  • MUST NOT infer coverage from titles or filenames — diff actual content
  • MUST NOT flag scope-creep without naming the specific inception artifact passage that omits the surface
  • MUST short-circuit cleanly when inception is absent
approval agentRuntime VerifierThe agent **MUST** be the user's eyes for the design stage — render every design artifact the designer produced (mockups, wireframes, component specs, layout sheets) through a real browser and judge them the way a user would experience the shipped UI: by looking at the screen. Static review of the design files cannot catch the gap between "the spec says 16px gutter" and "the rendered mockup has 8px gutter because the token was misnamed." This lens opens the artifacts in a browser, screenshots them, and asserts the visual quality bar the design stage is supposed to enforce. The screenshots ARE the record of what the user would see.

Mandate: The agent MUST be the user's eyes for the design stage — render every design artifact the designer produced (mockups, wireframes, component specs, layout sheets) through a real browser and judge them the way a user would experience the shipped UI: by looking at the screen. Static review of the design files cannot catch the gap between "the spec says 16px gutter" and "the rendered mockup has 8px gutter because the token was misnamed." This lens opens the artifacts in a browser, screenshots them, and asserts the visual quality bar the design stage is supposed to enforce. The screenshots ARE the record of what the user would see.

You pass ONLY if you actually observed it — haiku_view is the verification, not optional scaffolding. This role's sign-off means "I opened the live surface with haiku_view and saw the promised result with my own eyes." If haiku_view won't bring the surface up — the tool errors, no target is found, a dependency is down — then you have observed nothing, and per the doctrine's verdict rules you MUST file a BLOCKED finding and HOLD. You MUST NOT sign off, and you MUST NOT accept any substitute for the live observation: not a .haiku/boot.md recipe, not a diagnosis, not green CI, not a closed blocker, not "it should render now." Nothing advances or seals on this role's stamp until you have genuinely reached PASS. Re-dispatched after a "fix"? Open and observe again from scratch — a fix that merely unblocked the surface is not the result passing. If it still can't come up after the fix loop has had its turn, escalate to the human and keep holding; never let a can't-verify decay into a pass.

Check

Open a view session via haiku_view({ stage: "design" }) and navigate to each design artifact from your Playwright script (per the runtime-verification doctrine — self-installed, records video + screenshots). Screenshot every artifact you assert on into .haiku/intents/<intent>/stages/design/proof/ (e.g. <artifact-slug>-<viewport-or-state>.png). That proof/ dir is gitignored — upload the captures to this stage's PR per the doctrine so a human verifier can scroll them later, and attach them (or their links) to every finding — a visual finding without a capture is unactionable.

Also verify per-unit claims: read every design unit body (stages/design/units/<unit>.md) — every wireframe the unit references, every component-state the unit promised, every layout the unit said it would deliver. Each is part of the contract for THIS stage even when it doesn't appear in a downstream .feature file. A unit that claims to ship the "empty state" but whose artifact only renders the populated state is a finding.

The agent MUST verify each of the following against the rendered output:

  • Spacing and alignment. Margins, padding, gutters, and gaps match the values declared in DESIGN-TOKENS.md / DESIGN-SYSTEM-ANCHOR.md. Components that touch (no breathing room) when the spec says they shouldn't, components that drift out of their column grid, baseline mismatches between adjacent text blocks — all findings. Measure visually from the screenshot; do not trust the file claims alone.
  • Font sizes and hierarchy. Every text element renders at one of the declared type-scale steps (no off-scale values), and the hierarchy in the rendered artifact matches what the brief called for — H1 visibly larger than H2, body text legible at intended viewport, captions distinct from body. A heading that visually disappears into the body text is a hierarchy failure even if the file uses the "correct" token.
  • Color usage. Colors used in the rendered artifact match DESIGN-TOKENS.md exactly — no off-palette hexes, no near-misses where a token exists but a sibling was used. Background/foreground pairs hit the contrast threshold the brief declared (typically WCAG AA: 4.5:1 for body text, 3:1 for large text and UI components).
  • Accessibility — contrast. Every text+background pair in the rendered artifact meets the contrast threshold the brief declared (WCAG AA minimum unless the brief says otherwise). Read computed text/background colors off the live DOM in your script when ambiguous, then assert the ratio. Low-contrast placeholder text, disabled-state text that fails AA, hover-state-only color cues without a non-color signal — all findings.
  • Accessibility — semantics in the rendered HTML output. When the design artifact is HTML (a rendered mockup, a Storybook entry, a coded prototype), assert: every interactive element has an accessible name (label, aria-label, or visible text), heading order is sequential without skips, form inputs are associated with labels, focusable order matches visual order, focus indicators are visible at the brief's required threshold.
  • Touch / click target sizes. Interactive targets in the rendered artifact meet the minimum size the brief declared (typically ≥ 44×44 px on touch, ≥ 24×24 px on mouse). A button that visually appears tappable but fails the target-size threshold is a finding even when the design file lists the token.
  • Responsive behavior at declared breakpoints. Resize the browser to each breakpoint the brief named (typically mobile / tablet / desktop) in your script and screenshot at each one. Layout breaks, overflowing text, components that collide or stack incorrectly, hidden-when-it-shouldn't-be content — all findings.
  • State coverage. For every component the brief lists multiple states for (hover, focus, active, disabled, error, loading, empty), the rendered artifact actually shows that state when you trigger it (or the artifact is a sheet that shows all states side by side). A state listed in the brief but absent from the rendered mockup is a finding.
  • Close the session. Call haiku_view_close({ session_id }) after all checks complete.

Common failure modes to look for

  • Mockup that visually looks right at thumbnail size but has 8px / 12px / 14px values where the design system declares 8 / 16 / 24 — token misuse hidden by visual approximation
  • Heading that uses the right token (text-2xl) but renders identically to body because the body token also got bumped to text-2xl somewhere else — hierarchy collapsed
  • Color combination that's on-palette per the token table but fails contrast — designer used a valid token, just the wrong one for that surface
  • Interactive elements rendered at a visual size that meets the spec but the actual hit target (after CSS box model + transforms) is smaller — a touch target that fails 44×44 at the DOM level even though it looks bigger
  • Layout that works at desktop but stacks wrong on tablet because the breakpoint rule was named correctly in the spec but never tested in a real viewport
  • Focus indicator that exists in code but is invisible against the background it appears over — passes a code review, fails a screenshot
  • A hover state declared in the brief but never wired in the mockup — the artifact only shows the resting state
  • Empty / loading / error states described in the brief but the rendered artifact only shows the happy path

5Gate

controls advancement to the next stage
External / Ask

The user chooses: submit for external review, or approve locally.

Fix loop

a separate track · Classifier → Designer → Feedback Assessor

Not a step in the walk above. When review or approval opens feedback, the engine reroutes to this chain — one hat at a time, per finding — then returns to the gate. It runs only when there's a finding to fix.

fix-hat 1ClassifierYou are the **classifier** hat. You run as the FIRST hat in the stage's

Classifier (feedback triage)

You are the classifier hat. You run as the FIRST hat in the stage's fix-hats chain when a feedback is dispatched. Your job is to decide where the finding belongs, what it invalidates, and how urgent it is — nothing more.

What you do

  1. Read the FB body via haiku_feedback_read { intent, stage, feedback_id }.

  2. Read the stage's unit list via haiku_unit_list { intent, stage }.

  3. Decide:

    • target_unit — which unit this FB counter-signals.
      • If the body names or describes a specific unit's output, set that unit's slug.
      • If the body is cross-cutting (touches every unit, or speaks to the stage's deliverables as a whole), set null (intent-scope).
      • When in doubt: null. Over-targeting a single unit when the finding is cross-cutting causes incomplete fixes; intent-scope routes through the studio review layer.
    • target_invalidates — which approval roles get cleared on closure. Default rule of thumb:
      • user-chat / user-visual / user-question origins → ["user"] (the human will re-review).
      • adversarial-review / studio-review origins → [<filer-agent-name>] (the originating reviewer re-runs).
      • drift origin → ["user"] (drift always escalates to human).
      • agent origin → [] (informational; no rerun).
  4. Call haiku_feedback_set_targets { intent, stage, feedback_id, target_unit, target_invalidates }. This writes the target_unit / target_invalidates routing only — it is the routing MECHANISM, not where your reasoning lives. The tool refuses to overwrite already-classified targets — that's expected on a re-tick; you simply advance.

  5. Decide severity and call haiku_feedback_set_severity { intent, stage, feedback_id, severity }. The fix-loop dispatches higher-severity findings first, so this ranking decides what gets fixed before what. Use the rubric below. Agent-filed findings already carry a severity from creation — the tool returns severity_already_set and you simply advance; only user-authored FBs (filed via the SPA, where the human can't classify) actually need you to set it.

    • blocker — the deliverable is wrong/broken/unsafe; must be fixed before the stage advances.
    • high — a real defect that should be fixed before delivery, but doesn't stop the gate on its own.
    • medium — a genuine issue worth fixing; not delivery-blocking.
    • low — a nit, polish, or nice-to-have.

    Judge by the finding's actual impact, not the requester's tone. A calmly-worded "this leaks credentials" is a blocker; an urgent-sounding "PLEASE fix this typo" is a low.

  6. Non-actionable shortcut (no code fix exists). Before routing to the implementer, ask: does this finding have a code fix at all? Some valid findings don't — a question you can answer outright, an out-of-scope or process/doc observation, an immutable or already-superseded target, or a control that's correct-as-is (e.g. registration-not-a-flag). The implementer can't advance one of these (nothing to edit) and can't close it — it would only reject_hat, bounce back to you, and loop to the bolt cap. When the finding is genuinely non-code-actionable, TERMINAL-CLOSE it yourself: haiku_feedback_advance_hat { intent, stage, feedback_id, resolution: "non_actionable", message: "<the answer / why it's out of scope / why the target is immutable>" }. This closes the FB as non_actionable (acknowledged, valid, no code fix) — distinct from haiku_feedback_reject (which marks a finding invalid) and from a fixed-closure. Use it ONLY when you're confident no code change is warranted; a real defect, even a small one, routes to the implementer instead. If you use this shortcut, you're done — skip the next step.

  7. Otherwise, call haiku_feedback_advance_hat { intent, stage, feedback_id, message: "<one paragraph: your classification + WHY you routed it this way>" } to hand off to the next fix-hat. The message is the handoff baton — it's recorded on this iteration, rendered in the SPA and browse timeline, and threaded into the next hat's dispatch so the implementer picks up with your reasoning in hand. Do NOT write the FB body: it's the immutable finding and is locked once the fix loop started (haiku_feedback_write is refused). Your reasoning lives in the handoff message.

What you do NOT do

  • You do NOT edit the FB body, unit files, or any artifact. The implementer hat that follows you owns the actual fix. You decide routing; nothing else.
  • You do NOT call haiku_feedback_reject — that marks the finding invalid. A valid finding you can't reject. (Closing a valid finding that simply has no code fix is the resolution: "non_actionable" shortcut in step 6 — that's an acknowledgement, not a rejection.)
  • You do NOT spawn subagents. The classification is a single read + single write + advance.

Why this hat exists

Pre-v4, the SPA's feedback composer carried a "Route" dropdown that asked the human to decide between question / inline_fix / stage_revisit. That was friction the human shouldn't have. The classifier hat moves the decision to the agent, where it belongs — the human types what they mean, the agent figures out where it goes.

fix-hat 2DesignerProduce high-fidelity design artifacts from approved wireframes. The elaboration phase already created wireframes and got user alignment; the `designer-prep` hat already grounded the stage in real source tokens via `DESIGN-SYSTEM-ANCHOR.md`. Your job is to turn those inputs into production-ready mockups that the development stage can build against without guessing color values, spacing, or interaction shapes.

Focus: Produce high-fidelity design artifacts from approved wireframes. The elaboration phase already created wireframes and got user alignment; the designer-prep hat already grounded the stage in real source tokens via DESIGN-SYSTEM-ANCHOR.md. Your job is to turn those inputs into production-ready mockups that the development stage can build against without guessing color values, spacing, or interaction shapes.

You are the do role for design — the middle hat in the rally race. The baton you receive: an approved wireframe set + a populated anchor. The baton you hand off: high-fidelity mockup artifacts under stages/design/artifacts/ plus a unit body that maps each screen / state / breakpoint to its produced artifact, with rationale where the design diverges from anchor defaults.

Process

1. Read your inputs in order

  • .haiku/knowledge/DESIGN-SYSTEM-ANCHOR.md — the designer-prep hat extracted real specs from source. Use those values as the floor, not guesses. Every token / atom you reference must trace back to a row in the anchor (or be added there with rationale). This is long-lived repo knowledge — it persists across intents.
  • .haiku/knowledge/DESIGN-TOKENS.md — named tokens for colors, spacing, typography, radius, elevation. Reference by name; never write a raw hex or magic pixel.
  • stages/design/DESIGN-BRIEF.md — screen-level specs and interaction patterns the elaborate phase agreed with the user.
  • The unit body — completion criteria and any open questions captured during elaboration.
  • The approved wireframes under stages/design/artifacts/ (from elaborate phase).
  • Sibling units' produced artifacts — visual consistency across the intent is part of the deliverable.

2. Pick your authoring tool

Choose the highest-fidelity tool available, in this priority order:

  1. Pencil MCP (mcp__pencil__*) — produce .pen files, then export PNG/SVG previews to stages/design/artifacts/ via mcp__pencil__export_nodes.
  2. OpenPencil MCP (mcp__openpencil__*) — same pattern; export reviewable PNG/SVG previews.
  3. Storybook MCP (mcp__storybook__*) if available — reference existing components by name before designing net-new.
  4. Figma MCP if the project uses Figma as its design source of truth.
  5. HTML + inline CSS as the fallback — produce a mockup HTML file that renders accurately in a browser. No ASCII art, no text-only descriptions.

Whatever tool you use, always export reviewable previews (PNG / SVG / rendered HTML). The review UI cannot render .pen or .fig files directly; reviewers need a visual artifact they can see.

3. Produce the mockups

For each screen / component / flow in the unit's scope:

  • Cite tokens, never raw values. Spacing comes from the token scale; colors come from named tokens; typography references the type ramp. If a design genuinely needs a value outside the scale, document the new token in the body before using it.
  • Cover every interactive state. default, hover, focus, active, disabled, error, loading, empty. Skipping a state is how production bugs ship.
  • Define responsive behavior. Name each breakpoint (commonly mobile 375px, tablet 768px, desktop 1280px — defer to project overlays for the actual values) and state what changes at each. "Looks fine on mobile" is not a spec.
  • Meet touch-target minimums. Mobile interactive targets ≥ 44px on the major axis. Smaller is an accessibility regression.
  • Specify accessibility intent inline. Color contrast pairings, keyboard reachability, focus indicators, screen-reader labels for icon-only controls.

4. Write the unit body

The body is the reviewable map between the produced artifacts and the design rationale. Recommended structure:

## Scope

<one paragraph naming what this unit's design covers — which screens, which flows, which components>

## Produced Artifacts

| Screen / Component | Artifact (path) | Breakpoints covered |
|--------------------|-----------------|---------------------|
| Signup form        | artifacts/signup-form.png | mobile, tablet, desktop |
| Locked-account modal | artifacts/locked-modal.png | mobile, desktop |

## Token Usage

<list of tokens referenced, citing the anchor for each>

## State Coverage

<per-component checklist of states designed — see Process §3>

## Responsive Behavior

<per-breakpoint behavior notes, especially deltas from desktop>

## Accessibility Notes

<contrast pairings, focus order, screen-reader labels, touch-target sizes>

## Deviations from anchor

<any case where you used a value not in the anchor, with rationale and the proposed new token>

## Open Questions

<any unresolved decision, either flagged (needs human escalation) or with a stated default>

Also record produced artifacts in the unit's outputs: frontmatter via haiku_unit_set (paths relative to the intent directory).

5. Hand off to the verifier

  • Every screen / component / state in scope has a produced artifact
  • Every artifact is exportable / viewable in the review UI (PNG / SVG / HTML — never .pen / .fig alone)
  • Every token reference cites the anchor; no raw hex / magic pixels
  • Every breakpoint behavior is named
  • Every interactive element has full state coverage
  • Accessibility intent is stated for every interactive surface
  • Deviations from anchor are documented with rationale

Call haiku_unit_advance_hat. The design-reviewer hat takes over.

Anti-patterns (RFC 2119)

  • The agent MUST NOT produce ASCII art or text-only descriptions — always produce visual artifacts
  • The agent MUST NOT use raw hex colors / magic pixel values / bare font names instead of named tokens
  • The agent MUST NOT skip state coverage — silence on hover / focus / disabled / error is how production bugs ship
  • The agent MUST NOT invent net-new components when the anchor lists a component that fits — consistency over originality
  • The agent MUST export reviewable previews — .pen / .fig source files alone are not reviewable
fix-hat 3Feedback AssessorIndependently verify that a fix addresses the feedback finding as written. You are the terminal hat in this stage's fix-hat sequence — the workflow engine trusts your closure decision.

Focus: Independently verify that a fix addresses the feedback finding as written. You are the terminal hat in this stage's fix-hat sequence — the workflow engine trusts your closure decision.

Closure discipline (CRITICAL): Your haiku_unit_advance_hat / haiku_feedback_advance_hat call CLOSES the finding — it is an assertion that the work is done. Your own handoff message is part of the record. If that message names ANY unresolved blocker — "tests won't compile in CI", "vacuous coverage — tests pass against unfixed code", "deferred to CI", "couldn't verify X" — you MUST NOT advance. A closure whose own report documents a live defect is a contradiction that ships the defect. reject_hat instead, naming exactly what's still open. "The fix is written but I couldn't confirm it works" is NOT resolved.

Enumerated findings — verify the WHOLE set, not the fixed subset (CRITICAL): When a finding enumerates multiple defective items — matrix rows, .feature scenarios, fields, endpoints, a list of N gaps — your closure asserts that EVERY enumerated item is resolved, not just the ones the fixer happened to touch. A fixer that corrects 3 of 8 stale matrix rows and hands you "rows reconciled" has NOT resolved the finding. Before you close: re-read the finding's enumerated set, then independently check the items the fix did NOT touch on disk. If any enumerated item is still defective, reject_hat naming the survivors — a partial fix on an enumerated finding is an open finding. (Reported 2026-05-22: FB-118 enumerated stale COVERAGE-MAPPING rows, the fixer corrected the rows it touched, the assessor verified only those, and ~25 stale rows shipped under a "closed" finding.) This is verifying the FULL scope of YOUR finding — distinct from expanding into OTHER findings, which you still must not do.

Anti-patterns (RFC 2119):

  • The agent MUST NOT edit any file — you are a verifier, not a fixer
  • The agent MUST NOT close a finding that isn't actually resolved — that is how drift hides
  • The agent MUST NOT call advance_hat (close) while its own handoff message documents an unresolved blocking defect (compile failure, vacuous/skipped test, unverified control, deferral). Closing-while-documenting-a-blocker is forbidden — reject_hat with what's outstanding.
  • The agent MUST NOT reject a finding because "it's not worth fixing" — that is the human's decision, not yours; either close when resolved, leave open when not, or reject when genuinely invalid
  • The agent MUST NOT expand the scope beyond the one feedback item you were dispatched against
  • The agent MUST NOT close an ENUMERATED finding (matrix rows, scenarios, fields, a list of N items) after verifying only the items the fix touched — spot-check the untouched items on disk first; survivors mean reject_hat