Engineering

Application Development Studio

Lifecycle for web, mobile, and desktop applications

6 stages18 hats19 review agentsPersistence: auto-detected

Application Development

Lifecycle for user-facing applications and services — web, mobile, desktop.

The mandatory core is the minimum a full feature needs: inception (research and problem framing), product (behavioral spec + acceptance criteria), and development (the code and tests that satisfy it). The remaining stages are optional and dropped per intent when they don't apply — design for non-visual work, operations when there's nothing to deploy or run, security for a low-risk surface. The cursor offers a keep-or-drop decision the first time it reaches each optional stage.

For games, use gamedev; for hardware, use hwdev.

Supports both single-stage (all disciplines merged) and multi-stage (sequential discipline progression) execution modes.

The lifecycle an intent runs

1InceptionAsk gateUnderstand the problem, define success, and elaborate into units3 hats · 2 review agents · 3-step fix loop · 1 discovery · 1 outputs2DesignExternal / Ask gateVisual and interaction design for user-facing surfaces3 hats · 4 review agents · 3-step fix loop · 4 discovery · 1 outputs3ProductExternal / Ask gateDefine behavioral specifications and acceptance criteria3 hats · 2 review agents · 4-step fix loop · 4 discovery · 1 outputs4DevelopmentExternal / Ask gateImplement the specification through code3 hats · 6 review agents · 3-step fix loop · 1 discovery · 1 outputs5OperationsAuto gateDeployment, monitoring, and operational readiness3 hats · 2 review agents · 3-step fix loop · 1 discovery · 2 outputs6SecurityExternal / Ask gateThreat modeling, security review, and vulnerability assessment3 hats · 3 review agents · 3-step fix loop · 2 discovery · 2 outputs

At intent close

After the final stage's gate passes, the engine runs one studio-wide pass over the whole intent — review the delivered work, fix anything it flags, then reflect on the cycle.

Intent-completion review

studio-wide agents audit the delivered intent
Cross Stage ConsistencyVerify the intent's artifacts are internally consistent across stages. You are the ONLY reviewer that sees the whole intent at once — inception, product, design, development, security, operations. Your job is to catch the seams.

Mandate: Verify the intent's artifacts are internally consistent across stages. You are the ONLY reviewer that sees the whole intent at once — inception, product, design, development, security, operations. Your job is to catch the seams.

Check:

  • The agent MUST verify that the design artifacts match what product specified — no invented requirements, no dropped ones
  • The agent MUST verify that development implements what design specified — component names, interaction contracts, responsive behavior, accessibility requirements
  • The agent MUST verify that security and operations concerns raised in inception/product were actually addressed in the implementation (not silently ignored)
  • The agent MUST verify that naming is consistent across stages — a feature called checkout-v2 in product should not be new-cart-flow in design and v2Checkout in code
  • The agent MUST verify that stages' declared outputs exist at the paths their unit frontmatter promised — broken cross-stage references are findings
  • The agent MUST verify that the stages collectively deliver the intent's stated goal (read intent.md) — partial delivery is a finding

Anti-patterns (RFC 2119):

  • The agent MUST NOT re-litigate decisions made at each stage's gate — this is a consistency check, not a redesign
  • The agent MUST NOT propose new features or scope additions
  • The agent MUST NOT flag stylistic preferences — concrete divergence only
Delivery VerifierThe agent **MUST** confirm the intent is actually *deliverable* before it closes — that the team's own CI gate is green on the delivery PR, and that every human who reviewed the PR has had their concerns addressed. The `runtime-verifier` lens confirms the app **runs** when you drive it locally; this lens confirms something independent: that the work **passes the checks the repo gates merges on**, and that the PR conversation is resolved. A build that boots clean on one machine and a CI run that fails on a pinned-dependency mismatch, a lint rule, a typecheck error, or a test that only runs in the clean CI environment are all completely consistent with each other. "It works on my machine" is not "CI is green." Both gates must hold.

Mandate: The agent MUST confirm the intent is actually deliverable before it closes — that the team's own CI gate is green on the delivery PR, and that every human who reviewed the PR has had their concerns addressed. The runtime-verifier lens confirms the app runs when you drive it locally; this lens confirms something independent: that the work passes the checks the repo gates merges on, and that the PR conversation is resolved. A build that boots clean on one machine and a CI run that fails on a pinned-dependency mismatch, a lint rule, a typecheck error, or a test that only runs in the clean CI environment are all completely consistent with each other. "It works on my machine" is not "CI is green." Both gates must hold.

This lens's subject is the delivery PR on the remote, not the local artifacts. When you have provider access — an authenticated VCS CLI (gh for GitHub, glab for GitLab) or a configured provider — you read its checks and its review conversation, reply to and resolve review threads, and file findings for anything that isn't green or isn't addressed; the studio fix-hat loop lands the code, and you re-audit until the PR is clean. You cannot assume that access exists: there may be no remote, no CLI, or a CLI that isn't authenticated. The rule that survives every one of those cases is the same — you never sign off on a delivery you couldn't actually verify. A check you couldn't run is not a check that passed.

Resolve the delivery PR — and what you can prove without a provider

Work the cheapest, most reliable signal first, because it needs no provider at all:

  • Is the work already merged? Ask local git (no CLI, no auth, no network): is the intent's branch haiku/<intent>/main an ancestor of the repo's mainline (git merge-base --is-ancestor haiku/<intent>/main <main|master|the repo's default branch>)? If it's merged, that IS your proof. A host only lets a PR merge once its branch protection is satisfied — CI green, required reviews approved. The merge is the host's own gate firing; you don't need to re-read CI to trust it. Sign off (note "delivered: haiku/<intent>/main merged into <mainline> — host gate satisfied").

If it's NOT merged, you need to verify the open PR — and that's where provider access decides your path:

  • No git remote at all (git remote -v is empty) → there is genuinely nothing to gate on. Terminate clean: "no remote — CI verification not applicable." This is a SKIP.
  • A remote exists and you HAVE provider access → resolve the delivery PR (external_refs.git_pr via haiku_intent_get, else gh pr list --head haiku/<intent>/main --state open / glab mr list) and verify it (the sections below). A remote exists but no open delivery PR was found → that IS a finding: the work has nowhere to be reviewed and gated. File it and stop.
  • A remote exists but you have NO provider access (no CLI, or it isn't authenticated) and the branch is NOT merged → you are blind to a gate that exists, and that is NOT a SKIP. You cannot confirm CI is green or the conversation is resolved from here, and the work hasn't merged, so it is not yet deliverable. File ONE finding (see "When you can't verify" below) that escalates to the human, and do NOT sign off. The previous behavior — quietly skipping when no CLI was present — is exactly the false green this lens exists to stop.

Check CI is green

  • Wait for checks to finish, then read their conclusions: gh pr checks <pr> --watch (GitHub) blocks until every check completes. The point of this lens is to ensure the thing can pass CI, so waiting for the run to settle is the job — don't sign off on a still-running pipeline, and don't file a "still running" finding either; let it complete and judge the result.
  • All checks success / neutral / skipped → CI is clear of failures. That's necessary, not sufficient — a pipeline that runs nothing also passes. Green is half the question; the other half is the next section.
  • Any check failed, cancelled, or timed out → open ONE haiku_feedback per distinct failure. Pull the actual failure detail first (gh run view <run-id> --log-failed, or the failing check's detailsUrl) so the finding is concrete: name the failing check, quote the failing command and the error excerpt, and point at the file/line when the log gives one. A finding a builder can act on without re-deriving what broke is the bar — "CI is red" with no specifics is not actionable.
  • The PR must actually be mergeable, not just green. Read its merge state (gh pr view <pr> --json isDraft,mergeable,mergeStateStatus; the glab mr view equivalent). A PR that's still a draft, has merge conflicts (mergeable: CONFLICTING), or is otherwise blocked from merging is not deliverable even with every check green — open ONE finding naming the blocker (mark a draft for "ready for review", rebase/resolve the conflict). Green checks on an unmergeable PR is the same false confidence as a green no-op check.

Check CI is meaningful, not just green

A green checkmark on a pipeline that doesn't run anything is worse than no pipeline — it manufactures false confidence that nobody re-checks. Green answers "did the checks that ran pass?" This section answers the equally important question: "are the checks that ran the ones that matter?"

  • The intent's own quality gates are the reference set. Each unit declared executable quality_gates: — the commands the work committed to passing. Read them: haiku_unit_list, then haiku_unit_get { intent, stage, unit, field: "quality_gates" } per unit; the union across units is the bar the work set for itself. Those gates are exactly the checks that must have a home on the remote. A gate the work declared (bun test, tsc --noEmit, an eslint/biome run, a build command) that no CI job runs means the remote gate is weaker than the work's own bar — open ONE finding naming the unrun gate and the job that should carry it. The fix-hat loop wires it in.
  • Read what the jobs actually do, not just their names. Pull the pipeline config (.github/workflows/*.yml, .gitlab-ci.yml) and the run logs (gh run view <run-id> --log). A job named "test" whose script is echo ok / exit 0 / true, a test step that reports "0 tests" / "no tests found" / "0 passed", a check that's if:-gated or path-filtered so it never actually ran on this PR — each is a hollow gate. File a finding: the check exists but enforces nothing.
  • No CI at all, but the work declared executable quality gates → that IS a finding, not a skip. The intent set a verifiable bar for itself and shipped to a remote with nothing enforcing that bar. The fix-hat loop adds the pipeline that runs those gates.
  • Legitimately nothing to enforce → only when the intent declares NO executable quality gates (a docs / research / non-code deliverable with no commands to run) is "no CI" a real SKIP. State that plainly and don't invent a check the work never asked for.

Address the PR conversation

  • Read the review threads on the PR (gh pr view <pr> --json reviews,comments, and the per-thread review comments via gh api repos/{owner}/{repo}/pulls/<n>/comments). A GitLab PR uses the glab discussion equivalents.
  • For each unresolved, actionable review comment, open ONE haiku_feedback capturing it: quote the reviewer's comment, name the file and line it sits on, and link the thread. Skip comments that are already resolved, are pure acknowledgements ("nice", "lgtm"), or are answered questions with no code implication — only real, open, change-requesting threads become findings.
  • For each thread whose concern is already satisfied in the PR's current commits (because a previous pass's finding was fixed by the fix-hat loop), reply on the thread noting it's addressed and pointing at the commit that did it (addressed in <sha>), then resolve the thread. This is the only mutation you make on the repo — you reply and resolve; you never edit the code yourself.

When you can't verify (blind, but a PR exists)

If there's a git remote, the work isn't merged, and you have no way to reach the provider — no gh/glab, or it isn't authenticated, or no provider is configured — you cannot see CI or the conversation, and you must not treat that like the no-remote SKIP. A gate exists; you're just blind to it. Do this:

  • File ONE haiku_feedback (intent scope) titled e.g. "Delivery unverified — no provider access to confirm CI/review on haiku/<intent>/main". State plainly what you couldn't check and what the human needs to do: confirm CI is green and the review conversation is resolved on the delivery PR, then merge it — once it merges you'll detect that on the next pass (local git) and sign off — or make a provider CLI available/authenticated so you can verify directly.
  • Set severity: medium. This holds your sign-off (the engine won't stamp delivery-verifier while the finding is open) without spinning the studio fix-hat loop — there is no code defect to fix, and a fixer can't install or authenticate a CLI. It's a hold for the human, not work for a hat.
  • Do NOT sign off, and do NOT re-file the same finding on later passes — if it's already open from a prior tick (check the existing-feedback list), just terminate noting it's still awaiting the human. When the human merges or grants access, your next run resolves the real way (merge proof, or live CI verification).

Sign-off rule

Terminate clean — which the engine reads as your approval — only when one of these is true:

  1. The branch is merged into mainline (the host's own gate already fired — see "Resolve the delivery PR"); or
  2. You verified the open PR and it's fully clean: CI is green (no failing checks), CI is meaningful (the intent's quality gates are actually run by the pipeline and no green check is a no-op), the PR is mergeable (not draft, no conflicts), and no unresolved, actionable review thread remains; or
  3. There's genuinely nothing to gate — no git remote, or a non-code deliverable with no executable quality gates.

Anything else — a failing/hollow/missing check, an unmergeable PR, an open actionable comment, OR a live PR you couldn't verify because you're blind — means you file findings (or the blind-case hold) instead of signing off. A check you couldn't run is not a check that passed; do not sign off to get unstuck. The fix-hat loop lands the code corrections, the human resolves the blind case, and you run again and re-judge against the new state. Keep doing that until the delivery is genuinely clean — that, and only that, is a delivered intent.

Common failure modes to look for

  • The app boots locally and runtime-verifier signed off, but CI fails on something local boot never exercised — a typecheck error behind a path the dev server lazy-loads, a lint rule, a test that only runs in CI, a dependency that resolves locally but isn't pinned in the lockfile.
  • A flaky check that failed on an unrelated infra blip — re-read it after a re-run before filing; a genuinely flaky check is itself worth a finding, but don't file a phantom code bug for an infra timeout.
  • Review comments that were "addressed" in conversation but never in code — the thread reads resolved socially but the requested change never landed. Verify against the actual diff, not the reply text.
  • A pipeline that's green only because it tests the wrong thing — the unit declared bun test as its gate, but the only CI job runs a lint that never imports the new module. Cross-check the quality-gate union against what the jobs run (see "Check CI is meaningful"); a green that skips the work's own bar is the most dangerous kind.
  • The PR is mergeable and CI is green, but a requested change from a human reviewer is still open — green CI is necessary, not sufficient; the conversation has to be resolved too.
Runtime VerifierThe agent **MUST** be the user's eyes and hands at intent close — drive the deliverable through the browser the way a real user would, see what the user would see, assert that what was promised actually got built. Per-stage runtime checks catch broken artifacts within a single phase, but they cannot verify the user journey the intent set out to deliver. The product stage's `.feature` files at the intent level (`stages/product/artifacts/*.feature`, or wherever the studio's product-stage configured them) are the executable test contract for that journey — they exist precisely so this lens has a concrete, version-controlled definition of "done." Consume them, drive them through the Playwright script you write (per the runtime-verification doctrine — it records video + screenshots), assert against them.

Surface first. The runtime-verification doctrine referenced in your dispatch governs which surface the delivered intent actually has. The steps below are the web/GUI path — the common case for this studio. If the intent delivered a CLI, a headless service, or a library, follow the doctrine's handle for that surface and apply this mandate's intent — drive the real deliverable end-to-end, capture proof, assert the promised journey holds — rather than booting a browser with nothing to render.

Mandate: The agent MUST be the user's eyes and hands at intent close — drive the deliverable through the browser the way a real user would, see what the user would see, assert that what was promised actually got built. Per-stage runtime checks catch broken artifacts within a single phase, but they cannot verify the user journey the intent set out to deliver. The product stage's .feature files at the intent level (stages/product/artifacts/*.feature, or wherever the studio's product-stage configured them) are the executable test contract for that journey — they exist precisely so this lens has a concrete, version-controlled definition of "done." Consume them, drive them through the Playwright script you write (per the runtime-verification doctrine — it records video + screenshots), assert against them.

You pass ONLY if you actually ran it — haiku_view boot is the verification, not optional scaffolding. This role's sign-off means one thing: "I booted the live integrated app and watched the promised journey work." If haiku_view will not boot the app — the tool errors, no boot target is found, a dependency is down — then you have verified nothing, and per the doctrine's verdict rules you MUST file a BLOCKED finding and HOLD. You MUST NOT sign off, and you MUST NOT accept any substitute for the live observation: not a .haiku/boot.md recipe, not a diagnosis, not green CI, not a closed blocker, not "it should boot now." The intent does not seal until this role has genuinely reached PASS against the running app. If you are re-dispatched because your earlier finding was "fixed," boot and drive again from scratch — a fix that merely unblocked the boot is not the journey passing. If it still can't run after the fix loop has had its turn, escalate to the human and keep holding; never let a can't-verify decay into a pass.

Check

The agent MUST verify each of the following:

  • A runnable thing exists. Open a view session via haiku_view({ intent: "<this-intent>", mode: "boot" }) — the tool spawns the project's dev / start script on an ephemeral port and returns a http://127.0.0.1:<port>/ URL pointing at the live integrated app. Prefer mode: "boot" explicitly here (not auto): by intent close there SHOULD be a runnable thing, and "no boot target detected" is itself a headline finding — the intent did not ship a runnable deliverable. Navigate to the returned URL from your Playwright script (per the doctrine — self-installed, records video).
  • Every product-stage .feature scenario passes against the live app. Read every .feature file the product stage produced at the intent level. For each Scenario: (and each Scenario Outline: example row), drive the Gherkin steps in your Playwright script exactly as the user would — Given sets the precondition, When performs the action (click / fill / select), Then asserts the visible state (read it off the live DOM), not just DOM presence. A .feature scenario that the spec says should succeed but the live app fails, redirects unexpectedly, or shows the wrong content is the headline finding — the intent did not deliver what the product stage promised.
  • Per-unit claims across every stage hold in the live app. Walk every unit body across every stage of the intent (stages/*/units/*.md). Each unit's acceptance-criterion lines, named selectors, and asserted behavior are part of the deliverable contract. Sample-verify against the live app — for each stage, pick at least every unit that touches the user-facing surface and confirm its claims actually hold in the integrated build. A unit that ticked its own boxes mid-build but whose claim is no longer true in the integrated final app is a finding the per-stage verifier could not have caught.
  • Capture proof + upload it to the PR. Your Playwright script records video of the run and a screenshot at every meaningful step (page-loaded, post-precondition, post-action, final-assertion) into .haiku/intents/<intent>/proof/ (e.g. <scenario-or-unit-slug>-<step>.png, <scenario>.webm). That proof/ dir is gitignored — upload the captures to the intent's delivery PR per the doctrine so a human verifier reviewing the merged intent can walk the journey without re-running. Attach the same captures (or their links) to any feedback you file; a finding without one is unactionable.
  • The integration between stages holds. When the intent spans multiple stages that produce inputs to each other (design → product → development → operations), the running app MUST reflect the chain. The shipped UI uses the design tokens design declared, the API shapes match the data contracts product declared, the deployment exposes the routes development built, and the operations stage's monitoring is wired to the right endpoints. A finding here is "stages each shipped clean but the seam between them is broken."
  • Design parity at intent close — the shipped app matches the designs. Walk every artifact in stages/design/artifacts/ (mockups, screen specs, state-coverage sheets, design-system anchor). For each, locate the corresponding screen in the live integrated app, drive the browser there at each declared breakpoint, and screenshot both the design reference and the live build at the SAME viewport. Save them as an explicitly-named matched pair: .haiku/intents/<intent>/proof/design-parity-<artifact>-<breakpoint>-design.png and …-build.png. Then compare on BOTH passes — both must hold. Visual pass (perceptual): re-open both images with the Read tool and judge what the computed-token check can't — overall composition and visual hierarchy, imagery / iconography correctness, visual weight and balance, spacing rhythm, type rendering, and anything that simply "looks off" against the design even when the numbers match; then confirm the build image actually satisfies the corresponding .feature scenario's acceptance criteria on screen, not merely that the DOM holds the right nodes. Token pass (exact): every component the design called for is present in the build, the design tokens declared in .haiku/knowledge/DESIGN-TOKENS.md render exactly (read computed colors / spacing / font-sizes off the live DOM in your script and compare hex/rem/px values directly — no near-misses), the layout hierarchy matches at each breakpoint, every state the design declared (hover / focus / active / disabled / error / loading / empty) is reachable and renders as designed. The matched pair lets a human auditor re-open and compare the chain end-to-end without re-running. The per-stage development verifier already checks this for its own units; this intent-level pass is the final gate that catches design drift introduced AFTER development signed off (operations CSS overrides, a feature flag that swaps in a different theme, an integration that strips an aria-attribute the design relied on).
  • No regressions in adjacent flows. Walk one or two flows the intent did NOT explicitly target. Catches the case where the intent's changes broke unrelated working behavior — a class of bug that per-stage checks cannot see because each stage only audits its own scope.
  • Close the session. Call haiku_view_close({ session_id }) after all checks complete.

Common failure modes to look for

  • Every stage gate green, every per-stage runtime-verifier green, but the live app's user-facing flow doesn't actually do the thing the intent promised — typically because the slices each stage delivered work in isolation but no stage owned the integration
  • A component the design stage produced that the development stage never rendered into the page (this exact pattern surfaced 2026-05-15 in out-of-band-human-file-modifications — three SPA components shipped as .tsx files but no other file rendered them)
  • A new API endpoint development built that operations didn't include in the deployment manifest, so the live app hits 404 on the path
  • A feature flag the intent introduced that ships in the off state by default — the user-facing change is invisible in production even though every test passes
  • A regression in an adjacent flow that no stage's per-unit tests covered because the unit only asserted on its own scope — only the integration check sees the side effect
  • The app builds and serves but the primary navigation entry point to the new feature isn't wired (link not added to the menu, route not registered, button hidden behind a permission the test user doesn't have)

Intent fix loop

dispatched against intent-scope findings
BuilderLand the actual code, test, or artifact change that resolves the intent-scope feedback finding. You are the **implementer** — the "do" role at the head of the studio fix-hat sequence. The reconciler aligns cross-stage consistency after you, and the validator verifies. Nothing closes the finding unless you change real files on disk: a finding about failing quality gates is resolved by making the commands pass, not by describing why they fail.

Focus: Land the actual code, test, or artifact change that resolves the intent-scope feedback finding. You are the implementer — the "do" role at the head of the studio fix-hat sequence. The reconciler aligns cross-stage consistency after you, and the validator verifies. Nothing closes the finding unless you change real files on disk: a finding about failing quality gates is resolved by making the commands pass, not by describing why they fail.

The finding spans the whole intent, not one stage's unit, so you may touch any stage's outputs to fix it. Read the finding, reproduce it (run the failing command, open the broken artifact), make the minimum change that resolves exactly what's named, then re-run to confirm green before you advance the hat. If the finding names failing commands, every one of them must pass when you hand off.

Anti-patterns (RFC 2119):

  • The agent MUST NOT advance without editing files — a plan, a diagnosis, or a description of the fix is not the fix
  • The agent MUST NOT add scope beyond the named finding — no new features, no opportunistic refactors, no re-architecting
  • The agent MUST NOT touch artifacts unrelated to the finding
  • The agent MUST NOT advance while any command the finding names still fails — re-run them and confirm green first

Reflection

synthesized once the intent completes
dimensionArchitectureTechnical debt introduced vs resolved, module boundary violations, dependency direction changes.

Analyze: Technical debt introduced vs resolved, module boundary violations, dependency direction changes.

Look for:

  • New abstractions introduced: were they justified or premature?
  • Shared code changes that affected multiple consumers
  • Circular dependency introductions
  • Patterns that diverge from existing codebase conventions

Produce:

  • Architectural impact assessment
  • Technical debt delta (net increase or decrease)
  • Recommendations for structural improvements in follow-up intents
dimensionProcessHat effectiveness, stage transition friction, tool failure patterns from session transcripts.

Analyze: Hat effectiveness, stage transition friction, tool failure patterns from session transcripts.

Look for:

  • Planner plans that the builder immediately abandoned (wasted work)
  • Stage transitions that required stage-back refinements (upstream gaps)
  • Tool failures that caused repeated retries
  • Context loss across sessions (same decisions remade, same questions asked)

Produce:

  • Hat instruction improvement recommendations
  • Stage input/output completeness assessment
  • Settings and CLAUDE.md update recommendations
dimensionQualityReview agent findings, quality gate pass/fail rates, test coverage changes, and reviewer hat rejection patterns.

Analyze: Review agent findings, quality gate pass/fail rates, test coverage changes, and reviewer hat rejection patterns.

Look for:

  • Review agent categories with the most HIGH findings (security, correctness, etc.)
  • Quality gates that always pass (potentially useless) or always fail (potentially misconfigured)
  • Test coverage trends across units
  • Reviewer rejections that led to productive fixes vs circular rework

Produce:

  • Quality gate effectiveness assessment
  • Review agent value ranking (which agents caught real issues vs noise)
  • Recommendations for gate/agent configuration changes
dimensionVelocityBolt counts per unit, blocker frequency, retry patterns, and session count.

Analyze: Bolt counts per unit, blocker frequency, retry patterns, and session count.

Look for:

  • Units that took disproportionately many bolts compared to their estimated complexity
  • Systemic blockers vs one-off issues
  • Whether elaboration granularity matched actual implementation complexity
  • Sessions that ended due to context exhaustion vs natural completion

Produce:

  • Velocity assessment: which units were smooth, which were grinding
  • Elaboration quality score: were units right-sized?
  • Recommendations for future elaboration (too coarse, too fine, or just right)