Two Harnesses, One Shape
By Jason Waldrip
Anthropic published a post on harness design for long-running agent apps. I read it on a train, and the eerie thing was how much of the architecture we'd already converged on independently, in a different vocabulary.
Different teams, same problem, similar answers. That's the useful read. The places where we don't converge are the interesting part.
Where we landed in the same place
Seven patterns from their post map cleanly onto something we already have.
MatchMulti-agent specialization
Distinct personas — planner, generator, evaluator — each addressing a specific failure mode. The separation enables better error detection.
Hats in a plan → do → verify sequence (plugin/studios/ARCHITECTURE.md §3). Every stage ends in a verifier; adversarial review agents fire externally. Same shape, different vocabulary.
MatchState handoff through files
"One agent would write a file, another agent would read it and respond either within that file or with a new file." Decouples agents, enables clean session boundaries.
Our whole substrate is files — unit specs, feedback, the intent record, the stage outputs. Their state lives on disk too. We dropped a redundant index file along the way and let the artifacts themselves be the signal.
MatchContext reset over compaction
"Context resets — clearing the context window entirely and starting a fresh agent, combined with a structured handoff — address both compaction's drift and context anxiety."
Every hat fires as a fresh subagent invocation with a structured mandate. The orchestrator is the persistent thread; each hat starts cold with only the artifacts it needs.
MatchExternal evaluation, not self-eval
"When asked to evaluate work they've produced, agents tend to respond by confidently praising the work — even when, to a human observer, the quality is obviously mediocre."
Adversarial review agents, spec-conformance gate, intent-completion review agents. The feedback-assessor is a terminal hat that independently decides closure. Verifier ≠ author by construction.
MatchContract negotiation before implementation
"The generator proposed what it would build and how success would be verified, and the evaluator reviewed that proposal to make sure the generator was building the right thing."
Our pre-execute dispatch_review walk. Engine roles (spec, continuity, cross-stage-consistency) audit the planned spec before any wave-ready hat dispatch.
MatchActive runtime validation
Evaluator drives the live app through Playwright MCP — navigates, screenshots, asserts. Catches runtime issues invisible in static review.
Runtime verifiers boot the project's real app, drive it in a headless browser, and save screenshots as proof. Static artifacts get specialized viewers — schematics, gerbers, 3D models, circuits — so the verifier can see what the human reviewer would see. The mandates live with each studio.
MatchLog-driven prompt tuning
"Read logs, identify divergence from expectation, adjust the agent's prompt, run again." Sustained signal turns into structured revisions.
Default-on reflection loop. At each stage close the agent writes a short observations note — what was ambiguous, what surprised it, what the feedback stream didn't already show. At intent close it synthesizes a reflection across every stage, lands the project-fixable findings as local overlays the engine picks up next run, and surfaces engine-class findings through a structured report channel. Everything commits with a marked prefix; PR review is the human gate.
Seven for seven on the load-bearing patterns. Different vocabularies, same shapes. That's the validation.
Where we go different directions
DivergeHow much to decompose
Don't over-specify at planning, because lock-in errors cascade. Their evaluator stays "focused on product context and high-level technical design rather than detailed technical implementation."
The elaborate phase fans out subagents per discovery template, produces unit specs with depends_on DAGs, executable quality_gates, criteria-with-verify-commands. Different bet — the spec is the contract the verifier grades against.
DivergeRemoving scaffolding as models improve
Opus 4.5 → 4.6 lesson: "Every component encodes an assumption about what the model can't do on its own. Stress-test those assumptions, because they may be incorrect and they can quickly go stale as models improve."
The lesson holds, but the place to apply it isn't the hat list — each layer is structurally distinct work. Scaffolding accumulates inside hats: verbose anti-pattern lists, defensive validators, overlapping review-agent lenses, fanned-out discovery templates. The reflection loop is wired to surface exactly this signal at intent close, but it has not yet fired on a real intent. A manual audit pass trimmed seven of the longest mandates with all 1,673 tests still green. The pipe exists; nobody's pulled water through it yet.
What we have that their post doesn't address
OursOne instruction at a time
Their post is candid: "But some problems remained persistent. For more complex tasks, the agent still tends to go off the rails over time." They name two failure modes — context-window fill plus context anxiety, and self-evaluation bias — and patch each. Context resets fix the first. Separating the evaluator from the generator fixes the second. Both patches add "orchestration complexity, token overhead, and latency," and both leave the underlying shape intact: a single agent running an open-ended loop, deciding its own next move, sometimes for a very long time.
We never gave the agent that loop. haiku_run_next reads disk, walks the cursor, and returns exactly one structured action. The agent's job is to execute that one action and call haiku_run_next again. There is no "what should I do next?" question for the agent to answer; the cursor answers it from on-disk state every time.
This collapses both of their persistent failure modes structurally rather than as patches. Context fill can't compound because the agent doing the work has a one-turn horizon — there is no lengthy task from the agent's perspective, only from the cursor's. Self-evaluation drift can't happen because the verifier is a different hat in a different subagent. Their patches address the symptoms. Take away the open-ended loop and the symptoms don't arise. Long version here.
OursDrift gate / premise-witness model
A runtime check that the agent's premise still matches reality before each tick. They don't discuss this at all. Long-running agents drift; ours notice.
OursForward-only lifecycle
Completed units are never re-edited. Corrective work goes through new units authored from upstream elaborate — the FB-as-unit fix loop. They describe agents revising files; we treat units as immutable once verified, with feedback as the only way back. Different stance, deliberately.
OursMCP-as-contract boundary
Workflow-managed files are guarded by a PreToolUse hook — agents go through MCP tools or get rejected with a redirect message. Their agents communicate through files but they don't describe a hard enforcement layer around who can edit what. We do, and the engine relies on it for invariants like "stage branch is always ahead of main."
OursCascade resolution for mandates
Hats and review agents resolve through a three-tier cascade: stage → studio → global, first hit wins, with project .haiku/ overriding plugin at each tier. Lets us share defaults and override locally without forking. Their setup looks flat per-task.
OursSurface for non-developers and backpressure both ways
Every intent boots a local SPA. Review sessions render diffs, specs, annotations, and gate verbs in a browser tab. The browse view shows every intent in the repo with its current stage, gate, and open feedback. Non-engineers — design, PM, QA — interact with the work in the medium it lives in, and their feedback is a first-class input to the next tick. Drift flows the other direction: out-of-band file edits are detected pre-tick and reconciled before the agent moves. Anthropic's post is about the agent surface; it doesn't address the multi-skilled-team surface at all.
The convergence is reassuring — two teams chasing the same problem landed on the same shape, seven patterns over. The interesting gap isn't where we converged; it's where we made a different bet at the foundation. Anthropic's persistent failure modes — context drift on lengthy tasks, self-evaluation bias — are problems for any harness whose agent runs an open-ended loop. Patches help. But "the agent decides what to do next" is the seed those failure modes grow from, and we never planted it.
The cursor decides; the agent executes one step. The drift gate, the forward-only lifecycle, the MCP-contract boundary, the cascade resolution — they each follow from that one choice.