The Harness Series: What We Read, What We Took, What We're Trying to Do
By Jason Waldrip
The last month was a lot of reading. Anthropic shipped their harness-design post. Lex Christopherson's GSD-2 crossed 59K stars. Garry Tan's GSTACK crossed 71K. Jesse Vincent's Superpowers kept being passed around. Brian Suh wrote a piece on agents needing control flow. The Pulumi blog compared three of the four head-to-head.
Reading all four in sequence does something useful to your head. You stop thinking about each project as "the right answer" and start thinking about them as bets — different teams looking at the same failure modes and choosing different layers to put the constraint.
We wrote four side-by-side comparisons of H·AI·K·U against each.
Two Harnesses, One Shape
Where their harness post and ours land on the same answers, and where the bets diverge. Read it.
GSD and H·AI·K·U
A spec-driven dispatch framework. The atomic-plan-per-fresh-window pattern is the cleanest context-rot mitigation we've read. Read it.
GSTACK and H·AI·K·U
A 23-role startup library. The strongest articulation of "roles as legibly distinct lenses" we've seen at the prompt layer. Read it.
Superpowers, Side by Side
A skills library with a real surface. Three things we adopted from it sit in our engine today. Read it.
This post is the anchor — where we ended up on the methodology, what we took from whom, and what we're trying to do.
The methodology, named
H·AI·K·U is a workflow engine that hands an agent one action at a time from a plan kept on disk. The agent is stateless between calls. The cursor is the persistent thread. The artifacts on disk are the memory. Every move the agent makes is a single tool call returning one structured action, which the agent then executes — and that's it for that turn. Next tick, fresh subagent, next action.
The load-bearing claims:
- The agent shouldn't decide what's next. The cursor decides, deterministically, from on-disk state. There's no language model in the orchestrator's decision path. (One Instruction at a Time.)
- State lives on disk, not in context. Unit specs, feedback, intent frontmatter, stage outputs — everything material is artifact-shaped. Compaction,
/clear, crashes, machine swaps don't lose the plan. (The Scavenger Hunt.) - Plan → do → verify is the minimum hat sequence. Every stage has at least three hats; the last is a verifier; the verifier is a separate subagent from the doer. The doer never grades its own output.
- Specs are testable before any code lands. Acceptance criteria pair with executable
quality_gates:. Non-zero exit blocks the advance. The toolchain settles what the toolchain can settle; judgment is reserved for the parts that actually need judgment. - The lifecycle is forward-only. Completed units don't get re-edited. Corrective work flows through new units authored from upstream elaborate. Feedback is the only way back.
- Workflow shape is data, not prompts. Studios, stages, hats, review agents, gates, output templates — all markdown files the engine resolves through a three-tier cascade (stage → studio → global, project overlay wins at each tier). New domains ship by adding a studio; new lenses ship by adding a review-agent file.
- Drift gets noticed. A pre-tick check verifies the agent's premise still matches reality (file hashes against the agent's recorded premise). Long-running agents drift; ours notice before the next tick fires.
- The engine, not the agent, holds the contracts. A guardrail catches direct edits to workflow-managed files and tells the agent to use the right tool instead. Invariants like "stage branch is always ahead of main" aren't on the agent's checklist; they're hooked.
That's the engine. The shape of work it runs is up to the studio.
The surface, and why it matters
A workflow engine without a surface is a CLI for one user — the developer who installed it. That's not the work H·AI·K·U is built for.
Every intent boots a local SPA the engine drives. Review sessions render the diff, the spec, the feedback annotations, and the gate verbs (approve, leave feedback, submit files for drift, tick) in a browser tab. Questions the agent needs to ask route there. Direction requests for design route there. The browse view shows every intent in the repo with its current stage, current gate, current open feedback — a portfolio across an entire team's work, not just the one you're sitting on.
The surface is graded by mode. Same intent, same studio, same engine. The dial is per-stage, set at elaborate time.
In the loop
Every gate stops. You weigh in. The agent waits for your call before it advances.
On the loop
Work runs to a PR/MR boundary and stops there for external review. The merge is the approval.
Off the loop
Gates auto-advance until something blocks. You see the delivery PR, not the steps.
Mix and match
Interactive discovery, auto-advancing build, external-sign-off security gate — all in the same intent. You set it at elaborate time.
The other half is backpressure that flows the other direction. A non-developer leaving a comment on a rendered design isn't filing a ticket into a queue someone reads later — they're adding a feedback file the engine treats as a first-class input at the next tick. The fix-loop dispatches against it. The earliest unaddressed stage gets revisited. A PM dropping a paragraph in the question modal pauses the agent until the answer lands.
Drift is the same mechanism pointed at the agent. The pre-tick gate hashes every file the agent stamped as a premise and compares against the agent's recorded premise. If a file changed out-of-band — a designer edited a mockup, a teammate fixed a typo, a script touched the schema — the agent is told before the next instruction fires. The premise is reconciled or the unit reroutes through elaborate. Long-running agents lose their footing; ours has to re-find it on every step.
Superpowers has a surface. Most don't. We think ours stands above for the work it's aimed at — multi-skilled teams where the designer, the PM, the QA lead, and the engineer all have something to weigh in on, and none of them should have to learn the CLI to do it.
What we're trying to do
If we had to write the H·AI·K·U mission in one sentence, it would be the same one we wrote in April:
Dark factory is the ceiling, not the mandate. Per-stage gates let you keep a human in the loop as much as the work calls for — interactive discovery, auto-advancing build, external sign-off security gate, all in the same intent. The point isn't to remove the human; it's to put the human's effort where it pays back — in the spec, not in babysitting the build.
We're betting the structural pattern holds across 24 domains, not just software. The studios share the engine; only the domain shape differs.
- Engineering surface. Software, gamedev, hwdev, libdev — code, components, hardware specs, libraries.
- Product surface. Product-strategy, customer-success, dev-evangelism, ideation — the work of shaping what's built and how it lands.
- Operations surface. Data-pipeline, incident-response, quality-assurance, migration, vendor-management, security-assessment, project-management — the work that keeps the lights on.
- People and business surface. HR, finance, legal, compliance, sales, marketing, training, executive-strategy, documentation — the work that runs the company around the build.
So far: we think it can do it all. The shape holds. The four comparison posts are us pressure-testing that belief against four serious projects working on the same problem with different bets. None of them broke the belief; some of them showed us where we have packaging work to do.
What we owe
We didn't get here in a vacuum. Specific debts:
Anthropic's harness post. Confirmed two things we'd already built (context resets, external evaluators) and named one structural gap we addressed differently. Their post is candid that "some problems remained persistent" — context drift on long tasks, self-evaluation bias — and that their patches add complexity and latency. Reading that paragraph and recognizing it as a downstream symptom of the open-loop the agent runs was the clearest articulation we'd seen of why our single-step design pays off. We didn't invent that framing in response to their post; we'd shipped the cursor model first. But their post is what let us name the difference.
Jesse Vincent's Superpowers. Three concrete things. First: loud framing is load-bearing. <EXTREMELY-IMPORTANT> and <HARD-GATE> aren't theater; they're what survives the agent rationalizing past calm guidance. We adopted that defensive envelope at the orchestrator layer so every studio inherits it. Second: skill files as orientation aids — we don't ship flowchart files but the orchestrator renders the workflow shape on demand for the same reason. Third: the six-platform install matrix is a packaging discipline we don't yet match. The MCP protocol work is done; the one-click entry points for Cursor / Codex / Copilot CLI / Gemini / OpenCode are work-in-progress.
Lex Christopherson's GSD-2. The atomic-plan-per-fresh-window pattern is the cleanest articulation we've read of context-rot mitigation. We solve the same problem by never giving the agent the loop in the first place, but reading GSD's design reinforced that the per-step boundary is where the structural answer lives. The crash-recovery + stuck-loop detection patterns are also worth studying — our on-disk state survives crashes by construction, but GSD's explicit recovery paths show what an orchestrator-agent has to do when state isn't all on disk.
Garry Tan's GSTACK. The 23-role library is the strongest articulation of "roles as legibly distinct lenses" we've seen in the prompt-layer approach. Our hats embody the same intuition; reading GSTACK confirmed that the lens metaphor scales across domains, not just engineering. The next batch of studio cleanup we have queued (cross-stage consistency, FM-interpretation refactors) directly borrows the legibility discipline from GSTACK's role definitions.
Brian Suh's piece on agents needing control flow. Already linked from One Instruction at a Time. The line "if you've ever resorted to MANDATORY or DO NOT SKIP, you've hit the ceiling of prompting" is the clearest one-sentence justification for moving from prompts-as-runtime to engine-as-runtime that we've read.
AI-DLC, our predecessor. We built it. We watched it walk over its own MUSTs in production. The decision to rewrite it as H·AI·K·U with the agent-as-tool model came directly from those failure logs. The credit there is to the seven of us who kept hitting the same context-drift wall and finally said "the agent can't hold the plan; stop making it." That was the foundational bet — everything since follows.
What we still don't know
We think Haiku can do it all. The belief survived reading the four projects above. It hasn't survived running at scale yet — the reflection loop is wired but hasn't actually fired on a real intent. The studio audit we just commissioned found ~94 EXECUTION.md files describing a stale pre/post review model; that's drift we have to clean up. The "trim scaffolding as models improve" thesis from Anthropic and Superpowers lands on us too; we have hat-internal anti-pattern walls that we suspect are belt-and-suspenders on current frontier models.
So: we think the shape is right. We're less sure each layer is sized correctly. The next quarter is mostly cleanup work — let the reflection loop fire on real intents, watch what it surfaces, accept the trims, push back on the additions.
If you're choosing a harness today, read the comparisons before you choose us. We're not the right tool for every shape of work.
DivergePick the bet that matches your work
GSD might be the right grab. Per-window plans, spec-driven dispatch, context-rot mitigation by construction.
GSTACK might be. 23 roles, opinionated startup stack, prompt-layer discipline.
Superpowers might be. Loud framing, skills as orientation, a surface non-developers can touch.
That's the shape we're building for.
The harness conversation is bigger than any one of us. The space is better for having four well-argued options. Read the comparisons. Pick the bet that matches your work.