Back to blog

H·AI·K·U Is a Harness

Anthropic's engineering team recently published Harness Design for Long-Running Application Development, a detailed account of building multi-agent orchestration systems for complex, sustained AI work. The post introduces the term harness to describe the scaffolding that coordinates AI agents through structured phases, manages context across long-running tasks, and encodes assumptions about what models cannot do alone.

Reading it felt like finding a name for something we had already built.

H·AI·K·U is a harness. Understanding it as one clarifies what it does, why it works, and where it is going.

What Is a Harness?

The Anthropic post describes a harness as a multi-agent orchestration system that decomposes complex work into specialized roles working in concert. Rather than asking a single model to handle everything end-to-end, a harness assigns distinct responsibilities to separate agents — a planner that transforms brief prompts into comprehensive specifications, a generator that executes the work, and an evaluator that assesses quality using concrete criteria and external tools.

Four design principles emerge from their experience.

Anthropic's four principles
  1. File-based communication for inter-agent handoffs. Agents communicate asynchronously through structured files rather than direct conversation. One writes; the next reads and responds. Persistent artifacts replace synchronous coordination.
  2. Generator-evaluator feedback loops. Their core insight: "separating the agent doing the work from the agent judging it proves to be a strong lever." Self-evaluation fails because models praise their own outputs. A skeptical separate evaluator drives iterative improvement.
  3. Context resets with structured handoffs. Clear the window entirely rather than compact in place. Resets solve coherence loss and "context anxiety" — the model prematurely wrapping up as it approaches its perceived limit. The price is good handoff artifacts.
  4. The assumption principle. "Every component in a harness encodes an assumption about what the model can't do on its own, and those assumptions are worth stress testing." When Opus 4.6 arrived, the team removed sprint-based decomposition that had been essential for 4.5. The harness got simpler because the model got better.

The practice they recommend: remove one component at a time and measure impact, rather than radical restructuring.

The Mapping

H·AI·K·U and Anthropic's harnesses occupy the same design space. The concepts map directly:

Harness Concept (Anthropic)H·AI·K·U Equivalent
Specialized agent roles (planner, generator, evaluator)Hat-based workflow (planner, builder, reviewer)
File-based inter-agent communicationIntent specs, unit specs, operational plans as file artifacts
Generator-evaluator feedback loopsOperation, review, fail/advance cycle (bolts)
External evaluation driving qualityBackpressure hooks, completion criteria, Stop gates
Context resets with structured handoffs/haiku:haiku-pickup skill that reconstructs ephemeral state from file artifacts
"Every component encodes an assumption"Each hat and phase encodes a guardrail
Harness complexity should decrease with model capabilityRegular reassessment of which scaffolding remains necessary

The vocabulary differs. The architecture rhymes.

Specialized roles

Hats are roles by another name

Both systems separate planning, generating, and evaluating into distinct modes. Neither asks one undifferentiated agent to do all three at once.

File-based comms

Specs as the handoff medium

Intent specs, unit specs, scratchpads, iteration state. Persistent, structured state that survives context boundaries — the same shape Anthropic landed on.

Generator-evaluator loops

The bolt cycle

Operate, review, fail-or-advance. A generator-evaluator loop with explicit transitions humans can watch.

Context resets

`/haiku:haiku-pickup` as handoff

Reconstructs ephemeral session state entirely from file artifacts. The reset and the handoff in one move.

The alignment is not coincidental. These are convergent solutions to the same underlying problem: models cannot sustain complex work across long contexts without external structure.

Where H·AI·K·U Diverges

The mapping is close, but three differences matter.

Collaborative, Not Autonomous

DivergeWhere the human sits in the loop

Anthropic's harnesses

Autonomous agent-to-agent loops. Generator produces, evaluator grades, the loop repeats. Humans configure the system and review the final result. Execution between those endpoints is sealed.

H·AI·K·U

Centers human judgment. The human wears hats alongside the AI, approves phase transitions, can /haiku:haiku-refine mid-construction or /haiku:fail a review to send work back. Three modes dial autonomy up or down per intent.

The harness supports full autonomy but doesn't assume it.

This is a deliberate design choice. As we argued in Dark Factories and the Loop, the dark factory is a knob you turn, not a system you build. Some work deserves full autonomy. Some work deserves close oversight. The same harness should support both without architectural changes. A security-sensitive authentication flow and a routine dependency upgrade should not require different tools — just a different setting on the same tool.

Anthropic's harnesses optimize for throughput in autonomous execution. H·AI·K·U optimizes for adaptability across the full autonomy spectrum.

SDLC-Native Structure

The harnesses in the Anthropic post are task-generic. They built a game maker, a DAW, a website builder. The harness pattern applies to any domain where work can be decomposed into plan-generate-evaluate cycles.

H·AI·K·U is domain-specific. It maps to software development lifecycle concepts: intents are epics, units are tickets, bolts are sprints. This specificity is a feature, not a limitation. It gives the harness leverage that generic orchestration cannot have.

Consider quality evaluation. Anthropic's harness relies on an LLM evaluator to assess output quality — and the post is candid about how much work this requires. Out of the box, evaluators are lenient and test superficially. Achieving rigorous QA demanded iterative prompt refinement based on divergences between model judgments and human assessment. They converted subjective criteria into concrete, gradable standards and found that even specific phrasing (like "museum quality") meaningfully steered outputs.

H·AI·K·U sidesteps much of this problem by encoding software-specific quality gates as backpressure.

This does not eliminate the need for LLM-based evaluation entirely — completion criteria still require judgment about whether the intent has been satisfied. But the deterministic backpressure layer handles the mechanical quality questions, leaving the subjective evaluation layer to focus on higher-order concerns: does this implementation actually solve the problem? Is the API intuitive? Does the architecture make sense?

Domain specificity buys you deterministic quality gates. Generic harnesses have to solve quality evaluation entirely through LLM judgment.

Plugin Infrastructure, Not Agent SDK

DivergeWhere the harness lives

Anthropic's harnesses

Built on the Claude Agent SDK. Each agent is a distinct subprocess with its own context window, coordinated externally through file-based communication. The harness is a separate orchestration layer.

H·AI·K·U

Built on Claude's plugin system — skills, hooks, CLAUDE.md rules. Behavior shaped within a single Claude session. Skills provide commands, hooks enforce backpressure at tool-call boundaries, CLAUDE.md injects context. No separate process to install.

This means the harness is the development environment. There is no separate system to install, configure, or maintain. You install a plugin and your Claude session gains structured workflows, quality gates, and context management. The harness lives where the work happens.

The tradeoff is real. The Agent SDK approach gives you more control over agent orchestration — you can run agents in parallel, manage their lifecycle explicitly, and build custom communication protocols. The plugin approach gives you tighter integration with the developer experience — the harness is invisible until you invoke it, and it enhances rather than replaces the normal Claude workflow.

Both are valid architectures. They optimize for different things.

The Stress-Testing Principle

The post's most transferable insight deserves its own section: every component in a harness encodes an assumption about what the model cannot do alone. When the assumption stops being true, the component becomes overhead.

H·AI·K·U's components encode assumptions too.

What each piece of H·AI·K·U is compensating for
  1. Separate elaboration from implementation assumes the model will conflate planning with building if given the chance. Without the enforced separation, it starts writing code in the middle of the spec.
  2. A mandatory review phase assumes the model will skip validation if not told to stop and evaluate. Without the reviewer hat, it declares itself done prematurely.
  3. Backpressure hooks assume the model will cut corners without automated gates. Without hard stops on failing tests and lint errors, it "fixes later" and never comes back.
  4. Hat transitions assume the model loses focus without explicit role switches. Without the move from builder to reviewer, it reviews its own work while still in builder mode — and finds nothing wrong.

These assumptions were correct when H·AI·K·U was designed. They remain correct today, with current models. But "today" is a moving target.

Anthropic demonstrated this concretely. Sprint-based decomposition was essential scaffolding for Opus 4.5 — without it, the model could not sustain coherent work across large tasks. Opus 4.6 handled longer sequences natively, and the sprint decomposition became unnecessary overhead that actually degraded output quality. Removing it improved results.

The same will happen to H·AI·K·U's components. Some future model may plan and build coherently without an explicit hat transition. Some future model may self-evaluate rigorously without a separate review phase. Some future model may maintain quality standards without backpressure hooks.

When that happens, the correct response is not to defend the scaffolding. It is to remove it, measure the impact, and simplify. The harness should get lighter over time. If it is not getting lighter, either the model has not improved in the relevant dimension, or the team is not stress-testing their assumptions.

The Anthropic post recommends removing one component at a time and measuring. We endorse this approach. H·AI·K·U's modular structure — hats are independent files, hooks are individually configurable, workflows are composable YAML — was designed with exactly this kind of incremental simplification in mind.

What this means

Positioning H·AI·K·U as a harness does three things.

Why the naming matters
  1. It gives people an immediate mental model. If you've read Anthropic's post, you know what category H·AI·K·U belongs to. Orchestration scaffolding for long-running AI work, coordinating specialized roles through structured phases, with quality gates that fire automatically. You don't need the full methodology to know what kind of thing it is.
  2. It clarifies that the scaffolding is intentional and should evolve. The hats, the phases, the hooks — they aren't bureaucracy. Each is load-bearing structure compensating for a specific model limit. Each should be re-evaluated as models improve. Simplification is not regression; it's the system working as intended.
  3. It connects H·AI·K·U to a broader design space. Anthropic is publishing harness design patterns. Other teams are building their own. By naming H·AI·K·U as a harness, we join a conversation about what these systems should look like, how they should evolve, and what principles govern their design. The answers matter for everyone building here.

H·AI·K·U is a collaborative, SDLC-specialized harness built on Claude's plugin primitives. Understanding it as one is the clearest way to understand what it does — and to know when pieces of it should be simplified away.