Don't Review Broken Code
By Jason Waldrip
Running an adversarial security reviewer on code that fails tsc is theater. You're spending expensive model time and an elaborate multi-agent review cycle on something a 50-millisecond command could have told you was broken before any of it started. The security reviewer is going to find issues, sure — but the first and biggest issue is that the code doesn't compile, and you didn't need an adversarial agent to figure that out.
H·AI·K·U used to do that. It doesn't anymore.
The new order
The review phase now runs in two stages, hard checks before soft ones.
DivergeTwo kinds of check, run in order
Tests, lint, typecheck. Cheap. Fast. Machine-verifiable. Hard pass or hard fail. If anything's red, the agent bounces straight back to fix — no reviewer ever spawns.
Security, design, architecture. Expensive. Slow. Taste-driven. Only runs once every gate is green. The reviewers spend their attention on judgement calls, not syntax errors.
Quality gates run first. If any of them fail, the orchestrator returns fix_quality_gates and the agent gets bounced back to fix them before a single adversarial reviewer is even spawned. Only when everything's green does the orchestrator advance to the review agents, where the expensive taste-driven work happens.
How it's configured
Quality gates live in the intent's frontmatter as a list of name / command / dir tuples:
quality_gates:
- name: typecheck
command: tsc --noEmit
dir: packages/haiku
- name: tests
command: bun test
dir: packages/haiku
- name: lint
command: biome check .
The dir field exists because monorepos exist. Commands resolve relative to the repo root, and you can scope each gate to the subdirectory where it belongs. Typecheck runs in packages/haiku, lint runs at the root, tests run wherever. One intent can touch multiple workspaces and each gate runs in the right place.
The elaborator now detects your tooling
You don't write this by hand. When an intent is elaborated, the agent inspects the project and populates quality_gates in intent.md automatically based on what it finds.
Node / TS project
package.json with a test script and a typecheck dep gets npm test and tsc --noEmit wired up by default.
Rust project
Cargo.toml gets cargo test and cargo clippy -- -D warnings. Tests and lint in one shot.
Python project
pyproject.toml with pytest gets pytest. Add ruff or mypy and they get picked up too.
Anything else
A custom command goes in the same name / command / dir shape. The orchestrator doesn't care what runs underneath; it cares that the exit code is honest.
If the elaborator gets it wrong, you fix it once and it's in the frontmatter forever. If it gets it right, you never thought about it.
What happens when a gate fails
The agent doesn't get back "something broke, figure it out." It gets the gate name, the command that failed, and the first few lines of the failure output — enough to land on the specific error without re-running anything. The agent bounces back to execute, fixes the issue, ticks again. The orchestrator re-runs the gates — no human intervention, no review agents, no context switch — and if they pass, advancement resumes.
- Cheap failures stay cheap. A typecheck error should cost a typecheck run, not a full review cycle.
- Expensive reviews stay expensive for a reason. Taste-driven review is expensive because it's thinking about tradeoffs, not because it's rediscovering syntax errors.
- The feedback loop tightens. The agent learns about a broken type in seconds, not minutes. Time-to-fix drops because detection moved upstream.
- Review signal improves. When an adversarial reviewer finally does run, it's running on code that at least compiles. The issues it finds are real issues, not scaffolding noise.
The harness rule
This pattern — cheap checks before expensive ones — is just the orchestrator doing what orchestrators are for. Anthropic's harness design post makes the same point from a different angle: every component in a harness encodes an assumption about what the model can't do on its own, and those assumptions are worth stress-testing. The assumption here is that the agent could run tests itself, but won't reliably, and when it does, it won't reliably read the failures before moving on. So the harness runs them. The harness reads them. The harness decides.
There's a cost savings here. There's also a signal savings. But the real win is that review stops being a lossy compression of compilation errors and lint warnings. Review gets reserved for the things that actually require judgment. Taste is expensive. Spend it on code that works.