Harness Engineering
Models provide capability. Harnesses provide reliability.
Smarter models are not the bottleneck anymore. The gap between a demo that wows and a system that ships is almost always the harness — the scaffolding of context, tools, guardrails, and feedback that wraps the model. Models provide capability. Harnesses provide reliability. Treating the harness as an engineering discipline — rather than glue code around a prompt — is what separates agents that hold up in production from ones that quietly hallucinate at 3 a.m.
The four pillars wrapped around every reliable agent
Provide the right information, state, and intent. The model can only act on what it can see.
Give agents the capability to act in the world — typed, scoped, and auditable.
Enforce safety, policies, and boundaries before, during, and after each step.
Observe, evaluate, and learn from every run to improve the next one.
Four pillars wrap every reliable agent. Skip one and the failure it would have caught becomes a customer-visible bug.
This post lays out what a harness is, why it matters, the eight planes that make one up, the delivery loop teams should run inside it, the anti-patterns we see most often, and the metrics that tell you whether yours is working.
1. Why harness engineering matters
A strong model with a weak harness produces a brittle agent. The same model wrapped in a disciplined harness becomes a system you can monitor, debug, and improve. The difference shows up in five places: output consistency, hallucination rate, safety, repeatability, and how fast you can learn from failures.
Why harness engineering matters
MODEL ALONE · Smart but unpredictable
- ✗Inconsistent outputs — same prompt, different answers, hard to trust at scale
- ✗Hallucinations — confident but incorrect answers that sound convincing
- ✗Unsafe actions — may produce harmful, policy-violating, or out-of-scope behavior
- ✗Hard to debug — issues are intermittent and lack the context to diagnose
- ✗No memory of past failures — every run starts from zero
WITH A HARNESS · Reliable in predictable ways
- ✓Reliable outputs — consistent, accurate answers you can depend on
- ✓Safer behavior — built-in guardrails reduce risk and keep actions within bounds
- ✓Repeatable workflows — structured steps and tools turn ad-hoc runs into pipelines
- ✓Easier improvement — rich feedback and signals make issues visible and fixes measurable
- ✓Compounding learning — every failure produces a structured artifact
Without a harness, smart models still fail in predictable ways. The harness is where determinism lives.
The harness is where determinism lives. The model is non-deterministic by design; the harness is where you re-introduce the structure, contracts, and checkpoints that production systems require.
2. The harness architecture
A production agent is a controlled system, not a prompt with extra steps. It has eight functional planes, and each plane has an owner, a failure mode, and a control strategy.
The harness architecture · 8 planes, each with an owner, a failure mode, and a control strategy
Intent
Define the agent's goal and what success looks like. Bad intent quietly poisons every downstream step.
Context
Provide the right information and state. Too little starves the model; too much drowns the signal.
Tools
Equip the agent with actions it can take — typed, scoped, idempotent where possible.
Execution
Orchestrate steps safely and reliably — sandboxes, retries, timeouts, structured outputs.
Control
Enforce policies and guardrails in real time. Block, redirect, or escalate when bounds are crossed.
Verification
Check outputs for quality and safety against tests, schemas, and policy before they ship.
Observability
Instrument, log, and understand behavior — traces, evals, and human-readable run histories.
Governance
Maintain compliance, ownership, and change management as the system evolves.
A production agent is a controlled system, not just a prompt. Each plane has an owner, a failure mode, and a control strategy.
The four input planes — Intent, Context, Tools, Execution — define what the agent is trying to do and how it acts in the world. The four control planes — Control, Verification, Observability, Governance — define how the system stays in bounds, proves what it did, and adapts over time. Skip any plane and the failure mode it would have caught becomes a customer-visible bug.
3. Guides and sensors: feed-forward vs feedback
Reliable agents combine two kinds of controls. Feed-forward controls prevent problems before they happen. Feedback controls detect and correct problems after the fact. Each comes in a computational flavor (deterministic, rule-checkable, machine-verified) and an inferential flavor (judgment-based, human or model-evaluated).
Guides + sensors · feed-forward and feedback controls, computational and inferential
| Feed-forward | Feedback | |
|---|---|---|
| Computational | Schemas · typed APIs · repo maps | Tests · linters · dependency rules |
| Inferential | Principles · examples · design taste | Review agents · human review · evals |
Computational = deterministic, machine-checkable. Inferential = judgment-based. Build computational controls first.
The right mix is not 50/50. Build computational controls first — they are cheap, fast, and never get tired. Reserve inferential review for the cases where rules cannot capture intent.
4. The practical delivery loop
Every agent task should run through the same six-step loop. The loop is shaped so that a failure at any step produces structured evidence the next step can act on.
The practical delivery loop · every task runs through the same six steps
Define goal, constraints, and definition of done in machine-readable form.
Identify affected files, services, and blast radius before any action is taken.
Produce an explicit plan the agent commits to — reviewable, diffable, revisable.
Execute changes in an isolated environment with full instrumentation.
Run tests, evals, and policy checks. On failure, return structured evidence to step 2.
Human or review agent confirms intent alignment before promoting to production.
Success should be quiet. Failure should be verbose — turn every failure into structured input for the next step.
The principle behind the loop: success should be quiet, failure should be verbose. A passing run produces a green checkmark and an artifact. A failing run produces a trace, a diff, a categorized error, and a candidate remediation — enough for the next iteration to make progress without re-deriving the context.
5. Common anti-patterns
Most agent failures we audit are not model failures — they are harness failures. The same handful of anti-patterns show up in almost every system that is not reliable yet.
Common anti-patterns · most agent failures are really harness failures
Giant instruction file
Everything dumped into one prompt. Fix: modular scoped instructions plus a purpose-built context layer.
Unbounded tool access
Agent can do anything, anywhere. Fix: principle of least privilege — typed, scoped, audited tools.
Feedback without guides
Vague review leads to vague improvements. Fix: pair every feedback signal with a structured rubric.
Self-review only
Models grade themselves and call it good. Fix: independent verifier — tests, policies, or a second agent.
Unversioned harness changes
Prompts, tools, and policies change as guesswork. Fix: version everything; treat the harness like code.
No garbage collection
Old data, stale tools, and dead code piling up. Fix: prune context and tools the agent never uses.
Each anti-pattern collapses the harness back into the model. The fix in every case is to push the work back into the harness, where it can be verified.
The unifying theme: each anti-pattern collapses the harness back into the model. A giant instruction file pretends the model has perfect recall. Unbounded tool access pretends the model has perfect judgment. Self-review pretends the model has perfect calibration. The fix in every case is the same — push the work back into the harness, where it can be verified.
6. Metrics that matter
You cannot improve a harness you cannot measure. Six metrics give a good first read on whether the system is healthy and where to invest next.
Metrics that matter · if you cannot measure the harness, you cannot improve it
Track these per task type and over time. A rising escalation rate or revert rate is an early warning that the harness is drifting away from the work. A growing context cost without a matching first-pass-success gain is a sign that you are paying for context the model is not actually using.
7. The bottom line
Better models raise the ceiling. Better harnesses raise the floor. Most of the practical value in production AI today comes from raising the floor — making the median run reliable, observable, and improvable — not from chasing the last point of benchmark performance.
The bottom line · better models raise the ceiling, better harnesses raise the floor
Agent = model + harness
Power comes from the combination. Improving either in isolation has diminishing returns.
Prompts are not enough
Reliability requires structure, context, and guardrails — the harness is where reliability lives.
Observability turns failure into learning
Measure, understand, improve, repeat. Quiet failures are the most expensive ones.
Put humans at high-leverage checkpoints
Judgment is expensive — spend it where it matters most, not on every step.
Design the environment your agents work in, and the agents will start to look smarter than the model card suggests.
The agents that win the next two years will not be the ones built on the best model. They will be the ones built inside the best harness. Design the environment your agents work in, and the agents will start to look a lot smarter than the model card suggests.