The Harness Matters More Than the Model

I spent the first few weeks of building with AI doing what most people do: reading benchmarks. GPT-4 or Claude? Which one passed more coding tests? Which one was smarter?

It took me longer than I’d like to admit to realize I was solving the wrong problem.

The model is maybe 20% of the outcome. The harness is the other 80%. I had the ratio completely backwards.

What I mean by “the harness”

Update: Addy Osmani, Director at Google Cloud AI, put it directly in April 2026: “A decent model with a great harness beats a great model with a bad harness. The gap between what today’s models can do and what you see them doing is largely a harness gap.”

The harness is everything around the model. System prompts. Memory structure. Tool access. Stopping conditions. The description you write for each tool, which the model reads before deciding what to call. The SOUL.md file that tells an agent who it is. The AGENTS.md file that tells it what it’s supposed to do and what it should never touch.

You don’t see the harness when you demo a product. You see the model’s output. But the harness is where the work actually lives.

Viv Trivedy put it in a line that’s been stuck in my head since I first saw it: “If you’re not the model, you’re the harness.”

That’s your job. Not to pick the smartest model. To build the thing that makes any decent model work.

The evidence I can’t argue with

Here’s what moved me from “I think the harness matters” to “the harness is the whole game”:

Terminal Bench 2.0, a standardized benchmark for how well agents complete terminal-based tasks, ranked one team in the top 30. They changed only their harness. Same model, different system prompts, different tool structure. They jumped to top 5.

Osmani documented a different example: one developer optimized their Claude Code harness using CLI and Insforge Skills. Same Claude model before and after. The workload moved from 10.4 million tokens and 10 errors at $9.21, to 3.7 million tokens and zero errors at $2.81. A 3.27x cost reduction. Zero model change.

The variable in both cases wasn’t which AI they picked. It was how they told the AI to work.

What this looked like for me

I have a repo called everything-claude-code — it’s a fork I made to study how Claude Code’s harness is structured. Then I built my own.

The system I run now has a Chief of Staff agent named Jim, more on that in Post 2 — who manages a team of six specialized agents. Each one has a SOUL.md file (identity, principles, persona) and an AGENTS.md file (operational instructions). Jim reads both before dispatching any task.

I didn’t design that structure because I thought it was clever. I designed it because of specific things that went wrong without it.

Early on, I gave agents too much access and not enough context. They would drift, helpfully doing related things they weren’t asked to do. I’d ask for a research brief and get a research brief plus a draft blog post plus three suggested follow-up topics.

So I added task boundary paragraphs to every AGENTS.md. One sentence: this agent handles X. It succeeds when Y. It fails when Z. It should never touch W.

That’s a harness decision. It solved a real problem. Every line in a good system prompt traces to a specific past failure, what Osmani calls “The Ratchet Pattern.” You don’t design the perfect harness up front. You ship v0.1, watch it fail, write down what failed, and ship v0.2.

Why most people don’t do this

The model is the exciting part. New model drops, the benchmark posts come out, the comparison threads run. Nobody writes threads about refining system prompts for three weeks.

But the product you ship is the harness. The model is infrastructure. You didn’t pick your cloud provider because it was the most interesting cloud, you picked it because the infrastructure works and you can build on top of it.

The mental model shift that helped me: stop asking “which AI is smarter” and start asking “what does this agent need to know to do this job well?”

One is a spec sheet comparison. The other is job design.

The pattern I keep seeing

When an agent produces bad output, the first question most people ask is “should I switch models?” In my experience, that’s almost never the right answer.

The right questions are:

Does the agent have a clear task boundary?
Does the tool description actually explain when and how to use the tool?
Is there a stopping condition, or is the agent running until it decides it’s done?
What’s in the memory file — and does it reflect what the agent actually learned from past sessions?

These are harness questions. They’re slower to answer than “use GPT-5 instead of Claude.” They’re also the ones that matter.

I’ve shipped five production projects in the last eight months. The model versions I used on each one don’t matter much to me now. The harness decisions I made, the files I wrote, the memory structure I built, the task boundaries I enforced, those I’d make almost the same way again.

This is the first post in a seven-part series about what I’ve actually learned building with AI tools over eight months: Claude Code, Cowork, Codex, Cursor, and the projects underneath them. The next post goes deeper into my specific Claude Code setup — what each file is for, why it exists, and what I’d do differently.

If you’re building with AI tools and the model keeps frustrating you, it might not be the model.