Harness Engineering
The Craft of Guiding Models
"Sometimes the language model isn't not smart enough — it just isn't being guided well by humans."
If you've spent time with the herd, you've seen every model as a horse — body size for parameters, head for context, breed for capability. A horse is what the model is. But there's another half to the story: what humans do to steer it. That's the harness.
The model's training sets a ceiling on what it can do. How close you get to that ceiling depends almost entirely on the harness you put on it — the system prompt, the tools, the examples, the constraints, the evals. By the end of this page, the techniques you already know will look like pieces of one coherent craft: harness engineering.
A lecture worth watching
The phrase harness engineering comes from a recent lecture by Hung-yi Lee, a professor at National Taiwan University whose Chinese-language ML courses have been foundational for a generation of students. The premise is simple and unusually honest: most of what we read as "the model failed" is actually "we didn't guide it well." Watching this is the fastest way into the mindset.
A short history of the harness
Every technique you've heard of in prompting exists because an earlier one ran into its own limits. The harness grew as the models grew.
- Raw prompts (GPT-3 era). A text box and hope. Practitioners discovered that how you phrased the prompt changed everything — and called it prompt engineering half-jokingly, before realizing it wasn't a joke. Reynolds & McDonell, 2021 ↗
- System messages (ChatGPT, 2022). A formal place to set the model's stance before the conversation starts. The model now had a role, not just a question. OpenAI API docs ↗
- Few-shot and in-context learning. Show the model two or three examples of what you want. It pattern-matches instead of guessing. Brown et al., 2020 ↗
- Chain-of-thought (Jason Wei et al., 2022). Ask the model to think step by step and it reasons better. The prompt shaped the process, not just the target. arXiv 2201.11903 ↗
- Tool use (function calling, 2023–2024). Models stopped guessing at facts when they could call the world — a calculator, a search index, a database — and read the answer. OpenAI, Jun 2023 ↗
- Structured output. JSON schemas and response formats meant downstream systems could actually consume what the model produced. OpenAI, Aug 2024 ↗
- Agent loops. Act, observe, act again. A harness for multi-step work where each step's output becomes the next step's input. Yao et al., 2022 (ReAct) ↗
- Evals as feedback. Testing stopped being a checklist and became the feedback signal that teaches the harness itself. OpenAI Evals ↗
Each era solved a limit of the one before it. None of them made the model smarter. All of them made the model more useful.
Anatomy of a harness
Four pieces of tack map tightly onto four techniques — each piece of equipment does, in horse terms, what the technique does in prompt terms.
The platform everything else rides on. Always present, sets the stance before a single word is exchanged.
Example: "You are a careful technical reviewer. Explain tradeoffs before recommending."
Sits in the mouth and shapes what comes out of it. Schemas constrain form the same way.
Example: A JSON schema or response_format the model's output must conform to.
Narrow the field. Blinkers narrow vision; guardrails narrow the action space to what's allowed.
Example: "Do not invent function names that aren't in the provided API."
What the horse carries along for the journey. Retrieved documents, prior turns, user preferences.
Example: RAG over a knowledge base; a conversation history; a user profile loaded at session start.
Three more techniques matter just as much but don't fit neatly into any piece of tack:
- Few-shot prompting — less about constraining form, more about teaching through pattern: show two or three examples and the model follows the shape.
- Tool use — not steering but extending. The model gains new senses and reach: a search, a calculator, an API call to a real system.
- Evals — not a piece of gear at all. Evals are the feedback loop around the harness: run the model, score the output, iterate.
Why harnesses matter
Here's a reframe that changes how you read every AI story: when "AI failed at X," what usually failed was the harness, not the model. The same GPT-4 that looks brilliant in one product looks dumb in another because one product built a harness for the task and the other didn't.
Capability is the upper bound — set at training, not at use. Performance is where the harness actually lets you reach. The gap between a convincing demo and a shipped product is almost entirely harness work: the system prompt you iterated on for weeks, the five tools the model can call, the schema that keeps the output parseable, the eval suite that catches regressions.
A note on scope: this page uses harness engineering as a framing lens — the model / the harness split. In the broader industry usage, harness engineering is one subset of context engineering, the craft of managing everything the model knows. The lens still pays its rent: it's the sharpest way to see which failures you can actually fix.
The herd meets the harness
This primer is half of a pair. The other half — an interactive experience where you pick a horse from the herd, fit it with a harness, and run it — is coming soon.
Takeaways
- Capability sets the ceiling. Harness engineering decides how close you get to it.
- Prompts, tools, and evals are one craft. Every technique you already use is a piece of the same harness.
- Shipped AI is mostly harness work. What looks like intelligence in a product is usually guidance, not generation.