Harness Engineering - Learn

"Sometimes the language model isn't not smart enough — it just isn't being guided well by humans."

— Hung-yi Lee

If you've spent time with the herd, you've seen every model as a horse — body size for parameters, head for context, breed for capability. A horse is what the model is. But there's another half to the story: what humans do to steer it. That's the harness.

The model's training sets a ceiling on what it can do. How close you get to that ceiling depends almost entirely on the harness you put on it — the system prompt, the tools, the examples, the constraints, the evals. By the end of this page, the techniques you already know will look like pieces of one coherent craft: harness engineering.

A lecture worth watching

The phrase harness engineering comes from a recent lecture by Hung-yi Lee, a professor at National Taiwan University whose Chinese-language ML courses have been foundational for a generation of students. The premise is simple and unusually honest: most of what we read as "the model failed" is actually "we didn't guide it well." Watching this is the fastest way into the mindset.

A short history of the harness

Every technique you've heard of in prompting exists because an earlier one ran into its own limits. The harness grew as the models grew.

Raw prompts (GPT-3 era). A text box and hope. Practitioners discovered that how you phrased the prompt changed everything — and called it prompt engineering half-jokingly, before realizing it wasn't a joke. Reynolds & McDonell, 2021 ↗
System messages (ChatGPT, 2022). A formal place to set the model's stance before the conversation starts. The model now had a role, not just a question. OpenAI API docs ↗
Few-shot and in-context learning. Show the model two or three examples of what you want. It pattern-matches instead of guessing. Brown et al., 2020 ↗
Chain-of-thought (Jason Wei et al., 2022). Ask the model to think step by step and it reasons better. The prompt shaped the process, not just the target. arXiv 2201.11903 ↗
Tool use (function calling, 2023–2024). Models stopped guessing at facts when they could call the world — a calculator, a search index, a database — and read the answer. OpenAI, Jun 2023 ↗
Structured output. JSON schemas and response formats meant downstream systems could actually consume what the model produced. OpenAI, Aug 2024 ↗
Agent loops. Act, observe, act again. A harness for multi-step work where each step's output becomes the next step's input. Yao et al., 2022 (ReAct) ↗
Evals as feedback. Testing stopped being a checklist and became the feedback signal that teaches the harness itself. OpenAI Evals ↗

Each era solved a limit of the one before it. None of them made the model smarter. All of them made the model more useful.

Anatomy of a harness

Four pieces of tack map tightly onto four techniques — each piece of equipment does, in horse terms, what the technique does in prompt terms.

Saddle

System prompt

The platform everything else rides on. Always present, sets the stance before a single word is exchanged.

Example: "You are a careful technical reviewer. Explain tradeoffs before recommending."

Bit

Structured output

Sits in the mouth and shapes what comes out of it. Schemas constrain form the same way.

Example: A JSON schema or response_format the model's output must conform to.

Blinkers

Guardrails

Narrow the field. Blinkers narrow vision; guardrails narrow the action space to what's allowed.

Example: "Do not invent function names that aren't in the provided API."

Saddlebags

Context & memory

What the horse carries along for the journey. Retrieved documents, prior turns, user preferences.

Example: RAG over a knowledge base; a conversation history; a user profile loaded at session start.

Three more techniques matter just as much but don't fit neatly into any piece of tack:

Few-shot prompting — less about constraining form, more about teaching through pattern: show two or three examples and the model follows the shape.
Tool use — not steering but extending. The model gains new senses and reach: a search, a calculator, an API call to a real system.
Evals — not a piece of gear at all. Evals are the feedback loop around the harness: run the model, score the output, iterate.

Why harnesses matter

Here's a reframe that changes how you read every AI story: when "AI failed at X," what usually failed was the harness, not the model. The same GPT-4 that looks brilliant in one product looks dumb in another because one product built a harness for the task and the other didn't.

Capability is the upper bound — set at training, not at use. Performance is where the harness actually lets you reach. The gap between a convincing demo and a shipped product is almost entirely harness work: the system prompt you iterated on for weeks, the five tools the model can call, the schema that keeps the output parseable, the eval suite that catches regressions.

A note on scope: this page uses harness engineering as a framing lens — the model / the harness split. In the broader industry usage, harness engineering is one subset of context engineering, the craft of managing everything the model knows. The lens still pays its rent: it's the sharpest way to see which failures you can actually fix.

The herd meets the harness

This primer is half of a pair. The other half — an interactive experience where you pick a horse from the herd, fit it with a harness, and run it — is coming soon.

Pick a horse. Fit a harness. Run.

Coming soon

Meet the horses now →

Takeaways

Capability sets the ceiling. Harness engineering decides how close you get to it.
Prompts, tools, and evals are one craft. Every technique you already use is a piece of the same harness.
Shipped AI is mostly harness work. What looks like intelligence in a product is usually guidance, not generation.

Explore Further

Context Engineering Hung-yi Lee: Harness Engineering (video) Hung-yi Lee The Herd Next Token Prediction Alignment Agency Concepts Glossary