Context Engineering
Managing What the Model Knows
Same GPT-4 looks brilliant in one product and dumb in another. Same weights, same API, same month. What actually changed is what the product gave the model to work with. That gap — between the model's ceiling and how close you get to it — is almost always context, not capability.
This page is about that gap. It covers context engineering at two scales: the project scale (what your team has accumulated in specs, plans, decisions, code) and the inference scale (what's in the token window of a single call). The two are the same craft at different altitudes — and most of the value shows up when you can see how they connect.
A term coined in public
"Context engineering" was named in public on X in June 2025. The name stuck because it captured a shift that practitioners were already feeling — from writing clever single prompts to designing the whole information environment the model operates in.
"I really like the term 'context engineering' over prompt engineering. It describes the core skill better: the art of providing all the context for the task to be plausibly solvable by the LLM."
"+1 for 'context engineering' over 'prompt engineering' ... the delicate art and science of filling the context window with just the right information for the next step."
Cognition had already used the phrase as a section heading in Don't Build Multi-Agents a week earlier, on June 12 — so the term was live in the engineering community before Lütke's post. Anthropic followed with an institutional piece in September, and Linear formalized the product-development framing in March 2026.
Where context lives
Most of your context isn't in the model's window. It's in the stuff your team has already written down — scattered across Slack, Notion, Linear, GitHub, and the team's collective memory. Under Linear's product-development framing, that accumulated memory breaks into seven kinds of artifact:
- Plans
- Discussions
- Specs
- Technical designs
- Decisions
- Summaries
- Code
These are what your team already produces — plans and discussions that surface intent, specs and technical designs that pin ideas to surfaces, decisions that freeze ambiguity, summaries that compress meetings, and the code itself. Each is a kind of artifact with a different half-life and a different audience. Together they form the accumulated memory the model needs to do anything beyond the generic.
Two failure modes dominate at this scale:
- Context exists but isn't reachable. The decision lives in a meeting note the agent can't see. So it hallucinates the decision and you burn a week undoing it.
- Context is stale. Last quarter's plan is still the onboarding doc. Agents (and new hires) cheerfully follow it into the wrong forest.
Context engineering at the project scale is the discipline of making your team's memory both reachable and fresh. Every product team building with AI ends up here. Linear: What's Next ↗
Why specs matter
"AI cannot replace thinking. It can only amplify the thinking you have done or the lack of thinking you have done."
Specs are compressed context. When a team writes a clear spec, they're surfacing the decisions, constraints, and intent that would otherwise live only in someone's head — and making that thinking reachable, both by humans reviewing code and by agents generating it.
This is where the knowledge-gap lens pays off. If your spec leaves out "we decided against approach X last quarter because of Y," the model will cheerfully re-propose X. You'll catch yourself yelling at the screen. The model didn't fail — your context did. The missing spec is the bug.
A few practical beats from Dex Horthy's work on coding agents:
- A bad plan is 100 bad lines of code. A bad research pass can send you in the wrong direction entirely. Catching errors upstream is the highest-leverage move.
- Plans with file names and line snippets beat vague prose. The dumbest model in the world can follow a clear plan. None of them can execute a blurry one.
- On-demand compressed context beats static onboarding docs. Documentation rots; regenerate the context for the slice you're working on, from the source of truth.
The window you actually have
Zoom in. A single call to the model happens inside a fixed-size window — typically 128K or 200K tokens today, up to 1M or 2M on flagship models. Most of the field has intuited that more context means better answers. This turns out to be wrong.
Chroma's July 2025 research ("Context Rot") tested 18 models — including GPT-4.1, Claude 4, and Gemini 2.5 — and found that performance degrades non-uniformly as input length grows. Different tasks have different knees. But the pattern is consistent: models reason worse with more context, not better, past a certain threshold.
Performance degrades non-uniformly as input grows (Chroma Research, 2025). Dex Horthy puts the practical knee around 40%.
Trajectory matters too. If the conversation history is "model tried → human corrected → model tried again → human corrected again," the next most likely token is another wrong attempt followed by another correction. The model pattern-matches on the whole trajectory, not just on facts. Once a chat goes sideways, starting fresh is almost always cheaper than pushing harder.
The practical implication: context isn't free, and more isn't better. Every token you put in the window costs both latency and attention. Good context engineering treats the window as a budget, not a bucket. Chroma: Context Rot ↗
The techniques that hold up
Five techniques recur across the practitioners' writing. They're different shapes of the same move: curate before you accumulate.
- Intentional compaction. Before the window degrades, stop and ask the agent to compress the working context into a markdown file. Review it. Tag it. Start fresh with the compacted version as the new starting point. Compact proactively, not when Claude tells you it's getting confused. Dex Horthy talk ↗
- Sub-agents for context isolation, not role-play. Don't create a "QA sub-agent" and a "frontend sub-agent" to mimic job titles. Do create sub-agents when you want to hand off a bounded research task — the sub-agent burns its own window on the reading, then returns a short summary to a parent that's still in the smart zone.
- Research → Plan → Implement. Three phases, each with a smaller context and a clearer output. Research compresses truth. Planning compresses intent. Implementation executes against compressed intent. Each phase's artifact is the next phase's context.
- On-demand compressed context over static onboarding docs. Prefer a research pass that builds context from the current code to a hand- written onboarding doc that ages. The y-axis of any static doc's drift-from- truth chart is labeled "lies."
- Keep the trajectory clean. A model that's been corrected six times expects to be corrected a seventh. Start a new session with the same goal and a cleaner starting prompt rather than arguing your way to a right answer.
One meta-technique sits over all of these: don't outsource the thinking. A plan still has to be correct before it's executed. A tool that spits out a wall of markdown so you feel productive is not a context-engineering win — the thinking stayed on your plate, hidden under a nice scrollbar. The model amplifies care; it does not substitute for it.
Harness engineering as a lens
One useful subset of context engineering is what harness engineering names: the separation between the model (the thing you can't change at use time) and the harness (everything humans built around it — prompts, tools, schemas, guardrails, memory). The split is pedagogical rather than technical, but it's a sharp one. It answers a question every AI team asks: what can we actually fix?
If a behavior sits on the model side, your options are: pick a different model, wait for a better one, or fine-tune (expensive, often a bad idea). If it sits on the harness side, you own it. Context engineering is the craft of noticing which side matters for the problem you're looking at.
Takeaways
- Capability is the ceiling. Context is how close you get.
- Specs aren't bureaucracy — they're compressed context. Good ones make your team's thinking reachable by humans and agents alike.
- More context isn't better. Treat the window like a budget. Compact, isolate, and start over when trajectory drifts.