What is context engineering?

Context engineering is the practice of deliberately managing what an AI agent holds in its context window, keeping the relevant material in and the noise out so the agent stays accurate and fast across a long task. It covers curating what you load, ordering it well, compacting or resetting when needed, retrieving on demand, and caching stable content to control cost.

What is the context window?

The context window is the model's working memory: everything it can see when generating its next response, measured in tokens. For a coding agent it includes the system prompt, your rules, tool definitions, every file read, command output, and the whole conversation so far. It is finite, and quality degrades as it fills.

What is the lost-in-the-middle problem?

Language models attend most reliably to information at the start and end of their context and least reliably to information in the middle. A key instruction or fact buried in the centre of a long prompt is the most likely to be ignored, so you should place the most important material near the top and bottom of the context.

Does a bigger context window mean I do not have to worry about context?

No. Model quality degrades well before a window is technically full, and it degrades sharply at a performance cliff. A large maximum context is a ceiling, not a target. A smaller, well-curated context usually outperforms a larger, stuffed one, which is why context engineering still matters even with million-token windows.

What is compaction in a coding agent?

Compaction summarises a long conversation into a compact form so the agent can continue without the full history filling the window. Automatic compaction can quietly drop details you cared about, so it is best to steer it by telling the agent what to preserve, or to do a deliberate handover to a fresh session instead.

Context Engineering: Managing the Context Window

Q: How does prompt caching reduce cost?

You mark a stable prefix such as your system prompt, rules or a large reference document as cacheable, and later requests that start with the same exact bytes read it from cache instead of recomputing it. On the Claude API in 2026 a cache read costs about a tenth of a normal input token, so a reused prefix pays for itself within a couple of requests.

What the context window actually holds

The context window is everything the model can see when it generates its next response, measured in tokens (chunks of text, roughly four characters each). For a coding agent it fills with more than your latest message: the system prompt that defines the agent, your CLAUDE.md or AGENTS.md rules, the definitions of every connected tool and MCP server, every file the agent has read, every command output it has seen, and the entire conversation up to now. All of that competes for the same finite budget. The mental model that matters: context is a scarce resource you spend, and everything you load (a chatty MCP server, a giant file, a long back-and-forth) is budget the actual task no longer has. See the glossary on the context window for the formal definition.

The system prompt and your CLAUDE.md / AGENTS.md rules, reloaded every turn.
Tool and MCP server definitions, which is why connecting many servers is costly.
Every file read and every command output, which accumulates fast during a task.
The full conversation history; long sessions carry their whole past forward.

Why a full window hurts: the performance cliff

It is tempting to think a bigger context window means you can stop worrying, but the opposite is true: model quality degrades well before the window is technically full, and it degrades sharply. As the window fills with files, history and noise, the model has more to attend to and is likelier to lose the thread, contradict an earlier instruction, or forget a constraint from the top of the conversation. This is the "performance cliff," and it is why a 1M-token window does not mean you should pour 1M tokens into it. The practical takeaway is counterintuitive but reliable: a smaller, well-curated context usually outperforms a larger, stuffed one. Context engineering exists precisely to keep you on the good side of that cliff.

Quality drops before the window is full, and the drop is a cliff, not a gentle slope.
A stuffed window makes the model lose threads, contradict itself and drop constraints.
A large maximum context is a ceiling, not a target; do not fill it because you can.
A curated small context beats a bloated large one, the central rule of context engineering.

Lost in the middle

"Lost in the middle" is a well-documented behaviour of language models: they attend most reliably to information at the very start and the very end of their context, and least reliably to information buried in the middle. A crucial instruction or the one relevant fact, dropped into the centre of a long prompt or a long conversation, is the most likely thing to be ignored. The practical consequence shapes how you arrange context. Put the most important instructions and the most relevant material where the model looks: near the top (your standing rules) and near the bottom (the immediate task and the key file). Do not assume that because something is somewhere in the window, the model is using it. Position is leverage.

Models attend best to the start and end of context, worst to the middle.
A key instruction buried mid-prompt is the most likely to be ignored.
Put standing rules near the top and the immediate task and key file near the bottom.
Being in the window is not the same as being used; position determines attention.

Compaction, handovers and resets

When a session runs long, you need ways to shed weight without losing the thread. Three techniques do most of the work. Compaction summarises the conversation so far into a compact form and continues, freeing the window; the catch is that automatic compaction quietly drops details you cared about, so steer it by telling the agent what to preserve before it compacts. A handover ends one session and starts a fresh one with a clean, deliberate summary you write, which gives you a much tidier context than letting one session sprawl for hours. A reset throws away a context that has gone confused and starts over with a tight prompt, which is often faster than trying to argue a derailed agent back on track. Knowing when to reach for each is the practical core of the skill.

Compaction: summarise and continue to free the window; steer it so it keeps what matters.
Handover: end the session and start fresh with a clean summary you control.
Reset: discard a confused context and restart with a tight prompt rather than arguing.
Subagents also help: delegate noisy work so its output never lands in your main window.

Retrieval: bring in only what is needed

The opposite failure mode to a stuffed window is the right information never arriving at all. Retrieval is how you pull in just the relevant piece on demand instead of pre-loading everything. For a coding agent this is mostly concrete and unglamorous: let the agent search the codebase and read only the files a task touches, rather than pasting the whole repo; point it at the one doc page it needs; have it grep for the function instead of loading the directory. The principle behind retrieval-augmented patterns is the same whether it is a vector database or an agent running grep: fetch the specific thing the task needs, when it needs it, so the window holds signal rather than a hopeful pile of maybe-relevant material.

Pull in the specific file, doc or record the task needs, not everything that might be relevant.
Let a coding agent search and read on demand instead of pre-loading the whole repo.
Retrieval keeps the window full of signal, which keeps the model on the good side of the cliff.
The same idea scales up to vector search; the goal is always relevant-on-demand, not everything-just-in-case.

Prompt caching: control the cost of a big context

A large, stable context is expensive because the model re-processes every token of it on every request, and you pay for those input tokens each time. Prompt caching fixes the cost side: you mark a stable prefix (your system prompt, rules, tool definitions, a large reference document) as cacheable, and subsequent requests that begin with the same exact bytes read it from cache instead of recomputing it. On the Claude API the economics in June 2026 are clear: a cache write costs about 1.25x a normal input token for the default five-minute lifetime (or 2x for the one-hour option), and a cache read costs only about 0.1x, a tenth of the price. So a cached prefix pays for itself within a couple of reuses. The cache is a prefix cache, so order matters: put your stable content first and your changing content last, and a single changed token before the breakpoint forces a full re-write. Caching does not reduce how much context the model attends to, only what you pay to send it, so it complements curation rather than replacing it.

Caching reuses the encoded state of a stable prefix so it is not recomputed each request.
On the Claude API (June 2026): cache writes about 1.25x input (5-minute default, 2x for 1-hour), cache reads about 0.1x.
It is a prefix cache: keep stable content first and changing content last, or you force a re-write.
Caching cuts cost, not attention; you still curate the window. See the glossary on prompt caching.

A practical context-engineering checklist

Put the ideas together into habits you run without thinking. None of this requires special tooling; it is discipline about what you load and when you clean up. A future companion is the token and context estimator tool on this campus, which will let you paste text and see how much of a model window it fills before you send it; for now, the rules below carry you.

Keep standing rules in CLAUDE.md or AGENTS.md, and keep that file tight; it loads every turn.
Load the files the task needs, not the whole repo; let the agent retrieve on demand.
Put the most important instruction near the top and the immediate task near the bottom.
Compact, hand over or reset when a session gets long or confused; do not let it sprawl.
Cache large stable prefixes to control cost, keeping stable content first.
Delegate noisy side work to a subagent so its output stays out of your main window.

Frequently asked questions

Keep learning

Guide

Prompt Patterns for Coding Agents

Reusable prompt patterns for coding agents: role and spec, examples, decomposition, verification loops, and the anti-patterns to avoid. Practitioner guide for 2026.

Open Guide

How to Use Claude Code: Complete Beginner Guide (2026)

Learn how to use Claude Code from scratch in 2026: install it, start your first session, run the plan-edit-run-review loop, write a CLAUDE.md, and go deeper.

Open Guide

Claude Code Subagents Explained (with Examples)

What Claude Code subagents are, when to use them, and how to create one in .claude/agents with YAML frontmatter. Built-in subagents, examples and the /agents command.

Open Guide

What Is Agentic Engineering? The 2026 Pillar Guide

Agentic engineering is building software by directing AI coding agents that plan, edit and run code. What it is, how it differs, and how to learn it in 2026.

Open Term

Context Window

A context window is the maximum amount of text, measured in tokens, an AI model can consider at once, including the prompt, history and the answer it writes.

Open Term

Prompt Caching

Prompt caching stores the processed prefix of a prompt so repeated requests reuse it, cutting cost and latency. Cache reads can be about 90 percent cheaper.

Open Term

System Prompt

A system prompt is the standing instruction that sets an AI model role, rules and behaviour before any user message, shaping how it responds all session.

Open Term

Subagent

A subagent is a specialised AI agent a main agent delegates a task to, running in its own context window with its own prompt and tools, returning a summary.

Open Lesson

Context Engineering: Compaction, Handovers, Resets and Thinking Effort

Manage long-running agent work with deliberate compaction, clean handovers, well-timed resets and the right thinking effort

Open

What the context window actually holds

Why a full window hurts: the performance cliff

Lost in the middle

Compaction, handovers and resets

Retrieval: bring in only what is needed

Prompt caching: control the cost of a big context

A practical context-engineering checklist

Frequently asked questions

What is context engineering?

What is the context window?

What is the lost-in-the-middle problem?

Does a bigger context window mean I do not have to worry about context?

How does prompt caching reduce cost?

What is compaction in a coding agent?

Keep learning

Ready to put AI to work as a real workflow?

Better AI workflows, once a week.