Guides

Context Engineering: Managing the Context Window

Prompting9 min readUpdated June 13, 2026

Context engineering is the practice of deliberately managing what an AI agent is holding in its context window right now, so it stays accurate and fast across a long task instead of slowly drifting into confusion. The context window is the model's working memory: the system prompt, your rules, the files it has read, the tools available, the conversation so far. It is finite, and the single most important fact about it is that quality degrades as it fills, not gracefully but with a cliff. Context engineering is how you keep the right things in the window and the wrong things out: through compaction, retrieval, ordering, and prompt caching to control cost. This guide explains what fills the window, why a full window hurts, and the techniques that keep agentic work reliable. Everything here is current as of June 2026 and pairs with the Course 2 context-engineering lesson.

What the context window actually holds

The context window is everything the model can see when it generates its next response, measured in tokens (chunks of text, roughly four characters each). For a coding agent it fills with more than your latest message: the system prompt that defines the agent, your CLAUDE.md or AGENTS.md rules, the definitions of every connected tool and MCP server, every file the agent has read, every command output it has seen, and the entire conversation up to now. All of that competes for the same finite budget. The mental model that matters: context is a scarce resource you spend, and everything you load (a chatty MCP server, a giant file, a long back-and-forth) is budget the actual task no longer has. See the glossary on the context window for the formal definition.

  • The system prompt and your CLAUDE.md / AGENTS.md rules, reloaded every turn.
  • Tool and MCP server definitions, which is why connecting many servers is costly.
  • Every file read and every command output, which accumulates fast during a task.
  • The full conversation history; long sessions carry their whole past forward.

Why a full window hurts: the performance cliff

It is tempting to think a bigger context window means you can stop worrying, but the opposite is true: model quality degrades well before the window is technically full, and it degrades sharply. As the window fills with files, history and noise, the model has more to attend to and is likelier to lose the thread, contradict an earlier instruction, or forget a constraint from the top of the conversation. This is the "performance cliff," and it is why a 1M-token window does not mean you should pour 1M tokens into it. The practical takeaway is counterintuitive but reliable: a smaller, well-curated context usually outperforms a larger, stuffed one. Context engineering exists precisely to keep you on the good side of that cliff.

  • Quality drops before the window is full, and the drop is a cliff, not a gentle slope.
  • A stuffed window makes the model lose threads, contradict itself and drop constraints.
  • A large maximum context is a ceiling, not a target; do not fill it because you can.
  • A curated small context beats a bloated large one, the central rule of context engineering.

Lost in the middle

"Lost in the middle" is a well-documented behaviour of language models: they attend most reliably to information at the very start and the very end of their context, and least reliably to information buried in the middle. A crucial instruction or the one relevant fact, dropped into the centre of a long prompt or a long conversation, is the most likely thing to be ignored. The practical consequence shapes how you arrange context. Put the most important instructions and the most relevant material where the model looks: near the top (your standing rules) and near the bottom (the immediate task and the key file). Do not assume that because something is somewhere in the window, the model is using it. Position is leverage.

  • Models attend best to the start and end of context, worst to the middle.
  • A key instruction buried mid-prompt is the most likely to be ignored.
  • Put standing rules near the top and the immediate task and key file near the bottom.
  • Being in the window is not the same as being used; position determines attention.

Compaction, handovers and resets

When a session runs long, you need ways to shed weight without losing the thread. Three techniques do most of the work. Compaction summarises the conversation so far into a compact form and continues, freeing the window; the catch is that automatic compaction quietly drops details you cared about, so steer it by telling the agent what to preserve before it compacts. A handover ends one session and starts a fresh one with a clean, deliberate summary you write, which gives you a much tidier context than letting one session sprawl for hours. A reset throws away a context that has gone confused and starts over with a tight prompt, which is often faster than trying to argue a derailed agent back on track. Knowing when to reach for each is the practical core of the skill.

  • Compaction: summarise and continue to free the window; steer it so it keeps what matters.
  • Handover: end the session and start fresh with a clean summary you control.
  • Reset: discard a confused context and restart with a tight prompt rather than arguing.
  • Subagents also help: delegate noisy work so its output never lands in your main window.

Retrieval: bring in only what is needed

The opposite failure mode to a stuffed window is the right information never arriving at all. Retrieval is how you pull in just the relevant piece on demand instead of pre-loading everything. For a coding agent this is mostly concrete and unglamorous: let the agent search the codebase and read only the files a task touches, rather than pasting the whole repo; point it at the one doc page it needs; have it grep for the function instead of loading the directory. The principle behind retrieval-augmented patterns is the same whether it is a vector database or an agent running grep: fetch the specific thing the task needs, when it needs it, so the window holds signal rather than a hopeful pile of maybe-relevant material.

  • Pull in the specific file, doc or record the task needs, not everything that might be relevant.
  • Let a coding agent search and read on demand instead of pre-loading the whole repo.
  • Retrieval keeps the window full of signal, which keeps the model on the good side of the cliff.
  • The same idea scales up to vector search; the goal is always relevant-on-demand, not everything-just-in-case.

Prompt caching: control the cost of a big context

A large, stable context is expensive because the model re-processes every token of it on every request, and you pay for those input tokens each time. Prompt caching fixes the cost side: you mark a stable prefix (your system prompt, rules, tool definitions, a large reference document) as cacheable, and subsequent requests that begin with the same exact bytes read it from cache instead of recomputing it. On the Claude API the economics in June 2026 are clear: a cache write costs about 1.25x a normal input token for the default five-minute lifetime (or 2x for the one-hour option), and a cache read costs only about 0.1x, a tenth of the price. So a cached prefix pays for itself within a couple of reuses. The cache is a prefix cache, so order matters: put your stable content first and your changing content last, and a single changed token before the breakpoint forces a full re-write. Caching does not reduce how much context the model attends to, only what you pay to send it, so it complements curation rather than replacing it.

  • Caching reuses the encoded state of a stable prefix so it is not recomputed each request.
  • On the Claude API (June 2026): cache writes about 1.25x input (5-minute default, 2x for 1-hour), cache reads about 0.1x.
  • It is a prefix cache: keep stable content first and changing content last, or you force a re-write.
  • Caching cuts cost, not attention; you still curate the window. See the glossary on prompt caching.

A practical context-engineering checklist

Put the ideas together into habits you run without thinking. None of this requires special tooling; it is discipline about what you load and when you clean up. A future companion is the token and context estimator tool on this campus, which will let you paste text and see how much of a model window it fills before you send it; for now, the rules below carry you.

  • Keep standing rules in CLAUDE.md or AGENTS.md, and keep that file tight; it loads every turn.
  • Load the files the task needs, not the whole repo; let the agent retrieve on demand.
  • Put the most important instruction near the top and the immediate task near the bottom.
  • Compact, hand over or reset when a session gets long or confused; do not let it sprawl.
  • Cache large stable prefixes to control cost, keeping stable content first.
  • Delegate noisy side work to a subagent so its output stays out of your main window.

Frequently asked questions

Next step

Ready to put AI to work as a real workflow?

Start with the foundations course, keep your progress locally and sync everything to your free account whenever you like.