In short
A large language model does one thing remarkably well: it predicts the next token given everything it has seen. Once you understand tokens, the context window and the performance cliff that hits long inputs, working with any model stops feeling like guesswork. This guide explains all three in plain language, with the few numbers that actually matter in 2026, so you can drive any model well and stop blaming the tool for behaviour that is entirely predictable.
Tokens, not words
A model never sees words the way you do. Your text is first split into tokens, which are common chunks of characters, roughly four characters or three quarters of a word in English. Two things are measured in tokens: the price you pay and the amount a model can hold at once. That is why a cheap model can become expensive on long documents, and why code or other languages cost more tokens than the same idea in plain English. Pricing is quoted per million tokens and split into input and output, with output usually several times more expensive than input.
The context window
The context window is the maximum number of tokens a model can consider at once: your instructions, the files you pasted, the conversation history and the answer it is writing, all added together. Think of it as the model's desk. Everything relevant has to fit on the desk at the same time, and when the desk is full something falls off and is effectively forgotten. This is why a long chat starts losing track of instructions you gave near the start. In 2026 a strong model typically has around a 200,000 token window, with some advertising a million or more.
The performance cliff
Bigger context is not the same as better answers. As you fill a context window, quality degrades long before you hit the hard limit. Models attend best to the start and end of a long input and get fuzzy in the middle, a pattern often called lost in the middle. A million-token window sounds amazing, but answer quality on a packed window is often worse than on a tight, well-chosen prompt. This is the performance cliff, and the lesson is blunt: relevance beats volume every time.
Why huge context windows disappoint
You will see models advertising enormous context windows and assume they are strictly better. In practice they often disappoint, for exactly the reason above. A model can technically accept a million tokens and still answer worse than a focused prompt, because quality falls as the window fills. Treat a huge window as occasional insurance for a genuinely large document, not as permission to stop curating what you send.
How to use this in practice
The practical takeaways are simple. Send less, but send the right less. Start fresh conversations rather than piling onto long ones. When an answer is bad, your first two questions are whether your context is too big and whether the relevant information is actually near the top or bottom. On a workflow that runs thousands of times, trimming a bloated prompt can cut your bill dramatically and improve the answers at the same time.
Why this matters for your business
Tokens are money and context discipline is quality. A team that understands this writes tighter prompts, picks cheaper models for simple tasks, and gets more reliable output, which means less rework. Understanding the cliff is the single highest-leverage thing a non-technical founder can learn before spending on AI at scale, because it changes every downstream decision about models, prompts and agents.
Frequently asked questions
Matching lessons & resources
Build a correct mental model of tokens, context windows and why long prompts get worse, so you can drive any model well
Pick the right model tier for any task and know where to get strong models cheaply or free
Brief a coding agent so it delivers great work the first time, using axioms, framing, pushback and spec sheets
Model Selection Cheatsheet: a practical, reusable building block for shipping real AI workflows in your business.
Agent Task Brief: a practical, reusable building block for shipping real AI workflows in your business.
