Lesson 1.1

How LLMs Actually Work: Tokens, Context Windows and the Performance Cliff

Build a correct mental model of tokens, context windows and why long prompts get worse, so you can drive any model well

22 minFoundations - From Zero to Your First Shipped AppAvailable

What you learn

  • What a token is and why you are billed and limited by tokens, not words
  • How a context window works and what actually happens when it fills up
  • Why model quality degrades on long inputs - the performance cliff - and how to avoid it

Summary

A large language model does one thing astonishingly well: it predicts the next token given everything it has seen so far. Once you understand tokens, the context window and the performance cliff that hits long inputs, every later decision about which model to pick, how to prompt it and how to run agents stops feeling like guesswork. This lesson gives you that mental model in plain language, with the few numbers that actually matter in 2026.

What you will learn

You will learn what a token is, why you pay and get limited per token rather than per word, how the context window holds the whole conversation, and why dumping more text into a model often makes its answers worse rather than better. By the end you can read a model spec sheet and predict how a model will behave before you spend a cent on it.

Prerequisites

None. You do not need to code or to have used an AI tool before. If you have ever typed a message into ChatGPT, Claude or Gemini, you already have all the background you need. We link the deeper terminal and Git fundamentals later, when you actually need them.

The problem

Most people treat an LLM like a search engine or a person. They paste in a huge document, ask a vague question, and are surprised when the answer is shallow, wrong or ignores half of what they pasted. The model did not get lazy. It hit limits that are baked into how it works. Without a mental model of tokens and context you will keep blaming the tool for behaviour that is completely predictable.

Tokens, not words

A model never sees letters or words the way you do. Before anything happens, your text is split into tokens, which are common chunks of characters. A token is roughly four characters or about three quarters of a word in English. Common words are a single token; rare words, code symbols and other languages cost more. Two things are measured in tokens: the price you pay and the size of what a model can hold at once. That is why a "cheap" model can get expensive on long documents, and why German or code prompts cost more tokens than the same idea in plain English.

  • 1 token is about 4 characters or 0.75 English words.
  • 1,000 tokens is roughly 750 words, or about a page and a half of text.
  • Pricing is quoted per million tokens, split into input (what you send) and output (what the model writes back). Output usually costs several times more than input.
  • You are billed for the WHOLE conversation every turn, because the model re-reads everything each time it replies.

The context window

The context window is the maximum number of tokens a model can consider at once: your instructions, the files you pasted, the conversation history and the answer it is writing, all added together. In 2026 a typical strong model has a context window of around 200,000 tokens, with some models advertising 1,000,000 or more. Think of it as the model's desk. Everything relevant has to fit on the desk at the same time. When the desk is full, something has to come off, and the model effectively forgets it. This is why a long chat starts losing track of instructions you gave near the start: those tokens fell off the desk.

The performance cliff

Here is the part almost nobody tells beginners: bigger context is not the same as better answers. As you fill a context window, model quality degrades long before you hit the hard limit. Models attend best to the start and end of a long input and get fuzzy in the middle, a pattern often called "lost in the middle". A 1,000,000 token window sounds amazing, but in practice answer quality on a packed window can be noticeably worse than on a tight, well-chosen 20,000 token prompt. This is the performance cliff. The lesson is blunt: relevance beats volume. A short prompt with exactly the right context beats a giant prompt every single time.

  • Quality is highest when the window is mostly empty and every token earns its place.
  • Quality drops as the window fills, especially for information buried in the middle.
  • Huge advertised windows (1M+) rarely deliver their full quality at the top end - treat them as a safety margin, not a workspace.
  • When in doubt, start a fresh conversation rather than piling onto a long one.

Step by step: see it for yourself

You can build intuition in ten minutes without writing code. Open any chat model and run this small experiment. The point is to feel how tokens, the window and the cliff show up in real answers.

  • Ask the model: "How many tokens is the word internationalization, and why?" Notice it splits into several tokens because it is long and rare.
  • Paste a long article (a few thousand words) and ask a question about one sentence in the exact middle. Then ask the same question about the first sentence. The middle answer is usually weaker.
  • In a very long chat, ask the model to repeat an instruction you gave near the top. Watch it struggle or invent - those early tokens are off the desk.
  • Start a brand-new chat, paste only the relevant paragraph, and ask again. The answer is sharper. That is relevance beating volume.

Typical mistakes

The classic beginner error is the "dump everything" prompt: paste a 50 page PDF and ask one narrow question. The model drowns. The second mistake is the never-ending chat, where you keep one conversation open for days and wonder why it gets dumber. The third is assuming a bigger context window means you can be lazy about relevance. All three come from not respecting the cliff. The fix is always the same: send less, but send the right less, and reset often.

Business ROI

This is not academic. Tokens are money and context discipline is quality. A team that understands this writes tighter prompts, picks cheaper models for simple tasks, and gets more reliable output, which means less rework. On a real workflow that runs thousands of times, trimming a bloated prompt from 30,000 tokens to 5,000 can cut your bill by 80 percent AND improve the answers. Understanding the cliff is the single highest-leverage thing a non-technical founder can learn before spending on AI at scale.

Checklist

Before you move on, make sure you can answer these without looking back. If any answer is shaky, reread the relevant section - this model sits under everything else in the course.

  • Can you explain a token to a colleague in one sentence?
  • Do you know roughly how many words fit in a 200,000 token window?
  • Can you describe the performance cliff and why relevance beats volume?
  • Do you know why you are billed for the whole conversation each turn?

Resources

Keep the idea handy as you work: when an answer is bad, your first two questions are always "is my context too big?" and "is the right information actually near the top or bottom?" The fundamentals page on tokens goes deeper on tokenization if you want the underlying detail, and the model comparison in the next lesson builds directly on the numbers introduced here.

Your task

Run the four-step experiment above in a chat model of your choice and write down, in your own words, one sentence describing the moment you saw the model "forget" or get fuzzy. Keeping that concrete memory makes every prompting decision later in the course click into place.

Next lesson

Now that you know what a model is doing under the hood, the obvious next question is which model to use. The next lesson compares Haiku, Sonnet and Opus against GPT and Gemini, explains benchmarks honestly, and shows where to get strong models cheaply or for free.

Comments

Loading comments.

Post a comment
CommentsNext
Next step

Ready to put AI to work as a real workflow?

Start with the foundations course, keep your progress locally and sync everything to your free account whenever you like.