📖

What is Context Window?

A context window is the maximum amount of text, measured in tokens, that a large language model can process in a single interaction. It determines how much input and output the model can consider at once, including the user's prompt, any attached documents, prior conversation history, and the model's own generated response.

A context window is the maximum amount of text, measured in tokens, that a large language model (LLM) can process in a single interaction. It defines the total span of information the model can attend to at one time, including the user's prompt, any attached documents, prior conversation history, and the model's own generated response. When a conversation or document exceeds the context window, earlier content is typically truncated or dropped, which can cause the model to "forget" details it was given just moments earlier.

How a context window works

Before text reaches an LLM, it is broken into tokens, the small chunks (roughly words or word pieces) the model actually reads. The context window is the fixed budget of tokens the model can hold in its working memory at once. If a model advertises a 128,000-token context window, then everything — system instructions, retrieved documents, the full chat history, and the reply being generated — must fit inside that 128,000-token envelope.

Internally, the model uses a mechanism called attention to weigh the relationships between every token in that window. Because every token attends to every other token, the compute and memory cost grows roughly with the square of the window size, which is why expanding the context window is an active area of research. Practical effects show up quickly: a 200,000-token "needle in a haystack" test, where a specific fact is buried in a long document, reveals whether the model can still recall that fact when asked about it later in the prompt.

Why it matters

The context window is the single most important constraint on what an LLM can do in a given turn. A small window forces users to chunk long documents, summarize earlier sections, or rely on retrieval-augmented generation (RAG) to feed in only the most relevant passages. A larger window lets a model ingest whole codebases, long legal contracts, full transcripts, or hours of conversation without losing track of earlier details.

For developers, the window size shapes architecture decisions: how retrieval pipelines are built, how chat memory is managed, and how prompts are designed to stay under the limit. For end users, it is the difference between pasting a chapter into a chatbot and pasting an entire book — and whether the model can still answer a question about page three by the time it reaches page fifty.

Key types and current sizes

  • Short context (2K–8K tokens): the early generation of consumer LLMs, roughly the length of a long email or a few pages of prose.
  • Standard context (32K–128K tokens): common in modern frontier models, enough to hold a full novel, a moderate codebase, or a long meeting transcript.
  • Long context (200K–1M+ tokens): newer "long-context" models that can ingest entire books, multi-file repositories, or multi-hour conversations in one pass.
  • Effective vs. advertised context: the advertised window is the maximum input size, while the effective window is the portion over which the model reliably retrieves and reasons about information. Independent benchmarks often show the effective window is smaller than the advertised one.

Context windows have expanded dramatically since 2023, but bigger is not always better: longer windows cost more memory, run more slowly, and can dilute the model's focus. For most tasks, choosing a model with a context window that comfortably fits the input is more useful than chasing the largest number on the spec sheet.

Frequently Asked Questions

What happens when input exceeds the context window?
When input exceeds the context window, the model cannot see the excess text. Most APIs and chat interfaces handle this by truncating from the beginning or middle of the input, so the earliest content is lost first. Some systems use summarization or retrieval to compress earlier parts of the conversation so the most recent information still fits.
How many words fit in a context window?
As a rough rule of thumb, one token is about three-quarters of an English word, so a 100,000-token window holds roughly 75,000 words — close to the length of a typical novel. Code and other languages tokenize differently and may consume more tokens per character.
Does a larger context window make a model smarter?
Not necessarily. A larger window lets a model consider more information at once, but reasoning quality, training, and the model's effective recall still matter. Independent tests such as the "needle in a haystack" benchmark often find that models retrieve information less reliably near the edges of very long windows than in the middle.
How is context window different from memory in a chatbot?
The context window is the model's working memory for a single request, while chatbot "memory" usually refers to features that store facts across sessions and inject them into the prompt. Anything stored externally only counts toward the context window when it is actually included in the current prompt.