A context window is the maximum amount of text, measured in tokens, that a large language model (LLM) can process in a single interaction. It defines the total span of information the model can attend to at one time, including the user's prompt, any attached documents, prior conversation history, and the model's own generated response. When a conversation or document exceeds the context window, earlier content is typically truncated or dropped, which can cause the model to "forget" details it was given just moments earlier.
How a context window works
Before text reaches an LLM, it is broken into tokens, the small chunks (roughly words or word pieces) the model actually reads. The context window is the fixed budget of tokens the model can hold in its working memory at once. If a model advertises a 128,000-token context window, then everything — system instructions, retrieved documents, the full chat history, and the reply being generated — must fit inside that 128,000-token envelope.
Internally, the model uses a mechanism called attention to weigh the relationships between every token in that window. Because every token attends to every other token, the compute and memory cost grows roughly with the square of the window size, which is why expanding the context window is an active area of research. Practical effects show up quickly: a 200,000-token "needle in a haystack" test, where a specific fact is buried in a long document, reveals whether the model can still recall that fact when asked about it later in the prompt.
Why it matters
The context window is the single most important constraint on what an LLM can do in a given turn. A small window forces users to chunk long documents, summarize earlier sections, or rely on retrieval-augmented generation (RAG) to feed in only the most relevant passages. A larger window lets a model ingest whole codebases, long legal contracts, full transcripts, or hours of conversation without losing track of earlier details.
For developers, the window size shapes architecture decisions: how retrieval pipelines are built, how chat memory is managed, and how prompts are designed to stay under the limit. For end users, it is the difference between pasting a chapter into a chatbot and pasting an entire book — and whether the model can still answer a question about page three by the time it reaches page fifty.
Key types and current sizes
- Short context (2K–8K tokens): the early generation of consumer LLMs, roughly the length of a long email or a few pages of prose.
- Standard context (32K–128K tokens): common in modern frontier models, enough to hold a full novel, a moderate codebase, or a long meeting transcript.
- Long context (200K–1M+ tokens): newer "long-context" models that can ingest entire books, multi-file repositories, or multi-hour conversations in one pass.
- Effective vs. advertised context: the advertised window is the maximum input size, while the effective window is the portion over which the model reliably retrieves and reasons about information. Independent benchmarks often show the effective window is smaller than the advertised one.
Context windows have expanded dramatically since 2023, but bigger is not always better: longer windows cost more memory, run more slowly, and can dilute the model's focus. For most tasks, choosing a model with a context window that comfortably fits the input is more useful than chasing the largest number on the spec sheet.