A token is the smallest piece of text a language model actually works with. When you send a prompt to a model like GPT, Claude, or Llama, your text is first split into a sequence of tokens — typically whole words, common subwords, or single characters — and each token is then converted into a number the model can process. The model generates output the same way, predicting and emitting one token at a time until it decides to stop.
How tokens work
Tokens are produced by a tokenizer, a separate program that sits between your text and the model. The most common schemes are byte-pair encoding (BPE) and WordPiece, which start with individual characters and repeatedly merge the most frequent adjacent pairs into longer units. The result is a fixed vocabulary — often 30,000 to 200,000 entries — that balances short common words with reusable subword pieces. A frequent word like the usually becomes a single token, while a rare or made-up word like unbelievableness is split into several: un, believ, able, ness.
Because English averages around four characters per token, a rough rule of thumb is that 100 tokens ≈ 75 English words, though this varies by tokenizer and language. Pricing, context limits, and generation speed are all measured in tokens, not words or characters. A model with a 200,000-token context window can hold roughly the equivalent of a long novel plus several research papers in a single prompt.
Why it matters
Tokens determine three things every user cares about: cost, capacity, and behavior. API providers charge per million tokens, so a prompt that tokenizes inefficiently costs more than it should. Context windows — the maximum amount of text a model can consider at once — are counted in tokens, which is why very long documents must be chunked before being fed in. Behavior is affected too: a tokenizer that splits a word differently can change how a model reasons about it, and some languages tokenize into far more pieces per word than English, which inflates costs and shortens effective context for non-English users.
Key token concepts
- Tokenization: the algorithm that splits text into tokens, usually via BPE, WordPiece, or Unigram.
- Vocabulary: the fixed list of tokens a model knows, with a unique integer ID for each entry.
- Special tokens: reserved symbols such as
<BOS>,<EOS>, and padding markers that signal boundaries and structure rather than content. - Context window: the maximum number of tokens a model can process in a single request, including both input and generated output.
- Token limits: hard caps imposed by providers on how many tokens a request may contain, often split into input and output limits.
For a deeper look at byte-pair encoding, Andrej Karpathy's walkthrough minbpe is a practical starting point, and the original Neural Machine Translation of Rare Words with Subword Units paper introduced the approach most modern tokenizers still build on.