A Transformer is a type of neural network designed to process sequences of data — most famously language — by comparing every element in the input to every other element at the same time. Instead of reading strictly left-to-right like older recurrent networks, it uses a mechanism called self-attention to learn which words, tokens, or positions matter most to one another, no matter how far apart they sit. This parallel design makes Transformers faster to train on modern hardware and dramatically better at capturing long-range dependencies, which is why they now power nearly every state-of-the-art large language model.
How a Transformer works
At the heart of a Transformer is the self-attention operation. Every input token is projected into three vectors — called query, key, and value. To understand one token, the model compares its query against the keys of every other token, producing a set of attention scores that say "how much should I look at each of you?" Those scores are normalized into weights, and a weighted sum of the value vectors becomes the new representation of that token. Multi-head attention runs several such comparisons in parallel, letting the model track different kinds of relationships simultaneously — grammar, coreference, sentiment, and more.
Stacks of these attention blocks, each followed by a small feed-forward network and residual connections, form the full model. A positional encoding is added to the inputs so the network knows token order, since attention itself is permutation-agnostic. During training, a decoder-only Transformer predicts the next token in a sequence; with enough data and parameters, this simple objective produces the reasoning, translation, and code-generation abilities seen in systems like GPT.
Why it matters
Before Transformers, recurrent neural networks (RNNs) and LSTMs processed text one token at a time, which was slow and struggled with long contexts. The Transformer's parallel attention let researchers scale models to billions of parameters trained on web-scale corpora, unlocking the capabilities of modern LLMs. The same architecture has since been adapted to images (vision transformers), audio, proteins, and reinforcement learning, making it the dominant paradigm of contemporary deep learning.
Key types
- Encoder-only Transformers — such as BERT, optimized for understanding tasks like classification, search ranking, and embeddings.
- Decoder-only Transformers — such as GPT and Llama, optimized for generating text one token at a time.
- Encoder-decoder Transformers — such as the original "Attention Is All You Need" model and T5, used for translation and sequence-to-sequence tasks.
- Vision Transformers (ViT) — apply self-attention to patches of an image instead of words.
- Mixture-of-Experts (MoE) Transformers — route each token to a subset of "expert" sub-networks, increasing capacity without proportional compute cost.
Since 2017, the Transformer has reshaped both AI research and product engineering, and most of the apps in the HyperStore catalog — chatbots, code assistants, image generators, and reasoning agents — are built on some variant of it. Read the original "Attention Is All You Need" paper for the foundational design, or the Illustrated Transformer guide for a step-by-step walkthrough.