📖

What is Transformer?

A Transformer is a neural network architecture that learns relationships in sequential data — such as text, code, or time series — by weighing every element of an input against every other element in parallel, using a mechanism called self-attention. Introduced in the 2017 paper "Attention Is All You Need," it underpins most modern large language models, including GPT, BERT, and Llama.

A Transformer is a type of neural network designed to process sequences of data — most famously language — by comparing every element in the input to every other element at the same time. Instead of reading strictly left-to-right like older recurrent networks, it uses a mechanism called self-attention to learn which words, tokens, or positions matter most to one another, no matter how far apart they sit. This parallel design makes Transformers faster to train on modern hardware and dramatically better at capturing long-range dependencies, which is why they now power nearly every state-of-the-art large language model.

How a Transformer works

At the heart of a Transformer is the self-attention operation. Every input token is projected into three vectors — called query, key, and value. To understand one token, the model compares its query against the keys of every other token, producing a set of attention scores that say "how much should I look at each of you?" Those scores are normalized into weights, and a weighted sum of the value vectors becomes the new representation of that token. Multi-head attention runs several such comparisons in parallel, letting the model track different kinds of relationships simultaneously — grammar, coreference, sentiment, and more.

Stacks of these attention blocks, each followed by a small feed-forward network and residual connections, form the full model. A positional encoding is added to the inputs so the network knows token order, since attention itself is permutation-agnostic. During training, a decoder-only Transformer predicts the next token in a sequence; with enough data and parameters, this simple objective produces the reasoning, translation, and code-generation abilities seen in systems like GPT.

Why it matters

Before Transformers, recurrent neural networks (RNNs) and LSTMs processed text one token at a time, which was slow and struggled with long contexts. The Transformer's parallel attention let researchers scale models to billions of parameters trained on web-scale corpora, unlocking the capabilities of modern LLMs. The same architecture has since been adapted to images (vision transformers), audio, proteins, and reinforcement learning, making it the dominant paradigm of contemporary deep learning.

Key types

  • Encoder-only Transformers — such as BERT, optimized for understanding tasks like classification, search ranking, and embeddings.
  • Decoder-only Transformers — such as GPT and Llama, optimized for generating text one token at a time.
  • Encoder-decoder Transformers — such as the original "Attention Is All You Need" model and T5, used for translation and sequence-to-sequence tasks.
  • Vision Transformers (ViT) — apply self-attention to patches of an image instead of words.
  • Mixture-of-Experts (MoE) Transformers — route each token to a subset of "expert" sub-networks, increasing capacity without proportional compute cost.

Since 2017, the Transformer has reshaped both AI research and product engineering, and most of the apps in the HyperStore catalog — chatbots, code assistants, image generators, and reasoning agents — are built on some variant of it. Read the original "Attention Is All You Need" paper for the foundational design, or the Illustrated Transformer guide for a step-by-step walkthrough.

Frequently Asked Questions

Who invented the Transformer architecture?
A team at Google Brain led by Vaswani et al. introduced the Transformer in the 2017 paper "Attention Is All You Need." Its eight authors — including Noam Shazeer, Jakob Uszkoreit, Llion Jones, and Aidan Gomez — showed that self-attention alone could match or beat recurrent and convolutional models on translation tasks while training far faster on GPUs.
What is the difference between a Transformer and an LLM?
A Transformer is the underlying neural network architecture; an LLM (large language model) is a specific application of it, trained on massive text datasets to generate and reason about language. In other words, every modern LLM is built from Transformer blocks, but not every Transformer is an LLM — vision and audio models use the same architecture too.
Why did Transformers replace RNNs and LSTMs?
Transformers process entire sequences in parallel rather than one token at a time, making them far more efficient to train on modern hardware. Their self-attention also captures relationships across long distances in a sequence — something RNNs and LSTMs struggled with due to vanishing gradients. The result is faster training, larger models, and noticeably better performance on language tasks.
What are the main limitations of Transformers?
Self-attention scales quadratically with sequence length, so very long contexts (tens of thousands of tokens) become expensive in both memory and compute. Transformers also require large amounts of training data, are opaque in how they reach decisions, and can hallucinate confident but incorrect outputs. Active research on sparse attention, state-space models, and retrieval augmentation aims to address these trade-offs.