Retrieval-Augmented Generation (RAG) is a technique for building AI systems that lets a language model consult external documents before it answers a question. Instead of relying only on what was learned during training, a RAG pipeline first searches a knowledge base for passages relevant to the user's query, then feeds those passages to the model as context. The result is a generated response that is grounded in specific, citable sources rather than purely in the model's internal weights.
How Retrieval-Augmented Generation works
A typical RAG system has two main components: a retriever and a generator. The retriever is usually a vector search index built from a corpus of documents. When each document is added to the index, an embedding model converts its chunks into numerical vectors; the same model embeds the incoming user query, and a similarity search (commonly nearest-neighbor lookup using cosine or dot-product distance) returns the chunks whose vectors are closest to the query. The top-ranked chunks are then inserted into the prompt that is sent to the large language model, often alongside instructions such as “answer using only the provided context.”
For example, if a user asks an internal company assistant “What is our parental leave policy?”, the retriever finds the relevant section of the employee handbook, and the language model uses those passages to compose a precise answer that quotes the policy. This pattern, introduced in the 2020 paper by Lewis et al. at Facebook AI Research, separates knowledge (stored in the index) from reasoning (performed by the model), which is why the approach scales well as source material changes.
Why it matters
RAG addresses three persistent problems with standalone language models. First, it reduces hallucinations because the model is anchored to retrieved text rather than improvising. Second, it lets a system reflect information that did not exist, or that has changed, since the model's training cutoff, simply by updating the index. Third, it makes the model's answers more verifiable: developers and users can inspect the retrieved chunks, cite them, and trace any claim back to a source document.
These properties make RAG the default pattern for enterprise question answering, customer support copilots, legal and compliance search, and AI assistants that need to operate over private or proprietary data without retraining the underlying model.
Key types and patterns
- Naive (or “Retrieve-and-Read”) RAG: a single retrieval step feeds the top-k chunks directly into the generator's prompt.
- Advanced RAG: adds query rewriting, re-ranking, and chunk-level filtering before generation to improve precision.
- Modular RAG: composes the pipeline from interchangeable components such as web search, SQL lookup, or API calls, and may loop between retrieval and generation.
- Graph RAG: builds a knowledge graph from the corpus and retrieves subgraphs of related entities, which can produce more contextual answers on connected data.
- Agentic RAG: lets the language model decide when and what to retrieve, often across multiple tools, before producing a final answer.
By decoupling storage of knowledge from the reasoning engine, RAG has become a foundational building block for production AI applications that need to be accurate, current, and auditable. The original research is described in Lewis et al., “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks” (2020), and current best practices are documented in frameworks such as LlamaIndex and LangChain.