📖

What is Chain-of-Thought Prompting?

Chain-of-thought prompting is a technique for steering a large language model to produce a step-by-step reasoning trace before giving a final answer, rather than responding in a single leap. It was introduced by Wei et al. in the 2022 paper "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models," which showed that appending short worked examples to a prompt dramatically improves performance on arithmetic, commonsense, and symbolic reasoning tasks.

Chain-of-thought prompting is a prompt-engineering technique in which a user instructs a large language model to work through a problem one step at a time, exposing the intermediate reasoning that leads to the final answer. Instead of jumping straight to a conclusion, the model writes out the logical steps in natural language, much like a student showing their work on a math test. The technique was popularized by Wei et al. (2022) in Chain-of-Thought Prompting Elicits Reasoning in Large Language Models and has since become a foundation of modern prompt design.

How Chain-of-Thought Prompting works

The core idea is deceptively simple. When a prompt contains one or more worked examples in which the model demonstrates a reasoning chain — "first I do X, then I compute Y, therefore Z" — the model tends to imitate that structure on the new problem. This is known as few-shot chain-of-thought prompting, and it requires no changes to the model's weights; only the prompt changes.

A more recent variant, called zero-shot chain-of-thought, was introduced by Kojima et al. (2022). It works by appending a single magic phrase such as Let's think step by step to any question, which alone is enough to coax the model into decomposing the problem. Both variants rely on the same underlying capability: sufficiently large language models have learned internal procedures for arithmetic and logic, and surfacing those procedures as text measurably improves answer accuracy.

Why it matters

Chain-of-thought prompting matters because it directly attacks one of the most visible failure modes of LLMs: confidently wrong one-shot answers on multi-step problems. By forcing the model to externalize its reasoning, the technique reduces arithmetic errors, improves performance on commonsense benchmarks, and makes model behavior easier to audit because a human can inspect each step. It is now a building block for more advanced methods such as self-consistency (sampling many chains and voting on the answer), tree-of-thought search, and the reasoning traces produced by modern reasoning models.

Key variants

  • Few-shot CoT: The prompt includes several hand-written examples that demonstrate step-by-step reasoning before the real question. Usually the most reliable approach for smaller models.
  • Zero-shot CoT: Simply add "Let's think step by step" (or a similar trigger) to any prompt. Cheap and surprisingly effective on capable models.
  • Self-consistency: Sample many independent chains of thought and pick the most common final answer, trading compute for accuracy.
  • Tree-of-Thought: Let the model branch and explore multiple reasoning paths, then backtrack or prune weak ones — useful for puzzles and planning tasks.
  • Reasoning-model traces: Newer models such as those in the o-series and DeepSeek-R1 are explicitly trained to natively produce long chain-of-thought reasoning by default.

Chain-of-thought prompting turned "show your work" from a classroom rule into a powerful, general-purpose tool for getting more reliable answers out of large language models.

Frequently Asked Questions

Do all large language models benefit from chain-of-thought prompting?
The benefit scales with model size. The original 2022 paper found meaningful gains only on models with roughly 100B+ parameters, while smaller models often produced fluent but incorrect reasoning. Modern frontier models, including most released since 2023, respond well to chain-of-thought prompting across a wide range of tasks.
What is the difference between chain-of-thought prompting and chain-of-thought training?
Chain-of-thought prompting is a technique applied at inference time: the user simply asks the model to reason step by step, and no training occurs. Chain-of-thought training, sometimes called fine-tuning on reasoning traces, involves updating the model's weights on datasets of worked solutions so it produces step-by-step reasoning by default. The two are complementary and often combined.
Is chain-of-thought prompting the same as letting the model "think out loud"?
Functionally, yes, but the distinction matters for evaluation. "Thinking out loud" describes any free-form monologue, while chain-of-thought is a specific structured approach that has been measured against baselines and shown to improve accuracy on benchmarks such as GSM8K for math and StrategyQA for commonsense reasoning. The key is that the chain is decomposed into discrete, verifiable steps rather than left as a single fluid paragraph.
Does chain-of-thought prompting always make models more accurate?
No. It helps most on tasks that require multi-step arithmetic, logical deduction, or commonsense reasoning. For simple factual lookups, single-step classification, or creative writing, adding "think step by step" can add verbosity without improving — and occasionally hurting — performance. It also does not guarantee correctness: a chain of thought can be confidently wrong, which is why techniques like self-consistency and verification steps are often layered on top.