Do all large language models benefit from chain-of-thought prompting?

The benefit scales with model size. The original 2022 paper found meaningful gains only on models with roughly 100B+ parameters, while smaller models often produced fluent but incorrect reasoning. Modern frontier models, including most released since 2023, respond well to chain-of-thought prompting across a wide range of tasks.

What is the difference between chain-of-thought prompting and chain-of-thought training?

Chain-of-thought prompting is a technique applied at inference time: the user simply asks the model to reason step by step, and no training occurs. Chain-of-thought training, sometimes called fine-tuning on reasoning traces, involves updating the model's weights on datasets of worked solutions so it produces step-by-step reasoning by default. The two are complementary and often combined.

Is chain-of-thought prompting the same as letting the model "think out loud"?

Functionally, yes, but the distinction matters for evaluation. "Thinking out loud" describes any free-form monologue, while chain-of-thought is a specific structured approach that has been measured against baselines and shown to improve accuracy on benchmarks such as GSM8K for math and StrategyQA for commonsense reasoning. The key is that the chain is decomposed into discrete, verifiable steps rather than left as a single fluid paragraph.

Does chain-of-thought prompting always make models more accurate?

No. It helps most on tasks that require multi-step arithmetic, logical deduction, or commonsense reasoning. For simple factual lookups, single-step classification, or creative writing, adding "think step by step" can add verbosity without improving — and occasionally hurting — performance. It also does not guarantee correctness: a chain of thought can be confidently wrong, which is why techniques like self-consistency and verification steps are often layered on top.

Chain-of-Thoughtプロンプティングとは？初心者向けガイド

Chain-of-Thoughtプロンプティング（思考の連鎖プロンプティング）とは、ユーザーが大規模言語モデルに対して問題を一歩ずつ段階的に処理し、最終的な答えに至る中間的な推論を明示するよう指示するプロンプトエンジニアリングの手法です。モデルは一気に結論へ飛びつくのではなく、数学のテストで自分の解法を示す学生のように、自然言語で論理的なステップを書き出します。この手法はWeiら（2022年）の論文「Chain-of-Thought Prompting Elicits Reasoning in Large Language Models」によって広く知られるようになり、それ以来、現代のプロンプト設計の基礎となっています。

Chain-of-Thoughtプロンプティングの仕組み

その中心となる考え方は、一見すると意外にシンプルです。プロンプトに「まずXを行い、次にYを計算し、したがってZとなる」のように、モデルが推論の連鎖を示すworked example（解答例）を1つ以上含めると、モデルは新しい問題でもその構造を模倣する傾向があります。これはfew-shot（少数例）Chain-of-Thoughtプロンプティングと呼ばれ、モデルの重みを変更する必要はなく、プロンプトだけを変更します。

より新しいバリエーションであるzero-shot（ゼロショット）Chain-of-Thoughtは、Kojimaら（2022年）によって提案されました。これは、任意の問題に「ステップバイステップで考えよう」のような魔法のフレーズを一つ追加するだけで、モデルが問題を分解するようになるというものです。どちらのバリエーションも、根本的な同じ能力に依拠しています。すなわち、十分に大きな言語モデルは、算術や論理に関する内部的な手続きを学習しており、その手続きをテキストとして表に出すことで、回答の精度が測定可能なほど向上するのです。

なぜ重要なのか

Chain-of-Thoughtプロンプティングが重要であるのは、LLMのもっとも目立つ失敗モード、すなわち多段階の問題に対する自信満々の誤ったワンショット回答に直接的に対処できるからです。モデルに推論を外在化させることで、算術エラーが減少し、常識ベンチマークでの性能が向上し、また人間が各ステップを検証できるためモデルの挙動を監査しやすくなります。現在では、自己一貫性（self-consistency）（多数の推論チェーンをサンプリングし多数決で答えを選ぶ手法）、Tree-of-Thought（思考の木）探索、そして最新の推論モデルが生成する推論トレースといった、より高度な手法の構成要素となっています。

主要なバリエーション

Few-shot CoT：プロンプトに、手書きで作成された段階的な推論を示す複数の例を含める形式。実際の質問に先がけて提示される。小規模なモデルでは通常もっとも信頼性が高いアプローチ。
Zero-shot CoT：任意のプロンプトに「ステップバイステップで考えよう」（あるいは同様のトリガー）を単に追加するだけ。コストが安く、能力の高いモデルでは驚くほど効果的。
自己一貫性（Self-consistency）：複数の独立した思考の連鎖をサンプリングし、もっとも頻出する最終回答を選択する。計算コストを精度と引き換えるアプローチ。
Tree-of-Thought：モデルに複数の推論パスを分岐・探索させ、弱いものはバックトラックや枝刈りを行う。パズルやプランニングのタスクに有用。
推論モデルのトレース：oシリーズやDeepSeek-R1のような新しいモデルは、デフォルトで長いChain-of-Thought推論をネイティブに生成するよう明示的に訓練されている。

Chain-of-Thoughtプロンプティングは、「自分の解法を示しなさい」という教室のルールを、大規模言語モデルからより信頼性の高い回答を引き出すための強力で汎用的なツールへと変えました。

思考の連鎖プロンプティングとは？

Chain-of-Thoughtプロンプティングの仕組み

なぜ重要なのか

主要なバリエーション

よくある質問

思考の連鎖プロンプティング とは？

Chain-of-Thoughtプロンプティングの仕組み

なぜ重要なのか

主要なバリエーション

よくある質問

思考の連鎖プロンプティングとは？