Do all large language models benefit from chain-of-thought prompting?

The benefit scales with model size. The original 2022 paper found meaningful gains only on models with roughly 100B+ parameters, while smaller models often produced fluent but incorrect reasoning. Modern frontier models, including most released since 2023, respond well to chain-of-thought prompting across a wide range of tasks.

What is the difference between chain-of-thought prompting and chain-of-thought training?

Chain-of-thought prompting is a technique applied at inference time: the user simply asks the model to reason step by step, and no training occurs. Chain-of-thought training, sometimes called fine-tuning on reasoning traces, involves updating the model's weights on datasets of worked solutions so it produces step-by-step reasoning by default. The two are complementary and often combined.

Is chain-of-thought prompting the same as letting the model "think out loud"?

Functionally, yes, but the distinction matters for evaluation. "Thinking out loud" describes any free-form monologue, while chain-of-thought is a specific structured approach that has been measured against baselines and shown to improve accuracy on benchmarks such as GSM8K for math and StrategyQA for commonsense reasoning. The key is that the chain is decomposed into discrete, verifiable steps rather than left as a single fluid paragraph.

Does chain-of-thought prompting always make models more accurate?

No. It helps most on tasks that require multi-step arithmetic, logical deduction, or commonsense reasoning. For simple factual lookups, single-step classification, or creative writing, adding "think step by step" can add verbosity without improving — and occasionally hurting — performance. It also does not guarantee correctness: a chain of thought can be confidently wrong, which is why techniques like self-consistency and verification steps are often layered on top.

Chain-of-Thought 프롬프팅이란? 초보자를 위한 가이드

Chain-of-thought 프롬프팅은 사용자가 대규모 언어 모델에 한 번에 한 단계씩 문제를 풀도록 지시하여 최종 답변에 이르는 중간 추론 과정을 드러내는 프롬프트 엔지니어링 기법입니다. 모델은 결론으로 곧장 도약하는 대신, 자연어로 논리적 단계를 차례로 풀어냅니다. 이는 마치 수학 시험에서 풀이 과정을 보여주는 학생과도 같습니다. 이 기법은 Wei et al. (2022)의 논문 Chain-of-Thought Prompting Elicits Reasoning in Large Language Models를 통해 널리 알려졌으며, 이후 현대 프롬프트 디자인의 초석이 되었습니다.

Chain-of-Thought 프롬프팅의 작동 원리

핵심 아이디어는 겉보기에 아주 단순합니다. 모델이 추론 체인("먼저 X를 수행하고, 그다음 Y를 계산하므로 Z다"와 같은)을 보여주는 작업 예시를 하나 이상 포함한 프롬프트를 받으면, 새로운 문제에서도 그 구조를 모방하는 경향이 있습니다. 이를 few-shot chain-of-thought 프롬프팅이라 하며, 모델의 가중치 변경 없이 프롬프트만으로 구현됩니다.

Kojima et al. (2022)이 제안한 zero-shot chain-of-thought는 보다 최근에 등장한 변형입니다. 이는 어떤 질문이든 끝에 Let's think step by step과 같은 마법 같은 문장을 하나 붙이기만 하면, 그 한마디만으로 모델이 문제를 분해해 풀도록 유도할 수 있습니다. 두 변형 모두 동일한 근본적인 능력에 의존합니다. 즉, 충분히 큰 언어 모델은 산술과 논리를 위한 내부 절차를 학습해 두며, 그 절차를 텍스트로 드러내면 답변 정확도가 측정 가능한 수준으로 향상됩니다.

왜 중요한가

Chain-of-thought 프롬프팅이 중요한 이유는 LLM에서 가장 두드러진 실패 양상, 즉 다단계 문제에서 자신감만 넘치는 틀린 원샷 답변에 직접적으로 대응하기 때문입니다. 모델이 추론 과정을 외부로 드러내도록 강제함으로써 산술 오류를 줄이고, 상식 벤치마크의 성능을 개선하며, 각 단계를 사람이 검토할 수 있어 모델 동작의 감사도 쉬워집니다. 또한 이는 self-consistency(여러 추론 체인을 샘플링해 다수결로 답 선택), tree-of-thought 탐색, 그리고 최신 추론 모델이 생성하는 추론 트레이스와 같은 더 고급 기법의 구성 요소가 되었습니다.

주요 변형

Few-shot CoT: 프롬프트에 실제 질문을 던지기 전에 단계별 추론을 보여주는 손으로 작성한 예시 여러 개를 포함합니다. 일반적으로 소형 모델에서 가장 안정적인 접근법입니다.
Zero-shot CoT: 어떤 프롬프트든 "Let's think step by step"(혹은 유사한 트리거 문구)을 덧붙이기만 하면 됩니다. 비용이 저렴하면서도 능력 있는 모델에서 놀라울 정도로 효과적입니다.
Self-consistency: 독립적인 다수의 추론 체인을 샘플링한 뒤 가장 많이 등장하는 최종 답을 선택하여, 연산 비용을 들여 정확도를 높이는 방식입니다.
Tree-of-Thought: 모델이 여러 추론 경로를 분기해 탐색하도록 한 뒤, 약한 경로는 되돌리거나 가지치기합니다. 퍼즐이나 계획 수립 과제에 유용합니다.
추론 모델의 트레이스: o-시리즈나 DeepSeek-R1 같은 최신 모델은 기본적으로 긴 chain-of-thought 추론을 네이티브하게 생성하도록 명시적으로 학습되었습니다.

Chain-of-thought 프롬프팅은 "풀이 과정을 보여라"라는 교실의 격언을 대규모 언어 모델에서 더 신뢰할 수 있는 답변을 얻기 위한 강력하고 범용적인 도구로 변모시켰습니다.

사고 연쇄 프롬프팅(Chain-of-Thought Prompting)란 무엇인가요?

Chain-of-Thought 프롬프팅의 작동 원리

왜 중요한가

주요 변형

자주 묻는 질문