Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
连接思维提示在大语言模型中激发推理能力
Jason Wei Xuezhi Wang Dale Schuurmans Maarten Bosma Brian Ichter Fei Xia Ed H. Chi Quoc V. Le Denny Zhou
韦杰森 王学志 Dale Schuurmans Maarten Bosma Brian Ichter Fei Xia Ed H. Chi Quoc V. Le 周邓尼
Google Research, Brain Team {jasonwei,dennyzhou}@google.com
谷歌研究,Brain团队 {jasonwei,dennyzhou}@google.com
Abstract
摘要
We explore how generating a chain of thought-a series of intermediate reasoning steps—significantly improves the ability of large language models to perform complex reasoning. In particular, we show how such reasoning abilities emerge naturally in sufficiently large language models via a simple method called chain-ofthought prompting, where a few chain of thought demonstrations are provided as exemplars in prompting.
我们探讨了生成思维链——一系列中间推理步骤——如何显著提高大语言模型执行复杂推理的能力。特别是,我们展示了这种推理能力如何通过一种称为思维链提示 (chain-of-thought prompting) 的简单方法在足够大的语言模型中自然涌现,其中提供了一些思维链示例作为提示中的范例。
Experiments on three large language models show that chain-of-thought prompting improves performance on a range of arithmetic, commonsense, and symbolic reasoning tasks. The empirical gains can be striking. For instance, prompting a PaLM 540B with just eight chain-of-thought exemplars achieves state-of-the-art accuracy on the GSM8K benchmark of math word problems, surpassing even finetuned GPT-3 with a verifier.
在三个大语言模型上的实验表明,链式思维提示(chain-of-thought prompting)在一系列算术、常识和符号推理任务上提高了性能。经验性收益可能是显著的。例如,仅用八个链式思维示例提示 PaLM 540B,在数学文字问题的 GSM8K 基准测试中达到了最先进的准确率,超过了甚至带有验证器的微调 GPT-3。


Figure 1: Chain-of-thought prompting enables large language models to tackle complex arithmetic, commonsense, and symbolic reasoning tasks. Chain-of-thought reasoning processes are highlighted.
图 1: 链式思维提示使大语言模型能够处理复杂的算术、常识和符号推理任务。链式思维推理过程被突出显示。
1 Introduction
1 引言
The NLP landscape has recently been revolutionized by language models (Peters et al., 2018; Devlin et al., 2019; Brown et al., 2020, inter alia). Scaling up the size of language models has been shown to confer a range of benefits, such as improved performance and sample efficiency (Kaplan et al., 2020; Brown et al., 2020, inter alia). However, scaling up model size alone has not proved sufficient for achieving high performance on challenging tasks such as arithmetic, commonsense, and symbolic reasoning (Rae et al., 2021).
自然语言处理领域最近被语言模型 (Peters et al., 2018; Devlin et al., 2019; Brown et al., 2020, 等) 所革新。扩大语言模型的规模已被证明带来了一系列好处,例如性能提升和样本效率提高 (Kaplan et al., 2020; Brown et al., 2020, 等)。然而,仅靠扩大模型规模并不足以在诸如算术、常识推理和符号推理等具有挑战性的任务上取得高性能 (Rae et al., 2021)。
This work explores how the reasoning ability of large language models can be unlocked by a simple method motivated by two ideas. First, techniques for arithmetic reasoning can benefit from generating natural language rationales that lead to the final answer. Prior work has given models the ability to generate natural language intermediate steps by training from scratch (Ling et al., 2017) or finetuning a pretrained model (Cobbe et al., 2021), in addition to neuro-symbolic methods that use formal languages instead of natural language (Roy and Roth, 2015; Chiang and Chen, 2019; Amini et al., 2019; Chen et al., 2019). Second, large language models offer the exciting
这项工作探讨了通过一种简单的方法,如何解锁大语言模型的推理能力,该方法受到两个想法的启发。首先,算术推理技术可以从生成自然语言推理过程受益,这些过程最终导向正确答案。先前的工作已经使模型具备了从头训练 (Ling et al., 2017) 或微调预训练模型 (Cobbe et al., 2021) 来生成自然语言中间步骤的能力,此外还有使用形式语言而非自然语言的神经符号方法 (Roy 和 Roth, 2015; Chiang 和 Chen, 2019; Amini et al., 2019; Chen et al., 2019)。其次,大语言模型提供了令人兴奋的机会

Figure 2: PaLM 540B uses chain-ofthought prompting to achieve new stateof-the-art performance on the GSM8K benchmark of math word problems. Finetuned GPT-3 and prior best are from Cobbe et al. (2021).

图 2: PaLM 540B 使用链式思维提示来实现数学文字题的 GSM8K 基准测试的新最先进性能。微调的 GPT-3 和之前的最佳结果来自 Cobbe 等人 (2021) 。
prospect of in-context few-shot learning via prompting. That is, instead of finetuning a separate language model checkpoint for each new task, one can simply “prompt’ the model with a few input-output exemplars demonstrating the task. Remarkably, this has been successful for a range of simple question-answering tasks (Brown et al., 2020).
通过提示实现少样本学习的前景。也就是说,不必为每个新任务微调单独的大语言模型检查点,只需用几个输入-输出示例来“提示”模型以展示任务即可。令人惊讶的是,这种方法在一系列简单的问答任务中取得了成功 (Brown et al., 2020)。
Both of the above ideas, however, have key limitations. For rationale-augmented training and finetuning methods, it is costly to create a large set of high quality rationales, which is much more complicated than simple input-output pairs used in normal machine learning. For the traditional fewshot prompting method used in Brown et al. (2020), it works poorly on tasks that require reasoning abilities, and often does not improve substantially with increasing language model scale (Rae et al., 2021). In this paper, we combine the strengths of these two ideas in a way that avoids their limitations. Specifically, we explore the ability of language models to perform few-shot prompting for reasoning tasks, given a prompt that consists of triples: <input, chain of thought, output). A chain of thought is a series of intermediate natural language reasoning steps that lead to the final output, and we refer to this approach as chain-of-thought prompting. An example prompt is shown in Figure 1.
然而,上述两种方法都有关键的局限性。对于理由增强的训练和微调方法,创建大量高质量的理由成本高昂,这比正常机器学习中使用的简单输入输出对要复杂得多。对于 Brown 等人 (2020) 使用的传统少样本提示方法,在需要推理能力的任务上表现不佳,并且随着语言模型规模的增加通常不会显著改善 (Rae 等人, 2021)。在本文中,我们以一种避免这些局限性的方式结合了这两种方法的优点。具体来说,我们探索了语言模型在给定由三元组组成的提示时执行少样本提示进行推理任务的能力:<输入, 思维链, 输出>。思维链是一系列导致最终输出的中间自然语言推理步骤,我们将这种方法称为思维链提示。图 1 显示了一个示例提示。
We present empirical evaluations on arithmetic, commonsense, and symbolic reasoning benchmarks, showing that chain-of-thought prompting outperforms standard prompting, sometimes to a striking degree. Figure 2 illustrates one such result—on the GSM8K benchmark of math word problems (Cobbe et al., 2021), chain-of-thought prompting with PaLM 540B outperforms standard prompting by a large margin and achieves new state-of-the-art performance. A prompting only approach is important because it does not require a large training dataset and because a single model checkpoint can perform many tasks without loss of generality. This work underscores how large language models can learn via a few examples with natural language data about the task (c.f. automatically learning the patterns underlying inputs and outputs via a large training dataset).
我们展示了在算术、常识和符号推理基准上的实证评估,表明链式思维提示 (chain-of-thought prompting) 优于标准提示,有时差距显著。图 2 展示了其中一个结果——在 GSM8K 数学应用题基准(Cobbe 等,2021)上,使用 PaLM 540B 的链式思维提示大幅优于标准提示,并实现了新的最先进性能。仅依赖提示的方法很重要,因为它不需要大型训练数据集,并且单个模型检查点可以在不损失通用性的情况下执行许多任务。这项工作强调了大语言模型如何通过几个自然语言任务示例进行学习(例如,自动学习输入和输出模式的基础规律,而不是通过大型训练数据集)。
2 Chain-of-Thought Prompting
2 思维链提示 (Chain-of-Thought Prompting)
Consider one's own thought process when solving a complicated reasoning task such as a multi-step math word problem. It is typical to decompose the problem into intermediate steps and solve each before giving the final answer: “After Jane gives 2 fowers to her mom she has 10 ... then after she gives 3 to her dad she will have 7 ... so the answer is 7." The goal of this paper is to endow language models with the ability to generate a similar chain of thoughta coherent series of intermediate reasoning steps that lead to the final answer for a problem. We will show that sufficiently large language models can generate chains of thought if demonstrations of chain-of-thought reasoning are provided in the exemplars for few-shot prompting.
考虑在解决复杂的推理任务(如多步骤的数学文字题)时的思考过程。通常会将问题分解为中间步骤并逐一解决,然后再给出最终答案:“简在给了她妈妈2朵花后还剩10朵……然后在给了她爸爸3朵后还剩7朵……所以答案是7。”本文的目标是赋予大语言模型生成类似思维链的能力,即一系列连贯的中间推理步骤,最终得出问题的答案。我们将展示,如果在少样本提示的范例中提供思维链推理的演示,足够大的大语言模型可以生成思维链。
Figure 1 shows an example of a model producing a chain of thought to solve a math word problem that it would have otherwise gotten incorrect. The chain of thought in this case resembles a solution and can interpreted as one, but we still opt to call it a chain of thought to better capture the idea that it mimics a step-by-step thought process for arriving at the answer (and also, solutions/explanations typically come after the final answer (Narang et al., 2020; Wiegreffe et al., 2022; Lampinen et al., 2022,inter alia)).
图 1: 展示了一个模型生成的思维链,用于解决一个它原本会答错的数学应用题。在这种情况下,思维链类似于一个解题过程,并可以被解释为一个解题过程,但我们仍然选择称之为思维链,以更好地捕捉它模仿逐步思考过程以得出答案的概念(此外,解题过程/解释通常出现在最终答案之后 (Narang et al., 2020; Wiegreffe et al., 2022; Lampinen et al., 2022, inter alia))。
Chain-of-thought prompting has several attractive properties as an approach for facilitating reasoning in language models.
思维链提示有若干吸引人的特性,作为促进语言模型中推理的方法。
In empirical experiments, we will observe the utility of chain-of-thought prompting for arithmetic reasoning (Section 3), commonsense reasoning (Section 4), and symbolic reasoning (Section 5).
在实证实验中,我们将观察链式思维提示在算术推理(第 3 节)、常识推理(第 4 节)和符号推理(第 5 节)中的实用性。
3 Arithmetic Reasoning
3 算术推理
We begin by considering math word problems of the form in Figure 1, which measure the arithmetic reasoning ability of language models. Though simple for humans, arithmetic reasoning is a task where language models often struggle (Hendrycks et al., 2021; Patel et al., 2021, inter alia). Strikingly, chainof-thought prompting when used with the 540B parameter language model performs comparably with task-specific finetuned models on several tasks, even achieving new state of the art on the challenging GSM8K benchmark (Cobbe et al., 2021).
我们首先考虑图 1 所示形式的数学应用题,这些题目用于测量语言模型的算术推理能力。尽管对人类来说很简单,但算术推理是语言模型经常遇到困难的任务 (Hendrycks et al., 2021; Patel et al., 2021, 等)。值得注意的是,当使用 540B 参数的语言模型时,链式思维提示在多个任务上表现与特定任务微调的模型相当,甚至在具有挑战性的 GSM8K 基准测试中达到了新的最先进水平 (Cobbe et al., 2021)。
图 1:
3.1 Experimental Setup
3.1 实验设置
We explore chain-of-thought prompting for various language models on multiple benchmarks.
我们探索了链式思维提示在多个基准测试中的各种语言模型上的应用。
Benchmarks. We consider the following five math word problem benchmarks: (1) the GSM8K benchmark of math word problems (Cobbe et al., 2021), (2) the SVAMP dataset of math word problems with varying structures (Patel et al., 2021), (3) the AsDiv dataset of diverse math word problems (Miao et al., 2020), (4) the AQuA dataset of algebraic word problems, and (5) the MAWPS benchmark (Koncel-Kedziorski et al., 2016). Example problems are given in Appendix Table 12.
基准测试。我们考虑以下五个数学文字题基准测试:(1) Cobbe 等人 (2021) 提出的 GSM8K 基准测试,(2) Patel 等人 (2021) 提出的具有不同结构的数学文字题数据集 SVAMP,(3) Miao 等人 (2020) 提出的多样化数学文字题数据集 AsDiv,(4) 代数文字题数据集 AQuA,以及 (5) Koncel-Kedziorski 等人 (2016) 提出的 MAWPS 基准测试。示例问题见附录表 12。
Standard prompting. For the baseline, we consider standard few-shot prompting, popularized by Brown et al. (2020), in which a language model is given in-context exemplars of input-output pairs before outputting a prediction for a test-time example. Exemplars are formatted as questions and answers. The model gives the answer directly, as shown in Figure 1 (left).
标准提示。对于基准,我们考虑标准的少样本提示,由 Brown 等人 (2020) 流行化,在这种提示中,大语言模型在输出测试示例的预测之前,会先给出输入-输出对的上下文示例。示例格式为问题和答案。模型直接给出答案,如图 1 (左) 所示。
Chain-of-thought prompting. Our proposed approach is to augment each exemplar in few-shot prompting with a chain of thought for an associated answer, as illustrated in Figure 1 (right). As most of the datasets only have an evaluation split, we manually composed a set of eight few-shot exemplars with chains of thought for prompting—-Figure 1 (right) shows one chain of thought exemplar, and the full set of exemplars is given in Appendix Table 20. (These particular exemplars did not undergo prompt engineering; robustness is studied in Section 3.4 and Appendix A.2.) To investigate whether chain-of-thought prompting in this form can successfully elicit successful reasoning across a range of math word problems, we used this single set of eight chain of thought exemplars for all benchmarks except AQuA, which is multiple choice instead of free response. For AQuA, we used four exemplars and solutions from the training set, as given in Appendix Table 21.
思维链提示。我们提出的方法是在少样本提示中的每个示例中添加一个与答案相关联的思维链,如图 1 (右) 所示。由于大多数数据集只有评估集,我们手动编写了一组八个带有思维链的少样本示例用于提示——图 1 (右) 显示了一个思维链示例,完整的示例集见附录表 20。(这些特定示例未经过提示工程;鲁棒性在第 3.4 节和附录 A.2 中研究。)为了调查这种形式的思维链提示是否能成功地在一系列数学应用题中引发成功的推理,我们使用了这单一组八个思维链示例进行所有基准测试,除了 AQuA(AQuA 是选择题而不是自由回答)。对于 AQuA,我们使用了训练集中给出的四个示例及其解答,如附录表 21 所示。

Figure 3: Examples of (input, chain of thought, output) triples for arithmetic, commonsense, and symbolic reasoning benchmarks. Chains of thought are highlighted. Full prompts in Appendix G.

图 3: 算术、常识和符号推理基准的 (输入, 思维链, 输出) 三元组示例。思维链被高亮显示。完整提示见附录 G。
Language models. We evaluate five large language models. The first is GPT-3 (Brown et al., 2020), for which we use text-ada-001, text-babbage-001, text-curie-001, and text-davinci-002, which presumably correspond to Instruct GP T models of 350M, 1.3B, 6.7B, and 175B parameters (Ouyang et al., 2022).The second is LaMDA (Thoppilan et al., 2022), which has models of 422M, 2B, 8B, 68B, and 137B parameters. The third is PaLM, which has models of 8B, 62B, and 540B parameters. The fourth is UL2 20B (Tay et al., 2022), and the fifth is Codex (Chen et al., 2021, code-davinci-002 in the OpenAI API). We sample from the models via greedy decoding (though follow-up work shows chain-of-thought prompting can be improved by taking the majority final answer over many sampled generations (Wang et al., 2022a)). For LaMDA, we report averaged results over five random seeds, where each seed had a different randomly shuffled order of exemplars. As LaMDA experiments did not show large variance among different seeds, to save compute we report results for a single exemplar order for all other models.
大语言模型。我们评估了五个大语言模型。第一个是 GPT-3 (Brown 等, 2020),我们使用 text-ada-001, text-babbage-001, text-curie-001 和 text-davinci-002,这些模型可能对应于参数量分别为 3.5 亿、13 亿、67 亿和 1750 亿的 Instruct GPT 模型 (Ouyang 等, 2022)。第二个是 LaMDA (Thoppilan 等, 2022),其模型参数量分别为 4.22 亿、20 亿、80 亿、680 亿和 1370 亿。第三个是 PaLM,其模型参数量分别为 80 亿、620 亿和 5400 亿。第四个是 UL2 20B (Tay 等, 2022),第五个是 Codex (Chen 等, 2021, OpenAI API 中的 code-davinci-002)。我们通过贪心解码从模型中采样(尽管后续工作表明,通过多数最终答案可以改进链式思维提示的效果 (Wang 等, 2022a))。对于 LaMDA,我们在五组不同的随机种子上报告平均结果,每个种子都有不同的随机打乱的示例顺序。由于 LaMDA 实验在不同种子之间没有表现出较大差异,为了节省计算资源,我们对所有其他模型报告单个示例顺序的结果。
3.2 Results
3.2 结果
The strongest results of chain-of-thought prompting are summarized in Figure 4, with all experimental outputs for each model collection, model size, and benchmark shown in Table 2 in the Appendix. There are three key takeaways. First, Figure 4 shows that chain-of-thought prompting is an emergent ability of model scale (Wei et al., 2022b). That is, chain-of-thought prompting does not positively impact performance for small models, and only yields performance gains when used with models of $\mathord{\sim}100\mathrm{B}$ parameters. We qualitatively found that models of smaller scale produced fluent but illogical chains of thought, leading to lower performance than standard prompting.
最强的链式思维提示结果总结在图 4 中,附录中的表 2 显示了每个模型集合、模型大小和基准的所有实验输出。有三个关键要点。首先,图 4 显示链式思维提示是模型规模 (model scale) 的一种新兴能力 (Wei et al., 2022b)。也就是说,链式思维提示对小模型的性能没有正面影响,只有在使用参数量约为 ~100B 的模型时才会带来性能提升。我们定性地发现,较小规模的模型生成的链式思维虽然流畅但不合逻辑,导致其性能低于标准提示。
Second, chain-of-thought prompting has larger performance gains for more-complicated problems. For instance, for GSM8K (the dataset with the lowest baseline performance), performance more than doubled for the largest GPT and PaLM models. On the other hand, for SingleOp, the easiest subset of MAWPS which only requires a single step to solve, performance improvements were either negative or very small (see Appendix Table 3).
其次,链式思维提示对更复杂的问题有更大的性能提升。例如,对于 GSM8K (基线性能最低的数据集),最大规模的 GPT 和 PaLM 模型的性能提升了超过一倍。另一方面,对于 SingleOp,这是 MAWPS 中最简单的子集,只需要一个步骤就能解决,性能改进要么是负的,要么非常小(见附录表 3)。
Third, chain-of-thought prompting via GPT-3 175B and PaLM 540B compares favorably to prior state of the art, which typically finetunes a task-specific model on a labeled training dataset. Figure 4 shows how PaLM 540B uses chain-ofthought prompting to achieve new state of the art on GSM8K, SVAMP, and MAWPS (though note that standard prompting already passed the prior best for SVAMP). On the other two datasets, AQuA and ASDiv, PaLM with chain-of-thought prompting reaches within $2%$ of the state of the art (Appendix Table 2).
第三,通过 GPT-3 175B 和 PaLM 540B 进行的链式思维提示 (chain-of-thought prompting) 比之前的最先进技术更优,后者通常在一个标记的训练数据集上微调特定任务的模型。图 4 显示了 PaLM 540B 如何使用链式思维提示在 GSM8K、SVAMP 和 MAWPS 上达到新的最先进水平(尽管需要注意的是,标准提示已经超过了 SVAMP 的此前最佳水平)。在另外两个数据集 AQuA 和 ASDiv 上,PaLM 使用链式思维提示达到了最先进水平的 2% 以内(附录表 2)。
To better understand why chain-of-thought prompting works, we manually examined modelgenerated chainsof thought by LaMDA 137B for GSM8K. Of 50 random examples where the model returned the correct final answer, all of the generated chains of thought were also logically and mathematically correct except two that coincidentally arrived at the correct answer (see Appendix D.1, and Table 8 for examples of correct model-generated chains of thought). We also randomly examined 50 random samples for which the model gave the wrong answer. The summary of this analysis is that $46%$ ofthe chains of thought were almost correct, barring minor mistakes (calculator error, symbol map
为了更好地理解为什么链式思维提示有效,我们手动检查了 LaMDA 137B 为 GSM8K 生成的链式思维。在模型返回正确最终答案的 50 个随机示例中,所有生成的链式思维在逻辑和数学上都是正确的,除了两个巧合地得出正确答案的情况(见附录 D.1 和表 8 中的正确模型生成链式思维的例子)。我们还随机检查了 50 个模型给出错误答案的样本。此分析的总结是 $46%$ 的链式思维几乎正确,只是存在一些小错误(计算器错误、符号映射错误等)。

Figure 4: Chain-of-thought prompting enables large language models to solve challenging math problems. Notably, chain-of-thought reasoning is an emergent ability of increasing model scale. Prior best numbers are from Cobbe et al. (2021) for GSM8K, Jie et al. (2022) for SVAMP, and Lan et al. (2021) for MAWPS. Model scale (# parameters in billions)
图 4: 链式思维提示使大语言模型能够解决具有挑战性的数学问题。值得注意的是,链式思维推理是随着模型规模增加而出现的能力。此前的最佳结果分别来自 Cobbe 等 (2021) 的 GSM8K,Jie 等 (2022) 的 SVAMP,以及 Lan 等 (2021) 的 MAWPS。模型规模(参数量,以十亿计)
ping error, or one reasoning step missing), and that the other $54%$ of the chains of thought had major errors in semantic understanding or coherence (see Appendix D.2). To provide a small insight into why scaling improves chain-of-thought reasoning ability, we performed a similar analysis of errors made by PaLM 62B and whether those errors were fixed by scaling to PaLM 540B. The summary is that scaling PaLM to 540B fixes a large portion of one-step missing and semantic understanding errors in the 62B model (see Appendix A.1).
推理链中存在错误的 46% 是由于单步推理缺失或理解错误(例如:ping 错误,或一个推理步骤缺失),而其他的 54% 的推理链在语义理解和连贯性上存在重大错误(见附录 D.2)。为了深入了解为什么扩展规模可以提高推理能力,我们对 PaLM 62B 的错误进行了类似的分析,并研究了这些错误是否在扩展到 PaLM 540B 后得到修复。总结是,将 PaLM 扩展到 540B 可以修复 62B 模型中很大一部分单步缺失和语义理解错误(见附录 A.1)。
3.3Ablation Study
3.3 消融研究 (Ablation Study)
The observed benefits of using chain-of-thought prompting raises the natural question of whether the same performance improvements can be conferred via other types of prompting. Figure 5 shows an ablation study with three variations of chain of thought described below.
使用链式思维提示所观察到的好处引发了是否可以通过其他类型的提示获得相同的性能提升这一自然问题。图 5 显示了以下描述的链式思维的三种变体的消融研究。
Equation only. One reason for why chain-of-thought prompting might help is that it produces the mathematical equation to be evaluated, and so we test a variation where the model is prompted to output only a mathematical equation before giving the answer. Figure 5 shows that equation only prompting does not help much for GSM8K, which implies that the semantics of the questions in GSM8K are too challenging to directly translate into an equation without the natural language reasoning steps in chain of thought. For datasets of one-step or two-step problems, however, we find that equation only prompting does improve performance, since the equation can be easily derived from the question (see Appendix Table 6).
仅方程。链式思维提示可能有所帮助的一个原因是它生成了要计算的数学方程,因此我们测试了一种变体,在这种变体中,模型被提示仅输出数学方程然后再给出答案。图 5 显示,对于 GSM8K,仅方程提示帮助不大,这意味着 GSM8K 中问题的语义过于复杂,无法直接转换为方程,而不需要链式思维中的自然语言推理步骤。然而,对于一步或两步问题的数据集,我们发现仅方程提示确实提高了性能,因为方程可以从问题中轻松推导出来(见附录表 6)。
Variable compute only. Another intuition is that chain of thought allows the model to spend more computation (i.e., intermediate tokens) on harder problems. To isolate the effect of variable computation from chain-of-thought reasoning, we test a configuration where the model is prompted to output a only sequence of dots $(\cdot\cdot\cdot)$ equal to the number of characters in the equation needed to solve the problem. This variant performs about the same as the baseline, which suggests that variable computation by itself is not the reason for the success of chainof-thought prompting, and that there appears to be utility from expressing intermediate steps via natural language.
仅可变计算。另一种直觉是,思维链允许模型在更难的问题上花费更多的计算资源(即,中间 Token)。为了将可变计算的效果与思维链推理隔离开来,我们测试了一种配置,其中模型被提示输出一个仅由点 $(\cdot\cdot\cdot)$ 组成的序列,该序列的长度等于解决问题所需的方程中的字符数。这种变体的表现与基线大致相同,这表明可变计算本身并不是思维链提示成功的原因,并且通过自然语言表达中间步骤似乎是有用的。
Chain of thought after answer. Another potential benefit of chain-of-thought prompting could simply be that such prompts allow the model to better access relevant knowledge acquired during pre training. Therefore, we test an alternative configuration where the chain of thought prompt is only given after the answer, isolating whether the model actually depends on the produced chain of thought to give the final answer. This variant performs about the same as the baseline, which suggests that the sequential reasoning embodied in the chain of thought is useful for reasons beyond just activating knowledge.
回答后的思考链。思考链提示的另一个潜在好处可能是,这样的提示可以让模型更好地访问预训练期间获得的相关知识。因此,我们测试了一种替代配置,在这种配置中,思考链提示仅在答案之后给出,以隔离模型是否实际上依赖生成的思考链来给出最终答案。这个变体的表现与基线差不多,这表明思考链中体现的顺序推理在激活知识之外还有其他用途。

Figure 5: Ablation study for different variations of prompting using LaMDA 137B and PaLM 540B. Results for other datasets are given in Appendix Table 6 and Table 7.

图 5: 使用 LaMDA 137B 和 PaLM 540B 对不同提示变体的消融研究。其他数据集的结果见附录表 6 和表 7。
3.4 Robustness of Chain of Thought
3.4 思维链的鲁棒性
Sensitivity to exemplars is a key consideration of prompting approaches——for instance, varying the permutation of few-shot exemplars can cause the accuracy of GPT-3 on SST-2 to range from near chance $(54.3%)$ to near state of the art $(93.4%)$ (Zhao et al., 2021). In this final subsection, we evaluate robustness to chains of thought written by different annotators. In addition to the results above, which used chains of thought written by an Annotator A, two other co-authors of this paper (Annotators B and C) independently wrote chains of thought for the same few-shot exemplars (shown in Appendix H). Annotator A also wrote another chain of thought that was more concise than the original, following the style of solutions given in Cobbe et al. (2021).1
对示例的敏感性是提示方法的一个关键考虑因素——例如,改变少样本示例的排列可以导致 GPT-3 在 SST-2 上的准确率从接近随机水平 (54.3%) 到接近最先进水平 (93.4%) [20]。在本节的最后一部分,我们评估由不同标注者编写的思维链的鲁棒性。除了上述结果中使用的由标注者 A 编写的思维链外,本文的另外两位合著者(标注者 B 和 C)独立为相同的少样本示例编写了思维链(见附录 H)。标注者 A 还编写了一个比原始版本更简洁的思维链,遵循 Cobbe 等人 (2021) 给出的解决方案风格。
Figure 6 shows these results for LaMDA 137B on GSM8K and MAWPS (ablation results for other datasets are given in Appendix Table 6 / Table 7). Although there is variance among different chain of thought annotations, as would be expected when using exemplar-based prompting (Le Scao and Rush, 2021; Reynolds and McDonell, 2021; Zha0 et al., 2021), all sets of chain of thought prompts outperform the standard baseline by a large margin. This result implies that successful use of chain of thought does not depend on a particular linguistic style.
图 6: 显示了 LaMDA 137B 在 GSM8K 和 MAWPS 上的结果(其他数据集的消融结果见附录表 6 / 表 7)。尽管不同的思维链标注之间存在差异,这在使用基于范例的提示时是可以预期的 (Le Scao 和 Rush, 2021; Reynolds 和 McDonell, 2021; Zha0 等, 2021),但所有思维链提示集都大大优于标准基线。这一结果表明,成功使用思维链并不依赖于特定的语言风格。

Figure 6: Chain-of-thought prompting has variance for different prompt examples (as expected) but outperforms standard prompting for various annotators as well as for different exemplars.
图 6: 链式思维提示对不同提示示例有差异(如预期),但优于标准提示,适用于不同的标注者以及不同的示例。
source (examples in this dataset already included reasoning steps like a chain of thought).2 Figure 6 shows that these prompts performed comparably with our manually written exemplars, also substantially outperforming standard prompting.
数据集中的示例已经包含了类似思维链的推理步骤)。图 6 显示,这些提示词的表现与我们手动编写的示例相当,并且显著优于标准提示词。
In addition to robustness to annotators, independently-written chains of thought, different exemplars and various language models, we also find that chain-of-thought prompting for arithmetic reasoning is robust to different exemplar orders and varying numbers of exemplars (see Appendix A.2).
除了对标注者的鲁棒性、独立撰写的思维链、不同的示例和各种语言模型外,我们还发现算术推理的思维链提示对不同的示例顺序和不同数量的示例具有鲁棒性(见附录 A.2)。
4 Commonsense Reasoning
4 常识推理 (Commonsense Reasoning)
Although chain of thought is particularly suitable for math word problems, the language-based nature of chain of thought actually makes it applicable to a broad class of commonsense reasoning problems, which involve reasoning about physical and human interactions under the presumption of general background knowledge. Commonsense reasoning is key for interacting with the world and is still beyond the reach of current natural language understanding systems (Talmor et al., 2021).
尽管思维链特别适用于数学文字题,但思维链的语言特性实际上使其适用于广泛的一类常识推理问题。这类问题涉及在一般背景知识的假设下,对物理和人类互动进行推理。常识推理是与世界交互的关键,仍然超出了当前自然语言理解系统的能力范围 (Talmor et al., 2021)。
Benchmarks. We consider five datasets covering a diverse range of commonsense reasoning types. The popular CSQA (Talmor et al., 2019) asks commonsense questions about the world involving complex semantics that often require prior knowledge. StrategyQA (Geva et al., 2021) requires models to infer a multi-hop strategy to answer questions. We choose two specialized evaluation sets from the BIG-bench effort (BIG-bench collaboration, 2021): Date Understanding, which involves inferring a date from a given context, and Sports Understanding, which involves determining whether a sentence relating to sports is plausible or implausible. Finally, the SayCan dataset (Ahn et al., 2022) involves mapping a natural language instruction to a sequence of robot actions from a discrete set. Figure 3 shows examples with chain of thought annotations for all datasets.
基准测试。我们考虑了五个涵盖多种常识推理类型的数据库。流行的 CSQA (Talmor 等, 2019) 提问涉及复杂语义的常识问题,通常需要先验知识。StrategyQA (Geva 等, 2021) 要求模型推断多步策略来回答问题。我们选择了 BIG-bench 项目 (BIG-bench 合作, 2021) 中的两个专门评估集:日期理解,涉及从给定上下文中推断日期;体育理解,涉及判断与体育相关的句子是否合理或不合理。最后,SayCan 数据集 (Ahn 等, 2022) 涉及将自然语言指令映射到离散集合中的机器人动作序列。图 3 显示了所有数据集的带有思维链注释的示例。
Prompts. We follow the same experimental setup as the prior section. For CSQA and StrategyQA, we randomly selected examples from the training set and manually composed chains of thought for them to use as few-shot exemplars. The two BIG-bench tasks do not have training sets, so we selected the first ten examples as exemplars in the evaluation set as few-shot exemplars and report numbers on the rest of the evaluation set. For SayCan, we use six examples from the training set used in Ahn et al. (2022) and also manually composed chains of thought.
提示。我们遵循与前一部分相同的实验设置。对于 CSQA 和 StrategyQA,我们从训练集中随机选择了示例,并手动编写了思考链以用作少样本示例。这两个 BIG-bench 任务没有训练集,因此我们选择了评估集中的前十个示例作为少样本示例,并在评估集的其余部分报告结果。对于 SayCan,我们使用了 Ahn et al. (2022) 中使用的训练集中的六个示例,并且也手动编写了思考链。
Results. Figure 7 highlights these results for PaLM (full results for LaMDA, GPT-3, and different model scales are shown in Table 4). For all tasks, scaling up model size improved the performance of standard prompting; chain-of-thought prompting led to further gains, with improvements appearing to be largest for PaLM 540B. With chain-of-thought prompting, PaLM 540B achieved strong performance relative to baselines, outperforming the prior state of the art on StrategyQA $75.6%$ Vs $69.4%$ and outperforming an unaided sports enthusiast on sports understanding $95.4%$ VS $84%$ These results demonstrate that chain-of-thought prompting can also improve performance on tasks requiring a range of commonsense reasoning abilities (though note that gain was minimal on CsQA).
结果. 图 7: 突出了 PaLM 的这些结果(LaMDA、GPT-3 和不同模型规模的完整结果见表 4)。对于所有任务,增加模型规模改善了标准提示的效果;链式思维提示带来了进一步的提升,改进似乎在 PaLM 540B 上最为显著。使用链式思维提示时,PaLM 540B 相对于基线表现出色,在 StrategyQA 上超越了之前的最先进水平 75.6% 对 69.4%,并在体育理解方面超过了未辅助的体育爱好者 95.4% 对 84%。这些结果表明,链式思维提示也可以提高需要一系列常识推理能力的任务表现(尽管注意在 CsQA 上的增益很小)。


图 1: 模型架构示例
在本研究中,我们提出了一种新的方法来改进大语言模型 (LLM) 的性能。该方法结合了零样本和少样本学习技术,旨在提高模型的泛化能力。具体来说,我们的贡献包括:
- 设计了一个基于 Transformer 的新架构,能够更有效地处理长文本。
- 引入了一种新的训练策略,可以在不增加计算成本的情况下显著提升模型的表现。
- 在多个基准测试中验证了我们方法的有效性,结果表明我们的模型在各种任务上均取得了显著的改进。
表 1: 不同模型的性能对比
| 模型名称 | 准确率 (%) | F1 分数 |
|---|---|---|
| 基线模型 | 85.2 | 84.7 |
| 我们的模型 | 90.5 | 89.3 |
通过上述改进,我们的研究为未来的大语言模型发展提供了新的方向。
Figure 7: Chain-of-thought prompting also improves the commonsense reasoning abilities of language models. The language model shown here is PaLM. Prior best numbers are from the leader boards of CSQA (Talmor et al., 2019) and StrategyQA (Geva et al., 2021) (single-model only, as of May 5, 2022). Additional results using various sizes of LaMDA, GPT-3, and PaLM are shown in Table 4.
图 7: 思维链提示也提高了语言模型的常识推理能力。这里展示的语言模型是 PaLM。之前的最佳数据来自 CSQA (Talmor et al., 2019) 和 StrategyQA (Geva et al., 2021) 的排行榜(仅限单个模型,截至 2022 年 5 月 5 日)。使用不同规模的 LaMDA、GPT-3 和 PaLM 获得的其他结果见表 4。
5 Symbolic Reasoning
5 符号推理 (Symbolic Reasoning)
Our final experimental evaluation considers symbolic reasoning, which is simple for humans but potentially challenging for language models. We show that chain-ofthought prompting not only enables language models to perform symbolic reasoning tasks that are challenging in the standard prompting setting, but also facilitates length generalization to inference-time inputs longer than those seen in the few-shot exemplars.
我们最后的实验评估考虑了符号推理,这对人类来说很简单,但对语言模型来说可能具有挑战性。我们展示了链式思维提示不仅使语言模型能够在标准提示设置中执行具有挑战性的符号推理任务,还促进了长度泛化,使其能够处理推理时输入比少样本示例中看到的更长的输入。
Tasks. We use the following two toy tasks.
任务。我们使用以下两个玩具任务。
·Last letter concatenation.This task asks the model to concatenate the last letters of words in a name (e.g., "AmyBrown' $\dot{{\bf\Phi}}\rightarrow,"{\bf y}n,"$ . It is a more challenging version of first letter concatenation, which language models can already perform without chain of thought.? We generate full names by randomly concatenating names from the top one-thousand first and last names from name census data (https: / /namecensus.com/).
最后一字母拼接。此任务要求模型将名字中每个单词的最后一字母拼接起来(例如:"AmyBrown" → "yn")。这是首字母拼接的更具挑战性的版本,而首字母拼接是语言模型已经可以在不进行链式思考的情况下完成的任务。我们通过随机组合来自姓名普查数据 (https://namecensus.com/) 中排名前一千的名和姓来生成全名。
· Coin flip. This task asks the model to answer whether a coin is still heads up after people either flip or don't flip the coin (e.g., “A coin is heads up. Phoebe fips the coin. Osvaldo does not fip the coin. Is the coin still heads up?' $\rightarrow\ ^{\ast}{n o}^{,\ast};$
· 硬币翻转。此任务要求模型回答在人们翻转或不翻转硬币后,硬币是否仍然正面朝上(例如:“硬币正面朝上。Phoebe 翻转了硬币。Osvaldo 没有翻转硬币。硬币是否仍然正面朝上?” $\rightarrow\ ^{\ast}{n o}^{,\ast};$
As the construction of these symbolic reasoning tasks is well-defined, for each task we consider an in-domain test set for which examples had the same number of steps as the training/few-shot exemplars, as well as an out-of-domain (OOD) test set, for which evaluation examples had more steps than those in the exemplars. For last letter concatenation, the model only sees exemplars of names with two words, and then performs last letter concatenation on names with 3 and 4 words.4 We do the same for the number of potential flips in the coin flip task. Our experimental setup uses the same methods and models as in the prior two sections. We again manually compose chains of thought for the few-shot exemplars for each task, which are given in Figure 3.
由于这些符号推理任务的构建是明确定义的,对于每个任务,我们考虑一个领域内测试集,其中示例具有与训练/少样本示例相同的步骤数,以及一个领域外 (OOD) 测试集,其中评估示例的步骤数多于示例中的步骤数。对于最后一个字母连接任务,模型只看到包含两个单词的名字示例,然后对包含 3 和 4 个单词的名字执行最后一个字母连接。我们对硬币翻转任务中的潜在翻转次数也做同样的处理。我们的实验设置使用了与前两节相同的方法和模型。我们再次手动为每个任务的少样本示例组成思考链,这些思考链如图 3 所示。

Model scale (# parameters in billions) Figure 8: Using chain-of-thought prompting facilitates generalization to longer sequences in two symbolic reasoning tasks.
模型规模(数十亿参数) 图 8: 使用链式思维提示有助于在两个符号推理任务中推广到更长的序列。
Results. The results of these in-domain and OOD evaluations are shown in Figure 8 for PaLM, with results for LaMDA shown in Appendix Table 5. With PaLM 540B, chain-of-thought prompting leadsto almost $100%$ solve rates (note that standard prompting already solves coin flip with PaLM 540, though not for LaMDA 137B). Note that these in-domain evaluations are “toy tasks" in the sense that perfect solution structures are already provided by the chains of thought in the few-shot exemplars; all the model has to do is repeat the same steps with the new symbols in the test-time example. And yet, small models still fail—the ability to perform abstract manipulations on unseen symbols for these three tasks only arises at the scale of 1ooB model parameters.
结果。这些领域内和领域外评估的结果如图 8 所示,针对 PaLM 的结果,而 LaMDA 的结果见附录表 5。使用 PaLM 540B,链式思维提示导致几乎 100% 的解决率(请注意,标准提示已经可以解决 PaLM 540 中的硬币翻转问题,但对于 LaMDA 137B 则不行)。需要注意的是,这些领域内评估是“玩具任务”,因为完美的解决方案结构已经在少样本示例中的链式思维中提供;模型只需要在测试时的例子中重复相同的步骤,并使用新的符号。然而,小规模模型仍然失败——对于这三个任务,在未见过的符号上执行抽象操作的能力只在模型参数达到 100B 规模时才会出现。
图 8:
附录表 5:
As for the OOD evaluations, standard prompting fails for both tasks. With chain-of-thought prompting, language models achieve upward scaling curves (though performance is lower than in the in-domain setting). Hence, chain-of-thought prompting facilitates length generalization beyond seen chains of thought for language models of sufficient scale.
对于OOD评估,标准提示对两个任务都失败了。使用链式思维提示 (chain-of-thought prompting),大语言模型实现了向上扩展的性能曲线(尽管性能低于域内设置)。因此,链式思维提示有助于大语言模型在足够规模下实现超越已见思维链条长度的泛化。
6 Discussion
6 讨论
We have explored chain-of-thought prompting as a simple mechanism for eliciting multi-step reasoning behavior in large language models. We first saw that chain-of-thought prompting improves performance by a large margin on arithmetic reasoning, yielding improvements that are much stronger than ablations and robust to different annotators, exemplars, and language models (Section 3). Next, experiments on commonsense reasoning underscored how the linguistic nature of chain-of-thought reasoning makes it generally applicable (Section 4). Finally, we showed that for symbolic reasoning, chain-of-thought prompting facilitates OOD generalization to longer sequence lengths (Section 5). In all experiments, chain-of-thought reasoning is elicited simply by prompting an off-the-shelf language model. No language models were finetuned in the process of writing this paper.
我们探索了链式思维提示作为在大语言模型中激发多步推理行为的简单机制。我们首先发现,链式思维提示在算术推理方面显著提高了性能,带来的改进远超消融实验,并且对不同的标注者、示例和大语言模型 (Section 3) 都具有鲁棒性。接下来,常识推理实验强调了链式思维推理的语言特性使其具有广泛的适用性 (Section 4)。最后,我们展示了对于符号推理,链式思维提示有助于实现更长序列长度的OOD泛化 (Section 5)。在所有实验中,链式思维推理仅通过提示现成的大语言模型来激发。撰写本文的过程中没有对任何大语言模型进行微调。
The emergence of chain-of-thought reasoning as a result of model scale has been a prevailing theme (Wei et al., 2022b). For many reasoning tasks where standard prompting has a flat scaling curve, chainof-thought prompting leads to dramatically increasing scaling curves. Chain-of-thought prompting appears to expand the set of tasks that large language models can perform successfully—in other words, our work underscores that standard prompting only provides a lower bound on the capabilities of large language models. This observation likely raises more questions than it answers-for instance, how much more can we expect reasoning ability to improve with a further increase in model scale? What other prompting methods might expand the range of tasks that language models can solve?
链式思维推理作为模型规模的结果已经是一个普遍的主题 (Wei et al., 2022b)。对于许多标准提示具有平坦扩展曲线的推理任务,链式思维提示导致显著增加的扩展曲线。链式思维提示似乎扩大了大语言模型可以成功执行的任务集——换句话说,我们的工作强调标准提示仅提供了大语言模型能力的下限。这一观察可能提出了更多问题而不是答案——例如,随着模型规模的进一步增加,我们可以期望推理能力提高多少?还有哪些其他提示方法可能会扩大语言模型可以解决的任务范围?
As for limitations, we first qualify that although chain of thought emulates the thought processes of human reasoners, this does not answer whether the neural network is actually “reasoning,”’ which we leave as an open question. Second, although the cost of manually augmenting exemplars with chains of thought is minimal in the few-shot setting, such annotation costs could be prohibitive for finetuning (though this could potentially be surmounted with synthetic data generation, or zero-shot generalization). Third, there is no guarantee of correct reasoning paths, which can lead to both correct and incorrect answers; improving factual generations of language models is an open direction for future work (Rashkin et al., 2021; Ye and Durrett, 2022; Wiegreffe et al., 2022, inter alia). Finally, the emergence of chain-of-thought reasoning only at large model scales makes it costly to serve in real-world applications; further research could explore how to induce reasoning in smaller models.
关于局限性,我们首先说明,尽管思维链模拟了人类推理者的思维过程,但这并不回答神经网络是否真正“推理”的问题,我们将其作为一个开放问题。第二,虽然在少样本设置中手动为示例添加思维链的成本极低,但这种标注成本在微调时可能会变得难以承受(尽管这可以通过合成数据生成或零样本泛化来克服)。第三,无法保证推理路径的正确性,这可能导致正确和错误的答案;改进语言模型的事实生成是一个未来工作的开放方向(Rashkin 等,2021;Ye 和 Durrett,2022;Wiegreffe 等,2022,等)。最后,思维链推理仅在大模型规模下出现,使其在实际应用中的部署成本高昂;进一步的研究可以探索如何在较小的模型中诱导推理能力。
7 Related Work
7 相关工作
This work is inspired by many research areas, which we detail in an extended related work section (Appendix C). Here we describe two directions and associated papers that are perhaps most relevant.
这项工作受到许多研究领域的启发,我们在扩展的相关工作部分 (附录 C) 中详细说明。这里我们描述两个可能最相关的方向和相关论文。
The first relevant direction is using intermediate steps to solve reasoning problems. Ling et al. (2017) pioneer the idea of using natural language rationales to solve math word problems through a series of intermediate steps. Their work is a remarkable contrast to the literature using formal languages to reason (Roy et al., 2015; Chiang and Chen, 2019; Amini et al., 2019; Chen et al., 2019). Cobbe et al. (2021) extend Ling et al. (2017) by creating a larger dataset and using it to finetune a pretrained language model rather than training a model from scratch. In the domain of program synthesis, Nye et al. (2021) leverage language models to predict the final outputs of Python programs via first line-to-line predicting the intermediate computational results, and show that their step-by-step prediction method performs better than directly predicting the final outputs.
第一个相关方向是使用中间步骤来解决推理问题。Ling 等 (2017) 开创性地提出了使用自然语言推理来通过一系列中间步骤解决数学文字问题的想法。他们的工作与使用形式语言进行推理的文献形成了鲜明对比(Roy 等,2015;Chiang 和 Chen,2019;Amini 等,2019;Chen 等,2019)。Cobbe 等 (2021) 通过创建一个更大的数据集并用它来微调预训练的语言模型而不是从头训练模型,扩展了 Ling 等 (2017) 的工作。在程序合成领域,Nye 等 (2021) 利用语言模型通过逐行预测中间计算结果来预测 Python 程序的最终输出,并表明他们的逐步预测方法比直接预测最终输出表现更好。
Naturally, this paper also relates closely to the large body of recent work on prompting. Since the popularization of few-shot prompting as given by Brown et al. (2020), several general approaches have improved the prompting ability of models, such as automatically learning prompts (Lester et al., 2021) or giving models instructions describing a task (Wei et al., 2022a; Sanh et al., 2022; Ouyang et al., 2022). Whereas these approaches improve or augment the input part of the prompt (e.g., instructions that are prepended to inputs), our work takes the orthogonal direction of augmenting the outputs of language models with a chain of thought.
当然,本文也与近期关于提示的大量工作密切相关。自从 Brown 等人 (2020) 普及了少样本提示以来,几种通用方法已经提高了模型的提示能力,例如自动学习提示 (Lester 等,2021) 或给模型提供描述任务的指令 (Wei 等,2022a;Sanh 等,2022;Ouyang 等,2022)。虽然这些方法改进或增强了提示的输入部分(例如,附加在输入前的指令),我们的工作则采取了正交的方向,通过思维链来增强大语言模型的输出。
8 Conclusions
8 结论
We have explored chain-of-thought prompting as a simple and broadly applicable method for enhancing reasoning in language models. Through experiments on arithmetic, symbolic, and commonsense reasoning, we find that chain-of-thought reasoning is an emergent property of model scale that allows sufficiently large language models to perform reasoning tasks that otherwise have flat scaling curves. Broadening the range of reasoning tasks that language models can perform will hopefully inspire further work on language-based approaches to reasoning.
我们已经探索了链式思维提示作为增强语言模型推理能力的一种简单且广泛应用的方法。通过在算术、符号和常识推理上的实验,我们发现链式思维推理是模型规模的一个突现属性,使得足够大的语言模型能够执行其他情况下具有平坦扩展曲线的推理任务。扩大语言模型可以执行的推理任务范围,希望能够激发对基于语言的推理方法的进一步研究。
Acknowledgements
致谢
We thank Jacob Devlin, Claire Cui, Andrew Dai, and Ellie Pavlick for providing feedback on the paper. We thank Jacob Austin, Yuhuai Wu, Henryk Micha lewski, Aitor Lewkowycz, Charles Sutton, and Aakanksha Chowdhery for helpful discussions. We thank Sid Maxwell for notifying us about a mistake in the manual error analysis in the original manuscript.
我们感谢 Jacob Devlin、Claire Cui、Andrew Dai 和 Ellie Pavlick 为本文提供反馈。我们感谢 Jacob Austin、Yuhuai Wu、Henryk Michalewski、Aitor Lewkowycz、Charles Sutton 和 Aakanksha Chowdhery 的有益讨论。我们感谢 Sid Maxwell 通知我们原手稿中的手动错误分析存在错误。
References
参考文献
Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea Finn, Keerthana Gopal a krishna n, Karol Hausman, Alex Herzog, et al. 2022. Do as I can, not as I say: Grounding language in robotic afford ances. arXiv preprint arXiv:2204.01691.
Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea Finn, Keerthana Gopal a krishna n, Karol Hausman, Alex Herzog, 等. 2022. 按我能做的来做,而不是我说的:将语言扎根于机器人能力中. arXiv preprint arXiv:2204.01691.
Aida Amini, Saadia Gabriel, Shanchuan Lin, Rik Koncel-Kedziorski, Yejin Choi, and Hannaneh Hajishirzi. 2019. MathQA: Towards interpret able math word problem solving with operationbased formalisms. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota. Association for Computational Linguistics.
Aida Amini, Saadia Gabriel, Shanchuan Lin, Rik Koncel-Kedziorski, Yejin Choi, 和 Hannaneh Hajishirzi. 2019. MathQA: 朝着基于操作的形式化方法实现可解释的数学文字问题求解迈进。在 Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota。Association for Computational Linguistics。
Daniel Andor, Luheng He, Kenton Lee, and Emily Pitler. 2019. Giving BERT a calculator: Finding operations and arguments with reading comprehension. EMNLP.
Daniel Andor, Luheng He, Kenton Lee, 和 Emily Pitler. 2019. 给 BERT 配备计算器:通过阅读理解找到操作和参数. EMNLP.
Jacob Andreas, Dan Klein, and Sergey Levine. 2018. Learning with latent language. NAACL.
Jacob Andreas, Dan Klein, 和 Sergey Levine. 2018. 学习带有潜在语言 (learning with latent language)。NAACL。
Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Micha lewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. 2021. Program synthesis with large language models. arXiv preprint arXiv:2108.07732.
Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, 等. 2021. 使用大语言模型进行程序合成. arXiv预印本 arXiv:2108.07732.
BIG-bench collaboration. 2021. Beyond the imitation game: Measuring and extrapolating the capabilities of language models. In preparation.
BIG-bench 合作. 2021. 超越模仿游戏:测量和外推语言模型的能力. 准备中.
Kaj Bostrom, Xinyu Zhao, Swarat Chaudhuri, and Greg Durrett. 2021. Flexible generation of natural language deductions. EMNLP.
Kaj Bostrom, 赵欣宇, Swarat Chaudhuri, 和 Greg Durrett. 2021. 自然语言推理的灵活生成. EMNLP.
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neel a kant an, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert- Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language models are few-shot learners. NeurIPS.
汤姆·布朗,本杰明·曼,尼克·赖德,梅兰妮·苏比亚,贾里德·D·卡普兰,普拉富拉·达里瓦尔,阿文丁·尼尔坎坦,普拉纳夫·希亚姆,吉里什·萨斯特里,阿曼达·阿斯科尔,桑迪尼·阿加沃尔,阿里埃尔·赫伯特-沃斯,格雷琴·克鲁格,汤姆·亨尼根,瑞文·蔡尔德,阿迪蒂亚·拉梅什,丹尼尔·齐格勒,杰弗里·吴,克莱门斯·温特,克里斯·赫塞,马克·陈,埃里克·西格勒,马特乌什·利特温,斯科特·格雷,本杰明·切斯,杰克·克拉克,克里斯托弗·伯纳,山姆·麦肯德里什,阿莱克·拉德福德,伊利亚·苏茨凯弗,和 达里奥·阿莫代伊。2020。大语言模型是少样本学习者。NeurIPS。
Jonathon Cai, Richard Shin, and Dawn Song. 2017. Making neural programming architectures generalize via recursion. ICLR.
乔纳森·蔡,理查德·申,和黎明·宋。2017。通过递归使神经编程架构泛化。ICLR。
Oana-Maria Camburu, Tim Rock t as chel, Thomas L ukasiewicz, and Phil Blunsom. 2018. e-SNLI: Natural language inference with natural language explanations. NeurIPS.
Oana-Maria Camburu, Tim Rock t as chel, Thomas L ukasiewicz, 和 Phil Blunsom. 2018. e-SNLI: 带有自然语言解释的自然语言推理 (Natural language inference with natural language explanations). NeurIPS.
Howard Chen, Jacqueline He, Karthik Narasimhan, and Danqi Chen. 2022. Can rationalization improve robustness?NAACL.
Howard Chen, Jacqueline He, Karthik Narasimhan, 和 Danqi Chen. 2022. 理由化能否提高鲁棒性?NAACL.
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. 2021. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374.
陈马克, Tworek Jerry, Jun Heewoo, Yuan Qiming, Henrique Ponde de Oliveira Pinto, Kaplan Jared, Edwards Harri, Burda Yuri, Joseph Nicholas, Brockman Greg, 等. 2021. 评估训练于代码上的大语言模型 (Large Language Models Trained on Code). arXiv预印本 arXiv:2107.03374.
Xinyun Chen, Chen Liang, Adams Wei Yu, Denny Zhou, Dawn Song, and Quoc V. Le. 2019. Neural symbolic reader: Scalable integration of distributed and symbolic representations for reading comprehension. ICLR.
陈昕云,梁晨,Adams Wei Yu,Denny Zhou,Dawn Song,和 Quoc V. Le. 2019. 神经符号阅读器:分布式表示和符号表示在阅读理解中的可扩展集成. ICLR.
Ting-Rui Chiang and Yun-Nung Chen. 2019. Semantically-aligned equation generation for solving and reasoning math word problems. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2656-2668, Minneapolis, Minnesota. Association for Computational Linguistics.
Ting-Rui Chiang 和 Yun-Nung Chen. 2019. 语义对齐的方程生成用于求解和推理数学文字题. 在 Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) 的第 2656-2668 页,明尼阿波利斯,明尼苏达州. Association for Computational Linguistics.
Gabriel Recchia. 2021. Teaching auto regressive language models complex tasks by demonstration. arXiv preprint arXiv:2109.02102.
Gabriel Recchia. 2021. 通过演示教授自回归语言模型复杂任务. arXiv preprint arXiv:2109.02102.
Emily Reif, Daphne Ippolito, Ann Yuan, Andy Coenen, Chris Callison-Burch, and Jason Wei. 2022. A recipe for arbitrary text style transfer with large language models. ACL.
艾米丽·雷夫,达芙妮·伊波利托,安·袁,安迪·科亨,克里斯·卡利森-伯奇,和杰森·韦。2022。使用大语言模型进行任意文本风格迁移的方法。ACL。
Laria Reynolds and Kyle McDonell. 2021. Prompt programming for large language models: Beyond the few-shot paradigm. Extended Abstracts of the 2021 CH1 Conference on Human Factors in Computing Systems.
拉里亚·雷诺兹和凯尔·麦克唐纳。2021。大语言模型的提示编程:超越少样本范式。2021 年 CH1 人类因素与计算系统会议扩展摘要。
Subhro Roy and Dan Roth. 2015. Solving general arithmetic word problems. EMNLP.
Subhro Roy 和 Dan Roth. 2015. 解决一般算术文字题. EMNLP.
Checklist
检查清单
- For all authors...
对于所有作者...
(a) Do the main claims made in the abstract and introduction accurately reflect the paper's contributions and scope? [Yes]
摘要和引言中的主要主张是否准确反映了论文的贡献和范围? [是]
(d) Have you read the ethics review guidelines and ensured that your paper conforms to them?[Yes]
(d) 您是否已阅读伦理审查指南并确保您的论文符合这些指南?[是]
- If you are including theoretical results..
- 如果您包含理论结果..。
(a) Did you state the full set of assumptions of all theoretical results? [N/A] (b) Did you include complete proofs of all theoretical results? [N/A]
(a) 你是否陈述了所有理论结果的完整假设集?[N/A] (b) 你是否包含了所有理论结果的完整证明?[N/A]
- If you ran experiments..
- 如果你运行了实验..
- If you are using existing assets (e.g., code, data, models) or curating/releasing new assets..
- 如果您正在使用现有资产(例如,代码、数据、模型)或整理/发布新资产..。
- If you used crowd sourcing or conducted research with human subjects...
- 如果您使用了众包或进行了有人类受试者参与的研究...
(a) Did you include the full text of instructions given to participants and screenshots, if applicable?[N/A]
(a) 是否包含了提供给参与者的完整文本指令和截图(如适用)?[N/A]
(b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A]
(b) 您是否描述了任何潜在的参与者风险,并链接到机构审查委员会 (IRB) 批准,如适用?[不适用]
(t) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]
(t)您是否包含了支付给参与者的预计小时工资以及参与者补偿的总金额?[不适用]
A Frequently Asked Questions
常见问题解答
A.1 Why does increasing model scale improve chain-of-thought prompting?
A.1 为什么增加模型规模能改善链式思维提示?
The finding that successful chain-of-thought reasoning predictably emerges only at certain model scales is intriguing. Scaling up language models has been shown to confer benefits such as improved performance and sample efficiency (Kaplan et al., 2020), but chain-of-thought reasoning is emergent in the sense that its success cannot be predicted only by extrapolating the performance of small scale models, as chain of thought actually hurts performance for most models smaller than 10B parameters.
成功的思想链推理只在某些模型规模上可预测地出现这一发现非常有趣。扩大语言模型的规模已被证明可以带来诸如性能提升和样本效率提高等好处 (Kaplan et al., 2020),但思想链推理的出现具有突发性,因为其成功不能仅通过外推小规模模型的表现来预测,实际上对于大多数参数量小于 10B 的模型来说,思想链反而会损害性能。
The question of why model scale improves chain-of-thought prompting is certainly multi-faceted, and we made a preliminary attempt to shed insight into it via error analysis. This small analysis involved manually reading 45 errors made by PaLM 62B and categorizing them into semantic understanding (20 errors), one step missing (18 errors), and other errors (7 errors). The “other category” included hallucinations, repetitive outputs, and symbol mapping errors. This categorization is a coarse one borrowed from the initial error analysis done on LaMDA in Appendix D.2, for which categories were conceived based on what improvements were needed to make the chain of thought correct.
为什么模型规模能改善链式思维提示这个问题无疑是多方面的,我们通过错误分析进行了初步尝试以揭示其原因。这个小规模的分析包括手动阅读 PaLM 62B 所犯的 45 个错误,并将它们分类为语义理解错误 (20 个错误)、缺少一步推理错误 (18 个错误) 和其他错误 (7 个错误)。"其他类别" 包括幻觉、重复输出和符号映射错误。这种分类是粗略的,借鉴了附录 D.2 中对 LaMDA 进行的初始错误分析,其中的类别是根据需要哪些改进来使链式思维正确而构思的。
As shown in Figure 9, scaling PaLM to 540B parameters fixed a substantial portion of errors in all three categories. Examples of semantic understanding and one-step missing errors that were fixed by scaling PaLM to 540B are given in Figure 10. This result appears consistent with a hypothesis that language models acquire a range of semantic understanding and logical reasoning skills as a function of model scale (though note that model scale is often conflated with other factors, such as amount of training compute).
如图 9 所示,将 PaLM 扩展到 540B 参数修复了所有三个类别中的大量错误。图 10 给出了通过将 PaLM 扩展到 540B 参数而修复的语义理解和一步缺失错误的示例。这一结果似乎与以下假设一致:大语言模型随着模型规模的增加获得了广泛的语义理解和逻辑推理能力(尽管需要注意的是,模型规模通常与其他因素混淆,例如训练计算量)。
图 9:
图 10:

Figure 9: Error analysis of 45 problems that $\mathrm{PaLM},62\mathrm{B}$ got incorrect. These errors were categorized that semantic understanding, one step missing, and other. The other category includes hallucinations, repetitive outputs, and symbol mapping errors. Scaling PaLM to 540B fixed a substantial portion of errors in all categories.
图 9: 对 $\mathrm{PaLM},62\mathrm{B}$ 出错的 45 个问题进行错误分析。这些错误被归类为语义理解、一步缺失和其他。其他类别包括幻觉、重复输出和符号映射错误。将 PaLM 扩展到 540B 修复了所有类别中的大量错误。
There are also three notable points regarding why small language models fail. The first observation is that small language models fail at even relatively easy symbol mapping tasks. As demonstrated in Section 5, for even symbolic reasoning tasks that only require generalization to new examples using the same chain of thought logical structure that was given in the few-shot exemplars, small language models still failed. The second observation is that small language models seem to have inherently weaker arithmetic abilities, as shown by Brown et al. (2020), the ability to do simple arithmetic operations (without semantic understanding) requires sufficient model scale. Finally, we noticed qualitatively that small language models often did not generate a final answer that could be parsed, due to either repetitions or logic that never arrived at a final answer.
关于小型语言模型为何失败,有三个值得注意的点。第一点观察是,小型语言模型在相对简单的符号映射任务上也会失败。正如第 5 节所示,即使是只需要使用与少样本示例中相同的思维逻辑结构进行泛化的符号推理任务,小型语言模型仍然失败了。第二点观察是,小型语言模型似乎天生具有较弱的算术能力,正如 Brown 等人 (2020) 所展示的,执行简单算术运算(不涉及语义理解)的能力需要足够的模型规模。最后,我们定性地注意到,小型语言模型经常未能生成可以解析的最终答案,这可能是由于重复或从未得出最终答案的逻辑。
In summary, the success of chain-of-thought reasoning as a result of model scale is a complicated phenomena that likely involves a variety of emergent abilities (semantic understanding, symbol mapping, staying on topic, arithmetic ability, faithfulness, etc). Future work could more thoroughly investigate what properties of pre training data, model architecture, and optimization objective causally enable such reasoning capabilities.
总之,由于模型规模而导致的链式思维推理的成功是一种复杂的现象,可能涉及多种涌现能力(语义理解、符号映射、保持话题一致性、算术能力、忠实性等)。未来的工作可以更彻底地研究预训练数据、模型架构和优化目标的哪些属性因果性地使这些推理能力成为可能。


图中内容无法直接转换为文本,请提供具体需要翻译的文字内容。
Figure 10: Examples of semantic understanding and one-step missing errors that were fixed by scaling PaLM from 62B to 540B.
图 10: 语义理解及一步缺失错误的示例,这些错误通过将 PaLM 从 62B 扩展到 540B 得以修正。
A.2 What is the role of prompt engineering?
A.2 提示工程的作用是什么?
One of the key considerations of prompting is sensitivity to the exact prompt. There is no shortage of work showing that prompts affect language models in unexpected ways (Min et al., 2022). The general way that we created chain of thought annotations was by taking eight exemplars from the training set and decomposing the reasoning process into multiple steps leading to the final answer. Examples of chain of thought annotations are provided in Figure 3, with full prompts given in Appendix G. To analyze how sensitive chain of thought is to prompt engineering, we performed robustness experiments with respect to various factors.
提示的一个关键考虑因素是对确切提示的敏感性。不乏有研究表明提示以意想不到的方式影响语言模型 (Min et al., 2022)。我们创建思维链注释的一般方法是从训练集中选取八个示例,并将推理过程分解为多个步骤,最终得出答案。思维链注释的示例见图 3,完整的提示语请参阅附录 G。为了分析思维链对提示工程的敏感性,我们针对各种因素进行了鲁棒性实验。
· Different annotators. We first analyze robustness to three different annotators (Section 3.4 and Figure 6). Although there is notable variance in performance (which we will discuss later), chain of thought performed better than the baseline by a large margin for all three annotators on eight datasets in arithmetic, commonsense, and symbolic reasoning (Table 6 and Table 7). Similar to the annotation process in Cobbe et al. (2021), annotators were not given specific instructions about how to write the chain of thought annotations other than to simply write the step-by-step reasoning process that led to the final answer. Thus, the annotations were written in each annotator's own linguistic “chain of thought" writing style.
不同的标注者。我们首先分析对三位不同标注者的鲁棒性(第 3.4 节和图 6)。尽管性能存在显著差异(我们将在后文讨论),但在算术、常识和符号推理的八个数据集上,链式思维的表现均大幅优于基线模型(表 6 和表 7)。与 Cobbe 等人 (2021) 的标注过程类似,标注者没有获得关于如何编写链式思维标注的具体指示,只是要求写出导致最终答案的逐步推理过程。因此,标注是以每位标注者自己的语言“链式思维”写作风格编写的。
· Annotators without machine learning background. The GSM8K dataset (Cobbe et al., 2021) conveniently provides a training set with reasoning chains written by crowd compute workers, which enables us to investigate whether chain of thought still works with reasoning chains from an independent source without a background in machine learning. So we randomly sampled three sets of eight exemplars with chains of thought from GSM8K. These chain of thought annotations also outperformed the baseline by a large margin for all four arithmetic datasets (Table 6), indicating that chain of thought is not dependent on a particular set of annotators.
没有机器学习背景的标注者。GSM8K 数据集 (Cobbe et al., 2021) 方便地提供了一个由众包计算工作者编写的推理链训练集,这使我们能够调查来自没有机器学习背景的独立来源的推理链是否仍然有效。因此,我们随机抽取了三组各包含八个带有思考链的样本来自 GSM8K。这些思考链标注在所有四个算术数据集上也大大优于基线(表 6),表明思考链不依赖于特定的一组标注者。
· Different exemplars. The different GSM8K exemplars experiment above (Table 6) also shows that chain-of-thought prompting works for different sets of exemplars. Notably, we test every set of exemplars on all four arithmetic datasets (instead of picking exemplars from the training set for each dataset), which suggests that the exemplars do not necessarily have to come from the same dataset distribution as the test examples.
不同的示例。上述不同的 GSM8K 示例实验 (表 6) 也表明,链式思维提示对不同的示例集有效。值得注意的是,我们在所有四个算术数据集上测试每个示例集(而不是从每个数据集的训练集中挑选示例),这表明示例不一定必须来自与测试示例相同的数据集分布。
· Different order of exemplars. Prior work has shown that in some cases (e.g., classification) even the order of prompts matter—-varying the permutation of few-shot exemplars can cause the accuracy of GPT-3 on SST-2 to range from near chance $(54.3%)$ tonearSOTA $(93.4%)$ (Zha0 et al., 2021). We show the standard deviation of performance from different exemplars in Table 6 and Table 7. Standard deviations with respect to prompt order are relatively minimal in almost all cases. The one exception is the coin flip task, for which exemplar orders have high standard deviation, likely for the reason cited in Zhao et al. (2021)—-for classification, many exemplars of the same category in a row biases the model outputs).
不同的示例顺序。先前的工作已经表明,在某些情况下(例如,分类),提示的顺序也很重要——改变少样本示例的排列可以导致 GPT-3 在 SST-2 上的准确率从接近随机水平 (54.3%) 到接近最先进水平 (93.4%) 不等 (Zhao et al., 2021)。我们在表 6 和表 7 中展示了不同示例的标准差。几乎所有情况下,相对于提示顺序的标准差都相对较小。唯一的例外是硬币翻转任务,对于该任务,示例顺序具有较高的标准差,这可能是因为 Zhao 等人 (2021) 提到的原因——对于分类,连续出现的同一类别示例会偏向模型输出。
表 6:
表 7:
标准差与提示顺序有关的情况在几乎所有情况下都相对较小。唯一的例外是硬币翻转任务,对于该任务,示例顺序具有较高的标准差,这可能是因为 Zhao 等人 (2021) 提到的原因——对于分类,连续出现的同一类别示例会偏向模型输出。
· Different number of exemplars. We also found that gains from chain-of-thought prompting generally still held when there was a varying number of few-shot exemplars. This is shown for five datasets in Figure 11 (we did not have the compute to run this for all datasets). We also found in preliminary experiments that further increasing the number of exemplars in standard prompting did not lead to significant gains (e.g., increasing from 8 to 16 exemplars did not improve the performance of standard prompting enough to catch up with chain-of-thought prompting).
不同的示例数量。我们还发现,当有不同数量的少样本示例时,链式思维提示 (chain-of-thought prompting) 的收益通常仍然存在。这在图 11 中展示了五个数据集的结果(我们没有足够的计算资源来对所有数据集运行此实验)。在初步实验中,我们还发现进一步增加标准提示中的示例数量并未带来显著的改进(例如,从 8 个增加到 16 个示例并没有使标准提示的性能提升到赶上链式思维提示的程度)。
· Different language models. Another interesting question is whether certain prompts that work better for one model work better for other large language models. We find that with the same prompts, chain-of-thought prompting improves performance across all three models (LaMDA, GPT-3, and PaLM) for all datasets except CSQA and StrategyQA for GPT-3 (Table 1, Table 4, Table 5). The fact that gains from chain of thought did not transfer perfectly among models is a limitation; further work could investigate why how different pre-training datasets and model architectures affect the performance gain from chain-of-thought prompting.
不同的大语言模型。另一个有趣的问题是,某些对一个模型更有效的提示是否对其他大语言模型也更有效。我们发现,使用相同的提示,链式思维提示在所有三个模型 (LaMDA, GPT-3, 和 PaLM) 上提高了所有数据集的性能,除了 GPT-3 的 CSQA 和 StrategyQA (表 1, 表 4, 表 5)。链式思维的收益未能在不同模型之间完全转移这一事实是一个局限性;进一步的研究可以调查不同的预训练数据集和模型架构如何影响链式思维提示的性能提升。
Prompt engineering still matters, though. Although the results are relatively robust to the prompt for arithmetic reasoning, we want to be clear that prompt engineering still does matter, and can improve performance significantly in many cases. Though most chain of thought annotations outperform standard prompting, there is large variation in many cases. For instance, for the coin flip task, the performance varied from $99.6%$ for Annotator A to $71.4%$ for Annotator C, though both were above standard prompting $=50.0%$ (see Table 7). There are even tasks where prompt engineering is a requirement for good performance. In preliminary experiments, we tried using chain of thought to enable language models to reverse the order of a list of 5 items. While two co-authors were not able to write chain of thought prompts that solved the task despite their best attempts, a third co-author was able to write a chain of thought that perfectly solved the task.
提示工程仍然很重要。尽管算术推理的结果对提示相对稳健,我们希望明确指出,提示工程仍然很重要,并且在许多情况下可以显著提高性能。虽然大多数思维链注释优于标准提示,但在许多情况下存在很大差异。例如,在硬币翻转任务中,性能从注释者 A 的 $99.6%$ 到注释者 C 的 $71.4%$ 不等,尽管两者都高于标准提示的 $=50.0%$ (见表 7)。甚至有些任务需要良好的提示工程才能取得好成绩。在初步实验中,我们尝试使用思维链来使大语言模型能够反转一个包含 5 个项目的列表顺序。尽管两位合著者尽了最大努力,但未能编写出能解决该任务的思维链提示,而第三位合著者成功编写了一个完美解决该任务的思维链提示。
How to generate chain of thought annotations in a robust fashion could be an interesting direction for future work. For instance, an idea here could be to use a large language model to automatically generate chains of thought via prompting (and potentially optimize this over a validation set).
如何以稳健的方式生成思维链注释可能是一个有趣的研究方向。例如,可以使用大语言模型通过提示自动生成思维链(并可能在验证集上优化这一过程)。
A.3 Will chain-of-thought prompting improve performance for my task of interest?
A.3 思维链提示是否会提高我对感兴趣任务的性能?
While chain-of-thought prompting is in principle applicable for any text-to-text task, it is more helpful for some tasks than others. Based on the experiments in this paper, our intuition is that chain of thought helps the most when three conditions are met: (1) the task is challenging and requires
虽然链式思维提示在原则上适用于任何文本到文本的任务,但对某些任务的帮助比其他任务更大。根据本文的实验,我们的直觉是,当满足以下三个条件时,链式思维帮助最大:(1) 任务具有挑战性且需要
These intuitions are perhaps supported by the arithmetic reasoning results. The performance gain from chain-of-thought prompting is largest for PaLM 540B on GSM8K (challenging multi-step problems, flat scaling curve), which meets these conditions. The performance gain is small for the subsets of MAWPS that only require one or two steps (SingleOP, SingleEq, and AddSub), for which PaLM 540B already achieves performance of $90%$ or higher (and it is also generally true that there is less headroom for improvement when performance is already strong).
这些直觉可能得到了算术推理结果的支持。对于 GSM8K(具有挑战性的多步骤问题,平坦的扩展曲线),PaLM 540B 的思维链提示带来的性能提升最大,这符合这些条件。对于只需要一到两个步骤的 MAWPS 子集(SingleOP,SingleEq 和 AddSub),PaLM 540B 已经达到了 90% 或更高的性能(并且通常情况下,当性能已经很强时,改进的空间较小)。
Although in this paper we focused on multi-step reasoning tasks (arithmetic, commonsense, and symbolic), chain-of-thought prompting can potentially be applied to any task for which humans use a "chain of thought? to solve (at least in principle). We leave the empirical evaluation of chain-of-thought prompting on such diverse tasks (e.g., machine translation, etc.) to future work.
虽然在本文中我们专注于多步推理任务(算术、常识和符号推理),但链式思维提示 potentially 可以应用于人类使用“思维链”来解决的任何任务(至少在原则上)。我们将不同任务(例如,机器翻译等)上的链式思维提示的实证评估留待未来的工作。
A.4 Why is prompting with the equation only not enough for some arithmetic reasoning datasets?
A.4 为什么仅用方程提示对于某些算术推理数据集是不够的?
Prompting with the equation only as an intermediate step does help on many datasets, especially when the datasets only require a few reasoning steps (SVAMP, ASDiv, MAWPS). For GSM8K, however, using the equation only did not improve performance substantially. Based on qualitative analysis, we believe that these questions are too semantically challenging for the model to directly translate them into a math equation. Consider this example from LaMDA 137B:
仅用方程作为中间步骤进行提示在许多数据集上确实有帮助,特别是在数据集只需要少量推理步骤的情况下(SVAMP、ASDiv、MAWPS)。然而,对于 GSM8K,仅使用方程并没有显著提高性能。根据定性分析,我们认为这些问题的语义挑战太大,模型无法直接将它们转换为数学方程。考虑这个来自 LaMDA 137B 的例子:
QUEsTIoN: Mike plays ping pong for 40 minutes. In the first 20 minutes, he scores 4 points. In the second 20 minutes, he scores $25%$ more points. How many total points did he score?
问题:Mike 打乒乓球 40 分钟。在前 20 分钟,他得了 4 分。在后 20 分钟,他得分提高了 25% 。他总共得了多少分?
在后 20 分钟,他得分为 4 分的 125% ,即 4 × 1.25 = 5 分。
因此,他总共得分为 4 + 5 = 9 分。
EQUATION ONLY (WRONG ANSWER): $(4+20*0.25)=6$ The answer is 6. CHAIN OF THOUGHT (CORRECT): Mike played ping pong for 40 minutes. In the first 20 minutes, he scored 4 points. In the second 20 minutes, he scored $25%$ more points. So he scored $25%$ more in the second 20 minutes. $4\mathrm{x}1.25=5$ .So he scored 5 points in the second 20 minutes. So he scored 9 points in total. The answer is 9.
仅公式 (错误答案):$(4+20*0.25)=6$ 答案是 6。
思考过程 (正确):Mike 打了 40 分钟乒乓球。在前 20 分钟,他得了 4 分。在后 20 分钟,他得分提高了 $25%$ 。所以他后 20 分钟得分提高了 $25%$ 。$4 \mathrm{x} 1.25 = 5$ 。所以他后 20 分钟得了 5 分。因此,他总共得了 9 分。答案是 9。
It is hard for the model to directly translate all of the semantics into a single equation, but chain of thought allows it to better reason about each part of the question via intermediate steps in natural language.
模型很难直接将所有的语义翻译成一个单一的公式,但通过链式思维 (chain of thought),它能够通过自然语言中的中间步骤更好地对问题的每个部分进行推理。
B All Experimental Results
B 所有实验结果
This section contains tables for experimental results for varying models and model sizes, on all benchmarks, for standard prompting vs. chain-of-thought prompting.
本节包含所有基准测试中不同模型和模型大小的实验结果表格,对比了标准提示与思维链提示的效果。
For the arithmetic reasoning benchmarks, some chains of thought (along with the equations produced) were correct, except the model performed an arithmetic operation incorrectly. A similar observation was made in Cobbe et al. (2021). Hence, we can further add a Python program as an external calculator (using the Python eval function) to all the equations in the generated chain of thought. When there are multiple equations in a chain of thought, we propagate the external calculator results from one equation to the following equations via string matching. As shown in Table 1, we see that adding a calculator significantly boosts performance of chain-of-thought prompting on most tasks.
对于算术推理基准测试,一些思维链(以及产生的方程)是正确的,只是模型在执行算术运算时出现了错误。Cobbe等人 (2021) 也做出了类似的观察。因此,我们可以进一步添加一个Python语言程序作为外部计算器(使用Python语言的eval函数)来处理生成的思维链中的所有方程。当思维链中有多个方程时,我们通过字符串匹配将外部计算器的结果从一个方程传播到后续的方程。如表 1 所示,我们发现添加计算器显著提高了大多数任务中思维链提示的表现。
表 1: 性能对比表
Table 1: Chain of thought prompting outperforms standard prompting for various large language models on five arithmetic reasoning benchmarks. All metrics are accuracy $(%)$ . Ext. calc.: post-hoc external calculator for arithmetic computations only. Prior best numbers are from the following. $a$ Cobbe et al. (2021). b & e: Pi et al. (2022), $c:$ Lan et al. (2021), $d$ : Piekos et al. (2021).
表 1: 思维链提示在五个算术推理基准上优于标准提示的各种大语言模型。所有指标均为准确率 (%) 。Ext. calc.: 仅用于算术计算的事后外部计算器。此前的最佳数字来自以下文献。a Cobbe et al. (2021)。b & e: Pi et al. (2022),c: Lan et al. (2021),d: Piekos et al. (2021)。
| Prompting | GSM8K | SVAMP | ASDiv | AQuA | MAWPS | |
|---|---|---|---|---|---|---|
| Prior best | N/A (微调) | 55a | 57.4b | 75.3c | 37.9d | 88.4e |
| UL2 20B | 标准思维链 4.4 (+0.3) + ext. calc | 4.1 6.9 | 10.1 28.3 | 16.0 12.5 (+2.4) 16.9 (+0.9) 23.6 (+3.1) 34.3 | 20.5 23.6 | 16.6 19.1 (+2.5) 42.7 |
| LaMDA 137B | 标准思维链 14.3 (+7.8) + ext. calc | 6.5 17.8 | 29.5 42.1 | 40.1 37.5 (+8.0) 46.6 (+6.5) 20.6 (-4.9) 53.4 | 25.5 20.6 | 43.2 57.9 (+14.7) 69.3 |
| GPT-3175B (text-davinci-002) | 标准思维链 46.9 (+31.3) 68.9 (+3.2) + ext. calc | 15.6 49.6 | 65.7 70.3 | 70.3 71.1 | 24.8 71.3 (+1.0) 35.8 (+11.0) | 72.7 87.1 (+14.4) |
| Codex (code-davinci-002) | 标准思维链 63.1 (+43.4) 76.4 (+6.5) | 19.7 | 69.9 | 74.0 | 35.8 29.5 | 87.5 78.7 80.4 (+6.4) 45.3 (+15.8) 92.6 (+13.9) |
| PaLM540B | + ext.calc 标准思维链 56.9 (+39.0) + ext. calc | 65.4 17.9 58.6 | 77.0 69.4 79.0 (+9.6) 79.8 | 80.0 72.1 73.9 (+1.8) 35.8 (+10.6) 93.3 (+14.2) 72.6 | 45.3 25.2 35.8 | 93.3 79.2 93.5 |
Table 2: Standard prompting versus chain of thought prompting on five arithmetic reasoning benchmarks. Note that chain of thought prompting is an emergent ability of model scale—-it does not positively impact performance until used with a model of sufficient scale.
表 2: 标准提示与思维链提示在五个算术推理基准上的对比。注意,思维链提示是模型规模的新兴能力——它不会对性能产生正面影响,直到与足够规模的模型一起使用。
| 模型 | 参数量 | GSM8K | SVAMP | ASDiv | AQuA | MAWPS | |||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| standard | CoT | standard | CoT | standard | CoT | standard | CoT | standard | CoT | ||
| UL2 | 20B | 4.1 | 4.4 | 10.1 | 12.5 | 16.0 | 16.9 | 20.5 | 23.6 | 16.6 | 19.1 |
| LaMDA | 420M | 2.6 | 0.4 | 2.5 | 1.6 | 3.2 | 0.8 | 23.5 | 8.3 | 3.2 | 0.9 |
| 2B | 3.6 | 1.9 | 3.3 | 2.4 | 4.1 | 3.8 | 22.9 | 17.7 | 3.9 | 3.1 | |
| 8B | 3.2 | 1.6 | 4.3 | 3.4 | 5.9 | 5.0 | 22.8 | 18.6 | 5.3 | 4.8 | |
| 68B | 5.7 | 8.2 | 13.6 | 18.8 | 21.8 | 23.1 | 22.3 | 20.2 | 21.6 | 30.6 | |
| 137B | 6.5 | 14.3 | 29.5 | 37.5 | 40.1 | 46.6 | 25.5 | 20.6 | 43.2 | 57.9 | |
| GPT | 350M | 2.2 | 0.5 | 1.4 | 0.8 | 2.1 | 0.8 | 18.1 | 8.7 | 2.4 | 1.1 |
| 1.3B | 2.4 | 0.5 | 1.5 | 1.7 | 2.6 | 1.4 | 12.6 | 4.3 | 3.1 | 1.7 | |
| 6.7B | 4.0 | 2.4 | 6.1 | 3.1 | 8.6 | 3.6 | 15.4 | 13.4 | 8.8 | 3.5 | |
| 175B | 15.6 | 46.9 | 65.7 | 68.9 | 70.3 | 71.3 | 24.8 | 35.8 | 72.7 | 87.1 | |
| Codex | 19.7 | 63.1 | 69.9 | 76.4 | 74.0 | 80.4 | 29.5 | 45.3 | 78.7 | 92.6 | |
| PaLM | 8B | 4.9 | 4.1 | 15.1 | 16.8 | 23.7 | 25.2 | 19.3 | 21.7 | 26.2 | 30.5 |
| 62B | 9.6 | 29.9 | 48.2 | 46.7 | 58.7 | 61.9 | 25.6 | 22.4 | 61.8 | 80.3 | |
| 540B | 17.9 | 56.9 | 69.4 | 79.0 | 72.1 | 73.9 | 25.2 35.8 | 79.2 | 93.3 |
Table 3: Standard prompting versus chain of thought prompting on the four subsets of the MAWPS benchmark. The point of stratifying the MAWPS benchmark is to show that performance gains are minimal on easy one-step or two-step problems where large language models already achieve high performance (e.g., SingleOp, SingleEq, and AddSub).
表 3: 标准提示与思维链提示在 MAWPS 基准的四个子集上的对比。对 MAWPS 基准进行分层的目的是显示在大语言模型已经取得高表现的简单一步或两步问题上 (例如 SingleOp, SingleEq 和 AddSub),性能提升是有限的。
| 模型 | 参数量 | SingleOp | SingleEq | AddSub | MultiArith | ||||
|---|---|---|---|---|---|---|---|---|---|
| Model | standard | CoT | standard | CoT | standard | CoT | standard | CoT | |
| UL2 | 20B | 24.9 | 27.2 | 18.0 | 20.2 | 18.5 | 18.2 | 5.0 | 10.7 |
| LaMDA | 420M | 2.8 | 1.0 | 2.4 | 0.4 | 1.9 | 0.7 | 5.8 | 1.5 |
| 2B | 4.6 | 4.1 | 2.4 | 3.3 | 2.7 | 3.2 | 5.8 | 1.8 | |
| 8B | 8.0 | 7.0 | 4.5 | 4.4 | 3.4 | 5.2 | 5.2 | 2.4 | |
| 68B | 36.5 | 40.8 | 23.9 | 26.0 | 17.3 | 23.2 | 8.7 | 32.4 | |
| 137B | 73.2 | 76.2 | 48.8 | 58.7 | 43.0 | 51.9 | 7.6 | 44.9 | |
| GPT | 350M | 3.2 | 1.8 | 2.0 | 0.2 | 2.0 | 1.5 | 2.3 | 0.8 |
| 1.3B | 5.3 | 3.0 | 2.4 | 1.6 | 2.3 | 1.5 | 2.2 | 0.5 | |
| 6.7B | 13.5 | 3.9 | 8.7 | 4.9 | 8.6 | 2.5 | 4.5 | 2.8 | |
| 175B | 90.9 | 88.8 | 82.7 | 86.6 | 83.3 | 81.3 | 33.8 | 91.7 | |
| Codex | 93.1 | 91.8 | 86.8 | 93.1 | 90.9 | 89.1 | 44.0 | 96.2 | |
| PaLM | 8B | 41.8 | 46.6 | 29.5 | 28.2 | 29.4 | 31.4 | 4.2 | 15.8 |
| 62B | 87.9 | 85.6 | 77.2 | 83.5 | 74.7 | 78.2 | 7.3 | 73.7 | |
| 540B | 94.1 94.1 | 86.5 | 92.3 | 93.991.9 | 42.2 | 94.7 |
Table 4: Standard prompting versus chain of thought prompting on five commonsense reasoning benchmarks. Chain of thought prompting is an emergent ability of model scale—it does not positively impact performance until used with a model of sufficient scale.
表 4: 标准提示与思维链提示在五个常识推理基准上的对比。思维链提示是模型规模的一种新兴能力——它不会对性能产生积极影响,直到与足够规模的模型一起使用。
| 模型 | 参数量 | CSQA | StrategyQA | Date | Sports | SayCan | |||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| standard | CoT | standard | CoT | standard | CoT | standard | CoT | standard | CoT | ||
| Model UL2 | 20B | 34.2 | 51.4 | 59.0 | 53.3 | 13.5 | 14.0 | 57.9 | 65.3 | 20.0 | 41.7 |
| LaMDA | 420M | 20.1 | 19.2 | 46.4 | 24.9 | 1.9 | 1.6 | 50.0 | 49.7 | 7.5 | 7.5 |
| 2B | 20.2 | 19.6 | 52.6 | 45.2 | 8.0 | 6.8 | 49.3 | 57.5 | 8.3 | 8.3 | |
| 8B | 19.0 | 20.3 | 54.1 | 46.8 | 9.5 | 5.4 | 50.0 | 52.1 | 28.3 | 33.3 | |
| 68B | 37.0 | 44.1 | 59.6 | 62.2 | 15.5 | 18.6 | 55.2 | 77.5 | 35.0 | 42.5 | |
| 137B | 53.6 | 57.9 | 62.4 | 65.4 | 21.5 | 26.8 | 59.5 | 85.8 | 43.3 | 46.6 | |
| GPT | 350M | 14.7 | 15.2 | 20.6 | 0.9 | 4.3 | 0.9 | 33.8 | 41.6 | 12.5 | 0.8 |
| 1.3B | 12.0 | 19.2 | 45.8 | 35.7 | 4.0 | 1.4 | 0.0 | 26.9 | 20.8 | 9.2 | |
| 6.7B | 19.0 | 24.0 | 53.6 | 50.0 | 8.9 | 4.9 | 0.0 | 4.4 | 17.5 | 35.0 | |
| 175B | 79.5 | 73.5 | 65.9 | 65.4 | 43.8 | 52.1 | 69.6 | 82.4 | 81.7 | 87.5 | |
| Codex | - | 82.3 | 77.9 | 67.1 | 73.2 | 49.0 | 64.8 | 71.7 | 98.5 | 85.8 | 88.3 |
| PaLM | 8B | 19.8 | 24.9 | 55.6 | 53.5 | 12.9 | 13.1 | 55.1 | 75.2 | 34.2 | 40.0 |
| 62B | 65.4 | 68.1 | 58.4 | 63.4 | 29.8 | 44.7 | 72.1 | 93.6 | 65.8 | 70.0 | |
| 540B | 78.1 | 79.9 | 68.6 77.8 | 49.0 | 65.3 | 80.5 | 95.4 | 80.8 | 91.7 |
Table 5: Standard prompting versus chain of thought prompting enables length generalization to longer inference examples on two symbolic manipulation tasks.
表 5: 标准提示与思维链提示在两个符号操作任务上的长度泛化对比,使推理示例更长。
| 模型 | 参数量 | 最后一个字母连接 (Last Letter Concatenation) | 硬币翻转 (状态跟踪) (Coin Flip (state tracking)) | ||||
|---|---|---|---|---|---|---|---|
| 2 | OOD:3 | OOD:4 | 2 | OOD:3 | OOD:4 | ||
| 标准 | CoT 标准 | CoT 标准 | CoT | 标准 | CoT 标准 | ||
| Model UL2 | 20B | 0.6 18.8 | 0.0 0.2 | 0.0 | 0.0 | 70.4 | 67.1 |
| LaMDA420M | 0.3 1.6 | 0.0 0.0 | 0.0 | 0.0 | 52.9 | 49.6 | |
| 2B | 2.3 6.0 | 0.0 0.0 | 0.0 | 0.0 | 54.9 | 55.3 | |
| 8B | 1.5 11.5 | 0.0 0.0 | 0.0 | 0.0 | 52.9 | 55.5 | |
| 68B | 4.4 52.0 | 0.0 0.8 | 0.0 | 2.5 | 56.2 | 83.2 | |
| 137B | 5.8 77.5 | 0.0 34.4 | 0.0 | 13.5 | 49.0 | 99.6 | |
| PaLM | 8B | 2.6 18.8 | 0.0 0.0 | 0.0 | 0.2 | 60.0 | 74.4 |
| 62B | 6.8 85.0 | 0.0 59.6 | 0.0 | 0.0 13.4 | 91.4 | 96.8 | |
| 540B | 7.6 99.4 | 0.2 94.8 | 0.0 | 0.0 63.0 | 98.1 100.0 |
Table 6: Ablation and robustness results for arithmetic reasoning datasets. Chain of thought generally outperforms ablations by a large amount. “Equation only" performs in between standard prompting and chain of thought prompting, as it allows for intermediate reasoning steps via equations but does not leverage natural language. Chain of thought prompting has variance (as expected) when used with prompts written by different annotators or when using other exemplars, but still outperforms standard prompting by a large margin. Standard deviation shown is for different order of few-shot prompting exemplars, with five different random seeds. Results here are shown for LaMDA 137B, as additional queries for GPT-3 and PaLM are both limited and expensive.
表 6: 算术推理数据集的消融和鲁棒性结果。链式思维 (Chain of thought) 通常比消融实验有更大的优势。“仅方程”表现介于标准提示和链式思维提示之间,因为它允许通过方程进行中间推理步骤,但不利用自然语言。链式思维提示在使用不同注释者编写的提示或使用其他示例时存在差异(如预期),但仍以较大优势超过标准提示。显示的标准差是针对少样本提示示例的不同顺序,使用五个不同的随机种子。这里的结果显示为 LaMDA 137B,因为 GPT-3 和 PaLM 的额外查询既有限又昂贵。
| GSM8K | SVAMP | ASDiv | MAWPS | ||
|---|---|---|---|---|---|
| 标准提示 | 6.5 ± 0.4 | 29.5 ± 0.6 | 40.1 ± 0.6 | 43.2 ± 0.9 | |
| 链式思维提示 | 14.3 ± 0.4 | 36.7 ± 0.4 | 46.6 ± 0.7 | 57.9 ± 1.5 | |
| 消融实验 | |||||
| · 仅方程 | 5.4 ± 0.2 | 35.1 ± 0.4 | 45.9 ± 0.6 | 50.1 ± 1.0 | |
| . 仅变量计算 | 6.4 ± 0.3 | 28.0 ± 0.6 | 39.4 ± 0.4 | 41.3 ± 1.1 | |
| · 回答后的推理 | 6.1 ± 0.4 | 30.7 ± 0.9 | 38.6 ± 0.6 | 43.6 ± 1.0 | |
| 鲁棒性 | |||||
| . 不同注释者 (B) | 15.5 ± 0.6 | 35.2 ± 0.4 | 46.5 ± 0.4 | 58.2 ± 1.0 | |
| . 不同注释者 (C) | 17.6 ± 1.0 | 37.5 ± 2.0 | 48.7 ± 0.7 | 60.1 ± 2.0 | |
| : 故意简洁风格 | 11.1 ± 0.3 | 38.7 ± 0.8 | 48.0 ± 0.3 | 59.6 ± 0.7 | |
| · 来自 GSM8K 的示例 (Q) | 12.6 ± 0.6 | 32.8 ± 1.1 | 44.1 ± 0.9 | 53.9 ± 1.1 | |
| : 来自 GSM8K 的示例 (β) | 12.7 ± 0.5 | 34.8 ± 1.1 | 46.9 ± 0.6 | 60.9 ± 0.8 | |
| · 来自 GSM8K 的示例 () | 12.6 ± 0.7 | 35.6 ± 0.5 | 44.4 ± 2.6 | 54.2 ± 4.7 |
Table 7: Ablation and robustness results for four datasets in commonsense and symbolic reasoning. Chain of thought generally outperforms ablations by a large amount. Chain of thought prompting has variance (as expected) when used with prompts written by different annotators or when using other exemplars, but still outperforms standard prompting by a large margin. Standard deviation shown is for different order of few-shot prompting exemplars, with five different random seeds. Results here are shown for LaMDA 137B, as additional queries for GPT-3 and PaLM are both limited and expensive. The exception is that we run SayCan using PaLM here, as the SayCan evaluation set is only 120 examples and therefore less expensive to run multiple times.
表 7: 四个数据集在常识推理和符号推理方面的消融和鲁棒性结果。思维链 (Chain of thought) 通常以较大优势胜过消融实验。当使用不同注释者编写的提示或使用其他示例时,思维链提示存在差异(如预期),但仍以较大优势胜过标准提示。显示的标准差是针对少样本提示示例的不同顺序,使用五个不同的随机种子。这里的结果显示为 LaMDA 137B,因为 GPT-3 和 PaLM 的额外查询既有限又昂贵。例外情况是我们在这里使用 PaLM 运行 SayCan,因为 SayCan 评估集只有 120 个示例,因此多次运行的成本较低。
| 常识推理 | 符号推理 | ||||
|---|---|---|---|---|---|
| Date | Sports | SayCan | Concat | Coin | |
| 标准提示 | 21.5 ± 0.6 | 59.5 ± 3.0 | 80.8 ± 1.8 | 5.8 ± 0.6 | |
| 思维链提示 | 26.8 ± 2.1 | 85.8 ± 1.8 | 91.7 ± 1.4 | 77.5 ± 3.8 | |
| 消融实验 | |||||
| 只有可变计算 | 21.3 ± 0.7 | 61.6 ± 2.2 | 74.2 ± 2.3 | 7.2 ± 1.6 | |
| · reasoningafteranswer | 20.9 ± 1.0 | 63.0 ± 2.0 | 83.3 ± 0.6 | 0.0 ± 0.0 | |
| 鲁棒性 | |||||
| · 不同注释者 (B) | 27.4 ± 1.7 | 75.4 ± 2.7 | 88.3 ± 1.4 | 76.0 ± 1.9 | |
| . 不同注释者 (C) | 25.5 ± 2.5 | 81.1 ± 3.6 | 85.0 ± 1.8 | 68.1 ± 2.2 |
C Extended Related Work
C 扩展相关工作
Chain-of-thought prompting is a general approach that is inspired by several prior directions: prompting, natural language explanations, program synthesis/execution, numeric and logical reasoning, and intermediate language steps.
思维链提示是一种通用方法,受到以下几个先前方向的启发:提示、自然语言解释、程序合成/执行、数值和逻辑推理以及中间语言步骤。
C.1 Prompting
C.1 提示词设计 (Prompting)
The recent success of large-scale language models has led to growing interest in improving their capability to perform tasks via prompting (Brown et al. (2020), and see Liu et al. (2021) for a survey). This paper falls in the category of general prompting approaches, whereby input prompts are optimized to allow a single large language model to better perform a variety of tasks (Li and Liang, 2021; Lester et al., 2021; Reif et al., 2022, inter alia).
大规模语言模型的近期成功引发了通过提示改进其任务执行能力的兴趣增加 (Brown 等 (2020),综述见 Liu 等 (2021))。本文属于通用提示方法类别,通过优化输入提示使单一的大语言模型能够更好地执行各种任务 (Li 和 Liang, 2021; Lester 等, 2021; Reif 等, 2022, 等)。
One recent line of work aims to improve the ability of language models to perform a task by providing instructions that describe the task (Raffel et al., 2020; Wei et al., 2022a; Ouyang et al., 2022; Sanh et al., 2022; Wang et al., 2022b). This line of work is related because it also augments input-output pairs with meta-data. But whereas an instruction augments the input to a task (instructions are typically prepended to the inputs), chain-of-thought prompting augments the outputs of language models. Another related direction is sequentially combining the outputs of language models; human-computer interaction (HCI) work (Wu et al., 2022a,b) has shown that combining sequential generations of language models improves task outcomes in a 20-person user study.
最近的一项研究工作旨在通过提供描述任务的指令来提高语言模型执行任务的能力 (Raffel et al., 2020; Wei et al., 2022a; Ouyang et al., 2022; Sanh et al., 2022; Wang et al., 2022b)。这条研究路线与此相关,因为它也通过元数据增强了输入-输出对。然而,指令是增强任务的输入(通常将指令添加到输入之前),而思维链提示则是增强语言模型的输出。另一个相关的方向是顺序组合语言模型的输出;人机交互 (HCI) 研究 (Wu et al., 2022a,b) 表明,在一项20人的用户研究中,组合语言模型的顺序生成可以改善任务结果。
C.2 Natural language explanations
C.2 自然语言解释
Another closely related direction uses natural language explanations (NLEs), often with the goal of improving model interpret ability (Zhou et al., 2020; Wiegreffe and Marasovic, 2021, inter alia). That line of work typically focuses on natural language inference (Camburu et al., 2018; Yordanov et al., 2021; Bostrom et al., 2021), and produces explanations either simultaneously to or after the final prediction (Narang et al., 2020; Majumder et al., 2021; Wiegreffe et al., 2021, 2022). By contrast, the chain of thought processing considered in this paper occurs before the final answer. And while NLE aims mostly to improve neural network interpret ability (Rajagopal et al., 2021), the goal of chain-of-thought prompting is to allow models to decompose multi-hop reasoning tasks into multiple steps—-interpret ability is just a side effect. Marasovic et al. (2022) show that prompt-based finetuning with NLE improves NLI and classification performance, though they largely focus on evaluating explanation plausibility. In comparison, our work focuses on a range of arithmetic, commonsense, and symbolic tasks that require multi-hop reasoning.
另一个密切相关的方法使用自然语言解释 (NLEs),通常目的是提高模型的可解释性 (Zhou et al., 2020; Wiegreffe 和 Marasovic, 2021, 等)。这类研究通常集中在自然语言推理 (Camburu et al., 2018; Yordanov et al., 2021; Bostrom et al., 2021),并在最终预测的同时或之后生成解释 (Narang et al., 2020; Majumder et al., 2021; Wiegreffe et al., 2021, 2022)。相比之下,本文考虑的思维链处理发生在最终答案之前。虽然 NLE 主要旨在提高神经网络的可解释性 (Rajagopal et al., 2021),但思维链提示的目标是让模型将多步推理任务分解为多个步骤——可解释性只是一个副作用。Marasovic et al. (2022) 显示,基于提示的微调与 NLE 可以提高自然语言推理和分类性能,尽管他们主要关注评估解释的合理性。相比之下,我们的工作专注于需要多步推理的算术、常识和符号任务。
C.3 Program synthesis and execution
C.3 程序合成与执行
Using intermediate reasoning steps has a long history in program synthesis and execution (Zaremba and Sutskever, 2014, inter alia). Recent work along in this direction has included a number of architectural innovations (Cai et al., 2017; Dong et al., 2019; Yan et al., 2020), as well as the use of large language models (Chen et al., 2021; Austin et al., 2021). The program execution work closest to ours is perhaps Nye et al. (2021), which show that large language models can perform up to 10-digit addition, evaluate polynomials, and execute python programs. Whereas generating a program and then executing it can be viewed as a type of reasoning, our work generalizes such domain-specific primitives to natural language, which is open-domain and relevant to any text-to-text NLP task in principle.
使用中间推理步骤在程序合成和执行中有着悠久的历史 (Zaremba 和 Sutskever, 2014, 等)。近期在这方面的工作包括一系列架构创新 (Cai 等, 2017; Dong 等, 2019; Yan 等, 2020),以及大语言模型的使用 (Chen 等, 2021; Austin 等, 2021)。与我们工作最接近的程序执行研究或许是 Nye 等 (2021),他们展示了大语言模型可以执行多达 10 位数的加法、评估多项式和执行 Python语言 程序。虽然生成程序然后执行它可被视为一种推理形式,但我们的工作将此类特定领域的原语推广到自然语言,使其成为开放领域的,并原则上适用于任何文本到文本的 NLP 任务。
C.4 Numeric and logical reasoning
C.4 数值和逻辑推理
Numeric and logical reasoning has been a long-studied task in machine learning and natural language processing (Lev et al., 2004, inter alia). Recent work has also aimed to inject numeric reasoning abilities in language models in various ways, such as augmenting BERT with a predefined set of executable operations (Andor et al., 2019), including a graph neural network (Ran et al., 2019), and using specialized training procedures (Piekos et al., 2021). Another line of work aims to enable language models to perform logical or formal reasoning, often by vera b li zing the rules in natural language formal rules using language (Clark et al., 2020; Saeed et al., 2021; Liang et al., 2021).
数值和逻辑推理一直是机器学习和自然语言处理中的长期研究任务 (Lev et al., 2004, 等)。近期的研究也尝试以各种方式将数值推理能力注入语言模型,例如通过为 BERT 增加一组预定义的可执行操作 (Andor et al., 2019),引入图神经网络 (Ran et al., 2019),以及使用专门的训练程序 (Piekos et al., 2021)。另一条研究路线旨在使语言模型能够进行逻辑或形式推理,通常是通过将规则用自然语言形式化来实现 (Clark et al., 2020; Saeed et al., 2021; Liang et al., 2021)。
Perhaps the most-related work here is Recchia (2021), which shows that finetuning enables longhand module operations, which has previously been difficult for performers. Whereas work in this direction is often task-specific and uses finetuning, we show that chain-of-thought prompting works for a broad range of tasks without any finetuning.
也许这里最相关的工作是 Recchia (2021),该研究表明微调使得长格式模块操作成为可能,而这以前对于执行者来说是困难的。而这一方向上的工作通常特定于任务并使用微调,我们则表明链式思维提示(chain-of-thought prompting)可以在不需要任何微调的情况下应用于广泛的任务。
C.5 Intermediate language steps
C.5 中间语言步骤
Extensive prior work has shown the benefits of endowing neural networks with the ability to produce intermediate steps via training or finetuning confers various benefits in a range of scenarios. As examples, it has been shown that natural language intermediate steps can improve performance (Zaidan et al., 2007; Yao et al., 2021; Hase and Bansal, 2022; Gu et al., 2022), improve robustness (Chen et al., 2022), speed up training (Hancock et al., 2018), mitigate bias (Dua et al., 2020), and even help in image and reinforcement learning settings (Andreas et al., 2018). To endow models with the ability to produce intermediate steps, prior work typically finetunes models on either manually annotated training datasets (Camburu et al., 2018; Rajani et al., 2019, inter alia) or generates synthetic datasets (Talmor et al., 2020; Zelikman et al., 2022). Compared with these training or finetuning methods, our work shows that various natural language reasoning abilities can be elicited in off-theshelf language models of sufficient scale simply via prompting. This prompting setup is important because it allows for intermediate step reasoning without a large number of labeled annotations, and because a single model can perform a range of reasoning tasks without any gradient updates.
大量先前的工作表明,赋予神经网络通过训练或微调生成中间步骤的能力在各种场景中带来了诸多好处。例如,研究表明自然语言中间步骤可以提高性能 (Zaidan 等, 2007; Yao 等, 2021; Hase 和 Bansal, 2022; Gu 等, 2022),提高鲁棒性 (Chen 等, 2022),加速训练 (Hancock 等, 2018),减轻偏差 (Dua 等, 2020),甚至在图像和强化学习环境中也有帮助 (Andreas 等, 2018)。为了使模型具备生成中间步骤的能力,以往的工作通常通过对人工标注的训练数据集进行微调 (Camburu 等, 2018; Rajani 等, 2019, 等等) 或生成合成数据集 (Talmor 等, 2020; Zelikman 等, 2022)。与这些训练或微调方法相比,我们的工作表明,通过提示,可以在现成的大规模语言模型中激发各种自然语言推理能力。这种提示设置非常重要,因为它允许在不需要大量标注的情况下进行中间步骤推理,并且单个模型可以在不进行任何梯度更新的情况下执行多种推理任务。
D Appendix: Additional Analysis
D 附录:附加分析
D.1 Correct Chain of Thought Analysis
D.1 正确的思维链分析
As mentioned in the main text, we analyze 50 chains of thought from LaMDA 137B that led to correct answers in the GSM8K dataset. Of these 50, only one arrived at the correct answer through incorrect reasoning (shown in Table 9: “correct by chance"). The other 49 had correct logic and math, with examples shown in Table 8. Five had minor imperfections while maintaining coherent and understandable logic:
如正文所述,我们分析了来自 LaMDA 137B 的 50 条思维链,这些链导致了 GSM8K 数据集中的正确答案。在这 50 条中,只有一条通过错误的推理得出了正确答案(见表 9: “正确出于偶然”)。其他 49 条具有正确的逻辑和数学推理,示例见表 8。其中有五条存在轻微瑕疵,但仍然保持连贯且可理解的逻辑:
Table 8: Example of correct chains of thought produced by the model for the GSM8K dataset.
表 8: 模型为 GSM8K 数据集生成的正确思维链示例。
likely that models could arrive at the correct answer via an incorrect reasoning path (e.g., all the com mm on sense reasoning datasets we evaluate). This is a limitation, and future work should perform an analysis of the factuality of such chains of thought.
模型可能会通过错误的推理路径得出正确答案(例如,我们评估的所有常识推理数据集)。这是一个局限性,未来的工作应该对这类思维链的事实性进行分析。
Table 9: Of 50 examples that the model got correct, there were 7 salient cases where the chain of thought the model generated was imperfect. We found the other 43 to be equivalent to what a human would potentially generate as a ground truth.
表 9: 在模型正确的 50 个例子中,有 7 个显著案例的推理链是不完美的。我们发现其他 43 个例子等同于人类可能生成的真实答案。
QUEsTIoN: There are 36 penguins sunbathing in the snow. One-third of them jump in and swim in the ocean. Another one-third go inside the cave to eat their dinner. How many penguins are still left sunbathing?
问题:有 36 只企鹅在雪地中晒太阳。其中三分之一跳进海里游泳。另外三分之一进入洞穴吃晚餐。还有多少只企鹅在晒太阳?
MODEL ANsWER (CORRECT BY CHANCE): There are 36 penguins. One-third of them jump in and swim in the ocean. So that is 12 penguins. Another one-third go inside the cave to eat their dinner. So that is 12 penguins. The answeris12.
模型答案(偶然正确):有 36 只企鹅。其中三分之一跳进海里游泳,所以是 12 只企鹅。另外三分之一进入洞穴吃晚餐,所以也是 12 只企鹅。答案是 12 。
QUEsTIoN: John has 2 houses with 3 bedrooms each. Each bedroom has 2 windows each. There are an additional 4 windows in each house not connected to bedrooms. How many total windows are there between the houses?
问题:John 有 2 栋房子,每栋房子有 3 间卧室,每间卧室有 2 扇窗户。每栋房子还有额外的 4 扇窗户不在卧室里。这两栋房子总共有多少扇窗户?
MoDEL ANsWER (CORRECT BUT UNDER SPECIFIED STATEMENTS): There are 2 houses with 3 bedrooms each. Each bedroom has 2 windows each. So there are $2\mathrm{x}3=6$ windows in each house. There are an additional 4 windows in each house not connected to bedrooms. So there are $4+4=8$ windows in each house. So there are $6\mathrm{~x~}2=12$ windows in each house. So there are $12+8=20$ windows in both houses. The answer is 20.
模型答案(正确但描述不完全):有 2 栋房子,每栋房子有 3 间卧室。每间卧室有 2 扇窗户。所以每栋房子里有 $2\mathrm{x}3=6$ 扇窗户。每栋房子里还有额外的 4 扇窗户不属于卧室。所以每栋房子里有 $4+4=8$ 扇窗户。因此每栋房子里有 $6\mathrm{~x~}2=12$ 扇窗户。所以两栋房子里总共有 $12+8=20$ 扇窗户。答案是 20。
QUEsTIoN: If Ann is 9 years old and her brother is twice her age, how old will her brother be in 3 years?
问题:如果安 9 岁,她哥哥是她的两倍年龄,那么在 3 年后她哥哥将会是多少岁?
MoDEL ANSWER (CORRECT BUT UNRELATED STATEMENT): Ann is 9 years old. Her brother is twice her age. So her brother is 18 years old. In 3 years, she will be 12. So her brother will be $18+3=21$ yearsold.The answeris21.
模型答案(正确但不相关陈述):安妮 9 岁。她哥哥是她年龄的两倍。所以她哥哥是 18 岁。在 3 年后,她将会是 12 岁。所以她哥哥将会是 18+3=21 岁。答案是 21。
QUEsTION: A local gas station is selling gas for $\mathbb{S}3.00$ a gallon. An app company is offering. $\mathbb{S}.20$ cash back per gallon if you fill up at this station. If someone buys 10 gallons of gas, how much with their gas be, after the cash back rewards?
问题:当地一家加油站以每加仑 $\mathbb{S}3.00$ 的价格出售汽油。一家应用程序公司如果在此加油站加油,提供每加仑 $\mathbb{S}.20$ 的现金返还。如果某人购买了 10 加仑的汽油,在享受现金返还优惠后,他们需要支付多少?
解答:原价为 10 加仑 × $3.00 = $30.00。现金返还为 10 加仑 × $0.20 = $2.00。因此,最终支付金额为 $30.00 - $2.00 = $28.00。
MoDEL ANSWER (CORRECT BUT IMPERFECT EQUATION): The gas is originally 3 dollars per gallon. If you buy 10 gallons,that is $3\mathrm{x}10=30$ dollars. If you get 20 cents off per gallon, that is $20\mathrm{x}10=2$ dollars.So the total is $30-2=28$ dollars. The answer is 28 dollars.
气体原本是每加仑 3 美元。如果你购买 10 加仑,那是 $3\mathrm{x}10=30$ 美元。如果你每加仑可以节省 20 美分,那是 $20\mathrm{x}10=2$ 美元。所以总额是 $30-2=28$ 美元。答案是 28 美元。
QUEsTIoN: Nissa hires 60 seasonal workers to play elves in her department store's Santa village. A third of the elves quit after children vomit on them, then 10 of the remaining elves quit after kids kick their shins. How many elves are left?
问题:Nissa 雇佣了 60 名季节性工人在她百货公司的圣诞村扮演精灵。三分之一的精灵在孩子呕吐在他们身上后辞职了,然后剩下的精灵中有 10 名在孩子踢他们的胫骨后辞职。还剩下多少名精灵?
MoDEL ANSWER (CORRECT BUT IMPERFECT EQUATION): Nissa hires 6O seasonal workers. A third of themquit.So $60-1/3=40$ elves are left. Then 10 elves quit. So $40-10=30$ elves are left. The answer is 30 elves.
模型答案(正确但不完美的方程):Nissa 雇佣了 60 名季节性工人。其中三分之一的人离职了,所以 $60-1/3*60=40$ 名工人留下。然后 10 名工人离职。所以 $40-10=30$ 名工人留下。答案是 30 名工人。
QUEsTIoN: Grandma Jones baked 5 apple pies for the fireman's luncheon. She cut each pie into 8 pieces and set the five pies out on the buffet table for the guests to serve themselves. At the end of the evening, after the guests had taken and eaten their pieces of pie, there were 14 pieces of pie remaining. How many pieces were taken by the guests?
问题:琼斯奶奶为消防员午宴烤了 5 个苹果派。她将每个派切成 8 块,并把五个派放在自助餐桌上供客人自取。晚上结束时,客人们取走并吃了一些派后,剩下 14 块派。问客人取走了多少块派?
D.2 Incorrect Chain of Thought Analysis
D.2 错误的思维链分析
We also manually analyze 50 randomly sampled outputs of the model that were incorrect on GSM8K for LaMDA 137B. There are many ways that a chain of thought can be incorrect, making the design of error categorization non-trivial. We decided to categorize errors into what changes are needed to make the chain of thought correct, with the goal of elucidating how the model can be improved in the future.
我们还手动分析了 LaMDA 137B 在 GSM8K 上随机抽取的 50 个错误输出。思维链出错的方式有很多种,使得错误分类的设计变得不 trivial (非平凡)。我们决定根据需要进行哪些更改以使思维链正确来进行分类,目的是阐明如何在未来改进模型。
We found that many chains of thought can be made correct with one of the following three classes of modification.
我们发现许多思维链可以通过以下三类修改之一来修正。
· Calculator error only. We found that $8%$ of the chains of thought were completely correct except for a calculator error-in other words, applying an external calculator to equations, as done in Cobbe et al. (2021), would make the chain of thought correct. An example of this type of error is shown in Table 10: “calculator error only'". Indeed, the solve rate of chain-of-thought prompting on for LaMDA 137B GSM8K went up from $14.3%$ to $17.3%$ when we added a Python program as an external calculator, as shown in Table 2. Also, $34%$ of the examples contained calculator errors in addition to other types of errors. However, we perform the rest of the error categorization independently of calculator errors.
仅计算器错误。我们发现有 $8%$ 的思维链完全正确,除了存在计算器错误——换句话说,在方程中应用外部计算器,如 Cobbe 等人 (2021) 所做的那样,会使思维链正确。这种类型的错误示例如 表 10: “仅计算器错误”。事实上,当我们添加 Python语言 程序作为外部计算器时,LaMDA 137B 在 GSM8K 上的思维链提示解决率从 $14.3%$ 提高到了 $17.3%$ ,如 表 2 所示。此外,$34%$ 的示例中除了其他类型的错误外还包含计算器错误。然而,我们独立于计算器错误进行其余的错误分类。
· Symbol mapping error. We next found that $16%$ percent of the chains of thought were correct except for what we call symbol mapping errors. We define a symbol mapping error as when the chain of thought is correct except for the number symbols, and it could be made totally correct by modifying only the equations and not the words. As one might argue that they could simply place the correct final equation in any chain of thought, we constrain this category to chains of thought where the chain of thought can be modified to be a completely correct reasoning process (not just final answer). An example of this error category is shown in Table 10: “symbol mapping error'.
符号映射错误。我们接下来发现,$16%$ 的思维链是正确的,除了我们称之为符号映射错误的部分。我们将符号映射错误定义为:当思维链除数字符号外都是正确的,并且仅通过修改方程而不是文字就可以使其完全正确。有人可能会认为他们可以在任何思维链中简单地放置正确的最终方程,因此我们将这一类别限制为可以修改为完全正确推理过程的思维链(而不仅仅是最终答案)。表 10: “符号映射错误”展示了此类错误的一个例子。
· One step missing error. Our next category of error is chains of thought which were correct except that they were missing a single step. In other words, these chains of thoughts could be rewritten to be correct by adding in an additional reasoning step that was missed by the model. An example of this error category is shown in Table 10: “one step missing error". We found that $22%$ percentof the errors fell into this category.
一类错误是缺少一个步骤。我们下一类错误是思维链除了缺少一个步骤外都是正确的。换句话说,通过添加模型遗漏的额外推理步骤,这些思维链可以被改写为正确。表 10: “缺少一个步骤的错误”展示了这一错误类别的一个例子。我们发现 22% 的错误属于这一类别。
Table 10: Example of incorrect chains of thought, categorized as described in Appendix D.2.
表 10: 错误思维链的例子,分类详见附录 D.2。
step (in this case, adding 20 and 40).
步骤(在这种情况下,添加 20 和 40)。
We found that the remaining chains of thought (27 of 50; $54%$ would require substantial edits to make into a correct chain of thought. Almost all cases here involved some error in semantic understanding (see Table 11: “semantic understanding error'"), and 8 of the 27 also had incoherent chain of thoughts, meaning that some statements in the generated chain of thought did not follow from prior ones or violated basic world knowledge (see Table 11: “incoherent chain of thought error").
我们发现剩余的思维链中有 27 条(共 50 条;$54%$)需要进行大量修改才能成为正确的思维链。几乎所有这些案例都涉及语义理解方面的错误(见表 11: “语义理解错误”),并且在 27 条中有 8 条还存在不连贯的思维链,这意味着生成的思维链中某些陈述不符合之前的陈述或违反了基本的世界知识(见表 11: “不连贯的思维链错误”)。
Table 11: Example of incorrect chains of thought, categorized as described in Appendix D.2. These chains of thought had multiple mistakes that could not be fixed with minor edits.
表 11: 错误的思维链示例,分类详见附录 D.2。这些思维链存在多个无法通过小修小改解决的错误。
QUEsTIoN: Gretchen has 110 coins. There are 30 more gold coins than silver coins. How many gold coins does Gretchen have?
问题:格雷琴有 110 枚硬币。黄金硬币比银硬币多 30 枚。格雷琴有多少枚黄金硬币?
MODEL ANSWER (INCORRECT; SEMANTIC UNDERSTANDING ERROR): Gretchen has 11O coins. There are 30 more gold coins than silver coins. So there are $110-30=80\$ silver coins. So there are 80 silver coins and $110\cdot80=30\$ gold coins. The answer is 30. x
MODEL ANSWER (INCORRECT; SEMANTIC UNDERSTANDING ERROR): 格雷琴有 110 枚硬币。黄金硬币比银币多 30 枚。所以银币有 $110-30=80$ 枚。所以有 80 枚银币和 $110-80=30$ 枚黄金硬币。答案是 30。 x
注意:原文中的公式计算有误,$110 \cdot 80$ 应为 $110 - 80$。
ExPLANATION FOR ERROR CATEGORY: The model does not understand that 11O coins are in total and that it must solve an equation to get the number of gold and silver coins.
错误类别解释:模型不理解 110 枚硬币是总数,且必须解一个方程来得到金币和银币的数量。
QUEsTIoN: In a dance class of 20 students, $20%$ enrolled in contemporary dance, $25%$ of the remaining enrolled in jazz dance, and the rest enrolled in hip-hop dance. What percentage of the entire students enrolled in hip-hopdance?
问题:在一个有 20 名学生的舞蹈班中,20% 的学生报名参加了当代舞,剩余学生中有 25% 报名参加了爵士舞,其余的学生报名参加了嘻哈舞。报名参加嘻哈舞的学生占全部学生的百分比是多少?
MODEL ANSWER (INCORRECT; INCOHERENT CHAIN OF THOUGHT ERROR): $20%$ of the students enrolled in contemporary dance. $25%$ of the remaining students enrolled in jazz dance. The rest enrolled in hip-hop dance. So the percentage of the entire students enrolled in hip-hop dance is the percentage of the students enrolled in hip-hop dance, minus the percentage of the students enrolled in contemporary dance, minus the percentage of the students enrolled in jazz dance. So the percentage of the entire students enrolled in hip-hop dance is $(25+20)$ $(25+20)=100%$ . The answer is $100\bar{%}$ $\pmb{x}$
MODEL ANSWER (INCORRECT; INCOHERENT CHAIN OF THOUGHT ERROR): $20%$ 的学生报名参加了当代舞。 $25%$ 的剩余学生报名参加了爵士舞。其余的学生报名参加了嘻哈舞。所以整个学生中报名参加嘻哈舞的百分比是学生报名参加嘻哈舞的百分比,减去学生报名参加当代舞的百分比,减去学生报名参加爵士舞的百分比。所以整个学生中报名参加嘻哈舞的百分比是 $(25+20)$ $(25+20)=100%$ 。答案是 $100\bar{%}$ $\pmb{x}$
注:此答案逻辑错误,计算方式不正确。
EXPLANATION FOR ERROR CATEGORY: This chain of thought is incoherent in that the percent of entire students enrolled in hip-hope dance cannot be the percent of student enrolled in hip-hop dance minus another term.
错误类别解释:这种思维方式是不连贯的,因为参加嘻哈舞蹈的全部学生百分比不能等于参加嘻哈舞蹈的学生百分比减去另一个项。
Overall, there are no guarantees that the reasoning processes generated by large language models are coherent or factually correct, as underscored by the recent work evaluating the factuality of language model generations and explanations (Maynez et al., 2020; Rashkin et al., 2021; Ye and Durrett, 2022; Marasovic et al., 2022; Wiegreffe et al., 2022). Incorrect reasoning processes can lead to both incorrect final answers as well as accidentally correct final answers (with accidentally correct final answers being more likely for tasks such as binary classification as opposed to free response). Improving the factuality of language model generations with respect to context and world knowledge is an important direction open problems in language model research and could also be expected to potentially improve multi-step reasoning abilities of language models. One potential method for improving the quality of decoding could involve generating multiple reasoning paths and scoring each of them with a verifier, though this requires training the verifier (Cobbe et al., 2021; Shen et al., 2021; Thoppilan et al., 2022).
总体而言,大语言模型生成的推理过程并不保证是连贯或事实正确的,正如最近评估语言模型生成和解释的事实正确性的研究工作所强调的那样 (Maynez et al., 2020; Rashkin et al., 2021; Ye and Durrett, 2022; Marasovic et al., 2022; Wiegreffe et al., 2022)。不正确的推理过程可能导致最终答案错误,也可能导致偶然正确的最终答案(对于二元分类等任务,偶然正确的最终答案比自由回答任务更可能出现)。提高语言模型生成内容在上下文和世界知识方面的事实正确性是语言模型研究中的一个重要开放问题,并且有望潜在地改善语言模型的多步推理能力。一种可能改进解码质量的方法是生成多个推理路径,并使用验证器对每个路径进行评分,但这需要训练验证器 (Cobbe et al., 2021; Shen et al., 2021; Thoppilan et al., 2022)。
D.3 Additional Robustness Analysis
D.3 附加鲁棒性分析
As the experiments in the main paper use a fixed number of few-shot exemplars (8; as constrained by the input length of 1024 tokens), we verify that the chain-of-thought prompting is robust to various numbers of few-shot exemplars. We run experiments for LaMDA 137B, comparing chain-of-thought prompting with standard prompting for the five datasets where standard prompting had a mostly flat scaling curve (the largest model did not achieve high performance). As shown in Figure 11, the improvement of chain-of-thought prompting over standard prompting remains robust to varying the number of few-shot exemplars in the prompt.
如主论文中的实验使用了固定数量的少样本示例 (8;受 1024 Token 输入长度的限制),我们验证了链式思维提示对不同数量的少样本示例具有鲁棒性。我们对 LaMDA 137B 进行实验,比较链式思维提示与标准提示在五个标准提示具有相对平坦的学习曲线的数据集上的表现 (最大的模型未能取得高性能)。如图 11 所示,链式思维提示相对于标准提示的改进在提示中少样本示例数量变化时仍然保持鲁棒性。
图 11: 链式思维提示相对于标准提示的改进在少样本示例数量变化时的鲁棒性

Figure 11: The improvement of chain of thought prompting over standard prompting appears robust to varying the number of few-shot exemplars in the prompt.
图 11: 思维链提示相对于标准提示的改进在提示中变化少数样本示例的数量时表现稳健。
Table 12: Summary of math word problem benchmarks we use in this paper with examples. $N$ number of evaluation examples.
表 12: 我们在本文中使用的数学文字题基准总结及示例。$N$ 表示评估示例的数量。
| 数据集 | N 示例问题 |
|---|---|
| GSM8K | 1,319 Josh 决定尝试翻新一所房子。他以 80,000 美元购买了一所房子,然后投入了 50,000 美元进行修缮。这使房子的价值增加了 150%。他赚了多少利润? |
| SVAMP | 1,000 每包 DVD 成本为 76 美元。如果每包有 25 美元的折扣。你买每包需要支付多少? |
| ASDiv | 2,096 Ellen 比 Marin 多六个球。Marin 有九个球。Ellen 有多少个球? |
| AQuA | 254 一辆汽车正以直线和匀速驶向一座垂直塔的底部。从汽车上观察到塔顶,并且在过程中,仰角从 45° 变为 60° 花了 10 分钟。这辆汽车还需要多少时间才能到达塔底?答案选项:(a) 5√3 + 1 (b) 6V3 + √2 (c) 7√3 - 1 (d) 8√3 - 2 (e) 以上都不是 |
| MAWPS: SingleOp | 562 |
| MAWPS: SingleEq | 508 盒子里有多少瓶盖?Benny 买了 2 美元的软饮料和 5 根巧克力棒。他总共花费了 27 美元。每根巧克力棒的价格是多少? |
| MAWPS:AddSub | 395 花瓶里有 6 朵玫瑰。Mary 从她的花园里剪了一些玫瑰。现在花瓶里有 16 朵玫瑰。她剪了多少朵玫瑰? |
| MAWPS:MultiArith600 | 学校食堂为学生午餐订购了 42 个红苹果和 7 个绿苹果。但是,只有 9 名学生想要水果,食堂最终多出了多少水果? |
E Additional Details
E 附加详情
Version Control
版本控制
$\mathbf{V}2\to\mathbf{V}3$ . Added GPT-3 results. Added SVAMP and AQuA eval datasets for math. Added SayCan eval for commonsense. Added Extended Related Work section (Appendix C). Added ablations for Commonsense and Symbolic Reasoning (Table 7). Added FAQ section (Appendix A). Added raw results in Appendix B.
$\mathbf{V}2\to\mathbf{V}3$ 。添加了 GPT-3 的结果。添加了 SVAMP 和 AQuA 评估数据集用于数学任务。添加了 SayCan 评估用于常识推理。添加了扩展的相关工作部分(附录 C)。添加了常识和符号推理的消融实验(表 7)。添加了常见问题解答部分(附录 A)。在附录 B 中添加了原始结果。
$\mathbf{V1}\to\mathbf{V}2$ . Added PaLM results (V1 only had LaMDA).
$\mathbf{V1}\to\mathbf{V}2$ 。添加了 PaLM 结果(V1 仅有 LaMDA)。
E.1 Reproducibility Statement
E.1 可复现性声明
As our results make use of two sets of large language models that is not publicly available, we take the following actions to facilitate reproducibility. First, we provide the exact input prompts for all tasks in Table 20-Table 27 in Appendix G (and emphasize that we do not perform any finetuning and only apply prompting to off-the-shelf language models). Second, we conduct experiments using the publicly available GPT-3 API for four model scales text-ada-001, text-babbage-001, text-curie-001, text-davinci-002). Finally, we make exact inputs, targets, and predictions for LaMDA 137B for each task available as a zip file in the supplementary material.
由于我们的结果使用了两组不可公开获取的大语言模型,我们采取以下措施以促进可重复性。首先,我们在附录 G 的表 20-表 27 中提供了所有任务的确切输入提示(并强调我们不进行任何微调,仅对现成的语言模型应用提示)。其次,我们使用公开可用的 GPT-3 API 对四种模型规模(text-ada-001, text-babbage-001, text-curie-001, text-davinci-002)进行了实验。最后,我们将每个任务的确切输入、目标和预测(针对 LaMDA 137B)作为补充材料中的 zip 文件提供。
E.2 Computational Resources
E.2 计算资源
For all three language models we evaluated, we did prompting-based inference only. No finetuning was done for this paper. For inference on LaMDA 137B we use TPU v3 (8x8 configuration, 64 chips / 128 cores), and for inference on PaLM 540B we use TPU v4 (4x4x12 configuration, 192 chips / 384 cores). GPT-3 experiments were done using the public API.5
对于所有三个大语言模型,我们仅进行了基于提示的推理。本文没有进行任何微调。对于 LaMDA 137B 的推理,我们使用 TPU v3 (8x8 配置,64 个芯片 / 128 个核心),而对于 PaLM 540B 的推理,我们使用 TPU v4 (4x4x12 配置,192 个芯片 / 384 个核心)。GPT-3 实验是通过公共 API 进行的。
E.3Dataset Details and Licenses
E.3 数据集详情和许可
We list the details and licenses for all arithmetic and commonsense datasets used in this paper. The symbolic reasoning datasets were created synthetically, as described in Section 4.
我们列出了本文使用的所有算术和常识数据集的详细信息和许可证。符号推理数据集是按照第 4 节所述方法合成创建的。
Arithmetic reasoning
算术推理
Commonsense reasoning
常识推理
· CSQA (Talmor et al., 2019): https: //www.tau-nlp.org/commonsense qa, https: // github.com/jonathan her zig/commonsense qa.
CSQA (Talmor et al., 2019): https://www.tau-nlp.org/commonsense_qa, https://github.com/jonathan_herszig/commonsense_qa.
F Appendix: Input/Output Examples
F 附录:输入/输出示例
Table 13: Examples of correct and incorrect chains of thought produced by LaMDA 137B on the letter concatenation task.
表 13: LaMDA 137B 在字母连接任务上产生的正确和错误的思维链示例。
| 问题: 取 “Waldo Schmidt” 中每个单词的最后一个字母并将它们连接起来。 |
|---|
| 模型答案 (正确): “Waldo” 的最后一个字母是 “o”。"Schmidt" 的最后一个字母是 “t”。将它们连接起来是 “ot”。所以答案是 ot。 |
| 问题: 取 “Daniel Friedman” 中每个单词的最后一个字母并将它们连接起来。 |
| 模型答案 (错误): “Daniel” 的最后一个字母是 “l”。"Friedman" 的最后一个字母是 “n”。将它们连接起来是 “ln”。所以答案是 ln。 |
Table 14: Examples of correct and incorrect chains of thought produced by LaMDA 137B on the coinflip task.
表 14: LaMDA 137B 在硬币翻转任务中产生的正确和错误的思维链示例。
Table 15: Examples of correct and incorrect chains of thought produced by LaMDA 137B on CSQA.
表 15: LaMDA 137B 在 CSQA 上产生的正确和错误的思维链示例。
| 问题: 当一个人开始工作时,他们在建立什么? |
|---|
| 答案选项:(a) 时间 (b) 完成 (c) 工作 (d) 动力 (e) 疲劳 模型答案(正确):答案必须是某种被建立的东西。在上述选项中,只有动力是被建立的。所以答案是 (d)。 |
| 问题:你可能在哪里找到汉堡包? 答案选项:(a) 快餐店 (b) 披萨 (c) 磨碎的死牛 (d) 口腔 (e) 牛尸 |
|---|
| 模型答案(正确):答案必须是一个能找到汉堡包的地方。在上述选项中,只有快餐店提供汉堡包。所以答案是 (a)。 |
| 问题:除了水和营养,你的狗还需要什么? 答案选项:(a) 骨头 (b) 迷人 (c) 抚摸 (d) 大量的关注 (e) 散步 |
|---|
| 模型答案(错误):答案必须是狗需要的东西。在上述选项中,只有骨头是狗需要的东西。所以答案是 (a)。x (正确答案为 (d))。 |
| 问题:当你在接待区时,你在旁边等待的是什么? |
|---|
| 答案选项:(a) 汽车旅馆 (b) 椅子 (c) 医院 (d) 人 (e) 酒店 模型答案(错误):答案必须是在接待区与你一起等待的东西。在上述选项中,只有人在接待区与你一起等待。所以答案是 (e)。x (正确答案为 (d))。 |
Table 16: Examples of correct and incorrect chains of thought produced by LaMDA 137B on StrategyQA.
表 16: LaMDA 137B 在 StrategyQA 上产生的正确和错误的思维链示例。
| 问题:《驱魔人》会刺激边缘系统吗? | 模型答案 (正确): 《驱魔人》是一部恐怖电影。恐怖电影会让人感到害怕。边缘系统与恐惧有关。因此,《驱魔人》会刺激边缘系统。所以答案是:是的。√ |
|---|---|
| 问题: 杰克逊·波洛克是否由莱昂纳多·达·芬奇训练过? | 模型答案 (正确): 莱昂纳多·达·芬奇生活在15世纪。杰克逊·波洛克生活在20世纪。因此,杰克逊·波洛克不可能由莱昂纳多·达·芬奇训练。所以答案是:不是。√ |
| 问题: 哈利·波特能否预订韩亚航空的航班? | 模型答案 (错误): 哈利·波特是一个虚构的角色。因此,哈利·波特可以做任何事情。所以答案是:是的。× (实际答案为“否”。) |
| 问题: 一个诡辩家会使用花剑吗? | 模型答案 (错误): 诡辩家是指擅长说服艺术的人。花剑是一种剑。因此,诡辩家可能会使用花剑。所以答案是:是的。× (实际答案为“否”。) |
| 问题: 对于简而言,1992年5月6日就像昨天一样,但实际上那是十年前的事了。一个月前的日期是什么 (MM/DD/YYYY)? |
| 是 2002 年 4 月 6 日。所以答案是 04/06/2002.√ |
| 问题: 这是1899年的最后一天。24小时后的日期是什么 (MM/DD/YYYY)? |
| 模型答案 (正确): 今天是 12/31/1899。24 小时后将是 01/01/1900。所以答案是 01/01/1900.√ |
| 问题: 简安排了明天 (1972 年 7 月 9 日,星期二) 与 5 个人的 3 次预约。从今天起一周前的日期是什么 (MM/DD/YYYY)? |
| 模型答案 (错误): 明天是 7/9/1972。从今天起一周前是 7/2/1972。所以答案是 7/2/1972.x (真实答案是 07/01/1972.) |
| 问题: 今天是 2020 年的回文日,因为日期的 MMDDYYYY 格式正读和反读都相同 |
| 模型答案 (错误): 今天是 02/29/2020。明天将是 03/01/2020。所以答案是 03/01/2020.× (真实答案是 “02/03/2020".) |
Table 18: Examples of correct and incorrect chains of thought produced by LaMDA 137B on Sports Understanding.
表 18: LaMDA 137B 在体育理解方面生成的正确和错误的推理链示例。
| 问题 | 模型回答 |
|---|---|
| 问题: 以下句子是否合理?“Joao Moutinho 在三垒出局。”模型回答 (正确): Joao Moutinho 是一名足球运动员。在三垒出局是棒球的一部分,而不是足球。所以答案是否定的。 | |
| 问题: 以下句子是否合理?“Malcolm Brogdon 在 NBA 总决赛中欧洲步上篮。”模型回答 (正确): Malcolm Brogdon 是一名篮球运动员。欧洲步上篮是篮球的一部分。所以答案是肯定的。 | |
| 问题: 以下句子是否合理?“Derrick White 反手击球。”模型回答 (错误): Derrick White 是一名篮球运动员。反手击球是网球或乒乓球的一部分。所以答案是否定的。 | |
| 问题: 以下句子是否合理?“Nazem Kadri 在本垒出局。”模型回答 (错误): Nazem Kadri 是一名冰球运动员。在本垒出局是冰球的一部分。所以答案是肯定的。(事实是“否”,在本垒出局是棒球的一部分。) |
Table 19: Examples of correct and incorrect chains of thought produced by PaLM 540B on SayCan robot planning tasks.
表 19: PaLM 540B 在 SayCan 机器人规划任务中产生的正确和错误的思维链示例。
G Appendix: Full Prompts
G 附录:完整提示语
Table 20: Few-shot exemplars for full chain of thought prompt for math word problems. This set of exemplars was used for all math word problem datasets except AQuA.
表 20: 数学文字题的少样本示例,用于完整的思考链提示。这组示例用于所有数学文字题数据集,除 AQuA 外。
PROMPT FOR AQUA ALGEBRAIC WORD PROBLEMS
代数文字题的提示词
Table 22: Few-shot exemplars for full chain of thought prompt for the last letter concatenation task.
表 22: 少样本示例用于最后一个字母连接任务的完整思考链提示。
PROMPT FOR LAST LETTER CONCATENATION
最后一个字母连接提示
r RUMP IF UK LUIN r LIP
r RUMP IF UK LUIN r LIP
Table 24: Few-shot exemplars for full chain of thought prompt for CSQA. There are newlines between the answer choices that are omitted in the table for space reasons.
表 24: CSQA 的少样本全链路思考提示示例。由于空间原因,表格中省略了答案选项之间的新行。
PROMPT FOR CSQA
CSQA 的提示语
Table 25: Few-shot exemplars for full chain of thought prompt for StrategyQA
表 25: 少样本示例用于 StrategyQA 的完整思考链提示
I KUMrI rUK OIKAIEGIQA
我无法识别该标题并将其翻译为有意义的中文,因此保留原样:I KUMrI rUK OIKAIEGIQA
Table 26: Few-shot exemplars for full chain of thought prompt for Date Understanding.
表 26: 少样本示例用于日期理解的完整思考链提示。
PROMPT FOR DATE UNDERSTANDING
用于日期理解的提示 (PROMPT FOR DATE UNDERSTANDING)
PROMPT FOR SPORTS UNDERSTANDING
体育理解提示
H Appendix: Alternate Annotators for MWP
H 附录:MWP 的替代标注者
Table 29: Few-shot exemplars for full chain of thought prompt for math word problems. These exemplars are the same as in Table 20, except that the chains of thought were written by a different annotator ("Annotator B" instead of “Annotator A"). Annotators were co-authors and familiar with the goal of chain of thought prompting.
表 29: 数学文字题的少样本全链思考提示示例。这些示例与表 20 中的相同,不同之处在于链思考由不同的注释者编写(“Annotator B”而不是“Annotator A”)。注释者是共同作者,并熟悉链思考提示的目标。
PROMPT FOR MATH WORD PROBLEMS
数学文字题提示
Table 30: Few-shot exemplars for full chain of thought prompt for math word problems. These exemplars are the same as in Table 20, except that the chains of thought were written by a different annotator("Annotator $\mathbf{C}^{\bullet}$ instead of “Annotator A").
表 30: 数学文字题的少样本全链思考提示示例。这些示例与表 20 中的相同,不同之处在于链式思考由不同的标注者 (“Annotator $\mathbf{C}^{\bullet}$ 而不是 “Annotator A”) 编写。
