Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
连接思维提示在大语言模型中激发推理能力
Jason Wei Xuezhi Wang Dale Schuurmans Maarten Bosma Brian Ichter Fei Xia Ed H. Chi Quoc V. Le Denny Zhou
韦杰森 王学志 Dale Schuurmans Maarten Bosma Brian Ichter Fei Xia Ed H. Chi Quoc V. Le 周邓尼
Google Research, Brain Team {jasonwei,dennyzhou}@google.com
谷歌研究,Brain团队 {jasonwei,dennyzhou}@google.com
Abstract
摘要
We explore how generating a chain of thought-a series of intermediate reasoning steps—significantly improves the ability of large language models to perform complex reasoning. In particular, we show how such reasoning abilities emerge naturally in sufficiently large language models via a simple method called chain-ofthought prompting, where a few chain of thought demonstrations are provided as exemplars in prompting.
我们探讨了生成思维链——一系列中间推理步骤——如何显著提高大语言模型执行复杂推理的能力。特别是,我们展示了这种推理能力如何通过一种称为思维链提示 (chain-of-thought prompting) 的简单方法在足够大的语言模型中自然涌现,其中提供了一些思维链示例作为提示中的范例。
Experiments on three large language models show that chain-of-thought prompting improves performance on a range of arithmetic, commonsense, and symbolic reasoning tasks. The empirical gains can be striking. For instance, prompting a PaLM 540B with just eight chain-of-thought exemplars achieves state-of-the-art accuracy on the GSM8K benchmark of math word problems, surpassing even finetuned GPT-3 with a verifier.
在三个大语言模型上的实验表明,链式思维提示(chain-of-thought prompting)在一系列算术、常识和符号推理任务上提高了性能。经验性收益可能是显著的。例如,仅用八个链式思维示例提示 PaLM 540B,在数学文字问题的 GSM8K 基准测试中达到了最先进的准确率,超过了甚至带有验证器的微调 GPT-3。
Figure 1: Chain-of-thought prompting enables large language models to tackle complex arithmetic, commonsense, and symbolic reasoning tasks. Chain-of-thought reasoning processes are highlighted.
图 1: 链式思维提示使大语言模型能够处理复杂的算术、常识和符号推理任务。链式思维推理过程被突出显示。
1 Introduction
1 引言
The NLP landscape has recently been revolutionized by language models (Peters et al., 2018; Devlin et al., 2019; Brown et al., 2020, inter alia). Scaling up the size of language models has been shown to confer a range of benefits, such as improved performance and sample efficiency (Kaplan et al., 2020; Brown et al., 2020, inter alia). However, scaling up model size alone has not proved sufficient for achieving high performance on challenging tasks such as arithmetic, commonsense, and symbolic reasoning (Rae et al., 2021).
自然语言处理领域最近被语言模型 (Peters et al., 2018; Devlin et al., 2019; Brown et al., 2020, 等) 所革新。扩大语言模型的规模已被证明带来了一系列好处,例如性能提升和样本效率提高 (Kaplan et al., 2020; Brown et al., 2020, 等)。然而,仅靠扩大模型规模并不足以在诸如算术、常识推理和符号推理等具有挑战性的任务上取得高性能 (Rae et al., 2021)。
This work explores how the reasoning ability of large language models can be unlocked by a simple method motivated by two ideas. First, techniques for arithmetic reasoning can benefit from generating natural language rationales that lead to the final answer. Prior work has given models the ability to generate natural language intermediate steps by training from scratch (Ling et al., 2017) or finetuning a pretrained model (Cobbe et al., 2021), in addition to neuro-symbolic methods that use formal languages instead of natural language (Roy and Roth, 2015; Chiang and Chen, 2019; Amini et al., 2019; Chen et al., 2019). Second, large language models offer the exciting
这项工作探讨了通过一种简单的方法,如何解锁大语言模型的推理能力,该方法受到两个想法的启发。首先,算术推理技术可以从生成自然语言推理过程受益,这些过程最终导向正确答案。先前的工作已经使模型具备了从头训练 (Ling et al., 2017) 或微调预训练模型 (Cobbe et al., 2021) 来生成自然语言中间步骤的能力,此外还有使用形式语言而非自然语言的神经符号方法 (Roy 和 Roth, 2015; Chiang 和 Chen, 2019; Amini et al., 2019; Chen et al., 2019)。其次,大语言模型提供了令人兴奋的机会
Figure 2: PaLM 540B uses chain-ofthought prompting to achieve new stateof-the-art performance on the GSM8K benchmark of math word problems. Finetuned GPT-3 and prior best are from Cobbe et al. (2021).
图 2: PaLM 540B 使用链式思维提示来实现数学文字题的 GSM8K 基准测试的新最先进性能。微调的 GPT-3 和之前的最佳结果来自 Cobbe 等人 (2021) 。
prospect of in-context few-shot learning via prompting. That is, instead of finetuning a separate language model checkpoint for each new task, one can simply “prompt’ the model with a few input-output exemplars demonstrating the task. Remarkably, this has been successful for a range of simple question-answering tasks (Brown et al., 2020).
通过提示实现少样本学习的前景。也就是说,不必为每个新任务微调单独的大语言模型检查点,只需用几个输入-输出示例来“提示”模型以展示任务即可。令人惊讶的是,这种方法在一系列简单的问答任务中取得了成功 (Brown et al., 2020)。
Both of the above ideas, however, have key limitations. For rationale-augmented training and finetuning methods, it is costly to create a large set of high quality rationales, which is much more complicated than simple input-output pairs used in normal machine learning. For the traditional fewshot prompting method used in Brown et al. (2020), it works poorly on tasks that require reasoning abilities, and often does not improve substantially with increasing language model scale (Rae et al., 2021). In this paper, we combine the strengths of these two ideas in a way that avoids their limitations. Specifically, we explore the ability of language models to perform few-shot prompting for reasoning tasks, given a prompt that consists of triples: <input, chain of thought, output). A chain of thought is a series of intermediate natural language reasoning steps that lead to the final output, and we refer to this approach as chain-of-thought prompting. An example prompt is shown in Figure 1.
然而,上述两种方法都有关键的局限性。对于理由增强的训练和微调方法,创建大量高质量的理由成本高昂,这比正常机器学习中使用的简单输入输出对要复杂得多。对于 Brown 等人 (2020) 使用的传统少样本提示方法,在需要推理能力的任务上表现不佳,并且随着语言模型规模的增加通常不会显著改善 (Rae 等人, 2021)。在本文中,我们以一种避免这些局限性的方式结合了这两种方法的优点。具体来说,我们探索了语言模型在给定由三元组组成的提示时执行少样本提示进行推理任务的能力:<输入, 思维链, 输出>。思维链是一系列导致最终输出的中间自然语言推理步骤,我们将这种方法称为思维链提示。图 1 显示了一个示例提示。
We present empirical evaluations on arithmetic, commonsense, and symbolic reasoning benchmarks, showing that chain-of-thought prompting outperforms standard prompting, sometimes to a striking degree. Figure 2 illustrates one such result—on the GSM8K benchmark of math word problems (Cobbe et al., 2021), chain-of-thought prompting with PaLM 540B outperforms standard prompting by a large margin and achieves new state-of-the-art performance. A prompting only approach is important because it does not require a large training dataset and because a single model checkpoint can perform many tasks without loss of generality. This work underscores how large language models can learn via a few examples with natural language data about the task (c.f. automatically learning the patterns underlying inputs and outputs via a large training dataset).
我们展示了在算术、常识和符号推理基准上的实证评估,表明链式思维提示 (chain-of-thought prompting) 优于标准提示,有时差距显著。图 2 展示了其中一个结果——在 GSM8K 数学应用题基准(Cobbe 等,2021)上,使用 PaLM 540B 的链式思维提示大幅优于标准提示,并实现了新的最先进性能。仅依赖提示的方法很重要,因为它不需要大型训练数据集,并且单个模型检查点可以在不损失通用性的情况下执行许多任务。这项工作强调了大语言模型如何通过几个自然语言任务示例进行学习(例如,自动学习输入和输出模式的基础规律,而不是通过大型训练数据集)。
2 Chain-of-Thought Prompting
2 思维链提示 (Chain-of-Thought Prompting)
Consider one's own thought process when solving a complicated reasoning task such as a multi-step math word problem. It is typical to decompose the problem into intermediate steps and solve each before giving the final answer: “After Jane gives 2 fowers to her mom she has 10 ... then after she gives 3 to her dad she will have 7 ... so the answer is 7." The goal of this paper is to endow language models with the ability to generate a similar chain of thoughta coherent series of intermediate reasoning steps that lead to the final answer for a problem. We will show that sufficiently large language models can generate chains of thought if demonstrations of chain-of-thought reasoning are provided in the exemplars for few-shot prompting.
考虑在解决复杂的推理任务(如多步骤的数学文字题)时的思考过程。通常会将问题分解为中间步骤并逐一解决,然后再给出最终答案:“简在给了她妈妈2朵花后还剩10朵……然后在给了她爸爸3朵后还剩7朵……所以答案是7。”本文的目标是赋予大语言模型生成类似思维链的能力,即一系列连贯的中间推理步骤,最终得出问题的答案。我们将展示,如果在少样本提示的范例中提供思维链推理的演示,足够大的大语言模型可以生成思维链。
Figure 1 shows an example of a model producing a chain of thought to solve a math word problem that it would have otherwise gotten incorrect. The chain of thought in this case resembles a solution and can interpreted as one, but we still opt to call it a chain of thought to better capture the idea that it mimics a step-by-step thought process for arriving at the answer (and also, solutions/explanations typically come after the final answer (Narang et al., 2020; Wiegreffe et al., 2022; Lampinen et al., 2022,inter alia)).
图 1: 展示了一个模型生成的思维链,用于解决一个它原本会答错的数学应用题。在这种情况下,思维链类似于一个解题过程,并可以被解释为一个解题过程,但我们仍然选择称之为思维链,以更好地捕捉它模仿逐步思考过程以得出答案的概念(此外,解题过程/解释通常出现在最终答案之后 (Narang et al., 2020; Wiegreffe et al., 2022; Lampinen et al., 2022, inter alia))。
Chain-of-thought prompting has several attractive properties as an approach for facilitating reasoning in language models.
思维链提示有若干吸引人的特性,作为促进语言模型中推理的方法。
In empirical experiments, we will observe the utility of chain-of-thought prompting for arithmetic reasoning (Section 3), commonsense reasoning (Section 4), and symbolic reasoning (Section 5).
在实证实验中,我们将观察链式思维提示在算术推理(第 3 节)、常识推理(第 4 节)和符号推理(第 5 节)中的实用性。
3 Arithmetic Reasoning
3 算术推理
We begin by considering math word problems of the form in Figure 1, which measure the arithmetic reasoning ability of language models. Though simple for humans, arithmetic reasoning is a task where language models often struggle (Hendrycks et al., 2021; Patel et al., 2021, inter alia). Strikingly, chainof-thought prompting when used with the 540B parameter language model performs comparably with task-specific finetuned models on several tasks, even achieving new state of the art on the challenging GSM8K benchmark (Cobbe et al., 2021).
我们首先考虑图 1 所示形式的数学应用题,这些题目用于测量语言模型的算术推理能力。尽管对人类来说很简单,但算术推理是语言模型经常遇到困难的任务 (Hendrycks et al., 2021; Patel et al., 2021, 等)。值得注意的是,当使用 540B 参数的语言模型时,链式思维提示在多个任务上表现与特定任务微调的模型相当,甚至在具有挑战性的 GSM8K 基准测试中达到了新的最先进水平 (Cobbe et al., 2021)。
图 1:
3.1 Experimental Setup
3.1 实验设置
We explore chain-of-thought prompting for various language models on multiple benchmarks.
我们探索了链式思维提示在多个基准测试中的各种语言模型上的应用。
Benchmarks. We consider the following five math word problem benchmarks: (1) the GSM8K benchmark of math word problems (Cobbe et al., 2021), (2) the SVAMP dataset of math word problems with varying structures (Patel et al., 2021), (3) the AsDiv dataset of diverse math word problems (Miao et al., 2020), (4) the AQuA dataset of algebraic word problems, and (5) the MAWPS benchmark (Koncel-Kedziorski et al., 2016). Example problems are given in Appendix Table 12.
基准测试。我们考虑以下五个数学文字题基准测试:(1) Cobbe 等人 (2021) 提出的 GSM8K 基准测试,(2) Patel 等人 (2021) 提出的具有不同结构的数学文字题数据集 SVAMP,(3) Miao 等人 (2020) 提出的多样化数学文字题数据集 AsDiv,(4) 代数文字题数据集 AQuA,以及 (5) Koncel-Kedziorski 等人 (2016) 提出的 MAWPS 基准测试。示例问题见附录表 12。
Standard prompting. For the baseline, we consider standard few-shot prompting, popularized by Brown et al. (2020), in which a language model is given in-context exemplars of input-output pairs before outputting a prediction for a test-time example. Exemplars are formatted as questions and answers. The model gives the answer directly, as shown in Figure 1 (left).
标准提示。对于基准,我们考虑标准的少样本提示,由 Brown 等人 (2020) 流行化,在这种提示中,大语言模型在输出测试示例的预测之前,会先给出输入-输出对的上下文示例。示例格式为问题和答案。模型直接给出答案,如图 1 (左) 所示。
Chain-of-thought prompting. Our proposed approach is to augment each exemplar in few-shot prompting with a chain of thought for an associated answer, as illustrated in Figure 1 (right). As most of the datasets only have an evaluation split, we manually composed a set of eight few-shot exemplars with chains of thought for prompting—-Figure 1 (right) shows one chain of thought exemplar, and the full set of exemplars is given in Appendix Table 20. (These particular exemplars did not undergo prompt engineering; robustness is studied in Section 3.4 and Appendix A.2.) To investigate whether chain-of-thought prompting in this form can successfully elicit successful reasoning across a range of math word problems, we used this single set of eight chain of thought exemplars for all benchmarks except AQuA, which is multiple choice instead of free response. For AQuA, we used four exemplars and solutions from the training set, as given in Appendix Table 21.
思维链提示。我们提出的方法是在少样本提示中的每个示例中添加一个与答案相关联的思维链,如图 1 (右) 所示。由于大多数数据集只有评估集,我们手动编写了一组八个带有思维链的少样本示例用于提示——图 1 (右) 显示了一个思维链示例,完整的示例集见附录表 20。(这些特定示例未经过提示工程;鲁棒性在第 3.4 节和附录 A.2 中研究。)为了调查这种形式的思维链提示是否能成功地在一系列数学应用题中引发成功的推理,我们使用了这单一组八个思维链示例进行所有基准测试,除了 AQuA(AQuA 是选择题而不是自由回答)。对于 AQuA,我们使用了训练集中给出的四个示例及其解答,如附录表 21 所示。
Figure 3: Examples of (input, chain of thought, output) triples for arithmetic, commonsense, and symbolic reasoning benchmarks. Chains of thought are highlighted. Full prompts in Appendix G.
图 3: 算术、常识和符号推理基准的 (输入, 思维链, 输出) 三元组示例。思维链被高亮显示。完整提示见附录 G。
Language models. We evaluate five large language models. The first is GPT-3 (Brown et al., 2020), for which we use text-ada-001, text-babbage-001, text-curie-001, and text-davinci-002, which presumably correspond to Instruct GP T models of 350M, 1.3B, 6.7B, and 175B parameters (Ouyang et al., 2022).The second is LaMDA (Thoppilan et al., 2022), which has models of 422M, 2B, 8B, 68B, and 137B parameters. The third is PaLM, which has models of 8B, 62B, and 540B parameters. The fourth is UL2 20B (Tay et al., 2022), and the fifth is Codex (Chen et al., 2021, code-davinci-002 in the OpenAI API). We sample from the models via greedy decoding (though follow-up work shows chain-of-thought prompting can be improved by taking the majority final answer over many sampled generations (Wang et al., 2022a)). For LaMDA, we report averaged results over five random seeds, where each seed had a different randomly shuffled order of exemplars. As LaMDA experiments did not show large variance among different seeds, to save compute we report results for a single exemplar order for all other models.
大语言模型。我们评估了五个大语言模型。第一个是 GPT-3 (Brown 等, 2020),我们使用 text-ada-001, text-babbage-001, text-curie-001 和 text-davinci-002,这些模型可能对应于参数量分别为 3.5 亿、13 亿、67 亿和 1750 亿的 Instruct GPT 模型 (Ouyang 等, 2022)。第二个是 LaMDA (Thoppilan 等, 2022),其模型参数量分别为 4.22 亿、20 亿、80 亿、680 亿和 1370 亿。第三个是 PaLM,其模型参数量分别为 80 亿、620 亿和 5400 亿。第四个是 UL2 20B (Tay 等, 2022),第五个是 Codex (Chen 等, 2021, OpenAI API 中的 code-davinci-002)。我们通过贪心解码从模型中采样(尽管后续工作表明,通过多数最终答案可以改进链式思维提示的效果 (Wang 等, 2022a))。对于 LaMDA,我们在五组不同的随机种子上报告平均结果,每个种子都有不同的随机打乱的示例顺序。由于 LaMDA 实验在不同种子之间没有表现出较大差异,为了节省计算资源,我们对所有其他模型报告单个示例顺序的结果。
3.2 Results
3.2 结果
The strongest results of chain-of-thought prompting are summarized in Figure 4, with all experimental outputs for each model collection, model size, and benchmark shown in Table 2 in the Appendix. There are three key takeaways. First, Figure 4 shows that chain-of-thought prompting is an emergent ability of model scale (Wei et al., 2022b). That is, chain-of-thought prompting does not positively impact performance for small models, and only yields performance gains when used with models of $\mathord{\sim}100\mathrm{B}$ parameters. We qualitatively found that models of smaller scale produced fluent but illogical chains of thought, leading to lower performance than standard prompting.
最强的链式思维提示结果总结在图 4 中,附录中的表 2 显示了每个模型集合、模型大小和基准的所有实验输出。有三个关键要点。首先,图 4 显示链式思维提示是模型规模 (model scale) 的一种新兴能力 (Wei et al., 2022b)。也就是说,链式思维提示对小模型的性能没有正面影响,只有在使用参数量约为 ~100B 的模型时才会带来性能提升。我们定性地发现,较小规模的模型生成的链式思维虽然流畅但不合逻辑,导致其性能低于标准提示。
Second, chain-of-thought prompting has larger performance gains for more-complicated problems. For instance, for GSM8K (the dataset with the lowest baseline performance), performance more than doubled for the largest GPT and PaLM models. On the other hand, for SingleOp, the easiest subset of MAWPS which only requires a single step to solve, performance improvements were either negative or very small (see Appendix Table 3).
其次,链式思维提示对更复杂的问题有更大的性能提升。例如,对于 GSM8K (基线性能最低的数据集),最大规模的 GPT 和 PaLM 模型的性能提升了超过一倍。另一方面,对于 SingleOp,这是 MAWPS 中最简单的子集,只需要一个步骤就能解决,性能改进要么是负的,要么非常小(见附录表 3)。
Third, chain-of-thought prompting via GPT-3 175B and PaLM 540B compares favorably to prior state of the art, which typically finetunes a task-specific model on a labeled training dataset. Figure 4 shows how PaLM 540B uses chain-ofthought prompting to achieve new state of the art on GSM8K, SVAMP, and MAWPS (though note that standard prompting already passed the prior best for SVAMP). On the other two datasets, AQuA and ASDiv, PaLM with chain-of-thought prompting reaches within $2%$ of the state of the art (Appendix Table 2).
第三,通过 GPT-3 175B 和 PaLM 540B 进行的链式思维提示 (chain-of-thought prompting) 比之前的最先进技术更优,后者通常在一个标记的训练数据集上微调特定任务的模型。图 4 显示了 PaLM 540B 如何使用链式思维提示在 GSM8K、SVAMP 和 MAWPS 上达到新的最先进水平(尽管需要注意的是,标准提示已经超过了 SVAMP 的此前最佳水平)。在另外两个数据集 AQuA 和 ASDiv 上,PaLM 使用链式思维提示达到了最先进水平的 2% 以内(附录表 2)。
To better understand why chain-of-thought prompting works, we manually examined modelgenerated chainsof thought by LaMDA 137B for GSM8K. Of 50 random examples where the model returned the correct final answer, all of the generated chains of thought were also logically and mathematically correct except two that coincidentally arrived at the correct answer (see Appendix D.1, and Table 8 for examples of correct model-generated chains of thought). We also randomly examined 50 random samples for which the model gave the wrong answer. The summary of this analysis is that $46%$ ofthe chains of thought were almost correct, barring minor mistakes (calculator error, symbol map
为了更好地理解为什么链式思维提示有效,我们手动检查了 LaMDA 137B 为 GSM8K 生成的链式思维。在模型返回正确最终答案的 50 个随机示例中,所有生成的链式思维在逻辑和数学上都是正确的,除了两个巧合地得出正确答案的情况(见附录 D.1 和表 8 中的正确模型生成链式思维的例子)。我们还随机检查了 50 个模型给出错误答案的样本。此分析的总结是 $46%$ 的链式思维几乎正确,只是存在一些小错误(计算器错误、符号映射错误等)。
Figure 4: Chain-of-thought prompting enables large language models to solve challenging math problems. Notably, chain-of-thought reasoning is an emergent ability of increasing model scale. Prior best numbers are from Cobbe et al. (2021) for GSM8K, Jie et al. (2022) for SVAMP, and Lan et al. (2021) for MAWPS. Model scale (# parameters in billions)
图 4: 链式思维提示使大语言模型能够解决具有挑战性的数学问题。值得注意的是,链式思维推理是随着模型规模增加而出现的能力。此前的最佳结果分别来自 Cobbe 等 (2021) 的 GSM8K,Jie 等 (2022) 的 SVAMP,以及 Lan 等 (2021) 的 MAWPS。模型规模(参数量,以十亿计)
ping error, or one reasoning step missing), and that the other $54%$ of the chains of thought had major errors in semantic understanding or coherence (see Appendix D.2). To provide a small insight into why scaling improves chain-of-thought reasoning ability, we performed a similar analysis of errors made by PaLM 62B and whether those errors were fixed by scaling to PaLM 540B. The summary is that scaling PaLM to 540B fixes a large portion of one-step missing and semantic understanding errors in the 62B model (see Appendix A.1).
推理链中存在错误的 46% 是由于单步推理缺失或理解错误(例如:ping 错误,或一个推理步骤缺失),而其他的 54% 的推理链在语义理解和连贯性上存在重大错误(见附录 D.2)。为了深入了解为什么扩展规模可以提高推理能力,我们对 PaLM 62B 的错误进行了类似的分析,并研究了这些错误是否在扩展到 PaLM 540B 后得到修复。总结是,将 PaLM 扩展到 540B 可以修复 62B 模型中很大一部分单步缺失和语义理解错误(见附录 A.1)。
3.3Ablation Study
3.3 消融研究 (Ablation Study)
The observed benefits of using chain-of-thought prompting raises the natural question of whether the same performance improvements can be conferred via other types of prompting. Figure 5 shows an ablation study with three variations of chain of thought described below.
使用链式思维提示所观察到的好处引发了是否可以通过其他类型的提示获得相同的性能提升这一自然问题。图 5 显示了以下描述的链式思维的三种变体的消融研究。
Equation only. One reason for why chain-of-thought prompting might help is that it produces the mathematical equation to be evaluated, and so we test a variation where the model is prompted to output only a mathematical equation before giving the answer. Figure 5 shows that equation only prompting does not help much for GSM8K, which implies that the semantics of the questions in GSM8K are too challenging to directly translate into an equation without the natural language reasoning steps in chain of thought. For datasets of one-step or two-step problems, however, we find that equation only prompting does improve performance, since the equation can be easily derived from the question (see Appendix Table 6).
仅方程。链式思维提示可能有所帮助的一个原因是它生成了要计算的数学方程,因此我们测试了一种变体,在这种变体中,模型被提示仅输出数学方程然后再给出答案。图 5 显示,对于 GSM8K,仅方程提示帮助不大,这意味着 GSM8K 中问题的语义过于复杂,无法直接转换为方程,而不需要链式思维中的自然语言推理步骤。然而,对于一步或两步问题的数据集,我们发现仅方程提示确实提高了性能,因为方程可以从问题中轻松推导出来(见附录表 6)。
Variable compute only. Another intuition is that chain of thought allows the model to spend more computation (i.e., intermediate tokens) on harder problems. To isolate the effect of variable computation from chain-of-thought reasoning, we test a configuration where the model is prompted to output a only sequence of dots $(\cdot\cdot\cdot)$ equal to the number of characters in the equation needed to solve the problem. This variant performs about the same as the baseline, which suggests that variable computation by itself is not the reason for the success of chainof-thought prompting, and that there appears to be utility from expressing intermediate steps via natural language.
仅可变计算。另一种直觉是,思维链允许模型在更难的问题上花费更多的计算资源(即,中间 Token)。为了将可变计算的效果与思维链推理隔离开来,我们测试了一种配置,其中模型被提示输出一个仅由点 $(\cdot\cdot\cdot)$ 组成的序列,该序列的长度等于解决问题所需的方程中的字符数。这种变体的表现与基线大致相同,这表明可变计算本身并不是思维链提示成功的原因,并且通过自然语言表达中间步骤似乎是有用的。
Chain of thought after answer. Another potential benefit of chain-of-thought prompting could simply be that such prompts allow the model to better access relevant knowledge acquired during pre training. Therefore, we test an alternative configuration where the chain of thought prompt is only given after the answer, isolating whether the model actually depends on the produced chain of thought to give the final answer. This variant performs about the same as the baseline, which suggests that the sequential reasoning embodied in the chain of thought is useful for reasons beyond just activating knowledge.
回答后的思考链。思考链提示的另一个潜在好处可能是,这样的提示可以让模型更好地访问预训练期间获得的相关知识。因此,我们测试了一种替代配置,在这种配置中,思考链提示仅在答案之后给出,以隔离模型是否实际上依赖生成的思考链来给出最终答案。这个变体的表现与基线差不多,这表明思考链中体现的顺序推理在激活知识之外还有其他用途。
Figure 5: Ablation study for different variations of prompting using LaMDA 137B and PaLM 540B. Results for other datasets are given in Appendix Table 6 and Table 7.
图 5: 使用 LaMDA 137B 和 PaLM 540B 对不同提示变体的消融研究。其他数据集的结果见附录表 6 和表 7。
3.4 Robustness of Chain of Thought
3.4 思维链的鲁棒性
Sensitivity to exemplars is a key consideration of prompting approaches——for instance, varying the permutation of few-shot exemplars can cause the accuracy of GPT-3 on SST-2 to range from near chance $(54.3%)$ to near state of the art $(93.4%)$ (Zhao et al., 2021). In this final subsection, we evaluate robustness to chains of thought written by different annotators. In addition to the results above, which used chains of thought written by an Annotator A, two other co-authors of this paper (Annotators B and C) independently wrote chains of thought for the same few-shot exemplars (shown in Appendix H). Annotator A also wrote another chain of thought that was more concise than the original, following the style of solutions given in Cobbe et al. (2021).1
对示例的敏感性是提示方法的一个关键考虑因素——例如,改变少样本示例的排列可以导致 GPT-3 在 SST-2 上的准确率从接近随机水平 (54.3%) 到接近最先进水平 (93.4%) [20]。在本节的最后一部分,我们评估由不同标注者编写的思维链的鲁棒性。除了上述结果中使用的由标注者 A 编写的思维链外,本文的另外两位合著者(标注者 B 和 C)独立为相同的少样本示例编写了思维链(见附录 H)。标注者 A 还编写了一个比原始版本更简洁的思维链,遵循 Cobbe 等人 (2021) 给出的解决方案风格。
Figure 6 shows these results for LaMDA 137B on GSM8K and MAWPS (ablation results for other datasets are given in Appendix Table 6 / Table 7). Although there is variance among different chain of thought annotations, as would be expected when using exemplar-based prompting (Le Scao and Rush, 2021; Reynolds and McDonell, 2021; Zha0 et al., 2021), all sets of chain of thought prompts outperform the standard baseline by a large margin. This result implies that successful use of chain of thought does not depend on a particular linguistic style.
图 6: 显示了 LaMDA 137B 在 GSM8K 和 MAWPS 上的结果(其他数据集的消融结果见附录表 6 / 表 7)。尽管不同的思维链标注之间存在差异,这在使用基于范例的提示时是可以预期的 (Le Scao 和 Rush, 2021; Reynolds 和 McDonell, 2021; Zha0 等, 2021),但所有思维链提示集都大大优于标准基线。这一结果表明,成功使用思维链并不依赖于特定的语言风格。
Figure 6: Chain-of-thought prompting has variance for different prompt examples (as expected) but outperforms standard prompting for various annotators as well as for different exemplars.
图 6: 链式思维提示对不同提示示例有差异(如预期),但优于标准提示,适用于不同的标注者以及不同的示例。
source (examples in this dataset already included reasoning steps like a chain of thought).2 Figure 6 shows that these prompts performed comparably with our manually written exemplars, also substantially outperforming standard prompting.
数据集中的示例已经包含了类似思维链的推理步骤)。图 6 显示,这些提示词的表现与我们手动编写的示例相当,并且显著优于标准提示词。
In addition to robustness to annotators, independently-written chains of thought, different exemplars and various language models, we also find that chain-of-thought prompting for arithmetic reasoning is robust to different exemplar orders and varying numbers of exemplars (see Appendix A.2).
除了对标注者的鲁棒性、独立撰写的思维链、不同的示例和各种语言模型外,我们还发现算术推理的思维链提示对不同的示例顺序和不同数量的示例具有鲁棒性(见附录 A.2)。
4 Commonsense Reasoning
4 常识推理 (Commonsense Reasoning)
Although chain of thought is particularly suitable for math word problems, the language-based nature of chain of thought actually makes it applicable to a broad class of commonsense reasoning problems, which involve reasoning about physical and human interactions under the presumption of general background knowledge. Commonsense reasoning is key for interacting with the world and is still beyond the reach of current natural language understanding systems (Talmor et al., 2021).
尽管思维链特别适用于数学文字题,但思维链的语言特性实际上使其适用于广泛的一类常识推理问题。这类问题涉及在一般背景知识的假设下,对物理和人类互动进行推理。常识推理是与世界交互的关键,仍然超出了当前自然语言理解系统的能力范围 (Talmor et al., 2021)。
Benchmarks. We consider five datasets covering a diverse range of commonsense reasoning types. The popular CSQA (Talmor et al., 2019) asks commonsense questions about the world involving complex semantics that often require prior knowledge. StrategyQA (Geva et al., 2021) requires models to infer a multi-hop strategy to answer questions. We choose two specialized evaluation sets from the BIG-bench effort (BIG-bench collaboration, 2021): Date Understanding, which involves inferring a date from a given context, and Sports Understanding, which involves determining whether a sentence relating to sports is plausible or implausible. Finally, the SayCan dataset (Ahn et al., 2022) involves mapping a natural language instruction to a sequence of robot actions from a discrete set. Figure 3 shows examples with chain of thought annotations for all datasets.
基准测试。我们考虑了五个涵盖多种常识推理类型的数据库。流行的 CSQA (Talmor 等, 2019) 提问涉及复杂语义的常识问题,通常需要先验知识。StrategyQA (Geva 等, 2021) 要求模型推断多步策略来回答问题。我们选择了 BIG-bench 项目 (BIG-bench 合作, 2021) 中的两个专门评估集:日期理解,涉及从给定上下文中推断日期;体育理解,涉及判断与体育相关的句子是否合理或不合理。最后,SayCan 数据集 (Ahn 等, 2022) 涉及将自然语言指令映射到离散集合中的机器人动作序列。图 3 显示了所有数据集的带有思维链注释的示例。
Prompts. We follow the same experimental setup as the prior section. For CSQA and StrategyQA, we randomly selected examples from the training set and manually composed chains of thought for them to use as few-shot exemplars. The two BIG-bench tasks do not have training sets, so we selected the first ten examples as exemplars in the evaluation set as few-shot exemplars and report numbers on the rest of the evaluation set. For SayCan, we use six examples from the training set used in Ahn et al. (2022) and also manually composed chains of thought.
提示。我们遵循与前一部分相同的实验设置。对于 CSQA 和 StrategyQA,我们从训练集中随机选择了示例,并手动编写了思考链以用作少样本示例。这两个 BIG-bench 任务没有训练集,因此我们选择了评估集中的前十个示例作为少样本示例,并在评估集的其余部分报告结果。对于 SayCan,我们使用了 Ahn et al. (2022) 中使用的训练集中的六个示例,并且也手动编写了思考链。
Results. Figure 7 highlights these results for PaLM (full results for LaMDA, GPT-3, and different model scales are shown in Table 4). For all tasks, scaling up model size improved the performance of standard prompting; chain-of-thought prompting led to further gains, with improvements appearing to be largest for PaLM 540B. With chain-of-thought prompting, PaLM 540B achieved strong performance relative to baselines, outperforming the prior state of the art on StrategyQA $75.6%$ Vs $69.4%$ and outperforming an unaided sports enthusiast on sports understanding $95.4%$ VS $84%$ These results demonstrate that chain-of-thought prompting can also improve performance on tasks requiring a range of commonsense reasoning abilities (though note that gain was minimal on CsQA).
结果. 图 7: 突出了 PaLM 的这些结果(LaMDA、GPT-3 和不同模型规模的完整结果见表 4)。对于所有任务,增加模型规模改善了标准提示的效果;链式思维提示带来了进一步的提升,改进似乎在 PaLM 540B 上最为显著。使用链式思维提示时,PaLM 540B 相对于基线表现出色,在 StrategyQA 上超越了之前的最先进水平 75.6% 对 69.4%,并在体育理解方面超过了未辅助的体育爱好者 95.4% 对 84%。这些结果表明,链式思维提示也可以提高需要一系列常识推理能力的任务表现(尽管注意在 CsQA 上的增益很小)。
图 1: 模型架构示例
在本研究中,我们提出了一种新的方法来改进大语言模型 (LLM) 的性能。该方法结合了零样本和少样本学习技术,旨在提高模型的泛化能力。具体来说,我们的贡献包括:
- 设计了一个基于 Transformer 的新架构,能够更有效地处理长文本。
- 引入了一种新的训练策略,可以在不增加计算成本的情况下显著提升模型的表现。
- 在多个基准测试中验证了我们方法的有效性,结果表明我们的模型在各种任务上均取得了显著的改进。
表 1: 不同模型的性能对比
模型名称 | 准确率 (%) | F1 分数 |
---|---|---|
基线模型 | 85.2 | 84.7 |
我们的模型 | 90.5 | 89.3 |
通过上述改进,我们的研究为未来的大语言模型发展提供了新的方向。
Figure 7: Chain-of-thought prompting also improves the commonsense reasoning abilities of language models. The language model shown here is PaLM. Prior best numbers are from the leader boards of CSQA (Talmor et al., 2019) and StrategyQA (Geva et al., 2021) (single-model only, as of May 5, 2022). Additional results using various sizes of LaMDA, GPT-3, and PaLM are shown in Table 4.
图 7: 思维链提示也提高了语言模型的常识推理能力。这里展示的语言模型是 PaLM。之前的最佳数据来自 CSQA (Talmor et al., 2019) 和 StrategyQA (Geva et al., 2021) 的排行榜(仅限单个模型,截至 2022 年 5 月 5 日)。使用不同规模的 LaMDA、GPT-3 和 PaLM 获得的其他结果见表 4。
5 Symbolic Reasoning
5 符号推理 (Symbolic Reasoning)
Our final experimental evaluation considers symbolic reasoning, which is simple for humans but potentially challenging for language models. We show that chain-ofthought prompting not only enables language models to perform symbolic reasoning tasks that are challenging in the standard prompting setting, but also facilitates length generalization to inference-time inputs longer than those seen in the few-shot exemplars.
我们最后的实验评估考虑了符号推理,这对人类来说很简单,但对语言模型来说可能具有挑战性。我们展示了链式思维提示不仅使语言模型能够在标准提示设置中执行具有挑战性的符号推理任务,还促进了长度泛化,使其能够处理推理时输入比少样本示例中看到的更长的输入。
Tasks. We use the following two toy tasks.
任务。我们使用以下两个玩具任务。
·Last letter concatenation.This task asks the model to concatenate the last letters of words in a name (e.g., "AmyBrown' $\dot{{\bf\Phi}}\rightarrow,"{\bf y}n,"$ . It is a more challenging version of first letter concatenation, which language models can already perform without chain of thought.? We generate full names by randomly concatenating names from the top one-thousand first and last names from name census data (https: / /namecensus.com/).
最后一字母拼接。此任务要求模型将名字中每个单词的最后一字母拼接起来(例如:"AmyBrown" → "yn")。这是首字母拼接的更具挑战性的版本,而首字母拼接是语言模型已经可以在不进行链式思考的情况下完成的任务。我们通过随机组合来自姓名普查数据 (https://namecensus.com/) 中排名前一千的名和姓来生成全名。
· Coin flip. This task asks the model to answer whether a coin is still heads up after people either flip or don't flip the coin (e.g., “A coin is heads up. Phoebe fips the coin. Osvaldo does not fip the coin. Is the coin still heads up?' $\rightarrow\ ^{\ast}{n o}^{,\ast};$
· 硬币翻转。此任务要求模型回答在人们翻转或不翻转硬币后,硬币是否仍然正面朝上(例如:“硬币正面朝上。Phoebe 翻转了硬币。Osvaldo 没有翻转硬币。硬币是否仍然正面朝上?” $\rightarrow\ ^{\ast}{n o}^{,\ast};$
As the construction of these symbolic reasoning tasks is well-defined, for each task we consider an in-domain test set for which examples had the same number of steps as the training/few-shot exemplars, as well as an out-of-domain (OOD) test set, for which evaluation examples had more steps than those in the exemplars. For last letter concatenation, the model only sees exemplars of names with two words, and then performs last letter concatenation on names with 3 and 4 words.4 We do the same for the number of potential flips in the coin flip task. Our experimental setup uses the same methods and models as in the prior two sections. We again manually compose chains of thought for the few-shot exemplars for each task, which are given in Figure 3.
由于这些符号推理任务的构建是明确定义的,对于每个任务,我们考虑一个领域内测试集,其中示例具有与训练/少样本示例相同的步骤数,以及一个领域外 (OOD) 测试集,其中评估示例的步骤数多于示例中的步骤数。对于最后一个字母连接任务,模型只看到包含两个单词的名字示例,然后对包含 3 和 4 个单词的名字执行最后一个字母连接。我们对硬币翻转任务中的潜在翻转次数也做同样的处理。我们的实验设置使用了与前两节相同的方法和模型。我们再次手动为每个任务的少样本示例组成思考链,这些思考链如图 3 所示。
Model scale (# parameters in billions) Figure 8: Using chain-of-thought prompting facilitates generalization to longer sequences in two symbolic reasoning tasks.
模型规模(数十亿参数) 图 8: 使用链式思维提示有助于在两个符号推理任务中推广到更长的序列。
Results. The results of these in-domain and OOD evaluations are shown in Figure 8 for PaLM, with results for LaMDA shown in Appendix Table 5. With PaLM 540B, chain-of-thought prompting leadsto almost $100%$ solve rates (note that standard prompting already solves coin flip with PaLM 540, though not for LaMDA 137B). Note that these in-domain evaluations are “toy tasks" in the sense that perfect solution structures are already provided by the chains of thought in the few-shot exemplars; all the model has to do is repeat the same steps with the new symbols in the test-time example. And yet, small models still fail—the ability to perform abstract manipulations on unseen symbols for these three tasks only arises at the scale of 1ooB model parameters.
结果。这些领域内和领域外评估的结果如图 8 所示,针对 PaLM 的结果,而 LaMDA 的结果见附录表 5。使用 PaLM 540B,链式思维提示导致几乎 100% 的解决率(请注意,标准提示已经可以解决 PaLM 540 中的硬币翻转问题,但对于 LaMDA 137B 则不行)。需要注意的是,这些领域内评估是“玩具任务”,因为完美的解决方案结构已经在少样本示例中的链式思维中提供;模型只需要在测试时的例子中重复相同的步骤,并使用新的符号。然而,小规模模型仍然失败——对于这三个任务,在未见过的符号上执行抽象操作的能力只在模型参数达到 100B 规模时才会出现。
图 8:
附录表 5:
As for the OOD evaluations, standard prompting fails for both tasks. With chain-of-thought prompting, language models achieve upward scaling curves (though performance is lower than in the in-domain setting). Hence, chain-of-thought prompting facilitates length generalization beyond seen chains of thought for language models of sufficient scale.
对于OOD评估,标准提示对两个任务都失败了。使用链式思维提示 (chain-of-thought prompting),大语言模型实现了向上扩展的性能曲线(尽管性能低于域内设置)。因此,链式思维提示有助于大语言模型在足够规模下实现超越已见思维链条长度的泛化。
6 Discussion
6 讨论
We have explored chain-of-thought prompting as a simple mechanism for eliciting multi-step reasoning behavior in large language models. We first saw that chain-of-thought prompting improves performance by a large margin on arithmetic reasoning, yielding improvements that are much stronger than ablations and robust to different annotators, exemplars, and language models (Section 3). Next, experiments on commonsense reasoning underscored how the linguistic nature of chain-of-thought reasoning makes it generally applicable (Section 4). Finally, we showed that for symbolic reasoning, chain-of-thought prompting facilitates OOD generalization to longer sequence lengths (Section 5). In all experiments, chain-of-thought reasoning is elicited simply by prompting an off-the-shelf language model. No language models were finetuned in the process of writing this paper.
我们探索了链式思维提示作为在大语言模型中激发多步推理行为的简单机制。我们首先发现,链式思维提示在算术推理方面显著提高了性能,带来的改进远超消融实验,并且对不同的标注者、示例和大语言模型 (Section 3) 都具有鲁棒性。接下来,常识推理实验强调了链式思维推理的语言特性使其具有广泛的适用性 (Section 4)。最后,我们展示了对于符号推理,链式思维提示有助于实现更长序列长度的OOD泛化 (Section 5)。在所有实验中,链式思维推理仅通过提示现成的大语言模型来激发。撰写本文的过程中没有对任何大语言模型进行微调。
The emergence of chain-of-thought reasoning as a result of model scale has been a prevailing theme (Wei et al., 2022b). For many reasoning tasks where standard prompting has a flat scaling curve, chainof-thought prompting leads to dramatically increasing scaling curves. Chain-of-thought prompting appears to expand the set of tasks that large language models can perform successfully—in other words, our work underscores that standard prompting only provides a lower bound on the capabilities of large language models. This observation likely raises more questions than it answers-for instance, how much more can we expect reasoning ability to improve with a further increase in model scale? What other prompting methods might expand the range of tasks that language models can solve?
链式思维推理作为模型规模的结果已经是一个普遍的主题 (Wei et al., 2022b)。对于许多标准提示具有平坦扩展曲线的推理任务,链式思维提示导致显著增加的扩展曲线。链式思维提示似乎扩大了大语言模型可以成功执行的任务集——换句话说,我们的工作强调标准提示仅提供了大语言模型能力的下限。这一观察可能提出了更多问题而不是答案——例如,随着模型规模的进一步增加,我们可以期望推理能力提高多少?还有哪些其他提示方法可能会扩大语言模型可以解决的任务范围?
As for limitations, we first qualify that although chain of thought emulates the thought processes of human reasoners, this does not answer whether the neural network is actually “reasoning,”’ which we leave as an open question. Second, although the cost of manually augmenting exemplars with chains of thought is minimal in the few-shot setting, such annotation costs could be prohibitive for finetuning (though this could potentially be surmounted with synthetic data generation, or zero-shot generalization). Third, there is no guarantee of correct reasoning paths, which can lead to both correct and incorrect answers; improving factual generations of language models is an open direction for future work (Rashkin et al., 2021; Ye and Durrett, 2022; Wiegreffe et al., 2022, inter alia). Finally, the emergence of chain-of-thought reasoning only at large model scales makes it costly to serve in real-world applications; further research could explore how to induce reasoning in smaller models.
关于局限性,我们首先说明,尽管思维链模拟了人类推理者的思维过程,但这并不回答神经网络是否真正“推理”的问题,我们将其作为一个开放问题。第二,虽然在少样本设置中手动为示例添加思维链的成本极低,但这种标注成本在微调时可能会变得难以承受(尽管这可以通过合成数据生成或零样本泛化来克服)。第三,无法保证推理路径的正确性,这可能导致正确和错误的答案;改进语言模型的事实生成是一个未来工作的开放方向(Rashkin 等,2021;Ye 和 Durrett,2022;Wiegreffe 等,2022,等)。最后,思维链推理仅在大模型规模下出现,使其在实际应用中的部署成本高昂;进一步的研究可以探索如何在较小的模型中诱导推理能力。
7 Related Work
7 相关工作
This work is inspired by many research areas, which we detail in an extended related work section (Appendix C). Here we describe two directions and associated papers that are perhaps most relevant.
这项工作受到许多研究领域的启发,我们在扩展的相关工作部分 (附录 C) 中详细说明。这里我们描述两个可能最相关的方向和相关论文。
The first relevant direction is using intermediate steps to solve reasoning problems. Ling et al. (2017) pioneer the idea of using natural language rationales to solve math word problems through a series of intermediate steps. Their work is a remarkable contrast to the literature using formal languages to reason (Roy et al., 2015; Chiang and Chen, 2019; Amini et al., 2019; Chen et al., 2019). Cobbe et al. (2021) extend Ling et al. (2017) by creating a larger dataset and using it to finetune a pretrained language model rather than training a model from scratch. In the domain of program synthesis, Nye et al. (2021) leverage language models to predict the final outputs of Python programs via first line-to-line predicting the intermediate computational results, and show that their step-by-step prediction method performs better than directly predicting the final outputs.
第一个相关方向是使用中间步骤来解决推理问题。Ling 等 (2017) 开创性地提出了使用自然语言推理来通过一系列中间步骤解决数学文字问题的想法。他们的工作与使用形式语言进行推理的文献形成了鲜明对比(Roy 等,2015;Chiang 和 Chen,2019;Amini 等,2019;Chen 等,2019)。Cobbe 等 (2021) 通过创建一个更大的数据集并用它来微调预训练的语言模型而不是从头训练模型,扩展了 Ling 等 (2017) 的工作。在程序合成领域,Nye 等 (2021) 利用语言模型通过逐行预测中间计算结果来预测 Python 程序的最终输出,并表明他们的逐步预测方法比直接预测最终输出表现更好。
Naturally, this paper also relates closely to the large body of recent work on prompting. Since the popularization of few-shot prompting as given by Brown et al. (2020), several general approaches have improved the prompting ability of models, such as automatically learning prompts (Lester et al., 2021) or giving models instructions describing a task (Wei et al., 2022a; Sanh et al., 2022; Ouyang et al., 2022). Whereas these approaches improve or augment the input part of the prompt (e.g., instructions that are prepended to inputs), our work takes the orthogonal direction of augmenting the outputs of language models with a chain of thought.
当然,本文也与近期关于提示的大量工作密切相关。自从 Brown 等人 (2020) 普及了少样本提示以来,几种通用方法已经提高了模型的提示能力,例如自动学习提示 (Lester 等,2021) 或给模型提供描述任务的指令 (Wei 等,2022a;Sanh 等,2022;Ouyang 等,2022)。虽然这些方法改进或增强了提示的输入部分(例如,附加在输入前的指令),我们的工作则采取了正交的方向,通过思维链来增强大语言模型的输出。
8 Conclusions
8 结论
We have explored chain-of-thought prompting as a simple and broadly applicable method for enhancing reasoning in language models. Through experiments on arithmetic, symbolic, and commonsense reasoning, we find that chain-of-thought reasoning is an emergent property of model scale that allows sufficiently large language models to perform reasoning tasks that otherwise have flat scaling curves. Broadening the range of reasoning tasks that language models can perform will hopefully inspire further work on language-based approaches to reasoning.
我们已经探索了链式思维提示作为增强语言模型推理能力的一种简单且广泛应用的方法。通过在算术、符号和常识推理上的实验,我们发现链式思维推理是模型规模的一个突现属性,使得足够大的语言模型能够执行其他情况下具有平坦扩展曲线的推理任务。扩大语言模型可以执行的推理任务范围,希望能够激发对基于语言的推理方法的进一步研究。
Acknowledgements
致谢
We thank Jacob Devlin, Claire Cui, Andrew Dai, and Ellie Pavlick for providing feedback on the paper. We thank Jacob Austin, Yuhuai Wu, Henryk Micha lewski, Aitor Lewkowycz, Charles Sutton, and Aakanksha Chowdhery for helpful discussions. We thank Sid Maxwell for notifying us about a mistake in the manual error analysis in the original manuscript.
我们感谢 Jacob Devlin、Claire Cui、Andrew Dai 和 Ellie Pavlick 为本文提供反馈。我们感谢 Jacob Austin、Yuhuai Wu、Henryk Michalewski、Aitor Lewkowycz、Charles Sutton 和 Aakanksha Chowdhery 的有益讨论。我们感谢 Sid Maxwell 通知我们原手稿中的手动错误分析存在错误。
References
参考文献
Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea Finn, Keerthana Gopal a krishna n, Karol Hausman, Alex Herzog, et al. 2022. Do as I can, not as I say: Grounding language in robotic afford ances. arXiv preprint arXiv:2204.01691.
Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea Finn, Keerthana Gopal a krishna n, Karol Hausman, Alex Herzog, 等. 2022. 按我能做的来做,而不是我说的:将语言扎根于机器人能力中. arXiv preprint arXiv:2204.01691.
Aida Amini, Saadia Gabriel, Shanchuan Lin, Rik Koncel-Kedziorski, Yejin Choi, and Hannaneh Hajishirzi. 2019. MathQA: Towards interpret able math word problem solving with operationbased formalisms. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota. Association for Computational Linguistics.
Aida Amini, Saadia Gabriel, Shanchuan Lin, Rik Koncel-Kedziorski, Yejin Choi, 和 Hannaneh Hajishirzi. 2019. MathQA: 朝着基于操作的形式化方法实现可解释的数学文字问题求解迈进。在 Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota。Association for Computational Linguistics。
Daniel Andor, Luheng He, Kenton Lee, and Emily Pitler. 2019. Giving BERT a calculator: Finding operations and arguments with reading comprehension. EMNLP.
Daniel Andor, Luheng He, Kenton Lee, 和 Emily Pitler. 2019. 给 BERT 配备计算器:通过阅读理解找到操作和参数. EMNLP.
Jacob Andreas, Dan Klein, and Sergey Levine. 2018. Learning with latent language. NAACL.
Jacob Andreas, Dan Klein, 和 Sergey Levine. 2018. 学习带有潜在语言 (learning with latent language)。NAACL。
Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Micha lewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. 2021. Program synthesis with large language models. arXiv preprint arXiv:2108.07732.
Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, 等. 2021. 使用大语言模型进行程序合成. arXiv预印本 arXiv:2108.07732.
BIG-bench collaboration. 2021. Beyond the imitation game: Measuring and extrapolating the capabilities of language models. In preparation.
BIG-bench 合作. 2021. 超越模仿游戏:测量和外推语言模型的能力. 准备中.
Kaj Bostrom, Xinyu Zhao, Swarat Chaudhuri, and Greg Durrett. 2021. Flexible generation of natural language deductions. EMNLP.
Kaj Bostrom, 赵欣宇, Swarat Chaudhuri, 和 Greg Durrett. 2021. 自然语言推理的灵活生成. EMNLP.
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neel a kant an, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert- Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language models are few-shot learners. NeurIPS.
汤姆·布朗,本杰明·曼,尼克·赖德,梅兰妮·苏比亚,贾里德·D·卡普兰,普拉富拉·达里瓦尔,阿文丁·尼尔坎坦,普拉纳夫·希亚姆,吉里什·萨斯特里,阿曼达·阿斯科尔,桑迪尼·阿加沃尔,阿里埃尔·赫伯特-沃斯,格雷琴·克鲁格,汤姆·亨尼根,瑞文·蔡尔德,阿迪蒂亚·拉梅什,丹尼尔·齐格勒,杰弗里·吴,克莱门斯·温特,克里斯·赫塞,马克·陈,埃里克·西格勒,马特乌什·利特温,斯科特·格雷,本杰明·切斯,杰克·克拉克,克里斯托弗·伯纳,山姆·麦肯德里什,阿莱克·拉德福德,伊利亚·苏茨凯弗,和 达里奥·阿莫代伊。2020。大语言模型是少样本学习者。NeurIPS。
Jonathon Cai, Richard Shin, and Dawn Song. 2017. Making neural programming architectures generalize via recursion. ICLR.
乔纳森·蔡,理查德·申,和黎明·宋。2017。通过递归使神经编程架构泛化。ICLR。
Oana-Maria Camburu, Tim Rock t as chel, Thomas L ukasiewicz, and Phil Blunsom. 2018. e-SNLI: Natural language inference with natural language explanations. NeurIPS.
Oana-Maria Camburu, Tim Rock t as chel, Thomas L ukasiewicz, 和 Phil Blunsom. 2018. e-SNLI: 带有自然语言解释的自然语言推理 (Natural language inference with natural language explanations). NeurIPS.
Howard Chen, Jacqueline He, Karthik Narasimhan, and Danqi Chen. 2022. Can rationalization improve robustness?NAACL.
Howard Chen, Jacqueline He, Karthik Narasimhan, 和 Danqi Chen. 2022. 理由化能否提高鲁棒性?NAACL.
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. 2021. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374.
陈马克, Tworek Jerry, Jun Heewoo, Yuan Qiming, Henrique Ponde de Oliveira Pinto, Kaplan Jared, Edwards Harri, Burda Yuri, Joseph Nicholas, Brockman Greg, 等. 2021. 评估训练于代码上的大语言模型 (Large Language Models Trained on Code). arXiv预印本 arXiv:2107.03374.
Xinyun Chen, Chen Liang, Adams Wei Yu, Denny Zhou, Dawn Song, and Quoc V. Le. 2019. Neural symbolic reader: Scalable integration of distributed and symbolic representations for reading comprehension. ICLR.
陈昕云,梁晨,Adams Wei Yu,Denny Zhou,Dawn Song,和 Quoc V. Le. 2019. 神经符号阅读器:分布式表示和符号表示在阅读理解中的可扩展集成. ICLR.
Ting-Rui Chiang and Yun-Nung Chen. 2019. Semantically-aligned equation generation for solving and reasoning math word problems. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2656-2668, Minneapolis, Minnesota. Association for Computational Linguistics.
Ting-Rui Chiang 和 Yun-Nung Chen. 2019. 语义对齐的方程生成用于求解和推理数学文字题. 在 Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) 的第 2656-2668 页,明尼阿波利斯,明尼苏达州. Association for Computational Linguistics.
Gabriel Recchia. 2021. Teaching auto regressive language models complex tasks by demonstration. arXiv preprint arXiv:2109.02102.
Gabriel Recchia. 2021. 通过演示教授自回归语言模型复杂任务. arXiv preprint arXiv:2109.02102.
Emily Reif, Daphne Ippolito, Ann Yuan, Andy Coenen, Chris Callison-Burch, and Jason Wei. 2022. A recipe for arbitrary text style transfer with large language models. ACL.
艾米丽·雷夫,达芙妮·伊波利托,安·袁,安迪·科亨,克里斯·卡利森-伯奇,和杰森·韦。2022。使用大语言模型进行任意文本风格迁移的方法。ACL。
Laria Reynolds and Kyle McDonell. 2021. Prompt programming for large language models: Beyond the few-shot paradigm. Extended Abstracts of the 2021 CH1 Conference on Human Factors in Computing Systems.
拉里亚·雷诺兹和凯尔·麦克唐纳。2021。大语言模型的提示编程:超越少样本范式。2021 年 CH1 人类因素与计算系统会议扩展摘要。
Subhro Roy and Dan Roth. 2015. Solving general arithmetic word problems. EMNLP.
Subhro Roy 和 Dan Roth. 2015. 解决一般算术文字题. EMNLP.
Checklist
检查清单
- For all authors...
对于所有作者...
(a) Do the main claims made in the abstract and introduction accurately reflect the paper's contributions and scope? [Yes]
摘要和引言中的主要主张是否准确反映了论文的贡献和范围? [是]
(d) Have you read the ethics review guidelines and ensured that your paper conforms to them?[Yes]
(d) 您是否已阅读伦理审查指南并确保您的论文符合这些指南?[是]
- If you are including theoretical results..
- 如果您包含理论结果..。
(a) Did you state the full set of assumptions of all theoretical results? [N/A] (b) Did you include complete proofs of all theoretical results? [N/A]
(a) 你是否陈述了所有理论结果的完整假设集?[N/A] (b) 你是否包含了所有理论结果的完整证明?[N/A]
- If you ran experiments..
- 如果你运行了实验..
- If you are using existing assets (e.g., code, data, models) or curating/releasing new assets..
- 如果您正在使用现有资产(例如,代码、数据、模型)或整理/发布新资产..。
- If you used crowd sourcing or conducted research with human subjects...
- 如果您使用了众包或进行了有人类受试者参与的研究...
(a) Did you include the full text of instructions given to participants and screenshots, if applicable?[N/A]
(a) 是否包含了提供给参与者的完整文本指令和截图(如适用)?[N/A]
(b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A]
(b) 您是否描述了任何潜在的参与者风险,并链接到机构审查委员会 (IRB) 批准,如适用?[不适用]
(t) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]
(t)您是否包含了支付给参与者的预计小时工资以及参与者补偿的总金额?[不适用]
A Frequently Asked Questions
常见问题解答
A.1 Why does increasing model scale improve chain-of-thought prompting?
A.1 为什么增加模型规模能改善链式思维提示?
The finding that successful chain-of-thought reasoning predictably emerges only at certain model scales is intriguing. Scaling up language models has been shown to confer benefits such as improved performance and sample efficiency (Kaplan et al., 2020), but chain-of-thought reasoning is emergent in the sense that its success cannot be predicted only by extrapolating the performance of small scale models, as chain of thought actually hurts performance for most models smaller than 10B parameters.
成功的思想链推理只在某些模型规模上可预测地出现这一发现非常有趣。扩大语言模型的规模已被证明可以带来诸如性能提升和样本效率提高等好处 (Kaplan et al., 2020),但思想链推理的出现具有突发性,因为其成功不能仅通过外推小规模模型的表现来预测,实际上对于大多数参数量小于 10B 的模型来说,思想链反而会损害性能。
The question of why model scale improves chain-of-thought prompting is certainly multi-faceted, and we made a preliminary attempt to shed insight into it via error analysis. This small analysis involved manually reading 45 errors made by PaLM 62B and categorizing them into semantic understanding (20 errors), one step missing (18 errors), and other errors (7 errors). The “other category” included hallucinations, repetitive outputs, and symbol mapping errors. This categorization is a coarse one borrowed from the initial error analysis done on LaMDA in Appendix D.2, for which categories were conceived based on what improvements were needed to make the chain of thought correct.
为什么模型规模能改善链式思维提示这个问题无疑是多方面的,我们通过错误分析进行了初步尝试以揭示其原因。这个小规模的分析包括手动阅读 PaLM 62B 所犯的 45 个错误,并将它们分类为语义理解错误 (20 个错误)、缺少一步推理错误 (18 个错误) 和其他错误 (7 个错误)。"其他类别" 包括幻觉、重复输出和符号映射错误。这种分类是粗略的,借鉴了附录 D.2 中对 LaMDA 进行的初始错误分析,其中的类别是根据需要哪些改进来使链式思维正确而构思的。
As shown in Figure 9, scaling PaLM to 540B parameters fixed a substantial portion of errors in all three categories. Examples of semantic understanding and one-step missing errors that were fixed by scaling PaLM to 540B are given in Figure 10. This result appears consistent with a hypothesis that language models acquire a range of semantic understanding and logical reasoning skills as a function of model scale (though note that model scale is often conflated with other factors, such as amount of training compute).
如图 9 所示,将 PaLM 扩展到 540B 参数修复了所有三个类别中的大量错误。图 10 给出了通过将 PaLM 扩展到 540B 参数而修复的语义理解和一步缺失错误的示例。这一结果似乎与以下假设一致:大语言模型随着模型规模的增加获得了广泛的语义理解和逻辑推理能力(尽管需要注意的是,模型规模通常与其他因素混淆,例如训练计算量)。
图 9:
图 10:
Figure 9: Error analysis of 45 problems that $\mathrm{PaLM},62\mathrm{B}$ got incorrect. These errors were categorized that semantic understanding, one step missing, and other. The other category includes hallucinations, repetitive outputs, and symbol mapping errors. Scaling PaLM to 540B fixed a substantial portion of errors in all categories.
图 9: 对 $\mathrm{PaLM},62\mathrm{B}$ 出错的 45 个问题进行错误分析。这些错误被归类为语义理解、一步缺失和其他。其他类别包括幻觉、重复输出和符号映射错误。将 PaLM 扩展到 540B 修复了所有类别中的大量错误。
There are also three notable points regarding why small language models fail. The first observation is that small language models fail at even relatively easy symbol mapping tasks. As demonstrated in Section 5, for even symbolic reasoning tasks that only require generalization to new examples using the same chain of thought logical structure that was given in the few-shot exemplars, small language models still failed. The second observation is that small language models seem to have inherently weaker arithmetic abilities, as shown by Brown et al. (2020), the ability to do simple arithmetic operations (without semantic understanding) requires sufficient model scale. Finally, we noticed qualitatively that small language models often did not generate a final answer that could be parsed, due to either repetitions or logic that never arrived at a final answer.
关于小型语言模型为何失败,有三个值得注意的点。第一点观察是,小型语言模型在相对简单的符号映射任务上也会失败。正如第 5 节所示,即使是只需要使用与少样本示例中相同的思维逻辑结构进行泛化的符号推理任务,小型语言模型仍然失败了。第二点观察是,小型语言模型似乎天生具有较弱的算术能力,正如 Brown 等人 (2020) 所展示的,执行简单算术运算(不涉及语义理解)的能力需要足够的模型规模。最后,我们定性地注意到,小型语言模型经常未能生成可以解析的最终答案,这可能是由于重复或从未得出最终答案的逻辑。
In summary, the success of chain-of-thought reasoning as a result of model scale is a complicated phenomena that likely involves a variety of emergent abilities (semantic understanding, symbol mapping, staying on topic, arithmetic ability, faithfulness, etc). Future work could more thoroughly investigate what properties of pre training data, model architecture, and optimization objective causally enable such reasoning capabilities.
总之,由于模型规模而导致的链式思维推理的成功是一种复杂的现象,可能涉及多种涌现能力(语义理解、符号映射、保持话题一致性、算术能力、忠实性等)。未来的工作可以更彻底地研究预训练数据、模型架构和优化目标的哪些属性因果性地使这些推理能力成为可能。
图中内容无法直接转换为文本,请提供具体需要翻译的文字内容。
Figure 10: Examples of semantic understanding and one-step missing errors that were fixed by scaling PaLM from 62B to 540B.
图 10: 语义理解及一步缺失错误的示例,这些错误通过将 PaLM 从 62B 扩展到 540B 得以修正。
A.2 What is the role of prompt engineering?
A.2 提示工程的作用是什么?
One of the key considerations of prompting is sensitivity to the exact prompt. There is no shortage of work showing that prompts affect language models in unexpected ways (Min et al., 2022). The general way that we created chain of thought annotations was by taking eight exemplars from the training set and decomposing the reasoning process into multiple steps leading to the final answer. Examples of chain of thought annotations are provided in Figure 3, with full prompts given in Appendix G. To analyze how sensitive chain of thought is to prompt engineering, we performed robustness experiments with respect to various factors.
提示的一个关键考虑因素是对确切提示的敏感性。不乏有研究表明提示以意想不到的方式影响语言模型 (Min et al., 2022)。我们创建思维链注释的一般方法是从训练集中选取八个示例,并将推理过程分解为多个步骤,最终得出答案。思维链注释的示例见图 3,完整的提示语请参阅附录 G。为了分析思维链对提示工程的敏感性,我们针对各种因素进行了鲁棒性实验。
· Different annotators. We first analyze robustness to three different annotators (Section 3.4 and Figure 6). Although there is notable variance in performance (which we will discuss later), chain of thought performed better than the baseline by a large margin for all three annotators on eight datasets in arithmetic, commonsense, and symbolic reasoning (Table 6 and Table 7). Similar to the annotation process in Cobbe et al. (2021), annotators were not given specific instructions about how to write the chain of thought annotations other than to simply write the step-by-step reasoning process that led to the final answer. Thus, the annotations were written in each annotator's own linguistic “chain of thought" writing style.
不同的标注者。我们首先分析对三位不同标注者的鲁棒性(第 3.4 节和图 6)。尽管性能存在显著差异(我们将在后文讨论),但在算术、常识和符号推理的八个数据集上,链式思维的表现均大幅优于基线模型(表 6 和表 7)。与 Cobbe 等人 (2021) 的标注过程类似,标注者没有获得关于如何编写链式思维标注的具体指示,只是要求写出导致最终答案的逐步推理过程。因此,标注是以每位标注者自己的语言“链式思维”写作风格编写的。
· Annotators without machine learning background. The GSM8K dataset (Cobbe et al., 2021) conveniently provides a training set with reasoning chains written by crowd compute workers, which enables us to investigate whether chain of thought still works with reasoning chains from an independent source without a background in machine learning. So we randomly sampled three sets of eight exemplars with chains of thought from GSM8K. These chain of thought annotations also outperformed the baseline by a large margin for all four arithmetic datasets (Table 6), indicating that chain of thought is not dependent on a particular set of annotators.
没有机器学习背景的标注者。GSM8K 数据集 (Cobbe et al., 2021) 方便地提供了一个由众包计算工作者编写的推理链训练集,这使我们能够调查来自没有机器学习背景的独立来源的推理链是否仍然有效。因此,我们随机抽取了三组各包含八个带有思考链的样本来自 GSM8K。这些思考链标注在所有四个算术数据集上也大大优于基线(表 6),表明思考链不依赖于特定的一组标注者。
· Different exemplars. The different GSM8K exemplars experiment above (Table 6) also shows that chain-of-thought prompting works for different sets of exemplars. Notably, we test every set of exemplars on all four arithmetic datasets (instead of picking exemplars from the training set for each dataset), which suggests that the exemplars do not necessarily have to come from the same dataset distribution as the test examples.
不同的示例。上述不同的 GSM8K 示例实验 (表 6) 也表明,链式思维提示对不同的示例集有效。值得注意的是,我们在所有四个算术数据集上测试每个示例集(而不是从每个数据集的训练集中挑选示例),这表明示例不一定必须来自与测试示例相同的数据集分布。
· Different order of exemplars. Prior work has shown that in some cases (e.g., classification) even the order of prompts matter—-varying the permutation of few-shot exemplars can cause the accuracy of GPT-3 on SST-2 to range from near chance $(54.3%)$ tonearSOTA $(93.4%)$ (Zha0 et al., 2021). We show the standard deviation of performance from different exemplars in Table 6 and Table 7. Standard deviations with respect to prompt order are relatively minimal in almost all cases. The one exception is the coin flip task, for which exemplar orders have high standard deviation, likely for the reason cited in Zhao et al. (2021)—-for classification, many exemplars of the same category in a row biases the model outputs).
不同的示例顺序。先前的工作已经表明,在某些情况下(例如,分类),提示的顺序也很重要——改变少样本示例的排列可以导致 GPT-3 在 SST-2 上的准确率从接近随机水平 (54.3%) 到接近最先进水平 (93.4%) 不等 (Zhao et al., 2021)。我们在表 6 和表 7 中展示了不同示例的标准差。几乎所有情况下,相对于提示顺序的标准差都相对较小。唯一的例外是硬币翻转任务,对于该任务,示例顺序具有较高的标准差,这可能是因为 Zhao 等人 (2021) 提到的原因——对于分类,连续出现的同一类别示例会偏向模型输出。
表 6:
表 7:
标准差与提示顺序有关的情况在几乎所有情况下都相对较小。唯一的例外是硬币翻转任务,对于该任务,示例顺序具有较高的标准差,这可能是因为 Zhao 等人 (2021) 提到的原因——对于分类,连续出现的同一类别示例会偏向模型输出。
· Different number of exemplars. We also found that gains from chain-of-thought prompting generally still held when there was a varying number of few-shot exemplars. This is shown for five datasets in Figure 11 (we did not have the compute to run this for all datasets). We also found in preliminary experiments that further increasing the number of exemplars in standard prompting did not lead to significant gains (e.g., increasing from 8 to 16 exemplars did not improve the performance of standard prompting enough to catch up with chain-of-thought prompting).
不同的示例数量。我们还发现,当有不同数量的少样本示例时,链式思维提示 (chain-of-thought prompting) 的收益通常仍然存在。这在图 11 中展示了五个数据集的结果(我们没有足够的计算资源来对所有数据集运行此实验)。在初步实验中,我们还发现进一步增加标准提示中的示例数量并未带来显著的改进(例如,从 8 个增加到 16 个示例并没有使标准提示的性能提升到赶上链式思维提示的程度)。
· Different language models. Another interesting question is whether certain prompts that work better for one model work better for other large language models. We find that with the same prompts, chain-of-thought prompting improves performance across all three models (LaMDA, GPT-3, and PaLM) for all datasets except CSQA and StrategyQA for GPT-3 (Table 1, Table 4, Table 5). The fact that gains from chain of thought did not transfer perfectly among models is a limitation; further work could investigate why how different pre-training datasets and model architectures affect the performance gain from chain-of-thought prompting.
不同的大语言模型。另一个有趣的问题是,某些对一个模型更有效的提示是否对其他大语言模型也更有效。我们发现,使用相同的提示,链式思维提示在所有三个模型 (LaMDA, GPT-3, 和 PaLM) 上提高了所有数据集的性能,除了 GPT-3 的 CSQA 和 StrategyQA (表 1, 表 4, 表 5)。链式思维的收益未能在不同模型之间完全转移这一事实是一个局限性;进一步的研究可以调查不同的预训练数据集和模型架构如何影响链式思维提示的性能提升。
Prompt engineering still matters, though. Although the results are relatively robust to the prompt for arithmetic reasoning, we want to be clear that prompt engineering still does matter, and can improve performance significantly in many cases. Though most chain of thought annotations outperform standard prompting, there is large variation in many cases. For instance, for the coin flip task, the performance varied from $99.6%$ for Annotator A to $71.4%$ for Annotator C, though both were above standard prompting $=50.0%$ (see Table 7). There are even tasks where prompt engineering is a requirement for good performance. In preliminary experiments, we tried using chain of thought to enable language models to reverse the order of a list of 5 items. While two co-authors were not able to write chain of thought prompts that solved the task despite their best attempts, a third co-author was able to write a chain of thought that perfectly solved the task.
提示工程仍然很重要。尽管算术推理的结果对提示相对稳健,我们希望明确指出,提示工程仍然很重要,并且在许多情况下可以显著提高性能。虽然大多数思维链注释优于标准提示,但在许多情况下存在很大差异。例如,在硬币翻转任务中,性能从注释者 A 的 $99.6%$ 到注释者 C 的 $71.4%$ 不等,尽管两者都高于标准提示的 $=50.0%$ (见表 7)。甚至有些任务需要良好的提示工程才能取得好成绩。在初步实验中,我们尝试使用思维链来使大语言模型能够反转一个包含 5 个项目的列表顺序。尽管两位合著者尽了最大努力,但未能编写出能解决该任务的思维链提示,而第三位合著者成功编写了一个完美解决该任务的思维链提示。
How to generate chain of thought annotations in a robust fashion could be an interesting direction for future work. For instance, an idea here could be to use a large language model to automatically generate chains of thought via prompting (and potentially optimize this over a validation set).
如何以稳健的方式生成思维链注释可能是一个有趣的研究方向。例如,可以使用大语言模型通过提示自动生成思维链(并可能在验证集上优化这一过程)。
A.3 Will chain-of-thought prompting improve performance for my task of interest?
A.3 思维链提示是否会提高我对感兴趣任务的性能?
While chain-of-thought prompting is in principle applicable for any text-to-text task, it is more helpful for some tasks than others. Based on the experiments in this paper, our intuition is that chain of thought helps the most when three conditions are met: (1) the task is challenging and requires
虽然链式思维提示在原则上适用于任何文本到文本的任务,但对某些任务的帮助比其他任务更大。根据本文的实验,我们的直觉是,当满足以下三个条件时,链式思维帮助最大:(1) 任务具有挑战性且需要
These intuitions are perhaps supported by the arithmetic reasoning results. The performance gain from chain-of-thought prompting is largest for PaLM 540B on GSM8K (challenging multi-step problems, flat scaling curve), which meets these conditions. The performance gain is small for the subsets of MAWPS that only require one or two steps (SingleOP, SingleEq, and AddSub), for which PaLM 540B already achieves performance of $90%$ or higher (and it is also generally true that there is less headroom for improvement when performance is already strong).
这些直觉可能得到了算术推理结果的支持。对于 GSM8K(具有挑战性的多步骤问题,平坦的扩展曲线),PaLM 540B 的思维链提示带来的性能提升最大,这符合这些条件。对于只需要一到两个步骤的 MAWPS 子集(SingleOP,SingleEq 和 AddSub),PaLM 540B 已经达到了 90% 或更高的性能(并且通常情况下,当性能已经很强时,改进的空间较小)。
Although in this paper we focused on multi-step reasoning tasks (arithmetic, commonsense, and symbolic), chain-of-thought prompting can potentially be applied to any task for which humans use a "chain of thought? to solve (at least in principle). We leave the empirical evaluation of chain-of-thought prompting on such diverse tasks (e.g., machine translation, etc.) to future work.
虽然在本文中我们专注于多步推理任务(算术、常识和符号推理),但链式思维提示 potentially 可以应用于人类使用“思维链”来解决的任何任务(至少在原则上)。我们将不同任务(例如,机器翻译等)上的链式思维提示的实证评估留待未来的工作。
A.4 Why is prompting with the equation only not enough for some arithmetic reasoning datasets?
A.4 为什么仅用方程提示对于某些算术推理数据集是不够的?
Prompting with the equation only as an intermediate step does help on many datasets, especially when the datasets only require a few reasoning steps (SVAMP, ASDiv, MAWPS). For GSM8K, however, using the equation only did not improve performance substantially. Based on qualitative analysis, we believe that these questions are too semantically challenging for the model to directly translate them into a math equation. Consider this example from LaMDA 137B:
仅用方程作为中间步骤进行提示在许多数据集上确实有帮助,特别是在数据集只需要少量推理步骤的情况下(SVAMP、ASDiv、MAWPS)。然而,对于 GSM8K,仅使用方程并没有显著提高性能。根据定性分析,我们认为这些问题的语义挑战太大,模型无法直接将它们转换为数学方程。考虑这个来自 LaMDA 137B 的例子:
QUEsTIoN: Mike plays ping pong for 40 minutes. In the first 20 minutes, he scores 4 points. In the second 20 minutes, he scores $25%$ more points. How many total points did he score?
问题:Mike 打乒乓球 40 分钟。在前 20 分钟,他得了 4 分。在后 20 分钟,他得分提高了 25% 。他总共得了多少分?
在后 20 分钟,他得分为 4 分的 125% ,即 4 × 1.25 = 5 分。
因此,他总共得分为 4 + 5 = 9 分。
EQUATION ONLY (WRONG ANSWER): $(4+20*0.25)=6$ The answer is 6. CHAIN OF THOUGHT (CORRECT): Mike played ping pong for 40 minutes. In the first 20 minutes, he scored 4 points. In the second 20 minutes, he scored $25%$ more points. So he scored $25%$ more in the second 20 minutes. $4\mathrm{x}1.25=5$ .So he scored 5 points in the second 20 minutes. So he scored 9 points in total. The answer is 9.
仅公式 (错误答案):$(4+20*0.25)=6$ 答案是 6。
思考过程 (正确):Mike 打了 40 分钟乒乓球。在前 20 分钟,他得了 4 分。在后 20 分钟,他得分提高了 $25%$ 。所以他后 20 分钟得分提高了 $25%$ 。$4 \mathrm{x} 1.25 = 5$ 。所以他后 20 分钟得了 5 分。因此,他总共得了 9 分。答案是 9。
It is hard for the model to directly