Towards Reasoning in Large Language Models: A Survey
大语言模型中的推理能力研究综述
Jie Huang Kevin Chen-Chuan Chang Department of Computer Science, University of Illinois at Urbana-Champaign {jeffhj, kcchang}@illinois.edu
黄杰 Kevin Chen-Chuan Chang
伊利诺伊大学厄巴纳-香槟分校计算机科学系
{jeffhj, kcchang}@illinois.edu
Abstract
摘要
Reasoning is a fundamental aspect of human intelligence that plays a crucial role in activities such as problem solving, decision making, and critical thinking. In recent years, large language models (LLMs) have made significant progress in natural language processing, and there is observation that these models may exhibit reasoning abilities when they are sufficiently large. However, it is not yet clear to what extent LLMs are capable of reasoning. This paper provides a comprehensive overview of the current state of knowledge on reasoning in LLMs, including techniques for improving and eliciting reasoning in these models, methods and benchmarks for evaluating reasoning abilities, findings and implications of previous research in this field, and suggestions on future directions. Our aim is to provide a detailed and up-to-date review of this topic and stimulate meaningful discussion and future work.1
推理是人类智能的基本方面,在问题解决、决策制定和批判性思维等活动中起着关键作用。近年来, 大语言模型(LLM)在自然语言处理领域取得了显著进展,有观察表明当这些模型规模足够大时可能展现出推理能力。然而,目前尚不清楚LLM具备何种程度的推理能力。本文全面概述了LLM推理研究的现状,包括改进和激发模型推理的技术、评估推理能力的方法与基准、该领域先前研究的发现与启示,以及对未来方向的建议。我们的目标是提供这一主题的详细最新综述,并促进有意义的讨论和未来工作。[1]
1 Introduction
1 引言
Reasoning is a cognitive process that involves using evidence, arguments, and logic to arrive at conclusions or make judgments. It plays a central role in many intellectual activities, such as problem solving, decision making, and critical thinking. The study of reasoning is important in fields like psychology (Wason and Johnson-Laird, 1972), philosophy (Passmore, 1961), and computer science (Huth and Ryan, 2004), as it helps individuals make decisions, solve problems, and think critically.
推理是一种认知过程,涉及利用证据、论据和逻辑来得出结论或做出判断。它在许多智力活动中扮演核心角色,例如解决问题、决策制定和批判性思维。推理研究在心理学 (Wason and Johnson-Laird, 1972)、哲学 (Passmore, 1961) 和计算机科学 (Huth and Ryan, 2004) 等领域具有重要意义,因为它能帮助个人做出决策、解决问题并进行批判性思考。
Recently, large language models (LLMs) (Brown et al., 2020; Chowdhery et al., 2022; Chung et al., 2022; OpenAI, 2022, inter alia) such as ChatGPT have made significant advancements in natural language processing and related fields. It has been shown that these models exhibit emergent behaviors, including the ability to “reason”, when they are large enough (Wei et al., 2022a). For example, by providing the models with “chain of thoughts”, i.e., reasoning exemplars, or a simple prompt “Let’s think step by step”, these models are able to answer questions with explicit reasoning steps (Wei et al., 2022b; Kojima et al., 2022), e.g., “all whales are mammals, all mammals have kidneys; therefore, all whales have kidneys.” This has sparked considerable interest in the community since reasoning ability is a hallmark of human intelligence that is frequently considered missed in current artificial intelligence systems (Marcus, 2020; Russin et al., 2020; Mitchell, 2021; Bom- masani et al., 2021).
近来,以ChatGPT为代表的大语言模型(LLM)(Brown et al., 2020; Chowdhery et al., 2022; Chung et al., 2022; OpenAI, 2022等)在自然语言处理及相关领域取得重大突破。研究表明,当模型规模足够大时,它们会展现出包括"推理"能力在内的涌现行为(Wei et al., 2022a)。例如,通过向模型提供"思维链"(即推理示例)或简单提示"让我们逐步思考",这些模型能够以显式推理步骤回答问题(Wei et al., 2022b; Kojima et al., 2022),如"所有鲸鱼都是哺乳动物,所有哺乳动物都有肾脏;因此所有鲸鱼都有肾脏"。这一发现引发了学界广泛关注,因为推理能力作为人类智能的标志性特征,在当前人工智能系统中常被认为有所欠缺(Marcus, 2020; Russin et al., 2020; Mitchell, 2021; Bommasani et al., 2021)。
However, despite the strong performance of LLMs on certain reasoning tasks, it remains unclear whether LLMs are actually reasoning and to what extent they are capable of reasoning. For example, Kojima et al. (2022) claim that “LLMs are decent zero-shot reasoners (p. 1)”, while Valmeekam et al. (2022) conclude that “LLMs are still far from achieving acceptable performance on common planning/reasoning tasks which pose no issues for humans to do (p. 2).” This limitation is also stated by Wei et al. (2022b):
然而,尽管大语言模型(LLM)在某些推理任务上表现优异,但其是否真正具备推理能力以及推理能力的边界仍不明确。例如,Kojima等人(2022)声称"大语言模型是优秀的零样本推理者(p.1)",而Valmeekam等人(2022)则得出结论"大语言模型在人类轻松完成的常规规划/推理任务上仍远未达到可接受水平(p.2)"。Wei等人(2022b)同样指出这一局限性:
“we qualify that although chain of thought emulates the thought processes of human reasoners, this does not answer whether the neural network is actually reasoning (p. 9).”
我们强调,尽管思维链 (chain of thought) 模拟了人类推理者的思维过程,但这并不能回答神经网络是否真的在进行推理 (p. 9)。
Therefore, in this paper, we aim to provide a comprehensive overview and engage in an insightful discussion on the current state of knowledge on this fast-evolving topic. We initiate our exploration with a clarification of the concept of reasoning (§2). Subsequently, we turn our attention to the techniques for enhancing/eliciting reasoning in LLMs (§3), the methods and benchmarks for evaluating reasoning in LLMs (§4), and the key findings and implications in this field (§5). Finally, we reflect on and discuss the current state of the field (§6).
因此,本文旨在对这一快速演进领域的研究现状进行全面综述与深度探讨。我们首先厘清推理(Reasoning)的概念定义(§2),随后系统梳理大语言模型中推理能力的增强/激发技术(§3)、评估方法及基准测试(§4)、核心研究发现与启示(§5),最后对该领域现状进行反思与展望(§6)。
Figure 1: The structure of the paper.
图 1: 论文结构。
2 What is Reasoning?
2 什么是推理?
Reasoning is the process of thinking about something in a logical and systematic way, using evidence and past experiences to reach a conclusion or make a decision (Wason and Johnson-Laird, 1972; Wason, 1968; Galotti, 1989; Fagin et al., 2004; McHugh and Way, 2018). Reasoning involves making inferences, evaluating arguments, and drawing logical conclusions based on available information. Although “reasoning” is a term that is commonly used in literature and daily life, it is also an abstract concept that can refer to many things. To help the reader better understand this concept, we summarize several main categories of reasoning that are commonly recognized:
推理是以逻辑和系统化的方式思考某事物的过程,通过证据和过往经验得出结论或做出决策 (Wason and Johnson-Laird, 1972; Wason, 1968; Galotti, 1989; Fagin et al., 2004; McHugh and Way, 2018)。推理涉及根据现有信息进行推断、评估论点并得出逻辑结论。尽管"推理"是文献和日常生活中常用的术语,但它也是一个抽象概念,可以指代许多事物。为帮助读者更好地理解这一概念,我们总结了几种被广泛认可的推理主要类别:
Deductive reasoning. Deductive reasoning is a type of reasoning in which a conclusion is drawn based on the truth of the premises. In deductive reasoning, the conclusion must necessarily follow from the premises, meaning that if the premises are true, the conclusion must also be true. For example:
演绎推理。演绎推理是一种基于前提真实性得出结论的推理方式。在演绎推理中,结论必然由前提得出,这意味着如果前提为真,则结论也必定为真。例如:
Inductive reasoning. Inductive reasoning is a type of reasoning in which a conclusion is drawn based on observations or evidence. The conclusion is likely to be true based on the available evidence, but it is not necessarily certain. For example:
归纳推理。归纳推理是一种基于观察或证据得出结论的推理方式。根据现有证据,结论很可能为真,但并不一定确定。例如:
• Observation: Every time we see a creature with wings, it is a bird. • Observation: We see a creature with wings. • Conclusion: The creature is likely to be a bird.
• 观察:每次我们看到有翅膀的生物,都是鸟类。
• 观察:我们看到了一个有翅膀的生物。
• 结论:该生物很可能是一只鸟。
Abductive reasoning. Abductive reasoning is a type of reasoning in which a conclusion is drawn based on the best explanation for a given set of observations. The conclusion is the most likely explanation based on the available evidence, but it is not necessarily certain. For example:
溯因推理。溯因推理是一种根据对给定观察结果的最佳解释得出结论的推理方式。该结论是基于现有证据最可能的解释,但并不一定确定。例如:
• Observation: The car cannot start and there is a puddle of liquid under the engine. • Conclusion: The most likely explanation is that the car has a leak in the radiator.
• 观察:汽车无法启动,发动机下方有一滩液体。
• 结论:最可能的解释是汽车的散热器 (radiator) 存在泄漏。
Other types of reasoning include analogical reasoning, which involves making comparisons between two or more things in order to make inferences or arrive at conclusions; causal reasoning, which involves identifying and understanding the causes and effects of events or phenomena; and probabilistic reasoning, which involves making decisions or arriving at conclusions based on the likelihood or probability of certain outcomes.
其他类型的推理包括类比推理(analogical reasoning),即通过比较两个或多个事物来做出推断或得出结论;因果推理(causal reasoning),即识别和理解事件或现象的因果关系;以及概率推理(probabilistic reasoning),即根据特定结果的可能性或概率做出决策或得出结论。
Formal Reasoning vs Informal Reasoning. Formal reasoning is a systematic and logical process that follows a set of rules and principles, often used in mathematics and logic. Informal reasoning is a less structured approach that relies on intuition, experience, and common sense to draw conclusions and solve problems, and is often used in everyday life. Formal reasoning is more structured and reliable, while informal reasoning is more adaptable and open-ended, but may also be less reliable. We refer the reader to Galotti (1989); Bronkhorst et al. (2020) for a detailed distinction between them.
形式推理与非形式推理。形式推理是遵循一系列规则和原则的系统化逻辑过程,常用于数学和逻辑领域。非形式推理则是一种较少结构化的方法,依赖直觉、经验和常识来得出结论和解决问题,常见于日常生活。形式推理更具结构性和可靠性,而非形式推理则更灵活开放,但可靠性可能较低。关于两者的详细区分,我们建议读者参考 Galotti (1989) 和 Bronkhorst et al. (2020)。
Reasoning in Language Models. The concept of reasoning in language models has been around for some time, but there is not a clear definition of what it entails. In the literature, the term “reasoning” is often used to refer to informal reasoning, although it is not always explicitly stated that it is informal (Cobbe et al., 2021; Wei et al., 2022b, inter alia). Different forms of reasoning may be used depending on the task, benchmark, or method being used, e.g., deductive reasoning (Cobbe et al., 2021; Creswell et al., 2022; Han et al., 2022b, in- ter alia), inductive reasoning (Yang et al., 2022; Misra et al., 2022, inter alia) or abductive reasoning (Wiegreffe et al., 2022; Lampinen et al., 2022; Jung et al., 2022, inter alia). In this paper, we encompass various forms of reasoning, with a particular focus on “informal deductive reasoning” in large language models since it is a widely used form in which the conclusion is guaranteed to be true as long as the premises are true.
语言模型中的推理。语言模型的推理概念由来已久,但其具体内涵尚未有明确定义。在文献中,"推理"一词通常指非形式化推理 (informal reasoning) ,尽管并不总是明确说明其非形式化特性 (Cobbe et al., 2021; Wei et al., 2022b 等)。根据任务、基准或方法的不同,可能采用不同形式的推理,例如演绎推理 (deductive reasoning) (Cobbe et al., 2021; Creswell et al., 2022; Han et al., 2022b 等)、归纳推理 (inductive reasoning) (Yang et al., 2022; Misra et al., 2022 等) 或溯因推理 (abductive reasoning) (Wiegreffe et al., 2022; Lampinen et al., 2022; Jung et al., 2022 等)。本文涵盖多种推理形式,尤其关注大语言模型中的"非形式化演绎推理",因为这是一种广泛使用的推理形式——只要前提为真,结论必然为真。
3 Towards Reasoning in Large Language Models
3 迈向大语言模型中的推理
Reasoning, particularly multi-step reasoning, is often seen as a weakness in language models and other NLP models (Bommasani et al., 2021; Rae et al., 2021; Valmeekam et al., 2022). Recent research has suggested that reasoning ability may emerge in language models at a certain scale, such as models with over 100 billion parameters (Wei et al., 2022a,b; Cobbe et al., 2021). In this paper, we follow Wei et al. (2022a) in considering reasoning as an ability that is rarely present in smallscale models like GPT-2 (Radford et al., 2019) and BERT (Devlin et al., 2019), and therefore focus on techniques applicable to improving or eliciting “reasoning”2 in LLMs such as GPT-3 (Brown et al., 2020) and PaLM (Chowdhery et al., 2022).
推理,尤其是多步推理,常被视为语言模型和其他NLP模型的弱点 (Bommasani et al., 2021; Rae et al., 2021; Valmeekam et al., 2022)。最新研究表明,推理能力可能在大规模语言模型中涌现,例如参数量超过1000亿的模型 (Wei et al., 2022a,b; Cobbe et al., 2021)。本文遵循Wei et al. (2022a) 的观点,将推理视为GPT-2 (Radford et al., 2019) 和BERT (Devlin et al., 2019) 等小规模模型普遍缺失的能力,因此专注于适用于改进或激发GPT-3 (Brown et al., 2020) 和PaLM (Chowdhery et al., 2022) 等大语言模型中"推理"能力的技术。
3.1 Fully Supervised Finetuning
3.1 全监督微调
Before discussing reasoning in large language models, it is worth mentioning there is research working on eliciting/improving reasoning in small language models through fully supervised finetuning on specific datasets. For example, Rajani et al. (2019) finetune a pretrained GPT model (Radford et al., 2018) to generate rationales that explain model predictions with the built CoS-E dataset, and find that models trained with explanations perform better on commonsense question answering tasks (Talmor et al., 2019). Talmor et al. (2020) train RoBERTa (Liu et al., 2019) to perform reasoning/inference based on both implicit pre-trained knowledge and explicit free-text statements. Hendrycks et al. (2021) finetune pretrained language models to solve competition mathematics problems by generating full step-by-step solutions, though the accuracy is relatively low. Nye et al. (2022) train language models to do multi-step reasoning for program synthesis/execution by generating “scratch pads”, i.e., intermediate computations, before producing the final answers. We refer the reader to Helwe et al. (2021); Bhargava and $\mathrm{Ng}$ (2022)’s survey for more studies in this line.
在探讨大语言模型中的推理能力之前,值得提及一些研究通过全监督微调特定数据集来激发/改进小语言模型的推理能力。例如,Rajani等人 (2019) 对预训练的GPT模型 (Radford等人, 2018) 进行微调,利用构建的CoS-E数据集生成解释模型预测的推理依据,并发现带有解释训练的模型在常识问答任务 (Talmor等人, 2019) 上表现更优。Talmor等人 (2020) 训练RoBERTa (Liu等人, 2019) 基于隐式预训练知识和显式自由文本陈述进行推理/推断。Hendrycks等人 (2021) 通过生成完整分步解决方案微调预训练语言模型以解决竞赛数学问题,尽管准确率相对较低。Nye等人 (2022) 训练语言模型通过生成"草稿纸"(即中间计算步骤)来实现程序合成/执行的多步推理,再输出最终答案。更多相关研究可参阅Helwe等人 (2021) 以及Bhargava和 $\mathrm{Ng}$ (2022) 的综述。
There are two major limitations of fully supervised finetuning. First, it requires a dataset containing explicit reasoning, which can be difficult and time-consuming to create. Additionally, the model is only trained on a specific dataset, which limits its application to a specific domain and may result in the model relying on artifacts in the training data rather than actual reasoning to make predictions.
完全监督微调存在两大主要限制。首先,它需要包含显式推理的数据集,这类数据集的创建可能既困难又耗时。此外,模型仅在特定数据集上训练,这会限制其应用范围至特定领域,并可能导致模型依赖训练数据中的伪影而非实际推理来进行预测。
3.2 Prompting & In-Context Learning
3.2 提示与上下文学习
Large language models such as GPT-3 (Brown et al., 2020) have demonstrated remarkable fewshot performance across a variety of tasks through in-context learning. These models can be prompted with a question and a few ⟨input, output⟩ exemplars to potentially solve a problem through “reasoning”, either implicitly or explicitly. However, research has shown that these models still fall short when it comes to tasks that require multiple steps of reasoning to solve (Bommasani et al., 2021; Rae et al., 2021; Valmeekam et al., 2022). This may be due to a lack of exploration into the full capabilities of these models, as recent studies have suggested.
诸如GPT-3 (Brown等人,2020) 这样的大语言模型 (Large Language Model) 已通过上下文学习在各种任务中展现出卓越的少样本 (few-shot) 性能。这些模型可以通过输入一个问题及少量⟨输入,输出⟩示例,以隐式或显式"推理"方式潜在解决问题。然而研究表明 (Bommasani等人,2021;Rae等人,2021;Valmeekam等人,2022),当面对需要多步推理的任务时,这些模型仍存在不足。最新研究指出,这可能是由于对这些模型全部能力的探索仍不充分所致。
3.2.1 Chain of Thought and Its Variants
3.2.1 思维链及其变体
To encourage LLMs to engage in reasoning rather than simply providing answers directly, we may guide LLMs to generate “reasoning” explicitly. One approach for doing this is chain-of-thought prompting, proposed by Wei et al. (2022b). This approach involves providing a few examples of “chain of thought” (CoT), which are intermediate natural language reasoning steps, in the prompt to LLMs (Figure 2). Specifically, in CoT prompting, input, output demonstrations are replaced with ⟨input, chain of thought, output⟩ triples, e.g., “[input] Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis balls does he have now? [chain of thought] Roger started with 5 balls. 2 cans of 3 tennis balls each is 6 tennis balls. $5+6=11$ . [output] The answer is 11.” In this way, given a target question, the model learns to generate explicit rationale before producing the final answer. Experimental results show that this simple idea can improve LLMs’ few-shot performance on arithmetic, symbolic, and commonsense reasoning tasks, sometimes to a striking degree.
为了鼓励大语言模型进行推理而非直接给出答案,我们可以引导模型显式生成"推理过程"。Wei等人(2022b)提出的思维链提示(chain-of-thought prompting)就是其中一种方法。这种方法通过在提示词中提供少量"思维链"(CoT)示例——即中间的自然语言推理步骤(图2)。具体而言,在思维链提示中,输入-输出演示被替换为⟨输入,思维链,输出⟩三元组,例如:"[输入]Roger有5个网球。他又买了2罐网球,每罐有3个网球。他现在一共有多少个网球?[思维链]Roger一开始有5个球。2罐每罐3个网球就是6个网球。$5+6=11$。[输出]答案是11。"通过这种方式,当给定目标问题时,模型学会在生成最终答案前先输出明确的推理依据。实验结果表明,这个简单想法能提升大语言模型在算术、符号和常识推理任务中的少样本表现,有时提升幅度非常显著。
Figure 2: An illustration of Chain-of-Thought Prompting and Rationale Engineering, where asterisk $({}^{* })$ denotes the target problem to be solved.
图 2: 思维链提示 (Chain-of-Thought Prompting) 与原理工程 (Rationale Engineering) 的示意图,其中星号 $({}^{* })$ 表示待解决的目标问题。
There are several variants of chain-of-thought prompting that have been proposed in the literature, in a different form or to solve a specific problem.
文献中提出了几种思维链提示的变体,它们以不同形式出现或用于解决特定问题。
Different Form: Kojima et al. (2022) introduce Zero-shot-CoT, in which LLMs are simply prompted with the phrase “Let’s think step by step” after the input, in order to elicit reasoning without the need for few-shot demonstrations. Madaan et al. (2022); Gao et al. (2022); Chen et al. (2022) find that LLMs trained with code, e.g., Codex (Chen et al., 2021), can achieve better performance on reasoning tasks by framing reasoning as code generation. Wang et al. (2022a) propose to iterative ly prompt chain of thought. He et al. (2023) attempt to retrieve external knowledge in CoT to improve faithfulness of reasoning.
不同形式:Kojima等人(2022)提出了Zero-shot-CoT方法,该方法只需在输入后添加"让我们逐步思考"的提示语,就能引导大语言模型进行推理而无需少样本演示。Madaan等人(2022)、Gao等人(2022)和Chen等人(2022)发现,经过代码训练的大语言模型(如Codex(Chen等人,2021))通过将推理任务转化为代码生成任务可以获得更好的性能。Wang等人(2022a)提出了迭代式提示思维链的方法。He等人(2023)尝试在思维链中检索外部知识以提高推理的可信度。
Specific Problem/Setting: Before chain of thought, Nye et al. (2022) also try to use intermediate computations, named “scratch pads”, to improve language models’ reasoning performance in both finetuning and few-shot regimes, with a particular focus on programs. Shi et al. (2022) attempt to solve multilingual reasoning tasks with CoT in the native language, CoT in English (regardless of the problem language), and CoT in English (with the problem translated to English). Chen (2022) apply CoT to table-based reasoning, finding that LLMs can achieve strong performance on table tasks with only one exemplar. Prystawski et al. (2022) demonstrate that CoT can improve LLMs’ performance on paraphrase selection for metaphors. Lu et al.
特定问题/场景:在思维链 (chain of thought) 提出前,Nye等人 (2022) 也尝试使用名为"草稿纸 (scratch pads)"的中间计算步骤来提升语言模型在微调 (finetuning) 和少样本 (few-shot) 场景下的推理性能,尤其关注程序推理任务。Shi等人 (2022) 尝试用三种方式解决多语言推理任务:使用问题原生语言的思维链、使用英语思维链 (无论问题使用何种语言) ,以及将问题翻译成英语后使用英语思维链。Chen (2022) 将思维链应用于基于表格的推理,发现大语言模型 (LLM) 仅需一个示例就能在表格任务上取得强劲表现。Prystawski等人 (2022) 证明思维链可以提升大语言模型在隐喻复述选择任务中的表现。Lu等人
(2022) apply chain of thought to solve multimodal science questions.
(2022) 应用思维链 (chain of thought) 解决多模态科学问题。
3.2.2 Rationale Engineering
3.2.2 原理工程
The original version of chain-of-thought prompting, proposed by Wei et al. (2022b), relies on manually crafted examples of intermediate reasoning steps and applies greedy decoding in the generation. Rationale engineering aims to more effectively elicit or utilize reasoning in LLMs. This can be achieved through rationale refinement, which involves creating more effective examples of reasoning steps, or through rationale exploration and rationale verification, which involve exploring and verifying the rationales produced by LLMs. A summary of raltionale engineering is illustrated in Figure 2.
思维链提示(chain-of-thought prompting)的原始版本由Wei等人(2022b)提出,依赖于人工构建的中间推理步骤示例,并在生成过程中采用贪婪解码策略。原理工程(rationale engineering)旨在更有效地激发或利用大语言模型中的推理能力。这可以通过原理优化(rationale refinement)实现,即创建更有效的推理步骤示例;也可以通过原理探索(rationale exploration)和原理验证(rationale verification)实现,即探索和验证大语言模型生成的推理过程。图2展示了原理工程的总体框架。
Rationale refinement. The choice of exemplars can significantly affect the few-shot performance of LLMs, as demonstrated in research such as Liu et al. (2022b), which also appears in chain-of-thought prompting. Rationale refinement aims to create and refine rationale examples that are better able to elicit reasoning in LLMs. Fu et al. (2022b) propose complexity-based prompting to create rationales with more reasoning steps. Their experiments show that the performance of LLMs improves with the increased rationale complexity. Similarly, Zhou et al. (2022c) propose algorithmic prompting, which suggests that providing more thorough examples of solutions can help improve reasoning performance on some simple math calculations. Zhang et al. (2022b) design Auto-CoT to automatically construct exemplars by partitioning questions from a given dataset into clusters and then using ZeroShot-CoT (Kojima et al., 2022) to generate the rationale for a representative question from each cluster. The analysis shows that making exemplars diverse is important in prompting LLMs to produce better rationales.
原理精炼。范例的选择会显著影响大语言模型的少样本表现,如Liu等人(2022b)的研究所示,该现象在思维链提示中也存在。原理精炼旨在创建和完善能更好激发大语言模型推理能力的原理示例。Fu等人(2022b)提出基于复杂度的提示方法,通过增加推理步骤来构建原理。其实验表明,大语言模型的表现会随原理复杂度的提升而改善。类似地,Zhou等人(2022c)提出的算法提示指出,提供更详尽的解题示例有助于提升简单数学计算的推理性能。Zhang等人(2022b)设计的Auto-CoT能自动构建范例:先将给定数据集的问题聚类,再使用零样本思维链(Kojima等人,2022)为每类代表性问题生成原理。分析表明,提升范例多样性对促使大语言模型产生更优原理至关重要。
Rationale exploration. In addition to providing better exemplars, we can allow LLMs to fully explore various ways of reasoning to improve their performance on reasoning tasks, named rationale exploration. Based on the idea that complex problems often admit multiple ways of thinking that can lead to their unique correct answer, Wang et al. (2022c) present a decoding strategy called selfconsistency to improve upon the traditional greedy decoding used in chain-of-thought prompting. This strategy involves sampling a diverse set of rationales, rather than just the greedy one, and selecting the most consistent answer by marginal i zing out the sampled rationales. The idea is also used in Fu et al. (2022b) to vote over the top complex rationales. To further improve performance, Li et al. (2022b) suggest providing different demonstrations for each question by sampling exemplars from an exemplar base, in order to increase the diversity of the sampled rationales.
原理探索。除了提供更好的示例外,我们可以让大语言模型充分探索各种推理方式以提升其在推理任务中的表现,这种方法被称为原理探索。基于"复杂问题通常存在多种思维方式都能得出唯一正确答案"的理念,Wang等人(2022c)提出了一种名为自洽性的解码策略,改进了思维链提示中使用的传统贪心解码。该策略通过采样一组多样化的推理路径(而非仅采用贪心路径),并通过边缘化采样到的推理路径来选择最一致的答案。Fu等人(2022b)同样运用了这一理念,对顶级复杂推理路径进行投票选择。为进一步提升性能,Li等人(2022b)建议通过从示例库中采样,为每个问题提供不同的演示示例,从而增加采样推理路径的多样性。
Rationale verification. Ensuring that the rationales produced by LLMs are valid is critical, as incorrect rationales can lead to incorrect final predictions (Ye and Durrett, 2022). To address this issue, the process of rationale verification aims to verify whether the rationales produced by LLMs lead to the correct final answers. Cobbe et al. (2021) propose augmenting LLMs with a trained verifier that assigns a score to each rationale and solution generated by the LLM, selecting the highest-ranked solution as the final answer when solving math word problems. Li et al. (2022b) also use this technique to guide rationale selection, in conjunction with the process of rationale exploration. Different from the above methods that train an external verifier to verify the rationales, Weng et al. (2022) suggest using LLMs themselves as the verifiers.
理由验证。确保大语言模型(LLM)生成的推理过程有效至关重要,因为错误的推理会导致最终预测错误 (Ye and Durrett, 2022)。为解决这一问题,理由验证流程旨在检验大语言模型生成的推理是否能得出正确最终答案。Cobbe等人(2021)提出通过训练验证器来增强大语言模型,该验证器会对模型生成的每个推理步骤和解决方案进行评分,在解决数学应用题时选择评分最高的方案作为最终答案。Li等人(2022b)同样采用该技术指导推理选择,并与推理探索流程结合使用。与上述训练外部验证器的方法不同,Weng等人(2022)建议直接使用大语言模型自身作为验证器。
3.2.3 Problem Decomposition
3.2.3 问题分解
Chain-of-thought prompting, while effective for eliciting reasoning in LLMs, can struggle with complex tasks, e.g., tasks that require compositional generalization (Lake and Baroni, 2018; Keysers et al., 2020). To solve a complex problem, it is helpful to first break it down into smaller, more manageable sub problems. By solving each of these sub problems, we can effectively solve the complex problem. This technique is called problem decomposition or divide and conquer (Talmor and Berant, 2018; Min et al., 2019; Perez et al., 2020).
思维链提示 (Chain-of-thought prompting) 虽然能有效激发大语言模型的推理能力,但在处理复杂任务(例如需要组合泛化的任务)时仍存在困难 (Lake and Baroni, 2018; Keysers et al., 2020)。解决复杂问题时,先将其分解为更小、更易处理的子问题往往很有帮助。通过逐个解决这些子问题,我们就能有效攻克复杂问题。这种技术被称为问题分解 (problem decomposition) 或分治法 (divide and conquer) (Talmor and Berant, 2018; Min et al., 2019; Perez et al., 2020)。
Based on this idea, Zhou et al. (2022a) propose least-to-most prompting, which consists of two steps: decomposing the complex problem into sub problems and solving these sub problems in a specific order, with each subproblem being facilitated by the answers obtained from previously solved sub problems. As follow-up work, Drozdov et al. (2022) introduce dynamic least-to-most prompting, which is designed to solve more realistic semantic parsing problems by decomposing the problems with prompting-based syntactic parsing and dynamically selecting exemplars based on the decomposition. In addition, Khot et al. (2022) design decomposed prompting, which breaks down a complex problem into sub problems that can be handled by a shared library of prompting-based LLMs, each specialized in a particular subproblem. Furthermore, Dua et al. (2022) develop successive prompting, which iterative ly decomposes a complex problem into a simple problem, with the next subproblem prediction having access to the answers to the previous sub problems. While the above methods decompose or solve compositional questions with multiple forward passes, Press et al. (2022) suggest decomposing and solving the input question in one forward pass using CoT prompting. Overall, these techniques show promise for helping LLMs to solve complex tasks by decomposing the problem into more manageable sub problems.
基于这一思路,Zhou等人 (2022a) 提出了最小到最多提示法 (least-to-most prompting),该方法包含两个步骤:将复杂问题分解为子问题,并按特定顺序解决这些子问题,其中每个子问题的解决都得益于先前已解子问题的答案。作为后续工作,Drozdov等人 (2022) 提出了动态最小到最多提示法 (dynamic least-to-most prompting),旨在通过基于提示的句法解析来分解问题,并根据分解结果动态选择示例,从而解决更现实的语义解析问题。此外,Khot等人 (2022) 设计了分解提示法 (decomposed prompting),该方法将复杂问题分解为可由基于提示的大语言模型共享库处理的子问题,每个模型专门处理特定的子问题。更进一步,Dua等人 (2022) 开发了连续提示法 (successive prompting),该方法迭代地将复杂问题分解为简单问题,并在预测下一个子问题时能够访问先前子问题的答案。虽然上述方法通过多次前向传递来分解或解决组合问题,但Press等人 (2022) 建议使用思维链提示 (CoT prompting) 在一次前向传递中分解并解决输入问题。总体而言,这些技术通过将问题分解为更易管理的子问题,展现出了帮助大语言模型解决复杂任务的潜力。
3.2.4 Others
3.2.4 其他
There are other techniques that have been developed to facilitate reasoning in LLMs for specific tasks or settings. For instance, Creswell et al. (2022); Creswell and Shanahan (2022) introduce a selection-inference framework that uses LLMs as modules to select and infer reasoning steps from a set of facts that culminate in the final answer. Kazemi et al. (2022) suggest using backward chaining, i.e., from goal to the set of facts that support it, instead of forward chaining like Creswell et al. (2022); Creswell and Shanahan (2022). In addition, Jung et al. (2022) propose a method for solving binary questions by prompting LLMs abduct iv ely and recursively to rationalize each option. Zhou et al. (2022b) design a technique for performing numerical reasoning on complex numbers by replacing the complex numbers with simple numbers to produce simpler expressions, and then using these expressions to perform calculations on the complex numbers. There are also efforts to distill reasoning from LLMs into smaller models, such as the work by Li et al. (2022a); Shridhar et al. (2022); Magister et al. (2022). Finally, we refer the reader to Dohan et al. (2022)’s position paper on language model cascade, which presents a unifying framework for understanding chain-of-thought prompting and research in this line.
为促进大语言模型在特定任务或场景中的推理能力,已发展出多种技术。例如,Creswell等人(2022)和Creswell与Shanahan(2022)提出了选择-推理框架,将大语言模型作为模块,从一系列事实中选择并推断出最终答案的推理步骤。Kazemi等人(2022)建议采用反向链式推理(即从目标回溯支持事实),而非Creswell等人(2022)和Creswell与Shanahan(2022)采用的正向链式推理。此外,Jung等人(2022)提出通过溯因递归提示大语言模型来合理化每个选项,从而解决二元问题的方法。Zhou等人(2022b)设计了一种对复数进行数值推理的技术,通过用简单数字替换复数生成更简单的表达式,再利用这些表达式对复数进行计算。也有研究致力于将大语言模型的推理能力蒸馏到更小模型中,如Li等人(2022a)、Shridhar等人(2022)和Magister等人(2022)的工作。最后,我们推荐读者参阅Dohan等人(2022)关于语言模型级联的立场论文,该文提出了理解思维链提示及相关研究的统一框架。
3.3 Hybrid Method
3.3 混合方法
While “prompting” techniques can help elicit or better utilize reasoning in large language models to solve reasoning tasks, they do not actually improve the reasoning capabilities of the LLMs themselves, as the parameters of the models remain unchanged. In contrast, the “hybrid approach” aims to simultaneously improve the reasoning capabilities of LLMs and make better use of these models in order to solve complex problems. This approach involves both enhancing the reasoning capabilities of the LLMs and using techniques such as prompting to effectively utilize these capabilities.
虽然"提示"(prompting)技术可以帮助激发或更好地利用大语言模型(LLM)的推理能力来解决推理任务,但这些技术实际上并未提升大语言模型本身的推理能力,因为模型的参数保持不变。相比之下,"混合方法"(hybrid approach)旨在同时提升大语言模型的推理能力,并更有效地利用这些模型来解决复杂问题。该方法既包括增强大语言模型的推理能力,也涉及使用提示等技术来有效利用这些能力。
3.3.1 Reasoning-Enhanced Training and Prompting
3.3.1 推理增强训练与提示
One approach to improving the reasoning capabilities of LLMs is to pretrain or finetune the models on datasets that include “reasoning”. Lewkowycz et al. (2022); Taylor et al. (2022) find that LLMs trained on datasets containing scientific and mathematical data can achieve better performance on reasoning tasks like quantitative reasoning problems when using CoT prompting3. Pi et al. (2022) show that continually pre training with SQL data can boost the performance of language models, e.g., T5 (Raffel et al., 2020), on natural language reasoning such as numerical reasoning and logical reasoning. Furthermore, Chung et al. (2022) develop Flan models by finetuning PaLM (Chowdhery et al., 2022) and T5 (Raffel et al., 2020) with $1.8\mathrm{k}$ finetuning tasks, including CoT data, and find that CoT data are critical to keeping reasoning abilities. Similarly, Yu et al. (2022) finetune OPT (Zhang et al., 2022a) on 10 reasoning datasets and observe that it can improve some reasoning capabilities of LLMs. Anil et al. (2022) study the length generalization abilities of LLMs, i.e., whether LLMs learned with short problem instances can generalize to long ones. They discover that the combination of few-shot scratchpad (or chain of thought)
提升大语言模型(LLM)推理能力的一种方法是在包含"推理"的数据集上进行预训练或微调。Lewkowycz等人(2022)和Taylor等人(2022)发现,在包含科学和数学数据的数据集上训练的大语言模型,当使用思维链(CoT)提示时,能够在定量推理等任务上获得更好的表现。Pi等人(2022)表明,持续使用SQL数据进行预训练可以提升语言模型(如T5 [Raffel等人, 2020])在自然语言推理(如数值推理和逻辑推理)方面的性能。此外,Chung等人(2022)通过使用1.8k个微调任务(包括CoT数据)对PaLM(Chowdhery等人, 2022)和T5(Raffel等人, 2020)进行微调,开发了Flan模型,并发现CoT数据对于保持推理能力至关重要。类似地,Yu等人(2022)在10个推理数据集上对OPT(Zhang等人, 2022a)进行微调,观察到这可以提升大语言模型的某些推理能力。Anil等人(2022)研究了大语言模型的长度泛化能力,即用短问题实例学习的大语言模型是否能泛化到长实例。他们发现少样本草稿(或思维链)的组合...
finetuning and scratchpad prompting results in a significant improvement in LLMs’ ability to generalize to longer problems, while this phenomenon is not observed in the standard fully supervised finetuning paradigm.
微调 (finetuning) 和草稿提示 (scratchpad prompting) 显著提升了大语言模型对长问题的泛化能力,而这种现象在标准全监督微调范式中未被观察到。
3.3.2 Boots trapping & Self-Improving
3.3.2 自助法 (Bootstrapping) 与自我改进 (Self-Improving)
Instead of finetuning LLMs on pre-built datasets that include reasoning, there are studies that have explored the idea of using LLMs to self-improve their reasoning abilities through a process known as boots trapping. One example of this is the SelfTaught Reasoner (STaR) introduced by Zelikman et al. (2022), in which a LLM is trained and refined on its own output iterative ly. Specifically, with CoT prompting, the model first generates initial rationales. And then, the model is finetuned on rationales that lead to correct answers. This process can be repeated, with each iteration resulting in an improved model that can generate better training data, which in turn leads to further improvements. As a follow-up to this work, Huang et al. (2022a) show that LLMs are able to self-improve their reasoning abilities without the need for supervised data by leveraging the self-consistency of reasoning (Wang et al., 2022c).
与在包含推理的预构建数据集上微调大语言模型不同,有研究探索了通过自举(bootstrapping)过程让大语言模型自我提升推理能力的方法。Zelikman等人(2022)提出的STaR(SelfTaught Reasoner)就是其中一例,该方法通过迭代训练大语言模型并优化其自身输出。具体而言,通过思维链(CoT)提示,模型首先生成初始推理过程,随后针对那些得出正确答案的推理过程进行微调。这一过程可重复进行,每次迭代都能产生能生成更优质训练数据的改进模型,从而形成持续优化。作为后续研究,Huang等人(2022a)证明大语言模型无需监督数据,仅需利用推理的自洽性(Wang等人,2022c)即可实现推理能力的自我提升。
4 Measuring Reasoning in Large Language Models
4 大语言模型中的推理能力测量
We summarize methods and benchmarks for evaluating reasoning abilities of LLMs in this section.
本节总结了大语言模型 (LLM) 推理能力的评估方法和基准。
4.1 End Task Performance
4.1 终端任务性能
One way to measure reasoning abilities of LLMs is to report their performance, e.g., accuracy, on end tasks that require reasoning. We list some common benchmarks as follows.
衡量大语言模型推理能力的一种方法是报告其在需要推理的终端任务上的表现,例如准确率。以下列举了一些常见基准测试:
Arithmetic Reasoning. Arithmetic reasoning is the ability to understand and apply mathematical concepts and principles in order to solve problems involving arithmetic operations. This involves using logical thinking and mathematical principles to determine the correct course of action when solving mathematical problems. Representative benchmarks for arithmetic reasoning include GSM8K (Cobbe et al., 2021), Math (Hendrycks et al., 2021), MathQA (Amini et al., 2019), SVAMP (Patel et al., 2021), AS- Div (Miao et al., 2020), AQuA (Ling et al., 2017), and MAWPS (Roy and Roth, 2015). It is worth mentioning that Anil et al. (2022) generate the Parity Datasets and the Boolean Variable Assignment Dataset for analyzing the length generalization capabilities of LLMs (§3.3.1).
算术推理。算术推理是指理解和应用数学概念及原理以解决涉及算术运算问题的能力。这需要运用逻辑思维和数学原理来确定解决数学问题的正确方法。代表性的算术推理基准包括 GSM8K (Cobbe et al., 2021)、Math (Hendrycks et al., 2021)、MathQA (Amini et al., 2019)、SVAMP (Patel et al., 2021)、AS-Div (Miao et al., 2020)、AQuA (Ling et al., 2017) 和 MAWPS (Roy and Roth, 2015)。值得一提的是,Anil et al. (2022) 生成了奇偶校验数据集 (Parity Datasets) 和布尔变量赋值数据集 (Boolean Variable Assignment Dataset) 用于分析大语言模型的长度泛化能力 (§3.3.1)。
Commonsense Reasoning. Commonsense Reasoning is the use of everyday knowledge and understanding to make judgments and predictions about new situations. It is a fundamental aspect of human intelligence that enables us to navigate our environment, understand others, and make decisions with incomplete information. Benchmarks that can be used for testing commonsense reasoning abilities of LLMs include CSQA (Talmor et al., 2019), StrategyQA (Geva et al., 2021), and ARC (Clark et al., 2018). We refer the reader to Bhargava and $\mathrm{Ng}(2022)$ ’s survey for more work in this domain.
常识推理 (Commonsense Reasoning)。常识推理是利用日常知识和理解对新情况做出判断和预测的能力。这是人类智能的基础方面,使我们能够在信息不完整的情况下适应环境、理解他人并做出决策。可用于测试大语言模型常识推理能力的基准包括CSQA (Talmor et al., 2019)、StrategyQA (Geva et al., 2021)和ARC (Clark et al., 2018)。更多该领域研究可参阅Bhargava和$\mathrm{Ng}(2022)$的综述。
Symbolic Reasoning. Symbolic reasoning is a form of reasoning that involves the manipulation of symbols according to formal rules. In symbolic reasoning, we use abstract symbols to represent concepts and relationships, and then manipulate those symbols according to precise rules in order to draw conclusions or solve problems. Two benchmarks of symbolic reasoning are presented in Wei et al. (2022b), including Last Letter Concatenation and Coin Flip.
符号推理。符号推理是一种通过形式规则操纵符号的推理方式。在符号推理中,我们使用抽象符号表示概念和关系,然后根据精确规则操作这些符号以得出结论或解决问题。Wei等人 (2022b) 提出了两个符号推理基准任务:末字母拼接和抛硬币模拟。
Others. In practice, there are many benchmarks that can be used to evaluate reasoning abilities of LLMs (indirectly), as long as the downstream task involves reasoning. BIG-bench (Srivastava et al., 2022), for example, includes over 200 tasks that test a range of reasoning skills, including tasks like Date Understanding, Word Sorting, and Causal Judgement. Other benchmarks, such as SCAN (Lake and Baroni, 2018) and the one proposed by Anil et al. (2022), focus on evaluating generalization ability. LLMs can also be tested on their table reasoning abilities using benchmarks such as WikiTable QA (Pasupat and Liang, 2015), FetaQA (Nan et al., 2022), as suggested by Chen (2022). In addition, there are benchmarks for evaluating LLMs’ generative relational reasoning abilities, such as CommonGen (Lin et al., 2020; Liu et al., 2022a) and Open Relation Modeling (Huang et al., 2022b,d).
其他。实际上,只要下游任务涉及推理,就有许多基准可用于(间接)评估大语言模型的推理能力。例如,BIG-bench (Srivastava et al., 2022) 包含200多项测试各类推理技能的任务,如日期理解、单词排序和因果判断等。其他基准如SCAN (Lake and Baroni, 2018) 和Anil等人 (2022) 提出的基准,则侧重于评估泛化能力。根据Chen (2022) 的建议,大语言模型还可以在WikiTable QA (Pasupat and Liang, 2015)、FetaQA (Nan et al., 2022) 等基准上测试表格推理能力。此外,还有评估大语言模型生成式关系推理能力的基准,如CommonGen (Lin et al., 2020; Liu et al., 2022a) 和Open Relation Modeling (Huang et al., 2022b,d)。
4.2 Analysis on Reasoning
4.2 推理分析
Although LLMs have demonstrated impressive performance on various reasoning tasks, the extent to which their predictions are based on true reasoning or simple heuristics is not always clear. This is because most existing evaluations focus on their accuracy on end tasks, rather than directly assessing their reasoning steps. While some error analysis has been conducted on the generated rationales of LLMs (Wei et al., 2022b; Kojima et al., 2022, inter alia), this analysis has often been limited in depth.
尽管大语言模型(LLM)在各种推理任务中展现出卓越性能,但其预测究竟基于真实推理还是简单启发式方法仍不明确。这是因为现有评估大多关注最终任务准确率,而非直接检验推理步骤。虽然已有研究对大语言模型生成的推理过程进行错误分析(Wei et al., 2022b; Kojima et al., 2022等),但这些分析往往缺乏深度。
There have been some efforts to develop metrics and benchmarks that enable a more formal/deep analysis of reasoning in LLMs. Golovneva et al. (2022) design ROSCOE, a set of interpret able, detailed step-by-step evaluation metrics covering various perspectives including semantic alignment, logical inference, semantic similarity, and language coherence. Saparov and He (2022) create a synthetic dataset called PrOntoQA that is generated from real or fictional ontologies. Each example in the dataset has a unique proof, which can be converted to simple sentences and back again, allowing for a formal analysis of each reasoning step. Han et al. (2022a) introduce a dataset called FOLIO to test the first-order logic reasoning capabilities of LLMs. FOLIO contains first-order logic reasoning problems that require models to determine the correctness of conclusions given a set of premises. In addition, Wang et al. (2022b) conduct ablation experiments on CoT and find that LLMs may also perform reasoning while prompting with invalid rationals. Their study also suggests that being relevant to the query and correctly ordering the reasoning steps are important for CoT prompting.
为对大语言模型(LLM)的推理能力进行更正式/深入的分析,学界已开展了一些指标与基准的开发工作。Golovneva等人(2022)设计了ROSCOE评估体系,这套可解释的细粒度分步评估指标涵盖语义对齐、逻辑推理、语义相似性和语言连贯性等多个维度。Saparov和He(2022)构建了基于真实或虚构本体的合成数据集PrOntoQA,其中每个样本都配有可转换为自然语言句式的唯一形式化证明,支持对每个推理步骤进行严格分析。Han等人(2022a)提出测试大语言模型一阶逻辑推理能力的FOLIO数据集,该数据集包含需要模型根据给定前提判断结论正确性的一阶逻辑推理问题。此外,Wang等人(2022b)对思维链(CoT)进行消融实验发现,大语言模型在无效推理提示下仍可能执行推理,其研究同时表明:推理步骤与查询的相关性及正确排序对CoT提示至关重要。
In summary, most existing studies primarily report the performance of the models on downstream reasoning tasks, without a detailed examination of the quality of the rationales produced. This leaves open the question of whether the models are actually able to reason in a way that is similar to human reasoning, or whether they are simply able to achieve good performance on the tasks through other means. Further research is needed to more formally analyze the reasoning abilities of LLMs.
总结而言,现有研究大多仅报告模型在下游推理任务中的性能表现,而未能深入分析其生成推理过程的质量。这导致一个关键问题悬而未决:这些模型究竟是真正实现了类人推理能力,还是通过其他方式在任务中取得优异表现。需要进一步研究来更系统地分析大语言模型的推理能力。
5 Findings and Implications
5 研究发现与启示
In this section, we summarize the important findings and implications of studies on reasoning in large language models.
在本节中,我们总结了大语言模型中关于推理研究的重要发现和意义。
Reasoning seems an emergent ability of LLMs. Wei et al. (2022a,b); Suzgun et al. (2022) show that reasoning ability appears to emerge only in large language models like GPT-3 175B, as evidenced by significant improvements in performance on reasoning tasks at a certain scale (e.g., 100 billion parameters). This suggests that it may be more effective to utilize large models for general reasoning problems rather than training small models for specific tasks. However, the reason for this emergent ability is not yet fully understood. We refer the reader to Wei et al. (2022a); Fu et al. (2022a) for some potential explanations.
推理似乎是大语言模型的一种涌现能力。Wei等人(2022a,b)和Suzgun等人(2022)的研究表明,推理能力似乎只在像GPT-3 175B这样的大语言模型中才会出现,这体现在当模型规模达到一定程度(例如1000亿参数)时,推理任务的性能会有显著提升。这表明对于通用推理问题,利用大模型可能比针对特定任务训练小模型更有效。然而,这种涌现能力的原因尚未完全明了。关于一些可能的解释,读者可以参考Wei等人(2022a)和Fu等人(2022a)的研究。
Chain of thought elicits “reasoning” of LLMs. The use of chain-of-thought (CoT) prompts (Wei et al., 2022b) has been shown to improve the performance of LLMs on various reasoning tasks, as demonstrated in the experiments of Wei et al. (2022a,b); Suzgun et al. (2022). Additionally, Saparov and He (2022) $(\S4.2)$ find that, when using CoT prompts, LLMs are able to produce valid individual proof steps, even when the synthetic ontology is fictional or counter factual. However, they may sometimes choose the wrong steps when multiple options are available, leading to incomplete or incorrect proofs. Moreover, for many reasoning tasks where the performance of standard prompting grows smoothly with model scale, chain-of-thought prompting can lead to dramatic performance improvement. In addition to these benefits, the use of CoT prompts has been shown to improve the out-ofdistribution robustness of LLMs (Wei et al., 2022b; Zhou et al., 2022a; Anil et al., 2022, inter alia), an advantage that is not typically observed with standard prompting or fully supervised finetuning paradigms.
思维链激发了大语言模型的"推理"能力。研究表明,使用思维链 (chain-of-thought,CoT) 提示 (Wei et al., 2022b) 能显著提升大语言模型在各种推理任务中的表现 (Wei et al., 2022a,b; Suzgun et al., 2022)。此外,Saparov和He (2022) ( §4.2 ) 发现,在使用CoT提示时,即使面对虚构或反事实的合成本体论,大语言模型也能生成有效的独立证明步骤。不过当存在多个选项时,模型有时会选择错误的步骤,导致证明不完整或不正确。值得注意的是,对于许多标准提示性能随模型规模平稳增长的推理任务,思维链提示能带来显著的性能提升。除上述优势外,CoT提示还被证明能增强大语言模型的分布外鲁棒性 (Wei et al., 2022b; Zhou et al., 2022a; Anil et al., 2022等),这是标准提示或全监督微调范式通常无法实现的优势。
LLMs show human-like content effects on reasoning. According to Dasgupta et al. (2022), LLMs exhibit reasoning patterns that are similar to those of humans as described in the cognitive literature. For example, the models’ predictions are influenced by both prior knowledge and abstract reasoning, and their judgments of logical validity are impacted by the be liev ability of the conclusions. These findings suggest that, although language models may not always perform well on reasoning tasks, their failures often occur in situations that are challenging for humans as well. This provides some evidence that language models may “reason” in a way that is similar to human reasoning.
大语言模型在推理中表现出类人的内容效应。根据Dasgupta等人 (2022) 的研究,大语言模型展现出与认知文献中描述的人类推理模式相似的特征。例如,模型的预测同时受到先验知识和抽象推理的影响,其对逻辑有效性的判断也会受到结论可信度的干扰。这些发现表明,尽管语言模型在推理任务上表现并不总是出色,但其失败场景往往也是人类容易出错的场景。这为"语言模型可能以类人方式进行推理"的观点提供了部分证据。
LLMs are still unskilled at complex reasoning. Although LLMs seem to possess impressive reasoning capabilities with the techniques described in $\S3$ , they still struggle with more complex reasoning tasks or those involving imp lica ture, according to studies such as Valmeekam et al. (2022);
大语言模型在复杂推理方面仍不熟练。尽管通过 $\S3$ 中描述的技术,大语言模型展现出令人印象深刻的推理能力,但根据 Valmeekam 等人 (2022) 等研究显示,它们在处理更复杂的推理任务或涉及隐含含义的任务时仍存在困难;
Han et al. (2022a); Ruis et al. (2022). For instance, Valmeekam et al. (2022) find that even in relatively simple commonsense planning domains that humans would have no trouble navigating, LLMs such as GPT-3 (Brown et al., 2020) and BLOOM (Scao et al., 2022) struggle to perform effectively. These findings suggest that existing benchmarks may be too simple to accurately gauge the true reasoning abilities of LLMs, and that more challenging tasks may be needed to fully evaluate their abilities in this regard.
Han等人 (2022a); Ruis等人 (2022)。例如,Valmeekam等人 (2022) 发现,即使在人类能够轻松应对的相对简单的常识规划领域中,像 GPT-3 (Brown等人, 2020) 和 BLOOM (Scao等人, 2022) 这样的大语言模型也难以有效执行任务。这些研究结果表明,现有基准测试可能过于简单,无法准确衡量大语言模型的真实推理能力,可能需要更具挑战性的任务来全面评估它们在这方面的能力。
6 Reflection, Discussion, and Future Directions
6 反思、讨论与未来方向
Why reasoning? Reasoning is the process of thinking about something in a logical and systematic way, and it is a key aspect of human intelligence. By incorporating reasoning capabilities into language models, we can enable them to perform tasks that require more complex and nuanced thinking, such as problem solving, decision making, and planning (Huang et al., 2022e,f; Song et al., 2022). This can improve the performance of these models on downstream tasks and increase their out-of- distribution robustness (Wei et al., 2022a,b; Suzgun et al., 2022; Zhou et al., 2022a; Anil et al., 2022). In addition, reasoning can make language models more explain able and interpret able, as it provides explicit rationales for their predictions.
为何需要推理能力?推理是以逻辑化、系统化的方式思考事物的过程,是人类智能的核心要素。通过为大语言模型赋予推理能力,我们可以使其执行需要更复杂、更细致思维的任务,例如问题解决、决策制定和规划 (Huang et al., 2022e,f; Song et al., 2022)。这将提升模型在下游任务中的表现,并增强其分布外鲁棒性 (Wei et al., 2022a,b; Suzgun et al., 2022; Zhou et al., 2022a; Anil et al., 2022)。此外,推理能力还能使大语言模型更具可解释性,因为它为预测结果提供了明确的逻辑依据。
Right task/application? As Valmeekam et al. (2022) point out, current benchmarks may not adequately reflect the reasoning capabilities of LLMs. In addition, tasks such as solving simple math problems and concatenating letters in strings (§4.1) are artificial and do not accurately reflect real-world situations. To truly understand the reasoning ability of LLMs, it is important to consider more realistic and meaningful applications such as decision making (Edwards, 1954), legal reasoning (Levi, 2013), and scientific reasoning (Zimmerman, 2000). Our ultimate goal should not be to enable LLMs to solve simple math problems, which can be simply done with other programs. When conducting relevant research, it is essential to ask whether the specific task being tackled is meaningful and whether the proposed method can be generalized to more realistic tasks and applications.
正确的任务/应用?正如Valmeekam等人(2022)指出的,当前基准测试可能无法充分反映大语言模型的推理能力。此外,解决简单数学问题和连接字符串中的字母(§4.1)等任务都是人为设计的,不能准确反映现实情况。要真正理解大语言模型的推理能力,必须考虑更现实且有意义的应用场景,例如决策制定(Edwards, 1954)、法律推理(Levi, 2013)和科学推理(Zimmerman, 2000)。我们的终极目标不应是让大语言模型解决简单数学问题——这些任务用其他程序就能轻松完成。在进行相关研究时,必须思考:正在解决的具体任务是否有意义?所提出的方法能否推广到更现实的任务和应用中?
Are language models really able to reason? There are several indications that LLMs are able to reason, including 1) high performance on various tasks requiring reasoning (Suzgun et al., 2022);
语言模型真的具备推理能力吗?有多个迹象表明大语言模型能够进行推理,包括:1) 在各类需要推理的任务上表现优异 [20];
- the ability to reason step-by-step with chainof-thought prompting (Wei et al., 2022b); and 3) the reflection of human-like content effects on reasoning (Dasgupta et al., 2022). However, these findings are not sufficient to conclude that LLMs can truly reason. For 1), it is not clear whether the models are making predictions based on reasoning or heuristics (Patel et al., 2021). For many existing benchmarks on reasoning, actually, we can design a program with heuristic rules to achieve very high performance. We usually do not think a program relying on heuristic rules is capable of reasoning. For 2), although the models seem to reason stepby-step, the generated rationales may be incorrect and inconsistent. It is possible that the models are “generating reasoning-like response” rather than “reasoning step-by-step”. For 3), while LLMs display some human-like reasoning patterns, this does not necessarily mean that they behave like humans.
- 通过思维链提示 (chain-of-thought prompting) 进行逐步推理的能力 (Wei et al., 2022b) ;以及 3) 在推理过程中反映类人内容效应的能力 (Dasgupta et al., 2022) 。然而,这些发现不足以证明大语言模型具备真正的推理能力。对于第1点,目前尚不清楚模型是基于推理还是启发式规则进行预测 (Patel et al., 2021) 。实际上,针对许多现有的推理基准测试,我们完全可以设计一个基于启发式规则的程序来获得极高分数,但通常不会认为依赖启发式规则的程序具备推理能力。对于第2点,尽管模型看似在进行逐步推理,但其生成的逻辑依据可能存在错误或不一致,这更可能是模型在"生成类推理的响应"而非"逐步推理"。对于第3点,虽然大语言模型展现出某些类人推理模式,但这并不必然意味着它们具有类人的推理行为。
Additionally, there are several observations that suggest LLMs may not be capable of reasoning: 1) LLMs still struggle with tasks that require complex reasoning (Valmeekam et al., 2022; Han et al., 2022a; Ruis et al., 2022). If LLMs are really decent reasoners, they should handle tasks that can be simply solved by humans through reasoning; 2) LLMs make mistakes in their reasoning, as explained above; 3)#4 The performance of LLMs on downstream tasks has been found to be sensitive to the frequency of certain terms, such as numbers, in the training data (Razeghi et al., 2022; Jung et al., 2022), which would not be expected if the models were solving mathematical problems through reasoning; 4)# Language models have been found to struggle with associating relevant information that they have memorized (Huang et al., 2022c).
此外,有几点观察表明大语言模型可能不具备推理能力:1) 大语言模型在需要复杂推理的任务上仍然表现不佳 (Valmeekam et al., 2022; Han et al., 2022a; Ruis et al., 2022)。如果大语言模型确实擅长推理,它们应该能处理人类通过简单推理就能解决的任务;2) 如前所述,大语言模型在推理过程中会犯错;3) 研究发现大语言模型在下游任务中的表现对训练数据中某些术语(如数字)的出现频率非常敏感 (Razeghi et al., 2022; Jung et al., 2022),如果模型是通过推理来解决数学问题,这种情况就不应该出现;4) 研究发现语言模型难以关联它们已记忆的相关信息 (Huang et al., 2022c)。
Overall, it is still too early to draw a conclusion about the proposed question. In fact, there is also an ongoing debate about whether language models can actually understand language or capture meaning (Bender and Koller, 2020; Li et al., 2021; Manning, 2022; Piantasodi and Hill, 2022). Further in-depth analysis of factors such as training data, model architecture, and optimization objectives is needed, as well as the development of better benchmarks for measuring the reasoning capabilities of LLMs. However, it is clear that the current models are not yet capable of robust reasoning.
总体而言,现在对提出的问题下结论还为时过早。事实上,关于语言模型是否能真正理解语言或捕捉意义 (Bender and Koller, 2020; Li et al., 2021; Manning, 2022; Piantasodi and Hill, 2022) 仍存在持续争论。我们需要对训练数据、模型架构和优化目标等因素进行更深入的分析,并开发更好的基准来衡量大语言模型的推理能力。但可以明确的是,当前模型尚未具备稳健的推理能力。
Improving reasoning capabilities of LLMs.
提升大语言模型的推理能力。
While techniques like chain-of-thought prompting (Wei et al., 2022b) may help to elicit reasoning abilities in large language models, they cannot enable the models to solve tasks beyond their current capabilities. To truly enhance reasoning in LLMs, we need to utilize training data, model architecture, and optimization objectives that are designed to encourage reasoning. For example, finetuning a model with a dataset including CoT data has been shown to improve reasoning (Chung et al., 2022), and models can also self-improve through the process of boots trapping their reasoning (Zelikman et al., 2022; Huang et al., 2022a). There is still much research that needs to be done in this area, and we look forward to future progress in improving reasoning in large language models.
虽然思维链提示 (chain-of-thought prompting) 等技术 (Wei et al., 2022b) 可能有助于激发大语言模型的推理能力,但它们无法让模型解决超出当前能力范围的任务。要真正增强大语言模型的推理能力,我们需要利用专门设计用于促进推理的训练数据、模型架构和优化目标。例如,使用包含思维链数据的数据集对模型进行微调已被证明可以提升推理能力 (Chung et al., 2022),模型还能通过自举推理 (boots trapping) 过程实现自我改进 (Zelikman et al., 2022; Huang et al., 2022a)。该领域仍有许多研究亟待开展,我们期待未来在提升大语言模型推理能力方面取得进展。
7 Conclusion
7 结论
In this paper, we have provided a detailed and upto-date review of the current state of knowledge on reasoning in large language models. We have discussed techniques for improving and eliciting reasoning in LLMs, methods and benchmarks for evaluating reasoning abilities, and the findings and implications of previous studies in this topic. While LLMs have made significant progress in natural language processing and related fields, it remains unclear to what extent they are capable of true reasoning or whether they are simply using memorized patterns and heuristics to solve problems. Further research is needed to fully understand the reasoning abilities of LLMs, improve LLMs’ reasoning capabilities, and determine their potential for use in a variety of applications. We hope that this paper will serve as a useful overview of the current state of the field and stimulate further discussion and research on this interesting and important topic.
本文全面综述了大语言模型推理能力的研究现状,系统探讨了提升和激发大语言模型推理能力的技术方法、评估推理能力的基准体系,以及该领域已有研究成果的发现与启示。尽管大语言模型在自然语言处理及相关领域取得显著进展,但其是否具备真正的推理能力,抑或仅依靠记忆模式和启发式方法解决问题,目前仍无定论。未来研究需要深入探索大语言模型的推理机制、持续提升其推理性能,并评估其在各类应用场景中的潜力。希望本文能为相关领域研究者提供系统性参考,推动这一重要议题的深入探讨与研究发展。
Limitations
局限性
In this paper, we provide an overview of the current state of knowledge on reasoning in large language models. Reasoning is a broad concept that encompasses various forms, making it impractical to summarize all related work in a single paper. Therefore, we focus on deductive reasoning, as it is the most commonly studied in the literature. Other forms of reasoning such as inductive reasoning (Yang et al., 2022; Misra et al., 2022, inter alia) and abductive reasoning (Wiegreffe et al., 2022; Lampinen et al., 2022; Jung et al., 2022, inter alia) may not be discussed in depth.
本文概述了当前关于大语言模型推理能力的研究现状。推理是一个宽泛概念,涵盖多种形式,因此难以在一篇论文中全面总结所有相关工作。我们主要聚焦演绎推理(deductive reasoning),因为这是文献中最常研究的形式。其他推理形式如归纳推理(inductive reasoning) (Yang et al., 2022; Misra et al., 2022等)和溯因推理(abductive reasoning) (Wiegreffe et al., 2022; Lampinen et al., 2022; Jung et al., 2022等)可能不会深入讨论。
Additionally, given the rapid evolution and significance of reasoning within large language models, it is crucial to note that new contributions may have emerged in the field concurrent with the writing of this paper. An additional resource to consider is a parallel survey by Qiao et al. (2022), which emphasizes reasoning via language model prompting. Our coverage may not extend to papers released during or after 2023 such as evaluation on ChatGPT (Bang et al., 2023; Zheng et al., 2023). As such, we recommend readers to check the papers that cite this survey for a more comprehensive and updated understanding of this field.
此外,鉴于大语言模型 (Large Language Model) 中推理能力的快速发展和重要性,必须注意到在本文撰写期间该领域可能已有新成果涌现。Qiao 等人 (2022) 同期发表的另一篇综述重点关注基于语言模型提示的推理方法,可作为补充参考。本文可能未涵盖 2023 年及之后发表的研究(例如针对 ChatGPT 的评估工作 [Bang 等人, 2023; Zheng 等人, 2023])。因此,建议读者查阅引用本综述的文献以获取该领域更全面、最新的研究进展。
Acknowledgements
致谢
We would like to thank Jason Wei (OpenAI) and Denny Zhou (Google DeepMind) for their valuable advice and constructive feedback on this work. This material is based upon work supported by the National Science Foundation IIS 16-19302 and IIS 16-33755, Zhejiang University ZJU Research 083650, IBM-Illinois Center for Cognitive Computing Systems Research (C3SR) and IBM-Illinois Discovery Accelerator Institute (IIDAI), gift grants from eBay and Microsoft Azure, UIUC OVCR CCIL Planning Grant 434S34, UIUC CSBS Small Grant 434C8U, and UIUC New Frontiers Initiative. Any opinions, findings, and conclusions or recommendations expressed in this publication are those of the author(s) and do not necessarily reflect the views of the funding agencies.
我们要感谢Jason Wei (OpenAI) 和 Denny Zhou (Google DeepMind) 对本工作提出的宝贵建议和建设性反馈。本材料基于以下资助项目的研究成果:美国国家科学基金会 IIS 16-19302 和 IIS 16-33755、浙江大学 ZJU Research 083650、IBM-伊利诺伊认知计算系统研究中心 (C3SR) 和 IBM-伊利诺伊发现加速器研究所 (IIDAI)、eBay 和 Microsoft Azure 的捐赠资助、伊利诺伊大学厄巴纳-香槟分校 OVCR CCIL Planning Grant 434S34、UIUC CSBS Small Grant 434C8U 以及 UIUC New Frontiers Initiative。本出版物中表达的任何观点、发现、结论或建议均为作者个人观点,并不代表资助机构的立场。
References
参考文献
Aida Amini, Saadia Gabriel, Shanchuan Lin, Rik Koncel-Kedziorski, Yejin Choi, and Hannaneh Ha- jishirzi. 2019. MathQA: Towards interpret able math word problem solving with operation-based formalisms. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2357–2367, Minneapolis, Minnesota. Association for Computational Linguistics.
Aida Amini、Saadia Gabriel、Shanchuan Lin、Rik Koncel-Kedziorski、Yejin Choi和Hannaneh Hajishirzi。2019. MathQA:基于运算形式化的可解释数学应用题求解。载于《2019年北美计算语言学协会会议论文集:人类语言技术(长论文与短论文)》第1卷,第2357-2367页,明尼苏达州明尼阿波利斯。计算语言学协会。
Cem Anil, Yuhuai Wu, Anders Andreassen, Aitor Lewkowycz, Vedant Misra, Vinay Ramasesh, Am- brose Slone, Guy Gur-Ari, Ethan Dyer, and Behnam Neyshabur. 2022. Exploring length generalization in large lan