[论文翻译]大语言模型中的推理能力研究综述


原文地址:https://arxiv.org/pdf/2212.10403


Towards Reasoning in Large Language Models: A Survey

大语言模型中的推理能力研究综述

Jie Huang Kevin Chen-Chuan Chang Department of Computer Science, University of Illinois at Urbana-Champaign {jeffhj, kcchang}@illinois.edu

黄杰 Kevin Chen-Chuan Chang
伊利诺伊大学厄巴纳-香槟分校计算机科学系
{jeffhj, kcchang}@illinois.edu

Abstract

摘要

Reasoning is a fundamental aspect of human intelligence that plays a crucial role in activities such as problem solving, decision making, and critical thinking. In recent years, large language models (LLMs) have made significant progress in natural language processing, and there is observation that these models may exhibit reasoning abilities when they are sufficiently large. However, it is not yet clear to what extent LLMs are capable of reasoning. This paper provides a comprehensive overview of the current state of knowledge on reasoning in LLMs, including techniques for improving and eliciting reasoning in these models, methods and benchmarks for evaluating reasoning abilities, findings and implications of previous research in this field, and suggestions on future directions. Our aim is to provide a detailed and up-to-date review of this topic and stimulate meaningful discussion and future work.1

推理是人类智能的基本方面,在问题解决、决策制定和批判性思维等活动中起着关键作用。近年来, 大语言模型(LLM)在自然语言处理领域取得了显著进展,有观察表明当这些模型规模足够大时可能展现出推理能力。然而,目前尚不清楚LLM具备何种程度的推理能力。本文全面概述了LLM推理研究的现状,包括改进和激发模型推理的技术、评估推理能力的方法与基准、该领域先前研究的发现与启示,以及对未来方向的建议。我们的目标是提供这一主题的详细最新综述,并促进有意义的讨论和未来工作。[1]

1 Introduction

1 引言

Reasoning is a cognitive process that involves using evidence, arguments, and logic to arrive at conclusions or make judgments. It plays a central role in many intellectual activities, such as problem solving, decision making, and critical thinking. The study of reasoning is important in fields like psychology (Wason and Johnson-Laird, 1972), philosophy (Passmore, 1961), and computer science (Huth and Ryan, 2004), as it helps individuals make decisions, solve problems, and think critically.

推理是一种认知过程,涉及利用证据、论据和逻辑来得出结论或做出判断。它在许多智力活动中扮演核心角色,例如解决问题、决策制定和批判性思维。推理研究在心理学 (Wason and Johnson-Laird, 1972)、哲学 (Passmore, 1961) 和计算机科学 (Huth and Ryan, 2004) 等领域具有重要意义,因为它能帮助个人做出决策、解决问题并进行批判性思考。

Recently, large language models (LLMs) (Brown et al., 2020; Chowdhery et al., 2022; Chung et al., 2022; OpenAI, 2022, inter alia) such as ChatGPT have made significant advancements in natural language processing and related fields. It has been shown that these models exhibit emergent behaviors, including the ability to “reason”, when they are large enough (Wei et al., 2022a). For example, by providing the models with “chain of thoughts”, i.e., reasoning exemplars, or a simple prompt “Let’s think step by step”, these models are able to answer questions with explicit reasoning steps (Wei et al., 2022b; Kojima et al., 2022), e.g., “all whales are mammals, all mammals have kidneys; therefore, all whales have kidneys.” This has sparked considerable interest in the community since reasoning ability is a hallmark of human intelligence that is frequently considered missed in current artificial intelligence systems (Marcus, 2020; Russin et al., 2020; Mitchell, 2021; Bom- masani et al., 2021).

近来,以ChatGPT为代表的大语言模型(LLM)(Brown et al., 2020; Chowdhery et al., 2022; Chung et al., 2022; OpenAI, 2022等)在自然语言处理及相关领域取得重大突破。研究表明,当模型规模足够大时,它们会展现出包括"推理"能力在内的涌现行为(Wei et al., 2022a)。例如,通过向模型提供"思维链"(即推理示例)或简单提示"让我们逐步思考",这些模型能够以显式推理步骤回答问题(Wei et al., 2022b; Kojima et al., 2022),如"所有鲸鱼都是哺乳动物,所有哺乳动物都有肾脏;因此所有鲸鱼都有肾脏"。这一发现引发了学界广泛关注,因为推理能力作为人类智能的标志性特征,在当前人工智能系统中常被认为有所欠缺(Marcus, 2020; Russin et al., 2020; Mitchell, 2021; Bommasani et al., 2021)。

However, despite the strong performance of LLMs on certain reasoning tasks, it remains unclear whether LLMs are actually reasoning and to what extent they are capable of reasoning. For example, Kojima et al. (2022) claim that “LLMs are decent zero-shot reasoners (p. 1)”, while Valmeekam et al. (2022) conclude that “LLMs are still far from achieving acceptable performance on common planning/reasoning tasks which pose no issues for humans to do (p. 2).” This limitation is also stated by Wei et al. (2022b):

然而,尽管大语言模型(LLM)在某些推理任务上表现优异,但其是否真正具备推理能力以及推理能力的边界仍不明确。例如,Kojima等人(2022)声称"大语言模型是优秀的零样本推理者(p.1)",而Valmeekam等人(2022)则得出结论"大语言模型在人类轻松完成的常规规划/推理任务上仍远未达到可接受水平(p.2)"。Wei等人(2022b)同样指出这一局限性:

“we qualify that although chain of thought emulates the thought processes of human reasoners, this does not answer whether the neural network is actually reasoning (p. 9).”

我们强调,尽管思维链 (chain of thought) 模拟了人类推理者的思维过程,但这并不能回答神经网络是否真的在进行推理 (p. 9)。

Therefore, in this paper, we aim to provide a comprehensive overview and engage in an insightful discussion on the current state of knowledge on this fast-evolving topic. We initiate our exploration with a clarification of the concept of reasoning (§2). Subsequently, we turn our attention to the techniques for enhancing/eliciting reasoning in LLMs (§3), the methods and benchmarks for evaluating reasoning in LLMs (§4), and the key findings and implications in this field (§5). Finally, we reflect on and discuss the current state of the field (§6).

因此,本文旨在对这一快速演进领域的研究现状进行全面综述与深度探讨。我们首先厘清推理(Reasoning)的概念定义(§2),随后系统梳理大语言模型中推理能力的增强/激发技术(§3)、评估方法及基准测试(§4)、核心研究发现与启示(§5),最后对该领域现状进行反思与展望(§6)。


Figure 1: The structure of the paper.

图 1: 论文结构。

2 What is Reasoning?

2 什么是推理?

Reasoning is the process of thinking about something in a logical and systematic way, using evidence and past experiences to reach a conclusion or make a decision (Wason and Johnson-Laird, 1972; Wason, 1968; Galotti, 1989; Fagin et al., 2004; McHugh and Way, 2018). Reasoning involves making inferences, evaluating arguments, and drawing logical conclusions based on available information. Although “reasoning” is a term that is commonly used in literature and daily life, it is also an abstract concept that can refer to many things. To help the reader better understand this concept, we summarize several main categories of reasoning that are commonly recognized:

推理是以逻辑和系统化的方式思考某事物的过程,通过证据和过往经验得出结论或做出决策 (Wason and Johnson-Laird, 1972; Wason, 1968; Galotti, 1989; Fagin et al., 2004; McHugh and Way, 2018)。推理涉及根据现有信息进行推断、评估论点并得出逻辑结论。尽管"推理"是文献和日常生活中常用的术语,但它也是一个抽象概念,可以指代许多事物。为帮助读者更好地理解这一概念,我们总结了几种被广泛认可的推理主要类别:

Deductive reasoning. Deductive reasoning is a type of reasoning in which a conclusion is drawn based on the truth of the premises. In deductive reasoning, the conclusion must necessarily follow from the premises, meaning that if the premises are true, the conclusion must also be true. For example:

演绎推理。演绎推理是一种基于前提真实性得出结论的推理方式。在演绎推理中,结论必然由前提得出,这意味着如果前提为真,则结论也必定为真。例如:

Inductive reasoning. Inductive reasoning is a type of reasoning in which a conclusion is drawn based on observations or evidence. The conclusion is likely to be true based on the available evidence, but it is not necessarily certain. For example:

归纳推理。归纳推理是一种基于观察或证据得出结论的推理方式。根据现有证据,结论很可能为真,但并不一定确定。例如:

• Observation: Every time we see a creature with wings, it is a bird. • Observation: We see a creature with wings. • Conclusion: The creature is likely to be a bird.

• 观察:每次我们看到有翅膀的生物,都是鸟类。
• 观察:我们看到了一个有翅膀的生物。
• 结论:该生物很可能是一只鸟。

Abductive reasoning. Abductive reasoning is a type of reasoning in which a conclusion is drawn based on the best explanation for a given set of observations. The conclusion is the most likely explanation based on the available evidence, but it is not necessarily certain. For example:

溯因推理。溯因推理是一种根据对给定观察结果的最佳解释得出结论的推理方式。该结论是基于现有证据最可能的解释,但并不一定确定。例如:

• Observation: The car cannot start and there is a puddle of liquid under the engine. • Conclusion: The most likely explanation is that the car has a leak in the radiator.

• 观察:汽车无法启动,发动机下方有一滩液体。
• 结论:最可能的解释是汽车的散热器 (radiator) 存在泄漏。

Other types of reasoning include analogical reasoning, which involves making comparisons between two or more things in order to make inferences or arrive at conclusions; causal reasoning, which involves identifying and understanding the causes and effects of events or phenomena; and probabilistic reasoning, which involves making decisions or arriving at conclusions based on the likelihood or probability of certain outcomes.

其他类型的推理包括类比推理(analogical reasoning),即通过比较两个或多个事物来做出推断或得出结论;因果推理(causal reasoning),即识别和理解事件或现象的因果关系;以及概率推理(probabilistic reasoning),即根据特定结果的可能性或概率做出决策或得出结论。

Formal Reasoning vs Informal Reasoning. Formal reasoning is a systematic and logical process that follows a set of rules and principles, often used in mathematics and logic. Informal reasoning is a less structured approach that relies on intuition, experience, and common sense to draw conclusions and solve problems, and is often used in everyday life. Formal reasoning is more structured and reliable, while informal reasoning is more adaptable and open-ended, but may also be less reliable. We refer the reader to Galotti (1989); Bronkhorst et al. (2020) for a detailed distinction between them.

形式推理与非形式推理。形式推理是遵循一系列规则和原则的系统化逻辑过程,常用于数学和逻辑领域。非形式推理则是一种较少结构化的方法,依赖直觉、经验和常识来得出结论和解决问题,常见于日常生活。形式推理更具结构性和可靠性,而非形式推理则更灵活开放,但可靠性可能较低。关于两者的详细区分,我们建议读者参考 Galotti (1989) 和 Bronkhorst et al. (2020)。

Reasoning in Language Models. The concept of reasoning in language models has been around for some time, but there is not a clear definition of what it entails. In the literature, the term “reasoning” is often used to refer to informal reasoning, although it is not always explicitly stated that it is informal (Cobbe et al., 2021; Wei et al., 2022b, inter alia). Different forms of reasoning may be used depending on the task, benchmark, or method being used, e.g., deductive reasoning (Cobbe et al., 2021; Creswell et al., 2022; Han et al., 2022b, in- ter alia), inductive reasoning (Yang et al., 2022; Misra et al., 2022, inter alia) or abductive reasoning (Wiegreffe et al., 2022; Lampinen et al., 2022; Jung et al., 2022, inter alia). In this paper, we encompass various forms of reasoning, with a particular focus on “informal deductive reasoning” in large language models since it is a widely used form in which the conclusion is guaranteed to be true as long as the premises are true.

语言模型中的推理。语言模型的推理概念由来已久,但其具体内涵尚未有明确定义。在文献中,"推理"一词通常指非形式化推理 (informal reasoning) ,尽管并不总是明确说明其非形式化特性 (Cobbe et al., 2021; Wei et al., 2022b 等)。根据任务、基准或方法的不同,可能采用不同形式的推理,例如演绎推理 (deductive reasoning) (Cobbe et al., 2021; Creswell et al., 2022; Han et al., 2022b 等)、归纳推理 (inductive reasoning) (Yang et al., 2022; Misra et al., 2022 等) 或溯因推理 (abductive reasoning) (Wiegreffe et al., 2022; Lampinen et al., 2022; Jung et al., 2022 等)。本文涵盖多种推理形式,尤其关注大语言模型中的"非形式化演绎推理",因为这是一种广泛使用的推理形式——只要前提为真,结论必然为真。

3 Towards Reasoning in Large Language Models

3 迈向大语言模型中的推理

Reasoning, particularly multi-step reasoning, is often seen as a weakness in language models and other NLP models (Bommasani et al., 2021; Rae et al., 2021; Valmeekam et al., 2022). Recent research has suggested that reasoning ability may emerge in language models at a certain scale, such as models with over 100 billion parameters (Wei et al., 2022a,b; Cobbe et al., 2021). In this paper, we follow Wei et al. (2022a) in considering reasoning as an ability that is rarely present in smallscale models like GPT-2 (Radford et al., 2019) and BERT (Devlin et al., 2019), and therefore focus on techniques applicable to improving or eliciting “reasoning”2 in LLMs such as GPT-3 (Brown et al., 2020) and PaLM (Chowdhery et al., 2022).

推理,尤其是多步推理,常被视为语言模型和其他NLP模型的弱点 (Bommasani et al., 2021; Rae et al., 2021; Valmeekam et al., 2022)。最新研究表明,推理能力可能在大规模语言模型中涌现,例如参数量超过1000亿的模型 (Wei et al., 2022a,b; Cobbe et al., 2021)。本文遵循Wei et al. (2022a) 的观点,将推理视为GPT-2 (Radford et al., 2019) 和BERT (Devlin et al., 2019) 等小规模模型普遍缺失的能力,因此专注于适用于改进或激发GPT-3 (Brown et al., 2020) 和PaLM (Chowdhery et al., 2022) 等大语言模型中"推理"能力的技术。

3.1 Fully Supervised Finetuning

3.1 全监督微调

Before discussing reasoning in large language models, it is worth mentioning there is research working on eliciting/improving reasoning in small language models through fully supervised finetuning on specific datasets. For example, Rajani et al. (2019) finetune a pretrained GPT model (Radford et al., 2018) to generate rationales that explain model predictions with the built CoS-E dataset, and find that models trained with explanations perform better on commonsense question answering tasks (Talmor et al., 2019). Talmor et al. (2020) train RoBERTa (Liu et al., 2019) to perform reasoning/inference based on both implicit pre-trained knowledge and explicit free-text statements. Hendrycks et al. (2021) finetune pretrained language models to solve competition mathematics problems by generating full step-by-step solutions, though the accuracy is relatively low. Nye et al. (2022) train language models to do multi-step reasoning for program synthesis/execution by generating “scratch pads”, i.e., intermediate computations, before producing the final answers. We refer the reader to Helwe et al. (2021); Bhargava and $\mathrm{Ng}$ (2022)’s survey for more studies in this line.

在探讨大语言模型中的推理能力之前,值得提及一些研究通过全监督微调特定数据集来激发/改进小语言模型的推理能力。例如,Rajani等人 (2019) 对预训练的GPT模型 (Radford等人, 2018) 进行微调,利用构建的CoS-E数据集生成解释模型预测的推理依据,并发现带有解释训练的模型在常识问答任务 (Talmor等人, 2019) 上表现更优。Talmor等人 (2020) 训练RoBERTa (Liu等人, 2019) 基于隐式预训练知识和显式自由文本陈述进行推理/推断。Hendrycks等人 (2021) 通过生成完整分步解决方案微调预训练语言模型以解决竞赛数学问题,尽管准确率相对较低。Nye等人 (2022) 训练语言模型通过生成"草稿纸"(即中间计算步骤)来实现程序合成/执行的多步推理,再输出最终答案。更多相关研究可参阅Helwe等人 (2021) 以及Bhargava和 $\mathrm{Ng}$ (2022) 的综述。

There are two major limitations of fully supervised finetuning. First, it requires a dataset containing explicit reasoning, which can be difficult and time-consuming to create. Additionally, the model is only trained on a specific dataset, which limits its application to a specific domain and may result in the model relying on artifacts in the training data rather than actual reasoning to make predictions.

完全监督微调存在两大主要限制。首先,它需要包含显式推理的数据集,这类数据集的创建可能既困难又耗时。此外,模型仅在特定数据集上训练,这会限制其应用范围至特定领域,并可能导致模型依赖训练数据中的伪影而非实际推理来进行预测。

3.2 Prompting & In-Context Learning

3.2 提示与上下文学习

Large language models such as GPT-3 (Brown et al., 2020) have demonstrated remarkable fewshot performance across a variety of tasks through in-context learning. These models can be prompted with a question and a few ⟨input, output⟩ exemplars to potentially solve a problem through “reasoning”, either implicitly or explicitly. However, research has shown that these models still fall short when it comes to tasks that require multiple steps of reasoning to solve (Bommasani et al., 2021; Rae et al., 2021; Valmeekam et al., 2022). This may be due to a lack of exploration into the full capabilities of these models, as recent studies have suggested.

诸如GPT-3 (Brown等人,2020) 这样的大语言模型 (Large Language Model) 已通过上下文学习在各种任务中展现出卓越的少样本 (few-shot) 性能。这些模型可以通过输入一个问题及少量⟨输入,输出⟩示例,以隐式或显式"推理"方式潜在解决问题。然而研究表明 (Bommasani等人,2021;Rae等人,2021;Valmeekam等人,2022),当面对需要多步推理的任务时,这些模型仍存在不足。最新研究指出,这可能是由于对这些模型全部能力的探索仍不充分所致。

3.2.1 Chain of Thought and Its Variants

3.2.1 思维链及其变体

To encourage LLMs to engage in reasoning rather than simply providing answers directly, we may guide LLMs to generate “reasoning” explicitly. One approach for doing this is chain-of-thought prompting, proposed by Wei et al. (2022b). This approach involves providing a few examples of “chain of thought” (CoT), which are intermediate natural language reasoning steps, in the prompt to LLMs (Figure 2). Specifically, in CoT prompting, input, output demonstrations are replaced with ⟨input, chain of thought, output⟩ triples, e.g., “[input] Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis balls does he have now? [chain of thought] Roger started with 5 balls. 2 cans of 3 tennis balls each is 6 tennis balls. $5+6=11$ . [output] The answer is 11.” In this way, given a target question, the model learns to generate explicit rationale before producing the final answer. Experimental results show that this simple idea can improve LLMs’ few-shot performance on arithmetic, symbolic, and commonsense reasoning tasks, sometimes to a striking degree.

为了鼓励大语言模型进行推理而非直接给出答案,我们可以引导模型显式生成"推理过程"。Wei等人(2022b)提出的思维链提示(chain-of-thought prompting)就是其中一种方法。这种方法通过在提示词中提供少量"思维链"(CoT)示例——即中间的自然语言推理步骤(图2)。具体而言,在思维链提示中,输入-输出演示被替换为⟨输入,思维链,输出⟩三元组,例如:"[输入]Roger有5个网球。他又买了2罐网球,每罐有3个网球。他现在一共有多少个网球?[思维链]Roger一开始有5个球。2罐每罐3个网球就是6个网球。$5+6=11$。[输出]答案是11。"通过这种方式,当给定目标问题时,模型学会在生成最终答案前先输出明确的推理依据。实验结果表明,这个简单想法能提升大语言模型在算术、符号和常识推理任务中的少样本表现,有时提升幅度非常显著。


Figure 2: An illustration of Chain-of-Thought Prompting and Rationale Engineering, where asterisk $({}^{* })$ denotes the target problem to be solved.


图 2: 思维链提示 (Chain-of-Thought Prompting) 与原理工程 (Rationale Engineering) 的示意图,其中星号 $({}^{* })$ 表示待解决的目标问题。

There are several variants of chain-of-thought prompting that have been proposed in the literature, in a different form or to solve a specific problem.

文献中提出了几种思维链提示的变体,它们以不同形式出现或用于解决特定问题。

Different Form: Kojima et al. (2022) introduce Zero-shot-CoT, in which LLMs are simply prompted with the phrase “Let’s think step by step” after the input, in order to elicit reasoning without the need for few-shot demonstrations. Madaan et al. (2022); Gao et al. (2022); Chen et al. (2022) find that LLMs trained with code, e.g., Codex (Chen et al., 2021), can achieve better performance on reasoning tasks by framing reasoning as code generation. Wang et al. (2022a) propose to iterative ly prompt chain of thought. He et al. (2023) attempt to retrieve external knowledge in CoT to improve faithfulness of reasoning.

不同形式:Kojima等人(2022)提出了Zero-shot-CoT方法,该方法只需在输入后添加"让我们逐步思考"的提示语,就能引导大语言模型进行推理而无需少样本演示。Madaan等人(2022)、Gao等人(2022)和Chen等人(2022)发现,经过代码训练的大语言模型(如Codex(Chen等人,2021))通过将推理任务转化为代码生成任务可以获得更好的性能。Wang等人(2022a)提出了迭代式提示思维链的方法。He等人(2023)尝试在思维链中检索外部知识以提高推理的可信度。

Specific Problem/Setting: Before chain of thought, Nye et al. (2022) also try to use intermediate computations, named “scratch pads”, to improve language models’ reasoning performance in both finetuning and few-shot regimes, with a particular focus on programs. Shi et al. (2022) attempt to solve multilingual reasoning tasks with CoT in the native language, CoT in English (regardless of the problem language), and CoT in English (with the problem translated to English). Chen (2022) apply CoT to table-based reasoning, finding that LLMs can achieve strong performance on table tasks with only one exemplar. Prystawski et al. (2022) demonstrate that CoT can improve LLMs’ performance on paraphrase selection for metaphors. Lu et al.

特定问题/场景:在思维链 (chain of thought) 提出前,Nye等人 (2022) 也尝试使用名为"草稿纸 (scratch pads)"的中间计算步骤来提升语言模型在微调 (finetuning) 和少样本 (few-shot) 场景下的推理性能,尤其关注程序推理任务。Shi等人 (2022) 尝试用三种方式解决多语言推理任务:使用问题原生语言的思维链、使用英语思维链 (无论问题使用何种语言) ,以及将问题翻译成英语后使用英语思维链。Chen (2022) 将思维链应用于基于表格的推理,发现大语言模型 (LLM) 仅需一个示例就能在表格任务上取得强劲表现。Prystawski等人 (2022) 证明思维链可以提升大语言模型在隐喻复述选择任务中的表现。Lu等人

(2022) apply chain of thought to solve multimodal science questions.

(2022) 应用思维链 (chain of thought) 解决多模态科学问题。

3.2.2 Rationale Engineering

3.2.2 原理工程

The original version of chain-of-thought prompting, proposed by Wei et al. (2022b), relies on manually crafted examples of intermediate reasoning steps and applies greedy decoding in the generation. Rationale engineering aims to more effectively elicit or utilize reasoning in LLMs. This can be achieved through rationale refinement, which involves creating more effective examples of reasoning steps, or through rationale exploration and rationale verification, which involve exploring and verifying the rationales produced by LLMs. A summary of raltionale engineering is illustrated in Figure 2.

思维链提示(chain-of-thought prompting)的原始版本由Wei等人(2022b)提出,依赖于人工构建的中间推理步骤示例,并在生成过程中采用贪婪解码策略。原理工程(rationale engineering)旨在更有效地激发或利用大语言模型中的推理能力。这可以通过原理优化(rationale refinement)实现,即创建更有效的推理步骤示例;也可以通过原理探索(rationale exploration)和原理验证(rationale verification)实现,即探索和验证大语言模型生成的推理过程。图2展示了原理工程的总体框架。

Rationale refinement. The choice of exemplars can significantly affect the few-shot performance of LLMs, as demonstrated in research such as Liu et al. (2022b), which also appears in chain-of-thought prompting. Rationale refinement aims to create and refine rationale examples that are better able to elicit reasoning in LLMs. Fu et al. (2022b) propose complexity-based prompting to create rationales with more reasoning steps. Their experiments show that the performance of LLMs improves with the increased rationale complexity. Similarly, Zhou et al. (2022c) propose algorithmic prompting, which suggests that providing more thorough examples of solutions can help improve reasoning performance on some simple math calculations. Zhang et al. (2022b) design Auto-CoT to automatically construct exemplars by partitioning questions from a given dataset into clusters and then using ZeroShot-CoT (Kojima et al., 2022) to generate the rationale for a representative question from each cluster. The analysis shows that making exemplars diverse is important in prompting LLMs to produce better rationales.

原理精炼。范例的选择会显著影响大语言模型的少样本表现,如Liu等人(2022b)的研究所示,该现象在思维链提示中也存在。原理精炼旨在创建和完善能更好激发大语言模型推理能力的原理示例。Fu等人(2022b)提出基于复杂度的提示方法,通过增加推理步骤来构建原理。其实验表明,大语言模型的表现会随原理复杂度的提升而改善。类似地,Zhou等人(2022c)提出的算法提示指出,提供更详尽的解题示例有助于提升简单数学计算的推理性能。Zhang等人(2022b)设计的Auto-CoT能自动构建范例:先将给定数据集的问题聚类,再使用零样本思维链(Kojima等人,2022)为每类代表性问题生成原理。分析表明,提升范例多样性对促使大语言模型产生更优原理至关重要。

Rationale exploration. In addition to providing better exemplars, we can allow LLMs to fully explore various ways of reasoning to improve their performance on reasoning tasks, named rationale exploration. Based on the idea that complex problems often admit multiple ways of thinking that can lead to their unique correct answer, Wang et al. (2022c) present a decoding strategy called selfconsistency to improve upon the traditional greedy decoding used in chain-of-thought prompting. This strategy involves sampling a diverse set of rationales, rather than just the greedy one, and selecting the most consistent answer by marginal i zing out the sampled rationales. The idea is also used in Fu et al. (2022b) to vote over the top complex rationales. To further improve performance, Li et al. (2022b) suggest providing different demonstrations for each question by sampling exemplars from an exemplar base, in order to increase the diversity of the sampled rationales.

原理探索。除了提供更好的示例外,我们可以让大语言模型充分探索各种推理方式以提升其在推理任务中的表现,这种方法被称为原理探索。基于"复杂问题通常存在多种思维方式都能得出唯一正确答案"的理念,Wang等人(2022c)提出了一种名为自洽性的解码策略,改进了思维链提示中使用的传统贪心解码。该策略通过采样一组多样化的推理路径(而非仅采用贪心路径),并通过边缘化采样到的推理路径来选择最一致的答案。Fu等人(2022b)同样运用了这一理念,对顶级复杂推理路径进行投票选择。为进一步提升性能,Li等人(2022b)建议通过从示例库中采样,为每个问题提供不同的演示示例,从而增加采样推理路径的多样性。

Rationale verification. Ensuring that the rationales produced by LLMs are valid is critical, as incorrect rationales can lead to incorrect final predictions (Ye and Durrett, 2022). To address this issue, the process of rationale verification aims to verify whether the rationales produced by LLMs lead to the correct final answers. Cobbe et al. (2021) propose augmenting LLMs with a trained verifier that assigns a score to each rationale and solution generated by the LLM, selecting the highest-ranked solution as the final answer when solving math word problems. Li et al. (2022b) also use this technique to guide rationale selection, in conjunction with the process of rationale exploration. Different from the above methods that train an external verifier to verify the rationales, Weng et al. (2022) suggest using LLMs themselves as the verifiers.

理由验证。确保大语言模型(LLM)生成的推理过程有效至关重要,因为错误的推理会导致最终预测错误 (Ye and Durrett, 2022)。为解决这一问题,理由验证流程旨在检验大语言模型生成的推理是否能得出正确最终答案。Cobbe等人(2021)提出通过训练验证器来增强大语言模型,该验证器会对模型生成的每个推理步骤和解决方案进行评分,在解决数学应用题时选择评分最高的方案作为最终答案。Li等人(2022b)同样采用该技术指导推理选择,并与推理探索流程结合使用。与上述训练外部验证器的方法不同,Weng等人(2022)建议直接使用大语言模型自身作为验证器。

3.2.3 Problem Decomposition

3.2.3 问题分解

Chain-of-thought prompting, while effective for eliciting reasoning in LLMs, can struggle with complex tasks, e.g., tasks that require compositional generalization (Lake and Baroni, 2018; Keysers et al., 2020). To solve a complex problem, it is helpful to first break it down into smaller, more manageable sub problems. By solving each of these sub problems, we can effectively solve the complex problem. This technique is called problem decomposition or divide and conquer (Talmor and Berant, 2018; Min et al., 2019; Perez et al., 2020).

思维链提示 (Chain-of-thought prompting) 虽然能有效激发大语言模型的推理能力,但在处理复杂任务(例如需要组合泛化的任务)时仍存在困难 (Lake and Baroni, 2018; Keysers et al., 2020)。解决复杂问题时,先将其分解为更小、更易处理的子问题往往很有帮助。通过逐个解决这些子问题,我们就能有效攻克复杂问题。这种技术被称为问题分解 (problem decomposition) 或分治法 (divide and conquer) (Talmor and Berant, 2018; Min et al., 2019; Perez et al., 2020)。

Based on this idea, Zhou et al. (2022a) propose least-to-most prompting, which consists of two steps: decomposing the complex problem into sub problems and solving these sub problems in a specific order, with each subproblem being facilitated by the answers obtained from previously solved sub problems. As follow-up work, Drozdov et al. (2022) introduce dynamic least-to-most prompting, which is designed to solve more realistic semantic parsing problems by decomposing the problems with prompting-based syntactic parsing and dynamically selecting exemplars based on the decomposition. In addition, Khot et al. (2022) design decomposed prompting, which breaks down a complex problem into sub problems that can be handled by a shared library of prompting-based LLMs, each specialized in a particular subproblem. Furthermore, Dua et al. (2022) develop successive prompting, which iterative ly decomposes a complex problem into a simple problem, with the next subproblem prediction having access to the answers to the previous sub problems. While the above methods decompose or solve compositional questions with multiple forward passes, Press et al. (2022) suggest decomposing and solving the input question in one forward pass using CoT prompting. Overall, these techniques show promise for helping LLMs to solve complex tasks by decomposing the problem into more manageable sub problems.

基于这一思路,Zhou等人 (2022a) 提出了最小到最多提示法 (least-to-most prompting),该方法包含两个步骤:将复杂问题分解为子问题,并按特定顺序解决这些子问题,其中每个子问题的解决都得益于先前已解子问题的答案。作为后续工作,Drozdov等人 (2022) 提出了动态最小到最多提示法 (dynamic least-to-most prompting),旨在通过基于提示的句法解析来分解问题,并根据分解结果动态选择示例,从而解决更现实的语义解析问题。此外,Khot等人 (2022) 设计了分解提示法 (decomposed prompting),该方法将复杂问题分解为可由基于提示的大语言模型共享库处理的子问题,每个模型专门处理特定的子问题。更进一步,Dua等人 (2022) 开发了连续提示法 (successive prompting),该方法迭代地将复杂问题分解为简单问题,并在预测下一个子问题时能够访问先前子问题的答案。虽然上述方法通过多次前向传递来分解或解决组合问题,但Press等人 (2022) 建议使用思维链提示 (CoT prompting) 在一次前向传递中分解并解决输入问题。总体而言,这些技术通过将问题分解为更易管理的子问题,展现出了帮助大语言模型解决复杂任务的潜力。

3.2.4 Others

3.2.4 其他

There are other techniques that have been developed to facilitate reasoning in LLMs for specific tasks or settings. For instance, Creswell et al. (2022); Creswell and Shanahan (2022) introduce a selection-inference framework that uses LLMs as modules to select and infer reasoning steps from a set of facts that culminate in the final answer. Kazemi et al. (2022) suggest using backward chaining, i.e., from goal to the set of facts that support it, instead of forward chaining like Creswell et al. (2022); Creswell and Shanahan (2022). In addition, Jung et al. (2022) propose a method for solving binary questions by prompting LLMs abduct iv ely and recursively to rationalize each option. Zhou et al. (2022b) design a technique for performing numerical reasoning on complex numbers by replacing the complex numbers with simple numbers to produce simpler expressions, and then using these expressions to perform calculations on the complex numbers. There are also efforts to distill reasoning from LLMs into smaller models, such as the work by Li et al. (2022a); Shridhar et al. (2022); Magister et al. (2022). Finally, we refer the reader to Dohan et al. (2022)’s position paper on language model cascade, which presents a unifying framework for understanding chain-of-thought prompting and research in this line.

为促进大语言模型在特定任务或场景中的推理能力,已发展出多种技术。例如,Creswell等人(2022)和Creswell与Shanahan(2022)提出了选择-推理框架,将大语言模型作为模块,从一系列事实中选择并推断出最终答案的推理步骤。Kazemi等人(2022)建议采用反向链式推理(即从目标回溯支持事实),而非Creswell等人(2022)和Creswell与Shanahan(2022)采用的正向链式推理。此外,Jung等人(2022)提出通过溯因递归提示大语言模型来合理化每个选项,从而解决二元问题的方法。Zhou等人(2022b)设计了一种对复数进行数值推理的技术,通过用简单数字替换复数生成更简单的表达式,再利用这些表达式对复数进行计算。也有研究致力于将大语言模型的推理能力蒸馏到更小模型中,如Li等人(2022a)、Shridhar等人(2022)和Magister等人(2022)的工作。最后,我们推荐读者参阅Dohan等人(2022)关于语言模型级联的立场论文,该文提出了理解思维链提示及相关研究的统一框架。

3.3 Hybrid Method

3.3 混合方法

While “prompting” techniques can help elicit or better utilize reasoning in large language models to solve reasoning tasks, they do not actually improve the reasoning capabilities of the LLMs themselves, as the parameters of the models remain unchanged. In contrast, the “hybrid approach” aims to simultaneously improve the reasoning capabilities of LLMs and make better use of these models in order to solve complex problems. This approach involves both enhancing the reasoning capabilities of the LLMs and using techniques such as prompting to effectively utilize these capabilities.

虽然"提示"(prompting)技术可以帮助激发或更好地利用大语言模型(LLM)的推理能力来解决推理任务,但这些技术实际上并未提升大语言模型本身的推理能力,因为模型的参数保持不变。相比之下,"混合方法"(hybrid approach)旨在同时提升大语言模型的推理能力,并更有效地利用这些模型来解决复杂问题。该方法既包括增强大语言模型的推理能力,也涉及使用提示等技术来有效利用这些能力。

3.3.1 Reasoning-Enhanced Training and Prompting

3.3.1 推理增强训练与提示

One approach to improving the reasoning capabilities of LLMs is to pretrain or finetune the models on datasets that include “reasoning”. Lewkowycz et al. (2022); Taylor et al. (2022) find that LLMs trained on datasets containing scientific and mathematical data can achieve better performance on reasoning tasks like quantitative reasoning problems when using CoT prompting3. Pi et al. (2022) show that continually pre training with SQL data can boost the performance of language models, e.g., T5 (Raffel et al., 2020), on natural language reasoning such as numerical reasoning and logical reasoning. Furthermore, Chung et al. (2022) develop Flan models by finetuning PaLM (Chowdhery et al., 2022) and T5 (Raffel et al., 2020) with $1.8\mathrm{k}$ finetuning tasks, including CoT data, and find that CoT data are critical to keeping reasoning abilities. Similarly, Yu et al. (2022) finetune OPT (Zhang et al., 2022a) on 10 reasoning datasets and observe that it can improve some reasoning capabilities of LLMs. Anil et al. (2022) study the length generalization abilities of LLMs, i.e., whether LLMs learned with short problem instances can generalize to long ones. They discover that the combination of few-shot scratchpad (or chain of thought)

提升大语言模型(LLM)推理能力的一种方法是在包含"推理"的数据集上进行预训练或微调。Lewkowycz等人(2022)和Taylor等人(2022)发现,在包含科学和数学数据的数据集上训练的大语言模型,当使用思维链(CoT)提示时,能够在定量推理等任务上获得更好的表现。Pi等人(2022)表明,持续使用SQL数据进行预训练可以提升语言模型(如T5 [Raffel等人, 2020])在自然语言推理(如数值推理和逻辑推理)方面的性能。此外,Chung等人(2022)通过使用1.8k个微调任务(包括CoT数据)对PaLM(Chowdhery等人, 2022)和T5(Raffel等人, 2020)进行微调,开发了Flan模型,并发现CoT数据对于保持推理能力至关重要。类似地,Yu等人(2022)在10个推理数据集上对OPT(Zhang等人, 2022a)进行微调,观察到这可以提升大语言模型的某些推理能力。Anil等人(2022)研究了大语言模型的长度泛化能力,即用短问题实例学习的大语言模型是否能泛化到长实例。他们发现少样本草稿(或思维链)的组合...

finetuning and scratchpad prompting results in a significant improvement in LLMs’ ability to generalize to longer problems, while this phenomenon is not observed in the standard fully supervised finetuning paradigm.

微调 (finetuning) 和草稿提示 (scratchpad prompting) 显著提升了大语言模型对长问题的泛化能力,而这种现象在标准全监督微调范式中未被观察到。

3.3.2 Boots trapping & Self-Improving

3.3.2 自助法 (Bootstrapping) 与自我改进 (Self-Improving)

Instead of finetuning LLMs on pre-built datasets that include reasoning, there are studies that have explored the idea of using LLMs to self-improve their reasoning abilities through a process known as boots trapping. One example of this is the SelfTaught Reasoner (STaR) introduced by Zelikman et al. (2022), in which a LLM is trained and refined on its own output iterative ly. Specifically, with CoT prompting, the model first generates initial rationales. And then, the model is finetuned on rationales that lead to correct answers. This process can be repeated, with each iteration resulting in an improved model that can generate better training data, which in turn leads to further improvements. As a follow-up to this work, Huang et al. (2022a) show that LLMs are able to self-improve their reasoning abilities without the need for supervised data by leveraging the self-consistency of reasoning (Wang et al., 2022c).

与在包含推理的预构建数据集上微调大语言模型不同,有研究探索了通过自举(bootstrapping)过程让大语言模型自我提升推理能力的方法。Zelikman等人(2022)提出的STaR(SelfTaught Reasoner)就是其中一例,该方法通过迭代训练大语言模型并优化其自身输出。具体而言,通过思维链(CoT)提示,模型首先生成初始推理过程,随后针对那些得出正确答案的推理过程进行微调。这一过程可重复进行,每次迭代都能产生能生成更优质训练数据的改进模型,从而形成持续优化。作为后续研究,Huang等人(2022a)证明大语言模型无需监督数据,仅需利用推理的自洽性(Wang等人,2022c)即可实现推理能力的自我提升。

4 Measuring Reasoning in Large Language Models

4 大语言模型中的推理能力测量

We summarize methods and benchmarks for evaluating reasoning abilities of LLMs in this section.

本节总结了大语言模型 (LLM) 推理能力的评估方法和基准。

4.1 End Task Performance

4.1 终端任务性能

One way to measure reasoning abilities of LLMs is to report their performance, e.g., accuracy, on end tasks that require reasoning. We list some common benchmarks as follows.

衡量大语言模型推理能力的一种方法是报告其在需要推理的终端任务上的表现,例如准确率。以下列举了一些常见基准测试:

Arithmetic Reasoning. Arithmetic reasoning is the ability to understand and apply mathematical concepts and principles in order to solve problems involving arithmetic operations. This involves using logical thinking and mathematical principles to determine the correct course of action when solving mathematical problems. Representative benchmarks for arithmetic reasoning include GSM8K (Cobbe et al., 2021), Math (Hendrycks et al., 2021), MathQA (Amini et al., 2019), SVAMP (Patel et al., 2021), AS- Div (Miao et al., 2020), AQuA (Ling et al., 2017), and MAWPS (Roy and Roth, 2015). It is worth mentioning that Anil et al. (2022) generate the Parity Datasets and the Boolean Variable Assignment Dataset for analyzing the length generalization capabilities of LLMs (§3.3.1).

算术推理。算术推理是指理解和应用数学概念及原理以解决涉及算术运算问题的能力。这需要运用逻辑思维和数学原理来确定解决数学问题的正确方法。代表性的算术推理基准包括 GSM8K (Cobbe et al., 2021)、Math (Hendrycks et al., 2021)、MathQA (Amini et al., 2019)、SVAMP (Patel et al., 2021)、AS-Div (Miao et al., 2020)、AQuA (Ling et al., 2017) 和 MAWPS (Roy and Roth, 2015)。值得一提的是,Anil et al. (2022) 生成了奇偶校验数据集 (Parity Datasets) 和布尔变量赋值数据集 (Boolean Variable Assignment Dataset) 用于分析大语言模型的长度泛化能力 (§3.3.1)。

Commonsense Reasoning. Commonsense Reasoning is the use of everyday knowledge and understanding to make judgments and predictions about new situations. It is a fundamental aspect of human intelligence that enables us to navigate our environment, understand others, and make decisions with incomplete information. Benchmarks that can be used for testing commonsense reasoning abilities of LLMs include CSQA (Talmor et al., 2019), StrategyQA (Geva et al., 2021), and ARC (Clark et al., 2018). We refer the reader to Bhargava and $\mathrm{Ng}(2022)$ ’s survey for more work in this domain.

常识推理 (Commonsense Reasoning)。常识推理是利用日常知识和理解对新情况做出判断和预测的能力。这是人类智能的基础方面,使我们能够在信息不完整的情况下适应环境、理解他人并做出决策。可用于测试大语言模型常识推理能力的基准包括CSQA (Talmor et al., 2019)、StrategyQA (Geva et al., 2021)和ARC (Clark et al., 2018)。更多该领域研究可参阅Bhargava和$\mathrm{Ng}(2022)$的综述。

Symbolic Reasoning. Symbolic reasoning is a form of reasoning that involves the manipulation of symbols according to formal rules. In symbolic reasoning, we use abstract symbols to represent concepts and relationships, and then manipulate those symbols according to precise rules in order to draw conclusions or solve problems. Two benchmarks of symbolic reasoning are presented in Wei et al. (2022b), including Last Letter Concatenation and Coin Flip.

符号推理。符号推理是一种通过形式规则操纵符号的推理方式。在符号推理中,我们使用抽象符号表示概念和关系,然后根据精确规则操作这些符号以得出结论或解决问题。Wei等人 (2022b) 提出了两个符号推理基准任务:末字母拼接和抛硬币模拟。

Others. In practice, there are many benchmarks that can be used to evaluate reasoning abilities of LLMs (indirectly), as long as the downstream task involves reasoning. BIG-bench (Srivastava et al., 2022), for example, includes over 200 tasks that test a range of reasoning skills, including tasks like Date Understanding, Word Sorting, and Causal Judgement. Other benchmarks, such as SCAN (Lake and Baroni, 2018) and the one proposed by Anil et al. (2022), focus on evaluating generalization ability. LLMs can also be tested on their table reasoning abilities using benchmarks such as WikiTable QA (Pasupat and Liang, 2015), FetaQA (Nan et al., 2022), as suggested by Chen (2022). In addition, there are benchmarks for evaluating LLMs’ generative relational reasoning abilities, such as CommonGen (Lin et al., 2020; Liu et al., 2022a) and Open Relation Modeling (Huang et al., 2022b,d).

其他。实际上,只要下游任务涉及推理,就有许多基准可用于(间接)评估大语言模型的推理能力。例如,BIG-bench (Srivastava et al., 2022) 包含200多项测试各类推理技能的任务,如日期理解、单词排序和因果判断等。其他基准如SCAN (Lake and Baroni, 2018) 和Anil等人 (2022) 提出的基准,则侧重于评估泛化能力。根据Chen (2022) 的建议,大语言模型还可以在WikiTable QA (Pasupat and Liang, 2015)、FetaQA (Nan et al., 2022) 等基准上测试表格推理能力。此外,还有评估大语言模型生成式关系推理能力的基准,如CommonGen (Lin et al., 2020; Liu et al., 2022a) 和Open Relation Modeling (Huang et al., 2022b,d)。

4.2 Analysis on Reasoning

4.2 推理分析

Although LLMs have demonstrated impressive performance on various reasoning tasks, the extent to which their predictions are based on true reasoning or simple heuristics is not always clear. This is because most existing evaluations focus on their accuracy on end tasks, rather than directly assessing their reasoning steps. While some error analysis has been conducted on the generated rationales of LLMs (Wei et al., 2022b; Kojima et al., 2022, inter alia), this analysis has often been limited in depth.

尽管大语言模型(LLM)在各种推理任务中展现出卓越性能,但其预测究竟基于真实推理还是简单启发式方法仍不明确。这是因为现有评估大多关注最终任务准确率,而非直接检验推理步骤。虽然已有研究对大语言模型生成的推理过程进行错误分析(Wei et al., 2022b; Kojima et al., 2022等),但这些分析往往缺乏深度。

There have been some efforts to develop metrics and benchmarks that enable a more formal/deep analysis of reasoning in LLMs. Golovneva et al. (2022) design ROSCOE, a set of interpret able, detailed step-by-step evaluation metrics covering various perspectives including semantic alignment, logical inference, semantic similarity, and language coherence. Saparov and He (2022) create a synthetic dataset called PrOntoQA that is generated from real or fictional ontologies. Each example in the dataset has a unique proof, which can be converted to simple sentences and back again, allowing for a formal analysis of each reasoning step. Han et al. (2022a) introduce a dataset called FOLIO to test the first-order logic reasoning capabilities of LLMs. FOLIO contains first-order logic reasoning problems that require models to determine the correctness of conclusions given a set of premises. In addition, Wang et al. (2022b) conduct ablation experiments on CoT and find that LLMs may also perform reasoning while prompting with invalid rationals. Their study also suggests that being relevant to the query and correctly ordering the reasoning steps are important for CoT prompting.

为对大语言模型(LLM)的推理能力进行更正式/深入的分析,学界已开展了一些指标与基准的开发工作。Golovneva等人(2022)设计了ROSCOE评估体系,这套可解释的细粒度分步评估指标涵盖语义对齐、逻辑推理、语义相似性和语言连贯性等多个维度。Saparov和He(2022)构建了基于真实或虚构本体的合成数据集PrOntoQA,其中每个样本都配有可转换为自然语言句式的唯一形式化证明,支持对每个推理步骤进行严格分析。Han等人(2022a)提出测试大语言模型一阶逻辑推理能力的FOLIO数据集,该数据集包含需要模型根据给定前提判断结论正确性的一阶逻辑推理问题。此外,Wang等人(2022b)对思维链(CoT)进行消融实验发现,大语言模型在无效推理提示下仍可能执行推理,其研究同时表明:推理步骤与查询的相关性及正确排序对CoT提示至关重要。

In summary, most existing studies primarily report the performance of the models on downstream reasoning tasks, without a detailed examination of the quality of the rationales produced. This leaves open the question of whether the models are actually able to reason in a way that is similar to human reasoning, or whether they are simply able to achieve good performance on the tasks through other means. Further research is needed to more formally analyze the reasoning abilities of LLMs.

总结而言,现有研究大多仅报告模型在下游推理任务中的性能表现,而未能深入分析其生成推理过程的质量。这导致一个关键问题悬而未决:这些模型究竟是真正实现了类人推理能力,还是通过其他方式在任务中取得优异表现。需要进一步研究来更系统地分析大语言模型的推理能力。

5 Findings and Implications

5 研究发现与启示

In this section, we summarize the important findings and implications of studies on reasoning in large language models.

在本节中,我们总结了大语言模型中关于推理研究的重要发现和意义。

Reasoning seems an emergent ability of LLMs. Wei et al. (2022a,b); Suzgun et al. (2022) show that reasoning ability appears to emerge only in large language models like GPT-3 175B, as evidenced by significant improvements in performance on reasoning tasks at a certain scale (e.g., 100 billion parameters). This suggests that it may be more effective to utilize large models for general reasoning problems rather than training small models for specific tasks. However, the reason for this emergent ability is not yet fully understood. We refer the reader to Wei et al. (2022a); Fu et al. (2022a) for some potential explanations.

推理似乎是大语言模型的一种涌现能力。Wei等人(2022a,b)和Suzgun等人(2022)的研究表明,推理能力似乎只在像GPT-3 175B这样的大语言模型中才会出现,这体现在当模型规模达到一定程度(例如1000亿参数)时,推理任务的性能会有显著提升。这表明对于通用推理问题,利用大模型可能比针对特定任务训练小模型更有效。然而,这种涌现能力的原因尚未完全明了。关于一些可能的解释,读者可以参考Wei等人(2022a)和Fu等人(2022a)的研究。

Chain of thought elicits “reasoning” of LLMs. The use of chain-of-thought (CoT) prompts (Wei et al., 2022b) has been shown to improve the performance of LLMs on various reasoning tasks, as demonstrated in the experiments of Wei et al. (2022a,b); Suzgun et al. (2022). Additionally, Saparov and He (2022) $(\S4.2)$ find that, when using CoT prompts, LLMs are able to produce valid individual proof steps, even when the synthetic ontology is fictional or counter factual. However, they may sometimes choose the wrong steps when multiple options are available, leading to incomplete or incorrect proofs. Moreover, for many reasoning tasks where the performance of standard prompting grows smoothly with model scale, chain-of-thought prompting can lead to dramatic performance improvement. In addition to these benefits, the use of CoT prompts has been shown to improve the out-ofdistribution robustness of LLMs (Wei et al., 2022b; Zhou et al., 2022a; Anil et al., 2022, inter alia), an advantage that is not typically observed with standard prompting or fully supervised finetuning paradigms.

思维链激发了大语言模型的"推理"能力。研究表明,使用思维链 (chain-of-thought,CoT) 提示 (Wei et al., 2022b) 能显著提升大语言模型在各种推理任务中的表现 (Wei et al., 2022a,b; Suzgun et al., 2022)。此外,Saparov和He (2022) ( §4.2 ) 发现,在使用CoT提示时,即使面对虚构或反事实的合成本体论,大语言模型也能生成有效的独立证明步骤。不过当存在多个选项时,模型有时会选择错误的步骤,导致证明不完整或不正确。值得注意的是,对于许多标准提示性能随模型规模平稳增长的推理任务,思维链提示能带来显著的性能提升。除上述优势外,CoT提示还被证明能增强大语言模型的分布外鲁棒性 (Wei et al., 2022b; Zhou et al., 2022a; Anil et al., 2022等),这是标准提示或全监督微调范式通常无法实现的优势。

LLMs show human-like content effects on reasoning. According to Dasgupta et al. (2022), LLMs exhibit reasoning patterns that are similar to those of humans as described in the cognitive literature. For example, the models’ predictions are influenced by both prior knowledge and abstract reasoning, and their judgments of logical validity are impacted by the be liev ability of the conclusions. These findings suggest that, although language models may not always perform well on reasoning tasks, their failures often occur in situations that are challenging for humans as well. This provides some evidence that language models may “reason” in a way that is similar to human reasoning.

大语言模型在推理中表现出类人的内容效应。根据Dasgupta等人 (2022) 的研究,大语言模型展现出与认知文献中描述的人类推理模式相似的特征。例如,模型的预测同时受到先验知识和抽象推理的影响,其对逻辑有效性的判断也会受到结论可信度的干扰。这些发现表明,尽管语言模型在推理任务上表现并不总是出色,但其失败场景往往也是人类容易出错的场景。这为"语言模型可能以类人方式进行推理"的观点提供了部分证据。

LLMs are still unskilled at complex reasoning. Although LLMs seem to possess impressive reasoning capabilities with the techniques described in $\S3$ , they still struggle with more complex reasoning tasks or those involving imp lica ture, according to studies such as Valmeekam et al. (2022);

大语言模型在复杂推理方面仍不熟练。尽管通过 $\S3$ 中描述的技术,大语言模型展现出令人印象深刻的推理能力,但根据 Valmeekam 等人 (2022) 等研究显示,它们在处理更复杂的推理任务或涉及隐含含义的任务时仍存在困难;

Han et al. (2022a); Ruis et al. (2022). For instance, Valmeekam et al. (2022) find that even in relatively simple commonsense planning domains that humans would have no trouble navigating, LLMs such as GPT-3 (Brown et al., 2020) and BLOOM (Scao et al., 2022) struggle to perform effectively. These findings suggest that existing benchmarks may be too simple to accurately gauge the true reasoning abilities of LLMs, and that more challenging tasks may be needed to fully evaluate their abilities in this regard.

Han等人 (2022a); Ruis等人 (2022)。例如,Valmeekam等人 (2022) 发现,即使在人类能够轻松应对的相对简单的常识规划领域中,像 GPT-3 (Brown等人, 2020) 和 BLOOM (Scao等人, 2022) 这样的大语言模型也难以有效执行任务。这些研究结果表明,现有基准测试可能过于简单,无法准确衡量大语言模型的真实推理能力,可能需要更具挑战性的任务来全面评估它们在这方面的能力。

6 Reflection, Discussion, and Future Directions

6 反思、讨论与未来方向

Why reasoning? Reasoning is the process of thinking about something in a logical and systematic way, and it is a key aspect of human intelligence. By incorporating reasoning capabilities into language models, we can enable them to perform tasks that require more complex and nuanced thinking, such as problem solving, decision making, and planning (Huang et al., 2022e,f; Song et al., 2022). This can improve the performance of these models on downstream tasks and increase their out-of- distribution robustness (Wei et al., 2022a,b; Suzgun et al., 2022; Zhou et al., 2022a; Anil et al., 2022). In addition, reasoning can make language models more explain able and interpret able, as it provides explicit rationales for their predictions.

为何需要推理能力?推理是以逻辑化、系统化的方式思考事物的过程,是人类智能的核心要素。通过为大语言模型赋予推理能力,我们可以使其执行需要更复杂、更细致思维的任务,例如问题解决、决策制定和规划 (Huang et al., 2022e,f; Song et al., 2022)。这将提升模型在下游任务中的表现,并增强其分布外鲁棒性 (Wei et al., 2022a,b; Suzgun et al., 2022; Zhou et al., 2022a; Anil et al., 2022)。此外,推理能力还能使大语言模型更具可解释性,因为它为预测结果提供了明确的逻辑依据。

Right task/application? As Valmeekam et al. (2022) point out, current benchmarks may not adequately reflect the reasoning capabilities of LLMs. In addition, tasks such as solving simple math problems and concatenating letters in strings (§4.1) are artificial and do not accurately reflect real-world situations. To truly understand the reasoning ability of LLMs, it is important to consider more realistic and meaningful applications such as decision making (Edwards, 1954), legal reasoning (Levi, 2013), and scientific reasoning (Zimmerman, 2000). Our ultimate goal should not be to enable LLMs to solve simple math problems, which can be simply done with other programs. When conducting relevant research, it is essential to ask whether the specific task being tackled is meaningful and whether the proposed method can be generalized to more realistic tasks and applications.

正确的任务/应用?正如Valmeekam等人(2022)指出的,当前基准测试可能无法充分反映大语言模型的推理能力。此外,解决简单数学问题和连接字符串中的字母(§4.1)等任务都是人为设计的,不能准确反映现实情况。要真正理解大语言模型的推理能力,必须考虑更现实且有意义的应用场景,例如决策制定(Edwards, 1954)、法律推理(Levi, 2013)和科学推理(Zimmerman, 2000)。我们的终极目标不应是让大语言模型解决简单数学问题——这些任务用其他程序就能轻松完成。在进行相关研究时,必须思考:正在解决的具体任务是否有意义?所提出的方法能否推广到更现实的任务和应用中?

Are language models really able to reason? There are several indications that LLMs are able to reason, including 1) high performance on various tasks requiring reasoning (Suzgun et al., 2022);

语言模型真的具备推理能力吗?有多个迹象表明大语言模型能够进行推理,包括:1) 在各类需要推理的任务上表现优异 [20];

  1. the ability to reason step-by-step with chainof-thought prompting (Wei et al., 2022b); and 3) the reflection of human-like content effects on reasoning (Dasgupta et al., 2022). However, these findings are not sufficient to conclude that LLMs can truly reason. For 1), it is not clear whether the models are making predictions based on reasoning or heuristics (Patel et al., 2021). For many existing benchmarks on reasoning, actually, we can design a program with heuristic rules to achieve very high performance. We usually do not think a program relying on heuristic rules is capable of reasoning. For 2), although the models seem to reason stepby-step, the generated rationales may be incorrect and inconsistent. It is possible that the models are “generating reasoning-like response” rather than “reasoning step-by-step”. For 3), while LLMs display some human-like reasoning patterns, this does not necessarily mean that they behave like humans.
  2. 通过思维链提示 (chain-of-thought prompting) 进行逐步推理的能力 (Wei et al., 2022b) ;以及 3) 在推理过程中反映类人内容效应的能力 (Dasgupta et al., 2022) 。然而,这些发现不足以证明大语言模型具备真正的推理能力。对于第1点,目前尚不清楚模型是基于推理还是启发式规则进行预测 (Patel et al., 2021) 。实际上,针对许多现有的推理基准测试,我们完全可以设计一个基于启发式规则的程序来获得极高分数,但通常不会认为依赖启发式规则的程序具备推理能力。对于第2点,尽管模型看似在进行逐步推理,但其生成的逻辑依据可能存在错误或不一致,这更可能是模型在"生成类推理的响应"而非"逐步推理"。对于第3点,虽然大语言模型展现出某些类人推理模式,但这并不必然意味着它们具有类人的推理行为。

Additionally, there are several observations that suggest LLMs may not be capable of reasoning: 1) LLMs still struggle with tasks that require complex reasoning (Valmeekam et al., 2022; Han et al., 2022a; Ruis et al., 2022). If LLMs are really decent reasoners, they should handle tasks that can be simply solved by humans through reasoning; 2) LLMs make mistakes in their reasoning, as explained above; 3)#4 The performance of LLMs on downstream tasks has been found to be sensitive to the frequency of certain terms, such as numbers, in the training data (Razeghi et al., 2022; Jung et al., 2022), which would not be expected if the models were solving mathematical problems through reasoning; 4)# Language models have been found to struggle with associating relevant information that they have memorized (Huang et al., 2022c).

此外,有几点观察表明大语言模型可能不具备推理能力:1) 大语言模型在需要复杂推理的任务上仍然表现不佳 (Valmeekam et al., 2022; Han et al., 2022a; Ruis et al., 2022)。如果大语言模型确实擅长推理,它们应该能处理人类通过简单推理就能解决的任务;2) 如前所述,大语言模型在推理过程中会犯错;3) 研究发现大语言模型在下游任务中的表现对训练数据中某些术语(如数字)的出现频率非常敏感 (Razeghi et al., 2022; Jung et al., 2022),如果模型是通过推理来解决数学问题,这种情况就不应该出现;4) 研究发现语言模型难以关联它们已记忆的相关信息 (Huang et al., 2022c)。

Overall, it is still too early to draw a conclusion about the proposed question. In fact, there is also an ongoing debate about whether language models can actually understand language or capture meaning (Bender and Koller, 2020; Li et al., 2021; Manning, 2022; Piantasodi and Hill, 2022). Further in-depth analysis of factors such as training data, model architecture, and optimization objectives is needed, as well as the development of better benchmarks for measuring the reasoning capabilities of LLMs. However, it is clear that the current models are not yet capable of robust reasoning.

总体而言,现在对提出的问题下结论还为时过早。事实上,关于语言模型是否能真正理解语言或捕捉意义 (Bender and Koller, 2020; Li et al., 2021; Manning, 2022; Piantasodi and Hill, 2022) 仍存在持续争论。我们需要对训练数据、模型架构和优化目标等因素进行更深入的分析,并开发更好的基准来衡量大语言模型的推理能力。但可以明确的是,当前模型尚未具备稳健的推理能力。

Improving reasoning capabilities of LLMs.

提升大语言模型的推理能力。

While techniques like chain-of-thought prompting (Wei et al., 2022b) may help to elicit reasoning abilities in large language models, they cannot enable the models to solve tasks beyond their current capabilities. To truly enhance reasoning in LLMs, we need to utilize training data, model architecture, and optimization objectives that are designed to encourage reasoning. For example, finetuning a model with a dataset including CoT data has been shown to improve reasoning (Chung et al., 2022), and models can also self-improve through the process of boots trapping their reasoning (Zelikman et al., 2022; Huang et al., 2022a). There is still much research that needs to be done in this area, and we look forward to future progress in improving reasoning in large language models.

虽然思维链提示 (chain-of-thought prompting) 等技术 (Wei et al., 2022b) 可能有助于激发大语言模型的推理能力,但它们无法让模型解决超出当前能力范围的任务。要真正增强大语言模型的推理能力,我们需要利用专门设计用于促进推理的训练数据、模型架构和优化目标。例如,使用包含思维链数据的数据集对模型进行微调已被证明可以提升推理能力 (Chung et al., 2022),模型还能通过自举推理 (boots trapping) 过程实现自我改进 (Zelikman et al., 2022; Huang et al., 2022a)。该领域仍有许多研究亟待开展,我们期待未来在提升大语言模型推理能力方面取得进展。

7 Conclusion

7 结论

In this paper, we have provided a detailed and upto-date review of the current state of knowledge on reasoning in large language models. We have discussed techniques for improving and eliciting reasoning in LLMs, methods and benchmarks for evaluating reasoning abilities, and the findings and implications of previous studies in this topic. While LLMs have made significant progress in natural language processing and related fields, it remains unclear to what extent they are capable of true reasoning or whether they are simply using memorized patterns and heuristics to solve problems. Further research is needed to fully understand the reasoning abilities of LLMs, improve LLMs’ reasoning capabilities, and determine their potential for use in a variety of applications. We hope that this paper will serve as a useful overview of the current state of the field and stimulate further discussion and research on this interesting and important topic.

本文全面综述了大语言模型推理能力的研究现状,系统探讨了提升和激发大语言模型推理能力的技术方法、评估推理能力的基准体系,以及该领域已有研究成果的发现与启示。尽管大语言模型在自然语言处理及相关领域取得显著进展,但其是否具备真正的推理能力,抑或仅依靠记忆模式和启发式方法解决问题,目前仍无定论。未来研究需要深入探索大语言模型的推理机制、持续提升其推理性能,并评估其在各类应用场景中的潜力。希望本文能为相关领域研究者提供系统性参考,推动这一重要议题的深入探讨与研究发展。

Limitations

局限性

In this paper, we provide an overview of the current state of knowledge on reasoning in large language models. Reasoning is a broad concept that encompasses various forms, making it impractical to summarize all related work in a single paper. Therefore, we focus on deductive reasoning, as it is the most commonly studied in the literature. Other forms of reasoning such as inductive reasoning (Yang et al., 2022; Misra et al., 2022, inter alia) and abductive reasoning (Wiegreffe et al., 2022; Lampinen et al., 2022; Jung et al., 2022, inter alia) may not be discussed in depth.

本文概述了当前关于大语言模型推理能力的研究现状。推理是一个宽泛概念,涵盖多种形式,因此难以在一篇论文中全面总结所有相关工作。我们主要聚焦演绎推理(deductive reasoning),因为这是文献中最常研究的形式。其他推理形式如归纳推理(inductive reasoning) (Yang et al., 2022; Misra et al., 2022等)和溯因推理(abductive reasoning) (Wiegreffe et al., 2022; Lampinen et al., 2022; Jung et al., 2022等)可能不会深入讨论。

Additionally, given the rapid evolution and significance of reasoning within large language models, it is crucial to note that new contributions may have emerged in the field concurrent with the writing of this paper. An additional resource to consider is a parallel survey by Qiao et al. (2022), which emphasizes reasoning via language model prompting. Our coverage may not extend to papers released during or after 2023 such as evaluation on ChatGPT (Bang et al., 2023; Zheng et al., 2023). As such, we recommend readers to check the papers that cite this survey for a more comprehensive and updated understanding of this field.

此外,鉴于大语言模型 (Large Language Model) 中推理能力的快速发展和重要性,必须注意到在本文撰写期间该领域可能已有新成果涌现。Qiao 等人 (2022) 同期发表的另一篇综述重点关注基于语言模型提示的推理方法,可作为补充参考。本文可能未涵盖 2023 年及之后发表的研究(例如针对 ChatGPT 的评估工作 [Bang 等人, 2023; Zheng 等人, 2023])。因此,建议读者查阅引用本综述的文献以获取该领域更全面、最新的研究进展。

Acknowledgements

致谢

We would like to thank Jason Wei (OpenAI) and Denny Zhou (Google DeepMind) for their valuable advice and constructive feedback on this work. This material is based upon work supported by the National Science Foundation IIS 16-19302 and IIS 16-33755, Zhejiang University ZJU Research 083650, IBM-Illinois Center for Cognitive Computing Systems Research (C3SR) and IBM-Illinois Discovery Accelerator Institute (IIDAI), gift grants from eBay and Microsoft Azure, UIUC OVCR CCIL Planning Grant 434S34, UIUC CSBS Small Grant 434C8U, and UIUC New Frontiers Initiative. Any opinions, findings, and conclusions or recommendations expressed in this publication are those of the author(s) and do not necessarily reflect the views of the funding agencies.

我们要感谢Jason Wei (OpenAI) 和 Denny Zhou (Google DeepMind) 对本工作提出的宝贵建议和建设性反馈。本材料基于以下资助项目的研究成果:美国国家科学基金会 IIS 16-19302 和 IIS 16-33755、浙江大学 ZJU Research 083650、IBM-伊利诺伊认知计算系统研究中心 (C3SR) 和 IBM-伊利诺伊发现加速器研究所 (IIDAI)、eBay 和 Microsoft Azure 的捐赠资助、伊利诺伊大学厄巴纳-香槟分校 OVCR CCIL Planning Grant 434S34、UIUC CSBS Small Grant 434C8U 以及 UIUC New Frontiers Initiative。本出版物中表达的任何观点、发现、结论或建议均为作者个人观点,并不代表资助机构的立场。

References

参考文献

Aida Amini, Saadia Gabriel, Shanchuan Lin, Rik Koncel-Kedziorski, Yejin Choi, and Hannaneh Ha- jishirzi. 2019. MathQA: Towards interpret able math word problem solving with operation-based formalisms. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2357–2367, Minneapolis, Minnesota. Association for Computational Linguistics.

Aida Amini、Saadia Gabriel、Shanchuan Lin、Rik Koncel-Kedziorski、Yejin Choi和Hannaneh Hajishirzi。2019. MathQA:基于运算形式化的可解释数学应用题求解。载于《2019年北美计算语言学协会会议论文集:人类语言技术(长论文与短论文)》第1卷,第2357-2367页,明尼苏达州明尼阿波利斯。计算语言学协会。

Cem Anil, Yuhuai Wu, Anders Andreassen, Aitor Lewkowycz, Vedant Misra, Vinay Ramasesh, Am- brose Slone, Guy Gur-Ari, Ethan Dyer, and Behnam Neyshabur. 2022. Exploring length generalization in large language models. ArXiv preprint, abs/2207.04901.

Cem Anil、Yuhuai Wu、Anders Andreassen、Aitor Lewkowycz、Vedant Misra、Vinay Ramasesh、Ambrose Slone、Guy Gur-Ari、Ethan Dyer 和 Behnam Neyshabur。2022。探索大语言模型中的长度泛化能力。ArXiv预印本,abs/2207.04901。

Yejin Bang, Samuel Cah yaw i jaya, Nayeon Lee, Wenliang Dai, Dan Su, Bryan Wilie, Holy Lovenia, Ziwei Ji, Tiezheng Yu, Willy Chung, et al. 2023. A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity. ArXiv preprint, abs/2302.04023.

Yejin Bang, Samuel Cah yaw i jaya, Nayeon Lee, Wenliang Dai, Dan Su, Bryan Wilie, Holy Lovenia, Ziwei Ji, Tiezheng Yu, Willy Chung等. 2023. ChatGPT在推理、幻觉和交互性方面的多任务、多语言、多模态评估. ArXiv预印本, abs/2302.04023.

Emily M. Bender and Alexander Koller. 2020. Climbing towards NLU: On meaning, form, and understanding in the age of data. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5185–5198, Online. Association for Computational Linguistics.

Emily M. Bender 和 Alexander Koller. 2020. 攀登自然语言理解之峰:数据时代的意义、形式与理解. 见《第58届计算语言学协会年会论文集》, 第5185–5198页, 线上会议. 计算语言学协会.

Prajjwal Bhargava and Vincent Ng. 2022. Commonsense knowledge reasoning and generation with pretrained language models: A survey. Proceedings of the AAAI Conference on Artificial Intelligence.

Prajjwal Bhargava 和 Vincent Ng. 2022. 基于预训练语言模型的常识知识推理与生成综述. Proceedings of the AAAI Conference on Artificial Intelligence.

Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. 2021. On the opportunities and risks of foundation models. ArXiv preprint, abs/2108.07258.

Rishi Bommasani、Drew A Hudson、Ehsan Adeli、Russ Altman、Simran Arora、Sydney von Arx、Michael S Bernstein、Jeannette Bohg、Antoine Bosselut、Emma Brunskill 等。2021。基础模型的机遇与风险。ArXiv预印本,abs/2108.07258。

Hugo Bronkhorst, Gerrit Roorda, Cor Suhre, and Martin Goedhart. 2020. Logical reasoning in formal and everyday reasoning tasks. International Journal of Science and Mathematics Education, 18(8):1673– 1694.

Hugo Bronkhorst、Gerrit Roorda、Cor Suhre 和 Martin Goedhart。2020。形式推理与日常推理任务中的逻辑思维。国际科学与数学教育杂志,18(8):1673–1694。

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neel a kant an, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language models are few-shot learners. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual.

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. 大语言模型是少样本学习者。见《神经信息处理系统进展 33: 2020年神经信息处理系统年会》(NeurIPS 2020),2020年12月6-12日,线上会议。

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. 2021. Evaluating large language models trained on code. ArXiv preprint, abs/2107.03374.

Mark Chen、Jerry Tworek、Heewoo Jun、Qiming Yuan、Henrique Ponde de Oliveira Pinto、Jared Kaplan、Harri Edwards、Yuri Burda、Nicholas Joseph、Greg Brockman 等。2021. 评估基于代码训练的大语言模型。ArXiv预印本,abs/2107.03374。

Wenhu Chen. 2022. Large language models are few (1)- shot table reasoners. ArXiv preprint, abs/2210.06710.

Wenhu Chen. 2022. 大语言模型是少样本 (few-shot) 表格推理者. ArXiv预印本, abs/2210.06710.

Wenhu Chen, Xueguang Ma, Xinyi Wang, and William W Cohen. 2022. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks. ArXiv preprint, abs/2211.12588.

Wenhu Chen、Xueguang Ma、Xinyi Wang和William W Cohen。2022。思维编程提示:数值推理任务中计算与推理的解耦。ArXiv预印本,abs/2211.12588。

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. 2022. Palm: Scaling language modeling with pathways. ArXiv preprint, abs/2204.02311.

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann 等. 2022. PaLM: 基于Pathways扩展的语言建模. ArXiv预印本, abs/2204.02311.

Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. 2022. Scaling instruction-finetuned language models. ArXiv preprint, abs/2210.11416.

Hyung Won Chung、Le Hou、Shayne Longpre、Barret Zoph、Yi Tay、William Fedus、Eric Li、Xuezhi Wang、Mostafa Dehghani、Siddhartha Brahma 等. 2022. 规模化指令微调语言模型. ArXiv预印本, abs/2210.11416.

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. 2018. Think you have solved question answering? try arc, the ai2 reasoning challenge. ArXiv preprint, abs/1803.05457.

Peter Clark、Isaac Cowhey、Oren Etzioni、Tushar Khot、Ashish Sabharwal、Carissa Schoenick 和 Oyvind Tafjord。2018。自认为已解决问答问题?试试 ARC (AI2 推理挑战赛)。ArXiv 预印本,abs/1803.05457。

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. 2021. Training verifiers to solve math word problems. ArXiv preprint, abs/2110.14168.

Karl Cobbe、Vineet Kosaraju、Mohammad Bavarian、Jacob Hilton、Reiichiro Nakano、Christopher Hesse 和 John Schulman。2021。训练验证器解决数学应用题。ArXiv预印本,abs/2110.14168。

Antonia Creswell and Murray Shanahan. 2022. Faithful reasoning using large language models. ArXiv preprint, abs/2208.14271.

Antonia Creswell 和 Murray Shanahan. 2022. 基于大语言模型的可靠推理. ArXiv 预印本, abs/2208.14271.

Antonia Creswell, Murray Shanahan, and Irina Higgins. 2022. Selection-inference: Exploiting large language models for interpret able logical reasoning. ArXiv preprint, abs/2205.09712.

Antonia Creswell、Murray Shanahan和Irina Higgins。2022。选择-推断:利用大语言模型实现可解释的逻辑推理。ArXiv预印本,abs/2205.09712。

Ishita Dasgupta, Andrew K Lampinen, Stephanie CY Chan, Antonia Creswell, Dharshan Kumaran, James L McClelland, and Felix Hill. 2022. Language models show human-like content effects on reasoning. ArXiv preprint, abs/2207.07051.

Ishita Dasgupta、Andrew K Lampinen、Stephanie CY Chan、Antonia Creswell、Dharshan Kumaran、James L McClelland 和 Felix Hill。2022。语言模型在推理中表现出类人类的内容效应。ArXiv预印本,abs/2207.07051。

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.

Jacob Devlin、Ming-Wei Chang、Kenton Lee 和 Kristina Toutanova。2019. BERT:用于语言理解的深度双向Transformer预训练。载于《2019年北美计算语言学协会人类语言技术会议论文集(长文与短文)》第1卷,第4171–4186页,明尼苏达州明尼阿波利斯市。计算语言学协会。

David Dohan, Winnie Xu, Aitor Lewkowycz, Ja- cob Austin, David Bieber, Raphael Gontijo Lopes, Yuhuai Wu, Henryk Micha lewski, Rif A Saurous, Jascha Sohl-Dickstein, et al. 2022. Language model cascades. ArXiv preprint, abs/2207.10342.

David Dohan、Winnie Xu、Aitor Lewkowycz、Jacob Austin、David Bieber、Raphael Gontijo Lopes、Yuhuai Wu、Henryk Michalewski、Rif A Saurous、Jascha Sohl-Dickstein 等。2022。语言模型级联。ArXiv预印本,abs/2207.10342。

Andrew Drozdov, Nathanael Schärli, Ekin Akyürek, Nathan Scales, Xinying Song, Xinyun Chen, Olivier Bousquet, and Denny Zhou. 2022. Compositional semantic parsing with large language models. ArXiv preprint, abs/2209.15003.

Andrew Drozdov、Nathanael Schärli、Ekin Akyürek、Nathan Scales、Xinying Song、Xinyun Chen、Olivier Bousquet 和 Denny Zhou。2022。基于大语言模型的组合式语义解析。ArXiv预印本,abs/2209.15003。

Dheeru Dua, Shivanshu Gupta, Sameer Singh, and Matt Gardner. 2022. Successive prompting for decomposing complex questions. ArXiv preprint, abs/2212.04092.

Dheeru Dua、Shivanshu Gupta、Sameer Singh和Matt Gardner。2022。通过连续提示分解复杂问题。ArXiv预印本,abs/2212.04092。

Ward Edwards. 1954. The theory of decision making. Psychological bulletin, 51(4):380.

Ward Edwards. 1954. 决策理论. Psychological bulletin, 51(4):380.

Ronald Fagin, Joseph Y Halpern, Yoram Moses, and Moshe Vardi. 2004. Reasoning about knowledge. MIT press.

Ronald Fagin、Joseph Y Halpern、Yoram Moses 和 Moshe Vardi。2004。知识推理 (Reasoning about knowledge)。MIT press。

Yao Fu, Hao Peng, and Tushar Khot. 2022a. How does gpt obtain its ability? tracing emergent abilities of language models to their sources.

Yao Fu、Hao Peng 和 Tushar Khot。2022a。GPT如何获得其能力?追踪大语言模型涌现能力的来源。

Yao Fu, Hao Peng, Ashish Sabharwal, Peter Clark, and Tushar Khot. 2022b. Complexity-based prompting for multi-step reasoning. ArXiv preprint, abs/2210.00720.

Yao Fu、Hao Peng、Ashish Sabharwal、Peter Clark 和 Tushar Khot。2022b。基于复杂度的多步推理提示方法。ArXiv预印本,abs/2210.00720。

Kathleen M Galotti. 1989. Approaches to studying for- mal and everyday reasoning. Psychological bulletin, 105(3):331.

Kathleen M Galotti. 1989. 正式与日常推理的学习方法。Psychological bulletin, 105(3):331.

Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and Graham Neubig. 2022. Pal: Program-aided language models. ArXiv preprint, abs/2211.10435.

Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, Graham Neubig. 2022. PAL: 程序辅助语言模型. ArXiv预印本, abs/2211.10435.

Mor Geva, Daniel Khashabi, Elad Segal, Tushar Khot, Dan Roth, and Jonathan Berant. 2021. Did aristotle use a laptop? a question answering benchmark with implicit reasoning strategies. Transactions of the Association for Computational Linguistics, 9:346– 361.

Mor Geva、Daniel Khashabi、Elad Segal、Tushar Khot、Dan Roth和Jonathan Berant。2021。亚里士多德用过笔记本电脑吗?一个蕴含隐性推理策略的问答基准。计算语言学协会汇刊,9:346–361。

Olga Golovneva, Moya Chen, Spencer Poff, Martin Corredor, Luke Z ett le moyer, Maryam Fazel-Zarandi, and Asli Cel i kyi l maz. 2022. Roscoe: A suite of metrics for scoring step-by-step reasoning. ArXiv preprint, abs/2212.07919.

Olga Golovneva、Moya Chen、Spencer Poff、Martin Corredor、Luke Zettlemoyer、Maryam Fazel-Zarandi 和 Asli Celikyilmaz。2022。Roscoe: 逐步推理评分指标套件。ArXiv预印本,abs/2212.07919。

Simeng Han, Hailey Schoelkopf, Yilun Zhao, Zhenting Qi, Martin Riddell, Luke Benson, Lucy Sun, Ekaterina Zubova, Yujie Qiao, Matthew Burtell, et al. 2022a. Folio: Natural language reasoning with firstorder logic. ArXiv preprint, abs/2209.00840.

Simeng Han、Hailey Schoelkopf、Yilun Zhao、Zhenting Qi、Martin Riddell、Luke Benson、Lucy Sun、Ekaterina Zubova、Yujie Qiao、Matthew Burtell等。2022a。Folio: 使用一阶逻辑的自然语言推理。ArXiv预印本,abs/2209.00840。

Simon Jerome Han, Keith Ransom, Andrew Perfors, and Charles Kemp. 2022b. Human-like property induction is a challenge for large language models.

Simon Jerome Han、Keith Ransom、Andrew Perfors 和 Charles Kemp。2022b。类人属性归纳对大语言模型构成挑战。

Hangfeng He, Hongming Zhang, and Dan Roth. 2023. Rethinking with retrieval: Faithful large language model inference. ArXiv preprint, abs/2301.00303.

Hangfeng He, Hongming Zhang 和 Dan Roth. 2023. 基于检索的再思考: 可信大语言模型推理. ArXiv 预印本, abs/2301.00303.

Chadi Helwe, Chloé Clavel, and Fabian M Suchanek. 2021. Reasoning with transformer-based models: Deep learning, but shallow reasoning. In 3rd Conference on Automated Knowledge Base Construction.

Chadi Helwe、Chloé Clavel 和 Fabian M Suchanek。2021。基于Transformer模型的推理:深度学习与浅层推理。第三届自动化知识库构建会议。

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. 2021. Measuring mathematical problem solving with the math dataset. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, volume 1.

Dan Hendrycks、Collin Burns、Saurav Kadavath、Akul Arora、Steven Basart、Eric Tang、Dawn Song 和 Jacob Steinhardt。2021。使用数学数据集衡量数学问题求解能力。见《神经信息处理系统数据集与基准跟踪会议论文集》第1卷。

Jiaxin Huang, Shixiang Shane Gu, Le Hou, Yuexin Wu, Xuezhi Wang, Hongkun Yu, and Jiawei Han. 2022a. Large language models can self-improve. ArXiv preprint, abs/2210.11610.

Jiaxin Huang、Shixiang Shane Gu、Le Hou、Yuexin Wu、Xuezhi Wang、Hongkun Yu 和 Jiawei Han。2022a。大语言模型 (Large Language Model) 能够自我改进。ArXiv 预印本,abs/2210.11610。

Jie Huang, Kevin Chang, Jinjun Xiong, and Wen-mei Hwu. 2022b. Open relation modeling: Learning to define relations between entities. In Findings of the Association for Computational Linguistics: ACL

Jie Huang、Kevin Chang、Jinjun Xiong 和 Wen-mei Hwu。2022b。开放关系建模:学习定义实体间关系。载于《计算语言学协会发现集:ACL》

2022, pages 297–308, Dublin, Ireland. Association for Computational Linguistics.

2022年,第297-308页,爱尔兰都柏林。计算语言学协会。

Jie Huang, Hanyin Shao, and Kevin Chen-Chuan Chang. 2022c. Are large pre-trained language models leaking your personal information? In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 2038–2047, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.

黄杰、邵涵印与Kevin Chen-Chuan Chang。2022c。大型预训练语言模型是否泄露了您的个人信息?载于《计算语言学协会发现:EMNLP 2022》,第2038–2047页,阿拉伯联合酋长国阿布扎比。计算语言学协会。

Jie Huang, Kerui Zhu, Kevin Chen-Chuan Chang, Jinjun Xiong, and Wen-mei Hwu. 2022d. DEER: Descriptive knowledge graph for explaining entity relationships. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 6686–6698, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.

Jie Huang、Kerui Zhu、Kevin Chen-Chuan Chang、Jinjun Xiong 和 Wen-mei Hwu。2022d。DEER: 用于解释实体关系的描述性知识图谱。载于《2022年自然语言处理实证方法会议论文集》,第6686–6698页,阿拉伯联合酋长国阿布扎比。计算语言学协会。

Wenlong Huang, Pieter Abbeel, Deepak Pathak, and Igor Mordatch. 2022e. Language models as zeroshot planners: Extracting actionable knowledge for embodied agents. In Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 9118–9147. PMLR.

Wenlong Huang、Pieter Abbeel、Deepak Pathak 和 Igor Mordatch。2022e。作为零样本规划器的大语言模型:为具身智能体提取可执行知识。载于《第39届国际机器学习会议论文集》,第162卷《机器学习研究论文集》,第9118–9147页。PMLR。

Wenlong Huang, Fei Xia, Ted Xiao, Harris Chan, Jacky Liang, Pete Florence, Andy Zeng, Jonathan Tompson, Igor Mordatch, Yevgen Chebotar, et al. 2022f. Inner monologue: Embodied reasoning through planning with language models. In 2022 Conference on Robot Learning.

Wenlong Huang、Fei Xia、Ted Xiao、Harris Chan、Jacky Liang、Pete Florence、Andy Zeng、Jonathan Tompson、Igor Mordatch、Yevgen Chebotar等. 2022f. 内心独白: 通过大语言模型规划实现具身推理. 收录于2022机器人学习大会.

Michael Huth and Mark Ryan. 2004. Logic in Computer Science: Modelling and reasoning about systems. Cambridge university press.

Michael Huth 和 Mark Ryan。2004。计算机科学中的逻辑:系统建模与推理。剑桥大学出版社。

Jaehun Jung, Lianhui Qin, Sean Welleck, Faeze Brahman, Chandra Bhaga va tula, Ronan Le Bras, and Yejin Choi. 2022. Maieutic prompting: Logically consistent reasoning with recursive explanations. The 2022 Conference on Empirical Methods for Natural Language Processing.

Jaehun Jung、Lianhui Qin、Sean Welleck、Faeze Brahman、Chandra Bhagavatula、Ronan Le Bras 和 Yejin Choi。2022。产婆式提示:通过递归解释实现逻辑一致性推理。2022年自然语言处理实证方法会议。

Seyed Mehran Kazemi, Najoung Kim, Deepti Bhatia, Xin Xu, and Deepak Rama chandra n. 2022. Lambada: Backward chaining for automated reasoning in natural language. ArXiv preprint, abs/2212.13894.

Seyed Mehran Kazemi、Najoung Kim、Deepti Bhatia、Xin Xu和Deepak Ramachandran. 2022. Lambada: 自然语言自动推理中的反向链式推理. ArXiv预印本, abs/2212.13894.

Daniel Keysers, Nathanael Schärli, Nathan Scales, Hylke Buisman, Daniel Furrer, Sergii Kashubin, Nikola Momchev, Danila S in opal niko v, Lukasz Stafiniak, Tibor Tihon, Dmitry Tsarkov, Xiao Wang, Marc van Zee, and Olivier Bousquet. 2020. Measuring compositional generalization: A comprehensive method on realistic data. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net.

Daniel Keysers、Nathanael Schärli、Nathan Scales、Hylke Buisman、Daniel Furrer、Sergii Kashubin、Nikola Momchev、Danila Sinopalnikov、Lukasz Stafiniak、Tibor Tihon、Dmitry Tsarkov、Xiao Wang、Marc van Zee 和 Olivier Bousquet。2020。测量组合泛化能力:基于现实数据的综合方法。发表于第八届国际学习表征会议 (ICLR 2020),2020年4月26-30日,埃塞俄比亚亚的斯亚贝巴。OpenReview.net。

Tushar Khot, Harsh Trivedi, Matthew Finlayson, Yao Fu, Kyle Richardson, Peter Clark, and Ashish Sabharwal. 2022. Decomposed prompting: A modular approach for solving complex tasks. ArXiv preprint, abs/2210.02406.

Tushar Khot、Harsh Trivedi、Matthew Finlayson、Yao Fu、Kyle Richardson、Peter Clark 和 Ashish Sabharwal。2022。分解式提示 (Decomposed Prompting):解决复杂任务的模块化方法。ArXiv预印本,abs/2210.02406。

Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. 2022. Large language models are zero-shot reasoners. In Advances in Neural Information Processing Systems.

Takeshi Kojima、Shixiang Shane Gu、Machel Reid、Yutaka Matsuo 和 Yusuke Iwasawa。2022。大语言模型是零样本推理器。载于《神经信息处理系统进展》。

Brenden M. Lake and Marco Baroni. 2018. Generalization without systematic it y: On the compositional skills of sequence-to-sequence recurrent networks. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholms m s san, Stockholm, Sweden, July 10-15, 2018, volume 80 of Proceedings of Machine Learning Research, pages 2879–2888. PMLR.

Brenden M. Lake 和 Marco Baroni. 2018. 无需系统性的组合泛化能力:序列到序列循环网络的组合技能. 载于《第35届国际机器学习会议论文集》(ICML 2018), 瑞典斯德哥尔摩Stockholms m s san会议中心, 2018年7月10-15日, 机器学习研究论文集第80卷, 第2879–2888页. PMLR.

Andrew K Lampinen, Ishita Dasgupta, Stephanie CY Chan, Kory Matthewson, Michael Henry Tessler, Antonia Creswell, James L McClelland, Jane X Wang, and Felix Hill. 2022. Can language models learn from explanations in context? In Findings of the Association for Computational Linguistics: EMNLP 2022.

Andrew K Lampinen、Ishita Dasgupta、Stephanie CY Chan、Kory Matthewson、Michael Henry Tessler、Antonia Creswell、James L McClelland、Jane X Wang 和 Felix Hill。2022. 语言模型能否从上下文解释中学习?见《计算语言学协会发现:EMNLP 2022》。

Edward H Levi. 2013. An introduction to legal reasoning. University of Chicago Press.

Edward H Levi. 2013. 法律推理导论. University of Chicago Press.

Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Micha lewski, Vinay Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, et al. 2022. Solving quantitative reasoning problems with language models. ArXiv preprint, abs/2206.14858.

Aitor Lewkowycz、Anders Andreassen、David Dohan、Ethan Dyer、Henryk Michalewski、Vinay Ramasesh、Ambrose Slone、Cem Anil、Imanol Schlag、Theo Gutman-Solo 等。2022。利用语言模型解决定量推理问题。ArXiv预印本,abs/2206.14858。

Belinda Z. Li, Maxwell Nye, and Jacob Andreas. 2021. Implicit representations of meaning in neural language models. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1813–1827, Online. Association for Computational Linguistics.

Belinda Z. Li、Maxwell Nye和Jacob Andreas。2021。神经语言模型中意义的隐式表征。见《第59届计算语言学协会年会暨第11届自然语言处理国际联合会议论文集(第一卷:长论文)》,第1813–1827页,线上。计算语言学协会。

Shiyang Li, Jianshu Chen, Yelong Shen, Zhiyu Chen, Xinlu Zhang, Zekun Li, Hong Wang, Jing Qian, Baolin Peng, Yi Mao, et al. 2022a. Explanations from large language models make small reasoners better. ArXiv preprint, abs/2210.06726.

Shiyang Li, Jianshu Chen, Yelong Shen, Zhiyu Chen, Xinlu Zhang, Zekun Li, Hong Wang, Jing Qian, Baolin Peng, Yi Mao 等. 2022a. 大语言模型 (Large Language Model) 的解释让小推理模型表现更好. ArXiv 预印本, abs/2210.06726.

Yifei Li, Zeqi Lin, Shizhuo Zhang, Qiang Fu, Bei Chen, Jian-Guang Lou, and Weizhu Chen. 2022b. On the advance of making language models better reasoners. ArXiv preprint, abs/2206.02336.

Yifei Li, Zeqi Lin, Shizhuo Zhang, Qiang Fu, Bei Chen, Jian-Guang Lou, and Weizhu Chen. 2022b. 论提升语言模型推理能力的研究进展. ArXiv preprint, abs/2206.02336.

Bill Yuchen Lin, Wang chun shu Zhou, Ming Shen, Pei Zhou, Chandra Bhaga va tula, Yejin Choi, and Xiang Ren. 2020. CommonGen: A constrained text gen- eration challenge for generative commonsense reasoning. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1823–1840, Online. Association for Computational Linguistics.

Bill Yuchen Lin、Wang Chun Shu Zhou、Ming Shen、Pei Zhou、Chandra Bhagavatula、Yejin Choi 和 Xiang Ren。2020. CommonGen: 面向生成式常识推理的受限文本生成挑战。载于《计算语言学协会发现:EMNLP 2020》,第1823–1840页,线上。计算语言学协会。

Wang Ling, Dani Yogatama, Chris Dyer, and Phil Blunsom. 2017. Program induction by rationale generation: Learning to solve and explain algebraic word problems. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 158–167, Vancouver, Canada. Association for Computational Linguistics.

王凌、Dani Yogatama、Chris Dyer 和 Phil Blunsom。2017. 通过生成推理实现程序归纳:学习求解和解释代数应用题。载于《第55届计算语言学协会年会论文集(第一卷:长论文)》,第158-167页,加拿大温哥华。计算语言学协会。

Chen zheng yi Liu, Jie Huang, Kerui Zhu, and Kevin Chen-Chuan Chang. 2022a. Dimongen: Diver- sified generative commonsense reasoning for explaining concept relationships. ArXiv preprint, abs/2212.10545.

陈正一 Liu, Jie Huang, Kerui Zhu, 和 Kevin Chen-Chuan Chang. 2022a. Dimongen: 用于解释概念关系的多样化生成式常识推理. ArXiv 预印本, abs/2212.10545.

Jiachang Liu, Dinghan Shen, Yizhe Zhang, Bill Dolan, Lawrence Carin, and Weizhu Chen. 2022b. What makes good in-context examples for GPT-3? In Proceedings of Deep Learning Inside Out (DeeLIO 2022): The 3rd Workshop on Knowledge Extraction and Integration for Deep Learning Architectures, pages 100–114, Dublin, Ireland and Online. Association for Computational Linguistics.

Jiachang Liu、Dinghan Shen、Yizhe Zhang、Bill Dolan、Lawrence Carin 和 Weizhu Chen。2022b。什么构成了GPT-3的良好上下文示例?见《深度学习内外会议论文集》(DeeLIO 2022):第三届深度学习架构知识提取与集成研讨会,第100-114页,爱尔兰都柏林及在线。计算语言学协会。

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Z ett le moyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pre training approach. ArXiv preprint, abs/1907.11692.

Yinhan Liu、Myle Ott、Naman Goyal、Jingfei Du、Mandar Joshi、Danqi Chen、Omer Levy、Mike Lewis、Luke Zettlemoyer 和 Veselin Stoyanov。2019。RoBERTa: 一种稳健优化的 BERT 预训练方法。ArXiv 预印本,abs/1907.11692。

Pan Lu, Swaroop Mishra, Tony Xia, Liang Qiu, KaiWei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. 2022. Learn to explain: Multimodal reasoning via thought chains for science question answering. In Advances in Neural Information Processing Systems.

Pan Lu、Swaroop Mishra、Tony Xia、Liang Qiu、KaiWei Chang、Song-Chun Zhu、Oyvind Tafjord、Peter Clark 和 Ashwin Kalyan。2022。学会解释:通过思维链实现科学问答的多模态推理。载于《神经信息处理系统进展》。

Aman Madaan, Shuyan Zhou, Uri Alon, Yiming Yang, and Graham Neubig. 2022. Language models of code are few-shot commonsense learners. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP).

Aman Madaan、Shuyan Zhou、Uri Alon、Yiming Yang和Graham Neubig。2022. 代码语言模型是少样本常识学习者。载于《2022年自然语言处理实证方法会议论文集》(EMNLP)。

Lucie Charlotte Magister, Jonathan Mallinson, Jakub Adamek, Eric Malmi, and Aliaksei Severyn. 2022. Teaching small language models to reason. ArXiv preprint, abs/2212.08410.

Lucie Charlotte Magister、Jonathan Mallinson、Jakub Adamek、Eric Malmi和Aliaksei Severyn。2022。教小语言模型推理。ArXiv预印本,abs/2212.08410。

Christopher D Manning. 2022. Human language understanding & reasoning. Daedalus, 151(2):127–138.

Christopher D Manning. 2022. 人类语言理解与推理. Daedalus, 151(2):127–138.

Gary Marcus. 2020. The next decade in ai: four steps towards robust artificial intelligence. ArXiv preprint, abs/2002.06177.

Gary Marcus. 2020. 人工智能的下一个十年:迈向稳健人工智能的四步. ArXiv预印本, abs/2002.06177.

Conor McHugh and Jonathan Way. 2018. What is reasoning? Mind, 127(505):167–196.

Conor McHugh和Jonathan Way。2018。什么是推理?《心智》,127(505):167–196。

Shen-yun Miao, Chao-Chun Liang, and Keh-Yih Su. 2020. A diverse corpus for evaluating and developing English math word problem solvers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 975–984, Online. Association for Computational Linguistics.

Shen-yun Miao、Chao-Chun Liang和Keh-Yih Su。2020. 用于评估和开发英语数学应用题求解器的多样化语料库。载于《第58届计算语言学协会年会论文集》,第975–984页,线上会议。计算语言学协会。

Sewon Min, Victor Zhong, Luke Z ett le moyer, and Hannaneh Hajishirzi. 2019. Multi-hop reading comprehension through question decomposition and rescoring. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 6097–6109, Florence, Italy. Association for Computational Linguistics.

Sewon Min、Victor Zhong、Luke Zettlemoyer 和 Hannaneh Hajishirzi。2019. 通过问题分解与重评分的多跳阅读理解。见《第57届计算语言学协会年会论文集》,第6097-6109页,意大利佛罗伦萨。计算语言学协会。

Kanishka Misra, Julia Taylor Rayz, and Allyson Ettinger. 2022. A property induction framework for neural language models. ArXiv preprint, abs/2205.06910.

Kanishka Misra、Julia Taylor Rayz和Allyson Ettinger。2022。神经语言模型的属性归纳框架。ArXiv预印本,abs/2205.06910。

Melanie Mitchell. 2021. Abstraction and analogymaking in artificial intelligence. Annals of the New York Academy of Sciences, 1505(1):79–101.

Melanie Mitchell. 2021. 人工智能中的抽象与类比构建。Annals of the New York Academy of Sciences, 1505(1):79–101.

Linyong Nan, Chiachun Hsieh, Ziming Mao, Xi Victoria Lin, Neha Verma, Rui Zhang, Wojciech Kryscinski, Hailey Schoelkopf, Riley Kong, Xiangru Tang, Mutethia Mutuma, Ben Rosand, Isabel Trindade, Renusree Bandaru, Jacob Cunningham, Caiming Xiong, Dragomir Radev, and Dragomir Radev. 2022. FeTaQA: Free-form table question answering. Transactions of the Association for Computational Linguistics, 10:35–49.

Linyong Nan、Chiachun Hsieh、Ziming Mao、Xi Victoria Lin、Neha Verma、Rui Zhang、Wojciech Kryscinski、Hailey Schoelkopf、Riley Kong、Xiangru Tang、Mutethia Mutuma、Ben Rosand、Isabel Trindade、Renusree Bandaru、Jacob Cunningham、Caiming Xiong、Dragomir Radev 和 Dragomir Radev。2022。FeTaQA: 自由形式表格问答。计算语言学协会汇刊,10:35–49。

Maxwell Nye, Anders Johan Andreassen, Guy Gur-Ari, Henryk Micha lewski, Jacob Austin, David Bieber, David Dohan, Aitor Lewkowycz, Maarten Bosma, David Luan, Charles Sutton, and Augustus Odena. 2022. Show your work: Scratch pads for intermediate computation with language models. In Deep Learning for Code Workshop.

Maxwell Nye、Anders Johan Andreassen、Guy Gur-Ari、Henryk Michalewski、Jacob Austin、David Bieber、David Dohan、Aitor Lewkowycz、Maarten Bosma、David Luan、Charles Sutton 和 Augustus Odena。2022。展示你的工作:用于大语言模型中间计算的草稿本。见于《代码深度学习研讨会》。

OpenAI. 2022. Chatgpt: Optimizing language models for dialogue. OpenAI.

OpenAI. 2022. ChatGPT: 对话优化的语言模型. OpenAI.

John Arthur Passmore. 1961. Philosophical reasoning.

John Arthur Passmore. 1961. 哲学推理。

Panupong Pasupat and Percy Liang. 2015. Compositional semantic parsing on semi-structured tables. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1470– 1480, Beijing, China. Association for Computational Linguistics.

Panupong Pasupat和Percy Liang。2015。半结构化表格的组合语义解析。载于《第53届计算语言学协会年会暨第7届自然语言处理国际联合会议(第一卷:长论文)》,第1470–1480页,中国北京。计算语言学协会。

Arkil Patel, Satwik Bhatt a mishra, and Navin Goyal. 2021. Are NLP models really able to solve simple math word problems? In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2080–2094, Online. Association for Computational Linguistics.

Arkil Patel、Satwik Bhattamishra 和 Navin Goyal。2021。NLP 模型真的能解决简单数学应用题吗?(Are NLP models really able to solve simple math word problems?) 载于《2021年北美计算语言学协会人类语言技术会议论文集》(Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies),第2080–2094页,线上会议。计算语言学协会(Association for Computational Linguistics)。

Ethan Perez, Patrick Lewis, Wen-tau Yih, Kyunghyun Cho, and Douwe Kiela. 2020. Unsupervised question decomposition for question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 8864–8880, Online. Association for Computational Linguistics.

Ethan Perez、Patrick Lewis、Wen-tau Yih、Kyunghyun Cho 和 Douwe Kiela。2020. 无监督问题分解在问答中的应用。载于《2020年自然语言处理实证方法会议论文集》(EMNLP),第8864–8880页,线上。计算语言学协会。

Xinyu Pi, Qian Liu, Bei Chen, Morteza Ziyadi, Zeqi Lin, Yan Gao, Qiang Fu, Jian-Guang Lou, and Weizhu Chen. 2022. Reasoning like program executors. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP).

Xinyu Pi、Qian Liu、Bei Chen、Morteza Ziyadi、Zeqi Lin、Yan Gao、Qiang Fu、Jian-Guang Lou 和 Weizhu Chen。2022。像程序执行器一样推理。载于《2022年自然语言处理实证方法会议论文集》(EMNLP)。

Steven T Piantasodi and Felix Hill. 2022. Meaning without reference in large language models. ArXiv preprint, abs/2208.02957.

Steven T Piantasodi 和 Felix Hill. 2022. 大语言模型中的无指涉意义. ArXiv 预印本, abs/2208.02957.

Ofir Press, Muru Zhang, Sewon Min, Ludwig Schmidt, Noah A Smith, and Mike Lewis. 2022. Measuring and narrowing the compositional it y gap in language models. ArXiv preprint, abs/2210.03350.

Ofir Press、Muru Zhang、Sewon Min、Ludwig Schmidt、Noah A Smith和Mike Lewis。2022。测量并缩小语言模型中的组合性差距。ArXiv预印本,abs/2210.03350。

Ben Prystawski, Paul Thibodeau, and Noah Goodman. 2022. Psychologically-informed chain-of-thought prompts for metaphor understanding in large language models. ArXiv preprint, abs/2209.08141.

Ben Prystawski、Paul Thibodeau和Noah Goodman。2022。大语言模型中基于心理学的思维链提示用于隐喻理解。ArXiv预印本,abs/2209.08141。

Shuofei Qiao, Yixin Ou, Ningyu Zhang, Xiang Chen, Yunzhi Yao, Shumin Deng, Chuanqi Tan, Fei Huang, and Huajun Chen. 2022. Reasoning with language model prompting: A survey. ArXiv preprint, abs/2212.09597.

Shuofei Qiao、Yixin Ou、Ningyu Zhang、Xiang Chen、Yunzhi Yao、Shumin Deng、Chuanqi Tan、Fei Huang 和 Huajun Chen。2022. 基于大语言模型提示的推理方法综述。ArXiv预印本,abs/2212.09597。

Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. 2018. Improving language understanding by generative pre-training.

Alec Radford、Karthik Narasimhan、Tim Salimans、Ilya Sutskever 等. 2018. 通过生成式预训练提升语言理解能力.

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. 2019. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.

Alec Radford、Jeffrey Wu、Rewon Child、David Luan、Dario Amodei、Ilya Sutskever 等. 2019. 语言模型是无监督多任务学习者. OpenAI博客, 1(8):9.

Jack W Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, Francis Song, John Aslanides, Sarah Henderson, Roman Ring, Susannah Young, et al. 2021. Scaling language models: Methods, analysis & insights from training gopher. ArXiv preprint, abs/2112.11446.

Jack W Rae、Sebastian Borgeaud、Trevor Cai、Katie Millican、Jordan Hoffmann、Francis Song、John Aslanides、Sarah Henderson、Roman Ring、Susannah Young 等. 2021. 扩展语言模型:训练Gopher的方法、分析与洞见. ArXiv预印本, abs/2112.11446.

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J Liu, et al. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21(140):1–67.

Colin Raffel、Noam Shazeer、Adam Roberts、Katherine Lee、Sharan Narang、Michael Matena、Yanqi Zhou、Wei Li、Peter J Liu等. 2020. 探索迁移学习的极限:基于统一文本到文本Transformer的研究. J. Mach. Learn. Res., 21(140):1–67.

Nazneen Fatema Rajani, Bryan McCann, Caiming Xiong, and Richard Socher. 2019. Explain yourself! leveraging language models for commonsense reasoning. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4932–4942, Florence, Italy. Association for Computational Linguistics.

Nazneen Fatema Rajani, Bryan McCann, Caiming Xiong, and Richard Socher. 2019. 自我解释! 利用语言模型进行常识推理. 见: 第57届计算语言学协会年会论文集, 第4932–4942页, 意大利佛罗伦萨. 计算语言学协会.

Yasaman Razeghi, Robert L Logan IV, Matt Gardner, and Sameer Singh. 2022. Impact of pre training term frequencies on few-shot reasoning. ArXiv preprint, abs/2202.07206.

Yasaman Razeghi、Robert L Logan IV、Matt Gardner和Sameer Singh。2022。预训练词频对少样本推理的影响。ArXiv预印本,abs/2202.07206。

Subhro Roy and Dan Roth. 2015. Solving general arithmetic word problems. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1743–1752, Lisbon, Portugal. Association for Computational Linguistics.

Subhro Roy和Dan Roth。2015。解决通用算术文字问题。2015年自然语言处理经验方法会议论文集,第1743–1752页,葡萄牙里斯本。计算语言学协会。

Laura Ruis, Akbir Khan, Stella Biderman, Sara Hooker, Tim Rock t s chel, and Edward Gre fens te tte. 2022. Large language models are not zero-shot communicators. ArXiv preprint, abs/2210.14986.

Laura Ruis、Akbir Khan、Stella Biderman、Sara Hooker、Tim Rocktäschel 和 Edward Grefenstette。2022。大语言模型并非零样本沟通者。ArXiv预印本,abs/2210.14986。

Jacob Russin, Randall C O’Reilly, and Yoshua Bengio. 2020. Deep learning needs a prefrontal cortex. Work Bridging AI Cogn Sci, 107:603–616.

Jacob Russin、Randall C O'Reilly和Yoshua Bengio。2020. 深度学习需要前额叶皮层。《桥接人工智能与认知科学的工作》,107:603–616。

Abulhair Saparov and He He. 2022. Language models are greedy reasoners: A systematic formal analysis of chain-of-thought. ArXiv preprint, abs/2210.01240.

Abulhair Saparov 和 He He. 2022. 语言模型是贪婪推理者: 对思维链的系统性形式分析. ArXiv 预印本, abs/2210.01240.

Teven Le Scao, Angela Fan, Christopher Akiki, El- lie Pavlick, Suzana Ilic, Daniel Hesslow, Roman Castagné, Alexandra Sasha Luccioni, François Yvon, Matthias Gallé, et al. 2022. Bloom: A 176bparameter open-access multilingual language model. ArXiv preprint, abs/2211.05100.

Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ilic, Daniel Hesslow, Roman Castagné, Alexandra Sasha Luccioni, François Yvon, Matthias Gallé 等. 2022. Bloom: 一个1760亿参数的开源多语言大语言模型. ArXiv预印本, abs/2211.05100.

Freda Shi, Mirac Suzgun, Markus Freitag, Xuezhi Wang, Suraj Srivats, Soroush Vosoughi, Hyung Won Chung, Yi Tay, Sebastian Ruder, Denny Zhou, et al. 2022. Language models are multilingual chain-of-thought reasoners. ArXiv preprint, abs/2210.03057.

Freda Shi、Mirac Suzgun、Markus Freitag、Xuezhi Wang、Suraj Srivats、Soroush Vosoughi、Hyung Won Chung、Yi Tay、Sebastian Ruder、Denny Zhou等。2022。语言模型是多语言思维链推理器。ArXiv预印本,abs/2210.03057。

Kumar Shridhar, Alessandro Stolfo, and Mrinmaya Sachan. 2022. Distilling multi-step reasoning capabilities of large language models into smaller models via semantic decomposition s. ArXiv preprint, abs/2212.00193.

Kumar Shridhar、Alessandro Stolfo 和 Mrinmaya Sachan. 2022. 通过语义分解将大语言模型的多步推理能力蒸馏至小模型. ArXiv 预印本, abs/2212.00193.

Chan Hee Song, Jiaman Wu, Clayton Washington, Brian M Sadler, Wei-Lun Chao, and Yu Su. 2022. Llm-planner: Few-shot grounded planning for embodied agents with large language models. ArXiv preprint, abs/2212.04088.

Chan Hee Song、Jiaman Wu、Clayton Washington、Brian M Sadler、Wei-Lun Chao 和 Yu Su。2022。LLM-Planner:基于大语言模型的具身智能体少样本接地规划。ArXiv预印本,abs/2212.04088。

Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, et al. 2022. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. ArXiv preprint, abs/2206.04615.

Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso 等. 2022. 超越模仿游戏:量化与推演大语言模型的能力. ArXiv 预印本, abs/2206.04615.

Mirac Suzgun, Nathan Scales, Nathanael Schärli, Se- bastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc V Le, Ed H Chi, Denny Zhou, et al. 2022. Challenging big-bench tasks and whether chain-of-thought can solve them. ArXiv preprint, abs/2210.09261.

Mirac Suzgun、Nathan Scales、Nathanael Schärli、Sebastian Gehrmann、Yi Tay、Hyung Won Chung、Aakanksha Chowdhery、Quoc V Le、Ed H Chi、Denny Zhou等。2022。挑战Big-Bench任务及思维链(Chain-of-Thought)能否解决它们。ArXiv预印本,abs/2210.09261。

Alon Talmor and Jonathan Berant. 2018. The web as a knowledge-base for answering complex questions. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 641–651, New Orleans, Louisiana. Association for Computational Linguistics.

Alon Talmor 和 Jonathan Berant. 2018. 网络作为回答复杂问题的知识库. 见《2018年北美计算语言学协会人类语言技术会议论文集》(长篇论文), 第1卷, 第641–651页, 美国路易斯安那州新奥尔良. 计算语言学协会.

Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. 2019. Commonsense QA: A question answering challenge targeting commonsense knowledge. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4149–4158, Minneapolis, Minnesota. Association for Computational Linguistics.

Alon Talmor、Jonathan Herzig、Nicholas Lourie 和 Jonathan Berant。2019. Commonsense QA: 一个针对常识知识的问答挑战。载于《2019年北美计算语言学协会人类语言技术会议论文集》第1卷(长文与短文), 第4149–4158页, 明尼苏达州明尼阿波利斯。计算语言学协会。

Alon Talmor, Oyvind Tafjord, Peter Clark, Yoav Goldberg, and Jonathan Berant. 2020. Leap-of-thought: Teaching pre-trained models to systematically reason over implicit knowledge. In Advances in Neural

Alon Talmor、Oyvind Tafjord、Peter Clark、Yoav Goldberg 和 Jonathan Berant。2020。思维跳跃:让预训练模型系统化推理隐含知识。载于《神经信息处理系统进展》[20]

Mark Riedl, and Yejin Choi. 2022. Reframing human-AI collaboration for generating free-text explanations. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 632–658, Seattle, United States. Association for Computational Linguistics.

Mark Riedl 和 Yejin Choi。2022。重构人机协作以生成自由文本解释。载于《2022年北美计算语言学协会人类语言技术会议论文集》,第632-658页,美国西雅图。计算语言学协会。

阅读全文(20积分)