DEMONSTRATE–SEARCH–PREDICT: Composing retrieval and language models for knowledge-intensive NLP

DEMONSTRATE–SEARCH–PREDICT: 结合检索和语言模型的知识密集型 NLP

Omar Khattab 1 Keshav Santhanam 1 Xiang Lisa Li 1 David Hall 1 Percy Liang 1 Christopher Potts 1 Matei Zaharia 1

Abstract

摘要

Retrieval-augmented in-context learning has emerged as a powerful approach for addressing knowledge-intensive tasks using frozen language models (LM) and retrieval models (RM). Existing work has combined these in simple “retrievethen-read” pipelines in which the RM retrieves passages that are inserted into the LM prompt. To begin to fully realize the potential of frozen LMs and RMs, we propose DEMONSTRATE– SEARCH–PREDICT (DSP), a framework that relies on passing natural language texts in sophisticated pipelines between an LM and an RM. DSP can express high-level programs that bootstrap pipeline-aware demonstrations, search for relevant passages, and generate grounded predictions, systematically breaking down problems into small transformations that the LM and RM can handle more reliably. We have written novel DSP programs for answering questions in open-domain, multi-hop, and conversational settings, establishing in early evaluations new state-of-the-art incontext learning results and delivering $37-120%$ , $8-39%$ , and $80{-}290%$ relative gains against the vanilla LM (GPT-3.5), a standard retrieve-thenread pipeline, and a contemporaneous self-ask pipeline, respectively. We release DSP at https: //github.com/stanford nlp/dsp.

检索增强的上下文学习已成为一种强大的方法，用于使用冻结的语言模型（LM）和检索模型（RM）处理知识密集型任务。现有工作将这些模型结合在简单的“检索-然后-读取”流程中，其中RM检索段落并将其插入LM提示中。为了充分发挥冻结LM和RM的潜力，我们提出了DEMONSTRATE–SEARCH–PREDICT（DSP）框架，该框架依赖于在LM和RM之间通过复杂的流程传递自然语言文本。DSP可以表达高级程序，这些程序引导流程感知的演示、搜索相关段落并生成基于事实的预测，系统地将问题分解为LM和RM可以更可靠处理的小型转换。我们编写了新颖的DSP程序，用于在开放域、多跳和对话环境中回答问题，在早期评估中确立了新的最先进的上下文学习结果，并分别相对于原始LM（GPT-3.5）、标准的检索-然后-读取流程和同时期的自问流程，实现了37-120%、8-39%和80-290%的相对增益。我们在https://github.com/stanfordnlp/dsp上发布了DSP。

1. Introduction

1. 引言

In-context learning adapts a frozen language model (LM) to tasks by conditioning the LM on a textual prompt including task instructions and a few demonstrating examples (McCann et al., 2018; Radford et al., 2019; Brown et al., 2020). For knowledge-intensive tasks such as question answering, fact checking, and information-seeking dialogue, retrieval models (RM) are increasingly used to augment prompts with relevant information from a large corpus (Lazaridou et al., 2022; Press et al., 2022; Khot et al., 2022).

上下文学习通过将大语言模型 (LM) 与包含任务指令和一些示例的文本提示相结合，来适应冻结的语言模型 (LM) 以执行任务 (McCann et al., 2018; Radford et al., 2019; Brown et al., 2020)。对于知识密集型任务，如问答、事实核查和信息寻求对话，检索模型 (RM) 越来越多地用于从大型语料库中提取相关信息来增强提示 (Lazaridou et al., 2022; Press et al., 2022; Khot et al., 2022)。

Figure 1. A comparison between three systems based on GPT3.5 (text-davinci-002). On its own, the LM often makes false assertions. An increasingly popular retrieve-then-read pipeline fails when simple search can’t find an answer. In contrast, a taskaware DSP program successfully decomposes the problem and produces a correct response. Texts edited for presentation.

图 1: 基于 GPT3.5 (text-davinci-002) 的三种系统的比较。单独使用时，大语言模型经常做出错误的断言。当简单的搜索无法找到答案时，越来越流行的检索-然后-阅读的流程会失败。相比之下，一个任务感知的 DSP 程序成功分解了问题并生成了正确的响应。文本经过编辑以便展示。

Recent work has shown such retrieval-augmented in-context learning to be effective in simple “retrieve-then-read” pipelines: a query is fed to the RM and the retrieved passages become part of a prompt that provides context for the LM to use in its response. In this work, we argue that the fact that both LMs and RMs consume (and generate or retrieve) natural language texts creates an opportunity for much more sophisticated interactions between them. Fully realizing this would be transformative: frozen LMs and RMs could serve as infrastructure across tasks, enabling ML- and domain-experts alike to rapidly build grounded AI systems at a high level of abstraction and with lower deployment overheads and annotation costs.

最近的研究表明，这种检索增强的上下文学习在简单的“检索-然后-阅读”流程中是有效的：将查询输入到检索模型（RM）中，检索到的段落成为提示的一部分，为语言模型（LM）提供上下文以生成响应。在这项工作中，我们认为，由于语言模型和检索模型都处理（并生成或检索）自然语言文本，这为它们之间更复杂的交互创造了机会。完全实现这一点将是革命性的：冻结的语言模型和检索模型可以作为跨任务的基础设施，使机器学习和领域专家能够以高抽象级别快速构建基于事实的AI系统，同时降低部署开销和标注成本。

Figure 1 begins to illustrate the power of retrievalaugmented in-context learning, but also the limitations of “retrieve-then-read” (Lazaridou et al., 2022; Izacard et al., 2022). Our query is “How many storeys are in the castle David Gregory inherited?” When prompted to answer this, GPT-3.5 (text-davinci-002; Ouyang et al. 2022) makes up a fictitious castle with incorrect attributes, highlighting the common observation that knowledge stored in LM parameters is often unreliable (Shuster et al., 2021; Ishii et al., 2022). Introducing an RM component helps, as the LM can ground its responses in retrieved passages, but a rigid retrieve-then-read strategy fails because the RM cannot find passages that directly answer the question.

图 1 开始展示了检索增强的上下文学习 (retrieval-augmented in-context learning) 的力量，但也揭示了“检索-然后-阅读” (retrieve-then-read) 策略的局限性 (Lazaridou et al., 2022; Izacard et al., 2022)。我们的查询是“David Gregory 继承的城堡有多少层？”当被要求回答这个问题时，GPT-3.5 (text-davinci-002; Ouyang et al. 2022) 编造了一个具有错误属性的虚构城堡，这突显了一个常见的观察结果：存储在大语言模型参数中的知识通常不可靠 (Shuster et al., 2021; Ishii et al., 2022)。引入检索模块 (RM) 有所帮助，因为大语言模型可以基于检索到的段落来生成回答，但僵化的“检索-然后-阅读”策略失败了，因为检索模块无法找到直接回答问题的段落。

Figure 2. A toy example of a DSP program for multi-hop question answering. Given an input question and a 2-shot training set, the DEMONSTRATE stage pro grammatically annotates intermediate transformations on the training examples using a form of weak supervision. Learning from a resulting demonstration, the SEARCH stage decomposes the complex input question and retrieves supporting information over two retrieval hops. Finally, the PREDICT stage uses the demonstration and retrieved passages to answer the question.

图 2: 一个用于多跳问答的 DSP 程序示例。给定一个输入问题和 2-shot 训练集，DEMONSTRATE 阶段通过弱监督的方式对训练样本进行程序化注释，标注中间转换。SEARCH 阶段从生成的演示中学习，将复杂的输入问题分解，并通过两次检索跳转获取支持信息。最后，PREDICT 阶段使用演示和检索到的段落来回答问题。

We introduce the DEMONSTRATE–SEARCH–PREDICT (DSP) framework for in-context learning, which relies entirely on passing natural language text (and scores) between a frozen RM and LM. DSP introduces a number of composable functions that bootstrap training examples (DEMONSTRATE), gather information from a knowledge corpus (SEARCH), and generate grounded outputs (PREDICT), using them to systematically unify techniques from the retrieval-augmented NLP and the in-context learning literatures (Lee et al., 2019; Khattab et al., 2021a; Anan- tha et al., 2020; Gao et al., 2022; Izacard et al., 2022; Dohan et al., 2022; Zelikman et al., 2022; Zhang et al., 2022). We use DSP to suggest powerful strategies for knowledgeintensive tasks with compositions of these techniques. This reveals new conceptual possibilities for in-context learning in general (§2), and it allows us to present rich programs that set new state-of-the-art results (§3).

我们介绍了用于上下文学习的DEMONSTRATE–SEARCH–PREDICT (DSP)框架，该框架完全依赖于在固定的RM（Reward Model）和LM（Language Model）之间传递自然语言文本（和分数）。DSP引入了一些可组合的函数，这些函数用于引导训练示例（DEMONSTRATE）、从知识库中收集信息（SEARCH）以及生成基于事实的输出（PREDICT），并利用这些函数系统地统一了检索增强的自然语言处理（NLP）和上下文学习文献中的技术（Lee et al., 2019; Khattab et al., 2021a; Anantha et al., 2020; Gao et al., 2022; Izacard et al., 2022; Dohan et al., 2022; Zelikman et al., 2022; Zhang et al., 2022）。我们使用DSP为知识密集型任务提出了强大的策略，这些策略结合了这些技术。这揭示了上下文学习在概念上的新可能性（§2），并使我们能够展示出丰富的程序，这些程序设定了新的最先进结果（§3）。

Figure 1 shows the path that a DSP program might take to arrive at an answer, and Figure 2 illustrates how a deliberate program achieves this. Instead of asking the LM to answer this complex question, the program’s SEARCH stage uses the LM to generate a query “Which castle did David Gregory inherit?” The RM retrieves a passage saying Gregory inherited the Kinnairdy Castle. After a second search “hop” finds the castle’s number of storeys, the PREDICT stage queries the LM with these passages to answer the original question. Although this program implements behaviors such as query generation, it requires no hand-labeled examples of these intermediate transformations (i.e., of the queries and passages of both retrieval hops). Instead, the DEMONSTRATE stage uses labeled question–answer pairs to implement a form of weak supervision that pro grammatically annotates the transformations invoked within SEARCH and PREDICT.

图 1 展示了一个 DSP 程序可能采取的路径来得出答案，而图 2 则展示了一个有计划的程序如何实现这一点。程序没有直接要求大语言模型回答这个复杂的问题，而是在 SEARCH 阶段使用大语言模型生成查询“David Gregory 继承了哪座城堡？”。检索模型 (RM) 检索到一段文字，指出 Gregory 继承了 Kinnairdy 城堡。在第二次搜索“跳转”找到城堡的层数后，PREDICT 阶段使用这些段落查询大语言模型以回答原始问题。尽管这个程序实现了诸如查询生成等行为，但它不需要这些中间转换（即两次检索跳转的查询和段落）的手动标注示例。相反，DEMONSTRATE 阶段使用标注的问题-答案对来实现一种弱监督形式，以编程方式标注在 SEARCH 和 PREDICT 中调用的转换。

We evaluate several DSP programs on answering questions in open-domain, multi-hop, and conversational settings. In them, we implement novel and reusable transformations such as boots trapping annotations for all of our pipelines with weak supervision $(\S2.3)$ , reliably rewriting questions to resolve conversational dependencies and iterative ly decompose complex queries with sum mari z ation of intermediate hops (§2.4), and generating grounded responses from multiple passages with self-consistency (§2.5). We report preliminary results on Open-SQuAD, HotPotQA, and QReCC using the frozen LM GPT-3.5 and RM ColBERTv2 (Khattab & Zaharia, 2020; Santhanam et al., 2022b) with no fine-tuning. Our DSP programs deliver $37-120%$ , $8-39%$ , and $80{-}290%$ relative gains against corresponding vanilla LMs, a standard retrieve-then-read pipeline, and a contemporaneous self-ask pipeline (Press et al., 2022), respectively. Future versions of this report will include additional test tasks and LM choices.

我们在开放域、多跳和对话设置中评估了几个 DSP 程序在回答问题方面的表现。在这些评估中，我们实现了新颖且可重用的转换，例如通过弱监督为所有管道进行引导标注 $(\S2.3)$ ，可靠地重写问题以解决对话依赖关系，并通过中间跳的总结迭代分解复杂查询 (§2.4)，以及通过自一致性从多个段落生成有根据的响应 (§2.5)。我们使用未微调的 GPT-3.5 和 ColBERTv2 (Khattab & Zaharia, 2020; Santhanam et al., 2022b) 在 Open-SQuAD、HotPotQA 和 QReCC 上报告了初步结果。我们的 DSP 程序相对于相应的普通大语言模型、标准的检索-读取管道和同时期的自问管道 (Press et al., 2022) 分别实现了 $37-120%$ 、 $8-39%$ 和 $80{-}290%$ 的相对增益。本报告的未来版本将包括额外的测试任务和大语言模型选择。

In summary, this work makes the following contributions. First, we argue that simple task-agnostic pipelines for incontext learning should give way to deliberate, task-aware strategies. Second, we show that this shift need not be a burden: with DSP, such strategies can be easily expressed as short programs using composable operators. Third, this com pos ability spawns powerful capacities, like automatically annotating demonstrations for complex pipelines from end-task labels. Fourth, for three knowledge-intensive tasks, we implement rich programs that establish state-of-the-art results for in-context learning.

总之，本工作做出了以下贡献。首先，我们认为简单的任务无关的上下文学习流程应该让位于有意识的任务感知策略。其次，我们表明这种转变不必成为负担：通过 DSP，这些策略可以轻松地表示为使用可组合运算符的简短程序。第三，这种可组合性催生了强大的能力，例如从最终任务标签中自动注释复杂流程的演示。第四，针对三个知识密集型任务，我们实现了丰富的程序，为上下文学习建立了最先进的结果。

2. DEMONSTRATE–SEARCH–PREDICT

2. 演示–搜索–预测

We now introduce the DSP framework and show its expressive power by suggesting a number of strategies in which the LM and RM can come together to tackle complex problems effectively. We show in $\S3$ that such strategies outperform existing in-context learning methods. We begin by discussing the LM and RM foundation modules on which DSP is built (§2.1) and then the datatypes and control flow within DSP (§2.2). Subsequently, we discuss each of the three inference stages: DEMONSTRATE (§2.3), SEARCH (§2.4), and PREDICT (§2.5).

我们现在介绍 DSP 框架，并通过提出一些策略来展示其表达能力，这些策略可以让大语言模型 (LM) 和检索模型 (RM) 共同有效地解决复杂问题。我们在 $\S3$ 中展示了这些策略优于现有的上下文学习方法。我们首先讨论 DSP 构建的基础模块 LM 和 RM (§2.1)，然后讨论 DSP 中的数据类型和控制流 (§2.2)。随后，我们讨论三个推理阶段：DEMONSTRATE (§2.3)、SEARCH (§2.4) 和 PREDICT (§2.5)。

2.1. Pretrained Modules: LM and RM

2.1. 预训练模块：LM 和 RM

A DSP program defines the communication between the language model LM and the retrieval model RM.

DSP程序定义了语言模型 (LM) 和检索模型 (RM) 之间的通信。

Language Model We invoke a frozen language model LM to conditionally generate (or score) text. For each invocation, the program prepares a prompt that adapts the LM to a specific function (e.g., answering questions or generating queries). A prompt often includes instructions, a few demonstrations of the desired behavior, and an input query to be answered.

语言模型
我们调用一个冻结的语言模型 (Language Model, LM) 来有条件地生成（或评分）文本。对于每次调用，程序会准备一个提示 (prompt)，将 LM 适配到特定功能（例如回答问题或生成查询）。提示通常包括指令、一些期望行为的示例以及要回答的输入查询。

As in Figure 2, the LM generates not only: (i) the final answer to the input question (in the PREDICT stage), but also (ii) intermediate “hop” queries to find useful information for the input question (SEARCH) as well as (iii) exemplar queries that illustrate how to produce queries for questions in the training set (DEMONSTRATE). This systematic use of the LM is a hallmark of DSP programs.

如图 2 所示，大语言模型不仅生成：(i) 输入问题的最终答案（在 PREDICT 阶段），还生成 (ii) 中间“跳转”查询以找到输入问题的有用信息（SEARCH），以及 (iii) 示例查询，展示如何为训练集中的问题生成查询（DEMONSTRATE）。这种系统化使用大语言模型是 DSP 程序的标志。

Retrieval Model DSP programs also invoke a frozen retrieval model RM to retrieve the top $k$ most “relevant” text sequences for a given query. The RM can index a massive set of pre-defined passages for scalable search, and those passages can be updated without changing the retrieval parameters. The RM accepts free-form textual inputs and specializes in estimating the relevance (or similarity) of a text sequence to a query.

检索模型 (Retrieval Model) DSP 程序还会调用一个冻结的检索模型 RM，以检索给定查询中最相关的 $k$ 个文本序列。RM 可以对大量预定义的段落进行索引以实现可扩展的搜索，并且这些段落可以在不更改检索参数的情况下进行更新。RM 接受自由格式的文本输入，并专门用于估计文本序列与查询的相关性（或相似性）。

As in Figure 2, the RM is responsible for retrieving (i) passages for each query generated by the LM (during the SEARCH stage), but also (ii) passages that are used within demonstrations (DEMONSTRATE). In the latter case, the RM’s contributions are less about providing directly relevant information to the input question and more about helping the LM adapt to the domain and task.

如图 2 所示，RM 负责检索 (i) 由 LM 生成的每个查询的段落 (在 SEARCH 阶段)，以及 (ii) 用于演示 (DEMONSTRATE) 的段落。在后一种情况下，RM 的贡献较少在于直接提供与输入问题相关的信息，而更多在于帮助 LM 适应领域和任务。

Though not utilized in this example, the RM is also used in DSP for functions like retrieving “nearest-neighbor” demonstrations from task training data (DEMONSTRATE) and selecting well-grounded generated sequences from the LM (PREDICT).

尽管在此示例中未使用，RM 也用于 DSP 中，例如从任务训练数据中检索“最近邻”演示 (DEMONSTRATE) 以及从 LM 中选择有根据的生成序列 (PREDICT)。

2.2. Datatypes and Control Flow

2.2 数据类型和控制流

We have implemented the DSP framework in Python. The present section introduces the core data types and composable functions provided by the framework. We use illustrative code snippets to ground the examples, and to convey the power that comes from being able to express complex interactions between the LM and RM in simple programs.

我们已在 Python语言中实现了 DSP 框架。本节将介绍该框架提供的核心数据类型和可组合函数。我们通过示例代码片段来展示这些例子，并传达通过简单程序表达大语言模型 (LM) 和检索模型 (RM) 之间复杂交互的能力。

The Example Datatype To conduct a task, a DSP pro-gram manipulates one or more instances of the Example datatype. An Example behaves like a Python dictionary with multiple fields. The program is typically provided with a few training examples. The code snippet below illustrates this for multi-hop question answering.

示例数据类型 (Example Datatype)
为了执行任务，DSP程序操作一个或多个示例数据类型的实例。示例的行为类似于具有多个字段的Python字典。程序通常会提供一些训练示例。下面的代码片段展示了多跳问答的示例。

This snippet contains two labeled examples, each with a multi-hop question (e.g., “In which city did Akeem Ellis play in 2017?”) and its short answer (“Ellesmere Port”). Arbitrary keys and values are allowed within an Example, though typical values are strings or lists of strings.

此片段包含两个标注示例，每个示例都有一个多跳问题（例如，“Akeem Ellis 在 2017 年效力于哪个城市？”）及其简短答案（“Ellesmere Port”）。在示例中允许任意键和值，尽管典型值是字符串或字符串列表。

In this task, we are unlikely to find an individual passage that provides the answer to any question. For example, the first training example can probably be resolved only by first answering the question of who discovered Palomar (“Edwin Hubble”) and then addressing the question of Hubble’s birth date using different evidence passages. We typically assume that the human-labeled training data do not include labels for intermediate transformations (e.g., queries for individual hops) that would be useful for following these steps, and so it is the job of the DSP program to discover these strategies via in-context learning.

在此任务中，我们不太可能找到单独的一段文字来回答任何问题。例如，第一个训练示例可能只能通过先回答谁发现了帕洛马（“Edwin Hubble”），然后使用不同的证据段落来解决哈勃的出生日期问题。我们通常假设人工标注的训练数据不包括对中间转换（例如，用于单个跳转的查询）的标签，这些标签对于遵循这些步骤是有用的，因此 DSP 程序的任务是通过上下文学习来发现这些策略。

A DSP Program The following code snippet is a complete program for resolving multi-hop questions like those in Figure 1, with help from train examples like those above.

DSP 程序
以下代码片段是一个完整的程序，用于解决类似于图 1 中的多跳问题，并借助上述的训练示例。

invoke and compose DSP primitives (i.e., built-in functions) to build the DEMONSTRATE, SEARCH, and PREDICT transformations that define the program.

调用并组合 DSP 原语（即内置函数）来构建定义程序的 DEMONSTRATE、SEARCH 和 PREDICT 转换。

Transformations A transformation is a function that takes an Example as input and returns an Example, populating new fields (or modifying existing fields) in it. This program invokes three developer-defined transformations, namely, multi hop demonstrate, multi hop search, and multi hop predict. Transformations may themselves invoke other transformations, and they act analogously to layers in standard deep neural network (DNN) programming frameworks such as PyTorch, except that they pass text data instead of tensors between each other and do not involve back propagation.

变换
变换是一个函数，它接收一个示例 (Example) 作为输入并返回一个示例，填充其中的新字段（或修改现有字段）。该程序调用了三个开发者定义的变换，即多跳演示 (multi hop demonstrate)、多跳搜索 (multi hop search) 和多跳预测 (multi hop predict)。变换本身可以调用其他变换，它们类似于标准深度神经网络 (DNN) 编程框架（如 PyTorch）中的层，不同之处在于它们在彼此之间传递的是文本数据而不是张量，并且不涉及反向传播。

We categorize transformations according to their behavior (or purpose) under one of the DEMONSTRATE, SEARCH, and PREDICT stages. That said, DSP does not impose this categorization and allows us to define functions that may blend these stages. We will discuss each of the three stages next.

我们根据转换在 DEMONSTRATE、SEARCH 和 PREDICT 阶段的行为（或目的）对其进行分类。也就是说，DSP 并不强制这种分类，并允许我们定义可能混合这些阶段的函数。接下来我们将讨论这三个阶段中的每一个。

of training examples. Whenever fn returns an example (rather than None), annotate caches the intermediate predictions (i.e., the generated queries and retrieved passages). These predictions serve as successful demonstrations for the pipeline transformations. In simple uses, fn may attempt to answer the example “zero-shot” one or more times. This is typically done by invoking the SEARCH and PREDICT stages of the program. When an answer is produced, if fn assesses it as correct, it returns a populated example in which the intermediate predictions are present.

在训练示例的过程中，每当 fn 返回一个示例（而不是 None），annotate 会缓存中间预测（即生成的查询和检索到的段落）。这些预测作为管道转换的成功示例。在简单的使用场景中，fn 可能会尝试“零样本”回答示例一次或多次。这通常通过调用程序的 SEARCH 和 PREDICT 阶段来完成。当生成答案时，如果 fn 评估其为正确，它会返回一个包含中间预测的完整示例。

Case Study The snippet below defines the function multi hop demonstrate, called in Line 3 of multi hop program, and illustrates the usage of annotate.

案例研究下面的代码片段定义了在 multi_hop_program 的第 3 行调用的函数 multi_hop_demonstrate，并展示了 annotate 的用法。

2.3. DEMONSTRATE

2.3. 演示

It is known that including examples of the desired behavior from the LM in its prompt typically leads to better performance (Brown et al., 2020). In DSP, a demonstration is a training example that has been prepared to illustrate specific desired behaviors from the LM. A DEMONSTRATE transformation takes as input $\mathsf{x}$ of type Example and prepares a list of demonstrations in $\mathsf{x}$ .demos, typically by selecting a subset of the training examples in x.train and boots trapping new fields in them.

已知在提示中包含大语言模型 (LM) 的期望行为示例通常会带来更好的性能 (Brown et al., 2020)。在 DSP 中，演示 (demonstration) 是一个训练示例，用于展示大语言模型的特定期望行为。DEMONSTRATE 转换以类型为 Example 的 $\mathsf{x}$ 作为输入，并准备 $\mathsf{x}$ .demos 中的演示列表，通常通过选择 x.train 中的训练示例子集并在其中引导新字段来实现。

Boots trapping Demonstrations Examples in the training set typically consist of the input text and the target output of the task. The DEMONSTRATE stage can augment a training example by pro grammatically boots trapping annotations for intermediate transformations. In our running “multi-hop” example, the demonstrations illustrate three LM-based transformations: (i) how to break down the input question in order to gather information for answering it (i.e., first-hop retrieval), (ii) how to use information gathered in an earlier “hop” to ask follow-up questions, and (iii) how to use the information gathered to answer complex questions.

Boots trapping 演示示例

训练集中的示例通常由输入文本和任务的目标输出组成。DEMONSTRATE 阶段可以通过程序化地引导中间转换的注释来增强训练示例。在我们运行的“多跳”示例中，演示展示了三种基于大语言模型的转换：(i) 如何分解输入问题以收集回答问题的信息（即第一跳检索），(ii) 如何使用在早期“跳”中收集的信息来提出后续问题，以及 (iii) 如何使用收集到的信息来回答复杂问题。

In Line 10, multi hop demonstrate invokes annotate, which bootstraps missing fields in training examples by caching annotations from attempt example. The transformation attempt example takes a training example d and attempts to answer it in a zero-shot fashion: it creates a copy of d with no demonstrations (Line 4; i.e., zero-shot) and invokes the multi-hop search and predict pipeline (Lines 5 and 6). Each transformation returns an updated version of d with additional fields populated. If the pipeline answers correctly (Line 7), the updated d is returned.

在第 10 行，multi hop demonstrate 调用 annotate，通过缓存 attempt example 中的注释来引导训练示例中缺失的字段。transformation attempt example 接受一个训练示例 d，并尝试以零样本的方式回答它：它创建了一个没有演示的 d 副本（第 4 行；即零样本），并调用多跳搜索和预测管道（第 5 行和第 6 行）。每个转换都会返回一个更新后的 d 版本，其中填充了额外的字段。如果管道正确回答（第 7 行），则返回更新后的 d。

Figure 2 illustrates this behavior. DEMONSTRATE transforms a training question–answer pair to a fully-populated demonstration, including fields such as hop1 and hop2 (i.e., queries for multi-hop search) as well as psg1 and psg2. When the LM is later invoked to conduct a transformation, say, generating a “second-hop” query during SEARCH, the psg1 field serves as context and the hop2 field serves as a label for this particular training example.

图 2 展示了这种行为。DEMONSTRATE 将训练问题-答案对转换为一个完整的演示，包括 hop1 和 hop2（即多跳搜索的查询）以及 psg1 和 psg2 等字段。当稍后调用大语言模型进行转换时，例如在 SEARCH 过程中生成“第二跳”查询时，psg1 字段作为上下文，hop2 字段则作为该特定训练示例的标签。

Discussion This simple case study illustrates the power of composition in the DSP abstraction. Because the pipeline is a well-defined program in which transformations communicate by passing text attached to Examples, a simple map-and-filter strategy can leverage the LM and RM to bootstrap annotations for a full pipeline from end-task labels. This is an extensible strategy, but even in its simplest form it generalizes the approaches explored recently by Zelikman et al. (2022), Wei et al. (2022), Zhang et al. (2022), and Huang et al. (2022) in which an LM self-generates chain-of-thought rationales for an individual prompt.

讨论
这个简单的案例研究展示了 DSP 抽象中组合的强大能力。由于管道是一个定义良好的程序，其中转换通过传递附加到示例的文本进行通信，因此简单的映射和过滤策略可以利用 LM 和 RM 从端任务标签中引导整个管道的注释。这是一种可扩展的策略，即使在其最简单的形式中，它也概括了 Zelikman 等人 (2022)、Wei 等人 (2022)、Zhang 等人 (2022) 和 Huang 等人 (2022) 最近探索的方法，这些方法中 LM 为单个提示自生成思维链推理。

By boots trapping pipelines, DEMONSTRATE makes it easy to explore complex strategies in SEARCH and PREDICT without writing examples for every transformation. This includes strategies that are challenging to explore without custom annotations in traditional retrieval-augmented NLP. For instance, Khattab et al. (2021a) introduces a pipeline for multi-hop reasoning that is trained with weak supervision, extending work by Lee et al. (2019) and Khattab et al. (2021b). In it, the target 3 or 4 passages that need to retrieved must be labeled but the system discovers the best order of “hops” automatically.

通过引导管道，DEMONSTRATE 使得在 SEARCH 和 PREDICT 中探索复杂策略变得容易，而无需为每个转换编写示例。这包括在传统检索增强的自然语言处理（NLP）中难以探索的策略，除非使用自定义注释。例如，Khattab 等人 (2021a) 引入了一个用于多跳推理的管道，该管道通过弱监督进行训练，扩展了 Lee 等人 (2019) 和 Khattab 等人 (2021b) 的工作。在该管道中，需要检索的目标 3 或 4 个段落必须被标记，但系统会自动发现最佳的“跳转”顺序。

In contrast, DSP allows us to build complex pipelines without labels for intermediate steps, because we can compose programs out of small transformations. If LM and RM can accurately process such transformations “zero-shot” (i.e., without demonstrations) on at least one or two examples, these examples can be discovered with end-task labels and used as demonstrations.

相比之下，DSP 允许我们在没有中间步骤标签的情况下构建复杂的管道，因为我们可以通过小的转换来组合程序。如果 LM 和 RM 能够“零样本”（即无需演示）准确处理至少一个或两个示例中的此类转换，则可以通过最终任务标签发现这些示例并将其用作演示。

To draw on our earlier analogy with DNN frameworks like PyTorch, DEMONSTRATE aims to replace the function of back propagation in extensible ways by simulating the behavior of the program (corresponding to a “forward” pass) and pro grammatically learning from errors. In doing this with frozen models and with only end-task labels, DEMONSTRATE introduces a high degree of modularity. In particular, without hand-labeling intermediate transformations, developers may swap the training domain, update the training examples, or modify the program’s strategy, and use annotate to automatically populate all of the intermediate fields for demonstrations.

借鉴我们之前对 PyTorch 等 DNN 框架的类比，DEMONSTRATE 旨在通过模拟程序的行为（对应于“前向”传递）并以编程方式从错误中学习，以可扩展的方式取代反向传播的功能。通过使用冻结模型和仅使用最终任务标签来实现这一点，DEMONSTRATE 引入了高度的模块化。特别是，在没有手动标记中间转换的情况下，开发者可以交换训练领域、更新训练示例或修改程序的策略，并使用注释自动填充所有中间字段以进行演示。

Selecting Demonstrations It is not always possible to fit all of the training examples in the context window of the LM. DSP provides three primitives for selecting a subset of training examples, namely, sample, knn, and crossval.

选择示例
并非总是能够将所有训练示例都放入大语言模型的上下文窗口中。DSP 提供了三种选择训练示例子集的原语，分别是 sample、knn 和 crossval。

would select $n$ subsets of $k=5$ examples each, and return the set with which a transformation evaluate performs best on the remaining 95 examples.

将选择 $n$ 个子集，每个子集包含 $k=5$ 个示例，并返回在剩余 95 个示例上评估表现最佳的子集。

Compositions & Extensions By manipulating demonstrations and higher-order transformations, these simple selection and boots trapping primitives can be combined to conduct larger novel strategies. If the training set is very large (e.g., $|\mathtt{t r a i n}|=100,000)$ , we can conduct knn to find the nearest $k=16$ examples and only annotate these, arriving at a system that learns increment ally in real-time. If the training set is moderately large (e.g., $|\mathtt{t r a i n}|=1000)$ , we can conduct crossval and cache the performance of all prompts it evaluates on each training example. At test time, we can use knn to find $k=50$ similar examples to the test input and select the prompt that performs best on these $k$ examples, producing an adaptive system that is informed by the quality of its pipeline on different types of examples.

组合与扩展
通过操纵演示和更高阶的变换，这些简单的选择和引导原语可以组合起来，形成更大的新策略。如果训练集非常大（例如， $|\mathtt{t r a i n}|=100,000$ ），我们可以进行knn（k近邻）来找到最近的 $k=16$ 个示例，并仅对这些示例进行标注，从而构建一个能够实时增量学习的系统。如果训练集中等大小（例如， $|\mathtt{t r a i n}|=1000$ ），我们可以进行交叉验证，并缓存所有提示在每个训练示例上的表现。在测试时，我们可以使用knn找到与测试输入相似的 $k=50$ 个示例，并选择在这些 $k$ 个示例上表现最好的提示，从而构建一个能够根据其管道在不同类型示例上的质量进行自适应的系统。

2.4. SEARCH

2.4. 搜索

The SEARCH stage gathers passages to support transformations conducted by the LM. We assume a large knowledge corpus—e.g., a snippet of Web, Wikipedia, or arXiv—that is divided into text passages. Providing passages to the LM facilitates factual responses, enables updating the knowledge store without retraining, and presents a transparency contract: when in doubt, users can check whether the system has faithfully used a reliable source in making a prediction.

SEARCH 阶段收集文段以支持由大语言模型 (LM) 执行的转换。我们假设有一个大型知识库——例如，Web、Wikipedia 或 arXiv 的片段——被划分为文本段落。向大语言模型提供文段有助于生成事实性回答，无需重新训练即可更新知识库，并提供透明度契约：当有疑问时，用户可以检查系统在做出预测时是否忠实使用了可靠来源。

In the simplest scenarios, SEARCH can directly query the RM, requesting the top $k$ passages (from a pre-defined index) that match an input question. This baseline instantiation of SEARCH simulates retrieval in most open-domain question answering systems, which implement a “retrievethen-read” pipeline, like Lee et al. (2019), Khattab et al. (2021b), Lazaridou et al. (2022), and many others.

在最简单的场景中，SEARCH 可以直接查询 RM（Retrieval Model），请求与输入问题匹配的前 $k$ 个段落（来自预定义的索引）。SEARCH 的这种基线实例化模拟了大多数开放域问答系统中的检索过程，这些系统实现了“检索-然后-阅读”的流程，例如 Lee 等人 (2019)、Khattab 等人 (2021b)、Lazaridou 等人 (2022) 以及其他许多研究。

vers at ional search (Del Tredici et al., 2021; Raposo et al., 2022) pipelines have received much attention. These systems are typically fine-tuned with many hand-labeled query “rewrites” (Anantha et al., 2020), “decomposition s” (Geva et al., 2021; Min et al., 2019), or target hops (Yang et al., 2018; Jiang et al., 2020). Supported with automatic annotations from DEMONSTRATE, the SEARCH stage allows us to simulate many such strategies and many others in terms of passing queries, passages, and demonstrations between the RM and LM. More importantly, SEARCH facilitates our vision of advanced strategies in which the $\mathbf{LM}$ and RM cooperate to increment ally plan a research path for which the RM gathers information and the LM identifies next steps.

会话搜索 (Del Tredici et al., 2021; Raposo et al., 2022) 管道受到了广泛关注。这些系统通常通过大量手工标注的查询“重写” (Anantha et al., 2020)、“分解” (Geva et al., 2021; Min et al., 2019) 或目标跳转 (Yang et al., 2018; Jiang et al., 2020) 进行微调。借助 DEMONSTRATE 的自动标注支持，SEARCH 阶段允许我们在 RM 和 LM 之间传递查询、段落和演示时模拟许多此类策略以及其他策略。更重要的是，SEARCH 促进了我们对高级策略的愿景，即 $\mathbf{LM}$ 和 RM 合作逐步规划研究路径，其中 RM 收集信息，LM 识别下一步行动。

Case Study Let us build on our running multi-hop example as a case study. We can define multi hop search v 2 (Line 4 in our core program), a slightly more advanced version of the SEARCH transformation from Figure 2. This transformation simulates the iterative retrieval component of fine-tuned retrieval-augmented systems like IRRR (Qi et al., 2020), which reads a retrieved passage in every hop and generates a search query (or a termination condition to stop hopping), and Baleen (Khattab et al., 2021a), which summarizes the information from many passages in each hop for inclusion in subsequent hops.

案例研究
让我们以我们正在进行的多跳示例为基础进行案例研究。我们可以定义多跳搜索 v2（核心程序中的第 4 行），这是图 2 中 SEARCH 转换的一个稍微高级的版本。该转换模拟了像 IRRR (Qi et al., 2020) 这样的微调检索增强系统的迭代检索组件，该系统在每一跳中读取检索到的段落并生成搜索查询（或停止跳转的终止条件），以及 Baleen (Khattab et al., 2021a)，它在每一跳中总结来自多个段落的信息，以便包含在后续的跳转中。

In multi hop search v 2, Line 7 calls the generate primitive, which invokes the LM to produce a query for each retrieval hop. The LM is conditioned on a prompt that is prepared using the hop template template. (We discuss prompt templates and the generate primitive in $\S2.5.$ ) Here, this template may be designed to generate a prompt that has the following format (e.g., for the second hop).

在多跳搜索 v2 中，第 7 行调用了生成原语 (generate primitive)，该原语会调用大语言模型 (LM) 为每个检索跳生成一个查询。大语言模型的条件是基于使用跳模板 (hop template) 准备的提示 (prompt)。 (我们将在 $\S2.5.$ 中讨论提示模板和生成原语。) 在这里，该模板可能被设计为生成具有以下格式的提示 (例如，对于第二跳)。

As shown, the LM is instructed to read the context retrieved in earlier hops and a complex question. It is then prompted to write: (i) a summary of the supplied context and (ii) a search query that gathers information for answering that question. The generated text will be extracted and assigned to the summary and query variables in (multi hop search v 2; Line 7). On Line 10, we terminate the hops if the query is “N/A”. Otherwise, Line 12 retrieves $k=5$ passages using the query and Line 13 assigns the context for the subsequent hop (or for PREDICT), setting that to include the summary of all previous hops as well as the passages retrieved in the final hop so far.

如图所示，大语言模型 (LM) 被指示阅读在前几跳中检索到的上下文和一个复杂的问题。然后，它被提示编写：(i) 提供的上下文的摘要和 (ii) 一个用于收集信息以回答该问题的搜索查询。生成的文本将被提取并分配给 (multi hop search v 2; Line 7) 中的摘要和查询变量。在第 10 行，如果查询为“N/A”，则终止跳转。否则，第 12 行使用查询检索 $k=5$ 个段落，第 13 行为后续跳转（或 PREDICT）分配上下文，将其设置为包括所有先前跳转的摘要以及到目前为止在最终跳转中检索到的段落。

Comparison with self-ask It may be instructive to contrast this multi-hop DSP program with the recent “selfask” (Press et al., 2022) prompting technique, which we compare against in $\S3$ . Self-ask can be thought of as a simple instantiation of DSP’s SEARCH stage. In it, the $\mathbf{LM}$ asks one or more “follow-up questions”, which are intercepted and sent to a search engine. The search engine’s answers are concatenated into the prompt and are used to answer the question. This is essentially a simplified simulation of IRRR (Qi et al., 2020).

与 self-ask 的对比
将这种多跳 DSP 程序与最近的“self-ask”（Press 等人，2022）提示技术进行对比可能具有启发性，我们将在 $\S3$ 中对此进行比较。Self-ask 可以被视为 DSP 的 SEARCH 阶段的一个简单实例。在其中， $\mathbf{LM}$ 提出一个或多个“后续问题”，这些问题被拦截并发送到搜索引擎。搜索引擎的答案被串联到提示中，并用于回答问题。这本质上是 IRRR（Qi 等人，2020）的简化模拟。

As a general framework, DSP can express ideas like self-ask and many other, more sophisticated pipelines as we discuss in the present section. More importantly, DSP offers a number of intrinsic advantages that lead to large empirical gains: $80%-290%$ over self-ask. For instance, DSP programs are deeply modular, which among other things means that DSP programs will annotate and construct their own demonstrations. Thus, they can be developed without labeling any of the intermediate transformations (e.g., the queries generated). In addition, the LM prompts constructed by DSP get automatically updated to align with the training data and retrieval corpus provided. In contrast, approaches like self-ask rely on a hand-written prompt with hard-coded examples.

作为一个通用框架，DSP 可以表达诸如自我提问（self-ask）以及许多其他更复杂的流程，正如我们在本节中讨论的那样。更重要的是，DSP 提供了许多内在优势，从而带来了显著的实证收益：相较于自我提问，提升了 $80%-290%$ 。例如，DSP 程序具有高度的模块化特性，这意味着 DSP 程序能够自动标注并构建自己的演示。因此，它们可以在无需标注任何中间转换（例如生成的查询）的情况下进行开发。此外，DSP 构建的语言模型提示（LM prompts）会自动更新，以与提供的训练数据和检索语料库保持一致。相比之下，像自我提问这样的方法依赖于手工编写的提示和硬编码的示例。

Moreover, DSP assigns the control flow to an explicit program and facilitates design patterns that invoke the LM (or RM) to conduct small transformations. This allows us to build steps that are dedicated to generating one or more retrieval queries, summarizing multiple passages per hop, and answering questions. These steps are individually simpler than the self-ask prompt, yet our multi-hop DSP program deliberately composes them to build richer pipelines that are thus more reliable. In contrast, self-ask delegates the control flow to the LM completions, maintaining state within the prompt itself and intercepting follow-up questions to conduct search. We find that this paradigm leads to a “selfdistraction” problem (§3.5) that DSP programs avoid.

此外，DSP 将控制流分配给显式程序，并促进调用 LM（或 RM）进行小规模转换的设计模式。这使得我们能够构建专门用于生成一个或多个检索查询、每跳总结多个段落以及回答问题的步骤。这些步骤比自我提问提示（self-ask prompt）单独更简单，但我们的多跳 DSP 程序有意将它们组合起来，构建更丰富的管道，从而更可靠。相比之下，自我提问将控制流委托给 LM 完成，在提示本身内维护状态，并拦截后续问题以进行搜索。我们发现这种范式会导致“自我分心”问题（§3.5），而 DSP 程序避免了这一问题。

Fusing Retrieval Results For improved recall and robustness, we can also fuse the retrieval across multiple generated queries. Fusion has a long history in information retrieval (Fox & Shaw, 1994; Xue & Croft, 2013; Kurland & Culpepper, 2018) and sequentially processing multiple queries was explored recently by Gao et al. (2022) for retroactively attributing text generated by LMs to citations. Inspired by these, we include a fused retrieval primitive to DSP to offer a versatile mechanism for interacting with frozen retrievers. It accepts an optional fusion function that maps multiple retrieval lists into one. By default, DSP uses a variant of CombSUM (Fox & Shaw, 1994), assigning each passage the sum of its probabilities across retrieval lists.

融合检索结果
为了提高召回率和鲁棒性，我们还可以将多个生成查询的检索结果进行融合。融合技术在信息检索领域有着悠久的历史 (Fox & Shaw, 1994; Xue & Croft, 2013; Kurland & Culpepper, 2018)，最近 Gao 等人 (2022) 探索了顺序处理多个查询的方法，用于追溯将大语言模型生成的文本归因于引用。受此启发，我们在 DSP 中加入了融合检索原语，为与冻结检索器的交互提供了一种多功能机制。它接受一个可选的融合函数，该函数将多个检索列表映射为一个。默认情况下，DSP 使用 CombSUM (Fox & Shaw, 1994) 的变体，为每个段落分配其在检索列表中的概率总和。

To illustrate, the modification below generates $n=10$ queries for the transformation multi hop search v 2.

为了说明这一点，以下修改生成了 $n=10$ 个查询，用于转换多跳搜索 v 2。

Compositions & Extensions To illustrate a simple composition, we can equip a chatbot with the capacity for convers at ional multi-hop search by combining a query rewriting step, which produces a query that encompasses all of the relevant conversational context, with the multi-hop transformation, as follows.

组合与扩展

为了说明一个简单的组合，我们可以通过结合查询重写步骤和多跳转换，为聊天机器人配备对话式多跳搜索的能力。查询重写步骤生成一个包含所有相关对话上下文的查询，具体如下。

Similar approaches can be used for correcting spelling mistakes or implementing pseudo-relevance feedback (Cao et al., 2008; Wang et al., 2022a), in which retrieved passages are used to inform a better search query, though this has not been attempted with pretrained LMs to our knowledge.

类似的方法可用于纠正拼写错误或实现伪相关反馈 (Cao et al., 2008; Wang et al., 2022a)，其中检索到的段落用于改进搜索查询，尽管据我们所知，尚未尝试使用预训练的大语言模型 (LLM) 来实现这一点。

2.5. PREDICT

2.5. 预测

The PREDICT stage generates the system output using demonstrations (e.g., in x.demos) and passages (e.g., in x.context). PREDICT tackles the challenges of reliably solving the downstream task, which integrates much of the work on in-context learning in general. Within DSP, it also has the more specialized function of systematically aggregating information across a large number of demonstrations, passages, and candidate predictions.

PREDICT 阶段使用演示（例如在 x.demos 中）和段落（例如在 x.context 中）生成系统输出。PREDICT 解决了可靠解决下游任务的挑战，整合了大部分关于上下文学习的工作。在 DSP 中，它还具有更专业的功能，即系统地聚合大量演示、段落和候选预测中的信息。

Generating Candidates Generally, PREDICT has to produce one or more candidate predictions for the end-task. To this end, the basic primitive in PREDICT is generate, which accepts a Template and (via currying) an Example and queries the LM to produce one or more completions, as explored earlier in $\S2.4$ . A corresponding primitive that uses the RM in this stage is rank, which accepts a query and one or more passages and returns their relevance scores.

生成候选
通常，PREDICT 需要为最终任务生成一个或多个候选预测。为此，PREDICT 中的基本原语是 generate，它接受一个模板（Template）并通过柯里化（currying）接受一个示例（Example），并查询大语言模型（LM）以生成一个或多个补全，如之前在 $\S2.4$ 中探讨的那样。在此阶段使用相关性模型（RM）的相应原语是 rank，它接受一个查询和一个或多个段落，并返回它们的相关性分数。

| 模板 e#template: 一个可以生成提示并解析完成的对象 2 |

A Template is an object that can produce prompts, that is, map an Example to a string, and extract fields out of completions. For instance, we can map an example $\mathsf{x}$ that has a question and retrieved passages to the following prompt:

模板是一种能够生成提示的对象，即将一个示例映射为字符串，并从完成内容中提取字段。例如，我们可以将一个包含问题和检索段落的示例 $\mathsf{x}$ 映射为以下提示：

As this illustrates, the LM will be asked to generate a chainof-thought rationale (CoT; Wei et al. 2022; Kojima et al. 2022) and an answer, and the generated text will be extracted back into the rationale and answer keys of each completion.

如图所示，大语言模型将被要求生成一个思维链推理（CoT；Wei 等人，2022；Kojima 等人，2022）和一个答案，生成的文本将被提取回每个完成项的推理和答案键中。

Each invocation to the $\mathbf{LM}$ can sample multiple candidate predictions. Selecting a “best” prediction is the subject of much work on decoding (Wiher et al., 2022; Li et al., 2022), but a frozen and general-purpose LM may not support custom modifications to decoding. Within these constraints, we present several high-level strategies for selecting predictions and aggregating information in DSP via the LM and RM.

每次调用 $\mathbf{LM}$ 都可以采样多个候选预测。选择“最佳”预测是解码工作中广泛研究的主题 (Wiher et al., 2022; Li et al., 2022)，但一个冻结且通用的 LM 可能不支持对解码进行自定义修改。在这些限制条件下，我们提出了几种通过 LM 和 RM 选择预测和聚合信息的高级策略。

Selecting Predictions Among multiple candidates, we can simply extract the most popular prediction. When a CoT is used to arrive at the answer, this is the self-consistency method of Wang et al. (2022c), which seeks to identify predictions at which multiple distinct rationales arrive.

在多个候选预测中选择时，我们可以简单地提取最受欢迎的预测。当使用思维链 (CoT) 来得出答案时，这是 Wang 等人 (2022c) 提出的自一致性方法，旨在识别多个不同推理路径得出的预测。

1 fromdspimportbranch
7)
3 defpipeline(x）:
4 return multihop_predict(multihop_search_v2(x))

6 defPoT_program(question:str）->str:
7 X 二 Example（question=question，train=train)
8 X= multihop_demonstrate（x)
6
10 candidates =branch(pipeline，n=5，t=0.7)(x)
11 return x.copy（answer=majority(candidates).answer)

In the snippet above, Line 10 invokes the primitive branch which samples $n$ different PoTs with a high temperature (e.g., $t~=~0.7$ ) and accumulates their intermediate and final predictions. In this example, our pipeline invokes multi hop search v 2 (§2.4), which applies a variable number of retrieval hops depending on the questions generated, before doing PREDICT. That is, PoT program potentially invokes multiple distinct paths in the program (i.e., with different multi-hop queries and number of hops in each) across branches. It then selects the majority answer overall.

在上面的代码片段中，第 10 行调用了原始分支，该分支以高温（例如 $t~=~0.7$ ）采样 $n$ 个不同的 PoT (Program of Thoughts)，并累积它们的中间和最终预测结果。在这个例子中，我们的管道调用了多跳搜索 v2（§2.4），它在执行 PREDICT 之前根据生成的问题应用了可变数量的检索跳数。也就是说，PoT 程序可能会在分支中调用多个不同的路径（即每个路径具有不同的多跳查询和跳数）。然后，它选择整体的多数答案。

DSP generalizes self-consistency in a second way. When sampling our CoTs or PoTs provides multiple candidates, we can select the top $k$ (e.g., top-4) predictions and then compare them directly. For instance, we may prompt the LM to compare these choices as MCQ candidates, a transformation for which DEMONSTRATE can automatically prepare exemplars. This effectively simulates the LM recursion of Levine et al. (2022), though unlike their approach it does not require a large training set or updating any (prompttuning) weights. One such implementation is illustrated in open qa predict below.

DSP 在第二种方式中推广了自一致性。当采样我们的 CoTs 或 PoTs 提供多个候选时，我们可以选择前 $k$ 个（例如前 4 个）预测，然后直接比较它们。例如，我们可以提示大语言模型将这些选择作为多项选择题的候选进行比较，这种转换可以通过 DEMONSTRATE 自动准备示例。这有效地模拟了 Levine 等人 (2022) 的大语言模型递归，尽管与他们的方法不同，它不需要大量的训练集或更新任何（提示调优）权重。其中一个实现示例如下所示。

To deal with a larger number of demonstrations or passages, we can branch in parallel to process individual subsets of the passages or demonstrations and then aggregate the individual answers using one of the scoring methods presented earlier. Indeed, Lewis et al. (2020) and Lazaridou et al. (2022) have explored margin aliz ation as a way to combine scores across passages and Le et al. (2022) ensemble prompts across demonstrations, which can be expressed in this way.

为了处理更多的示例或段落，我们可以并行分支处理各个子集的段落或示例，然后使用前面介绍的评分方法之一汇总各个答案。事实上，Lewis 等人 (2020) 和 Lazaridou 等人 (2022) 已经探索了通过边际化 (marginalization) 来结合跨段落的评分，而 Le 等人 (2022) 则通过跨示例的集成提示 (ensemble prompts) 来实现，这些方法都可以用这种方式表达。

An alternative aggregation strategy is to accumulate information across passages sequentially, rather than independently. This is effectively how our multi-hop approach works (§2.4). Such a strategy has also been employed recently by Gao et al. (2022) for retroactively attributing text generated by LMs to citations. They generate many queries but instead of fusion $(\S2.4)$ , they run their pipeline on each query and use its outputs to alter the input to subsequent queries.1

另一种聚合策略是跨段落顺序累积信息，而不是独立累积信息。这实际上是我们多跳方法的工作原理（§2.4）。Gao 等人 (2022) 最近也采用了这种策略，用于追溯将大语言模型生成的文本归因于引用。他们生成了许多查询，但与融合 $(\S2.4)$ 不同，他们在每个查询上运行他们的管道，并使用其输出来改变后续查询的输入。[1]

3. Evaluation

3. 评估

We now consider how to implement DSP programs for three diverse knowledge-intensive NLP tasks: open-domain question answering (QA), multi-hop QA, and conversational QA. All of these tasks are “open-domain”, in the sense that systems are given a short question or participate in a multi-turn conversation without being granted access to context that answers these questions.

我们现在考虑如何为三个不同的知识密集型自然语言处理 (NLP) 任务实现 DSP 程序：开放域问答 (QA)、多跳 QA 和对话式 QA。所有这些任务都是“开放域”的，即系统在回答这些问题时，只被给予一个简短的问题或参与多轮对话，而没有访问相关上下文。

We build and evaluate intuitive compositions of the functions explored in $\S2$ for each task. We show that, despite low development effort, the resulting DSP programs exhibit strong quality and deliver considerable empirical gains over vanilla in-context learning and a standard retrieve-then-read pipeline with in-context learning.

我们为每个任务构建并评估了在 $\S2$ 中探索的函数的直观组合。结果表明，尽管开发工作量较低，但生成的 DSP 程序表现出高质量，并且在零样本学习和带有上下文的检索-读取标准流程上，取得了显著的实证收益。

3.1. Evaluation Methodology

[论文翻译]DEMONSTRATE–SEARCH–PREDICT: 结合检索和语言模型的知识密集型 NLP

原文地址：https://arxiv.org/pdf/2212.14024v2