Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

检索增强生成在知识密集型 NLP 任务中的应用

Abstract

摘要

Large pre-trained language models have been shown to store factual knowledge in their parameters, and achieve state-of-the-art results when fine-tuned on downstream NLP tasks. However, their ability to access and precisely manipulate knowledge is still limited, and hence on knowledge-intensive tasks, their performance lags behind task-specific architectures. Additionally, providing provenance for their decisions and updating their world knowledge remain open research problems. Pretrained models with a differentiable access mechanism to explicit non-parametric memory have so far been only investigated for extractive downstream tasks. We explore a general-purpose fine-tuning recipe for retrieval-augmented generation (RAG) — models which combine pre-trained parametric and non-parametric memory for language generation. We introduce RAG models where the parametric memory is a pre-trained seq2seq model and the non-parametric memory is a dense vector index of Wikipedia, accessed with a pre-trained neural retriever. We compare two RAG formulations, one which conditions on the same retrieved passages across the whole generated sequence, and another which can use different passages per token. We fine-tune and evaluate our models on a wide range of knowledgeintensive NLP tasks and set the state of the art on three open domain QA tasks, outperforming parametric seq2seq models and task-specific retrieve-and-extract architectures. For language generation tasks, we find that RAG models generate more specific, diverse and factual language than a state-of-the-art parametric-only seq2seq baseline.

大型预训练语言模型已被证明能够在参数中存储事实知识，并在下游自然语言处理（NLP）任务中进行微调时取得最先进的结果。然而，它们在访问和精确操作知识方面的能力仍然有限，因此在知识密集型任务上，它们的表现落后于特定任务的架构。此外，为它们的决策提供来源以及更新其世界知识仍然是开放的研究问题。迄今为止，具有可微分访问机制的预训练模型仅在抽取式下游任务中进行了研究。我们探索了一种通用的微调方法，用于检索增强生成（RAG）——结合了预训练参数化内存和非参数化内存的语言生成模型。我们引入了RAG模型，其中参数化内存是一个预训练的seq2seq模型，非参数化内存是维基百科的密集向量索引，通过预训练的神经检索器进行访问。我们比较了两种RAG公式，一种是在整个生成序列中使用相同的检索段落，另一种是每个token可以使用不同的段落。我们在广泛的知识密集型NLP任务上对模型进行了微调和评估，并在三个开放域问答任务上取得了最先进的结果，超越了参数化的seq2seq模型和特定任务的检索-抽取架构。对于语言生成任务，我们发现RAG模型生成的文本比最先进的仅参数化seq2seq基线更具特异性、多样性和事实性。

1 Introduction

1 引言

Pre-trained neural language models have been shown to learn a substantial amount of in-depth knowledge from data [47]. They can do so without any access to an external memory, as a parameterized implicit knowledge base [51, 52]. While this development is exciting, such models do have downsides: They cannot easily expand or revise their memory, can’t straightforwardly provide insight into their predictions, and may produce “hallucinations” [38]. Hybrid models that combine parametric memory with non-parametric (i.e., retrieval-based) memories [20, 26, 48] can address some of these issues because knowledge can be directly revised and expanded, and accessed knowledge can be inspected and interpreted. REALM [20] and ORQA [31], two recently introduced models that combine masked language models [8] with a differentiable retriever, have shown promising results, but have only explored open-domain extractive question answering. Here, we bring hybrid parametric and non-parametric memory to the “workhorse of NLP,” i.e. sequence-to-sequence (seq2seq) models.

预训练的神经语言模型已被证明能够从数据中学习大量深入的知识 [47]。它们可以在不访问外部存储器的情况下做到这一点，作为一个参数化的隐式知识库 [51, 52]。尽管这一发展令人兴奋，但这类模型确实存在一些缺点：它们无法轻松扩展或修改其记忆，无法直接提供对其预测的洞察，并且可能会产生“幻觉” [38]。结合参数化记忆与非参数化（即基于检索的）记忆的混合模型 [20, 26, 48] 可以解决其中一些问题，因为知识可以直接修改和扩展，并且可以检查和解释所访问的知识。REALM [20] 和 ORQA [31] 是最近引入的两个模型，它们将掩码语言模型 [8] 与可微分检索器相结合，已经显示出有希望的结果，但仅探索了开放域抽取式问答。在这里，我们将混合参数化和非参数化记忆引入到“NLP的主力”中，即序列到序列（seq2seq）模型。

Figure 1: Overview of our approach. We combine a pre-trained retriever (Query Encoder $^+$ Document Index) with a pre-trained seq2seq model (Generator) and fine-tune end-to-end. For query $x$ , we use Maximum Inner Product Search (MIPS) to find the top-K documents $z_{i}$ . For final prediction $y$ , we treat $z$ as a latent variable and marginal ize over seq2seq predictions given different documents.

图 1: 方法概述。我们将预训练的检索器（查询编码器 $^+$ 文档索引）与预训练的序列到序列模型（生成器）结合，并进行端到端的微调。对于查询 $x$ ，我们使用最大内积搜索（MIPS）来找到前 K 个文档 $z_{i}$ 。对于最终预测 $y$ ，我们将 $z$ 视为潜在变量，并对给定不同文档的序列到序列预测进行边缘化。

We endow pre-trained, parametric-memory generation models with a non-parametric memory through a general-purpose fine-tuning approach which we refer to as retrieval-augmented generation (RAG). We build RAG models where the parametric memory is a pre-trained seq2seq transformer, and the non-parametric memory is a dense vector index of Wikipedia, accessed with a pre-trained neural retriever. We combine these components in a probabilistic model trained end-to-end (Fig. 1). The retriever (Dense Passage Retriever [26], henceforth DPR) provides latent documents conditioned on the input, and the seq2seq model (BART [32]) then conditions on these latent documents together with the input to generate the output. We marginal ize the latent documents with a top-K approximation, either on a per-output basis (assuming the same document is responsible for all tokens) or a per-token basis (where different documents are responsible for different tokens). Like T5 [51] or BART, RAG can be fine-tuned on any seq2seq task, whereby both the generator and retriever are jointly learned.

我们通过一种通用的微调方法，为预训练的、参数化记忆生成模型赋予了非参数化记忆，这种方法我们称之为检索增强生成 (RAG)。我们构建了 RAG 模型，其中参数化记忆是一个预训练的 seq2seq Transformer，而非参数化记忆是维基百科的密集向量索引，通过预训练的神经检索器访问。我们将这些组件结合在一个端到端训练的概率模型中 (图 1)。检索器 (Dense Passage Retriever [26]，以下简称 DPR) 根据输入提供潜在文档，然后 seq2seq 模型 (BART [32]) 根据这些潜在文档和输入生成输出。我们通过 top-K 近似对潜在文档进行边缘化处理，可以基于每个输出 (假设同一个文档负责所有 token) 或基于每个 token (不同的文档负责不同的 token)。与 T5 [51] 或 BART 类似，RAG 可以在任何 seq2seq 任务上进行微调，生成器和检索器可以联合学习。

There has been extensive previous work proposing architectures to enrich systems with non-parametric memory which are trained from scratch for specific tasks, e.g. memory networks [64, 55], stackaugmented networks [25] and memory layers [30]. In contrast, we explore a setting where both parametric and non-parametric memory components are pre-trained and pre-loaded with extensive knowledge. Crucially, by using pre-trained access mechanisms, the ability to access knowledge is present without additional training.

已有大量先前工作提出了通过非参数记忆（non-parametric memory）来丰富系统的架构，这些架构是为特定任务从头训练的，例如记忆网络 [64, 55]、堆栈增强网络 [25] 和记忆层 [30]。相比之下，我们探索了一种设置，其中参数化和非参数化的记忆组件都是预训练并预加载了大量知识的。关键的是，通过使用预训练的访问机制，无需额外训练即可具备访问知识的能力。

Our results highlight the benefits of combining parametric and non-parametric memory with generation for knowledge-intensive tasks—tasks that humans could not reasonably be expected to perform without access to an external knowledge source. Our RAG models achieve state-of-the-art results on open Natural Questions [29], Web Questions [3] and Curate dT rec [2] and strongly outperform recent approaches that use specialised pre-training objectives on TriviaQA [24]. Despite these being extractive tasks, we find that un constrained generation outperforms previous extractive approaches. For knowledge-intensive generation, we experiment with MS-MARCO [1] and Jeopardy question generation, and we find that our models generate responses that are more factual, specific, and diverse than a BART baseline. For FEVER [56] fact verification, we achieve results within $4.3%$ of state-of-the-art pipeline models which use strong retrieval supervision. Finally, we demonstrate that the non-parametric memory can be replaced to update the models’ knowledge as the world changes.1

我们的结果突显了将参数化和非参数化记忆与生成相结合在知识密集型任务中的优势——这些任务是人类在没有外部知识来源的情况下无法合理完成的。我们的RAG模型在开放的自然问题[29]、网络问题[3]和CuratedTrec[2]上取得了最先进的结果，并且在TriviaQA[24]上显著优于最近使用专门预训练目标的方法。尽管这些是抽取式任务，但我们发现无约束生成优于之前的抽取式方法。对于知识密集型生成，我们在MS-MARCO[1]和Jeopardy问题生成上进行了实验，发现我们的模型生成的响应比BART基线更具事实性、具体性和多样性。对于FEVER[56]事实验证，我们取得了与使用强检索监督的最先进管道模型相差 $4.3%$ 以内的结果。最后，我们证明了随着世界的变化，可以替换非参数化记忆以更新模型的知识。

2 Methods

2 方法

We explore RAG models, which use the input sequence $x$ to retrieve text documents $z$ and use them as additional context when generating the target sequence $y$ . As shown in Figure 1, our models leverage two components: (i) a retriever $p_{\eta}(z|x)$ with parameters $\eta$ that returns (top-K truncated) distributions over text passages given a query $x$ and (ii) a generator $p_{\theta}(y_{i}|x,z,y_{1:i-1})$ para met rize d by $\theta$ that generates a current token based on a context of the previous $i-1$ tokens $y_{1:i-1}$ , the original input $x$ and a retrieved passage $z$ .

我们探索了RAG模型，该模型使用输入序列 $x$ 来检索文本文档 $z$ ，并在生成目标序列 $y$ 时将其作为额外的上下文。如图1所示，我们的模型利用了两种组件：(i) 一个检索器 $p_{\eta}(z|x)$ ，其参数为 $\eta$ ，在给定查询 $x$ 时返回文本段落的（top-K截断）分布；(ii) 一个生成器 $p_{\theta}(y_{i}|x,z,y_{1:i-1})$ ，其参数为 $\theta$ ，基于前 $i-1$ 个token $y_{1:i-1}$ 的上下文、原始输入 $x$ 和检索到的段落 $z$ 生成当前token。

To train the retriever and generator end-to-end, we treat the retrieved document as a latent variable. We propose two models that marginal ize over the latent documents in different ways to produce a distribution over generated text. In one approach, RAG-Sequence, the model uses the same document to predict each target token. The second approach, RAG-Token, can predict each target token based on a different document. In the following, we formally introduce both models and then describe the $p_{\eta}$ and $p_{\theta}$ components, as well as the training and decoding procedure.

为了端到端地训练检索器和生成器，我们将检索到的文档视为一个潜在变量。我们提出了两种模型，它们以不同的方式对潜在文档进行边缘化，以生成文本的分布。在第一种方法中，RAG-Sequence，模型使用相同的文档来预测每个目标 Token。第二种方法，RAG-Token，可以根据不同的文档预测每个目标 Token。接下来，我们将正式介绍这两种模型，然后描述 $p_{\eta}$ 和 $p_{\theta}$ 组件，以及训练和解码过程。

2.1 Models

2.1 模型

RAG-Sequence Model The RAG-Sequence model uses the same retrieved document to generate the complete sequence. Technically, it treats the retrieved document as a single latent variable that is marginalized to get the seq2seq probability $p(y|x)$ via a top-K approximation. Concretely, the top $\mathbf{K}$ documents are retrieved using the retriever, and the generator produces the output sequence probability for each document, which are then marginalized,

RAG-Sequence 模型
RAG-Sequence 模型使用相同的检索文档生成完整的序列。从技术上讲，它将检索到的文档视为单个潜在变量，并通过 top-K 近似来边缘化以获得 seq2seq 概率 $p(y|x)$ 。具体来说，使用检索器检索出 top $\mathbf{K}$ 个文档，生成器为每个文档生成输出序列概率，然后对这些概率进行边缘化处理。

图片.png

RAG-Token Model In the RAG-Token model we can draw a different latent document for each target token and marginal ize accordingly. This allows the generator to choose content from several documents when producing an answer. Concretely, the top K documents are retrieved using the retriever, and then the generator produces a distribution for the next output token for each document, before marginal i zing, and repeating the process with the following output token, Formally, we define:

RAG-Token 模型
在 RAG-Token 模型中，我们可以为每个目标 Token 抽取不同的潜在文档，并相应地边缘化。这使得生成器在生成答案时可以从多个文档中选择内容。具体来说，首先使用检索器检索出前 K 个文档，然后生成器为每个文档生成下一个输出 Token 的分布，再进行边缘化，并对后续的输出 Token 重复此过程。形式上，我们定义如下：

图片.png

Finally, we note that RAG can be used for sequence classification tasks by considering the target class as a target sequence of length one, in which case RAG-Sequence and RAG-Token are equivalent.

最后，我们注意到 RAG 可以用于序列分类任务，通过将目标类别视为长度为 1 的目标序列，此时 RAG-Sequence 和 RAG-Token 是等价的。

2.2 Retriever: DPR

2.2 检索器：DPR

The retrieval component $p_{\eta}(z|x)$ is based on DPR [26]. DPR follows a bi-encoder architecture:

检索组件 $p_{\eta}(z|x)$ 基于 DPR [26]。DPR 采用双编码器架构：

图片.png

where $\mathbf{d}(z)$ is a dense representation of a document produced by a BERTBASE document encoder [8], and ${\bf q}(x)$ a query representation produced by a query encoder, also based on BERTBASE. Calculating top $\cdot\mathbf{k}(p_{\eta}(\cdot|x))$ , the list of $k$ documents $z$ with highest prior probability $p_{\eta}(z|x)$ , is a Maximum Inner Product Search (MIPS) problem, which can be approximately solved in sub-linear time [23]. We use a pre-trained bi-encoder from DPR to initialize our retriever and to build the document index. This retriever was trained to retrieve documents which contain answers to TriviaQA [24] questions and Natural Questions [29]. We refer to the document index as the non-parametric memory.

其中 $\mathbf{d}(z)$ 是由 BERTBASE 文档编码器 [8] 生成的文档的密集表示， ${\bf q}(x)$ 是由同样基于 BERTBASE 的查询编码器生成的查询表示。计算 top $\cdot\mathbf{k}(p_{\eta}(\cdot|x))$ ，即具有最高先验概率 $p_{\eta}(z|x)$ 的 $k$ 个文档 $z$ 的列表，是一个最大内积搜索 (MIPS) 问题，可以在亚线性时间内近似解决 [23]。我们使用来自 DPR 的预训练双编码器来初始化我们的检索器并构建文档索引。该检索器经过训练，用于检索包含 TriviaQA [24] 问题和自然问题 [29] 答案的文档。我们将文档索引称为非参数记忆。

2.3 Generator: BART

2.3 生成器：BART

The generator component $p_{\theta}(y_{i}|x,z,y_{1:i-1})$ could be modelled using any encoder-decoder. We use BART-large [32], a pre-trained seq2seq transformer [58] with 400M parameters. To combine the input $x$ with the retrieved content $z$ when generating from BART, we simply concatenate them. BART was pre-trained using a denoising objective and a variety of different noising functions. It has obtained state-of-the-art results on a diverse set of generation tasks and outperforms comparably-sized T5 models [32]. We refer to the BART generator parameters $\theta$ as the parametric memory henceforth.

生成器组件 $p_{\theta}(y_{i}|x,z,y_{1:i-1})$ 可以使用任何编码器-解码器进行建模。我们使用 BART-large [32]，这是一个预训练的序列到序列 Transformer [58]，具有 4 亿个参数。为了在从 BART 生成时将输入 $x$ 与检索到的内容 $z$ 结合，我们简单地将它们连接起来。BART 通过去噪目标和多种不同的噪声函数进行预训练。它在各种生成任务上取得了最先进的结果，并且优于同等规模的 T5 模型 [32]。此后，我们将 BART 生成器参数 $\theta$ 称为参数化记忆。

2.4 Training

2.4 训练

We jointly train the retriever and generator components without any direct supervision on what document should be retrieved. Given a fine-tuning training corpus of input/output pairs $(x_{j},y_{j})$ , we minimize the negative marginal log-likelihood of each target, $\begin{array}{r}{\sum_{j}-\log p(y_{j}|x_{j})}\end{array}$ using stochastic gradient descent with Adam [28]. Updating the document encod er $\mathtt{B E R T}_ {d}$ during training is costly as it requires the document index to be periodically updated as REALM does during pre-training [20]. We do not find this step necessary for strong performance, and keep the document encoder (and index) fixed, only fine-tuning the query encoder $\mathrm{BERT}_{q}$ and the BART generator.

我们联合训练检索器和生成器组件，而不对应该检索哪些文档进行任何直接监督。给定一个输入/输出对 $(x_{j},y_{j})$ 的微调训练语料库，我们使用 Adam [28] 的随机梯度下降法最小化每个目标的负边际对数似然 $\begin{array}{r}{\sum_{j}-\log p(y_{j}|x_{j})}\end{array}$ 。在训练期间更新文档编码器 $\mathtt{B E R T}_ {d}$ 是昂贵的，因为它需要像 REALM 在预训练期间那样定期更新文档索引 [20]。我们发现这一步骤对于强性能并非必要，因此保持文档编码器（和索引）固定，仅微调查询编码器 $\mathrm{BERT}_{q}$ 和 BART 生成器。

2.5 Decoding

2.5 解码

At test time, RAG-Sequence and RAG-Token require different ways to approximate arg $\mathrm{max}_{y}p(y|x)$ .

在测试时，RAG-Sequence 和 RAG-Token 需要不同的方法来近似 arg $\mathrm{max}_{y}p(y|x)$ 。

RAG-Token The RAG-Token model can be seen as a standard, auto regressive seq2seq generator with transition probability: $\begin{array}{r}{p_{\theta}^{\prime}(y_{i}|x,y_{1:i-1})=\sum_{z\in\mathrm{top}\cdot k(p(\cdot|x))}p_{\eta}(z_{i}|\overline{{x}})p_{\theta}(y_{i}|x,\hat{z}_ {i},\hat{y_{1:i-1}})}\end{array}$ To decode, we can plug $p_{\theta}^{\prime}(y_{i}|x,y_{1:i-1})$ into a standard beam decoder.

RAG-Token 模型可以看作是一个标准的自回归序列到序列生成器，其转移概率为： $\begin{array}{r}{p_{\theta}^{\prime}(y_{i}|x,y_{1:i-1})=\sum_{z\in\mathrm{top}\cdot k(p(\cdot|x))}p_{\eta}(z_{i}|\overline{{x}})p_{\theta}(y_{i}|x,\hat{z}_ {i},\hat{y_{1:i-1}})}\end{array}$ 在解码时，我们可以将 $p_{\theta}^{\prime}(y_{i}|x,y_{1:i-1})$ 插入到标准的束搜索解码器中。

RAG-Sequence For RAG-Sequence, the likelihood $p(\boldsymbol{y}|\boldsymbol{x})$ does not break into a conventional pertoken likelihood, hence we cannot solve it with a single beam search. Instead, we run beam search for each document $z$ , scoring each hypothesis using $p_{\theta}(y_{i}|x,z,y_{1:i-1})$ . This yields a set of hypotheses $Y$ , some of which may not have appeared in the beams of all documents. To estimate the probability of an hypothesis $y$ we run an additional forward pass for each document $z$ for which $y$ does not appear in the beam, multiply generator probability with $p_{\eta}(z|x)$ and then sum the probabilities across beams for the marginals. We refer to this decoding procedure as “Thorough Decoding.” For longer output sequences, $|Y|$ can become large, requiring many forward passes. For more efficient decoding, we can make a further approximation that $\dot{p}_ {\theta}(y|\dot{x},z_{i})\stackrel{\cdot}{\approx}0$ where $y$ was not generated during beam search from $x,z_{i}$ . This avoids the need to run additional forward passes once the candidate set $Y$ has been generated. We refer to this decoding procedure as “Fast Decoding.”

RAG-Sequence 对于 RAG-Sequence，似然 $p(\boldsymbol{y}|\boldsymbol{x})$ 不会分解为传统的逐 Token 似然，因此我们无法通过单一的束搜索来解决它。相反，我们对每个文档 $z$ 运行束搜索，使用 $p_{\theta}(y_{i}|x,z,y_{1:i-1})$ 对每个假设进行评分。这会产生一组假设 $Y$ ，其中一些可能没有出现在所有文档的束中。为了估计假设 $y$ 的概率，我们对每个文档 $z$ 运行额外的前向传递，其中 $y$ 没有出现在束中，将生成器概率与 $p_{\eta}(z|x)$ 相乘，然后对束中的概率求和以得到边际概率。我们将此解码过程称为“彻底解码”。对于较长的输出序列， $|Y|$ 可能会变得很大，需要许多前向传递。为了更高效地解码，我们可以进一步近似 $\dot{p}_ {\theta}(y|\dot{x},z_{i})\stackrel{\cdot}{\approx}0$ ，其中 $y$ 在从 $x,z_{i}$ 的束搜索中没有生成。这避免了在生成候选集 $Y$ 后运行额外前向传递的需要。我们将此解码过程称为“快速解码”。

3 Experiments

3 实验

We experiment with RAG in a wide range of knowledge-intensive tasks. For all experiments, we use a single Wikipedia dump for our non-parametric knowledge source. Following Lee et al. [31] and Karpukhin et al. [26], we use the December 2018 dump. Each Wikipedia article is split into disjoint 100-word chunks, to make a total of 21M documents. We use the document encoder to compute an embedding for each document, and build a single MIPS index using FAISS [23] with a Hierarchical Navigable Small World approximation for fast retrieval [37]. During training, we retrieve the top $k$ documents for each query. We consider $k\in{5,10}$ for training and set $k$ for test time using dev data. We now discuss experimental details for each task.

我们在多种知识密集型任务中进行了 RAG 实验。对于所有实验，我们使用单一的维基百科转储作为非参数化知识源。遵循 Lee 等人 [31] 和 Karpukhin 等人 [26] 的做法，我们使用 2018 年 12 月的转储。每篇维基百科文章被分割成不重叠的 100 词块，总共生成 2100 万份文档。我们使用文档编码器计算每份文档的嵌入，并使用 FAISS [23] 构建一个单一的 MIPS 索引，采用分层可导航小世界近似以实现快速检索 [37]。在训练过程中，我们为每个查询检索前 $k$ 份文档。我们考虑 $k\in{5,10}$ 用于训练，并使用开发数据设置测试时的 $k$ 。接下来，我们将讨论每个任务的实验细节。

3.1 Open-domain Question Answering

3.1 开放域问答

Open-domain question answering (QA) is an important real-world application and common testbed for knowledge-intensive tasks [20]. We treat questions and answers as input-output text pairs $(x,y)$ and train RAG by directly minimizing the negative log-likelihood of answers. We compare RAG to the popular extractive QA paradigm [5, 7, 31, 26], where answers are extracted spans from retrieved documents, relying primarily on non-parametric knowledge. We also compare to “Closed-Book QA” approaches [52], which, like RAG, generate answers, but which do not exploit retrieval, instead relying purely on parametric knowledge. We consider four popular open-domain QA datasets: Natural Questions (NQ) [29], TriviaQA (TQA) [24]. Web Questions (WQ) [3] and Curate dT rec (CT) [2]. As CT and WQ are small, we follow DPR [26] by initializing CT and WQ models with our NQ RAG model. We use the same train/dev/test splits as prior work [31, 26] and report Exact Match (EM) scores. For TQA, to compare with T5 [52], we also evaluate on the TQA Wiki test set.

开放域问答 (Open-domain QA) 是知识密集型任务的重要实际应用和常见测试平台 [20]。我们将问题和答案视为输入-输出文本对 $(x,y)$ ，并通过直接最小化答案的负对数似然来训练 RAG。我们将 RAG 与流行的抽取式问答范式 [5, 7, 31, 26] 进行比较，后者从检索到的文档中提取答案片段，主要依赖于非参数化知识。我们还与“闭卷问答”方法 [52] 进行比较，这些方法与 RAG 一样生成答案，但不利用检索，而是完全依赖于参数化知识。我们考虑了四个流行的开放域问答数据集：Natural Questions (NQ) [29]、TriviaQA (TQA) [24]、Web Questions (WQ) [3] 和 CuratedTrec (CT) [2]。由于 CT 和 WQ 数据集较小，我们遵循 DPR [26] 的方法，使用我们的 NQ RAG 模型初始化 CT 和 WQ 模型。我们使用与先前工作 [31, 26] 相同的训练/开发/测试集划分，并报告精确匹配 (Exact Match, EM) 分数。对于 TQA，为了与 T5 [52] 进行比较，我们还在 TQA Wiki 测试集上进行了评估。

3.2 Abstract ive Question Answering

3.2 抽象问题回答

RAG models can go beyond simple extractive QA and answer questions with free-form, abstract ive text generation. To test RAG’s natural language generation (NLG) in a knowledge-intensive setting, we use the MSMARCO NLG task v2.1 [43]. The task consists of questions, ten gold passages retrieved from a search engine for each question, and a full sentence answer annotated from the retrieved passages. We do not use the supplied passages, only the questions and answers, to treat

RAG 模型可以超越简单的抽取式问答，并通过自由形式的抽象文本生成来回答问题。为了在知识密集型环境中测试 RAG 的自然语言生成 (NLG) 能力，我们使用了 MSMARCO NLG 任务 v2.1 [43]。该任务包括问题、每个问题从搜索引擎检索到的十个黄金段落，以及从检索到的段落中标注的完整句子答案。我们不使用提供的段落，仅使用问题和答案，以处理

MSMARCO as an open-domain abstract ive QA task. MSMARCO has some questions that cannot be answered in a way that matches the reference answer without access to the gold passages, such as “What is the weather in Volcano, CA?” so performance will be lower without using gold passages. We also note that some MSMARCO questions cannot be answered using Wikipedia alone. Here, RAG can rely on parametric knowledge to generate reasonable responses.

MSMARCO 作为一个开放领域的抽象问答任务。MSMARCO 中的一些问题无法在不访问黄金段落的情况下生成与参考答案匹配的答案，例如“加州火山的天气如何？”因此，在不使用黄金段落的情况下，性能会较低。我们还注意到，一些 MSMARCO 问题无法仅通过 Wikipedia 来回答。在这种情况下，RAG 可以依赖参数化知识生成合理的响应。

3.3 Jeopardy Question Generation

3.3 危险边缘问题生成

To evaluate RAG’s generation abilities in a non-QA setting, we study open-domain question generation. Rather than use questions from standard open-domain QA tasks, which typically consist of short, simple questions, we propose the more demanding task of generating Jeopardy questions. Jeopardy is an unusual format that consists of trying to guess an entity from a fact about that entity. For example, “The World Cup” is the answer to the question “In 1986 Mexico scored as the first country to host this international sports competition twice.” As Jeopardy questions are precise, factual statements, generating Jeopardy questions conditioned on their answer entities constitutes a challenging knowledge-intensive generation task.

为了评估 RAG 在非问答（QA）场景下的生成能力，我们研究了开放域问题生成。与通常由简短、简单问题组成的标准开放域 QA 任务不同，我们提出了更具挑战性的任务：生成 Jeopardy 问题。Jeopardy 是一种独特的格式，要求从关于某个实体的事实中猜测该实体。例如，“世界杯”是问题“1986 年，墨西哥成为第一个两次举办这项国际体育赛事的国家”的答案。由于 Jeopardy 问题是精确的事实陈述，生成以答案实体为条件的 Jeopardy 问题构成了一项具有挑战性的知识密集型生成任务。

We use the splits from SearchQA [10], with 100K train, 14K dev, and 27K test examples. As this is a new task, we train a BART model for comparison. Following [67], we evaluate using the SQuAD-tuned Q-BLEU-1 metric [42]. Q-BLEU is a variant of BLEU with a higher weight for matching entities and has higher correlation with human judgment for question generation than standard metrics. We also perform two human evaluations, one to assess generation factuality, and one for specificity. We define factuality as whether a statement can be corroborated by trusted external sources, and specificity as high mutual dependence between the input and output [33]. We follow best practice and use pairwise comparative evaluation [34]. Evaluators are shown an answer and two generated questions, one from BART and one from RAG. They are then asked to pick one of four options—quuestion A is better, question B is better, both are good, or neither is good.

我们使用 SearchQA [10] 的数据划分，包含 10 万条训练样本、1.4 万条开发样本和 2.7 万条测试样本。由于这是一个新任务，我们训练了一个 BART 模型进行比较。根据 [67]，我们使用经过 SQuAD 调优的 Q-BLEU-1 指标 [42] 进行评估。Q-BLEU 是 BLEU 的一个变体，它对匹配实体赋予更高的权重，并且在问题生成任务中与人类判断的相关性比标准指标更高。我们还进行了两项人工评估，一项用于评估生成的事实性，另一项用于评估特异性。我们将事实性定义为陈述是否可以通过可信的外部来源得到证实，而特异性则定义为输入和输出之间的高度相互依赖性 [33]。我们遵循最佳实践，使用成对比较评估 [34]。评估者会看到一个答案和两个生成的问题，一个来自 BART，另一个来自 RAG。然后，他们被要求从四个选项中选择一个——问题 A 更好，问题 B 更好，两者都好，或者两者都不好。

3.4 Fact Verification

3.4 事实验证

FEVER [56] requires classifying whether a natural language claim is supported or refuted by Wikipedia, or whether there is not enough information to decide. The task requires retrieving evidence from Wikipedia relating to the claim and then reasoning over this evidence to classify whether the claim is true, false, or un verifiable from Wikipedia alone. FEVER is a retrieval problem coupled with an challenging entailment reasoning task. It also provides an appropriate testbed for exploring the RAG models’ ability to handle classification rather than generation. We map FEVER class labels (supports, refutes, or not enough info) to single output tokens and directly train with claim-class pairs. Crucially, unlike most other approaches to FEVER, we do not use supervision on retrieved evidence. In many real-world applications, retrieval supervision signals aren’t available, and models that do not require such supervision will be applicable to a wider range of tasks. We explore two variants: the standard 3-way classification task (supports/refutes/not enough info) and the 2-way (supports/refutes) task studied in Thorne and Vlachos [57]. In both cases we report label accuracy.

FEVER [56] 要求对自然语言声明进行分类，判断其是否被维基百科支持或反驳，或者是否没有足够的信息来决定。该任务需要从维基百科中检索与声明相关的证据，然后对这些证据进行推理，以分类声明是真实的、虚假的，还是仅凭维基百科无法验证的。FEVER 是一个检索问题，结合了一个具有挑战性的蕴含推理任务。它还为探索 RAG 模型处理分类而非生成的能力提供了一个合适的测试平台。我们将 FEVER 的类别标签（支持、反驳或信息不足）映射到单个输出 token，并直接使用声明-类别对进行训练。关键的是，与大多数其他处理 FEVER 的方法不同，我们不使用检索证据的监督信号。在许多实际应用中，检索监督信号是不可用的，而不需要这种监督的模型将适用于更广泛的任务。我们探索了两个变体：标准的三分类任务（支持/反驳/信息不足）和 Thorne 与 Vlachos [57] 中研究的二分类任务（支持/反驳）。在这两种情况下，我们都报告了标签准确率。

4 Results

4 结果

4.1 Open-domain Question Answering

4.1 开放域问答

Table 1 shows results for RAG along with state-of-the-art models. On all four open-domain QA tasks, RAG sets a new state of the art (only on the T5-comparable split for TQA). RAG combines the generation flexibility of the “closed-book” (parametric only) approaches and the performance of "open-book" retrieval-based approaches. Unlike REALM and $\mathrm{T}5\substack{+}\mathrm{SSM}$ , RAG enjoys strong results without expensive, specialized “salient span masking” pre-training [20]. It is worth noting that RAG’s retriever is initialized using DPR’s retriever, which uses retrieval supervision on Natural Questions and TriviaQA. RAG compares favourably to the DPR QA system, which uses a BERT-based “crossencoder” to re-rank documents, along with an extractive reader. RAG demonstrates that neither a re-ranker nor extractive reader is necessary for state-of-the-art performance.

表 1 展示了 RAG 与最先进模型的结果。在所有四个开放域问答任务中，RAG 都取得了新的最先进水平（仅在 TQA 的 T5 可比分割上）。RAG 结合了“闭卷”（仅参数化）方法的生成灵活性和基于检索的“开卷”方法的性能。与 REALM 和 $\mathrm{T}5\substack{+}\mathrm{SSM}$ 不同，RAG 在没有昂贵且专门的“显著跨度掩码”预训练的情况下取得了强劲的结果 [20]。值得注意的是，RAG 的检索器是使用 DPR 的检索器初始化的，该检索器在 Natural Questions 和 TriviaQA 上使用了检索监督。RAG 与 DPR 问答系统相比表现优异，后者使用基于 BERT 的“交叉编码器”对文档进行重新排序，并配备了一个抽取式阅读器。RAG 表明，重新排序器和抽取式阅读器对于实现最先进的性能都不是必需的。

There are several advantages to generating answers even when it is possible to extract them. Documents with clues about the answer but do not contain the answer verbatim can still contribute towards a correct answer being generated, which is not possible with standard extractive approaches, leading to more effective margin aliz ation over documents. Furthermore, RAG can generate correct answers even when the correct answer is not in any retrieved document, achieving $11.8%$ accuracy in such cases for NQ, where an extractive model would score $0%$ .

即使可以提取答案，生成答案也有几个优势。包含答案线索但不逐字包含答案的文档仍然可以有助于生成正确答案，这是标准提取方法无法实现的，从而在文档上实现更有效的边际化。此外，即使正确的答案不在任何检索到的文档中，RAG 仍然可以生成正确答案，在 NQ 数据集上，这种情况下的准确率达到 $11.8%$ ，而提取模型在这种情况下得分为 $0%$ 。

Table 1: Open-Domain QA Test Scores. For TQA, left column uses the standard test set for OpenDomain QA, right column uses the TQA-Wiki test set. See Appendix D for further details.

表 1: 开放域问答测试分数。对于 TQA，左列使用开放域问答的标准测试集，右列使用 TQA-Wiki 测试集。更多细节请参见附录 D。

模型	NQ	TQA	WQ	CT
闭卷	T5-11B [52] T5-11B+SSM[52]	34.5 36.6	/50.1 37.4 /60.5 44.7
开卷	REALM[20]	40.4	-/- 40.7	46.8
开卷	DPR [26]	41.5	57.9/	41.1 50.6
开卷	RAG-Token	44.1	55.2/66.1	45.5 50.0
开卷	RAG-Seq.	44.5	56.8/68.0	45.2 52.2

Table 2: Generation and classification Test Scores. MS-MARCO SotA is [4], FEVER-3 is [68] and FEVER-2 is [57] *Uses gold context/evidence. Best model without gold access underlined.

表 2: 生成和分类测试分数。MS-MARCO SotA 是 [4]，FEVER-3 是 [68]，FEVER-2 是 [57] *使用黄金上下文/证据。最佳模型（无黄金访问权限）已加下划线。

模型	Jeopardy	MSMARCO FVR3FVR2
	B-1	QB-1
SotA		19.7
BART RAG-Tok. 17.3	15.1	22.2

4.2 Abstract ive Question Answering

4.2 抽象问答

As shown in Table 2, RAG-Sequence outperforms BART on Open MS-MARCO NLG by 2.6 Bleu points and 2.6 Rouge-L points. RAG approaches state-of-the-art model performance, which is impressive given that (i) those models access gold passages with specific information required to generate the reference answer , (ii) many questions are unanswerable without the gold passages, and (iii) not all questions are answerable from Wikipedia alone. Table 3 shows some generated answers from our models. Qualitatively, we find that RAG models hallucinate less and generate factually correct text more often than BART. Later, we also show that RAG generations are more diverse than BART generations (see $\S4.5,$ ).

如表 2 所示，RAG-Sequence 在 Open MS-MARCO NLG 上比 BART 高出 2.6 个 Bleu 点和 2.6 个 Rouge-L 点。RAG 接近最先进的模型性能，考虑到以下几点，这令人印象深刻：(i) 这些模型访问了生成参考答案所需的特定信息的黄金段落，(ii) 许多问题在没有黄金段落的情况下无法回答，(iii) 并非所有问题都可以仅从 Wikipedia 中回答。表 3 展示了我们模型生成的一些答案。从质量上看，我们发现 RAG 模型比 BART 更少产生幻觉，并且更频繁地生成事实正确的文本。稍后，我们还展示了 RAG 生成的内容比 BART 生成的内容更加多样化（参见 $\S4.5,$ ）。

4.3 Jeopardy Question Generation

4.3 危险问题生成

Table 2 shows that RAG-Token performs better than RAG-Sequence on Jeopardy question generation, with both models outperforming BART on Q-BLEU-1. 4 shows human evaluation results, over 452 pairs of generations from BART and RAG-Token. Evaluators indicated that BART was more factual than RAG in only $7.1%$ of cases, while RAG was more factual in $42.7%$ of cases, and both RAG and BART were factual in a further $17%$ of cases, clearly demonstrating the effectiveness of RAG on the task over a state-of-the-art generation model. Evaluators also find RAG generations to be more specific by a large margin. Table 3 shows typical generations from each model.

表 2 显示，RAG-Token 在 Jeopardy 问题生成任务上表现优于 RAG-Sequence，且两个模型在 Q-BLEU-1 指标上均优于 BART。图 4 展示了基于 452 对 BART 和 RAG-Token 生成结果的人工评估结果。评估者指出，BART 仅在 $7.1%$ 的情况下比 RAG 更符合事实，而 RAG 在 $42.7%$ 的情况下更符合事实，且 RAG 和 BART 在另外 $17%$ 的情况下都符合事实，这清楚地证明了 RAG 在该任务上相较于最先进的生成模型的有效性。评估者还发现 RAG 的生成结果在特异性方面明显更优。表 3 展示了每个模型的典型生成结果。

Jeopardy questions often contain two separate pieces of information, and RAG-Token may perform best because it can generate responses that combine content from several documents. Figure 2 shows an example. When generating “Sun”, the posterior is high for document 2 which mentions “The Sun Also Rises”. Similarly, document 1 dominates the posterior when “A Farewell to Arms” is generated. Intriguingly, after the first token of each book is generated, the document posterior flattens. This observation suggests that the generator can complete the titles without depending on specific documents. In other words, the model’s parametric knowledge is sufficient to complete the titles. We find evidence for this hypothesis by feeding the BART-only baseline with the partial decoding "The Sun. BART completes the generation "The Sun Also Rises" is a novel by this author of "The Sun Also Rises" indicating the title "The Sun Also Rises" is stored in BART’s parameters. Similarly, BART will complete the partial decoding "The Sun Also Rises" is a novel by this author of "A with "The Sun Also Rises" is a novel by this author of "A Farewell to Arms". This example shows how parametric and non-parametric memories work together—the non-parametric component helps to guide the generation, drawing out specific knowledge stored in the parametric memory.

Jeopardy 问题通常包含两个独立的信息片段，而 RAG-Token 可能表现最佳，因为它可以生成结合多个文档内容的响应。图 2 展示了一个例子。在生成“Sun”时，提到“The Sun Also Rises”的文档 2 的后验概率较高。同样，当生成“A Farewell to Arms”时，文档 1 的后验概率占主导地位。有趣的是，在生成每本书的第一个 token 后，文档的后验概率趋于平缓。这一观察表明，生成器可以在不依赖特定文档的情况下完成标题。换句话说，模型的参数化知识足以完成标题。我们通过向仅使用 BART 的基线模型输入部分解码“The Sun”来验证这一假设。BART 完成了生成“The Sun Also Rises”是“The Sun Also Rises”作者的小说，表明标题“The Sun Also Rises”存储在 BART 的参数中。同样，BART 会将部分解码“The Sun Also Rises”是“A Farewell to Arms”作者的小说。这个例子展示了参数化和非参数化记忆如何协同工作——非参数化组件有助于引导生成，提取存储在参数化记忆中的特定知识。

4.4 Fact Verification

4.4 事实验证

Table 2 shows our results on FEVER. For 3-way classification, RAG scores are within $4.3%$ of state-of-the-art models, which are complex pipeline systems with domain-specific architectures and substantial engineering, trained using intermediate retrieval supervision, which RAG does not require.

表 2 展示了我们在 FEVER 上的结果。对于三分类任务，RAG 的得分与最先进的模型相差在 $4.3%$ 以内，而这些模型是复杂的流水线系统，具有特定领域的架构和大量的工程，使用中间检索监督进行训练，而 RAG 不需要这些。

Document 1: his works are considered classics of American literature ... His wartime experiences formed the basis for his novel ”A Farewell to Arms” (1929) ...

文档 1: 他的作品被认为是美国文学的经典之作……他的战时经历构成了他的小说《永别了，武器》(A Farewell to Arms) (1929) 的基础……

Document 2: ... artists of the 1920s ”Lost Generation” expatriate community. His debut novel, ”The Sun Also Rises”, was published in 1926.

文档 2: ... 20 世纪 20 年代“迷惘的一代”流亡社区的艺术家。他的处女作《太阳照常升起》于 1926 年出版。

Figure 2: RAG-Token document posterior $p(z_{i}|x,y_{i},y_{-i})$ for each generated token for input “Hemingway" for Jeopardy generation with 5 retrieved documents. The posterior for document 1 is high when generating “A Farewell to Arms" and for document 2 when generating “The Sun Also Rises".

图 2: RAG-Token 文档后验概率 $p(z_{i}|x,y_{i},y_{-i})$ ，针对输入“Hemingway”在生成 Jeopardy 问题时每个生成 Token 的结果，检索了 5 篇文档。在生成“A Farewell to Arms”时，文档 1 的后验概率较高，而在生成“The Sun Also Rises”时，文档 2 的后验概率较高。

Table 3: Examples from generation tasks. RAG models generate more specific and factually accurate responses. ‘?’ indicates factually incorrect responses, * indicates partially correct responses.

表 3: 生成任务示例。RAG 模型生成的响应更具体且事实更准确。‘?’ 表示事实错误的响应，* 表示部分正确的响应。

任务	输入		模型生成
MS-MARCO	定义中耳		BART? 中耳是耳朵与鼻子之间的部分。RAG-T 中耳是耳膜内部的部分。
			RAG-S 中耳包括鼓室和三块听小骨。
	苏格兰需要什么货币		BART 苏格兰需要的货币是英镑。RAG-T 苏格兰需要的货币是英镑。
			RAG-S 苏格兰需要的货币是英镑。
Jeopardy	苏格兰		BART? 这个州拥有美国最多的县。
	华盛顿		RAG-T 这是唯一一个以美国总统命名的州。
			RAG-S 这是你可以找到雷尼尔山国家公园的州。
Generation	《神曲》	BART	* 这部由但丁创作的史诗分为三部分：地狱、炼狱和炼狱
			RAG-T 但丁的《地狱》是这部史诗的第一部分
			RAG-S 这部14世纪的作品分为三部分：“地狱”、“炼狱”和“天堂”

For 2-way classification, we compare against Thorne and Vlachos [57], who train RoBERTa [35] to classify the claim as true or false given the gold evidence sentence. RAG achieves an accuracy within $2.7%$ of this model, despite being supplied with only the claim and retrieving its own evidence. We also analyze whether documents retrieved by RAG correspond to documents annotated as gold evidence in FEVER. We calculate the overlap in article titles between the top $k$ documents retrieved by RAG and gold evidence annotations. We find that the top retrieved document is from a gold article in $71%$ of cases, and a gold article is present in the top 10 retrieved articles in $90%$ of cases.

对于二分类任务，我们与 Thorne 和 Vlachos [57] 进行了比较，他们训练了 RoBERTa [35] 模型，在给定黄金证据句子的情况下将声明分类为真或假。尽管 RAG 仅提供了声明并自行检索证据，但其准确率与该模型相差在 $2.7%$ 以内。我们还分析了 RAG 检索到的文档是否与 FEVER 中标注为黄金证据的文档相对应。我们计算了 RAG 检索到的前 $k$ 个文档与黄金证据标注之间的文章标题重叠率。我们发现，在 $71%$ 的情况下，检索到的前一个文档来自黄金文章，而在 $90%$ 的情况下，黄金文章出现在检索到的前 10 篇文章中。

4.5 Additional Results

4.5 其他结果

Generation Diversity Section 4.3 shows that RAG models are more factual and specific than BART for Jeopardy question generation. Following recent work on diversity-promoting decoding [33, 59, 39], we also investigate generation diversity by calculating the ratio of distinct ngrams to total ngrams generated by different models. Table 5 shows that RAG-Sequence’s generations are more diverse than RAG-Token’s, and both are significantly more diverse than BART without needing any diversity-promoting decoding.

生成多样性
第4.3节表明，在Jeopardy问题生成任务中，RAG模型比BART更具事实性和特异性。根据最近关于促进多样性的解码方法的研究 [33, 59, 39]，我们还通过计算不同模型生成的独特n-gram与总n-gram的比率来研究生成多样性。表5显示，RAG-Sequence的生成结果比RAG-Token更具多样性，且两者在不使用任何促进多样性的解码方法的情况下，都比BART显著更具多样性。

Retrieval Ablations A key feature of RAG is learning to retrieve relevant information for the task. To assess the effectiveness of the retrieval mechanism, we run ablations where we freeze the retriever during training. As shown in Table 6, learned retrieval improves results for all tasks.

检索消融实验
RAG 的一个关键特性是学习为任务检索相关信息。为了评估检索机制的有效性，我们在训练期间冻结检索器进行消融实验。如表 6 所示，学习检索提高了所有任务的结果。

We compare RAG’s dense retriever to a word overlap-based BM25 retriever [53]. Here, we replace RAG’s retriever with a fixed BM25 system, and use BM25 retrieval scores as logits when calculating $p(z|x)$ . Table 6 shows the results. For FEVER, BM25 performs best, perhaps since FEVER claims are heavily entity-centric and thus well-suited for word overlap-based retrieval. Differentiable retrieval improves results on all other tasks, especially for Open-Domain QA, where it is crucial.

我们将 RAG 的密集检索器与基于词重叠的 BM25 检索器 [53] 进行了比较。在这里，我们用固定的 BM25 系统替换了 RAG 的检索器，并在计算 $p(z|x)$ 时使用 BM25 检索分数作为 logits。表 6 展示了结果。对于 FEVER，BM25 表现最佳，可能是因为 FEVER 的声明高度以实体为中心，因此非常适合基于词重叠的检索。可微检索在所有其他任务上提高了结果，尤其是在开放域问答任务中，这一点至关重要。

Index hot-swapping An advantage of non-parametric memory models like RAG is that knowledge can be easily updated at test time. Parametric-only models like T5 or BART need further training to update their behavior as the world changes. To demonstrate, we build an index using the DrQA [5] Wikipedia dump from December 2016 and compare outputs from RAG using this index to the newer index from our main results (December 2018). We prepare a list of 82 world leaders who had changed between these dates and use a template “Who is {position}?” (e.g. “Who is the President of Peru?”) to query our NQ RAG model with each index. RAG answers $70%$ correctly using the 2016 index for 2016 world leaders and $68%$ using the 2018 index for 2018 world leaders. Accuracy with mismatched indices is low ( $12%$ with the 2018 index and 2016 leaders, $4%$ with the 2016 index and 2018 leaders). This shows we can update RAG’s world knowledge by simply replacing its non-parametric memory.

索引热交换
RAG 等非参数化内存模型的一个优势是，知识可以在测试时轻松更新。像 T5 或 BART 这样的纯参数化模型需要进一步训练才能随着世界的变化更新其行为。为了演示这一点，我们使用 2016 年 12 月的 DrQA [5] 维基百科转储构建了一个索引，并将使用该索引的 RAG 输出与主要结果中的较新索引（2018 年 12 月）进行比较。我们准备了一份在这两个日期之间发生变化的世界领导人名单，并使用模板“Who is {position}?”（例如“Who is the President of Peru?”）来查询我们的 NQ RAG 模型。RAG 使用 2016 年索引对 2016 年世界领导人的回答正确率为 $70%$ ，使用 2018 年索引对 2018 年世界领导人的回答正确率为 $68%$ 。使用不匹配的索引时，准确率较低（使用 2018 年索引和 2016 年领导人为 $12%$ ，使用 2016 年索引和 2018 年领导人为 $4%$ ）。这表明我们可以通过简单地替换 RAG 的非参数化内存来更新其世界知识。

Table 4: Human assessments for the Jeopardy Question Generation Task.

表 4: 人类对《危险边缘》问题生成任务的评估。

	事实性	特异性
BART更好	7.1%	16.8%
RAG更好	42.7%	37.4%
两者都好	11.7%	11.8%
两者都差	17.7%	6.9%
无多数	20.8%	20.1%

Table 5: Ratio of distinct to total tri-grams for generation tasks.

表 5: 生成任务中不同三元组与总三元组的比例。

	MSMARCO	Jeopardy QGen
Gold	89.6%	90.0%
BART	70.7%	32.4%
RAG-Token	77.8%	46.8%
RAG-Seq.	83.5%	53.8%

Table 6: Ablations on the dev set. As FEVER is a classification task, both RAG models are equivalent.

表 6: 在开发集上的消融实验。由于 FEVER 是一个分类任务，两个 RAG 模型是等价的。

模型	NQ	TQA	WQ	CT	Jeopardy-QGen	MSMarco	FVR-3 FVR-2
		ExactMatch			B-1	QB-1	R-L
RAG-Token-BM25 RAG-Sequence-BM25	29.7	41.5	32.1	33.1	17.5	22.3	55.5
	31.8	44.1	36.6	33.8	11.1	19.5	56.5
RAG-Token-Frozen RAG-Sequence-Frozen	37.8 41.2	50.1 52.1	37.1 41.8	51.1 52.6	16.7 11.8	21.7 19.6	55.9 56.7

RAG-Token	43.5	54.8	46.5	51.9	17.9	22.6	56.2
RAG-Sequence	44.0	55.8	44.9	53.4	15.3	21.5	57.2

Effect of Retrieving more documents Models are trained with either 5 or 10 retrieved latent documents, and we do not observe significant differences in performance between them. We have the flexibility to adjust the number of retrieved documents at test time, which can affect performance and runtime. Figure 3 (left) shows that retrieving more documents at test time monotonically improves Open-domain QA results for RAG-Sequence, but performance peaks for RAG-Token at 10 retrieved documents. Figure 3 (right) shows that retrieving more documents leads to higher Rouge-L for RAG-Token at the expense of Bleu-1, but the effect is less pronounced for RAG-Sequence.

检索更多文档的效果
模型在训练时使用5或10个检索到的潜在文档，我们没有观察到它们之间的性能有显著差异。我们可以在测试时灵活调整检索文档的数量，这会影响性能和运行时间。图3（左）显示，在测试时检索更多文档会单调地提高RAG-Sequence的开放域问答结果，但RAG-Token的性能在检索10个文档时达到峰值。图3（右）显示，检索更多文档会提高RAG-Token的Rouge-L分数，但会降低Bleu-1分数，而RAG-Sequence的效果则不那么明显。

Figure 3: Left: NQ performance as more documents are retrieved. Center: Retrieval recall performance in NQ. Right: MS-MARCO Bleu-1 and Rouge-L as more documents are retrieved.

图 3: 左: 随着检索文档数量的增加，NQ 性能的变化。中: NQ 中的检索召回性能。右: 随着检索文档数量的增加，MS-MARCO 的 Bleu-1 和 Rouge-L 得分。

5 相关工作

Single-Task Retrieval Prior work has shown that retrieval improves performance across a variety of NLP tasks when considered in isolation. Such tasks include open-domain question answering [5, 29], fact checking [56], fact completion [48], long-form question answering [12], Wikipedia article generation [36], dialogue [41, 65, 9, 13], translation [17], and language modeling [19, 27]. Our work unifies previous successes in incorporating retrieval into individual tasks, showing that a single retrieval-based architecture is capable of achieving strong performance across several tasks.

单任务检索

先前的研究表明，当单独考虑时，检索可以提升各种自然语言处理 (NLP) 任务的表现。这些任务包括开放域问答 [5, 29]、事实核查 [56]、事实补全 [48]、长文本问答 [12]、维基百科文章生成 [36]、对话 [41, 65, 9, 13]、翻译 [17] 以及语言建模 [19, 27]。我们的工作统一了之前将检索融入单个任务的成功经验，表明基于检索的单一架构能够在多个任务中实现强大的性能。

General-Purpose Architectures for NLP Prior work on general-purpose architectures for NLP tasks has shown great success without the use of retrieval. A single, pre-trained language model has been shown to achieve strong performance on various classification tasks in the GLUE benchmarks [60, 61] after fine-tuning [49, 8]. GPT-2 [50] later showed that a single, left-to-right, pre-trained language model could achieve strong performance across both disc rim i native and generative tasks. For further improvement, BART [32] and T5 [51, 52] propose a single, pre-trained encoder-decoder model that leverages bi-directional attention to achieve stronger performance on disc rim i native and generative tasks. Our work aims to expand the space of possible tasks with a single, unified architecture, by learning a retrieval module to augment pre-trained, generative language models.

通用 NLP 架构

在 NLP 任务的通用架构方面，先前的工作在不使用检索的情况下取得了巨大成功。经过微调 [49, 8] 后，单一的预训练语言模型在 GLUE 基准测试 [60, 61] 中的各种分类任务上表现出色。GPT-2 [50] 后来表明，单一的从左到右的预训练语言模型可以在判别性和生成性任务上都表现出色。为了进一步提高性能，BART [32] 和 T5 [51, 52] 提出了单一的预训练编码器-解码器模型，利用双向注意力机制在判别性和生成性任务上实现更强的性能。我们的工作旨在通过学习检索模块来增强预训练的生成性语言模型，从而扩展单一统一架构可能执行的任务范围。

Learned Retrieval There is significant work on learning to retrieve documents in information retrieval, more recently with pre-trained, neural language models [44, 26] similar to ours. Some work optimizes the retrieval module to aid in a specific, downstream task such as question answering, using search [46], reinforcement learning [6, 63, 62], or a latent variable approach [31, 20] as in our work. These successes leverage different retrieval-based architectures and optimization techniques to achieve strong performance on a single task, while we show that a single retrieval-based architecture can be fine-tuned for strong performance on a variety of tasks.

学习检索
在信息检索领域，关于学习检索文档的研究有很多，最近的研究使用了与我们类似的预训练神经语言模型 [44, 26]。一些工作优化了检索模块，以帮助特定的下游任务，例如问答，使用搜索 [46]、强化学习 [6, 63, 62] 或潜在变量方法 [31, 20]，正如我们的工作所示。这些成功利用了不同的基于检索的架构和优化技术，在单一任务上实现了强大的性能，而我们展示了单一的基于检索的架构可以通过微调在多种任务上实现强大的性能。

Memory-based Architectures Our document index can be seen as a large external memory for neural networks to attend to, analogous to memory networks [64, 55]. Concurrent work [14] learns to retrieve a trained embedding for each entity in the input, rather than to retrieve raw text as in our work. Other work improves the ability of dialog models to generate factual text by attending over fact embeddings [15, 13]. A key feature of our memory is that it is comprised of raw text rather distributed representations, which makes the memory both (i) human-readable, lending a form of interpret ability to our model, and (ii) human-writable, enabling us to dynamically update the model’s memory by editing the document index. This approach has also been used in knowledge-intensive dialog, where generators have been conditioned on retrieved text directly, albeit obtained via TF-IDF rather than end-to-end learnt retrieval [9].

基于记忆的架构
我们的文档索引可以被视为神经网络关注的大型外部记忆，类似于记忆网络 [64, 55]。同时期的工作 [14] 学习为输入中的每个实体检索训练好的嵌入，而不是像我们的工作那样检索原始文本。其他工作通过关注事实嵌入来提高对话模型生成事实文本的能力 [15,

[论文翻译]检索增强生成在知识密集型 NLP 任务中的应用

原文地址：https://arxiv.org/pdf/2005.11401v4

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

检索增强生成在知识密集型 NLP 任务中的应用

Abstract

摘要

1 Introduction

1 引言

2 Methods

2 方法

2.1 Models

2.1 模型

2.2 Retriever: DPR

2.2 检索器：DPR

2.3 Generator: BART

2.3 生成器：BART

2.4 Training

2.4 训练

2.5 Decoding

2.5 解码

3 Experiments

3 实验

3.1 Open-domain Question Answering

3.1 开放域问答

3.2 Abstract ive Question Answering

3.2 抽象问题回答

3.3 Jeopardy Question Generation

3.3 危险边缘问题生成

3.4 Fact Verification

3.4 事实验证

4 Results

4 结果

4.1 Open-domain Question Answering

4.1 开放域问答

4.2 Abstract ive Question Answering

4.2 抽象问答

4.3 Jeopardy Question Generation

4.3 危险问题生成

4.4 Fact Verification

4.4 事实验证

4.5 Additional Results

4.5 其他结果

5 Related Work

5 相关工作