Retrieval as Attention: End-to-end Learning of Retrieval and Reading within a Single Transformer

检索即注意力：在单一Transformer中端到端学习检索与阅读

Zhengbao Jiang* Luyu Gao* Jun Araki, Haibo Ding Zhiruo ${\bf W a n g}^{\odot}$ , Jamie Callan♥, Graham Neubig♥

Zhengbao Jiang* Luyu Gao* Jun Araki, Haibo Ding Zhiruo ${\bf W a n g}^{\odot}$, Jamie Callan♥, Graham Neubig♥

Abstract

摘要

Systems for knowledge-intensive tasks such as open-domain question answering (QA) usually consist of two stages: efficient retrieval of relevant documents from a large corpus and detailed reading of the selected documents to generate answers. Retrievers and readers are usually modeled separately, which necessitates a cumbersome implementation and is hard to train and adapt in an end-to-end fashion. In this paper, we revisit this design and eschew the separate architecture and training in favor of a single Transformer that performs Retrieval as Attention (ReAtt), and end-toend training solely based on supervision from the end QA task. We demonstrate for the first time that a single model trained end-toend can achieve both competitive retrieval and QA performance, matching or slightly outperforming state-of-the-art separately trained retrievers and readers. Moreover, end-to-end adaptation significantly boosts its performance on out-of-domain datasets in both supervised and unsupervised settings, making our model a simple and adaptable solution for knowledgeintensive tasks. Code and models are available at https://github.com/jzbjyb/ReAtt.

知识密集型任务（如开放域问答 (QA)）的系统通常由两个阶段组成：从大规模语料库中高效检索相关文档，以及对选定的文档进行详细阅读以生成答案。检索器和阅读器通常分别建模，这导致实现过程繁琐，并且难以进行端到端的训练和适应。在本文中，我们重新审视了这一设计，摒弃了分离的架构和训练，转而采用单一的 Transformer 模型，该模型通过注意力机制进行检索（ReAtt），并仅基于最终 QA 任务的监督进行端到端训练。我们首次证明，通过端到端训练的单一模型可以在检索和 QA 性能上达到与最先进的分别训练的检索器和阅读器相当或略优的水平。此外，端到端的适应显著提升了其在监督和无监督设置下对域外数据集的性能，使我们的模型成为知识密集型任务的一个简单且适应性强的解决方案。代码和模型可在 https://github.com/jzbjyb/ReAtt 获取。

1 Introduction

1 引言

Knowledge-intensive tasks such as question answering (QA), fact checking, and dialogue generation require models to gather relevant information from potentially enormous knowledge corpora (e.g., Wikipedia) and generate answers based on gathered evidence. A widely used solution is to first retrieve a small number of relevant documents from the corpus with a bi-encoder architecture which encodes queries and documents independently for efficiency purposes, then read the retrieved documents in a more careful and expansive way with a cross-encoder architecture which encodes queries and documents jointly (Lee et al., 2019; Guu et al., 2020; Lewis et al., 2020; Izacard et al., 2022). The distinction between retrieval and reading leads to the widely adopted paradigm of treating retrievers and readers separately. Retrievers and readers are usually two separate models with heterogeneous architectures and different training recipes, which is cumbersome to train. Even though two models can be combined in an ad-hoc way for downstream tasks, it hinders effective end-to-end learning and adaptation to new domains.

知识密集型任务（如问答 (QA)、事实核查和对话生成）要求模型从潜在庞大的知识库（例如维基百科）中收集相关信息，并根据收集到的证据生成答案。一种广泛使用的解决方案是首先使用双编码器架构从语料库中检索少量相关文档，该架构为了效率独立编码查询和文档，然后使用交叉编码器架构以更仔细和全面的方式阅读检索到的文档，该架构联合编码查询和文档 (Lee et al., 2019; Guu et al., 2020; Lewis et al., 2020; Izacard et al., 2022)。检索和阅读之间的区别导致了广泛采用的将检索器和阅读器分开处理的范式。检索器和阅读器通常是两个独立的模型，具有异构的架构和不同的训练方法，这使得训练变得繁琐。尽管这两个模型可以以临时方式结合用于下游任务，但它阻碍了有效的端到端学习和对新领域的适应。

There have been several attempts to connect up reader and retriever training (Lee et al., 2019; Guu et al., 2020; Lewis et al., 2020; Sachan et al., 2021; Lee et al., 2021a; Izacard et al., 2022). How- ever, retrievers in these works are not learned in a fully end-to-end way. They require either initialization from existing supervised ly trained dense retrievers (Lewis et al., 2020), or expensive unsupervised retrieval pre training as warm-up (Lee et al., 2019; Guu et al., 2020; Sachan et al., 2021; Lee et al., 2021a; Izacard et al., 2022). The reliance on retrieval-specific warm-up and the ad-hoc combination of retrievers and readers makes them less of a unified solution and potentially hinders their domain adaptation ability. With the ultimate goal of facilitating downstream tasks, retriever and reader should instead be fused more organically and learned in a fully end-to-end way.

已有多次尝试将阅读器和检索器的训练结合起来 (Lee et al., 2019; Guu et al., 2020; Lewis et al., 2020; Sachan et al., 2021; Lee et al., 2021a; Izacard et al., 2022)。然而，这些工作中的检索器并未以完全端到端的方式学习。它们要么需要从现有的有监督训练的密集检索器进行初始化 (Lewis et al., 2020)，要么需要昂贵的无监督检索预训练作为预热 (Lee et al., 2019; Guu et al., 2020; Sachan et al., 2021; Lee et al., 2021a; Izacard et al., 2022)。对检索特定预热的依赖以及检索器和阅读器的临时组合使得它们不太像一个统一的解决方案，并可能阻碍其领域适应能力。为了最终促进下游任务，检索器和阅读器应该以更有机的方式融合，并以完全端到端的方式进行学习。

In this paper, we focus on one of the most important knowledge-intensive tasks, open-domain QA. We ask the following question: is it possible to perform both retrieval and reading within a single Transformer model, and train the model in a fully end-to-end fashion to achieve competitive performance from both perspectives? Such a single-model end-to-end solution eliminates the need for retrieval-specific annotation and warm-up and simplifies retrieval-augmented training, making adaptation to new domains easier. Based on the analogy between self-attention which relates different tokens in a single sequence (Vaswani et al., 2017) and the goal of retrieval which is to relate queries with relevant documents, we hypothesize that self-attention could be a natural fit for retrieval, and it allows an organic fusion of retriever and reader within a single Transformer.

在本文中，我们聚焦于最重要的知识密集型任务之一——开放域问答（open-domain QA）。我们提出以下问题：是否可以在单个 Transformer 模型中同时执行检索和阅读任务，并以完全端到端的方式训练模型，从而在两个方面都达到有竞争力的性能？这种单模型的端到端解决方案消除了对检索特定注释和预热的需求，简化了检索增强的训练，使得适应新领域变得更加容易。基于自注意力机制（self-attention）在单个序列中关联不同 Token 的能力（Vaswani 等，2017）与检索目标（即将查询与相关文档关联）之间的类比，我们假设自注意力机制可能天然适合检索任务，并且它允许在单个 Transformer 中实现检索器和阅读器的有机融合。

Specifically, we start from a encode-decoder T5 (Raffel et al., 2020) and use it as both retriever and reader. We use the first $B$ encoder layers as bi-encoder to encode queries and documents independently, and the attention score at layer $B+1$ (denoted as retrieval attention) to compute relevance scores, as shown in Fig. 1. We found that directly using self-attention for retrieval underperforms strong retrievers, which we conjecture is because self-attention pretrained on local context is not sufficient to identify relevant information in the large representation space of the whole corpus. To solve this, we propose to compute retrieval attention between a query and a large number of documents and adjust the retrieval attention across documents. For each query, we compute retrieval attention over both close documents that potentially contain positive and hard negative documents, and documents of other queries in the same batch as random negatives. The retrieval attention is adjusted by minimizing its discrepancy from the cross-attention between the decoder and encoder (denoted as target attention), which is indicative of the usefulness of each document in generating answers (Izacard and Grave, 2021a). The resulting Retrieval as Attention model (ReAtt) is a single T5 trained based on only QA annotations and simultan e ou sly learns to promote useful documents through cross-document adjustment.

具体来说，我们从编码器-解码器 T5 (Raffel et al., 2020) 开始，并将其同时用作检索器和阅读器。我们使用前 $B$ 个编码器层作为双编码器，独立编码查询和文档，并使用第 $B+1$ 层的注意力分数（称为检索注意力）来计算相关性分数，如图 1 所示。我们发现，直接使用自注意力进行检索的效果不如强大的检索器，我们推测这是因为在局部上下文上预训练的自注意力不足以在整个语料库的大表示空间中识别相关信息。为了解决这个问题，我们提出在查询和大量文档之间计算检索注意力，并调整跨文档的检索注意力。对于每个查询，我们计算对可能包含正样本和难负样本的邻近文档以及同一批次中其他查询的文档（作为随机负样本）的检索注意力。通过最小化检索注意力与解码器和编码器之间的交叉注意力（称为目标注意力）的差异来调整检索注意力，目标注意力指示了每个文档在生成答案中的有用性 (Izacard and Grave, 2021a)。最终的检索即注意力模型 (ReAtt) 是一个仅基于问答注释训练的单一 T5 模型，同时通过跨文档调整学习提升有用文档。

We train ReAtt on Natural Questions dataset (NQ) (Kwiatkowski et al., 2019) in a fully end-toend manner. It achieves both competitive retrieval and QA performance, matching or slightly outperforming state-of-the-art retriever ColBERT-NQ (Khattab et al., 2020) trained with explicit retrieval annotations and strong QA model FiD (Izacard and Grave, 2021b,a), demonstrating for the first time end-to-end training can produce competitive retriever and reader within a single model. To further test ReAtt’s generalization and end-to-end adaptation ability, we conduct zero-shot, supervised, and unsupervised adaptation experiments on 7 datasets from the BEIR benchmark (Thakur et al., 2021). In all settings, end-to-end adaptation improves the retrieval performance usually by a large margin, contextual representation (CR) CR for retrieval attention CR for target attention

我们在 Natural Questions 数据集 (NQ) (Kwiatkowski et al., 2019) 上以完全端到端的方式训练 ReAtt。它在检索和问答性能上都取得了竞争力，匹配或略微超过了使用显式检索注释训练的最先进检索器 ColBERT-NQ (Khattab et al., 2020) 以及强大的问答模型 FiD (Izacard and Grave, 2021b,a)，首次证明了端到端训练可以在单一模型中产生具有竞争力的检索器和阅读器。为了进一步测试 ReAtt 的泛化和端到端适应能力，我们在 BEIR 基准 (Thakur et al., 2021) 的 7 个数据集上进行了零样本、监督和无监督适应实验。在所有设置中，端到端适应通常显著提高了检索性能，上下文表示 (CR) 用于检索注意力的 CR 用于目标注意力的 CR

self-/cross-attention retrieval (self-)attention target (cross-)attention

自注意力/交叉注意力检索 (自) 注意力目标 (交叉) 注意力

Figure 1: Illustration of Retrieval as Attention (ReAtt) with the first $B{=}2$ encoder layers as bi-encoder (i.e., retriever) and the rest $L{-}B{=}2$ layers as cross-encoder. During training, the retrieval attention between a query $\pmb{q}_{1}$ and documents $d_{11,12,13}$ is adjusted by minimizing its discrepancy from the target attention. For simplicity, we use a single arrow to represent attention of a single head between multiple tokens.

图 1: 检索即注意力 (ReAtt) 的示意图，其中前 $B{=}2$ 个编码器层作为双编码器（即检索器），剩下的 $L{-}B{=}2$ 层作为交叉编码器。在训练过程中，查询 $\pmb{q}_{1}$ 与文档 $d_{11,12,13}$ 之间的检索注意力通过最小化其与目标注意力的差异来调整。为简化起见，我们使用单箭头表示多个 Token 之间的单头注意力。

achieving comparable or superior performance to strong retrieval adaptation and pre training methods.

达到与强大的检索适应和预训练方法相当或更优的性能。

2 Retrieval as Attention (ReAtt)

2 检索即注意力 (ReAtt)

With the goal of developing a single Transformer that can perform both retrieval and reading, and the analogy between retrieval and self-attention, we first introduce architecture changes to allow retrieval as attention $(\S:2.2)$ , then examine how well attention as-is can be directly used to perform retrieval $(\S:2.3)$ .

为了开发一个能够同时执行检索和阅读的单一 Transformer，并基于检索与自注意力之间的类比，我们首先引入了架构变化，以允许将检索作为注意力 $(\S:2.2)$ ，然后探讨了现有的注意力机制能否直接用于执行检索 $(\S:2.3)$ 。

2.1 Formal Definition

2.1 形式化定义

We first briefly define the task of retrieval and question answering. As mentioned in the introduction, queries and documents need to be represented independently for efficient retrieval which implies a bi-encoder architecture that has no interaction between queries and documents. Without loss of generality, we use $E_{d}=\mathrm{biencoder}(d)$ to denote one or multiple representations generated by a bi-encoder based on a document from a corpus $d\in\mathcal{D}$ , and likewise $E_{q}=\mathrm{biencoder}(q)$ to denote query representations.1 The top $\mathbf{\nabla\cdot}\mathbf{k}$ documents most relevant to a query are retrieved by $\mathcal{D}_{q}^{\mathrm{ret}}=\arg\mathrm{topk}_{d\in\mathcal{D}}r(E_{q},E_{d})$ , where function $r$ computes relevance based on query and document representations which can be as simple as a dot product if queries and documents are encoded into a single vector, and $\mathcal{D}_{\boldsymbol{q}}^{\mathrm{ret}}$ stands for the returned documents. We consider encoder-decoder-based generative question answering in this paper, which jointly represents queries and retrieved documents with the encoder $E_{q,d}=\operatorname{crossencoder}(q,d)$ , and generates the answer $\textbf{\em a}$ auto regressive ly with the decoder $P^{\mathrm{gen}}({\pmb a}|{\pmb q},{\pmb d})=P^{\mathrm{gen}}({\pmb a}|E_{{\pmb q},{\pmb d}})$ . To handle multiple retrieved documents, we follow the fusion-in-decoder model (FiD) (Izacard and Grave, 2021b) which encodes each query-document pair independently and fuse these representations in decoder through cross-attention $P^{\mathrm{gen}}(a|\pmb{q},T_{\pmb{q}}^{\mathrm{ret}})~=$ $P^{\mathrm{gen}}(a|E_{q,d_{1}},...,E_{q,d_{|\mathcal{D}_{q}^{\mathrm{ret}}|}})$ . Negative log likelihood (NLL) is used in optimization $\begin{array}{r l}{{\mathcal{L}}_{\mathrm{QA}}}&{{}=}\end{array}$ $-\log P^{\mathrm{gen}}({\pmb a}|{\pmb q},\mathcal{D}_{{\pmb q}}^{\mathrm{ret}})$ .

我们首先简要定义检索和问答任务。如引言所述，查询和文档需要独立表示以实现高效检索，这意味着查询和文档之间没有交互的双编码器架构。在不失一般性的情况下，我们使用 $E_{d}=\mathrm{biencoder}(d)$ 表示基于语料库中的文档 $d\in\mathcal{D}$ 生成的一个或多个表示，同样使用 $E_{q}=\mathrm{biencoder}(q)$ 表示查询表示。与查询最相关的前 $\mathbf{\nabla\cdot}\mathbf{k}$ 个文档通过 $\mathcal{D}_{q}^{\mathrm{ret}}=\arg\mathrm{topk}_{d\in\mathcal{D}}r(E_{q},E_{d})$ 检索得到，其中函数 $r$ 根据查询和文档表示计算相关性，如果查询和文档编码为单个向量，则可以简单地使用点积，$\mathcal{D}_{\boldsymbol{q}}^{\mathrm{ret}}$ 表示返回的文档。本文考虑基于编码器-解码器的生成式问答，它通过编码器 $E_{q,d}=\operatorname{crossencoder}(q,d)$ 联合表示查询和检索到的文档，并通过解码器自回归生成答案 $\textbf{\em a}$，即 $P^{\mathrm{gen}}({\pmb a}|{\pmb q},{\pmb d})=P^{\mathrm{gen}}({\pmb a}|E_{{\pmb q},{\pmb d}})$。为了处理多个检索到的文档，我们遵循融合解码器模型 (FiD) (Izacard and Grave, 2021b)，该模型独立编码每个查询-文档对，并通过交叉注意力在解码器中融合这些表示，即 $P^{\mathrm{gen}}(a|\pmb{q},T_{\pmb{q}}^{\mathrm{ret}})~=$ $P^{\mathrm{gen}}(a|E_{q,d_{1}},...,E_{q,d_{|\mathcal{D}_{q}^{\mathrm{ret}}|}})$。优化中使用负对数似然 (NLL)，即 $\begin{array}{r l}{{\mathcal{L}}_{\mathrm{QA}}}&{{}=}\end{array}$ $-\log P^{\mathrm{gen}}({\pmb a}|{\pmb q},\mathcal{D}_{{\pmb q}}^{\mathrm{ret}})$。

2.2 Leveraging Attention for Retrieval

2.2 利用注意力机制进行检索

Next, we introduce our method that directly uses self-attention between queries and documents as retrieval scores.

接下来，我们介绍一种直接利用查询和文档之间的自注意力机制 (self-attention) 作为检索分数的方法。

Putting the Retriever into Transformers As illustrated in Fig. 1, we choose T5 (Raffel et al., 2020) as our base model, use the first $B$ layers of the encoder as the bi-encoder “retriever” by disabling self-attention between queries and documents, and the remaining $L-B$ layers as the crossencoder “reader”. We use the self-attention paid from query tokens to document tokens at the $B+1$ - th layer as the retrieval score, which is denoted as retrieval attention (green arrows in Fig. 1). It is computed based on the independent query and document contextual representations from the last ( $B$ - th) layer of the bi-encoder (green blocks in Fig. 1). Formally for an $H$ -head Transformer, document and query representations are:

将检索器融入 Transformer

如图 1 所示，我们选择 T5 (Raffel et al., 2020) 作为基础模型，使用编码器的前 $B$ 层作为双编码器“检索器”，通过禁用查询和文档之间的自注意力机制，并将剩余的 $L-B$ 层作为交叉编码器“阅读器”。我们使用第 $B+1$ 层中从查询 Token 到文档 Token 的自注意力作为检索分数，称为检索注意力（图 1 中的绿色箭头）。它是基于双编码器最后一层（第 $B$ 层）的独立查询和文档上下文表示计算的（图 1 中的绿色块）。对于一个 $H$ 头 Transformer，文档和查询的表示形式为：
图片.png

where $K$ and $Q$ are key and query vectors of the token sequence used in self-attention, $|d|$ and $|q|$ are document and query length, and $e$ is the dimensionality of each head. The retrieval attention matrix from query tokens to document before softmax for one head is computed by:

其中 $K$ 和 $Q$ 是自注意力机制中使用的 Token 序列的键和查询向量，$|d|$ 和 $|q|$ 是文档和查询的长度，$e$ 是每个头的维度。在 softmax 之前，从一个查询 Token 到文档的检索注意力矩阵通过以下公式计算：

图片.png

Directly using attention for retrieval can not only leverage its ability to identify relatedness, it is also a natural and simple way to achieve both retrieval and reading in a single Transformer with minimal architectural changes, which facilitates our final goal of end-to-end learning.

直接使用注意力机制进行检索不仅可以利用其识别相关性的能力，还是一种自然且简单的方式，通过最小的架构变化在单个 Transformer 中实现检索和阅读，这有助于我们实现端到端学习的最终目标。

From Token Attention to Document Relevance Given the token-level attention scores $A_{\pmb{q},\pmb{d}}^{B+1,h}$ d , the relevance between $\pmb q$ and $^d$ is computed by avg-max aggregation: choosing the most relevant document token for each query token (i.e., max) then averaging across query tokens:

从Token注意力到文档相关性

给定Token级别的注意力分数 $A_{\pmb{q},\pmb{d}}^{B+1,h}$，$\pmb q$ 和 $^d$ 之间的相关性通过平均-最大聚合计算：为每个查询Token选择最相关的文档Token（即最大），然后在查询Token之间取平均值：
图片.png

where 1 and 0 refer to the dimension over which the operation is applied. This is similar to the MaxSim and sum operators used in ColBERT (Khattab and Zaharia, 2020), with the intuition that a relevant document should match as many query tokens as possible with the best-matching token. The final relevance is a weighted sum over all heads:

其中 1 和 0 表示操作所应用的维度。这与 ColBERT (Khattab and Zaharia, 2020) 中使用的 MaxSim 和 sum 操作符类似，其直觉是相关文档应尽可能多地匹配查询 Token，并与最佳匹配的 Token 匹配。最终的相关性是所有头的加权和：

图片.png

where $P_{h}$ is a learnable weight that sums to one. As explained in the next section, we empirically find only a few attention heads with non-random retrieval performance, and among them one particular head is significantly better than the others. Given this observation, we introduce a low temperature $\tau$ to promote this sparsity $\begin{array}{r}{P_{h}^{\mathrm{head}}=\frac{\exp\left(w_{h}/\tau\right)}{\sum_{h^{\prime}}\exp\left(w_{h}^{\prime}/\tau\right)}}\end{array}$ which always ends with a single head with the great majority of the weight, which is denoted as retrieval head $h^{*}$ . As a result, the learned head weights are practically a head selector, a fact that can also be exploited to make test-time retrieval more efficient.

其中 $P_{h}$ 是一个可学习的权重，其总和为一。正如下一节所解释的，我们通过实验发现只有少数注意力头具有非随机的检索性能，并且其中有一个特定的头明显优于其他头。基于这一观察，我们引入了一个低温 $\tau$ 来促进这种稀疏性 $\begin{array}{r}{P_{h}^{\mathrm{head}}=\frac{\exp\left(w_{h}/\tau\right)}{\sum_{h^{\prime}}\exp\left(w_{h}^{\prime}/\tau\right)}}\end{array}$，最终总是以一个拥有绝大多数权重的头结束，该头被称为检索头 $h^{*}$。因此，学习到的头权重实际上是一个头选择器，这一事实也可以用来提高测试时检索的效率。

End-to-end Retrieval with Attention To perform retrieval over a corpus, we first generate key vectors KB+1,h∗ of retrieval head for all document tokens offline and index them with FAISS library (Johnson et al., 2021). For each query token, we issue its vector (Q qB+1,h∗) to the index to retrieve top $\cdot K^{\prime}$ document tokens, which yields a filtered set of documents, each of which has at least one token retrieved by a query token. We then fetch all tokens of filtered documents, compute relevance scores following Eq. 1, and return top $K$ documents with the highest scores $r_{h^{*}}(\pmb{q},\pmb{d})$ . This is similar to the two-stage retrieval in ColBERT (Khattab and Zaharia, 2020), and we reuse their successful practice in index compression and search approximation to make test-time retrieval efficient, which we refer to Santhanam et al. (2021) for details.

端到端检索与注意力机制

为了在语料库上进行检索，我们首先为所有文档 Token 离线生成检索头的关键向量 KB+1,h∗，并使用 FAISS 库 (Johnson 等人, 2021) 对其进行索引。对于每个查询 Token，我们将其向量 (Q qB+1,h∗) 发送到索引中以检索前 $\cdot K^{\prime}$ 个文档 Token，从而生成一组过滤后的文档，每个文档至少有一个 Token 被查询 Token 检索到。然后，我们获取过滤后文档的所有 Token，按照公式 1 计算相关性分数，并返回分数 $r_{h^{*}}(\pmb{q},\pmb{d})$ 最高的前 $K$ 个文档。这与 ColBERT (Khattab 和 Zaharia, 2020) 中的两阶段检索类似，我们重用了他们在索引压缩和搜索近似方面的成功实践，以使测试时的检索高效，具体细节请参考 Santhanam 等人 (2021)。

Figure 2: Illustration of approximate attention over the corpus with $\scriptstyle|{\mathcal{Q}}|=4$ queries in a batch and $K{=}3$ close documents per query. We use $\pmb{q}_{1}$ as an example to illustrate the required computation, where close documents require both retrieval and target attention while random documents only require retrieval attention.

图 2: 在语料库上使用 $\scriptstyle|{\mathcal{Q}}|=4$ 个查询和每个查询 $K{=}3$ 个相近文档的近似注意力示意图。我们以 $\pmb{q}_{1}$ 为例来说明所需的计算，其中相近文档需要检索和目标注意力，而随机文档仅需要检索注意力。

2.3 How Good is Attention As-is?

2.3 注意力机制本身的表现如何？

To examine this question, we use T5-large and test queries from the Natural Question dataset (NQ), retrieve 100 documents with BM25, compute relevance scores $r_{h}(\boldsymbol{q},d)$ with half layers $\mathit{\Theta}^{\prime}B=12\mathit{\Theta}.$ ) as bi-encoder, and measure its correlation with the gold binary annotation. We found that among $H=24$ heads, 4 heads have non-trivial correlations of 0.137, 0.097, 0.082, and 0.059. We fur- ther perform end-to-end retrieval over Wikipedia using the best head, achieving top-10 retrieval accuracy of $43.5%$ , inferior to $55.5%$ of BM25. This demonstrates that there are indeed heads that can relate queries with relevant documents, but they are not competitive. We hypothesize that because self-attention is usually trained by comparing and relating tokens in a local context (512/1024 tokens) it cannot effectively identify relevant tokens in the enormous representation space of a corpus with millions of documents. This discrepancy motivates us to compute retrieval attention between queries and potentially all documents (i.e., attention over the corpus), and adjust attention across documents to promote useful ones.

为了研究这个问题，我们使用 T5-large 模型，并从自然问题数据集 (NQ) 中选取测试查询，使用 BM25 检索 100 篇文档，计算相关性分数 $r_{h}(\boldsymbol{q},d)$，其中半层 $\mathit{\Theta}^{\prime}B=12\mathit{\Theta}.$ 作为双编码器，并测量其与黄金二元标注的相关性。我们发现，在 $H=24$ 个头中，有 4 个头的相关性分别为 0.137、0.097、0.082 和 0.059。我们进一步使用最佳头在 Wikipedia 上进行端到端检索，获得了 $43.5%$ 的 top-10 检索准确率，低于 BM25 的 $55.5%$。这表明确实存在能够将查询与相关文档关联起来的头，但它们并不具备竞争力。我们假设，由于自注意力通常是通过在局部上下文（512/1024 个 Token）中比较和关联 Token 来训练的，因此它无法在包含数百万文档的语料库的巨大表示空间中有效识别相关 Token。这种差异促使我们计算查询与潜在所有文档之间的检索注意力（即对语料库的注意力），并调整文档间的注意力以提升有用文档的权重。

3 Learning Retrieval as Attention

3 学习检索作为注意力机制

We first approximate attention over the corpus at training time by sub-sampling a manageable number of documents for each query containing both potentially relevant and random documents $(\S:3.1)$ . Next, we introduce our end-to-end training objective that optimizes a standard QA loss while also adding supervision to promote attention over documents that are useful for the end task $(\S3.2)$ .

我们首先通过在训练时对每个查询进行子采样来近似处理语料库的注意力，每个查询包含潜在相关和随机文档 $(\S:3.1)$ 。接着，我们引入了端到端的训练目标，该目标优化了标准的问答损失，同时增加了监督，以促进对最终任务有用的文档的注意力 $(\S3.2)$ 。

3.1 Approximate Attention over the Corpus

3.1 语料库上的近似注意力机制

Encoding the entire corpus and computing attention between the query and all documents is very expensive. To make it practical, we propose to subsample a small set of documents for each query to approximate the whole corpus. Inspired by negative sampling methods used in dense retriever training (Karpukhin et al., 2020; Xiong et al., 2021; Khattab and Zaharia, 2020), we sub-sample both (1) documents close to queries that can be either relevant or hard negatives, and (2) random documents that are most likely to be easy negatives. This allows the model to distinguish between relevant and hard negative documents, while simultaneously preventing it from losing its ability to distinguish easy negatives, which form the majority of the corpus.

对整个语料库进行编码并计算查询与所有文档之间的注意力是非常昂贵的。为了使其可行，我们提出为每个查询子采样一小部分文档来近似整个语料库。受密集检索器训练中使用的负采样方法（Karpukhin 等人，2020；Xiong 等人，2021；Khattab 和 Zaharia，2020）的启发，我们子采样了（1）接近查询的文档，这些文档可能是相关的或难负样本，以及（2）随机文档，这些文档很可能是易负样本。这使得模型能够区分相关文档和难负样本，同时防止其失去区分易负样本的能力，而易负样本构成了语料库的大部分。

Iterative Close Document Sub-sampling To sample documents close to a query $\mathcal{D}_{\pmb q}^{\mathrm{close}}$ , we start from widely used lexical retriever BM25 (Robertson and Zaragoza, 2009) to retrieve $K=100$ documents, as shown by the orange blocks in Fig. 2. We set $K$ to a relatively large number to better approximate the local region, inspired by Izacard and Grave (2021b)’s findings that QA performance increases as more documents are used.

迭代式接近文档子采样

为了采样接近查询 $\mathcal{D}_{\pmb q}^{\mathrm{close}}$ 的文档，我们从广泛使用的词汇检索器 BM25 (Robertson 和 Zaragoza, 2009) 开始，检索 $K=100$ 个文档，如图 2 中的橙色块所示。我们将 $K$ 设置为一个相对较大的数字，以更好地近似局部区域，这受到 Izacard 和 Grave (2021b) 的发现的启发，即随着使用更多文档，问答性能会提高。

This fixed set of close documents can become outdated and no longer close to the query anymore as the retrieval attention gets better. To provide dynamic close sub-samples, we re-index the corpus and retrieve a new set of $K$ documents using the current retrieval attention after each iteration. It is similar in spirit to the hard negative mining methods used in Karpukhin et al. (2020); Khattab et al. (2020), with a major difference that we do not manually or heuristic ally annotate documents but instead learn from the end loss with cross-document adjustment, which will be explained in $\S~3.2$ .

随着检索注意力的提升，这组固定的相近文档可能会变得过时，不再与查询密切相关。为了提供动态的相近子样本，我们在每次迭代后重新索引语料库，并使用当前的检索注意力检索出一组新的 $K$ 个文档。这与 Karpukhin 等人 (2020) 和 Khattab 等人 (2020) 使用的硬负样本挖掘方法在精神上相似，但有一个主要区别：我们不是手动或启发式地标注文档，而是通过跨文档调整从最终损失中学习，这将在 $\S~3.2$ 中详细解释。

In-batch Random Document Sub-sampling We use close documents of other queries in the same batch as the random documents of the current query $\mathcal{D}{\pmb q}^{\mathrm{random}}=\cup{\pmb q^{\prime}\in\mathcal{Q}\wedge\pmb q^{\prime}\neq\pmb q}\mathcal{D}{\pmb q^{\prime}}^{\mathrm{close}}$ where $\mathcal{Q}$ contains all queries in a batch, as shown by the green blocks in Fig. 2, which has the advantage of reusing document representations across queries. This is similar to the in-batch negatives used in DPR (Karpukhin et al., 2020) with a major difference that we reuse a token representations $(K{d}^{B+1,h},1\leq$ $h\leq H,$ ) across queries instead of a single-vector document representation.

批内随机文档子采样
我们使用同一批次中其他查询的相近文档作为当前查询的随机文档 $\mathcal{D}{\pmb q}^{\mathrm{random}}=\cup{\pmb q^{\prime}\in\mathcal{Q}\wedge\pmb q^{\prime}\neq\pmb q}\mathcal{D}{\pmb q^{\prime}}^{\mathrm{close}}$，其中 $\mathcal{Q}$ 包含批次中的所有查询，如图 2 中的绿色块所示。这种方法具有跨查询重用文档表示的优势。这与 DPR (Karpukhin et al., 2020) 中使用的批内负样本类似，主要区别在于我们跨查询重用了 Token 表示 $(K{d}^{B+1,h},1\leq$ $h\leq H,$ )，而不是单一向量的文档表示。

3.2 Cross-document Adjustment with Decoder-to-Encoder Attention Distillation

3.2 跨文档调整与解码器到编码器的注意力蒸馏

Given the sub-sampled $|\mathcal{Q}|\times K$ documents $\mathcal{D}{q}=$ Dcqlose ∪ Drqandom for each query q, we compute the retrieval attention-based relevance scores $r(\mathbf{q},d)$ and adjust them across multiple documents $d\in\mathcal{D}{q}$ only relying on end task supervision. Since retrieval is simply a means to achieve the downstream task, documents useful for downstream tasks should be promoted by retrieval. Inspired by reader-to-retriever distillation (Izacard and Grave, 2021a; Yang and Seo, 2020), we measure document usefulness based on cross-attention between decoder and encoder, and minimize retrieval attention’s discrepancy from it through distillation. In contrast to Izacard and Grave (2021a) that learns two models iterative ly and alternatively, we optimize QA and distillation loss in a single model simultaneously.

给定每个查询 q 的子采样 $|\mathcal{Q}|\times K$ 文档 $\mathcal{D}{q}=$ Dcqlose ∪ Drqandom，我们计算基于检索注意力的相关性分数 $r(\mathbf{q},d)$，并仅依赖最终任务的监督在多个文档 $d\in\mathcal{D}{q}$ 中调整它们。由于检索只是实现下游任务的手段，对下游任务有用的文档应通过检索得到提升。受读者到检索器蒸馏的启发 (Izacard and Grave, 2021a; Yang and Seo, 2020)，我们基于解码器和编码器之间的交叉注意力来衡量文档的有用性，并通过蒸馏最小化检索注意力与其之间的差异。与 Izacard 和 Grave (2021a) 迭代交替学习两个模型不同，我们在单个模型中同时优化问答和蒸馏损失。

Minimizing KL-divergence Between Retrieval and Target Attention Specifically, we denote cross-attention before softmax of the first position/token of the last decoder layer as target attention $C_{{\pmb a},{\pmb q},\mathcal{D}{{\pmb q}}}\in\mathbb{R}^{H\times|\mathcal{D}{{\pmb q}}|\times(|{\pmb d}|+|{\pmb q}|)}$ where $\textbf{\em a}$ is the answer, $|\mathcal{D}{q}|$ is the number of sub-sampled documents to be fused by the decoder $(\S\enspace2.1)$ , and $|d|$ is document length.2 To aggregate tokenlevel target attention into document-level distribution $P^{\mathrm{tgi}}(a,q,\mathcal{D}{q})\in\mathbb{R}^{|\mathcal{D}{q}|}$ , we first perform soft- max over all tokens in all query-document pairs $(|\mathcal{D}{q}|\times(|d|+|q|))$ , sum over tokens of each querydocument pair $(|d|+|q|)$ , then average across multiple heads $(H)$ :

最小化检索和目标注意力之间的 KL 散度

具体来说，我们将最后一个解码器层第一个位置/Token的 softmax 之前的交叉注意力表示为目标注意力 $C_{{\pmb a},{\pmb q},\mathcal{D}{{\pmb q}}}\in\mathbb{R}^{H\times|\mathcal{D}{{\pmb q}}|\times(|{\pmb d}|+|{\pmb q}|)}$，其中 $\textbf{\em a}$ 是答案，$|\mathcal{D}{q}|$ 是解码器要融合的子采样文档数量 $(\S\enspace2.1)$，$|d|$ 是文档长度。为了将 Token 级别的目标注意力聚合为文档级别的分布 $P^{\mathrm{tgi}}(a,q,\mathcal{D}{q})\in\mathbb{R}^{|\mathcal{D}{q}|}$，我们首先对所有查询-文档对中的所有 Token 进行 softmax $(|\mathcal{D}{q}|\times(|d|+|q|))$，然后对每个查询-文档对的 Token 求和 $(|d|+|q|)$，最后在多个头 $(H)$ 上进行平均：

图片.png

Given relevance scores obtained from retrieval attention, the final cross-document adjustment loss is the KL-divergence between relevance distribution $P^{\mathrm{ret}}$ and target distribution $P^{\mathrm{tgt}}$ :

给定从检索注意力中获得的相关性分数，最终的跨文档调整损失是相关性分布 $P^{\mathrm{ret}}$ 和目标分布 $P^{\mathrm{tgt}}$ 之间的 KL 散度：

图片.png

where the overline indicates stop gradient back propagation to target distributions. Our final loss combines QA loss and cross-document adjustment loss with $\alpha$ as combination weight.

其中上划线表示停止梯度反向传播到目标分布。我们的最终损失结合了 QA 损失和跨文档调整损失，$\alpha$ 作为组合权重。
图片.png

Zero Target Attention for Random Documents For a batch with $|\mathcal{Q}|$ queries, we need to compute retrieval attention and target attention between $|\mathcal{Q}|\times|\mathcal{Q}|\times K$ query-document pairs. This is both computation- and memory-intensive when batch size is large, especially for target attention because it requires $L-B$ layers of joint encoding of querydocument pairs in the cross-encoder. To alleviate this, we make a simple and effective assumption that in-batch random documents are not relevant to the current query thus having zero target attention: $P^{\mathrm{tgt}}(\pmb{a},\pmb{q},\mathcal{D}{\pmb{q}}^{\mathrm{random}})\in\mathbb{R}^{|\mathcal{D}{\pmb{q}}^{\mathrm{random}}|}\leftarrow0$ rqandom| ← 0. As a result, we only need to run cross-encoder and decoder for $K$ close documents of each query, as shown in Fig. 2. In Appendix A we will introduce our efficient implementation to make it possible to run a large batch size over a limited number of GPUs.

零目标注意力机制用于随机文档

对于包含 $|\mathcal{Q}|$ 个查询的批次，我们需要计算 $|\mathcal{Q}|\times|\mathcal{Q}|\times K$ 个查询-文档对之间的检索注意力和目标注意力。当批次大小较大时，这在计算和内存上都非常密集，尤其是目标注意力，因为它需要在交叉编码器中对查询-文档对进行 $L-B$ 层的联合编码。为了缓解这一问题，我们做了一个简单而有效的假设：批次内的随机文档与当前查询不相关，因此目标注意力为零：$P^{\mathrm{tgt}}(\pmb{a},\pmb{q},\mathcal{D}{\pmb{q}}^{\mathrm{random}})\in\mathbb{R}^{|\mathcal{D}{\pmb{q}}^{\mathrm{random}}|}\leftarrow0$。因此，我们只需要为每个查询的 $K$ 个相近文档运行交叉编码器和解码器，如图 2 所示。在附录 A 中，我们将介绍我们的高效实现，使得在有限的 GPU 数量上运行大批次成为可能。

3.3 Domain Adaptation Methods

3.3 领域适应方法

One of the major benefits of a single end-to-end trainable model is that given a new corpus from a new domain, possibly without retrieval annotations, we can easily adapt it by end-to-end training. This section describes how we adapt ReAtt under different setups.

单端到端可训练模型的主要优势之一是，给定来自新领域的新语料库（可能没有检索注释），我们可以通过端到端训练轻松适应它。本节描述了我们在不同设置下如何适应 ReAtt。

We consider adapting ReAtt with (1) QA supervision, (2) information retrieval (IR) supervision, or (3) unsupervised adaptation where we only have access to the document corpus. Although our goal is to learn retrieval through downstream tasks instead of retrieval supervision, being able to consume retrieval annotations is helpful when retrieval supervision is indeed available. To do so, we convert retrieval task with annotations in the form of query-document-relevance triples $\langle\pmb{q},\pmb{d},l\rangle$ into a generative task: given a query, the target is to generate its relevant document and the corresponding relevance with the following format “relevance: $l$ . $\overrightarrow{d}^{\cdot}$ . If a query has multiple relevant documents, we follow Izacard and Grave (2021b) to randomly sample one of them. For unsupervised adaptation, with simplicity as our primary goal, we randomly choose one sentence from a document and mask one entity, which is considered as the “query”, and have our model generate the masked entity as the “answer”, similar to salient span masking (SSM) used in Guu et al. (2020).

我们考虑通过以下方式对ReAtt进行适配：(1) 问答监督，(2) 信息检索 (IR) 监督，或 (3) 无监督适配，其中我们只能访问文档语料库。尽管我们的目标是通过下游任务而非检索监督来学习检索，但当检索监督确实可用时，能够利用检索注释是有帮助的。为此，我们将带有查询-文档-相关性三元组 $\langle\pmb{q},\pmb{d},l\rangle$ 形式的检索任务转换为生成任务：给定一个查询，目标是生成其相关文档及相应的相关性，格式如下：“相关性: $l$ . $\overrightarrow{d}^{\cdot}$ . 如果一个查询有多个相关文档，我们遵循 Izacard 和 Grave (2021b) 的方法，随机抽样其中一个。对于无监督适配，以简单为主要目标，我们从文档中随机选择一句话并掩码一个实体，将其视为“查询”，并让我们的模型生成被掩码的实体作为“答案”，类似于 Guu 等人 (2020) 中使用的显著跨度掩码 (SSM)。

4 In-domain Experiments

4 领域内实验

In this section, we examine if supervised ly training ReAtt end-to-end with only QA supervision yields both competitive retrieval and QA performance.

在本节中，我们探讨了仅使用问答监督端到端训练 ReAtt 是否能同时获得具有竞争力的检索和问答性能。

Datasets, Baselines, and Metrics We train our model using the Natural Questions dataset (NQ). We compare retrieval performance with lexical models BM25 (Robertson and Zaragoza, 2009), passage-level dense retrievers DPR, ANCE, coCondenser, FiD-KD, YONO (with and without retrieval pre training) (Karpukhin et al., 2020; Oguz et al., 2021; Xiong et al., 2021; Gao and Callan, 2022; Izacard and Grave, 2021a; Lee et al., 2021a), and token/phrase-level dense retrievers Dense Phrase, ColBERT, ColBERT-NQ (Lee et al., 2021b; Khat- tab and Zaharia, 2020; Khattab et al., 2020).3 Among them ColBERT-NQ, FiD-KD and YONO are the most fair-to-compare baselines because of either similar token-level retrieval granularity (ColBERT-NQ) or similar end-to-end training settings (FiD-KD and YONO). We report top $\mathbf{\nabla\cdot}\mathbf{k}$ retrieval accuracy $(\mathbf{R}@\mathbf{k})$ , the fraction of queries with at least one retrieved document containing answers. We compare QA performance with ORQA, REALM, RAG, FiD, EMDR2, YONO, UnitedQA, and R2-D2 (Lee et al., 2019; Guu et al., 2020; Lewis et al., 2020; Izacard and Grave, 2021b,a; Sachan et al., 2021; Lee et al., 2021a; Cheng et al., 2021; Fajcik et al., 2021) using exact match (EM), among which FiD, $\mathrm{EMDR^{2}}$ , and YONO are the most fair-to-compare baselines because they have similar model sizes and training settings.

数据集、基线模型和指标

我们使用自然问题数据集 (NQ) 训练我们的模型。我们将检索性能与词汇模型 BM25 (Robertson 和 Zaragoza, 2009)、段落级密集检索器 DPR、ANCE、coCondenser、FiD-KD、YONO (带和不带检索预训练) (Karpukhin 等, 2020; Oguz 等, 2021; Xiong 等, 2021; Gao 和 Callan, 2022; Izacard 和 Grave, 2021a; Lee 等, 2021a) 以及 Token/短语级密集检索器 Dense Phrase、ColBERT、ColBERT-NQ (Lee 等, 2021b; Khattab 和 Zaharia, 2020; Khattab 等, 2020) 进行比较。其中，ColBERT-NQ、FiD-KD 和 YONO 是最公平的比较基线，因为它们要么具有相似的 Token 级检索粒度 (ColBERT-NQ)，要么具有相似的端到端训练设置 (FiD-KD 和 YONO)。我们报告了前 $\mathbf{\nabla\cdot}\mathbf{k}$ 检索准确率 $(\mathbf{R}@\mathbf{k})$，即至少有一个检索到的文档包含答案的查询比例。我们将问答性能与 ORQA、REALM、RAG、FiD、EMDR2、YONO、UnitedQA 和 R2-D2 (Lee 等, 2019; Guu 等, 2020; Lewis 等, 2020; Izacard 和 Grave, 2021b,a; Sachan 等, 2021; Lee 等, 2021a; Cheng 等, 2021; Fajcik 等, 2021) 进行比较，使用精确匹配 (EM)，其中 FiD、$\mathrm{EMDR^{2}}$ 和 YONO 是最公平的比较基线，因为它们具有相似的模型大小和训练设置。

5 Implementation Details of ReAtt

5 ReAtt 的实现细节

ReAtt is based on T5-large with $B=12$ encoder layers as bi-encoder and temperatures $\tau=0.001$ to select the best retrieval head. We retrieve $K=100$ close documents for each query, and use a batch size of $|\mathcal{Q}|=64\$ queries to obtain in-batch random documents. We use $\alpha=8$ to combine crossdocument adjustment loss with QA loss. We use AdamW with a learning rate of 5e-5, $10%$ steps of warmup, and linear decay. We first warmup crossattention’s ability to distinguish documents by only using the QA loss for 3K steps, then train with the combined losses (Eq. 3) for 4 iterations, where the first iteration uses close documents returned by BM25, and the following 3 iterations use close documents returned by the previous ReAtt model (denoted as ReAtt BM25). Each iteration has 8K update steps and takes $\sim1.5$ days on a single node with $8\times\mathrm{Al00}$ GPUs with 80GB memory. Since DPR (Karpukhin et al., 2020) achieves stronger performance than BM25, training with close documents returned by DPR can potentially reduce training time. We experimented with training on close documents from DPR for a single iteration with 16K steps (denoted as ReAtt DPR). Since both approaches achieve similar performance (Tab. 1 and Tab. 2) and ReAtt DPR is cheaper to train, we use it in other experimental settings.

ReAtt 基于 T5-large，使用 $B=12$ 个编码器层作为双编码器，并使用温度 $\tau=0.001$ 来选择最佳检索头。我们为每个查询检索 $K=100$ 个相近文档，并使用批量大小为 $|\mathcal{Q}|=64$ 的查询来获取批次内的随机文档。我们使用 $\alpha=8$ 将跨文档调整损失与问答损失结合。我们使用 AdamW 优化器，学习率为 5e-5，$10%$ 的预热步数，并采用线性衰减。我们首先通过仅使用问答损失进行 3K 步的预热，以增强跨注意力的文档区分能力，然后使用组合损失（公式 3）进行 4 次迭代训练，其中第一次迭代使用 BM25 返回的相近文档，接下来的 3 次迭代使用前一次 ReAtt 模型返回的相近文档（记为 ReAtt BM25）。每次迭代有 8K 更新步数，在单个节点上使用 $8\times\mathrm{Al00}$ 80GB 内存的 GPU 需要约 $\sim1.5$ 天。由于 DPR (Karpukhin et al., 2020) 比 BM25 表现更强，使用 DPR 返回的相近文档进行训练可能会减少训练时间。我们尝试了使用 DPR 返回的相近文档进行单次迭代训练，共 16K 步（记为 ReAtt DPR）。由于两种方法表现相似（表 1 和表 2）且 ReAtt DPR 训练成本更低，我们在其他实验设置中使用它。

Table 1: Retrieval performance on NQ. PT is retrieval pre training. Fair-to-compare baselines are highlighted with background color. Best performance is in bold.

表 1: NQ 上的检索性能。PT 表示检索预训练。公平比较的基线用背景色高亮显示。最佳性能以粗体显示。

模型	R@1	R@5	R@20	R@100	参数量
有监督检索器
BM25	23.9	45.9	63.8	78.9	-
DPR	45.9	68.1	80.0	85.9	220M
DPRnew	52.5	72.2	81.3	87.3	220M
DPR-PAQ	-	74.2	84.0	89.2	220M
ANCE	-	-	81.9	87.5	220M
coCondenser	-	75.8	84.3	89.0	220M
DensePhrase	51.1	69.9	78.7	-	330M
ColBERT	-	-	79.1	-	110M
ColBERT-NQ	54.3	75.7	85.6	90.0	110M
半监督/无监督检索器
FiD-KD	49.4	73.8	84.3	89.3	220M
YONOw/o PT	-	72.3	82.2	-	165M
YONOw/PT	-	75.3	85.2	90.2	165M
ReAtt DPR	54.6	77.2	86.1	90.7	165M
ReAtt BM25	55.8	77.4	86.0	90.4	165M

At test-time, we save key vectors of all tokens in the corpus and use exact index from FAISS (i.e., faiss.Index Flat IP) to perform inner-product search. We retrieve $K^{\prime}=2048$ document tokens for each query token and return top-100 documents with the highest aggregated scores (Eq. 1) to generate answers. We found compressing index with clustering and quantization proposed by Santhanam et al. (2021) can greatly reduce search latency and index size with a minor retrieval accuracy loss.

在测试时，我们保存语料库中所有 Token 的关键向量，并使用 FAISS 的精确索引（即 faiss.Index Flat IP）执行内积搜索。我们为每个查询 Token 检索 $K^{\prime}=2048$ 个文档 Token，并返回得分最高的前 100 个文档（公式 1）以生成答案。我们发现，使用 Santhanam 等人 (2021) 提出的聚类和量化方法压缩索引，可以显著减少搜索延迟和索引大小，同时仅带来轻微的检索精度损失。

5.1 Overall Results

5.1 总体结果

We compare ReAtt with various retrievers and readers in Tab. 1 and Tab. 2. ReAtt achieves both slightly better retrieval performance than the strongest retriever baseline ColBERT-NQ (Khattab et al., 2020) and comparable QA performance than the strong reader baseline FiD-KD (Izacard and Grave, 2021a) on NQ, demonstrating for the first time that fully end-to-end training using QA supervision can produce both competitive retrieval and QA performance. Compared to another singlemodel architecture YONO (Lee et al., 2021a), ReAtt offers better performance without cumbersome pre training to warm-up retrieval.

我们在表 1 和表 2 中将 ReAtt 与各种检索器和阅读器进行了比较。ReAtt 在 NQ 数据集上不仅比最强的检索器基线 ColBERT-NQ (Khattab et al., 2020) 取得了略好的检索性能，而且与强大的阅读器基线 FiD-KD (Izacard and Grave, 2021a) 相比，问答性能也相当。这首次证明了使用问答监督进行完全端到端训练可以同时产生具有竞争力的检索和问答性能。与另一种单模型架构 YONO (Lee et al., 2021a) 相比，ReAtt 在不进行繁琐的预训练以预热检索的情况下提供了更好的性能。

Table 2: QA performance on NQ. PT is retrieval pretraining. Fair-to-compare baselines are highlighted. Best performance is in bold.

表 2: NQ 上的 QA 性能。PT 表示检索预训练。公平比较的基线已高亮显示。最佳性能以粗体显示。

模型	EM	参数量
ORQA (Lee et al., 2019)	33.3	330M
REALM (Guu et al., 2020)	40.4	330M
RAG (Lewis et al., 2020)	44.5	220M
FiD (Izacard and Grave, 2021b)	51.4	990M
FiD-KD (Izacard and Grave, 2021a)	54.4	990M
EMDR2 (Sachan et al., 2021)	52.5	440M
YONO w/o Pr (Lee et al., 2021a)	42.4	440M
YONO w/Pr (Lee et al., 2021a)	53.2	440M
UnitedQA (Cheng et al., 2021)	54.7	1.870B
R2-D2 (Fajcik et al., 2021)	55.9	1.290B
ReAttDPR	54.0	770M
ReAttBM25	54.7	770M

5.2 Ablations

5.2 消融实验

We perform ablation experiments to understand the contribution of each component. Due to resource limitations, all ablations are trained with 2K steps per iteration. We use ReAtt trained with $B{=}12$ bi-encoder layers, $|\mathcal{Q}|{=}16$ batch size, and $\scriptstyle\alpha=8$ cross-document loss weight as the baseline, remove one component or modify one hyperparameter at a time to investigate its effect. As shown in Tab. 3, we found: 1. Only using QA loss without cross-document adjustment (#2) improves retrieval performance over the original T5 (#3), but cross-document adjustment is necessary to achieve further improvement (#1). 2. Iterative ly retrieving close documents with the current model is helpful (#5 vs #1). 3. In-batch random documents are beneficial (#4 vs #1), and a larger batch size leads to larger improvements (#8-11). 4. A larger weight on cross-document adjustment loss improves retrieval performance but hurts QA performance, with $4{\sim}8$ achieving a good trade-off (#12-15). 5. A small number of bi-encoder layers (#6) significantly hurts retrieval while a large number of layers (#7) significantly hurts QA, suggesting choosing equal numbers of layers in bi-encoder and cross-encoder.

我们进行了消融实验以了解每个组件的贡献。由于资源限制，所有消融实验每次迭代都训练 2K 步。我们使用 ReAtt 作为基线，其训练参数为 $B{=}12$ 个双编码器层、$|\mathcal{Q}|{=}16$ 的批量大小和 $\scriptstyle\alpha=8$ 的跨文档损失权重，每次移除一个组件或修改一个超参数以研究其效果。如表 3 所示，我们发现：1. 仅使用 QA 损失而不进行跨文档调整 (#2) 比原始 T5 (#3) 提高了检索性能，但跨文档调整是实现进一步改进的必要条件 (#1)。2. 使用当前模型迭代检索相关文档是有帮助的 (#5 vs #1)。3. 批次内的随机文档是有益的 (#4 vs #1)，更大的批量大小会带来更大的改进 (#8-11)。4. 跨文档调整损失的权重越大，检索性能越好，但会损害 QA 性能，$4{\sim}8$ 的权重实现了良好的权衡 (#12-15)。5. 少量的双编码器层 (#6) 会显著损害检索性能，而大量的层 (#7) 会显著损害 QA 性能，这表明在双编码器和交叉编码器中选择相同数量的层是合适的。

6 Out-of-domain Generalization and Adaptation

6 域外泛化与适应

In this section, we examine both zero-shot retrieval performance on out-of-domain datasets and ReAtt’s end-to-end adaptability in supervised (QA, IR) and unsupervised settings.

在本节中，我们研究了零样本检索在域外数据集上的性能，以及 ReAtt 在监督（问答、信息检索）和无监督设置中的端到端适应性。

6.1 Datasets, Baselines, and Metrics

6.1 数据集、基线模型和评估指标

We choose 7 datasets from BEIR (Thakur et al., 2021), a benchmark covering diverse domains and tasks. On each dataset we compare ReAtt with different types of retrievers including BM25, DPR, and ColBERT. We consider 2 QA datasets (BioASQ and FiQA (Tsa tsar on is et al., 2015; Maia et al., 2018)) and one IR dataset (MS MARCO (Nguyen et al., 2016)) to evaluate su- pervised adaptation capability, and 4 other datasets (CQ AD up Stack, TREC-COVID, SCIDOCS, SciFact (Hoogeveen et al., 2015; Voorhees et al., 2020; Cohan et al., 2020; Wadden et al., 2020)) to eval- uate unsupervised adaptation capability. Detailed statistics are listed in Tab. 8. We report nDCG $@10$ to measure retrieval performance and EM to measure QA performance. We group all baselines into three categories and denote them with different colors in the following tables:

我们从 BEIR (Thakur et al., 2021) 中选择了 7 个数据集，这是一个涵盖多个领域和任务的基准。在每个数据集上，我们将 ReAtt 与不同类型的检索器进行比较，包括 BM25、DPR 和 ColBERT。我们考虑了 2 个问答数据集 (BioASQ 和 FiQA (Tsa tsar on is et al., 2015; Maia et al., 2018)) 和一个信息检索数据集 (MS MARCO (Nguyen et al., 2016)) 来评估监督适应能力，以及 4 个其他数据集 (CQ AD up Stack, TREC-COVID, SCIDOCS, SciFact (Hoogeveen et al., 2015; Voorhees et al., 2020; Cohan et al., 2020; Wadden et al., 2020)) 来评估无监督适应能力。详细统计数据列在表 8 中。我们使用 nDCG $@10$ 来衡量检索性能，使用 EM 来衡量问答性能。我们将所有基线分为三类，并在下表中用不同颜色表示：

Table 3: Ablations by removing one component or changing one hyper parameter from the ReAtt baseline.

表 3: 从 ReAtt 基线中移除一个组件或更改一个超参数后的消融实验。

#	方法	R@1	R@5	R@20	R@100	EM
1	ReAtt 基线 (B=12,	Q	=16, α=8)	41.9	68.8	82.5
2	- 跨文档损失	21.7	49.0	71.5	83.5	46.0
3	- QA (=T5)	13.2	33.7	53.6	67.7	3.0
4	- 批内损失	38.1	66.0	80.3	87.6	46.7
5	- 迭代	41.2	68.3	82.0	88.4	45.0
6	B=6	19.1	42.1	62.4	78.1	40.3
7	B=18	38.2	63.8	79.3	87.4	35.2
8		Q	=4	39.4	66.1	80.7
9		Q	=8	40.7	67.1	82.1
10		Q	=32	43.6	69.4	82.8
11		Q	=64	45.5	71.0	83.3
12	α=1	37.4	65.4	80.9	88.0	47.3
13	α=2	39.7	66.9	81.7	88.4	47.4
14	α=4	40.9	68.0	82.1	88.8	46.9
15	α=16	42.0	68.8	82.5	88.8	45.5

• Supervised adaptation models are trained with downstream task supervision, including RAG trained on BioASQ, Contriever fine-tuned on FiQA, and docT5query, ANCE, ColBERT, and Contriever fine-tuned on MS MARCO (Nogueira and Lin, 2019; Xiong et al., 2021; Khattab and Zaharia, 2020; Izacard et al., 2021). • Unsupervised adaptation models are trained on domain corpus in an unsupervised way such as contrastive learning or pseudo query generation, including SimCSE and TSDAE $^+$ GPL (Gao et al., 2021c; Wang et al., 2021a,b).

• 监督适应模型通过下游任务的监督进行训练，包括在 BioASQ 上训练的 RAG、在 FiQA 上微调的 Contriever，以及在 MS MARCO 上微调的 docT5query、ANCE、ColBERT 和 Contriever (Nogueira and Lin, 2019; Xiong et al., 2021; Khattab and Zaharia, 2020; Izacard et al., 2021)。
• 无监督适应模型通过无监督方式在领域语料库上进行训练，例如对比学习或伪查询生成，包括 SimCSE 和 TSDAE $^+$ GPL (Gao et al., 2021c; Wang et al., 2021a,b)。

• Pre training models are trained on corpora with

• 预训练模型在语料库上进行训练

Table 4: $\mathrm{nDCG}@10$ of zero-shot and supervised adaptation experiments on two QA and one IR datasets. We use colors to denote categories: pre training, unsupervised adaptation, and supervised adaptation. Baselines comparable to ReAtt are highlighted with blue background color. We also show the improvement of ReAtt over zero-shot performance in subscript.

表 4: $\mathrm{nDCG}@10$ 在两个问答（QA）和一个信息检索（IR）数据集上的零样本和监督适应实验的结果。我们使用颜色来表示类别：预训练、无监督适应和监督适应。与 ReAtt 可比的基线用蓝色背景突出显示。我们还在下标中展示了 ReAtt 相对于零样本性能的提升。

Table 5: RAG and ReAtt on BioASQ. Each indent indicates fine-tuning one more component than its parent with performance difference colored with green/red. ∗ denotes fine-tuning conducted sequentially instead of jointly with the current component.

表 5: BioASQ 上的 RAG 和 ReAtt。每个缩进表示比其父组件多微调一个组件，性能差异用绿色/红色标注。∗ 表示微调是按顺序进行的，而不是与当前组件联合进行。

#Ablations		nDCG@1@5	EM
1	RAG	14.6 13.0	1.3
2	+reader	14.6 13.0	27.5 26.2
3	+ qry enc (e2e)	0.0 0.0 -13.0	25.7 -1.9
4	+ doc/qry enc	29.4 27.1 14.1	5.0 3.7
5	+ reader (pipe)	29.4 27.1	27.8 22.8
6	+ qry enc	23.3 23.2 -4.0	26.2 -1.6
7	T5	49.2 47.7	0.0
8	+e2e	75.2 73.5 25.7	44.4 44.4
9	ReAtt	72.8 70.1	17.2
10	+e2e	77.4 75.4 5.3	47.2 30.0

Table 6: $\mathrm{nDCG}@10$ of zero-shot and unsupervised adaptation on four datasets. Format is similar to Tab. 4

表 6: 四个数据集上零样本和无监督适应的 $\mathrm{nDCG}@10$。格式与表 4 类似。

任务数据集	QA BioASQ FiQA		检索 MSMARCO
	零样本性能
BM25	68.1	23.6	22.8
DPR	14.1	11.2	17.7
ColBERT-NQ	65.5	23.8	32.8
ReAtt	71.1	30.1	32.3
	额外训练
Contriever		32.9	docT5query
SimCSE	58.1	31.4	ANCE
TSDAE+GPL	61.6	34.4	ColBERT
Contrieverw/FT		38.1	Contriever
ReAtt	+5.876.9	+8.538.6	ReAtt +7.639.9

out direct exposure to the target domain, such as Contriever (Izacard et al., 2021) trained with contrastive learning on Wikipedia and CCNet.

在没有直接接触目标领域的情况下，例如通过对比学习在Wikipedia和CCNet上训练的Contriever (Izacard et al., 2021)。

We highlight baselines in the same category as ReAtt in the following tables since comparison between them is relatively fair. Details of adaptation of ReAtt can be found in Appendix B.

我们在以下表格中突出显示了与 ReAtt 同类的基线，因为它们之间的比较相对公平。ReAtt 的适配细节可以在附录 B 中找到。

6.2 Experimental Results

6.2 实验结果

Results of supervised and unsupervised adaptation are listed in Tab. 4, Tab. 5, and Tab. 6 respectively.

监督学习和无监督学习适应的结果分别列于表 4、表 5 和表 6 中。

Zero-shot Generalization Ability As shown in Tab. 4 and Tab. 6, the zero-shot performance of ReAtt is significantly better than other zero-shot baselines on two QA datasets and one fact checking dataset $(+3.0/+6.5/+4.5$ on BioASQ/FiQA/SciFact than the second best), and overall comparable on the rest of datasets $(-0.5/-0.6/-3.0/-1.0$ on MS

零样本泛化能力

如表 4 和表 6 所示，ReAtt 在两个问答数据集和一个事实核查数据集上的零样本性能显著优于其他零样本基线 $(BioASQ/FiQA/SciFact 上分别比第二好的方法高出 +3.0/+6.5/+4.5)$，在其他数据集上的表现总体相当 $(MS 数据集上分别为 -0.5/-0.6/-3.0/-1.0)$。

方法	CQA	TRECC	SCIDOCS	SciFact
	零样本性能
BM25	29.9	65.6	15.8	66.5
DPR	15.3	33.2	7.7	31.8
ANCE	29.6	65.4	12.2	50.7
ColBERT-NQ	33.9	48.9	15.6	65.3
ReAtt	33.3	62.6	14.8	71.0
	额外训练
Contriever	34.5	59.6	16.5	67.7
SimCSE	29.0	68.3		55.0
TSDAE+GPL	35.1	74.6		68.9
ReAtt	+3.336.6	+13.476.0	+1.015.8 +0.271.2

MARCO/CQA./TRECC./SCIDOCS than the best which is usually BM25), demonstrating that our end-to-end training with QA loss on NQ produces a robust retriever. We conjecture that the superior performance on QA datasets can be attributed to our end-to-end training using QA loss which learns retrieval that better aligns with the end task than training with retrieval annotations.

MARCO/CQA/TREC/SCIDOCS 上的表现优于通常最佳的 BM25，这表明我们在 NQ 上使用 QA 损失进行的端到端训练产生了一个鲁棒的检索器。我们推测，在 QA 数据集上的优越性能可以归因于我们使用 QA 损失进行的端到端训练，这种训练学习的检索比使用检索注释的训练更好地与最终任务对齐。

Retrieval Adaptation with QA Supervision As shown in the left-hand side of Tab. 4, end-toend adaptation with QA supervision significantly improves ReAtt’s retrieval performance by 5.8/8.5 on BioASQ/FiQA, achieving similar performance as Contriever fine-tuned on FiQA, and better performance than other unsupervised methods, confirming the end-to-end adaptability of our methods.

基于问答监督的检索适应

如表 4 左侧所示，通过问答监督进行端到端适应显著提升了 ReAtt 的检索性能，在 BioASQ/FiQA 上分别提高了 5.8/8.5，达到了与在 FiQA 上微调的 Contriever 相似的性能，并且优于其他无监督方法，证实了我们方法的端到端适应性。

End-to-end QA Adaptation We perform endto-end adaptation on BioASQ and compare with RAG as a baseline, which combines DPR as retriever and BART as reader, and DPR has a query and document encoder. Since updating document encoder requires corpus re-indexing, it is fixed during fine-tuning. We found end-to-end fine-tuning fails on RAG. To understand why, we conduct a rigorous experiment that breaks down each component of RAG to find the failure point in Tab. 5.

端到端 QA 适应
我们在 BioASQ 上进行了端到端适应，并与作为基线的 RAG 进行了比较。RAG 结合了 DPR 作为检索器和 BART 作为阅读器，DPR 具有查询和文档编码器。由于更新文档编码器需要重新索引语料库，因此在微调期间它是固定的。我们发现端到端微调在 RAG 上失败了。为了理解原因，我们进行了一项严格的实验，分解了 RAG 的每个组件，以在表 5 中找到失败点。

Starting from the initial model trained on NQ (#1), we first fine-tune the reader while fixing the query encoder (#2), and as expected QA performance improves. However fine-tuning both query encoder and reader (end-to-end #3) makes the retriever collapse with zero relevant documents returned, indicating end-to-end fine-tuning does not work for RAG on new domains. In order to improve both retrieval and QA, we need to fine-tune RAG in a pipeline manner: first fine-tune the retriever (both query and doc encoder) similarly to DPR using retrieval annotations (#4), then finetune the reader (#5). With the DPR-like fine-tuned retriever, end-to-end fine-tuning of query encoder and reader still fails (#6), although the retriever does not completely collapse.

从在 NQ 上训练的初始模型 (#1) 开始，我们首先在固定查询编码器的情况下微调阅读器 (#2)，正如预期的那样，QA 性能有所提升。然而，同时微调查询编码器和阅读器（端到端 #3）会导致检索器崩溃，返回的相关文档为零，这表明端到端微调在新领域的 RAG 上不起作用。为了同时改进检索和 QA，我们需要以流水线方式微调 RAG：首先像 DPR 一样使用检索注释微调检索器（包括查询和文档编码器）(#4)，然后微调阅读器 (#5)。使用类似 DPR 的微调检索器后，查询编码器和阅读器的端到端微调仍然失败 (#6)，尽管检索器并未完全崩溃。

End-to-end fine-tuning of ReAtt improves retrieval and QA simultaneously. Fine-tuning starting from ReAtt trained on NQ is better than starting from T5, indicating the capability learned in NQ could be transferred to BioASQ. Comparing RAG and ReAtt, we identify several keys that enable endto-end adaptation. (1) ReAtt relying on token-level attention has a strong initial performance, (2) crossdocument adjustment over both close and random documents in ReAtt provides a better gradient estimation than only using retrieved documents in RAG, (3) distillation-based loss in ReAtt might be more effective than multiplying the retrieval probability into the final generation probability.

端到端微调 ReAtt 同时提升了检索和问答性能。从在 NQ 上训练的 ReAtt 开始微调比从 T5 开始效果更好，表明在 NQ 上学到的能力可以迁移到 BioASQ。通过比较 RAG 和 ReAtt，我们发现了几个实现端到端适应的关键点：(1) 依赖 Token 级别注意力的 ReAtt 具有强大的初始性能，(2) ReAtt 中对接近和随机文档的跨文档调整比 RAG 中仅使用检索文档提供了更好的梯度估计，(3) ReAtt 中基于蒸馏的损失可能比将检索概率乘以最终生成概

[论文翻译]检索即注意力：在单一Transformer中端到端学习检索与阅读

原文地址：https://arxiv.org/pdf/2212.02027v1

Retrieval as Attention: End-to-end Learning of Retrieval and Reading within a Single Transformer

检索即注意力：在单一Transformer中端到端学习检索与阅读

Abstract

摘要

1 Introduction

1 引言

2 Retrieval as Attention (ReAtt)

2 检索即注意力 (ReAtt)

2.1 Formal Definition

2.1 形式化定义

2.2 Leveraging Attention for Retrieval

2.2 利用注意力机制进行检索

2.3 How Good is Attention As-is?

2.3 注意力机制本身的表现如何？

3 Learning Retrieval as Attention

3 学习检索作为注意力机制

3.1 Approximate Attention over the Corpus

3.1 语料库上的近似注意力机制

3.2 Cross-document Adjustment with Decoder-to-Encoder Attention Distillation

3.2 跨文档调整与解码器到编码器的注意力蒸馏

3.3 Domain Adaptation Methods

3.3 领域适应方法

4 In-domain Experiments

4 领域内实验

5 Implementation Details of ReAtt

5 ReAtt 的实现细节

5.1 Overall Results

5.1 总体结果

5.2 Ablations

5.2 消融实验

6 Out-of-domain Generalization and Adaptation

6 域外泛化与适应

6.1 Datasets, Baselines, and Metrics

6.1 数据集、基线模型和评估指标

6.2 Experimental Results

6.2 实验结果