[论文翻译]REALM: 检索增强的语言模型预训练


原文地址:https://arxiv.org/pdf/2002.08909v1


REALM: Retrieval-Augmented Language Model Pre-Training

REALM: 检索增强的语言模型预训练

Kelvin Guu * 1 Kenton Lee * 1 Zora Tung 1 Panupong Pasupat 1 Ming-Wei Chang

Kelvin Guu * 1 Kenton Lee * 1 Zora Tung 1 Panupong Pasupat 1 Ming-Wei Chang

Abstract

摘要

Language model pre-training has been shown to capture a surprising amount of world knowledge, crucial for NLP tasks such as question answering. However, this knowledge is stored implicitly in the parameters of a neural network, requiring ever-larger networks to cover more facts. To capture knowledge in a more modular and interpretable way, we augment language model pretraining with a latent knowledge retriever, which allows the model to retrieve and attend over documents from a large corpus such as Wikipedia, used during pre-training, fine-tuning and inference. For the first time, we show how to pretrain such a knowledge retriever in an unsupervised manner, using masked language modeling as the learning signal and back propagating through a retrieval step that considers millions of documents. We demonstrate the effectiveness of Retrieval-Augmented Language Model pre-training (REALM) by fine-tuning on the challenging task of Open-domain Question Answering (Open-QA). We compare against state-of-theart models for both explicit and implicit knowledge storage on three popular Open-QA benchmarks, and find that we outperform all previous methods by a significant margin ( $4%$ absolute accuracy), while also providing qualitative benefits such as interpret ability and modularity.

语言模型预训练已被证明能捕获大量世界知识,这对问答等自然语言处理任务至关重要。然而,这些知识隐式存储在神经网络参数中,需要不断扩大网络规模以涵盖更多事实。为使知识获取更具模块化和可解释性,我们通过潜在知识检索器增强语言模型预训练,使模型能在预训练、微调和推理阶段检索并关注来自维基百科等大型语料库的文档。我们首次展示了如何以无监督方式预训练此类知识检索器:使用掩码语言建模作为学习信号,并通过考虑数百万文档的检索步骤进行反向传播。通过在开放域问答(Open-QA)任务上的微调,我们验证了检索增强型语言模型预训练(REALM)的有效性。在三个主流Open-QA基准测试中,我们与显式和隐式知识存储的先进模型进行对比,发现以显著优势(绝对准确率提升4%)超越所有现有方法,同时具备可解释性和模块化等质性优势。


Figure 1. REALM augments language model pre-training with a neural knowledge retriever that retrieves knowledge from a textual knowledge corpus, $\mathcal{Z}$ (e.g., all of Wikipedia). Signal from the language modeling objective back propagates all the way through the retriever, which must consider millions of documents in $\mathcal{Z}$ —a significant computational challenge that we address.

图 1: REALM通过神经知识检索器增强语言模型预训练,该检索器从文本知识库$\mathcal{Z}$(例如整个维基百科)中获取知识。语言建模目标的信号通过检索器反向传播,后者需要处理$\mathcal{Z}$中数百万份文档——我们解决了这一重大计算挑战。

correctly predict the missing word in the following sentence: “The is the currency of the United Kingdom” (answer: “pound”).

正确预测以下句子中缺失的单词:"The is the currency of the United Kingdom" (答案: "pound")。

1. Introduction

1. 引言

Recent advances in language model pre-training have shown that models such as BERT (Devlin et al., 2018), RoBERTa (Liu et al., 2019) and T5 (Raffel et al., 2019) store a surprising amount of world knowledge, acquired from the massive text corpora they are trained on (Petroni et al., 2019). For example, BERT is able to

语言模型预训练的最新进展表明,BERT (Devlin et al., 2018)、RoBERTa (Liu et al., 2019) 和 T5 (Raffel et al., 2019) 等模型存储了惊人的世界知识量,这些知识源自它们训练所使用的大规模文本语料库 (Petroni et al., 2019)。例如,BERT能够

In these language models, the learned world knowledge is stored implicitly in the parameters of the underlying neural network. This makes it difficult to determine what knowledge is stored in the network and where. Furthermore, storage space is limited by the size of the network—to capture more world knowledge, one must train ever-larger networks, which can be prohibitively slow or expensive.

在这些语言模型中,学习到的世界知识被隐式地存储在底层神经网络的参数中。这使得难以确定网络中存储了哪些知识以及存储位置。此外,存储空间受限于网络规模——要捕获更多世界知识,就必须训练越来越大的网络,这可能导致训练过程极其缓慢或成本高昂。

To capture knowledge in a more interpret able and modular way, we propose a novel framework, Retrieval-Augmented Language Model (REALM) pre-training, which augments language model pre-training algorithms with a learned textual knowledge retriever. In contrast to models that store knowledge in their parameters, this approach explicitly exposes the role of world knowledge by asking the model to decide what knowledge to retrieve and use during inference. Before making each prediction, the language model uses the retriever to retrieve documents1 from a large corpus such as Wikipedia, and then attends over those documents to help inform its prediction. Learning this model end-toend requires back propagating through a retrieval step that considers an entire corpus of textual knowledge, as shown in Figure 1.

为了以更可解释和模块化的方式捕捉知识,我们提出了一种新颖框架——检索增强语言模型(RELA)预训练,该框架通过学习的文本知识检索器来增强语言模型预训练算法。与将知识存储在参数中的模型不同,这种方法通过让模型在推理过程中决定检索和使用哪些知识,明确揭示了世界知识的作用。在进行每次预测之前,语言模型使用检索器从维基百科等大型语料库中检索文档,然后关注这些文档以辅助其预测。如图1所示,端到端学习该模型需要通过考虑整个文本知识语料库的检索步骤进行反向传播。

The key intuition of REALM is to train the retriever using a performance-based signal from unsupervised text: a retrieval that improves the language model’s perplexity is helpful and should be rewarded, while an uninformative retrieval should be penalized. For example, in Figure 1, if the model needs to fill the blank in “the at the top of the pyramid”, the retriever should be rewarded for selecting a document containing “The pyramidion on top allows for less material higher up the pyramid”. We achieve this behavior by modeling our retrieve-then-predict approach as a latent variable language model and optimizing the marginal likelihood.

REALM的核心直觉是通过无监督文本中的性能信号来训练检索器:能够降低语言模型困惑度的检索是有益的,应当给予奖励,而无信息量的检索则应受到惩罚。例如,在图1中,如果模型需要填补"the at the top of the pyramid"中的空白,检索器若选择了包含"The pyramidion on top allows for less material higher up the pyramid"的文档就应获得奖励。我们通过将检索-预测方法建模为隐变量语言模型并优化边际似然来实现这一行为。

Incorporating a large-scale neural retrieval module during pre-training constitutes a significant computational challenge, since the retriever must consider millions of candidate documents for each pre-training step, and we must back propagate through its decisions. To address this, we structure the retriever such that the computation performed for each document can be cached and asynchronously updated, and selection of the best documents can be formulated as Maximum Inner Product Search (MIPS).

在预训练阶段融入大规模神经检索模块是一项重大的计算挑战,因为检索器必须在每个预训练步骤中处理数百万候选文档,且需要对其决策进行反向传播。为解决这一问题,我们对检索器进行结构化设计:首先缓存并异步更新各文档的计算结果,其次将最优文档选择问题转化为最大内积搜索 (MIPS) 任务。

Numerous prior works have demonstrated the benefit of adding a discrete retrieval step to neural networks (Miller et al., 2016; Chen et al., 2017), but did not apply the framework to language model pre-training and employed non-learned retrievers to handle large-scale document collections. In the language modeling literature, the $k$ -Nearest Neighbor Language Model (Khandelwal et al., 2019) (kNN-LM) retrieves similar LM examples to improve memorization. However, kNN-LM was not finetuned for downstream tasks, perhaps because it is unclear how to adapt the retrieval mechanism: a $k\mathbf{NN}$ can only use examples labeled for the target task—during fine-tuning, this precludes LM examples, which contain the desired world knowledge. In contrast, REALM’s retriever is designed to transfer to other tasks, and the retrieval is just text, not a labeled example.

众多先前研究已证明在神经网络中加入离散检索步骤的优势 (Miller et al., 2016; Chen et al., 2017),但未将该框架应用于语言模型预训练,且使用非学习型检索器处理大规模文档集。在语言建模领域,$k$-最近邻语言模型 (Khandelwal et al., 2019) (kNN-LM) 通过检索相似的语言模型样本来增强记忆能力。然而kNN-LM未针对下游任务进行微调,这可能是因为其检索机制难以适配:$k\mathbf{NN}$只能使用带目标任务标签的样本——在微调过程中,这会排除包含所需世界知识的语言模型样本。相比之下,REALM的检索器专为任务迁移设计,且检索对象仅为文本而非带标签样本。

We evaluate our approach by fine-tuning the models pre-trained with REALM on the task of Open- domain Question Answering (Open-QA), one of the most knowledge-intensive tasks in natural language processing. We evaluate on three popular Open-QA benchmarks (NATURAL QUESTIONS-OPEN, WEB QUESTIONS, and

我们通过在开放域问答(Open-QA)任务上微调REALM预训练模型来评估我们的方法,这是自然语言处理中最需要知识的任务之一。我们在三个流行的开放域问答基准测试(NATURAL QUESTIONS-OPEN、WEB QUESTIONS和...)上进行评估。

CURATE DT REC) and compare to state-of-the-art Open-QA models, including both extremely large models that store knowledge implicitly (such as T5) as well as previous approaches that also use a knowledge retriever to access external knowledge, but implement retrieval in a more heuristic fashion (Lee et al., 2019; Min et al., $2019\mathrm{a}$ ; Asai et al., 2019). REALM achieves new state-of-the-art results on all three benchmarks, significantly outperforming all previous systems by $4-16%$ absolute accuracy. We also demonstrate qualitative benefits of REALM, including interpret ability and modularity.

CURATE DT REC) 并与最先进的开放问答模型进行比较,包括那些隐式存储知识的超大规模模型(如T5),以及同样使用知识检索器访问外部知识但以更启发式方式实现检索的先前方法 (Lee et al., 2019; Min et al., $2019\mathrm{a}$; Asai et al., 2019)。REALM在所有三个基准测试中均取得了新的最先进成果,绝对准确率显著超越所有先前系统 $4-16%$。我们还展示了REALM的定性优势,包括可解释性和模块化。

2. Background

2. 背景

Language model pre-training The goal of language model pre-training is to learn useful representations of language, usually from unlabeled text corpora. The resulting pre-trained model can then be further trained (fine-tuned) for a downstream task of primary interest (in our case, Open-QA), often leading to better generalization than training from scratch (Dai & Le, 2015; Radford et al., 2019).

语言模型预训练
语言模型预训练的目标是从通常无标注的文本语料库中学习语言的有用表征。得到的预训练模型可以针对主要关注的下游任务(在我们的案例中是开放域问答)进行进一步训练(微调),这通常比从头开始训练能带来更好的泛化性能 (Dai & Le, 2015; Radford et al., 2019)。

We focus on the masked language model2 (MLM) variant of pre-training popularized by BERT (Devlin et al., 2018). In its basic form, an MLM is trained to predict the missing tokens in an input text passage. Given an unlabeled pre-training corpus $\mathcal{X}$ (e.g., Wikipedia text), a training example $(x,y)$ can be generated by randomly masking tokens in a sampled piece of text (e.g., $x=$ “The [MASK] is the currency [MASK] the UK”; $y=$ (“pound”, “of”)). The model uses its representation of the masked input $x$ to predict the token that should go in each mask. A good MLM must learn to encode syntactic and semantic information (e.g., to predict “of”) as well as some world knowledge (e.g., to predict “pound”).

我们专注于由BERT (Devlin et al., 2018) 推广的掩码语言模型2 (masked language model, MLM) 预训练变体。其基本形式是训练MLM预测输入文本段落中缺失的token。给定无标注的预训练语料$\mathcal{X}$(例如维基百科文本),可通过随机掩码采样文本中的token生成训练样本$(x,y)$(例如$x=$"The [MASK] is the currency [MASK] the UK";$y=$("pound", "of"))。模型利用其对掩码输入$x$的表征来预测每个掩码位置应填入的token。优秀的MLM必须学会编码句法和语义信息(例如预测"of")以及部分世界知识(例如预测"pound")。

Open-domain question answering (Open-QA) To measure a model’s ability to incorporate world knowledge, we need a downstream task where world knowledge is critical. Perhaps one of the most knowledge-intensive tasks in natural language processing is open-domain question answering (Open-QA): given a question $x$ such as “What is the currency of the UK?”, a model must output the correct answer string $y$ , “pound”. The “open” part of OpenQA refers to the fact that the model does not receive a preidentified document that is known to contain the answer, unlike traditional reading comprehension (RC) tasks such as SQuAD (Rajpurkar et al., 2016; 2018). While RC models comprehend a single document, Open-QA models must retain knowledge from millions of documents, since a question could be about any of them.

开放域问答 (Open-QA)
为衡量模型整合世界知识的能力,我们需要一个依赖世界知识的下游任务。自然语言处理中最具知识密集性的任务之一或许是开放域问答:给定问题$x$(例如"英国的货币是什么?"),模型必须输出正确答案字符串$y$("英镑")。OpenQA中的"开放"指模型不会像SQuAD (Rajpurkar et al., 2016; 2018)等传统阅读理解(RC)任务那样接收已知包含答案的预设文档。阅读理解模型只需理解单个文档,而开放域问答模型必须保留来自数百万文档的知识,因为问题可能涉及其中任意内容。

We focus on Open-QA systems that utilize a textual knowledge corpus $\mathcal{Z}$ as the knowledge source. Many of these systems employ a retrieval-based approach: given a question $x$ , retrieve potentially relevant documents $z$ from the corpus $\mathcal{Z}$ , and then extract an answer $y$ from the documents (Brill et al., 2002; Chen et al., 2017; Lee et al., 2019). Our approach, REALM, is inspired by this paradigm and extends it to language model pre-training. Alternatively, some recent work has proposed generationbased systems that apply a sequence-to-sequence model on $x$ to directly generate $y$ token-by-token (Lewis et al., 2019; Raffel et al., 2019). We will compare against state-of-theart systems from both paradigms in our experiments.

我们关注以文本知识库$\mathcal{Z}$作为知识源的开放问答系统。这类系统大多采用基于检索的方法:给定问题$x$,从知识库$\mathcal{Z}$中检索潜在相关文档$z$,然后从文档中提取答案$y$ (Brill et al., 2002; Chen et al., 2017; Lee et al., 2019)。我们的REALM方法受此范式启发,并将其扩展至语言模型预训练领域。另一些近期研究提出了基于生成的系统,它们对$x$应用序列到序列模型直接逐token生成$y$ (Lewis et al., 2019; Raffel et al., 2019)。实验中我们将对比这两种范式下的前沿系统。

3. Approach

3. 方法

We start by formalizing REALM’s pre-training and finetuning tasks as a retrieve-then-predict generative process in Section 3.1. Then in Section 3.2, we describe the model architectures for each component of that process. In Section 3.3, we show how to implement REALM pre-training and fine-tuning by maximizing the likelihood of REALM’s generative process. En route, we address important computational challenges, explain why training works, and also discuss strategies for injecting useful inductive biases. The overall framework is illustrated in Figure 2.

我们首先在第3.1节将REALM的预训练和微调任务形式化为一个检索-预测生成过程。接着在第3.2节,我们描述了该流程各组成部分的模型架构。第3.3节展示了如何通过最大化REALM生成过程的似然来实现其预训练与微调。在此过程中,我们解决了关键的计算挑战,阐释了训练原理,并讨论了注入有用归纳偏置的策略。整体框架如图2所示。

3.1. REALM’s generative process

3.1. REALM的生成过程

For both pre-training and fine-tuning, REALM takes some input $x$ and learns a distribution $p(y\mid x)$ over possible outputs $y$ . For pre-training, the task is masked language modeling: $x$ is a sentence from a pre-training corpus $\mathcal{X}$ with some tokens masked out, and the model must predict the value of those missing tokens, $y$ . For fine-tuning, the task is Open-QA: $x$ is a question, and $y$ is the answer.

在预训练和微调阶段,REALM都会接收输入$x$并学习可能输出$y$的分布$p(y\mid x)$。预训练任务采用掩码语言建模:$x$是预训练语料库$\mathcal{X}$中被遮蔽部分token的句子,模型需预测缺失token的值$y$。微调任务采用开放域问答(Open-QA):$x$为问题,$y$为答案。

REALM decomposes $p(y\mid x)$ into two steps: retrieve, then predict. Given an input $x$ , we first retrieve possibly helpful documents $z$ from a knowledge corpus $\mathcal{Z}$ . We model this as a sample from the distribution $p(z\mid x)$ . Then, we condition on both the retrieved $z$ and the original input $x$ to generate the output $y$ —modeled as $p(y\mid z,x)$ . To obtain the overall likelihood of generating $y$ , we treat $z$ as a latent variable and marginal ize over all possible documents $z$ , yielding

REALM将$p(y\mid x)$分解为两个步骤:检索,然后预测。给定输入$x$,我们首先从知识库$\mathcal{Z}$中检索可能有帮助的文档$z$,将其建模为分布$p(z\mid x)$的采样。接着,基于检索到的$z$和原始输入$x$生成输出$y$——建模为$p(y\mid z,x)$。为计算生成$y$的总体似然,我们将$z$视为隐变量并对所有可能的文档$z$进行边缘化,得到

$$
p(y\mid x)=\sum_{z\in{\mathcal{Z}}}p(y\mid z,x)p(z\mid x).
$$

$$
p(y\mid x)=\sum_{z\in{\mathcal{Z}}}p(y\mid z,x)p(z\mid x).
$$

3.2. Model architecture

3.2. 模型架构

We now describe the two key components: the neural knowledge retriever, which models $p(z\mid x)$ , and the knowledge-augmented encoder, which models $p(y\mid z,x)$ .

我们现在描述两个关键组件:神经知识检索器(建模 $p(z\mid x)$ )和知识增强编码器(建模 $p(y\mid z,x)$ )。

Knowledge Retriever The retriever is defined using a dense inner product model:

知识检索器
该检索器采用密集内积模型定义:

$$
\begin{array}{l}{\displaystyle p(\boldsymbol{z}\mid\boldsymbol{x})=\frac{\exp f(\boldsymbol{x},\boldsymbol{z})}{\sum_{\boldsymbol{z}^{\prime}}\exp f(\boldsymbol{x},\boldsymbol{z}^{\prime})},}\ {\displaystyle f(\boldsymbol{x},\boldsymbol{z})=\mathrm{Embed_{input}}(\boldsymbol{x})^{\top}\mathrm{Embed_{doc}}(\boldsymbol{z}),}\end{array}
$$

$$
\begin{array}{l}{\displaystyle p(\boldsymbol{z}\mid\boldsymbol{x})=\frac{\exp f(\boldsymbol{x},\boldsymbol{z})}{\sum_{\boldsymbol{z}^{\prime}}\exp f(\boldsymbol{x},\boldsymbol{z}^{\prime})},}\ {\displaystyle f(\boldsymbol{x},\boldsymbol{z})=\mathrm{Embed_{input}}(\boldsymbol{x})^{\top}\mathrm{Embed_{doc}}(\boldsymbol{z}),}\end{array}
$$

where $\mathtt{E m b e d\Pi_{i n p u t}}$ and $\mathtt{E m b e d}_{\mathtt{d o c}}$ are embedding functions that map $x$ and $z$ respectively to $d$ -dimensional vectors. The relevance score $f(x,z)$ between $x$ and $z$ is defined as the inner product of the vector embeddings. The retrieval distribution is the softmax over all relevance scores.

其中 $\mathtt{E m b e d\Pi_{i n p u t}}$ 和 $\mathtt{E m b e d}_{\mathtt{d o c}}$ 是将 $x$ 和 $z$ 分别映射到 $d$ 维向量的嵌入函数。$x$ 和 $z$ 之间的相关性分数 $f(x,z)$ 定义为向量嵌入的内积。检索分布是所有相关性分数的 softmax。

We implement the embedding functions using BERT-style Transformers (Devlin et al., 2018). Following standard practices, we join spans of text by applying wordpiece tokenization, separating them with [SEP] tokens, prefixing a [CLS] token, and appending a final [SEP] token.

我们使用 BERT 风格的 Transformer (Devlin et al., 2018) 来实现嵌入函数。按照标准做法,我们通过应用 wordpiece tokenization 来连接文本片段,用 [SEP] token 分隔它们,前缀一个 [CLS] token,并附加一个最终的 [SEP] token。

$$
\begin{array}{r l}{\mathrm{join}_ {\mathtt{B E R T}}(x)=\left[\mathtt{C L S}\right]x\left[\mathtt{S E P}\right]~}&{}\ {\mathrm{join}_ {\mathtt{B E R T}}(x_{1},x_{2})=\left[\mathtt{C L S}\right]x_{1}\left[\mathtt{S E P}\right]x_{2}\left[\mathtt{S E P}\right]}&{}\end{array}
$$

$$
\begin{array}{r l}{\mathrm{join}_ {\mathtt{B E R T}}(x)=\left[\mathtt{C L S}\right]x\left[\mathtt{S E P}\right]~}&{}\ {\mathrm{join}_ {\mathtt{B E R T}}(x_{1},x_{2})=\left[\mathtt{C L S}\right]x_{1}\left[\mathtt{S E P}\right]x_{2}\left[\mathtt{S E P}\right]}&{}\end{array}
$$

As in Devlin et al. (2018), we pass this into a Transformer, which produces one vector for each token, including the vector corresponding to [CLS] which is used as a “pooled” representation of the sequence (denoted $\mathsf{B E R T}_{\mathsf{C L S},}$ ). Finally, we perform a linear projection to reduce the dimensionality of the vector, denoted as a projection matrix W:

如 Devlin 等人 (2018) 所述,我们将其输入 Transformer,该模型会为每个 Token 生成一个向量,包括对应 [CLS] 的向量(用作序列的"池化"表示,记为 $\mathsf{B E R T}_{\mathsf{C L S},}$)。最后,我们执行线性投影以降维,该投影矩阵记为 W:

$$
\begin{array}{r l}&{\mathtt{E m b e d}_ {\mathrm{input}}(x)=\mathbf{W}_ {\mathrm{input}}\mathtt{B E R T}_ {\mathrm{CLS}}(\mathtt{j o i n}_ {\mathtt{B E R T}}(x))}\ &{\quad\mathtt{E m b e d}_ {\mathtt{d o c}}(z)=\mathbf{W}_ {\mathrm{doc}}\mathtt{B E R T}_ {\mathrm{CLS}}(\mathtt{j o i n}_ {\mathtt{B E R T}}(z_{\mathrm{title}},z_{\mathrm{body}}))}\end{array}
$$

$$
\begin{array}{r l}&{\mathtt{E m b e d}_ {\mathrm{input}}(x)=\mathbf{W}_ {\mathrm{input}}\mathtt{B E R T}_ {\mathrm{CLS}}(\mathtt{j o i n}_ {\mathtt{B E R T}}(x))}\ &{\quad\mathtt{E m b e d}_ {\mathtt{d o c}}(z)=\mathbf{W}_ {\mathrm{doc}}\mathtt{B E R T}_ {\mathrm{CLS}}(\mathtt{j o i n}_ {\mathtt{B E R T}}(z_{\mathrm{title}},z_{\mathrm{body}}))}\end{array}
$$

where $z_{\mathrm{title}}$ is the document’s title and $z_{\mathrm{body}}$ is its body. We let $\theta$ denote all parameters associated with the retriever, which include the Transformer and projection matrices.

其中 $z_{\mathrm{title}}$ 表示文档标题,$z_{\mathrm{body}}$ 表示文档正文。我们用 $\theta$ 表示检索器所有相关参数,包括 Transformer 和投影矩阵。

Knowledge-Augmented Encoder Given an input $x$ and a retrieved document $z$ , the knowledge-augmented encoder defines $p(y\mid z,x)$ . We join $x$ and $z$ into a single sequence that we feed into a Transformer (distinct from the one used in the retriever). This allows us to perform rich crossattention between $x$ and $z$ before predicting $y$ . See Figure 1 for a concrete example.

知识增强编码器
给定输入 $x$ 和检索到的文档 $z$,知识增强编码器定义了 $p(y\mid z,x)$。我们将 $x$ 和 $z$ 拼接为单一序列后输入到 Transformer (与检索器中使用的 Transformer 不同) 中。这样可以在预测 $y$ 之前,对 $x$ 和 $z$ 进行丰富的交叉注意力计算。具体示例见图 1:

At this stage, the architectures for pre-training and finetuning differ slightly. For the masked language model pretraining task, we must predict the original value of each [MASK] token in $x$ . To do so, we use the same masked language modeling (MLM) loss as in Devlin et al. (2018):

在当前阶段,预训练和微调的架构略有不同。对于掩码语言模型预训练任务,我们需要预测$x$中每个[MASK] token的原始值。为此,我们采用与Devlin等人(2018)相同的掩码语言建模(MLM)损失函数:


Figure 2. The overall framework of REALM. Left: Unsupervised pre-training. The knowledge retriever and knowledge-augmented encoder are jointly pre-trained on the unsupervised language modeling task. Right: Supervised fine-tuning. After the parameters of the retriever $(\theta)$ and encoder $(\phi)$ have been pre-trained, they are then fine-tuned on a task of primary interest, using supervised examples.

图 2: REALM的整体框架。左:无监督预训练。知识检索器和知识增强编码器在无监督语言建模任务上联合预训练。右:监督微调。当检索器 $(\theta)$ 和编码器 $(\phi)$ 的参数完成预训练后,它们会在目标任务上使用监督样本进行微调。

$$
\begin{array}{r l}&{p(\boldsymbol{y}\mid\boldsymbol{z},\boldsymbol{x})=\displaystyle\prod_{j=1}^{J_{x}}p(y_{j}\mid\boldsymbol{z},\boldsymbol{x})}\ &{p(y_{j}\mid\boldsymbol{z},\boldsymbol{x})\propto\exp\left(w_{j}^{\top}\mathtt{B E R T}_ {\mathtt{M A S K}(j)}(\mathrm{j}\circ\mathrm{i}\mathrm{n}_ {\mathtt{B E R T}}(\boldsymbol{x},\boldsymbol{z}_{\mathtt{b o d y}}))\right)}\end{array}
$$

$$
\begin{array}{r l}&{p(\boldsymbol{y}\mid\boldsymbol{z},\boldsymbol{x})=\displaystyle\prod_{j=1}^{J_{x}}p(y_{j}\mid\boldsymbol{z},\boldsymbol{x})}\ &{p(y_{j}\mid\boldsymbol{z},\boldsymbol{x})\propto\exp\left(w_{j}^{\top}\mathtt{B E R T}_ {\mathtt{M A S K}(j)}(\mathrm{j}\circ\mathrm{i}\mathrm{n}_ {\mathtt{B E R T}}(\boldsymbol{x},\boldsymbol{z}_{\mathtt{b o d y}}))\right)}\end{array}
$$

where $\mathtt{B E R T_{M A S K(j)}}$ denotes the Transformer output vector corresponding to the $j^{t h}$ masked token, $J_{x}$ is the total number of [MASK] tokens in $x$ , and $w_{j}$ is a learned word embedding for token yj.

其中 $\mathtt{B E R T_{M A S K(j)}}$ 表示与第 $j^{t h}$ 个被掩码 token 对应的 Transformer 输出向量,$J_{x}$ 是 $x$ 中 [MASK] token 的总数,$w_{j}$ 是 token yj 的学习词嵌入。

For Open-QA fine-tuning, we wish to produce the answer string $y$ . Following previous reading comprehension work (Rajpurkar et al., 2016; Seo et al., 2016; Lee et al., 2016; Clark & Gardner, 2017), we will assume that the answer $y$ can be found as a contiguous sequence of tokens in some document $z$ . Let $S(z,y)$ be the set of spans matching $y$ in $z$ . Then we can define $p(y\mid z,x)$ as:

在开放问答(Open-QA)微调中,我们的目标是生成答案字符串$y$。借鉴先前阅读理解领域的研究工作 (Rajpurkar et al., 2016; Seo et al., 2016; Lee et al., 2016; Clark & Gardner, 2017),我们假设答案$y$可以作为连续token序列存在于某个文档$z$中。设$S(z,y)$为文档$z$中与$y$匹配的所有文本片段集合,则可将$p(y\mid z,x)$定义为:

$$
\begin{array}{r l}&{p(y\mid z,x)\propto{\displaystyle\sum_{s\in S(z,y)}}\exp\big(\mathtt{M L P}\big(\big[h_{\mathtt{S T A R T}(\mathtt{s})};h_{\mathtt{E N D}(\mathtt{s})}\big]\big)\big)}\ &{\quad\quad h_{\mathtt{S T A R T}(\mathtt{s})}=\mathtt{B E R T}_ {\mathtt{S T A R T}(\mathtt{s})}\big(\mathtt{j o i n}_ {\mathtt{B E R T}}(x,z_{\mathtt{b o d y}})\big),}\ &{\quad\quad h_{\mathtt{E N D}(\mathtt{s})}=\mathtt{B E R T}_ {\mathtt{E N D}(\mathtt{s})}\big(\mathtt{j o i n}_ {\mathtt{B E R T}}(x,z_{\mathtt{b o d y}})\big),}\end{array}
$$

$$
\begin{array}{r l}&{p(y\mid z,x)\propto{\displaystyle\sum_{s\in S(z,y)}}\exp\big(\mathtt{M L P}\big(\big[h_{\mathtt{S T A R T}(\mathtt{s})};h_{\mathtt{E N D}(\mathtt{s})}\big]\big)\big)}\ &{\quad\quad h_{\mathtt{S T A R T}(\mathtt{s})}=\mathtt{B E R T}_ {\mathtt{S T A R T}(\mathtt{s})}\big(\mathtt{j o i n}_ {\mathtt{B E R T}}(x,z_{\mathtt{b o d y}})\big),}\ &{\quad\quad h_{\mathtt{E N D}(\mathtt{s})}=\mathtt{B E R T}_ {\mathtt{E N D}(\mathtt{s})}\big(\mathtt{j o i n}_ {\mathtt{B E R T}}(x,z_{\mathtt{b o d y}})\big),}\end{array}
$$

where $\mathtt{B E R T\mathbf{s}T A R T(\mathbf{s})}$ and $\mathtt{B E R T_{E N D(s)}}$ denote the Transformer output vectors corresponding to the start and end tokens of span $s$ , respectively, while MLP denotes a feed-forward neural network. We will let $\phi$ denote all parameters associated with the knowledge-augmented encoder.

其中 $\mathtt{B E R T\mathbf{s}T A R T(\mathbf{s})}$ 和 $\mathtt{B E R T_{E N D(s)}}$ 分别表示与跨度 $s$ 的起始和结束 Token 对应的 Transformer 输出向量,而 MLP 表示前馈神经网络。我们将用 $\phi$ 表示与知识增强编码器相关的所有参数。

3.3. Training

3.3. 训练

For both pre-training and fine-tuning, we train by maximizing the log-likelihood $\log p(y\mid x)$ of the correct output $y$ . Since both the knowledge retriever and knowledgeaugmented encoder are differentiable neural networks, we can compute the gradient of $\log p(y\mid x)$ (defined in Equation 1) with respect to the model parameters $\theta$ and $\phi$ , and optimize using stochastic gradient descent.

在预训练和微调阶段,我们都通过最大化正确输出$y$的对数似然$\log p(y\mid x)$进行训练。由于知识检索器和知识增强编码器都是可微分神经网络,我们可以计算$\log p(y\mid x)$(公式1定义)对模型参数$\theta$和$\phi$的梯度,并使用随机梯度下降进行优化。

The key computational challenge is that the marginal probability ${p(y \mid x)=\sum_{z\in\mathcal{Z}}p(y \mid x,z)~p(z \mid x)}$ involves a summation over all documents $z$ in the knowledge corpus $\mathcal{Z}$ . We approximate this by instead summing over the top $k$ documents with highest probability under $p(z\mid x)$ —this is reasonable if most documents have near zero probability.

关键计算挑战在于边缘概率 ${p(y \mid x)=\sum_{z\in\mathcal{Z}}p(y \mid x,z)~p(z \mid x)}$ 需要对知识库 $\mathcal{Z}$ 中所有文档 $z$ 进行求和。我们通过仅对 $p(z\mid x)$ 概率最高的前 $k$ 个文档求和来近似计算——这在大多数文档概率接近零时是合理的。

Even with this approximation, we still need an efficient way to find the top $k$ documents. Note that the ordering of documents under $p(z\mid x)$ is the same as under the relevance score $f(x,z)=\mathtt{E m b e d}_ {\mathrm{input}}(x)^{\top}\mathtt{E m b e d}_{\mathrm{doc}}(z)$ , which is an inner product. Thus, we can employ Maximum Inner Product Search (MIPS) algorithms to find the approximate top $k$ documents, using running time and storage space that scale sub-linearly with the number of documents (Ram & Gray, 2012; Shrivastava & Li, 2014; Shen et al., 2015).

即使采用这种近似方法,我们仍需一种高效的方式来找出前 $k$ 个文档。需要注意的是,文档在 $p(z\mid x)$ 下的排序与相关性评分 $f(x,z)=\mathtt{E m b e d}_ {\mathrm{input}}(x)^{\top}\mathtt{E m b e d}_{\mathrm{doc}}(z)$ 下的排序相同,这是一个内积运算。因此,我们可以采用最大内积搜索 (MIPS) 算法来近似找出前 $k$ 个文档,其运行时间和存储空间随文档数量呈次线性增长 (Ram & Gray, 2012; Shrivastava & Li, 2014; Shen et al., 2015)。

To employ MIPS, we must pre-compute $\mathtt{E m b e d}_ {\mathtt{d o c}}(z)$ for every $z\in{\mathcal{Z}}$ and construct an efficient search index over these embeddings. However, this data structure will no longer be consistent with $p(z\mid x)$ if the parameters $\theta$ of $\mathtt{E m b e d}_{\mathtt{d o c}}$ are later updated. Hence, the search index goes “stale” after every gradient update on $\theta$ .

要使用MIPS,我们必须为每个$z\in{\mathcal{Z}}$预计算$\mathtt{E m b e d}_ {\mathtt{d o c}}(z)$,并在这些嵌入上构建高效的搜索索引。然而,如果后续更新了$\mathtt{E m b e d}_{\mathtt{d o c}}$的参数$\theta$,该数据结构将不再与$p(z\mid x)$保持一致。因此,每次对$\theta$进行梯度更新后,搜索索引都会变得"过时"。

Our solution is to “refresh” the index by asynchronously re-embedding and re-indexing all documents every several hundred training steps. The MIPS index is slightly stale between refreshes, but note that it is only used to select the top $k$ documents. We recompute $p(z\mid x)$ and its gradient, using the fresh $\theta$ , for these top $k$ documents after retrieving them. In Section 4.5, we empirically demonstrate that this procedure results in stable optimization, provided that refreshes happen at a sufficiently frequent rate.

我们的解决方案是通过每几百个训练步骤异步重新嵌入和重新索引所有文档来"刷新"索引。在两次刷新之间,MIPS索引会略微过时,但请注意它仅用于选择前$k$个文档。在检索到这些前$k$个文档后,我们会使用最新的$\theta$重新计算$p(z\mid x)$及其梯度。在第4.5节中,我们通过实验证明,只要刷新频率足够高,这一过程就能实现稳定的优化。

Implementing asynchronous MIPS refreshes We asynchronously refresh the MIPS index by running two jobs in parallel: a primary trainer job, which performs gradient updates on the parameters, and a secondary index builder job, which embeds and indexes the documents. As shown below, the trainer sends the index builder a snapshot of its parameters, $\theta^{\prime}$ . The trainer then continues to train while the index builder uses $\theta^{\prime}$ to construct a new index in the background. As soon as the index builder is done, it sends the new index back to the trainer, and the process repeats.

实现异步MIPS刷新
我们通过并行运行两个任务来异步刷新MIPS索引:主训练任务(对参数执行梯度更新)和辅助索引构建任务(对文档进行嵌入和索引)。如下所示,训练器向索引构建器发送其参数快照$\theta^{\prime}$。随后训练器继续训练,而索引构建器在后台使用$\theta^{\prime}$构建新索引。索引构建完成后,立即将新索引发回训练器,该过程循环进行。


Figure 3. REALM pre-training with asynchronous MIPS refreshes.

图 3: 采用异步MIPS刷新的REALM预训练过程

While asynchronous refreshes can be used for both pretraining and fine-tuning, in our experiments we only use it for pre-training. For fine-tuning, we just build the MIPS index once (using the pre-trained $\theta$ ) for simplicity and do not update $\mathtt{E m b e d}_{\mathtt{d o c}}$ .3 Note that we still fine-tune Embedinput, so the retrieval function is still updated from the query side.

虽然异步刷新可以同时用于预训练和微调阶段,但在我们的实验中仅将其用于预训练。为简化流程,微调时我们仅构建一次MIPS索引(使用预训练的$\theta$)且不更新$\mathtt{E m b e d}_{\mathtt{d o c}}$。3 需要注意的是,我们仍会对Embedinput进行微调,因此检索功能仍会从查询端持续更新。

What does the retriever learn? Since the knowledge retrieval of REALM is latent, it is not obvious how the training objective encourages meaningful retrievals. Here, we show how it rewards retrievals that improve prediction accuracy.

检索器学到了什么?由于REALM的知识检索是隐式的,训练目标如何促进有意义的检索并不明显。在此,我们展示它如何奖励能提高预测准确率的检索。

For a given query $x$ and document $z$ , recall that $f(x,z)$ is the “relevance score” that the knowledge retriever assigns to document $z$ . We can see how a single step of gradient descent during REALM pre-training alters this score by analyzing the gradient with respect to the parameters of the knowledge retriever, $\theta$ :

对于给定查询$x$和文档$z$,回顾$f(x,z)$是知识检索器赋予文档$z$的"相关性分数"。通过分析知识检索器参数$\theta$的梯度,我们可以观察到REALM预训练期间单步梯度下降如何改变这一分数:

$$
\begin{array}{l}{\displaystyle\nabla\log p(\boldsymbol{y}\mid\boldsymbol{x})=\sum_{z\in\boldsymbol{z}}r(z)\nabla f(\boldsymbol{x},z)}\ {\displaystyle r(z)=\left[\frac{p(\boldsymbol{y}\mid z,\boldsymbol{x})}{p(\boldsymbol{y}\mid\boldsymbol{x})}-1\right]p(\boldsymbol{z}\mid\boldsymbol{x}).}\end{array}
$$

$$
\begin{array}{l}{\displaystyle\nabla\log p(\boldsymbol{y}\mid\boldsymbol{x})=\sum_{z\in\boldsymbol{z}}r(z)\nabla f(\boldsymbol{x},z)}\ {\displaystyle r(z)=\left[\frac{p(\boldsymbol{y}\mid z,\boldsymbol{x})}{p(\boldsymbol{y}\mid\boldsymbol{x})}-1\right]p(\boldsymbol{z}\mid\boldsymbol{x}).}\end{array}
$$

For each document $z$ , the gradient encourages the retriever to change the score $f(x,z)$ by $r(z)$ — increasing if $r(z)$ is positive, and decreasing if negative. The multiplier $r(z)$ is positive if and only if $p(y\mid z,x)>p(y\mid x)$ . The term $p(y\mid z,x)$ is the probability of predicting the correct output $y$ when using document $z$ . The term $p(y\mid x)$ is the expected value of $p(\boldsymbol{y}\mid\boldsymbol{x},z)$ when randomly sampling a document from $p(z\mid x)$ . Hence, document $z$ receives a positive update whenever it performs better than expected.

对于每个文档 $z$,梯度促使检索器根据 $r(z)$ 调整分数 $f(x,z)$——当 $r(z)$ 为正时增加,为负时减少。乘数 $r(z)$ 为正当且仅当 $p(y\mid z,x)>p(y\mid x)$。其中 $p(y\mid z,x)$ 表示使用文档 $z$ 时预测正确输出 $y$ 的概率,而 $p(y\mid x)$ 是从 $p(z\mid x)$ 随机采样文档时 $p(\boldsymbol{y}\mid\boldsymbol{x},z)$ 的期望值。因此,当文档 $z$ 的表现优于预期时,它就会获得正向更新。

3.4. Injecting inductive biases into pre-training

3.4. 将归纳偏置注入预训练

In the process of developing REALM, we discovered several additional strategies that further guide the model towards meaningful retrievals, described below.

在开发REALM的过程中,我们发现了以下几种额外策略,可进一步引导模型实现有意义的检索:

Salient span masking During REALM pre-training, we want to focus on examples $x$ that require world knowledge to predict the masked tokens. As explained in Section 2, some MLM spans only require local context. To focus on problems that require world knowledge, we mask salient spans such as “United Kingdom” or “July 1969”. We use a BERT-based tagger trained on CoNLL-2003 data (Sang & De Meulder, 2003) to identify named entities, and a regular expression to identify dates. We select and mask one of these salient spans within a sentence for the masked language modeling task. We show that this significantly outperforms other masking strategies in Section 4.5.

显著跨度掩码
在REALM预训练过程中,我们希望专注于那些需要世界知识来预测被掩码token的样本$x$。如第2节所述,某些MLM(掩码语言模型)跨度仅需局部上下文即可预测。为了聚焦需要世界知识的问题,我们掩码诸如"United Kingdom"或"July 1969"之类的显著跨度。我们使用基于BERT的标注器(在CoNLL-2003数据上训练)(Sang & De Meulder, 2003)来识别命名实体,并通过正则表达式识别日期。在掩码语言建模任务中,我们选择并掩码句子中的一个显著跨度。第4.5节将证明,该方法显著优于其他掩码策略。

Null document Even with salient span masking, not all masked tokens require world knowledge to predict. We model this by adding an empty null document $\boldsymbol{\mathcal{O}}$ to the top $k$ retrieved documents, allowing appropriate credit to be assigned to a consistent sink when no retrieval is necessary.

即使采用显著跨度掩码,也并非所有被掩码的token都需要借助世界知识来预测。我们通过向检索到的前$k$篇文档中添加一个空文档$\boldsymbol{\mathcal{O}}$来建模这一现象,从而在无需检索时为一致的接收端分配适当的权重。

Prohibiting trivial retrievals If the pre-training corpus $\mathcal{X}$ and the knowledge corpus $\mathcal{Z}$ are the same, there exists a trivial retrieval candidate $z$ that is too informative: if the masked sentence $x$ comes from document $z$ , the knowledge augmented encoder can trivially predict $y$ by looking at the unmasked version of $x$ in $z$ . This results in a large positive gradient for $p(z\mid x)$ . If this occurs too often, the knowledge retriever ends up learning to look for exact string matches between $x$ and $z$ , which does not capture other forms of relevance. For this reason, we exclude this trivial candidate during pre-training.

禁止简单检索
如果预训练语料 $\mathcal{X}$ 和知识语料 $\mathcal{Z}$ 相同,会存在一个信息量过大的简单检索候选 $z$:当被掩码的句子 $x$ 来自文档 $z$ 时,知识增强编码器可以通过查看 $z$ 中未掩码版本的 $x$ 来轻易预测 $y$。这会导致 $p(z\mid x)$ 产生较大的正梯度。如果这种情况频繁发生,知识检索器最终会学习寻找 $x$ 和 $z$ 之间的精确字符串匹配,而无法捕捉其他形式的相关性。因此,我们在预训练期间排除了这种简单候选。

Initialization At the beginning of training, if the retriever does not have good embeddings for $\mathtt{E m b e d\Pi_{i n p u t}(}x\mathtt{)}$ and $\mathtt{E m b e d}_ {\mathtt{d o c}}(z)$ , the retrieved documents $z$ will likely be unrelated to $x$ . This causes the knowledge augmented encoder to learn to ignore the retrieved documents. Once this occurs, the knowledge retriever does not receive a meaningful gradient and cannot improve, creating a vicious cycle. To avoid this cold-start problem, we warm-start Embedinput and $\mathtt{E m b e d}_{\mathtt{d o c}}$ using a simple training objective known as the Inverse Cloze Task (ICT) where, given a sentence, the model is trained to retrieve the document where that sentence came from. We defer to Lee et al. (2019) for details. For the knowledge-augmented encoder, we warmstart it with BERT pre-training—specifically, the uncased BERT-base model (12 layers, 768 hidden units, 12 attention heads).

初始化
在训练开始时,如果检索器无法为$\mathtt{E m b e d\Pi_{i n p u t}(}x\mathtt{)}$和$\mathtt{E m b e d}_ {\mathtt{d o c}}(z)$生成优质嵌入向量,检索到的文档$z$很可能与$x$无关。这会导致知识增强编码器学会忽略检索到的文档。一旦发生这种情况,知识检索器就无法获得有意义的梯度信号而无法改进,从而形成恶性循环。为避免这种冷启动问题,我们采用逆完形填空任务(Inverse Cloze Task, ICT)这一简单训练目标对Embedinput和$\mathtt{E m b e d}_{\mathtt{d o c}}$进行预热初始化——该任务要求模型根据给定句子检索出其原始出处文档。具体实现细节可参考Lee et al. (2019)。对于知识增强编码器,我们采用BERT预训练进行预热初始化,具体使用的是无大小写区分的BERT-base模型(12层网络结构,768维隐藏单元,12个注意力头)。

4. Experiments

4. 实验

We now evaluate our approach on the Open-QA task. In this section, we describe in detail the benchmarks used and the different approaches to which we compare empirically.

我们现在在开放问答(Open-QA)任务上评估我们的方法。本节将详细描述所使用的基准测试以及我们通过实证比较的不同方法。

4.1. Open-QA Benchmarks

4.1. 开放问答基准测试

A number of benchmarks have been proposed for OpenQA. In this work, we focus on datasets where the question writers did not already know the answer. This yields questions that reflect more realistic information-seeking needs, and also avoids artifacts that can arise if the question is formulated with a particular answer in mind. A deeper justification is given in Lee et al. (2019). In all cases, the predicted answer is evaluated via exact match with any reference answer, following previous Open-QA work (Chen et al., 2017).

针对开放领域问答(OpenQA)已提出多项基准测试。本研究重点关注问题提出者事先不知道答案的数据集。这种做法能产生更贴近真实信息需求的问题,同时避免了因预设特定答案而导致的问题表述偏差。Lee等人(2019)对此进行了更深入的论证。所有实验均遵循先前开放领域问答研究(Chen等人,2017)的做法,通过精确匹配参考答案来评估预测答案。

Natural Questions-Open The Natural Questions dataset (Kwiatkowski et al., 2019) consists of naturally occurring Google queries and their answers. Each answer also comes with an “answer type”: following Lee et al. (2019), we only keep questions that are categorized as “short answer type” with at most five tokens. The dataset also provides a suggested Wikipedia document to retrieve; like all models we compare against, we do not provide this to our model.

Natural Questions-Open
Natural Questions数据集 (Kwiatkowski等人, 2019) 包含自然产生的Google查询及其答案。每个答案还附带一个"答案类型":参照Lee等人 (2019) 的方法,我们仅保留被归类为"短答案类型"且最多包含五个token的问题。该数据集还提供了建议检索的维基百科文档;与我们对比的所有模型一样,我们不会向模型提供该文档。

Web Questions The Web Questions dataset (Berant et al., 2013) was collected from the Google Suggest API, using one seed question and expanding the set to related questions. We follow the setting defined by Chen et al. (2017).

Web Questions数据集(Berant等人,2013)通过Google Suggest API收集,使用一个种子问题并扩展至相关问题集。我们采用Chen等人(2017)定义的实验设置。

Curate dT rec The Curate dT rec dataset is a collection of question-answer pairs drawn from real user queries issued on sites such as MSNSearch and AskJeeves. To account for multiple correct answers or different spelling variations, the answers in this dataset are defined as regular expressions that match all correct answers. It is unclear how to train generation-based models with this type of supervision, so we do not evaluate them on this dataset.

精选dT rec数据集
精选dT rec数据集是从MSNSearch和AskJeeves等网站真实用户查询中提取的问答对集合。为应对多个正确答案或不同拼写变体,该数据集中的答案被定义为匹配所有正确答案的正则表达式。目前尚不清楚如何利用此类监督训练基于生成的模型,因此我们未在该数据集上评估它们。

4.2. Approaches compared

4.2. 对比方法

Retrieval-based Open-QA Most existing Open-QA systems answer the input question by first retrieving potentially relevant documents from a knowledge corpus, and then using a reading comprehension system to extract an answer from the documents. In this paradigm, the knowledge is stored explicitly in the corpus. We wish to compare different methods for implementing retrieval.

基于检索的开放问答
现有大多数开放问答系统通过两个步骤回答问题:先从知识库中检索可能相关的文档,再用阅读理解系统从文档中提取答案。这种范式将知识显式存储在知识库中。我们将对比不同的检索实现方法。

Many approaches use non-learned heuristic retrieval such as sparse bag-of-words matching (Robertson et al., 2009) or entity linking on the question to select a small set of relevant documents (e.g., 20). These documents are typically then re-ranked using a learned model, but coverage may be limited by the initial heuristic retrieval step. Approaches such as DrQA (Chen et al., 2017), HardEM (Min et al., 2019a), Graph Retriever (Min et al., 2019b), and PathRetriever (Asai et al., 2019) in Table 1 are in this category.

许多方法采用非学习的启发式检索技术,例如基于稀疏词袋匹配 (Robertson et al., 2009) 或问题实体链接来筛选少量相关文档 (例如20篇)。这些文档通常随后通过学习模型进行重排序,但覆盖范围可能受限于初始启发式检索步骤。表1中的DrQA (Chen et al., 2017)、HardEM (Min et al., 2019a)、Graph Retriever (Min et al., 2019b) 和PathRetriever (Asai et al., 2019) 等方法均属于此类。

Some recent approaches have proposed to implement learnable retrieval using a MIPS index. ORQA (Lee et al., 2019) formulates Open-QA using a similar latent variable model as REALM, and also trains by maximizing the marginal likelihood. However, REALM adds a novel language model pre-training step, and back propagates into the MIPS index, rather than using a fixed index. In Table 1, we directly compare the two. It is also important to note that the retrievers for both REALM pre training and ORQA are initialized using the Inverse Cloze Task, described in Section 3.4.

一些近期研究提出使用MIPS索引实现可学习的检索机制。ORQA (Lee et al., 2019) 采用与REALM类似的隐变量模型构建开放域问答系统,同样通过最大化边缘似然进行训练。但REALM引入了创新的语言模型预训练步骤,并通过反向传播优化MIPS索引而非使用固定索引。表1中我们直接对比了两者性能。值得注意的是,REALM预训练和ORQA的检索器均采用逆完形填空任务(详见3.4节)进行初始化。

Generation-based Open-QA An emerging alternative approach to Open-QA is to model it as a sequence prediction task: simply encode the question, and then decode the answer token-by-token based on the encoding. While it was initially unclear how large amounts of knowledge could be injected into the model, GPT-2 (Radford et al., 2019) hinted at the possibility of directly generating answers without using any given context via sequence-tosequence. However, their performance was not competitive possibly due to the lack of fine-tuning. Orthogonal ly, T5 (Raffel et al., 2019) showed that directly generating answers without explicit extraction from the given context is viable approach, but they only experimented on the reading comprehension task, where a context document is provided.

基于生成的开放问答
开放问答的一种新兴替代方法是将其建模为序列预测任务:先对问题进行编码,然后基于编码逐个token解码生成答案。虽然最初尚不清楚如何将大量知识注入模型,但GPT-2 (Radford et al., 2019) 暗示了不借助给定上下文、直接通过序列到序列生成答案的可能性。不过由于缺乏微调,其性能表现欠佳。另一方面,T5 (Raffel et al., 2019) 证明了无需从给定上下文中显式提取、直接生成答案的可行性,但他们仅在提供上下文文档的阅读理解任务上进行了实验。

For the most competitive and comparable generation-based baseline, we compare to concurrent work which fine-tunes T5 for Open-QA (Roberts et al., 2020).4 We compare against the Base, Large, and even larger 11-billion parameter model to measure the effect of model size.

为了进行最具竞争力和可比性的基于生成的基线对比,我们与同期工作进行了比较,该工作针对开放问答(Open-QA)对T5进行了微调(Roberts等人,2020)[4]。我们对比了Base、Large甚至更大的110亿参数模型,以衡量模型规模的影响。

4.3. Implementation Details

4.3. 实现细节

Fine-tuning We reuse all hyper parameters from Lee et al. (2019), to enable direct comparison. Our knowledge corpus is derived from the December 20, 2018 snapshot of English Wikipedia. Documents are greedily split into chunks of up to 288 BERT wordpieces, resulting in just over 13 million retrieval candidates. During finetuning inference, we consider the top-5 candidates, and the entire model can be run on a single machine with a 12GB GPU.

微调
我们复用Lee等人 (2019) 的所有超参数以实现直接对比。知识库来源于2018年12月20日的英文维基百科快照。文档通过贪心算法分割成最多288个BERT词片段 (wordpieces) 的块,最终生成略超1300万个检索候选项。在微调推理阶段,我们考虑前5个候选结果,整个模型可在配备12GB GPU的单台机器上运行。

Table 1. Test results on Open-QA benchmarks. The number of train/test examples are shown in paretheses below each benchmark. Predictions are evaluated with exact match against any reference answer. Sparse retrieval denotes methods that use sparse features such as TF-IDF and BM25. Our model, REALM, outperforms all existing systems.

NameArchitecturesPre-trainingNQ (79k/4k)WQ (3k/2k)CT (1k /1k)#params
BERT-Baseline (Lee et al.,2019)SparseRetr.+TransformerBERT26.517.721.3110m
T5 (base) (Roberts et al., 2020)Transformer Seq2SeqT5 (Multitask)27.029.1223m
T5 (large) (Roberts et al., 2020)Transformer Seq2SeqT5 (Multitask)29.832.2738m
T5 (11b) (Roberts et al., 2020)Transformer Seq2SeqT5 (Multitask)34.537.411318m
DrQA (Chen et al., 2017)SparseRetr.+DocReaderN/A20.725.734m
HardEM (Min et al., 2019a)Sparse Retr.+TransformerBERT28.1110m
GraphRetriever (Min et al., 2019b)GraphRetriever+TransformerBERT31.831.6110m
PathRetriever (Asai et al.,2019)PathRetriever+TransformerMLM32.6110m
ORQA (Lee et al., 2019)Dense Retr.+TransformerICT+BERT33.336.430.1330m
Ours (X = Wikipedia, Z = Wikipedia)DenseRetr.+TransformerREALM39.240.246.8330m
Ours (X = CC-News,Z =Wikipedia)DenseRetr.+Transformer