REALM: Retrieval-Augmented Language Model Pre-Training
REALM: 检索增强的语言模型预训练
Kelvin Guu * 1 Kenton Lee * 1 Zora Tung 1 Panupong Pasupat 1 Ming-Wei Chang
Kelvin Guu * 1 Kenton Lee * 1 Zora Tung 1 Panupong Pasupat 1 Ming-Wei Chang
Abstract
摘要
Language model pre-training has been shown to capture a surprising amount of world knowledge, crucial for NLP tasks such as question answering. However, this knowledge is stored implicitly in the parameters of a neural network, requiring ever-larger networks to cover more facts. To capture knowledge in a more modular and interpretable way, we augment language model pretraining with a latent knowledge retriever, which allows the model to retrieve and attend over documents from a large corpus such as Wikipedia, used during pre-training, fine-tuning and inference. For the first time, we show how to pretrain such a knowledge retriever in an unsupervised manner, using masked language modeling as the learning signal and back propagating through a retrieval step that considers millions of documents. We demonstrate the effectiveness of Retrieval-Augmented Language Model pre-training (REALM) by fine-tuning on the challenging task of Open-domain Question Answering (Open-QA). We compare against state-of-theart models for both explicit and implicit knowledge storage on three popular Open-QA benchmarks, and find that we outperform all previous methods by a significant margin ( $4%$ absolute accuracy), while also providing qualitative benefits such as interpret ability and modularity.
语言模型预训练已被证明能捕获大量世界知识,这对问答等自然语言处理任务至关重要。然而,这些知识隐式存储在神经网络参数中,需要不断扩大网络规模以涵盖更多事实。为使知识获取更具模块化和可解释性,我们通过潜在知识检索器增强语言模型预训练,使模型能在预训练、微调和推理阶段检索并关注来自维基百科等大型语料库的文档。我们首次展示了如何以无监督方式预训练此类知识检索器:使用掩码语言建模作为学习信号,并通过考虑数百万文档的检索步骤进行反向传播。通过在开放域问答(Open-QA)任务上的微调,我们验证了检索增强型语言模型预训练(REALM)的有效性。在三个主流Open-QA基准测试中,我们与显式和隐式知识存储的先进模型进行对比,发现以显著优势(绝对准确率提升4%)超越所有现有方法,同时具备可解释性和模块化等质性优势。

Figure 1. REALM augments language model pre-training with a neural knowledge retriever that retrieves knowledge from a textual knowledge corpus, $\mathcal{Z}$ (e.g., all of Wikipedia). Signal from the language modeling objective back propagates all the way through the retriever, which must consider millions of documents in $\mathcal{Z}$ —a significant computational challenge that we address.
图 1: REALM通过神经知识检索器增强语言模型预训练,该检索器从文本知识库$\mathcal{Z}$(例如整个维基百科)中获取知识。语言建模目标的信号通过检索器反向传播,后者需要处理$\mathcal{Z}$中数百万份文档——我们解决了这一重大计算挑战。
correctly predict the missing word in the following sentence: “The is the currency of the United Kingdom” (answer: “pound”).
正确预测以下句子中缺失的单词:"The is the currency of the United Kingdom" (答案: "pound")。
1. Introduction
1. 引言
Recent advances in language model pre-training have shown that models such as BERT (Devlin et al., 2018), RoBERTa (Liu et al., 2019) and T5 (Raffel et al., 2019) store a surprising amount of world knowledge, acquired from the massive text corpora they are trained on (Petroni et al., 2019). For example, BERT is able to
语言模型预训练的最新进展表明,BERT (Devlin et al., 2018)、RoBERTa (Liu et al., 2019) 和 T5 (Raffel et al., 2019) 等模型存储了惊人的世界知识量,这些知识源自它们训练所使用的大规模文本语料库 (Petroni et al., 2019)。例如,BERT能够
In these language models, the learned world knowledge is stored implicitly in the parameters of the underlying neural network. This makes it difficult to determine what knowledge is stored in the network and where. Furthermore, storage space is limited by the size of the network—to capture more world knowledge, one must train ever-larger networks, which can be prohibitively slow or expensive.
在这些语言模型中,学习到的世界知识被隐式地存储在底层神经网络的参数中。这使得难以确定网络中存储了哪些知识以及存储位置。此外,存储空间受限于网络规模——要捕获更多世界知识,就必须训练越来越大的网络,这可能导致训练过程极其缓慢或成本高昂。
To capture knowledge in a more interpret able and modular way, we propose a novel framework, Retrieval-Augmented Language Model (REALM) pre-training, which augments language model pre-training algorithms with a learned textual knowledge retriever. In contrast to models that store knowledge in their parameters, this approach explicitly exposes the role of world knowledge by asking the model to decide what knowledge to retrieve and use during inference. Before making each prediction, the language model uses the retriever to retrieve documents1 from a large corpus such as Wikipedia, and then attends over those documents to help inform its prediction. Learning this model end-toend requires back propagating through a retrieval step that considers an entire corpus of textual knowledge, as shown in Figure 1.
为了以更可解释和模块化的方式捕捉知识,我们提出了一种新颖框架——检索增强语言模型(RELA)预训练,该框架通过学习的文本知识检索器来增强语言模型预训练算法。与将知识存储在参数中的模型不同,这种方法通过让模型在推理过程中决定检索和使用哪些知识,明确揭示了世界知识的作用。在进行每次预测之前,语言模型使用检索器从维基百科等大型语料库中检索文档,然后关注这些文档以辅助其预测。如图1所示,端到端学习该模型需要通过考虑整个文本知识语料库的检索步骤进行反向传播。
The key intuition of REALM is to train the retriever using a performance-based signal from unsupervised text: a retrieval that improves the language model’s perplexity is helpful and should be rewarded, while an uninformative retrieval should be penalized. For example, in Figure 1, if the model needs to fill the blank in “the at the top of the pyramid”, the retriever should be rewarded for selecting a document containing “The pyramidion on top allows for less material higher up the pyramid”. We achieve this behavior by modeling our retrieve-then-predict approach as a latent variable language model and optimizing the marginal likelihood.
REALM的核心直觉是通过无监督文本中的性能信号来训练检索器:能够降低语言模型困惑度的检索是有益的,应当给予奖励,而无信息量的检索则应受到惩罚。例如,在图1中,如果模型需要填补"the at the top of the pyramid"中的空白,检索器若选择了包含"The pyramidion on top allows for less material higher up the pyramid"的文档就应获得奖励。我们通过将检索-预测方法建模为隐变量语言模型并优化边际似然来实现这一行为。
Incorporating a large-scale neural retrieval module during pre-training constitutes a significant computational challenge, since the retriever must consider millions of candidate documents for each pre-training step, and we must back propagate through its decisions. To address this, we structure the retriever such that the computation performed for each document can be cached and asynchronously updated, and selection of the best documents can be formulated as Maximum Inner Product Search (MIPS).
在预训练阶段融入大规模神经检索模块是一项重大的计算挑战,因为检索器必须在每个预训练步骤中处理数百万候选文档,且需要对其决策进行反向传播。为解决这一问题,我们对检索器进行结构化设计:首先缓存并异步更新各文档的计算结果,其次将最优文档选择问题转化为最大内积搜索 (MIPS) 任务。
Numerous prior works have demonstrated the benefit of adding a discrete retrieval step to neural networks (Miller et al., 2016; Chen et al., 2017), but did not apply the framework to language model pre-training and employed non-learned retrievers to handle large-scale document collections. In the language modeling literature, the $k$ -Nearest Neighbor Language Model (Khandelwal et al., 2019) (kNN-LM) retrieves similar LM examples to improve memorization. However, kNN-LM was not finetuned for downstream tasks, perhaps because it is unclear how to adapt the retrieval mechanism: a $k\mathbf{NN}$ can only use examples labeled for the target task—during fine-tuning, this precludes LM examples, which contain the desired world knowledge. In contrast, REALM’s retriever is designed to transfer to other tasks, and the retrieval is just text, not a labeled example.
众多先前研究已证明在神经网络中加入离散检索步骤的优势 (Miller et al., 2016; Chen et al., 2017),但未将该框架应用于语言模型预训练,且使用非学习型检索器处理大规模文档集。在语言建模领域,$k$-最近邻语言模型 (Khandelwal et al., 2019) (kNN-LM) 通过检索相似的语言模型样本来增强记忆能力。然而kNN-LM未针对下游任务进行微调,这可能是因为其检索机制难以适配:$k\mathbf{NN}$只能使用带目标任务标签的样本——在微调过程中,这会排除包含所需世界知识的语言模型样本。相比之下,REALM的检索器专为任务迁移设计,且检索对象仅为文本而非带标签样本。
We evaluate our approach by fine-tuning the models pre-trained with REALM on the task of Open- domain Question Answering (Open-QA), one of the most knowledge-intensive tasks in natural language processing. We evaluate on three popular Open-QA benchmarks (NATURAL QUESTIONS-OPEN, WEB QUESTIONS, and
我们通过在开放域问答(Open-QA)任务上微调REALM预训练模型来评估我们的方法,这是自然语言处理中最需要知识的任务之一。我们在三个流行的开放域问答基准测试(NATURAL QUESTIONS-OPEN、WEB QUESTIONS和...)上进行评估。
CURATE DT REC) and compare to state-of-the-art Open-QA models, including both extremely large models that store knowledge implicitly (such as T5) as well as previous approaches that also use a knowledge retriever to access external knowledge, but implement retrieval in a more heuristic fashion (Lee et al., 2019; Min et al., $2019\mathrm{a}$ ; Asai et al., 2019). REALM achieves new state-of-the-art results on all three benchmarks, significantly outperforming all previous systems by $4-16%$ absolute accuracy. We also demonstrate qualitative benefits of REALM, including interpret ability and modularity.
CURATE DT REC) 并与最先进的开放问答模型进行比较,包括那些隐式存储知识的超大规模模型(如T5),以及同样使用知识检索器访问外部知识但以更启发式方式实现检索的先前方法 (Lee et al., 2019; Min et al., $2019\mathrm{a}$; Asai et al., 2019)。REALM在所有三个基准测试中均取得了新的最先进成果,绝对准确率显著超越所有先前系统 $4-16%$。我们还展示了REALM的定性优势,包括可解释性和模块化。
2. Background
2. 背景
Language model pre-training The goal of language model pre-training is to learn useful representations of language, usually from unlabeled text corpora. The resulting pre-trained model can then be further trained (fine-tuned) for a downstream task of primary interest (in our case, Open-QA), often leading to better generalization than training from scratch (Dai & Le, 2015; Radford et al., 2019).
语言模型预训练
语言模型预训练的目标是从通常无标注的文本语料库中学习语言的有用表征。得到的预训练模型可以针对主要关注的下游任务(在我们的案例中是开放域问答)进行进一步训练(微调),这通常比从头开始训练能带来更好的泛化性能 (Dai & Le, 2015; Radford et al., 2019)。
We focus on the masked language model2 (MLM) variant of pre-training popularized by BERT (Devlin et al., 2018). In its basic form, an MLM is trained to predict the missing tokens in an input text passage. Given an unlabeled pre-training corpus $\mathcal{X}$ (e.g., Wikipedia text), a training example $(x,y)$ can be generated by randomly masking tokens in a sampled piece of text (e.g., $x=$ “The [MASK] is the currency [MASK] the UK”; $y=$ (“pound”, “of”)). The model uses its representation of the masked input $x$ to predict the token that should go in each mask. A good MLM must learn to encode syntactic and semantic information (e.g., to predict “of”) as well as some world knowledge (e.g., to predict “pound”).
我们专注于由BERT (Devlin et al., 2018) 推广的掩码语言模型2 (masked language model, MLM) 预训练变体。其基本形式是训练MLM预测输入文本段落中缺失的token。给定无标注的预训练语料$\mathcal{X}$(例如维基百科文本),可通过随机掩码采样文本中的token生成训练样本$(x,y)$(例如$x=$"The [MASK] is the currency [MASK] the UK";$y=$("pound", "of"))。模型利用其对掩码输入$x$的表征来预测每个掩码位置应填入的token。优秀的MLM必须学会编码句法和语义信息(例如预测"of")以及部分世界知识(例如预测"pound")。
Open-domain question answering (Open-QA) To measure a model’s ability to incorporate world knowledge, we need a downstream task where world knowledge is critical. Perhaps one of the most knowledge-intensive tasks in natural language processing is open-domain question answering (Open-QA): given a question $x$ such as “What is the currency of the UK?”, a model must output the correct answer string $y$ , “pound”. The “open” part of OpenQA refers to the fact that the model does not receive a preidentified document that is known to contain the answer, unlike traditional reading comprehension (RC) tasks such as SQuAD (Rajpurkar et al., 2016; 2018). While RC models comprehend a single document, Open-QA models must retain knowledge from millions of documents, since a question could be about any of them.
开放域问答 (Open-QA)
为衡量模型整合世界知识的能力,我们需要一个依赖世界知识的下游任务。自然语言处理中最具知识密集性的任务之一或许是开放域问答:给定问题$x$(例如"英国的货币是什么?"),模型必须输出正确答案字符串$y$("英镑")。OpenQA中的"开放"指模型不会像SQuAD (Rajpurkar et al., 2016; 2018)等传统阅读理解(RC)任务那样接收已知包含答案的预设文档。阅读理解模型只需理解单个文档,而开放域问答模型必须保留来自数百万文档的知识,因为问题可能涉及其中任意内容。
We focus on Open-QA systems that utilize a textual knowledge corpus $\mathcal{Z}$ as the knowledge source. Many of these systems employ a retrieval-based approach: given a question $x$ , retrieve potentially relevant documents $z$ from the corpus $\mathcal{Z}$ , and then extract an answer $y$ from the documents (Brill et al., 2002; Chen et al., 2017; Lee et al., 2019). Our approach, REALM, is inspired by this paradigm and extends it to language model pre-training. Alternatively, some recent work has proposed generationbased systems that apply a sequence-to-sequence model on $x$ to directly generate $y$ token-by-token (Lewis et al., 2019; Raffel et al., 2019). We will compare against state-of-theart systems from both paradigms in our experiments.
我们关注以文本知识库$\mathcal{Z}$作为知识源的开放问答系统。这类系统大多采用基于检索的方法:给定问题$x$,从知识库$\mathcal{Z}$中检索潜在相关文档$z$,然后从文档中提取答案$y$ (Brill et al., 2002; Chen et al., 2017; Lee et al., 2019)。我们的REALM方法受此范式启发,并将其扩展至语言模型预训练领域。另一些近期研究提出了基于生成的系统,它们对$x$应用序列到序列模型直接逐token生成$y$ (Lewis et al., 2019; Raffel et al., 2019)。实验中我们将对比这两种范式下的前沿系统。
3. Approach
3. 方法
We start by formalizing REALM’s pre-training and finetuning tasks as a retrieve-then-predict generative process in Section 3.1. Then in Section 3.2, we describe the model architectures for each component of that process. In Section 3.3, we show how to implement REALM pre-training and fine-tuning by maximizing the likelihood of REALM’s generative process. En route, we address important computational challenges, explain why training works, and also discuss strategies for injecting useful inductive biases. The overall framework is illustrated in Figure 2.
我们首先在第3.1节将REALM的预训练和微调任务形式化为一个检索-预测生成过程。接着在第3.2节,我们描述了该流程各组成部分的模型架构。第3.3节展示了如何通过最大化REALM生成过程的似然来实现其预训练与微调。在此过程中,我们解决了关键的计算挑战,阐释了训练原理,并讨论了注入有用归纳偏置的策略。整体框架如图2所示。
3.1. REALM’s generative process
3.1. REALM的生成过程
For both pre-training and fine-tuning, REALM takes some input $x$ and learns a distribution $p(y\mid x)$ over possible outputs $y$ . For pre-training, the task is masked language modeling: $x$ is a sentence from a pre-training corpus $\mathcal{X}$ with some tokens masked out, and the model must predict the value of those missing tokens, $y$ . For fine-tuning, the task is Open-QA: $x$ is a question, and $y$ is the answer.
在预训练和微调阶段,REALM都会接收输入$x$并学习可能输出$y$的分布$p(y\mid x)$。预训练任务采用掩码语言建模:$x$是预训练语料库$\mathcal{X}$中被遮蔽部分token的句子,模型需预测缺失token的值$y$。微调任务采用开放域问答(Open-QA):$x$为问题,$y$为答案。
REALM decomposes $p(y\mid x)$ into two steps: retrieve, then predict. Given an input $x$ , we first retrieve possibly helpful documents $z$ from a knowledge corpus $\mathcal{Z}$ . We model this as a sample from the distribution $p(z\mid x)$ . Then, we condition on both the retrieved $z$ and the original input $x$ to generate the output $y$ —modeled as $p(y\mid z,x)$ . To obtain the overall likelihood of generating $y$ , we treat $z$ as a latent variable and marginal ize over all possible documents $z$ , yielding
REALM将$p(y\mid x)$分解为两个步骤:检索,然后预测。给定输入$x$,我们首先从知识库$\mathcal{Z}$中检索可能有帮助的文档$z$,将其建模为分布$p(z\mid x)$的采样。接着,基于检索到的$z$和原始输入$x$生成输出$y$——建模为$p(y\mid z,x)$。为计算生成$y$的总体似然,我们将$z$视为隐变量并对所有可能的文档$z$进行边缘化,得到
$$
p(y\mid x)=\sum_{z\in{\mathcal{Z}}}p(y\mid z,x)p(z\mid x).
$$
$$
p(y\mid x)=\sum_{z\in{\mathcal{Z}}}p(y\mid z,x)p(z\mid x).
$$
3.2. Model architecture
3.2. 模型架构
We now describe the two key components: the neural knowledge retriever, which models $p(z\mid x)$ , and the knowledge-augmented encoder, which models $p(y\mid z,x)$ .
我们现在描述两个关键组件:神经知识检索器(建模 $p(z\mid x)$ )和知识增强编码器(建模 $p(y\mid z,x)$ )。
Knowledge Retriever The retriever is defined using a dense inner product model:
知识检索器
该检索器采用密集内积模型定义:
$$
\begin{array}{l}{\displaystyle p(\boldsymbol{z}\mid\boldsymbol{x})=\frac{\exp f(\boldsymbol{x},\boldsymbol{z})}{\sum_{\boldsymbol{z}^{\prime}}\exp f(\boldsymbol{x},\boldsymbol{z}^{\prime})},}\ {\displaystyle f(\boldsymbol{x},\boldsymbol{z})=\mathrm{Embed_{input}}(\boldsymbol{x})^{\top}\mathrm{Embed_{doc}}(\boldsymbol{z}),}\end{array}
$$
$$
\begin{array}{l}{\displaystyle p(\boldsymbol{z}\mid\boldsymbol{x})=\frac{\exp f(\boldsymbol{x},\boldsymbol{z})}{\sum_{\boldsymbol{z}^{\prime}}\exp f(\boldsymbol{x},\boldsymbol{z}^{\prime})},}\ {\displaystyle f(\boldsymbol{x},\boldsymbol{z})=\mathrm{Embed_{input}}(\boldsymbol{x})^{\top}\mathrm{Embed_{doc}}(\boldsymbol{z}),}\end{array}
$$
where $\mathtt{E m b e d\Pi_{i n p u t}}$ and $\mathtt{E m b e d}_{\mathtt{d o c}}$ are embedding functions that map $x$ and $z$ respectively to $d$ -dimensional vectors. The relevance score $f(x,z)$ between $x$ and $z$ is defined as the inner product of the vector embeddings. The retrieval distribution is the softmax over all relevance scores.
其中 $\mathtt{E m b e d\Pi_{i n p u t}}$ 和 $\mathtt{E m b e d}_{\mathtt{d o c}}$ 是将 $x$ 和 $z$ 分别映射到 $d$ 维向量的嵌入函数。$x$ 和 $z$ 之间的相关性分数 $f(x,z)$ 定义为向量嵌入的内积。检索分布是所有相关性分数的 softmax。
We implement the embedding functions using BERT-style Transformers (Devlin et al., 2018). Following standard practices, we join spans of text by applying wordpiece tokenization, separating them with [SEP] tokens, prefixing a [CLS] token, and appending a final [SEP] token.
我们使用 BERT 风格的 Transformer (Devlin et al., 2018) 来实现嵌入函数。按照标准做法,我们通过应用 wordpiece tokenization 来连接文本片段,用 [SEP] token 分隔它们,前缀一个 [CLS] token,并附加一个最终的 [SEP] token。
$$
\begin{array}{r l}{\mathrm{join}_ {\mathtt{B E R T}}(x)=\left[\mathtt{C L S}\right]x\left[\mathtt{S E P}\right]~}&{}\ {\mathrm{join}_ {\mathtt{B E R T}}(x_{1},x_{2})=\left[\mathtt{C L S}\right]x_{1}\left[\mathtt{S E P}\right]x_{2}\left[\mathtt{S E P}\right]}&{}\end{array}
$$
$$
\begin{array}{r l}{\mathrm{join}_ {\mathtt{B E R T}}(x)=\left[\mathtt{C L S}\right]x\left[\mathtt{S E P}\right]~}&{}\ {\mathrm{join}_ {\mathtt{B E R T}}(x_{1},x_{2})=\left[\mathtt{C L S}\right]x_{1}\left[\mathtt{S E P}\right]x_{2}\left[\mathtt{S E P}\right]}&{}\end{array}
$$
As in Devlin et al. (2018), we pass this into a Transformer, which produces one vector for each token, including the vector corresponding to [CLS] which is used as a “pooled” representation of the sequence (denoted $\mathsf{B E R T}_{\mathsf{C L S},}$ ). Finally, we perform a linear projection to reduce the dimensionality of the vector, denoted as a projection matrix W:
如 Devlin 等人 (2018) 所述,我们将其输入 Transformer,该模型会为每个 Token 生成一个向量,包括对应 [CLS] 的向量(用作序列的"池化"表示,记为 $\mathsf{B E R T}_{\mathsf{C L S},}$)。最后,我们执行线性投影以降维,该投影矩阵记为 W:
$$
\begin{array}{r l}&{\mathtt{E m b e d}_ {\mathrm{input}}(x)=\mathbf{W}_ {\mathrm{input}}\mathtt{B E R T}_ {\mathrm{CLS}}(\mathtt{j o i n}_ {\mathtt{B E R T}}(x))}\ &{\quad\mathtt{E m b e d}_ {\mathtt{d o c}}(z)=\mathbf{W}_ {\mathrm{doc}}\mathtt{B E R T}_ {\mathrm{CLS}}(\mathtt{j o i n}_ {\mathtt{B E R T}}(z_{\mathrm{title}},z_{\mathrm{body}}))}\end{array}
$$
$$
\begin{array}{r l}&{\mathtt{E m b e d}_ {\mathrm{input}}(x)=\mathbf{W}_ {\mathrm{input}}\mathtt{B E R T}_ {\mathrm{CLS}}(\mathtt{j o i n}_ {\mathtt{B E R T}}(x))}\ &{\quad\mathtt{E m b e d}_ {\mathtt{d o c}}(z)=\mathbf{W}_ {\mathrm{doc}}\mathtt{B E R T}_ {\mathrm{CLS}}(\mathtt{j o i n}_ {\mathtt{B E R T}}(z_{\mathrm{title}},z_{\mathrm{body}}))}\end{array}
$$
where $z_{\mathrm{title}}$ is the document’s title and $z_{\mathrm{body}}$ is its body. We let $\theta$ denote all parameters associated with the retriever, which include the Transformer and projection matrices.
其中 $z_{\mathrm{title}}$ 表示文档标题,$z_{\mathrm{body}}$ 表示文档正文。我们用 $\theta$ 表示检索器所有相关参数,包括 Transformer 和投影矩阵。
Knowledge-Augmented Encoder Given an input $x$ and a retrieved document $z$ , the knowledge-augmented encoder defines $p(y\mid z,x)$ . We join $x$ and $z$ into a single sequence that we feed into a Transformer (distinct from the one used in the retriever). This allows us to perform rich crossattention between $x$ and $z$ before predicting $y$ . See Figure 1 for a concrete example.
知识增强编码器
给定输入 $x$ 和检索到的文档 $z$,知识增强编码器定义了 $p(y\mid z,x)$。我们将 $x$ 和 $z$ 拼接为单一序列后输入到 Transformer (与检索器中使用的 Transformer 不同) 中。这样可以在预测 $y$ 之前,对 $x$ 和 $z$ 进行丰富的交叉注意力计算。具体示例见图 1:
At this stage, the architectures for pre-training and finetuning differ slightly. For the masked language model pretraining task, we must predict the original value of each [MASK] token in $x$ . To do so, we use the same masked language modeling (MLM) loss as in Devlin et al. (2018):
在当前阶段,预训练和微调的架构略有不同。对于掩码语言模型预训练任务,我们需要预测$x$中每个[MASK] token的原始值。为此,我们采用与Devlin等人(2018)相同的掩码语言建模(MLM)损失函数:

Figure 2. The overall framework of REALM. Left: Unsupervised pre-training. The knowledge retriever and knowledge-augmented encoder are jointly pre-trained on the unsupervised language modeling task. Right: Supervised fine-tuning. After the parameters of the retriever $(\theta)$ and encoder $(\phi)$ have been pre-trained, they are then fine-tuned on a task of primary interest, using supervised examples.
图 2: REALM的整体框架。左:无监督预训练。知识检索器和知识增强编码器在无监督语言建模任务上联合预训练。右:监督微调。当检索器 $(\theta)$ 和编码器 $(\phi)$ 的参数完成预训练后,它们会在目标任务上使用监督样本进行微调。
$$
\begin{array}{r l}&{p(\boldsymbol{y}\mid\boldsymbol{z},\boldsymbol{x})=\displaystyle\prod_{j=1}^{J_{x}}p(y_{j}\mid\boldsymbol{z},\boldsymbol{x})}\ &{p(y_{j}\mid\boldsymbol{z},\boldsymbol{x})\propto\exp\left(w_{j}^{\top}\mathtt{B E R T}_ {\mathtt{M A S K}(j)}(\mathrm{j}\circ\mathrm{i}\mathrm{n}_ {\mathtt{B E R T}}(\boldsymbol{x},\boldsymbol{z}_{\mathtt{b o d y}}))\right)}\end{array}
$$
$$
\begin{array}{r l}&{p(\boldsymbol{y}\mid\boldsymbol{z},\boldsymbol{x})=\displaystyle\prod_{j=1}^{J_{x}}p(y_{j}\mid\boldsymbol{z},\boldsymbol{x})}\ &{p(y_{j}\mid\boldsymbol{z},\boldsymbol{x})\propto\exp\left(w_{j}^{\top}\mathtt{B E R T}_ {\mathtt{M A S K}(j)}(\mathrm{j}\circ\mathrm{i}\mathrm{n}_ {\mathtt{B E R T}}(\boldsymbol{x},\boldsymbol{z}_{\mathtt{b o d y}}))\right)}\end{array}
$$
where $\mathtt{B E R T_{M A S K(j)}}$ denotes the Transformer output vector corresponding to the $j^{t h}$ masked token, $J_{x}$ is the total number of [MASK] tokens in $x$ , and $w_{j}$ is a learned word embedding for token yj.
其中 $\mathtt{B E R T_{M A S K(j)}}$ 表示与第 $j^{t h}$ 个被掩码 token 对应的 Transformer 输出向量,$J_{x}$ 是 $x$ 中 [MASK] token 的总数,$w_{j}$ 是 token yj 的学习词嵌入。
For Open-QA fine-tuning, we wish to produce the answer string $y$ . Following previous reading comprehension work (Rajpurkar et al., 2016; Seo et al., 2016; Lee et al., 2016; Clark & Gardner, 2017), we will assume that the answer $y$ can be found as a contiguous sequence of tokens in some document $z$ . Let $S(z,y)$ be the set of spans matching $y$ in $z$ . Then we can define $p(y\mid z,x)$ as:
在开放问答(Open-QA)微调中,我们的目标是生成答案字符串$y$。借鉴先前阅读理解领域的研究工作 (Rajpurkar et al., 2016; Seo et al., 2016; Lee et al., 2016; Clark & Gardner, 2017),我们假设答案$y$可以作为连续token序列存在于某个文档$z$中。设$S(z,y)$为文档$z$中与$y$匹配的所有文本片段集合,则可将$p(y\mid z,x)$定义为:
$$
\begin{array}{r l}&{p(y\mid z,x)\propto{\displaystyle\sum_{s\in S(z,y)}}\exp\big(\mathtt{M L P}\big(\big[h_{\mathtt{S T A R T}(\mathtt{s})};h_{\mathtt{E N D}(\mathtt{s})}\big]\big)\big)}\ &{\quad\quad h_{\mathtt{S T A R T}(\mathtt{s})}=\mathtt{B E R T}_ {\mathtt{S T A R T}(\mathtt{s})}\big(\mathtt{j o i n}_ {\mathtt{B E R T}}(x,z_{\mathtt{b o d y}})\big),}\ &{\quad\quad h_{\mathtt{E N D}(\mathtt{s})}=\mathtt{B E R T}_ {\mathtt{E N D}(\mathtt{s})}\big(\mathtt{j o i n}_ {\mathtt{B E R T}}(x,z_{\mathtt{b o d y}})\big),}\end{array}
$$
$$
\begin{array}{r l}&{p(y\mid z,x)\propto{\displaystyle\sum_{s\in S(z,y)}}\exp\big(\mathtt{M L P}\big(\big[h_{\mathtt{S T A R T}(\mathtt{s})};h_{\mathtt{E N D}(\mathtt{s})}\big]\big)\big)}\ &{\quad\quad h_{\mathtt{S T A R T}(\mathtt{s})}=\mathtt{B E R T}_ {\mathtt{S T A R T}(\mathtt{s})}\big(\mathtt{j o i n}_ {\mathtt{B E R T}}(x,z_{\mathtt{b o d y}})\big),}\ &{\quad\quad h_{\mathtt{E N D}(\mathtt{s})}=\mathtt{B E R T}_ {\mathtt{E N D}(\mathtt{s})}\big(\mathtt{j o i n}_ {\mathtt{B E R T}}(x,z_{\mathtt{b o d y}})\big),}\end{array}
$$
where $\mathtt{B E R T\mathbf{s}T A R T(\mathbf{s})}$ and $\mathtt{B E R T_{E N D(s)}}$ denote the Transformer output vectors corresponding to the start and end tokens of span $s$ , respectively, while MLP denotes a feed-forward neural network. We will let $\phi$ denote all parameters associated with the knowledge-augmented encoder.
其中 $\mathtt{B E R T\mathbf{s}T A R T(\mathbf{s})}$ 和 $\mathtt{B E R T_{E N D(s)}}$ 分别表示与跨度 $s$ 的起始和结束 Token 对应的 Transformer 输出向量,而 MLP 表示前馈神经网络。我们将用 $\phi$ 表示与知识增强编码器相关的所有参数。
3.3. Training
3.3. 训练
For both pre-training and fine-tuning, we train by maximizing the log-likelihood $\log p(y\mid x)$ of the correct output $y$ . Since both the knowledge retriever and knowledgeaugmented encoder are differentiable neural networks, we can compute the gradient of $\log p(y\mid x)$ (defined in Equation 1) with respect to the model parameters $\theta$ and $\phi$ , and optimize using stochastic gradient descent.
在预训练和微调阶段,我们都通过最大化正确输出$y$的对数似然$\log p(y\mid x)$进行训练。由于知识检索器和知识增强编码器都是可微分神经网络,我们可以计算$\log p(y\mid x)$(公式1定义)对模型参数$\theta$和$\phi$的梯度,并使用随机梯度下降进行优化。
The key computational challenge is that the marginal probability ${p(y \mid x)=\sum_{z\in\mathcal{Z}}p(y \mid x,z)~p(z \mid x)}$ involves a summation over all documents $z$ in the knowledge corpus $\mathcal{Z}$ . We approximate this by instead summing over the top $k$ documents with highest probability under $p(z\mid x)$ —this is reasonable if most documents have near zero probability.
关键计算挑战在于边缘概率 ${p(y \mid x)=\sum_{z\in\mathcal{Z}}p(y \mid x,z)~p(z \mid x)}$ 需要对知识库 $\mathcal{Z}$ 中所有文档 $z$ 进行求和。我们通过仅对 $p(z\mid x)$ 概率最高的前 $k$ 个文档求和来近似计算——这在大多数文档概率接近零时是合理的。
Even with this approximation, we still need an efficient way to find the top $k$ documents. Note that the ordering of documents under $p(z\mid x)$ is the same as under the relevance score $f(x,z)=\mathtt{E m b e d}_ {\mathrm{input}}(x)^{\top}\mathtt{E m b e d}_{\mathrm{doc}}(z)$ , which is an inner product. Thus, we can employ Maximum Inner Product Search (MIPS) algorithms to find the approximate top $k$ documents, using running time and storage space that scale sub-linearly with the number of documents (Ram & Gray, 2012; Shrivastava & Li, 2014; Shen et al., 2015).
即使采用这种近似方法,我们仍需一种高效的方式来找出前 $k$ 个文档。需要注意的是,文档在 $p(z\mid x)$ 下的排序与相关性评分 $f(x,z)=\mathtt{E m b e d}_ {\mathrm{input}}(x)^{\top}\mathtt{E m b e d}_{\mathrm{doc}}(z)$ 下的排序相同,这是一个内积运算。因此,我们可以采用最大内积搜索 (MIPS) 算法来近似找出前 $k$ 个文档,其运行时间和存储空间随文档数量呈次线性增长 (Ram & Gray, 2012; Shrivastava & Li, 2014; Shen et al., 2015)。
To employ MIPS, we must pre-compute $\mathtt{E m b e d}_ {\mathtt{d o c}}(z)$ for every $z\in{\mathcal{Z}}$ and construct an efficient search index over these embeddings. However, this data structure will no longer be consistent with $p(z\mid x)$ if the parameters $\theta$ of $\mathtt{E m b e d}_{\mathtt{d o c}}$ are later updated. Hence, the search index goes “stale” after every gradient update on $\theta$ .
要使用MIPS,我们必须为每个$z\in{\mathcal{Z}}$预计算$\mathtt{E m b e d}_ {\mathtt{d o c}}(z)$,并在这些嵌入上构建高效的搜索索引。然而,如果后续更新了$\mathtt{E m b e d}_{\mathtt{d o c}}$的参数$\theta$,该数据结构将不再与$p(z\mid x)$保持一致。因此,每次对$\theta$进行梯度更新后,搜索索引都会变得"过时"。
Our solution is to “refresh” the index by asynchronously re-embedding and re-indexing all documents every several hundred training steps. The MIPS index is slightly stale between refreshes, but note that it is only used to select the top $k$ documents. We recompute $p(z\mid x)$ and its gradient, using the fresh $\theta$ , for these top $k$ documents after retrieving them. In Section 4.5, we empirically demonstrate that this procedure results in stable optimization, provided that refreshes happen at a sufficiently frequent rate.
我们的解决方案是通过每几百个训练步骤异步重新嵌入和重新索引所有文档来"刷新"索引。在两次刷新之间,MIPS索引会略微过时,但请注意它仅用于选择前$k$个文档。在检索到这些前$k$个文档后,我们会使用最新的$\theta$重新计算$p(z\mid x)$及其梯度。在第4.5节中,我们通过实验证明,只要刷新频率足够高,这一过程就能实现稳定的优化。
Implementing asynchronous MIPS refreshes We asynchronously refresh the MIPS index by running two jobs in parallel: a primary trainer job, which performs gradient updates on the parameters, and a secondary index builder job, which embeds and indexes the documents. As shown below, the trainer sends the index builder a snapshot of its parameters, $\theta^{\prime}$ . The trainer then continues to train while the index builder uses $\theta^{\prime}$ to construct a new index in the background. As soon as the index builder is done, it sends the new index back to the trainer, and the process repeats.
实现异步MIPS刷新
我们通过并行运行两个任务来异步刷新MIPS索引:主训练任务(对参数执行梯度更新)和辅助索引构建任务(对文档进行嵌入和索引)。如下所示,训练器向索引构建器发送其参数快照$\theta^{\prime}$。随后训练器继续训练,而索引构建器在后台使用$\theta^{\prime}$构建新索引。索引构建完成后,立即将新索引发回训练器,该过程循环进行。

Figure 3. REALM pre-training with asynchronous MIPS refreshes.
图 3: 采用异步MIPS刷新的REALM预训练过程
While asynchronous refreshes can be used for both pretraining and fine-tuning, in our experiments we only use it for pre-training. For fine-tuning, we just build the MIPS index once (using the pre-trained $\theta$ ) for simplicity and do not update $\mathtt{E m b e d}_{\mathtt{d o c}}$ .3 Note that we still fine-tune Embedinput, so the retrieval function is still updated from the query side.
虽然异步刷新可以同时用于预训练和微调阶段,但在我们的实验中仅将其用于预训练。为简化流程,微调时我们仅构建一次MIPS索引(使用预训练的$\theta$)且不更新$\mathtt{E m b e d}_{\mathtt{d o c}}$。3 需要注意的是,我们仍会对Embedinput进行微调,因此检索功能仍会从查询端持续更新。
What does the retriever learn? Since the knowledge retrieval of REALM is latent, it is not obvious how the training objective encourages meaningful retrievals. Here, we show how it rewards retrievals that improve prediction accuracy.
检索器学到了什么?由于REALM的知识检索是隐式的,训练目标如何促进有意义的检索并不明显。在此,我们展示它如何奖励能提高预测准确率的检索。
For a given query $x$ and document $z$ , recall that $f(x,z)$ is the “relevance score” that the knowledge retriever assigns to document $z$ . We can see how a single step of gradient descent during REALM pre-training alters this score by analyzing the gradient with respect to the parameters of the knowledge retriever, $\theta$ :
对于给定查询$x$和文档$z$,回顾$f(x,z)$是知识检索器赋予文档$z$的"相关性分数"。通过分析知识检索器参数$\theta$的梯度,我们可以观察到REALM预训练期间单步梯度下降如何改变这一分数:
$$
\begin{array}{l}{\displaystyle\nabla\log p(\boldsymbol{y}\mid\boldsymbol{x})=\sum_{z\in\boldsymbol{z}}r(z)\nabla f(\boldsymbol{x},z)}\ {\displaystyle r(z)=\left[\frac{p(\boldsymbol{y}\mid z,\boldsymbol{x})}{p(\boldsymbol{y}\mid\boldsymbol{x})}-1\right]p(\boldsymbol{z}\mid\boldsymbol{x}).}\end{array}
$$
$$
\begin{array}{l}{\displaystyle\nabla\log p(\boldsymbol{y}\mid\boldsymbol{x})=\sum_{z\in\boldsymbol{z}}r(z)\nabla f(\boldsymbol{x},z)}\ {\displaystyle r(z)=\left[\frac{p(\boldsymbol{y}\mid z,\boldsymbol{x})}{p(\boldsymbol{y}\mid\boldsymbol{x})}-1\right]p(\boldsymbol{z}\mid\boldsymbol{x}).}\end{array}
$$
For each document $z$ , the gradient encourages the retriever to change the score $f(x,z)$ by $r(z)$ — increasing if $r(z)$ is positive, and decreasing if negative. The multiplier $r(z)$ is positive if and only if $p(y\mid z,x)>p(y\mid x)$ . The term $p(y\mid z,x)$ is the probability of predicting the correct output $y$ when using document $z$ . The term $p(y\mid x)$ is the expected value of $p(\boldsymbol{y}\mid\boldsymbol{x},z)$ when randomly sampling a document from $p(z\mid x)$ . Hence, document $z$ receives a positive update whenever it performs better than expected.
对于每个文档 $z$,梯度促使检索器根据 $r(z)$ 调整分数 $f(x,z)$——当 $r(z)$ 为正时增加,为负时减少。乘数 $r(z)$ 为正当且仅当 $p(y\mid z,x)>p(y\mid x)$。其中 $p(y\mid z,x)$ 表示使用文档 $z$ 时预测正确输出 $y$ 的概率,而 $p(y\mid x)$ 是从 $p(z\mid x)$ 随机采样文档时 $p(\boldsymbol{y}\mid\boldsymbol{x},z)$ 的期望值。因此,当文档 $z$ 的表现优于预期时,它就会获得正向更新。
3.4. Injecting inductive biases into pre-training
3.4. 将归纳偏置注入预训练
In the process of developing REALM, we discovered several additional strategies that further guide the model towards meaningful retrievals, described below.
在开发REALM的过程中,我们发现了以下几种额外策略,可进一步引导模型实现有意义的检索:
Salient span masking During REALM pre-training, we want to focus on examples $x$ that require world knowledge to predict the masked tokens. As explained in Section 2, some MLM spans only require local context. To focus on problems that require world knowledge, we mask salient spans such as “United Kingdom” or “July 1969”. We use a BERT-based tagger trained on CoNLL-2003 data (Sang & De Meulder, 2003) to identify named entities, and a regular expression to identify dates. We select and mask one of these salient spans within a sentence for the masked language modeling task. We show that this significantly outperforms other masking strategies in Section 4.5.
显著跨度掩码
在REALM预训练过程中,我们希望专注于那些需要世界知识来预测被掩码token的样本$x$。如第2节所述,某些MLM(掩码语言模型)跨度仅需局部上下文即可预测。为了聚焦需要世界知识的问题,我们掩码诸如"United Kingdom"或"July 1969"之类的显著跨度。我们使用基于BERT的标注器(在CoNLL-2003数据上训练)(Sang & De Meulder, 2003)来识别命名实体,并通过正则表达式识别日期。在掩码语言建模任务中,我们选择并掩码句子中的一个显著跨度。第4.5节将证明,该方法显著优于其他掩码策略。
Null document Even with salient span masking, not all masked tokens require world knowledge to predict. We model this by adding an empty null document $\boldsymbol{\mathcal{O}}$ to the top $k$ retrieved documents, allowing appropriate credit to be assigned to a consistent sink when no retrieval is necessary.
即使采用显著跨度掩码,也并非所有被掩码的token都需要借助世界知识来预测。我们通过向检索到的前$k$篇文档中添加一个空文档$\boldsymbol{\mathcal{O}}$来建模这一现象,从而在无需检索时为一致的接收端分配适当的权重。
Prohibiting trivial retrievals If the pre-training corpus $\mathcal{X}$ and the knowledge corpus $\mathcal{Z}$ are the same, there exists a trivial retrieval candidate $z$ that is too informative: if the masked sentence $x$ comes from document $z$ , the knowledge augmented encoder can trivially predict $y$ by looking at the unmasked version of $x$ in $z$ . This results in a large positive gradient for $p(z\mid x)$ . If this occurs too often, the knowledge retriever ends up learning to look for exact string matches between $x$ and $z$ , which does not capture other forms of relevance. For this reason, we exclude this trivial candidate during pre-training.
禁止简单检索
如果预训练语料 $\mathcal{X}$ 和知识语料 $\mathcal{Z}$ 相同,会存在一个信息量过大的简单检索候选 $z$:当被掩码的句子 $x$ 来自文档 $z$ 时,知识增强编码器可以通过查看 $z$ 中未掩码版本的 $x$ 来轻易预测 $y$。这会导致 $p(z\mid x)$ 产生较大的正梯度。如果这种情况频繁发生,知识检索器最终会学习寻找 $x$ 和 $z$ 之间的精确字符串匹配,而无法捕捉其他形式的相关性。因此,我们在预训练期间排除了这种简单候选。
Initialization At the beginning of training, if the retriever does not have good embeddings for $\mathtt{E m b e d\Pi_{i n p u t}(}x\mathtt{)}$ and $\mathtt{E m b e d}_ {\mathtt{d o c}}(z)$ , the retrieved documents $z$ will likely be unrelated to $x$ . This causes the knowledge augmented encoder to learn to ignore the retrieved documents. Once this occurs, the knowledge retriever does not receive a meaningful gradient and cannot improve, creating a vicious cycle. To avoid this cold-start problem, we warm-start Embedinput and $\mathtt{E m b e d}_{\mathtt{d o c}}$ using a simple training objective known as the Inverse Cloze Task (ICT) where, given a sentence, the model is trained to retrieve the document where that sentence came from. We defer to Lee et al. (2019) for details. For the knowledge-augmented encoder, we warmstart it with BERT pre-training—specifically, the uncased BERT-base model (12 layers, 768 hidden units, 12 attention heads).
初始化
在训练开始时,如果检索器无法为$\mathtt{E m b e d\Pi_{i n p u t}(}x\mathtt{)}$和$\mathtt{E m b e d}_ {\mathtt{d o c}}(z)$生成优质嵌入向量,检索到的文档$z$很可能与$x$无关。这会导致知识增强编码器学会忽略检索到的文档。一旦发生这种情况,知识检索器就无法获得有意义的梯度信号而无法改进,从而形成恶性循环。为避免这种冷启动问题,我们采用逆完形填空任务(Inverse Cloze Task, ICT)这一简单训练目标对Embedinput和$\mathtt{E m b e d}_{\mathtt{d o c}}$进行预热初始化——该任务要求模型根据给定句子检索出其原始出处文档。具体实现细节可参考Lee et al. (2019)。对于知识增强编码器,我们采用BERT预训练进行预热初始化,具体使用的是无大小写区分的BERT-base模型(12层网络结构,768维隐藏单元,12个注意力头)。
4. Experiments
4. 实验
We now evaluate our approach on the Open-QA task. In this section, we describe in detail the benchmarks used and the different approaches to which we compare empirically.
我们现在在开放问答(Open-QA)任务上评估我们的方法。本节将详细描述所使用的基准测试以及我们通过实证比较的不同方法。
4.1. Open-QA Benchmarks
4.1. 开放问答基准测试
A number of benchmarks have been proposed for OpenQA. In this work, we focus on datasets where the question writers did not already know the answer. This yields questions that reflect more realistic information-seeking needs, and also avoids artifacts that can arise if the question is formulated with a particular answer in mind. A deeper justification is given in Lee et al. (2019). In all cases, the predicted answer is evaluated via exact match with any reference answer, following previous Open-QA work (Chen et al., 2017).
针对开放领域问答(OpenQA)已提出多项基准测试。本研究重点关注问题提出者事先不知道答案的数据集。这种做法能产生更贴近真实信息需求的问题,同时避免了因预设特定答案而导致的问题表述偏差。Lee等人(2019)对此进行了更深入的论证。所有实验均遵循先前开放领域问答研究(Chen等人,2017)的做法,通过精确匹配参考答案来评估预测答案。
Natural Questions-Open The Natural Questions dataset (Kwiatkowski et al., 2019) consists of naturally occurring Google queries and their answers. Each answer also comes with an “answer type”: following Lee et al. (2019), we only keep questions that are categorized as “short answer type” with at most five tokens. The dataset also provides a suggested Wikipedia document to retrieve; like all models we compare against, we do not provide this to our model.
Natural Questions-Open
Natural Questions数据集 (Kwiatkowski等人, 2019) 包含自然产生的Google查询及其答案。每个答案还附带一个"答案类型":参照Lee等人 (2019) 的方法,我们仅保留被归类为"短答案类型"且最多包含五个token的问题。该数据集还提供了建议检索的维基百科文档;与我们对比的所有模型一样,我们不会向模型提供该文档。
Web Questions The Web Questions dataset (Berant et al., 2013) was collected from the Google Suggest API, using one seed question and expanding the set to related questions. We follow the setting defined by Chen et al. (2017).
Web Questions数据集(Berant等人,2013)通过Google Suggest API收集,使用一个种子问题并扩展至相关问题集。我们采用Chen等人(2017)定义的实验设置。
Curate dT rec The Curate dT rec dataset is a collection of question-answer pairs drawn from real user queries issued on sites such as MSNSearch and AskJeeves. To account for multiple correct answers or different spelling variations, the answers in this dataset are defined as regular expressions that match all correct answers. It is unclear how to train generation-based models with this type of supervision, so we do not evaluate them on this dataset.
精选dT rec数据集
精选dT rec数据集是从MSNSearch和AskJeeves等网站真实用户查询中提取的问答对集合。为应对多个正确答案或不同拼写变体,该数据集中的答案被定义为匹配所有正确答案的正则表达式。目前尚不清楚如何利用此类监督训练基于生成的模型,因此我们未在该数据集上评估它们。
4.2. Approaches compared
4.2. 对比方法
Retrieval-based Open-QA Most existing Open-QA systems answer the input question by first retrieving potentially relevant documents from a knowledge corpus, and then using a reading comprehension system to extract an answer from the documents. In this paradigm, the knowledge is stored explicitly in the corpus. We wish to compare different methods for implementing retrieval.
基于检索的开放问答
现有大多数开放问答系统通过两个步骤回答问题:先从知识库中检索可能相关的文档,再用阅读理解系统从文档中提取答案。这种范式将知识显式存储在知识库中。我们将对比不同的检索实现方法。
Many approaches use non-learned heuristic retrieval such as sparse bag-of-words matching (Robertson et al., 2009) or entity linking on the question to select a small set of relevant documents (e.g., 20). These documents are typically then re-ranked using a learned model, but coverage may be limited by the initial heuristic retrieval step. Approaches such as DrQA (Chen et al., 2017), HardEM (Min et al., 2019a), Graph Retriever (Min et al., 2019b), and PathRetriever (Asai et al., 2019) in Table 1 are in this category.
许多方法采用非学习的启发式检索技术,例如基于稀疏词袋匹配 (Robertson et al., 2009) 或问题实体链接来筛选少量相关文档 (例如20篇)。这些文档通常随后通过学习模型进行重排序,但覆盖范围可能受限于初始启发式检索步骤。表1中的DrQA (Chen et al., 2017)、HardEM (Min et al., 2019a)、Graph Retriever (Min et al., 2019b) 和PathRetriever (Asai et al., 2019) 等方法均属于此类。
Some recent approaches have proposed to implement learnable retrieval using a MIPS index. ORQA (Lee et al., 2019) formulates Open-QA using a similar latent variable model as REALM, and also trains by maximizing the marginal likelihood. However, REALM adds a novel language model pre-training step, and back propagates into the MIPS index, rather than using a fixed index. In Table 1, we directly compare the two. It is also important to note that the retrievers for both REALM pre training and ORQA are initialized using the Inverse Cloze Task, described in Section 3.4.
一些近期研究提出使用MIPS索引实现可学习的检索机制。ORQA (Lee et al., 2019) 采用与REALM类似的隐变量模型构建开放域问答系统,同样通过最大化边缘似然进行训练。但REALM引入了创新的语言模型预训练步骤,并通过反向传播优化MIPS索引而非使用固定索引。表1中我们直接对比了两者性能。值得注意的是,REALM预训练和ORQA的检索器均采用逆完形填空任务(详见3.4节)进行初始化。
Generation-based Open-QA An emerging alternative approach to Open-QA is to model it as a sequence prediction task: simply encode the question, and then decode the answer token-by-token based on the encoding. While it was initially unclear how large amounts of knowledge could be injected into the model, GPT-2 (Radford et al., 2019) hinted at the possibility of directly generating answers without using any given context via sequence-tosequence. However, their performance was not competitive possibly due to the lack of fine-tuning. Orthogonal ly, T5 (Raffel et al., 2019) showed that directly generating answers without explicit extraction from the given context is viable approach, but they only experimented on the reading comprehension task, where a context document is provided.
基于生成的开放问答
开放问答的一种新兴替代方法是将其建模为序列预测任务:先对问题进行编码,然后基于编码逐个token解码生成答案。虽然最初尚不清楚如何将大量知识注入模型,但GPT-2 (Radford et al., 2019) 暗示了不借助给定上下文、直接通过序列到序列生成答案的可能性。不过由于缺乏微调,其性能表现欠佳。另一方面,T5 (Raffel et al., 2019) 证明了无需从给定上下文中显式提取、直接生成答案的可行性,但他们仅在提供上下文文档的阅读理解任务上进行了实验。
For the most competitive and comparable generation-based baseline, we compare to concurrent work which fine-tunes T5 for Open-QA (Roberts et al., 2020).4 We compare against the Base, Large, and even larger 11-billion parameter model to measure the effect of model size.
为了进行最具竞争力和可比性的基于生成的基线对比,我们与同期工作进行了比较,该工作针对开放问答(Open-QA)对T5进行了微调(Roberts等人,2020)[4]。我们对比了Base、Large甚至更大的110亿参数模型,以衡量模型规模的影响。
4.3. Implementation Details
4.3. 实现细节
Fine-tuning We reuse all hyper parameters from Lee et al. (2019), to enable direct comparison. Our knowledge corpus is derived from the December 20, 2018 snapshot of English Wikipedia. Documents are greedily split into chunks of up to 288 BERT wordpieces, resulting in just over 13 million retrieval candidates. During finetuning inference, we consider the top-5 candidates, and the entire model can be run on a single machine with a 12GB GPU.
微调
我们复用Lee等人 (2019) 的所有超参数以实现直接对比。知识库来源于2018年12月20日的英文维基百科快照。文档通过贪心算法分割成最多288个BERT词片段 (wordpieces) 的块,最终生成略超1300万个检索候选项。在微调推理阶段,我们考虑前5个候选结果,整个模型可在配备12GB GPU的单台机器上运行。
Table 1. Test results on Open-QA benchmarks. The number of train/test examples are shown in paretheses below each benchmark. Predictions are evaluated with exact match against any reference answer. Sparse retrieval denotes methods that use sparse features such as TF-IDF and BM25. Our model, REALM, outperforms all existing systems.
| Name | Architectures | Pre-training | NQ (79k/4k) | WQ (3k/2k) | CT (1k /1k) | #params |
| BERT-Baseline (Lee et al.,2019) | SparseRetr.+Transformer | BERT | 26.5 | 17.7 | 21.3 | 110m |
| T5 (base) (Roberts et al., 2020) | Transformer Seq2Seq | T5 (Multitask) | 27.0 | 29.1 | 223m | |
| T5 (large) (Roberts et al., 2020) | Transformer Seq2Seq | T5 (Multitask) | 29.8 | 32.2 | 738m | |
| T5 (11b) (Roberts et al., 2020) | Transformer Seq2Seq | T5 (Multitask) | 34.5 | 37.4 | 11318m | |
| DrQA (Chen et al., 2017) | SparseRetr.+DocReader | N/A | 20.7 | 25.7 | 34m | |
| HardEM (Min et al., 2019a) | Sparse Retr.+Transformer | BERT | 28.1 | 110m | ||
| GraphRetriever (Min et al., 2019b) | GraphRetriever+Transformer | BERT | 31.8 | 31.6 | 110m | |
| PathRetriever (Asai et al.,2019) | PathRetriever+Transformer | MLM | 32.6 | 110m | ||
| ORQA (Lee et al., 2019) | Dense Retr.+Transformer | ICT+BERT | 33.3 | 36.4 | 30.1 | 330m |
| Ours (X = Wikipedia, Z = Wikipedia) | DenseRetr.+Transformer | REALM | 39.2 | 40.2 | 46.8 | 330m |
| Ours (X = CC-News,Z =Wikipedia) | DenseRetr.+Transformer | REALM | 40.4 | 40.7 | 42.9 | 330m |
表 1: Open-QA基准测试结果。每个基准下的训练/测试样本数量显示在括号中。预测结果通过与任一参考答案的精确匹配进行评估。稀疏检索表示使用TF-IDF和BM25等稀疏特征的方法。我们的模型REALM超越了所有现有系统。
| 名称 | 架构 | 预训练 | NQ (79k/4k) | WQ (3k/2k) | CT (1k/1k) | #参数 |
|---|---|---|---|---|---|---|
| BERT-Baseline (Lee et al.,2019) | 稀疏检索+Transformer | BERT | 26.5 | 17.7 | 21.3 | 110m |
| T5 (base) (Roberts et al., 2020) | Transformer序列到序列 | T5 (多任务) | 27.0 | 29.1 | 223m | |
| T5 (large) (Roberts et al., 2020) | Transformer序列到序列 | T5 (多任务) | 29.8 | 32.2 | 738m | |
| T5 (11b) (Roberts et al., 2020) | Transformer序列到序列 | T5 (多任务) | 34.5 | 37.4 | 11318m | |
| DrQA (Chen et al., 2017) | 稀疏检索+文档阅读器 | 无 | 20.7 | 25.7 | 34m | |
| HardEM (Min et al., 2019a) | 稀疏检索+Transformer | BERT | 28.1 | 110m | ||
| GraphRetriever (Min et al., 2019b) | 图检索器+Transformer | BERT | 31.8 | 31.6 | 110m | |
| PathRetriever (Asai et al.,2019) | 路径检索器+Transformer | 掩码语言模型 | 32.6 | 110m | ||
| ORQA (Lee et al., 2019) | 稠密检索+Transformer | ICT+BERT | 33.3 | 36.4 | 30.1 | 330m |
| Ours (X = 维基百科, Z = 维基百科) | 稠密检索+Transformer | REALM | 39.2 | 40.2 | 46.8 | 330m |
| Ours (X = CC-新闻, Z = 维基百科) | 稠密检索+Transformer | REALM | 40.4 | 40.7 | 42.9 | 330m |
Table 2. Ablation experiments on NQ’s development set.
| Ablation | Exact Match | Zero-shot Retrieval Recall@5 |
| REALM | 38.2 | 38.5 |
| REALMretriever+Baselineencoder | 37.4 | 38.5 |
| Baselineretriever+REALMencoder | 35.3 | 13.9 |
| Baseline (ORQA) | 31.3 | 13.9 |
| REALMwithrandomuniformmasks | 32.3 | 24.2 |
| REALM with random span masks | 35.3 | 26.1 |
| 30xstaleMIPS | 28.7 | 15.1 |
表 2: NQ开发集上的消融实验
| 消融项 | 精确匹配 | 零样本检索召回率@5 |
|---|---|---|
| REALM | 38.2 | 38.5 |
| REALM检索器+基线编码器 | 37.4 | 38.5 |
| 基线检索器+REALM编码器 | 35.3 | 13.9 |
| 基线(ORQA) | 31.3 | 13.9 |
| REALM随机均匀掩码 | 32.3 | 24.2 |
| REALM随机跨度掩码 | 35.3 | 26.1 |
| 30倍陈旧MIPS | 28.7 | 15.1 |
Pre-training We pre-train for $200\mathrm{k\Omega}$ steps on 64 Google Cloud TPUs, with a batch size of 512 and a learning rate of 3e-5, using BERT’s default optimizer. The document embedding step for the MIPS index is parallel i zed over 16 TPUs. For each example, we retrieve and marginal ize over 8 candidate documents, including the null document $\boldsymbol{\mathcal{O}}$ .
预训练
我们在64个Google Cloud TPU上进行了200kΩ步的预训练,批次大小为512,学习率为3e-5,使用BERT的默认优化器。MIPS索引的文档嵌入步骤在16个TPU上并行处理。对于每个样本,我们检索并边缘化8个候选文档,包括空文档$\boldsymbol{\mathcal{O}}$。
We experiment with two choices of the pre-training corpus $\mathcal{X}$ : (1) Wikipedia, which is identical to the knowledge corpus $\mathcal{Z}$ , and (2) CC-News, our reproduction of the corpus of English news proposed by Liu et al. (2019).
我们尝试了两种预训练语料 $\mathcal{X}$ 的选择:(1) 维基百科,与知识语料 $\mathcal{Z}$ 相同;(2) CC-News,这是我们复现的 Liu et al. (2019) 提出的英文新闻语料。
4.4. Main results
4.4. 主要结果
Table 1 shows the accuracy of different approaches on the three Open-QA datasets. REALM outperform all previous approaches by a significant margin. Table 1 also shows the number of parameters for each model.
表 1: 三种开放问答数据集上不同方法的准确率。REALM 以显著优势超越所有先前方法。表 1 还展示了每个模型的参数量。
As reported in the concurrent work of Roberts et al. (2020), the generative Open-QA systems based on T5 are surprisingly powerful, with the largest T5-11B model outperforming the previous best Open-QA system. Increasing the size of T5 yields consistent improvement, but comes at significant computational cost (from Base to 11B, the model is 50 times larger, and gains roughly 5 points in accuracy). In contrast, REALM outperforms the largest T5-11B model while being 30 times smaller. It is also important to note that T5 accesses additional reading comprehension data from SQuAD during its pre-training $^{100,000+}$ examples). Access to such data could also benefit REALM, but was not used in our experiments.
正如Roberts等人在同期研究(2020)中所述,基于T5的生成式开放问答系统(Open-QA)表现出惊人性能,其中最大的T5-11B模型超越了此前最优的开放问答系统。增大T5规模能带来稳定提升,但计算成本显著增加(从Base到11B,模型规模扩大50倍,准确率仅提升约5个百分点)。相比之下,REALM在模型体积小30倍的情况下仍优于最大的T5-11B模型。值得注意的是,T5在预训练阶段额外使用了SQuAD的阅读理解数据(超过10万条样本),这类数据理论上也能提升REALM性能,但未在我们的实验中使用。
Among all systems, the most direct comparison with REALM is ORQA (Lee et al., 2019), where the fine-tuning setup, hyper parameters and training data are identical. The improvement of REALM over ORQA is purely due to better pre-training methods. The results also indicate that our method of pre-training can be applied both on (1) the singlecorpus setting ( $\mathcal{X}=$ Wikipedia, $\mathcal{Z}=$ Wikipedia), or (2) the separate-corpus setting $\mathcal{X}=\mathrm{CC}\mathrm{-News}$ , $\mathcal{Z}=$ Wikipedia).
在所有系统中,与REALM最直接对比的是ORQA (Lee et al., 2019),两者的微调设置、超参数和训练数据完全相同。REALM相对于ORQA的提升完全源于更好的预训练方法。结果表明,我们的预训练方法既适用于(1) 单语料库设置( $\mathcal{X}=$ 维基百科, $\mathcal{Z}=$ 维基百科),也适用于(2) 分离语料库设置( $\mathcal{X}=\mathrm{CC}\mathrm{-News}$ , $\mathcal{Z}=$ 维基百科)。
Compared to other retrieval-based systems (Asai et al., 2019; Min et al., 2019a;b) which often retrieve from 20 to 80 documents, our system gets the overall best performance while only retrieving 5 documents.
与其他检索系统 [Asai et al., 2019; Min et al., 2019a;b] (通常需要检索20至80份文档) 相比,我们的系统仅需检索5份文档即可获得整体最佳性能。
4.5. Analysis
4.5. 分析
In Table 2 we present results for Natural Questions-Open after ablating critical components of REALM. In addition to the end-to-end results, we also report how often the gold answer appears in the top-5 retrievals before applying any fine-tuning. The latter metric more significantly isolates the contribution of improving the retriever during pre-training.
在表2中,我们展示了REALM关键组件消融后在Natural Questions-Open数据集上的结果。除了端到端结果外,我们还报告了在应用任何微调前,黄金答案出现在前5检索结果中的频率。后一项指标更能显著体现预训练阶段改进检索器的贡献。
Table 3. An example where REALM utilizes retrieved documents to better predict masked tokens. It assigns much higher probability (0.129) to the correct term, “Fermat”, compared to BERT. (Note that the blank corresponds to 3 BERT wordpieces.)
| c: | An equilateral triangle is easily constructed using a straightedge and compass, b because3isa prime. | ||
| (a) BERT | p(y “Fermat | [x) 1.1 × 10-14 | (No retrieval.) |
| (b) REALM | C,2) = 1.0 | (Conditional probability with document z ="257is ... a Fermat prime. Thus a regular polygon with 257 sides is constructible with compass | |
| (c) REALM | p(y=“Fermat | [c) 0.129 | (Marginal probability, marginalizing over top 8 retrieved documents.) |
表 3: REALM 利用检索到的文档更好地预测掩码 token 的示例。与 BERT 相比,它为正确术语 "Fermat" 分配了更高的概率 (0.129)。(注意空白处对应 3 个 BERT 的 wordpiece token。)
| c: | 使用直尺和圆规可以轻松构造等边三角形,因为 3 是质数。 |
| (a) BERT | p(y="Fermat" | x) 1.1 × 10⁻¹⁴ | (无检索) |
| (b) REALM | p(y="Fermat" | z, x) = 1.0 | (条件概率,文档 z="257 是...一个费马质数。因此可以用圆规构造具有 257 条边的正多边形") |
| (c) REALM | p(y="Fermat" | x) 0.129 | (边缘概率,对前 8 个检索文档进行边缘化计算) |
Encoder or Retriever We first aim to determine whether REALM pre-training improves the retriever or the encoder, or both. To do so, we can reset the parameters of either the retriever or the encoder to their baseline state before REALM pre-training, and feed that into fine-tuning. Resetting both the retriever and encoder reduces the system to our main baseline, ORQA. We find that both the encoder and retriever benefit from REALM training separately, but the best result requires both components acting in unison.
编码器还是检索器
我们首先旨在确定REALM预训练是改进了检索器、编码器,还是两者都有所提升。为此,我们可以在REALM预训练之前将检索器或编码器的参数重置为基线状态,然后将其输入到微调过程中。同时重置检索器和编码器会将系统还原为我们的主要基线ORQA。我们发现编码器和检索器各自都能从REALM训练中受益,但最佳结果需要两者协同工作。
Masking scheme We compare our salient span masking scheme (Section 3.4) with (1) random token masking introduced in BERT (Devlin et al., 2018) and (2) random span masking proposed by SpanBERT (Joshi et al., 2019). While such salient span masking has not been shown to be impactful in previous work with standard BERT training (Joshi et al., 2019), it is crucial for REALM. Intuitively, the latent variable learning relies heavily on the utility of retrieval and is therefore more sensitive to a consistent learning signal.
掩码方案
我们将提出的显著片段掩码方案(第3.4节)与以下方法进行对比:(1) BERT (Devlin et al., 2018) 提出的随机token掩码;(2) SpanBERT (Joshi et al., 2019) 提出的随机片段掩码。虽然先前基于标准BERT训练的研究表明这种显著片段掩码效果有限 (Joshi et al., 2019),但它对REALM至关重要。直观来看,隐变量学习高度依赖检索效用,因此对一致的学习信号更为敏感。
MIPS index refresh rate During pre-training, we run a parallel process to re-embed corpus documents and rebuild the MIPS index. This results in one index refresh per approximate ly 500 training steps. To demonstrate the importance of frequent index refreshes, we compare against using a slower refresh rate. The results in Table 2 suggests that a stale index can hurt model training, and further reducing this staleness could offer better optimization.
MIPS索引刷新率
在预训练期间,我们运行并行进程来重新嵌入语料库文档并重建MIPS索引。这大约每500个训练步骤就会刷新一次索引。为了证明频繁刷新索引的重要性,我们与使用较慢刷新率的情况进行了比较。表2中的结果表明,过时的索引可能会损害模型训练,进一步减少这种过时性可能会提供更好的优化。
Examples of retrieved documents Table 3 shows an example of the REALM masked language model prediction. In this example, “Fermat” is the correct word, and REALM (row (c)) gives the word a much high probability compared to the BERT model (row (a)). Since REALM manages to retrieve some documents with a related fact (row (b)), the marginalized probability of the correct answer dramatically increases. This shows that REALM is able to retrieve document to fill in the masked word even though it is trained with unsupervised text only.
检索文档示例
表 3 展示了REALM掩码语言模型预测的示例。在该示例中,"Fermat"是正确答案,与BERT模型(行(a))相比,REALM(行(c))为该词分配了更高的概率。由于REALM成功检索到包含相关事实的文档(行(b)),正确答案的边缘化概率显著提升。这表明REALM能够通过检索文档来填补掩码词,尽管其仅通过无监督文本进行训练。
5. Discussion and Related Work
5. 讨论与相关工作
We previously discussed related methods for Open-QA. Here we present several alternate ways of viewing REALM that connect it to a broader set of ideas beyond Open-QA:
我们之前讨论了开放问答(Open-QA)的相关方法。这里我们提出几种理解REALM的替代视角,将其与开放问答之外更广泛的思想联系起来:
Language modeling with corpus as context Language representation models have been incorporating contexts of increasingly large scope when making predictions. Examples of this progression include models that condition on surrounding words (Mikolov et al., 2013a;b), sen- tences (Kiros et al., 2015; Peters et al., 2018), and paragraphs (Radford et al., 2018; Devlin et al., 2018). We can view REALM as a generalization of the above work to the next level of scope: the entire text corpus.
以语料库为上下文进行语言建模
语言表征模型在预测时已逐渐融入更大范围的上下文。这一发展历程的示例包括基于周围词语 (Mikolov et al., 2013a;b)、句子 (Kiros et al., 2015; Peters et al., 2018) 和段落 (Radford et al., 2018; Devlin et al., 2018) 的模型。我们可以将REALM视为上述工作向更大范围的延伸:整个文本语料库。
Retrieve-and-edit with learned retrieval In order to better explain the variance in the input text and enable controllable generation, Guu et al. (2018) proposed a language model with the retrieve-and-edit framework (Hashimoto et al., 2018) that conditions on text with high lexical overlap. REALM has a similar approach, except that the model learns for itself which texts are most useful for reducing perplexity. By jointly learning the retriever, REALM has the capacity to depend on information beyond lexical overlap.
基于学习的检索与编辑框架
为更好地解释输入文本的差异并实现可控生成,Guu等人(2018)提出采用检索-编辑框架(Hashimoto等人, 2018)的语言模型,该模型以词汇重叠度高的文本为条件。REALM采用类似方法,区别在于模型会自主学习哪些文本最能有效降低困惑度。通过联合学习检索器,REALM能够依赖超越词汇重叠的其他信息。
Scalable grounded neural memory The document index can be viewed as a memory where the keys are the document embeddings. From this view, our work share motivations with works such as product key memory (Lample et al., 2019), which enables sub-linear memory access in a memory network (Weston et al., 2014; Graves et al., 2014; Sukhbaatar et al., 2015), allowing these scalable memory layers to be integrated into large language models. One main difference is that our memories are grounded—each memory is associated with a document rather than unnamed value vectors. This level of interpre t ability is crucial for applications like Open-QA, where users would require provenance for a predicted answer to be trustworthy.
可扩展的接地神经记忆
文档索引可视为一种记忆,其中键是文档嵌入向量。从这个角度看,我们的工作与产品键记忆 (product key memory) [Lample et al., 2019] 等研究有共同动机,后者在记忆网络 (memory network) [Weston et al., 2014; Graves et al., 2014; Sukhbaatar et al., 2015] 中实现了次线性记忆访问,使得这些可扩展的记忆层能够集成到大语言模型中。一个主要区别在于我们的记忆是接地的 (grounded) —— 每个记忆都与具体文档关联,而非未命名的值向量。这种可解释性水平对于开放问答 (Open-QA) 等应用至关重要,因为用户需要预测答案的来源证明才能建立信任。
Unsupervised Corpus Alignment In sequence-tosequence models with attention (Bahdanau et al., 2014), text is generated with latent selection of relevant tokens. This results in a set of model-centric unsupervised alignments between target and source tokens. Analogously, REALM also generates text with latent selection of relevant documents. A by-product of our method is that we offer a set of model-centric unsupervised alignments between text in the pre-training corpus $\mathcal{X}$ and knowledge corpus $\mathcal{Z}$ .
无监督语料对齐
在基于注意力机制的序列到序列模型 (Bahdanau et al., 2014) 中,文本生成过程隐式选择了相关 token,从而形成目标 token 与源 token 之间以模型为中心的无监督对齐关系。类似地,REALM 也通过隐式选择相关文档来生成文本。我们方法的副产品是:在预训练语料 $\mathcal{X}$ 和知识语料 $\mathcal{Z}$ 之间,提供了一组以模型为中心的无监督文本对齐关系。
6. Future Work
6. 未来工作
The work presented here is the minimal instantiation of a family of REALM-like approaches where a representation is pre-trained to perform reasoning over a large corpus of knowledge on-the-fly during inference. We are particularly optimistic about generalizations of this work to (1) structured knowledge, which would result in a generalization of Peters et al. (2019) where we would also learn the decision of which entities are informative, (2) the multi-lingual setting, e.g., retrieving knowledge in a high-resource language to better represent text in a low-resource language, and (3) the multi-modal setting, e.g., retrieving images or videos that can provide knowledge rarely observed in text.
本文工作是REALM类方法的最小化实例,其核心是在推理过程中动态预训练表征以实现大规模知识库的即时推理。我们尤其看好该工作在下述方向的拓展:(1) 结构化知识,这将扩展Peters等人(2019)的研究,使我们还能学习判断哪些实体具有信息价值;(2) 多语言场景,例如通过高资源语言检索知识来优化低资源语言的文本表征;(3) 多模态场景,例如检索能补充文本稀缺知识的图像或视频。
References
参考文献
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Z ett le moyer, L., and Stoyanov, V. Roberta: A robustly optimized bert pre training approach. arXiv preprint arXiv:1907.11692, 2019.
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., and Stoyanov, V. RoBERTa: 一种鲁棒优化的BERT预训练方法。arXiv preprint arXiv:1907.11692, 2019.
Mikolov, T., Chen, K., Corrado, G., and Dean, J. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013a.
Mikolov, T., Chen, K., Corrado, G., and Dean, J. 词向量空间的高效估计。arXiv预印本 arXiv:1301.3781, 2013a.
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and Dean, J. Distributed representations of words and phrases and their compositional it y. In Advances in neural information processing systems, pp. 3111–3119, 2013b.
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and Dean, J. 词与短语的分布式表示及其组合性。In Advances in neural information processing systems, pp. 3111–3119, 2013b.
Miller, A., Fisch, A., Dodge, J., Karimi, A.-H., Bordes, A., and Weston, J. Key-value memory networks for directly reading documents. arXiv preprint arXiv:1606.03126, 2016.
Miller, A., Fisch, A., Dodge, J., Karimi, A.-H., Bordes, A., and Weston, J. 基于键值记忆网络直接阅读文档. arXiv预印本 arXiv:1606.03126, 2016.
Min, S., Chen, D., Hajishirzi, H., and Z ett le moyer, L. A discrete hard em approach for weakly supervised question answering. arXiv preprint arXiv:1909.04849, 2019a.
Min, S., Chen, D., Hajishirzi, H., 和 Zettlemoyer, L. 一种用于弱监督问答的离散硬EM方法。arXiv预印本 arXiv:1909.04849, 2019a.
Min, S., Chen, D., Z ett le moyer, L., and Hajishirzi, H. Knowledge guided text retrieval and reading for open domain question answering. arXiv preprint arXiv:1911.03868, 2019b.
Min, S., Chen, D., Zettlemoyer, L., 和 Hajishirzi, H. 知识引导的文本检索与阅读在开放域问答中的应用。arXiv预印本 arXiv:1911.03868, 2019b.
Peters, M. E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., and Z ett le moyer, L. Deep contextual i zed word representations. In Proc. of NAACL, 2018.
Peters, M. E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., and Zettlemoyer, L. 深度上下文词表征. In Proc. of NAACL, 2018.
Peters, M. E., Neumann, M., IV, R. L. L., Schwartz, R., Joshi, V., Singh, S., and Smith, N. A. Knowledge enhanced contextual word representations, 2019.
Peters, M. E., Neumann, M., IV, R. L. L., Schwartz, R., Joshi, V., Singh, S., and Smith, N. A. 知识增强的上下文词表征, 2019.
Petroni, F., Rock t as chel, T., Lewis, P., Bakhtin, A., Wu, Y. Miller, A. H., and Riedel, S. Language models as knowledge bases? arXiv preprint arXiv:1909.01066, 2019.
Petroni, F., Rocktäschel, T., Lewis, P., Bakhtin, A., Wu, Y. Miller, A. H., and Riedel, S. 语言模型能作为知识库吗?arXiv预印本 arXiv:1909.01066, 2019.
Radford, A., Narasimhan, K., Salimans, T., and Sutskever, I. Improving language understanding with unsupervised learning. Technical report, OpenAI, 2018.
Radford, A., Narasimhan, K., Salimans, T., and Sutskever, I. 通过无监督学习提升语言理解能力. 技术报告, OpenAI, 2018.
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., and Sutskever, I. Language models are unsupervised multitask learners. OpenAI Blog, 2019.
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., and Sutskever, I. 语言模型是无监督多任务学习者。OpenAI 博客, 2019.
Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., and Liu, P. J. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683, 2019.
Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., and Liu, P. J. 探索迁移学习的极限:基于统一文本到文本Transformer的研究。arXiv预印本 arXiv:1910.10683, 2019.
Rajpurkar, P., Zhang, J., Lopyrev, K., and Liang, P. Squad: $100{,}000+$ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 2383– 2392, 2016.
Rajpurkar, P., Zhang, J., Lopyrev, K., and Liang, P. SQuAD: 10万+文本机器理解问题集. 载于《2016年自然语言处理实证方法会议论文集》, 第2383–2392页, 2016.
Rajpurkar, P., Jia, R., and Liang, P. Know what you don’t know: Unanswerable questions for squad. arXiv preprint arXiv:1806.03822, 2018.
Rajpurkar, P., Jia, R., and Liang, P. 了解你所不知道的:SQUAD中的不可回答问题。arXiv预印本 arXiv:1806.03822,2018。
A. Derivation of the gradient with respect to the knowledge retriever
A. 知识检索器梯度的推导
We compute the gradient of the REALM pre-training objective (a log-likelihood) with respect to the parameters of the knowledge retriever, $\theta$ :
我们计算REALM预训练目标(对数似然)关于知识检索器参数$\theta$的梯度:
$$
\begin{aligned}
\nabla \log p(\boldsymbol{y} \mid \boldsymbol{x}) &= p(\boldsymbol{y} \mid \boldsymbol{x})^{-1} \nabla p(\boldsymbol{y} \mid \boldsymbol{x}) \
&= p(\boldsymbol{y} \mid \boldsymbol{x})^{-1} \sum_{\boldsymbol{z}} p(\boldsymbol{y} \mid \boldsymbol{z}, \boldsymbol{x}) \nabla p(\boldsymbol{z} \mid \boldsymbol{x}) \
&= p(\boldsymbol{y} \mid \boldsymbol{x})^{-1} \sum_{\boldsymbol{z}} p(\boldsymbol{y} \mid \boldsymbol{z}, \boldsymbol{x}) p(\boldsymbol{z} \mid \boldsymbol{x}) \nabla \log p(\boldsymbol{z} \mid \boldsymbol{x}) \
&= \sum_{\boldsymbol{z}} p(\boldsymbol{z} \mid \boldsymbol{y}, \boldsymbol{x}) \nabla \log p(\boldsymbol{z} \mid \boldsymbol{x})
\end{aligned}
$$
$$
\begin{aligned}
\nabla \log p(\boldsymbol{y} \mid \boldsymbol{x}) &= p(\boldsymbol{y} \mid \boldsymbol{x})^{-1} \nabla p(\boldsymbol{y} \mid \boldsymbol{x}) \
&= p(\boldsymbol{y} \mid \boldsymbol{x})^{-1} \sum_{\boldsymbol{z}} p(\boldsymbol{y} \mid \boldsymbol{z}, \boldsymbol{x}) \nabla p(\boldsymbol{z} \mid \boldsymbol{x}) \
&= p(\boldsymbol{y} \mid \boldsymbol{x})^{-1} \sum_{\boldsymbol{z}} p(\boldsymbol{y} \mid \boldsymbol{z}, \boldsymbol{x}) p(\boldsymbol{z} \mid \boldsymbol{x}) \nabla \log p(\boldsymbol{z} \mid \boldsymbol{x}) \
&= \sum_{\boldsymbol{z}} p(\boldsymbol{z} \mid \boldsymbol{y}, \boldsymbol{x}) \nabla \log p(\boldsymbol{z} \mid \boldsymbol{x})
\end{aligned}
$$
where the last line follows from applying conditional Bayes’ rule. We can then expand $\nabla\log p\left(z\mid x\right)$ as:
最后一行是通过应用条件贝叶斯规则得出的。随后我们可以将 $\nabla\log p\left(z\mid x\right)$ 展开为:
$$
\begin{aligned}\nabla \log p(z \mid x) &= \nabla \log \frac{\exp f(x,z)}{\sum_{z^{\prime}} \exp f(x,z^{\prime})} \&= \nabla \left[ f(x,z) - \log \sum_{z^{\prime}} \exp f(x,z^{\prime}) \right] \&= \nabla f(x,z) - \sum_{z^{\prime}} p(z^{\prime} \mid x) \nabla f(x,z^{\prime})\end{aligned}
$$
$$
\begin{aligned}\nabla \log p(z \mid x) &= \nabla \log \frac{\exp f(x,z)}{\sum_{z^{\prime}} \exp f(x,z^{\prime})} \&= \nabla \left[ f(x,z) - \log \sum_{z^{\prime}} \exp f(x,z^{\prime}) \right] \&= \nabla f(x,z) - \sum_{z^{\prime}} p(z^{\prime} \mid x) \nabla f(x,z^{\prime})\end{aligned}
$$
Plugging this back into the first set of equations yields:
将其代回第一组方程可得:
$$
\begin{array}{c l}{\displaystyle\nabla\log p\left(y\mid x\right)=\sum_{z}p\left(z\mid y,x\right)\left[\nabla f(x,z)-\sum_{z^{\prime}}p\left(z^{\prime}\mid x\right)\nabla f(x,z^{\prime})\right]}\ {=\sum_{z}p\left(z\mid y,x\right)\nabla f(x,z)-\sum_{z^{\prime}}p\left(z^{\prime}\mid x\right)\nabla f(x,z^{\prime})}\ {=\sum_{z}\left[p\left(z\mid y,x\right)-p\left(z\mid x\right)\right]\nabla f(x,z)}\ {=\sum_{z}\left[\frac{p\left(y\mid z,x\right)p\left(z\mid x\right)}{p\left(y\mid x\right)}-p\left(z\mid x\right)\right]\nabla f(x,z)}\ {=\sum_{z}\left[\frac{p\left(y\mid z,x\right)}{p\left(y\mid x\right)}-1\right]p\left(z\mid x\right)\nabla f(x,z).}\end{array}
$$
$$
\begin{array}{c l}{\displaystyle\nabla\log p\left(y\mid x\right)=\sum_{z}p\left(z\mid y,x\right)\left[\nabla f(x,z)-\sum_{z^{\prime}}p\left(z^{\prime}\mid x\right)\nabla f(x,z^{\prime})\right]}\ {=\sum_{z}p\left(z\mid y,x\right)\nabla f(x,z)-\sum_{z^{\prime}}p\left(z^{\prime}\mid x\right)\nabla f(x,z^{\prime})}\ {=\sum_{z}\left[p\left(z\mid y,x\right)-p\left(z\mid x\right)\right]\nabla f(x,z)}\ {=\sum_{z}\left[\frac{p\left(y\mid z,x\right)p\left(z\mid x\right)}{p\left(y\mid x\right)}-p\left(z\mid x\right)\right]\nabla f(x,z)}\ {=\sum_{z}\left[\frac{p\left(y\mid z,x\right)}{p\left(y\mid x\right)}-1\right]p\left(z\mid x\right)\nabla f(x,z).}\end{array}
$$
In the second line, we used the fact that the overall expression is an expectation with respect to $p\left(z\mid\boldsymbol{y},\boldsymbol{x}\right)$ , and the terms which depend on $z^{\prime}$ but not $z$ can be moved out of that expectation.
在第二行中,我们利用了整体表达式是关于 $p\left(z\mid\boldsymbol{y},\boldsymbol{x}\right)$ 的期望这一事实,且依赖于 $z^{\prime}$ 但不依赖于 $z$ 的项可以从该期望中移出。
B. Connection between REALM and supervised learning
B. REALM 与监督学习之间的联系
From the equations in Appendix A, we saw that
从附录A的方程中可以看出
$$
\nabla\log p\left(\boldsymbol{y}\mid\boldsymbol{x}\right)=\sum_{\boldsymbol{z}}\left[p\left(\boldsymbol{z}\mid\boldsymbol{y},\boldsymbol{x}\right)-p\left(\boldsymbol{z}\mid\boldsymbol{x}\right)\right]\nabla f(\boldsymbol{x},\boldsymbol{z}).
$$
$$
\nabla\log p\left(\boldsymbol{y}\mid\boldsymbol{x}\right)=\sum_{\boldsymbol{z}}\left[p\left(\boldsymbol{z}\mid\boldsymbol{y},\boldsymbol{x}\right)-p\left(\boldsymbol{z}\mid\boldsymbol{x}\right)\right]\nabla f(\boldsymbol{x},\boldsymbol{z}).
$$
Suppose that there exists one document $z^{ * }$ which causes the model to achieve perfect prediction accuracy (i.e., $p\left(y\mid z^{ * },x\right)=1)$ , while all other documents $z^{\prime}$ result in
假设存在一个文档 $z^{ * }$ 能使模型达到完美预测准确率 (即 $p\left(y\mid z^{*},x\right)=1)$ ,而其他所有文档 $z^{\prime}$ 都会导致
zero accuracy (i.e., $p\left(y\mid z^{\prime},x\right)=0)$ . Under this setting, $p\left(\boldsymbol{z}^{ * }\mid\boldsymbol{y},\boldsymbol{x}\right)=1$ (provided that $p\left(z^{*}\mid x\right)$ is non-zero), which causes the gradient to become
零准确率 (即 $p\left(y\mid z^{\prime},x\right)=0)$ 。在此设定下, $p\left(\boldsymbol{z}^{ * }\mid\boldsymbol{y},\boldsymbol{x}\right)=1$ (前提是 $p\left(z^{*}\mid x\right)$ 非零),这会导致梯度变为
$$
\begin{array}{c}{{\nabla\log p\left(\boldsymbol{y}\mid\boldsymbol{x}\right)=\nabla f\left(\boldsymbol{x},\boldsymbol{z}^{ * }\right)-\displaystyle\sum_{\boldsymbol{z}}p\left(\boldsymbol{z}\mid\boldsymbol{x}\right)\nabla f(\boldsymbol{x},\boldsymbol{z})}}\ {{=\nabla\log p\left(\boldsymbol{z}^{ * }\mid\boldsymbol{x}\right).}}\end{array}
$$
$$
\begin{array}{c}{{\nabla\log p\left(\boldsymbol{y}\mid\boldsymbol{x}\right)=\nabla f\left(\boldsymbol{x},\boldsymbol{z}^{ * }\right)-\displaystyle\sum_{\boldsymbol{z}}p\left(\boldsymbol{z}\mid\boldsymbol{x}\right)\nabla f(\boldsymbol{x},\boldsymbol{z})}}\ {{=\nabla\log p\left(\boldsymbol{z}^{ * }\mid\boldsymbol{x}\right).}}\end{array}
$$
From this, we see that gradient descent on the REALM objective is equivalent to gradient descent on $\log p\left(z^{ * }\mid x\right)$ . This is none other than the typical maximum likelihood training objective used in supervised learning, where $z^{ * }$ is the “gold” document.
由此可知,REALM目标函数的梯度下降等价于对$\log p\left(z^{ * }\mid x\right)$进行梯度下降。这正是监督学习中常用的最大似然训练目标,其中$z^{ * }$代表"黄金"文档。
C. Adapting to new knowledge
C. 适应新知识
An explicit retrieval system allows us to adapt to new world knowledge simply by modifying the corpus documents. To demonstrate this ability, we replace the knowledge corpus with a more recent version of Wikipedia corpus after pre-training is done. When the input query is about a fact where the two corpora disagree, REALM can change the prediction to reflect the updated information, as exemplified in Table 4. However, even with an explicit retrieval mechanism, the knowledge-augmented encoder will still end up remembering some world knowledge, making the prediction of some input sentences not updated with the new corpus. (For instance, the model predicts “Thatcher” for “ is the prime minister of United Kingdom.” on both corpora, perhaps due to the frequent mention of her name in Wikipedia articles.)
显式检索系统让我们只需修改语料文档就能适应新的世界知识。为验证这一能力,我们在预训练完成后将知识语料库替换为更新的维基百科语料版本。如表4所示,当输入查询涉及两个语料库存在分歧的事实时,REALM能够调整预测结果以反映更新后的信息。但值得注意的是,即便采用显式检索机制,经过知识增强的编码器仍会记忆部分世界知识,导致某些输入句子的预测结果无法随新语料更新(例如模型在两个语料库中都针对" is the prime minister of United Kingdom."预测出"Thatcher",这可能是由于其名字在维基百科文章中高频出现所致)。
D. Retrieval Utility
D. 检索效用
The null document $\boldsymbol{\mathcal{O}}$ described in Section 3.4 provides a way to measure the importance of a retrieved document $z$ : we define the retrieval utility (RU) of $z$ for the masked input $x$ as the difference between the log-likelihood of the knowledge-augmented encoder when conditioning on $z$ versus on $\boldsymbol{\mathcal{O}}$ :
3.4节中描述的空文档$\boldsymbol{\mathcal{O}}$提供了一种衡量检索文档$z$重要性的方法:我们将$z$对掩码输入$x$的检索效用(RU)定义为知识增强编码器在条件为$z$与条件为$\boldsymbol{\mathcal{O}}$时的对数似然差:
$$
\operatorname{RU}(z\mid x)=\log p(y\mid z,x)-\log p(y\mid\emptyset,x).
$$
$$
\operatorname{RU}(z\mid x)=\log p(y\mid z,x)-\log p(y\mid\emptyset,x).
$$
A negative RU shows that $z$ is less useful for predicting $y$ than the null document. This could mean that $z$ is irrelevant to $x$ , but could also mean that the masked tokens in $x$ do not require world knowledge to predict, or that the world knowledge is sufficiently commonplace it has been baked into the model’s parameters. In practice, we find that RU increases steadily over the course of pre-training, and is more predictive of good performance on the downstream task of Open-QA than even the overall log-likelihood. An example of how RU behaves over time and across different settings is in Figure 4.
负的RU值表明,$z$对预测$y$的作用比空文档还低。这可能意味着$z$与$x$无关,但也可能表示$x$中被遮蔽的token不需要世界知识来预测,或者相关世界知识过于常见,已被融入模型参数中。实践中我们发现,RU在预训练过程中稳步上升,甚至比整体对数似然更能预测模型在开放问答(Open-QA)下游任务中的表现。图4展示了RU在不同时期及不同设置下的变化示例。
| c: | "Jennifer | formed the production company Excellent |
| BERT | also (0.13), then n (0.08),later (0.05), ... | |
| REALM (Z =20 Dec 2018 corpus) | smith (0.01),brown (0.01),jones (0.01) | |
| REALM | (Z =20 Jan 2020 corpus) | lawrence (0.13), brown (0.01), smith (0.01), ... |
| c: | "Jennifer | formed the production company Excellent |
| BERT | | also (0.13), then n (0.08), later (0.05), ... |
| REALM (Z =20 Dec 2018 corpus) | | smith (0.01), brown (0.01), jones (0.01) |
| REALM | (Z =20 Jan 2020 corpus) | lawrence (0.13), brown (0.01), smith (0.01), ... |
Table 4. An example where REALM adapts to the updated knowledge corpus. The Wikipedia page “Excellent Cadaver” was added in 2019, so the model was not about to recover the word when the knowledge corpus is outdated (2018). Interestingly, the same REALM model pre-trained on the 2018 corpus is able to retrieve the document in the updated corpus (2020) and generate the correct token, “Lawrence”.
表 4: REALM 适应更新后知识库的示例。由于"Excellent Cadaver"维基百科页面于2019年新增,当知识库未更新时(2018),模型无法恢复该词汇。值得注意的是,基于2018年知识库预训练的同一REALM模型,在更新后的知识库(2020)中能够检索到该文档并生成正确token"Lawrence"。

Figure 4. The Retrieval Utility (RU, described in Eq. 2) vs the number of pre-training steps. RU roughly estimates the “usefulness” of retrieval. RU is impacted by the choice of masking and the number of pre-training steps.
图 4: 检索效用 (Retrieval Utility, RU, 公式 2 所述) 与预训练步数的关系。RU 粗略估计了检索的"有用性"。RU 受掩码选择和预训练步数的影响。
