Learning Dense Representations of Phrases at Scale
大规模短语密集表征学习
Abstract
摘要
Open-domain question answering can be reformulated as a phrase retrieval problem, without the need for processing documents on-demand during inference (Seo et al., 2019). However, current phrase retrieval models heavily depend on sparse representations and still underperform retriever-reader approaches. In this work, we show for the first time that we can learn dense representations of phrases alone that achieve much stronger performance in opendomain QA. We present an effective method to learn phrase representations from the supervision of reading comprehension tasks, coupled with novel negative sampling methods. We also propose a query-side fine-tuning strategy, which can support transfer learning and reduce the discrepancy between training and inference. On five popular open-domain QA datasets, our model Dense Phrases improves over previous phrase retrieval models by $15%-$ $25%$ absolute accuracy and matches the performance of state-of-the-art retriever-reader models. Our model is easy to parallel ize due to pure dense representations and processes more than 10 questions per second on CPUs. Finally, we directly use our pre-indexed dense phrase representations for two slot filling tasks, showing the promise of utilizing Dense Phrases as a dense knowledge base for downstream tasks.1
开放域问答可以重新表述为短语检索问题,无需在推理时按需处理文档 (Seo et al., 2019) 。然而当前短语检索模型严重依赖稀疏表示,性能仍逊于检索器-阅读器方法。本研究首次证明仅学习短语的密集表示就能在开放域问答中实现更强性能。我们提出一种通过阅读理解任务监督学习短语表征的有效方法,结合新型负采样策略,并设计了查询端微调方案以支持迁移学习并减少训练与推理的差异。在五个主流开放域问答数据集上,我们的Dense Phrases模型将短语检索性能绝对值提升15%-25%,达到当前最优检索器-阅读器模型水平。得益于纯密集表示特性,该模型易于并行化,在CPU上每秒可处理超过10个问题。最后,我们直接将预索引的密集短语表征应用于两个槽填充任务,验证了Dense Phrases作为下游任务密集知识库的应用潜力。[20]
1 Introduction
1 引言
Open-domain question answering (QA) aims to provide answers to natural-language questions using a large text corpus (Voorhees et al., 1999; Ferrucci et al., 2010; Chen and Yih, 2020). While a dominating approach is a two-stage retriever-reader approach (Chen et al., 2017; Lee et al., 2019; Guu et al., 2020; Karpukhin et al., 2020), we focus on a recent new paradigm solely based on phrase retrieval (Seo et al., 2019; Lee et al., 2020). Phrase retrieval highlights the use of phrase representations and finds answers purely based on the similarity search in the vector space of phrases.2 Without relying on an expensive reader model for processing text passages, it has demonstrated great runtime efficiency at inference time.
开放域问答 (QA) 旨在利用大型文本语料库回答自然语言问题 (Voorhees et al., 1999; Ferrucci et al., 2010; Chen and Yih, 2020)。虽然主流方法是两阶段的检索器-阅读器架构 (Chen et al., 2017; Lee et al., 2019; Guu et al., 2020; Karpukhin et al., 2020),但我们关注一种仅基于短语检索的新范式 (Seo et al., 2019; Lee et al., 2020)。短语检索强调使用短语表征,并完全基于短语向量空间的相似性搜索来寻找答案。该方法无需依赖昂贵的阅读器模型处理文本段落,在推理时展现出极高的运行时效率。
Despite great promise, it remains a formidable challenge to build vector representations for every single phrase in a large corpus. Since phrase representations are decomposed from question represent at ions, they are inherently less expressive than cross-attention models (Devlin et al., 2019). Moreover, the approach requires retrieving answers correctly out of billions of phrases (e.g., $6\times10^{10}$ phrases in English Wikipedia), making the scale of the learning problem difficult. Consequently, existing approaches heavily rely on sparse representations for locating relevant documents and paragraphs while still falling behind retriever-reader models (Seo et al., 2019; Lee et al., 2020).
尽管前景广阔,但为海量文本库中的每个短语构建向量表示仍是一项艰巨挑战。由于短语表示是从问题表示中分解而来,其表达能力天然弱于交叉注意力模型 (Devlin et al., 2019) 。此外,该方法需要从数十亿短语 (例如英文维基百科中的 $6\times10^{10}$ 个短语) 中准确检索答案,导致学习问题的规模极其庞大。因此,现有方法严重依赖稀疏表示来定位相关文档和段落,但性能仍落后于检索-阅读器模型 (Seo et al., 2019; Lee et al., 2020) 。
In this work, we investigate whether we can build fully dense phrase representations at scale for opendomain QA. First, we aim to learn strong phrase representations from the supervision of reading comprehension tasks. We propose to use data augmentation and knowledge distillation to learn better phrase representations within a single passage. We then adopt negative sampling strategies such as inbatch negatives (Henderson et al., 2017; Karpukhin et al., 2020), to better discriminate the phrases at a larger scale. Here, we present a novel method called pre-batch negatives, which leverages preceding mini-batches as negative examples to compensate the need of large-batch training. Lastly, we present a query-side fine-tuning strategy that drastically improves phrase retrieval performance and allows for transfer learning to new domains, without re-building billions of phrase representations.
在本研究中,我们探讨了能否为开放域问答构建大规模全稠密短语表征。首先,我们致力于通过阅读理解任务的监督学习强表征短语。我们提出使用数据增强和知识蒸馏技术来提升单篇章内的短语表征质量。随后采用负采样策略(如批次内负例 (Henderson et al., 2017; Karpukhin et al., 2020)),以在更大规模上更好地区分短语。我们创新性地提出预批次负例方法,利用前序小批次作为负样本,弥补大批次训练的需求。最后,我们提出查询端微调策略,该策略显著提升短语检索性能,并支持跨领域迁移学习,无需重建数十亿短语表征。
Table 1: Retriever-reader and phrase retrieval approaches for open-domain QA. The retriever-reader approach retrieves a small number of relevant documents or passages from which the answers are extracted. The phrase retrieval approach retrieves an answer out of billions of phrase representations pre-indexed from the entire corpus. Appendix B provides detailed benchmark specification. The accuracy is measured on the test sets in the opendomain setting. NQ: Natural Questions.
表 1: 开放域问答的检索器-阅读器与短语检索方法对比。检索器-阅读器方法从少量相关文档或段落中检索并提取答案,短语检索方法则从预索引的数十亿短语表示中直接检索答案。附录B提供详细的基准测试说明,准确率在开放域设置下的测试集上测得。NQ: Natural Questions。
类别 | 模型 | 稀疏? | 存储(GB) | 查询量/秒(GPU, CPU) | NQ(Acc) | SQuAD(Acc) |
---|---|---|---|---|---|---|
Retriever-Reader | DrQA (Chen et al., 2017) | 26 | 1.8, 0.6 | 29.8 | ||
BERTSerini (Yang et al., 2019) | 21 | 2.0, 0.4 | 38.6 | |||
ORQA (Lee et al., 2019) | × | 18 | 8.6, 1.2 | 33.3 | 20.2 | |
REALMNews (Guu et al., 2020) | × | 18 | 8.4, 1.2 | 40.4 | ||
DPR-multi (Karpukhin et al., 2020) | 76 | 0.9, 0.04 | 41.5 | 24.1 | ||
PhraseRetrieval | DenSPI (Seo et al., 2019) | 1,200 | 2.9, 2.4 | 8.1 | 36.2 | |
DenSPI + Sparc (Lee et al., 2020) | 1,547 | 2.1, 1.7 | 14.5 | 40.7 | ||
DensePhrases (Ours) | 320 | 20.6, 13.6 | 40.9 | 38.0 |
As a result, all these improvements lead to a much stronger phrase retrieval model, without the use of any sparse representations (Table 1). We evaluate our model, Dense Phrases, on five standard open-domain QA datasets and achieve much better accuracies than previous phrase retrieval models (Seo et al., 2019; Lee et al., 2020), with $15%-$ $25%$ absolute improvement on most datasets. Our model also matches the performance of state-ofthe-art retriever-reader models (Guu et al., 2020; Karpukhin et al., 2020). Due to the removal of sparse representations and careful design choices, we further reduce the storage footprint for the full English Wikipedia from $1.5\mathrm{TB}$ to 320GB, as well as drastically improve the throughput.
因此,这些改进共同造就了一个更强大的短语检索模型,且无需使用任何稀疏表示 (表 1)。我们在五个标准开放域问答数据集上评估了 Dense Phrases 模型,其准确率显著优于之前的短语检索模型 (Seo et al., 2019; Lee et al., 2020),在多数数据集上实现了 15%-25% 的绝对提升。我们的模型也达到了当前最先进的检索-阅读器模型 (Guu et al., 2020; Karpukhin et al., 2020) 的性能水平。由于移除了稀疏表示并采用了精心的设计选择,我们进一步将英文维基百科的存储占用从 1.5TB 降至 320GB,同时大幅提升了吞吐量。
Finally, we envision that Dense Phrases acts as a neural interface for retrieving phrase-level knowledge from a large text corpus. To showcase this possibility, we demonstrate that we can directly use Dense Phrases for fact extraction, without rebuilding the phrase storage. With only fine-tuning the question encoder on a small number of subjectrelation-object triples, we achieve state-of-the-art performance on two slot filling tasks (Petroni et al., 2021), using less than $5%$ of the training data.
最后,我们设想Dense Phrases可以作为一种神经接口,用于从大型文本语料库中检索短语级知识。为展示这一可能性,我们证明无需重建短语存储即可直接使用Dense Phrases进行事实提取。仅需在少量主语-关系-宾语三元组上微调问题编码器,我们就在两个槽填充任务(Petroni等人,2021)上实现了最先进的性能,且使用的训练数据不足$5%$。
2 Background
2 背景
We first formulate the task of open-domain question answering for a set of $K$ documents $\mathcal{D}=$ ${d_{1},\ldots,d_{K}}$ . We follow the recent work (Chen et al., 2017; Lee et al., 2019) and treat all of English Wikipedia as $\mathcal{D}$ , hence $K\approx5\times10^{6}$ . However, most approaches—including ours—are generic and could be applied to other collections of documents.
我们首先针对一组$K$篇文档$\mathcal{D}=$ ${d_{1},\ldots,d_{K}}$定义开放域问答任务。遵循近期研究 (Chen et al., 2017; Lee et al., 2019) 的做法,我们将整个英文维基百科视为$\mathcal{D}$,因此$K\approx5\times10^{6}$。不过包括我们在内的大多数方法都具有通用性,可应用于其他文档集合。
The task aims to provide an answer $\hat{a}$ for the input question $q$ based on $\mathcal{D}$ . In this work, we focus on the extractive QA setting, where each answer is a segment of text, or a phrase, that can be found in $\mathcal{D}$ . Denote the set of phrases in $\mathcal{D}$ as $S({\mathcal{D}})$ and each phrase $s_{k}\in S(\mathcal{D})$ consists of contiguous words $w_{\mathrm{start}(k)},\ldots,w_{\mathrm{end}(k)}$ in its document $d_{\mathsf{d o c}(k)}$ . In practice, we consider all the phrases up to $L=20$ words in $\mathcal{D}$ and $S({\mathcal{D}})$ comprises a large number of $6\times10^{10}$ phrases. An extractive QA system returns a phrase $\hat{s}=\mathrm{argmax}_{s\in{\cal S}(\mathcal{D})}f(s|\mathcal{D},q)$ where $f$ is a scoring function. The system finally maps $\hat{s}$ to an answer string $\hat{a}$ $:\mathrm{TEXT}({\hat{s}})={\hat{a}}$ and the evaluation is typically done by comparing the predicted answer $\hat{a}$ with a gold answer $a^{*}$ .
该任务旨在基于输入问题 $q$ 和数据集 $\mathcal{D}$ 提供答案 $\hat{a}$。在本研究中,我们聚焦于抽取式问答(extractive QA)场景,其中每个答案均为能在 $\mathcal{D}$ 中找到的文本片段或短语。将 $\mathcal{D}$ 中的短语集合记为 $S({\mathcal{D}})$,每个短语 $s_{k}\in S(\mathcal{D})$ 由其所属文档 $d_{\mathsf{d o c}(k)}$ 中的连续单词 $w_{\mathrm{start}(k)},\ldots,w_{\mathrm{end}(k)}$ 构成。实际应用中,我们考虑 $\mathcal{D}$ 中所有长度不超过 $L=20$ 个单词的短语,此时 $S({\mathcal{D}})$ 包含约 $6\times10^{10}$ 个短语。抽取式问答系统返回的短语为 $\hat{s}=\mathrm{argmax}_{s\in{\cal S}(\mathcal{D})}f(s|\mathcal{D},q)$,其中 $f$ 为评分函数。系统最终将 $\hat{s}$ 映射为答案字符串 $\hat{a}$(即 $\mathrm{TEXT}({\hat{s}})={\hat{a}}$),并通过比较预测答案 $\hat{a}$ 与标准答案 $a^{*}$ 进行评估。
Although we focus on the extractive QA setting, recent works propose to use a generative model as the reader (Lewis et al., 2020; Izacard and Grave, 2021), or learn a closed-book QA model (Roberts et al., 2020), which directly predicts answers without using an external knowledge source. The extractive setting provides two advantages: first, the model directly locates the source of the answer, which is more interpret able, and second, phraselevel knowledge retrieval can be uniquely adapted to other NLP tasks as we show in $\S7.3$ .
虽然我们聚焦于抽取式问答(extractive QA)场景,但近期研究提出了使用生成式模型作为阅读器(Lewis等人, 2020; Izacard和Grave, 2021)或训练闭卷问答模型(closed-book QA)(Roberts等人, 2020)的方案,这些模型无需借助外部知识源即可直接预测答案。抽取式设置具有两大优势:首先,模型能直接定位答案来源,具有更强的可解释性;其次,短语级知识检索可独特地适配其他NLP任务,如我们在$\S7.3$章节所示。
Retriever-reader. A dominating paradigm in open-domain QA is the retriever-reader approach (Chen et al., 2017; Lee et al., 2019; Karpukhin et al., 2020), which leverages a firststage document retriever $f_{\mathrm{retr}}$ and only reads top $K^{\prime}\ll K$ documents with a reader model $f_{\mathrm{read}}$ The scoring function $f(s\mid{\mathcal{D}},q)$ is decomposed as:
检索器-阅读器。开放域问答中的主流范式是检索器-阅读器方法 (Chen et al., 2017; Lee et al., 2019; Karpukhin et al., 2020),该方法通过第一阶段的文档检索器 $f_{\mathrm{retr}}$ 筛选文档,并仅用阅读器模型 $f_{\mathrm{read}}$ 处理前 $K^{\prime}\ll K$ 篇文档。评分函数 $f(s\mid{\mathcal{D}},q)$ 分解为:
$$
\begin{array}{r}{f(s\mid\mathcal{D},q)=f_{\mathrm{retr}}({d_{j_{1}},\ldots,d_{j_{K^{\prime}}}}\mid\mathcal{D},q)}\ {\times f_{\mathrm{read}}(s\mid{d_{j_{1}},\ldots,d_{j_{K^{\prime}}}},q),}\end{array}
$$
$$
\begin{array}{r}{f(s\mid\mathcal{D},q)=f_{\mathrm{retr}}({d_{j_{1}},\ldots,d_{j_{K^{\prime}}}}\mid\mathcal{D},q)}\ {\times f_{\mathrm{read}}(s\mid{d_{j_{1}},\ldots,d_{j_{K^{\prime}}}},q),}\end{array}
$$
where ${j_{1},\dots,j_{K^{\prime}}}\subset{1,\dots,K}$ and if $s\not\in$ $\mathcal{S}({d_{j_{1}},\ldots,d_{j_{K^{\prime}}}})$ , the score will be 0. It can easily adapt to passages and sentences (Yang et al., 2019; Wang et al., 2019). However, this approach suffers from error propagation when incorrect documents are retrieved and can be slow as it usually requires running an expensive reader model on every retrieved document or passage at inference time.
其中 ${j_{1},\dots,j_{K^{\prime}}}\subset{1,\dots,K}$ ,且若 $s\not\in$ $\mathcal{S}({d_{j_{1}},\ldots,d_{j_{K^{\prime}}}})$ ,则得分为0。该方法可轻松适配段落和句子 (Yang et al., 2019; Wang et al., 2019) ,但其存在错误传播问题——当检索到错误文档时,性能会受影响;此外该方法通常需要在推理阶段对每个检索到的文档或段落运行昂贵的阅读器模型,导致速度较慢。
Phrase retrieval. Seo et al. (2019) introduce the phrase retrieval approach that encodes phrase and question representations independently and performs similarity search over the phrase representations to find an answer. Their scoring function $f$ is computed as follows:
短语检索。Seo等人(2019)提出了一种短语检索方法,该方法独立编码短语和问题表示,并通过在短语表示上执行相似性搜索来寻找答案。其评分函数$f$的计算方式如下:
$$
\begin{array}{r}{f(s\mid\mathcal{D},q)=E_{s}(s,\mathcal{D})^{\top}E_{q}(q),}\end{array}
$$
$$
\begin{array}{r}{f(s\mid\mathcal{D},q)=E_{s}(s,\mathcal{D})^{\top}E_{q}(q),}\end{array}
$$
where $E_{s}$ and $E_{q}$ denote the phrase encoder and the question encoder respectively. As $E_{s}(\cdot)$ and $E_{q}(\cdot)$ representations are de com pos able, it can support maximum inner product search (MIPS) and improve the efficiency of open-domain QA models. Previous approaches (Seo et al., 2019; Lee et al., 2020) leverage both dense and sparse vectors for phrase and question representations by taking their concatenation: $\begin{array}{r l}{E_{s}(s,\mathcal{D})}&{{}=}\end{array}$ $[E_{\mathrm{sparse}}(s,\mathcal{D}),E_{\mathrm{dense}}(s,\mathcal{D})]$ .3 However, since the sparse vectors are difficult to parallel ize with dense vectors, their method essentially conducts sparse and dense vector search separately. The goal of this work is to only use dense representations, i.e., $E_{s}(s,\mathcal{D})=E_{\mathrm{dense}}(s,\mathcal{D})$ , which can model $f(s\mid{\mathcal{D}},q)$ solely with MIPS, as well as close the gap in performance.
其中 $E_{s}$ 和 $E_{q}$ 分别表示短语编码器和问题编码器。由于 $E_{s}(\cdot)$ 和 $E_{q}(\cdot)$ 的表示是可分解的,它可以支持最大内积搜索 (MIPS) 并提高开放域问答模型的效率。先前的方法 (Seo et al., 2019; Lee et al., 2020) 通过拼接稠密向量和稀疏向量来构建短语和问题的表示: $\begin{array}{r l}{E_{s}(s,\mathcal{D})}&{{}=}\end{array}$ $[E_{\mathrm{sparse}}(s,\mathcal{D}),E_{\mathrm{dense}}(s,\mathcal{D})]$ 。然而,由于稀疏向量难以与稠密向量并行化,这些方法本质上需要分别进行稀疏向量和稠密向量的搜索。本工作的目标是仅使用稠密表示,即 $E_{s}(s,\mathcal{D})=E_{\mathrm{dense}}(s,\mathcal{D})$ ,这样可以通过 MIPS 单独建模 $f(s\mid{\mathcal{D}},q)$ ,同时缩小性能差距。
3 Dense Phrases
3 密集短语
3.1 Overview
3.1 概述
We introduce Dense Phrases, a phrase retrieval model that is built on fully dense representations. Our goal is to learn a phrase encoder as well as a question encoder, so we can pre-index all the possible phrases in $\mathcal{D}$ , and efficiently retrieve phrases for any question through MIPS at testing time. We outline our approach as follows:
我们推出Dense Phrases,这是一种基于全密集表示的短语检索模型。我们的目标是学习一个短语编码器和一个问题编码器,从而可以预索引$\mathcal{D}$中所有可能的短语,并在测试时通过MIPS高效检索任意问题的短语。我们的方法概述如下:
Before we present the approach in detail, we first describe our base architecture below.
在详细介绍我们的方法之前,我们首先描述以下基础架构。
3.2 Base Architecture
3.2 基础架构
Our base architecture consists of a phrase encoder $E_{s}$ and a question encoder $E_{q}$ . Given a passage $p=w_{1},\ldots,w_{m}$ , we denote all the phrases up to $L$ tokens as $S(p)$ . Each phrase $ s_{k}$ has start and end in- dicies start $(k)$ and $\mathsf{e n d(}k\mathbf{)}$ and the gold phrase is $s^{*}\in S(p)$ . Following previous work on phrase or span representations (Lee et al., 2017; Seo et al., 2018), we first apply a pre-trained language model $\mathcal{M}{p}$ to obtain contextual i zed word representations for each passage token: $\mathbf{h}{1},\ldots,\mathbf{h}{m}\in\mathbb{R}^{d}$ . Then, we can represent each phrase $s_{k}\in S(p)$ as the concatenation of corresponding start and end vectors:
我们的基础架构包含一个短语编码器 $E_{s}$ 和一个问题编码器 $E_{q}$。给定段落 $p=w_{1},\ldots,w_{m}$,我们将所有不超过 $L$ 个token的短语记为 $S(p)$。每个短语 $s_{k}$ 具有起始和结束索引 start $(k)$ 和 $\mathsf{e n d(}k\mathbf{)}$,黄金短语为 $s^{*}\in S(p)$。遵循先前关于短语或片段表示的研究工作 (Lee et al., 2017; Seo et al., 2018),我们首先应用预训练语言模型 $\mathcal{M}{p}$ 获取每个段落token的上下文相关词表示:$\mathbf{h}{1},\ldots,\mathbf{h}{m}\in\mathbb{R}^{d}$。然后,我们可以将每个短语 $s_{k}\in S(p)$ 表示为对应起始和结束向量的拼接:
$$
E_{s}(s_{k},p)=[\mathbf{h}{\mathrm{start}(k)},\mathbf{h}_{\mathrm{enod}(k)}]\in\mathbb{R}^{2d}.
$$
$$
E_{s}(s_{k},p)=[\mathbf{h}{\mathrm{start}(k)},\mathbf{h}_{\mathrm{enod}(k)}]\in\mathbb{R}^{2d}.
$$
A great advantage of this representation is that we eventually only need to index and store all the word vectors (we use $\mathcal{W}(\mathcal{D})$ to denote all the words in $\mathcal{D})$ , instead of all the phrases $S({\mathcal{D}})$ , which is at least one magnitude order smaller.
这种表示方法的一大优势是,最终我们只需对所有词向量建立索引并存储(用 $\mathcal{W}(\mathcal{D})$ 表示 $\mathcal{D}$ 中的所有词),而无需处理所有短语 $S({\mathcal{D}})$,其数据量至少小一个数量级。
Similarly, we need to learn a question encoder $E_{q}(\cdot)$ that maps a question $q=\tilde{w}{1},\ldots,\tilde{w}{n}$ to a vector of the same dimension as $E_{s}(\cdot)$ . Since the start and end representations of phrases are produced by the same language model, we use another two different pre-trained encoders $\mathcal{M}{q,\mathrm{start}}$ and $\mathcal{M}{q,\mathrm{end}}$ to differentiate the start and end positions. We apply ${\mathcal{M}}{q,{\mathrm{start}}}$ and $\mathcal{M}{q,\mathrm{end}}$ on $q$ separately and obtain representations $\mathbf{q}^{\mathrm{start}}$ and qend taken from the [CLS] token representations respectively. Finally, $E_{q}(\cdot)$ simply takes their concatenation:
同样,我们需要学习一个问题编码器 $E_{q}(\cdot)$,将问题 $q=\tilde{w}{1},\ldots,\tilde{w}{n}$ 映射到与 $E_{s}(\cdot)$ 相同维度的向量。由于短语的起始和结束表示由同一语言模型生成,我们使用另外两个不同的预训练编码器 $\mathcal{M}{q,\mathrm{start}}$ 和 $\mathcal{M}{q,\mathrm{end}}$ 来区分起始和结束位置。我们分别对 $q$ 应用 ${\mathcal{M}}{q,{\mathrm{start}}}$ 和 $\mathcal{M}{q,\mathrm{end}}$,并分别从[CLS]标记表示中获取 $\mathbf{q}^{\mathrm{start}}$ 和 $\mathbf{q}^{\mathrm{end}}$。最终,$E_{q}(\cdot)$ 只需取它们的拼接结果:
Figure 1: An overview of Dense Phrases. (a) We learn dense phrase representations in a single passage (§4.1) along with in-batch and pre-batch negatives $(\S4.2,\S4.3)$ . (b) With the top $k$ retrieved phrase representations from the entire text corpus (§5), we further perform query-side fine-tuning to optimize the question encoder (§6). During inference, our model simply returns the top-1 prediction.
图 1: Dense Phrases 概述。(a) 我们在单个段落中学习密集短语表示(§4.1),同时使用批次内和预批次负样本$(§4.2,§4.3)$。(b) 通过从整个文本语料库中检索前$k$个短语表示(§5),我们进一步执行查询端微调以优化问题编码器(§6)。在推理阶段,我们的模型仅返回top-1预测结果。
$$
E_{q}(q)=[\mathbf{q}^{\mathrm{start}},\mathbf{q}^{\mathrm{end}}]\in\mathbb{R}^{2d}.
$$
$$
E_{q}(q)=[\mathbf{q}^{\mathrm{start}},\mathbf{q}^{\mathrm{end}}]\in\mathbb{R}^{2d}.
$$
Note that we use pre-trained language models to initialize $\mathcal{M}{p}$ , $\mathcal{M}{q,\mathrm{start}}$ and $\mathcal{M}{q,\mathrm{end}}$ and they are fine-tuned with the objectives that we will define later. In our pilot experiments, we found that SpanBERT (Joshi et al., 2020) leads to superior performance compared to BERT (Devlin et al., 2019). SpanBERT is designed to predict the information in the entire span from its two endpoints, therefore it is well suited for our phrase representations. In our final model, we use SpanBERT-base-cased as our base LMs for $E_{s}$ and $E_{q}$ , and hence $d=768.$ .5 See Table 5 for an ablation study.
需要注意的是,我们使用预训练语言模型来初始化 $\mathcal{M}{p}$ 、 $\mathcal{M}{q,\mathrm{start}}$ 和 $\mathcal{M}{q,\mathrm{end}}$ ,并通过后续定义的目标进行微调。在初步实验中,我们发现 SpanBERT (Joshi et al., 2020) 的表现优于 BERT (Devlin et al., 2019)。SpanBERT 的设计目标是从两个端点预测整个跨度的信息,因此非常适合我们的短语表示。在最终模型中,我们采用 SpanBERT-base-cased 作为 $E_{s}$ 和 $E_{q}$ 的基础大语言模型,因此 $d=768$。消融实验详见表 5。
4 Learning Phrase Representations
4 学习短语表征
In this section, we start by learning dense phrase representations from the supervision of reading comprehension tasks, i.e., a single passage $p$ contains an answer $a^{*}$ to a question $q$ . Our goal is to learn strong dense representations of phrases for $s\in S(p)$ , which can be retrieved by a dense represent ation of the question and serve as a direct answer (§4.1). Then, we introduce two different negative sampling methods $(\S4.2,\S4.3)$ , which encourage the phrase representations to be better discriminated at the full Wikipedia scale. See Figure 1 for an overview of Dense Phrases.
在本节中,我们首先从阅读理解任务的监督中学习密集短语表示,即单个段落 $p$ 包含问题 $q$ 的答案 $a^{*}$。我们的目标是为 $s\in S(p)$ 学习强密集短语表示,这些表示可以通过问题的密集表示进行检索,并作为直接答案(见4.1节)。接着,我们介绍了两种不同的负采样方法 $(\S4.2,\S4.3)$,这些方法鼓励短语表示在全维基百科规模上得到更好区分。Dense Phrases的概述见图1。
4.1 Single-passage Training
4.1 单段落训练
To learn phrase representations in a single passage along with question representations, we first maximize the log-likelihood of the start and end positions of the gold phrase $s^{}$ where $\operatorname{TEXT}(s^{})=a^{*}$ . The training loss for predicting the start position of a phrase given a question is computed as:
为了在单个段落中学习短语表示和问题表示,我们首先最大化黄金短语 $s^{}$ 起始和结束位置的对数似然,其中 $\operatorname{TEXT}(s^{})=a^{*}$。给定问题时预测短语起始位置的训练损失计算如下:
$$
\begin{array}{r l}&{z_{1}^{\mathrm{start}},\dots,z_{m}^{\mathrm{start}}=[\mathbf{h}{1}^{\top}\mathbf{q}^{\mathrm{start}},\dots,\mathbf{h}{m}^{\top}\mathbf{q}^{\mathrm{start}}],}\ &{\qquadP^{\mathrm{start}}=\mathrm{softmax}(z_{1}^{\mathrm{start}},\dots,z_{m}^{\mathrm{start}}),}\ &{\qquad\mathcal{L}{\mathrm{start}}=-\log P_{\mathrm{start}}^{\mathrm{start}}(s^{*})^{.}}\end{array}
$$
We can define $\mathcal{L}_{\mathrm{end}}$ in a similar way and the final loss for the single-passage training is
我们可以用类似的方式定义 $\mathcal{L}_{\mathrm{end}}$,单段落训练的最终损失为
$$
\mathcal{L}{\mathrm{single}}=\frac{\mathcal{L}{\mathrm{start}}+\mathcal{L}_{\mathrm{end}}}{2}.
$$
$$
\mathcal{L}{\mathrm{single}}=\frac{\mathcal{L}{\mathrm{start}}+\mathcal{L}_{\mathrm{end}}}{2}.
$$
This essentially learns reading comprehension without any cross-attention between the passage and the question tokens, which fully decomposes phrase and question representations.
这本质上是在没有任何段落和问题token之间的交叉注意力机制的情况下学习阅读理解能力,从而完全分解了短语和问题的表示。
Data augmentation Since the contextual i zed word representations $\mathbf{h}{1},\ldots,\mathbf{h}{m}$ are encoded in a query-agnostic way, they are always inferior to query-dependent representations in cross-attention models (Devlin et al., 2019), where passages are fed along with the questions concatenated by a special token such as [SEP]. We hypothesize that one key reason for the performance gap is that reading comprehension datasets only provide a few annotated questions in each passage, compared to the set of possible answer phrases. Learning from this supervision is not easy to differentiate similar phrases in one passage (e.g., $s^{*}=$ Charles, Prince of Wales and another $s=P_{\mathrm{}}$ rince George for a question $q=$ Who is next in line to be the monarch of England?).
数据增强
由于上下文词表示 $\mathbf{h}{1},\ldots,\mathbf{h}{m}$ 是以与查询无关的方式编码的,它们在跨注意力模型 (cross-attention models) 中始终不如依赖查询的表示 (Devlin et al., 2019)。在跨注意力模型中,段落会与通过特殊 token (如 [SEP]) 连接的问题一起输入。我们假设性能差距的一个关键原因是,阅读理解数据集在每个段落中仅提供少量标注问题,而可能的答案短语集合要大得多。从这种监督中学习难以区分同一段落中的相似短语 (例如,对于问题 $q=$ 谁是英国王位的下一位继承人?,$s^{*}=$ 查尔斯王子和另一个 $s=P_{\mathrm{}}$ 乔治王子)。
Following this intuition, we propose to use a simple model to generate additional questions for data augmentation, based on a T5-large model (Raffel et al., 2020). To train the question genera- tion model, we feed a passage $p$ with the gold answer $s^{*}$ highlighted by inserting surrounding special tags. Then, the model is trained to maximize the log-likelihood of the question words of $q$ . After training, we extract all the named entities in each training passage as candidate answers and feed the passage $p$ with each candidate answer to generate questions. We keep the questionanswer pairs only when a cross-attention reading comprehension model6 makes a correct prediction on the generated pair. The remaining generated QA pairs ${(\bar{q}{1},\bar{s}{1}),(\bar{q}{2},\bar{s}{2}),\ldots,(\bar{q}{r},\bar{s}_{r})}$ are directly augmented to the original training set.
基于这一思路,我们提出使用一个基于T5-large模型 (Raffel et al., 2020) 的简单模型来生成额外问题以实现数据增强。训练问题生成模型时,我们输入段落$p$并通过插入特殊标记来高亮标注黄金答案$s^{*}$。随后,模型通过最大化问题$q$的词序列对数似然进行训练。训练完成后,我们提取每个训练段落中的所有命名实体作为候选答案,并将段落$p$与每个候选答案组合输入以生成问题。仅当交叉注意力阅读理解模型6对生成的问题-答案对做出正确预测时,我们才保留该问答对。最终保留的生成问答对${(\bar{q}{1},\bar{s}{1}),(\bar{q}{2},\bar{s}{2}),\ldots,(\bar{q}{r},\bar{s}_{r})}$将直接扩充至原始训练集。
Distillation We also propose improving the phrase representations by distilling knowledge from a cross-attention model (Hinton et al., 2015). We minimize the Kullback–Leibler divergence between the probability distribution from our phrase encoder and that from a standard SpanBERT-base QA model. The loss is computed as follows:
蒸馏
我们还提出通过从交叉注意力模型 (cross-attention model) [Hinton et al., 2015] 中蒸馏知识来改进短语表征。我们最小化短语编码器输出的概率分布与标准 SpanBERT-base 问答模型概率分布之间的 Kullback-Leibler 散度。损失函数计算如下:
$$
\mathcal{L}{\mathrm{distill}}=\frac{\mathrm{KL}(P^{\mathrm{start}}||P_{c}^{\mathrm{start}})+\mathrm{KL}(P^{\mathrm{end}}||P_{c}^{\mathrm{end}})}{2},
$$
$$
\mathcal{L}{\mathrm{distill}}=\frac{\mathrm{KL}(P^{\mathrm{start}}||P_{c}^{\mathrm{start}})+\mathrm{KL}(P^{\mathrm{end}}||P_{c}^{\mathrm{end}})}{2},
$$
where $P^{\mathrm{start}}$ (and $P^{\mathrm{end}}$ ) is defined in Eq. (5) and $P_{c}^{\mathrm{start}}$ and $P_{c}^{\mathrm{end}}$ denote the probability distributions used to predict the start and end positions of answers in the cross-attention model.
其中 $P^{\mathrm{start}}$ (和 $P^{\mathrm{end}}$) 在方程 (5) 中定义,$P_{c}^{\mathrm{start}}$ 和 $P_{c}^{\mathrm{end}}$ 表示用于预测交叉注意力模型中答案起始和结束位置的概率分布。
4.2 In-batch Negatives
4.2 批内负样本 (In-batch Negatives)
Eventually, we need to build phrase representations for billions of phrases. Therefore, a bigger challenge is to incorporate more phrases as negatives so the representations can be better discriminated at a larger scale. While Seo et al. (2019) simply sample two negative passages based on question similarity, we use in-batch negatives for our dense phrase representations, which has been shown to be effective in learning dense passage representations before (Karpukhin et al., 2020).
最终,我们需要为数十亿个短语构建表征。因此,更大的挑战在于纳入更多短语作为负样本,从而在更大规模上实现更好的表征区分。虽然 Seo 等人 (2019) 仅基于问题相似度采样两个负段落,但我们采用批内负样本 (in-batch negatives) 来优化密集短语表征——该方法此前已被证明能有效学习密集段落表征 (Karpukhin 等人, 2020)。
Figure 2: Two types of negative samples for the first batch item $(\mathbf{q}{1}^{\mathrm{start}})$ in a mini-batch of size $B=4$ and $C=3$ . Note that the negative samples for the end representations $(\mathbf{q}_{i}^{\mathrm{end}})$ are obtained in a similar manner. See $\S4.2$ and $\S4.3$ for more details.
图 2: 在批大小为 $B=4$ 且 $C=3$ 的小批量中,第一个批次项 $(\mathbf{q}{1}^{\mathrm{start}})$ 的两种负样本类型。注意,结束表示 $(\mathbf{q}_{i}^{\mathrm{end}})$ 的负样本以类似方式获得。更多细节参见 $\S4.2$ 和 $\S4.3$。
As shown in Figure 2 (a), for the $i$ -th example in a mini-batch of size $B$ , we denote the hidden representations of the gold start and end positions $\mathbf{h}{\mathit{s t a r t}}(s^{})$ and $\mathbf{h}{\mathrm{end}(s^{*})}$ as $\mathbf{g}{i}^{\mathrm{start}}$ and $\mathbf{g}{i}^{\mathrm{end}}$ , as well as the question representation as $[\mathbf{q}{i}^{\mathrm{start}},\mathbf{q}{i}^{\mathrm{end}}]$ . Let ${\bf G}^{\mathrm{start}},{\bf G}^{\mathrm{end}},{\bf Q}^{\mathrm{start}},{\bf Q}^{\mathrm{end}}$ be the $B\times d$ matrices and each row corresponds to $\mathbf{g}{i}^{\mathrm{start}},\mathbf{g}{i}^{\mathrm{end}},\mathbf{q}{i}^{\mathrm{start}},\mathbf{q}{i}^{\mathrm{end}}$ respectively. Basically, we can treat all the gold phrases from other passages in the same mini-batch as negative examples. We compute $\mathbf{S}^{\mathrm{start}}=\mathbf{Q}^{\mathrm{start}}\mathbf{G}^{\mathrm{start}\intercal}$ and $\mathbf{S}^{\mathrm{end}}=$ $\mathbf{Q}^{\mathrm{end}}\mathbf{G}^{\mathrm{end}\top}$ and the $i$ -th row of $\mathbf{S}^{\mathrm{start}}$ and ${\bf S}^{\mathrm{end}}$ return $B$ scores each, including a positive score and $B{-}1$ negative scores: $s_{1}^{\mathrm{start}},\ldots,s_{B}^{\mathrm{start}}$ and sen d, $s_{1}^{\mathrm{end}},\ldots,s_{B}^{\mathrm{end}}$ Similar to Eq. (5), we can compute the loss function for the $i$ -th example as:
如图 2 (a) 所示, 对于小批量 (mini-batch) 中大小为 $B$ 的第 $i$ 个样本, 我们将黄金起始位置和结束位置的隐藏表示 $\mathbf{h}{\mathit{s t a r t}}(s^{})$ 和 $\mathbf{h}{\mathrm{end}(s^{*})}$ 分别记为 $\mathbf{g}{i}^{\mathrm{start}}$ 和 $\mathbf{g}{i}^{\mathrm{end}}$, 问题表示记为 $[\mathbf{q}{i}^{\mathrm{start}},\mathbf{q}{i}^{\mathrm{end}}]$。令 ${\bf G}^{\mathrm{start}},{\bf G}^{\mathrm{end}},{\bf Q}^{\mathrm{start}},{\bf Q}^{\mathrm{end}}$ 为 $B\times d$ 矩阵, 每行分别对应 $\mathbf{g}{i}^{\mathrm{start}},\mathbf{g}{i}^{\mathrm{end}},\mathbf{q}{i}^{\mathrm{start}},\mathbf{q}{i}^{\mathrm{end}}$。本质上, 我们可以将同一小批量中其他段落的所有黄金短语视为负样本。我们计算 $\mathbf{S}^{\mathrm{start}}=\mathbf{Q}^{\mathrm{start}}\mathbf{G}^{\mathrm{start}\intercal}$ 和 $\mathbf{S}^{\mathrm{end}}=$ $\mathbf{Q}^{\mathrm{end}}\mathbf{G}^{\mathrm{end}\top}$, $\mathbf{S}^{\mathrm{start}}$ 和 ${\bf S}^{\mathrm{end}}$ 的第 $i$ 行各返回 $B$ 个分数, 包括一个正分数和 $B{-}1$ 个负分数: $s_{1}^{\mathrm{start}},\ldots,s_{B}^{\mathrm{start}}$ 以及 $s_{1}^{\mathrm{end}},\ldots,s_{B}^{\mathrm{end}}$。类似于公式 (5), 我们可以计算第 $i$ 个样本的损失函数为:
$$
\begin{array}{r l}&{P_{i}^{\mathrm{start_ib}}=\mathrm{softmax}(s_{1}^{\mathrm{start}},\dots,s_{B}^{\mathrm{start}}),}\ &{P_{i}^{\mathrm{end_ib}}=\mathrm{softmax}(s_{1}^{\mathrm{end}},\dots,s_{B}^{\mathrm{end}}),}\ &{\quad\quad\mathcal{L}{\mathrm{neg}}=-\frac{\log P_{i}^{\mathrm{start_ib}}+\log P_{i}^{\mathrm{end_ib}}}{2},}\end{array}
$$
We also attempted using non-gold phrases from other passages as negatives but did not find a meaningful improvement.
我们还尝试使用其他段落中的非黄金短语作为负样本,但并未发现显著改进。
4.3 Pre-batch Negatives
4.3 预批次负样本
The in-batch negatives usually benefit from a large batch size (Karpukhin et al., 2020). However, it is challenging to further increase batch sizes, as they are bounded by the size of GPU memory. Next, we propose a novel negative sampling method called pre-batch negatives, which can effectively utilize the representations from the preceding $C$ mini-batches (Figure 2 (b)). In each iteration, we maintain a FIFO queue of $C$ mini-batches to cache phrase representations $\mathbf{G}^{\mathrm{start}}$ and $\mathbf{G}^{\mathrm{end}}$ . The cached phrase representations are then used as negative samples for the next iteration, providing $B\times C$ additional negative samples in total.7
批次内负样本通常受益于较大的批次大小 [20]。然而,由于受限于 GPU 显存容量,进一步增大批次规模具有挑战性。为此,我们提出了一种称为批次前负样本的新型负采样方法,该方法能有效利用前 $C$ 个小批次的表征 (图 2(b))。在每次迭代中,我们维护一个包含 $C$ 个小批次的 FIFO 队列,用于缓存短语表征 $\mathbf{G}^{\mathrm{start}}$ 和 $\mathbf{G}^{\mathrm{end}}$。这些缓存的短语表征将作为下一轮迭代的负样本,共计提供 $B\times C$ 个额外负样本。7
These pre-batch negatives are used together with in-batch negatives and the training loss is the same as Eq. (8), except that the gradients are not backpropagated to the cached pre-batch negatives. After warming up the model with in-batch negatives, we simply shift from in-batch negatives $(B-1$ negatives) to in-batch and pre-batch negatives (hence a total number of $B\times C+B-1$ negatives). For simplicity, we use $\mathcal{L}_{\mathrm{neg}}$ to denote the loss for both inbatch negatives and pre-batch negatives. Since we do not retain the computational graph for pre-batch negatives, the memory consumption of pre-batch negatives is much more manageable while allowing an increase in the number of negative samples.
这些预批次负样本与批次内负样本共同使用,训练损失函数与公式(8)相同,但梯度不会反向传播至缓存的预批次负样本。在通过批次内负样本完成模型预热后,我们直接从批次内负样本($B-1$个负样本)切换为批次内与预批次负样本联合使用(共计$B\times C+B-1$个负样本)。为简化表述,我们使用$\mathcal{L}_{\mathrm{neg}}$表示批次内负样本和预批次负样本的联合损失函数。由于未保留预批次负样本的计算图,其内存消耗更可控,同时实现了负样本数量的有效提升。
4.4 Training Objective
4.4 训练目标
Finally, we optimize all the three losses together, on both annotated reading comprehension examples and generated questions from $\S4.1$ :
最后,我们在标注的阅读理解样本和 $\S4.1$ 生成的问题上联合优化所有三个损失函数:
$$
\mathcal{L}=\lambda_{1}\mathcal{L}{\mathrm{single}}+\lambda_{2}\mathcal{L}{\mathrm{distill}}+\lambda_{3}\mathcal{L}_{\mathrm{neg}},
$$
$$
\mathcal{L}=\lambda_{1}\mathcal{L}{\mathrm{single}}+\lambda_{2}\mathcal{L}{\mathrm{distill}}+\lambda_{3}\mathcal{L}_{\mathrm{neg}},
$$
where $\lambda_{1},\lambda_{2},\lambda_{3}$ determine the importance of each loss term. We found that $\lambda_{1}=1$ , $\lambda_{2}=2$ , and $\lambda_{3}=$ 4 works well in practice. See Table 5 and Table 6 for an ablation study of different components.
其中 $\lambda_{1},\lambda_{2},\lambda_{3}$ 决定各损失项的重要性。实验发现 $\lambda_{1}=1$ 、 $\lambda_{2}=2$ 和 $\lambda_{3}=4$ 能取得良好效果。不同组件的消融研究详见表5和表6。
5 Indexing and Search
5 索引与搜索
Indexing After training the phrase encoder $E_{s}$ , we need to encode all the phrases $S({\mathcal{D}})$ in the entire English Wikipedia $\mathcal{D}$ and store an index of the phrase dump. We segment each document $d_{i} \in~\mathcal{D}$ into a set of natural paragraphs, from which we obtain token representations for each paragraph using $E_{s}(\cdot)$ . Then, we build a phrase dump $\mathbf{H}=[\mathbf{h}{1},\dots,\mathbf{h}{|\mathcal{W}(\mathcal{D})|}]\in\mathbb{R}^{|\mathcal{W}(\mathcal{D})|\times d}$ by stacking the token representations from all the paragraphs in $\mathcal{D}$ . Note that this process is computationally expensive and takes about hundreds of GPU hours with a large disk footprint. To reduce the size of phrase dump, we follow and modify several techniques introduced in Seo et al. (2019) (see Appendix E for details). After indexing, we can use two rows $i$ and $j$ of $\mathbf{H}$ to represent a dense phrase representation $[\mathbf{h}{i},\mathbf{h}_{j}]$ . We use faiss (Johnson et al., 2017) for building a MIPS index of $\mathbf{H}$ .8
索引
训练完短语编码器 $E_{s}$ 后,我们需要对整个英文维基百科 $\mathcal{D}$ 中的所有短语 $S({\mathcal{D}})$ 进行编码,并存储短语库的索引。我们将每个文档 $d_{i} \in~\mathcal{D}$ 分割为若干自然段落,并使用 $E_{s}(\cdot)$ 获取每个段落的token表示。接着,通过堆叠 $\mathcal{D}$ 中所有段落的token表示,构建短语库 $\mathbf{H}=[\mathbf{h}{1},\dots,\mathbf{h}{|\mathcal{W}(\mathcal{D})|}]\in\mathbb{R}^{|\mathcal{W}(\mathcal{D})|\times d}$。注意,该过程计算开销较大,需耗费数百GPU小时且占用大量磁盘空间。为缩减短语库体积,我们借鉴并改进了Seo等人 (2019) 提出的若干技术 (详见附录E)。索引完成后,可用 $\mathbf{H}$ 的第 $i$ 行和第 $j$ 行表示稠密短语向量 $[\mathbf{h}{i},\mathbf{h}_{j}]$。我们使用faiss (Johnson等人, 2017) 为 $\mathbf{H}$ 构建MIPS索引。
Search For a given question $q$ , we can find the answer $\hat{s}$ as follows:
对于给定问题 $q$,我们可以通过以下方式找到答案 $\hat{s}$:
$$
\begin{array}{r l}&{\boldsymbol{\hat{s}}=\underset{\boldsymbol{s}{(i,j)}}{\operatorname{argmax}}E_{s}(\boldsymbol{s}{(i,j)},\mathcal{D})^{\top}E_{q}(\boldsymbol{q}),}\ &{\quad=\underset{\boldsymbol{s}{(i,j)}}{\operatorname{argmax}}(\mathbf{Hq}^{\operatorname{start}}){i}+(\mathbf{Hq}^{\operatorname{end}})_{j},}\end{array}
$$
$$
\begin{array}{r l}&{\boldsymbol{\hat{s}}=\underset{\boldsymbol{s}{(i,j)}}{\operatorname{argmax}}E_{s}(\boldsymbol{s}{(i,j)},\mathcal{D})^{\top}E_{q}(\boldsymbol{q}),}\ &{\quad=\underset{\boldsymbol{s}{(i,j)}}{\operatorname{argmax}}(\mathbf{Hq}^{\operatorname{start}}){i}+(\mathbf{Hq}^{\operatorname{end}})_{j},}\end{array}
$$
where $s_{(i,j)}$ denotes a phrase with start and end indices as $i$ and $j$ in the index $\mathbf{H}$ . We can compute the argmax of $\mathbf{Hq}^{\mathrm{start}}$ and $\mathbf{Hq}^{\mathrm{end}}$ efficiently by performing MIPS over $\mathbf{H}$ with $\mathbf{q}^{\mathrm{start}}$ and $\mathbf{q}^{\mathrm{end}}$ . In practice, we search for the top $k$ start and top $k$ end positions separately and perform a constrained search over their end and start positions respectively such that $1\leq i\leq j<i+L\leq|\mathcal{W}(\mathcal{D})|$ .
其中 $s_{(i,j)}$ 表示在索引 $\mathbf{H}$ 中起止位置为 $i$ 和 $j$ 的短语。我们可以通过用 $\mathbf{q}^{\mathrm{start}}$ 和 $\mathbf{q}^{\mathrm{end}}$ 在 $\mathbf{H}$ 上执行 MIPS (最大内积搜索) 来高效计算 $\mathbf{Hq}^{\mathrm{start}}$ 和 $\mathbf{Hq}^{\mathrm{end}}$ 的 argmax。实际应用中,我们分别搜索前 $k$ 个起始位置和前 $k$ 个终止位置,并在满足 $1\leq i\leq j<i+L\leq|\mathcal{W}(\mathcal{D})|$ 的条件下对它们的终止和起始位置进行约束搜索。
6 Query-side Fine-tuning
6 查询端微调
So far, we have created a phrase dump $\mathbf{H}$ that supports efficient MIPS search. In this section, we propose a novel method called query-side fine-tuning by only updating the question encoder $E_{q}$ to correctly retrieve a desired answer $a^{}$ for a question $q$ given $\mathbf{H}$ . Formally speaking, we optimize the marginal log-likelihood of the gold answer $a^{*}$ for a question $q$ , which resembles the weakly-supervised QA setting in previous work (Lee et al., 2019; Min et al., 2019). For every question $q$ , we retrieve top $k$ phrases and minimize the objective:
目前,我们已经创建了一个支持高效MIPS搜索的短语库$\mathbf{H}$。本节提出一种称为查询端微调的新方法,仅通过更新问题编码器$E_{q}$来实现:在给定$\mathbf{H}$的情况下,为问题$q$正确检索出期望答案$a^{}$。形式化地说,我们优化问题$q$对应黄金答案$a^{*}$的边际对数似然,这与前人工作中的弱监督QA设定相似(Lee et al., 2019; Min et al., 2019)。对于每个问题$q$,我们检索top $k$个短语并最小化目标函数:
$$
\begin{array}{r}{\mathcal{L}{\mathrm{query}}=-\log\frac{\sum_{s\in\tilde{S}(q),\mathrm{TEXT}(s)=a^{*}}\exp\big(f(s|\mathcal{D},q)\big)}{\sum_{s\in\tilde{S}(q)}\exp\big(f(s|\mathcal{D},q)\big)},}\end{array}
$$
$$
\begin{array}{r}{\mathcal{L}{\mathrm{query}}=-\log\frac{\sum_{s\in\tilde{S}(q),\mathrm{TEXT}(s)=a^{*}}\exp\big(f(s|\mathcal{D},q)\big)}{\sum_{s\in\tilde{S}(q)}\exp\big(f(s|\mathcal{D},q)\big)},}\end{array}
$$
where $f(s|\mathcal{D},q)$ is the score of the phrase $s$ (Eq. (2)) and $\tilde{\cal S}(q)$ denotes the top $k$ phrases for $q$ (Eq. (10)). In practice, we use $k=100$ for all the experiments.
其中 $f(s|\mathcal{D},q)$ 是短语 $s$ 的得分 (式 (2)),$\tilde{\cal S}(q)$ 表示查询 $q$ 的前 $k$ 个短语 (式 (10))。实际实验中,我们统一采用 $k=100$ 的设置。
There are several advantages for doing this: (1) we find that query-side fine-tuning can reduce the discrepancy between training and inference, and hence improve the final performance substantially (§8). Even with effective negative sampling, the model only sees a small portion of passages compared to the full scale of $\mathcal{D}$ and this training objective can effectively fill in the gap. (2) This training strategy allows for transfer learning to unseen domains, without rebuilding the entire phrase index. More specifically, the model is able to quickly adapt to new QA tasks (e.g., Web Questions) when the phrase dump is built using SQuAD or Natural Questions. We also find that this can transfers to non-QA tasks when the query is written in a different format. In $\S7.3$ , we show the possibility of directly using Dense Phrases for slot filling tasks by using a query such as (Michael Jackson, is a singer of, $x_{.}$ ). In this regard, we can view our model as a dense knowledge base that can be accessed by many different types of queries and it is able to return phrase-level knowledge efficiently.
这样做有几个优势:(1) 我们发现查询端微调可以减少训练和推理之间的差异,从而显著提高最终性能(§8)。即使采用有效的负采样,与完整的$\mathcal{D}$规模相比,模型也只能看到一小部分段落,而这种训练目标可以有效填补这一空白。(2) 这种训练策略允许在不重建整个短语索引的情况下,将学习迁移到未见过的领域。更具体地说,当使用SQuAD或Natural Questions构建短语转储时,模型能够快速适应新的QA任务(例如Web Questions)。我们还发现,当查询以不同格式编写时,这也可以迁移到非QA任务。在$\S7.3$中,我们展示了通过使用诸如(Michael Jackson, is a singer of, $x_{.}$)这样的查询,直接使用Dense Phrases进行槽填充任务的可能性。在这方面,我们可以将我们的模型视为一个密集的知识库,可以通过许多不同类型的查询访问,并且能够高效地返回短语级知识。
7 Experiments
7 实验
7.1Setup
7.1 设置
Datasets. We use two reading comprehension datasets: SQuAD (Rajpurkar et al., 2016) and Natural Questions (NQ) (Kwiatkowski et al., 2019) to learn phrase representations, in which a single gold passage is provided for each question. For the opendomain QA experiments, we evaluate our approach on five popular open-domain QA datasets: Natural Questions, Web Questions (WQ) (Berant et al., 2013), Curate dT REC (TREC) (Baudis and Sedivy, 2015), TriviaQA (TQA) (Joshi et al., 2017), and SQuAD. Note that we only use SQuAD and/or NQ to build the phrase index and perform query-side fine-tuning (§6) for other datasets.
数据集。我们使用两个阅读理解数据集:SQuAD (Rajpurkar et al., 2016) 和自然问题 (Natural Questions, NQ) (Kwiatkowski et al., 2019) 来学习短语表示,其中每个问题都提供了一个标准段落。对于开放域问答实验,我们在五个流行的开放域问答数据集上评估了我们的方法:自然问题、网页问题 (Web Questions, WQ) (Berant et al., 2013)、CuratedTREC (TREC) (Baudis and Sedivy, 2015)、TriviaQA (TQA) (Joshi et al., 2017) 和 SQuAD。需要注意的是,我们仅使用 SQuAD 和/或 NQ 来构建短语索引并为其他数据集进行查询端微调 (§6)。
We also evaluate our model on two slot filling tasks, to show how to adapt our Dense Phrases for other knowledge-intensive NLP tasks. We focus on using two slot filling datasets from the KILT benchmark (Petroni et al., 2021): T-REx (Elsahar et al., 2018) and zero-shot relation extraction (Levy et al., 2017). Each query is provided in the form of “{subject entity} [SEP] {relation}" and the answer is the objec