[论文翻译]基于记忆网络的大规模简单问答


原文地址:https://arxiv.org/pdf/1506.02075v1


Large-scale Simple Question Answering with Memory Networks

基于记忆网络的大规模简单问答

Antoine Bordes

Antoine Bordes

Abstract

摘要

Training large-scale question answering systems is complicated because training sources usually cover a small portion of the range of possible questions. This paper studies the impact of multitask and transfer learning for simple question answering; a setting for which the reasoning required to answer is quite easy, as long as one can retrieve the correct evidence given a question, which can be difficult in large-scale conditions. To this end, we introduce a new dataset of $100\mathrm{k}$ questions that we use in conjunction with existing benchmarks. We conduct our study within the framework of Memory Networks (Weston et al., 2015) because this perspective allows us to eventually scale up to more complex reasoning, and show that Memory Networks can be successfully trained to achieve excellent performance.

训练大规模问答系统十分复杂,因为训练数据通常只覆盖了潜在问题范围的很小一部分。本文研究了多任务学习和迁移学习在简单问答任务中的影响——在这种设定中,只要能够根据问题检索到正确证据(这在大规模场景下可能很困难),所需的推理过程其实相当简单。为此,我们引入了一个包含 $100\mathrm{k}$ 问题的新数据集,并将其与现有基准结合使用。我们在记忆网络(Memory Networks)框架[20]下开展研究,因为该框架最终能扩展到更复杂的推理场景。实验表明,通过适当训练,记忆网络能够取得卓越性能。

1 Introduction

1 引言

Open-domain Question Answering (QA) systems aim at providing the exact answer(s) to questions formulated in natural language, without restriction of domain. While there is a long history of QA systems that search for textual documents or on the Web and extract answers from them (see e.g. (Voorhees and Tice, 2000; Dumais et al., 2002)), recent progress has been made with the release of large Knowledge Bases (KBs) such as Freebase, which contain consolidated knowledge stored as atomic facts, and extracted from different sources, such as free text, tables in webpages or collaborative input. Existing approaches for QA from KBs use learnable components to either transform the question into a structured KB query (Berant et al., 2013) or learn to embed questions and facts in a low dimensional vector space and retrieve the answer by computing similarities in this embedding space (Bordes et al., 2014a). However, while most recent efforts have focused on designing systems with higher reasoning capabilities, that could jointly retrieve and use multiple facts to answer, the simpler problem of answering questions that refer to a single fact of the KB, which we call Simple Question Answering in this paper, is still far from solved.

开放域问答 (Open-domain QA) 系统旨在为自然语言表述的问题提供精确答案,且不受领域限制。虽然基于文本检索或网络搜索的问答系统已有较长发展历史 (例如 Voorhees 和 Tice,2000;Dumais 等,2002),但随着 Freebase 等大型知识库 (Knowledge Base) 的发布,该领域取得了新进展——这些知识库以原子事实形式存储从自由文本、网页表格或众包输入等多源提取的整合知识。现有基于知识库的问答方法主要采用可学习组件:或将问题转换为结构化查询 (Berant 等,2013),或将问题与事实嵌入低维向量空间并通过相似度计算检索答案 (Bordes 等,2014a)。然而,尽管近期研究多聚焦于设计具备多事实联合推理能力的系统,但针对仅涉及知识库单一事实的简单问答 (本文称为 Simple Question Answering) 这一基础问题,其解决程度仍远未完善。

Hence, existing benchmarks are small; they mostly cover the head of the distributions of facts, and are restricted in their question types and their syntactic and lexical variations. As such, it is still unknown how much the existing systems perform outside the range of the specific question templates of a few, small benchmark datasets, and it is also unknown whether learning on a single dataset transfers well on other ones, and whether such systems can learn from different training sources, which we believe is necessary to capture the whole range of possible questions.

因此,现有基准测试规模较小,主要覆盖事实分布的头部,且在问题类型、句法和词汇变化方面受限。如此一来,现有系统在少数小型基准数据集特定问题模板范围外的表现仍属未知,也无法确定单一数据集的学习能否良好迁移到其他数据集,以及此类系统能否从不同训练源中学习(我们认为这对覆盖所有可能问题范围至关重要)。

Besides, the actual need for reasoning, i.e. constructing the answer from more than a single fact from the KB, depends on the actual structure of the KB. As we shall see, for instance, a simple preprocessing of Freebase tremendously increases the coverage of simple QA in terms of possible questions that can be answered with a single fact, including list questions that expect more than a single answer. In fact, the task of simple QA itself might already cover a wide range of practical usages, if the KB is properly organized.

此外,实际对推理的需求(即从知识库(KB)中多个事实构建答案)取决于知识库的实际结构。例如,我们将看到,对Freebase进行简单的预处理能极大提高简单问答的覆盖率,即仅需单个事实即可回答的问题数量,包括需要多个答案的列表问题。事实上,如果知识库组织得当,简单问答任务本身可能已涵盖广泛的实用场景。

This paper presents two contributions. First, as an effort to study the coverage of existing systems and the possibility to train jointly on different data sources via multitasking, we collected the first large-scale dataset of questions and answers based on a KB, called Simple Questions. This dataset, which is presented in Section 2, contains more than $100\mathrm{k\Omega}$ questions written by human anno

本文提出了两项贡献。首先,为了研究现有系统的覆盖范围以及通过多任务处理在不同数据源上联合训练的可能性,我们收集了首个基于知识库 (KB) 的大规模问答数据集,称为Simple Questions。该数据集在第2节中介绍,包含超过$100\mathrm{k\Omega}$条由人工标注者撰写的问题。

What American cartoonist is the creator of Andy Lippincott? (andy lippincott, character created by, garry trudeau) Which forest is Fires Creek in? (fires creek, contained by, nantahala national forest) What is an active ingredient in childrens earache relief ? (childrens earache relief, active ingredients, capsicum) What does Jimmy Neutron do? (jimmy neutron, fictional character occupation, inventor) What dietary restriction is incompatible with kimchi? (kimchi, incompatible with dietary restrictions, veganism)

哪位美国漫画家创造了安迪·利平科特? (andy lippincott, character created by, garry trudeau)
Fires Creek位于哪片森林? (fires creek, contained by, nantahala national forest)
儿童耳痛缓解药的有效成分是什么? (childrens earache relief, active ingredients, capsicum)
吉米·中子从事什么职业? (jimmy neutron, fictional character occupation, inventor)
哪种饮食限制与韩国泡菜不兼容? (kimchi, incompatible with dietary restrictions, veganism)

Table 1: Examples of simple QA. Questions and corresponding facts have been extracted from the new dataset Simple Questions introduced in this paper. Actual answers are underlined.

表 1: 简单问答示例。问题及对应事实均提取自本文提出的新数据集Simple Questions,正确答案以下划线标注。

tators and associated to Freebase facts, while the largest existing benchmark, Web Questions, contains less than 6k questions created automatically using the Google suggest API.

现有最大的基准测试Web Questions包含不到6k个问题,这些问题是通过Google建议API自动生成的,而我们的数据集则与人工注释者和Freebase事实相关联。

Second, in sections 3 and 4, we present an embedding-based QA system developed under the framework of Memory Networks (MemNNs) (Weston et al., 2015; Sukhbaatar et al., 2015). Memory Networks are learning systems centered around a memory component that can be read and written to, with a particular focus on cases where the relationship between the input and response languages (here natural language) and the storage language (here, the facts from KBs) is performed by embedding all of them in the same vector space. The setting of the simple QA corresponds to the elementary operation of performing a single lookup in the memory. While our model bares similarity with previous embedding models for QA (Bordes et al., 2014b; Bordes et al., 2014a), using the framework of MemNNs opens the perspective to more involved inference schemes in future work, since MemNNs were shown to perform well on complex reasoning toy QA tasks (Weston et al., 2015). We discuss related work in Section 5.

其次,在第3和第4节中,我们介绍了一个基于嵌入的问答系统,该系统是在记忆网络(MemNNs)框架下开发的(Weston等人,2015;Sukhbaatar等人,2015)。记忆网络是一种以记忆组件为核心的学习系统,该组件可读写,特别关注输入和响应语言(此处为自然语言)与存储语言(此处为知识库中的事实)之间的关系,通过将它们全部嵌入同一向量空间来实现。简单问答的设置对应于在记忆中进行单次查找的基本操作。尽管我们的模型与之前的问答嵌入模型(Bordes等人,2014b;Bordes等人,2014a)有相似之处,但使用记忆网络框架为未来工作中更复杂的推理方案开辟了前景,因为记忆网络已被证明在复杂的推理玩具问答任务中表现良好(Weston等人,2015)。我们将在第5节讨论相关工作。

We report experimental results in Section 6, where we show that our model achieves excellent results on the benchmark Web Questions. We also show that it can learn from two different QA datasets to improve its performance on both. We also present the first successful application of transfer learning for QA. Using the Reverb KB and QA datasets, we show that Reverb facts can be added to the memory and used to answer without retraining, and that MemNNs achieve better results than some systems designed on this dataset.

我们在第6节报告了实验结果,表明我们的模型在基准测试Web Questions上取得了优异表现。同时证明该模型能够通过两个不同问答数据集进行联合学习,从而提升双方性能。我们还首次成功实现了问答任务的迁移学习应用:基于Reverb知识库和问答数据集,证实无需重新训练即可将Reverb事实添加到记忆模块并用于回答,且记忆神经网络(MemNNs)在该数据集上的表现优于部分专用系统。

2 Simple Question Answering

2 简单问答

Knowledge Bases contain facts expressed as triples (subject, relationship, object), where subject and object are entities and relationship describes the type of (directed) link between these entities. The simple QA problem we address here consist in finding the answer to questions that can be rephrased as queries of the form (subject, relationship, ?), asking for all objects linked to subject by relationship. The question What do Jamaican people speak ?, for instance, could be rephrased as the Freebase query (jamaica, language spoken, ?). In other words, fetching a single fact from a KB is sufficient to answer correctly.

知识库 (Knowledge Bases) 包含以三元组 (subject, relationship, object) 形式表示的事实,其中 subject 和 object 是实体,relationship 描述这些实体之间的 (有向) 链接类型。我们在此处理的简单问答 (QA) 问题,涉及寻找可以重新表述为 (subject, relationship, ?) 形式查询的问题答案,即询问通过 relationship 与 subject 关联的所有 objects。例如,问题 "What do Jamaican people speak?" 可以重新表述为 Freebase 查询 (jamaica, language spoken, ?)。换句话说,从知识库中获取一个事实就足以正确回答问题。

The term simple $Q A$ refers to the simplicity of the reasoning process needed to answer questions, since it involves a single fact. However, this does not mean that the QA problem is easy per se, since retrieving this single supporting fact can be very challenging as it involves to search over millions of alternatives given a query expressed in natural language. Table 1 shows that, with a KB with many types of relationships like Freebase, the range of questions that can be answered with a single fact is already very broad. Besides, as we shall see, modiying slightly the structure of the KB can make some QA problems simpler by adding direct connections between entities and hence allow to bypass the need for more complex reasoning.

术语简单$Q A$指的是回答问题所需的推理过程简单,因为它只涉及单一事实。但这并不意味着问答问题本身容易,因为在自然语言表达的查询中,从数百万种候选项中检索出这一单一支持事实可能极具挑战性。表1显示,在拥有多种关系类型(如Freebase)的知识库(KB)中,仅凭单一事实就能回答的问题范围已经非常广泛。此外,我们将看到,通过添加实体间的直接连接来略微修改知识库结构,可以使某些问答问题变得更简单,从而绕过更复杂推理的需求。

2.1 Knowledge Bases

2.1 知识库

We use the KB Freebase1 as the basis of our QA system, our source of facts and answers. All Freebase entities and relationships are typed and the lexicon for types and relationships is closed. Freebase data is collaborative ly collected and curated, to ensure a high reliability of the facts. Each entity has an internal identifier and a set of strings that are usually used to refer to that entity in text, termed aliases. We consider two extracts of Freebase, whose statistics are given in Table 2. FB2M, which was used in (Bordes et al., 2014a), contains about 2M entities and 5k relationships. FB5M, is much larger with about 5M entities and more than $7.5\mathrm{k}$ relationships.

我们以KB Freebase1作为问答系统的基础,作为事实和答案的来源。所有Freebase实体和关系都经过类型标注,且类型和关系的词汇表是封闭的。Freebase数据通过协作收集和整理,以确保事实的高可靠性。每个实体都有一个内部标识符和一组通常在文本中用于指代该实体的字符串,称为别名。我们考虑了两个Freebase的子集,其统计信息如表2所示。FB2M(曾在Bordes等人[2014a]中使用)包含约200万个实体和5000种关系。FB5M规模更大,包含约500万个实体和超过$7.5\mathrm{k}$种关系。

We also use the KB Reverb as a secondary source of facts to study how well a model trained to answer questions using Freebase facts could be used to answer using Reverb’s as well, without being trained on Reverb data. This is a pure setting of transfer learning. Reverb is interesting for this experiment because it differs a lot from Freebase. Its data was extracted automatically from text with minimal human intervention and is highly unstructured: entities are unique strings and the lexicon for relationships is open. This leads to many more relationships, but entities with multiple references are not de duplicated, ambiguous referents are not resolved, and the reliability of the stored facts is much lower than in Freebase. We used the full extraction from (Fader et al., 2011), which contains 2M entities and $600\mathrm{k}$ relationships.

我们还使用 KB Reverb 作为次要事实来源,研究一个基于 Freebase 事实训练的问题回答模型在未经 Reverb 数据训练的情况下,能否同样有效地利用 Reverb 数据进行回答。这是一个纯粹的迁移学习场景。Reverb 之所以适合本实验,是因为它与 Freebase 存在显著差异:其数据通过自动化文本抽取获得(人工干预极少),且高度非结构化——实体以唯一字符串形式存在,关系词汇表完全开放。这导致关系数量大幅增加,但多重指代实体未去重、歧义指称未消解,且存储事实的可靠性远低于 Freebase。我们采用 (Fader et al., 2011) 的完整抽取结果,包含 200 万实体和 $600\mathrm{k}$ 种关系。

Table 2: Knowledge Bases used in this paper. FB2M and FB5M are two versions of Freebase.

表 2: 本文使用的知识库。FB2M 和 FB5M 是 Freebase 的两个版本。

FB2M FB5M Reverb
实体 (ENTITIES) 2,150,604 4,904,397 2,044,752
关系 (RELATIONSHIPS) 6,701 7,523 601,360
原子事实 (ATOMICFACTS) 14,180,937 22,441,880 14,338,214
事实 (分组) (FACTS (grouped)) 10,843,106 12,010,500

2.2 The Simple Questions dataset

2.2 Simple Questions数据集

Existing resources for QA such as WebQuestions (Berant et al., 2013) are rather small (few thousands questions) and hence do not provide a very thorough coverage of the variety of questions that could be answered using a KB like Freebase, even in the context of simple QA. Hence, in this paper, we introduce a new dataset of much larger scale for the task of simple QA called Simple Questions.2 This dataset consists of a total of 108,442 questions written in natural language by human English-speaking annotators each paired with a corresponding fact from FB2M that provides the answer and explains it. We randomly shuffle these questions and use $70%$ of them (75910) as training set, $10%$ as validation set (10845), and the remaining $20%$ as test set. Examples of questions and facts are given in Table 1.

现有问答资源如WebQuestions (Berant等人,2013) 规模较小(仅数千问题),即便在简单问答场景下,也无法全面覆盖像Freebase这类知识库能解答的各类问题。为此,本文推出了一个规模更大的简单问答任务新数据集Simple Questions。该数据集包含108,442个由英语母语标注者撰写的自然语言问题,每个问题都对应FB2M知识库中提供答案并解释的事实。我们将这些问题随机打乱,分配70%(75,910条)作为训练集,10%(10,845条)作为验证集,剩余20%作为测试集。具体问题与事实示例见表1。

We collected Simple Questions in two phases. The first phase consisted of short listing the set of facts from Freebase to be annotated with questions. We used FB2M as background KB and removed all facts with undefined relationship type i.e. containing the word freebase. We also removed all facts for which the (subject, relationship) pair had more than a threshold number of objects. This filtering step is crucial to remove facts which would result in trivial uninformative questions, such as, Name a person who is an actor?. The threshold was set to 10.

我们分两个阶段收集简单问题。第一阶段包括从Freebase中筛选出待标注问题的事实集合。我们使用FB2M作为背景知识库,并移除了所有关系类型未定义的事实(即包含freebase一词的条目)。同时,我们还移除了那些(主语,关系)对拥有超过阈值数量宾语的所有事实。这一过滤步骤对于剔除会导致无意义简单问题的事实至关重要,例如"说出一个演员的名字?"。该阈值设定为10。

In the second phase, these selected facts were sampled and delivered to human annotators to generate questions from them. For the sampling, each fact was associated with a probability which defined as a function of its relationship frequency in the KB: to favor variability, facts with relationship appearing more frequently were given lower probabilities. For each sampled facts, annotators were shown the facts along with hyperlinks to freebase.com to provide some context while framing the question. Given this information, annotators were asked to phrase a question involving the subject and the relationship of the fact, with the answer being the object. The annotators were explicitly instructed to phrase the question differently as much as possible, if they encounter multiple facts with similar relationship. They were also given the option of skipping facts if they wish to do so. This was very important to avoid the annotators to write a boiler plate questions when they had no background knowledge about some facts.

在第二阶段,这些筛选出的事实经过抽样后被交付给人工标注员用于生成问题。抽样过程中,每条事实根据其在知识库(KB)中的关系频率被赋予相应概率:为增加多样性,关系出现频率较高的事实会被分配较低抽样概率。对于每条被抽样的事实,标注员会看到该事实及其在freebase.com上的超链接以提供上下文参考。基于这些信息,标注员需构建涉及事实主语和关系的问题,并以事实宾语作为答案。标注员被明确要求:当遇到具有相似关系的多个事实时,需尽可能采用不同的提问方式。他们也可选择跳过某些事实,这对避免标注员在缺乏背景知识时编写模板化问题至关重要。

3 Memory Networks for Simple QA

3 用于简单问答的记忆网络

A Memory Network consists of a memory (an indexed array of objects) and a neural network that is trained to query it given some inputs (usually questions). It has four components: Input map $(I)$ , Generalization $(G)$ , Output map $(O)$ and $R e$ - sponse $(R)$ which we detail below. But first, we describe the MemNNs workflow used to set up a model for simple QA. This proceeds in three steps:

记忆网络 (Memory Network) 由记忆体 (一个带索引的对象数组) 和一个经过训练可根据输入 (通常是问题) 进行查询的神经网络组成。它包含四个组件:输入映射 $(I)$ 、泛化 $(G)$ 、输出映射 $(O)$ 和响应 $(R)$ ,我们将在下文详细说明。但首先,我们描述用于构建简单问答 (QA) 模型的记忆神经网络 (MemNNs) 工作流程,该流程分为三个步骤:

  1. Storing Freebase: this first phase parses Freebase (either FB2M or FB5M depending on the setting) and stores it in memory. It uses the Input module to preprocess the data.
  2. 存储Freebase:第一阶段解析Freebase(根据设置使用FB2M或FB5M)并将其存储在内存中。该阶段使用输入模块对数据进行预处理。
  3. Training: this second phase trains the MemNN to answer question. This uses Input, Output and Response modules, the training concerns mainly the parameters of the embedding model at the core of the Output module.
  4. 训练:第二阶段训练MemNN以回答问题。这涉及输入(Input)、输出(Output)和响应(Response)模块,训练主要关注输出模块核心嵌入模型的参数。
  5. Connecting Reverb: this third phase adds new facts coming from Reverb to the memory. This is done after training to test the ability of MemNNs to handle new facts without having to be re-trained. It uses the Input module to preprocess Reverb facts and the Generalization module to connect them to the facts already stored.
  6. 连接Reverb:第三阶段将来自Reverb的新事实添加到记忆中。这一步骤在训练后进行,用于测试记忆神经网络 (MemNNs) 处理新事实而无需重新训练的能力。它使用输入模块预处理Reverb事实,并通过泛化模块将这些事实与已存储的事实连接起来。

After these three stages, the MemNN is ready to answer any question by running the $I,O$ and $R$ modules in turn. We now detail the implementation of the four modules.

经过这三个阶段后,MemNN 就可以通过依次运行 $I$、$O$ 和 $R$ 模块来回答任何问题。我们现在详细介绍这四个模块的实现。

3.1 Input module

3.1 输入模块

This module pre processes the three types of data that are input to the network: Freebase facts that are used to populate the memory, questions that the system need to answer, and Reverb facts that we use, in a second phase, to extend the memory.

该模块对输入网络的三种数据类型进行预处理:用于填充记忆的Freebase事实、系统需要回答的问题,以及我们在第二阶段用于扩展记忆的Reverb事实。

Preprocessing Freebase The Freebase data is initially stored as atomic facts involving single entities as subject and object, plus a relationship between them. However, this storage needs to be adapted to the QA task in two aspects.

预处理Freebase
Freebase数据最初以原子事实的形式存储,包含作为主语和宾语的单个实体以及它们之间的关系。然而,这种存储方式需要在两个方面进行调整以适应问答任务。

First, in order to answer list questions, which expect more than one answer, we redefine a fact as being a triple containing a subject, a relationship, and the set of all objects linked to the subject by the relationship. This grouping process transforms atomic facts into grouped facts, which we simply refer to as facts in the following. Table 2 shows the impact of this grouping: on FB2M, this decreases the number of facts from 14M to 11M and, on FB5M, from 22M to 12M.

首先,为了回答需要多个答案的列表问题,我们将事实重新定义为包含主语、关系以及通过该关系与主语相关联的所有对象集合的三元组。这一分组过程将原子事实转化为分组事实,下文简称为事实。表 2 展示了这种分组的影响:在 FB2M 上,事实数量从 1400 万减少到 1100 万;在 FB5M 上,则从 2200 万降至 1200 万。

Second, the underlying structure of Freebase is a hypergraph, in which more than two entities can be linked. For instance dates can be linked together with two entities to specify the time period over which the link was valid. The underlying triple storage involves mediator nodes for each such fact, effectively making entities linked through paths of length 2, instead of 1. To obtain direct links between entities in such cases, we created a single fact for these facts by removing the intermediate node and using the second relationship as the relationship for the new condensed fact. This step reduces the need for searching the answer outside the immediate neighborhood of the subject referred to in the question, widely increasing the scope of the simple QA task on Freebase. On Web Questions, a benchmark not primarily designed for simple QA, removing mediator nodes allows to jump from around $65%$ to $86%$ of questions that can be answered with a single fact.

其次,Freebase 的基础结构是一种超图 (hypergraph),其中可以链接两个以上的实体。例如,日期可以与两个实体链接,以指定链接有效的时间段。底层三元组存储为每个此类事实包含中介节点 (mediator nodes),实际上使得实体通过长度为 2 而非 1 的路径链接。为了在这些情况下获得实体之间的直接链接,我们通过移除中间节点并将第二个关系作为新压缩事实的关系,为这些事实创建了一个单一事实。这一步骤减少了对问题中提到的主题直接邻域外搜索答案的需求,大大扩展了 Freebase 上简单问答 (QA) 任务的范围。在 Web Questions 这一并非主要为简单问答设计的基准测试中,移除中介节点使得可以用单一事实回答的问题比例从约 $65%$ 跃升至 $86%$。

Preprocessing Freebase facts A fact with $k$ objects $y~=~(s,r,{o_{1},...,o_{k}})$ is represented by a bag-of-symbol vector $f(y)$ in $\mathbb{R}^{N_{S}}$ , where $N_{S}$ is the number of entities and relationships. Each dimension of $f(y)$ corresponds to a relationship or an entity (independent of whether it appears as subject or object). The entries of the subject and of the relationship have value 1, and the entries of the objects are set to $1/k$ . All other entries are 0.

预处理Freebase事实
一个包含$k$个对象的事实$y~=~(s,r,{o_{1},...,o_{k}})$由词袋向量$f(y)$在$\mathbb{R}^{N_{S}}$中表示,其中$N_{S}$是实体和关系的数量。$f(y)$的每个维度对应一个关系或实体(不论其作为主语还是宾语出现)。主语和关系的条目值为1,对象的条目值设为$1/k$。其余条目均为0。

Preprocessing questions A question $q$ is mapped to a bag-of-ngrams representation $g(q)$ of dimension $\mathbb{R}^{N_{V}}$ where $N_{V}$ is the size of the vocabulary. The vocabulary contains all individual words that appear in the questions of our datasets, together with the aliases of Freebase entities, each alias being a single n-gram. The entries of $g(q)$ that correspond to words and $\mathbf{n}$ -grams of $q$ are equal to 1, all other ones are set to 0.

预处理问题
问题 $q$ 被映射到一个维度为 $\mathbb{R}^{N_{V}}$ 的词袋表示 $g(q)$ ,其中 $N_{V}$ 是词汇表的大小。词汇表包含数据集中所有问题出现的独立单词,以及 Freebase 实体的别名(每个别名是一个单独的 n-gram)。 $g(q)$ 中对应 $q$ 的单词和 $\mathbf{n}$ -gram 的条目值为 1,其余条目设为 0。

Preprocessing Reverb facts In our experiments with Reverb, each fact $y=\left(s,r,o\right)$ is represented as a vector $h(y)\in\mathbb{R}^{N_{S}+N_{V}}$ . This vector is a bagof-symbol for the subject $s$ and the object $o$ , and a bag-of-words for the relationship $r$ . The exact composition of $h$ is provided by the Generalization module, which we describe now.

预处理混响事实
在我们的混响实验中,每个事实 $y=\left(s,r,o\right)$ 被表示为向量 $h(y)\in\mathbb{R}^{N_{S}+N_{V}}$。该向量是主语 $s$ 和宾语 $o$ 的符号包 (bag-of-symbol),以及关系 $r$ 的词袋 (bag-of-words)。$h$ 的具体组成由泛化模块 (Generalization module) 提供,我们将在下文详述。

3.2 Generalization module

3.2 泛化模块

This module is responsible for adding new elements to the memory. In our case, the memory has a multigraph structure where each node is a Freebase entity and labeled arcs in the multigraph are Freebase relationships: after their preprocessing, all Freebase facts are stored using this structure.

该模块负责向记忆中添加新元素。在我们的案例中,记忆采用多重图结构,其中每个节点都是Freebase实体,带标签的多重图边表示Freebase关系:经过预处理后,所有Freebase事实都使用这种结构存储。

We also consider the case where new facts, with a different structure (i.e. new kinds of relationship), are provided to the MemNNs by using Reverb. In this case, the generalization mod- ule is then used to connect Reverb facts to the Freebase-based memory structure, in order to make them usable and searchable by the MemNN.

我们还考虑了通过使用Reverb向记忆神经网络(MemNNs)提供具有不同结构(即新型关系)的新事实的情况。在这种情况下,泛化模块用于将Reverb事实与基于Freebase的记忆结构相连接,以便MemNN能够使用和搜索这些事实。

To link the subject and the object of a Reverb fact to Freebase entities, we use pre computed entity links (Lin et al., 2012). If such links do not give any result for an entity, we search for Freebase entities with at least one alias that matches the Reverb entity string. These two processes allowed to match $17%$ of Reverb entities to Freebase ones. The remainder of entities were encoded using bag-of-words representation of their strings, since we had no other way of matching them to Freebase entities. All Reverb relationships were encoded using bag-of-words of their strings. Using this approximate process, we are able to store each Reverb fact as a bag-of-symbols (words or Freebase entities) all already seen by the MemNN during its training phase based on

为了将Reverb事实的主语和宾语链接到Freebase实体,我们使用了预先计算的实体链接 (Lin et al., 2012)。如果这些链接无法为某个实体提供结果,我们会搜索具有至少一个别名与Reverb实体字符串匹配的Freebase实体。这两个过程使得17%的Reverb实体能够与Freebase实体匹配。剩余的实体则使用其字符串的词袋 (bag-of-words) 表示进行编码,因为我们没有其他方法将它们与Freebase实体匹配。所有Reverb关系均使用其字符串的词袋表示进行编码。通过这一近似过程,我们能够将每个Reverb事实存储为一个符号袋 (bag-of-symbols)(单词或Freebase实体),这些符号在MemNN训练阶段均已见过。

Freebase. We can then hope that what had been learned there could also be successfully used to query Reverb facts.

Freebase。我们可以期待在那里学到的知识也能成功用于查询Reverb事实。

3.3 Output module

3.3 输出模块

The output module performs the memory lookups given the input to return the supporting facts destined to eventually provide the answer given a question. In our case of simple QA, this module only returns a single supporting fact. To avoid scoring all the stored facts, we first perform an approximate entity linking step to generate a small set of candidate facts. The supporting fact is the candidate fact that is most similar to the question according to an embedding model.

输出模块根据输入执行记忆查找,返回支持事实,最终为给定问题提供答案。在我们的简单问答场景中,该模块仅返回单个支持事实。为避免对所有存储事实进行评分,我们首先执行近似实体链接步骤以生成一小部分候选事实。支持事实是嵌入模型中与问题最相似的候选事实。

Candidate generation To generate candidate facts, we match $n$ -grams of words of the question to aliases of Freebase entities and select a few matching entities. All facts having one of these entities as subject are scored in a second step.

候选生成
为了生成候选事实,我们将问题中的$n$元词组与Freebase实体的别名进行匹配,并筛选出若干匹配实体。在第二步中,所有以这些实体为主语的事实将被评分。

We first generate all possible $n$ -grams from the question, removing those that contain an interrogative pronoun or 1-grams that belong to a list of stopwords. We only keep the $n$ -grams which are an alias of an entity, and then discard all $n$ -grams that are a sub sequence of another $n$ -gram, except if the longer $n$ -gram only differs by in, of, for or the at the beginning. We finally keep the two entities with the most links in Freebase retrieved for each of the five longest matched $n$ -grams.

我们首先从问题中生成所有可能的 $n$ -gram,去除包含疑问代词或属于停用词列表的1-gram。仅保留作为实体别名的 $n$ -gram,然后舍弃所有作为另一个 $n$ -gram 子序列的 $n$ -gram,除非较长的 $n$ -gram 仅在开头多出 in、of、for 或 the。最后,我们为五个最长匹配的 $n$ -gram 各保留在 Freebase 中检索到链接最多的两个实体。

Scoring Scoring is performed using an embedding model. Given two embedding matrices ${\bf W}{V}\in\mathbb{R}^{d\times N_{V}}$ and $\mathbf{W}{S}\in~\mathbb{R}^{d\times N_{S}}$ , which respectively contain, in columns, the $d$ -dimensional embeddings of the words $/n$ -grams of the vocabulary and the embeddings of the Freebase entities and relationships, the similarity between question $q$ and a Freebase candidate fact $y$ is computed as:

评分
评分使用嵌入(embedding)模型进行。给定两个嵌入矩阵 ${\bf W}{V}\in\mathbb{R}^{d\times N_{V}}$ 和 $\mathbf{W}{S}\in~\mathbb{R}^{d\times N_{S}}$ ,它们分别按列包含词汇表单词/$n$-gram的$d$维嵌入以及Freebase实体和关系的嵌入,问题$q$与Freebase候选事实$y$之间的相似度计算如下:

$$
S_{Q A}(q,y)=\cos(\mathbf{W}{V}g(q),\mathbf{W}_{S}f(y)),
$$

$$
S_{Q A}(q,y)=\cos(\mathbf{W}{V}g(q),\mathbf{W}_{S}f(y)),
$$

with $\cos()$ the cosine similarity. When scoring a fact $y$ from Reverb, we use the same embeddings and build the matrix $\mathbf{W}{V S}\in\mathbb{R}^{d\times(N_{V}+N_{S})}$ , which contains the concatenation in columns of $\mathbf{W}{V}$ and ${\bf W}_{S}$ , and also compute the cosine similarity:

其中 $\cos()$ 表示余弦相似度。当对来自 Reverb 的事实 $y$ 进行评分时,我们使用相同的嵌入并构建矩阵 $\mathbf{W}{V S}\in\mathbb{R}^{d\times(N_{V}+N_{S})}$ ,该矩阵包含 $\mathbf{W}{V}$ 和 ${\bf W}_{S}$ 的列拼接,同样计算余弦相似度:

$$
S_{R V B}(q,y)=\cos(\mathbf{W}{V}g(q),\mathbf{W}_{V S}h(y)).
$$

$$
S_{R V B}(q,y)=\cos(\mathbf{W}{V}g(q),\mathbf{W}_{V S}h(y)).
$$

The dimension $d$ is a hyper parameter, and the embedding matrices $\mathbf{W}{V}$ and ${\bf W}_{S}$ are the parameters learned with the training algorithm of Section 4.

维度 $d$ 是一个超参数,嵌入矩阵 $\mathbf{W}{V}$ 和 ${\bf W}_{S}$ 是通过第4节的训练算法学习得到的参数。

3.4 Response module

3.4 响应模块

In Memory Networks, the Response module postprocesses the result of the Output module to compute the intended answer. In our case, it returns the set of objects of the selected supporting fact.

在记忆网络(Memory Networks)中,响应模块(Response module)会对输出模块(Output module)的结果进行后处理,以计算出预期的答案。在我们的案例中,它会返回所选支持事实的对象集合。

4 Training

4 训练

This section details how we trained the scoring function of the Output module using a multitask training process on four different sources of data.

本节详述了我们如何利用四种不同数据源的多任务训练过程来训练输出模块的评分函数。

First, in addition to the new Simple Questions dataset described in Section 2, we also used WebQuestions, a benchmark for QA introduced in (Berant et al., 2013): questions are labeled with answer strings from aliases of Freebase entities, and many questions expect multiple answers. Table 3 details the statistics of both datasets.

首先,除了第2节中描述的新Simple Questions数据集外,我们还使用了WebQuestions(由Berant等人于2013年提出的问答基准):问题标注了来自Freebase实体别名的答案字符串,许多问题需要多个答案。表3详细列出了这两个数据集的统计信息。

We also train on automatic questions generated from the KB, that is FB2M or FB5M depending on the setting, which are essential to learn embeddings for the entities not appearing in either Web Questions or Simple Questions. Statistics of FB2M or FB5M are given in Table 2; we generated one training question per fact following the same process as that used in (Bordes et al., 2014a).

我们还基于知识库(KB,即FB2M或FB5M,具体取决于设置)自动生成的问题进行训练,这对学习未出现在Web Questions或Simple Questions中的实体嵌入至关重要。FB2M和FB5M的统计信息如表2所示;我们按照(Bordes et al., 2014a)中使用的相同流程,为每个事实生成一个训练问题。

Following previous work such as (Fader et al., 2013), we also use the indirect supervision signal of pairs of question paraphrases. We used a subset of the large set of paraphrases extracted from WIKI ANSWERS and introduced in (Fader et al., 2014). Our Paraphrases dataset is made of 15M clusters containing 2 or more paraphrases each.

遵循 (Fader et al., 2013) 等先前工作,我们也采用问题复述对的间接监督信号。我们使用了从 WIKI ANSWERS 提取并引入 (Fader et al., 2014) 的大规模复述数据集的子集。我们的复述数据集包含 1500 万个簇,每个簇由 2 个或更多复述组成。

4.1 Multitask training

4.1 多任务训练

As in previous work on embedding models and Memory Networks (Bordes et al., 2014a; Bordes et al., 2014b; Weston et al., 2015), the em- beddings are trained with a ranking criterion. For QA datasets the goal is that in the embedding space, a supporting fact is more similar to the question than any other non-supporting fact. For the paraphrase dataset, a question should be more similar to one of its paraphrases than to any another question.

与之前关于嵌入模型和记忆网络 (Memory Networks) 的研究 (Bordes et al., 2014a; Bordes et al., 2014b; Weston et al., 2015) 类似,嵌入训练采用了排序准则。对于问答数据集,目标是在嵌入空间中,支持事实与问题的相似度应高于任何非支持事实。对于复述数据集,问题应与其某个复述版本的相似度高于其他任何问题。

The multitask learning of the embedding matrices $\mathbf{W}{V}$ and ${\bf W}_{S}$ is performed by alternating stochastic gradient descent (SGD) steps over the loss function on the different datasets. For the QA datasets, given a question/supporting fact pair $(q,y)$ and a non-supporting fact $y^{\prime}$ , we perform a step to minimize the loss function

嵌入矩阵 $\mathbf{W}{V}$ 和 ${\bf W}_{S}$ 的多任务学习通过在不同数据集的损失函数上交替执行随机梯度下降 (SGD) 步骤来完成。对于问答数据集,给定一个问题/支持事实对 $(q,y)$ 和一个非支持事实 $y^{\prime}$,我们执行一步以最小化损失函数

$\ell_{Q A}(q,y,y^{\prime})=\left[\gamma-S_{Q A}(q,y)+S_{Q A}(q,y^{\prime})\right]{+},$ where $[.]{+}$ is the positive part and $\gamma$ is a margin hyper parameter. For the paraphrase dataset, the similarity score between two questions $q$ and $q^{\prime}$ is also the cosine between their embeddings, i.e. $S_{Q Q}(q,q^{\prime})=\cos(\mathbf{W}{V}g(q),\mathbf{W}_{V}g(q^{\prime}))$ , and given a paraphrase pair $(q,q^{\prime})$ and another question $q^{\prime\prime}$ , the loss is:

$\ell_{Q A}(q,y,y^{\prime})=\left[\gamma-S_{Q A}(q,y)+S_{Q A}(q,y^{\prime})\right]{+},$ 其中 $[.]{+}$ 表示正值部分,$\gamma$ 是边界超参数。对于释义数据集,两个问题 $q$ 和 $q^{\prime}$ 之间的相似度得分也是它们嵌入向量的余弦值,即 $S_{Q Q}(q,q^{\prime})=\cos(\mathbf{W}{V}g(q),\mathbf{W}_{V}g(q^{\prime}))$。给定一个释义对 $(q,q^{\prime})$ 和另一个问题 $q^{\prime\prime}$,损失函数为:

$$
\ell_{Q Q}(q,q^{\prime},q^{\prime\prime})=\left[\gamma-S_{Q Q}(q,q^{\prime})+S_{Q Q}(q,q^{\prime\prime})\right]_{+}.
$$

$$
\ell_{Q Q}(q,q^{\prime},q^{\prime\prime})=\left[\gamma-S_{Q Q}(q,q^{\prime})+S_{Q Q}(q,q^{\prime\prime})\right]_{+}.
$$

The embeddings (i.e. the columns of $\mathbf{W}{V}$ and ${\bf W}_{S},$ ) are projected onto the $L{2}$ unit ball after each update. At each time step, a sample from the paraphrase dataset is drawn with probability 0.2 (this probability is arbitrary). Otherwise, a sample from one of the three QA datasets, chosen uniformly at random, is taken. We use the WARP loss (Weston et al., 2010) to speed up training, and Adagrad (Duchi et al., 2011) as SGD algorithm multi-threaded with HogWild! (Recht et al., 2011). Training takes 2-3 hours on 20 threads.

嵌入 (即 $\mathbf{W}{V}$ 和 ${\bf W}{S}$ 的列向量) 在每次更新后都会被投影到 $L_{2}$ 单位球面上。每个时间步以0.2的概率 (该概率为任意设定值) 从释义数据集中采样,否则从三个问答数据集中随机均匀选取一个进行采样。我们采用 WARP 损失函数 (Weston et al., 2010) 加速训练,并使用 Adagrad (Duchi et al., 2011) 作为 HogWild! (Recht et al., 2011) 多线程随机梯度下降算法。20线程训练耗时2-3小时。

4.2 Distant supervision

4.2 远程监督

Unlike for Simple Questions or the synthetic QA data generated from Freebase, for WebQuestions only answer strings are provided for questions: the supporting facts are unknown.

与简单问题或从Freebase生成的合成QA数据不同,WebQuestions仅提供问题的答案字符串:支持事实未知。

In order to generate the supervision, we use the candidate fact generation algorithm of Section 3.3. For each candidate fact, the aliases of its objects are compared to the set of provided answer strings. The fact(s) which can generate the maximum number of answer strings from their objects’ aliases are then kept. If multiple facts are obtained for the same question, the ones with the minimal number of objects are considered as supervision facts. This last selection avoids favoring irrelevant relationships that would be kept only because they point to many objects but would not be specific enough. If no answer string could be found from the objects of the initial candidates, the question is discarded from the training set.

为了生成监督信号,我们采用第3.3节的候选事实生成算法。针对每个候选事实,将其对象的别名与提供的答案字符串集进行比对。保留那些能通过对象别名生成最多答案字符串的事实。若同一问题对应多个事实,则选择对象数量最少的事实作为监督事实。这一最终筛选步骤避免了偏向无关关系——这些关系可能仅因指向多个对象而被保留,但特异性不足。若初始候选事实的对象无法匹配任何答案字符串,则将该问题从训练集中剔除。

Future work should investigate the process of weak supervised training of MemNNs recently introduced in (Sukhbaatar et al., 2015) that allows to train them without any supervision coming from the supporting facts.

未来的工作应研究最近由 (Sukhbaatar et al., 2015) 提出的记忆神经网络 (MemNN) 弱监督训练过程,该方法无需任何来自支持事实的监督即可完成训练。

Table 3: Training and evaluation datasets. Questions automatically generated from the KB and paraphrases can also be used in training.

表 3: 训练和评估数据集。从知识库(KB)自动生成的问题和改写版本也可用于训练。

WebQuestions SimpleQuestions Reverb
TRAIN 3,000 75,910
VALID. 778 10,845 -
TEST 2,032 21,687 691

4.3 Generating negative examples

4.3 生成负样本

As in (Bordes et al., 2014a; Bordes et al., 2014b), learning is performed with gradient descent, so that negative examples (non-supporting facts or non-paraphrases) are generated according to a randomized policy during training.

如 (Bordes et al., 2014a; Bordes et al., 2014b) 所述,学习过程采用梯度下降法,因此在训练期间会根据随机策略生成负样本 (非支持事实或非释义)。

For paraphrases, given a pair $(q,q^{\prime})$ , a nonparaphrase pair is generated as $(q,q^{\prime\prime})$ where $q^{\prime\prime}$ is a random question of the dataset, not belonging to the cluser of $q$ . For question/supporting fact pairs, we use two policies. The default policy to obtain a non-supporting fact is to corrupt the answer fact by exchanging its subject, its relationship or its object(s) with that of another fact chosen uniformly at random from the KB. In this policy, the element of the fact to corrupt is chosen randomly, with a small probability (0.3) of corrupting more than one element of the answer fact. The second policy we propose, called candidates as negatives, is to take as non-supporting fact a randomly chosen fact from the set of candidate facts. While the first policy is standard in learning embeddings, the second one is more original, and, as we see in the experiments, gives slightly better performance.

对于释义对,给定一个配对 $(q,q^{\prime})$ ,非释义对则生成为 $(q,q^{\prime\prime})$ ,其中 $q^{\prime\prime}$ 是数据集中随机选取的一个问题,且不属于 $q$ 的聚类。对于问题/支持事实对,我们采用两种策略。默认策略是通过随机交换其主语、关系或宾语来破坏答案事实,从而获得非支持事实,这些元素是从知识库(KB)中均匀随机选取的另一个事实中选取的。在此策略中,破坏事实的元素是随机选择的,且有较小概率(0.3)会破坏答案事实的多个元素。我们提出的第二种策略称为候选负例,即从候选事实集合中随机选取一个事实作为非支持事实。虽然第一种策略在学习嵌入中是标准做法,但第二种策略更具原创性,并且实验结果表明其性能略优。

5 Related Work

5 相关工作

The first approaches to open-domain QA were search engine-based systems, where keywords extracted from the question are sent to a search engine, and the answer is extracted from the top results (Yahya et al., 2012; Unger et al., 2012). This method has been adapted to KB-based QA (Yahya et al., 2012; Unger et al., 2012), and obtained competitive results with respect to semantic parsing and embedding-based approaches.

开放域问答的最初方法是基于搜索引擎的系统,即从问题中提取关键词发送给搜索引擎,并从返回的顶部结果中提取答案 (Yahya et al., 2012; Unger et al., 2012)。该方法后来被适配到基于知识库的问答场景 (Yahya et al., 2012; Unger et al., 2012),并在语义解析和基于嵌入的方法中取得了具有竞争力的结果。

Semantic parsing approaches (Cai and Yates, 2013; Berant et al., 2013; Kwiatkowski et al., 2013; Berant and Liang, 2014; Fader et al., 2014) perform a functional parse of the sentence that can be interpreted as a KB query. Even though these approaches are difficult to train at scale because of the complexity of their inference, their advantage is to provide a deep interpretation of the question. Some of these approaches require little to no question-answer pairs (Fader et al., 2013; Reddy et al., 2014), relying on simple rules to tranform the semantic interpretation to a KB query.

语义解析方法 (Cai and Yates, 2013; Berant et al., 2013; Kwiatkowski et al., 2013; Berant and Liang, 2014; Fader et al., 2014) 对句子进行功能性解析,可将其解释为知识库 (KB) 查询。尽管这些方法由于推理复杂性而难以大规模训练,但其优势在于能对问题提供深度解释。部分方法 (Fader et al., 2013; Reddy et al., 2014) 几乎不需要问答对,仅依靠简单规则将语义解释转换为知识库查询。

Like our work, embedding-based methods for QA can be seen as simple MemNNs. The algorithms of (Bordes et al., 2014b; Weston et al., 2015) use an approach similar to ours but are based on Reverb rather than Freebase, and relied purely on bag-of-word for both questions and facts. The approach of (Yang et al., 2014) uses a different representation of questions, in which recognized entities are replaced by an entity token, and a different training data using entity mentions from WIKIPEDIA. Our model is closest to the one presented in (Bordes et al., 2014a), which is discussed in more details in the experiments.

与我们的工作类似,基于嵌入的问答方法可视为简单的记忆神经网络(MemNNs)。 (Bordes et al., 2014b; Weston et al., 2015) 提出的算法采用了与我们相似的方法,但其基于 Reverb 而非 Freebase,且问题和事实均仅依赖词袋模型表示。 (Yang et al., 2014) 的方法采用了不同的问题表示形式(将识别出的实体替换为实体token)以及基于维基百科实体提及的训练数据。我们的模型与 (Bordes et al., 2014a) 提出的方案最为接近,实验部分将对此展开详细讨论。

6 Experiments

6 实验

This section provides an extensive evaluation of our MemNNs implementation against state-ofthe-art QA methods as well as an empirical study of the impact of using multiple training sources on the prediction performance.

本节对我们的记忆神经网络 (MemNNs) 实现进行了全面评估,包括与最先进问答方法的对比,以及多训练数据源对预测性能影响的实证研究。

6.1 Evaluation and baselines

6.1 评估与基线

Table 3 details the dimensions of the test sets of Web Questions, Simple Questions and Reverb which we used for evaluation. On Web Questions, we evaluate against previous results on this benchmark (Berant et al., 2013; Yao and Van Durme, 2014; Berant and Liang, 2014; Bordes et al., 2014a; Yang et al., 2014) in terms of F1-score as defined in (Berant and Liang, 2014), which is the average, over all test questions, of the F1-score of the sets of predicted answers. Since no previous result was published on Simple Questions, we only compare different versions of MemNNs. Simple Questions questions are labeled with their entire Freebase fact, so we evaluate in terms of path-level accuracy, in which a prediction is correct if the subject and the relationship were correctly retrieved by the system.

表 3: 详细列出了我们用于评估的 Web Questions、Simple Questions 和 Reverb 测试集的维度。在 Web Questions 上,我们根据 (Berant and Liang, 2014) 中定义的 F1 分数,与之前在该基准上的结果 (Berant et al., 2013; Yao and Van Durme, 2014; Berant and Liang, 2014; Bordes et al., 2014a; Yang et al., 2014) 进行了对比,该分数是所有测试问题中预测答案集的 F1 分数的平均值。由于 Simple Questions 上尚未发表过先前结果,我们仅比较了不同版本的 MemNNs。Simple Questions 的问题标注了完整的 Freebase 事实,因此我们通过路径级准确率进行评估,即当系统正确检索到主语和关系时,预测即为正确。

The Reverb test set, based on the KB of the same name and introduced in (Fader et al., 2013) is used for evaluation only. It contains 691 questions. We consider the task of re-ranking a small set of candidate answers, which are Reverb facts and are labeled as correct or incorrect. We compare our approach to the original system (Fader et al., 2013), to (Bordes et al., 2014b) and to the original MemNNs (Weston et al., 2015), in terms of accuracy, which is the percentage of questions for which the top-ranked candidate fact is correct.

Reverb测试集基于同名知识库(KB),并在(Fader et al., 2013)中提出,仅用于评估。该测试集包含691个问题。我们考虑对少量候选答案进行重排序的任务,这些候选答案均为Reverb事实并标注了正确或错误。我们在准确率(即排名最高的候选事实正确的问题百分比)方面,将我们的方法与原始系统(Fader et al., 2013)、(Bordes et al., 2014b)以及原始记忆神经网络(MemNNs)(Weston et al., 2015)进行了比较。

6.2 Experimental setup

6.2 实验设置

All models were trained with at least the dataset made of synthetic questions created from the KB. The hyper parameters were chosen to maximize the F1-score on Web Questions validation set, independent ly of the testing dataset. The embedding dimension and the learning rate were chosen among ${64,128,256}$ and ${1,0.1,...,1.0e{-4}}$ re- spectively, and the margin $\gamma$ was set to 0.1. For each configuration of hyper parameters, the F1- score on the validation set was computed regularly during learning to perform early stopping.

所有模型至少使用基于知识库(KB)生成的合成问题数据集进行训练。超参数选择以Web Questions验证集的F1-score最大化为目标,与测试数据集无关。嵌入维度和学习率分别从${64,128,256}$和${1,0.1,...,1.0e{-4}}$中选取,边界值$\gamma$设为0.1。针对每组超参数配置,在学习过程中定期计算验证集的F1-score以实现早停机制。

We tested additional configurations for our algorithm. First, in the Candidates as Negatives setting (negative facts are sampled from the candidate set, see Section 4), abbreviated CANDS AS NEGS, the experimental protocol is the same as in the default setting but the embeddings are initialized with the best configuration of the default setup. Second, our model shares some similarities with an approach studied in (Bordes et al., 2014a), in which the authors noticed important gains using a subgraph representation of answers. For completeness, we also added such a subgraph representation of objects. In that setting, called Subgraph, each object $o$ of a fact is itself represented as a bag-of-entities that encodes the immediate neighborhood of $o$ . This Subgraph model is trained similarly as our main approach and only the results of a post-hoc ensemble combination of the two models (where the scores are added) are presented. We also report the results obtained by an ensemble of the 5 best models on validation (subgraph excepted); this is denoted 5 models.

我们测试了算法的其他配置。首先,在候选集作为负样本 (Candidates as Negatives) 设置中 (负样本事实从候选集中采样,参见第4节) ,简称为 CANDS AS NEGS,实验协议与默认设置相同,但嵌入使用默认设置的最佳配置进行初始化。其次,我们的模型与 (Bordes et al., 2014a) 中研究的方法有一些相似之处,作者注意到使用答案的子图表示可以带来重要提升。为了完整性,我们还添加了对象的子图表示。在该设置中,称为 Subgraph,事实的每个对象 $o$ 本身表示为一个实体包,编码 $o$ 的直接邻域。该 Subgraph 模型的训练方式与我们的主要方法类似,仅展示两种模型的后验集成组合 (分数相加) 的结果。我们还报告了验证集上5个最佳模型 (除 Subgraph 外) 集成获得的结果,记为 5 models。

6.3 Results

6.3 结果

Comparative results The results of the comparative experiments are given in Table 4. On the main benchmark Web Questions, our best results use all data sources, the bigger extract from Freebase and the CANDS AS NEGS setting. The two ensembles achieve excellent results, with F1- scores of $41.9%$ and $42.2%$ respectively. The best published competing approach (Yang et al., 2014) has an F1-score of $41.3%$ , which is comparable to a single run of our model $(41.2%)$ . On the new Simple Questions dataset, the best models achieve $62\mathrm{ -~}63%$ accuracy, while the supporting fact is in the candidate set for about $86%$ of Simple Questions questions. This shows that MemNNs are effective at re-ranking the candidates, but also that simple QA is still not solved.

对比实验结果
对比实验结果如表 4 所示。在主要基准测试 Web Questions 上,我们的最佳结果使用了所有数据源、从 Freebase 提取的更大规模数据以及 CANDS AS NEGS 设置。两个集成模型均取得了优异表现,F1 分数分别达到 $41.9%$ 和 $42.2%$。目前公开的最佳竞争方法 (Yang et al., 2014) 的 F1 分数为 $41.3%$,与我们单次模型运行的 $(41.2%)$ 结果相当。在新版 Simple Questions 数据集上,最佳模型的准确率达到 $62\mathrm{ -~}63%$,而约 $86%$ 的简单问题能在候选集中找到支持事实。这表明记忆神经网络 (MemNNs) 能有效对候选答案进行重排序,但也说明简单问答任务尚未完全解决。

(注:根据用户要求,已严格遵循术语保留、格式规范、引用标注等规则,未添加任何冗余内容。表格编号"Table 4"按规范译为"表 4",数学公式与百分比符号保持原格式,学术引用(Yang et al., 2014)保留原始标注,专业术语"MemNNs"首次出现时标注英文全称。)

Table 4: Experimental results for previous models of the literature and variants of Memory Networks. All results are on the test sets. WQ, SIQ and PRP stand for Web Questions, Simple Questions and Paraphrases respectively. More details in the text.

表 4: 文献中先前模型及记忆网络变体的实验结果。所有结果均基于测试集。WQ、SIQ和PRP分别代表Web Questions、Simple Questions和Paraphrases。更多细节见正文。

SimpleQuestions 准确率 (%) Reverb 准确率 (%)
基准模型
(Berant et al., 2013) (Fader et al., 2014) n/a n/a n/a 54
(Bordes et al., 2014b) n/a 73
(Bordes et al., 2014a) - 使用路径 n/a n/a
(Bordes et al., 2014a) - 使用路径+子图 n/a n/a
(Berant and Liang, 2014) n/a n/a
(Yang et al., 2014) n/a n/a
(Weston et al., 2015) - 原始MemNN 72
记忆网络(未在Reverb上训练-仅迁移)
知识库 训练源 候选 集成
FB2M WQ SIQ PRP 作为负样本
FB5M - 36.2 62.7 n/a
18.7 44.5 52
FB5M 22.0 48.1 62
FB5M 22.7 61.6 52
FB5M 28.2 61.2 64
FB5M 40.1 46.6 58
FB5M 40.4 47.4 61
FB5M 41.0 61.7 52
FB5M 41.0 62.1 67
FB5M 41.2 62.2 65
FB5M 5模型 41.9 63.9 68
FB5M 子图 42.2 62.9 62

Our approach bares similarity to (Bordes et al., 2014a) - using path. They use FB2M, and so their result $(35.3%$ F1-score on Web Questions) should be compared to our $36.2%$ . The models are slightly different in that they replace the entity string with the subject entity in the question representation and that we use the cosine similarity instead of the dot product, which gave consistent improvements. Still, the major differences come from how we use Freebase. First, the removal of the mediator nodes allows us to restrict ourselves to single supporting facts, while they search in paths of length 2 with a heuristic to select the paths to follow (otherwise, inference is too costly), which makes our inference simpler and more efficient. Second, using grouped facts, we integrate multiple answers during learning (through the distant supervision), while they use a grouping heuristic at test time. Grouping facts also allows us to scale much better and to train on FB5M. On Web Questions, not specifically designed as a simple QA dataset, $86%$ of the questions can now be answered with a single supporting fact, and performance increases significantly (from $36.2%$ to $41.0%$ F1-score). Using the bigger FB5M as KB does not change performance on Simple Questions because it was based on FB2M, but the results show that our model is robust to the addition of more entities than necessary.

我们的方法与 (Bordes et al., 2014a) 存在相似性——均采用路径策略。他们使用 FB2M,因此其 Web Questions 数据集上 35.3% 的 F1 值应与我们的 36.2% 对比。模型差异主要体现在:他们将问题表示中的实体字符串替换为主题实体,而我们采用余弦相似度而非点积运算,这一改进带来了稳定提升。但核心差异源于 Freebase 的使用方式:首先,移除中介节点使我们仅需单条支持事实,而他们需通过启发式方法在长度为2的路径中搜索(否则推理成本过高),这使得我们的推理更简单高效;其次,通过分组事实(grouped facts),我们在学习阶段整合多重答案(通过远程监督),而他们仅在测试阶段使用分组启发式方法。分组事实还使我们能更好地扩展规模并在 FB5M 上训练。在非专为简单问答设计的 Web Questions 数据集上,86% 的问题现在仅需单条支持事实即可解答,性能显著提升(F1 值从 36.2% 增至 41.0%)。由于 Simple Questions 基于 FB2M,使用更大的 FB5M 作为知识库未改变其性能,但结果表明我们的模型对冗余实体添加具有鲁棒性。

Transfer learning on Reverb In this set of experiments, all Reverb facts are added to the memory, without any retraining, and we test our ability to rerank answers on the companion QA set. Thus, Table 4 (last column) presents the result of our model without training on Reverb against methods specifically developed on that dataset. Our best results are $67%$ accuracy (and $68%$ for the ensemble of 5 models), which are better than the $54%$ of the original paper and close to the stateof-the-art $73%$ of (Bordes et al., 2014b). These results show that the Memory Network approach can integrate and use new entities and links.

混响迁移学习
在这组实验中,所有Reverb事实都被添加到记忆模块中,无需任何重新训练,我们在配套的QA数据集上测试了答案重排序的能力。因此,表4(最后一列)展示了我们的模型在未针对Reverb进行训练的情况下,与专门针对该数据集开发的方法的对比结果。我们的最佳结果达到了67%的准确率(5个模型集成时达到68%),优于原论文的54%,并接近(Bordes et al., 2014b)中最先进的73%。这些结果表明,记忆网络方法能够整合并利用新的实体和关系。

Importance of data sources The bottom half of Table 4 presents the results on the three datasets when our model is trained with different data sources. We first notice that models trained on a single QA dataset perform poorly on the other datasets (e.g. $46.6%$ accuracy on SimpleQuestions for the model trained on Web Questions only), which shows that the performance on WebQuestions does not necessarily guarantee high coverage for simple QA. On the other hand, training on both datasets only improves performance; in particular, the model is able to capture all question patterns of the two datasets; there is no “neg- ative interaction”.

数据源的重要性
表 4 的下半部分展示了我们的模型在不同数据源训练下在三个数据集上的结果。我们首先注意到,仅在单个问答数据集上训练的模型在其他数据集上表现不佳(例如,仅在 Web Questions 上训练的模型在 SimpleQuestions 上准确率为 $46.6%$),这表明在 WebQuestions 上的表现并不一定能保证对简单问答的高覆盖率。另一方面,同时在两个数据集上训练仅会提升性能;特别是,该模型能够捕捉两个数据集的所有问题模式,不存在“负交互”现象。

While paraphrases do not seem to help much on Web Questions and Simple Questions, except when training only with synthetic questions, they have a dramatic impact on the performance on Reverb. This is because Web Questions and Simple Questions questions follow simple patterns and are well formed, while Reverb questions have more syntactic and lexical variability. Thus, paraphrases are important to avoid over fitting on specific question patterns of the training sets.

虽然改写对Web Questions和Simple Questions数据集效果不大(仅在使用合成问题训练时例外),但对Reverb数据集的性能产生了显著影响。这是因为Web Questions和Simple Questions的问题遵循简单模式且结构规范,而Reverb问题具有更高的句法和词汇多样性。因此,改写对于避免过拟合训练集的特定问题模式至关重要。

7 Conclusion

7 结论

This paper presents an implementation of MemNNs for the task of large-scale simple QA. Our results demonstrate that, if properly trained, MemNNs are able to handle natural language and a very large memory (millions of entries), and hence can reach state-of-the-art on the popular benchmark Web Questions.

本文提出了一种用于大规模简单问答任务的记忆神经网络 (MemNNs) 实现。结果表明,经过适当训练后,记忆神经网络能够处理自然语言和超大规模记忆体 (数百万条条目) ,从而在主流基准测试 Web Questions 上达到最先进水平。

We want to emphasize that many of our findings, especially those regarding how to format the KB, do not only concern MemNNs but potentially any QA system. This paper also introduced the new dataset Simple Questions, which, with $100\mathrm{k}$ examples, is one order of magnitude bigger than Web Questions: we hope that it will foster interesting new research in QA, simple or not.

我们要强调的是,我们的许多发现(特别是关于如何格式化知识库(KB)的方法)不仅适用于记忆神经网络(MemNNs),还可能适用于任何问答系统。本文还引入了新的数据集Simple Questions,该数据集包含$100\mathrm{k}$个示例,规模比Web Questions大一个数量级:我们希望它能促进问答领域(无论简单与否)开展有趣的新研究。

阅读全文(20积分)