Pragmatically Informative Text Generation
实用信息文本生成
Abstract
摘要
We improve the informative ness of models for conditional text generation using techniques from computational pragmatics. These techniques formulate language production as a game between speakers and listeners, in which a speaker should generate output text that a listener can use to correctly identify the original input that the text describes. While such approaches are widely used in cognitive science and grounded language learning, they have received less attention for more standard language generation tasks. We consider two pragmatic modeling methods for text generation: one where pragmatics is imposed by information preservation, and another where pragmat- ics is imposed by explicit modeling of distractors. We find that these methods improve the performance of strong existing systems for abstractive sum mari z ation and generation from structured meaning representations.
我们运用计算语用学技术提升条件文本生成模型的信息性。这些技术将语言生成建模为说话者与听者之间的博弈,要求说话者生成的文本能让听者准确识别出原文描述的输入内容。虽然此类方法在认知科学和具身语言学习领域应用广泛,但在标准语言生成任务中关注较少。我们研究了两种语用文本生成建模方法:一种通过信息保留实现语用约束,另一种通过显式建模干扰项实现语用约束。实验表明,这些方法能显著提升现有抽象摘要系统和结构化语义表示生成系统的性能。
1 Introduction
1 引言
Computational approaches to pragmatics cast language generation and interpretation as gametheoretic or Bayesian inference procedures (Golland et al., 2010; Frank and Goodman, 2012). While such approaches are capable of modeling a variety of pragmatic phenomena, their main application in natural language processing has been to improve the informative ness of generated text in grounded language learning problems (Monroe et al., 2018). In this paper, we show that pragmatic reasoning can be similarly used to improve performance in more traditional language generation tasks like generation from structured meaning representations (Figure 1) and sum mari z ation.
语用学的计算方法将语言生成和解释视为博弈论或贝叶斯推理过程 (Golland et al., 2010; Frank and Goodman, 2012) 。虽然这类方法能够建模多种语用现象,但其在自然语言处理中的主要应用是提升基于情境的语言学习任务中生成文本的信息量 (Monroe et al., 2018) 。本文通过结构化语义表示生成 (图 1) 和摘要等更传统的语言生成任务,证明语用推理同样能有效提升生成性能。
Our work builds on a line of learned Rational Speech Acts (RSA) models (Monroe and Potts, 2015; Andreas and Klein, 2016), in which generated strings are selected to optimize the behavior of an embedded listener model. The canonical presentation of the RSA framework (Frank and Goodman, 2012) is grounded in reference resolution: models of speakers attempt to describe referents in the presence of distract or s, and models of listeners attempt to resolve descriptors to referents. Recent work has extended these models to more complex groundings, including images (Mao et al., 2015) and trajectories (Fried et al., 2018). The techniques used in these settings are similar, and the primary intuition of the RSA framework is preserved: from the speaker’s perspective, a good description is one that picks out, as discriminatively as possible, the content the speaker intends for the listener to identify.
我们的工作建立在系列学习型理性言语行为(RSA)模型基础上 (Monroe and Potts, 2015; Andreas and Klein, 2016),这些模型通过选择生成字符串来优化嵌入式听者模型的行为。RSA框架的标准表述 (Frank and Goodman, 2012) 基于指代消解:说话者模型尝试在干扰项存在的情况下描述指代对象,而听者模型试图将描述符解析为指代对象。近期研究将这些模型扩展到更复杂的场景,包括图像 (Mao et al., 2015) 和轨迹 (Fried et al., 2018)。这些场景中使用的技术相似,且保留了RSA框架的核心思想:从说话者角度看,好的描述应尽可能区分性地突出说话者希望听者识别的内容。
Figure 1: Example outputs of our systems on the E2E generation task. While a base sequence-to-sequence model $\mathrm{\Delta}S_{0}$ , Sec. 2) fails to describe all attributes in the input meaning representation, both of our pragmatic systems ( $\mathrm{{}}{\mathrm{{}}}^{\mathrm{{}}}S_{1}^{R}$ , Sec. 3.1 and $S_{1}^{D}$ , Sec. 3.2) and the human-written reference do.
图 1: 我们的系统在E2E生成任务中的示例输出。基础序列到序列模型 ($\mathrm{\Delta}S_{0}$ ,第2节) 未能描述输入意义表示中的所有属性,而我们的两种实用系统 ($\mathrm{{}}{\mathrm{{}}}^{\mathrm{{}}}S_{1}^{R}$ ,第3.1节和 $S_{1}^{D}$ ,第3.2节) 以及人工撰写的参考文本都做到了。
| 输入意义表示 (i): NAME[FITZBILLIES], EATTYPE[COFFEE SHOP], FOOD[ENGLISH], PRICERANGE[CHEAP], CUSTOMERRATING[5 OUT OF 5], AREA[RIVERSIDE], FAMILYFRIENDLY[YES] 人工撰写 |
|---|
| 河边一家顾客评分为5分(满分5分)的廉价咖啡店是Fitzbillies。Fitzbillies适合家庭用餐,并提供英式食物。 |
| 基础序列到序列模型 (So) Fitzbillies是一家适合家庭的咖啡店,位于河边 |
| 基于干扰的实用系统 (S) Fitzbillies是一家适合家庭的咖啡店,提供英式食物。它位于河边区域,顾客评分为5分(满分5分)且价格便宜。 |
| 基于重构的实用系统 (S) Fitzbillies是一家适合家庭的咖啡店,在河边区域提供便宜的英式食物。它的顾客评分为5分(满分5分)。 |
Outside of grounding, cognitive modeling (Frank et al., 2009), and targeted analysis of linguistic phenomena (Orita et al., 2015), rational speech acts models have seen limited application in the natural language processing literature. In this work we show that they can be extended to a distinct class of language generation problems that use as referents structured descriptions of lingustic content, or other natural language texts. In accordance with the maxim of quantity (Grice, 1970) or the Q-principle (Horn, 1984), pragmatic approaches naturally correct underinformative ness problems observed in state-of-theart language generation systems $\mathrm{\mathit{~S}}_{0}$ in Figure 1).
除了基础研究、认知建模 (Frank et al., 2009) 和语言现象的针对性分析 (Orita et al., 2015) 之外,理性言语行为模型在自然语言处理领域的应用较为有限。本文研究表明,该模型可扩展至一类特殊的语言生成问题——这类问题以语言内容的结构化描述或其他自然文本作为指称对象。根据数量准则 (Grice, 1970) 或Q原则 (Horn, 1984),语用学方法能自然修正当前最先进语言生成系统中存在的信息不足问题 (如图 1 中的 $\mathrm{\mathit{~S}}_{0}$ )。
We present experiments on two language generation tasks: generation from meaning representations (Novikova et al., 2017) and sum mari z ation. For each task, we evaluate two models of pragmatics: the reconstructor-based model of Fried et al. (2018) and the distractor-based model of CohnGordon et al. (2018). Both models improve performance on both tasks, increasing ROUGE scores by 0.2–0.5 points on the CNN/Daily Mail abstractive sum mari z ation dataset and BLEU scores by 2 points on the End-to-End (E2E) generation dataset, obtaining new state-of-the-art results.
我们在两个语言生成任务上进行了实验:基于意义表示的生成 (Novikova et al., 2017) 和摘要生成。针对每个任务,我们评估了两种语用学模型:Fried等人 (2018) 提出的基于重构器的模型,以及CohnGordon等人 (2018) 提出的基于干扰项的模型。两个模型在两项任务上均提升了性能,在CNN/Daily Mail抽象摘要数据集上将ROUGE分数提高了0.2-0.5分,在端到端 (E2E) 生成数据集上将BLEU分数提高了2分,取得了新的最先进成果。
2 Tasks
2 任务
We formulate a conditional generation task as taking an input $i$ from a space of possible inputs $\mathcal{T}$ (e.g., input sentences for abstract ive summarization; meaning representations for structured generation) and producing an output $o$ as a sequence of tokens $\left(o_{1},\dots,o_{T}\right)$ . We build our pragmatic approaches on top of learned base speaker models $S_{0}$ , which produce a probability distribution $S_{0}(o\mid i)$ over output text for a given input. We focus on two conditional generation tasks where the information in the input context should largely be preserved in the output text, and apply the pragmatic procedures outlined in Sec. 3 to each task. For these $S_{0}$ models we use systems from past work that are strong, but may still be underinformative relative to human reference outputs (e.g., Figure 1).
我们将条件生成任务定义为:从可能的输入空间 $\mathcal{T}$ 中获取输入 $i$(例如摘要生成任务的输入句子,或结构化生成任务的含义表示),并生成由 token 序列 $\left(o_{1},\dots,o_{T}\right)$ 构成的输出 $o$。我们的语用方法建立在已学习的基础说话者模型 $S_{0}$ 之上,该模型针对给定输入生成输出文本的概率分布 $S_{0}(o\mid i)$。我们重点关注两种条件生成任务,其输入上下文中的信息应大部分保留在输出文本中,并将第 3 节所述的语用流程应用于每个任务。对于这些 $S_{0}$ 模型,我们采用过去工作中表现优异但仍可能比人类参考输出信息量不足的系统(如 图 1)。
Meaning Representations Our first task is generation from structured meaning representations (MRs) containing attribute-value pairs (Novikova et al., 2017). An example is shown in Figure 1, where systems must generate a description of the restaurant with the specified attributes. We apply pragmatics to encourage output strings from which the input MR can be identified. For our $S_{0}$ model, we use a publicly-released neural generation system (Puzikov and Gurevych, 2018) that achieves comparable performance to the best published results in Dusek et al. (2018).
意义表示
我们的首个任务是基于包含属性-值对的结构化意义表示(MR)生成文本 (Novikova et al., 2017)。如图 1 所示,系统必须根据指定属性生成餐厅描述。我们运用语用学原理来鼓励生成能反推输入MR的输出字符串。对于 $S_{0}$ 模型,我们采用公开的神经生成系统 (Puzikov and Gurevych, 2018),其性能已达到Dusek等人(2018)发表的最佳结果水平。
Abstract ive Sum mari z ation Our second task is multi-sentence document sum mari z ation. There is a vast amount of past work on summarization (Nenkova and McKeown, 2011); recent neu- ral models have used large datasets (e.g., Hermann et al. (2015)) to train models in both the extractive (Cheng and Lapata, 2016; Nallapati et al., 2017) and abstract ive (Rush et al., 2015; See et al., 2017) settings. Among these works, we build on the recent abstract ive neural sum mari z ation system of Chen and Bansal (2018). First, this system uses a sentence-level extractive model RNN-EXT to identify a sequence of salient sentences $i^{(1)},\dots i^{(P)}$ in each source document. Second, the system uses an abstract ive model ABS to rewrite each $i^{(p)}$ into an output $o^{(p)}$ , which are then concatenated to produce the final summary. We rely on the fixed RNNEXT model to extract sentences as inputs in our pragmatic procedure, using ABS as our $S_{0}$ model and applying pragmatics to the i(p) → o(p) abstractive step.
摘要生成
我们的第二个任务是多句子文档摘要生成。关于摘要生成已有大量前人研究 (Nenkova and McKeown, 2011);最近的神经模型利用大型数据集 (如 Hermann et al. (2015)) 在抽取式 (Cheng and Lapata, 2016; Nallapati et al., 2017) 和生成式 (Rush et al., 2015; See et al., 2017) 两种设定下训练模型。在这些工作中,我们基于 Chen and Bansal (2018) 最近的生成式神经摘要系统进行构建。首先,该系统使用句子级抽取模型 RNN-EXT 识别源文档中的关键句子序列 $i^{(1)},\dots i^{(P)}$;其次,系统采用生成式模型 ABS 将每个 $i^{(p)}$ 重写为输出 $o^{(p)}$,最终拼接生成完整摘要。在我们的语用处理流程中,我们依赖固定的 RNNEXT 模型抽取句子作为输入,将 ABS 作为 $S_{0}$ 基础模型,并将语用方法应用于 i(p) → o(p) 的生成式转换步骤。
3 Pragmatic Models
3 实用模型
To produce informative outputs, we consider pragmatic methods that extend the base speaker models, $S_{0}$ , using listener models, $L$ , which produce a distribution $L(i\mid o)$ over possible inputs given an output. Listener models are used to derive pragmatic speakers, $S_{1}(o\mid i)$ , which produce output that has a high probability of making a listener model $L$ identify the correct input. There are a large space of possible choices for designing $L$ and deriving $S_{1}$ ; we follow two lines of past work which we categorize as reconstructor-based and distractor-based. We tailor each of these pragmatic methods to both our two tasks by developing reconstructor models and methods of choosing distract or s.
为了生成信息丰富的输出,我们采用语用方法扩展基础说话者模型$S_{0}$,通过引入听者模型$L$来推导语用说话者$S_{1}(o\mid i)$。听者模型会生成给定输出时可能输入的分布$L(i\mid o)$,语用说话者则产生能使听者模型$L$准确识别正确输入的高概率输出。在设计$L$和推导$S_{1}$时存在多种可能选择,我们沿袭两类既往研究框架:基于重构器(reconstructor-based)和基于干扰项(distractor-based)。针对两项任务,我们分别开发了重构器模型和干扰项选择方法,对每种语用方法进行定制化适配。
3.1 Reconstructor-Based Pragmatics
3.1 基于重构器的语用学
Pragmatic approaches in this category (Dusek and Jurcicek, 2016; Fried et al., 2018) rely on a reconstructor listener model $L^{R}$ defined independently of the speaker. This listener model produces a distribution $L^{R}(i\mid o)$ over all possible input contexts $\textit{i}\in\textit{T}$ , given an output description $o$ . We use sequence-to-sequence or structured classification models for $L^{R}$ (described below), and train these models on the same data used to supervise the $S_{0}$ models.
这类实用方法 (Dusek和Jurcicek, 2016; Fried等, 2018) 依赖于独立于说话者定义的重构听者模型 $L^{R}$。该听者模型在给定输出描述 $o$ 时,会生成所有可能输入上下文 $\textit{i}\in\textit{T}$ 上的分布 $L^{R}(i\mid o)$。我们使用序列到序列或结构化分类模型作为 $L^{R}$(如下所述),并在用于监督 $S_{0}$ 模型的相同数据上训练这些模型。
The listener model and the base speaker model together define a pragmatic speaker, with output score given by:
听者模型和基础说话者模型共同定义了一个语用说话者,其输出分数由下式给出:
$$
S_{1}^{R}(o\mid i)=L^{R}(i\mid o)^{\lambda}\cdot S_{0}(o\mid i)^{1-\lambda}
$$
$$
S_{1}^{R}(o\mid i)=L^{R}(i\mid o)^{\lambda}\cdot S_{0}(o\mid i)^{1-\lambda}
$$
where $\lambda$ is a rationality parameter that controls how much the model optimizes for disc rim i native outputs (see Monroe et al. (2017) and Fried et al. (2018) for a discussion). We select an output text sequence $o$ for a given input $i$ by choosing the highest scoring output under Eq. 1 from a set of candidates obtained by beam search in $S_{0}(\cdot\mid i)$ .
其中 $\lambda$ 是一个控制模型对判别性输出优化程度的理性参数 (参见 Monroe 等人 (2017) 和 Fried 等人 (2018) 的讨论)。对于给定输入 $i$,我们通过从 $S_{0}(\cdot\mid i)$ 的束搜索获得的候选集中选择方程 1 下得分最高的输出文本序列 $o$。
Meaning Representations We construct $L^{R}$ for the meaning representation generation task as a multi-task, multi-class classifier, defining a distribution over possible values for each attribute. Each MR attribute has its own prediction layer and attention-based aggregation layer, which conditions on a basic encoding of $o$ shared across all attributes. See Appendix A.1 for architecture details. We then define $L^{R}(i\mid o)$ as the joint probability of predicting all input MR attributes in $i$ from $o$ .
意义表示
我们将意义表示生成任务中的 $L^{R}$ 构建为一个多任务、多类别的分类器,为每个属性定义可能值的分布。每个MR属性都有其独立的预测层和基于注意力的聚合层,这些层以所有属性共享的 $o$ 的基本编码为条件。架构细节参见附录A.1。随后,我们将 $L^{R}(i\mid o)$ 定义为从 $o$ 预测输入MR $i$ 中所有属性的联合概率。
Sum mari z ation To construct $L^{R}$ for summarization, we train an ABS model (of the type we use for $S_{0}$ , Chen and Bansal (2018)) but in reverse, i.e., taking as input a sentence in the summary and producing a sentence in the source document. We train $L^{R}$ on the same heuristic ally-extracted and aligned source document sentences used to train $S_{0}$ (Chen and Bansal, 2018).
摘要
为构建摘要任务的 $L^{R}$,我们训练了一个反向ABS模型(与 $S_{0}$ 所用类型相同,参考Chen和Bansal (2018)),即输入摘要句子并生成源文档句子。该 $L^{R}$ 的训练数据与 $S_{0}$ 相同,均采用启发式对齐的源文档句子(Chen和Bansal, 2018)。
listener model that needs to condition on entire output decisions, the distractor-based approach is able to make pragmatic decisions at each word rather than choosing between entire output candidates (as in the reconstructor approaches).
基于干扰项的方法能够在每个词上做出实用决策,而非在整个输出候选项之间进行选择(如重构器方法那样),这种机制无需像监听模型那样依赖于整个输出决策的条件设定。
The listener $L^{D}$ and pragmatic speaker $S_{1}^{D}$ are derived from the base speaker $S_{0}$ and a belief distribution $p_{t}(\cdot)$ maintained at each timestep $t$ over the possible inputs $\mathcal{T}^{D}$ :
听者 $L^{D}$ 和语用说话者 $S_{1}^{D}$ 源自基础说话者 $S_{0}$ 以及在每个时间步 $t$ 对可能输入 $\mathcal{T}^{D}$ 维护的信念分布 $p_{t}(\cdot)$:
$$
\begin{array}{c}{{L^{D}(i\mid o_{<t})\propto S_{0}(o_{<t}\mid i)\cdot p_{t-1}(i)}}\ {{{}}}\ {{S_{1}^{D}(o_{t}\mid i,o_{<t})\propto L^{D}(i\mid o_{<t})^{\alpha}\cdot S_{0}(o_{t}\mid i,o_{<t})}}\ {{p_{t}(i)\propto S_{0}(o_{t}\mid i,o_{<t})\cdot L^{D}(i\mid o_{<t})}}\end{array}
$$
$$
\begin{array}{c}{{L^{D}(i\mid o_{<t})\propto S_{0}(o_{<t}\mid i)\cdot p_{t-1}(i)}}\ {{{}}}\ {{S_{1}^{D}(o_{t}\mid i,o_{<t})\propto L^{D}(i\mid o_{<t})^{\alpha}\cdot S_{0}(o_{t}\mid i,o_{<t})}}\ {{p_{t}(i)\propto S_{0}(o_{t}\mid i,o_{<t})\cdot L^{D}(i\mid o_{<t})}}\end{array}
$$
where $\alpha$ is again a rationality parameter, and the initial belief distribution $p_{0}(\cdot)$ is uniform, i.e., $p_{0}(i)=p_{0}(\widetilde{i})=0.5$ . Eqs. 2 and 4 are normalized over the truee input $i$ and distractor $\widetilde{\imath}$ ; Eq. 3 is normalized over the output vocabular y.e We construct an output text sequence for the pragmatic speaker $S_{1}^{D}$ increment ally using beam search to approximately maximize Eq. 3.
其中 $\alpha$ 仍为理性参数,初始信念分布 $p_{0}(\cdot)$ 是均匀的,即 $p_{0}(i)=p_{0}(\widetilde{i})=0.5$。公式2和4在真实输入 $i$ 和干扰项 $\widetilde{\imath}$ 上归一化;公式3在输出词汇表上归一化。我们通过束搜索逐步构建语用说话者 $S_{1}^{D}$ 的输出文本序列,以近似最大化公式3。
Meaning Representations A distractor MR is automatically constructed for each input to be the most distinctive possible against the input. We construct this distractor by masking each present input attribute and replacing the value of each nonpresent attribute with the value that is most frequent for that attribute in the training data. For example, for the input MR in Figure 1, the distractor is NEAR[BURGER KING].
意义表示
每个输入的干扰项意义表示(MR)都会自动构建,以尽可能与输入形成最大差异。我们通过遮蔽每个现有输入属性,并将每个缺失属性的值替换为该属性在训练数据中出现频率最高的值来构建这一干扰项。例如,对于图1中的输入MR,其干扰项为NEAR[BURGER KING]。
3.2 Distractor-Based Pragmatics
3.2 基于干扰项的语用学
Pragmatic approaches in this category (Frank and Goodman, 2012; Andreas and Klein, 2016; Vedantam et al., 2017; Cohn-Gordon et al., 2018) derive pragmatic behavior by producing outputs that distinguish the input $i$ from an alternate distractor input (or inputs). We construct a distractor $\widetilde{\imath}$ for a given input $i$ in a task-dependent way.1
这一类别中的实用方法 (Frank and Goodman, 2012; Andreas and Klein, 2016; Vedantam et al., 2017; Cohn-Gordon et al., 2018) 通过生成能区分输入 $i$ 与干扰输入 (或多个干扰输入) 的输出,来推导出实用行为。我们以任务相关的方式为给定输入 $i$ 构建干扰项 $\widetilde{\imath}$。1
We follow the approach of Cohn-Gordon et al. (2018), outlined briefly here. The base speakers we build on produce outputs increment ally, where the probability of $o_{t}$ , the word output at time $t$ , is conditioned on the input and the previously generated words: $S_{0}(o_{t}\mid i,o_{<t})$ . Since the output is generated increment ally and there is no separate
我们采用Cohn-Gordon等人(2018) 的方法,简要概述如下。所构建的基础说话者以增量方式生成输出,其中时刻$t$的单词输出$o_{t}$的概率取决于输入和先前生成的单词:$S_{0}(o_{t}\mid i,o_{<t})$。由于输出是增量生成的且没有独立的...
Sum mari z ation For each extracted input sentence $i^{(p)}$ , we use the previous extracted sentence $i^{(p-1)}$ from the same document as the distractor input $\widetilde{\imath}$ (for the first sentence we do not use a distractor).e This is intended to encourage outputs $o^{(p)}$ to contain distinctive information against other summaries produced within the same document.
对于每个提取的输入句子 $i^{(p)}$,我们使用同一文档中先前提取的句子 $i^{(p-1)}$ 作为干扰输入 $\widetilde{\imath}$(首句不使用干扰项)。这一设计旨在促使输出 $o^{(p)}$ 包含与同一文档内生成的其他摘要相区别的信息。
4 Experiments
4 实验
For each of our two conditional generation tasks we evaluate on a standard benchmark dataset, following past work by using automatic evaluation against human-produced reference text. We choose hyper parameters for our models (beam size, and parameters $\alpha$ and $\lambda$ ) to maximize task metrics on each dataset’s development set; see Appendix A.2 for the settings used.2
针对我们的两项条件生成任务,我们均在标准基准数据集上进行评估,沿用前人工作方法,采用自动评估方式对比人工生成的参考文本。我们通过优化各数据集开发集上的任务指标来选择模型超参数(束搜索大小、参数$\alpha$和$\lambda$),具体参数设置参见附录A.2。
| 系统 | BLEU | NIST | METEOR | R-L | CIDEr |
|---|---|---|---|---|---|
| T-Gen | 65.93 | 8.61 | 44.83 | 68.50 | 2.23 |
| Best Prev. | 66.19t | 8.61t | 45.29 | 70.83° | 2.27° |
| So | 66.52 | 8.55 | 44.45 | 69.34 | 2.23 |
| So x2 | 65.93 | 8.31 | 43.52 | 69.58 | 2.12 |
| SR | 68.60 | 8.73 | 45.25 | 70.82 | 2.37 |
| SP | 67.76 | 8.72 | 44.59 | 69.41 | 2.27 |
Table 1: Test results for the E2E generation task, in comparison to the T-Gen baseline (Dusek and Jurcicek, 2016) and the best results from the E2E challenge, reported by Dusek et al. (2018): +Juraska et al. (2018), ‡Puzikov and Gurevych (2018), ⋄Zhang et al. (2018), and •Gong (2018). We bold our highest performing model on each metric, as well as previous work if it outperforms all of our models.
表 1: E2E生成任务的测试结果,与T-Gen基线 (Dusek和Jurcicek, 2016) 以及E2E挑战赛最佳结果的对比 (Dusek等人, 2018报告): +Juraska等人 (2018), ‡Puzikov和Gurevych (2018), ⋄Zhang等人 (2018), 以及•Gong (2018)。我们在每个指标上表现最佳的模型加粗显示,若先前工作优于我们所有模型也一并加粗。
4.1 Meaning Representations
4.1 意义表示
We evaluate on the E2E task of generation from meaning representations containing restaurant attributes (Novikova et al., 2017). We report the task’s five automatic metrics: BLEU (Papineni et al., 2002), NIST (Doddington, 2002), METEOR (Lavie and Agarwal, 2007), ROUGE-L (Lin, 2004) and CIDEr (Vedantam et al., 2015).
我们在E2E任务上评估了从包含餐厅属性的语义表示生成文本的效果 (Novikova et al., 2017)。报告了该任务的五项自动评估指标:BLEU (Papineni et al., 2002)、NIST (Doddington, 2002)、METEOR (Lavie and Agarwal, 2007)、ROUGE-L (Lin, 2004) 和 CIDEr (Vedantam et al., 2015)。
Table 1 compares the performance of our base $S_{0}$ and pragmatic models to the baseline T-Gen system (Dusek and Jurcicek, 2016) and the best previous result from the 20 primary systems evaluated in the E2E challenge (Dusek et al., 2018). The systems obtaining these results encompass a range of approaches: a template system (Puzikov and Gurevych, 2018), a neural model (Zhang et al., 2018), models trained with reinforcement learning (Gong, 2018), and systems using ensembling and reranking (Juraska et al., 2018). To ensure that the benefit of the reconstructor-based pragmatic approach, which uses two models, is not due solely to a model combination effect, we also compare to an ensemble of two base models $(S_{0}\times2)$ . This ensemble uses a weighted combination of scores of two independently-trained $S_{0}$ models, following Eq. 1 (with weights tuned on the development data).
表1比较了我们的基础模型$S_{0}$和实用模型与基线T-Gen系统 (Dusek和Jurcicek, 2016) 以及E2E挑战赛中评估的20个主要系统 (Dusek等人, 2018) 的最佳先前结果的性能。获得这些结果的系统涵盖了一系列方法: 模板系统 (Puzikov和Gurevych, 2018)、神经网络模型 (Zhang等人, 2018)、使用强化学习训练的模型 (Gong, 2018) 以及使用集成和重新排序的系统 (Juraska等人, 2018)。为了确保基于重构器的实用方法 (使用两个模型) 的优势不仅仅来自模型组合效应, 我们还比较了两个基础模型$(S_{0}\times2)$的集成。该集成使用两个独立训练的$S_{0}$模型的加权分数组合, 遵循公式1 (权重在开发数据上调整)。
Both of our pragmatic systems improve over the strong baseline $S_{0}$ system on all five metrics, with the largest improvements (2.1 BLEU, 0.2 NIST, 0.8 METEOR, 1.5 ROUGE-L, and 0.1 CIDEr) from the $S_{1}^{R}$ model. This $S_{1}^{R}$ model outperforms the previous best results obtained by any system in the E2E challenge on BLEU, NIST, and CIDEr, with comparable performance on METEOR and ROUGE-L.
我们的两个实用系统在所有五项指标上均优于强基线系统 $S_{0}$ ,其中 $S_{1}^{R}$ 模型提升最大 (BLEU提升2.1,NIST提升0.2,METEOR提升0.8,ROUGE-L提升1.5,CIDEr提升0.1)。该 $S_{1}^{R}$ 模型在BLEU、NIST和CIDEr指标上超越了E2E挑战赛所有系统此前的最佳成绩,在METEOR和ROUGE-L指标上表现相当。
| 系统 | R-1 | R-2 | R-L | METEOR |
|---|---|---|---|---|
| 抽取式 | ||||
| Lead-3 | 40.34 | 17.70 | 36.57 | 22.21 |
| Inputs | 38.93 | 18.23 | 35.90 | 24.66 |
| 生成式 | ||||
| Best Previous | 41.69t | 19.47t | 39.08 | 21.00° |
| So | 40.88 | 17.80 | 38.54 | 20.38 |
| So x2 | 40.76 | 17.88 | 38.46 | 19.88 |
| SR | 41.23 | 18.07 | 38.76 | 20.57 |
| SD | 41.39 | 18.30 | 38.78 | 21.70 |
Table 2: Test results for the non-anonymized CNN/Daily Mail sum mari z ation task. We compare to extractive baselines, and the best previous abstract ive results of †Cel i kyi l maz et al. (2018), ‡Paulus et al. (2018) and ⋄Chen and Bansal (2018). We bold our highest performing model on each metric, as well as previous work if it outperforms all of our models.
表 2: 非匿名化 CNN/Daily Mail 摘要任务的测试结果。我们与抽取式基线方法进行比较,同时对比了之前最佳的抽象摘要结果:†Celikyi lmaz 等人 (2018) 、‡Paulus 等人 (2018) 以及 ⋄Chen 和 Bansal (2018) 。我们对每个指标下性能最高的模型进行了加粗标注,若先前工作的性能优于我们所有模型,则同样加粗标注。
4.2 Abstract ive Sum mari z ation
4.2 摘要式总结
We evaluate on the CNN/Daily Mail summarization dataset (Hermann et al., 2015; Nallapati et al., 2016), using See et al.’s (2017) non-anonymized preprocessing. As in previous work (Chen and Bansal, 2018), we evaluate using ROUGE and METEOR.
我们在CNN/Daily Mail摘要数据集(Hermann et al., 2015; Nallapati et al., 2016)上进行评估,采用See et al. (2017)的非匿名预处理方法。与先前研究(Chen and Bansal, 2018)一致,我们使用ROUGE和METEOR作为评估指标。
Table 2 compares our pragmatic systems to the base $S_{0}$ model (with scores taken from Chen and Bansal (2018); we obtained comparable performance in our reproduction 3), an ensemble of two of these base models, and the best previous abstractive sum mari z ation result for each metric on this dataset (Cel i kyi l maz et al., 2018; Paulus et al., 2018; Chen and Bansal, 2018). We also report two extractive baselines: Lead-3, which uses the first three sentences of the document as the summary (See et al., 2017), and Inputs, the concatenation of the extracted sentences used as inputs to our models (i.e., $i^{(1)},\dots,i^{(P)})$ .
表 2: 将我们的实用系统与基础模型 $S_{0}$ (分数取自 Chen 和 Bansal (2018); 我们在复现中获得了可比的性能)、两个基础模型的集成模型、以及该数据集上各指标此前最佳的抽象摘要结果 (Celikyilmaz 等人, 2018; Paulus 等人, 2018; Chen 和 Bansal, 2018) 进行对比。我们还报告了两个抽取式基线: Lead-3 (使用文档前三句作为摘要, 参见 See 等人 (2017)) 和 Inputs (即作为模型输入的抽取句串联 $i^{(1)},\dots,i^{(P)})$。
The pragmatic methods obtain improvements of 0.2–0.5 in ROUGE scores and 0.2–1.8 METEOR over the base $S_{0}$ model, with the distractor-based approac h SD outperforming the reconstructorbased approach $S_{1}^{R}$ . $S_{1}^{D}$ is strong across all metrics, obtaining results competitive to the best previous abstract ive systems.
实用方法在ROUGE分数上获得0.2-0.5的提升,在METEOR指标上比基准模型$S_{0}$高出0.2-1.8分,其中基于干扰项的方法$S_{1}^{D}$表现优于基于重构器的方法$S_{1}^{R}$。$S_{1}^{D}$在所有指标上都表现强劲,取得了与先前最佳摘要生成系统相媲美的结果。


(a) Coverage ratios by attribute type for the base model $S_{0}$ and pragmatic models $S_{1}^{R}$ and $S_{1}^{D}$ . The pragmatic models typically improve coverage ratios across attribute types when compared to the base model.
(a) 基础模型 $S_{0}$ 与实用模型 $S_{1}^{R}$ 和 $S_{1}^{D}$ 按属性类型的覆盖率对比。相较于基础模型,实用模型通常能提升各类属性的覆盖率。
| FF | ET | Food | PR | Area | CR | |
|---|---|---|---|---|---|---|
| So -FF | 0.50 | 0.98 | 0.88 | 0.91 | 0.96 | 0.90 |
| SD SD -ET A SP-Food SP-PR | 0.57 | 1.00 | 0.92 | 0.90 | 0.95 | 0.95 |
| 0.47 | 1.00 | 0.96 | 0.92 | 0.96 | 0.95 | |
| 0.45 | 1.00 | 1.00 | 0.93 | 0.95 | 0.94 | |
| 0.51 | 1.00 | 0.90 | 0.98 | 0.93 | 0.92 | |
| 0.47 | 1.00 | 0.93 | 0.91 | 0.98 | 0.93 | |
| 0.45 | 1.00 | 0.91 | 0.90 | 0.91 | 0.95 |
(b) Coverage ratios by attribute type (columns) for the base model $S_{0}$ , and for the pragmatic system $S_{1}^{D}$ when constructing the distractor by masking the specified attribute (rows). Cell colors are the degree the coverage ratio increases (green) or decreases (red) relative to $S_{0}$ .
(b) 基础模型 $S_{0}$ 和实用系统 $S_{1}^{D}$ 在通过遮蔽指定属性(行)构建干扰项时,各属性类型(列)的覆盖率。单元格颜色表示相对于 $S_{0}$ 的覆盖率增加(绿色)或减少(红色)程度。
Figure 2: Coverage ratios for the E2E task by attribute type, estimating how frequently the values for each attribute from the input meaning representations are mentioned in the output text.
图 2: 按属性类型划分的E2E任务覆盖率,用于评估输入语义表示中各属性值在输出文本中被提及的频率。
5 Analysis
5 分析
The base speaker $S_{0}$ model is often underinformative, e.g., for the E2E task failing to mention certain attributes of a MR, even though almost all the training examples incorporate all of them. To better understand the performance improvements from the pragmatic models for E2E, we compute a coverage ratio as a proxy measure of how well content in the input is preserved in the generated outputs. The coverage ratio for each attribute is the fraction of times there is an exact match between the text in the generated output and the attribute’s value in the source MR (for instances where the attribute is specified).
基础说话者模型 $S_{0}$ 通常信息量不足,例如在E2E任务中会遗漏MR(语义表示)的某些属性,尽管几乎所有训练样本都包含了这些属性。为了更好地理解E2E任务中语用模型的性能提升,我们计算了覆盖率作为衡量输入内容在生成输出中保留程度的代理指标。每个属性的覆盖率是指生成输出文本与源MR中该属性值完全匹配的比例(仅针对已指定该属性的实例)。
Figure 2(a) shows coverage ratio by attribute category for all models. The $S_{1}^{R}$ model increases the coverage ratio when compared to $S_{0}$ across all attributes, showing that using the reconstruction model score to select outputs does lead to an increase in mentions for each attribute. Coverage ratios increase for $S_{1}^{D}$ as well in four out of six categories, but the increase is typically less than that produced by $S_{1}^{R}$ .
图 2(a) 展示了所有模型按属性类别的覆盖率。与 $S_{0}$ 相比,$S_{1}^{R}$ 模型在所有属性上的覆盖率均有所提升,这表明使用重建模型 (reconstruction model) 分数筛选输出确实能增加每个属性的提及次数。$S_{1}^{D}$ 在六类属性中有四类的覆盖率也有所上升,但增幅通常小于 $S_{1}^{R}$。
While $S_{1}^{D}$ optimizes less explicitly for attribute mentions than $S_{1}^{R}$ , it still provides a potential method to control generated outputs by choosing alternate distract or s. Figure 2(b) shows coverage ratios for $S_{1}^{D}$ when masking only a single attribute in the distractor. The highest coverage ratio for each attribute is usually obtained when masking that attribute in the distractor MR (entries on the main diagonal, underlined), in particular for FAMILY FRIENDLY (FF), FOOD, PRICERANGE (PR), and AREA. However, masking a single attribute sometimes results in decreasing the coverage ratio, and we also observe substantial increases from masking other attributes: e.g., masking either FAMILY FRIENDLY or CUSTOMER RAT- ING (CR) produces an equal increase in coverage ratio for the CUSTOMER RATING attribute. This may reflect underlying correlations in the training data, as these two attributes have a small number of possible values (3 and 7, respectively).
虽然 $S_{1}^{D}$ 对属性提及的优化不如 $S_{1}^{R}$ 明确,但它仍提供了一种通过选择不同干扰项来控制生成输出的潜在方法。图 2(b) 展示了在干扰项中仅屏蔽单个属性时 $S_{1}^{D}$ 的覆盖率。每个属性的最高覆盖率通常出现在干扰项 MR 中屏蔽该属性时(主对角线上带下划线的条目),尤其是 FAMILY FRIENDLY (FF)、FOOD、PRICERANGE (PR) 和 AREA 属性。然而,屏蔽单个属性有时会导致覆盖率下降,我们也观察到屏蔽其他属性带来的显著提升:例如,屏蔽 FAMILY FRIENDLY 或 CUSTOMER RATING (CR) 都会使 CUSTOMER RATING 属性的覆盖率同等提升。这可能反映了训练数据中的潜在相关性,因为这两个属性的可能取值数量较少(分别为 3 和 7)。
6 Conclusion
6 结论
Our results show that $S_{0}$ models from previous work, while strong, still imperfectly capture the behavior that people exhibit when generating text; and an explicit pragmatic modeling procedure can improve results. Both pragmatic methods evaluated in this paper encourage prediction of outputs that can be used to identify their inputs, either by reconstructing inputs in their entirety or distinguishing true inputs from distract or s, so it is perhaps unsurprising that both methods produce similar improvements in performance. Future work might allow finer-grained modeling of the tradeoff between under- and over-in format iv it y within the sequence generation pipeline (e.g., with a learned communication cost model) or explore applications of pragmatics for content selection earlier in the generation pipeline.
我们的结果表明,先前研究中的 $S_{0}$ 模型虽然强大,但仍不能完美捕捉人们在生成文本时表现出的行为;而显式的语用建模程序可以改善结果。本文评估的两种语用方法都鼓励预测可用于识别其输入的输出,无论是通过完全重建输入还是将真实输入与干扰项区分开来,因此这两种方法在性能上产生相似的改进也许并不令人意外。未来的工作可能允许在序列生成流程中对信息不足和过度之间的权衡进行更细粒度的建模(例如,通过学习通信成本模型),或者探索语用学在生成流程早期内容选择中的应用。
