[论文翻译]实用信息文本生成


原文地址:https://arxiv.org/pdf/1904.01301v2


Pragmatically Informative Text Generation

实用信息文本生成

Abstract

摘要

We improve the informative ness of models for conditional text generation using techniques from computational pragmatics. These techniques formulate language production as a game between speakers and listeners, in which a speaker should generate output text that a listener can use to correctly identify the original input that the text describes. While such approaches are widely used in cognitive science and grounded language learning, they have received less attention for more standard language generation tasks. We consider two pragmatic modeling methods for text generation: one where pragmatics is imposed by information preservation, and another where pragmat- ics is imposed by explicit modeling of distractors. We find that these methods improve the performance of strong existing systems for abstractive sum mari z ation and generation from structured meaning representations.

我们运用计算语用学技术提升条件文本生成模型的信息性。这些技术将语言生成建模为说话者与听者之间的博弈,要求说话者生成的文本能让听者准确识别出原文描述的输入内容。虽然此类方法在认知科学和具身语言学习领域应用广泛,但在标准语言生成任务中关注较少。我们研究了两种语用文本生成建模方法:一种通过信息保留实现语用约束,另一种通过显式建模干扰项实现语用约束。实验表明,这些方法能显著提升现有抽象摘要系统和结构化语义表示生成系统的性能。

1 Introduction

1 引言

Computational approaches to pragmatics cast language generation and interpretation as gametheoretic or Bayesian inference procedures (Golland et al., 2010; Frank and Goodman, 2012). While such approaches are capable of modeling a variety of pragmatic phenomena, their main application in natural language processing has been to improve the informative ness of generated text in grounded language learning problems (Monroe et al., 2018). In this paper, we show that pragmatic reasoning can be similarly used to improve performance in more traditional language generation tasks like generation from structured meaning representations (Figure 1) and sum mari z ation.

语用学的计算方法将语言生成和解释视为博弈论或贝叶斯推理过程 (Golland et al., 2010; Frank and Goodman, 2012) 。虽然这类方法能够建模多种语用现象,但其在自然语言处理中的主要应用是提升基于情境的语言学习任务中生成文本的信息量 (Monroe et al., 2018) 。本文通过结构化语义表示生成 (图 1) 和摘要等更传统的语言生成任务,证明语用推理同样能有效提升生成性能。

Our work builds on a line of learned Rational Speech Acts (RSA) models (Monroe and Potts, 2015; Andreas and Klein, 2016), in which generated strings are selected to optimize the behavior of an embedded listener model. The canonical presentation of the RSA framework (Frank and Goodman, 2012) is grounded in reference resolution: models of speakers attempt to describe referents in the presence of distract or s, and models of listeners attempt to resolve descriptors to referents. Recent work has extended these models to more complex groundings, including images (Mao et al., 2015) and trajectories (Fried et al., 2018). The techniques used in these settings are similar, and the primary intuition of the RSA framework is preserved: from the speaker’s perspective, a good description is one that picks out, as discriminatively as possible, the content the speaker intends for the listener to identify.

我们的工作建立在系列学习型理性言语行为(RSA)模型基础上 (Monroe and Potts, 2015; Andreas and Klein, 2016),这些模型通过选择生成字符串来优化嵌入式听者模型的行为。RSA框架的标准表述 (Frank and Goodman, 2012) 基于指代消解:说话者模型尝试在干扰项存在的情况下描述指代对象,而听者模型试图将描述符解析为指代对象。近期研究将这些模型扩展到更复杂的场景,包括图像 (Mao et al., 2015) 和轨迹 (Fried et al., 2018)。这些场景中使用的技术相似,且保留了RSA框架的核心思想:从说话者角度看,好的描述应尽可能区分性地突出说话者希望听者识别的内容。

Figure 1: Example outputs of our systems on the E2E generation task. While a base sequence-to-sequence model $\mathrm{\Delta}S_{0}$ , Sec. 2) fails to describe all attributes in the input meaning representation, both of our pragmatic systems ( $\mathrm{{}}{\mathrm{{}}}^{\mathrm{{}}}S_{1}^{R}$ , Sec. 3.1 and $S_{1}^{D}$ , Sec. 3.2) and the human-written reference do.
图 1: 我们的系统在E2E生成任务中的示例输出。基础序列到序列模型 ($\mathrm{\Delta}S_{0}$ ,第2节) 未能描述输入意义表示中的所有属性,而我们的两种实用系统 ($\mathrm{{}}{\mathrm{{}}}^{\mathrm{{}}}S_{1}^{R}$ ,第3.1节和 $S_{1}^{D}$ ,第3.2节) 以及人工撰写的参考文本都做到了。

输入意义表示 (i): NAME[FITZBILLIES], EATTYPE[COFFEE SHOP], FOOD[ENGLISH], PRICERANGE[CHEAP], CUSTOMERRATING[5 OUT OF 5], AREA[RIVERSIDE], FAMILYFRIENDLY[YES] 人工撰写
河边一家顾客评分为5分(满分5分)的廉价咖啡店是Fitzbillies。Fitzbillies适合家庭用餐,并提供英式食物。
基础序列到序列模型 (So) Fitzbillies是一家适合家庭的咖啡店,位于河边
基于干扰的实用系统 (S) Fitzbillies是一家适合家庭的咖啡店,提供英式食物。它位于河边区域,顾客评分为5分(满分5分)且价格便宜。
基于重构的实用系统 (S) Fitzbillies是一家适合家庭的咖啡店,在河边区域提供便宜的英式食物。它的顾客评分为5分(满分5分)。

Outside of grounding, cognitive modeling (Frank et al., 2009), and targeted analysis of linguistic phenomena (Orita et al., 2015), rational speech acts models have seen limited application in the natural language processing literature. In this work we show that they can be extended to a distinct class of language generation problems that use as referents structured descriptions of lingustic content, or other natural language texts. In accordance with the maxim of quantity (Grice, 1970) or the Q-principle (Horn, 1984), pragmatic approaches naturally correct underinformative ness problems observed in state-of-theart language generation systems $\mathrm{\mathit{~S}}_{0}$ in Figure 1).

除了基础研究、认知建模 (Frank et al., 2009) 和语言现象的针对性分析 (Orita et al., 2015) 之外,理性言语行为模型在自然语言处理领域的应用较为有限。本文研究表明,该模型可扩展至一类特殊的语言生成问题——这类问题以语言内容的结构化描述或其他自然文本作为指称对象。根据数量准则 (Grice, 1970) 或Q原则 (Horn, 1984),语用学方法能自然修正当前最先进语言生成系统中存在的信息不足问题 (如图 1 中的 $\mathrm{\mathit{~S}}_{0}$ )。

We present experiments on two language generation tasks: generation from meaning representations (Novikova et al., 2017) and sum mari z ation. For each task, we evaluate two models of pragmatics: the reconstructor-based model of Fried et al. (2018) and the distractor-based model of CohnGordon et al. (2018). Both models improve performance on both tasks, increasing ROUGE scores by 0.2–0.5 points on the CNN/Daily Mail abstractive sum mari z ation dataset and BLEU scores by 2 points on the End-to-End (E2E) generation dataset, obtaining new state-of-the-art results.

我们在两个语言生成任务上进行了实验:基于意义表示的生成 (Novikova et al., 2017) 和摘要生成。针对每个任务,我们评估了两种语用学模型:Fried等人 (2018) 提出的基于重构器的模型,以及CohnGordon等人 (2018) 提出的基于干扰项的模型。两个模型在两项任务上均提升了性能,在CNN/Daily Mail抽象摘要数据集上将ROUGE分数提高了0.2-0.5分,在端到端 (E2E) 生成数据集上将BLEU分数提高了2分,取得了新的最先进成果。

2 Tasks

2 任务

We formulate a conditional generation task as taking an input $i$ from a space of possible inputs $\mathcal{T}$ (e.g., input sentences for abstract ive summarization; meaning representations for structured generation) and producing an output $o$ as a sequence of tokens $\left(o_{1},\dots,o_{T}\right)$ . We build our pragmatic approaches on top of learned base speaker models $S_{0}$ , which produce a probability distribution $S_{0}(o\mid i)$ over output text for a given input. We focus on two conditional generation tasks where the information in the input context should largely be preserved in the output text, and apply the pragmatic procedures outlined in Sec. 3 to each task. For these $S_{0}$ models we use systems from past work that are strong, but may still be underinformative relative to human reference outputs (e.g., Figure 1).

我们将条件生成任务定义为:从可能的输入空间 $\mathcal{T}$ 中获取输入 $i$(例如摘要生成任务的输入句子,或结构化生成任务的含义表示),并生成由 token 序列 $\left(o_{1},\dots,o_{T}\right)$ 构成的输出 $o$。我们的语用方法建立在已学习的基础说话者模型 $S_{0}$ 之上,该模型针对给定输入生成输出文本的概率分布 $S_{0}(o\mid i)$。我们重点关注两种条件生成任务,其输入上下文中的信息应大部分保留在输出文本中,并将第 3 节所述的语用流程应用于每个任务。对于这些 $S_{0}$ 模型,我们采用过去工作中表现优异但仍可能比人类参考输出信息量不足的系统(如 图 1)。

Meaning Representations Our first task is generation from structured meaning representations (MRs) containing attribute-value pairs (Novikova et al., 2017). An example is shown in Figure 1, where systems must generate a description of the restaurant with the specified attributes. We apply pragmatics to encourage output strings from which the input MR can be identified. For our $S_{0}$ model, we use a publicly-released neural generation system (Puzikov and Gurevych, 2018) that achieves comparable performance to the best published results in Dusek et al. (2018).

意义表示
我们的首个任务是基于包含属性-值对的结构化意义表示(MR)生成文本 (Novikova et al., 2017)。如图 1 所示,系统必须根据指定属性生成餐厅描述。我们运用语用学原理来鼓励生成能反推输入MR的输出字符串。对于 $S_{0}$ 模型,我们采用公开的神经生成系统 (Puzikov and Gurevych, 2018),其性能已达到Dusek等人(2018)发表的最佳结果水平。

Abstract ive Sum mari z ation Our second task is multi-sentence document sum mari z ation. There is a vast amount of past work on summarization (Nenkova and McKeown, 2011); recent neu- ral models have used large datasets (e.g., Hermann et al. (2015)) to train models in both the extractive (Cheng and Lapata, 2016; Nallapati et al., 2017) and abstract ive (Rush et al., 2015; See et al., 2017) settings. Among these works, we build on the recent abstract ive neural sum mari z ation system of Chen and Bansal (2018). First, this system uses a sentence-level extractive model RNN-EXT to identify a sequence of salient sentences $i^{(1)},\dots i^{(P)}$ in each source document. Second, the system uses an abstract ive model ABS to rewrite each $i^{(p)}$ into an output $o^{(p)}$ , which are then concatenated to produce the final summary. We rely on the fixed RNNEXT model to extract sentences as inputs in our pragmatic procedure, using ABS as our $S_{0}$ model and applying pragmatics to the i(p) → o(p) abstractive step.

摘要生成
我们的第二个任务是多句子文档摘要生成。关于摘要生成已有大量前人研究 (Nenkova and McKeown, 2011);最近的神经模型利用大型数据集 (如 Hermann et al. (2015)) 在抽取式 (Cheng and Lapata, 2016; Nallapati et al., 2017) 和生成式 (Rush et al., 2015; See et al., 2017) 两种设定下训练模型。在这些工作中,我们基于 Chen and Bansal (2018) 最近的生成式神经摘要系统进行构建。首先,该系统使用句子级抽取模型 RNN-EXT 识别源文档中的关键句子序列 $i^{(1)},\dots i^{(P)}$;其次,系统采用生成式模型 ABS 将每个 $i^{(p)}$ 重写为输出 $o^{(p)}$,最终拼接生成完整摘要。在我们的语用处理流程中,我们依赖固定的 RNNEXT 模型抽取句子作为输入,将 ABS 作为 $S_{0}$ 基础模型,并将语用方法应用于 i(p) → o(p) 的生成式转换步骤。

3 Pragmatic Models

3 实用模型

To produce informative outputs, we consider pragmatic methods that extend the base speaker models, $S_{0}$ , using listener models, $L$ , which produce a distribution $L(i\mid o)$ over possible inputs given an output. Listener models are used to derive pragmatic speakers, $S_{1}(o\mid i)$ , which produce output that has a high probability of making a listener model $L$ identify the correct input. There are a large space of possible choices for designing $L$ and deriving $S_{1}$ ; we follow two lines of past work which we categorize as reconstructor-based and distractor-based. We tailor each of these pragmatic methods to both our two tasks by developing reconstructor models and methods of choosing distract or s.

为了生成信息丰富的输出,我们采用语用方法扩展基础说话者模型$S_{0}$,通过引入听者模型$L$来推导语用说话者$S_{1}(o\mid i)$。听者模型会生成给定输出时可能输入的分布$L(i\mid o)$,语用说话者则产生能使听者模型$L$准确识别正确输入的高概率输出。在设计$L$和推导$S_{1}$时存在多种可能选择,我们沿袭两类既往研究框架:基于重构器(reconstructor-based)和基于干扰项(distractor-based)。针对两项任务,我们分别开发了重构器模型和干扰项选择方法,对每种语用方法进行定制化适配。

3.1 Reconstructor-Based Pragmatics

3.1 基于重构器的语用学

Pragmatic approaches in this category (Dusek and Jurcicek, 2016; Fried et al., 2018) rely on a reconstructor listener model $L^{R}$ defined independently of the speaker. This listener model produces a distribution $L^{R}(i\mid o)$ over all possible input contexts $\textit{i}\in\textit{T}$ , given an output description $o$ . We use sequence-to-sequence or structured classification models for $L^{R}$ (described below), and train these models on the same data used to supervise the $S_{0}$ models.

这类实用方法 (Dusek和Jurcicek, 2016; Fried等, 2018) 依赖于独立于说话者定义的重构听者模型 $L^{R}$。该听者模型在给定输出描述 $o$ 时,会生成所有可能输入上下文 $\textit{i}\in\textit{T}$ 上的分布 $L^{R}(i\mid o)$。我们使用序列到序列或结构化分类模型作为 $L^{R}$(如下所述),并在用于监督 $S_{0}$ 模型的相同数据上训练这些模型。

The listener model and the base speaker model together define a pragmatic speaker, with output score given by:

听者模型和基础说话者模型共同定义了一个语用说话者,其输出分数由下式给出:

$$
S_{1}^{R}(o\mid i)=L^{R}(i\mid o)^{\lambda}\cdot S_{0}(o\mid i)^{1-\lambda}
$$

$$
S_{1}^{R}(o\mid i)=L^{R}(i\mid o)^{\lambda}\cdot S_{0}(o\mid i)^{1-\lambda}
$$

where $\lambda$ is a rationality parameter that controls how much the model optimizes for disc rim i native outputs (see Monroe et al. (2017) and Fried et al. (2018) for a discussion). We select an output text sequence $o$ for a given input $i$ by choosing the highest scoring output under Eq. 1 from a set of candidates obtained by beam search in $S_{0}(\cdot\mid i)$ .

其中 $\lambda$ 是一个控制模型对判别性输出优化程度的理性参数 (参见 Monroe 等人 (2017) 和 Fried 等人 (2018) 的讨论)。对于给定输入 $i$,我们通过从 $S_{0}(\cdot\mid i)$ 的束搜索获得的候选集中选择方程 1 下得分最高的输出文本序列 $o$。

Meaning Representations We construct $L^{R}$ for the meaning representation generation task as a multi-task, multi-class classifier, defining a distribution over possible values for each attribute. Each MR attribute has its own prediction layer and attention-based aggregation layer, which conditions on a basic encoding of $o$ shared across all attributes. See Appendix A.1 for architecture details. We then define $L^{R}(i\mid o)$ as the joint probability of predicting all input MR attributes in $i$ from $o$ .

意义表示
我们将意义表示生成任务中的 $L^{R}$ 构建为一个多任务、多类别的分类器,为每个属性定义可能值的分布。每个MR属性都有其独立的预测层和基于注意力的聚合层,这些层以所有属性共享的 $o$ 的基本编码为条件。架构细节参见附录A.1。随后,我们将 $L^{R}(i\mid o)$ 定义为从 $o$ 预测输入MR $i$ 中所有属性的联合概率。

Sum mari z ation To construct $L^{R}$ for summarization, we train an ABS model (of the type we use for $S_{0}$ , Chen and Bansal (2018)) but in reverse, i.e., taking as input a sentence in the summary and producing a sentence in the source document. We train $L^{R}$ on the same heuristic ally-extracted and aligned source document sentences used to train $S_{0}$ (Chen and Bansal, 2018).

摘要
为构建摘要任务的 $L^{R}$,我们训练了一个反向ABS模型(与 $S_{0}$ 所用类型相同,参考Chen和Bansal (2018)),即输入摘要句子并生成源文档句子。该 $L^{R}$ 的训练数据与 $S_{0}$ 相同,均采用启发式对齐的源文档句子(Chen和Bansal, 2018)。

listener model that needs to condition on entire output decisions, the distractor-based approach is able to make pragmatic decisions at each word rather than choosing between entire output candidates (as in the reconstructor approaches).

基于干扰项的方法能够在每个词上做出实用决策,而非在整个输出候选项之间进行选择(如重构器方法那样),这种机制无需像监听模型那样依赖于整个输出决策的条件设定。

The listener $L^{D}$ and pragmatic speaker $S_{1}^{D}$ are derived from the base speaker $S_{0}$ and a belief distribution $p_{t}(\cdot)$ maintained at each timestep $t$ over the possible inputs $\mathcal{T}^{D}$ :

听者 $L^{D}$ 和语用说话者 $S_{1}^{D}$ 源自基础说话者 $S_{0}$ 以及在每个时间步 $t$ 对可能输入 $\mathcal{T}^{D}$ 维护的信念分布 $p_{t}(\cdot)$:

$$
\begin{array}{c}{{L^{D}(i\mid o_{<t})\propto S_{0}(o_{<t}\mid i)\cdot p_{t-1}(i)}}\ {{{}}}\ {{S_{1}^{D}(o_{t}\mid i,o_{<t})\propto L^{D}(i\mid o_{<t})^{\alpha}\cdot S_{0}(o_{t}\mid i,o_{<t})}}\ {{p_{t}(i)\propto S_{0}(o_{t}\mid i,o_{<t})\cdot L^{D}(i\mid o_{<t})}}\end{array}
$$

$$
\begin{array}{c}{{L^{D}(i\mid o_{<t})\propto S_{0}(o_{<t}\mid i)\cdot p_{t-1}(i)}}\ {{{}}}\ {{S_{1}^{D}(o_{t}\mid i,o_{<t})\propto L^{D}(i\mid o_{<t})^{\alpha}\cdot S_{0}(o_{t}\mid i,o_{<t})}}\ {{p_{t}(i)\propto S_{0}(o_{t}\mid i,o_{<t})\cdot L^{D}(i\mid o_{<t})}}\end{array}
$$

where $\alpha$ is again a rationality parameter, and the initial belief distribution $p_{0}(\cdot)$ is uniform, i.e., $p_{0}(i)=p_{0}(\widetilde{i})=0.5$ . Eqs. 2 and 4 are normalized over the truee input $i$ and distractor $\widetilde{\imath}$ ; Eq. 3 is normalized over the output vocabular y.e We construct an output text sequence for the pragmatic speaker $S_{1}^{D}$ increment ally using beam search to approximately maximize Eq. 3.

其中 $\alpha$ 仍为理性参数,初始信念分布 $p_{0}(\cdot)$ 是均匀的,即 $p_{0}(i)=p_{0}(\widetilde{i})=0.5$。公式2和4在真实输入 $i$ 和干扰项 $\widetilde{\imath}$ 上归一化;公式3在输出词汇表上归一化。我们通过束搜索逐步构建语用说话者 $S_{1}^{D}$ 的输出文本序列,以近似最大化公式3。

Meaning Representations A distractor MR is automatically constructed for each input to be the most distinctive possible against the input. We construct this distractor by masking each present input attribute and replacing the value of each nonpresent attribute with the value that is most frequent for that attribute in the training data. For example, for the input MR in Figure 1, the distractor is NEAR[BURGER KING].

意义表示
每个输入的干扰项意义表示(MR)都会自动构建,以尽可能与输入形成最大差异。我们通过遮蔽每个现有输入属性,并将每个缺失属性的值替换为该属性在训练数据中出现频率最高的值来构建这一干扰项。例如,对于图1中的输入MR,其干扰项为NEAR[BURGER KING]。

3.2 Distractor-Based Pragmatics

3.2 基于干扰项的语用学

Pragmatic approaches in this category (Frank and Goodman, 2012; Andreas and Klein, 2016; Vedantam et al., 2017; Cohn-Gordon et al., 2018) derive pragmatic behavior by producing outputs that distinguish the input $i$ from an alternate distractor input (or inputs). We construct a distractor $\widetilde{\imath}$ for a given input $i$ in a task-dependent way.1

这一类别中的实用方法 (Frank and Goodman, 2012; Andreas and Klein, 2016; Vedantam et al., 2017; Cohn-Gordon et al., 2018) 通过生成能区分输入 $i$ 与干扰输入 (或多个干扰输入) 的输出,来推导出实用行为。我们以任务相关的方式为给定输入 $i$ 构建干扰项 $\widetilde{\imath}$。1

We follow the approach of Cohn-Gordon et al. (2018), outlined briefly here. The base speakers we build on produce outputs increment ally, where the probability of $o_{t}$ , the word output at time $t$ , is conditioned on the input and the previously generated words: $S_{0}(o_{t}\mid i,o_{<t})$ . Since the output is generated increment ally and there is no separate

我们采用Cohn-Gordon等人(2018) 的方法,简要概述如下。所构建的基础说话者以增量方式生成输出,其中时刻$t$的单词输出$o_{t}$的概率取决于输入和先前生成的单词:$S_{0}(o_{t}\mid i,o_{<t})$。由于输出是增量生成的且没有独立的...

Sum mari z ation For each extracted input sentence $i^{(p)}$ , we use the previous extracted sentence $i^{(p-1)}$ from the same document as the distractor input $\widetilde{\imath}$ (for the first sentence we do not use a distractor).e This is intended to encourage outputs $o^{(p)}$ to contain distinctive information against other summaries produced within the same document.

对于每个提取的输入句子 $i^{(p)}$,我们使用同一文档中先前提取的句子 $i^{(p-1)}$ 作为干扰输入 $\widetilde{\imath}$(首句不使用干扰项)。这一设计旨在促使输出 $o^{(p)}$ 包含与同一文档内生成的其他摘要相区别的信息。

4 Experiments

4 实验

For each of our two conditional generation tasks we evaluate on a standard benchmark dataset, following past work by using automatic evaluation against human-produced reference text. We choose hyper parameters for our models (beam size, and parameters $\alpha$ and $\lambda$ ) to maximize task metrics on each dataset’s development set; see Appendix A.2 for the settings used.2

针对我们的两项条件生成任务,我们均在