ERNIE-GEN: An Enhanced Multi-Flow Pre-training and Fine-tuning Framework for Natural Language Generation
ERNIE-GEN:一种增强的多流预训练与微调框架用于自然语言生成
Abstract
摘要
Current pre-training works in natural language generation pay little attention to the problem of exposure bias on downstream tasks. To address this issue, we propose an enhanced multi-flow sequence to sequence pre-training and fine-tuning framework named ERNIE-GEN, which bridges the discrepancy between training and inference with an infilling generation mechanism and a noise-aware generation method. To make generation closer to human writing patterns, this framework introduces a span-by-span generation flow that trains the model to predict semantically-complete spans consecutively rather than predicting word by word. Unlike existing pre-training methods, ERNIE-GEN incorporates multi-granularity target sampling to construct pre-training data, which enhances the correlation between encoder and decoder. Experimental results demonstrate that ERNIE-GEN achieves state-of-the-art results with a much smaller amount of pre-training data and parameters on a range of language generation tasks, including abstract ive sum mari z ation (Gigaword and CNN/DailyMail), question generation (SQuAD), dialogue response generation (Persona-Chat) and generative question answering (CoQA). The source codes and pretrained models have been released at https://github. com/Paddle Paddle/ERNIE.
当前的自然语言生成预训练工作对下游任务中的暴露偏差问题关注较少。为了解决这一问题,我们提出了一个增强的多流序列到序列预训练和微调框架,名为 ERNIE-GEN。该框架通过填充生成机制和噪声感知生成方法,弥合了训练和推理之间的差异。为了使生成更接近人类的写作模式,该框架引入了逐段生成流,训练模型连续预测语义完整的段落,而不是逐词预测。与现有的预训练方法不同,ERNIE-GEN 结合了多粒度目标采样来构建预训练数据,从而增强了编码器和解码器之间的相关性。实验结果表明,ERNIE-GEN 在一系列语言生成任务中,包括摘要生成(Gigaword 和 CNN/DailyMail)、问题生成(SQuAD)、对话响应生成(Persona-Chat)和生成式问答(CoQA),以更少的预训练数据和参数取得了最先进的结果。源代码和预训练模型已在 https://github.com/PaddlePaddle/ERNIE 发布。
1 Introduction
1 引言
Pre-trained on large-scale unlabeled text corpora and finetuned on downstream tasks, self-supervised representation models such as GPT [Radford et al., 2018], BERT [Devlin et al., 2019] and XLNet [Yang et al., 2019b] have achieved remarkable improvements in natural language understanding (NLU). Different from encoder-only pre-training like BERT or decoder-only pre-training like GPT, natural language generation (NLG) relies on the sequence to sequence generation framework (seq2seq) which consists of a bidirectional encoder and a unidirectional decoder. Current pre-training works in NLG such as MASS [Song et al., 2019] and UNILM [Dong et al., 2019] mainly focus on jointly pre-training encoder and decoder on different self-supervised tasks. However, these works pay little attention to the exposure bias issue [Ranzato et al., 2016], a major drawback of teacher-forcing training. This issue is due to the fact that ground truth words are used during training, while generated words, whether predicted correctly or not, are used for inference where mistakes tend to accumulate. To alleviate this issue, we present ERNIE-GEN, an enhanced multi-flow seq2seq training framework characterized by a carefully-designed MultiFlow Attention architecture based on Transformer [Vaswani et al., 2017], as illustrated in Figure 2. ERNIE-GEN incorporates a novel infilling generation mechanism and a noiseaware generation method into pre-training and fine-tuning, which is proved to be effective through experiments in $\S4.3$ .
在大规模无标签文本语料库上进行预训练并在下游任务上进行微调的自监督表示模型,如 GPT [Radford et al., 2018]、BERT [Devlin et al., 2019] 和 XLNet [Yang et al., 2019b],在自然语言理解 (NLU) 方面取得了显著进展。与 BERT 的仅编码器预训练或 GPT 的仅解码器预训练不同,自然语言生成 (NLG) 依赖于序列到序列生成框架 (seq2seq),该框架由双向编码器和单向解码器组成。当前 NLG 领域的预训练工作,如 MASS [Song et al., 2019] 和 UNILM [Dong et al., 2019],主要集中在不同的自监督任务上联合预训练编码器和解码器。然而,这些工作很少关注暴露偏差问题 [Ranzato et al., 2016],这是教师强制训练的一个主要缺点。这个问题是由于在训练期间使用了真实标签词,而在推理时使用了生成的词(无论预测是否正确),错误往往会累积。为了缓解这个问题,我们提出了 ERNIE-GEN,这是一个增强的多流 seq2seq 训练框架,其特点是基于 Transformer [Vaswani et al., 2017] 精心设计的多流注意力架构,如图 2 所示。ERNIE-GEN 在预训练和微调中引入了新颖的填充生成机制和噪声感知生成方法,实验证明这些方法在 $\S4.3$ 中有效。

Figure 1: Schematic of two generation mechanisms (left) and data strategies for pre-training (right). Blocks in green, orange and blue denote source texts, target texts and artificial symbols.
图 1: 两种生成机制示意图(左)和预训练数据策略(右)。绿色、橙色和蓝色块分别表示源文本、目标文本和人工符号。
• Infilling generation. Instead of using last ground truth word in training or last generated word in inference, we adopt an inserted artificial symbol [ATTN] along with its position to gather history contextual representations at each step in both training and inference, which diverts model’s attention away from last word and coerces it into focusing on all former representations, thus alleviating negative influence of previous mistakes to subsequent generation, as shown in Figure 1(b).
• 填充生成 (Infilling generation)。在训练过程中,我们不使用最后一个真实词,在推理过程中也不使用最后一个生成的词,而是采用插入的人工符号 [ATTN] 及其位置,在训练和推理的每一步中收集历史上下文表示。这种方法将模型的注意力从最后一个词转移开,迫使其专注于所有先前的表示,从而减轻先前错误对后续生成的负面影响,如图 1(b) 所示。
• Noise-Aware generation. We corrupt the input target sequence by randomly replacing words to arbitrary words in the vocabulary. This setup, despite its simplicity, proves to be an effective way to make the model be aware of mistakes in training, so that the model is able to detect mistakes and ignore them during inference.
• 噪声感知生成。我们通过随机替换词汇表中的任意单词来破坏输入的目标序列。尽管这种设置简单,但它被证明是一种有效的方式,使模型能够意识到训练中的错误,从而使模型能够在推理过程中检测并忽略这些错误。

Figure 2: Overview of ERNIE-GEN framework. ${S,T}$ and $\mathbf{Y}$ donate source, target, and generated texts, $\pmb{T}^{\prime}$ is the noised version of ${\pmb{T}}$ .
图 2: ERNIE-GEN 框架概述。${S,T}$ 和 $\mathbf{Y}$ 分别表示源文本、目标文本和生成文本,$\pmb{T}^{\prime}$ 是 ${\pmb{T}}$ 的噪声版本。
Moreover, in light of the fact that entities, phrases and sentences in human writing are organized in a coherent manner, we incorporate a span-by-span generation task into ERNIEGEN as a new generation flow to train the model to predict semantically-complete spans consecutively rather than predicting word by word as traditional models do. This task is implemented through the infilling generation mechanism in parallel with an infilling-based word-by-word generation flow to facilitate convergence in training, as shown in Figure 1b.
此外,鉴于人类写作中的实体、短语和句子是以连贯的方式组织的,我们将逐段生成任务引入ERNIEGEN,作为一种新的生成流程,以训练模型连续预测语义完整的段落,而不是像传统模型那样逐词预测。该任务通过填充生成机制实现,并与基于填充的逐词生成流程并行,以促进训练中的收敛,如图1b所示。
In addition, as shown in Figure 1(c-d), recent pre-training works for NLG like UNILM and MASS only sample a single continuous segment as target sequence. However, this sampling method compromises the correlation between encoder and decoder when it comes to pre-training of long texts (typically 512 words), given that adjacent segments are often relevant semantically. ERNIE-GEN adopts a multi-granularity target fragments sampling strategy to force decoder to rely more on the encoder representations other than the previous generated words, thus enhancing the correlation between encoder and decoder, as shown in Figure 1e.
此外,如图 1(c-d) 所示,最近针对自然语言生成 (NLG) 的预训练工作,如 UNILM 和 MASS,仅采样单个连续片段作为目标序列。然而,这种采样方法在长文本(通常为 512 个词)的预训练中会削弱编码器和解码器之间的相关性,因为相邻片段通常在语义上是相关的。ERNIE-GEN 采用了多粒度目标片段采样策略,迫使解码器更多地依赖编码器表示,而不是之前生成的词,从而增强了编码器和解码器之间的相关性,如图 1e 所示。
Empirically, ERNIE-GEN is particularly effective and achieves state-of-the-art results on a range of NLG tasks including abstract ive sum mari z ation (Gigaword and CNN/DailyMail), question generation (SQuAD), dialogue response generation (Persona-Chat) and generative question answering (CoQA), utilizing a much smaller amount of pretraining data and parameters.
经验表明,ERNIE-GEN 在一系列自然语言生成任务中表现出色,并取得了最先进的结果,这些任务包括摘要生成(Gigaword 和 CNN/DailyMail)、问题生成(SQuAD)、对话响应生成(Persona-Chat)以及生成式问答(CoQA),同时使用了更少的预训练数据和参数。
2 Related Work
2 相关工作
Pre-training for NLP Tasks. Recently, pre-training methods have achieved state-of-the-art results in multiple NLU tasks. ELMo [Peters et al., 2018] pre-trains two unidirectional language models (LMs) with forward and backward direction respectively to feature downstream tasks. GPT utilizes an adjusted Transformer [Vaswani et al., 2017] to learn a forward LM and then fine-tunes the forward LM on supervised datasets. BERT proposes a masked language modeling (MLM) task to learn deep bidirectional representations. Nevertheless, above methods are usually implemented by just one encoder or decoder, which is less effective in encoder-decoder based generation tasks, thus several works have preliminarily explored the pre-training towards NLG by incorporating BERT’s MLM into the seq2seq framework and shown excellent performance on a range of generation tasks. MASS masks a consecutive fragment $(50%)$ of the input sentence with [MASK] symbols to predict. UNILM masks several words in the input sequence which is a pair of segments for encoder and decoder, and then predicts the masked words in accordance with BERT’s MLM.
NLP任务的预训练。近年来,预训练方法在多个自然语言理解(NLU)任务中取得了最先进的成果。ELMo [Peters et al., 2018] 分别通过前向和后向的两个单向语言模型(LM)进行预训练,以增强下游任务的特征表示。GPT 使用调整后的 Transformer [Vaswani et al., 2017] 来学习前向语言模型,然后在有监督的数据集上对该前向语言模型进行微调。BERT 提出了掩码语言建模(MLM)任务,以学习深度双向表示。然而,上述方法通常仅通过一个编码器或解码器实现,这在基于编码器-解码器的生成任务中效果较差,因此一些工作初步探索了将 BERT 的 MLM 融入序列到序列(seq2seq)框架中的 NLG 预训练方法,并在一系列生成任务中表现出色。MASS 通过 [MASK] 符号对输入句子中的连续片段 $(50%)$ 进行掩码并预测。UNILM 对输入序列中的多个词进行掩码,该序列是编码器和解码器的一对片段,然后根据 BERT 的 MLM 预测被掩码的词。
Exposure Bias Issue. NLG tasks suffer from the exposure bias which is caused by teacher-forcing training. To address such issue, RNN-based variation al auto encoders (VAEs) are leveraged in [Yang et al., 2019a; Wang et al., 2019], whereas it requires inference for both posterior and prior distribution. Reinforcement learning is also adopted to text generation against exposure bias issue [Ranzato et al., 2016; Wang et al., 2018], which is, however, inefficient during training because of the word-by-word sampling procedure. These methods are inefficient and less practical for pretraining that relies on large-scale unlabeled text corpora.
曝光偏差问题。自然语言生成任务(NLG)受到由教师强制训练引起的曝光偏差的影响。为了解决这一问题,[Yang et al., 2019a; Wang et al., 2019] 中利用了基于RNN的变分自编码器(VAE),但这需要对后验分布和先验分布进行推断。强化学习也被用于文本生成以对抗曝光偏差问题 [Ranzato et al., 2016; Wang et al., 2018],然而,由于逐字采样的过程,训练效率较低。这些方法对于依赖大规模未标记文本语料库的预训练来说效率低下且不太实用。
Span-level Pre-training. [Sun et al., 2019; Sun et al., 2020; Joshi et al., 2019] verify that predicting spans reaches substantially better performance on NLU tasks. Meanwhile, inspired by characteristics of human expression, we hope the model have the foresight to generate a semantically-complete span at each step rather than a word. Consequently, a spanby-span generating task is proposed to make the model capable of generating texts more human-like.
Span-level 预训练。 [Sun et al., 2019; Sun et al., 2020; Joshi et al., 2019] 验证了预测 span 在自然语言理解 (NLU) 任务上能够显著提升性能。同时,受人类表达特点的启发,我们希望模型在每一步生成时能够具备前瞻性,生成语义完整的 span 而非单个词。因此,提出了逐 span 生成任务,使模型能够生成更接近人类表达的文本。
3 Proposed Framework
3 提出的框架
Built on infilling generation mechanism, ERNIE-GEN adopts a Multi-Flow Attention architecture to train the model on word-by-word and span-by-span generation tasks in parallel. In this section, we describe ERNIE-GEN according to the training process shown in Figure 2.
基于填充生成机制,ERNIE-GEN 采用了多流注意力架构,以并行方式训练模型进行逐词和逐片段生成任务。在本节中,我们根据图 2 所示的训练过程描述 ERNIE-GEN。
3.1 Multi-Granularity Target Fragments
3.1 多粒度目标片段
Given an input sequence ${\cal S}={s_{1},...,s_{n}}$ , we first sample a length distribution $D_{i}$ from a distribution set $\textbf{\textit{D}}=$ ${D_{1},...,D_{|D|}}$ with probability $p_{i}$ for target fragments, and then select fragments according to $D_{i}$ in $\boldsymbol{S}$ iterative ly until the fragment budget has been spent (e.g. $25%$ of $\pmb{S}$ ). We denote $S_{j}^{i}$ as the $j$ -th fragment which is sampled in length distribution $D_{i}$ . Sampled fragments are then removed from $\boldsymbol{S}$ and stitched together to form target sequence $\pmb{T}=[T_{1},...,T_{k}]=$ $[S_{1}^{i},...,S_{k}^{i}]$ . We denote $S^{\prime}$ as the left input sequence after removing sampled fragments. ERNIE-GEN performs pretraining by predicting the fragmented target sequence $\pmb{T}$ and minimizing the negative log likelihood:
给定输入序列 ${\cal S}={s_{1},...,s_{n}}$,我们首先从分布集 $\textbf{\textit{D}}=$ ${D_{1},...,D_{|D|}}$ 中以概率 $p_{i}$ 为目标片段采样长度分布 $D_{i}$,然后根据 $D_{i}$ 在 $\boldsymbol{S}$ 中迭代选择片段,直到片段预算用完(例如 $\pmb{S}$ 的 $25%$)。我们将 $S_{j}^{i}$ 表示为在长度分布 $D_{i}$ 中采样的第 $j$ 个片段。采样后的片段从 $\boldsymbol{S}$ 中移除并拼接在一起,形成目标序列 $\pmb{T}=[T_{1},...,T_{k}]=$ $[S_{1}^{i},...,S_{k}^{i}]$。我们将 $S^{\prime}$ 表示为移除采样片段后剩余的输入序列。ERNIE-GEN 通过预测分段目标序列 $\pmb{T}$ 并最小化负对数似然来进行预训练:

where the target sequence $\mathbf{\delta}{\mathbf{{T}}}$ is sorted by the positions of sampled fragments. For each fragment $\dot{T={t{1},...,t_{|T|}}}$ in $\pmb{T}$ , we have $\begin{array}{r}{P(T)=\prod_{j=1}^{|T|}P(t_{j}|t_{<j})}\end{array}$ .
其中目标序列 $\mathbf{\delta}{\mathbf{{T}}}$ 按采样片段的位置排序。对于 $\pmb{T}$ 中的每个片段 $\dot{T={t{1},...,t_{|T|}}}$ ,我们有 $\begin{array}{r}{P(T)=\prod_{j=1}^{|T|}P(t_{j}|t_{<j})}\end{array}$ 。

Figure 3: Illustration of the Multi-Flow Attention module. (a):Overview of multi-flow attention. The encoder and the decoder share the parameters of multi-layer Transformer. (b):Word-by-word generation flow with history contextual representations from Contextual Flow. (c):Spanby-span generation flow with shared Contextual Flow. (d):The attention mask matrixes of word-by-word generation flow $(M_{W})$ , contextual flow $(M_{C})$ and span-by-span generation flow $(M_{S})$ . The $i$ -th generated token $y_{i}$ is calculated by argmax(softmax( $\operatorname{Ec}(\pmb{a}_{i}^{(L-1)}))$ ).
图 3: 多流注意力模块的示意图。(a): 多流注意力的概述。编码器和解码器共享多层 Transformer 的参数。(b): 逐词生成流,带有来自上下文流的历史上下文表示。(c): 逐片段生成流,共享上下文流。(d): 逐词生成流 $(M_{W})$、上下文流 $(M_{C})$ 和逐片段生成流 $(M_{S})$ 的注意力掩码矩阵。第 $i$ 个生成的 Token $y_{i}$ 通过 argmax(softmax( $\operatorname{Ec}(\pmb{a}_{i}^{(L-1)}))$ ) 计算得出。
Following preliminary trials, we set a hyper parameter $\gamma=$ 0.25, which denotes the ratio of length of all fragments to that of the input sequence $\pmb{S}$ . Besides, we introduce two uniform distributions $\bar{D}={U(1,4),U(4,32)}$ with probability of 0.4 and 0.6 respectively to sample fragments, which aims to learn representations from different perspectives. On the one hand, short fragments benefit learning of semantic relation between words; on the other hand, longer fragments help to memorize sentence-level expressions.
在初步试验后,我们设置了一个超参数 $\gamma=$ 0.25,它表示所有片段的长度与输入序列 $\pmb{S}$ 长度的比率。此外,我们引入了两个均匀分布 $\bar{D}={U(1,4),U(4,32)}$,分别以 0.4 和 0.6 的概率对片段进行采样,旨在从不同角度学习表示。一方面,短片段有助于学习词之间的语义关系;另一方面,较长的片段有助于记忆句子级别的表达。
3.2 Noise-Aware Generation
3.2 噪声感知生成
To train a generation model which can detect the false prediction and mitigate its impact on subsequent generation, we corrupt the ground truth sequence $\mathbf{\delta}{\mathbf{{T}}}$ with a procedure where words are being replaced randomly, and the corrupted $\pmb{T}$ is represented as $\pmb{T}^{\prime}$ . There are two hyper parameters, $\rho{p}$ and $\rho_{f}$ , denoting the noising rate in pre-training and fine-tuning respectively.
为了训练一个能够检测错误预测并减轻其对后续生成影响的生成模型,我们通过随机替换单词的过程对真实序列 $\mathbf{\delta}{\mathbf{{T}}}$ 进行破坏,破坏后的 $\pmb{T}$ 表示为 $\pmb{T}^{\prime}$。有两个超参数 $\rho{p}$ 和 $\rho_{f}$,分别表示预训练和微调中的噪声率。
3.3 Architecture: Multi-Flow Attention
3.3 架构:多流注意力机制
Formally, given a source sequence $S={s_{1},...,s_{n}}$ , a noised target sequence $\pmb{T}^{\prime}={t_{1},...,t_{m}}$ , we denote the inference of seq2seq network based on shared Transformer as follows:
形式上,给定一个源序列 $S={s_{1},...,s_{n}}$ 和一个噪声目标序列 $\pmb{T}^{\prime}={t_{1},...,t_{m}}$ ,我们将基于共享 Transformer 的序列到序列网络的推理表示如下:

where $Q,K,V$ denote the query, key, and value in MultiHead attention [Vaswani et al., 2017]. $\mathbf{\pmb{\mathscr{s}}}{i}^{(l)}$ and $\pmb{t}{i}^{(l)}$ indicate the $i$ -th vector representations of the $l$ -th layer of Multi-Head Attention for the encoder and the decoder respectively, $[\cdot]$ denotes the concatenation operation. In this work, we call the above procedure the Contextual Flow.
其中 $Q,K,V$ 表示多头注意力机制 (MultiHead attention) [Vaswani et al., 2017] 中的查询 (query)、键 (key) 和值 (value)。$\mathbf{\pmb{\mathscr{s}}}{i}^{(l)}$ 和 $\pmb{t}{i}^{(l)}$ 分别表示编码器和解码器第 $l$ 层多头注意力的第 $i$ 个向量表示,$[\cdot]$ 表示拼接操作。在本工作中,我们将上述过程称为上下文流 (Contextual Flow)。
Word-by-word Generation Flow. Based on infilling generation mechanism, this flow utilizes an inserted [ATTN] symbol to gather history representations word by word (see Figure 1b). To facilitate this process, we place all inserted [ATTN] together to construct an artificial symbol sequence $A_{W}=\left{~[\mathrm{ATTN}]{1},...,[\mathrm{ATTN}]{m}\right}$ which has the same length as $\pmb{T}^{\prime}$ , as shown in Figure 3b. To be specific, the word-byword generation flow is updated as follow:
逐词生成流程。基于填充生成机制,该流程利用插入的 [ATTN] 符号逐词收集历史表示(见图 1b)。为了便于这一过程,我们将所有插入的 [ATTN] 放在一起,构建一个人工符号序列 $A_{W}=\left{~[\mathrm{ATTN}]{1},...,[\mathrm{ATTN}]{m}\right}$,其长度与 $\pmb{T}^{\prime}$ 相同,如图 3b 所示。具体来说,逐词生成流程更新如下:

where ai(l) indicates the i-th vector representation of the l-th layer for the artificial symbol sequence $\pmb{A}_{W}$ .
其中 ai(l) 表示人工符号序列 $\pmb{A}_{W}$ 的第 l 层的第 i 个向量表示。
Span-by-span Generation Flow. Different from word-byword generation flow, span-by-span flow uses [ATTN] symbols to predict spans consecutively, as shown in Figure 3c. Formally, given a list of span boundaries $B={b_{1},...,b_{|B|}}$ , we conduct the span-by-span generation flow as:
逐片段生成流程。与逐词生成流程不同,逐片段生成流程使用 [ATTN] 符号连续预测片段,如图 3c 所示。形式上,给定片段边界列表 $B={b_{1},...,b_{|B|}}$,我们执行逐片段生成流程如下:

where j ∈ [bi, bi+1), and a(jl) denotes the $(j-b_{i})$ -th vector representation of the $i$ -th span. Essentially, the model is trained to predict a whole span ${t_{b_{i}},...,t_{b_{i+1}-1}}$ with the same history context $[S,t_{<b_{i}}]$ . Instead of randomly sampling spans, we prefer sampling spans with semantical information and knowledge. Specifically, we consider the following two steps to sample spans consecutively in $\pmb{T}^{\prime}$ :
其中 j ∈ [bi, bi+1),a(jl) 表示第 i 个 span 的第 $(j-b_{i})$ 个向量表示。本质上,模型被训练为预测整个 span ${t_{b_{i}},...,t_{b_{i+1}-1}}$,并使用相同的历史上下文 $[S,t_{<b_{i}}]$。我们更倾向于采样具有语义信息和知识的 span,而不是随机采样 span。具体来说,我们考虑以下两个步骤在 $\pmb{T}^{\prime}$ 中连续采样 span:
• Firstly, we implement a T-test to compute ${\bf t}$ -statistic scores of all bigrams and trigrams, which is based on an initial hypothesis $H_{0}$ : a random span of $n$ arbitrary words ${\pmb w}=$ ${w_{1},...,w_{n}}$ with probability $\begin{array}{r}{p^{\prime}(\pmb{w})=\prod_{i=1}^{n}\stackrel{\cdot}{p}(w_{i})}\end{array}$ cannot be a statistical $n$ -gram. The t-statistic sc ore is calculated by $\frac{\left(p(\pmb{w})-p^{\prime}(\pmb{w})\right)}{\sqrt{\sigma^{2}/N}}$ , where $\begin{array}{r}{p(\pmb{w})=\frac{\complement\mathrm{ount}(\pmb{w})}{N}}\end{array}$ and $\sigma^{2}=p(\pmb{w})(1-$ $p(\pmb{w}))$ , indicating the statistic probability and the standard deviation of $\mathbf{\nabla}_{\mathbf{\overrightarrow{\nabla}}\mathbf{\overrightarrow{\omega}}}$ respectively, $N$ denotes the total number of $n$ -grams appearing in the training data. According to the t-statistic scores, we select the top 200,000 bigrams, top 50,000 trigrams and all unigrams to construct a specific vocabulary of spans, which is represented as $V_{s p a n}$ .
• 首先,我们实施T检验来计算所有二元组和三元组的${\bf t}$统计量分数,这是基于初始假设$H_{0}$:一个由$n$个任意单词${\pmb w}=$ ${w_{1},...,w_{n}}$组成的随机片段,其概率为$\begin{array}{r}{p^{\prime}(\pmb{w})=\prod_{i=1}^{n}\stackrel{\cdot}{p}(w_{i})}\end{array}$,不能成为一个统计$n$-gram。t统计量分数通过$\frac{\left(p(\pmb{w})-p^{\prime}(\pmb{w})\right)}{\sqrt{\sigma^{2}/N}}$计算,其中$\begin{array}{r}{p(\pmb{w})=\frac{\complement\mathrm{ount}(\pmb{w})}{N}}\end{array}$和$\sigma^{2}=p(\pmb{w})(1-$ $p(\pmb{w}))$,分别表示统计概率和$\mathbf{\nabla}_{\mathbf{\overrightarrow{\nabla}}\mathbf{\overrightarrow{\omega}}}$的标准差,$N$表示训练数据中出现的$n$-gram的总数。根据t统计量分数,我们选择前200,000个二元组、前50,000个三元组和所有一元组来构建一个特定的片段词汇表,表示为$V_{s p a n}$。
• Secondly, we search the trigram, bigram and unigram in order, starting with current word until a span $n$ -gram, $n\leq$ 3) is retrieved in $V_{s p a n}$ .
• 其次,我们从当前词开始,依次搜索三元组、二元组和一元组,直到在 $V_{s p a n}$ 中检索到一个跨度 $n$ -gram($n\leq$ 3)。
Multi-Flow Attention. To integrate the word-by-word generation flow and span-by-span generation flow, we apply them in parallel with a shared contextual flow by leveraging the multi-flow attention architecture, as Figure 3a describes. The multi-flow attention is computed as:
多流注意力机制。为了整合逐词生成流和逐跨度生成流,我们通过利用多流注意力架构,将它们与共享的上下文流并行应用,如图3a所示。多流注意力的计算方式如下:

where $\boldsymbol{X}$ denotes the concatenation of $\boldsymbol{S}$ and $\pmb{T}^{\prime}$ , $X^{(l)}$ is the vector sequence of the $l$ -th layer for the contextual flow. ${\pmb A}{W}^{(l)},{\pmb A}{S}^{(l)}$ are vector sequences of the $l$ -th layer for the wordby-word and span-by-span generation flow respectively. As shown in Figure 3d, attention mask matrix $M$ determines whether query and key can attend to each other by modifying the attention weight W=softmax( Q√KdT [Vaswani et al., 2017] . Specifically, $M$ is assigned as:
其中 $\boldsymbol{X}$ 表示 $\boldsymbol{S}$ 和 $\pmb{T}^{\prime}$ 的拼接,$X^{(l)}$ 是第 $l$ 层的上下文流的向量序列。${\pmb A}{W}^{(l)},{\pmb A}{S}^{(l)}$ 分别是第 $l$ 层的逐词生成流和逐片段生成流的向量序列。如图 3d 所示,注意力掩码矩阵 $M$ 通过修改注意力权重 W=softmax( Q√KdT [Vaswani et al., 2017] 来确定查询和键是否可以相互关注。具体来说,$M$ 被赋值为:

While training, we add the loss of the word-by-word and span-by-span generation flow with an coefficient $\lambda$ :
在训练过程中,我们通过系数 $\lambda$ 将逐词生成和逐段生成的损失相加:

where $\mathbf{\delta}_{\mathbf{T}}$ indicates the unnoised target sequence, and $\mathcal{L}(\cdot)$ denotes the cross entropy loss function. In detail, we set $\lambda=0.5$ and $\lambda=1.0$ respectively in pre-training and fine-tuning.
其中,$\mathbf{\delta}_{\mathbf{T}}$ 表示未加噪的目标序列,$\mathcal{L}(\cdot)$ 表示交叉熵损失函数。具体来说,我们在预训练和微调阶段分别设置 $\lambda=0.5$ 和 $\lambda=1.0$。
3.4 Inference: Infilling Decoding
3.4 推理:填充解码
During inference, the target sequence $\mathbf{\delta}_{\mathbf{{T}}}$ is unknown, we insert symbol [ATTN] step by step to gather the representation of history context instead of preparing an artificial symbol sequence $\pmb{A}$ in advance. Meanwhile, for the purpose of efficiency, we need to drop the inserted [ATTN] after inference at each step, as detailed in Figure 4.
在推理过程中,目标序列 $\mathbf{\delta}_{\mathbf{{T}}}$ 是未知的,我们逐步插入符号 [ATTN] 以收集历史上下文的表示,而不是预先准备一个人工符号序列 $\pmb{A}$。同时,为了提高效率,我们需要在每一步推理后删除插入的 [ATTN],如图 4 所示。
4 Experiments
4 实验
In this section, we compare our ERNIE-GEN with previous works and conduct several ablation experiments to assess the performance of proposed methods in $\S3$ .
在本节中,我们将 ERNIE-GEN 与之前的工作进行比较,并进行多项消融实验,以评估 $\S3$ 中提出的方法的性能。
4.1 Pre-training and Implementation
4.1 预训练与实现
Analogous to BERT and UNILM, ERNIE-GEN is trained on English Wikipedia and BookCorpus [Zhu et al., 2015], totaling 16GB. We also pre-train ERNIE-GEN on larger scaled text corpora, which is specifically described in appendix A. The input sequence is lowercased and truncated to a maximum length of 512. We train a base model ERNIE $\mathbf{GEN}_{B A S E}$ ( $L{=}12$ , $H{=}768$ , $_{A=12}$ , Total Parameters ${\bf\Gamma}=110\mathbf{M})^{.}$ 1 and a large model ERNIE $\mathbf{GEN}_{L A R G E}$ $\scriptstyle{\mathcal{L}}=24$ , $H{=}1024$ , $A{=}16$ , Total Parameters $\mathbf{\mu}=340\mathbf{M})$ with parameters initialized by $\mathbf{B}\mathbf{ERT}_{B A S E}$ and $\mathbf{B}\mathbf{ERT}_{L A R G E}$ respectively. Specifically, Adam optimizer with $\beta_{1}~=~0.9,\beta_{2}~=~0.99\dot{9},\dot{\epsilon}~=~10^{-9}$ is employed. The peak learning rate is 5e-5 with warmup over the first 4,000 steps and linear decay scheduling. The noising rate $\rho_{p}$ for pre-training is 0.05. Batches are organized by limiting the maximum number of tokens to 196,608. Pre-training experiments are carried out on Paddle Paddle platforms2 and Nvidia Tesla V100 GPU. By virtue of float16 mixed precision training, it takes almost 4 days for 400,000 steps to train ERNIE $\mathbf{GEN}_{B A S E}$ while almost 7 days for 450,000 steps to train ERNIE-GENLARGE.
类似于 BERT 和 UNILM,ERNIE-GEN 在英文维基百科和 BookCorpus [Zhu et al., 2015] 上进行训练,总计 16GB。我们还在更大规模的文本语料库上对 ERNIE-GEN 进行了预训练,具体描述见附录 A。输入序列被转换为小写并截断至最大长度 512。我们训练了一个基础模型 ERNIE $\mathbf{GEN}_{B A S E}$ ( $L{=}12$ , $H{=}768$ , $_{A=12}$ , 总参数量 ${\bf\Gamma}=110\mathbf{M})^{.}$ 1 和一个大模型 ERNIE $\mathbf{GEN}_{L A R G E}$ $\scriptstyle{\mathcal{L}}=24$ , $H{=}1024$ , $A{=}16$ , 总参数量 $\mathbf{\mu}=340\mathbf{M})$ ,参数分别由 $\mathbf{B}\mathbf{ERT}_{B A S E}$ 和 $\mathbf{B}\mathbf{ERT}_{L A R G E}$ 初始化。具体来说,使用了 Adam 优化器,参数为 $\beta_{1}~=~0.9,\beta_{2}~=~0.99\dot{9},\dot{\epsilon}~=~10^{-9}$ 。峰值学习率为 5e-5,前 4,000 步进行预热,之后采用线性衰减调度。预训练的噪声率 $\rho_{p}$ 为 0.05。通过限制最大 token 数量为 196,608 来组织批次。预训练实验在 Paddle Paddle 平台2 和 Nvidia Tesla V100 GPU 上进行。得益于 float16 混合精度训练,训练 ERNIE $\mathbf{GEN}_{B A S E}$ 400,000 步大约需要 4 天,而训练 ERNIE-GENLARGE 450,000 步大约需要 7 天。

Figure 4: Schematic of infilling decoding: the particular procedures in infilling decoding including dropping and inserting (left) and the attention mask matrixes at each step (right).
图 4: 填充解码示意图:填充解码中的特定步骤,包括删除和插入(左)以及每一步的注意力掩码矩阵(右)。
4.2 Fine-tuning on Downstream Tasks
4.2 下游任务的微调
Abstract ive Sum mari z ation aims at generating fluent and concise summaries without being constrained to extracting sub-sequences from the input articles. We execute experiments on Gigaword dataset [Rush et al., 2015] and CNN/DailyMail dataset [Hermann et al., 2015]. Gigaword dataset contains $3.8\mathbf{M}$ articles extracted from the Gigaword corpus, while CNN/DailyMail dataset consists of $93\mathrm{k\Omega}$ articles and $220\mathrm{k\Omega}$ articles from the CNN and Daily Mail respectively.
摘要生成式摘要的目标是生成流畅且简洁的摘要,而不受限于从输入文章中提取子序列。我们在 Gigaword 数据集 [Rush et al., 2015] 和 CNN/DailyMail 数据集 [Hermann et al., 2015] 上进行了实验。Gigaword 数据集包含从 Gigaword 语料库中提取的 $3.8\mathbf{M}$ 篇文章,而 CNN/DailyMail 数据集分别包含来自 CNN 和 Daily Mail 的 $93\mathrm{k\Omega}$ 篇文章和 $220\mathrm{k\Omega}$ 篇文章。
| 模型 | 数据参数 | RG-1/RG-2/RG-L | |
|---|---|---|---|
| *10k 训练样本: Gigaword10k | |||
| MASS [Song et al., 2019] | 18G | 160M | 25.03/9.48/23.48 |
| UNILM LARGE [Dong et al., 2019] | 16G | 340M | 32.96/14.68/30.56 |
| ERNIE-GEN BASE | 16G | 110M | 33.75/15.23/31.35 |
| ERNIE-GEN LARGE | 16G | 340M | 35.05/16.10/32.50 |
| *完整 3.8M 训练样本 | |||
| MASS [Song et al., 2019] | 18G | 160M | 37.66/18.53/34.89 |
| BERT SHARE [Rothe et al., 2019] | 16G | 110M | 38.13/19.81/35.62 |
| UNILM LARGE [Dong et al., 2019] | 16G | 340M | 38.45/19.45/35.75 |
| PEGASUS (c4) [Zhang et al., 2019] | 750G | 568M | 38.75/19.96/36.14 |
| PEGASUS (HugeNews) [Zhang et al., 2019] | 3.8T | 568M | 39.12/19.86/36.24 |
| ERNIE-GEN BASE | 16G | 110M | 38.83/20.04/36.20 |
| ERNIE-GEN LARGE | 16G | 340M | 39.25/20.25/36.53 |
Table 2: Comparison on Gigaword dataset with state-of-the-art results. Models in the upper block use 10k sample for fine-tuning. We also report the size of pre-training data and parameters utilized for each listed model (columns 2-3). RG is short for ROUGE.
表 2: Gigaword 数据集上与最先进结果的对比。上部分的模型使用了 10k 样本进行微调。我们还报告了每个列出的模型所使用的预训练数据和参数大小(第 2-3 列)。RG 是 ROUGE 的缩写。
Table 1: Hyper paramters of fine-tuning for ERNIE $\mathrm{GEN}_{B A S E}$ and ERNIE-GENLARGE.
表 1: ERNIE $\mathrm{GEN}_{B A S E}$ 和 ERNIE-GENLARGE 微调的超参数
| 任务 | Epoch BASE | Epoch LARGE | LearningRate BASE | LearningRate LARGE | NoisingRatepf BASE | NoisingRatepf LARGE | DropoutRate BASE | DropoutRate LARGE | Batch Size | Label Smooth | Beam Size | 评估指标 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| SQuAD QG | 10 | 10 | 2.5e-5 | 1.5e-5 | 0.7 | 0.7 | 0.1 | 0.2 | 32 | 0.1 | 1 | BLEU-4, METEOR (MTR), ROUGE-L (RG-L) |
| CNN/DailyMail | 30 | 20 | 5e-5 | 4e-5 | 0.7 | 0.7 | 0.1 | 0.1 | 64 | 0.1 | 5 | ROUGE-F1 scores: |
| Gigaword | 10 | 5 | 3e-5 | 3e-5 | 0.5 | 0.6 | 0.1 | 0.2 | 128 | 0.1 | 5 | ROUGE-1 (RG-1), ROUGE-2 (RG-2), ROUGE-L (RG-L) |
| Persona-Chat | 30 | 1e-4 | 0.0 | 0.1 | 64 | 0.1 | 10 | BLEU-1, BLEU-2, Distinct-1, Distinct-2 | ||||
| GenerativeCoQA | 10 | 1e-5 | 0.5 | 0.1 | 32 | 0.1 | 3 | F1-score |
Table 3: Evaluation results on CNN/DailyMail. C4 and HugeNews are two massive datasets of 750G and $3.8\mathrm{T}$ respectively.
| 模型 | 数据 | 参数 | RG-1/RG-2/RG-L |
|---|---|---|---|
| BERTSHARE [Rothe et al., 2019] | 16G | 110M | 39.25/18.09/36.45 |
| BERTSUMABS [Liu and Lapata, 2019] | 16G | 110M | 41.72/19.39/38.76 |
| MASS [Song et al., 2019] | 18G | 160M | 42.12/19.50/39.01 |
| UNLMLARGE [Dong et al., 2019] | 16G | 340M | 43.33/20.21/40.51 |
| T5LARGE [Raffel et al., 2019] | 750G | 340M | 42.50/20.68/39.75 |
| T5xLARGE [Raffel et al., 2019] | 750G | 11B | 43.52/21.55/40.69 |
| BARTLARGE [Lewis et al., 2019] | 430G | 400M | 44.16/21.28/40.90 |
| PEGASUS (c4) [Zhang et al., 2019] | 750G | 568M | 43.90/21.20/40.76 |
| PEGASUS (HugeNews) [Zhang et al., 2019] | 3.8T | 568M | 44.17/21.47/41.11 |
| ERNIE-GENBASE | 16G | 110M | 42.30/19.92/39.68 |
| ERNIE-GENLARGE | 16G | 340M | 44.02/21.17/41.26 |
表 3: CNN/DailyMail 上的评估结果。C4 和 HugeNews 是两个大规模数据集,分别为 750G 和 $3.8\mathrm{T}$。
The results on Gigaword task with two scales (10k and 3.8M) are presented in Table 2, and the fine-tuning settings are shown in Table 1. On the low-resource task (Gigaword 10k), ERNIE $\mathrm{GEN}{L A R G E}$ yields a gain of $+1.94$ ROUGEL compared with $\mathrm{UNILM}{L A R G E}$ . On the full training set, ERNIE $\mathbf{G}\mathbf{EN}{L A R G E}$ creates the state-of-the-art results, outperforming various previous methods. Specifically, ERNIE $\mathbf{GEN}{B A S E}$ outperforms PEGASUS (568M and 750G) by using only 110M parameters and 16G training data.
在 Gigaword 任务上的结果(10k 和 3.8M 两个规模)如表 2 所示,微调设置如表 1 所示。在低资源任务(Gigaword 10k)上,ERNIE $\mathrm{GEN}{L A R G E}$ 相比 $\mathrm{UNILM}{L A R G E}$ 在 ROUGEL 上提升了 $+1.94$。在完整训练集上,ERNIE $\mathbf{G}\mathbf{EN}{L A R G E}$ 创造了最新的结果,超越了之前各种方法。具体来说,ERNIE $\mathbf{GEN}{B A S E}$ 仅使用 110M 参数和 16G 训练数据,就超越了 PEGASUS(568M 和 750G)。
Table 3 shows the performance on CNN/DailyMail. With a similar amount of pre-training data and parameters, ERNIE $\mathbf{GEN}{B A S E}$ outperforms MASS by $+0.67$ ROUGE-L scores. Fairly compared with $\mathrm{UNILM}{L A R G E}$ , ERNIE $\mathbf{GEN}_{L A R G E}$ obtains substantial gain of $+0.73$ ROUGE-L scores. Meanwhile, in spite of small pre-training data and parameters, our large model also achieves state-of-the-art result on ROUGE-L and comparable performance on ROUGE-1/2.
表 3 展示了在 CNN/DailyMail 数据集上的性能表现。在预训练数据和参数量相似的情况下,ERNIE $\mathbf{GEN}{B A S E}$ 在 ROUGE-L 分数上比 MASS 高出 $+0.67$。与 $\mathrm{UNILM}{L A R G E}$ 进行公平比较时,ERNIE $\mathbf{GEN}_{L A R G E}$ 在 ROUGE-L 分数上获得了 $+0.73$ 的显著提升。同时,尽管预训练数据和参数量较少,我们的大模型在 ROUGE-L 上仍达到了最先进的结果,并在 ROUGE-1/2 上表现相当。
Question Generation is to generate a question according to a given input passage and a corresponding answer. We evaluate on the SQuAD 1.1 dataset [Rajpurkar et al., 2016] for question generation task (called SQuAD QG). Following UNILM, we redistribute the original dataset into a new training set and testing set with the original development set unchanged. We also conduct experiment with the reversed dev $\leftrightarrow$ test split as [Zhao et al., 2018] indicates. In Table 4, we present the results of ERNIE-GEN and several previous works. Again, ERNIE-GEN outperforms $\mathrm{UNILM}_{L A R G E}$ and achieves a new state-of-the-art result on question generation by giving $+1.82$ BLEU-4 scores.
问题生成 (Question Generation) 是根据给定的输入段落和相应的答案生成一个问题。我们在 SQuAD 1.1 数据集 [Rajpurkar et al., 2016] 上评估问题生成任务(称为 SQuAD QG)。遵循 UNILM 的做法,我们将原始数据集重新分配为新的训练集和测试集,同时保持原始开发集不变。我们还按照 [Zhao et al., 2018] 的建议,进行了开发集和测试集互换的实验。在表 4 中,我们展示了 ERNIE-GEN 和之前几项工作的结果。ERNIE-GEN 再次超越了 $\mathrm{UNILM}_{L A R G E}$,并通过提供 $+1.82$ 的 BLEU-4 分数,在问题生成任务上取得了新的最先进结果。
Table 4: Question generation results on SQuAD. Models in the upper block and the lower block use different $t e s t\leftrightarrow$ dev split method.
表 4: SQuAD 上的问题生成结果。上块和下块中的模型使用不同的 $t e s t\leftrightarrow$ dev 分割方法。
| 模型 | BLEU-4 | METEOR | ROUGE-L |
|---|---|---|---|
| SemQG [Zhang and Bansal, 2019] | 20.76 | 24.20 | 48.91 |
| UNILMLARGE [Dong et al., 2019] | 22.12 | 25.06 | 51.07 |
| ERNIE-GENBASE (beam size = 1) | 22.28 | 25.13 | 50.58 |
| ERNIE-GENLARGE (beam size = 1) | 24.03 | 26.31 | 52.36 |
| ERNIE-GENLARGE (beam size = 5) | 25.40 | 26.92 | 52.84 |
| *Reversed test→dev split | |||
| MP-GSN [Zhao et al., 2018] | 16.38 | 20.25 | 44.48 |
| SemQG [Zhang and Bansal, 2019] | 20.76 | 24.20 | 48.91 |
| UnILMLARGE [Dong et al., 2019] | 23.75 | 25.61 | 52.04 |
| ERNIE-GEN BASE (beam size = 1) | 23.52 | 25.61 | 51.45 |
| ERNIE-GENLARGE (beam size = 1) | 25.57 | 26.89 | 53.31 |
| ERNIE-GENLARGE (beam size = 5) | 26.95 | 27.57 | 53.77 |
Table 5: Comparison with state-of-the-art results on Persona-Chat.
表 5: 与 Persona-Chat 上最先进结果的比较
| 模型 | BLEU-1/2 | Distinct-1/2 |
|---|---|---|
| LIC [Bao et al., 2020] | 40.5/32.0 | 0.019/0.113 |
| PLATO w/ latent [Bao et al., 2020] | 45.8/35.7 | 0.012/0.064 |
| PLATO [Bao et al., 2020] | 40.6/31.5 | 0.021/0.121 |
| ERNIE-GEN LARGE | 46.8/36.4 | 0.023/0.168 |
Generative Question Answering / Dialogue Response in multi-turn conversations are challenging because of complex background knowledge and diverse utterances. We conduct an experiment on Persona-Chat dataset [Zhang et al., 2018] to generate responses according to given multi-turn conversations and persona profile. Table 5 shows that ERNIE-GEN outperforms current task-specific pre-training model on dialogue generation. Beside, we also execute an experiment on CoQA dataset [Reddy et al., 2019] to generate free-form an- swers for input questions and conversations. As shown in Table 6, our generative question answering model works considerably better than early works by $+2.0$ F1-scores.
生成式问答/多轮对话中的对话响应由于复杂的背景知识和多样化的表达而具有挑战性。我们在 Persona-Chat 数据集 [Zhang et al., 2018] 上进行了实验,根据给定的多轮对话和人物档案生成响应。表 5 显示,ERNIE-GEN 在对话生成任务上优于当前的任务特定预训练模型。此外,我们还在 CoQA 数据集 [Reddy et al., 2019] 上进行了实验,为输入的问题和对话生成自由形式的答案。如表 6 所示,我们的生成式问答模型比早期工作的 F1 分数高出 $+2.0$。
Table 6: Generative question answering results on the development set of CoQA.
| 模型 | F1分数 |
|---|---|
| Seq2Seq [Reddy et al., 2019] | 27.5 |
| PGNet [Reddy et al., 2019] | 45.4 |
| UNILM LARGE [Dong et al., 2019] | 82.5 |
| ERNIE-GEN LARGE | 84.5 |
表 6: CoQA 开发集上的生成式问答结果。
4.3 Ablation Studies
4.3 消融研究
To better understand the importance of each proposed generation methods, we conduct experiments concerning the following two aspects:
为了更好地理解每种生成方法的重要性,我们进行了以下两个方面的实验:
• The robustness of infilling generation mechanism and noise-aware generation method against the exposure bias. • The effectiveness of span-by-span generation task and the complete ERNIE-GEN model.
• 填充生成机制和噪声感知生成方法对曝光偏差的鲁棒性。
• 逐段生成任务和完整 ERNIE-GEN 模型的有效性。
In Table 8, we compare two ERNIE $\mathbf{GEN}_{B A S E}$ variants that are pre-trained with typical generation mechanism and infilling generation mechanism and that generate word by word. Row 1-3 shows that without noising ground truth texts, infilling generation outperforms typical generation
在表 8 中,我们比较了两个 ERNIE $\mathbf{GEN}_{B A S E}$ 变体,它们分别通过典型的生成机制和填充生成机制进行预训练,并逐词生成。第 1-3 行显示,在没有对真实文本进行噪声处理的情况下,填充生成优于典型生成。

Figure 5: Results of ablation studies. (a):Comparisons between typical generation and infilling generation on Gigaword 10k and SQuAD QG with different fine-tuning noising rate $\rho_{f}$ . (b):Noising Analysis, average attention weights of source words, unnoised target words and noised target words for diverse fine-tuning noising rate $\rho_{f}$ . (c):Ablation study on Gigaword 10k, the $\mathbf{X}$ -axis shows fine-tuning epochs.
图 5: 消融研究结果。(a): 在 Gigaword 10k 和 SQuAD QG 上,典型生成与填充生成的比较,使用不同的微调噪声率 $\rho_{f}$。(b): 噪声分析,源词、未噪声目标词和噪声目标词的平均注意力权重,针对不同的微调噪声率 $\rho_{f}$。(c): 在 Gigaword 10k 上的消融研究,$\mathbf{X}$ 轴表示微调轮数。
Table 7: Ablation study for ERNIE $\mathbf{GEN}{B A S E}$ and its variants. Particularly, We set $\rho{p}=0.05$ in pre-training (row 1), while removing the span-by-span generation task (row 3), we set $\rho_{p}=0.2$ because the pre-training becomes easier.
| # 微调方法 | 1 噪声微调:带噪声感知生成的微调 | 1 噪声微调:带噪声感知生成的微调 | 1 噪声微调:带噪声感知生成的微调 | 2 掩码微调:仅更新掩码词的梯度 | 2 掩码微调:仅更新掩码词的梯度 | 2 掩码微调:仅更新掩码词的梯度 |
|---|---|---|---|---|---|---|
| Gigaword10k | CNN/DailyMail10k | SQuADQG | Gigaword10k | CNN/DailyMail10k | SQuADQG | |
| # 模型 | RG-1/RG-2/RG-L | RG-1/RG-2/RG-L | Bleu-4/MTR/RG-L | RG-1/RG-2/RG-L | RG-1/RG-2/RG-L | Bleu-4/MTR/RG-L |
| ERNIE-GEN | 33.75/15.23/31.35 | 39.92/17.46/37.40 | 23.52/25.61/51.45 | 33.30/15.04/31.22 | 39.54/17.61/37.00 | 22.99/25.14/51.31 |
| 2 | -noise-aware | 33.57/15.15/31.28 | 39.78/17.63/37.23 | 23.40/25.50/51.36 | 33.01/14.94/31.00 | 39.53/17.61/36.97 |
| 3 | -span-by-span | 33.43/15.04/31.14 | 39.75/17.62/37.21 | 23.37/25.56/51.32 | 32.97/14.92/30.94 | 39.54/17.57/36.95 |
| 4 | -2和3 | 33.23/14.77/31.00 | 39.71/17.55/37.18 | 23.34/25.54/51.30 | 32.57/14.68/30.60 | 39.49/17.66/36.96 |
表 7: ERNIE $\mathbf{GEN}{B A S E}$ 及其变体的消融研究。特别地,我们在预训练中设置 $\rho{p}=0.05$(第1行),而在移除逐段生成任务(第3行)时,我们设置 $\rho_{p}=0.2$,因为预训练变得更容易。
| #任务 (指标) | 典型生成 | 填充生成 |
|---|---|---|
| 无噪声感知生成的微调 | ||
| GigaWord 10k (RG-1/RG-2/RG-L) | 32.98/14.67/30.5132.93/14.46/30.53 | |
| CNN/DM10K (RG-1/RG-2/RG-L) | 39.25/16.70/36.6539.56/16.93/36.94 | |
| SQuADQG (Bleu-4/MTR/RG-L) | 21.95/24.53/50.3422.13/24.66/50.51 | |
| 带噪声感知生成的微调 | ||
| 4 Gigaword 10k (RG-1/RG-2/RG-L) | 32.99/14.83/30.8433.23/14.77/31.00 | |
| 5 CNN/DM1OK (RG-1/RG-2/RG-L) | 39.34/17.30/36.7539.71/17.55/37.18 | |
| h SQuADQG (Bleu-4/MTR/RG-L) | 23.23/25.47/51.2523.34/25.54/51.30 |
Table 8: Results of models pre-trained with typical generation and infilling generation. Tasks in the upper block are fine-tuned without noising, while the others are fine-tuned with noise-aware generation.
表 8: 使用典型生成和填充生成预训练的模型结果。上部分的任务在没有噪声的情况下进行微调,而其他任务则在噪声感知生成的情况下进行微调。
across tasks. Furthermore, both variants achieve remarkable improvements by fine-tuning with noise-aware generation method (row 4-6). Specifically, Figure 5a shows the re- sults with diverse choices of noising rate $\rho_{f}$ on two tasks, indicating that appropriate noising substantially benefits the training and alleviates the training-inference discrepancy. To further analyze the excellence of infilling generation mechanism with noising, we compute the average attention weights of source tokens, unnoised target tokens and noised target tokens in the last self-attention layer respectively on 1,000 samples. Average attention weights with diverse noising rate $\rho_{f}$ are shown in Figure 5b, which tells us that the model pays more attention on the decoder side to figure out noised points and assign them less attention weights as the noising rate $\rho_{f}$ increased in fine-tuning. Thereby, the model is able to detect and ignore the false predictions properly to alleviate accumulating mistakes while inference.
跨任务表现。此外,通过使用噪声感知生成方法进行微调,两种变体都取得了显著的改进(第4-6行)。具体来说,图5a展示了在两个任务上不同噪声率 $\rho_{f}$ 的选择结果,表明适当的噪声显著有益于训练,并缓解了训练与推理之间的差异。为了进一步分析带噪声的填充生成机制的优越性,我们在1000个样本上分别计算了最后一个自注意力层中源Token、未噪声目标Token和噪声目标Token的平均注意力权重。图5b展示了不同噪声率 $\rho_{f}$ 下的平均注意力权重,结果表明随着微调中噪声率 $\rho_{f}$ 的增加,模型在解码器侧更加关注噪声点,并赋予它们较少的注意力权重。因此,模型能够正确检测并忽略错误预测,从而在推理过程中减轻错误累积。
In column 1 of Table 7, we compare four base size variants on three tasks. We see that noise-aware generation method and span-by-span generation task (rows 2-3 of Table 7) play an important role in ERNIE-GEN pre-training and significantly outperform the baseline model which is only pretrained with word-by-word infilling generation flow (row 4 of Table 7). After integrating noise-aware generation method and span-by-span generation task, ERNIE-GEN boosts the performance across all three tasks, as shown in row 1 of Table 7. In addition, UNILM is fine-tuned by masking words in the encoder and decoder to predict, which is also a case of noising for generation. To verify the idea that fine-tuning with masking language modeling like UNILM is inefficient due to the coupling of masking (noising) and predicting that only the masked (noised) position will be learned, we also list the fine-tuning results obtained by predicting masked words with masking probability of 0.7, as shown in column 2 of Table 7. We observe that our noise-aware generation method significantly outperforms the mask language modeling in seq2seq fine-tuning by predicting all words in the decoder side.
在表 7 的第 1 列中,我们比较了三个任务上的四种基础规模变体。我们可以看到,噪声感知生成方法和逐段生成任务(表 7 的第 2-3 行)在 ERNIE-GEN 的预训练中起到了重要作用,并且显著优于仅通过逐字填充生成流程进行预训练的基线模型(表 7 的第 4 行)。在整合了噪声感知生成方法和逐段生成任务后,ERNIE-GEN 在所有三个任务上的性能都得到了提升,如表 7 的第 1 行所示。此外,UNILM 通过在编码器和解码器中屏蔽单词来进行预测,这也是生成过程中的一种噪声处理方式。为了验证像 UNILM 这样通过屏蔽语言建模进行微调效率低下的观点,因为屏蔽(噪声)和预测的耦合导致只有被屏蔽(噪声)的位置会被学习,我们还列出了通过预测屏蔽概率为 0.7 的屏蔽单词获得的微调结果,如表 7 的第 2 列所示。我们观察到,我们的噪声感知生成方法通过在解码器端预测所有单词,显著优于序列到序列微调中的屏蔽语言建模。
5 Conclusions
5 结论
We present an enhanced multi-flow seq2seq pre-training and fine-tuning framework named ERNIE-GEN for language generation, which incorporates an infilling generation mechanism and a noise-aware generation method to alleviate the exposure bias. Besides, ERNIE-GEN integrates a new span-by- span generation task to train the model to generate texts like human writing, which further improves the performance on downstream tasks. Through extensive experiments, ERNIEGEN achieves state-of-the-art results on a range of NLG tasks. Future work includes incorporating reinforcement learning into pre-training for exposure bias and applying ERNIE-GEN to more NLG tasks such as machine translation.
我们提出了一种名为ERNIE-GEN的增强型多流序列到序列预训练和微调框架,用于语言生成。该框架结合了填充生成机制和噪声感知生成方法,以缓解暴露偏差。此外,ERNIE-GEN集成了新的逐段生成任务,以训练模型生成类似人类书写的文本,从而进一步提高下游任务的性能。通过大量实验,ERNIE-GEN在一系列自然语言生成任务上取得了最先进的结果。未来的工作包括将强化学习引入预训练以应对暴露偏差,并将ERNIE-GEN应用于更多自然语言生成任务,如机器翻译。
Acknowledgments
致谢
This work was supported by the National Key Research and Development Project of China (No. 2018 AAA 0101900).
本工作得到了国家重点研发计划项目(编号:2018AAA0101900)的支持。
[Zhang et al., 2019] Jingqing Zhang, Yao Zhao, Mohammad Saleh, and Peter J Liu. Pegasus: Pre-training with extracted gap-sentences for abstract ive sum mari z ation. arXiv preprint arXiv:1912.08777, 2019.
[Zhang et al., 2019] Jingqing Zhang, Yao Zhao, Mohammad Saleh, 和 Peter J Liu. Pegasus: 基于提取的间隔句子进行预训练用于摘要生成. arXiv 预印本 arXiv:1912.08777, 2019.
[Zhao et al., 2018] Yao Zhao, Xiaochuan Ni, Yuanyuan Ding, and Qifa Ke. Paragraph-level neural question generation with maxout pointer and gated self-attention networks. In EMNLP, pages 3901–3910, 2018.
[Zhao et al., 2018] Yao Zhao, Xiaochuan Ni, Yuanyuan Ding, and Qifa Ke. 基于最大输出指针和门控自注意力网络的段落级神经问题生成。在EMNLP中,第3901-3910页,2018年。
[Zhu et al., 2015] Yukun Zhu, Ryan Kiros, Rich Zemel, Rus- lan Salak hut dino v, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. Aligning books and movies: Towards storylike visual explanations by watching movies and reading books. In ICCV, pages 19–27, 2015.
[Zhu et al., 2015] Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, 和 Sanja Fidler. 对齐书籍与电影:通过观看电影和阅读书籍实现故事化的视觉解释. 在 ICCV, 第 19-27 页, 2015.
A Appendix
A 附录
A.1 Pre-training on Large-scale Text Corpora
A.1 大规模文本语料库上的预训练
Recent works for pre-training verify that larger scaled pretraining corpora can improve the performances on downstream tasks. We pre-train ERNIE $\mathbf{GEN}{L A R G E}$ model on the 430GB text corpora with 1 epoch and 1M training steps. Our 430GB text corpora is extracted from the corpus used by RoBERTa [Liu et al., 2019], T5 [Raffel et al., 2019] and ALBERT [Lan et al., 2020]. We fine-tune ERNIE- $\mathbf{GEN}{L A R G E}$ on two abstract ive sum mari z ation datasets including Gigaword and CNN/Daily Mail, the evaluation results are reported in Table 9. Notice that the performance increase significantly as ERNIE $\mathbf{GEN}_{L A R G E}$ pre-trains on larger scaled text corpora.
近期关于预训练的研究表明,更大规模的预训练语料库能够提升下游任务的表现。我们在430GB的文本语料库上对ERNIE $\mathbf{GEN}{L A R G E}$ 模型进行了预训练,训练了1个epoch和100万步。我们的430GB文本语料库是从RoBERTa [Liu et al., 2019]、T5 [Raffel et al., 2019] 和 ALBERT [Lan et al., 2020] 使用的语料库中提取的。我们在两个摘要生成数据集(Gigaword和CNN/Daily Mail)上对ERNIE- $\mathbf{GEN}{L A R G E}$ 进行了微调,评估结果如表9所示。值得注意的是,随着ERNIE $\mathbf{GEN}_{L A R G E}$ 在更大规模的文本语料库上进行预训练,性能显著提升。
Table 9: Evaluation results on Gigaword and CNN/DailyMail for pre-trained models using large-scale text corpora. ERNIE-GEN with the $^\ddag$ mark is pre-trained with the 430GB text corpora.
表 9: 使用大规模文本语料库预训练模型在 Gigaword 和 CNN/DailyMail 上的评估结果。带有 $^\ddag$ 标记的 ERNIE-GEN 使用 430GB 文本语料库进行预训练。
| 模型 | 数据参数 | RG-1/RG-2/RG-L | |
|---|---|---|---|
| *Gigaword10k | |||
| ERNIE-GENLARGE | 16G | 340M | 35.05/16.10/32.50 |
| ERNIE-GENILARGE | 430G | 340M | 35.51/16.79/33.23 |
| *Gigaword | |||
| PEGASUS(Cc4) [Zhang et al.,2019] | 750G | 568M | 38.75/19.96/36.14 |
| PEGASUS(HugeNews) [Zhang et al.,2019] | 3.8T | 568M | 39.12/19.86/36.24 |
| ERNIE-GENLARGE ERNIE-GENILARGE | 16G 430G | 340M 340M | 39.25/20.25/36.53 39.46/20.34/36.74 |
| *CNN/DailyMail | |||
| T5LARGE [Raffel et al., 2019] | 750G | 340M | 42.50/20.68/39.75 |
| T5xLARGE [Raffel et al.,2019] | 750G | 11B | 43.52/21.55/40.69 |
| BARTLARGE [Lewis et al., 2019] | 160G | 400M | 44.16/21.28/40.90 |
| PEGASUS(c4) [Zhang et al.,2019] | 750G | 568M | 43.90/21.20/40.76 |
| PEGASUS(HugeNews) [Zhang et al.,2019] | 3.8T | 568M | 44.17/21.47/41.11 |
| ERNIE-GENLARGE | 16G | 340M | 44.02/21.17/41.26 |
| ERNIE-GENILARGE | 430G | 340M | 44.31/21.35/41.60 |
We also fine-tune ERNIE-GEN on the SQuAD 1.1 dataset for question generation task, the results are presented in Table 10. We observe that larger scaled pre-training corpora can slightly improve the Rouge-L score and BLEU-4 score for the SQuAD dataset.
我们还在 SQuAD 1.1 数据集上对 ERNIE-GEN 进行了微调,用于问题生成任务,结果如表 10 所示。我们观察到,更大规模的预训练语料库可以略微提高 SQuAD 数据集的 Rouge-L 分数和 BLEU-4 分数。
| 模型 | 数据 | 参数 | BLEU-4 | METEOR | ROUGE-L |
|---|---|---|---|---|---|
| ERNIE-GENLARGE | 16G | 340M | 25.40 | 26.92 | 52.84 |
| ERNIE-GENILARGE | 430G | 340M | 25.41 | 52.84 | 25.91 |
| *Reversedtest<>dev split | |||||
| ERNIE-GENLARGE | 16G | 340M | 26.95 | 27.57 | 53.77 |
| ERNIE-GENILARGE | 430G | 340M | 27.05 | 27.43 | 53.83 |
Table 10: Question generation results on SQuAD. ERNIE-GEN with the $^\ddag$ mark is pre-trained with the 430GB text corpora.
表 10: SQuAD 上的问题生成结果。带有 $^\ddag$ 标记的 ERNIE-GEN 使用了 430GB 的文本语料库进行预训练。
The fine-tuning hyper parameters of ERNIE-GEN‡LARGE are presented in Table 11.
ERNIE-GEN‡LARGE 的微调超参数如表 11 所示。
Table 11: Hyper parameters used for fine-tuning on CNN/DailyMail, Gigaword, and SQuAD QG.
| CNN/DailyMail | Gigaword | SQuADQG | |
|---|---|---|---|
| 学习率 (Learning rate) | 4e-5 | 3e-5 | 1.25e-5 |
| 批量大小 (Batch size) | 32 | 128 | 32 |
| 训练轮数 (Epochs) | 17 | 5 | 10 |
| 丢弃率 (Dropout rate) | 0.1 | 0.2 | 0.2 |
| 预热比例 (Warmup ratio) | 0.02 | 0.1 | 0.25 |
| 噪声率 (Noising rate pf) | 0.7 | 0.6 | 0.7 |
| 标签平滑 (Label smooth) | 0.1 | 0.1 | 0.1 |
| 束搜索大小 (Beam size) | 5 | 6 | 5 |
| 长度惩罚 (Length penalty) | 1.2 | 0.7 | 1.0 |
表 11: 用于 CNN/DailyMail、Gigaword 和 SQuAD QG 微调的超参数。
