ERNIE-GEN: An Enhanced Multi-Flow Pre-training and Fine-tuning Framework for Natural Language Generation
ERNIE-GEN:一种增强的多流预训练与微调框架用于自然语言生成
Abstract
摘要
Current pre-training works in natural language generation pay little attention to the problem of exposure bias on downstream tasks. To address this issue, we propose an enhanced multi-flow sequence to sequence pre-training and fine-tuning framework named ERNIE-GEN, which bridges the discrepancy between training and inference with an infilling generation mechanism and a noise-aware generation method. To make generation closer to human writing patterns, this framework introduces a span-by-span generation flow that trains the model to predict semantically-complete spans consecutively rather than predicting word by word. Unlike existing pre-training methods, ERNIE-GEN incorporates multi-granularity target sampling to construct pre-training data, which enhances the correlation between encoder and decoder. Experimental results demonstrate that ERNIE-GEN achieves state-of-the-art results with a much smaller amount of pre-training data and parameters on a range of language generation tasks, including abstract ive sum mari z ation (Gigaword and CNN/DailyMail), question generation (SQuAD), dialogue response generation (Persona-Chat) and generative question answering (CoQA). The source codes and pretrained models have been released at https://github. com/Paddle Paddle/ERNIE.
当前的自然语言生成预训练工作对下游任务中的暴露偏差问题关注较少。为了解决这一问题,我们提出了一个增强的多流序列到序列预训练和微调框架,名为 ERNIE-GEN。该框架通过填充生成机制和噪声感知生成方法,弥合了训练和推理之间的差异。为了使生成更接近人类的写作模式,该框架引入了逐段生成流,训练模型连续预测语义完整的段落,而不是逐词预测。与现有的预训练方法不同,ERNIE-GEN 结合了多粒度目标采样来构建预训练数据,从而增强了编码器和解码器之间的相关性。实验结果表明,ERNIE-GEN 在一系列语言生成任务中,包括摘要生成(Gigaword 和 CNN/DailyMail)、问题生成(SQuAD)、对话响应生成(Persona-Chat)和生成式问答(CoQA),以更少的预训练数据和参数取得了最先进的结果。源代码和预训练模型已在 https://github.com/PaddlePaddle/ERNIE 发布。
1 Introduction
1 引言
Pre-trained on large-scale unlabeled text corpora and finetuned on downstream tasks, self-supervised representation models such as GPT [Radford et al., 2018], BERT [Devlin et al., 2019] and XLNet [Yang et al., 2019b] have achieved remarkable improvements in natural language understanding (NLU). Different from encoder-only pre-training like BERT or decoder-only pre-training like GPT, natural language generation (NLG) relies on the sequence to sequence generation framework (seq2seq) which consists of a bidirectional encoder and a unidirectional decoder. Current pre-training works in NLG such as MASS [Song et al., 2019] and UNILM [Dong et al., 2019] mainly focus on jointly pre-training encoder and decoder on different self-supervised tasks. However, these works pay little attention to the exposure bias issue [Ranzato et al., 2016], a major drawback of teacher-forcing training. This issue is due to the fact that ground truth words are used during training, while generated words, whether predicted correctly or not, are used for inference where mistakes tend to accumulate. To alleviate this issue, we present ERNIE-GEN, an enhanced multi-flow seq2seq training framework characterized by a carefully-designed MultiFlow Attention architecture based on Transformer [Vaswani et al., 2017], as illustrated in Figure 2. ERNIE-GEN incorporates a novel infilling generation mechanism and a noiseaware generation method into pre-training and fine-tuning, which is proved to be effective through experiments in §4.3 .
在大规模无标签文本语料库上进行预训练并在下游任务上进行微调的自监督表示模型,如 GPT [Radford et al., 2018]、BERT [Devlin et al., 2019] 和 XLNet [Yang et al., 2019b],在自然语言理解 (NLU) 方面取得了显著进展。与 BERT 的仅编码器预训练或 GPT 的仅解码器预训练不同,自然语言生成 (NLG) 依赖于序列到序列生成框架 (seq2seq),该框架由双向编码器和单向解码器组成。当前 NLG 领域的预训练工作,如 MASS [Song et al., 2019] 和 UNILM [Dong et al., 2019],主要集中在不同的自监督任务上联合预训练编码器和解码器。然而,这些工作很少关注暴露偏差问题 [Ranzato et al., 2016],这是教师强制训练的一个主要缺点。这个问题是由于在训练期间使用了真实标签词,而在推理时使用了生成的词(无论预测是否正确),错误往往会累积。为了缓解这个问题,我们提出了 ERNIE-GEN,这是一个增强的多流 seq2seq 训练框架,其特点是基于 Transformer [Vaswani et al., 2017] 精心设计的多流注意力架构,如图 2 所示。ERNIE-GEN 在预训练和微调中引入了新颖的填充生成机制和噪声感知生成方法,实验证明这些方法在 §4.3 中有效。
Figure 1: Schematic of two generation mechanisms (left) and data strategies for pre-training (right). Blocks in green, orange and blue denote source texts, target texts and artificial symbols.
图 1: 两种生成机制示意图(左)和预训练数据策略(右)。绿色、橙色和蓝色块分别表示源文本、目标文本和人工符号。
• Infilling generation. Instead of using last ground truth word in training or last generated word in inference, we adopt an inserted artificial symbol [ATTN] along with its position to gather history contextual representations at each step in both training and inference, which diverts model’s attention away from last word and coerces it into focusing on all former representations, thus alleviating negative influence of previous mistakes to subsequent generation, as shown in Figure 1(b).
• 填充生成 (Infilling generation)。在训练过程中,我们不使用最后一个真实词,在推理过程中也不使用最后一个生成的词,而是采用插入的人工符号 [ATTN] 及其位置,在训练和推理的每一步中收集历史上下文表示。这种方法将模型的注意力从最后一个词转移开,迫使其专注于所有先前的表示,从而减轻先前错误对后续生成的负面影响,如图 1(b) 所示。
• Noise-Aware generation. We corrupt the input target sequence by randomly replacing words to arbitrary words in the vocabulary. This setup, despite its simplicity, proves to be an effective way to make the model be aware of mistakes in training, so that the model is able to detect mistakes and ignore them during inference.
• 噪声感知生成。我们通过随机替换词汇表中的任意单词来破坏输入的目标序列。尽管这种设置简单,但它被证明是一种有效的方式,使模型能够意识到训练中的错误,从而使模型能够在推理过程中检测并忽略这些错误。
Figure 2: Overview of ERNIE-GEN framework. ${S,T}and\mathbf{Y}donatesource,target,andgeneratedtexts,\pmb{T}^{\prime}isthenoisedversionof{\pmb{T}}$ .
图 2: ERNIE-GEN 框架概述。${S,T}和\mathbf{Y}分别表示源文本、目标文本和生成文本,\pmb{T}^{\prime}是{\pmb{T}}$ 的噪声版本。
Moreover, in light of the fact that entities, phrases and sentences in human writing are organized in a coherent manner, we incorporate a span-by-span generation task into ERNIEGEN as a new generation flow to train the model to predict semantically-complete spans consecutively rather than predicting word by word as traditional models do. This task is implemented through the infilling generation mechanism in parallel with an infilling-based word-by-word generation flow to facilitate convergence in training, as shown in Figure 1b.
此外,鉴于人类写作中的实体、短语和句子是以连贯的方式组织的,我们将逐段生成任务引入ERNIEGEN,作为一种新的生成流程,以训练模型连续预测语义完整的段落,而不是像传统模型那样逐词预测。该任务通过填充生成机制实现,并与基于填充的逐词生成流程并行,以促进训练中的收敛,如图1b所示。
In addition, as shown in Figure 1(c-d), recent pre-training works for NLG like UNILM and MASS only sample a single continuous segment as target sequence. However, this sampling method compromises the correlation between encoder and decoder when it comes to pre-training of long texts (typically 512 words), given that adjacent segments are often relevant semantically. ERNIE-GEN adopts a multi-granularity target fragments sampling strategy to force decoder to rely more on the encoder representations other than the previous generated words, thus enhancing the correlation between encoder and decoder, as shown in Figure 1e.
此外,如图 1(c-d) 所示,最近针对自然语言生成 (NLG) 的预训练工作,如 UNILM 和 MASS,仅采样单个连续片段作为目标序列。然而,这种采样方法在长文本(通常为 512 个词)的预训练中会削弱编码器和解码器之间的相关性,因为相邻片段通常在语义上是相关的。ERNIE-GEN 采用了多粒度目标片段采样策略,迫使解码器更多地依赖编码器表示,而不是之前生成的词,从而增强了编码器和解码器之间的相关性,如图 1e 所示。
Empirically, ERNIE-GEN is particularly effective and achieves state-of-the-art results on a range of NLG tasks including abstract ive sum mari z ation (Gigaword and CNN/DailyMail), question generation (SQuAD), dialogue response generation (Persona-Chat) and generative question answering (CoQA), utilizing a much smaller amount of pretraining data and parameters.
经验表明,ERNIE-GEN 在一系列自然语言生成任务中表现出色,并取得了最先进的结果,这些任务包括摘要生成(Gigaword 和 CNN/DailyMail)、问题生成(SQuAD)、对话响应生成(Persona-Chat)以及生成式问答(CoQA),同时使用了更少的预训练数据和参数。
2 Related Work
2 相关工作
Pre-training for NLP Tasks. Recently, pre-training methods have achieved state-of-the-art results in multiple NLU tasks. ELMo [Peters et al., 2018] pre-trains two unidirectional language models (LMs) with forward and backward direction respectively to feature downstream tasks. GPT utilizes an adjusted Transformer [Vaswani et al., 2017] to learn a forward LM and then fine-tunes the forward LM on supervised datasets. BERT proposes a masked language modeling (MLM) task to learn deep bidirectional representations. Nevertheless, above methods are usually implemented by just one encoder or decoder, which is less effective in encoder-decoder based generation tasks, thus several works have preliminarily explored the pre-training towards NLG by incorporating BERT’s MLM into the seq2seq framework and shown excellent performance on a range of generation tasks. MASS masks a consecutive fragment (50 of the input sentence with [MASK] symbols to predict. UNILM masks several words in the input sequence which is a pair of segments for encoder and decoder, and then predicts the masked words in accordance with BERT’s MLM.
NLP任务的预训练。近年来,预训练方法在多个自然语言理解(NLU)任务中取得了最先进的成果。ELMo [Peters et al., 2018] 分别通过前向和后向的两个单向语言模型(LM)进行预训练,以增强下游任务的特征表示。GPT 使用调整后的 Transformer [Vaswani et al., 2017] 来学习前向语言模型,然后在有监督的数据集上对该前向语言模型进行微调。BERT 提出了掩码语言建模(MLM)任务,以学习深度双向表示。然而,上述方法通常仅通过一个编码器或解码器实现,这在基于编码器-解码器的生成任务中效果较差,因此一些工作初步探索了将 BERT 的 MLM 融入序列到序列(seq2seq)框架中的 NLG 预训练方法,并在一系列生成任务中表现出色。MASS 通过 [MASK] 符号对输入句子中的连续片段 (50 进行掩码并预测。UNILM 对输入序列中的多个词进行掩码,该序列是编码器和解码器的一对片段,然后根据 BERT 的 MLM 预测被掩码的词。
Exposure Bias Issue. NLG tasks suffer from the exposure bias which is caused by teacher-forcing training. To address such issue, RNN-based variation al auto encoders (VAEs) are leveraged in [Yang et al., 2019a; Wang et al., 2019], whereas it requires inference for both posterior and prior distribution. Reinforcement learning is also adopted to text generation against exposure bias issue [Ranzato et al., 2016; Wang et al., 2018], which is, however, inefficient during training because of the word-by-word sampling procedure. These methods are inefficient and less practical for pretraining that relies on large-scale unlabeled text corpora.
曝光偏差问题。自然语言生成任务(NLG)受到由教师强制训练引起的曝光偏差的影响。为了解决这一问题,[Yang et al., 2019a; Wang et al., 2019] 中利用了基于RNN的变分自编码器(VAE),但这需要对后验分布和先验分布进行推断。强化学习也被用于文本生成以对抗曝光偏差问题 [Ranzato et al., 2016; Wang et al., 2018],然而,由于逐字采样的过程,训练效率较低。这些方法对于依赖大规模未标记文本语料库的预训练来说效率低下且不太实用。
Span-level Pre-training. [Sun et al., 2019; Sun et al., 2020; Joshi et al., 2019] verify that predicting spans reaches substantially better performance on NLU tasks. Meanwhile, inspired by characteristics of human expression, we hope the model have the foresight to generate a semantically-complete span at each step rather than a word. Consequently, a spanby-span generating task is proposed to make the model capable of generating texts more human-like.
Span-level 预训练。 [Sun et al., 2019; Sun et al., 2020; Joshi et al., 2019] 验证了预测 span 在自然语言理解 (NLU) 任务上能够显著提升性能。同时,受人类表达特点的启发,我们希望模型在每一步生成时能够具备前瞻性,生成语义完整的 span 而非单个词。因此,提出了逐 span 生成任务,使模型能够生成更接近人类表达的文本。
3 Proposed Framework
3 提出的框架
Built on infilling generation mechanism, ERNIE-GEN adopts a Multi-Flow Attention architecture to train the model on word-by-word and span-by-span generation tasks in parallel. In this section, we describe ERNIE-GEN according to the training process shown in Figure 2.
基于填充生成机制,ERNIE-GEN 采用了多流注意力架构,以并行方式训练模型进行逐词和逐片段生成任务。在本节中,我们根据图 2 所示的训练过程描述 ERNIE-GEN。
3.1 Multi-Granularity Target Fragments
3.1 多粒度目标片段
Given an input sequence ${\cal S}={s_{1},...,s_{n}},wefirstsamplealengthdistributionD_{i}fromadistributionset\textbf{\textit{D}}={D_{1},...,D_{|D|}}withprobabilityp_{i}fortargetfragments,andthenselectfragmentsaccordingtoD_{i}in\boldsymbol{S}iterativelyuntilthefragmentbudgethasbeenspent(e.g.25%of\pmb{S}).WedenoteS_{j}^{i}asthej−thfragmentwhichissampledinlengthdistributionD_{i}.Sampledfragmentsarethenremovedfrom\boldsymbol{S}andstitchedtogethertoformtargetsequence\pmb{T}=[T_{1},...,T_{k}]=[S_{1}^{i},...,S_{k}^{i}].WedenoteS^{\prime}astheleftinputsequenceafterremovingsampledfragments.ERNIE−GENperformspretrainingbypredictingthefragmentedtargetsequence\pmb{T}$ and minimizing the negative log likelihood:
给定输入序列 ${\cal S}={s_{1},...,s_{n}},我们首先从分布集\textbf{\textit{D}}={D_{1},...,D_{|D|}}中以概率p_{i}为目标片段采样长度分布D_{i},然后根据D_{i}在\boldsymbol{S}中迭代选择片段,直到片段预算用完(例如\pmb{S}的25%)。我们将S_{j}^{i}表示为在长度分布D_{i}中采样的第j个片段。采样后的片段从\boldsymbol{S}中移除并拼接在一起,形成目标序列\pmb{T}=[T_{1},...,T_{k}]=[S_{1}^{i},...,S_{k}^{i}]。我们将S^{\prime}表示为移除采样片段后剩余的输入序列。ERNIE−GEN通过预测分段目标序列\pmb{T}$ 并最小化负对数似然来进行预训练:
where the target sequence $\mathbf{\delta}{\mathbf{{T}}}issortedbythepositionsofsampledfragments.Foreachfragment\dot{T={t{1},...,t_{|T|}}}in\pmb{T},wehaveP(T)=∏|T|j=1P(tj|t<j)$ .
其中目标序列 $\mathbf{\delta}{\mathbf{{T}}}按采样片段的位置排序。对于\pmb{T}中的每个片段\dot{T={t{1},...,t_{|T|}}},我们有P(T)=∏|T|j=1P(tj|t<j)$ 。
Figure 3: Illustration of the Multi-Flow Attention module. (a):Overview of multi-flow attention. The encoder and the decoder share the parameters of multi-layer Transformer. (b):Word-by-word generation flow with history contextual representations from Contextual Flow. (c):Spanby-span generation flow with shared Contextual Flow. (d):The attention mask matrixes of word-by-word generation flow (MW) , contextual flow (MC) and span-by-span generation flow (MS) . The i -th generated token yi is calculated by argmax(softmax( Ec(aa(L−1)i)) ).
图 3: 多流注意力模块的示意图。(a): 多流注意力的概述。编码器和解码器共享多层 Transformer 的参数。(b): 逐词生成流,带有来自上下文流的历史上下文表示。(c): 逐片段生成流,共享上下文流。(d): 逐词生成流 (MW)、上下文流 (MC) 和逐片段生成流 (MS) 的注意力掩码矩阵。第 i 个生成的 Token yi 通过 argmax(softmax( Ec(aa(L−1)i)) ) 计算得出。
Following preliminary trials, we set a hyper parameter γ= 0.25, which denotes the ratio of length of all fragments to that of the input sequence SS . Besides, we introduce two uniform distributions ˉD=U(1,4),U(4,32) with probability of 0.4 and 0.6 respectively to sample fragments, which aims to learn representations from different perspectives. On the one hand, short fragments benefit learning of semantic relation between words; on the other hand, longer fragments help to memorize sentence-level expressions.
在初步试验后,我们设置了一个超参数 γ= 0.25,它表示所有片段的长度与输入序列 SS 长度的比率。此外,我们引入了两个均匀分布 ˉD=U(1,4),U(4,32),分别以 0.4 和 0.6 的概率对片段进行采样,旨在从不同角度学习表示。一方面,短片段有助于学习词之间的语义关系;另一方面,较长的片段有助于记忆句子级别的表达。
3.2 Noise-Aware Generation
3.2 噪声感知生成
To train a generation model which can detect the false prediction and mitigate its impact on subsequent generation, we corrupt the ground truth sequence $\mathbf{\delta}{\mathbf{{T}}}withaprocedurewherewordsarebeingreplacedrandomly,andthecorrupted\pmb{T}isrepresentedas\pmb{T}^{\prime}.Therearetwohyperparameters,\rho{p}and\rho_{f}$ , denoting the noising rate in pre-training and fine-tuning respectively.
为了训练一个能够检测错误预测并减轻其对后续生成影响的生成模型,我们通过随机替换单词的过程对真实序列 $\mathbf{\delta}{\mathbf{{T}}}进行破坏,破坏后的\pmb{T}表示为\pmb{T}^{\prime}。有两个超参数\rho{p}和\rho_{f}$,分别表示预训练和微调中的噪声率。
3.3 Architecture: Multi-Flow Attention
3.3 架构:多流注意力机制
Formally, given a source sequence S=s1,...,sn , a noised target sequence TT′=t1,...,tm , we denote the inference of seq2seq network based on shared Transformer as follows:
形式上,给定一个源序列 S=s1,...,sn 和一个噪声目标序列 TT′=t1,...,tm ,我们将基于共享 Transformer 的序列到序列网络的推理表示如下:
where Q,K,V denote the query, key, and value in MultiHead attention [Vaswani et al., 2017]. $\mathbf{\pmb{\mathscr{s}}}{i}^{(l)}and\pmb{t}{i}^{(l)}indicatethei−thvectorrepresentationsofthel−thlayerofMulti−HeadAttentionfortheencoderandthedecoderrespectively,[\cdot]$ denotes the concatenation operation. In this work, we call the above procedure the Contextual Flow.
其中 Q,K,V 表示多头注意力机制 (MultiHead attention) [Vaswani et al., 2017] 中的查询 (query)、键 (key) 和值 (value)。$\mathbf{\pmb{\mathscr{s}}}{i}^{(l)}和\pmb{t}{i}^{(l)}分别表示编码器和解码器第l层多头注意力的第i个向量表示,[\cdot]$ 表示拼接操作。在本工作中,我们将上述过程称为上下文流 (Contextual Flow)。
Word-by-word Generation Flow. Based on infilling generation mechanism, this flow utilizes an inserted [ATTN] symbol to gather history representations word by word (see Figure 1b). To facilitate this process, we place all inserted [ATTN] together to construct an artificial symbol sequence $A_{W}=\left{~[\mathrm{ATTN}]{1},...,[\mathrm{ATTN}]{m}\right}whichhasthesamelengthas\pmb{T}^{\prime}$ , as shown in Figure 3b. To be specific, the word-byword generation flow is updated as follow:
逐词生成流程。基于填充生成机制,该流程利用插入的 [ATTN] 符号逐词收集历史表示(见图 1b)。为了便于这一过程,我们将所有插入的 [ATTN] 放在一起,构建一个人工符号序列 $A_{W}=\left{~[\mathrm{ATTN}]{1},...,[\mathrm{ATTN}]{m}\right},其长度与\pmb{T}^{\prime}$ 相同,如图 3b 所示。具体来说,逐词生成流程更新如下:
where ai(l) indicates the i-th vector representation of the l-th layer for the artificial symbol sequence AAW .
其中 ai(l) 表示人工符号序列 AAW 的第 l 层的第 i 个向量表示。
Span-by-span Generation Flow. Different from word-byword generation flow, span-by-span flow uses [ATTN] symbols to predict spans consecutively, as shown in Figure 3c. Formally, given a list of span boundaries B=b1,...,b|B| , we conduct the span-by-span generation flow as:
逐片段生成流程。与逐词生成流程不同,逐片段生成流程使用 [ATTN] 符号连续预测片段,如图 3c 所示。形式上,给定片段边界列表 B=b1,...,b|B|,我们执行逐片段生成流程如下:
where j ∈ [bi, bi+1), and a(jl) denotes the (j−bi) -th vector representation of the i -th span. Essentially, the model is trained to predict a whole span tbi,...,tbi+1−1 with the same history context [S,t<bi] . Instead of randomly sampling spans, we prefer sampling spans with semantical information and knowledge. Specifically, we consider the following two steps to sample spans consecutively in TT′ :
其中 j ∈ [bi, bi+1),a(jl) 表示第 i 个 span 的第 (j−bi) 个向量表示。本质上,模型被训练为预测整个 span tbi,...,tbi+1−1,并使用相同的历史上下文 [S,t<bi]。我们更倾向于采样具有语义信息和知识的 span,而不是随机采样 span。具体来说,我们考虑以下两个步骤在 TT′ 中连续采样 span:
• Firstly, we implement a T-test to compute t -statistic scores of all bigrams and trigrams, which is based on an initial hypothesis H0 : a random span of n arbitrary words ww= w1,...,wn with probability p′(ww)=∏ni=1⋅p(wi) cannot be a statistical n -gram. The t-statistic sc ore is calculated by (p(ww)−p′(ww))√σ2/N , where p(ww)=∁ount(ww)N and σ2=p(ww)(1− p(ww)) , indicating the statistic probability and the standard deviation of ∇→∇→ω respectively, N denotes the total number of n -grams appearing in the training data. According to the t-statistic scores, we select the top 200,000 bigrams, top 50,000 trigrams and all unigrams to construct a specific vocabulary of spans, which is represented as Vspan .
• 首先,我们实施T检验来计算所有二元组和三元组的t统计量分数,这是基于初始假设H0:一个由n个任意单词ww= w1,...,wn组成的随机片段,其概率为p′(ww)=∏ni=1⋅p(wi),不能成为一个统计n-gram。t统计量分数通过(p(ww)−p′(ww))√σ2/N计算,其中p(ww)=∁ount(ww)N和σ2=p(ww)(1− p(ww)),分别表示统计概率和∇→∇→ω的标准差,N表示训练数据中出现的n-gram的总数。根据t统计量分数,我们选择前200,000个二元组、前50,000个三元组和所有一元组来构建一个特定的片段词汇表,表示为Vspan。
• Secondly, we search the trigram, bigram and unigram in order, starting with current word until a span n -gram, n≤ 3) is retrieved in Vspan .
• 其次,我们从当前词开始,依次搜索三元组、二元组和一元组,直到在 Vspan 中检索到一个跨度 n -gram(n≤ 3)。
Multi-Flow Attention. To integrate the word-by-word generation flow and span-by-span generation flow, we apply them in parallel with a shared contextual flow by leveraging the multi-flow attention architecture, as Figure 3a describes. The multi-flow attention is computed as:
多流注意力机制。为了整合逐词生成流和逐跨度生成流,我们通过利用多流注意力架构,将它们与共享的上下文流并行应用,如图3a所示。多流注意力的计算方式如下:
where X denotes the concatenation of S and TT′ , X(l) is the vector sequence of the l -th layer for the contextual flow. ${\pmb A}{W}^{(l)},{\pmb A}{S}^{(l)}arevectorsequencesofthel−thlayerforthewordby−wordandspan−by−spangenerationflowrespectively.AsshowninFigure3d,attentionmaskmatrixMdetermineswhetherqueryandkeycanattendtoeachotherbymodifyingtheattentionweightW=softmax(Q√KdT[Vaswanietal.,2017].Specifically,M$ is assigned as:
其中 X 表示 S 和 TT′ 的拼接,X(l) 是第 l 层的上下文流的向量序列。${\pmb A}{W}^{(l)},{\pmb A}{S}^{(l)}分别是第l层的逐词生成流和逐片段生成流的向量序列。如图3d所示,注意力掩码矩阵M通过修改注意力权重W=softmax(Q√KdT[Vaswanietal.,2017]来确定查询和键是否可以相互关注。具体来说,M$ 被赋值为:
While training, we add the loss of the word-by-word and span-by-span generation flow with an coefficient λ :
在训练过程中,我们通过系数 λ 将逐词生成和逐段生成的损失相加:
where δT indicates the unnoised target sequence, and L(⋅) denotes the cross entropy loss function. In detail, we set λ=0.5 and λ=1.0 respectively in pre-training and fine-tuning.
其中,δT 表示未加噪的目标序列,L(⋅) 表示交叉熵损失函数。具体来说,我们在预训练和微调阶段分别设置 λ=0.5 和 λ=1.0。
3.4 Inference: Infilling Decoding
3.4 推理:填充解码
During inference, the target sequence δT is unknown, we insert symbol [ATTN] step by step to gather the representation of history context instead of preparing an artificial symbol sequence AA in advance. Meanwhile, for the purpose of efficiency, we need to drop the inserted [ATTN] after inference at each step, as detailed in Figure 4.
在推理过程中,目标序列 δT 是未知的,我们逐步插入符号 [ATTN] 以收集历史上下文的表示,而不是预先准备一个人工符号序列 AA。同时,为了提高效率,我们需要在每一步推理后删除插入的 [ATTN],如图 4 所示。
4 Experiments
4 实验
In this section, we compare our ERNIE-GEN with previous works and conduct several ablation experiments to assess the performance of proposed methods in §3 .
在本节中,我们将 ERNIE-GEN 与之前的工作进行比较,并进行多项消融实验,以评估 §3 中提出的方法的性能。
4.1 Pre-training and Implementation
4.1 预训练与实现
Analogous to BERT and UNILM, ERNIE-GEN is trained on English Wikipedia and BookCorpus [Zhu et al., 2015], totaling 16GB. We also pre-train ERNIE-GEN on larger scaled text corpora, which is specifically described in appendix A. The input sequence is lowercased and truncated to a maximum length of 512. We train a base model ERNIE GENBASE ( L=12 , H=768 , A=12 , Total Parameters Γ=110M). 1 and a large model ERNIE GENLARGE L=24 , H=1024 , A=16 , Total Parameters μ=340M) with parameters initialized by BERTBASE and BERTLARGE respectively. Specifically, Adam optimizer with β1 = 0.9,β2 = 0.99˙9,˙ϵ = 10−9 is employed. The peak learning rate is 5e-5 with warmup over the first 4,000 steps and linear decay scheduling. The noising rate ρp for pre-training is 0.05. Batches are organized by limiting the maximum number of tokens to 196,608. Pre-training experiments are carried out on Paddle Paddle platforms2 and Nvidia Tesla V100 GPU. By virtue of float16 mixed precision training, it takes almost 4 days for 400,000 steps to train ERNIE GENBASE while almost 7 days for 450,000 steps to train ERNIE-GENLARGE.
类似于 BERT 和 UNILM,ERNIE-GEN 在英文维基百科和 BookCorpus [Zhu et al., 2015] 上进行训练,总计 16GB。我们还在更大规模的文本语料库上对 ERNIE-GEN 进行了预训练,具体描述见附录 A。输入序列被转换为小写并截断至最大长度 512。我们训练了一个基础模型 ERNIE GENBASE ( L=12 , H=768 , A=12 , 总参数量 Γ=110M). 1 和一个大模型 ERNIE GENLARGE L=24 , H=1024 , A=16 , 总参数量 μ=340M) ,参数分别由 BERTBASE 和 BERTLARGE 初始化。具体来说,使用了 Adam 优化器,参数为 β1 = 0.9,β2 = 0.99˙9,˙ϵ = 10−9 。峰值学习率为 5e-5,前 4,000 步进行预热,之后采用线性衰减调度。预训练的噪声率 ρp 为 0.05。通过限制最大 token 数量为 196,608 来组织批次。预训练实验在 Paddle Paddle 平台2 和 Nvidia Tesla V100 GPU 上进行。得益于 float16 混合精度训练,训练 ERNIE GENBASE 400,000 步大约需要 4 天,而训练 ERNIE-GENLARGE 450,000 步大约需要 7 天。
Figure 4: Schematic of infilling decoding: the particular procedures in infilling decoding including dropping and inserting (left) and the attention mask matrixes at each step (right).
图 4: 填充解码示意图:填充解码中的特定步骤,包括删除和插入(左)以及每一步的注意力掩码矩阵(右)。
4.2 Fine-tuning on Downstream Tasks
4.2 下游任务的微调
Abstract ive Sum mari z ation aims at generating fluent and concise summaries without being constrained to extracting sub-sequences from the input articles. We execute experiments on Gigaword dataset [Rush et al., 2015] and CNN/DailyMail dataset [Hermann et al., 2015]. Gigaword dataset contains 3.8M articles extracted from the Gigaword corpus, while CNN/DailyMail dataset consists of 93kΩ articles and 220kΩ articles from the CNN and Daily Mail respectively.
摘要生成式摘要的目标是生成流畅且简洁的摘要,而不受限于从输入文章中提取子序列。我们在 Gigaword 数据集 [Rush et al., 2015] 和 CNN/DailyMail 数据集 [Hermann et al., 2015] 上进行了实验。Gigaword 数据集包含从 Gigaword 语料库中提取的 3.8M 篇文章,而 CNN/DailyMail 数据集分别包含来自 CNN 和 Daily Mail 的 93kΩ 篇文章和 220kΩ 篇文章。
模型 | 数据参数 | RG-1/RG-2/RG-L | |
---|---|---|---|
*10k 训练样本: Gigaword10k | |||
MASS [Song et al., 2019] | 18G | 160M | 25.03/9.48/23.48 |
UNILM LARGE [Dong et al., 2019] | 16G | 340M | 32.96/14.68/30.56 |
ERNIE-GEN BASE | 16G | 110M | 33.75/15.23/31.35 |
ERNIE-GEN LARGE | 16G | 340M | 35.05/16.10/32.50 |
*完整 3.8M 训练样本 | |||
MASS [Song et al., 2019] | 18G | 160M | 37.66/18.53/34.89 |
BERT SHARE [Rothe et al., 2019] | 16G | 110M | 38.13/19.81/35.62 |
UNILM LARGE [Dong et al., 2019] | 16G | 340M | 38.45/19.45/35.75 |
PEGASUS (c4) [Zhang et al., 2019] | 750G | 568M | 38.75/19.96/36.14 |
PEGASUS (HugeNews) [Zhang et al., 2019] | 3.8T | 568M | 39.12/19.86/36.24 |
ERNIE-GEN BASE | 16G | 110M | 38.83/20.04/36.20 |
ERNIE-GEN LARGE | 16G | 340M | 39.25/20.25/36.53 |
Table 2: Comparison on Gigaword dataset with state-of-the-art results. Models in the upper block use 10k sample for fine-tuning. We also report the size of pre-training data and parameters utilized for each listed model (columns 2-3). RG is short for ROUGE.
表 2: Gigaword 数据集上与最先进结果的对比。上部分的模型使用了 10k 样本进行微调。我们还报告了每个列出的模型所使用的预训练数据和参数大小(第 2-3 列)。RG 是 ROUGE 的缩写。
Table 1: Hyper paramters of fine-tuning for ERNIE GENBASE and ERNIE-GENLARGE.
表 1: ERNIE GENBASE 和 ERNIE-GENLARGE 微调的超参数
| 任务 | Epoch BASE | Epoch LARGE | LearningRate BASE | LearningRate LARGE | NoisingRatepf BASE | NoisingRatepf LARGE | DropoutRate BASE | DropoutRate LARGE | Batch Size | Label Smooth | Beam Size | 评估指标 |
|------|------------|-------------|-------------------|--------------------|--------------------|---------------------|--------------