A Survey on Contextual Embeddings
基于上下文嵌入的综述
Qi Liu‡, Matt J. Kusner†∗, Phil Blunsom‡⋄, ‡University of Oxford ⋄DeepMind †University College London ∗The Alan Turing Institute ‡{firstname.lastname}@cs.ox.ac.uk †m.kusner@ucl.ac.uk
Qi Liu‡, Matt J. Kusner†∗, Phil Blunsom‡⋄, ‡牛津大学 ⋄DeepMind †伦敦大学学院 ∗艾伦·图灵研究所 ‡{firstname.lastname}@cs.ox.ac.uk †m.kusner@ucl.ac.uk
Abstract
摘要
Contextual embeddings, such as ELMo and BERT, move beyond global word representations like Word2Vec and achieve groundbreaking performance on a wide range of natural language processing tasks. Contextual embeddings assign each word a representation based on its context, thereby capturing uses of words across varied contexts and encoding knowledge that transfers across languages. In this survey, we review existing contextual embedding models, cross-lingual polyglot pretraining, the application of contextual embeddings in downstream tasks, model compression, and model analyses.
上下文嵌入 (Contextual Embeddings) ,如 ELMo 和 BERT,超越了 Word2Vec 等全局词表示方法,在广泛的自然语言处理任务中取得了突破性性能。上下文嵌入会根据上下文为每个词分配表示,从而捕捉不同语境下的词语用法,并编码跨语言迁移的知识。本文综述了现有上下文嵌入模型、跨语言多语种预训练、上下文嵌入在下游任务中的应用、模型压缩以及模型分析。
1 Introduction
1 引言
Distribution al word representations (Turian et al., 2010; Mikolov et al., 2013; Pennington et al., 2014) trained in an unsupervised manner on large-scale corpora are widely used in modern natural language processing systems. However, these approaches only obtain a single global represent ation for each word, ignoring their context. Different from traditional word representations, contextual embeddings move beyond word-level semantics in that each token is associated with a representation that is a function of the entire input sequence. These context-dependent representations can capture many syntactic and semantic properties of words under diverse linguistic contexts. Previous work (Peters et al., 2018; Devlin et al., 2018; Yang et al., 2019; Raffel et al., 2019) has shown that contextual embeddings pretrained on large-scale unlabelled corpora achieve state-of-the-art performance on a wide range of natural language processing tasks, such as text classification, question answering and text summarization. Further analyses (Liu et al., 2019a; Hewitt and Liang, 2019; Hewitt and Manning, 2019; Tenney et al., 2019a) demonstrate that contextual embeddings are capable of learning useful and transferable representations across languages.
分布式词表示 (Turian et al., 2010; Mikolov et al., 2013; Pennington et al., 2014) 通过在大规模语料库上进行无监督训练,被广泛应用于现代自然语言处理系统。然而,这些方法仅能获取每个词的单一全局表示,忽略了上下文信息。与传统词表示不同,上下文嵌入突破了词级语义的局限,使每个token的表示成为整个输入序列的函数。这种上下文相关的表示能够捕捉单词在不同语言环境下的多种句法和语义特性。先前研究 (Peters et al., 2018; Devlin et al., 2018; Yang et al., 2019; Raffel et al., 2019) 表明,基于大规模无标注语料预训练的上下文嵌入在文本分类、问答系统和文本摘要等自然语言处理任务中实现了最先进的性能。进一步分析 (Liu et al., 2019a; Hewitt and Liang, 2019; Hewitt and Manning, 2019; Tenney et al., 2019a) 证明,上下文嵌入能够学习跨语言的有效可迁移表示。
The rest of the survey is organized as follows. In Section 2, we define the concept of contextual embeddings. In Section 3, we introduce existing methods for obtaining contextual embeddings. In Section 4, we present the pre-training methods of contextual embeddings on multi-lingual corpora. In Section 5, we describe methods for applying pre-trained contextual embeddings in downstream tasks. In Section 6, we detail model compression methods. In Section 7, we survey analyses that have aimed to identify the linguistic knowledge learned by contextual embeddings. We conclude the survey by highlighting some challenges for future research in Section 8.
本综述的其余部分安排如下。第2节定义了上下文嵌入 (contextual embeddings) 的概念。第3节介绍了获取上下文嵌入的现有方法。第4节阐述了在多语言语料库上预训练上下文嵌入的方法。第5节描述了在下游任务中应用预训练上下文嵌入的方法。第6节详述了模型压缩方法。第7节综述了旨在识别上下文嵌入所学习语言知识的分析工作。最后在第8节,我们通过强调未来研究的一些挑战来总结本综述。
2 Token Embeddings
2 Token嵌入
Consider a text corpus that is represented as a sequence $s$ of tokens, $(t_ {1},t_ {2},...,t_ {N})$ . Distributed representations of words (Harris, 1954; Bengio et al., 2003) associate each token $t_ {i}$ with a dense feature vector $\mathbf{h}_ {t_ {i}}$ . Traditional word embedding techniques aim to learn a global word embedding matrix $\textbf{E}\in\mathbb{R}^{V\times d}$ , where $V$ is the vocabulary size and $d$ is the number of dimensions. Specifically, each row $\mathbf{e}_ {i}$ of $\mathbf{E}$ corresponds to the global embedding of word type $i$ in the vocabulary $V$ . Well-known models for learning word embeddings include Word2vec (Mikolov et al., 2013) and Glove (Pennington et al., 2014). On the other hand, methods that learn contextual embeddings associate each token $t_ {i}$ with a representation that is a function of the entire input sequence $s$ , i.e. $\mathbf{h}_ {t_ {i}}~=~f(\mathbf{e}_ {t_ {1}},\mathbf{e}_ {t_ {2}},...,\mathbf{e}_ {t_ {N}})$ , where each input token $t_ {j}$ is usually mapped to its noncontextual i zed representation $\mathbf{e}_ {t_ {j}}$ first, before applying an aggregation function $f$ . These contextdependent representations are better suited to capture sequence-level semantics (e.g. polysemy) than non-contextual word embeddings. There are many model architectures for $f$ , which we review here. We begin by describing pre-training methods for learning contextual embeddings that can be used in downstream tasks.
考虑一个由token序列$s$ $(t_ {1},t_ {2},...,t_ {N})$ 表示的文本语料库。分布式词表示 (Harris, 1954; Bengio et al., 2003) 将每个token $t_ {i}$ 与一个稠密特征向量 $\mathbf{h}_ {t_ {i}}$ 关联。传统词嵌入技术旨在学习全局词嵌入矩阵 $\textbf{E}\in\mathbb{R}^{V\times d}$ ,其中 $V$ 是词汇表大小,$d$ 是维度数。具体而言,$\mathbf{E}$ 的每一行 $\mathbf{e}_ {i}$ 对应词汇表 $V$ 中词类型 $i$ 的全局嵌入。著名的词嵌入学习模型包括 Word2vec (Mikolov et al., 2013) 和 Glove (Pennington et al., 2014)。另一方面,学习上下文嵌入的方法将每个token $t_ {i}$ 与整个输入序列 $s$ 的函数表示相关联,即 $\mathbf{h}_ {t_ {i}}~=~f(\mathbf{e}_ {t_ {1}},\mathbf{e}_ {t_ {2}},...,\mathbf{e}_ {t_ {N}})$ ,其中每个输入token $t_ {j}$ 通常在应用聚合函数 $f$ 之前先映射到其非上下文化的表示 $\mathbf{e}_ {t_ {j}}$ 。这些上下文相关的表示比非上下文词嵌入更适合捕捉序列级语义(例如一词多义)。$f$ 有许多模型架构,我们在此进行综述。首先描述可用于下游任务的上下文嵌入预训练方法。
3 Pre-training Methods for Contextual Embeddings
3 上下文嵌入的预训练方法
In large part, pre-training contextual embeddings can be divided into either unsupervised methods (e.g. language modelling and its variants) or supervised methods (e.g. machine translation and natural language inference).
在很大程度上,预训练上下文嵌入可以分为无监督方法(如语言建模及其变体)或有监督方法(如机器翻译和自然语言推理)。
3.1 Unsupervised Pre-training via Language Modeling
3.1 基于语言建模的无监督预训练
The prototypical way to learn distributed token embeddings is via language modelling. A language model is a probability distribution over a sequence of tokens. Given a sequence of $N$ tokens, $(t_ {1},t_ {2},...,t_ {N})$ , a language model factorizes the probability of the sequence as:
学习分布式token嵌入的典型方法是通过语言建模。语言模型是对token序列的概率分布。给定一个由$N$个token组成的序列$(t_ {1},t_ {2},...,t_ {N})$,语言模型将该序列的概率分解为:
$$
p(t_ {1},t_ {2},...,t_ {N})=\prod_ {i=1}^{N}p(t_ {i}|t_ {1},t_ {2},...,t_ {i-1}).
$$
$$
p(t_ {1},t_ {2},...,t_ {N})=\prod_ {i=1}^{N}p(t_ {i}|t_ {1},t_ {2},...,t_ {i-1}).
$$
Language modelling uses maximum likelihood estimation (MLE), often penalized with regularization terms, to estimate model parameters. A left-to-right language model takes the left context, $t_ {1},t_ {2},...,t_ {i-1}$ , of $t_ {i}$ into account for esti- mating the conditional probability. Language models are usually trained using large-scale unlabelled corpora. The conditional probabilities are most commonly learned using neural networks (Bengio et al., 2003), and the learned representations have been proven to be transferable to downstream natural language understanding tasks (Dai and Le, 2015; Rama chandra n et al., 2016).
语言建模采用最大似然估计 (MLE),通常结合正则化项进行惩罚,以估计模型参数。自左向右的语言模型会考虑 $t_ {i}$ 的左侧上下文 $t_ {1},t_ {2},...,t_ {i-1}$ 来估计条件概率。语言模型通常使用大规模无标注语料库进行训练。条件概率最常通过神经网络学习 (Bengio et al., 2003),且已证明学习到的表征可迁移至下游自然语言理解任务 (Dai and Le, 2015; Ramachandran et al., 2016)。
Precursor Models. Dai and Le (2015) is the first work we are aware of that uses language modelling together with a sequence auto encoder to improve sequence learning with recurrent networks. Thus, it can be thought of as a precursor to modern contextual embedding methods. Pre-trained on the datasets IMDB, Rotten Tomatoes, 20 Newsgroups, and DBpedia, the model is then fine-tuned on sentiment analysis and text classification tasks, achieving strong performance compared to randomly- initialized models.
先驱模型。Dai和Le (2015) 是我们所知首个将语言建模与序列自编码器结合以改进循环网络序列学习的工作,可视为现代上下文嵌入方法的前身。该模型在IMDB、烂番茄、20新闻组和DBpedia数据集上预训练后,针对情感分析和文本分类任务进行微调,相比随机初始化模型取得了显著性能提升。
Rama chandra n et al. (2016) extends Dai and Le (2015) by proposing a pre-training method to improve the accuracy of sequence to sequence (seq2seq) models. The encoder and decoder of the seq2seq model is initialized with the pre-trained weights of two language models. These language models are separately trained on either the News Crawl English or German corpora for machine translation, while both are initialized with the language model trained with the English Gigaword corpus for abstract ive sum mari z ation. These pretrained models are fine-tuned on the WMT English $\rightarrow$ German task and the CNN/Daily Mail corpus, respectively, achieving better results over baselines without pre-training.
Rama chandra n等人 (2016) 在Dai和Le (2015) 的基础上提出了一种预训练方法,用于提升序列到序列 (seq2seq) 模型的准确性。该seq2seq模型的编码器和解码器使用两个语言模型的预训练权重进行初始化。这些语言模型分别在机器翻译任务的News Crawl英语或德语语料库上单独训练,而在抽象摘要任务中则均采用基于English Gigaword语料库训练的语言模型进行初始化。这些预训练模型分别在WMT英德翻译任务和CNN/Daily Mail语料库上进行微调,相比未预训练的基线模型取得了更好的结果。
The work in the following sections improves over Dai and Le (2015) and Rama chandra n et al. (2016) with new architectures (e.g. Transformer), larger datasets, and new pre-training objectives. A summary of the models and the pre-training objectives is shown in Table 1 and 2.
以下章节的工作在Dai和Le (2015) 以及Ramachandra等人 (2016) 的基础上进行了改进,采用了新的架构(如Transformer)、更大的数据集和新的预训练目标。模型和预训练目标的总结如表1和表2所示。
ELMo. The ELMo model (Peters et al., 2018) generalizes traditional word embeddings by extracting context-dependent representations from a bidirectional language model. A forward $L$ -layer LSTM and a backward $L$ -layer LSTM are applied to encode the left and right contexts, respectively. At each layer $j$ , the contextual i zed representations are the concatenation of the left-to-right and rightto-left representations, obtaining $N$ hidden representations, $(\mathbf{h}_ {1,j},\mathbf{h}_ {2,j},...,\mathbf{h}_ {N,j})$ , for a sequence of length $N$ .
ELMo。ELMo模型 (Peters等人, 2018) 通过从双向语言模型中提取上下文相关的表示,泛化了传统的词嵌入方法。该方法采用一个前向$L$层LSTM和一个反向$L$层LSTM分别对左右上下文进行编码。在每一层$j$,上下文相关的表示是左右双向表示的拼接,从而为长度为$N$的序列获得$N$个隐藏表示$(\mathbf{h}_ {1,j},\mathbf{h}_ {2,j},...,\mathbf{h}_ {N,j})$。
To use ELMo in downstream tasks, the $(L+1)$ - layer representations (including the global word embedding) for each token $k$ are aggregated as:
要在下游任务中使用ELMo,每个token $k$ 的 $(L+1)$ 层表示(包括全局词嵌入)按以下方式聚合:
$$
\mathrm{ELMO}_ {k}^{t a s k}=\gamma^{t a s k}\sum_ {j=0}^{L}s_ {j}^{t a s k}\mathbf{h}_ {k,j},
$$
$$
\mathrm{ELMO}_ {k}^{t a s k}=\gamma^{t a s k}\sum_ {j=0}^{L}s_ {j}^{t a s k}\mathbf{h}_ {k,j},
$$
where stask are layer-wise weights normalized by the softmax used to linearly combine the $(L+1)$ - layer representations of the token $k$ and $\gamma^{t a s k}$ is a task-specific constant.
其中stask是通过softmax归一化的逐层权重,用于线性组合token $k$ 的$(L+1)$层表示,而$\gamma^{task}$是任务特定的常数。
Given a pre-trained ELMo, it is straightforward to incorporate it into a task-specific architecture for improving the performance. As most supervised models use global word representations $\mathbf{x}_ {k}$ in their lowest layers, these representations can be concatenated with their corresponding contextdependent representations $\mathrm{ELMo}_ {k}^{t a s k}$ , obtaining $[\mathbf{x}_ {k};\mathbf{ELMo}_ {k}^{t a s k}]$ , before feeding them to higher layers.
给定一个预训练的ELMo,可以轻松地将其整合到特定任务的架构中以提升性能。由于大多数监督模型在最底层使用全局词表示$\mathbf{x}_ {k}$,这些表示可以与其对应的上下文相关表示$\mathrm{ELMo}_ {k}^{t a s k}$拼接,得到$[\mathbf{x}_ {k};\mathbf{ELMo}_ {k}^{t a s k}]$,然后再输入到更高层。
Table 1: A comparison of popular pre-trained models.
| Method | Architecture | Encoder | Decoder | Objective | Dataset |
| ELMo | LSTM | LM | 1BWordBenchmark | ||
| GPT | Transformer | × | LM | BookCorpus | |
| GPT2 | Transformer | × | √ | LM | Web pages starting from Reddit |
| BERT | Transformer | MLM&NSP | BookCorpus&Wiki | ||
| RoBERTa | Transformer | √ | × | MLM | BookCorpus,Wiki,CC-News,OpenWebText,Stories |
| ALBERT | Transformer | √ | MLM&SOP | SameasRoBERTaandXLNet | |
| UniLM | Transformer | LM, MLM, seq2seq LM | Same asBERT | ||
| ELECTRA | Transformer | × | Discriminator (o/r) | SameasXLNet | |
| XLNet | Transformer | × | √ | PLM | BookCorpus,Wiki,Giga5,ClueWeb,Common Crawl |
| XLM | Transformer | CLM, MLM, TLM | Wiki,parellel corpora (e.g.MultiUN) | ||
| MASS | Transformer | √ | Span Mask | WMTNewsCrawl | |
| T5 | Transformer | √ | TextInfilling | ColossalCleanCrawledCorpus | |
| BART | Transformer | √ | TextInfilling&SentShuffling | SameasRoBERTa |
表 1: 主流预训练模型对比
| 方法 | 架构 | 编码器 | 解码器 | 目标函数 | 数据集 |
|---|---|---|---|---|---|
| ELMo | LSTM | LM | 1BWordBenchmark | ||
| GPT | Transformer | × | LM | BookCorpus | |
| GPT2 | Transformer | × | √ | LM | 从Reddit开始的网页 |
| BERT | Transformer | MLM&NSP | BookCorpus&Wiki | ||
| RoBERTa | Transformer | √ | × | MLM | BookCorpus, Wiki, CC-News, OpenWebText, Stories |
| ALBERT | Transformer | √ | MLM&SOP | SameasRoBERTaandXLNet | |
| UniLM | Transformer | LM, MLM, seq2seq LM | Same asBERT | ||
| ELECTRA | Transformer | × | Discriminator (o/r) | SameasXLNet | |
| XLNet | Transformer | × | √ | PLM | BookCorpus, Wiki, Giga5, ClueWeb, Common Crawl |
| XLM | Transformer | CLM, MLM, TLM | Wiki, parellel corpora (e.g. MultiUN) | ||
| MASS | Transformer | √ | Span Mask | WMTNewsCrawl | |
| T5 | Transformer | √ | TextInfilling | ColossalCleanCrawledCorpus | |
| BART | Transformer | √ | TextInfilling&SentShuffling | SameasRoBERTa |
Table 2: Pre-training objectives and their input-output formats.
| Objective | Inputs | Targets |
| LM | [START] | I am happy to join with you today |
| MLM | I am[MASK] to join with you [MASK] | happy today |
| NSP | Sent1[SEP]NextSentorSent1[SEP]RandomSent | NextSent/RandomSent |
| SOP | Sent1[SEP]Sent2orSent2[SEP]Sentl | inorder/reversed |
| Discriminator (o/r) | I amthrilledtostudywithyoutoday | oororooo |
| PLM | happy join with | today am I to you |
| seq2seq LM | I am happy to | join with you today |
| Span Mask | I am[MASK][MASK][MASK]withyou today | happy to join |
| Text Infilling | Iam[MASK]withyoutoday | happy to join |
| Sent Shuffling | today you am I join with happy to | I am happy to join with you today |
| TLM | How[MASK]you[SEP][MASK]vas-tu | areComment |
表 2: 预训练目标及其输入-输出格式。
| 目标 | 输入 | 输出 |
|---|---|---|
| LM | [START] | I am happy to join with you today |
| MLM | I am[MASK] to join with you [MASK] | happy today |
| NSP | Sent1[SEP]NextSentorSent1[SEP]RandomSent | NextSent/RandomSent |
| SOP | Sent1[SEP]Sent2orSent2[SEP]Sentl | inorder/reversed |
| Discriminator (o/r) | I amthrilledtostudywithyoutoday | oororooo |
| PLM | happy join with | today am I to you |
| seq2seq LM | I am happy to | join with you today |
| Span Mask | I am[MASK][MASK][MASK]withyou today | happy to join |
| Text Infilling | Iam[MASK]withyoutoday | happy to join |
| Sent Shuffling | today you am I join with happy to | I am happy to join with you today |
| TLM | How[MASK]you[SEP][MASK]vas-tu | areComment |
The effectiveness of ELMo is evaluated on six NLP problems, including question answering, textual entailment and sentiment analysis.
ELMo的有效性在六个自然语言处理(NLP)任务上进行了评估,包括问答、文本蕴含和情感分析。
GPT, GPT2, and Grover. GPT (Radford et al., 2018) adopts a two-stage learning paradigm: (a) unsupervised pre-training using a language modelling objective and (b) supervised fine-tuning. The goal is to learn universal representations transferable to a wide range of downstream tasks. To this end, GPT uses the BookCorpus dataset (Zhu et al., 2015), which contains more than 7,000 books from various genres, for training the language model. The Transformer architecture (Vaswani et al., 2017) is used to implement the language model, which has been shown to better capture global dependencies from the inputs compared to its alternatives, e.g. recurrent networks, and perform strongly on a range of sequence learning tasks, such as machine translation (Vaswani et al., 2017) and document gener- ation (Liu et al., 2018). To use GPT on inputs with multiple sequences during fine-tuning, GPT applies task-specific input adaptations motivated by traversal-style approaches (Rock t as chel et al.. 2015). These approaches pre-process each text input as a single contiguous sequence of tokens through special tokens including [START] (the start of a sequence), [DELIM] (delimiting two sequences from the text input) and [EXTRACT] (the end of a sequence). GPT outperforms task-specific architectures in 9 out of 12 tasks studied with a pretrained Transformer.
GPT、GPT2和Grover。GPT (Radford et al., 2018)采用两阶段学习范式:(a) 使用语言建模目标进行无监督预训练;(b) 有监督微调。其目标是学习可迁移至广泛下游任务的通用表征。为此,GPT使用BookCorpus数据集(Zhu et al., 2015)训练语言模型,该数据集包含7,000多本不同体裁的书籍。模型采用Transformer架构(Vaswani et al., 2017)实现语言建模,相比循环网络等替代方案,该架构能更好地捕捉输入中的全局依赖关系,并在机器翻译(Vaswani et al., 2017)和文档生成(Liu et al., 2018)等序列学习任务中表现优异。为在微调阶段处理多序列输入,GPT借鉴遍历式方法(Rocktäschel et al., 2015)进行任务特定的输入适配:通过START、DELIM和EXTRACT等特殊token,将每个文本输入预处理为连续的token序列。在12项研究任务中,基于预训练Transformer的GPT在9项任务上超越了专用架构。
GPT2 (Radford et al., 2019) mainly follows the architecture of GPT and trains a language model on a dataset as large and diverse as possible to learn from varied domains and contexts. To do so, Radford et al. (2019) create a new dataset of millions of web pages named WebText, by scraping outbound links from Reddit. The authors ar- gue that a language model trained on large-scale unlabelled corpora begins to learn some common supervised NLP tasks, such as question answering, machine translation and sum mari z ation, without any explicit supervision signal. To validate this, GPT2 is tested on ten datasets (e.g. Children’s Book Test (Hill et al., 2015), LAMBADA (Paperno et al., 2016) and CoQA (Reddy et al.,
GPT2 (Radford et al., 2019) 主要遵循 GPT 的架构,并在尽可能大而多样的数据集上训练语言模型,以学习不同领域和上下文。为此,Radford et al. (2019) 通过抓取 Reddit 的外链创建了一个名为 WebText 的新数据集,包含数百万个网页。作者认为,在大规模无标注语料上训练的语言模型会开始学习一些常见的监督式 NLP 任务,例如问答、机器翻译和摘要,而无需任何显式的监督信号。为验证这一点,GPT2 在十个数据集上进行了测试 (如 Children’s Book Test (Hill et al., 2015)、LAMBADA (Paperno et al., 2016) 和 CoQA (Reddy et al.,
2019)) in a zero-shot setting. GPT2 performs strongly on some tasks. For instance, when conditioned on a document and questions, GPT2 reaches an F1-score of 55 on the CoQA dataset without using any labelled training data. This matches or outperforms the performance of 3 out of 4 baseline systems. As GPT2 divides texts into bytes and uses BPE (Sennrich et al., 2016) to build up its vocabulary (instead of using characters or words, as in previous work), it is unclear if the improved performance comes from the model or the new input representation.
2019)) 的零样本设置中。GPT2 在某些任务上表现强劲。例如,在给定文档和问题的情况下,GPT2 在 CoQA 数据集上达到了 55 的 F1 分数,且未使用任何标注训练数据。这一结果与 4 个基线系统中的 3 个持平或更优。由于 GPT2 将文本分割为字节并使用 BPE (Sennrich et al., 2016) 构建词汇表 (而非如先前工作那样使用字符或单词),尚不确定性能提升是源自模型还是新的输入表示方式。
Grover (Zellers et al., 2019) creates a news dataset, RealNews, from Common Crawl and pretrains a language model for generating realisticlooking fake news that is conditioned on metadata including domains, dates, authors and headlines. They further study disc rim in at or s that can be used to detect fake news. The best defense against Grover turns out to be Grover itself, which sheds light on the importance of releasing trained models for detecting fake news.
Grover (Zellers等人, 2019) 从Common Crawl中创建了一个新闻数据集RealNews,并预训练了一个语言模型,用于生成看起来逼真的假新闻,这些假新闻基于包括域名、日期、作者和标题在内的元数据。他们进一步研究了可用于检测假新闻的判别器。事实证明,对抗Grover的最佳防御措施是Grover本身,这揭示了发布用于检测假新闻的训练模型的重要性。
BERT. ELMo (Peters et al., 2018) concatenates representations from the forward and backward LSTMs without considering the interactions between the left and right contexts. GPT (Radford et al., 2018) and GPT2 (Radford et al., 2019) use a left-to-right decoder, where every token can only attend to its left context. These architectures are sub-optimal for sentence-level tasks, e.g. named entity recognition and sentiment analysis, as it is crucial to incorporate contexts from both directions.
BERT。ELMo (Peters et al., 2018) 通过拼接前向和后向LSTM的表征,但未考虑左右上下文之间的交互。GPT (Radford et al., 2018) 和 GPT2 (Radford et al., 2019) 采用从左到右的解码器结构,每个token只能关注其左侧上下文。这些架构对于句子级任务(如命名实体识别和情感分析)并非最优选择,因为融合双向上下文信息至关重要。
BERT proposes a masked language modelling (MLM) objective, where some of the tokens of a input sequence are randomly masked, and the objective is to predict these masked positions taking the corrupted sequence as input. BERT applies a Transformer encoder to attend to bi-directional contexts during pre-training. In addition, BERT uses a next-sentence-prediction (NSP) objective. Given two input sentences, NSP predicts whether the second sentence is the actual next sentence of the first sentence. The NSP objective aims to improve the tasks, such as question answering and natural language inference, which require reasoning over sentence pairs.
BERT提出了一种掩码语言建模 (MLM) 目标,即随机掩码输入序列中的部分token,并以被破坏的序列作为输入来预测这些被掩码的位置。BERT在预训练过程中使用Transformer编码器来关注双向上下文。此外,BERT还采用了下一句预测 (NSP) 目标。给定两个输入句子,NSP预测第二个句子是否是第一个句子的实际下一句。NSP目标旨在改进需要推理句子对的任务,例如问答和自然语言推理。
Similar to GPT, BERT uses special tokens to obtain a single contiguous sequence for each input sequence. Specifically, the first token is always a special classification token [CLS], and sentence pairs are separated using a special token [SEP]. BERT adopts a pre-training followed by fine-tuning scheme. The final hidden state of [CLS] is used for sentence-level tasks and the final hidden state of each token is used for token-level tasks. BERT obtains new state-of-the-art results on eleven natural language processing tasks, e.g. improving the GLUE (Wang et al., 2018) score to $80.5%$ .
与GPT类似,BERT使用特殊token为每个输入序列生成单一连续序列。具体而言,第一个token始终是特殊分类标记[CLS],句子对则通过特殊标记[SEP]分隔。BERT采用预训练后微调的训练范式:[CLS]的最终隐藏状态用于句子级任务,每个token的最终隐藏状态用于token级任务。BERT在11项自然语言处理任务中刷新了最优性能指标,例如将GLUE (Wang et al., 2018) 分数提升至 $80.5%$。
Similar to GPT2, it is unclear exactly why BERT improves over prior work as it uses different objectives, datasets (Wikipedia and BookCorpus) and architectures compared to previous methods. For partial insight on this, we refer the readers to (Raffel et al., 2019) for a controlled comparison between unidirectional and bidirectional models, traditional language modelling and masked language modelling using the same datasets.
与GPT2类似,目前尚不完全清楚BERT为何能超越先前工作,因为它采用了不同的训练目标、数据集(Wikipedia和BookCorpus)以及架构设计。关于部分原因分析,我们建议读者参阅 (Raffel et al., 2019) ,该研究在相同数据集上对单向模型与双向模型、传统语言建模与掩码语言建模进行了对照实验。
BERT variants. Recent work further studies and improves the objective and architecture of BERT.
BERT变体。近期研究进一步探索并改进了BERT的目标函数和架构。
Instead of randomly masking tokens, ERNIE (Sun et al., 2019b) incorporates knowledge masking strategies, including entity-level masking and phrase-level masking. ERNIE 2.0 (Sun et al., 2019c) further incorporates more pre-training tasks, such as semantic closeness and discourse relations. SpanBERT (Joshi et al., 2019) generalizes ERNIE to mask random spans, without referring to external knowledge. StructBERT (Wang et al., 2019b) proposes a word structural objective that randomly permutes the order of 3-grams for reconstruction and a sentence structural objective that predicts the order of two consecutive segments.
ERNIE (Sun等人, 2019b) 没有随机掩盖token,而是采用了知识掩盖策略,包括实体级掩盖和短语级掩盖。ERNIE 2.0 (Sun等人, 2019c) 进一步引入了更多预训练任务,如语义相似性和篇章关系。SpanBERT (Joshi等人, 2019) 将ERNIE的方法推广到随机遮盖连续片段,且无需依赖外部知识。StructBERT (Wang等人, 2019b) 提出了一个词结构目标,即随机打乱3-gram的顺序进行重建,以及一个句子结构目标,即预测两个连续片段的顺序。
RoBERTa (Liu et al., 2019c) makes a few changes to the released BERT model and achieves substantial improvements. The changes include: (1) Training the model longer with larger batches and more data; (2) Removing the NSP objective; (3) Training on longer sequences; (4) Dynamically changing the masked positions during pretraining.
RoBERTa (Liu et al., 2019c) 对已发布的 BERT 模型进行了几项改进并取得显著提升。改进包括:(1) 使用更大批次和更多数据延长训练时间;(2) 移除下一句预测 (NSP) 目标;(3) 在更长序列上训练;(4) 预训练期间动态调整掩码位置。
ALBERT (Lan et al., 2019) proposes two parameter-reduction techniques (factorized embedding parameter iz ation and cross-layer parameter sharing) to lower memory consumption and speed up training. Furthermore, ALBERT argues that the NSP objective lacks difficulty, as the negative examples are created by pairing segments from different documents, this mixes topic prediction and coherence prediction into a single task.
ALBERT (Lan等人,2019) 提出了两种参数缩减技术(分解嵌入参数化和跨层参数共享)以降低内存消耗并加速训练。此外,ALBERT认为NSP目标缺乏难度,因为负样本是通过将不同文档的片段配对创建的,这将主题预测和连贯性预测混合到一个任务中。
ALBERT instead uses a sentence-order prediction (SOP) objective. SOP obtains positive examples by taking out two consecutive segments and negative examples by reversing the order of two consecutive segments from the same document.
ALBERT 转而采用句子顺序预测 (SOP) 目标。SOP 通过提取同一文档中两个连续片段作为正例,通过调换两个连续片段的顺序作为负例。
XLNet. The XLNet model (Yang et al., 2019) identifies two weaknesses of BERT:
XLNet。XLNet模型 (Yang等人, 2019) 指出了BERT的两个弱点:
- BERT assumes conditional independence of corrupted tokens. For instance, to model the probability $p(t_ {2}=\mathrm{cat},t_ {6}=\mathrm{mat}|t_ {1}=$ The, $t_ {2}=[\mathrm{MASK}],t_ {3}=\mathrm{sat},t_ {4}=\mathrm{on},t_ {5}=$ the, $t_ {6}~=~[\mathrm{MASK])}$ , BERT factorizes it as $p(t_ {2}=\mathrm{{cat}}|\ldots)p(t_ {6}=\mathrm{{mat}}|\ldots)$ , where $t_ {2}$ and $t_ {6}$ are assumed to be conditionally independent.
- BERT假设被遮蔽token之间条件独立。例如,为了建模概率$p(t_ {2}=\mathrm{cat},t_ {6}=\mathrm{mat}|t_ {1}=$ The, $t_ {2}=[\mathrm{MASK}],t_ {3}=\mathrm{sat},t_ {4}=\mathrm{on},t_ {5}=$ the, $t_ {6}~=~[\mathrm{MASK])}$,BERT将其分解为$p(t_ {2}=\mathrm{{cat}}|\ldots)p(t_ {6}=\mathrm{{mat}}|\ldots)$,其中$t_ {2}$和$t_ {6}$被假定为条件独立。
- The symbols such as [MASK] are introduced by BERT during pre-training, yet they never occur in real data, resulting in a discrepancy between pre-training and fine-tuning.
- [MASK]等符号由BERT在预训练时引入,但这些符号从未在实际数据中出现,导致预训练与微调之间存在差异。
XLNet proposes a new auto-regressive method based on permutation language modelling (PLM) (Uria et al., 2016) without introducing any new symbols. The MLE objective for it is calculated as:
XLNet提出了一种基于排列语言建模 (permutation language modelling, PLM) [20] 的新型自回归方法,无需引入任何新符号。其最大似然估计 (MLE) 目标函数计算公式为:
$$
\operatorname*{max}_ {\theta}\mathbb{E}_ {\mathbf{z}\in Z_ {N}}\left[\sum_ {j=1}^{N}\log p_ {\theta}(t_ {z_ {j}}|t_ {z_ {1}},t_ {z_ {2}},...,t_ {z_ {j-1}})\right].
$$
$$
\operatorname*{max}_ {\theta}\mathbb{E}_ {\mathbf{z}\in Z_ {N}}\left[\sum_ {j=1}^{N}\log p_ {\theta}(t_ {z_ {j}}|t_ {z_ {1}},t_ {z_ {2}},...,t_ {z_ {j-1}})\right].
$$
For each sequence, XLNet samples a permutation order $\mathbf{z}=[z_ {1},z_ {2},...,z_ {N}]$ from the set of all permutations $Z_ {N}$ , where $|Z_ {N}|=N!$ . The probability of the sequence is factorized according to $\mathbf{z}$ , where the $z_ {j}$ -th token $t_ {z_ {j}}$ is conditioned on all the previous tokens $t_ {z_ {1}},t_ {z_ {2}},...,t_ {z_ {j}}$ according to the permutation order $\mathbf{z}$ .
对于每个序列,XLNet从所有排列组合的集合$Z_ {N}$中采样一个排列顺序$\mathbf{z}=[z_ {1},z_ {2},...,z_ {N}]$,其中$|Z_ {N}|=N!$。序列的概率根据$\mathbf{z}$进行分解,其中第$z_ {j}$个token $t_ {z_ {j}}$依赖于排列顺序$\mathbf{z}$中所有前驱token $t_ {z_ {1}},t_ {z_ {2}},...,t_ {z_ {j}}$。
XLNet further adopts two-stream self-attention and Transformer-XL (Dai et al., 2019) to take into account the target positions $z_ {j}$ and learn longrange dependencies, respectively.
XLNet 进一步采用双流自注意力机制和 Transformer-XL (Dai et al., 2019) 来分别考虑目标位置 $z_ {j}$ 并学习长距离依赖关系。
As the cardinality of $Z_ {N}$ is factorial, naive optimization would be challenging. Thus, XLNet conditions on part of the input and generates the rest of the input to reduce the scale of the search space:
由于$Z_ {N}$的基数为阶乘,直接优化将面临挑战。为此,XLNet采用部分输入条件化的策略,通过生成剩余输入来缩减搜索空间规模:
$$
\operatorname*{max}_ {\theta}\mathbb{E}_ {\mathbf{z}\in Z_ {N}}\left[\sum_ {j=c+1}^{N}\log p_ {\theta}(t_ {z_ {j}}|t_ {z_ {1}},t_ {z_ {2}},...,t_ {z_ {j-1}})\right],
$$
$$
\operatorname*{max}_ {\theta}\mathbb{E}_ {\mathbf{z}\in Z_ {N}}\left[\sum_ {j=c+1}^{N}\log p_ {\theta}(t_ {z_ {j}}|t_ {z_ {1}},t_ {z_ {2}},...,t_ {z_ {j-1}})\right],
$$
where $c$ is the cutting point of the sequence. However, it is tricky to compare XLNet directly with
其中 $c$ 是序列的切割点。然而,直接将 XLNet 与
BERT due to the multiple changes in loss and architecture.1
由于损失函数和架构的多重变化,BERT [1]
UniLM. UniLM (Dong et al., 2019) adopts three objectives: (a) language modelling, (b) masked language modelling, and (c) sequence-tosequence language modelling (seq2seq LM), for pre-training a Transformer network. To implement three objectives in a single network, UniLM utilizes specific self-attention masks to control what context the prediction conditions on. For example, MLM can attend to its bidirectional contexts, while seq2seq LM can attend to bidirectional contexts for source sequences and left contexts only for target sequences.
UniLM. UniLM (Dong et al., 2019) 采用三种目标:(a) 语言建模, (b) 掩码语言建模, (c) 序列到序列语言建模 (seq2seq LM), 用于预训练 Transformer 网络。为了在单一网络中实现这三种目标,UniLM 使用特定的自注意力掩码来控制预测所依赖的上下文。例如,MLM 可以关注其双向上下文,而 seq2seq LM 可以关注源序列的双向上下文,但目标序列仅能关注左侧上下文。
ELECTRA. Compared to BERT, ELECTRA (Clark et al., 2019) proposes a more effective pretraining method. Instead of corrupting some positions of inputs with [MASK], ELECTRA replaces some tokens of the inputs with their plausible alternatives sampled from a small generator network. ELECTRA trains a disc rim in at or to predict whether each token in the corrupted input was replaced by the generator or not. The pre-trained discriminator can then be used in downstream tasks for fine-tuning, improving upon the pre-trained representation learned by the generator.
ELECTRA。与BERT相比,ELECTRA (Clark等人,2019)提出了一种更有效的预训练方法。ELECTRA不是用[MASK]遮盖输入的部分位置,而是用一个小的生成器网络采样出合理的替代词来替换输入中的部分token。ELECTRA训练一个判别器来预测损坏输入中的每个token是否被生成器替换过。预训练后的判别器可用于下游任务的微调,从而改进生成器学到的预训练表示。
MASS. Although BERT achieves state-of-theart performance for many natural language understanding tasks, BERT cannot be easily used for natural language generation. MASS (Song et al., 2019) uses masked sequences to pretrain sequence-to-sequence models. More specifically, MASS adopts an encoder-decoder framework and extends the MLM objective. The encoder takes as input a sequence where consecutive tokens are masked and the decoder predicts these masked consecutive tokens auto regressive ly. MASS achieves significant improvements over baselines without pre-training or with other pretraining methods on a variety of zero/low-resource language generation tasks, including neural machine translation, text sum mari z ation and conversational response generation.
MASS。尽管BERT在许多自然语言理解任务中实现了最先进的性能,但它难以直接用于自然语言生成。MASS (Song et al., 2019) 采用掩码序列来预训练序列到序列模型。具体而言,MASS采用编码器-解码器框架并扩展了MLM(掩码语言建模)目标:编码器输入被掩码的连续token序列,解码器则以自回归方式预测这些被掩码的连续token。在神经机器翻译、文本摘要和对话响应生成等多种零样本/低资源语言生成任务中,MASS相比未预训练或采用其他预训练方法的基线模型取得了显著提升。
T5. Raffel et al. (2019) propose T5 (Text-toText Transfer Transformer), unifying natural language understanding and generation by converting the data into a text-to-text format and applying a encoder-decoder framework.
T5. Raffel等人 (2019) 提出T5 (Text-toText Transfer Transformer), 通过将数据转换为文本到文本格式并应用编码器-解码器框架, 统一了自然语言理解和生成任务。
T5 introduces a new pre-training dataset, Colossal Clean Crawled Corpus by cleaning the web pages from Common Crawl. T5 also systematically compares previous methods in terms of pre-training objectives, architectures, pre-training datasets, and transfer approaches. T5 adopts a text infilling objective (where spans of text are replaced with a single mask token), longer training, multi-task pre-training on GLUE or SuperGLUE, fine-tuning on each individual GLUE and SuperGLUE tasks, and beam search. ERNIE-GEN (Xiao et al., 2020) is another work using text infilling, where tokens of each masked span are generated non-auto regressive ly.
T5通过清理Common Crawl的网页,引入了一个新的预训练数据集Colossal Clean Crawled Corpus。T5还系统性地比较了先前方法在预训练目标、架构、预训练数据集和迁移方法上的差异。T5采用了文本填充目标(用单个掩码token替换文本片段)、更长的训练时间、在GLUE或SuperGLUE上进行多任务预训练、针对每个GLUE和SuperGLUE任务进行微调,以及束搜索技术。ERNIE-GEN (Xiao et al., 2020) 是另一项使用文本填充的工作,其中每个掩码片段的token以非自回归方式生成。
For fine-tuning, to convert the input data into a text-to-text framework, T5 utilizes the token vocabulary of the decoder as the prediction labels. For example, the tokens “entailment”, “contradiction”, and “neutral” are used as the labels for natural language inference tasks. For the regression task (e.g. STS-B (Cer et al., 2017)), T5 simply rounds up the scores to the nearest multiple of 0.2 and converts the results to literal string representations (e.g. 2.57 is converted to the string “2.6”). T5 also adds a task-specific prefix to each input sequence to specify its task. For instance, T5 adds the prefix “translate English to German” to each input sequence like “That is good.” for English-toGerman translation datasets.
在微调阶段,为将输入数据转换为文本到文本框架,T5采用解码器的token词汇表作为预测标签。例如,在自然语言推理任务中使用"entailment"、"contradiction"和"neutral"等token作为标签。针对回归任务(如STS-B (Cer等人,2017)),T5简单地将分数四舍五入到最接近0.2的倍数,并将结果转换为字面字符串表示(例如2.57转换为字符串"2.6")。T5还会为每个输入序列添加任务特定的前缀,例如在英德翻译数据集中,会为"That is good."这样的输入序列添加"translate English to German"前缀。
3.2 Supervised Objectives
3.2 监督目标
Pre-training on the ImageNet dataset (which has supervision about the objects in images) before fine-tuning on downstream tasks has become the de facto standard in the computer vision community. Motivated by the success of supervised pre-training in computer vision, some work (Conneau et al., 2017; McCann et al., 2017; Subramania n et al., 2018) utilizes data-rich tasks in NLP to learn transferable representations.
在下游任务上进行微调之前,先在ImageNet数据集(包含图像中物体的监督信息)上进行预训练已成为计算机视觉领域的实际标准。受计算机视觉中监督预训练成功的启发,一些研究工作(Conneau等人,2017;McCann等人,2017;Subramania等人,2018)利用NLP中数据丰富的任务来学习可迁移的表征。
CoVe (McCann et al., 2017) shows that the represent at ions learned from machine translation are transferable to downstream tasks. CoVe uses a deep LSTM encoder from a sequence-to-sequence model trained for machine translation to obtain contextual embeddings. Empirical results show that augmenting non-contextual i zed word representations (Mikolov et al., 2013; Pennington et al., 2014) with CoVe embeddings improves performance over a wide variety of common NLP tasks, such as sentiment analysis, question classification, entailment, and question answering. InferSent (Conneau et al., 2017) obtains contextualized representations from a pre-trained natural language inference model on SNLI. Subramania n et al. (2018) use multi-task learning to pre-train a sequence-to-sequence model for obtaining general representations, where the tasks include skipthought (Kiros et al., 2015), machine translation, constituency parsing, and natural language inference.
CoVe (McCann等人,2017) 研究表明,从机器翻译中学习到的表征可迁移至下游任务。CoVe采用为机器翻译训练的序列到序列模型中的深度LSTM编码器来获取上下文嵌入。实证结果表明,将非上下文词向量 (Mikolov等人,2013; Pennington等人,2014) 与CoVe嵌入相结合,能在情感分析、问题分类、蕴涵和问答等多种常见NLP任务中提升性能。InferSent (Conneau等人,2017) 通过SNLI数据集上预训练的自然语言推理模型获取上下文表征。Subramania等人 (2018) 采用多任务学习预训练序列到序列模型以获得通用表征,其任务包括跳跃思维 (Kiros等人,2015)、机器翻译、选区解析和自然语言推理。
BART. The BART model (Lewis et al., 2019) introduces additional noising functions beyond MLM for pre-training sequence-to-sequence models. First, the input sequence is corrupted using an arbitrary noising function. Then, the corrupted input is reconstructed by a Transformer network trained using teacher forcing (Williams and Zipser, 1989). BART evaluates a wide variety of noising functions, including token masking, token deletion, text infilling, document rotation, and sentence shuffling (randomly shuffling the word order of a sentence). The best performance is achieved by using both sentence shuffling and text infilling. BART matches the performance of RoBERTa on GLUE and SQuAD and achieves state-of-the-art performance on a variety of text generation tasks.
BART。BART模型 (Lewis等人, 2019) 在MLM基础上引入了额外的噪声函数来预训练序列到序列模型。首先,输入序列通过任意噪声函数进行破坏;随后,被破坏的输入由一个采用教师强制 (Williams和Zipser, 1989) 训练的Transformer网络进行重建。BART评估了多种噪声函数,包括token掩码、token删除、文本填充、文档旋转和句子乱序 (随机打乱句子中的单词顺序) 。最佳性能通过同时使用句子乱序和文本填充实现。BART在GLUE和SQuAD上达到与RoBERTa相当的性能,并在多种文本生成任务中取得最先进水平。
4 Cross-lingual Polyglot Pre-training for Contextual Embeddings
4 面向上下文嵌入的跨语言多语种预训练
Cross-lingual polyglot pre-training aims to learn joint multi-lingual representations, enabling knowledge transfer from data-rich languages like English to data-scarce languages like Romanian. Based on whether joint training and a shared vocabulary are used, we divide previous work into three categories.
跨语言多语种预训练旨在学习联合的多语言表征,实现从英语等数据丰富的语言向罗马尼亚语等数据稀缺语言的知识迁移。根据是否采用联合训练和共享词汇表,我们将先前工作划分为三类。
Joint training & shared vocabulary. Artetxe and Schwenk (2019) use a BiLSTM encoderdecoder framework with a shared BPE vocabulary for 93 languages. The framework is pre-trained using parallel corpora, including as Europarl and Tanzil. The contextual embeddings from the encoder are used to train class if i ers using English corpora for downstream tasks. As the embedding space and the encoder are shared, the resultant class if i ers can be transferred to any of the 93 languages without further modification. Experiments show that these class if i ers achieve competitive performance on cross-lingual natural language inference, cross-lingual document classification, and parallel corpus mining.
联合训练与共享词汇表。Artetxe和Schwenk (2019) 采用带有共享BPE词汇表的BiLSTM编码器-解码器框架,涵盖93种语言。该框架使用包括Europarl和Tanzil在内的平行语料库进行预训练。编码器生成的上下文嵌入被用于训练分类器,其中英语语料库用于下游任务。由于嵌入空间和编码器是共享的,所得分类器无需额外修改即可迁移至93种语言中的任意一种。实验表明,这些分类器在跨语言自然语言推理、跨语言文档分类和平行语料挖掘任务中均取得具有竞争力的性能。
Rosita (Mulcaire et al., 2019) pre-trains a lan- guage model using text from different languages, showing the benefits of polyglot learning on lowresource languages.
Rosita (Mulcaire等人,2019) 通过使用不同语言的文本预训练语言模型,展示了多语言学习在低资源语言上的优势。
Recently, the authors of BERT developed a multi-lingual BERT2 which is pre-trained using the Wikipedia dump with more than 100 languages.
最近,BERT的作者们开发了一个多语言BERT2,该模型使用包含100多种语言的维基百科数据转储进行预训练。
XLM (Lample and Conneau, 2019) uses three pre-training methods for learning cross-lingual language models: (1) Causal language modelling, where the model is trained to predict $p(t_ {i}|t_ {1},t_ {2},...,t_ {i-1})$ , (2) Masked language modelling, and (3) Translation language modelling (TLM). Parallel corpora are used, and tokens in both source and target sequences are masked for learning cross-lingual association. XLM performs strongly on cross-lingual classification, unsupervised machine translation, and supervised machine translation. XLM-R (Conneau et al., 2019) scales up XLM by training a Transformer-based masked language model on one hundred languages, using more than two terabytes of filtered Common Crawl data. XLM-R shows that largescale multi-lingual pre-training leads to significant performance gains for a wide range of crosslingual transfer tasks.
XLM (Lample and Conneau, 2019) 采用三种预训练方法学习跨语言语言模型:(1) 因果语言建模 (causal language modelling),训练模型预测 $p(t_ {i}|t_ {1},t_ {2},...,t_ {i-1})$;(2) 掩码语言建模 (masked language modelling);(3) 翻译语言建模 (translation language modelling, TLM)。该方法使用平行语料库,并对源序列和目标序列中的token进行掩码以学习跨语言关联。XLM在跨语言分类、无监督机器翻译和有监督机器翻译任务中表现优异。XLM-R (Conneau et al., 2019) 通过基于Transformer的掩码语言模型在100种语言上进行训练(使用超过2TB经过滤的Common Crawl数据),进一步扩展了XLM的规模。XLM-R证明,大规模多语言预训练能为广泛的跨语言迁移任务带来显著性能提升。
Joint training & separate vocabularies. Wu et al. (2019) study the emergence of cross-lingual structures in pre-trained multi-lingual language models. It is found that cross-lingual transfer is possible even when there is no shared vocabulary across the monolingual corpora, and there are universal latent symmetries in the embedding spaces of different languages.
联合训练与独立词表。Wu等人(2019)研究了预训练多语言模型中跨语言结构的涌现现象。研究发现即使单语语料库之间没有共享词表,跨语言迁移仍然可能实现,并且不同语言的嵌入空间存在普遍的潜在对称性。
Separate training & separate vocabularies. Artetxe et al. (2019) use a four-step method for obtaining multi-lingual embeddings. Suppose we have the monolingual sequences of two languages $L_ {1}$ and $L_ {2}$ : (1) Pre-training BERT with the vocabulary of $L_ {1}$ using $L_ {1}$ ’s monolingual data. (2) Replacing the vocabulary of $L_ {1}$ with the vocabulary of $L_ {2}$ and training new vocabulary embeddings, while freezing the other parameters, using $L_ {2}$ ’s monolingual data. (3) Fine-tuning the BERT model for a downstream task using labeled data in $L_ {1}$ , while freezing $L_ {1}$ ’s vocabulary embeddings. (4) Replacing the fine-tuned BERT with $L_ {2}$ ’s vocabulary embeddings for zero-shot transfer tasks.
独立训练 & 独立词汇表。Artetxe等人(2019)采用四步法获取多语言嵌入表示。假设我们有两种语言 $L_ {1}$ 和 $L_ {2}$ 的单语序列:(1) 使用 $L_ {1}$ 的单语数据,以其词汇表预训练BERT。(2) 将 $L_ {1}$ 的词汇表替换为 $L_ {2}$ 的词汇表,在冻结其他参数的情况下,利用 $L_ {2}$ 的单语数据训练新词汇嵌入。(3) 使用 $L_ {1}$ 的标注数据在下游任务上微调BERT模型,同时冻结 $L_ {1}$ 的词汇嵌入。(4) 将微调后的BERT模型替换为 $L_ {2}$ 的词汇嵌入,用于零样本迁移任务。
5 Downstream Learning
5 下游学习
Once learned, contextual embeddings have demonstrated impressive performance when used downstream on various learning problems. Here we describe the ways in which contextual embeddings are used downstream, the ways in which one can avoid forgetting information in the embeddings during downstream learning, and how they can be specialized to multiple learning tasks.
学习完成后,上下文嵌入 (contextual embeddings) 在下游各类学习任务中展现出卓越性能。本文阐述了上下文嵌入在下游任务中的应用方式、如何避免下游学习过程中遗忘嵌入信息,以及如何将其适配到多种学习任务中。
5.1 Ways to Use Contextual Embeddings Downstream
5.1 上下文嵌入的下游应用方式
There are three main ways to use pre-trained contextual embeddings in downstream tasks: (1) Feature-based methods, (2) Fine-tuning methods, and (3) Adapter methods.
在下游任务中使用预训练上下文嵌入主要有三种方式:(1) 基于特征的方法 (feature-based methods)、(2) 微调方法 (fine-tuning methods)、(3) 适配器方法 (adapter methods)。
Feature-based. One example of a feature-based is the method used by ELMo (Peters et al., 2018). Specifically, as shown in equation 2, ELMo freezes the weights of the pre-trained contextual embedding model and forms a linear combination of its internal representations. The linearly- combined representations are then used as features for task-specific architectures. The benefit of feature-based models is that they can use state-ofthe-art handcrafted architectures for specific tasks.
基于特征的方法。一个基于特征的例子是ELMo (Peters et al., 2018)使用的方法。具体来说,如公式2所示,ELMo冻结预训练的上下文嵌入模型的权重,并形成其内部表示的线性组合。然后,这些线性组合的表示被用作特定任务架构的特征。基于特征模型的优势在于,它们可以为特定任务使用最先进的手工架构。
Fine-tuning. Fine-tuning works as follows: starting with the weights of the pre-trained contextual embedding model, fine-tuning makes small adjustments to them in order to specialize them to a specific downstream task. One stream of work applies minimal changes to pre-trained models to take full advantage of their parameters. The most straightforward way is adding linear layers on top of the pre-trained models (Devlin et al., 2018; Lan et al., 2019). Another method (Radford et al., 2019; Raffel et al., 2019) uses universal data formats without introducing new parameters for downstream tasks.
微调 (Fine-tuning)。微调的工作原理如下:从预训练的上下文嵌入模型的权重开始,微调会对这些权重进行小幅调整,使其专门适用于特定的下游任务。一部分研究工作对预训练模型进行最小程度的改动,以充分利用其参数。最直接的方法是在预训练模型之上添加线性层 (Devlin et al., 2018; Lan et al., 2019)。另一种方法 (Radford et al., 2019; Raffel et al., 2019) 使用通用数据格式,无需为下游任务引入新参数。
To apply pre-trained models to structurally different tasks, where task-specific architectures are used, as much of the model is initialized with pre-trained weights as possible. For instance, XLM (Lample and Conneau, 2019) applies two pre-trained monolingual language models to initialize the encoder and the decoder for ma- chine translation, respectively, leaving only crossattention weights randomly initialized.
为了将预训练模型应用于结构不同的任务(这些任务使用特定于任务的架构),尽可能多地用预训练权重初始化模型。例如,XLM (Lample and Conneau, 2019) 应用两个预训练的单语语言模型分别初始化机器翻译的编码器和解码器,仅随机初始化交叉注意力权重。
Adapters. Adapters (Rebuffi et al., 2017; Stickland and Murray, 2019) are small modules added between layers of pre-trained models to be trained in a multi-task learning setting. The parameters of the pre-trained model are fixed while tuning these adapter modules. Compared to previous work that fine-tunes a separate pre-trained model for each task, a model with shared adapters for all tasks often requires fewer parameters.
适配器 (Adapters)。适配器 (Rebuffi et al., 2017; Stickland and Murray, 2019) 是在预训练模型各层之间添加的小型模块,用于多任务学习场景下的训练。在调整这些适配器模块时,预训练模型的参数保持不变。与之前为每个任务单独微调预训练模型的方法相比,使用共享适配器的模型通常需要更少的参数。
5.2 Countering Catastrophic Forgetting
5.2 对抗灾难性遗忘
Learning on downstream tasks is prone to overwrite the information from pre-trained models, which is widely known as the catastrophic forgetting (McCloskey and Cohen, 1989; d’Autume et al., 2019). Previous work combats this by (1) Freezing layers, (2) Using adaptive learning rates, and (3) Regular iz ation.
在下游任务上的学习容易覆盖预训练模型中的信息,这一现象被广泛称为灾难性遗忘 (McCloskey and Cohen, 1989; d'Autume et al., 2019)。先前的研究通过以下方法应对该问题:(1) 冻结网络层,(2) 使用自适应学习率,(3) 正则化。
Freezing layers. Motivated by layer-wise training of neural networks (Hinton et al., 2006), training certain layers while freezing others can potentially reduce forgetting during fine-tuning. Different layer-wise tuning schedules have been studied. Long et al. (2015) freeze all layers except the top layer. Felbo et al. (2017) use “chain-thaw”, which sequentially unfreezes and fine-tunes a layer at a time. Howard and Ruder (2018) gradually unfreeze all layers one by one from top to bottom. Chrono poul ou et al. (2019) apply a three-stage fine-tuning schedule: (a) randomly-initialized parameters are updated for $n$ epochs, (b) the pretrained parameters (except word embeddings) are then fine-tuned, (c) at last, all parameters are finetuned.
冻结层。受神经网络分层训练 (Hinton et al., 2006) 的启发,在微调时冻结部分层而训练其他层可有效减少遗忘现象。目前已研究出多种分层调优方案:Long et al. (2015) 仅解冻顶层进行训练;Felbo et al. (2017) 采用"链式解冻"策略,逐层解冻并微调;Howard 和 Ruder (2018) 采用自上而下逐层解冻的方式;Chronopoulou et al. (2019) 则设计了三阶段微调方案:(a) 随机初始化参数训练 $n$ 轮,(b) 微调预训练参数(词嵌入层除外),(c) 最终全参数微调。
Adaptive learning rates. Another method to mitigate catastrophic forgetting is by using adaptive learning rates. As it is believed that the lower layers of pre-trained models tend to capture general language knowledge (Tenney et al., 2019a), Howard and Ruder (2018) use lower learning rates for lower layers when fine-tuning.
自适应学习率。另一种缓解灾难性遗忘的方法是使用自适应学习率。由于预训练模型的底层通常被认为能捕捉通用语言知识 (Tenney et al., 2019a) ,Howard 和 Ruder (2018) 在微调时对底层采用了较低的学习率。
Regular iz ation. Regular iz ation limits the finetuned parameters to be close to the pre-trained pa- rameters. Wiese et al. (2017) minimize the Euclidean distance between the fine-tuned parameters and pre-trained parameters. Kirkpatrick et al. (2017) use the Fisher information matrix to protect the weights that are identified as essential for pre-trained models.
正则化。正则化限制微调参数接近预训练参数。Wiese等人(2017)最小化微调参数与预训练参数之间的欧氏距离。Kirkpatrick等人(2017)使用Fisher信息矩阵来保护被识别为对预训练模型至关重要的权重。
5.3 Multi-task Fine-tuning
5.3 多任务微调
Multi-task learning on downstream tasks (Liu et al., 2019b; Wang et al., 2019a; Jozefowicz et al., 2016) obtains general rep- resent at ions across tasks and achieves strong performance on each individual task.
在下游任务上进行多任务学习 (Liu et al., 2019b; Wang et al., 2019a; Jozefowicz et al., 2016) 可以获取跨任务的通用表征,并在每个独立任务上实现强劲性能。
MT-DNN (Liu et al., 2019b) fine-tunes BERT on all the GLUE tasks, improving the GLUE benchmark to $82.7%$ . MT-DNN also demonstrates that the representations from multi-task learning obtain better performance on domain adaptation compared to BERT.
MT-DNN (Liu et al., 2019b) 在所有GLUE任务上对BERT进行微调,将GLUE基准提升至$82.7%$。MT-DNN还表明,与BERT相比,多任务学习获得的表征在领域适应方面表现更优。
Wang et al. (2019a) investigate further, nonGLUE tasks, such as skip-thought and Reddit response generation, for multi-task learning.
Wang等人 (2019a) 进一步研究了多任务学习中的非GLUE任务,例如跳跃思维 (skip-thought) 和Reddit回复生成。
T5 (Raffel et al., 2019) studies various settings of multi-task learning and finds that using multitask learning before fine-tuning on each task performs the best.
T5 (Raffel等人,2019) 研究了多任务学习的多种设置,发现先在每个任务上使用多任务学习再进行微调效果最佳。
6 Model Compression
6 模型压缩
As many pre-trained language models have a prohibitive memory footprint and latency, it is a challenging task to deploy them in resourceconstrained environments. To address this, model compression (Cheng et al., 2017), which has gained popularity in recent years for shrinking large neural networks, has been investigated for compressing contextual embedding models. Work on compressing language models utilizes (1) Lowrank approximation, (2) Knowledge distillation, and (3) Weight quantization, to make them usable in embedded systems and edge devices.
由于许多预训练语言模型存在内存占用过高和延迟过大的问题,在资源受限环境中部署它们成为一项具有挑战性的任务。为此,近年来流行的模型压缩技术(Cheng等人,2017)被用于压缩上下文嵌入模型。语言模型压缩主要采用三种方法:(1) 低秩近似,(2) 知识蒸馏,(3) 权重量化,以使其能在嵌入式系统和边缘设备上运行。
Low rank approximation. Methods that learn low rank approximations seek to compress the full-rank model weight matrices into low-rank matrices, thereby reducing the effective number of model parameters. As the embedding matrices usually account for a large portion of model parameters (e.g. $21%$ for $\mathrm{BERT_ {Base}},$ ), ALBERT (Lan et al., 2019) approximates the embedding matrix $\mathbf{E}\in\mathbb{R}^{V\times d}$ as the product of two smaller matrices, ${\bf E}_ {1}\in\mathbb{R}^{V\times d^{\prime}}$ and $\mathbf{E}_ {2}\in\mathbb{R}^{d^{\prime}\times d}$ , where $d^{\prime}\ll d$ .
低秩近似。学习低秩近似的方法旨在将全秩模型权重矩阵压缩为低秩矩阵,从而减少模型参数的有效数量。由于嵌入矩阵通常占模型参数的很大一部分(例如,$\mathrm{BERT_ {Base}}$ 中占 $21%$),ALBERT (Lan et al., 2019) 将嵌入矩阵 $\mathbf{E}\in\mathbb{R}^{V\times d}$ 近似为两个较小矩阵 ${\bf E}_ {1}\in\mathbb{R}^{V\times d^{\prime}}$ 和 $\mathbf{E}_ {2}\in\mathbb{R}^{d^{\prime}\times d}$ 的乘积,其中 $d^{\prime}\ll d$。
Knowledge distillation. A method called ‘knowledge distillation’ was proposed by Hinton et al. (2015), where the ‘knowledge’ encoded in a teacher network is transferred to a student network. Hinton et al. (2015) use the soft target probabilities, output by the teacher network, to train the student network using the cross-entropy loss. The student network is smaller than the teacher network, resulting in a more lightweight model that nears the accuracy of the heavyweight teacher network. Tang et al. (2019) distill the knowledge from BERT into a single-layer BiLSTM, obtaining performance comparable to ELMo with roughly 100 times fewer parameters. DistilBERT (Sanh et al., 2019) uses MLM, distillation loss (Hinton et al., 2015), and cosine similarity between the embedding matrices of the teacher and student networks to train a smaller BERT model. BERT-PKD (Sun et al., 2019a) uses a student BERT model with fewer layers compared to BERTBase or BERTLarge and proposes two ways (learning from the last $k$ layers and learning from every $k$ layers) to map the layers of the student to the layers of BERTBase or BERTLarge. The hidden states of the student are kept close to the hidden states of the teacher from corresponding layers using a Euclidean distance regularize r. TinyBERT (Jiao et al., 2019) introduces a two-stage learning framework, where distillation is performed at both the pre-training and the fine-tuning stages.
知识蒸馏。Hinton等人(2015)提出了一种名为"知识蒸馏"的方法,将教师网络编码的"知识"迁移到学生网络中。Hinton等人(2015)使用教师网络输出的软目标概率,通过交叉熵损失训练学生网络。学生网络比教师网络更小,从而获得接近重量级教师网络精度的轻量级模型。Tang等人(2019)将BERT的知识蒸馏到单层BiLSTM中,用约100倍少的参数获得了与ELMo相当的性能。DistilBERT(Sanh等人,2019)使用MLM、蒸馏损失(Hinton等人,2015)以及师生网络嵌入矩阵间的余弦相似度来训练较小的BERT模型。BERT-PKD(Sun等人,2019a)使用比BERTBase或BERTLarge层数更少的学生BERT模型,并提出了两种方式(从最后$k$层学习和从每$k$层学习)将学生层映射到BERTBase或BERTLarge的对应层。通过欧几里得距离正则化器使学生网络的隐藏状态保持接近教师网络对应层的隐藏状态。TinyBERT(Jiao等人,2019)提出了两阶段学习框架,在预训练和微调阶段都进行蒸馏。
Weight quantization. Quantization methods focus on mapping weight parameters to lowprecision integers and floating-point numbers. QBERT (Shen et al., 2019) proposes a group-wise quantization scheme, where the parameters are divided into groups based on attention heads, and uses a Hessian-based, mixed-precision method to compress the model.
权重量化。量化方法专注于将权重参数映射到低精度整数和浮点数。QBERT (Shen et al., 2019) 提出了一种分组量化方案,其中参数根据注意力头被分成若干组,并采用基于Hessian矩阵的混合精度方法来压缩模型。
7 Analyzing Contextual Embeddings
7 分析上下文嵌入 (Contextual Embeddings)
While contextual embedding methods have impressive performance on a variety of natural language tasks, it is often unclear exactly why they work so well. To study this, work so far has used (1) Probe class if i ers, and (2) Visualization.
虽然上下文嵌入方法在各种自然语言任务上表现出色,但其优异性能的具体原因往往不明确。针对这一问题,目前研究主要采用两种方法:(1) 探针分类器 (probe classifiers) ,(2) 可视化分析。
Probe class if i ers. A large body of work studies contextual embeddings using probes. These are constrained class if i ers designed to explore whether syntactic and semantic information is encoded in these representations or not.
探针分类器。大量研究使用探针分类器探索上下文嵌入表示。这些受限分类器旨在检验句法和语义信息是否被编码在这些表示中。
Liu et al. (2019a) design a series of token labelling, segmentation, and pairwise relation tasks for studying the effectiveness of contextual embeddings. Contextual embeddings achieve competitive results compared to the state-of-the-art models on most tasks, yet fail on some fine-grained linguistic tasks (e.g. conjunct identification).
Liu et al. (2019a) 设计了一系列token标注、分割和成对关系任务来研究上下文嵌入的有效性。在大多数任务上,上下文嵌入相比最先进模型取得了有竞争力的结果,但在某些细粒度语言任务(如连词识别)上表现不佳。
Hewitt and Manning (2019) propose a structural probe for finding syntax in contextual embeddings. The model attempts to learn a linear transformation under which the L2 distances between tokens encode the distances between these tokens in syntactic parsing trees like dependency trees.
Hewitt和Manning (2019) 提出了一种结构探针,用于在上下文嵌入中寻找句法。该模型尝试学习一种线性变换,使得Token之间的L2距离能够编码这些Token在依存树等句法分析树中的距离。
Tenney et al. (2019a) find that BERT rediscovers the traditional NLP pipeline in an interpret able and local iz able way. Specifically, it is capable at POS tagging, parsing, NER, semantic roles, and co reference, and these are learned in order.
Tenney等人 (2019a) 发现BERT以一种可解释且可局部化的方式重新发现了传统自然语言处理(NLP)流程。具体而言,该模型能够进行词性标注(POS tagging)、句法分析(parsing)、命名实体识别(NER)、语义角色标注(semantic roles)和共指消解(coreference),并且这些能力是按顺序习得的。
Jawahar et al. (2019) use ten sentence-level probing tasks (e.g. SentLen, TreeDepth) and find that BERT captures phrase-level information in earlier layers and long-distance dependency information in deeper layers.
Jawahar等人(2019)使用十个句子级探测任务(如SentLen、TreeDepth)发现,BERT在较浅层捕获短语级信息,在较深层捕获长距离依赖信息。
Visualization. Another body of work uses visualization to analyze attention and fine-tuning procedures, among others.
可视化。另一部分工作利用可视化来分析注意力机制和微调过程等。
Hao et al. (2019) visualize loss landscapes and optimization trajectories when fine-tuning BERT. The visualization s show that BERT reaches a good initial point during pre-training for downstream tasks, which can lead to better optima compared to randomly-initialized models.
Hao等人(2019)通过可视化微调BERT时的损失曲面和优化轨迹发现,预训练使BERT为下游任务提供了良好的初始点,相比随机初始化模型能获得更优的极值点。
Kovaleva et al. (2019) visualize the attention heads of BERT, discovering a limited set of attention patterns across different heads. This leads to the fact that the heads of BERT are highly redundant. After manually disabling certain attention heads, better performance is obtained compared to the fine-tuned BERT models that use the full set of attention heads.
Kovaleva等人 (2019) 对BERT的注意力头进行了可视化,发现不同注意力头之间存在有限的注意力模式。这表明BERT的注意力头具有高度冗余性。在手动禁用某些注意力头后,相比使用完整注意力头的微调BERT模型,反而获得了更好的性能表现。
Coenen et al. (2019) visualize and analyze the geometry of BERT embeddings, finding that BERT distinguishes word senses at a very finegrained level. These word senses are also found to be encoded in a relatively low-dimensional subspace.
Coenen等人(2019)对BERT嵌入的几何结构进行了可视化分析,发现BERT能以非常细粒度区分词义。这些词义还被发现编码在相对低维的子空间中。
8 Current Challenges
8 当前挑战
There are many key challenges that, if solved, would improve future contextual embeddings.
未来语境嵌入技术的提升面临诸多关键挑战。
Better pre-training objectives. BERT designed MLM to take advantage of bi-directional information during pre-training. It remains unclear whether there are pre-training objectives that are simultaneously more efficient and effective. Some recent work focuses on designing new training methods (Clark et al., 2019), noise combination techniques, (Lewis et al., 2019) and multi-task learning approaches (Wang et al., 2019a).
更好的预训练目标。BERT设计了掩码语言模型(MLM)以利用预训练中的双向信息。目前尚不清楚是否存在同时更高效和更有效的预训练目标。一些最新研究专注于设计新的训练方法(Clark et al., 2019)、噪声组合技术(Lewis et al., 2019)以及多任务学习方法(Wang et al., 2019a)。
Understanding the knowledge encoded in pre-trained models. As described above, a range of methods (Tenney et al., 2019a,b; Hewitt and Manning, 2019; Liu et al., 2019a) have been proposed to explore the effectiveness of pre-trained models via probes. Yet, controlled experiments are still lacking to understand whether the representations actually encode linguistic knowledge or the probes happen to learn to perform well on these linguistic tasks because the data they use is so high-dimensional (Hewitt and Liang, 2019). Hewitt and Liang (2019) devise control tasks, where a good probe is one that performs well on linguistic tasks, and badly on control tasks. They find that most existing probes fail to satisfy this condition. Indeed, most probes use shallow class if i ers, which may not be able to extract the relevant information from contextual representations. New probes or better methods for understanding contextual representations are needed.
理解预训练模型中编码的知识。如前所述,已有多种方法 (Tenney et al., 2019a,b; Hewitt and Manning, 2019; Liu et al., 2019a) 通过探针(probe)来探索预训练模型的有效性。然而,目前仍缺乏对照实验来验证这些表征是否真正编码了语言学知识,或者探针只是因为使用了高维数据而恰好在这些语言学任务上表现良好 (Hewitt and Liang, 2019)。Hewitt和Liang (2019) 设计了控制任务,其中优秀的探针应在语言学任务上表现良好,而在控制任务上表现糟糕。他们发现大多数现有探针无法满足这一条件。事实上,多数探针使用浅层分类器(shallow classifier),可能无法从上下文表征中提取相关信息。我们需要开发新的探针或更好的方法来理解上下文表征。
Model robustness. Concerns about the vulnerability of models to attack are growing when deploying NLP models into production. Wallace et al. (2019) show that universal adversarial triggers that cause significant performance deterioration of pre-trained models can be found. Additionally, concerns of abusing pre-trained models (e.g. generating fake news) have arisen3. Better methods for increasing model robustness are highly needed.
模型鲁棒性。将NLP模型部署到生产环境时,人们越来越关注模型易受攻击的脆弱性。Wallace等人(2019)研究表明,可以找到导致预训练模型性能显著下降的通用对抗触发器。此外,滥用预训练模型(如生成虚假新闻)的担忧也随之出现[3]。我们亟需提升模型鲁棒性的更好方法。
Controlled generation of sequences. Pretrained language models (Radford et al., 2018, 2019) are able to generate realistic-looking text sequences. Yet it is hard to adapt these models to generate domain-specific sequences (Keskar et al., 2019) or to agree with common human knowledge (Zellers et al., 2019). As a result, we advocate research on more fine-grained control over sequence generation.
序列的受控生成。预训练语言模型 (Radford et al., 2018, 2019) 能够生成逼真的文本序列。然而,这些模型难以适应生成特定领域的序列 (Keskar et al., 2019) 或符合人类常识 (Zellers et al., 2019)。因此,我们主张对序列生成进行更细粒度控制的研究。
Acknowledgements
致谢
We thank Douwe Kiela, Jiatao Gu, Yi Tay, Xi- aodong Liu, Ziyang Wang and Jake Zhao for their comments and discussions on this manuscript.
感谢 Douwe Kiela、Jiatao Gu、Yi Tay、Xiaodong Liu、Ziyang Wang 和 Jake Zhao 对本手稿提出的意见和讨论。
Andrew M Dai and Quoc V Le. 2015. Semi-supervised sequence learning. In Advances in neural information processing systems, pages 3079–3087.
Andrew M Dai 和 Quoc V Le. 2015. 半监督序列学习. In Advances in neural information processing systems, pages 3079–3087.
Zihang Dai, Zhilin Yang, Yiming Yang, William W Cohen, Jaime Carbonell, Quoc V Le, and Ruslan Salak hut dino v. 2019. Transformer-xl: Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860.
Zihang Dai、Zhilin Yang、Yiming Yang、William W Cohen、Jaime Carbonell、Quoc V Le 和 Ruslan Salak hut dino v. 2019. Transformer-xl: 突破固定长度上下文的注意力语言模型. arXiv preprint arXiv:1901.02860.
Cyprien de Masson d’Autume, Sebastian Ruder, Ling- peng Kong, and Dani Yogatama. 2019. Episodic memory in lifelong language learning. arXiv preprint arXiv:1906.01076.
Cyprien de Masson d’Autume、Sebastian Ruder、Ling-peng Kong 和 Dani Yogatama。2019。终身语言学习中的情景记忆。arXiv预印本 arXiv:1906.01076。
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
Jacob Devlin、Ming-Wei Chang、Kenton Lee 和 Kristina Toutanova。2018。BERT:面向语言理解的深度双向Transformer预训练。arXiv预印本 arXiv:1810.04805。
Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xiaodong Liu, Yu Wang, Jianfeng Gao, Ming Zhou, and Hsiao-Wuen Hon. 2019. Unified language model pre-training for natural language understanding and generation. In Advances in Neural Information Processing Systems, pages 13042–13054.
李东, 南楠, 王文辉, 魏福如, 刘晓东, 王宇, 高剑峰, 周明, 和洪小文. 2019. 面向自然语言理解与生成的统一语言模型预训练. 载于《神经信息处理系统进展》, 第13042–13054页.
Bjarke Felbo, Alan Mislove, Anders Søgaard, Iyad Rahwan, and Sune Lehmann. 2017. Using millions of emoji occurrences to learn any-domain representations for detecting sentiment, emotion and sarcasm. arXiv preprint arXiv:1708.00524.
Bjarke Felbo、Alan Mislove、Anders Søgaard、Iyad Rahwan 和 Sune Lehmann。2017. 利用数百万表情符号数据学习跨领域表征以检测情感、情绪与讽刺。arXiv 预印本 arXiv:1708.00524。
Yaru Hao, Li Dong, Furu Wei, and Ke Xu. 2019. Visu- alizing and understanding the effectiveness of bert. arXiv preprint arXiv:1908.05620.
Yaru Hao、Li Dong、Furu Wei 和 Ke Xu. 2019. 可视化与理解 BERT 的有效性. arXiv 预印本 arXiv:1908.05620.
Zellig S Harris. 1954. Distribution al structure. Word, 10(2-3):146–162.
Zellig S Harris. 1954. 分布结构. Word, 10(2-3):146–162.
John Hewitt and Percy Liang. 2019. Designing and interpreting probes with control tasks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2733–2743, Hong Kong, China. Association for Computational Linguistics.
John Hewitt 和 Percy Liang. 2019. 设计与解释带有控制任务的探针。见《2019年自然语言处理经验方法会议暨第九届自然语言处理国际联合会议论文集》(EMNLP-IJCNLP), 第2733–2743页, 中国香港。计算语言学协会。
John Hewitt and Christopher D Manning. 2019. A structural probe for finding syntax in word representations. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4129–4138.
John Hewitt 和 Christopher D Manning. 2019. 一种用于在词表示中寻找句法结构的结构探针. 见《2019年北美计算语言学协会人类语言技术会议论文集, 第1卷(长论文和短论文)》, 第4129–4138页.
Felix Hill, Antoine Bordes, Sumit Chopra, and Jason Weston. 2015. The goldilocks principle: Reading children’s books with explicit memory representations. arXiv preprint arXiv:1511.02301.
Felix Hill、Antoine Bordes、Sumit Chopra 和 Jason Weston。2015. 金发姑娘原则:通过显式记忆表征阅读儿童读物。arXiv 预印本 arXiv:1511.02301。
Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531.
Geoffrey Hinton、Oriol Vinyals 和 Jeff Dean. 2015. 蒸馏神经网络中的知识. arXiv 预印本 arXiv:1503.02531.
Geoffrey E Hinton, Simon Osindero, and Yee-Whye Teh. 2006. A fast learning algorithm for deep belief nets. Neural computation, 18(7):1527–1554.
Geoffrey E Hinton、Simon Osindero 和 Yee-Whye Teh。2006。深度信念网络的一种快速学习算法。神经计算,18(7):1527–1554。
Jeremy Howard and Sebastian Ruder. 2018. Universal language model fine-tuning for text classification. arXiv preprint arXiv:1801.06146.
Jeremy Howard 和 Sebastian Ruder. 2018. 通用语言模型微调在文本分类中的应用. arXiv 预印本 arXiv:1801.06146.
Ganesh Jawahar, Benoit Sagot, Djamé Seddah, Samuel Unicomb, Gerardo Iniguez, Marton Karsai, Yannick Léo, Marton Karsai, Carlos Sarraute, Eric Fleury, et al. 2019. What does bert learn about the structure of language? In 57th Annual Meeting of the Association for Computational Linguistics (ACL), Florence, Italy.
Ganesh Jawahar, Benoit Sagot, Djamé Seddah, Samuel Unicomb, Gerardo Iniguez, Marton Karsai, Yannick Léo, Marton Karsai, Carlos Sarraute, Eric Fleury 等. 2019. BERT对语言结构学到了什么? 见: 第57届计算语言学协会年会 (ACL), 意大利佛罗伦萨.
Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, and Qun Liu. 2019. Tinybert: Distilling bert for natural language understanding. arXiv preprint arXiv:1909.10351.
肖琪娇、尹一淳、尚立峰、姜欣、陈晓、李琳琳、王芳和刘群。2019. TinyBERT:用于自然语言理解的BERT蒸馏方法。arXiv预印本 arXiv:1909.10351。
Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S Weld, Luke Z ett le moyer, and Omer Levy. 2019. Spanbert: Improving pre-training by representing and predicting spans. arXiv preprint arXiv:1907.10529.
Mandar Joshi、Danqi Chen、Yinhan Liu、Daniel S Weld、Luke Zettlemoyer 和 Omer Levy。2019. SpanBERT: 通过表示和预测文本片段改进预训练。arXiv预印本 arXiv:1907.10529。
Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, and Yonghui Wu. 2016. Exploring the limits of language modeling. arXiv preprint arXiv:1602.02410.
Rafal Jozefowicz、Oriol Vinyals、Mike Schuster、Noam Shazeer 和 Yonghui Wu。2016. 探索语言建模的极限。arXiv预印本 arXiv:1602.02410。
Nitish Shirish Keskar, Bryan McCann, Lav R Varshney, Caiming Xiong, and Richard Socher. 2019. Ctrl: A conditional transformer language model for controllable generation. arXiv preprint arXiv:1909.05858.
Nitish Shirish Keskar、Bryan McCann、Lav R Varshney、Caiming Xiong 和 Richard Socher。2019。Ctrl: 一种用于可控生成的条件Transformer语言模型。arXiv预印本 arXiv:1909.05858。
James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Ag- nieszka Grabska-Barwinska, et al. 2017. Over- coming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, 114(13):3521–3526.
James Kirkpatrick、Razvan Pascanu、Neil Rabinowitz、Joel Veness、Guillaume Desjardins、Andrei A Rusu、Kieran Milan、John Quan、Tiago Ramalho、Agnieszka Grabska-Barwinska 等. 2017. 克服神经网络中的灾难性遗忘. 美国国家科学院院刊, 114(13):3521–3526.
Ryan Kiros, Yukun Zhu, Ruslan R Salak hut dino v, Richard Zemel, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Skip-thought vectors. In Advances in neural information processing systems, pages 3294–3302.
Ryan Kiros、Yukun Zhu、Ruslan R Salakhutdinov、Richard Zemel、Raquel Urtasun、Antonio Torralba 和 Sanja Fidler。2015. Skip-thought向量。载于《神经信息处理系统进展》,第3294-3302页。
Olga Kovaleva, Alexey Romanov, Anna Rogers, and Anna Rumshisky. 2019. Revealing the dark secrets of bert. arXiv preprint arXiv:1908.08593.
Olga Kovaleva、Alexey Romanov、Anna Rogers 和 Anna Rumshisky。2019. 揭示 BERT 的黑暗秘密。arXiv 预印本 arXiv:1908.08593。
Guillaume Lample and Alexis Conneau. 2019. Crosslingual language model pre training. arXiv preprint arXiv:1901.07291.
Guillaume Lample and Alexis Conneau. 2019. 跨语言语言模型预训练。arXiv preprint arXiv:1901.07291。
Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2019. Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942.
Zhenzhong Lan、Mingda Chen、Sebastian Goodman、Kevin Gimpel、Piyush Sharma 和 Radu Soricut。2019. ALBERT: 一种用于语言表征自监督学习的轻量级BERT。arXiv预印本 arXiv:1909.11942。
Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghaz vi nine j ad, Abdel rahman Mohamed, Omer Levy, Ves Stoyanov, and Luke Z ett le moyer. 2019. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461.
Mike Lewis、Yinhan Liu、Naman Goyal、Marjan Ghazvininejad、Abdelrahman Mohamed、Omer Levy、Ves Stoyanov 和 Luke Zettlemoyer。2019. BART:用于自然语言生成、翻译和理解的去噪序列到序列预训练。arXiv预印本 arXiv:1910.13461。
Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Z ett le moyer. 2018. Deep contextual i zed word representations. arXiv preprint arXiv:1802.05365.
Matthew E Peters、Mark Neumann、Mohit Iyyer、Matt Gardner、Christopher Clark、Kenton Lee 和 Luke Zettlemoyer。2018。深度上下文词表征 (Deep contextualized word representations)。arXiv预印本 arXiv:1802.05365。
Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding by generative pre-training. URL https://s3-us-west-2. amazonaws. com/openaiassets/research covers/language unsupervised/language understanding paper. pdf.
Alec Radford、Karthik Narasimhan、Tim Salimans和Ilya Sutskever。2018。通过生成式预训练提升语言理解能力。URL https://s3-us-west-2.amazonaws.com/openaiassets/research-covers/language-unsupervised/language_ understanding_ paper.pdf。
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. OpenAI Blog, 1(8).
Alec Radford、Jeffrey Wu、Rewon Child、David Luan、Dario Amodei 和 Ilya Sutskever。2019。语言模型是无监督多任务学习者。OpenAI 博客,1(8)。
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2019. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683.
Colin Raffel、Noam Shazeer、Adam Roberts、Katherine Lee、Sharan Narang、Michael Matena、Yanqi Zhou、Wei Li 和 Peter J Liu. 2019. 探索迁移学习的极限:统一的文本到文本Transformer. arXiv预印本 arXiv:1910.10683.
Prajit Rama chandra n, Peter J Liu, and Quoc V Le. 2016. Unsupervised pre training for sequence to sequence learning. arXiv preprint arXiv:1611.02683.
Prajit Ramachandran, Peter J Liu 和 Quoc V Le. 2016. 无监督预训练在序列到序列学习中的应用. arXiv preprint arXiv:1611.02683.
Sylvestre-Alvise Rebuffi, Hakan Bilen, and Andrea Vedaldi. 2017. Learning multiple visual domains with residual adapters. In Advances in Neural Information Processing Systems, pages 506–516.
Sylvestre-Alvise Rebuffi、Hakan Bilen 和 Andrea Vedaldi。2017. 使用残差适配器学习多视觉领域。载于《神经信息处理系统进展》,第506–516页。
Siva Reddy, Danqi Chen, and Christopher D Manning. 2019. Coqa: A conversational question answering challenge. Transactions of the Association for Computational Linguistics, 7:249–266.
Siva Reddy、Danqi Chen 和 Christopher D Manning。2019。Coqa:对话式问答挑战赛。计算语言学协会汇刊,7:249–266。
Tim Rock t as chel, Edward Gre fens te tte, Karl Moritz Hermann, Tomas Kocisky, and Phil Blunsom. 2015. Reasoning about entailment with neural attention. arXiv preprint arXiv:1509.06664.
Tim Rockt as chel、Edward Gre fens te tte、Karl Moritz Hermann、Tomas Kocisky和Phil Blunsom。2015. 基于神经注意力机制的蕴涵推理研究。arXiv预印本 arXiv:1509.06664.
Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2019. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108.
Victor Sanh、Lysandre Debut、Julien Chaumond 和 Thomas Wolf。2019. DistilBERT:BERT的蒸馏版本——更小、更快、更经济、更轻量。arXiv预印本 arXiv:1910.01108。
Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1715–1725, Berlin, Germany. Association for Computational Linguistics.
Rico Sennrich、Barry Haddow 和 Alexandra Birch。2016. 基于子词单元的稀有词神经机器翻译。载于《第54届计算语言学协会年会论文集(第一卷:长论文)》,第1715-1725页,德国柏林。计算语言学协会。
Sheng Shen, Zhen Dong, Jiayu Ye, Linjian Ma, Zhewei Yao, Amir Gholami, Michael W Mahoney, and Kurt Keutzer. 2019. Q-bert: Hessian based ultra low precision quantization of bert. arXiv preprint arXiv:1909.05840.
Sheng Shen、Zhen Dong、Jiayu Ye、Linjian Ma、Zhewei Yao、Amir Gholami、Michael W Mahoney 和 Kurt Keutzer。2019。Q-BERT: 基于Hessian的BERT超低精度量化。arXiv预印本 arXiv:1909.05840。
Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and TieYan Liu. 2019. Mass: Masked sequence to sequence pre-training for language generation. arXiv preprint arXiv:1905.02450.
Kaitao Song、Xu Tan、Tao Qin、Jianfeng Lu 和 TieYan Liu。2019. MASS: 面向语言生成的掩码序列到序列预训练。arXiv预印本 arXiv:1905.02450。
Asa Cooper Stickland and Iain Murray. 2019. Bert and pals: Projected attention layers for efficient adaptation in multi-task learning. arXiv preprint arXiv:1902.02671.
Asa Cooper Stickland和Iain Murray。2019。BERT及其伙伴:多任务学习中高效适配的投影注意力层。arXiv预印本arXiv:1902.02671。
Sandeep Subramania n, Adam Trischler, Yoshua Ben- gio, and Christopher J Pal. 2018. Learning general purpose distributed sentence representations via large scale multi-task learning. arXiv preprint arXiv:1804.00079.
Sandeep Subramanian, Adam Trischler, Yoshua Bengio, Christopher J Pal. 2018. 通过大规模多任务学习实现通用分布式句子表征. arXiv预印本 arXiv:1804.00079.
Siqi Sun, Yu Cheng, Zhe Gan, and Jingjing Liu. 2019a. Patient knowledge distillation for bert model compression. arXiv preprint arXiv:1908.09355.
Siqi Sun, Yu Cheng, Zhe Gan, and Jingjing Liu. 2019a. 面向BERT模型压缩的患者知识蒸馏。arXiv预印本 arXiv:1908.09355。
Yu Sun, Shuohuan Wang, Yukun Li, Shikun Feng, Xuyi Chen, Han Zhang, Xin Tian, Danxiang Zhu, Hao Tian, and Hua Wu. 2019b. Ernie: Enhanced represent ation through knowledge integration. arXiv preprint arXiv:1904.09223.
Yu Sun、Shuohuan Wang、Yukun Li、Shikun Feng、Xuyi Chen、Han Zhang、Xin Tian、Danxiang Zhu、Hao Tian 和 Hua Wu。2019b。Ernie:通过知识集成增强表征。arXiv预印本 arXiv:1904.09223。
Yu Sun, Shuohuan Wang, Yukun Li, Shikun Feng, Hao Tian, Hua Wu, and Haifeng Wang. 2019c. Ernie 2.0: A continual pre-training framework for language under standing. arXiv preprint arXiv:1907.12412.
Yu Sun、Shuohuan Wang、Yukun Li、Shikun Feng、Hao Tian、Hua Wu 和 Haifeng Wang。2019c。ERNIE 2.0:一种持续预训练的语言理解框架。arXiv预印本 arXiv:1907.12412。
Raphael Tang, Yao Lu, Linqing Liu, Lili Mou, Olga Vechtomova, and Jimmy Lin. 2019. Distilling taskspecific knowledge from bert into simple neural networks. arXiv preprint arXiv:1903.12136.
Raphael Tang、Yao Lu、Linqing Liu、Lili Mou、Olga Vechtomova 和 Jimmy Lin。2019。从 BERT 中提取任务特定知识到简单神经网络。arXiv 预印本 arXiv:1903.12136。
Ian Tenney, Dipanjan Das, and Ellie Pavlick. 2019a. Bert rediscovers the classical nlp pipeline. arXiv preprint arXiv:1905.05950.
Ian Tenney、Dipanjan Das 和 Ellie Pavlick. 2019a. BERT 重探经典 NLP 流程. arXiv 预印本 arXiv:1905.05950.
Ian Tenney, Patrick Xia, Berlin Chen, Alex Wang, Adam Poliak, R Thomas McCoy, Najoung Kim, Benjamin Van Durme, Samuel R Bowman, Dipanjan Das, et al. 2019b. What do you learn from context? probing for sentence structure in contextual i zed word representations. arXiv preprint arXiv:1905.06316.
Ian Tenney、Patrick Xia、Berlin Chen、Alex Wang、Adam Poliak、R Thomas McCoy、Najoung Kim、Benjamin Van Durme、Samuel R Bowman、Dipanjan Das 等. 2019b. 你能从上下文中学到什么?探究上下文词表征中的句子结构. arXiv预印本 arXiv:1905.06316.
Joseph Turian, Lev Ratinov, and Yoshua Bengio. 2010. Word representations: a simple and general method for semi-supervised learning. In Proceedings of the 48th annual meeting of the association for computational linguistics, pages 384–394. Association for Computational Linguistics.
Joseph Turian、Lev Ratinov 和 Yoshua Bengio. 2010. 词表示: 一种简单通用的半监督学习方法. 见《第48届计算语言学协会年会论文集》, 第384–394页. 计算语言学协会.
Benigno Uria, Marc-Alexandre Coté, Karol Gregor, Iain Murray, and Hugo Larochelle. 2016. Neural auto regressive distribution estimation. The Journal of Machine Learning Research, 17(1):7184–7220.
Benigno Uria、Marc-Alexandre Coté、Karol Gregor、Iain Murray 和 Hugo Larochelle。2016。神经自回归分布估计。《机器学习研究期刊》17(1):7184–7220。
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008.
Ashish Vaswani、Noam Shazeer、Niki Parmar、Jakob Uszkoreit、Llion Jones、Aidan N Gomez、Łukasz Kaiser 和 Illia Polosukhin。2017. Attention is all you need。载于《神经信息处理系统进展》,第5998–6008页。
Eric Wallace, Shi Feng, Nikhil Kandpal, Matt Gardner, and Sameer Singh. 2019. Universal adversarial triggers for nlp. arXiv preprint arXiv:1908.07125.
Eric Wallace、Shi Feng、Nikhil Kandpal、Matt Gardner 和 Sameer Singh。2019. 自然语言处理的通用对抗触发器 (Universal adversarial triggers for NLP)。arXiv 预印本 arXiv:1908.07125。
Alex Wang, Jan Hula, Patrick Xia, Rag have ndra Pappagari, R. Thomas McCoy, Roma Patel, Najoung Kim, Ian Tenney, Yinghui Huang, Katherin Yu, Shuning Jin, Berlin Chen, Benjamin Van Durme, Edouard Grave, Ellie Pavlick, and Samuel R. Bowman. 2019a. Can you tell me how to get past sesame street? sentence-level pre training beyond language modeling. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4465–4476, Florence, Italy. Association for Computational Linguistics.
Alex Wang、Jan Hula、Patrick Xia、Raghavendra Pappagari、R. Thomas McCoy、Roma Patel、Najoung Kim、Ian Tenney、Yinghui Huang、Katherin Yu、Shuning Jin、Berlin Chen、Benjamin Van Durme、Edouard Grave、Ellie Pavlick 和 Samuel R. Bowman。2019a。你能告诉我如何通过芝麻街吗?超越语言建模的句子级预训练。载于《第57届计算语言学协会年会论文集》,第4465–4476页,意大利佛罗伦萨。计算语言学协会。
Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. 2018 Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461.
Alex Wang、Amanpreet Singh、Julian Michael、Felix Hill、Omer Levy 和 Samuel R Bowman. 2018 GLUE: 自然语言理解的多任务基准与分析平台. arXiv preprint arXiv:1804.07461.
Wei Wang, Bin Bi, Ming Yan, Chen Wu, Zuyi Bao, Liwei Peng, and Luo Si. 2019b. Structbert: Incorpora ting language structures into pre-training for deep language understanding. arXiv preprint arXiv:1908.04577.
Wei Wang, Bin Bi, Ming Yan, Chen Wu, Zuyi Bao, Liwei Peng, and Luo Si. 2019b. Structbert: Incorpora ting language structures into pre-training for deep language understanding. arXiv preprint arXiv:1908.04577.
Georg Wiese, Dirk Weiss en born, and Mariana Neves. 2017. Neural domain adaptation for biomedical question answering. arXiv preprint arXiv:1706.03610.
Georg Wiese、Dirk Weiss en born 和 Mariana Neves。2017。面向生物医学问答的神经域适应方法。arXiv 预印本 arXiv:1706.03610。
Ronald J Williams and David Zipser. 1989. A learning algorithm for continually running fully recurrent neural networks. Neural computation, 1(2):270– 280.
Ronald J Williams和David Zipser。1989。一种用于持续运行的全递归神经网络的学习算法。神经计算,1(2):270–280。
Shijie Wu, Alexis Conneau, Haoran Li, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Emerging cross-lingual structure in pretrained language models. arXiv preprint arXiv:1911.01464.
史杰武、Alexis Conneau、李浩然、Luke Zettlemoyer 和 Veselin Stoyanov。2019。预训练语言模型中涌现的跨语言结构。arXiv预印本 arXiv:1911.01464。
Dongling Xiao, Han Zhang, Yukun Li, Yu Sun, Hao Tian, Hua Wu, and Haifeng Wang. 2020. Ernie-gen: An enhanced multi-flow pre-training and fine-tuning framework for natural language generation. arXiv preprint arXiv:2001.11314.
董玲晓, 张翰, 李宇坤, 孙宇, 田浩, 吴华, 王海峰. 2020. ERNIE-GEN: 一种增强的多流预训练与微调自然语言生成框架. arXiv预印本 arXiv:2001.11314.
Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salak hut dino v, and Quoc V Le. 2019. Xlnet: Generalized auto regressive pretraining for language understanding. arXiv preprint arXiv:1906.08237.
Zhilin Yang、Zihang Dai、Yiming Yang、Jaime Carbonell、Ruslan Salakhutdinov 和 Quoc V Le。2019。XLNet:面向语言理解的广义自回归预训练。arXiv预印本 arXiv:1906.08237。
Rowan Zellers, Ari Holtzman, Hannah Rashkin, Yonatan Bisk, Ali Farhadi, Franziska Roesner, and Yejin Choi. 2019. Defending against neural fake news. arXiv preprint arXiv:1905.12616.
Rowan Zellers、Ari Holtzman、Hannah Rashkin、Yonatan Bisk、Ali Farhadi、Franziska Roesner 和 Yejin Choi。2019。防御神经假新闻。arXiv预印本 arXiv:1905.12616。
Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proceedings of the IEEE international conference on computer vision, pages 19– 27.
Yukun Zhu、Ryan Kiros、Rich Zemel、Ruslan Salakhutdinov、Raquel Urtasun、Antonio Torralba 和 Sanja Fidler。2015。对齐书籍与电影:通过观影读书实现故事化视觉解释。载于《IEEE国际计算机视觉会议论文集》,第19-27页。
