A Survey on Contextual Embeddings

基于上下文嵌入的综述

Qi Liu‡, Matt J. Kusner†∗, Phil Blunsom‡⋄, ‡University of Oxford ⋄DeepMind †University College London ∗The Alan Turing Institute ‡{firstname.lastname}@cs.ox.ac.uk †m.kusner@ucl.ac.uk

Qi Liu‡, Matt J. Kusner†∗, Phil Blunsom‡⋄, ‡牛津大学 ⋄DeepMind †伦敦大学学院 ∗艾伦·图灵研究所 ‡{firstname.lastname}@cs.ox.ac.uk †m.kusner@ucl.ac.uk

Abstract

摘要

Contextual embeddings, such as ELMo and BERT, move beyond global word representations like Word2Vec and achieve groundbreaking performance on a wide range of natural language processing tasks. Contextual embeddings assign each word a representation based on its context, thereby capturing uses of words across varied contexts and encoding knowledge that transfers across languages. In this survey, we review existing contextual embedding models, cross-lingual polyglot pretraining, the application of contextual embeddings in downstream tasks, model compression, and model analyses.

上下文嵌入 (Contextual Embeddings) ，如 ELMo 和 BERT，超越了 Word2Vec 等全局词表示方法，在广泛的自然语言处理任务中取得了突破性性能。上下文嵌入会根据上下文为每个词分配表示，从而捕捉不同语境下的词语用法，并编码跨语言迁移的知识。本文综述了现有上下文嵌入模型、跨语言多语种预训练、上下文嵌入在下游任务中的应用、模型压缩以及模型分析。

1 Introduction

1 引言

Distribution al word representations (Turian et al., 2010; Mikolov et al., 2013; Pennington et al., 2014) trained in an unsupervised manner on large-scale corpora are widely used in modern natural language processing systems. However, these approaches only obtain a single global represent ation for each word, ignoring their context. Different from traditional word representations, contextual embeddings move beyond word-level semantics in that each token is associated with a representation that is a function of the entire input sequence. These context-dependent representations can capture many syntactic and semantic properties of words under diverse linguistic contexts. Previous work (Peters et al., 2018; Devlin et al., 2018; Yang et al., 2019; Raffel et al., 2019) has shown that contextual embeddings pretrained on large-scale unlabelled corpora achieve state-of-the-art performance on a wide range of natural language processing tasks, such as text classification, question answering and text summarization. Further analyses (Liu et al., 2019a; Hewitt and Liang, 2019; Hewitt and Manning, 2019; Tenney et al., 2019a) demonstrate that contextual embeddings are capable of learning useful and transferable representations across languages.

分布式词表示 (Turian et al., 2010; Mikolov et al., 2013; Pennington et al., 2014) 通过在大规模语料库上进行无监督训练，被广泛应用于现代自然语言处理系统。然而，这些方法仅能获取每个词的单一全局表示，忽略了上下文信息。与传统词表示不同，上下文嵌入突破了词级语义的局限，使每个token的表示成为整个输入序列的函数。这种上下文相关的表示能够捕捉单词在不同语言环境下的多种句法和语义特性。先前研究 (Peters et al., 2018; Devlin et al., 2018; Yang et al., 2019; Raffel et al., 2019) 表明，基于大规模无标注语料预训练的上下文嵌入在文本分类、问答系统和文本摘要等自然语言处理任务中实现了最先进的性能。进一步分析 (Liu et al., 2019a; Hewitt and Liang, 2019; Hewitt and Manning, 2019; Tenney et al., 2019a) 证明，上下文嵌入能够学习跨语言的有效可迁移表示。

The rest of the survey is organized as follows. In Section 2, we define the concept of contextual embeddings. In Section 3, we introduce existing methods for obtaining contextual embeddings. In Section 4, we present the pre-training methods of contextual embeddings on multi-lingual corpora. In Section 5, we describe methods for applying pre-trained contextual embeddings in downstream tasks. In Section 6, we detail model compression methods. In Section 7, we survey analyses that have aimed to identify the linguistic knowledge learned by contextual embeddings. We conclude the survey by highlighting some challenges for future research in Section 8.

本综述的其余部分安排如下。第2节定义了上下文嵌入 (contextual embeddings) 的概念。第3节介绍了获取上下文嵌入的现有方法。第4节阐述了在多语言语料库上预训练上下文嵌入的方法。第5节描述了在下游任务中应用预训练上下文嵌入的方法。第6节详述了模型压缩方法。第7节综述了旨在识别上下文嵌入所学习语言知识的分析工作。最后在第8节，我们通过强调未来研究的一些挑战来总结本综述。

2 Token Embeddings

2 Token嵌入

Consider a text corpus that is represented as a sequence $s$ of tokens, $(t_ {1},t_ {2},...,t_ {N})$ . Distributed representations of words (Harris, 1954; Bengio et al., 2003) associate each token $t_ {i}$ with a dense feature vector $\mathbf{h}_ {t_ {i}}$ . Traditional word embedding techniques aim to learn a global word embedding matrix $\textbf{E}\in\mathbb{R}^{V\times d}$ , where $V$ is the vocabulary size and $d$ is the number of dimensions. Specifically, each row $\mathbf{e}_ {i}$ of $\mathbf{E}$ corresponds to the global embedding of word type $i$ in the vocabulary $V$ . Well-known models for learning word embeddings include Word2vec (Mikolov et al., 2013) and Glove (Pennington et al., 2014). On the other hand, methods that learn contextual embeddings associate each token $t_ {i}$ with a representation that is a function of the entire input sequence $s$ , i.e. $\mathbf{h}_ {t_ {i}}~=~f(\mathbf{e}_ {t_ {1}},\mathbf{e}_ {t_ {2}},...,\mathbf{e}_ {t_ {N}})$ , where each input token $t_ {j}$ is usually mapped to its noncontextual i zed representation $\mathbf{e}_ {t_ {j}}$ first, before applying an aggregation function $f$ . These contextdependent representations are better suited to capture sequence-level semantics (e.g. polysemy) than non-contextual word embeddings. There are many model architectures for $f$ , which we review here. We begin by describing pre-training methods for learning contextual embeddings that can be used in downstream tasks.

考虑一个由token序列$s$ $(t_ {1},t_ {2},...,t_ {N})$ 表示的文本语料库。分布式词表示 (Harris, 1954; Bengio et al., 2003) 将每个token $t_ {i}$ 与一个稠密特征向量 $\mathbf{h}_ {t_ {i}}$ 关联。传统词嵌入技术旨在学习全局词嵌入矩阵 $\textbf{E}\in\mathbb{R}^{V\times d}$ ，其中 $V$ 是词汇表大小，$d$ 是维度数。具体而言，$\mathbf{E}$ 的每一行 $\mathbf{e}_ {i}$ 对应词汇表 $V$ 中词类型 $i$ 的全局嵌入。著名的词嵌入学习模型包括 Word2vec (Mikolov et al., 2013) 和 Glove (Pennington et al., 2014)。另一方面，学习上下文嵌入的方法将每个token $t_ {i}$ 与整个输入序列 $s$ 的函数表示相关联，即 $\mathbf{h}_ {t_ {i}}~=~f(\mathbf{e}_ {t_ {1}},\mathbf{e}_ {t_ {2}},...,\mathbf{e}_ {t_ {N}})$ ，其中每个输入token $t_ {j}$ 通常在应用聚合函数 $f$ 之前先映射到其非上下文化的表示 $\mathbf{e}_ {t_ {j}}$ 。这些上下文相关的表示比非上下文词嵌入更适合捕捉序列级语义（例如一词多义）。$f$ 有许多模型架构，我们在此进行综述。首先描述可用于下游任务的上下文嵌入预训练方法。

3 Pre-training Methods for Contextual Embeddings

3 上下文嵌入的预训练方法

In large part, pre-training contextual embeddings can be divided into either unsupervised methods (e.g. language modelling and its variants) or supervised methods (e.g. machine translation and natural language inference).

在很大程度上，预训练上下文嵌入可以分为无监督方法（如语言建模及其变体）或有监督方法（如机器翻译和自然语言推理）。

3.1 Unsupervised Pre-training via Language Modeling

3.1 基于语言建模的无监督预训练

The prototypical way to learn distributed token embeddings is via language modelling. A language model is a probability distribution over a sequence of tokens. Given a sequence of $N$ tokens, $(t_ {1},t_ {2},...,t_ {N})$ , a language model factorizes the probability of the sequence as:

学习分布式token嵌入的典型方法是通过语言建模。语言模型是对token序列的概率分布。给定一个由$N$个token组成的序列$(t_ {1},t_ {2},...,t_ {N})$，语言模型将该序列的概率分解为：

$$
p(t_ {1},t_ {2},...,t_ {N})=\prod_ {i=1}^{N}p(t_ {i}|t_ {1},t_ {2},...,t_ {i-1}).
$$

Language modelling uses maximum likelihood estimation (MLE), often penalized with regularization terms, to estimate model parameters. A left-to-right language model takes the left context, $t_ {1},t_ {2},...,t_ {i-1}$ , of $t_ {i}$ into account for esti- mating the conditional probability. Language models are usually trained using large-scale unlabelled corpora. The conditional probabilities are most commonly learned using neural networks (Bengio et al., 2003), and the learned representations have been proven to be transferable to downstream natural language understanding tasks (Dai and Le, 2015; Rama chandra n et al., 2016).

语言建模采用最大似然估计 (MLE)，通常结合正则化项进行惩罚，以估计模型参数。自左向右的语言模型会考虑 $t_ {i}$ 的左侧上下文 $t_ {1},t_ {2},...,t_ {i-1}$ 来估计条件概率。语言模型通常使用大规模无标注语料库进行训练。条件概率最常通过神经网络学习 (Bengio et al., 2003)，且已证明学习到的表征可迁移至下游自然语言理解任务 (Dai and Le, 2015; Ramachandran et al., 2016)。

Precursor Models. Dai and Le (2015) is the first work we are aware of that uses language modelling together with a sequence auto encoder to improve sequence learning with recurrent networks. Thus, it can be thought of as a precursor to modern contextual embedding methods. Pre-trained on the datasets IMDB, Rotten Tomatoes, 20 Newsgroups, and DBpedia, the model is then fine-tuned on sentiment analysis and text classification tasks, achieving strong performance compared to randomly- initialized models.

先驱模型。Dai和Le (2015) 是我们所知首个将语言建模与序列自编码器结合以改进循环网络序列学习的工作，可视为现代上下文嵌入方法的前身。该模型在IMDB、烂番茄、20新闻组和DBpedia数据集上预训练后，针对情感分析和文本分类任务进行微调，相比随机初始化模型取得了显著性能提升。

Rama chandra n et al. (2016) extends Dai and Le (2015) by proposing a pre-training method to improve the accuracy of sequence to sequence (seq2seq) models. The encoder and decoder of the seq2seq model is initialized with the pre-trained weights of two language models. These language models are separately trained on either the News Crawl English or German corpora for machine translation, while both are initialized with the language model trained with the English Gigaword corpus for abstract ive sum mari z ation. These pretrained models are fine-tuned on the WMT English $\rightarrow$ German task and the CNN/Daily Mail corpus, respectively, achieving better results over baselines without pre-training.

Rama chandra n等人 (2016) 在Dai和Le (2015) 的基础上提出了一种预训练方法，用于提升序列到序列 (seq2seq) 模型的准确性。该seq2seq模型的编码器和解码器使用两个语言模型的预训练权重进行初始化。这些语言模型分别在机器翻译任务的News Crawl英语或德语语料库上单独训练，而在抽象摘要任务中则均采用基于English Gigaword语料库训练的语言模型进行初始化。这些预训练模型分别在WMT英德翻译任务和CNN/Daily Mail语料库上进行微调，相比未预训练的基线模型取得了更好的结果。

The work in the following sections improves over Dai and Le (2015) and Rama chandra n et al. (2016) with new architectures (e.g. Transformer), larger datasets, and new pre-training objectives. A summary of the models and the pre-training objectives is shown in Table 1 and 2.

以下章节的工作在Dai和Le (2015) 以及Ramachandra等人 (2016) 的基础上进行了改进，采用了新的架构（如Transformer）、更大的数据集和新的预训练目标。模型和预训练目标的总结如表1和表2所示。

ELMo. The ELMo model (Peters et al., 2018) generalizes traditional word embeddings by extracting context-dependent representations from a bidirectional language model. A forward $L$ -layer LSTM and a backward $L$ -layer LSTM are applied to encode the left and right contexts, respectively. At each layer $j$ , the contextual i zed representations are the concatenation of the left-to-right and rightto-left representations, obtaining $N$ hidden representations, $(\mathbf{h}_ {1,j},\mathbf{h}_ {2,j},...,\mathbf{h}_ {N,j})$ , for a sequence of length $N$ .

ELMo。ELMo模型 (Peters等人, 2018) 通过从双向语言模型中提取上下文相关的表示，泛化了传统的词嵌入方法。该方法采用一个前向$L$层LSTM和一个反向$L$层LSTM分别对左右上下文进行编码。在每一层$j$，上下文相关的表示是左右双向表示的拼接，从而为长度为$N$的序列获得$N$个隐藏表示$(\mathbf{h}_ {1,j},\mathbf{h}_ {2,j},...,\mathbf{h}_ {N,j})$。

To use ELMo in downstream tasks, the $(L+1)$ - layer representations (including the global word embedding) for each token $k$ are aggregated as:

要在下游任务中使用ELMo，每个token $k$ 的 $(L+1)$ 层表示（包括全局词嵌入）按以下方式聚合：

$$
\mathrm{ELMO}_ {k}^{t a s k}=\gamma^{t a s k}\sum_ {j=0}^{L}s_ {j}^{t a s k}\mathbf{h}_ {k,j},
$$

where stask are layer-wise weights normalized by the softmax used to linearly combine the $(L+1)$ - layer representations of the token $k$ and $\gamma^{t a s k}$ is a task-specific constant.

其中stask是通过softmax归一化的逐层权重，用于线性组合token $k$ 的$(L+1)$层表示，而$\gamma^{task}$是任务特定的常数。

Given a pre-trained ELMo, it is straightforward to incorporate it into a task-specific architecture for improving the performance. As most supervised models use global word representations $\mathbf{x}_ {k}$ in their lowest layers, these representations can be concatenated with their corresponding contextdependent representations $\mathrm{ELMo}_ {k}^{t a s k}$ , obtaining $[\mathbf{x}_ {k};\mathbf{ELMo}_ {k}^{t a s k}]$ , before feeding them to higher layers.

给定一个预训练的ELMo，可以轻松地将其整合到特定任务的架构中以提升性能。由于大多数监督模型在最底层使用全局词表示$\mathbf{x}_ {k}$，这些表示可以与其对应的上下文相关表示$\mathrm{ELMo}_ {k}^{t a s k}$拼接，得到$[\mathbf{x}_ {k};\mathbf{ELMo}_ {k}^{t a s k}]$，然后再输入到更高层。

Table 1: A comparison of popular pre-trained models.

Method	Architecture	Encoder	Decoder	Objective	Dataset
ELMo	LSTM			LM	1BWordBenchmark
GPT	Transformer	×		LM	BookCorpus
GPT2	Transformer	×	√	LM	Web pages starting from Reddit
BERT	Transformer			MLM&NSP	BookCorpus&Wiki
RoBERTa	Transformer	√	×	MLM	BookCorpus,Wiki,CC-News,OpenWebText,Stories
ALBERT	Transformer	√		MLM&SOP	SameasRoBERTaandXLNet
UniLM	Transformer			LM, MLM, seq2seq LM	Same asBERT
ELECTRA	Transformer		×	Discriminator (o/r)	SameasXLNet
XLNet	Transformer	×	√	PLM	BookCorpus,Wiki,Giga5,ClueWeb,Common Crawl
XLM	Transformer			CLM, MLM, TLM	Wiki,parellel corpora (e.g.MultiUN)
MASS	Transformer	√		Span Mask	WMTNewsCrawl
T5	Transformer	√		TextInfilling	ColossalCleanCrawledCorpus
BART	Transformer	√		TextInfilling&SentShuffling	SameasRoBERTa

表 1: 主流预训练模型对比

方法	架构	编码器	解码器	目标函数	数据集
ELMo	LSTM			LM	1BWordBenchmark
GPT	Transformer	×		LM	BookCorpus
GPT2	Transformer	×	√	LM	从Reddit开始的网页
BERT	Transformer			MLM&NSP	BookCorpus&Wiki
RoBERTa	Transformer	√	×	MLM	BookCorpus, Wiki, CC-News, OpenWebText, Stories
ALBERT	Transformer	√		MLM&SOP	SameasRoBERTaandXLNet
UniLM	Transformer			LM, MLM, seq2seq LM	Same asBERT
ELECTRA	Transformer		×	Discriminator (o/r)	SameasXLNet
XLNet	Transformer	×	√	PLM	BookCorpus, Wiki, Giga5, ClueWeb, Common Crawl
XLM	Transformer			CLM, MLM, TLM	Wiki, parellel corpora (e.g. MultiUN)
MASS	Transformer	√		Span Mask	WMTNewsCrawl
T5	Transformer	√		TextInfilling	ColossalCleanCrawledCorpus
BART	Transformer	√		TextInfilling&SentShuffling	SameasRoBERTa

Table 2: Pre-training objectives and their input-output formats.

Objective	Inputs	Targets
LM	[START]	I am happy to join with you today
MLM	I am[MASK] to join with you [MASK]	happy today
NSP	Sent1[SEP]NextSentorSent1[SEP]RandomSent	NextSent/RandomSent
SOP	Sent1[SEP]Sent2orSent2[SEP]Sentl	inorder/reversed
Discriminator (o/r)	I amthrilledtostudywithyoutoday	oororooo
PLM	happy join with	today am I to you
seq2seq LM	I am happy to	join with you today
Span Mask	I am[MASK][MASK][MASK]withyou today	happy to join
Text Infilling	Iam[MASK]withyoutoday	happy to join
Sent Shuffling	today you am I join with happy to	I am happy to join with you today
TLM	How[MASK]you[SEP][MASK]vas-tu	areComment

表 2: 预训练目标及其输入-输出格式。

目标	输入	输出
LM	[START]	I am happy to join with you today
MLM	I am[MASK] to join with you [MASK]	happy today
NSP	Sent1[SEP]NextSentorSent1[SEP]RandomSent	NextSent/RandomSent
SOP	Sent1[SEP]Sent2orSent2[SEP]Sentl	inorder/reversed
Discriminator (o/r)	I amthrilledtostudywithyoutoday	oororooo
PLM	happy join with	today am I to you
seq2seq LM	I am happy to	join with you today
Span Mask	I am[MASK][MASK][MASK]withyou today	happy to join
Text Infilling	Iam[MASK]withyoutoday	happy to join
Sent Shuffling	today you am I join with happy to	I am happy to join with you today
TLM	How[MASK]you[SEP][MASK]vas-tu	areComment

The effectiveness of ELMo is evaluated on six NLP problems, including question answering, textual entailment and sentiment analysis.

ELMo的有效性在六个自然语言处理(NLP)任务上进行了评估，包括问答、文本蕴含和情感分析。

GPT, GPT2, and Grover. GPT (Radford et al., 2018) adopts a two-stage learning paradigm: (a) unsupervised pre-training using a language modelling objective and (b) supervised fine-tuning. The goal is to learn universal representations transferable to a wide range of downstream tasks. To this end, GPT uses the BookCorpus dataset (Zhu et al., 2015), which contains more than 7,000 books from various genres, for training the language model. The Transformer architecture (Vaswani et al., 2017) is used to implement the language model, which has been shown to better capture global dependencies from the inputs compared to its alternatives, e.g. recurrent networks, and perform strongly on a range of sequence learning tasks, such as machine translation (Vaswani et al., 2017) and document gener- ation (Liu et al., 2018). To use GPT on inputs with multiple sequences during fine-tuning, GPT applies task-specific input adaptations motivated by traversal-style approaches (Rock t as chel et al.. 2015). These approaches pre-process each text input as a single contiguous sequence of tokens through special tokens including [START] (the start of a sequence), [DELIM] (delimiting two sequences from the text input) and [EXTRACT] (the end of a sequence). GPT outperforms task-specific architectures in 9 out of 12 tasks studied with a pretrained Transformer.

GPT、GPT2和Grover。GPT (Radford et al., 2018)采用两阶段学习范式：(a) 使用语言建模目标进行无监督预训练；(b) 有监督微调。其目标是学习可迁移至广泛下游任务的通用表征。为此，GPT使用BookCorpus数据集(Zhu et al., 2015)训练语言模型，该数据集包含7,000多本不同体裁的书籍。模型采用Transformer架构(Vaswani et al., 2017)实现语言建模，相比循环网络等替代方案，该架构能更好地捕捉输入中的全局依赖关系，并在机器翻译(Vaswani et al., 2017)和文档生成(Liu et al., 2018)等序列学习任务中表现优异。为在微调阶段处理多序列输入，GPT借鉴遍历式方法(Rocktäschel et al., 2015)进行任务特定的输入适配：通过START、DELIM和EXTRACT等特殊token，将每个文本输入预处理为连续的token序列。在12项研究任务中，基于预训练Transformer的GPT在9项任务上超越了专用架构。

GPT2 (Radford et al., 2019) mainly follows the architecture of GPT and trains a language model on a dataset as large and diverse as possible to learn from varied domains and contexts. To do so, Radford et al. (2019) create a new dataset of millions of web pages named WebText, by scraping outbound links from Reddit. The authors ar- gue that a language model trained on large-scale unlabelled corpora begins to learn some common supervised NLP tasks, such as question answering, machine translation and sum mari z ation, without any explicit supervision signal. To validate this, GPT2 is tested on ten datasets (e.g. Children’s Book Test (Hill et al., 2015), LAMBADA (Paperno et al., 2016) and CoQA (Reddy et al.,

GPT2 (Radford et al., 2019) 主要遵循 GPT 的架构，并在尽可能大而多样的数据集上训练语言模型，以学习不同领域和上下文。为此，Radford et al. (2019) 通过抓取 Reddit 的外链创建了一个名为 WebText 的新数据集，包含数百万个网页。作者认为，在大规模无标注语料上训练的语言模型会开始学习一些常见的监督式 NLP 任务，例如问答、机器翻译和摘要，而无需任何显式的监督信号。为验证这一点，GPT2 在十个数据集上进行了测试 (如 Children’s Book Test (Hill et al., 2015)、LAMBADA (Paperno et al., 2016) 和 CoQA (Reddy et al.,

2019)) in a zero-shot setting. GPT2 performs strongly on some tasks. For instance, when conditioned on a document and questions, GPT2 reaches an F1-score of 55 on the CoQA dataset without using any labelled training data. This matches or outperforms the performance of 3 out of 4 baseline systems. As GPT2 divides texts into bytes and uses BPE (Sennrich et al., 2016) to build up its vocabulary (instead of using characters or words, as in previous work), it is unclear if the improved performance comes from the model or the new input representation.

2019)) 的零样本设置中。GPT2 在某些任务上表现强劲。例如，在给定文档和问题的情况下，GPT2 在 CoQA 数据集上达到了 55 的 F1 分数，且未使用任何标注训练数据。这一结果与 4 个基线系统中的 3 个持平或更优。由于 GPT2 将文本分割为字节并使用 BPE (Sennrich et al., 2016) 构建词汇表 (而非如先前工作那样使用字符或单词)，尚不确定性能提升是源自模型还是新的输入表示方式。

Grover (Zellers et al., 2019) creates a news dataset, RealNews, from Common Crawl and pretrains a language model for generating realisticlooking fake news that is conditioned on metadata including domains, dates, authors and headlines. They further study disc rim in at or s that can be used to detect fake news. The best defense against Grover turns out to be Grover itself, which sheds light on the importance of releasing trained models for detecting fake news.

Grover (Zellers等人, 2019) 从Common Crawl中创建了一个新闻数据集RealNews,并预训练了一个语言模型,用于生成看起来逼真的假新闻,这些假新闻基于包括域名、日期、作者和标题在内的元数据。他们进一步研究了可用于检测假新闻的判别器。事实证明,对抗Grover的最佳防御措施是Grover本身,这揭示了发布用于检测假新闻的训练模型的重要性。

BERT. ELMo (Peters et al., 2018) concatenates representations from the forward and backward LSTMs without considering the interactions between the left and right contexts. GPT (Radford et al., 2018) and GPT2 (Radford et al., 2019) use a left-to-right decoder, where every token can only attend to its left context. These architectures are sub-optimal for sentence-level tasks, e.g. named entity recognition and sentiment analysis, as it is crucial to incorporate contexts from both directions.

BERT。ELMo (Peters et al., 2018) 通过拼接前向和后向LSTM的表征，但未考虑左右上下文之间的交互。GPT (Radford et al., 2018) 和 GPT2 (Radford et al., 2019) 采用从左到右的解码器结构，每个token只能关注其左侧上下文。这些架构对于句子级任务（如命名实体识别和情感分析）并非最优选择，因为融合双向上下文信息至关重要。

BERT proposes a masked language modelling (MLM) objective, where some of the tokens of a input sequence are randomly masked, and the objective is to predict these masked positions taking the corrupted sequence as input. BERT applies a Transformer encoder to attend to bi-directional contexts during pre-training. In addition, BERT uses a next-sentence-prediction (NSP) objective. Given two input sentences, NSP predicts whether the second sentence is the actual next sentence of the first sentence. The NSP objective aims to improve the tasks, such as question answering and natural language inference, which require reasoning over sentence pairs.

BERT提出了一种掩码语言建模 (MLM) 目标，即随机掩码输入序列中的部分token，并以被破坏的序列作为输入来预测这些被掩码的位置。BERT在预训练过程中使用Transformer编码器来关注双向上下文。此外，BERT还采用了下一句预测 (NSP) 目标。给定两个输入句子，NSP预测第二个句子是否是第一个句子的实际下一句。NSP目标旨在改进需要推理句子对的任务，例如问答和自然语言推理。

Similar to GPT, BERT uses special tokens to obtain a single contiguous sequence for each input sequence. Specifically, the first token is always a special classification token [CLS], and sentence pairs are separated using a special token [SEP]. BERT adopts a pre-training followed by fine-tuning scheme. The final hidden state of [CLS] is used for sentence-level tasks and the final hidden state of each token is used for token-level tasks. BERT obtains new state-of-the-art results on eleven natural language processing tasks, e.g. improving the GLUE (Wang et al., 2018) score to $80.5%$ .

与GPT类似，BERT使用特殊token为每个输入序列生成单一连续序列。具体而言，第一个token始终是特殊分类标记[CLS]，句子对则通过特殊标记[SEP]分隔。BERT采用预训练后微调的训练范式：[CLS]的最终隐藏状态用于句子级任务，每个token的最终隐藏状态用于token级任务。BERT在11项自然语言处理任务中刷新了最优性能指标，例如将GLUE (Wang et al., 2018) 分数提升至 $80.5%$。

Similar to GPT2, it is unclear exactly why BERT improves over prior work as it uses different objectives, datasets (Wikipedia and BookCorpus) and architectures compared to previous methods. For partial insight on this, we refer the readers to (Raffel et al., 2019) for a controlled comparison between unidirectional and bidirectional models, traditional language modelling and masked language modelling using the same datasets.

与GPT2类似，目前尚不完全清楚BERT为何能超越先前工作，因为它采用了不同的训练目标、数据集（Wikipedia和BookCorpus）以及架构设计。关于部分原因分析，我们建议读者参阅 (Raffel et al., 2019) ，该研究在相同数据集上对单向模型与双向模型、传统语言建模与掩码语言建模进行了对照实验。

BERT variants. Recent work further studies and improves the objective and architecture of BERT.

BERT变体。近期研究进一步探索并改进了BERT的目标函数和架构。

Instead of randomly masking tokens, ERNIE (Sun et al., 2019b) incorporates knowledge masking strategies, including entity-level masking and phrase-level masking. ERNIE 2.0 (Sun et al., 2019c) further incorporates more pre-training tasks, such as semantic closeness and discourse relations. SpanBERT (Joshi et al., 2019) generalizes ERNIE to mask random spans, without referring to external knowledge. StructBERT (Wang et al., 2019b) proposes a word structural objective that randomly permutes the order of 3-grams for reconstruction and a sentence structural objective that predicts the order of two consecutive segments.

ERNIE (Sun等人, 2019b) 没有随机掩盖token，而是采用了知识掩盖策略，包括实体级掩盖和短语级掩盖。ERNIE 2.0 (Sun等人, 2019c) 进一步引入了更多预训练任务，如语义相似性和篇章关系。SpanBERT (Joshi等人, 2019) 将ERNIE的方法推广到随机遮盖连续片段，且无需依赖外部知识。StructBERT (Wang等人, 2019b) 提出了一个词结构目标，即随机打乱3-gram的顺序进行重建，以及一个句子结构目标，即预测两个连续片段的顺序。

RoBERTa (Liu et al., 2019c) makes a few changes to the released BERT model and achieves substantial improvements. The changes include: (1) Training the model longer with larger batches and more data; (2) Removing the NSP objective; (3) Training on longer sequences; (4) Dynamically changing the masked positions during pretraining.

RoBERTa (Liu et al., 2019c) 对已发布的 BERT 模型进行了几项改进并取得显著提升。改进包括：(1) 使用更大批次和更多数据延长训练时间；(2) 移除下一句预测 (NSP) 目标；(3) 在更长序列上训练；(4) 预训练期间动态调整掩码位置。

ALBERT (Lan et al., 2019) proposes two parameter-reduction techniques (factorized embedding parameter iz ation and cross-layer parameter sharing) to lower memory consumption and speed up training. Furthermore, ALBERT argues that the NSP objective lacks difficulty, as the negative examples are created by pairing segments from different documents, this mixes topic prediction and coherence prediction into a single task.

ALBERT (Lan等人，2019) 提出了两种参数缩减技术（分解嵌入参数化和跨层参数共享）以降低内存消耗并加速训练。此外，ALBERT认为NSP目标缺乏难度，因为负样本是通过将不同文档的片段配对创建的，这将主题预测和连贯性预测混合到一个任务中。

ALBERT instead uses a sentence-order prediction (SOP) objective. SOP obtains positive examples by taking out two consecutive segments and negative examples by reversing the order of two consecutive segments from the same document.

ALBERT 转而采用句子顺序预测 (SOP) 目标。SOP 通过提取同一文档中两个连续片段作为正例，通过调换两个连续片段的顺序作为负例。

XLNet. The XLNet model (Yang et al., 2019) identifies two weaknesses of BERT:

XLNet。XLNet模型 (Yang等人, 2019) 指出了BERT的两个弱点:

BERT assumes conditional independence of corrupted tokens. For instance, to model the probability $p(t_ {2}=\mathrm{cat},t_ {6}=\mathrm{mat}|t_ {1}=$ The, $t_ {2}=[\mathrm{MASK}],t_ {3}=\mathrm{sat},t_ {4}=\mathrm{on},t_ {5}=$ the, $t_ {6}~=~[\mathrm{MASK])}$ , BERT factorizes it as $p(t_ {2}=\mathrm{{cat}}|\ldots)p(t_ {6}=\mathrm{{mat}}|\ldots)$ , where $t_ {2}$ and $t_ {6}$ are assumed to be conditionally independent.
BERT假设被遮蔽token之间条件独立。例如，为了建模概率$p(t_ {2}=\mathrm{cat},t_ {6}=\mathrm{mat}|t_ {1}=$ The, $t_ {2}=[\mathrm{MASK}],t_ {3}=\mathrm{sat},t_ {4}=\mathrm{on},t_ {5}=$ the, $t_ {6}~=~[\mathrm{MASK])}$，BERT将其分解为$p(t_ {2}=\mathrm{{cat}}|\ldots)p(t_ {6}=\mathrm{{mat}}|\ldots)$，其中$t_ {2}$和$t_ {6}$被假定为条件独立。
The symbols such as [MASK] are introduced by BERT during pre-training, yet they never occur in real data, resulting in a discrepancy between pre-training and fine-tuning.
[MASK]等符号由BERT在预训练时引入，但这些符号从未在实际数据中出现，导致预训练与微调之间存在差异。

XLNet proposes a new auto-regressive method based on permutation language modelling (PLM) (Uria et al., 2016) without introducing any new symbols. The MLE objective for it is calculated as:

XLNet提出了一种基于排列语言建模 (permutation language modelling, PLM) [20] 的新型自回归方法，无需引入任何新符号。其最大似然估计 (MLE) 目标函数计算公式为：

$$
\operatorname*{max}_ {\theta}\mathbb{E}_ {\mathbf{z}\in Z_ {N}}\left[\sum_ {j=1}^{N}\log p_ {\theta}(t_ {z_ {j}}|t_ {z_ {1}},t_ {z_ {2}},...,t_ {z_ {j-1}})\right].
$$

For each sequence, XLNet samples a permutation order $\mathbf{z}=[z_ {1},z_ {2},...,z_ {N}]$ from the set of all permutations $Z_ {N}$ , where $|Z_ {N}|=N!$ . The probability of the sequence is factorized according to $\mathbf{z}$ , where the $z_ {j}$ -th token $t_ {z_ {j}}$ is conditioned on all the previous tokens $t_ {z_ {1}},t_ {z_ {2}},...,t_ {z_ {j}}$ according to the permutation order $\mathbf{z}$ .

对于每个序列，XLNet从所有排列组合的集合$Z_ {N}$中采样一个排列顺序$\mathbf{z}=[z_ {1},z_ {2},...,z_ {N}]$，其中$|Z_ {N}|=N!$。序列的概率根据$\mathbf{z}$进行分解，其中第$z_ {j}$个token $t_ {z_ {j}}$依赖于排列顺序$\mathbf{z}$中所有前驱token $t_ {z_ {1}},t_ {z_ {2}},...,t_ {z_ {j}}$。

XLNet further adopts two-stream self-attention and Transformer-XL (Dai et al., 2019) to take into account the target positions $z_ {j}$ and learn longrange dependencies, respectively.

XLNet 进一步采用双流自注意力机制和 Transformer-XL (Dai et al., 2019) 来分别考虑目标位置 $z_ {j}$ 并学习长距离依赖关系。

As the cardinality of $Z_ {N}$ is factorial, naive optimization would be challenging. Thus, XLNet conditions on part of the input and generates the rest of the input to reduce the scale of the search space:

由于$Z_ {N}$的基数为阶乘，直接优化将面临挑战。为此，XLNet采用部分输入条件化的策略，通过生成剩余输入来缩减搜索空间规模：

$$
\operatorname*{max}_ {\theta}\mathbb{E}_ {\mathbf{z}\in Z_ {N}}\left[\sum_ {j=c+1}^{N}\log p_ {\theta}(t_ {z_ {j}}|t_ {z_ {1}},t_ {z_ {2}},...,t_ {z_ {j-1}})\right],
$$

where $c$ is the cutting point of the sequence. However, it is tricky to compare XLNet directly with

其中 $c$ 是序列的切割点。然而，直接将 XLNet 与

BERT due to the multiple changes in loss and architecture.1

由于损失函数和架构的多重变化，BERT [1]

UniLM. UniLM (Dong et al., 2019) adopts three objectives: (a) language modelling, (b) masked language modelling, and (c) sequence-tosequence language modelling (seq2seq LM), for pre-training a Transformer network. To implement three objectives in a single network, UniLM utilizes specific self-attention masks to control what context the prediction conditions on. For example, MLM can attend to its bidirectional contexts, while seq2seq LM can attend to bidirectional contexts for source sequences and left contexts only for target sequences.

UniLM. UniLM (Dong et al., 2019) 采用三种目标：(a) 语言建模, (b) 掩码语言建模, (c) 序列到序列语言建模 (seq2seq LM), 用于预训练 Transformer 网络。为了在单一网络中实现这三种目标，UniLM 使用特定的自注意力掩码来控制预测所依赖的上下文。例如，MLM 可以关注其双向上下文，而 seq2seq LM 可以关注源序列的双向上下文，但目标序列仅能关注左侧上下文。

ELECTRA. Compared to BERT, ELECTRA (Clark et al., 2019) proposes a more effective pretraining method. Instead of corrupting some positions of inputs with [MASK], ELECTRA replaces some tokens of the inputs with their plausible alternatives sampled from a small generator network. ELECTRA trains a disc rim in at or to predict whether each token in the corrupted input was replaced by the generator or not. The pre-trained discriminator can then be used in downstream tasks for fine-tuning, improving upon the pre-trained representation learned by the generator.

ELECTRA。与BERT相比，ELECTRA (Clark等人，2019)提出了一种更有效的预训练方法。ELECTRA不是用[MASK]遮盖输入的部分位置，而是用一个小的生成器网络采样出合理的替代词来替换输入中的部分token。ELECTRA训练一个判别器来预测损坏输入中的每个token是否被生成器替换过。预训练后的判别器可用于下游任务的微调，从而改进生成器学到的预训练表示。

MASS. Although BERT achieves state-of-theart performance for many natural language understanding tasks, BERT cannot be easily used for natural language generation. MASS (Song et al., 2019) uses masked sequences to pretrain sequence-to-sequence models. More specifically, MASS adopts an encoder-decoder framework and extends the MLM objective. The encoder takes as input a sequence where consecutive tokens are masked and the decoder predicts these masked consecutive tokens auto regressive ly. MASS achieves significant improvements over baselines without pre-training or with other pretraining methods on a variety of zero/low-resource language generation tasks, including neural machine translation, text sum mari z ation and conversational response generation.

MASS。尽管BERT在许多自然语言理解任务中实现了最先进的性能，但它难以直接用于自然语言生成。MASS (Song et al., 2019) 采用掩码序列来预训练序列到序列模型。具体而言，MASS采用编码器-解码器框架并扩展了MLM（掩码语言建模）目标：编码器输入被掩码的连续token序列，解码器则以自回归方式预测这些被掩码的连续token。在神经机器翻译、文本摘要和对话响应生成等多种零样本/低资源语言生成任务中，MASS相比未预训练或采用其他预训练方法的基线模型取得了显著提升。

T5. Raffel et al. (2019) propose T5 (Text-toText Transfer Transformer), unifying natural language understanding and generation by converting the data into a text-to-text format and applying a encoder-decoder framework.

T5. Raffel等人 (2019) 提出T5 (Text-toText Transfer Transformer), 通过将数据转换为文本到文本格式并应用编码器-解码器框架, 统一了自然语言理解和生成任务。

T5 introduces a new pre-training dataset, Colossal Clean Crawled Corpus by cleaning the web pages from Common Crawl. T5 also systematically compares previous methods in terms of pre-training objectives, architectures, pre-training datasets, and transfer approaches. T5 adopts a text infilling objective (where spans of text are replaced with a single mask token), longer training, multi-task pre-training on GLUE or SuperGLUE, fine-tuning on each individual GLUE and SuperGLUE tasks, and beam search. ERNIE-GEN (Xiao et al., 2020) is another work using text infilling, where tokens of each masked span are generated non-auto regressive ly.

T5通过清理Common Crawl的网页，引入了一个新的预训练数据集Colossal Clean Crawled Corpus。T5还系统性地比较了先前方法在预训练目标、架构、预训练数据集和迁移方法上的差异。T5采用了文本填充目标（用单个掩码token替换文本片段）、更长的训练时间、在GLUE或SuperGLUE上进行多任务预训练、针对每个GLUE和SuperGLUE任务进行微调，以及束搜索技术。ERNIE-GEN (Xiao et al., 2020) 是另一项使用文本填充的工作，其中每个掩码片段的token以非自回归方式生成。

For fine-tuning, to convert the input data into a text-to-text framework, T5 utilizes the token vocabulary of the decoder as the prediction labels. For example, the tokens “entailment”, “contradiction”, and “neutral” are used as the labels for natural language inference tasks. For the regression task (e.g. STS-B (Cer et al., 2017)), T5 simply rounds up the scores to the nearest multiple of 0.2 and converts the results to literal string representations (e.g. 2.57 is converted to the string “2.6”). T5 also adds a task-specific prefix to each input sequence to specify its task. For instance, T5 adds the prefix “translate English to German” to each input sequence like “That is good.” for English-toGerman translation datasets.

在微调阶段，为将输入数据转换为文本到文本框架，T5采用解码器的token词汇表作为预测标签。例如，在自然语言推理任务中使用"entailment"、"contradiction"和"neutral"等token作为标签。针对回归任务（如STS-B (Cer等人，2017)），T5简单地将分数四舍五入到最接近0.2的倍数，并将结果转换为字面字符串表示（例如2.57转换为字符串"2.6"）。T5还会为每个输入序列添加任务特定的前缀，例如在英德翻译数据集中，会为"That is good."这样的输入序列添加"translate English to German"前缀。

3.2 Supervised Objectives

3.2 监督目标

Pre-training on the ImageNet dataset (which has supervision about the objects in images) before fine-tuning on downstream tasks has become the de facto standard in the computer vision community. Motivated by the success of supervised pre-training in computer vision, some work (Conneau et al., 2017; McCann et al., 2017; Subramania n et al., 2018) utilizes data-rich tasks in NLP to learn transferable representations.

在下游任务上进行微调之前，先在ImageNet数据集（包含图像中物体的监督信息）上进行预训练已成为计算机视觉领域的实际标准。受计算机视觉中监督预训练成功的启发，一些研究工作（Conneau等人，2017；McCann等人，2017；Subramania等人，2018）利用NLP中数据丰富的任务来学习可迁移的表征。

CoVe (McCann et al., 2017) shows that the represent at ions learned from machine translation are transferable to downstream tasks. CoVe uses a deep LSTM encoder from a sequence-to-sequence model trained for machine translation to obtain contextual embeddings. Empirical results show that augmenting non-contextual i zed word representations (Mikolov et al., 2013; Pennington et al., 2014) with CoVe embeddings improves performance over a wide variety of common NLP tasks, such as sentiment analysis, question classification, entailment, and question answering. InferSent (Conneau et al., 2017) obtains contextualized representations from a pre-trained natural language inference model on SNLI. Subramania n et al. (2018) use multi-task learning to pre-train a sequence-to-sequence model for obtaining general representations, where the tasks include skipthought (Kiros et al., 2015), machine translation, constituency parsing, and natural language inference.

CoVe (McCann等人，2017) 研究表明，从机器翻译中学习到的表征可迁移至下游任务。CoVe采用为机器翻译训练的序列到序列模型中的深度LSTM编码器来获取上下文嵌入。实证结果表明，将非上下文词向量 (Mikolov等人，2013; Pennington等人，2014) 与CoVe嵌入相结合，能在情感分析、问题分类、蕴涵和问答等多种常见NLP任务中提升性能。InferSent (Conneau等人，2017) 通过SNLI数据集上预训练的自然语言推理模型获取上下文表征。Subramania等人 (2018) 采用多任务学习预训练序列到序列模型以获得通用表征，其任务包括跳跃思维 (Kiros等人，2015)、机器翻译、选区解析和自然语言推理。

BART. The BART model (Lewis et al., 2019) introduces additional noising functions beyond MLM for pre-training sequence-to-sequence models. First, the input sequence is corrupted using an arbitrary noising function. Then, the corrupted input is reconstructed by a Transformer network trained using teacher forcing (Williams and Zipser, 1989). BART evaluates a wide variety of noising functions, including token masking, token deletion, text infilling, document rotation, and sentence shuffling (randomly shuffling the word order of a sentence). The best performance is achieved by using both sentence shuffling and text infilling. BART matches the performance of RoBERTa on GLUE and SQuAD and achieves state-of-the-art performance on a variety of text generation tasks.

BART。BART模型 (Lewis等人, 2019) 在MLM基础上引入了额外的噪声函数来预训练序列到序列模型。首先，输入序列通过任意噪声函数进行破坏；随后，被破坏的输入由一个采用教师强制 (Williams和Zipser, 1989) 训练的Transformer网络进行重建。BART评估了多种噪声函数，包括token掩码、token删除、文本填充、文档旋转和句子乱序 (随机打乱句子中的单词顺序) 。最佳性能通过同时使用句子乱序和文本填充实现。BART在GLUE和SQuAD上达到与RoBERTa相当的性能，并在多种文本生成任务中取得最先进水平。

4 Cross-lingual Polyglot Pre-training for Contextual Embeddings

4 面向上下文嵌入的跨语言多语种预训练

Cross-lingual polyglot pre-training aims to learn joint multi-lingual representations, enabling knowledge transfer from data-rich languages like English to data-scarce languages like Romanian. Based on whether joint training and a shared vocabulary are used, we divide previous work into three categories.

跨语言多语种预训练旨在学习联合的多语言表征，实现从英语等数据丰富的语言向罗马尼亚语等数据稀缺语言的知识迁移。根据是否采用联合训练和共享词汇表，我们将先前工作划分为三类。

Joint training & shared vocabulary. Artetxe and Schwenk (2019) use a BiLSTM encoderdecoder framework with a shared BPE vocabulary for 93 languages. The framework is pre-trained using parallel corpora, including as Europarl and Tanzil. The contextual embeddings from the encoder are used to train class if i ers using English corpora for downstream tasks. As the embedding space and the encoder are shared, the resultant class if i ers can be transferred to any of the 93 languages without further modification. Experiments show that these class if i ers achieve competitive performance on cross-lingual natural language inference, cross-lingual document classification, and parallel corpus mining.

联合训练与共享词汇表。Artetxe和Schwenk (2019) 采用带有共享BPE词汇表的BiLSTM编码器-解码器框架，涵盖93种语言。该框架使用包括Europarl和Tanzil在内的平行语料库进行预训练。编码器生成的上下文嵌入被用于训练分类器，其中英语语料库用于下游任务。由于嵌入空间和编码器是共享的，所得分类器无需额外修改即可迁移至93种语言中的任意一种。实验表明，这些分类器在跨语言自然语言推理、跨语言文档分类和平行语料挖掘任务中均取得具有竞争力的性能。

Rosita (Mulcaire et al., 2019) pre-trains a lan- guage model using text from different languages, showing the benefits of polyglot learning on lowresource languages.

Rosita (Mulcaire等人，2019) 通过使用不同语言的文本预训练语言模型，展示了多语言学习在低资源语言上的优势。

Recently, the authors of BERT developed a multi-lingual BERT2 which is pre-trained using the Wikipedia dump with more than 100 languages.

最近，BERT的作者们开发了一个多语言BERT2，该模型使用包含100多种语言的维基百科数据转储进行预训练。

XLM (Lample and Conneau, 2019) uses three pre-training methods for learning cross-lingual language models: (1) Causal language modelling, where the model is trained to predict $p(t_ {i}|t_ {1},t_ {2},...,t_ {i-1})$ , (2) Masked language modelling, and (3) Translation language modelling (TLM). Parallel corpora are used, and tokens in both source and target sequences are masked for learning cross-lingual association. XLM performs strongly on cross-lingual classification, unsupervised machine translation, and supervised machine translation. XLM-R (Conneau et al., 2019) scales up XLM by training a Transformer-based masked language model on one hundred languages, using more than two terabytes of filtered Common Crawl data. XLM-R shows that largescale multi-lingual pre-training leads to significant performance gains for a wide range of crosslingual transfer tasks.

XLM (Lample and Conneau, 2019) 采用三种预训练方法学习跨语言语言模型：(1) 因果语言建模 (causal language modelling)，训练模型预测 $p(t_ {i}|t_ {1},t_ {2},...,t_ {i-1})$；(2) 掩码语言建模 (masked language modelling)；(3) 翻译语言建模 (translation language modelling, TLM)。该方法使用平行语料库，并对源序列和目标序列中的token进行掩码以学习跨语言关联。XLM在跨语言分类、无监督机器翻译和有监督机器翻译任务中表现优异。XLM-R (Conneau et al., 2019) 通过基于Transformer的掩码语言模型在100种语言上进行训练（使用超过2TB经过滤的Common Crawl数据），进一步扩展了XLM的规模。XLM-R证明，大规模多语言预训练能为广泛的跨语言迁移任务带来显著性能提升。

Joint training & separate vocabularies. Wu et al. (2019) study the emergence of cross-lingual structures in pre-trained multi-lingual language models. It is found that cross-lingual transfer is possible even when there is no shared vocabulary across the monolingual corpora, and there are universal latent symmetries in the embedding spaces of different languages.

联合训练与独立词表。Wu等人(2019)研究了预训练多语言模型中跨语言结构的涌现现象。研究发现即使单语语料库之间没有共享词表，跨语言迁移仍然可能实现，并且不同语言的嵌入空间存在普遍的潜在对称性。

Separate training & separate vocabularies. Artetxe et al. (2019) use a four-step method for obtaining multi-lingual embeddings. Suppose we have the monolingual sequences of two languages $L_ {1}$ and $L_ {2}$ : (1) Pre-training BERT with the vocabulary of $L_ {1}$ using $L_ {1}$ ’s monolingual data. (2) Replacing the vocabulary of $L_ {1}$ with the vocabulary of $L_ {2}$ and training new vocabulary embeddings, while freezing the other parameters, using $L_ {2}$ ’s monolingual data. (3) Fine-tuning the BERT model for a downstream task using labeled data in $L_ {1}$ , while freezing $L_ {1}$ ’s vocabulary embeddings. (4) Replacing the fine-tuned BERT with $L_ {2}$ ’s vocabulary embeddings for zero-shot transfer tasks.

独立训练 & 独立词汇表。Artetxe等人(2019)采用四步法获取多语言嵌入表示。假设我们有两种语言 $L_ {1}$ 和 $L_ {2}$ 的单语序列：(1) 使用 $L_ {1}$ 的单语数据，以其词汇表预训练BERT。(2) 将 $L_ {1}$ 的词汇表替换为 $L_ {2}$ 的词汇表，在冻结其他参数的情况下，利用 $L_ {2}$ 的单语数据训练新词汇嵌入。(3) 使用 $L_ {1}$ 的标注数据在下游任务上微调BERT模型，同时冻结 $L_ {1}$ 的词汇嵌入。(4) 将微调后的BERT模型替换为 $L_ {2}$ 的词汇嵌入，用于零样本迁移任务。

5 Downstream Learning

5 下游学习

Once learned, contextual embeddings have demonstrated impressive performance when used downstream on various learning problems. Here we describe the ways in which contextual embeddings are used downstream, the ways in which one can avoid forgetting information in the embeddings during downstream learning, and how they can be specialized to multiple learning tasks.

学习完成后，上下文嵌入 (contextual embeddings) 在下游各类学习任务中展现出卓越性能。本文阐述了上下文嵌入在下游任务中的应用方式、如何避免下游学习过程中遗忘嵌入信息，以及如何将其适配到多种学习任务中。

5.1 Ways to Use Contextual Embeddings Downstream

5.1 上下文嵌入的下游应用方式

There are three main ways to use pre-trained contextual embeddings in downstream tasks: (1) Feature-based methods, (2) Fine-tuning methods, and (3) Adapter methods.

在下游任务中使用预训练上下文嵌入主要有三种方式：(1) 基于特征的方法 (feature-based methods)、(2) 微调方法 (fine-tuning methods)、(3) 适配器方法 (adapter methods)。

Feature-based. One example of a feature-based is the method used by ELMo (Peters et al., 2018). Specifically, as shown in equation 2, ELMo freezes the weights of the pre-trained contextual embedding model and forms a linear combination of its internal representations. The linearly- combined representations are then used as features for task-specific architectures. The benefit of feature-based models is that they can use state-ofthe-art handcrafted architectures for specific tasks.

基于特征的方法。一个基于特征的例子是ELMo (Peters et al., 2018)使用的方法。具体来说，如公式2所示，ELMo冻结预训练的上下文嵌入模型的权重，并形成其内部表示的线性组合。然后，这些线性组合的表示被用作特定任务架构的特征。基于特征模型的优势在于，它们可以为特定任务使用最先进的手工架构。

Fine-tuning. Fine-tuning works as follows: starting with the weights of the pre-trained contextual embedding model, fine-tuning makes small adjustments to them in order to specialize them to a specific downstream task. One stream of work applies minimal changes to pre-trained models to take full advantage of their parameters. The most straightforward way is adding linear layers on top of the pre-trained models (Devlin et al., 2018; Lan et al., 2019). Another method (Radford et al., 2019; Raffel et al., 2019) uses universal data formats without introducing new parameters for downstream tasks.

微调 (Fine-tuning)。微调的工作原理如下：从预训练的上下文嵌入模型的权重开始，微调会对这些权重进行小幅调整，使其专门适用于特定的下游任务。一部分研究工作对预训练模型进行最小程度的改动，以充分利用其参数。最直接的方法是在预训练模型之上添加线性层 (Devlin et al., 2018; Lan et al., 2019)。另一种方法 (Radford et al., 2019; Raffel et al., 2019) 使用通用数据格式，无需为下游任务引入新参数。

To apply pre-trained models to structurally different tasks, where task-specific architectures are used, as much of the model is initialized with pre-trained weights as possible. For instance, XLM (Lample and Conneau, 2019) applies two pre-trained monolingual language models to initialize the encoder and the decoder for ma- chine translation, respectively, leaving only crossattention weights randomly initialized.

为了将预训练模型应用于结构不同的任务（这些任务使用特定于任务的架构），尽可能多地用预训练权重初始化模型。例如，XLM (Lample and Conneau, 2019) 应用两个预训练的单语语言模型分别初始化机器翻译的编码器和解码器，仅随机初始化交叉注意力权重。

Adapters. Adapters (Rebuffi et al., 2017; Stickland and Murray, 2019) are small modules added between layers of pre-trained models to be trained in a multi-task learning setting. The parameters of the pre-trained model are fixed while tuning these adapter modules. Compared to previous work that fine-tunes a separate pre-trained model for each task, a model with shared adapters for all tasks often requires fewer parameters.

适配器 (Adapters)。适配器 (Rebuffi et al., 2017; Stickland and Murray, 2019) 是在预训练模型各层之间添加的小型模块，用于多任务学习场景下的训练。在调整这些适配器模块时，预训练模型的参数保持不变。与之前为每个任务单独微调预训练模型的方法相比，使用共享适配器的模型通常需要更少的参数。

5.2 Countering Catastrophic Forgetting

5.2 对抗灾难性遗忘

Learning on downstream tasks is prone to overwrite the information from pre-trained models, which is widely known as the catastrophic forgetting (McCloskey and Cohen, 1989; d’Autume et al., 2019). Previous work combats this by (1) Freezing layers, (2) Using adaptive learning rates, and (3) Regular iz ation.

在下游任务上的学习容易覆盖预训练模型中的信息，这一现象被广泛称为灾难性遗忘 (McCloskey and Cohen, 1989; d'Autume et al., 2019)。先前的研究通过以下方法应对该问题：(1) 冻结网络层，(2) 使用自适应学习率，(3) 正则化。

Freezing layers. Motivated by layer-wise training of neural networks (Hinton et al., 2006), training certain layers while freezing others can potentially reduce forgetting during fine-tuning. Different layer-wise tuning schedules have been studied. Long et al. (2015) freeze all layers except the top layer. Felbo et al. (2017) use “chain-thaw”, which sequentially unfreezes and fine-tunes a layer at a time. Howard and Ruder (2018) gradually unfreeze all layers one by one from top to bottom. Chrono poul ou et al. (2019) apply a three-stage fine-tuning schedule: (a) randomly-initialized parameters are updated for $n$ epochs, (b) the pretrained parameters (except word embeddings) are then fine-tuned, (c) at last, all parameters are finetuned.

冻结层。受神经网络分层训练 (Hinton et al., 2006) 的启发，在微调时冻结部分层而训练其他层可有效减少遗忘现象。目前已研究出多种分层调优方案：Long et al. (2015) 仅解冻顶层进行训练；Felbo et al. (2017) 采用"链式解冻"策略，逐层解冻并微调；Howard 和 Ruder (2018) 采用自上而下逐层解冻的方式；Chronopoulou et al. (2019) 则设计了三阶段微调方案：(a) 随机初始化参数训练 $n$ 轮，(b) 微调预训练参数（词嵌入层除外），(c) 最终全参数微调。

Adaptive learning rates. Another method to mitigate catastrophic forgetting is by using adaptive learning rates. As it is believed that the lower layers of pre-trained models tend to capture general language knowledge (Tenney et al., 2019a), Howard and Ruder (2018) use lower learning rates for lower layers when fine-tuning.

自适应学习率。另一种缓解灾难性遗忘的方法是使用自适应学习率。由于预训练模型的底层通常被认为能捕捉通用语言知识 (Tenney et al., 2019a) ，Howard 和 Ruder (2018) 在微调时对底层采用了较低的学习率。

Regular iz ation. Regular iz ation limits

[论文翻译]基于上下文嵌入的综述

原文地址：https://arxiv.org/pdf/2003.07278