BERTweet: A pre-trained language model for English Tweets
BERTweet: 面向英文推特的预训练语言模型
Abstract
摘要
We present BERTweet, the first public largescale pre-trained language model for English Tweets. Our BERTweet, having the same architecture as $\mathrm{BERT_{base}}$ (Devlin et al., 2019), is trained using the RoBERTa pre-training procedure (Liu et al., 2019). Experiments show that BERTweet outperforms strong baselines $\mathrm{RoBERTa_{base}}$ and ${\mathrm{XLM-R}}_{\mathrm{base}}$ (Conneau et al., 2020), producing better performance results than the previous state-of-the-art models on three Tweet NLP tasks: Part-of-speech tagging, Named-entity recognition and text classification. We release BERTweet under the MIT License to facilitate future research and applications on Tweet data. Our BERTweet is available at: https://github.com/ V in AIResearch/BERTweet.
我们推出了BERTweet,这是首个面向英文推文的大规模公开预训练语言模型。我们的BERTweet采用与$\mathrm{BERT_{base}}$ (Devlin等人,2019)相同的架构,并基于RoBERTa预训练流程(Liu等人,2019)进行训练。实验表明,BERTweet在三个推文NLP任务(词性标注、命名实体识别和文本分类)上超越了强基线模型$\mathrm{RoBERTa_{base}}$和${\mathrm{XLM-R}}_{\mathrm{base}}$ (Conneau等人,2020),性能优于之前的先进模型。我们以MIT许可证发布BERTweet,以促进未来对推文数据的研究和应用。BERTweet可通过以下地址获取:https://github.com/VinAIResearch/BERTweet。
1 Introduction
1 引言
The language model BERT (Devlin et al., 2019)— the Bidirectional Encoder Representations from Transformers (Vaswani et al., 2017)—and its variants have successfully helped produce new stateof-the-art performance results for various NLP tasks. Their success has largely covered the common English domains such as Wikipedia, news and books. For specific domains such as biomedical or scientific, we could retrain a domainspecific model using the BERTology architecture (Beltagy et al., 2019; Lee et al., 2019; Gururangan et al., 2020).
语言模型BERT (Devlin等人, 2019)——即基于Transformer (Vaswani等人, 2017)的双向编码器表示——及其变体已成功帮助各类NLP任务取得新的最先进性能表现。它们的成功主要覆盖了维基百科、新闻和书籍等常见英文领域。对于生物医学或科学等特定领域,我们可以使用BERTology架构 (Beltagy等人, 2019; Lee等人, 2019; Gururangan等人, 2020) 重新训练领域专用模型。
Twitter has been one of the most popular microblogging platforms where users can share realtime information related to all kinds of topics and events. The enormous and plentiful Tweet data has been proven to be a widely-used and real-time source of information in various important analytic tasks (Ghani et al., 2019). Note that the charact eris tics of Tweets are generally different from those of traditional written text such as Wikipedia and news articles, due to the typical short length of Tweets and frequent use of informal grammar as well as irregular vocabulary e.g. abbreviations, typo graphical errors and hashtags (Eisenstein, 2013; Han et al., 2013). Thus this might lead to a challenge in applying existing language models pretrained on large-scale conventional text corpora with formal grammar and regular vocabulary to handle text analytic tasks on Tweet data. To the best of our knowledge, there is not an existing language model pre-trained on a large-scale corpus of English Tweets.
Twitter一直是最受欢迎的微博平台之一,用户可以在上面分享各类话题和事件的实时信息。海量且丰富的推文数据已被证明是多种重要分析任务中广泛使用的实时信息来源 (Ghani et al., 2019)。需要注意的是,由于推文通常篇幅短小且频繁使用非正式语法及不规则词汇(如缩写、拼写错误和话题标签)(Eisenstein, 2013; Han et al., 2013),其特性与维基百科和新闻文章等传统书面文本存在显著差异。因此,这将为应用基于大规模正规语法和规范词汇的传统文本语料库预训练的语言模型来处理推文数据分析任务带来挑战。据我们所知,目前尚不存在基于大规模英文推文语料库预训练的语言模型。
To fill the gap, we train the first large-scale language model for English Tweets using a 80GB corpus of 850M English Tweets. Our model uses the $\mathrm{BERT_{base}}$ model configuration, trained based on the RoBERTa pre-training procedure (Liu et al., 2019). We evaluate our model and compare it with strong competitors, i.e. RoBERTa base and XLM $\mathrm{R_{base}}$ (Conneau et al., 2020), on three downstream Tweet NLP tasks: Part-of-speech (POS) tagging, Named-entity recognition (NER) and text classification. Experiments show that our model outperforms RoBERTa base and XLM-Rbase as well as the previous state-of-the-art (SOTA) models on all these tasks. Our contributions are as follows:
为填补这一空白,我们使用包含8.5亿条英文推文的80GB语料库,训练了首个面向英文推文的大规模语言模型。该模型采用$\mathrm{BERT_{base}}$架构,基于RoBERTa预训练流程 (Liu et al., 2019) 进行训练。我们在三项下游推文自然语言处理任务中评估模型性能:词性标注 (POS tagging) 、命名实体识别 (NER) 和文本分类,并与强基线模型RoBERTa base和XLM $\mathrm{R_{base}}$ (Conneau et al., 2020) 进行对比。实验表明,我们的模型在所有任务上均优于RoBERTa base、XLM-Rbase以及先前的最先进 (SOTA) 模型。主要贡献如下:
• We present the first large-scale pre-trained language model for English Tweets. • Our model does better than its competitors RoBERTa base and XLM-Rbase and outperforms previous SOTA models on three downstream Tweet NLP tasks of POS tagging, NER and text classification, thus confirming the effectiveness of the large-scale and domain-specific language model pre-trained for English Tweets. • We also provide the first set of experiments invest i gating whether a commonly used approach of applying lexical normalization dictionaries on Tweets (Han et al., 2012) would help improve the performance of the pre-trained language models on the downstream tasks.
• 我们推出了首个针对英文推文的大规模预训练语言模型。
• 该模型在词性标注 (POS tagging)、命名实体识别 (NER) 和文本分类三项下游推文 NLP 任务中表现优于竞争对手 RoBERTa base 和 XLM-Rbase,并超越了之前的 SOTA 模型,从而验证了针对英文推文进行大规模领域特定语言模型预训练的有效性。
• 我们还首次通过实验验证了常用方法(即在推文上应用词汇规范化词典 (Han et al., 2012))是否能提升预训练语言模型在下游任务中的性能。
• We publicly release our model under the name BERTweet which can be used with fairseq (Ott et al., 2019) and transformers (Wolf et al., 2019). We hope that BERTweet can serve as a strong baseline for future research and applications of Tweet analytic tasks.
• 我们以BERTweet为名公开发布了模型,该模型可与fairseq (Ott et al., 2019) 和transformers (Wolf et al., 2019) 配合使用。希望BERTweet能为推文分析任务的未来研究和应用提供强有力的基线基准。
2 BERTweet
2 BERTweet
In this section, we outline the architecture, and describe the pre-training data and optimization setup that we use for BERTweet.
在本节中,我们将概述架构,并描述用于BERTweet的预训练数据和优化设置。
Architecture
架构
Our BERTweet uses the same architecture as $\mathrm{BERT_{base}}$ , which is trained with a masked language modeling objective (Devlin et al., 2019). BERTweet pre-training procedure is based on RoBERTa (Liu et al., 2019) which optimizes the BERT pre-training approach for more robust performance. Given the widespread usage of BERT and RoBERTa, we do not detail the architecture here. See Devlin et al. (2019) and Liu et al. (2019) for more details.
我们的BERTweet采用了与$\mathrm{BERT_{base}}$相同的架构,该架构基于掩码语言建模目标进行训练 (Devlin et al., 2019)。BERTweet的预训练流程遵循RoBERTa (Liu et al., 2019) 的方法,通过优化BERT预训练过程来提升模型鲁棒性。鉴于BERT和RoBERTa已被广泛使用,此处不再赘述架构细节,具体可参阅Devlin et al. (2019) 和Liu et al. (2019) 的论文。
Pre-training data
预训练数据
We use an 80GB pre-training dataset of uncompressed texts, containing 850M Tweets (16B word tokens). Here, each Tweet consists of at least 10 and at most 64 word tokens. In particular, this dataset is a concatenation of two corpora:
我们使用了一个80GB的未压缩文本预训练数据集,包含8.5亿条推文(160亿词元)。其中每条推文由至少10个、最多64个词元组成。具体而言,该数据集由两个语料库拼接而成:
• We first download the general Twitter Stream grabbed by the Archive Team,1 containing 4TB of Tweet data streamed from 01/2012 to 08/2019 on Twitter. To identify English Tweets, we employ the language identification component of fastText (Joulin et al., 2017). We tokenize those English Tweets using “Tweet Token ize r” from the NLTK toolkit (Bird et al., 2009) and use the emoji package to translate emotion icons into text strings (here, each icon is referred to as a word token).2 We also normalize the Tweets by converting user mentions and web/url links into special tokens @USER and HTTPURL, respectively. We filter out retweeted Tweets and the ones shorter than 10 or longer than 64 word tokens. This pre-process results in the first corpus of 845M English Tweets.
• 我们首先下载了Archive Team抓取的通用Twitter流数据,包含2012年1月至2019年8月期间从Twitter获取的4TB推文数据。为识别英文推文,我们采用了fastText的语言识别组件 (Joulin et al., 2017) 。使用NLTK工具包中的"Tweet Tokenizer" (Bird et al., 2009) 对这些英文推文进行分词,并通过emoji包将表情符号转换为文本字符串(此处每个表情符号被视为一个单词token)。我们还将用户提及和网页/URL链接分别标准化为特殊token @USER和HTTPURL。过滤掉转推以及单词token数少于10或超过64的推文。经过预处理后得到首个包含8.45亿条英文推文的语料库。
• We also stream Tweets related to the COVID-19 pandemic, available from 01/2020 to 03/2020.3 We apply the same data pre-process step as described above, thus resulting in the second corpus of 5M English Tweets.
• 我们还采集了与COVID-19疫情相关的推文数据流,时间跨度为2020年1月至2020年3月。采用上述相同的数据预处理步骤,最终形成包含500万条英文推文的第二语料库。
We then apply fastBPE (Sennrich et al., 2016) to segment all 850M Tweets with subword units, using a vocabulary of 64K subword types. On average there are 25 subword tokens per Tweet.
我们随后应用 fastBPE (Sennrich et al., 2016) 对所有 8.5 亿条推文进行子词单元切分,使用了一个包含 64K 种子词类型的词表。平均每条推文包含 25 个子词 token。
Optimization
优化
We utilize the RoBERTa implementation in the fairseq library (Ott et al., 2019). We set a maximum sequence length at 128, thus generating $850\mathbf{M}\times25⁄128\approx166\mathbf{M}$ sequence blocks. Following Liu et al. (2019), we optimize the model using Adam (Kingma and Ba, 2014), and use a batch size of 7K across 8 V100 GPUs (32GB each) and a peak learning rate of 0.0004. We pre-train BERTweet for 40 epochs in about 4 weeks (here, we use the first 2 epochs for warming up the learning rate), equivalent to $166{\mathrm{M}}\times40/7{\mathrm{K}}\approx950{\mathrm{K}}$ training steps.
我们采用fairseq库(Ott等人, 2019)中的RoBERTa实现。设置最大序列长度为128,因此生成$850\mathbf{M}\times25⁄128\approx166\mathbf{M}$个序列块。参照Liu等人(2019)的方法,我们使用Adam优化器(Kingma和Ba, 2014)进行模型优化,在8块V100 GPU(每块32GB)上采用7K的批量大小和0.0004的峰值学习率。我们耗时约4周对BERTweet进行了40轮预训练(其中前2轮用于学习率预热),相当于$166{\mathrm{M}}\times40/7{\mathrm{K}}\approx950{\mathrm{K}}$个训练步数。
3 Experimental setup
3 实验设置
We evaluate and compare the performance of BERTweet with strong baselines on three downstream NLP tasks of POS tagging, NER and text classification, using benchmark Tweet datasets.
我们在三个下游自然语言处理(NLP)任务(词性标注(POS tagging)、命名实体识别(NER)和文本分类)上评估并比较了BERTweet与强基线的性能,使用的基准数据集均为推文数据集。
Downstream task datasets
下游任务数据集
For POS tagging, we use three datasets Ritter11- T-POS (Ritter et al., 2011), ARK-Twitter4 (Gimpel et al., 2011; Owoputi et al., 2013) and TWEEBANK- $\mathrm{v}2^{5}$ (Liu et al., 2018). For NER, we employ datasets from the WNUT16 NER shared task (Strauss et al., 2016) and the WNUT17 shared task on novel and emerging entity recognition (Derczynski et al., 2017). For text classification, we employ the 3-class sentiment analysis dataset from the Se mE val 2017 Task 4A (Rosenthal et al., 2017) and the 2-class irony detection dataset from the Se mE val 2018 Task 3A (Van Hee et al., 2018).
在词性标注 (POS tagging) 任务中,我们使用了三个数据集:Ritter11-T-POS (Ritter et al., 2011)、ARK-Twitter4 (Gimpel et al., 2011; Owoputi et al., 2013) 和 TWEEBANK-$\mathrm{v}2^{5}$ (Liu et al., 2018)。对于命名实体识别 (NER),我们采用了 WNUT16 NER 共享任务 (Strauss et al., 2016) 和 WNUT17 新兴实体识别共享任务 (Derczynski et al., 2017) 的数据集。在文本分类任务中,我们使用了 SemEval 2017 Task 4A (Rosenthal et al., 2017) 的三类情感分析数据集,以及 SemEval 2018 Task 3A (Van Hee et al., 2018) 的二类讽刺检测数据集。
For Ritter11-T-POS, we employ a 70/15/15 training/validation/test pre-split available from Gui et al. (2017).6 ARK-Twitter contains two files daily547.conll and oct27.conll in which oct27.conll is further split into files oct27.traindev and oct27.test. Following Owoputi et al. (2013) and Gui et al. (2017), we employ daily547.conll as a test set. In addition, we use oct27.traindev and oct27.test as training and validation sets, respectively. For the TWEEBANK-V2, WNUT16 and WNUT17 datasets, we use their existing training/validation/test split. The Se mE val 2017- Task4A and Se mE val 2018-Task3A datasets are provided with training and test sets only (i.e. there is not a standard split for validation), thus we sample $10%$ of the training set for validation and use the remaining $90%$ for training.
对于Ritter11-T-POS数据集,我们采用Gui等人(2017)提供的70/15/15预设训练集/验证集/测试集划分。ARK-Twitter数据集包含daily547.conll和oct27.conll两个文件,其中oct27.conll进一步划分为oct27.traindev和oct27.test文件。遵循Owoputi等人(2013)和Gui等人(2017)的做法,我们将daily547.conll作为测试集。此外,我们分别使用oct27.traindev和oct27.test作为训练集和验证集。对于TWEEBANK-V2、WNUT16和WNUT17数据集,我们使用其现有的训练集/验证集/测试集划分。SemEval 2017-Task4A和SemEval 2018-Task3A数据集仅提供训练集和测试集(即没有标准验证集划分),因此我们从训练集中抽取10%作为验证集,剩余90%用于训练。
We use a “soft” normalization strategy to all of the experimental datasets by translating word tokens of user mentions and web/url links into special tokens @USER and HTTPURL, respectively, and converting emotion icon tokens into corresponding strings. We also apply a “hard” strategy by further applying lexical normalization dictionaries (Aramaki, 2010; Liu et al., 2012; Han et al., 2012) to normalize word tokens in Tweets.
我们对所有实验数据集采用"软"归一化策略:将用户提及和网页/URL链接的词token分别转换为特殊token @USER和HTTPURL,并将表情图标token转换为对应字符串。同时采用"硬"策略,通过应用词汇归一化词典 (Aramaki, 2010; Liu et al., 2012; Han et al., 2012) 进一步规范化推文中的词token。
Fine-tuning
微调 (Fine-tuning)
Following Devlin et al. (2019), for POS tagging and NER, we append a linear prediction layer on top of the last Transformer layer of BERTweet with regards to the first subword of each word token, while for text classification we append a linear prediction layer on top of the pooled output.
遵循 Devlin 等人 (2019) 的方法,对于词性标注 (POS tagging) 和命名实体识别 (NER),我们在 BERTweet 的最后一个 Transformer 层之上为每个单词 token 的第一个子词添加了一个线性预测层,而对于文本分类任务,我们则在池化输出之上添加了一个线性预测层。
We employ the transformers library (Wolf et al., 2019) to independently fine-tune BERTweet for each task and each dataset in 30 training epochs. We use AdamW (Loshchilov and Hutter, 2019) with a fixed learning rate of 1.e-5 and a batch size of 32 (Liu et al., 2019). We compute the task performance after each training epoch on the validation set (here, we apply early stopping when no improvement is observed after 5 continuous epochs), and select the best model checkpoint to compute the performance score on the test set.
我们使用transformers库 (Wolf等人,2019) 为每个任务和每个数据集独立微调BERTweet,共进行30个训练周期。采用AdamW优化器 (Loshchilov和Hutter,2019) ,固定学习率为1e-5,批量大小为32 (Liu等人,2019) 。每个训练周期结束后在验证集上评估任务性能 (若连续5个周期未观察到性能提升则启用早停机制) ,最终选择最佳模型检查点在测试集上计算性能得分。
We repeat this fine-tuning process 5 times with different random seeds, i.e. 5 runs for each task and each dataset. We report each final test result as an average over the test scores from the 5 runs.
我们使用不同的随机种子重复这一微调过程5次,即每个任务和每个数据集进行5次运行。最终测试结果为5次运行测试分数的平均值。
Baselines
基线方法
Our main competitors are the pre-trained language models RoBERTa base (Liu et al., 2019) and XLM $\operatorname{\mathbf{}}_{\mathrm{base}}$ (Conneau et al., 2020), which
我们的主要竞争对手是预训练语言模型RoBERTa base (Liu et al., 2019)和XLM $\operatorname{\mathbf{}}_{\mathrm{base}}$ (Conneau et al., 2020)。
模型 | Ritter11 soft | Ritter11 hard | ARK soft | ARK hard | TB-V2 soft | TB-V2 hard |
---|---|---|---|---|---|---|
RoBERTalarge XLM-Rlarge | 91.7 | 91.5 | 93.7 | 93.2 | 94.9 | 94.6 |
e RoBERTabase | 92.6 88.7 | 92.1 88.3 | 94.2 91.8 | 93.8 91.6 | 95.5 93.7 | 95.1 93.5 |
XLM-Rbase | 90.4 | 90.3 | 92.8 | 92.6 | 94.7 | 94.3 |
BERTweet | 90.1 | 89.5 | 94.1 | 93.4 | 95.2 | 94.7 |
DCNN (Gui et al.) | 89.9 | |||||
DCNN (Gui et al.) | 91.2 [+a] | 92.4 [+a+b] | ||||
TPANN | 90.9 [+a] | 92.8 [+a+b] | ||||
ARKtagger | 90.4 | |||||
BiLSTM-CNN-CRF | 93.2 [+b] | 94.6 [+c] 92.5 [+c] |
Table 1: POS tagging accuracy results on the Ritter11-T-POS (Ritter11), ARK-Twitter (ARK) and TWEEBANK-V2 (TB-v2) test sets. Result of ARKtagger (Owoputi et al., 2013) on Ritter11 is reported in the TPANN paper (Gui et al., 2017). Note that Ritter11 uses Twitter-specific POS tags for retweeted (RT), user-account, hashtag and url word tokens which can be tagged perfectly using some simple regular expressions. Therefore, we follow Gui et al. (2017) and Gui et al. (2018) to tag those words appropriately for all models. Results of ARKtagger and BiLSTM-CNNCRF (Ma and Hovy, 2016) on TB-v2 are