BERTweet: A pre-trained language model for English Tweets
BERTweet: 面向英文推特的预训练语言模型
Abstract
摘要
We present BERTweet, the first public largescale pre-trained language model for English Tweets. Our BERTweet, having the same architecture as $\mathrm{BERT_{base}}$ (Devlin et al., 2019), is trained using the RoBERTa pre-training procedure (Liu et al., 2019). Experiments show that BERTweet outperforms strong baselines $\mathrm{RoBERTa_{base}}$ and ${\mathrm{XLM-R}}_{\mathrm{base}}$ (Conneau et al., 2020), producing better performance results than the previous state-of-the-art models on three Tweet NLP tasks: Part-of-speech tagging, Named-entity recognition and text classification. We release BERTweet under the MIT License to facilitate future research and applications on Tweet data. Our BERTweet is available at: https://github.com/ V in AIResearch/BERTweet.
我们推出了BERTweet,这是首个面向英文推文的大规模公开预训练语言模型。我们的BERTweet采用与$\mathrm{BERT_{base}}$ (Devlin等人,2019)相同的架构,并基于RoBERTa预训练流程(Liu等人,2019)进行训练。实验表明,BERTweet在三个推文NLP任务(词性标注、命名实体识别和文本分类)上超越了强基线模型$\mathrm{RoBERTa_{base}}$和${\mathrm{XLM-R}}_{\mathrm{base}}$ (Conneau等人,2020),性能优于之前的先进模型。我们以MIT许可证发布BERTweet,以促进未来对推文数据的研究和应用。BERTweet可通过以下地址获取:https://github.com/VinAIResearch/BERTweet。
1 Introduction
1 引言
The language model BERT (Devlin et al., 2019)— the Bidirectional Encoder Representations from Transformers (Vaswani et al., 2017)—and its variants have successfully helped produce new stateof-the-art performance results for various NLP tasks. Their success has largely covered the common English domains such as Wikipedia, news and books. For specific domains such as biomedical or scientific, we could retrain a domainspecific model using the BERTology architecture (Beltagy et al., 2019; Lee et al., 2019; Gururangan et al., 2020).
语言模型BERT (Devlin等人, 2019)——即基于Transformer (Vaswani等人, 2017)的双向编码器表示——及其变体已成功帮助各类NLP任务取得新的最先进性能表现。它们的成功主要覆盖了维基百科、新闻和书籍等常见英文领域。对于生物医学或科学等特定领域,我们可以使用BERTology架构 (Beltagy等人, 2019; Lee等人, 2019; Gururangan等人, 2020) 重新训练领域专用模型。
Twitter has been one of the most popular microblogging platforms where users can share realtime information related to all kinds of topics and events. The enormous and plentiful Tweet data has been proven to be a widely-used and real-time source of information in various important analytic tasks (Ghani et al., 2019). Note that the charact eris tics of Tweets are generally different from those of traditional written text such as Wikipedia and news articles, due to the typical short length of Tweets and frequent use of informal grammar as well as irregular vocabulary e.g. abbreviations, typo graphical errors and hashtags (Eisenstein, 2013; Han et al., 2013). Thus this might lead to a challenge in applying existing language models pretrained on large-scale conventional text corpora with formal grammar and regular vocabulary to handle text analytic tasks on Tweet data. To the best of our knowledge, there is not an existing language model pre-trained on a large-scale corpus of English Tweets.
Twitter一直是最受欢迎的微博平台之一,用户可以在上面分享各类话题和事件的实时信息。海量且丰富的推文数据已被证明是多种重要分析任务中广泛使用的实时信息来源 (Ghani et al., 2019)。需要注意的是,由于推文通常篇幅短小且频繁使用非正式语法及不规则词汇(如缩写、拼写错误和话题标签)(Eisenstein, 2013; Han et al., 2013),其特性与维基百科和新闻文章等传统书面文本存在显著差异。因此,这将为应用基于大规模正规语法和规范词汇的传统文本语料库预训练的语言模型来处理推文数据分析任务带来挑战。据我们所知,目前尚不存在基于大规模英文推文语料库预训练的语言模型。
To fill the gap, we train the first large-scale language model for English Tweets using a 80GB corpus of 850M English Tweets. Our model uses the $\mathrm{BERT_{base}}$ model configuration, trained based on the RoBERTa pre-training procedure (Liu et al., 2019). We evaluate our model and compare it with strong competitors, i.e. RoBERTa base and XLM $\mathrm{R_{base}}$ (Conneau et al., 2020), on three downstream Tweet NLP tasks: Part-of-speech (POS) tagging, Named-entity recognition (NER) and text classification. Experiments show that our model outperforms RoBERTa base and XLM-Rbase as well as the previous state-of-the-art (SOTA) models on all these tasks. Our contributions are as follows:
为填补这一空白,我们使用包含8.5亿条英文推文的80GB语料库,训练了首个面向英文推文的大规模语言模型。该模型采用$\mathrm{BERT_{base}}$架构,基于RoBERTa预训练流程 (Liu et al., 2019) 进行训练。我们在三项下游推文自然语言处理任务中评估模型性能:词性标注 (POS tagging) 、命名实体识别 (NER) 和文本分类,并与强基线模型RoBERTa base和XLM $\mathrm{R_{base}}$ (Conneau et al., 2020) 进行对比。实验表明,我们的模型在所有任务上均优于RoBERTa base、XLM-Rbase以及先前的最先进 (SOTA) 模型。主要贡献如下:
• We present the first large-scale pre-trained language model for English Tweets. • Our model does better than its competitors RoBERTa base and XLM-Rbase and outperforms previous SOTA models on three downstream Tweet NLP tasks of POS tagging, NER and text classification, thus confirming the effectiveness of the large-scale and domain-specific language model pre-trained for English Tweets. • We also provide the first set of experiments invest i gating whether a commonly used approach of applying lexical normalization dictionaries on Tweets (Han et al., 2012) would help improve the performance of the pre-trained language models on the downstream tasks.
• 我们推出了首个针对英文推文的大规模预训练语言模型。
• 该模型在词性标注 (POS tagging)、命名实体识别 (NER) 和文本分类三项下游推文 NLP 任务中表现优于竞争对手 RoBERTa base 和 XLM-Rbase,并超越了之前的 SOTA 模型,从而验证了针对英文推文进行大规模领域特定语言模型预训练的有效性。
• 我们还首次通过实验验证了常用方法(即在推文上应用词汇规范化词典 (Han et al., 2012))是否能提升预训练语言模型在下游任务中的性能。
• We publicly release our model under the name BERTweet which can be used with fairseq (Ott et al., 2019) and transformers (Wolf et al., 2019). We hope that BERTweet can serve as a strong baseline for future research and applications of Tweet analytic tasks.
• 我们以BERTweet为名公开发布了模型,该模型可与fairseq (Ott et al., 2019) 和transformers (Wolf et al., 2019) 配合使用。希望BERTweet能为推文分析任务的未来研究和应用提供强有力的基线基准。
2 BERTweet
2 BERTweet
In this section, we outline the architecture, and describe the pre-training data and optimization setup that we use for BERTweet.
在本节中,我们将概述架构,并描述用于BERTweet的预训练数据和优化设置。
Architecture
架构
Our BERTweet uses the same architecture as $\mathrm{BERT_{base}}$ , which is trained with a masked language modeling objective (Devlin et al., 2019). BERTweet pre-training procedure is based on RoBERTa (Liu et al., 2019) which optimizes the BERT pre-training approach for more robust performance. Given the widespread usage of BERT and RoBERTa, we do not detail the architecture here. See Devlin et al. (2019) and Liu et al. (2019) for more details.
我们的BERTweet采用了与$\mathrm{BERT_{base}}$相同的架构,该架构基于掩码语言建模目标进行训练 (Devlin et al., 2019)。BERTweet的预训练流程遵循RoBERTa (Liu et al., 2019) 的方法,通过优化BERT预训练过程来提升模型鲁棒性。鉴于BERT和RoBERTa已被广泛使用,此处不再赘述架构细节,具体可参阅Devlin et al. (2019) 和Liu et al. (2019) 的论文。
Pre-training data
预训练数据
We use an 80GB pre-training dataset of uncompressed texts, containing 850M Tweets (16B word tokens). Here, each Tweet consists of at least 10 and at most 64 word tokens. In particular, this dataset is a concatenation of two corpora:
我们使用了一个80GB的未压缩文本预训练数据集,包含8.5亿条推文(160亿词元)。其中每条推文由至少10个、最多64个词元组成。具体而言,该数据集由两个语料库拼接而成:
• We first download the general Twitter Stream grabbed by the Archive Team,1 containing 4TB of Tweet data streamed from 01/2012 to 08/2019 on Twitter. To identify English Tweets, we employ the language identification component of fastText (Joulin et al., 2017). We tokenize those English Tweets using “Tweet Token ize r” from the NLTK toolkit (Bird et al., 2009) and use the emoji package to translate emotion icons into text strings (here, each icon is referred to as a word token).2 We also normalize the Tweets by converting user mentions and web/url links into special tokens @USER and HTTPURL, respectively. We filter out retweeted Tweets and the ones shorter than 10 or longer than 64 word tokens. This pre-process results in the first corpus of 845M English Tweets.
• 我们首先下载了Archive Team抓取的通用Twitter流数据,包含2012年1月至2019年8月期间从Twitter获取的4TB推文数据。为识别英文推文,我们采用了fastText的语言识别组件 (Joulin et al., 2017) 。使用NLTK工具包中的"Tweet Tokenizer" (Bird et al., 2009) 对这些英文推文进行分词,并通过emoji包将表情符号转换为文本字符串(此处每个表情符号被视为一个单词token)。我们还将用户提及和网页/URL链接分别标准化为特殊token @USER和HTTPURL。过滤掉转推以及单词token数少于10或超过64的推文。经过预处理后得到首个包含8.45亿条英文推文的语料库。
• We also stream Tweets related to the COVID-19 pandemic, available from 01/2020 to 03/2020.3 We apply the same data pre-process step as described above, thus resulting in the second corpus of 5M English Tweets.
• 我们还采集了与COVID-19疫情相关的推文数据流,时间跨度为2020年1月至2020年3月。采用上述相同的数据预处理步骤,最终形成包含500万条英文推文的第二语料库。
We then apply fastBPE (Sennrich et al., 2016) to segment all 850M Tweets with subword units, using a vocabulary of 64K subword types. On average there are 25 subword tokens per Tweet.
我们随后应用 fastBPE (Sennrich et al., 2016) 对所有 8.5 亿条推文进行子词单元切分,使用了一个包含 64K 种子词类型的词表。平均每条推文包含 25 个子词 token。
Optimization
优化
We utilize the RoBERTa implementation in the fairseq library (Ott et al., 2019). We set a maximum sequence length at 128, thus generating $850\mathbf{M}\times25⁄128\approx166\mathbf{M}$ sequence blocks. Following Liu et al. (2019), we optimize the model using Adam (Kingma and Ba, 2014), and use a batch size of 7K across 8 V100 GPUs (32GB each) and a peak learning rate of 0.0004. We pre-train BERTweet for 40 epochs in about 4 weeks (here, we use the first 2 epochs for warming up the learning rate), equivalent to $166{\mathrm{M}}\times40/7{\mathrm{K}}\approx950{\mathrm{K}}$ training steps.
我们采用fairseq库(Ott等人, 2019)中的RoBERTa实现。设置最大序列长度为128,因此生成$850\mathbf{M}\times25⁄128\approx166\mathbf{M}$个序列块。参照Liu等人(2019)的方法,我们使用Adam优化器(Kingma和Ba, 2014)进行模型优化,在8块V100 GPU(每块32GB)上采用7K的批量大小和0.0004的峰值学习率。我们耗时约4周对BERTweet进行了40轮预训练(其中前2轮用于学习率预热),相当于$166{\mathrm{M}}\times40/7{\mathrm{K}}\approx950{\mathrm{K}}$个训练步数。
3 Experimental setup
3 实验设置
We evaluate and compare the performance of BERTweet with strong baselines on three downstream NLP tasks of POS tagging, NER and text classification, using benchmark Tweet datasets.
我们在三个下游自然语言处理(NLP)任务(词性标注(POS tagging)、命名实体识别(NER)和文本分类)上评估并比较了BERTweet与强基线的性能,使用的基准数据集均为推文数据集。
Downstream task datasets
下游任务数据集
For POS tagging, we use three datasets Ritter11- T-POS (Ritter et al., 2011), ARK-Twitter4 (Gimpel et al., 2011; Owoputi et al., 2013) and TWEEBANK- $\mathrm{v}2^{5}$ (Liu et al., 2018). For NER, we employ datasets from the WNUT16 NER shared task (Strauss et al., 2016) and the WNUT17 shared task on novel and emerging entity recognition (Derczynski et al., 2017). For text classification, we employ the 3-class sentiment analysis dataset from the Se mE val 2017 Task 4A (Rosenthal et al., 2017) and the 2-class irony detection dataset from the Se mE val 2018 Task 3A (Van Hee et al., 2018).
在词性标注 (POS tagging) 任务中,我们使用了三个数据集:Ritter11-T-POS (Ritter et al., 2011)、ARK-Twitter4 (Gimpel et al., 2011; Owoputi et al., 2013) 和 TWEEBANK-$\mathrm{v}2^{5}$ (Liu et al., 2018)。对于命名实体识别 (NER),我们采用了 WNUT16 NER 共享任务 (Strauss et al., 2016) 和 WNUT17 新兴实体识别共享任务 (Derczynski et al., 2017) 的数据集。在文本分类任务中,我们使用了 SemEval 2017 Task 4A (Rosenthal et al., 2017) 的三类情感分析数据集,以及 SemEval 2018 Task 3A (Van Hee et al., 2018) 的二类讽刺检测数据集。
For Ritter11-T-POS, we employ a 70/15/15 training/validation/test pre-split available from Gui et al. (2017).6 ARK-Twitter contains two files daily547.conll and oct27.conll in which oct27.conll is further split into files oct27.traindev and oct27.test. Following Owoputi et al. (2013) and Gui et al. (2017), we employ daily547.conll as a test set. In addition, we use oct27.traindev and oct27.test as training and validation sets, respectively. For the TWEEBANK-V2, WNUT16 and WNUT17 datasets, we use their existing training/validation/test split. The Se mE val 2017- Task4A and Se mE val 2018-Task3A datasets are provided with training and test sets only (i.e. there is not a standard split for validation), thus we sample $10%$ of the training set for validation and use the remaining $90%$ for training.
对于Ritter11-T-POS数据集,我们采用Gui等人(2017)提供的70/15/15预设训练集/验证集/测试集划分。ARK-Twitter数据集包含daily547.conll和oct27.conll两个文件,其中oct27.conll进一步划分为oct27.traindev和oct27.test文件。遵循Owoputi等人(2013)和Gui等人(2017)的做法,我们将daily547.conll作为测试集。此外,我们分别使用oct27.traindev和oct27.test作为训练集和验证集。对于TWEEBANK-V2、WNUT16和WNUT17数据集,我们使用其现有的训练集/验证集/测试集划分。SemEval 2017-Task4A和SemEval 2018-Task3A数据集仅提供训练集和测试集(即没有标准验证集划分),因此我们从训练集中抽取10%作为验证集,剩余90%用于训练。
We use a “soft” normalization strategy to all of the experimental datasets by translating word tokens of user mentions and web/url links into special tokens @USER and HTTPURL, respectively, and converting emotion icon tokens into corresponding strings. We also apply a “hard” strategy by further applying lexical normalization dictionaries (Aramaki, 2010; Liu et al., 2012; Han et al., 2012) to normalize word tokens in Tweets.
我们对所有实验数据集采用"软"归一化策略:将用户提及和网页/URL链接的词token分别转换为特殊token @USER和HTTPURL,并将表情图标token转换为对应字符串。同时采用"硬"策略,通过应用词汇归一化词典 (Aramaki, 2010; Liu et al., 2012; Han et al., 2012) 进一步规范化推文中的词token。
Fine-tuning
微调 (Fine-tuning)
Following Devlin et al. (2019), for POS tagging and NER, we append a linear prediction layer on top of the last Transformer layer of BERTweet with regards to the first subword of each word token, while for text classification we append a linear prediction layer on top of the pooled output.
遵循 Devlin 等人 (2019) 的方法,对于词性标注 (POS tagging) 和命名实体识别 (NER),我们在 BERTweet 的最后一个 Transformer 层之上为每个单词 token 的第一个子词添加了一个线性预测层,而对于文本分类任务,我们则在池化输出之上添加了一个线性预测层。
We employ the transformers library (Wolf et al., 2019) to independently fine-tune BERTweet for each task and each dataset in 30 training epochs. We use AdamW (Loshchilov and Hutter, 2019) with a fixed learning rate of 1.e-5 and a batch size of 32 (Liu et al., 2019). We compute the task performance after each training epoch on the validation set (here, we apply early stopping when no improvement is observed after 5 continuous epochs), and select the best model checkpoint to compute the performance score on the test set.
我们使用transformers库 (Wolf等人,2019) 为每个任务和每个数据集独立微调BERTweet,共进行30个训练周期。采用AdamW优化器 (Loshchilov和Hutter,2019) ,固定学习率为1e-5,批量大小为32 (Liu等人,2019) 。每个训练周期结束后在验证集上评估任务性能 (若连续5个周期未观察到性能提升则启用早停机制) ,最终选择最佳模型检查点在测试集上计算性能得分。
We repeat this fine-tuning process 5 times with different random seeds, i.e. 5 runs for each task and each dataset. We report each final test result as an average over the test scores from the 5 runs.
我们使用不同的随机种子重复这一微调过程5次,即每个任务和每个数据集进行5次运行。最终测试结果为5次运行测试分数的平均值。
Baselines
基线方法
Our main competitors are the pre-trained language models RoBERTa base (Liu et al., 2019) and XLM $\operatorname{\mathbf{}}_{\mathrm{base}}$ (Conneau et al., 2020), which
我们的主要竞争对手是预训练语言模型RoBERTa base (Liu et al., 2019)和XLM $\operatorname{\mathbf{}}_{\mathrm{base}}$ (Conneau et al., 2020)。
| 模型 | Ritter11 soft | Ritter11 hard | ARK soft | ARK hard | TB-V2 soft | TB-V2 hard |
|---|---|---|---|---|---|---|
| RoBERTalarge XLM-Rlarge | 91.7 | 91.5 | 93.7 | 93.2 | 94.9 | 94.6 |
| e RoBERTabase | 92.6 88.7 | 92.1 88.3 | 94.2 91.8 | 93.8 91.6 | 95.5 93.7 | 95.1 93.5 |
| XLM-Rbase | 90.4 | 90.3 | 92.8 | 92.6 | 94.7 | 94.3 |
| BERTweet | 90.1 | 89.5 | 94.1 | 93.4 | 95.2 | 94.7 |
| DCNN (Gui et al.) | 89.9 | |||||
| DCNN (Gui et al.) | 91.2 [+a] | 92.4 [+a+b] | ||||
| TPANN | 90.9 [+a] | 92.8 [+a+b] | ||||
| ARKtagger | 90.4 | |||||
| BiLSTM-CNN-CRF | 93.2 [+b] | 94.6 [+c] 92.5 [+c] |
Table 1: POS tagging accuracy results on the Ritter11-T-POS (Ritter11), ARK-Twitter (ARK) and TWEEBANK-V2 (TB-v2) test sets. Result of ARKtagger (Owoputi et al., 2013) on Ritter11 is reported in the TPANN paper (Gui et al., 2017). Note that Ritter11 uses Twitter-specific POS tags for retweeted (RT), user-account, hashtag and url word tokens which can be tagged perfectly using some simple regular expressions. Therefore, we follow Gui et al. (2017) and Gui et al. (2018) to tag those words appropriately for all models. Results of ARKtagger and BiLSTM-CNNCRF (Ma and Hovy, 2016) on TB-v2 are reported by Liu et al. (2018). Also note that $\mathbf{\hat{\Sigma}}^{\omega}{+}\mathbf{a}^{\gamma}$ , $\mathbf{\hat{\mu}}^{6}+\mathbf{b}^{\mathbf{\eta},\mathbf{\gamma}}$ and $^{\bullet\bullet}{+\circ}^{,}$ denote the additional use of extra training data, i.e. models trained on bigger training data. $\mathbf{\hat{\Sigma}}^{\omega}{+}\mathbf{a}^{\gamma}$ : additional use of the POS annotated data from the English WSJ Penn treebank sections 00-24 (Marcus et al., 1993). $\mathrm{\Delta}^{6}+\mathrm{b}^{5}$ : the use of both training and validation sets for learning models. “ $\cdot\cdot\overrightarrow{\mathbf{\nabla}}+\mathbf{\nabla}\mathrm{c}^{,}$ : additional use of the POS annotated data from the UD English-EWT training set (Silveira et al., 2014).
表 1: Ritter11-T-POS (Ritter11)、ARK-Twitter (ARK) 和 TWEEBANK-V2 (TB-v2) 测试集上的词性标注准确率结果。ARKtagger (Owoputi et al., 2013) 在 Ritter11 上的结果引自 TPANN 论文 (Gui et al., 2017)。注意 Ritter11 对转发 (RT)、用户账号、话题标签和网址等 Twitter 专用词 token 采用了特定词性标签,这些词可通过简单正则表达式完美标注。因此我们遵循 Gui et al. (2017) 和 Gui et al. (2018) 的做法,为所有模型正确标注这些词汇。ARKtagger 和 BiLSTM-CNNCRF (Ma and Hovy, 2016) 在 TB-v2 上的结果由 Liu et al. (2018) 报告。另请注意 $\mathbf{\hat{\Sigma}}^{\omega}{+}\mathbf{a}^{\gamma}$、$\mathbf{\hat{\mu}}^{6}+\mathbf{b}^{\mathbf{\eta},\mathbf{\gamma}}$ 和 $^{\bullet\bullet}{+\circ}^{,}$ 表示额外使用了补充训练数据,即在更大规模训练数据上训练的模型。$\mathbf{\hat{\Sigma}}^{\omega}{+}\mathbf{a}^{\gamma}$:额外使用了来自英文 WSJ Penn 树库 00-24 章节 (Marcus et al., 1993) 的词性标注数据。$\mathrm{\Delta}^{6}+\mathrm{b}^{5}$:同时使用训练集和验证集进行模型学习。"$\cdot\cdot\overrightarrow{\mathbf{\nabla}}+\mathbf{\nabla}\mathrm{c}^{,}$":额外使用了来自 UD English-EWT 训练集 (Silveira et al., 2014) 的词性标注数据。
have the same architecture configuration as our BERTweet. In addition, we also evaluate the pretrained $\mathrm{RoBERTa_{large}}$ and XLM-Rlarge although it is not a fair comparison due to their significantly larger model configurations.
我们的BERTweet采用了相同的架构配置。此外,我们还评估了预训练的$\mathrm{RoBERTa_{large}}$和XLM-Rlarge,尽管由于它们的模型配置明显更大,这不是一个公平的比较。
The pre-trained RoBERTa is a strong language model for English, learned from 160GB of texts covering books, Wikipedia, Common Crawl news, Common Crawl stories, and web text contents. XLM-R is a cross-lingual variant of RoBERTa, trained on a 2.5TB multilingual corpus which contains 301GB of English Common Crawl texts.
预训练的RoBERTa是一个强大的英语语言模型,基于160GB文本数据训练而成,涵盖书籍、维基百科、Common Crawl新闻、Common Crawl故事和网络文本内容。XLM-R是RoBERTa的跨语言变体,在2.5TB多语言语料库上训练,其中包含301GB的英语Common Crawl文本。
We fine-tune RoBERTa and XLM-R using the same fine-tuning approach we use for BERTweet.
我们采用与BERTweet相同的微调方法对RoBERTa和XLM-R进行微调。
4 Experimental results
4 实验结果
Main results
主要结果
Tables 1, 2, 3 and 4 present our obtained scores for BERTweet and baselines regarding both “soft” and “hard” normalization strategies. We find that for each pre-trained language model the “soft” scores are generally higher than the corresponding “hard” scores, i.e. applying lexical normalization dictionaries to normalize word tokens in Tweets generally does not help improve the performance of the pre-trained language models on downstream tasks.
表1、表2、表3和表4展示了BERTweet及基线模型在"软"(soft)和"硬"(hard)两种归一化策略下的得分结果。我们发现,对于每个预训练语言模型而言,"软"归一化得分普遍高于对应的"硬"归一化得分,这表明应用词典归一化方法对推文中的单词token进行归一化处理,通常无法提升预训练语言模型在下游任务中的性能表现。
Table 2: F1 scores on the WNUT16 and WNUT17 test sets. Cambridge LTL result is reported by Lim so path am and Collier (2016). “entity” and “surface” denote the scores computed for the standard entity level and the surface level (Derczynski et al., 2017), respectively.
表 2: WNUT16 和 WNUT17 测试集上的 F1 分数。Cambridge LTL 结果由 Lim so path am 和 Collier (2016) 报告。"entity"和"surface"分别表示标准实体级别和表面级别 (Derczynski et al., 2017) 计算的分数。
| Model | WNUT16 | WNUT17 |
|---|---|---|
| soft | hard | |
| RoBERTalarge ults | 55.4 55.8 | 54.8 55.3 |
| XLM-Rlarge res S RoBERTabase | 49.7 | 49.2 |
| XLM-Rbase | 49.9 | 49.4 |
| 52.1 | 51.3 | |
| BERTweet | 52.4 | [+b] |
| CambridgeLTL | ||
| DATNet (Zhou et al.) Aguilar et al.(2017) | 53.0 [+b] |
Table 3: Performance scores on the Se mE val 2017- Task4A test set. See Rosenthal et al. (2017) for the definitions of the AvgRec and $\mathrm{F}_{1}^{\mathrm{NP}}$ metrics, in which AvgRec is the main ranking metric.
表 3: Se mE val 2017-Task4A 测试集上的性能得分。有关 AvgRec 和 $\mathrm{F}_{1}^{\mathrm{NP}}$ 指标的定义,请参阅 Rosenthal 等人 (2017) 的研究,其中 AvgRec 是主要排名指标。
| 模型 | AvgRec (soft) | AvgRec (hard) | F NP (soft) | F NP (hard) | Accuracy (soft) | Accuracy (hard) |
|---|---|---|---|---|---|---|
| RoBERTalarge ults | 72.5 | 72.2 | 72.0 | 71.8 | 70.7 | 71.3 |
| XLM-Rlarge | 71.7 | 71.7 | 71.1 | 70.9 | 70.7 | 70.6 |
| res RoBERTabase | 71.6 | 71.8 | 71.2 | 71.2 | 71.6 | 70.9 |
| n XLM-Rbase | 70.3 | 70.3 | 69.4 | 69.6 | 69.3 | 69.7 |
| BERTweet | 73.2 | 72.8 | 72.8 | 72.5 | 71.7 | 72.0 |
| Cliche (2017) | 68.1 | 68.1 | 68.5 | 68.5 | 65.8 | 65.8 |
| Baziotis et al.(2017) | 68.1 | 68.1 | 67.7 | 67.7 | 65.1 | 65.1 |
Table 4: Performance scores on the Se mE val 2018- Task3A test set. $\mathrm{F}{1}^{\mathrm{pos}}$ —the main ranking metric— denotes the $\mathrm{F}_{1}$ score computed for the positive label.
| 模型 | F1 pos (soft) | F1 pos (hard) | 准确率 (soft) | 准确率 (hard) |
|---|---|---|---|---|
| RoBERTalarge | 73.2 | 71.9 | 76.5 | 75.1 |
| XLM-Rlarge | 70.8 | 69.7 | 74.2 | 73.2 |
| RoBERTabase | 71.0 | 71.2 | 74.0 | 74.0 |
| XLM-Rbase | 66.6 | 66.2 | 70.8 | 70.8 |
| BERTweet | 74.6 | 74.3 | 78.2 | 78.2 |
| Wu et al. (2018) Baziotis et al. (2018) | 70.5 | 73.5 | - | - |
表 4: Se mE val 2018-Task3A 测试集上的性能得分。$\mathrm{F}{1}^{\mathrm{pos}}$ (主要排名指标) 表示针对正标签计算的 $\mathrm{F}_{1}$ 分数。
Our BERTweet outperforms its main competitors RoBERTa base and $\mathrm{XLM-R}{\mathrm{ase}}$ on all experimental datasets (with only one exception that $\mathrm{XLM-R}{\mathrm{base}}$ does slightly better than BERTweet on Ritter11-T-POS). Compared to RoBERTa large and XLM $\mathbf{\tau}\mathbf{R}_{\mathrm{large}}$ which use significantly larger model configurations, we find that they obtain better POS tagging and NER scores than BERTweet. However, BERTweet performs better than those large models on the two text classification datasets.
我们的BERTweet在所有实验数据集上均优于其主要竞争对手RoBERTa base和$\mathrm{XLM-R}{\mathrm{base}}$(仅有一个例外:在Ritter11-T-POS任务中$\mathrm{XLM-R}{\mathrm{base}}$略优于BERTweet)。与采用更大模型配置的RoBERTa large和XLM $\mathbf{\tau}\mathbf{R}_{\mathrm{large}}$相比,我们发现这些大型模型在词性标注和命名实体识别任务上得分更高。然而,BERTweet在两个文本分类数据集上的表现优于这些大型模型。
Tables 1, 2, 3 and 4 also compare our obtained scores with the previous highest reported results on the same test sets. Clearly, the pre-trained language models help achieve new SOTA results on all experimental datasets. Specifically, BERTweet improves the previous SOTA in the novel and emerging entity recognition by absolute $14%$ on the WNUT17 dataset, and in text classification by $5%$ and $4%$ on the Se mE val 2017-Task4A and Se mE val 2018-Task3A test sets, respectively. Our results confirm the effectiveness of the large-scale BERTweet for Tweet NLP.
表1、表2、表3和表4同时将我们获得的分数与先前在同一测试集上报告的最高结果进行了对比。显然,预训练语言模型帮助我们在所有实验数据集上实现了新的SOTA(State Of The Art)结果。具体而言,BERTweet在WNUT17数据集上的新出现实体识别任务中,将先前SOTA结果提升了绝对14%;在SemEval 2017-Task4A和SemEval 2018-Task3A测试集的文本分类任务中,分别提升了5%和4%。我们的结果证实了大规模BERTweet模型在推文自然语言处理中的有效性。
Discussion
讨论
Our results comparing the “soft” and “hard” normalization strategies with regards to the pretrained language models confirm the previous view that lexical normalization on Tweets is a lossy translation task (Owoputi et al., 2013). We find that RoBERTa outperforms XLM-R on the text classification datasets. This finding is similar to what is found in the XLM-R paper (Conneau et al., 2020) where XLM-R obtains lower perfor- mance scores than RoBERTa for sequence classification tasks on traditional written English corpora.
我们比较预训练语言模型中"软"和"硬"归一化策略的结果证实了先前观点:推文的词汇归一化是一种有损转换任务 (Owoputi et al., 2013) 。研究发现RoBERTa在文本分类数据集上表现优于XLM-R,这一发现与XLM-R论文 (Conneau et al., 2020) 的结论相似——在传统书面英语语料库的序列分类任务中,XLM-R获得的性能评分低于RoBERTa。
We also recall that although RoBERTa and XLM-R use $160/80=2$ times and $301/80\approx3.75$ times bigger English data than our BERTweet, respectively, BERTweet does better than its competitors $\mathtt{R o B E R T a_{b a s e}}$ and $\mathrm{XLM-R}_{\mathrm{base}}$ . Thus this confirms the effectiveness of a large-scale and domain-specific pre-trained language model for English Tweets. In future work, we will release a “large” version of BERTweet, which possibly performs better than RoBERTa large and XLM-Rlarge on all three evaluation tasks.
我们还注意到,虽然RoBERTa和XLM-R分别使用了比我们的BERTweet多$160/80=2$倍和$301/80\approx3.75$倍的英语数据,但BERTweet的表现优于其竞争对手$\mathtt{R o B E R T a_{b a s e}}$和$\mathrm{XLM-R}_{\mathrm{base}}$。这证实了针对英语推文的大规模领域专用预训练语言模型的有效性。在未来的工作中,我们将发布BERTweet的"大型"版本,该版本可能在三项评估任务上均优于RoBERTa large和XLM-Rlarge。
5 Conclusion
5 结论
We have presented the first large-scale language model BERTweet pre-trained for English Tweets. We demonstrate the usefulness of BERTweet by showing that BERTweet outperforms its baselines RoBERTa base and XLM-Rbase and helps produce better performances than the previous SOTA models for three downstream Tweet NLP tasks of POS tagging, NER, and text classification (i.e. sentiment analysis & irony detection).
我们推出了首个针对英文推文预训练的大规模语言模型BERTweet。通过实验证明,BERTweet在词性标注(POS tagging)、命名实体识别(NER)和文本分类(即情感分析和反讽检测)这三个下游推文NLP任务中,其性能不仅超越了基线模型RoBERTa base和XLM-Rbase,还优于之前的SOTA模型。
As of September 2020, we have collected a corpus of about 23M “cased” COVID-19 English Tweets consisting of at least 10 and at most 64 word tokens. In addition, we also create an “uncased” version of this corpus. Then we continue pre-training from our pre-trained BERTweet on each of the “cased” and “uncased” corpora of 23M Tweets for 40 additional epochs, resulting in two BERTweet variants of pre-trained “cased” and “uncased” BERTweet-COVID19 models, respectively. By publicly releasing BERTweet and its two variants, we hope that they can foster future research and applications of Tweet analytic tasks, such as identifying informative COVID-19 Tweets (Nguyen et al., 2020) or extracting COVID-19 events from Tweets (Zong et al., 2020).
截至2020年9月,我们已收集约2300万条"区分大小写"的COVID-19英文推文语料库,每条推文包含至少10个至多64个单词token。此外,我们还创建了该语料库的"不区分大小写"版本。接着,我们在预训练的BERTweet基础上,分别对2300万条推文的"区分大小写"和"不区分大小写"语料库进行40个额外周期的继续预训练,最终得到两个BERTweet变体:预训练的"区分大小写"和"不区分大小写"BERTweet-COVID19模型。通过公开释放BERTweet及其两个变体,我们希望它们能促进未来推文分析任务的研究与应用,例如识别信息性COVID-19推文 (Nguyen et al., 2020) 或从推文中提取COVID-19事件 (Zong et al., 2020)。
