MA-BERT: Learning Representation by Incorporating Multi-Attribute Knowledge in Transformers
MA-BERT: 通过结合多属性知识在 Transformer 中学习表示
You Zhang†, Jin Wang†, 1, Liang-Chih Yu\ddag,2 and Xuejie Zhang†, 3 †School of Information Science and Engineering, Yunnan University, Yunnan, P.R. China ‡Department of Information Management, Yuan Ze University, Taiwan Contact:{wangjin1, xjzhang3}@ynu.edu.cn, lcyu@saturn.yzu.edu.tw2
You Zhang†, Jin Wang†, 1, Liang-Chih Yu\ddag,2 和 Xuejie Zhang†, 3 †云南大学信息科学与工程学院,中国云南 ‡元智大学信息管理系,台湾 联系方式:{wangjin1, xjzhang3}@ynu.edu.cn, lcyu@saturn.yzu.edu.tw2
Abstract
摘要
Incorporating attribute information such as user and product features into deep neural networks has been shown to be useful in sentiment analysis. Previous works typically accomplished this in two ways: concatenating multiple attributes to word/text representation or treating them as a bias to adjust attention distribution. To leverage the advantages of both methods, this paper proposes a multi-attribute BERT (MA-BERT) to incorporate external attribute knowledge. The proposed method has two advantages. First, it applies multi-attribute transformer (MA-Transformer) encoders to incorporate multiple attributes into both input representation and attention distribution. Second, the MA-Transformer is implemented as a universal layer and stacked on a BERT-based model such that it can be initialized from a pre-trained checkpoint and fine-tuned for the downstream applications without extra pretraining costs. Experiments on three benchmark datasets show that the proposed method outperformed pre-trained BERT models and other methods incorporating external attribute knowledge.
将用户和产品特征等属性信息融入深度神经网络已被证明在情感分析中非常有用。以往的研究通常通过两种方式实现这一点:将多个属性与词/文本表示连接,或将它们视为调整注意力分布的偏差。为了结合这两种方法的优势,本文提出了一种多属性 BERT (MA-BERT) 来融入外部属性知识。所提出的方法有两个优点。首先,它应用多属性 Transformer (MA-Transformer) 编码器将多个属性融入输入表示和注意力分布中。其次,MA-Transformer 被实现为一个通用层,并堆叠在基于 BERT 的模型上,因此可以从预训练的检查点初始化,并为下游应用进行微调,而无需额外的预训练成本。在三个基准数据集上的实验表明,所提出的方法优于预训练的 BERT 模型以及其他融入外部属性知识的方法。
1 Introduction
1 引言
To learn a distributed text representation for sentiment classification (Pang and Lee, 2008; Liu, 2012), conventional deep neural networks, such as convolutional neural networks (CNN) (Kim, 2014) and long short-term memory (LSTM) (Hochreiter and Schmid huber, 1997), and common integration technics, such as self-attention mechanisms (Vaswani et al., 2017; Chaudhari et al., 2019) and dynamic routing algorithms (Gong et al., 2018; Sabour et al., 2017), are usually applied to compose the vectors of constituent words. To further enhance the performance, pre-trained models (PTMs), such as BERT (Devlin et al., 2019), ALBERT (Lan et al., 2019), RoBERTa (Liu et al., 2019), and XLM-
为了学习用于情感分类的分布式文本表示 (Pang and Lee, 2008; Liu, 2012),传统的深度神经网络,如卷积神经网络 (CNN) (Kim, 2014) 和长短期记忆网络 (LSTM) (Hochreiter and Schmidhuber, 1997),以及常见的集成技术,如自注意力机制 (Vaswani et al., 2017; Chaudhari et al., 2019) 和动态路由算法 (Gong et al., 2018; Sabour et al., 2017),通常被应用于组合构成词的向量。为了进一步提升性能,预训练模型 (PTMs),如 BERT (Devlin et al., 2019)、ALBERT (Lan et al., 2019)、RoBERTa (Liu et al., 2019) 和 XLM-
Figure 1: Different strategies to incorporate external attribute knowledge into deep neural networks.
图 1: 将外部属性知识融入深度神经网络的不同策略。
RoBERTa (Conneau et al., 2019) can be fine-tuned and transferred for sentiment analysis tasks. Practically, PTMs were first fed a large amount of unannotated data, and trained using a masked language model or next sentence prediction to learn the usage of various words and how the language is written in general. Then, the models are transferred to another task to be fed another smaller task-specific dataset.
RoBERTa (Conneau et al., 2019) 可以微调并迁移用于情感分析任务。实际上,预训练模型 (PTMs) 首先被输入大量未标注的数据,并通过掩码语言模型或下一句预测进行训练,以学习各种词汇的用法以及语言的通用写作方式。然后,这些模型被迁移到另一个任务中,并输入另一个较小的任务特定数据集。
The above mentioned methods only use features from plain texts. Incorporating attribute information such as users and products can improve sentiment analysis task performance. Previous works typically incorporated such external knowledge by concatenating these attributes into word and text representations (Tang et al., 2015), as shown in Figs. 1(a) and (b). Such methods are often introduced in shallow models to attach attribute information to modify the representation of either words or texts. However, this may lack interaction between attributes and the text since it equally aligns words to attribute features, thus the model is unable to emphasize important tokens. Several works have used attribute features as a bias term in selfattention mechanisms to model meaningful relations between words and attributes (Wu et al., 2018; Chen et al., 2016b; Dong et al., 2017; Dou, 2017), as shown in Fig. 1(c). By using the sof tmax function for normalization to calculate the attention score, the incorporated attribute features only impact the allocation of the attention weights. As a result, the representation of input words has not been updated, and the information of these attributes will be lost. For example, depending on individual preferences for chili, readers may focus on reviews talking about spicy, but only those who like chili would consider such review recommendations useful. However, current self-attention models that learn text representations by adjusting the weights of spicy may still produce the same word representation of spicy for different persons, leading to confusion in distinguishing people who like chili or not.
上述方法仅使用纯文本的特征。结合用户和产品等属性信息可以提高情感分析任务的性能。先前的工作通常通过将这些属性与词和文本表示连接起来来引入此类外部知识(Tang et al., 2015),如图 1(a) 和 (b) 所示。这些方法通常在浅层模型中引入,以附加属性信息来修改词或文本的表示。然而,这可能缺乏属性与文本之间的交互,因为它将词与属性特征同等对齐,因此模型无法强调重要的 Token。一些工作将属性特征用作自注意力机制中的偏置项,以建模词与属性之间的有意义关系(Wu et al., 2018; Chen et al., 2016b; Dong et al., 2017; Dou, 2017),如图 1(c) 所示。通过使用 softmax 函数进行归一化来计算注意力分数,引入的属性特征仅影响注意力权重的分配。结果,输入词的表示并未更新,这些属性的信息将会丢失。例如,根据个人对辣椒的偏好,读者可能会关注谈论辛辣的评论,但只有喜欢辣椒的人才会认为此类评论推荐有用。然而,当前通过调整“辛辣”权重来学习文本表示的自注意力模型,可能仍会为不同的人生成相同的“辛辣”词表示,从而导致无法区分喜欢辣椒的人。
To address the above problems, this study proposes a multi-attribute BERT (MA-BERT) model which applies multi-attribute transformer (MATransformer) encoders to incorporate external attribute knowledge. Different from being incorporated into the attention mechanism as bias terms, multiple attributes can be injected into both attention maps and input token representations using bilinear interaction, as shown in Fig. 1(d). In addition, the MA-Transformer is implemented as a universal layer and stacked on a BERT-based model such that it can be initialized from a pre-training checkpoint and fine-tuned for downstream tasks without extra pre-training costs. Experiments are conducted on three benchmark datasets (IMDB, Yelp-2013, and Yelp-2014) for sentiment polarity classification. The results show that the proposed MA-BERT model outperformed pre-trained BERT models and other methods incorporating external attribute knowledge.
为了解决上述问题,本研究提出了一种多属性 BERT (MA-BERT) 模型,该模型应用多属性 Transformer (MATransformer) 编码器来融入外部属性知识。与将多属性作为偏置项融入注意力机制不同,多属性可以通过双线性交互注入到注意力图和输入 Token 表示中,如图 1(d) 所示。此外,MA-Transformer 被实现为一个通用层,并堆叠在基于 BERT 的模型上,因此可以从预训练检查点初始化,并在不增加预训练成本的情况下针对下游任务进行微调。实验在三个基准数据集(IMDB、Yelp-2013 和 Yelp-2014)上进行,用于情感极性分类。结果表明,所提出的 MA-BERT 模型优于预训练的 BERT 模型以及其他融入外部属性知识的方法。
The remainder of this paper is organized as follows. Section 2 provides a detailed description of the proposed methods. The empirical experiments are reported with analysis in Section 3. Conclusions are finally drawn in Section 4.
本文的其余部分组织如下。第2节详细描述了所提出的方法。第3节报告了实验并进行了分析。最后,第4节得出了结论。
2 Multi-Attribute BERT Model
2 多属性 BERT 模型
Fig. 2 shows an overview of the MA-BERT model. It mainly consists of two parts, including a BERTbased PTM model and several MA-Transformer encoders as extra layers stacked on BERT. Both components are described in detail below.
图 2 展示了 MA-BERT 模型的概览。它主要由两部分组成,包括一个基于 BERT 的 PTM 模型和几个 MA-Transformer 编码器作为堆叠在 BERT 上的额外层。下面将详细描述这两个组件。
Figure 2: Overall architecture of the MA-BERT model.
图 2: MA-BERT 模型的整体架构。
2.1 BERT Encoder
2.1 BERT 编码器
By applying a word piece tokenizer (Wu et al., 2016), the input text can be denoted as a sequence of tokens, i.e., \begin{array}{r c l}{s}&{=}&{\left{w_{0},w_{1},w_{2},\dots,w_{L-1}\right}}\end{array} , where L is the length of the text and w0 = [CLS] is a special classification token. Moreover, its corresponding attributes are denoted as \mathbf{A}=\left{a_{1},a_{2},\dots,a_{M}\right} , where M is the number of attributes in the text. Thus, the i -th input sample can be denoted as a tuple, i.e., (Ai,si) .
通过应用一个词片 Tokenizer (Wu et al., 2016),输入文本可以表示为一个 Token 序列,即 \begin{array}{r c l}{s}&{=}&{\left{w_{0},w_{1},w_{2},\dots,w_{L-1}\right}}\end{array},其中 L 是文本的长度,w0 = [CLS] 是一个特殊的分类 Token。此外,其对应的属性表示为 \mathbf{A}=\left{a_{1},a_{2},\dots,a_{M}\right},其中 M 是文本中的属性数量。因此,第 i 个输入样本可以表示为一个元组,即 (Ai,si)。
To learn the hidden representation, the pretrained language model BERT (Devlin et al., 2019) was used, achieving impressive performance for various natural language processing (NLP) tasks. We then fed the token sequence into the BERT model to obtain the representation, denoted as,
为了学习隐藏表示,我们使用了预训练语言模型 BERT (Devlin et al., 2019),该模型在各种自然语言处理 (NLP) 任务中表现出色。随后,我们将 Token 序列输入到 BERT 模型中,以获得表示,记为,
where T∈RL×dt is the output representation of all tokens; θBERT is the trainable parameters of BERT, which is initialized from a pretrained checkpoint and then fine-tuned during the model training; dt=768 is the dimensionality of the output representation.
其中 T∈RL×dt 是所有 Token 的输出表示;θBERT 是 BERT 的可训练参数,它从预训练检查点初始化,然后在模型训练期间进行微调;dt=768 是输出表示的维度。
According to Wu et al. (2018) and Wang et al. (2017), all the attributes are mapped to attribute embeddings EA=[EA,1,EA,2,...,EA,M]∈ RM×dE , which are randomly initialized and updated in the following training phase.
根据 Wu 等人 (2018) 和 Wang 等人 (2017) 的研究,所有属性都被映射到属性嵌入 EA=[EA,1,EA,2,...,EA,M]∈ RM×dE 中,这些嵌入在训练阶段随机初始化并更新。
Multi-Attribute Attention. To incorporate multiple attributes into the MA-Transformer, we introduce multi-attribute attention (MAA), which is expressed as,
多属性注意力。为了将多个属性整合到 MA-Transformer 中,我们引入了多属性注意力 (Multi-Attribute Attention, MAA),其表达式为:
where Um is the attention from m -th attribute; Wo∈R(M⋅d)×dt is the output linear projection and d denotes the dimensionality of Q,K and V ; Q,K and V are matrices that package the queries, keys and values, which are defined as,
其中 Um 是来自第 m 个属性的注意力;Wo∈R(M⋅d)×dt 是输出线性投影,d 表示 Q,K 和 V 的维度;Q,K 和 V 是打包查询、键和值的矩阵,其定义为,
where Qm,Km and $V_{m}\in\mathbb{R}^{L\times d_{E}}arebilineartransformations(Huangetal.,2019)appliedontheinputrepresentationTandattributerepresentationE_{\mathrm{A,m}}.W_{q,m},W_{k,m}andW_{v,m}\in\mathbb{R}^{d_{t}\times d_{E}}areweightmatricesforquery,keyandvalueprojections,and\cdotand\odot$ respectively denote the inner and the Hadamard product.
其中 Qm,Km 和 $V_{m}\in\mathbb{R}^{L\times d_{E}}是应用于输入表示T和属性表示E_{\mathrm{A,m}}的双线性变换(Huangetal.,2019)。W_{q,m},W_{k,m}和W_{v,m}\in\mathbb{R}^{d_{t}\times d_{E}}是查询、键和值投影的权重矩阵,\cdot和\odot$ 分别表示内积和哈达玛积。
Similar to Vaswani et al. (2017), we also introduced multi-head mechanism for MA-Transformer, denoted as,
与 Vaswani 等人 (2017) 类似,我们也为 MA-Transformer 引入了多头机制,表示为
where K is the number of heads for each attribute and ⨁ denotes the concatenation operator; EkA,m∈RdE is the m -th attribute representation in the k -th head, and its dimensionality should be ensured that dE=dt/K . Given that different heads can capture different relation types along with text representations, different parameters are considered for different heads.
其中 K 是每个属性的头数,⨁ 表示连接操作符;EkA,m∈RdE 是第 k 个头中的第 m 个属性表示,其维度应确保 dE=dt/K。考虑到不同的头可以捕捉到不同的关系类型以及文本表示,因此为不同的头考虑了不同的参数。
2.2 MA-Transformer
2.2 MA-Transformer
Taking the representation of both text T and attribute A as input, an MA-Transformer encoder then processes the same as a standard transformer encoder (Vaswani et al., 2017) to generate Y∈ RL×dt . Then, Y is connected by a normalization layer and a residual layer from the input representation T . The intermediate output is then passed to a two-layered feed-forward network with a rectified linear unit (ReLU) activate function. Similarly, residual and normalization layers are connected to generate the final output which is taken as the input for the next encoder.
以文本 T 和属性 A 的表示作为输入,MA-Transformer 编码器随后像标准的 Transformer 编码器 (Vaswani et al., 2017) 一样处理这些输入,生成 Y∈ RL×dt。然后,Y 通过一个归一化层和一个残差层与输入表示 T 连接。中间输出随后传递到一个带有修正线性单元 (ReLU) 激活函数的两层前馈网络。类似地,残差和归一化层被连接以生成最终输出,该输出将作为下一个编码器的输入。
By stacking several MA-Transformer encoders on the BERT model, the MA-BERT model generates a review representation h[CLS] consistent with the special token [CLS]. Then, a classifier comprised of a linear projection and a sof tmax activation (with the dimension identical to the number of classes) is used for classification.
通过在BERT模型上堆叠多个MA-Transformer编码器,MA-BERT模型生成了与特殊Token [CLS]一致的评论表示h[CLS]。然后,使用由线性投影和softmax激活函数(维度与类别数量相同)组成的分类器进行分类。
3 Comparative Experiments
3 对比实验
Datasets. Following the experimental settings used in Tang et al. (2015), the proposed MABERT model is evaluated using three benchmark datasets 1 (IMDB, Yelp-2013, and Yelp-2014). The evaluation metrics include accuracy (Acc.) and root mean squared error (RMSE) . Higher Acc. and lower RMSE scores indicate higher performance.
数据集。根据 Tang 等人 (2015) 使用的实验设置,提出的 MABERT 模型使用三个基准数据集 1 (IMDB、Yelp-2013 和 Yelp-2014) 进行评估。评估指标包括准确率 (Acc.) 和均方根误差 (RMSE)。较高的 Acc. 和较低的 RMSE 分数表示更高的性能。
Implementation Details. The baseline methods can be divided into three groups. The first group includes the methods without user and product information such as CNN (Kim, 2014), BiLSTM (Hochreiter and Schmid huber, 1997), neural sentiment classification (NSC) (Chen et al., 2016a) and its variant with a local attention mechanism (NSC+LA) . For the BERT-based methods, the uncased-base-BERT model consisting of 12 layers of transformer encoders was implemented for comparison. ToBERT (Pappagari et al., 2019) was trained non-end2end using a word-to-segment strategy in a two-stage way.
实现细节。基线方法可以分为三组。第一组包括不使用用户和产品信息的方法,例如 CNN (Kim, 2014)、BiLSTM (Hochreiter 和 Schmidhuber, 1997)、神经情感分类 (NSC) (Chen 等, 2016a) 及其带有局部注意力机制的变体 (NSC+LA)。对于基于 BERT 的方法,我们实现了由 12 层 Transformer 编码器组成的 uncased-base-BERT 模型进行比较。ToBERT (Pappagari 等, 2019) 使用词到段落的策略以两阶段方式进行非端到端训练。
The second group includes existing methods incorporating user and product information such as NSC with user (U) and product (P) information incorporated into an attention (A) mecha- nism (NSC+UPA) ), user product neural network (UPNN) (Tang et al., 2015), hierarchical model with separate user attention and product attention (HUAPA) (Wu et al