Abstract
Politicians often have underlying agendas when reacting to events. Arguments in contexts of various events reflect a fairly consistent set of agendas for a given entity. In spite of recent advances in Pretrained Language Models (PLMs), those text representations are not designed to capture such nuanced patterns. In this paper, we propose a Compositional Reader model consisting of encoder and composer modules, that attempts to capture and leverage such information to generate more effective representations for entities, issues, and events. These representations are contextualized by tweets, press releases, issues, news articles, and participating entities. Our model can process several documents at once and generate composed representations for multiple entities over several issues or events. Via qualitative and quantitative empirical analysis, we show that these representations are meaningful and effective.
摘要
政客们经常在对事件做出反应时患上底层议程。各种事件的上下文中的论据反映了给定实体的一组相当一致的议程。尽管近期预用语言模型(PLMS)进行了进展,但这些文本表示不设计用于捕获此类细微差别模式。在本文中,我们提出了一种由编码器和作曲器模块组成的组成读者模型,该模型包括捕获和利用此类信息,以为实体,问题和事件产生更有效的表示。这些表示由Tweets,新闻稿,问题,新闻文章和参与实体进行了内容化。我们的模型可以一次处理多个文档,并在几个问题或事件中生成多个实体的组合表示。通过定性和定量的实证分析,我们表明这些陈述有意义和有效。
Introduction
Often in political discourse, the same argument trajectories are repeated across events by politicians and political caucuses. Knowing and understanding the trajectories that are regularly used, is pivotal in contextualizing the comments made by them when a new event occurs. Furthermore, it helps us in understanding their perspectives and predict their likely reactions to new events and participating entities. In political text, bias towards a political perspective is often subtle rather than explicitly stated . Choices of mentioning or omitting certain entities or certain attributes can reveal the author's agenda. For example, when a politician tweets in reaction to a new shooting event, it is likely that they oppose gun control and support free gun rights, despite not mentioning their stance explicitly. Our main insight in this paper is that effectively detecting such bias from text requires modeling the broader context of the document. This can include understanding relevant facts related to the event addressed in the text, the ideological leanings and perspectives expressed by the author in the past, and the sentiment/attitude of the author towards the entities referenced in the text. We suggest that this holistic view can be obtained by combining information from multiple sources, which can be of varying types, such as news articles, social media posts, quotes from press releases and historical beliefs expressed by politicians. Despite recent advances in Pretrained Language Models (PLMs) in NLP , which have greatly improved word representations via contextualized embeddings and powerful transformer units, such representations alone are not enough to capture nuanced biases in political discourse. Two of the key reasons are: (i) they do not directly focus on entity/issue-centric data and (ii) they contextualize only based on surrounding text but not on relevant issue/event knowledge. A computational setting for this approach, , requires two necessary attributes: (i) an input representation that combines all the different types of information meaningfully and (ii) an ability to process all the information together in one-shot. We address the first challenge by introducing a graph structure that ties together first-person informal (tweets) and formal discourse (press releases and perspectives), third-person current (news) and consolidated (Wikipedia) discourse. These documents are connected via their authors, the issues/events they discuss and the entities that are mentioned in them. As a clarifying example consider the partial Tweet by President Trump . This tweet will be represented in our graph by connecting the text node to the author node (President Trump) and the referenced entity node (New York Gov. Cuomo). These settings are shown in Fig. test_graph Then, we propose a novel neural architecture that can process all the information in the graph together in one-shot. The architecture generates a distributed representation for each item in the graph that is contextualized by the representations of others. In our example, this results in a modified representation for the tweet and the entities thus helping us characterize the opinion of President Trump about Governor Cuomo in context of NRA or in general. Our architecture builds upon the text representations obtained from BERT . It consists of an Encoder which combines all the documents related to a given node to generate an initial node representation and a Composer which is a Graph Attention Network (GAT) that composes over the graph structure to generate contextualized node embeddings.
介绍
经常在政治话语中,同样的参数轨迹通过政治家和政治核心的事件重复。了解和理解定期使用的轨迹,在上下文中,在发生新事件时,在语境中提出的评论中是关键的。此外,它有助于我们了解他们的观点并预测他们对新事件和参与实体的反应。在政治文本中,偏向政治观点往往是微妙的,而不是明确说明。提及或省略某些实体或某些属性的选择可以揭示作者的议程。例如,当一个政治家推文对新的射击事件的反应时,他们可能反对枪支控制并支持免费枪支权利,尽管没有明确提及他们的立场。我们本文的主要见解是有效地检测文本的这种偏差需要建模文档的更广泛的背景。这可以包括了解与文本中涉及的事件相关的相关事实,作者过去表达的意识形态倾向和观点,以及作者对文本中引用的实体的情绪/态度。我们建议通过将来自多个来源的信息组合来获得该整体视图,该信息可以具有不同类型的类型,例如新闻文章,社交媒体帖子,来自新闻稿和政治家所表达的历史信仰。尽管NLP中的预用语言模型(PLMS)最近进行了最近的进展,但通过上下文嵌入和强大的变压器单元具有大大改进的文字表示,这种表示不足以捕捉政治话语中的细微偏差。两个主要原因是:(i)他们不会直接关注实体/发行的数据和(ii)它们仅基于周围文本而不是相关的问题/事件知识。这种方法的计算设置,需要两个必要的属性:(i)将所有不同类型的信息的输入表示,其与(ii)在一次拍摄中一起处理所有信息的能力。我们通过介绍一个图形结构来解决第一人称非正式(推文)和正式话语(新闻稿和观点),第三人称当前(新闻)和综合(维基百科)话语的图形结构来解决第一个挑战。这些文件通过其作者连接,他们讨论的问题/事件以及它们中提到的实体。作为一个澄清的例子,考虑总统特朗普总统的部分推文。通过将文本节点连接到作者节点(总统特朗普)和引用的实体节点(纽约州长Cuomo),将在我们的图表中表示此推文。这些设置如图4所示。测试_praph然后,我们提出了一种新颖的神经结构,可以在一次拍摄中处理图中的所有信息。该架构为图形中的图表中的每个项目生成分布式表示,该图形由其他人的表示。在我们的示例中,这导致推文和实体的修改表示,从而帮助我们在NA或一般的背景下表征总统Cumo的总统Cuomo的意见。我们的架构构建在从伯特获得的文本表示时。它由编码器组成,该编码器组合与给定节点相关的所有文档,以生成初始节点表示和作曲器,该作曲器是撰写在图形结构上以生成上下文化节点嵌入的图表网络(GAT)。
We design two self-supervised learning tasks to train the model and capture structural dependencies over the rich discourse representation, namely predicting and links over the graph structure. The intuition behind the tasks is that the model is required to understand subtle language usage to solve them. prediction requires the model to differentiate between: (i) the language of one author from another and (ii) the language of the author in context of one issue vs another issue. prediction requires the model to understand the language used by an author when discussing a particular entity given the author's historical discourse. We evaluate the resulting discourse representation via several empirical tasks identifying political perspectives at both article and author levels. Our evaluation is designed to demonstrate the importance of each component of our model and usefulness of the learning tasks. The task evaluates our model's ability to consolidate multiple documents, of different types, from a single author into a coherent perspective about an issue. This is evaluated by framing the problem as a paraphrasing task, comparing the model’s composed representation of an author with a short text expressing the stance directly, i.e., only based on the model’s pre-training process. The and tasks show that our representations capture meaningful information that make them highly effective for political prediction tasks. Both tasks build classifiers on top of the model. evaluates author and issue representations while evaluates graph-contextualized document representations. We perform for two domains: and using politician grades from two different organizations: National Rifles Association (NRA) and League of Conservation Voters (LCV). We compare our model to three competitive baselines: BERT , an adaptation of BERT to our data, and our Encoder architecture. This helps us evaluate different aspects of our model as well as our learning tasks. We also analyse the relative usefulness of various types of documents via an ablation study. The BERT adaptation baseline is designed to be trained on our learning tasks without using the Composer architecture. It helps demonstrate the effectiveness of our learning tasks and the importance of the Composer architecture. Our model outperforms the baselines on all three evaluation tasks. Finally, we perform qualitative analysis, visualizing entities' stances, demonstrating that our representations effectively capture nuanced political information. To summarise, our research contributions include:
我们设计了两个自我监督的学习任务,以培训模型并捕获丰富的话语表示,即通过图形结构预测和链接结构依赖性。任务背后的直觉是该模型需要了解要解决它们的微妙语言用法。预测需要模型来区分:(i)来自另一个作者的一个作者的语言和(ii)作者语言在一个问题上的语言与另一个问题。在讨论作者的历史话语时,预测要求模型了解作者使用的语言。我们通过识别两篇文章和作者水平的政治观点来评估所产生的话语代表。我们的评估旨在展示我们模型的每个组成部分和学习任务的有用性的重要性。该任务评估了我们模型的能力,从单个作者到一个关于一个问题的连贯的角度来巩固不同类型的多种文档。这是通过将问题绘制为释义任务来评估,比较模型的作者的组合表示与直接表达姿态的短文本,即,仅基于模型的预培训过程。 “和任务”显示我们的陈述捕获了有意义的信息,使其使它们对政治预测任务非常有效。两个任务都构建了模型顶部的分类器。评估作者和发出表示表示,同时评估图形上下文化文档表示。我们为两个域名执行:并使用两个不同组织的政治家等级:国家步枪协会(NRA)和保护选民联盟(LCV)。我们将模型与三个竞争基础的模型进行比较:BERT,对我们的数据的适应,以及我们的编码器架构。这有助于我们评估我们模型的不同方面以及我们的学习任务。我们还通过烧蚀研究分析各种文献的相对实用性。 BERT适配基线旨在在我们的学习任务中培训,而无需使用作曲家架构。它有助于展示我们学习任务的有效性以及作曲家架构的重要性。我们的模型在所有三个评估任务中占据了基线。最后,我们进行定性分析,可视化实体的立场,表明我们的代表有效地捕获了细微的政治信息。总而言之,我们的研究贡献包括:
Related Work
Due to recent advances in text representations catalysed by , and followed by , and , we are now able to create very rich textual representations that are effective in many nuanced NLP tasks. Although semantic contextual information is captured by these models, they are not explicitly designed to capture entity/event-centric information. Hence, to solve tasks that demand better understanding of such information , there is a need to create more focused representations. Of late, several works attempted to solve such tasks . But, the representations used are usually limited in scope to specific tasks and not rich enough to capture information that is useful across several tasks. Compositional Reader model, that builds upon embeddings and consists of a transformer-based Graph Attention Network inspired from and aims to address those limitations via a generic entity-issue-event-document graph, which is used to learn highly effective representations.
相关的工作
由于最近由文本表示的前进,并且,我们现在能够创建非常丰富的文本表示,这些文本表示在许多细微的NLP任务中有效。尽管这些模型捕获了语义上下影信息,但它们没有明确地设计用于捕获以捕获实体/活动为中心的信息。因此,要解决需要更好地理解此类信息的任务,需要创建更多的聚焦表示。较晚,有几项工程试图解决这些任务。但是,所使用的表示通常限于特定任务的范围,并且不足以捕获在多个任务中有用的信息。构成读卡器模型,它在嵌入时构建,包括基于变换器的图表关注网络,它受到了通过通用实体问题 - 事件文档图来解决这些限制的,用于学习高效的表示。
Data
Data Type | Count |
---|---|
News Events | 367 |
Authoring Entities | 455 |
Referenced Entities | 10,506 |
Wikipedia Articles | 455 |
Tweets | 86,409 |
Press Releases | 62,257 |
Perspectives | 30,446 |
News Articles | 8,244 |
Total # documents | 187,811 |
Average sents per doc | 14.18 |
We collected US political text data related to $ 8 $ broad topics: . Data used for this paper was focused on $ 455 $ US senators and congressmen. We collected political text data relevant to above topics from $ 5 $ sources: press statements by political entities from ProPublica Congress API https://projects.propublica.org/api-docs/congress-api/ , Wikipedia articles describing political entities, tweets by political entities, perspectives of the senators and congressmen regarding various political issues from and news articles & background of the those political issues from . A total of $ 187,811 $ documents were used to train our model. Summary statistics are shown in Tab.
data
我们收集了与$ 8 $广泛主题相关的美国政治文本数据:。本文用于本文的数据专注于$ 455 $美国参议员和国会议员。从$ 5 $源收集与上述主题相关的政治文本数据通过政治实体,参议员和国会议员的观点,了解这些政治问题的各种政治问题。总共$ 187,811 $文件用于培训我们的模型。摘要统计显示在选项卡中。
Event Identification
Event based categorization of documents is performed as follows: news articles related to each issue are ordered by their date of publication. We find the mean ( $ \mu $ ) and standard deviation ( $ \sigma $ ) of the number of articles published per day for each issue. If more than $ \mu+\sigma $ number of articles are published on a single day for a given issue, we flag it as the beginning of an event. Then, we skip $ 7 $ days and look for a new event. Until a new event window begins, the current event window continues. We use thus obtained event windows to mark events. In our setting, events with in a given issue are non overlapping. We divide events for each issue separately, hence events for different issues overlap. These events last for $ 7-10 $ days on average and hence the non-overlapping assumption within an issue is a reasonable relaxation of reality. To illustrate our point: coronavirus and civil-rights are separate issues and hence have overlapping events. An example event related to coronavirus could be First case of COVID-19 outside of China reporte. Similarly an event about civil-rights could be that Officer who was part of George Floyd killing suspended. We inspected the events manually and found that the events are meaningful for a high percentage of inspected cases ( $ \geq85% $ events). Examples of identified events are shown in the appendix.
事件识别
基于事件的文档分类如下:与每个问题相关的新闻文章按出版日期订购。我们发现每个问题每天发布的文章数量的平均值($\mu $)和标准偏差($\sigma $)。如果超过$\mu+\sigma $在一天发布的文章,则为给定问题,我们将其标记为事件的开头。然后,我们跳过$ 7 $天并查找新事件。在新事件窗口开始之前,当前事件窗口继续。我们使用如此获得的事件窗口标记事件。在我们的设置中,在给定问题中的事件是非重叠的。我们分别划分每个问题的事件,因此不同问题的事件重叠。这些事件持续为$ 7-10 $天平均而且因此,问题中的非重叠假设是一个合理的放松现实。为了说明我们的观点:冠状病毒和公民权利是单独的问题,因此具有重叠的事件。与Coronavirus相关的一个例子可能是中国报告以外的Covid-19。类似地,关于民间权利的事件可能是那个是乔治弗洛伊德杀死暂停的军官。我们手动检查了事件,发现事件对于高百分比的检查案例有意义($\geq85% $事件)。所识别事件的示例显示在附录中。
Data Pre-processing
We use Stanford CoreNLP tool , Wikifier and BERT-base-uncased implementation by to preprocess data for our experiments. We tokenize the documents, apply coreference resolution and extract referenced entities from each document. The referenced entities are then wikified using Wikifier tool . The documents are then categorized by issues and events. News articles from and perspectives from are already classified by issues. We use keyword based querying to extract issue-wise press releases from Propublica API. We use hashtag based classification for tweets. A set of gold hashtags for each issue was created and the tweets were classified accordingly Data collection is detailed in appendix . Sentence-wise BERT-base embeddings of all documents are computed.
数据预处理
我们使用斯坦福州Corenlp工具,Wikifier和BERT-Base-uncassed实现,以预处理我们的实验。我们授权文档,应用Coreference解析并从每个文档中提取引用的实体。然后使用Wikifier工具来使用引用的实体。然后通过问题和事件进行分类。来自问题的新闻文章和观点已经被问题分类。我们使用基于关键字的查询从Propublica API提取问题WISE新闻稿。我们使用基于HashTag的Clification来推文。创建了一组用于每个问题的金色标签,并根据附录中详述的数据收集归类为临时分类。计算所有文档的句子伯爵嵌入式。
Query Mechanism
We implemented a query mechanism to obtain relevant subsets of data from the corpus. Each query is a triplet of . Given a query triplet, news articles related to the events for each of the issues, Wikipedia articles for each of the entities, background descriptions of the issues, perspectives of each entity regarding each of the issues and tweets & press releases by each of the entities related to the events in the query are retrieved. Referenced entities for each of the sentences in documents and sentence-wise BERT embeddings of the documents are also retrieved.
查询机制
我们实现了一个查询机制来获取来自语料库的相关数据子集。每个查询都是一个三联体。给定查询三联网,新闻文章与每个问题的事件相关的新闻文章,每个实体的文章,问题的背景描述,每个实体的视角都会有关每个问题和推文\和新闻稿的每个实体检索与查询中的事件相关的实体。还检索文档中的每个句子的引用实体和文档的句子BERT Embedingings。
Compositional Reader
In this section, we describe the architecture of the proposed Compositional Reader' model in detail. It contains $ 3 $ key components: Graph Generator, Encoder and Composer. Given a query output of the query mechanism from Sec. query_mech, Graph Generator creates a directed graph with entities, issues, events and documents as nodes. Encoder is used to generate initial node embeddings for each of the nodes. Composer is a transformer-based Graph Attention Network (GAT) followed by a pooling layer. It generates the final node embeddings and a single summary embedding for the query graph. Each component is described below.
组成读卡器
在本节中,我们详细描述了所提出的“组成读者”模型的体系结构。它包含$ 3 $密钥组件:图形生成器,编码器和作曲家。给定秒的查询机制的查询输出。 query_mech,图形生成器将带有实体,问题,事件和文档的定向图表为节点。编码器用于为每个节点生成初始节点嵌入式。 Composer是一个基于变压器的图表关注网络(GAT),后跟池层。它生成最终节点嵌入品和查询图嵌入的单个摘要。下面描述每个组件。
Graph Generator
Given the output of the query mechanism for a query, the Graph Generator creates a directed graph with $ 5 $ types of nodes: authoring entities, referenced entities, issues, events and documents. Directed edges are used by Composer to update source nodes' representations using destination nodes. We design the topology with the main goal of capturing the representations of events, issues and referenced entities that reflect author's opinion about them. We add edges from issues/events to author's documents but omit the other direction as our main goal is to contextualize issues/events using author's opinions. Edges are added from authoring entities to their Wikipedia articles and the documents authored by it (tweets, press releases and perspectives). Reverse edges from the authored documents to the author are also added. Uni-directional edges from relevant event nodes to the tweet and press release document nodes are added. Edges from issue nodes to event nodes and vice-versa are added. Edges from the issue nodes to their background description documents are added. Edges from event nodes to news articles describing the events and vice-versa are added. Uni-directional nodes from issue nodes to author perspective nodes are added. Finally, uni-directional edges from referenced entities to all the document nodes are added. An example graph is shown in Fig. test_graph.
图生成器
给定查询的查询机制的输出,图形生成器创建带有$ 5 $类型的指示图:创作实体,引用实体,问题,事件和文档。作曲家使用定向边缘来使用目标节点更新源节点的表示。我们设计了拓扑的主要目标,即捕获事件,问题和引用实体的陈述,反映了作者对他们的意见。我们将边缘从问题/事件中添加到作者文件中,但省略另一个方向,因为我们的主要目标是使用作者的意见来上下文化问题/事件。边缘从创作实体添加到他们的维基百科文章和由它创作的文件(推文,新闻稿和透视图)。还添加了从撰写文档到作者的反向边缘。添加了从相关事件节点到Tweet的单向边缘,然后按新版本文档节点。添加来自问题节点的边缘以及事件节点,反之亦然。从问题节点到其背景说明文档的边缘已添加。添加了从事件节点到描述事件和反之亦然的新闻文章的边缘。添加到发出节点到作者透视节点的单向节点。最后,添加了来自所有文档节点的引用实体的单向边缘。示例图在图2中示出。
Encoder
Encoder is used to compute the initial node embeddings. It consists of BERT followed by a Bi-LSTM. For each node, it takes a sequence of documents as input. The documents are ordered temporally. The output of Encoder is a single embedding of dimension $ d_m $ for each node. Given a node $ \mathcal{N} $ = { $ D_1 $ , $ D_2 $ , , $ D_d $ } consisting of $ d $ documents, for each document $ D_i $ , contextualized embeddings of all the tokens are computed using BERT. Token embeddings are computed sentence-wise to avoid truncating long documents. Then, token embeddings of each document are mean-pooled to get the document embeddings $ \mathcal{\vec{N}}^{bert} $ = { $ \vec{D_1}^{bert} $ , $ \vec{D_2}^{bert} $ , , $ \vec{D_d}^{bert} $ } where $ \vec{D_i}^{bert}\in\mathbb{R}^{1 \times d_m} $ , $ d_m $ is the dimension of a BERT token embedding. The sequence $ \vec{\mathcal{N}}^{bert} $ is passed through a Bi-LSTM to obtain an output sequence $ \vec{E} $ = { $ \vec{e_1} $ , $ \vec{e_2} $ , , $ \vec{e_d} $ }, $ \vec{e_i}\in\mathbb{R}^{1 \times h} $ , where $ h/2 $ is the hidden dimension of the Bi-LSTM, we set $ h = d_m $ in our model. Finally, the output of Encoder is computed by mean-pooling the sequence $ \vec{E} $ . We use BERT-base-uncased model in our experiments where $ d_m=h=768 $ . Initial node embeddings of all the document nodes are set to Encoder output of the documents themselves. For authoring entity nodes, their Wikipedia descriptions, tweets, press releases and perspective documents are passed through Encoder. For issue nodes, background description of the issue is used. For event nodes, Encoder representation of all the news articles related to the event is used. For referenced entities, all documents referring to the entity are used.
编码器
编码器用于计算初始节点嵌入式。它由BERT组成,然后是BI-LSTM。对于每个节点,它需要一系列文件作为输入。文件在时间上订购。编码器的输出是每个节点的尺寸$ d_m $的单个嵌入。给定节点$\mathcal{N} $ = \ {$ D_1 $,$ D_2 $,$ D_d $ +,$ D_d $ }由$ d $文档组成,每个文档$ D_i $,使用BERT计算所有令牌的上下文化嵌入式。令牌嵌入式被计算句子,以避免截断截断的长文档。然后,将各文件的令牌的嵌入是均汇集获取文档的嵌入$\mathcal{\vec{N}}^{bert} $ = \ {$\vec{D_1}^{bert} $,$\vec{D_2}^{bert} $,$\vec{D_d}^{bert} $ }其中$\vec{D_i}^{bert}\in\mathbb{R}^{1 \times d_m} $,$ d_m $是嵌入伯特令牌的尺寸。序列$\vec{\mathcal{N}}^{bert} $通过BI-LSTM来获取输出序列$\vec{E} $ = {$\vec{e_1} $,$\vec{e_2} $,$\vec{e_d} $},$\vec{e_i}\in\mathbb{R}^{1 \times h} $,其中$ h/2 $是BI-LSTM的隐藏维度,我们在我们的模型中设置了$ h = d_m $。最后,通过均值汇集序列$\vec{E} $来计算编码器的输出。我们在我们的实验中使用BERT-Base-Uncassed模型,其中$ d_m=h=768 $。所有文档节点的初始节点嵌入为文件本身的编码器输出。对于创作实体节点,它们的维基百科描述,推文,新闻稿和透视文档通过编码器传递。对于发出节点,使用的后台说明问题。对于事件节点,使用与事件相关的所有新闻文章的编码器表示。对于引用实体,使用引用实体的所有文档。
Composer
Composer is a transformer-based graph attention network (GAT) followed by a pooling layer. We use the transformer encoding layer proposed by after removing the position-wise feed forward layer as a graph attention layer. Position-wise feed forward layer is removed because the transformer unit was originally proposed for sequence to sequence prediction tasks, but the nodes in a graph usually have no ordering relationship between them. Adjacency matrix of the graph is used as the attention mask. Self-loops are added for all nodes so that updated representation of the node also depends on its previous representation. Composer module uses $ l=2 $ graph attention layers in our experiments. Composer module generates updated node embeddings $ \mathbb{U}\in\mathbb{R}^{n \times d_m} $ and a summary embedding $ \mathbb{S}\in\mathbb{R}^{1 \times d_m} $ as outputs. The output dimension of node embeddings is $ 768 $ , same as BERT-base.
$$
\centering\begin{multlined} \mathbb{E} \in \mathbb{R}^{d_m \times n}, \mathcal{A} \in {0, 1}^{n \times n}\hfill
\mathbb{G} = layer-norm(\mathbb{E})\hfill
Q = W_q^T \mathbb{G}, Q \in \mathbb{R}^{n_h \times d_k \times n}\hfill
K = W_k^T \mathbb{G}, K \in \mathbb{R}^{n_h \times d_k \times n}\hfill
V = W_v^T \mathbb{G}, V \in \mathbb{R}^{n_h \times d_v \times n}\hfill
M = \frac{Q^{T} K}{\sqrt{d_k}}, M \in \mathbb{R}^{n_h \times n \times n}\hfill
M = mask(M, \mathcal{A})\hfill \mathbb{O} = M V^{T}, \mathbb{O} \in \mathbb{R}^{n_h d_v \times n}\hfill
\mathbb{U} = W_o^T \mathbb{O} + \mathbb{E}\hfill \mathbb{S} = mean-pool(\mathbb{U})\hfill \end{multlined}
$$
where $ n $ is number of nodes in the graph, $ d_m $ is the dimension of a BERT token embedding, $ d_k $ , $ d_v $ are projection dimensions, $ n_h $ is number of attention heads used and $ W_q \in\mathbb{R}^{d_m \times n_h d_k} $ , $ W_k \in\mathbb{R}^{d_m \times n_h d_k} $ , $ W_v \in\mathbb{R}^{d_m \times n_h d_v} $ and $ W_o \in\mathbb{R}^{n_h d_v \times d_m} $ are weight parameters to be learnt. $ \mathbb{E}\in\mathbb{R}^{d_m \times n} $ is the outputs of the encoder. $ \mathcal{A}\in{0, 1}^{n \times n} $ is the adjacency matrix. We set $ n_h=12 $ , $ d_k=d_v=64 $ in our experiments.
Composer
Composer是一个基于变压器的图表注意网络(GAT),后跟池层。我们使用在将位置明智的前馈层作为图表层中移除后提出的变压器编码层。删除位置馈电前向前层,因为最初提出了变压器单元以序列预测任务,但图中的节点通常不具有它们之间的排序关系。图形的邻接矩阵用作注意掩模。为所有节点添加自循环,以便更新节点的表示也取决于其先前的表示。 COMPOSER模块在我们的实验中使用$ l=2 $图注意图层。 Composer模块生成更新的节点嵌入式$\mathbb{U}\in\mathbb{R}^{n \times d_m} $和摘要
Learning Tasks
We design $ 2 $ learning tasks to train the Compositional Reader model: Authorship Prediction and Referenced Entity Prediction. Both the tasks are different flavors of link prediction over graphs. In Authorship Prediction, given a graph, an author node and a document node with no link between them, the task is to predict if the document was authored by the author node. In the Referenced Entity Prediction task, given a graph, a document node and a referenced entity node, the task is to predict if the entity was referenced in the document. For this task, all occurrences of one entity in the text are replaced with a generic $ < $ ent $ > $ token in the document text before the document embedding is computed. Both are detailed below.
学习任务
我们设计$ 2 $学习任务培训组成读者模型:作者预测和引用实体预测。所有任务都是图形链路预测的不同口号。在作者预测中,给定图形,作者节点和文档节点,其中没有链接,任务是预测文档是否由作者节点创作。在参考实体预测任务中,给定图形,文档节点和引用的实体节点,任务是预测该实体是否在文档中引用。对于此任务,文本中的一个实体的所有出现都是在计算文件嵌入之前的文档文本中的通用$ < $ ENT $ > $令牌。两者都详述了下面。
Authorship Prediction
Authorship Prediction is designed as a binary classification task. In this task, given a graph generated by the graph generator model $ \mathcal{G} $ , an author node $ n_a $ and a document node $ n_d $ with no edges between them, the task is to predict whether or not author represented by node $ n_a $ authored the document represented by node $ n_d $ . Intuition behind this learning task is to enable our model to learn differentiating between the language of an author in context of an issue and documents by other entities or documents related to other issues. The model sees documents by the same author for the same issue in the graph and learns to decide whether the input document has similar language or not. It is a fairly simple learning task and hence is an ideal task to start pre-training our model. We concatenate the initial and final node embeddings of the author, document and also the summary embedding of the graph to obtain inputs to the fine-tuning layers for Authorship Prediction task. We add one hidden layer of dimension $ 384 $ before the classification layer. Data samples for the task were created as follows: for each of the $ 455 $ entities, for each of the $ 8 $ issues and for all events related to that issue, we fire a query to the query mechanism and use the graph generator module to obtain a data graph (Fig. test_graph). Hence, we fire $ 3,640 $ queries in total and obtain respective data graphs. To create a positive data sample, we sample a document $ d_i $ authored by the entity $ a_i $ and remove the edges between the nodes that represent the $ a_i $ and $ d_i $ . Negative samples were designed carefully in $ 3 $ batches to enable the model to learn different aspects of the language used by the author. In the first batch, we sample news article nodes from the same graph. In the second batch, we obtain tweets, press releases and perspectives of the same author but from a different issue. In the third batch, we sample documents related to the same issue but from other authors. We generate $ 421,284 $ samples in total, with $ 252,575 $ positive samples and $ 168,709 $ negative samples. We randomly split the data into training set of $ 272,159 $ samples, validation set of $ 73,410 $ samples and test set of $ 75,715 $ samples. We also perform out-sample experiments to evaluate generalization capability to unseen politicians' data. We train the model on training data from two-thirds of politicians and test on the test sets of others. Results are shown in Tab. auth_pred_res. We perform graph trimming to make the computation tractable on a single GPU. We randomly drop $ 80% $ of the news articles, tweets and press releases that are not related to the event to which $ d_i $ belongs. We use graphs with $ 200 $ - $ 500 $ nodes and batch size of $ 1 $ .
作者预测
autheration预测被设计为二进制分类任务。在此任务中,给定由图形生成器型号$\mathcal{G} $生成的图表,作者节点$ n_a $和文档节点$ n_d $,它们之间没有边缘,任务是预测是否由节点$ n_a 表示的作者。 $编写由节点$ n_d $表示的文档。这种学习任务背后的直觉是使我们的模型能够在问题和文件的上下文中学习分化作者的语言,并由与其他问题相关的其他实体或文档的文档。该模型在图中看到了同一作者的文档对于图表中的相同问题,并学会决定输入文档是否具有类似的语言。这是一个相当简单的学习任务,因此是开始预先训练模型的理想任务。我们连接作者,文档以及摘要嵌入图形的初始和最终节点嵌入,以获取对Autheration预测任务的微调层的输入。我们在分类层之前添加一个隐藏的Dimension $ 384 $。任务的数据样本是如下创建的:对于每个$ 455 $实体,对于每个$ 8 $问题以及与该问题相关的所有事件,我们向查询机制启动查询并使用图形生成器模块要获取数据图(图。Test_Graph)。因此,我们总共释放$ 3,640 $查询并获得各个数据图。要创建正数据示例,我们将由实体$ a_i $创建的文档$ d_i $,并删除表示$ a_i $和$ d_i $的节点之间的边缘。在$ 3 $批处理中仔细设计了负样本,以使模型能够了解作者使用的语言的不同方面。在第一个批处理中,我们从相同的图表中示出新闻文章节点。在第二个批处理中,我们获得了同一作者的推文,新闻稿和透视图,但是从不同的问题中获得了同一作者的透视。在第三批中,我们将与同一问题相关的文件,但从其他作者进行示例。我们总共生成$ 421,284 $样本,$ 252,575 $正样品和$ 168,709 $负样本。我们随机将数据分成$ 272,159 $样本的训练集,验证组$ 73,410 $样本和$ 75,715 $样本的测试集。我们还执行超出样本实验,以评估未来政治家数据的概括能力。我们培训了从三分之二的政客培训数据的模型,并测试了其他三分之二的测试集。结果显示在标签中。 auth_pred_res。我们执行图形修剪,以使计算在单个GPU上进行易行。我们随机丢弃News文章的$ 80% $与$ d_i $所属的事件无关的推文和新闻稿无关。我们使用$ 200 $-$ 500 $节点和$ 1 $的批量大小的图表。
Referenced Entity Prediction
This task is also designed as binary classification. Given a graph $ \mathcal{G} $ , document node $ d_i $ and referenced entity node $ r_i $ from $ \mathcal{G} $ , the task is to predict whether or not $ r_i $ is referenced in $ d_i $ . To create data samples for this task, we sample a document from the data graph, replace all occurrences of the most frequent referenced entity in the document with a generic $ < $ ent $ > $ token. We remove the link between $ r_i $ and $ d_i $ in $ \mathcal{G} $ . Triplet ( $ \mathcal{G} $ , $ d_i $ , $ r_i $ ) is used as a positive data sample. We sample another referenced entity $ r_j $ from the graph, that is not referenced in $ d_i $ , to generate a negative sample. Intuition behind this learning task is to enable our model to learn the correlation between the author, language in the document and the referenced entity. For example, in context of recent Donald Trump's impeachment hearing, consider the sentence X needs to face the consequences of their actions'. Depending upon the author, X could either be ' or '. Learning to understand such correlations by looking at other documents from the same author is a useful training task for our model. This is also a harder learning problem than Authorship Prediction. We use fine-tuning architecture similar to Authorship Prediction on top of Compositional Reader for this task as well. We keep separate fine-tuning parameters for each task as they are fundamentally different prediction problems. Compositional Reader is shared. We generated $ 252,578 $ samples for this task, half of them positive. They were split into $ 180,578 $ training samples, validation and test sets of $ 36,400 $ samples each. We apply graph trimming for this task as well. We also perform out-sample evaluation for this learning task.
引用的实体预测
此任务也被设计为二进制分类。给定图$\mathcal{G} $,文档节点$ d_i $和引用的实体节点$ r_i $从$\mathcal{G} $中,任务是预测$ d_i $是否在$ d_i $中引用。要为此任务创建数据示例,我们将从数据图中调制文档,用通用$ < $ ENT $ > $令牌替换文档中最常用的引用实体的所有发生。我们在$\mathcal{G} $中删除$ r_i $和$ d_i $之间的链接。三联网($\mathcal{G} $,$ d_i $,$ r_i $)用作正数据样本。我们从图形中示出另一个引用的实体$ r_j $,在$ d_i $中未引用,以生成否定样本。在此学习任务背后的直觉是启用我们的模型,以了解文档中的作者,语言与引用实体之间的相关性。例如,在最近唐纳德特朗普的弹劾听证会上,考虑句子“X需要面对他们行动的后果”。根据作者,X可以是“或”“或”“。学习通过查看来自同一作者的其他文档来了解此类相关性是我们模型的有用培训任务。这也是一个难的学习问题而不是作者预测。我们使用与此任务的组成读卡器顶部的作者预测类似的微调架构。我们为每项任务保留单独的微调参数,因为它们是根本不同的预测问题。分享组成读卡器。我们为此任务生成$ 252,578 $样本,其中一半是正的。它们分为$ 180,578 $培训样本,验证和测试集$ 36,400 $样本。我们也适用于此任务的图形修剪。我们还对这项学习任务进行了对样本评估。
Evaluation
We evaluate our model and pre-training tasks in a systematic manner using several quantitative tasks and qualitative analysis. Quantitative evaluation includes NRA Grade Paraphrase' task, Grade Prediction' on NRA and LCV grades data followed by Bias Predication' task on AllSides news articles. Qualitative evaluation includes entity-stance visualization for issues. We compare our model's performance to BERT representations, the BERT adaptation baseline and representations from the Encoder module. Baselines and the evaluation tasks are detailed below.
评估
我们使用若干定量任务和定性分析评估我们的模型和预培训任务。定量评估包括NRA和LCV等级数据的NRA级别释义的任务,“等级预测”,后跟Allsides新闻文章的“偏差预测”任务。定性评估包括问题的实体 - 立场可视化。我们将模型的性能与BERT表示,BERT适配基线和来自编码器模块的表示进行比较。基线和评估任务如下所述。
BaselinesBERT:
We compute the results obtained by using pooled BERT representations of relevant documents for each of the quantitative tasks. Details of the chosen documents and the pooling procedure is described in the relevant task subsections. We compare the performance of our model to the results obtained by using initial node embeddings generated from the Encoder for each of the quantitative tasks. We design a BERT adaptation baseline for the learning tasks. BERT adaptation is equivalent to using only the Encoder's initial node embeddings of the Compositional Reader model. While BERT adaptation and Encoder share exactly the same architecture, Encoder parameters are trained via back-propagation through the Composer, while BERT adaptation parameters are trained directly using our learning tasks. In BERT adaptation, once we generate the data graph, we pass the mean-pooled sentence-wise BERT embeddings of the node documents through a Bi-LSTM. We mean-pool the output of Bi-LSTM to get node embeddings. We use fine-tuning layers on top of thus obtained node embeddings for Authorship Prediction and Referenced Entity Prediction tasks. BERT A