# [论文翻译]通过语境化话语处理理解政治

## Abstract

Politicians often have underlying agendas when reacting to events. Arguments in contexts of various events reflect a fairly consistent set of agendas for a given entity. In spite of recent advances in Pretrained Language Models (PLMs), those text representations are not designed to capture such nuanced patterns. In this paper, we propose a Compositional Reader model consisting of encoder and composer modules, that attempts to capture and leverage such information to generate more effective representations for entities, issues, and events. These representations are contextualized by tweets, press releases, issues, news articles, and participating entities. Our model can process several documents at once and generate composed representations for multiple entities over several issues or events. Via qualitative and quantitative empirical analysis, we show that these representations are meaningful and effective.

## Introduction

Often in political discourse, the same argument trajectories are repeated across events by politicians and political caucuses. Knowing and understanding the trajectories that are regularly used, is pivotal in contextualizing the comments made by them when a new event occurs. Furthermore, it helps us in understanding their perspectives and predict their likely reactions to new events and participating entities. In political text, bias towards a political perspective is often subtle rather than explicitly stated . Choices of mentioning or omitting certain entities or certain attributes can reveal the author's agenda. For example, when a politician tweets in reaction to a new shooting event, it is likely that they oppose gun control and support free gun rights, despite not mentioning their stance explicitly. Our main insight in this paper is that effectively detecting such bias from text requires modeling the broader context of the document. This can include understanding relevant facts related to the event addressed in the text, the ideological leanings and perspectives expressed by the author in the past, and the sentiment/attitude of the author towards the entities referenced in the text. We suggest that this holistic view can be obtained by combining information from multiple sources, which can be of varying types, such as news articles, social media posts, quotes from press releases and historical beliefs expressed by politicians. Despite recent advances in Pretrained Language Models (PLMs) in NLP , which have greatly improved word representations via contextualized embeddings and powerful transformer units, such representations alone are not enough to capture nuanced biases in political discourse. Two of the key reasons are: (i) they do not directly focus on entity/issue-centric data and (ii) they contextualize only based on surrounding text but not on relevant issue/event knowledge. A computational setting for this approach, , requires two necessary attributes: (i) an input representation that combines all the different types of information meaningfully and (ii) an ability to process all the information together in one-shot. We address the first challenge by introducing a graph structure that ties together first-person informal (tweets) and formal discourse (press releases and perspectives), third-person current (news) and consolidated (Wikipedia) discourse. These documents are connected via their authors, the issues/events they discuss and the entities that are mentioned in them. As a clarifying example consider the partial Tweet by President Trump . This tweet will be represented in our graph by connecting the text node to the author node (President Trump) and the referenced entity node (New York Gov. Cuomo). These settings are shown in Fig. test_graph Then, we propose a novel neural architecture that can process all the information in the graph together in one-shot. The architecture generates a distributed representation for each item in the graph that is contextualized by the representations of others. In our example, this results in a modified representation for the tweet and the entities thus helping us characterize the opinion of President Trump about Governor Cuomo in context of NRA or in general. Our architecture builds upon the text representations obtained from BERT . It consists of an Encoder which combines all the documents related to a given node to generate an initial node representation and a Composer which is a Graph Attention Network (GAT) that composes over the graph structure to generate contextualized node embeddings.

## 介绍

Due to recent advances in text representations catalysed by , and followed by , and , we are now able to create very rich textual representations that are effective in many nuanced NLP tasks. Although semantic contextual information is captured by these models, they are not explicitly designed to capture entity/event-centric information. Hence, to solve tasks that demand better understanding of such information , there is a need to create more focused representations. Of late, several works attempted to solve such tasks . But, the representations used are usually limited in scope to specific tasks and not rich enough to capture information that is useful across several tasks. Compositional Reader model, that builds upon embeddings and consists of a transformer-based Graph Attention Network inspired from and aims to address those limitations via a generic entity-issue-event-document graph, which is used to learn highly effective representations.

## Data

Data Type Count
News Events 367
Authoring Entities 455
Referenced Entities 10,506
Wikipedia Articles 455
Tweets 86,409
Press Releases 62,257
Perspectives 30,446
News Articles 8,244
Total # documents 187,811
Average sents per doc 14.18

We collected US political text data related to $8$ broad topics: . Data used for this paper was focused on $455$ US senators and congressmen. We collected political text data relevant to above topics from $5$ sources: press statements by political entities from ProPublica Congress API https://projects.propublica.org/api-docs/congress-api/ , Wikipedia articles describing political entities, tweets by political entities, perspectives of the senators and congressmen regarding various political issues from and news articles & background of the those political issues from . A total of $187,811$ documents were used to train our model. Summary statistics are shown in Tab.

## data

### Event Identification

Event based categorization of documents is performed as follows: news articles related to each issue are ordered by their date of publication. We find the mean ( $\mu$ ) and standard deviation ( $\sigma$ ) of the number of articles published per day for each issue. If more than $\mu+\sigma$ number of articles are published on a single day for a given issue, we flag it as the beginning of an event. Then, we skip $7$ days and look for a new event. Until a new event window begins, the current event window continues. We use thus obtained event windows to mark events. In our setting, events with in a given issue are non overlapping. We divide events for each issue separately, hence events for different issues overlap. These events last for $7-10$ days on average and hence the non-overlapping assumption within an issue is a reasonable relaxation of reality. To illustrate our point: coronavirus and civil-rights are separate issues and hence have overlapping events. An example event related to coronavirus could be First case of COVID-19 outside of China reporte. Similarly an event about civil-rights could be that Officer who was part of George Floyd killing suspended. We inspected the events manually and found that the events are meaningful for a high percentage of inspected cases ( $\geq85%$ events). Examples of identified events are shown in the appendix.

### Data Pre-processing

We use Stanford CoreNLP tool , Wikifier and BERT-base-uncased implementation by to preprocess data for our experiments. We tokenize the documents, apply coreference resolution and extract referenced entities from each document. The referenced entities are then wikified using Wikifier tool . The documents are then categorized by issues and events. News articles from and perspectives from are already classified by issues. We use keyword based querying to extract issue-wise press releases from Propublica API. We use hashtag based classification for tweets. A set of gold hashtags for each issue was created and the tweets were classified accordingly Data collection is detailed in appendix . Sentence-wise BERT-base embeddings of all documents are computed.

### Query Mechanism

We implemented a query mechanism to obtain relevant subsets of data from the corpus. Each query is a triplet of . Given a query triplet, news articles related to the events for each of the issues, Wikipedia articles for each of the entities, background descriptions of the issues, perspectives of each entity regarding each of the issues and tweets & press releases by each of the entities related to the events in the query are retrieved. Referenced entities for each of the sentences in documents and sentence-wise BERT embeddings of the documents are also retrieved.

### 查询机制

In this section, we describe the architecture of the proposed Compositional Reader' model in detail. It contains $3$ key components: Graph Generator, Encoder and Composer. Given a query output of the query mechanism from Sec. query_mech, Graph Generator creates a directed graph with entities, issues, events and documents as nodes. Encoder is used to generate initial node embeddings for each of the nodes. Composer is a transformer-based Graph Attention Network (GAT) followed by a pooling layer. It generates the final node embeddings and a single summary embedding for the query graph. Each component is described below.

## 组成读卡器

### Graph Generator

Given the output of the query mechanism for a query, the Graph Generator creates a directed graph with $5$ types of nodes: authoring entities, referenced entities, issues, events and documents. Directed edges are used by Composer to update source nodes' representations using destination nodes. We design the topology with the main goal of capturing the representations of events, issues and referenced entities that reflect author's opinion about them. We add edges from issues/events to author's documents but omit the other direction as our main goal is to contextualize issues/events using author's opinions. Edges are added from authoring entities to their Wikipedia articles and the documents authored by it (tweets, press releases and perspectives). Reverse edges from the authored documents to the author are also added. Uni-directional edges from relevant event nodes to the tweet and press release document nodes are added. Edges from issue nodes to event nodes and vice-versa are added. Edges from the issue nodes to their background description documents are added. Edges from event nodes to news articles describing the events and vice-versa are added. Uni-directional nodes from issue nodes to author perspective nodes are added. Finally, uni-directional edges from referenced entities to all the document nodes are added. An example graph is shown in Fig. test_graph.

### Encoder

Encoder is used to compute the initial node embeddings. It consists of BERT followed by a Bi-LSTM. For each node, it takes a sequence of documents as input. The documents are ordered temporally. The output of Encoder is a single embedding of dimension $d_m$ for each node. Given a node $\mathcal{N}$ = { $D_1$ , $D_2$ , , $D_d$ } consisting of $d$ documents, for each document $D_i$ , contextualized embeddings of all the tokens are computed using BERT. Token embeddings are computed sentence-wise to avoid truncating long documents. Then, token embeddings of each document are mean-pooled to get the document embeddings $\mathcal{\vec{N}}^{bert}$ = { $\vec{D_1}^{bert}$ , $\vec{D_2}^{bert}$ , , $\vec{D_d}^{bert}$ } where $\vec{D_i}^{bert}\in\mathbb{R}^{1 \times d_m}$ , $d_m$ is the dimension of a BERT token embedding. The sequence $\vec{\mathcal{N}}^{bert}$ is passed through a Bi-LSTM to obtain an output sequence $\vec{E}$ = { $\vec{e_1}$ , $\vec{e_2}$ , , $\vec{e_d}$ }, $\vec{e_i}\in\mathbb{R}^{1 \times h}$ , where $h/2$ is the hidden dimension of the Bi-LSTM, we set $h = d_m$ in our model. Finally, the output of Encoder is computed by mean-pooling the sequence $\vec{E}$ . We use BERT-base-uncased model in our experiments where $d_m=h=768$ . Initial node embeddings of all the document nodes are set to Encoder output of the documents themselves. For authoring entity nodes, their Wikipedia descriptions, tweets, press releases and perspective documents are passed through Encoder. For issue nodes, background description of the issue is used. For event nodes, Encoder representation of all the news articles related to the event is used. For referenced entities, all documents referring to the entity are used.

### Composer

Composer is a transformer-based graph attention network (GAT) followed by a pooling layer. We use the transformer encoding layer proposed by after removing the position-wise feed forward layer as a graph attention layer. Position-wise feed forward layer is removed because the transformer unit was originally proposed for sequence to sequence prediction tasks, but the nodes in a graph usually have no ordering relationship between them. Adjacency matrix of the graph is used as the attention mask. Self-loops are added for all nodes so that updated representation of the node also depends on its previous representation. Composer module uses $l=2$ graph attention layers in our experiments. Composer module generates updated node embeddings $\mathbb{U}\in\mathbb{R}^{n \times d_m}$ and a summary embedding $\mathbb{S}\in\mathbb{R}^{1 \times d_m}$ as outputs. The output dimension of node embeddings is $768$ , same as BERT-base.
$$\centering\begin{multlined} \mathbb{E} \in \mathbb{R}^{d_m \times n}, \mathcal{A} \in {0, 1}^{n \times n}\hfill \mathbb{G} = layer-norm(\mathbb{E})\hfill Q = W_q^T \mathbb{G}, Q \in \mathbb{R}^{n_h \times d_k \times n}\hfill K = W_k^T \mathbb{G}, K \in \mathbb{R}^{n_h \times d_k \times n}\hfill V = W_v^T \mathbb{G}, V \in \mathbb{R}^{n_h \times d_v \times n}\hfill M = \frac{Q^{T} K}{\sqrt{d_k}}, M \in \mathbb{R}^{n_h \times n \times n}\hfill M = mask(M, \mathcal{A})\hfill \mathbb{O} = M V^{T}, \mathbb{O} \in \mathbb{R}^{n_h d_v \times n}\hfill \mathbb{U} = W_o^T \mathbb{O} + \mathbb{E}\hfill \mathbb{S} = mean-pool(\mathbb{U})\hfill \end{multlined}$$
where $n$ is number of nodes in the graph, $d_m$ is the dimension of a BERT token embedding, $d_k$ , $d_v$ are projection dimensions, $n_h$ is number of attention heads used and $W_q \in\mathbb{R}^{d_m \times n_h d_k}$ , $W_k \in\mathbb{R}^{d_m \times n_h d_k}$ , $W_v \in\mathbb{R}^{d_m \times n_h d_v}$ and $W_o \in\mathbb{R}^{n_h d_v \times d_m}$ are weight parameters to be learnt. $\mathbb{E}\in\mathbb{R}^{d_m \times n}$ is the outputs of the encoder. $\mathcal{A}\in{0, 1}^{n \times n}$ is the adjacency matrix. We set $n_h=12$ , $d_k=d_v=64$ in our experiments.

### Composer

Composer 是一个基于变压器的图表注意网络（GAT），后跟池层。我们使用在将位置明智的前馈层作为图表层中移除后提出的变压器编码层。删除位置馈电前向前层，因为最初提出了变压器单元以序列预测任务，但图中的节点通常不具有它们之间的排序关系。图形的邻接矩阵用作注意掩模。为所有节点添加自循环，以便更新节点的表示也取决于其先前的表示。 COMPOSER 模块在我们的实验中使用$l=2$图注意图层。 Composer 模块生成更新的节点嵌入式$\mathbb{U}\in\mathbb{R}^{n \times d_m}$和摘要

We design $2$ learning tasks to train the Compositional Reader model: Authorship Prediction and Referenced Entity Prediction. Both the tasks are different flavors of link prediction over graphs. In Authorship Prediction, given a graph, an author node and a document node with no link between them, the task is to predict if the document was authored by the author node. In the Referenced Entity Prediction task, given a graph, a document node and a referenced entity node, the task is to predict if the entity was referenced in the document. For this task, all occurrences of one entity in the text are replaced with a generic $<$ ent $>$ token in the document text before the document embedding is computed. Both are detailed below.

## 学习任务

### Authorship Prediction

Authorship Prediction is designed as a binary classification task. In this task, given a graph generated by the graph generator model $\mathcal{G}$ , an author node $n_a$ and a document node $n_d$ with no edges between them, the task is to predict whether or not author represented by node $n_a$ authored the document represented by node $n_d$ . Intuition behind this learning task is to enable our model to learn differentiating between the language of an author in context of an issue and documents by other entities or documents related to other issues. The model sees documents by the same author for the same issue in the graph and learns to decide whether the input document has similar language or not. It is a fairly simple learning task and hence is an ideal task to start pre-training our model. We concatenate the initial and final node embeddings of the author, document and also the summary embedding of the graph to obtain inputs to the fine-tuning layers for Authorship Prediction task. We add one hidden layer of dimension $384$ before the classification layer. Data samples for the task were created as follows: for each of the $455$ entities, for each of the $8$ issues and for all events related to that issue, we fire a query to the query mechanism and use the graph generator module to obtain a data graph (Fig. test_graph). Hence, we fire $3,640$ queries in total and obtain respective data graphs. To create a positive data sample, we sample a document $d_i$ authored by the entity $a_i$ and remove the edges between the nodes that represent the $a_i$ and $d_i$ . Negative samples were designed carefully in $3$ batches to enable the model to learn different aspects of the language used by the author. In the first batch, we sample news article nodes from the same graph. In the second batch, we obtain tweets, press releases and perspectives of the same author but from a different issue. In the third batch, we sample documents related to the same issue but from other authors. We generate $421,284$ samples in total, with $252,575$ positive samples and $168,709$ negative samples. We randomly split the data into training set of $272,159$ samples, validation set of $73,410$ samples and test set of $75,715$ samples. We also perform out-sample experiments to evaluate generalization capability to unseen politicians' data. We train the model on training data from two-thirds of politicians and test on the test sets of others. Results are shown in Tab. auth_pred_res. We perform graph trimming to make the computation tractable on a single GPU. We randomly drop $80%$ of the news articles, tweets and press releases that are not related to the event to which $d_i$ belongs. We use graphs with $200$ - $500$ nodes and batch size of $1$ .

### 作者预测

autheration 预测被设计为二进制分类任务。在此任务中，给定由图形生成器型号$\mathcal{G}$生成的图表，作者节点$n_a$和文档节点$n_d$，它们之间没有边缘，任务是预测是否由节点$n_a 表示的作者。$编写由节点$n_d$表示的文档。这种学习任务背后的直觉是使我们的模型能够在问题和文件的上下文中学习分化作者的语言，并由与其他问题相关的其他实体或文档的文档。该模型在图中看到了同一作者的文档对于图表中的相同问题，并学会决定输入文档是否具有类似的语言。这是一个相当简单的学习任务，因此是开始预先训练模型的理想任务。我们连接作者，文档以及摘要嵌入图形的初始和最终节点嵌入，以获取对 Autheration 预测任务的微调层的输入。我们在分类层之前添加一个隐藏的 Dimension $384$。任务的数据样本是如下创建的：对于每个$455$实体，对于每个$8$问题以及与该问题相关的所有事件，我们向查询机制启动查询并使用图形生成器模块要获取数据图（图。Test_Graph）。因此，我们总共释放$3,640$查询并获得各个数据图。要创建正数据示例，我们将由实体$a_i$创建的文档$d_i$，并删除表示$a_i$和$d_i$的节点之间的边缘。在$3$批处理中仔细设计了负样本，以使模型能够了解作者使用的语言的不同方面。在第一个批处理中，我们从相同的图表中示出新闻文章节点。在第二个批处理中，我们获得了同一作者的推文，新闻稿和透视图，但是从不同的问题中获得了同一作者的透视。在第三批中，我们将与同一问题相关的文件，但从其他作者进行示例。我们总共生成$421,284$样本，$252,575$正样品和$168,709$负样本。我们随机将数据分成$272,159$样本的训练集，验证组$73,410$样本和$75,715$样本的测试集。我们还执行超出样本实验，以评估未来政治家数据的概括能力。我们培训了从三分之二的政客培训数据的模型，并测试了其他三分之二的测试集。结果显示在标签中。 auth_pred_res。我们执行图形修剪，以使计算在单个 GPU 上进行易行。我们随机丢弃 News 文章的$80%$与$d_i$所属的事件无关的推文和新闻稿无关。我们使用$200$-$500$节点和$1$的批量大小的图表。

### Referenced Entity Prediction

This task is also designed as binary classification. Given a graph $\mathcal{G}$ , document node $d_i$ and referenced entity node $r_i$ from $\mathcal{G}$ , the task is to predict whether or not $r_i$ is referenced in $d_i$ . To create data samples for this task, we sample a document from the data graph, replace all occurrences of the most frequent referenced entity in the document with a generic $<$ ent $>$ token. We remove the link between $r_i$ and $d_i$ in $\mathcal{G}$ . Triplet ( $\mathcal{G}$ , $d_i$ , $r_i$ ) is used as a positive data sample. We sample another referenced entity $r_j$ from the graph, that is not referenced in $d_i$ , to generate a negative sample. Intuition behind this learning task is to enable our model to learn the correlation between the author, language in the document and the referenced entity. For example, in context of recent Donald Trump's impeachment hearing, consider the sentence X needs to face the consequences of their actions'. Depending upon the author, X could either be ' or '. Learning to understand such correlations by looking at other documents from the same author is a useful training task for our model. This is also a harder learning problem than Authorship Prediction. We use fine-tuning architecture similar to Authorship Prediction on top of Compositional Reader for this task as well. We keep separate fine-tuning parameters for each task as they are fundamentally different prediction problems. Compositional Reader is shared. We generated $252,578$ samples for this task, half of them positive. They were split into $180,578$ training samples, validation and test sets of $36,400$ samples each. We apply graph trimming for this task as well. We also perform out-sample evaluation for this learning task.

## Evaluation

We evaluate our model and pre-training tasks in a systematic manner using several quantitative tasks and qualitative analysis. Quantitative evaluation includes NRA Grade Paraphrase' task, Grade Prediction' on NRA and LCV grades data followed by Bias Predication' task on AllSides news articles. Qualitative evaluation includes entity-stance visualization for issues. We compare our model's performance to BERT representations, the BERT adaptation baseline and representations from the Encoder module. Baselines and the evaluation tasks are detailed below.

## 评估

### NRA Grades Evaluation

National Rifle Association (NRA) assigns letter grades (A+, A, , F) to politicians based on candidate questionnaire and their gun-related voting. We evaluate our representations on their ability to predict these grades. Our intuition behind this evaluation is that the language in the tweets, press releases and perspectives of a politician directly helps in predicting their NRA grade. We evaluate our model on
$2$ tasks, namely, Paraphrase Task' and Grades Prediction Task'. In the Paraphrase task, we evaluate the representations from our model directly without training on NRA grades data. In the Grade Prediction task, we use the representations from our model and fine-tune on grades data. We collected the historical data of politicians NRA grades from . Grade data is available for $349$ out of
$455$ politicians in focus. For each politician $p_i$ , we obtain data for the query (pi
, guns, all guns-related events). We input the data to Compositional Reader and take the final node embeddings of nodes representing the politician $\vec{n}_{auth}$ , issue $\vec{n}_{guns}$
and referenced entity $\vec{n}_{NRA}$ . For some politicians , $\vec{n}_{NRA}$ is not available, depending on whether or not they referred to NRA in their discourse. These embeddings are used for both the prediction and paraphrase tasks. We repeat the Grade Prediction' task with grades from data for the issue . The tasks are detailed below. In this task, we evaluate our representations directly training on the NRA grade data. Grades are divided into two classes: higher than, and including, B+ are in the positive class and all grades from C+ to F are classified as negative. We formulate a representative sentence for each class: noitemsep

• POSITIVE: I strongly support the NRA
• NEGATIVE: I vehemently oppose the NRA

We compute BERT embeddings for the representative sentences to obtain $\vec{pos}_{NRA}$ and $\vec{neg}_{NRA}$ . We mean-pool the three embeddings $\vec{n}_{auth}$ ,
$\vec{n}_{guns}$ and $\vec{n}_{NRA}$ to obtain $\vec{n}_{stance}$ . We compute cosine similarity of $\vec{n}_{stance}$ with $\vec{pos}_{NRA}$ & $\vec{neg}_{NRA}$ . Politician is assigned the higher similarity class. We compare our model's results to , BERT adaptation and Encoder embeddings. For , we compute $\vec{n}_{stance}$ by mean-pooling the sentence-wise BERT embeddings of tweets, press releases and perspectives of the author on all events related to the issue .

### NRA 评级评估

$\vec{n}_{stance}$的余弦相似性。政治家被分配了更高的相似类。我们将模型的结果进行比较，BERT 适配和编码器 Embeddings。因为，通过均值汇集句子的句子，在所有与问题相关的所有事件上汇集$\vec{n}_{stance}$，按作者的所有事件的所有事件的句子。

Results are shown in Tab. quant_eval. This is as a $5$ -class classification task, one class for each letter grade: {A, B, C, D & F}. We train a simple feed-forward network with $1$ hidden layer of dimension $1000$ . The network is given $2$ inputs $\vec{n}_{auth}$ & $\vec{n}_{guns}$ . When $\vec{n}_{NRA}$ is available for an entity, we set $\vec{n}_{guns}$ = ( $\vec{n}_{NRA}$ , $\vec{n}_{guns}$ ). The network's output is a classification prediction. We randomly divide the NRA Grades data into $k=10$ folds and we train the model with $8$ folds and check the performance on $1$ test fold. We use $1$ fold for validation. We repeat this experiment with each fold as the test fold and then the entire process for $5$ random seeds. We perform this evaluation for , BERT adaptation, Encoder and Compositional Reader. To compute $\vec{n}_{auth}$ for , we mean-pool the sentence-wise embeddings of all author documents on . For $\vec{n}_{guns}$ , we use the background description document of issue . Results on the test set are in Tab. quant_eval. Further, we also perform experiments by training the model on a fraction of the data. We monitor the validation and test performances with change in training data percentage. We observe that, in general, the gap between Compositional Reader model and the BERT baseline widens with increase in training data. It hints that our representation likely captures more relevant information for this task. Results are included in the Appendix.

This is similar to NRA Grade Prediction task. It is a $4$ -way classification task. LCV assigns a score ranging between $0$ - $100$ to each politician depending upon their environmental voting activity. We segregate politicians into $4$ classes ( $0-25$ , $25-50$ , $50-75$ , $75-100$ ). We obtain input to the prediction model by concatenating $\vec{n}_(auth)$ and $\vec{n}_{environment}$ . We use same fine-tuning architecture as NRA Grade Prediction task with a fresh set of parameters. Results are shown in Tab.

## Ablation Analysis

Further, we investigate the importance of various components of our model. We perform ablation study over various types of documents on the NRA Grades Paraphrase task. the results are shown in Tab.

-Tweets 63.32%
-Press Releases 63.04%
-Perspectives 59.31%
Only Tweets 40.11%
Only Press Releases 55.87%
Only Perspectives 60.74%

Results in Tab. ablation_study indicate that are most useful while are the least useful documents for the task. As are summarized ideological leanings of politicians, it is intuitive that they are more effective for this task. Tweets are informal discourse and tend to be very specific to a current event, hence they are not as useful for this task.

## Conclusion

We propose a Compositional Reader model that builds upon representations from and generates more effective representations. We design learning tasks and train our model on large amounts of political data. We evaluate our model on several qualitative and quantitative tasks. We comprehensively outperform BERT-base model on both learning tasks and quantitative evaluation tasks. Results from our qualitative evaluation demonstrate that our representations effectively capture nuanced political information.