SAMSum Corpus: A Human-annotated Dialogue Dataset for Abstract ive Sum mari z ation
SAMSum语料库:面向抽象摘要的人类标注对话数据集
Abstract
摘要
This paper introduces the SAMSum Corpus, a new dataset with abstract ive dialogue summaries. We investigate the challenges it poses for automated sum mari z ation by testing several models and comparing their results with those obtained on a corpus of news articles. We show that model-generated summaries of dialogues achieve higher ROUGE scores than the model-generated summaries of news – in contrast with human evaluators’ judgement. This suggests that a challenging task of abstractive dialogue sum mari z ation requires dedicated models and non-standard quality measures. To our knowledge, our study is the first attempt to introduce a high-quality chatdialogues corpus, manually annotated with abstractive sum mari zat ions, which can be used by the research community for further studies.
本文介绍了SAMSum语料库,这是一个包含抽象对话摘要的新数据集。我们通过测试多种模型并将其结果与新闻文章语料库的结果进行比较,探究了该数据集对自动摘要任务带来的挑战。研究表明,模型生成的对话摘要比新闻摘要获得更高的ROUGE分数——这与人类评估者的判断相反。这表明抽象对话摘要这一挑战性任务需要专用模型和非标准质量评估指标。据我们所知,本研究首次引入了经人工标注抽象摘要的高质量聊天对话语料库,可供研究社区用于进一步探索。
1 Introduction and related work
1 引言与相关工作
The goal of the sum mari z ation task is condensing a piece of text into a shorter version that covers the main points succinctly. In the abstract ive approach important pieces of information are presented using words and phrases not necessarily appearing in the source text. This requires natural language generation techniques with high level of semantic understanding (Chopra et al., 2016; Rush et al., 2015; Khandelwal et al., 2019; Zhang et al., 2019; See et al., 2017; Chen and Bansal, 2018; Gehrmann et al., 2018).
摘要任务的目标是将一段文本压缩成更短的版本,同时简洁地涵盖要点。在抽象式方法中,重要信息会使用源文本中未必出现的词汇和短语来呈现。这需要具备高水平语义理解的自然语言生成技术 (Chopra et al., 2016; Rush et al., 2015; Khandelwal et al., 2019; Zhang et al., 2019; See et al., 2017; Chen and Bansal, 2018; Gehrmann et al., 2018)。
Major research efforts have focused so far on sum mari z ation of single-speaker documents like news (e.g., Nallapati et al. (2016)) or scientific publications (e.g., Nikolov et al. (2018)). One of the reasons is the availability of large, high-quality news datasets with annotated summaries, e.g., CNN/Daily Mail (Hermann et al., 2015; Nallapati et al., 2016). Such a comprehen- sive dataset for dialogues is lacking.
目前主要的研究工作集中在单说话者文档(如新闻(例如Nallapati等人(2016))或科学出版物(例如Nikolov等人(2018)))的摘要生成上。原因之一在于存在大量带有标注摘要的高质量新闻数据集,例如CNN/Daily Mail (Hermann等人(2015); Nallapati等人(2016))。而针对对话的此类综合性数据集则较为缺乏。
The challenges posed by the abstract ive dialogue sum mari z ation task have been discussed in the literature with regard to AMI meeting corpus (McCowan et al., 2005), e.g. Banerjee et al. (2015), Mehdad et al. (2014), Goo and Chen (2018). Since the corpus has a low number of summaries (for 141 dialogues), Goo and Chen (2018) proposed to use assigned topic descriptions as gold references. These are short, label-like goals of the meeting, e.g., costing evaluation of project process; components, materials and energy sources; chitchat. Such descriptions, however, are very general, lacking the messenger-like structure and any information about the speakers.
文献中针对AMI会议语料库(McCowan et al., 2005)的抽象对话摘要任务所面临的挑战已有讨论,例如Banerjee et al. (2015)、Mehdad et al. (2014)、Goo和Chen (2018)。由于该语料库摘要数量较少(仅141个对话),Goo和Chen (2018)提出使用预设主题描述作为黄金参考。这些描述是会议目标的简短标签式说明,例如:项目流程成本评估;组件、材料和能源;闲聊。然而此类描述过于笼统,既缺乏信使式结构,也未包含发言者相关信息。
To benefit from large news corpora, Ganesh and Dingliwal (2019) built a dialogue sum mari z ation model that first converts a conversation into a structured text document and later applies an attention-based pointer network to create an abstract ive summary. Their model, trained on structured text documents of CNN/Daily Mail dataset, was evaluated on the Argumentative Dialogue Summary Corpus (Misra et al., 2015), which, however, contains only 45 dialogues.
为了利用大规模新闻语料库的优势,Ganesh和Dingliwal (2019) 构建了一个对话摘要模型,该模型首先将对话转换为结构化文本文档,随后应用基于注意力机制的指针网络生成抽象摘要。他们的模型在CNN/Daily Mail数据集的结构化文本文档上进行训练,并在Argumentative Dialogue Summary Corpus (Misra等人,2015) 上进行了评估,然而该语料库仅包含45段对话。
In the present paper, we further investigate the problem of abstract ive dialogue sum mari z ation. With the growing popularity of online conversations via applications like Messenger, WhatsApp and WeChat, sum mari z ation of chats between a few participants is a new interesting direction of sum mari z ation research. For this purpose we have created the SAMSum Corpus1 which contains over 16k chat dialogues with manually annotated summaries. The dataset is freely available for the research community2.
在本文中,我们进一步研究了抽象式对话摘要问题。随着Messenger、WhatsApp和微信等应用程序在线对话的日益普及,对少量参与者之间聊天内容的摘要成为摘要研究中一个新的有趣方向。为此,我们创建了SAMSum Corpus1,其中包含超过16k条带有手动标注摘要的聊天对话。该数据集可免费供研究社区使用2。
Table 1: Datasets sizes
表 1: 数据集规模
| Dataset | Train | Validation | Test |
|---|---|---|---|
| CNN/DM | 287227 | 13368 | 11 490 |
| SAMSum | 14732 | 818 | 819 |
The paper is structured as follows: in Section 2 we present details about the new corpus and describe how it was created, validated and cleaned. Brief description of baselines used in the summarization task can be found in Section 3. In Section 4, we describe our experimental setup and pa- rameters of models. Both evaluations of summarization models, the automatic with ROUGE metric and the linguistic one, are reported in Section 5 and Section 6, respectively. Examples of models’ outputs and some errors they make are described in Section 7. Finally, discussion, conclusions and ideas for further research are presented in sections 8 and 9.
本论文结构如下:第2节详细介绍新语料库的构建过程,包括创建、验证与清洗方法;第3节简要概述摘要任务中使用的基线模型;第4节阐述实验设置与模型参数;第5节和第6节分别汇报基于ROUGE指标的自动评估与语言学评估结果;第7节分析模型输出示例及典型错误;最后,第8节和第9节呈现讨论、结论与后续研究方向。
2 SAMSum Corpus
2 SAMSum 语料库
Initial approach. Since there was no available corpus of messenger conversations, we considered two approaches to build it: (1) using existing datasets of documents, which have a form similar to chat conversations, (2) creating such a dataset by linguists.
初始方案。由于缺乏现成的即时通讯对话语料库,我们考虑采用两种构建方式:(1) 使用文档形式类似聊天对话的现有数据集,(2) 由语言学家创建此类数据集。
In the first approach, we reviewed datasets from the following categories: chatbot dialogues, SMS corpora, IRC/chat data, movie dialogues, tweets, comments data (conversations formed by replies to comments), transcription of meetings, written discussions, phone dialogues and daily communication data. Unfortunately, they all differed in some respect from the conversations that are typically written in messenger apps, e.g. they were too technical (IRC data), too long (comments data, transcription of meetings), lacked context (movie dialogues) or they were more of a spoken type, such as a dialogue between a petrol station assistant and a client buying petrol.
在第一种方法中,我们审查了以下类别的数据集:聊天机器人对话、短信语料库、IRC/聊天数据、电影对话、推文、评论数据(由评论回复形成的对话)、会议转录、书面讨论、电话对话和日常交流数据。遗憾的是,它们在某些方面都与即时通讯应用中典型的对话有所不同,例如过于技术化(IRC数据)、过长(评论数据、会议转录)、缺乏上下文(电影对话),或者更偏向口语类型,如加油站助理与购买汽油的客户之间的对话。
As a consequence, we decided to create a chat dialogue dataset by constructing such conversations that would epitomize the style of a messenger app.
因此,我们决定通过构建能体现即时通讯应用风格的对话来创建一个聊天对话数据集。
Process of building the dataset. Our dialogue sum mari z ation dataset contains natural messenger-like conversations created and written down by linguists fluent in English. The style and register of conversations are diversified – dialogues could be informal, semi-formal or formal, they may contain slang phrases, emoticons and typos. We asked linguists to create conversations similar to those they write on a daily basis, reflecting the proportion of topics of their real-life messenger conversations. It includes chit-chats, gossiping about friends, arranging meetings, discussing politics, consulting university assignments with colleagues, etc. Therefore, this dataset does not contain any sensitive data or fragments of other corpora.
构建数据集的流程。我们的对话摘要数据集包含由精通英语的语言学家创建并记录的自然对话,这些对话类似于即时通讯工具中的交流。对话的风格和语域多样——可以是非正式、半正式或正式的,可能包含俚语、表情符号和拼写错误。我们要求语言学家模拟日常交流场景编写对话,反映其真实生活中即时通讯对话的话题比例。内容包括闲聊、朋友八卦、安排会面、讨论政治、与同事协商大学作业等。因此,该数据集不包含任何敏感数据或其他语料库的片段。
Each dialogue was created by one person. After collecting all of the conversations, we asked language experts to annotate them with summaries, assuming that they should (1) be rather short, (2) extract important pieces of information, (3) include names of inter lo cut or s, (4) be written in the third person. Each dialogue contains only one reference summary.
每段对话由一人创作。收集所有对话后,我们邀请语言专家按以下标准进行摘要标注:(1) 篇幅简短,(2) 提取关键信息,(3) 包含对话者姓名,(4) 使用第三人称书写。每段对话仅包含一条参考摘要。
Validation. Since the SAMSum corpus contains dialogues created by linguists, the question arises whether such conversations are really similar to those typically written via messenger apps. To find the answer, we performed a validation task. We asked two linguists to doubly annotate 50 conversations in order to verify whether the dialogues could appear in a messenger app and could be summarized (i.e. a dialogue is not too general or unintelligible) or not (e.g. a dialogue between two people in a shop). The results revealed that $94%$ of examined dialogues were classified by both annotators as good i.e. they do look like conversations from a messenger app and could be condensed in a reasonable way. In a similar validation task, conducted for the existing dialogue-type datasets (described in the Initial approach section), the annotators agreed that only $28%$ of the dialogues resembled conversations from a messenger app.
验证。由于SAMSum语料库包含语言学家创建的对话,这就引发了一个问题:这类对话是否真的与通常通过即时通讯应用编写的对话相似。为寻找答案,我们进行了验证任务。我们邀请两位语言学家对50组对话进行双重标注,以验证这些对话是否可能出现在即时通讯应用中且具备可总结性(即对话内容不过于宽泛或难以理解),或反之(例如商店中两人之间的对话)。结果显示,两位标注者一致认为94%的受测对话质量良好——它们确实类似即时通讯应用的对话内容,并能以合理方式进行压缩。在对现有对话型数据集(见初始方法章节所述)开展的类似验证任务中,标注者仅对28%的对话认可其与即时通讯应用的对话相似。
Cleaning data. After preparing the dataset, we conducted a process of cleaning it in a semiautomatic way. Beforehand, we specified a for- mat for written dialogues with summaries: a colon should separate an author of utterance from its content, each utterance is expected to be in a separate line. Therefore, we could easily find all deviations from the agreed structure – some of them could be automatically fixed (e.g. when instead of a colon, someone used a semicolon right after the inter lo cut or’s name at the beginning of an utterance), others were passed for verification to linguists. We also tried to correct typos in interlocutors’ names (if one person has several utterances, it happens that, before one of them, there is a typo in his/her name) – we used the Levenshtein distance to find very similar names (possibly with typos e.g. ’George’ and ’Goerge’) in a single conversation, and those cases with very similar names were passed to linguists for verification.
清理数据。在准备好数据集后,我们以半自动方式进行了数据清洗。事先我们规定了带摘要的书面对话格式:发言者与内容间用冒号分隔,每段发言需单独成行。因此能轻松识别所有偏离既定结构的案例——部分可自动修复(例如当发言开头的人名后误用分号而非冒号时),其余则交由语言学家核查。我们还尝试修正对话者姓名拼写错误(若同一人有多段发言,可能出现某处姓名拼写错误)——通过Levenshtein距离算法定位单次对话中高度相似的姓名(如"George"和"Goerge"这类可能含拼写错误的变体),并将这些疑似错误案例提交语言学家验证。
Description. The created dataset is made of 16369 conversations distributed uniformly into 4 groups based on the number of utterances in conversations: 3-6, 7-12, 13-18 and 19-30. Each utterance contains the name of the speaker. Most conversations consist of dialogues between two inter lo cut or s (about $75%$ of all conversations), the rest is between three or more people. Table 1 presents the size of the dataset split used in our experiments. The example of a dialogue from this corpus is shown in Table 2.
描述。创建的数据集包含16369个对话,根据对话中的话语数量均匀分为4组:3-6句、7-12句、13-18句和19-30句。每条话语均包含说话者名称。大多数对话由两个对话者之间的交流构成(约占全部对话的$75%$),其余为三人或以上参与的对话。表1展示了实验中使用的数据集划分规模,表2展示了该语料库中的对话示例。
Table 2: Example of a dialogue from the collected corpus
表 2: 语料库收集的对话示例
| 对话 |
|---|
| Blair: 记得我们下班后要见婚礼策划师 Chuck: 当然,我们在哪见她?Blair: 在Nonna Rita餐厅 Chuck: 我能点他们的海鲜宽面吗?还是只和她喝咖啡?自从上个月去过之后我就一直想着它 Blair: 哈哈当然可以 |
| Chuck: 我们都记得上次和Diane见面时番茄意面的灾难 Blair: 天啊哈哈哈全洒在她的白衬衫上了 Chuck: :D |
| Blair: :P 摘要 Blair和Chuck下班后要去Nonna Rita餐厅见婚礼策划师。 |
3 Dialogues baselines
3 对话基线
The baseline commonly used in the news summarization task is Lead-3 (See et al., 2017), which takes three leading sentences of the document as the summary. The underlying assumption is that the beginning of the article contains the most significant information. Inspired by the Lead-n model, we propose a few different simple models:
新闻摘要任务中常用的基线方法是Lead-3 (See et al., 2017),即选取文档的前三句话作为摘要。其基本假设是文章开头包含最重要的信息。受Lead-n模型启发,我们提出了几种不同的简单模型:
Table 3: Baselines for the dialogues sum mari z ation
表 3: 对话摘要基线模型
| 模型 | n | R-1 | R-2 | R-L |
|---|---|---|---|---|
| LEAD | 3 4 5 | 31.40 31.87 32.02 | 8.68 8.93 9.53 | 29.42 29.91 30.07 |
| MIDDLE | 3 4 5 | 28.04 30.08 29.91 | 6.57 7.96 8.12 | 26.13 28.10 27.97 |
| LONGEST | 3 4 5 | 32.46 32.19 31.61 | 10.27 10.35 10.21 | 29.92 29.91 29.55 |
| LONGER-THAN | 10 20 30 | 28.31 29.36 29.61 | 9.69 10.23 10.28 | 26.72 27.59 27.71 |
| MOST-ACTIVE-PERSON | n/a | 26.54 | 8.55 | 24.57 |
Results of the evaluation of the above models are reported in Table 3. There is no obvious baseline for the task of dialogues sum mari z ation. We expected rather low results for Lead-3, as the beginnings of the conversations usually contain greetings, not the main part of the discourse. However, it seems that in our dataset greetings are frequently combined with question-asking or information passing (sometimes they are even omitted) and such a baseline works even better than the MIDDLE baseline (taking utterances from the middle of a dialogue). Nevertheless, the best dialogue baseline turns out to be the LONGEST-3 model.
上述模型的评估结果如表3所示。对话摘要任务尚无明显的基线标准。我们预期Lead-3模型表现较差,因为对话开头通常包含问候语而非主要内容。然而在我们的数据集中,问候语常与提问或信息传递相结合(有时甚至被省略),这使得该基线模型表现优于MIDDLE基线(从对话中部选取语句)。最终,最佳对话基线模型是LONGEST-3。
4 Experimental setup
4 实验设置
This section contains a description of setting used in the experiments carried out.
本节包含实验中所用设置的描述。
4.1 Data preparation
4.1 数据准备
In order to build a dialogue sum mari z ation model, we adopt the following strategies: (1) each candidate architecture is trained and evaluated on the dialogue dataset; (2) each architecture is trained on the train set of CNN/Daily Mail joined together with the train set of the dialogue data, and evaluated on the dialogue test set.
为了构建对话摘要模型,我们采用以下策略:(1) 每种候选架构均在对话数据集上进行训练和评估;(2) 每种架构在CNN/Daily Mail训练集与对话数据训练集的联合集上训练,并在对话测试集上评估。
In addition, we prepare a version of dialogue data, in which utterances are separated with a special token called the separator (artificially added token e.g. $\ '_{\mathrm{
此外,我们准备了一个对话数据版本,其中话语之间用特殊的分隔符(人工添加的token,例如使用词嵌入的模型采用 $\ '_{\mathrm{
4.2 Models
4.2 模型
We carry out experiments with the following summarization models (for all architectures we set the beam size for beam search decoding to 5):
我们使用以下摘要模型进行实验(所有架构的束搜索解码束宽均设为5):
• Pointer generator network (See et al., 2017). In the case of Pointer Generator, we use a default configuration 3, changing only the minimum length of the generated summary from 35 (used in news) to 15 (used in dialogues).
• Pointer generator network (See et al., 2017)。对于Pointer Generator,我们使用默认配置3,仅将生成摘要的最小长度从新闻领域常用的35改为对话场景适用的15。
• Transformer (Vaswani et al., 2017). The model is trained using OpenNMT library4. We use the same parameters for training both on news and on dialogues , changing only the minimum length of the generated summary – 35 for news and 15 for dialogues.
• Transformer (Vaswani et al., 2017)。该模型使用OpenNMT库4进行训练。我们在新闻和对话数据上采用相同的训练参数,仅调整生成摘要的最小长度——新闻为35个token,对话为15个token。
• Fast Abs RL (Chen and Bansal, 2018). It is trained using its default parameters 6. For dialogues, we change the convolutional wordlevel sentence encoder (used in extractor part) to only use kernel with size equal 3 instead of 3-5 range. It is caused by the fact that some of utterances are very short and the default setting is unable to handle that.
• Fast Abs RL (Chen and Bansal, 2018)。该模型使用默认参数6进行训练。针对对话场景,我们将抽取器部分的卷积词级句子编码器内核尺寸从3-5范围调整为固定值3,这是因为部分话语片段过短,默认配置无法有效处理。
• Fast Abs RL Enhanced. The additional variant of the Fast Abs RL model with slightly changed utterances i.e. to each utterance, at the end, after artificial separator, we add names of all other inter lo cut or s. The reason for that is that Fast Abs RL requires text to be split into sentences (as it selects sentences and then paraphrase each of them). For dialogues, we divide text into utterances (which is a natural unit in conversations), so sometimes, a single utterance may contain more than one sentence. Taking into account how this model works, it may happen that it selects an utterance of a single person (each utterance starts with the name of the author of the utterance) and has no information about other inter lo cut or s (if names of other interlocutors do not appear in selected utterances), so it may have no chance to use the right people’s names in generated summaries.
• Fast Abs RL 增强版。该模型是 Fast Abs RL 的变体,对语句进行了细微调整:在每条语句末尾的人工分隔符后,我们会添加所有其他对话者 (interlocutor) 的名称。这是因为 Fast Abs RL 需要将文本拆分成句子(它会先选择句子再对每条进行改写)。对于对话场景,我们将文本划分为语句(这是对话中的自然单元),因此单个语句可能包含多个句子。考虑到该模型的工作原理,它可能会选择某个说话者的单条语句(每条语句都以说话者名称开头),却无法获知其他对话者信息(若所选语句中未出现其他对话者名称),导致生成摘要时无法正确使用人名。
LightConv and Dynamic Con v (Wu et al., 2019). The implementation is available in fairseq7 (Ott et al., 2019). We train lightweight convolution models in two manners: (1) learning token representations from scratch; in this case we apply BPE tokenization with the vocabulary of 30K types, using fastBPE implementation 8 (Sennrich et al., 2015); (2) initializing token embeddings with pre-trained language model representations; as a language model we choose GPT-2 small (Radford et al., 2019).
LightConv和DynamicConv (Wu等人, 2019)。该实现可在fairseq7 (Ott等人, 2019)中获得。我们以两种方式训练轻量级卷积模型:(1) 从头学习token表示;这种情况下我们应用BPE分词(词汇量为3万种类型),使用fastBPE实现8 (Sennrich等人, 2015);(2) 用预训练语言模型表示初始化token嵌入;我们选择GPT-2 small (Radford等人, 2019)作为语言模型。
4.3 Evaluation metrics
4.3 评估指标
We evaluate models with the standard ROUGE metric (Lin, 2004), reporting the $F_{1}$ scores (with stemming) for ROUGE-1, ROUGE-2 and ROUGE-L following previous works (Chen and Bansal, 2018; See et al., 2017). We obtain scores using the py-rouge package9.
我们采用标准的ROUGE指标(Lin, 2004)评估模型,参照先前研究(Chen and Bansal, 2018; See et al., 2017)报告ROUGE-1、ROUGE-2和ROUGE-L的$F_{1}$分数(启用词干提取)。分数计算使用py-rouge包9实现。
5 Results
5 结果
The results for the news sum mari z ation task are shown in Table 4 and for the dialogue summarization – in Table 5. In both domains, the best mod- els’ ROUGE-1 exceeds 39, ROUGE-2 – 17 and ROUGE-L – 36. Note that the strong baseline for news (Lead-3) is outperformed in all three metrics only by one model. In the case of dialogues, all tested models perform better than the baseline (LONGEST-3).
新闻摘要任务的结果如表4所示,对话摘要任务的结果如表5所示。在这两个领域中,最佳模型的ROUGE-1分数超过39,ROUGE-2超过17,ROUGE-L超过36。值得注意的是,在新闻领域,只有一个模型在所有三个指标上都优于强基线(Lead-3)。而在对话领域,所有测试模型的表现均优于基线(LONGEST-3)。
In general, the Transformer-based architectures benefit from training on the joint dataset: news+dialogues, even though the news and the dialogue documents have very different structures. Interestingly, this does not seem to be the case for the Pointer Generator or Fast Abs RL model.
总体而言,基于Transformer的架构能从联合数据集(新闻+对话)训练中获益,尽管新闻和对话文档的结构差异很大。有趣的是,指针生成器(Pointer Generator)和Fast Abs RL模型似乎并未表现出这种优势。
The inclusion of a separation token between dialogue utterances is advantageous for most models – presumably because it improves the discourse structure. The improvement is most visible when training is performed on the joint dataset.
在对话语句之间加入分隔token对大多数模型都有利——可能是因为它改善了话语结构。当在联合数据集上进行训练时,这种改进最为明显。
Having compared two variants of the Fast Abs RL model – with original utterances and with enhanced ones (see Section 4.2), we conclude that enhancing utterances with information about the other inter lo cut or s helps achieve higher ROUGE values.
在对比了Fast Abs RL模型的两个变体——使用原始话语和使用增强话语(见第4.2节)后,我们得出结论:通过添加其他inter lo cut或s相关信息来增强话语,有助于获得更高的ROUGE值。
The largest improvement of the model performance is observed for LightConv and DynamicConv models when they are complemented with pretrained embeddings from the language model GPT-2, trained on enormous corpora.
模型性能的最大提升出现在LightConv和DynamicConv模型中,当它们补充了来自大语言模型GPT-2的预训练嵌入时(该模型基于海量语料库训练)。
It is also worth noting that some models (Pointer Generator, Fast Abs RL), trained only on the dialogues corpus (which has 16k dialogues), reach similar level (or better) in terms of ROUGE metrics than models trained on the CNN/DM news dataset (which has more than $300\mathrm{k}$ articles). Adding pretrained embeddings and training on the joined dataset helps in achieving significantly higher values of ROUGE for dialogues than the best models achieve on the CNN/DM news dataset.
值得注意的是,仅基于对话语料库(包含1.6万组对话)训练的某些模型(Pointer Generator、Fast Abs RL)在ROUGE指标上达到了与基于CNN/DM新闻数据集(超过$300\mathrm{k}$篇新闻)训练的模型相近(或更优)的水平。通过添加预训练词向量并在联合数据集上训练,对话摘要的ROUGE值显著超越了CNN/DM新闻数据集上最佳模型的表现。
According to ROUGE metrics, the best performing model is Dynamic Con v with GPT-2 embeddings, trained on joined news and dialogue data with an utterance separation token.
根据ROUGE指标,表现最佳的模型是采用GPT-2嵌入的Dynamic Conv,该模型在合并新闻和对话数据(含话语分隔token)上进行训练。
6 Linguistic verification of summaries
6 摘要的语言学验证
ROUGE is a standard way of evaluating the quality of machine generated summaries by comparing them with reference ones. The metric based on n-gram overlapping, however, may not be very informative for abstract ive sum mari z ation, where paraphrasing is a keypoint in producing highquality sentences. To quantify this conjecture, we manually evaluated summaries generated by the models for 150 news and 100 dialogues. We asked two linguists to mark the quality of every summary on the scale of $-1,0,1$ , where $-1$ means that a sum mari z ation is poor, extracts irrelevant information or does not make sense at all, 1 – it is understandable and gives a brief overview of the text, and 0 stands for a sum mari z ation that extracts only a part of relevant information, or makes some mistakes in the produced summary.
ROUGE是一种通过比较机器生成摘要与参考摘要来评估其质量的标准方法。然而,这种基于n-gram重叠的指标对于抽象摘要可能信息量不足,因为改写是生成高质量句子的关键。为量化这一假设,我们手动评估了模型生成的150篇新闻和100段对话摘要,邀请两位语言学家按$-1,0,1$量表对每篇摘要评分:$-1$表示摘要质量差、提取无关信息或完全无意义,1表示摘要可理解且能简要概述文本,0表示仅提取部分相关信息或存在错误。
Table 4: Model evaluation on the news corpus test set
表 4: 新闻语料测试集上的模型评估
| 模型 | R-1 | R-2 | R-L |
|---|---|---|---|
| Lead-3 baseline Pointer Generator | 40.24 38.72 | 17.44 16.67 | 34.90 35.59 |
| Fast Abs RL Transformer LightConv DynamicConv | 40.99 38.72 39.44 39.46 | 17.72 16.89 17.20 17.33 | 38.30 35.74 36.20 36.29 |
| LightConv + GPT2 emb | 39.52 | 17.31 | 36.15 |
| DynamicConv + GPT2 emb | 39.94 | 17.56 | 36.51 |
We noticed a few annotations (7 for news and 4 for dialogues) with opposite marks (i.e. one annotator judgement was $-1$ , whereas the second one was 1) and decided to have them annotated once again by another annotator who had to resolve conflicts. For the rest, we calculated the linear weighted Cohen’s kappa coefficient (McHugh, 2012) between annotators’ scores. For news examples, we obtained agreement on the level of 0.371 and for dialogues – 0.506. The annotators’ agreement is higher on dialogues than on news, probably because of structures of those data – articles are often long and it is difficult to decide what the key-point of the text is; dialogues, on the contrary, are rather short and focused mainly on one topic.
我们注意到少数标注存在对立标记(新闻7例,对话4例),即一位标注者评分为$-1$,而另一位为1。这些案例交由第三位标注者重新标注以解决分歧。其余数据则采用线性加权Cohen's kappa系数(McHugh, 2012)计算标注者间一致性:新闻样本为0.371,对话样本为0.506。对话数据的一致性较高,可能源于其结构特性——文章通常篇幅较长且关键点难以判定,而对话则相对简短且主题集中。
For manually evaluated samples, we calculated ROUGE metrics and the mean of two human ratings; the prepared statistics is presented in Table 6. As we can see, models generating dialogue summaries can obtain high ROUGE results, but their outputs are marked as poor by human annotators. Our conclusion is that the ROUGE metric corresponds with the quality of generated summaries for news much better than for dialogues, confirmed by Pearson’s correlation between human evaluation and the ROUGE metric, shown in Table 7.
对于人工评估样本,我们计算了ROUGE指标和两位评分的平均值;统计结果如表6所示。可见,生成对话摘要的模型能获得较高ROUGE分数,但其输出被人为标注为低质量。我们得出结论:ROUGE指标对新闻摘要生成质量的反映远优于对话摘要,该结论通过表7所示的人为评估与ROUGE指标的皮尔逊相关性得到验证。
Table 5: Model evaluation on the dialogues corpus test set
表 5: 对话语料测试集上的模型评估
| 模型 | 训练数据 | 分隔符 | R-1 | R-2 | R-L |
|---|---|---|---|---|---|
| LONGEST-3baseline | n/a | n/a | 32.46 | 10.27 | 29.92 |
| Pointer Generator | dialogues | no | 38.55 | 14.14 | 34.85 |
| Pointer Generator | dialogues | yes | 40.08 | 15.28 | 36.63 |
| Fast Abs RL | dialogues | no | 40.96 | 17.18 | 39.05 |
| Fast Abs RLEnhanced | dialogues | no | 41.95 | 18.06 | 39.23 |
| Transformer | dialogues | no | 36.62 | 11.18 | 33.06 |
| Transformer | dialogues | yes | 37.27 | 10.76 | 32.73 |
| LightConv | dialogues | no | 33.19 | 11.14 | 30.34 |
| DynamicConv | dialogues | no | 33.79 | 11.19 | 30.41 |
| DynamicConv | dialogues | yes | 33.69 | 10.88 | 30.93 |
| LightConv + GPT-2 emb. | dialogues | no | 41.81 | 16.34 | 37.63 |
| DynamicConv + GPT-2 emb. | dialogues | no | 41.79 | 16.44 | 37.54 |
| DynamicConv + GPT-2 emb. | dialogues | yes | 41.54 | 16.29 | 37.07 |
| Pointer Generator | news + dialogues | no | 35.04 | 13.25 | 32.42 |
| Pointer Generator | news + dialogues | yes | 37.27 | 14.42 | 34.36 |
| FastAbsRL | news + dialogues | no | 41.03 | 16.93 | 39.05 |
| Fast Abs RL Enhanced | news + dialogues | no | 41.87 | 17.47 | 39.53 |
| Transformer | news + dialogues | no | 41.91 | 18.25 | 38.77 |
| Transformer | news + dialogues | yes | 42.37 | 18.44 | 39.27 |
| LightConv | news + dialogues | no | 40.29 | 17.28 | 36.81 |
| DynamicConv | news + dialogues | no | 40.66 | 17.41 | 37.20 |
| DynamicConv | news + dialogues | yes | 41.07 | 17.11 | 37.27 |
| LightConv + GPT-2 emb. | news + dialogues | no | 44.47 | 19.75 | 40.07 |
| DynamicConv + GPT-2 emb. | news + dialogues | no | 44.69 | 20.28 | 40.76 |
| DynamicConv + GPT-2 emb. | news + dialogues | yes | 45.41 | 20.65 | 41.45 |
Table 6: Statistics of human evaluation of summaries’ quality and ROUGE evaluation of those summaries
表 6: 摘要质量人工评估统计及ROUGE评估结果
| #examples | mean | median | R-1 | R-2 | R-L | ||
|---|---|---|---|---|---|---|---|
| NEWS | overall | 100 | 0.18 | 0.5 | 39.76 | 16.55 | 36.23 |
| FastAbsRL DynamicConv | 50 | 0.33 | 0.5 | 42.33 | 18.28 | 38.82 | |
| DIALOGUES | overall | 50 | 0.03 | 0.25 | 37.19 | 14.81 | 33.64 |
| FastAbsRL | 150 | -0.503 | -0.5 | 43.53 | 19.94 | 40.66 | |
| FastAbsRLEnhanced | 50 | -0.55 | -0.75 | 42.16 | 19.28 | 40.37 | |
| DynamicConv + GPT-2 emb. | 50 50 | -0.63 -0.33 | -1.0 -0.5 | 39.79 48.63 | 16.59 | 37.05 |
7 Difficulties in dialogue sum mari z ation
7 对话摘要中的难点
In a structured text, such as a news article, the information flow is very clear. However, in a dialogue, which contains discussions (e.g. when people try to agree on a date of a meeting), questions (one person asks about something and the answer may appear a few utterances later) and greetings, most important pieces of information are scattered across the utterances of different speakers. What is more, articles are written in the third-person point of view, but in a chat everyone talks about themselves, using a variety of pronouns, which further complicates the structure. Additionally, people talking on messengers often are in a hurry, so they shorten words, use the slang phrases (e.g. ’u r gr8’ means ’you are great’) and make typos. These phenomena increase the difficulty of performing dialogue sum mari z ation.
在结构化文本(如新闻文章)中,信息流非常清晰。然而,在包含讨论(例如人们试图确定会议日期)、提问(某人提出问题后,答案可能出现在几轮对话之后)和问候的对话中,最重要的信息往往分散在不同发言者的话语间。此外,文章采用第三人称视角撰写,而聊天中每个人都使用各种代词谈论自身,这进一步复杂化了结构。另外,人们在即时通讯中常常匆忙,因此会缩写单词、使用俚语(例如"u r gr8"意为"you are great")并出现拼写错误。这些现象增加了对话摘要的难度。
Table 8 and 9 show a few selected dialogues, together with summaries produced by the best tested models:
表 8 和表 9 展示了一些精选对话,以及经过测试的最佳模型生成的摘要:
Table 7: Pearson’s correlations between human judgement and ROUGE metric
表 7: 人类判断与ROUGE指标之间的皮尔逊相关性
| ROUGE-1 | ROUGE-2 | ROUGE-L | ||||
|---|---|---|---|---|---|---|
| corr | p-value | corr | p-value | corr | p-value | |
| NEWS | 0.47 | 1e-6 | 0.44 | 6e-6 | 0.48 | 1e-6 |
| DIALOGUES | 0.32 | 7.7e-5 | 0.30 | 1.84e-4 | 0.32 | 8.1e-5 |
One can easily notice problematic issues. Firstly, the models frequently have difficulties in associating names with actions, often repeating the same name, e.g., for Dialogue 1 in Table 8, Fast Abs RL generates the following summary: ’lilly and lilly are going to eat salmon’. To help the model deal with names, the utterances are enhanced by adding information about the other inter lo cut or s – Fast Abs RL enhanced variant described in Section 4.2. In this case, after enhancement, the model generates a summary containing both inter lo cut or s’ names: ’lily and gabriel are going to pasta...’. Sometimes models correctly choose speakers’ names when generating a summary, but make a mistake in deciding who performs the action (the subject) and who receives the action (the object), e.g. for Dialogue 4 Dynamic $C o n\nu+G P T\cdot2$ emb. w/o sep. model generates the summary ’randolph will buy some earplugs for maya’, while the correct form is ’maya will buy some earplugs for randolph’.
可以轻易发现一些问题。首先,模型经常难以将名称与动作关联,往往重复相同名字,例如表8中的对话1,Fast Abs RL生成的摘要为:"lilly和lilly要去吃三文鱼"。为解决名称关联问题,我们通过添加对话者信息对语句进行了增强——即4.2节描述的Fast Abs RL增强变体。增强后,模型生成的摘要包含了两个对话者姓名:"lily和gabriel要去吃意大利面..."。有时模型能正确选择说话者姓名,但在判断动作执行者(主语)与动作接受者(宾语)时出错,例如对话4中Dynamic $C o n\nu+G P T\cdot2$ emb. w/o sep.模型生成摘要"randolph要给maya买耳塞",而正确表述应为"maya要给randolph买耳塞"。
A closely related problem is capturing the context and extracting information about the arrangements after the discussion. For instance, for Dialogue 4, the Fast Abs RL model draws a wrong conclusion from the agreed arrangement. This issue is quite frequently visible in summaries generated by Fast Abs RL, which may be the consequence of the way it is constructed; it first chooses important utterances, and then summarizes each of them separately. This leads to the narrowing of the context and loosing important pieces of information.
一个密切相关的问题是如何捕捉上下文并在讨论后提取有关安排的信息。例如,在对话4中,Fast Abs RL模型从达成一致的安排中得出了错误结论。这一问题在Fast Abs RL生成的摘要中相当常见,可能是其构建方式导致的:它首先选择重要语句,然后分别对每条语句进行摘要。这会导致上下文范围缩小,丢失重要信息片段。
One more aspect of summary generation is deciding which information in the dialogue content is important. For instance, for Dialogue 3 DynamicConv $^+$ GPT-2 emb. with sep. generates a correct summary, but focuses on a piece of information different than the one included in the reference summary. In contrast, some other models – like Fast Abs RL enhanced – select both of the pieces of information appearing in the discussion. On the other hand, when summarizing Dialogue 5, the models seem to focus too much on the phrase ’it’s the best place’, intuitively not the most important one to summarize.
摘要生成的另一个方面是决定对话内容中哪些信息是重要的。例如,对于对话3,DynamicConv$^+$ GPT-2嵌入(embedding)与分隔符(separator)生成的摘要虽然正确,但关注的是与参考摘要中不同的信息片段。相比之下,其他模型(如Fast Abs RL增强版)则同时选择了讨论中出现的两个信息片段。另一方面,在总结对话5时,这些模型似乎过分关注"it's the best place"这一短语,而直觉上这并不是最重要的总结内容。
8 Discussion
8 讨论
This paper is a step towards abstract ive summarization of dialogues by (1) introducing a new dataset, created for this task, (2) comparison with news sum mari z ation by the means of automated (ROUGE) and human evaluation.
本文在对话抽象摘要方面迈出了重要一步:(1) 为此任务创建了新数据集,(2) 通过自动评估(ROUGE)和人工评估的方式与新闻摘要进行对比。
Most of the tools and the metrics measuring the quality of text sum mari z ation have been developed for a single-speaker document, such as news; as such, they are not necessarily the best choice for conversations with several speakers.
大多数用于衡量文本摘要质量的工具和指标都是针对单说话者文档(如新闻)开发的,因此它们未必是多人对话场景的最佳选择。
We test a few general-purpose sum mari z ation models. In terms of human evaluation, the results of dialogues sum mari z ation are worse than the results of news sum mari z ation. This is connected with the fact that the dialogue structure is more complex – information is spread in multiple utterances, discussions, questions, more typos and slang words appear there, posing new challenges for sum mari z ation. On the other hand, dialogues are divided into utterances, and for each utterance its author is assigned. We demonstrate in experiments that the models benefit from the introduction of separators, which mark utterances for each person. This suggests that dedicated models having some architectural changes, taking into account the assignation of a person to an utterance in a systematic manner, could improve the quality of dialogue sum mari z ation.
我们测试了几种通用摘要模型。在人工评估方面,对话摘要的结果比新闻摘要的结果更差。这与对话结构更复杂有关——信息分散在多个话语、讨论和问题中,且存在更多拼写错误和俚语,这给摘要带来了新的挑战。另一方面,对话被划分为话语,并为每个话语分配了作者。我们在实验中证明,引入分隔符(标记每个人的话语)对模型有益。这表明,通过系统性地考虑话语与说话者的对应关系,对模型架构进行针对性改进,可以提高对话摘要的质量。
Table 8: Examples of dialogues (Part 1). REF – reference summary, L3 – LONGEST-3 baseline, DS – DynamicConv $^+$ GPT-2 emb. with sep., D – Dynamic Con v $^+$ GPT-2 emb., F – Fast Abs RL, FE – Fast Abs RL Enhanced, T – Transformer. For L3, three longest utterances are listed. Rounded ROUGE values [R-1/R-2/R-L] are given in square brackets.
表 8: 对话示例 (第一部分)。REF – 参考摘要, L3 – LONGEST-3基线, DS – DynamicConv$^+$GPT-2嵌入(带分隔符), D – DynamicConv$^+$GPT-2嵌入, F – Fast Abs RL, FE – Fast Abs RL增强版, T – Transformer。L3列出三条最长话语。方括号内为ROUGE值取整结果[R-1/R-2/R-L]。
| 对话1 1. lilly: 抱歉我要迟到 2. lilly: 别等我先点餐 3. gabriel: 没问题 要帮你点些吃的吗? 4. gabriel: 这样你一到就能吃上 5. lilly: 好主意 6. lilly: 这里的鲑鱼罗勒意面很好吃 REF: lilly会迟到。gabriel将点鲑鱼意面 | 对话2 1. randolph: 亲爱的 2. randolph: 你还在药店吗? 3. maya: 在 4. randolph: 帮我买些耳塞 5. maya: 要几副? 6. randolph: 4-5包 7. maya: 买5包 8. randolph: 谢谢宝贝 |
|---|---|
| L3: 6,3,4 [38/17/38] DS: lilly和gabriel要点鲑鱼罗勒意面 [62/42/62] D: lilly和gabriel要点鲑鱼罗勒意面 [62/42/62] F: lilly会迟到。她要自己点餐。lilly和gabriel会吃鲑鱼罗勒 [55/39/55] FE: lilly会迟到。lilly和gabriel要点鲑鱼罗勒意面 [63/47/63] | REF: maya将在药店给randolph买5包耳塞。L3: 2,4,8 [36/8/36] DS: randolph和maya要给randolph买耳塞 [43/19/43] D: randolph要给maya买耳塞 [63/24/42] F: maya在药店。maya会买5包 [48/21/48] FE: randolph在药店。randolph要买耳塞。maya会买5包 [64/38/64] |
We show that the most popular sum mari z ation metric ROUGE does not reflect the quality of a summary. Looking at the ROUGE scores, one concludes that the dialogue sum mari z ation models perform better than the ones for news summarization. In fact, this hypothesis is not true – we performed an independent, manual analysis of summaries and we demonstrated that high ROUGE results, obtained for automatically-generated dialogue summaries, correspond with lower evaluation marks given by human annotators. An interesting example of the misleading behavior of the ROUGE metrics is presented in Table 9 for Dialogue 4, where a wrong summary – ’paul and cindy don’t like red roses.’ – obtained all ROUGE values higher than a correct summary – ’paul asks cindy what color flowers should buy.’. Despite lower ROUGE values, news summaries were scored higher by human evaluators. We conclude that when measuring the quality of modelgenerated summaries, the ROUGE metrics are more indicative for news than for dialogues, and a new metric should be designed to measure the quality of abstract ive dialogue summaries.
我们发现最流行的摘要评估指标ROUGE并不能反映摘要质量。从ROUGE分数来看,对话摘要模型的表现似乎优于新闻摘要模型。但事实并非如此——我们通过独立的人工摘要分析证明,自动生成的对话摘要虽然获得较高的ROUGE分数,却得到人工标注员更低的评分。表9中对话4的案例生动展示了ROUGE指标的误导性:错误摘要"paul and cindy don't like red roses."的所有ROUGE值都高于正确摘要"paul asks cindy what color flowers should buy."。尽管新闻摘要的ROUGE值较低,却获得人工评估者更高的评分。我们得出结论:在衡量模型生成摘要质量时,ROUGE指标对新闻摘要的指示性优于对话摘要,需要设计新指标来评估抽象对话摘要的质量。
9 Conclusions
9 结论
In our paper we have studied the challenges of abstractive dialogue sum mari z ation. We have addressed a major factor that prevents researchers from engaging into this problem: the lack of a proper dataset. To the best of our knowledge, this is the first attempt to create a comprehensive resource of this type which can be used in future research. The next step could be creating an even more challenging dataset with longer dialogues that not only cover one topic, but span over numerous different ones.
在我们的论文中,我们研究了抽象对话摘要(abstractive dialogue summarization)的挑战。我们解决了一个阻碍研究人员投入该问题的主要因素:缺乏合适的数据集。据我们所知,这是首次尝试创建此类可用于未来研究的综合性资源。下一步可能是创建一个更具挑战性的数据集,包含更长的对话,这些对话不仅涵盖单一主题,还跨越多个不同主题。
Table 9: Examples of dialogues (Part 2). REF – reference summary, L3 – LONGEST-3 baseline, DS – DynamicConv $^+$ GPT-2 emb. with sep., D – Dynamic Con v $^+$ GPT-2 emb., F – Fast Abs RL, FE – Fast Abs RL Enhanced, T – Transformer. For L3, three longest utterances are listed. Rounded ROUGE values [R-1/R-2/R-L] are given in square brackets.
表 9: 对话示例(第2部分)。REF - 参考摘要, L3 - LONGEST-3基线, DS - DynamicConv$^+$GPT-2嵌入(带分隔符), D - DynamicConv$^+$GPT-2嵌入, F - Fast Abs RL, FE - Fast Abs RL增强版, T - Transformer。对于L3,列出了三个最长的语句。方括号内为四舍五入后的ROUGE值[R-1/R-2/R-L]。
As shown, sum mari z ation of dialogues is much more challenging than of news. In order to perform well, it may require designing dedicated tools, but also new, non-standard measures to capture the quality of abstract ive dialogue summaries in a relevant way. We hope to tackle these issues in future work.
如图所示,对话摘要比新闻摘要更具挑战性。要想表现出色,可能需要设计专用工具,还需要采用新颖的非标准指标来恰当评估抽象对话摘要的质量。我们希望在未来的工作中解决这些问题。
Acknowledgments
致谢
We would like to express our sincere thanks to Tunia Blachno, Oliwia Ebebenge, Monika Jedras and Małgorzata Krawentek for their huge contribution to the corpus collection – without their ideas, management of the linguistic task and verification of examples we would not be able to create this paper. We are also grateful for the reviewers’ helpful comments and suggestions.
我们要衷心感谢Tunia Blachno、Oliwia Ebebenge、Monika Jedras和Małgorzata Krawentek在语料收集工作中做出的巨大贡献——没有他们的创意构思、语言学任务管理及案例验证,我们无法完成这篇论文。同时感谢审稿人提出的宝贵意见和建议。
References
参考文献
Siddhartha Banerjee, Prasenjit Mitra, and Kazunari Sugiyama. 2015. Abstract ive meeting summarization using dependency graph fusion. In Proceedings of the 24th International Conference on World Wide Web, pages 5–6.
Siddhartha Banerjee、Prasenjit Mitra和Kazunari Sugiyama。2015。基于依存图融合的会议摘要生成。见《第24届国际万维网会议论文集》,第5-6页。
Yen-Chun Chen and Mohit Bansal. 2018. Fast abstractive sum mari z ation with reinforce-selected sentence rewriting. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguis- tics, pages 675–686.
Yen-Chun Chen和Mohit Bansal。2018。基于强化选择句子重写的快速抽象摘要方法。见《第56届计算语言学协会年会论文集》,第675-686页。
Sumit Chopra, Michael Auli, and Alexander M. Rush. 2016. Abstract ive sentence sum mari z ation with attentive recurrent neural networks. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 93-aA\$98.
Sumit Chopra、Michael Auli 和 Alexander M. Rush。2016。基于注意力循环神经网络的抽象式句子摘要。载于《2016年北美计算语言学协会人类语言技术会议论文集》,第93-98页。
Prakhar Ganesh and Saket Dingliwal. 2019. Abstractive sum mari z ation of spoken and written conversation. arXiv:1902.01615.
Prakhar Ganesh 和 Saket Dingliwal. 2019. 口语与书面对话的抽象摘要生成. arXiv:1902.01615.
Sebastian Gehrmann, Yuntian Deng, and Alexander Rush. 2018. Bottom-up abstract ive sum mari z ation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4098–4109.
Sebastian Gehrmann、Yuntian Deng 和 Alexander Rush。2018. 自底向上的抽象摘要生成。载于《2018年自然语言处理实证方法会议论文集》,第4098-4109页。
Chih-Wen Goo and Yun-Nung Chen. 2018. Abstractive dialogue sum mari z ation with sentence-gated modeling optimized by dialogue acts. 2018 IEEE Spoken Language Technology Workshop (SLT), pages 735–742.
Chih-Wen Goo和Yun-Nung Chen。2018。基于对话行为优化的句子门控抽象对话摘要生成。2018年IEEE口语语言技术研讨会(SLT),第735-742页。
Karl M. Hermann, TomAas KociskA;, Edward Grefen- stette, Lasse Espeholt, Will Kay, Mustafa Suleyman, and Phil Blunsom. 2015. Teaching machines to read and comprehend. CoRR, abs/1506.03340.
Karl M. Hermann、Tomáš Kočiský、Edward Grefenstette、Lasse Espeholt、Will Kay、Mustafa Suleyman 和 Phil Blunsom。2015. 教机器阅读和理解。CoRR, abs/1506.03340。
Urvashi Khandelwal, Kevin Clark, Dan Jurafsky, and Lukasz Kaiser. 2019. Sample efficient text sum- marization using a single pre-trained transformer. CoRR, abs/1905.08836.
Urvashi Khandelwal、Kevin Clark、Dan Jurafsky 和 Lukasz Kaiser。2019。使用单一预训练 Transformer 实现高效样本文本摘要。CoRR, abs/1905.08836。
Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics.
Chin-Yew Lin. 2004. ROUGE: 自动摘要评估工具包。载于《文本摘要分支探讨》,第74-81页,西班牙巴塞罗那。计算语言学协会。
I. McCowan, J. Carletta, W. Kraaij, S. Ashby, S. Bour- ban, M. Flynn, M. Guillemot, T. Hain, J. Kadlec, V. Karaiskos, M. Kronenthal, G. Lathoud, M. Lincoln, A. Lisowska, W. Post, Dennis Reidsma, and P. Wellner. 2005. The ami meeting corpus. In Proceedings of Measuring Behavior 2005, 5th International Conference on Methods and Techniques in Behavioral Research, pages 137–140.
I. McCowan、J. Carletta、W. Kraaij、S. Ashby、S. Bourban、M. Flynn、M. Guillemot、T. Hain、J. Kadlec、V. Karaiskos、M. Kronenthal、G. Lathoud、M. Lincoln、A. Lisowska、W. Post、Dennis Reidsma和P. Wellner。2005。AMI会议语料库。载于《行为测量2005:第五届行为研究方法与技术国际会议论文集》,第137-140页。
Mary L. McHugh. 2012. Interrater reliability: the kappa statistic. Biochemia medica, 22(3):276–282.
Mary L. McHugh. 2012. 评估者间信度:kappa统计量. Biochemia medica, 22(3):276–282.
Yashar Mehdad, Giuseppe Carenini, and Raymond T. Ng. 2014. Abstract ive sum mari z ation of spoken and written conversations based on phrasal queries. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, volume 1, pages 1220–1230.
Yashar Mehdad、Giuseppe Carenini 和 Raymond T. Ng. 2014. 基于短语查询的口语与书面对话抽象摘要. 见《第52届计算语言学协会年会论文集》第1卷, 第1220–1230页.
Amita Misra, Pranav Anand, Jean Fox Tree, and Marilyn Walker. 2015. Using sum mari z ation to discover argument facets in online idealogical dialog. In The North American Chapter of the Association for Computational Linguistics (NAACL).
Amita Misra、Pranav Anand、Jean Fox Tree和Marilyn Walker。2015。利用摘要技术发现网络意识形态对话中的论点切面。载于《计算语言学协会北美分会(NAACL)》。
Ramesh Nallapati, Bowen Zhou, Cicero Nogueira dos Santos, Caglar Gulcehre, and Bing Xiang. 2016. Abstract ive text sum mari z ation using sequence-tosequence rnns and beyond. In Computational Natural Language Learning.
Ramesh Nallapati、Bowen Zhou、Cicero Nogueira dos Santos、Caglar Gulcehre 和 Bing Xiang。2016。基于序列到序列RNN及其扩展的抽象文本摘要。载于《计算自然语言学习》。
Nikola Nikolov, Michael Pfeiffer, and Richard Hahnloser. 2018. Data-driven sum mari z ation of scientific articles. In Proceedings of the Eleventh International Conference on Language Resources and Eval- uation (LREC 2018).
Nikola Nikolov、Michael Pfeiffer 和 Richard Hahnloser。2018. 数据驱动的科学文章摘要生成。载于《第十一届国际语言资源与评估会议论文集》(LREC 2018)。
Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, and Michael Auli. 2019. fairseq: A fast, extensible toolkit for sequence modeling. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), pages 48–53.
Myle Ott、Sergey Edunov、Alexei Baevski、Angela Fan、Sam Gross、Nathan Ng、David Grangier 和 Michael Auli。2019. fairseq: 一个快速、可扩展的序列建模工具包。载于《2019年北美计算语言学协会会议论文集(演示部分)》,第48–53页。
Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners.
Alec Radford、Jeff Wu、Rewon Child、David Luan、Dario Amodei 和 Ilya Sutskever。2019。语言模型是无监督多任务学习器。
Alexander M. Rush, Sumit Chopra, and Jason Weston. 2015. A neural attention model for abstract ive sentence sum mari z ation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 379–389.
Alexander M. Rush、Sumit Chopra 和 Jason Weston。2015. 用于抽象句子摘要的神经注意力模型。载于《2015年自然语言处理实证方法会议论文集》,第379-389页。
Abigail See, Peter J. Liu, and Christopher D. Manning. 2017. Get to the point: Sum mari z ation with pointergenerator networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, volume 1, pages 1073–1083.
Abigail See、Peter J. Liu 和 Christopher D. Manning。2017. 直奔主题:基于指针生成器网络的摘要生成。载于《第55届计算语言学协会年会论文集》第1卷,第1073–1083页。
Rico Sennrich, Barry Haddow, and Alexandra Birch. 2015. Neural machine translation of rare words with subword units. CoRR, abs/1508.07909.
Rico Sennrich, Barry Haddow 和 Alexandra Birch. 2015. 基于子词单元的稀有词神经机器翻译. CoRR, abs/1508.07909.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems 30, pages 5998–6008.
Ashish Vaswani、Noam Shazeer、Niki Parmar、Jakob Uszkoreit、Llion Jones、Aidan N. Gomez、Lukasz Kaiser 和 Illia Polosukhin。2017. Attention is all you need。载于《神经信息处理系统进展》第30卷,第5998–6008页。
Felix Wu, Angela Fan, Alexei Baevski, Yann Dauphin, and Michael Auli. 2019. Pay less attention with lightweight and dynamic convolutions. In International Conference on Learning Representations.
Felix Wu、Angela Fan、Alexei Baevski、Yann Dauphin 和 Michael Auli。2019. 轻量级动态卷积实现高效注意力机制。发表于国际学习表征会议。
Haoyu Zhang, Jianjun Xu, and Ji Wang. 2019. Pre training-based natural language generation for text sum mari z ation. CoRR, abs/1902.09243.
Haoyu Zhang、Jianjun Xu 和 Ji Wang。2019. 基于预训练的文本摘要自然语言生成。CoRR, abs/1902.09243。
