SAMSum Corpus: A Human-annotated Dialogue Dataset for Abstract ive Sum mari z ation
SAMSum语料库:面向抽象摘要的人类标注对话数据集
Abstract
摘要
This paper introduces the SAMSum Corpus, a new dataset with abstract ive dialogue summaries. We investigate the challenges it poses for automated sum mari z ation by testing several models and comparing their results with those obtained on a corpus of news articles. We show that model-generated summaries of dialogues achieve higher ROUGE scores than the model-generated summaries of news – in contrast with human evaluators’ judgement. This suggests that a challenging task of abstractive dialogue sum mari z ation requires dedicated models and non-standard quality measures. To our knowledge, our study is the first attempt to introduce a high-quality chatdialogues corpus, manually annotated with abstractive sum mari zat ions, which can be used by the research community for further studies.
本文介绍了SAMSum语料库,这是一个包含抽象对话摘要的新数据集。我们通过测试多种模型并将其结果与新闻文章语料库的结果进行比较,探究了该数据集对自动摘要任务带来的挑战。研究表明,模型生成的对话摘要比新闻摘要获得更高的ROUGE分数——这与人类评估者的判断相反。这表明抽象对话摘要这一挑战性任务需要专用模型和非标准质量评估指标。据我们所知,本研究首次引入了经人工标注抽象摘要的高质量聊天对话语料库,可供研究社区用于进一步探索。
1 Introduction and related work
1 引言与相关工作
The goal of the sum mari z ation task is condensing a piece of text into a shorter version that covers the main points succinctly. In the abstract ive approach important pieces of information are presented using words and phrases not necessarily appearing in the source text. This requires natural language generation techniques with high level of semantic understanding (Chopra et al., 2016; Rush et al., 2015; Khandelwal et al., 2019; Zhang et al., 2019; See et al., 2017; Chen and Bansal, 2018; Gehrmann et al., 2018).
摘要任务的目标是将一段文本压缩成更短的版本,同时简洁地涵盖要点。在抽象式方法中,重要信息会使用源文本中未必出现的词汇和短语来呈现。这需要具备高水平语义理解的自然语言生成技术 (Chopra et al., 2016; Rush et al., 2015; Khandelwal et al., 2019; Zhang et al., 2019; See et al., 2017; Chen and Bansal, 2018; Gehrmann et al., 2018)。
Major research efforts have focused so far on sum mari z ation of single-speaker documents like news (e.g., Nallapati et al. (2016)) or scientific publications (e.g., Nikolov et al. (2018)). One of the reasons is the availability of large, high-quality news datasets with annotated summaries, e.g., CNN/Daily Mail (Hermann et al., 2015; Nallapati et al., 2016). Such a comprehen- sive dataset for dialogues is lacking.
目前主要的研究工作集中在单说话者文档(如新闻(例如Nallapati等人(2016))或科学出版物(例如Nikolov等人(2018)))的摘要生成上。原因之一在于存在大量带有标注摘要的高质量新闻数据集,例如CNN/Daily Mail (Hermann等人(2015); Nallapati等人(2016))。而针对对话的此类综合性数据集则较为缺乏。
The challenges posed by the abstract ive dialogue sum mari z ation task have been discussed in the literature with regard to AMI meeting corpus (McCowan et al., 2005), e.g. Banerjee et al. (2015), Mehdad et al. (2014), Goo and Chen (2018). Since the corpus has a low number of summaries (for 141 dialogues), Goo and Chen (2018) proposed to use assigned topic descriptions as gold references. These are short, label-like goals of the meeting, e.g., costing evaluation of project process; components, materials and energy sources; chitchat. Such descriptions, however, are very general, lacking the messenger-like structure and any information about the speakers.
文献中针对AMI会议语料库(McCowan et al., 2005)的抽象对话摘要任务所面临的挑战已有讨论,例如Banerjee et al. (2015)、Mehdad et al. (2014)、Goo和Chen (2018)。由于该语料库摘要数量较少(仅141个对话),Goo和Chen (2018)提出使用预设主题描述作为黄金参考。这些描述是会议目标的简短标签式说明,例如:项目流程成本评估;组件、材料和能源;闲聊。然而此类描述过于笼统,既缺乏信使式结构,也未包含发言者相关信息。
To benefit from large news corpora, Ganesh and Dingliwal (2019) built a dialogue sum mari z ation model that first converts a conversation into a structured text document and later applies an attention-based pointer network to create an abstract ive summary. Their model, trained on structured text documents of CNN/Daily Mail dataset, was evaluated on the Argumentative Dialogue Summary Corpus (Misra et al., 2015), which, however, contains only 45 dialogues.
为了利用大规模新闻语料库的优势,Ganesh和Dingliwal (2019) 构建了一个对话摘要模型,该模型首先将对话转换为结构化文本文档,随后应用基于注意力机制的指针网络生成抽象摘要。他们的模型在CNN/Daily Mail数据集的结构化文本文档上进行训练,并在Argumentative Dialogue Summary Corpus (Misra等人,2015) 上进行了评估,然而该语料库仅包含45段对话。
In the present paper, we further investigate the problem of abstract ive dialogue sum mari z ation. With the growing popularity of online conversations via applications like Messenger, WhatsApp and WeChat, sum mari z ation of chats between a few participants is a new interesting direction of sum mari z ation research. For this purpose we have created the SAMSum Corpus1 which contains over 16k chat dialogues with manually annotated summaries. The dataset is freely available for the research community2.
在本文中,我们进一步研究了抽象式对话摘要问题。随着Messenger、WhatsApp和微信等应用程序在线对话的日益普及,对少量参与者之间聊天内容的摘要成为摘要研究中一个新的有趣方向。为此,我们创建了SAMSum Corpus1,其中包含超过16k条带有手动标注摘要的聊天对话。该数据集可免费供研究社区使用2。
Table 1: Datasets sizes
表 1: 数据集规模
Dataset | Train | Validation | Test |
---|---|---|---|
CNN/DM | 287227 | 13368 | 11 490 |
SAMSum | 14732 | 818 | 819 |
The paper is structured as follows: in Section 2 we present details about the new corpus and describe how it was created, validated and cleaned. Brief description of baselines used in the summarization task can be found in Section 3. In Section 4, we describe our experimental setup and pa- rameters of models. Both evaluations of summarization models, the automatic with ROUGE metric and the linguistic one, are reported in Section 5 and Section 6, respectively. Examples of models’ outputs and some errors they make are described in Section 7. Finally, discussion, conclusions and ideas for further research are presented in sections 8 and 9.
本论文结构如下:第2节详细介绍新语料库的构建过程,包括创建、验证与清洗方法;第3节简要概述摘要任务中使用的基线模型;第4节阐述实验设置与模型参数;第5节和第6节分别汇报基于ROUGE指标的自动评估与语言学评估结果;第7节分析模型输出示例及典型错误;最后,第8节和第9节呈现讨论、结论与后续研究方向。
2 SAMSum Corpus
2 SAMSum 语料库
Initial approach. Since there was no available corpus of messenger conversations, we considered two approaches to build it: (1) using existing datasets of documents, which have a form similar to chat conversations, (2) creating such a dataset by linguists.
初始方案。由于缺乏现成的即时通讯对话语料库,我们考虑采用两种构建方式:(1) 使用文档形式类似聊天对话的现有数据集,(2) 由语言学家创建此类数据集。
In the first approach, we reviewed datasets from the following categories: chatbot dialogues, SMS corpora, IRC/chat data, movie dialogues, tweets, comments data (conversations formed by replies to comments), transcription of meetings, written discussions, phone dialogues and daily communication data. Unfortunately, they all differed in some respect from the conversations that are typically written in messenger apps, e.g. they were too technical (IRC data), too long (comments data, transcription of meetings), lacked context (movie dialogues) or they were more of a spoken type, such as a dialogue between a petrol station assistant and a client buying petrol.
在第一种方法中,我们审查了以下类别的数据集:聊天机器人对话、短信语料库、IRC/聊天数据、电影对话、推文、评论数据(由评论回复形成的对话)、会议转录、书面讨论、电话对话和日常交流数据。遗憾的是,它们在某些方面都与即时通讯应用中典型的对话有所不同,例如过于技术化(IRC数据)、过长(评论数据、会议转录)、缺乏上下文(电影对话),或者更偏向口语类型,如加油站助理与购买汽油的客户之间的对话。
As a consequence, we decided to create a chat dialogue dataset by constructing such conversations that would epitomize the style of a messenger app.
因此,我们决定通过构建能体现即时通讯应用风格的对话来创建一个聊天对话数据集。
Process of building the dataset. Our dialogue sum mari z ation dataset contains natural messenger-like conversations created and written down by linguists fluent in English. The style and register of conversations are diversified – dialogues could be informal, semi-formal or formal, they may contain slang phrases, emoticons and typos. We asked linguists to create conversations similar to those they write on a daily basis, reflecting the proportion of topics of their real-life messenger conversations. It includes chit-chats, gossiping about friends, arranging meetings, discussing politics, consulting university assignments with colleagues, etc. Therefore, this dataset does not contain any sensitive data or fragments of other corpora.
构建数据集的流程。我们的对话摘要数据集包含由精通英语的语言学家创建并记录的自然对话,这些对话类似于即时通讯工具中的交流。对话的风格和语域多样——可以是非正式、半正式或正式的,可能包含俚语、表情符号和拼写错误。我们要求语言学家模拟日常交流场景编写对话,反映其真实生活中即时通讯对话的话题比例。内容包括闲聊、朋友八卦、安排会面、讨论政治、与同事协商大学作业等。因此,该数据集不包含任何敏感数据或其他语料库的片段。
Each dialogue was created by one person. After collecting all of the conversations, we asked language experts to annotate them with summaries, assuming that they should (1) be rather short, (2) extract important pieces of information, (3) include names of inter lo cut or s, (4) be written in the third person. Each dialogue contains only one reference summary.
每段对话由一人创作。收集所有对话后,我们邀请语言专家按以下标准进行摘要标注:(1) 篇幅简短,(2) 提取关键信息,(3) 包含对话者姓名,(4) 使用第三人称书写。每段对话仅包含一条参考摘要。
Validation. Since the SAMSum corpus contains dialogues created by linguists, the question arises whether such conversations are really similar to those typically written via messenger apps. To find the answer, we performed a validation task. We asked two linguists to doubly annotate 50 conversations in order to verify whether the dialogues could appear in a messenger app and could be summarized (i.e. a dialogue is not too general or unintelligible) or not (e.g. a dialogue between two people in a shop). The results revealed that $94%$ of examined dialogues were classified by both annotators as good i.e. they do look like conversations from a messenger app and could be condensed in a reasonable way. In a similar validation task, conducted for the existing dialogue-type datasets (described in the Initial approach section), the annotators agreed that only $28%$ of the dialogues resembled conversations from a messenger app.
验证。由于SAMSum语料库包含语言学家创建的对话,这就引发了一个问题:这类对话是否真的与通常通过即时通讯应用编写的对话相似。为寻找答案,我们进行了验证任务。我们邀请两位语言学家对50组对话进行双重标注,以验证这些对话是否可能出现在即时通讯应用中且具备可总结性(即对话内容不过于宽泛或难以理解),或反之(例如商店中两人之间的对话)。结果显示,两位标注者一致认为94%的受测对话质量良好——它们确实类似即时通讯应用的对话内容,并能以合理方式进行压缩。在对现有对话型数据集(见初始方法章节所述)开展的类似验证任务中,标注者仅对28%的对话认可其与即时通讯应用的对话相似。
Cleaning data. After preparing the dataset, we conducted a process of cleaning it in a semiautomatic way. Beforehand, we specified a for- mat for written dialogues with summaries: a colon should separate an author of utterance from its content, each utterance is expected to be in a separate line. Therefore, we could easily find all deviations from the agreed structure – some of them could be automatically fixed (e.g. when instead of a colon, someone used a semicolon right after the inter lo cut or’s name at the beginning of an utterance), others were passed for verification to linguists. We also tried to correct typos in interlocutors’ names (if one person has several utterances, it happens that, before one of them, there is a typo in his/her name) – we used the Levenshtein distance to find very similar names (possibly with typos e.g. ’George’ and ’Goerge’) in a single conversation, and those cases with very similar names were passed to linguists for verification.
清理数据。在准备好数据集后,我们以半自动方式进行了数据清洗。事先我们规定了带摘要的书面对话格式:发言者与内容间用冒号分隔,每段发言需单独成行。因此能轻松识别所有偏离既定结构的案例——部分可自动修复(例如当发言开头的人名后误用分号而非冒号时),其余则交由语言学家核查。我们还尝试修正对话者姓名拼写错误(若同一人有多段发言,可能出现某处姓名拼写错误)——通过Levenshtein距离算法定位单次对话中高度相似的姓名(如"George"和"Goerge"这类可能含拼写错误的变体),并将这些疑似错误案例提交语言学家验证。
Description. The created dataset is made of 16369 conversations distributed uniformly into 4 groups based on the number of utterances in conversations: 3-6, 7-12, 13-18 and 19-30. Each utterance contains the name of the speaker. Most conversations consist of dialogues between two inter lo cut or s (about $75%$ of all conversations), the rest is between three or more people. Table 1 presents the size of the dataset split used in our experiments. The example of a dialogue from this corpus is shown in Table 2.
描述。创建的数据集包含16369个对话,根据对话中的话语数量均匀分为4组:3-6句、7-12句、13-18句和19-30句。每条话语均包含说话者名称。大多数对话由两个对话者之间的交流构成(约占全部对话的$75%$),其余为三人或以上参与的对话。表1展示了实验中使用的数据集划分规模,表2展示了该语料库中的对话示例。
Table 2: Example of a dialogue from the collected corpus
表 2: 语料库收集的对话示例
对话 |
---|
Blair: 记得我们下班后要见婚礼策划师 Chuck: 当然,我们在哪见她?Blair: 在Nonna Rita餐厅 Chuck: 我能点他们的海鲜宽面吗?还是只和她喝咖啡?自从上个月去过之后我就一直想着它 Blair: 哈哈当然可以 |
Chuck: 我们都记得上次和Diane见面时番茄意面的灾难 Blair: 天啊哈哈哈全洒在她的白衬衫上了 Chuck: :D |
Blair: :P 摘要 Blair和Chuck下班后要去Nonna Rita餐厅见婚礼策划师。 |
3 Dialogues baselines
3 对话基线
The baseline commonly used in the news summarization task is Lead-3 (See et al., 2017), which takes three leading sentences of the document as the summary. The underlying assumption is that the beginning of the article contains the most significant information. Inspired by the Lead-n model, we propose a few different simple models:
新闻摘要任务中常用的基线方法是Lead-3 (See et al., 2017),即选取文档的前三句话作为摘要。其基本假设是文章开头包含最重要的信息。受Lead-n模型启发,我们提出了几种不同的简单模型:
Table 3: Baselines for the dialogues sum mari z ation
表 3: 对话摘要基线模型
模型 | n | R-1 | R-2 | R-L |
---|---|---|---|---|
LEAD | 3 4 5 | 31.40 31.87 32.02 | 8.68 8.93 9.53 | 29.42 29.91 30.07 |
MIDDLE | 3 4 5 | 28.04 30.08 29.91 | 6.57 7.96 8.12 | 26.13 28.10 27.97 |
LONGEST | 3 4 5 | 32.46 32.19 31.61 | 10.27 10.35 10.21 | 29.92 29.91 29.55 |
LONGER-THAN | 10 20 30 | 28.31 29.36 29.61 | 9.69 10.23 10.28 | 26.72 27.59 27.71 |
MOST-ACTIVE-PERSON | n/a | 26.54 | 8.55 | 24.57 |
Results of the evaluation of the above models are reported in Table 3. There is no obvious baseline for the task of dialogues sum mari z ation. We expected rather low results for Lead-3, as the beginnings of the conversations usually contain greetings, not the main part of the discourse. However, it seems that in our dataset greetings are frequently combined with question-asking or information passing (sometimes they are even omitted) and such a baseline works even better than the MIDDLE baseline (taking utterances from the middle of a dialogue). Nevertheless, the best dialogue baseline turns out to be the LONGEST-3 model.
上述模型的评估结果如表3所示。对话摘要任务尚无明显的基线标准。我们预期Lead-3模型表现较差,因为对话开头通常包含问候语而非主要内容。然而在我们的数据集中,问候语常与提问或信息传递相结合(有时甚至被省略),这使得该基线模型表现优于MIDDLE基线(从对话中部选取语句)。最终,最佳对话基线模型是LONGEST-3。
4 Experimental setup
4 实验设置
This section contains a description of setting used in the experiments carried out.
本节包含实验中所用设置的描述。
4.1 Data preparation
4.1 数据准备
In order to build a dialogue sum mari z ation model, we adopt the following strategies: (1) each candidate architecture is trained and evaluated on the dialogue dataset; (2) each architecture is trained on the train set of CNN/Daily Mail joined together with the train set of the dialogue data, and evaluated on the dialogue test set.
为了构建对话摘要模型,我们采用以下策略:(1) 每种候选架构均在对话数据集上进行训练和评估;(2) 每种架构在CNN/Daily Mail训练集与对话数据训练集的联合集上训练,并在对话测试集上评估。
In addition, we prepare a version of dialogue data, in which utterances are separated with a special token called the separator (artificially added token e.g. $\ '_{\mathrm{
此外,我们准备了一个对话数据版本,其中话语之间用特殊的分隔符(人工添加的token,例如使用词嵌入的模型采用 $\ '_{\mathrm{
4.2 Models
4.2 模型
We carry out experiments with the following summarization models (for all architectures we set the beam size for beam search decoding to 5):
我们使用以下摘要模型进行实验(所有架构的束搜索解码束宽均设为5):
• Pointer generator network (See et al., 2017). In the case of Pointer Generator, we use a default configuration 3, changing only the minimum length of the generated summary from 35 (used in news) to 15 (used in dialogues).
• Pointer generator network (See et al., 2017)。对于Pointer Generator,我们使用默认配置3,仅将生成摘要的最小长度从新闻领域常用的35改为对话场景适用的15。
• Transformer (Vaswani et al., 2017). The model is trained using OpenNMT library4. We use the same parameters for training both on news and on dialogues , changing only the minimum length of the generated summary – 35 for news and 15 for dialogues.
• Transformer (Vaswani et al., 2017)。该模型使用OpenNMT库4进行训练。我们在新闻和对话数据上采用相同的训练参数,仅调整生成摘要的最小长度——新闻为35个token,对话为15个token。
• Fast Abs RL (Chen and Bansal, 2018). It is trained using its default parameters 6. For dialogues, we change the convolutional wordlevel sentence encoder (used in extractor part) to only use kernel with size equal 3 instead of 3-5 range. It is caused by the fact that some of utterances are very short and the default setting is unable to handle that.
• Fast Abs RL (Chen and Bansal, 2018)。该模型使用默认参数6进行训练。针对对话场景,我们将抽取器部分的卷积词级句子编码器内核尺寸从3-5范围调整为固定值3,这是因为部分话语片段过短,默认配置无法有效处理。
• Fast Abs RL Enhanced. The additional variant of the Fast Abs RL model with slightly changed utterances i.e. to each utterance, at the end, after artificial separator, we add names of all other inter lo cut or s. The reason for that is that Fast Abs RL requires text to be split into sentences (as it selects sentences and then paraphrase each of them). For dialogues, we divide text into utterances (which is a natural unit in conversations), so sometimes, a single utterance may contain more than one sentence. Taking into account how this model works, it may happen that it selects an utterance of a single person (each utterance starts with the name of the author of the utterance) and has no information about other inter lo cut or s (if names of other interlocutors do not appear in selected utterances), so it may have no chance to use the right people’s names in generated summaries.
• Fast Abs RL 增强版。该模型是 Fast Abs RL 的变体,对语句进行了细微调整:在每条语句末尾的人工分隔符后,我们会添加所有其他对话者 (interlocutor) 的名称。这是因为 Fast Abs RL 需要将文本拆分成句子(它会先选择句子再对每条进行改写)。对于对话场景,我们将文本划分为语句(这是对话中的自然单元),因此单个语句可能包含多个句子。考虑到该模型的工作原理,它可能会选择某个说话者的单条语句(每条语句都以说话者名称开头),却无法获知其他对话者信息(若所选语句中未出现其他对话者名称),导致生成摘要时无法正确使用人名。
LightConv and Dynamic Con v (Wu et al., 2019). The implementation is available in fairseq7 (Ott et al., 2019). We train lightweight convolution models in two manners: (1) learning token representations from scratch; in this case we apply BPE tokenization with the vocabulary of 30K types, using fastBPE implementation 8 (Sennrich et al., 2015); (2) initializing token embeddings with pre-trained language model representations; as a language model we choose GPT-2 small (Radford et al., 2019).
LightConv和DynamicConv (Wu等人, 2019)。该实现可在fairseq7 (Ott等人, 2019)中获得。我们以两种方式训练轻量级卷积模型:(1) 从头学习token表示;这种情况下我们应用BPE分词(词汇量为3万种类型),使用fastBPE实现8 (Sennrich等人, 2015);(2) 用预训练语言模型表示初始化token嵌入;我们选择GPT-2 small (Radford等人, 2019)作为语言模型。
4.3 Evaluation metrics
4.3 评估指标
We evaluate models with the standard ROUGE metric (Lin, 2004), reporting the $F_{1}$ scores (with stemming) for ROUGE-1, ROUGE-2 and ROUGE-L following previous works (Chen and Bansal, 2018; See et al., 2017). We obtain scores using the py-rouge package9.
我们采用标准的ROUGE指标(Lin, 2004)评估模型,参照先前研究(Chen and Bansal, 2018; See et al., 2017)报告ROUGE-1、ROUGE-2和ROUGE-L的$F_{1}$分数(启用词干提取)。分数计算使用py-rouge包9实现。
5 Results
5 结果
The results for the news sum mari z ation task are shown in Table 4 and for the dialogue summarization – in Table 5. In both domains, the best mod- els’ ROUGE-1 exceeds 39, ROUGE-2 – 17 and ROUGE-L – 36. Note that the strong baseline for news (Lead-3) is outperformed in all three metrics only by one model. In the case of dialogues, all tested models perform better than the baseline (LONGEST-3).
新闻摘要任务的结果如表4所示,对话摘要任务的结果如表5所示。在这两个领域中,最佳模型的ROUGE-1分数超过39,ROUGE-2超过17,ROUGE-L超过36。值得注意的是,在新闻领域,只有一个模型在所有三个指标上都优于强基线(Lead-3)。而在对话领域,所有测试模型的表现均优于基线(LONGEST-3)。
In general, the Transformer-based architectures benefit from training on the joint dataset: news+dialogues, even though the news and the dialogue documents have very different structures. Interestingly, this does not seem to be the case for the Pointer Generator or Fast Abs RL model.
总体而言,基于Transformer的架构能从联合数据集(新闻+对话)训练中获益,尽管新闻和对话文档的结构差异很大。有趣的是,指针生成器(Pointer Generator)和Fast Abs RL模型似乎并未表现出这种优势。
The inclusion of a separation token between dialogue utterances is advantageous for most models – presumably because it improves the discourse structure. The improvement is most visible when training is performed on the joint dataset.
在对话语句之间加入分隔token对大多数模型都有利——可能是因为它改善了话语结构。当在联合数据集上进行训练时,这种改进最为明显。
Having compared two variants of the Fast Abs RL model – with original utterances and with enhanced ones (see Section 4.2), we conclude that enhancing utterances with information about the other inter lo cut or s helps achieve higher ROUGE values.
在对比了Fast Abs RL模型的两个变体——使用原始话语和使用增强话语(见第4.2节)后,我们得出结论:通过添加其他inter lo cut或s相关信息来增强话语,有助于获得更高的ROUGE值。
The largest improvement of the model performance is observed for LightConv and DynamicConv models when they are complemented with pretrained embeddings from the language model GPT-2, trained on enormous corpora.
模型性能的最大提升出现在LightConv和DynamicConv模型中,当它们补充了来自大语言模型GPT-2的预训练嵌入时(该模型基于海量语料库训练)。
It is also worth noting that some models (Pointer Generator, Fast Abs RL), trained only on the dialogues corpus (which has 16k dialogues), reach similar level (or better) in terms of ROUGE metrics than models trained on the CNN/DM news dataset (which has more than $300\mathrm{k}$ articles). Adding pretrained embeddings and training on the joined dataset helps in achieving significantly higher values of ROUGE for dialogues than the best models achieve on the CNN/DM news dataset.
值得注意的是,仅基于对话语料库(包含1.6万组对话)训练的某些模型(Pointer Generator、Fast Abs RL)在ROUGE指标上达到了与基于CNN/DM新闻数据集(超过$300\mathrm{k}$篇新闻)训练的模型相近(或更优)的水平。通过添加预训练词向量并在联合数据集上训练,对话摘要的ROUGE值显著超越了CNN/DM新闻数据集上最佳模型的表现。
According to ROUGE metrics, the best performing model is Dynamic Con v with GPT-2 embeddings, trained on joined news and dialogue data with an utterance separation token.
根据ROUGE指标,表现最佳的模型是采用GPT-2嵌入的Dynamic Conv,该模型在合并新闻和对话数据(含话语分隔token)上进行训练。
6 Linguistic verification of summaries
6 摘要的语言学验证
ROUGE is a standard way of evaluating the quality of machine generated summaries by comparing them with reference ones. The metric based on n-gram overlapping, however, may not be very informative for abstract ive sum mari z ation, where paraphrasing is a keypoint in producing highquality sentences. To quantify this conjecture, we manually evaluated summaries generated by the models for 150 news and 100 dialogues. We asked two linguists to mark the quality of every summary on the scale of $-1,0,1$ , where $-1$ means that a sum mari z ation is poor, extracts irrelevant information or does not make sense at all, 1 – it is understandable and gives a brief overview of the text, and 0 stands for a sum mari z ation that extracts only a part of relevant information, or makes some mistakes in the produced summary.
ROUGE是一种通过比较机器生成摘要与参考摘要来评估其质量的标准方法。然而,这种基于n-gram重叠的指标对于抽象摘要可能信息量不足,因为改写是生成高质量句子的关键。为量化这一假设,我们手动评估了模型生成的150篇新闻和100段对话摘要,邀请两位语言学家按$-1,0,1$量表对每篇摘要评分:$-1$表示摘要质量差、提取无关信息或完全无意义,1表示摘要可理解且能简要概述文本,0表示仅提取部分相关信息或存在错误。
Table 4: Model evaluation on the news corpus test set
表 4: 新闻语料测试集上的模型评估
模型 | R-1 | R-2 | R-L |
---|---|---|---|
Lead-3 baseline Pointer Generator | 40.24 38.72 | 17.44 16.67 | 34.90 35.59 |
Fast Abs RL Transformer LightConv DynamicConv | 40.99 38.72 39.44 39.46 | 17.72 16.89 17.20 17.33 | 38.30 35.74 36.20 36.29 |
LightConv + GPT2 emb | 39.52 | 17.31 | 36.15 |
DynamicConv + GPT2 emb | 39.94 | 17.56 | 36.51 |
We noticed a few annotations (7 for news and 4 for dialogues) with opposite marks (i.e. one annotator judgement was $-1$ , whereas the second one was 1) and decided to have them annotated once again by another annotator who had to resolve conflicts. For the rest, we calculated the linear weighted Cohen’s kappa coefficient (McHugh, 2012) between annotators’ scores. For news examples, we obtained agreement on the level of 0.371 and for dialogues – 0.506. The annotators’ agreement is higher on dialogues than on news, probably because of structures of those data – articles are often long and it is difficult to decide what the key-point of the text is; dialogues, on the contrary, are rather short and focused mainly on one topic.
我们注意到少数标注存在对立标记(新闻7例,对话4例),即一位标注者评分为$-1$,而另一位为1。这些案例交由第三位标注者重新标注以解决分歧。其余数据则采用线性加权Cohen's kappa系数(McHugh, 2012)计算标注者间一致性:新闻样本为0.371,对话样本为0.506。对话数据的一致性较高,可能源于其结构特性——文章通常篇幅较长且关键点难以判定,而对话则相对简短且主题集中。
For manually evaluated samples, we calculated ROUGE metrics and the mean of two human ratings; the prepared statistics is presented in Table 6. As we can see, models generating dialogue summaries can obtain high ROUGE results, but their outputs are marked as poor by human annotators. Our conclusion is that the ROUGE metric corresponds with the quality of generated summaries for news much better than for dialogues, confirmed by Pearson’s correlation between human evaluation and the ROUGE metric, shown in Table 7.
对于人工评估样本,我们计算了ROUGE指标和两位评分的平均值;统计结果如表6所示。可见,生成对话摘要的模型能获得较高ROUGE分数,但其输出被人为标注为低质量。我们得出结论:ROUGE指标对新闻摘要生成质量的反映远优于对话摘要,该结论通过表7所示的人为评估与ROUGE指标的皮尔逊相关性得到验证。
Table 5: Model evaluation on the dialogues corpus test set
表 5: 对话语料测试集上的模型评估
模型 | 训练数据 | 分隔符 | R-1 | R-2 | R-L |
---|---|---|---|---|---|
LONGEST-3baseline | n/a | n/a | 32.46 | 10.27 | 29.92 |
Pointer Generator | dialogues | no | 38.55 | 14.14 | 34.85 |
Pointer Generator | dialogues | yes | 40.08 | 15.28 | 36.63 |
Fast Abs RL | dialogues | no | 40.96 | 17.18 | 39.05 |
Fast Abs RLEnhanced | dialogues | no | 41.95 | 18.06 | 39.23 |
Transformer | dialogues | no | 36.62 | 11.18 | 33.06 |
Transformer | dialogues | yes | 37.27 | 10.76 | 32.73 |
LightConv | dialogues | no | 33.19 | 11.14 | 30.34 |
DynamicConv | dialogues | no | 33.79 | 11.19 | 30.41 |
DynamicConv | dialogues | yes | 33.69 | 10.88 | 30.93 |
LightConv + GPT-2 emb. | dialogues | no | 41.81 | 16.34 | 37.63 |
DynamicConv + GPT-2 emb. | dialogues | no | 41.79 | 16.44 | 37.54 |
DynamicConv + GPT-2 emb. | dialogues | yes | 41.54 | 16.29 | 37.07 |
Pointer Generator | news + dialogues | no | 35.04 | 13.25 | 32.42 |
Pointer Generator | news + dialogues | yes | 37.27 | 14.42 | 34.36 |
FastAbsRL | news + dialogues | no | 41.03 | 16.93 | 39.05 |
Fast Abs RL Enhanced | news + dialogues | no | 41.87 | 17.47 | 39.53 |
Transformer | news + dialogues | no | 41.91 | 18.25 | 38.77 |
Transformer | news + dialogues | yes | 42.37 | 18.44 | 39.27 |
LightConv | news + dialogues | no | 40.29 | 17.28 | 36.81 |
DynamicConv | news + dialogues | no | 40.66 | 17.41 | 37.20 |
DynamicConv | news + dialogues | yes | 41.07 | 17.11 | 37.27 |
LightConv + GPT-2 emb. | news + dialogues | no | 44.47 | 19.75 | 40.07 |
DynamicConv + GPT-2 emb. | news + dialogues | no | 44.69 | 20.28 | 40.76 |
DynamicConv + GPT-2 emb. | news + dialogues | yes | 45.41 | 20.65 | 41.45 |
Table 6: Statistics of human evaluation of summaries’ quality and ROUGE evaluation of those summaries
表 6: 摘要质量人工评估统计及ROUGE评估结果
#examples | mean | median | R-1 | R-2 | R-L | ||
---|---|---|---|---|---|---|---|
NEWS | overall | 100 | 0.18 | 0.5 | 39.76 | 16.55 | 36.23 |
FastAbsRL DynamicConv | 50 | 0.33 | 0.5 | 42.33 | 18.28 | 38.82 | |
DIALOGUES | overall | 50 | 0.03 | 0.25 | 37.19 | 14.81 | 33.64 |
FastAbsRL | 150 | -0.503 | -0.5 | 43.53 | 19.94 | 40.66 | |
FastAbsRLEnhanced | 50 | -0.55 | -0.75 | 42.16 | 19.28 | 40.37 | |
DynamicConv + GPT-2 emb. | 50 50 | -0.63 -0.33 | -1.0 -0.5 | 39.79 48.63 | 16.59 | 37.05 |
7 Difficulties in dialogue sum mari z ation
7 对话摘要中的难点
In a structured text, such as a news article, the information flow is very clear. However, in a dialogue, which contains discussions (e.g. when people try to agree on a date of a meeting), questions (one person asks about something and the answer may appear a few utterances later) and greetings, most important pieces of information are scattered across the utterances of di