由苏黎世Google Research的AI常驻研究员Julian Eisenschlos发布。识别文本蕴涵的任务(也称为自然语言推论)包括确定一条文本(前提)是否可以被另一条文本(假设)隐含或矛盾(或都不存在)。尽管这个问题通常被认为是机器学习(ML)系统推理技能的重要测试,并且已经针对纯文本输入进行了深入研究,但是将此类模型应用于结构化数据(如网站,表格)的工作却投入了很多然而,每当需要将表格的内容准确地汇总并呈现给用户时,识别文本含义就显得尤为重要,这对于高保真问答系统和虚拟助手。
Posted by Julian Eisenschlos, AI Resident, Google Research, ZürichThe task of recognizing* *textual entailment, also known as natural language inference, consists of determining whether a piece of text (a premise), can be implied or contradicted (or neither) by another piece of text (the hypothesis). While this problem is often considered an important test for the reasoning skills of machine learning (ML) systems and has been studied in depth for plain text inputs, much less effort has been put into applying such models to structured data, such as websites, tables, databases, etc. Yet, recognizing textual entailment is especially relevant whenever the contents of a table need to be accurately summarized and presented to a user, and is essential for high fidelity question answering systems and virtual assistants.
在《EMNLP 2020的发现》中发布的“了解具有中间预训练的表”中,我们介绍了针对表解析而定制的首批预训练任务,使模型能够更好,更快地从更少的数据中学习。我们以较早的TAPAS模型为基础,该模型是BERT双向Transformer模型的扩展,具有特殊的嵌入以在表中查找答案。将我们的新的预训练目标应用于TAPAS,可在涉及表的多个数据集上产生最新的技术水平。在TabFact上,例如,它将模型与人类绩效之间的差距减少了约50%。我们还系统地对选择相关输入以提高效率的方法进行了基准测试,在速度和内存方面实现了4倍的增益,同时保留了92%的结果。适用于不同任务和大小的所有模型都在GitHub repo上发布,您可以在colab Notebook中自己尝试这些模型。
In "Understanding tables with intermediate pre-training", published in Findings of EMNLP 2020, we introduce the first pre-training tasks customized for table parsing, enabling models to learn better, faster and from less data. We build upon our earlier TAPAS model, which was an extension of the BERT bi-directional Transformer model with special embeddings to find answers in tables. Applying our new pre-training objectives to TAPAS yields a new state of the art on multiple datasets involving tables. On TabFact, for example, it reduces the gap between model and human performance by ~50%. We also systematically benchmark methods of selecting relevant input for higher efficiency, achieving 4x gains in speed and memory, while retaining 92% of the results. All the models for different tasks and sizes are released on GitHub repo, where you can try them out yourself in a colab Notebook.
文本包含
当应用于表格数据时,与纯文本相比,文本包含的任务更具挑战性。例如,考虑来自Wikipedia的表格,其中有一些句子是从与其关联的表格内容中得出的。评估表的内容是否需要或与句子矛盾,可能需要查看多个列和行,并可能执行简单的数值计算,例如求平均值,求和,求差等。
Textual Entailment
The task of textual entailment is more challenging when applied to tabular data than plain text. Consider, for example, a table from Wikipedia with some sentences derived from its associated table content. Assessing if the content of the table entails or contradicts the sentence may require looking over multiple columns and rows, and possibly performing simple numeric computations, like averaging, summing, differencing, etc.
A table together with some statements from TabFact.The content of the table can be used to support or contradict the statements.
一个表以及一些来自TabFact的语句。该表的内容可用于支持或反对这些语句。
Following the methods used by TAPAS, we encode the content of a statement and a table together, pass them through a Transformer model, and obtain a single number with the probability that the statement is entailed or refuted by the table.
遵循TAPAS所使用的方法,我们将语句和表的内容一起编码,通过Transformer模型传递它们,并获得一个数字,表示该语句被表所包含或拒绝的可能性。
The TAPAS model architecture uses a BERT model to encode the statement and the flattened table, read row by row. Special embeddings are used to encode the table structure. The vector output of the first token is used to predict the probability of entailment.
该TAPAS模型架构使用BERT模式编码的语句和扁平表,读取一行一行。特殊的嵌入用于对表结构进行编码。第一个令牌的向量输出用于预测蕴含的可能性。
Because the only information in the training examples is a binary value (i.e., "correct" or "incorrect"), training a model to understand whether a statement is entailed or not is challenging and highlights the difficulty in achieving generalization in deep learning, especially when the provided training signal is scarce. Seeing isolated entailed or refuted examples, a model can easily pick-up on spurious patterns in the data to make a prediction, for example the presence of the word "tie" in "Greg Norman and Billy Mayfair tie in rank", instead of truly comparing their ranks, which is what is needed to successfully apply the model beyond the original training data.
因为训练示例中唯一的信息是二进制值(即“正确”或“不正确”),所以训练模型以了解是否需要陈述是一种挑战,并突出了实现深度学习泛化的难度,尤其是当所提供的训练信号不足时。看到孤立的需要或被反驳的示例,模型可以轻松地提取数据中的虚假模式以进行预测,例如“格雷格·诺曼和比利·梅菲尔排名”中存在“领带”一词,而不是真正比较他们的等级,这是成功地将模型应用到原始训练数据之外的条件。
Pre-training Tasks
Pre-training tasks can be used to “warm-up” models by providing them with large amounts of readily available unlabeled data. However, pre-training typically includes primarily plain text and not tabular data. In fact, TAPAS was originally pre-trained using a simple masked language modelling objective that was not designed for tabular data applications. In order to improve the model performance on tabular data, we introduce two novel pretraining binary-classification tasks called counterfactual and synthetic, which can be applied as a second stage of pre-training (often called intermediate pre-training).
In the counterfactual task, we source sentences from Wikipedia that mention an entity (person, place or thing) that also appears in a given table. Then, 50% of the time, we modify the statement by swapping the entity for another alternative. To make sure the statement is realistic, we choose a replacement among the entities in the same column in the table. The model is trained to recognize whether the statement was modified or not. This pre-training task includes millions of such examples, and although the reasoning about them is not complex, they typically will still sound natural.
For the synthetic task, we follow a method similar to semantic parsing in which we generate statements using a simple set of grammar rules that require the model to understand basic mathematical operations, such as sums and averages (e.g., "the sum of earnings"), or to understand how to filter the elements in the table using some condition (e.g.,"the country is Australia"). Although these statements are artificial, they help improve the numerical and logical reasoning skills of the model.
预训练任务
预训练任务可以通过为模型提供大量易于获得的未标记数据来“预热”模型。但是,预训练通常主要包括纯文本,而不包括表格数据。实际上,TAPAS最初是使用简单的屏蔽语言建模目标进行预训练的,而该目标并不是为表格数据应用设计的。为了提高表格数据的模型性能,我们引入了两个新颖的预训练二进制分类任务,称为反事实和合成,可以用作预训练的第二阶段(通常称为中间预训练)。
在反事实的任务中,我们从Wikipedia提取句子,该句子提到也出现在给定表中的实体(人,地点或事物)。然后,在50%的时间内,我们通过将实体替换为另一种方式来修改语句。为了确保该语句是现实的,我们在表的同一列中的实体之间选择一个替换项。对模型进行训练以识别该语句是否已被修改。这项预训练任务包括数百万个此类示例,尽管关于它们的推理并不复杂,但通常听起来仍然很自然。
对于合成任务,我们采用类似于语义解析的方法,在该方法中,我们使用一套简单的语法规则来生成语句,该语法规则集要求模型理解基本的数学运算,例如总和和平均值(例如,“收入总和”) ,或了解如何使用某种条件(例如,“国家/地区是澳大利亚”)过滤表格中的元素。尽管这些陈述是人为的,但它们有助于提高模型的数字和逻辑推理能力。
Example instances for the two novel pre-training tasks. Counterfactual examples swap entities mentioned in a sentence that accompanies the input table for a plausible alternative. Synthetic statements use grammar rules to create new sentences that require combining the information of the table in complex ways.
Results
We evaluate the success of the counterfactual and synthetic pre-training objectives on the TabFact dataset by comparing to the baseline TAPAS model and to two prior models that have exhibited success in the textual entailment domain, LogicalFactChecker (LFC) and Structure Aware Transformer (SAT). The baseline TAPAS model exhibits improved performance relative to LFC and SAT, but the pre-trained model (TAPAS+CS) performs significantly better, achieving a new state of the art.
We also apply TAPAS+CS to question answering tasks on the SQA dataset, which requires that the model find answers from the content of tables in a dialog setting. The inclusion of CS objectives improves the previous best performance by more than 4 points, demonstrating that this approach also generalizes performance beyond just textual entailment.
结果
我们通过与基线TAPAS模型以及两个在文本蕴含域中显示成功的先前模型LogicalFactChecker(LFC)和Structure Aware Transformer(SAT )进行比较,评估了TabFact数据集上的反事实和综合预训练目标的成功)。相对于LFC和SAT,基线TAPAS模型表现出更高的性能,但是预训练模型(TAPAS + CS)的性能明显更好,达到了新的技术水平。
我们还将TAPAS + CS应用于SQA数据集上的问答任务,这要求模型在对话框设置中从表的内容中找到答案。包含CS目标可以使以前的最佳性能提高4个百分点以上,这表明该方法还可以将性能推广到文本之外。
Results on TabFact (left) and SQA (right). Using the synthetic and counterfactual datasets, we achieve new state-of-the-art results in both tasks by a large margin.
Data and Compute Efficiency
Another aspect of the counterfactual and synthetic pre-training tasks is that since the models are already tuned for binary classification, they can be applied without any fine-tuning to TabFact. We explore what happens to each of the models when trained only on a subset (or even none) of the data. Without looking at a single example, the TAPAS+CS model is competitive with a strong baseline Table-Bert, and when only 10% of the data are included, the results are comparable to the previous state-of-the-art.
数据和计算效率
反事实和合成的预训练任务的另一个方面是,由于已经针对二进制分类对模型进行了调整,因此可以在不对TabFact进行任何微调的情况下应用它们。我们探索仅在数据的一个子集(甚至没有数据)上训练时每个模型会发生什么。无需查看单个示例,TAPAS + CS模型就具有强大的基线Table-Bert竞争能力,并且当仅包含10%的数据时,结果可与以前的最新技术相提并论。
Dev accuracy on TabFact relative to the fraction of the training data used.
A general concern when trying to use large models such as this to operate on tables, is that their high computational requirements makes it difficult for them to parse very large tables. To address this, we investigate whether one can heuristically select subsets of the input to pass through the model in order to optimize its computational efficiency.
We conducted a systematic study of different approaches to filter the input and discovered that simple methods that select for word overlap between a full column and the subject statement give the best results. By dynamically selecting which tokens of the input to include, we can use fewer resources or work on larger inputs at the same cost. The challenge is doing so without losing important information and hurting accuracy.
For instance, the models discussed above all use sequences of 512 tokens, which is around the normal limit for a transformer model (although recent efficiency methods like the Reformer or Performer are proving effective in scaling the input size). The column selection methods we propose here can allow for faster training while still achieving high accuracy on TabFact. For 256 input tokens we get a very small drop in accuracy, but the model can now be pre-trained, fine-tuned and make predictions up to two times faster. With 128 tokens the model still outperforms the previous state-of-the-art model, with an even more significant speed-up — 4x faster across the board.
尝试使用诸如此类的大型模型对表进行操作时,通常要考虑的一个问题是,它们的高计算要求使其难以解析非常大的表。为了解决这个问题,我们研究了是否可以试探性地选择要通过模型的输入子集,以优化其计算效率。
我们对过滤输入的不同方法进行了系统的研究,发现选择整列和主题陈述之间的单词重叠的简单方法可获得最佳结果。通过动态选择要包含输入的令牌,我们可以使用较少的资源或以相同的成本处理较大的输入。这样做的挑战是在不丢失重要信息和准确性的情况下。
例如,上面讨论的模型全部使用512个令牌的序列,该序列接近于变压器模型的正常极限(尽管最近的效率方法(如Reformer或Performer)在缩放输入大小方面被证明是有效的)。我们在这里提出的列选择方法可以允许更快的训练,同时仍然在TabFact上实现高精度。对于256个输入令牌,我们的准确性下降很小,但是现在可以对模型进行预训练,微调,并使预测速度提高两倍。该模型拥有128个令牌,其性能仍然优于以前的最新模型,而且速度明显提高,整个速度提高了4倍。
Accuracy on TabFact using different sequence lengths, by shortening the input with our column selection method.
Using both the column selection method we proposed and the novel pre-training tasks, we can create table parsing models that need fewer data and less compute power to obtain better results.
We have made available the new models and pre-training techniques at our GitHub repo, where you can try it out yourself in colab. In order to make this approach more accessible, we also shared models of varying sizes all the way down to “tiny”. It is our hope that these results will help spur development of table reasoning among the broader research community.
使用我们提出的列选择方法和新颖的预训练任务,我们可以创建需要较少数据和较少计算能力的表解析模型,以获得更好的结果。
我们已经在GitHub存储库中提供了新的模型和预训练技术,您可以在colab中进行尝试。为了使这种方法更易于使用,我们还共享了各种大小的模型,一直到“微小”。我们希望这些结果将有助于促进更广泛的研究社区中表格推理的发展。
致谢
这项工作是由我们位于苏黎世的语言团队的朱利安·马丁·艾森施洛斯(Julian Martin Eisenschlos),西里因·克里奇(Syrine Krichene)和托马斯·穆勒(ThomasMüller)进行的。我们要感谢约旦·博伊德·格拉伯,亚瑟敏·阿尔通,艾米莉·皮特勒,本杰明·博辛格,斯里尼·纳拉亚南,斯拉夫·彼得罗夫,威廉·科恩和乔纳森·赫兹格。
Acknowledgements
This work was carried out by Julian Martin Eisenschlos, Syrine Krichene and Thomas Müller from our Language Team in Zürich. We would like to thank Jordan Boyd-Graber, Yasemin Altun, Emily Pitler, Benjamin Boerschinger, Srini Narayanan, Slav Petrov, William Cohen and Jonathan Herzig for their useful comments and suggestions.
2021年1月29日,星期五
Friday, January 29, 2021