[论文翻译]12合1:多任务视觉与语言表征学习


原文地址:https://arxiv.org/pdf/1912.02315v2


12-in-1: Multi-Task Vision and Language Representation Learning

12合1:多任务视觉与语言表征学习

Abstract

摘要

Much of vision-and-language research focuses on a small but diverse set of independent tasks and supporting datasets often studied in isolation; however, the visuallygrounded language understanding skills required for success at these tasks overlap significantly. In this work, we investigate these relationships between vision-and-language tasks by developing a large-scale, multi-task training regime. Our approach culminates in a single model on 12 datasets from four broad categories of task including visual question answering, caption-based image retrieval, grounding referring expressions, and multi-modal verification. Compared to independently trained single-task mod- els, this represents a reduction from approximately 3 billion parameters to 270 million while simultaneously improving performance by 2.05 points on average across tasks. We use our multi-task framework to perform in-depth analysis of the effect of joint training diverse tasks. Further, we show that finetuning task-specific models from our single multi-task model can lead to further improvements, achieving performance at or above the state-of-the-art.

视觉与语言研究大多聚焦于少量但多样化的独立任务及其配套数据集,这些任务通常被孤立研究。然而,成功完成这些任务所需的视觉基础语言理解技能存在显著重叠。本研究通过构建大规模多任务训练体系,探索视觉与语言任务间的关联性。我们的方法最终在四大类任务(视觉问答、基于描述的图像检索、指代表达式定位和多模态验证)的12个数据集上实现了单一模型统一。与独立训练的单任务模型相比,该模型将参数量从约30亿缩减至2.7亿,同时平均任务性能提升2.05分。我们利用该多任务框架深入分析联合训练多样化任务的效果,并证明基于统一多任务模型微调特定任务模型可带来额外性能提升,达到或超越当前最优水平。

1. Introduction

1. 引言

A compelling reason to study language and vision jointly is the promise of language as a universal and natural interface for visual reasoning problems – useful both in specifying a wide range of problems and in communicating AI responses. However, the current research landscape for visually-grounded language understanding is a patchwork of many specialized tasks like question answering or caption generation, each supported by a handful of datasets. As such, progress in this field has been measured by the independent improvement of bespoke models designed and trained for each of these specific tasks and datasets.

研究语言与视觉联合的一个有力理由是,语言有望成为视觉推理问题的通用自然交互界面——既能用于描述各类问题,也能传达AI的响应。然而,当前基于视觉的语言理解研究格局由问答、字幕生成等众多专项任务拼凑而成,每个任务仅由少量数据集支撑。因此,该领域的进展一直通过为这些特定任务和数据集单独设计、训练的定制模型的独立改进来衡量。

The recent rise of general architectures for vision-andlanguage [1, 23, 24, 27, 43, 45, 54] reduces the architectural differences across tasks. These models pretrain common architectures on self-supervised tasks to learn general visio-linguistic representations then fine-tune for specific datasets; however, the result is still a menagerie of independent task-specific models rather than a single unified model. This is dissatisfying in practice – the model that understands questions cannot ground noun phrases, the grounding model cannot retrieve images based on a description, and so forth. Further, this approach does not scale well as each new task requires storing a new model.

视觉与语言通用架构 [1, 23, 24, 27, 43, 45, 54] 的近期兴起减少了跨任务间的架构差异。这些模型通过自监督任务预训练通用架构以学习视觉-语言通用表征,随后针对特定数据集进行微调;然而其结果仍是各自独立的任务专用模型集合,而非单一统一模型。这种实践存在明显缺陷——能理解问题的模型无法关联名词短语,具备关联能力的模型无法根据描述检索图像等。此外,该方法扩展性较差,每个新任务都需要存储一个新模型。


Figure 1: We introduce an approach for effective multi-task learning, training a single model on 12 popular vision-and-language datasets. This single model performs at par or even better than independent task-specific state-of-the-art approaches for many tasks.

图 1: 我们提出了一种高效的多任务学习方法,在12个主流视觉-语言数据集上训练单一模型。该单一模型在多项任务中表现与独立任务专用最优方法相当甚至更优。

Beyond being intellectually dissatisfying, this task-based fracturing leaves quite a lot on the table. While individual tasks present different challenges and diverse interfaces, the underlying associations between language and visual concepts are often common across tasks. For example, learning to ground the referring expression “small red vase” requires understanding the same concepts as answering the question “What color is the small vase?”. Training multiple tasks jointly can potentially pool these different sources of grounding supervision. Further, developing models that can perform well on a wide range of tasks simultaneously can help guard against the research community over fitting to specific datasets and metrics.

除了在学术上不尽如人意外,这种基于任务的碎片化处理还留下了大量未开发的潜力。虽然单个任务面临不同的挑战和多样化的接口,但语言与视觉概念之间的底层关联往往在不同任务中是共通的。例如,学习定位指代表达"小红花瓶"所需理解的概念,与回答"小花瓶是什么颜色?"这一问题所需的概念是相同的。联合训练多个任务有可能整合这些不同来源的语义监督。此外,开发能同时胜任多种任务的模型,有助于防止研究社区过度拟合特定数据集和评估指标。

In this work, we develop a multi-task model for discriminative vision-and-language tasks based on the recently proposed ViLBERT [27] model. We consider four categories of tasks – training jointly on a total of 12 different datasets. Our results not only show that a single model can perform all these tasks, but also that joint training can improve the performance compared to single-task training with the same architecture. Before undertaking this effort, it was not obvious to us that this would be the case – multitask training is notoriously challenging and vision-and-language datasets vary greatly in size, interface, and difficulty. Our model attains improvements of 0.25 to 4.19 absolute points from multi-task training – improving over corresponding singletask models for 11 out of 12 tasks. Further, we demonstrate that multi-task training is an effective pre training step for single-task models – leading to further gains and setting a new state-of-the-art for 7 out of 12 tasks.

在本工作中,我们基于最新提出的ViLBERT [27]模型,开发了一个用于判别式视觉与语言任务的多任务模型。我们考虑了四大类任务——共在12个不同数据集上进行联合训练。结果表明,单一模型不仅能完成所有这些任务,而且相比相同架构的单任务训练,联合训练还能提升性能。在开展这项工作之前,我们并未预见到这种结果——众所周知多任务训练极具挑战性,且视觉与语言数据集在规模、接口和难度上差异巨大。通过多任务训练,我们的模型取得了0.25至4.19个绝对百分点的性能提升——在12项任务中有11项超越了对应的单任务模型。此外,我们还证明多任务训练可作为单任务模型的有效预训练步骤——带来进一步性能提升,并为12项任务中的7项创造了新标杆。

Large-scale multi-task learning is challenging as datasets can vary in size and difficulty. To address these issues, we introduce a dynamic stop-and-go training scheduler, taskdependent input tokens, and simple hyper-parameter heuristics. Using our proposed pipeline, we were able to train many multi-task models with varying datasets – assessing the relationships between different vision-and-language tasks in terms of their performance when trained together.

大规模多任务学习面临挑战,因为数据集在规模和难度上可能存在差异。为解决这些问题,我们引入了动态启停训练调度器、任务相关输入token以及简单的超参数启发式方法。通过我们提出的流程,我们能够训练多个具有不同数据集的多任务模型——评估不同视觉与语言任务在联合训练时的性能关系。

To summarize, we make the following contributions:

总结来说,我们的贡献如下:

– We systematically analyze the joint training relationships between different of vision-and-language datasets and tasks and present a Clean V&L Multi-Task setup, which ensures no train-test leaks across task. – We develop a single multi-task model trained on 12 popular V&L datasets. Compared to a set of independent models, this represents a reduction from ${\sim}3$ billion parameters to ${\sim}270$ million while simultaneously improving average performance by 2.05 points. – We demonstrate that multi-task training is useful even in cases where single-task performance is paramount. On average, fine-tuning from our multi-task model for single tasks resulted in an average improvement of 2.98 points over baseline single-task trained models.

  • 我们系统分析了不同视觉与语言数据集及任务间的联合训练关系,并提出了一个无训练-测试泄漏的Clean V&L多任务框架。
  • 我们开发了在12个主流V&L数据集上训练的单一多任务模型。与独立模型组相比,参数量从约30亿降至约2.7亿,同时平均性能提升2.05个点。
  • 我们证明即使在单任务性能优先的场景下,多任务训练仍具价值。平均而言,基于我们多任务模型进行单任务微调,比基线单任务模型平均提升2.98个点。

2. Vision-and-Language Tasks

2. 视觉与语言任务

2.1. Task-Groups and Datasets

2.1. 任务组与数据集

We consider 12 popular vision and language datasets. These datasets cover a wide range of tasks and require diverse grounding granularity and reasoning skills. We group related datasets into four groups to facilitate our analysis:

我们选取了12个主流的视觉与语言数据集。这些数据集涵盖广泛的任务类型,需要多样化的基础粒度与推理能力。为便于分析,我们将相关数据集划分为四组:

Vocab-based VQA. Given an image and a natural-language question, select an answer from a fixed vocabulary. We consider three popular datasets for this group – VQAv2 [15], GQA [17], and Visual Genome (VG) QA [21].

基于词汇的视觉问答 (Vocab-based VQA)。给定一张图像和一个自然语言问题,从固定词汇表中选择答案。我们针对该类别考虑了三个常用数据集——VQAv2 [15]、GQA [17] 和 Visual Genome (VG) QA [21]。

Image Retrieval. Given a caption and a pool of images, retrieve the target image that is best-described by the caption. We consider COCO [7] and Flickr30K [35] captioning datasets for this task-group.

图像检索。给定一段描述和一个图像池,检索出最能被该描述准确描述的目标图像。我们选用COCO [7]和Flickr30K [35]这两个图像描述数据集来完成此任务组。

Referring Expressions. Given a natural language expression and an image, identify the target region that is referred to by expression. The expression can vary greatly across datasets from simple noun phrases to multi-round dialogs.

指代表达。给定一个自然语言表达和一张图像,识别该表达所指的目标区域。不同数据集中的表达差异很大,从简单的名词短语到多轮对话都可能出现。

Table 1: Percentage of row-task test images that are present in column-tasks train/val images.

%Row-Task Test Images in Column-Task Train/Val Set
[A][B][C][D][E][F][G][H][I][J][K][L]
[A]VQA2.0[15]0%%00%0%0%0%%00%%00% %00%
[B]VG QA[21]%0%0%00%%0 0%%0%0%00%%0%0
[c]GQA [17]%00%%00%%0 0%0%%0%00%%00%
[D]CoCO [7]100%43%33%0%%00% 0%%07%46%%00%
[E]Flickr30k[35]%00%0%0%%0 0%%0%0%00%98%0%
[F]RefCOCO[19]100%36%27%100%%00% %0%998%62%%00%
[G]RefCOCO+[19]100%38%27%100%%0 %0%0%998%62%%00%
[H]RefCOCOG[30]100%41%31%100%%0 53%53%%08%63%%00%
[I]Visual 7w[55]50%100%79%48%%08% 8%10%0%24%%00%
[f]GuessWhat [13]100%40%31%96%%0 20%20%26%7%0%0%0%
[K]SNLI-VE[49]%0%0%00%94% 0%%0%0%00%%00%
[L] NLVR2[44]%0%0%0%0%0%0%0%0%0 %0%0%0

表 1: 行任务测试图像在列任务训练/验证图像中的占比

%行任务测试图像在列任务训练/验证集中的占比
[A] [B]
[A]VQA2.0[15] 0%
[B]VG QA[21] 0%
[c] GQA [17] 0%
[D] CoCO [7] 100%
[E]Flickr30k[35] 0%
[F]RefCOCO[19] 100%
[G]RefCOCO+[19] 100%
[H]RefCOCOG[30] 100%
[I] Visual 7w[55] 50%
[f] GuessWhat [13] 100%
[K] SNLI-VE[49] 0%
[L] NLVR2[44] 0%

We consider phrase grounding in $\mathrm{RefCOCO(+/g)}$ [19, 30], Pointing questions in Visual7W [55], and dialog sequences in the GuessWhat [13]. We note that these language inputs vary significantly in terms of detail and structure.

我们考虑在 $\mathrm{RefCOCO(+/g)}$ [19, 30] 中进行短语定位 (phrase grounding),在 Visual7W [55] 中处理指向性问题 (pointing questions),以及在 GuessWhat [13] 中处理对话序列。我们注意到这些语言输入在细节和结构上存在显著差异。

Multi-modal Verification. Given one or more images and a natural language statement, judge the correctness or predict their semantic relationship. We consider $\mathrm{NLVR^{2}}$ [44] and SNLI-VE [49]. In $\mathrm{NLVR^{2}}$ , two images are given and the statement must be true for both to be true. In SNLIVE, image-statement pairs are classified as representing an entailment, contradiction, or neutral. That is, whether the content of the image confirms, refutes, or is insufficient to comment on the truth of the corresponding statement.

多模态验证。给定一张或多张图像及自然语言陈述,判断其正确性或预测语义关系。我们选用 $\mathrm{NLVR^{2}}$ [44] 和 SNLI-VE [49] 作为基准。在 $\mathrm{NLVR^{2}}$ 任务中,需对两张图像进行联合判断,仅当陈述同时适用于两张图像时才判定为真;而 SNLI-VE 则将图像-陈述对分类为蕴含、矛盾或中性关系,即判断图像内容是否支持、反驳或无法证实对应陈述的真实性。

2.2. A Clean V&L Multi-Task Setup

2.2. 一个干净的视觉与语言多任务设置

Many V&L tasks are built on top of each other and share significant overlap in terms of individual images.

许多视觉与语言(V&L)任务相互构建,并在单个图像层面存在显著重叠。

However, as each task is often examined in isolation, there does not exist an in-depth analysis of this overlap across different V&L tasks. Table 1 shows the percentage of test images for the target tasks which are present in other tasks’ train/val sets. As we can see, there exists significant overlap across tasks. Even though different tasks require different inputs and outputs, other task annotations will provide clues about the visual grounding – for example, a referring expression for a “blue striped ball” at training could unfairly improve a VQA model’s ability to answer “What color is the striped ball?” for the same image at test time. To avoid information leakage from the annotations of other tasks, we propose a cleaned multi-task split for V&L tasks where test images are removed from train/val for all the tasks. We stress that the test sets are not modified in any way, so our results are comparable to prior work. Cleaning results in about an $11%$ reduction in training data on average across datasets. Full details of this process and statistics regarding cleaned dataset size are available in the supplement.

然而,由于各项任务通常被独立研究,目前缺乏针对不同视觉与语言(V&L)任务间重叠性的深入分析。表1展示了目标任务的测试图像在其他任务训练集/验证集中出现的比例。数据显示任务间存在显著重叠——尽管不同任务需要不同的输入输出,但其他任务的标注信息可能泄露视觉基础线索(例如训练时出现的"蓝色条纹球"指代表达,会在测试时不公平地提升VQA模型对同一图像回答"条纹球是什么颜色"的能力)。为避免其他任务标注造成的信息泄露,我们提出了净化版多任务划分方案:从所有任务的训练集/验证集中移除测试图像。需要强调的是测试集未作任何修改,因此我们的结果与先前研究具有可比性。数据净化平均导致各数据集训练规模减少约$11%$,完整处理流程及净化后数据集规模统计详见附录。

3. Approach

3. 方法

3.1. Base Architecture

3.1. 基础架构

There has been a flurry of recent work developing general vision-and-language model architectures that are amenable to large-scale self-supervised pre training. [1, 23, 24, 27, 43, 45, 54]. By pre training general representations and then finetuning on single downstream tasks, these models set state-of-the-art in many tasks. For the base archi- tecture in our experiments, we take the ViLBERT model proposed by Lu et al. [27]. We describe it here briefly.

近期涌现了大量研究致力于开发适用于大规模自监督预训练的通用视觉-语言模型架构 [1, 23, 24, 27, 43, 45, 54]。这些模型通过预训练通用表征再针对单一下游任务微调,在多类任务中实现了最先进的性能。在我们的实验中,我们采用 Lu 等人 [27] 提出的 ViLBERT 模型作为基础架构,现简要说明如下。

At the interface level, ViLBERT takes as input an image $I$ and text segment $Q$ represented as the sequence ${\mathtt{I M G},v_{1},\ldots,v_{T}$ , CLS, $w_{1},\dots,w_{T},\mathrm{SEP}}$ where ${\bar{v}{i}}{i=1}^{T}$ are image region features [2], ${w_{j}}{j=1}^{T}$ are word tokens, and the IMG, CLS, and SEP tokens are special markers. The model then outputs embeddings for each input ${h_{v_{i}}}_{i=1}^{T}$ , ${h_{w_{j}}}{j=1}^{T}$ , $h_{\mathrm{{IMG}}},h_{\mathrm{{CLS}}}$ , and $h_{\mathrm{SEP}}$ . As in [27], we take $h_{\mathrm{IMG}}$ and $h_{\mathrm{{CLS}}}$ as holistic image and text representations.

在接口层面,ViLBERT 的输入包括图像 $I$ 和文本片段 $Q$,表示为序列 ${\mathtt{I M G},v_{1},\ldots,v_{T}$,CLS,$w_{1},\dots,w_{T},\mathrm{SEP}}$。其中 ${\bar{v}{i}}{i=1}^{T}$ 是图像区域特征 [2],${w_{j}}{j=1}^{T}$ 是单词 token,而 IMG、CLS 和 SEP token 是特殊标记。模型随后输出每个输入的嵌入表示 ${h_{v_{i}}}{i=1}^{T}$、${h_{w_{j}}}{j=1}^{T}$、$h_{\mathrm{{IMG}}}$、$h_{\mathrm{{CLS}}}$ 和 $h_{\mathrm{SEP}}$。如 [27] 所述,我们将 $h_{\mathrm{IMG}}$ 和 $h_{\mathrm{{CLS}}}$ 作为整体图像和文本的表征。

Internally, ViLBERT consists of two parallel BERT-style [14] models operating over image regions and text segments. Each stream is a series of transformer blocks (TRM) [48] connected by co-attention al transformer layers (CoTRM) which enable information exchange between modalities. We use the default parameter setting, which has 6 / 12 layers of TRM for visual / linguistic streams respectively.

ViLBERT内部由两个并行的BERT风格[14]模型组成,分别处理图像区域和文本片段。每个流都是由Transformer块(TRM)[48]构成的序列,通过跨模态注意力层(CoTRM)连接,实现模态间信息交互。我们采用默认参数配置,视觉/语言流分别包含6层和12层TRM。

Like many of the models of this class, ViLBERT is pretrained on the Conceptual Caption dataset [39] with two ‘proxy’ tasks: masked multi-modal modelling and multimodal alignment prediction. The first randomly masks approximate ly $15%$ of both words and image tokens and reconstructs them given the remaining inputs. The later tasks the model with predicting whether an image and caption correspond or not. After pre training, the model can be finetuned for strong performance for various downstream tasks.

与许多同类模型一样,ViLBERT在Conceptual Caption数据集[39]上通过两项"代理"任务进行预训练:掩码多模态建模和多模态对齐预测。前者随机掩码约$15%$的文本单词和图像Token,并根据剩余输入进行重建;后者则让模型判断图像与标题是否匹配。预训练完成后,该模型可通过微调在各种下游任务中实现优异性能。

We make two important modifications to this pre training process. First, when masking visual regions we also mask other regions with significant overlap $\mathrm{{(>0.4~IoU)}}$ to avoid leaking visual information. This forces the model to rely more heavily on language to predict image content. Second, we do not enforce the masked multi-modal modelling loss when sampling a negative (unmatching) caption for multimodal alignment prediction. This will effectively remove the noise introduced by negative samples. While orthogonal to our primary contribution of multi-task learning, we found these modifications to make the baseline model more effective. For further discussion, see the supplemental material. All models we present are first pretrained in this manner.

我们对这一预训练过程进行了两处重要修改。首先,在遮蔽视觉区域时,我们同时会遮蔽与之显著重叠 (IoU>0.4) 的其他区域,以避免视觉信息泄露。这迫使模型更依赖语言来预测图像内容。其次,在为多模态对齐预测采样负样本(不匹配标题)时,我们不强制执行遮蔽多模态建模损失。这将有效消除负样本引入的噪声。虽然这些修改与我们的多任务学习主要贡献正交,但我们发现它们能使基线模型更有效。更多讨论详见补充材料。我们展示的所有模型都首先以这种方式进行预训练。

3.2. Multi-Task Learning

3.2. 多任务学习

We consider a simple multi-task model where each task has a task-specific ‘head’ network that branches off a common, shared ‘trunk’ ViLBERT model. As such, we learn shared trunk parameters $\theta_{s}$ and a set of task-specific layers ${\theta_{t}}{t=1}^{T}$ for $\tau$ tasks. Our goal is to learn parameters $\theta_{s}\cup{\theta_{t}}_{t=1}^{T}$ that minimize loss across all tasks. Details on heads and other modifications follow.

我们考虑一个简单的多任务模型,其中每个任务都有一个任务特定的"头"网络,从一个共享的ViLBERT主干模型分支出来。因此,我们学习共享主干参数$\theta_{s}$和针对$\tau$个任务的一组任务特定层${\theta_{t}}{t=1}^{T}$。我们的目标是学习参数$\theta_{s}\cup{\theta_{t}}_{t=1}^{T}$,以最小化所有任务的损失。关于任务头和其他修改的细节如下。

Task Token. While relying on the same groundings, different tasks may still require the model to process inputs differently – e.g. referring expressions just require grounding while VQA must follow grounding with additional reasoning. To enable this, we augment the query with a task token $\mathrm{TASK}{t}$ such that the new input format is ${\mathtt{I M G},v_{1},\ldots,v_{n}$ , CLS, $\Gamma\mathbb{A}\mathrm{SK}{t},w_{1},\dots,w_{m},\mathrm{SEP}}$ . The architecture can then leverage this task information in a bottom-up manner. In what follows, we describe the taskspecific heads by task groups.

任务Token。虽然依赖相同的基础,不同任务仍可能要求模型以不同方式处理输入——例如指代表达仅需基础信息,而视觉问答(VQA)则必须在基础信息之外进行额外推理。为此,我们在查询中添加任务Token $\mathrm{TASK}{t}$,使新输入格式变为 ${\mathtt{I M G},v_{1},\ldots,v_{n}$ , CLS, $\Gamma\mathbb{A}\mathrm{SK}{t},w_{1},\dots,w_{m},\mathrm{SEP}}$。该架构随后能以自底向上方式利用此任务信息。下文将按任务组别描述各任务专用头。

Vocab-Based VQA Output: We compute an overall image-query representation as an element-wise product between the holistic $h_{\tt I M G}$ and $h_{\mathrm{{CLS}}}$ representations. As in [2, 17], we treat vocab-based VQA as a multi-label classification task – assigning a soft target score to each answer based on its relevancy to the ground truth answer. We compute scores for a set of the pre-defined answers $A$ by using a two-layer MLP on top of the overall representation:

基于词汇的视觉问答输出:我们通过整体图像表征 $h_{\tt I M G}$ 和 $h_{\mathrm{{CLS}}}$ 表征的逐元素乘积,计算出一个综合的图像-查询表征。参照 [2, 17] 的方法,我们将基于词汇的视觉问答视为多标签分类任务——根据每个答案与真实答案的相关性为其分配软目标分数。通过对预定义答案集 $A$ 的综合表征应用两层 MLP (多层感知机) 来计算得分:

$$
P_{v}(A|I,Q)=\sigma(\mathrm{MLP}(h_{\tt I M G}\odot h_{\tt C L S}))
$$

$$
P_{v}(A|I,Q)=\sigma(\mathrm{MLP}(h_{\tt I M G}\odot h_{\tt C L S}))
$$

where $\sigma$ is the sigmoid function. Due to the answer vocabulary differences, VQA and VG QA share the MLP and answer vocabulary while GQA learns a separate one.

其中 $\sigma$ 是 sigmoid 函数。由于答案词汇差异,VQA 和 VG QA 共享 MLP 和答案词汇,而 GQA 则学习独立的词汇表。

Image Retrieval Output: Using the same overall representation, we compute an alignment score between imagecaption pairs as:

图像检索输出:使用相同的整体表示,我们计算图像-标题对之间的对齐分数为:

$$
\mathsf{R e1}(I,Q)=W_{i}(h_{\tt I M G}\odot h_{\tt C L S})
$$

$$
\mathsf{R e1}(I,Q)=W_{i}(h_{\tt I M G}\odot h_{\tt C L S})
$$

where $W_{i}\in\mathbb{R}^{d\times1}$ is shared across COCO and Flickr30k image retrieval tasks. As in [27], we train a 4-way multiplechoice against hard-negatives selected off-line and then fixed. Recent work has used online hard-negative mining [8, 23] but this is costly to compute.

其中 $W_{i}\in\mathbb{R}^{d\times1}$ 在COCO和Flickr30k图像检索任务中共享。如[27]所述,我们针对离线选择并固定的困难负样本训练了一个4路多选分类器。近期研究采用了在线困难负样本挖掘[8, 23],但计算成本较高。

Referring Expressions Output: We rerank a set of region proposals [50] given the referring expression. We pass the final representation $h_{v_{i}}$ for each image region $i$ into a learned projection $W_{r}\in\mathbb{R}^{d\times1}$ to predict a matching score.

指代表达式输出:我们根据指代表达式对一组区域提议[50]进行重新排序。将每个图像区域i的最终表示$h_{v_{i}}$输入学习到的投影 $W_{r}\in\mathbb{R}^{d\times1}$ ,以预测匹配分数。

$$
\mathrm{Re1}(v_{i},Q)=W_{r}h_{v i}
$$

$$
\mathrm{Re1}(v_{i},Q)=W_{r}h_{v i}
$$

Note that $Q$ may be either a phrase, question or dialog based on different tasks $(\mathrm{RefCOCO}+/\mathrm{g}$ , Visual7W, GuessWhat). $W_{r}$ is shared across all the referring expression tasks.

注意,$Q$ 可能是基于不同任务 $(\mathrm{RefCOCO}+/\mathrm{g}$、Visual7W、GuessWhat) 的短语、问题或对话。$W_{r}$ 在所有指代表达任务中共享。

Multi-modal Verification Output: Taking $\mathrm{NLVR^{2}}$ as an example, the input is a concatenation of two images ( $I_{0}$ and $I_{1}$ ) and a statement $Q$ , that the model must judge the validity of the statement given the images. We consider this a classification problem given an embedding that encodes the two image-statement pairs $(I_{0},Q)$ and $(I_{1},Q)$ . The output probability is predicted by a 2-layer MLP with softmax:

多模态验证输出:以 $\mathrm{NLVR^{2}}$ 为例,输入为两张图像 ( $I_{0}$ 和 $I_{1}$ ) 与一个陈述 $Q$ 的拼接,模型需根据图像判断该陈述的有效性。我们将其视为给定编码图像-陈述对 $(I_{0},Q)$ 和 $(I_{1},Q)$ 的嵌入向量的分类问题。输出概率通过带softmax的2层MLP预测:

$$
P_{v}(C|I_{0},I_{1},Q)=\mathsf{s o f t m a x}\left(\mathrm{MLP}\left(\left[h_{\mathrm{IMG}}^{0}\odot h_{\mathrm{CLS}}^{0}\right]\right)\right)
$$

$$
P_{v}(C|I_{0},I_{1},Q)=\mathsf{s o f t m a x}\left(\mathrm{MLP}\left(\left[h_{\mathrm{IMG}}^{0}\odot h_{\mathrm{CLS}}^{0}\right]\right)\right)
$$

where [ ] is concatenation.

其中 [ ] 表示拼接操作。

For SNLI-VE, the input is a single image and statement. We thus learn a separate classifier of the same form that predicts the sentiment (entailment, neutral, contradiction) from the inputs.

对于SNLI-VE数据集,输入是一张图片和一条陈述语句。因此我们训练了一个形式相同的独立分类器,用于根据输入预测情感倾向(蕴含、中性、矛盾)。

3.3. Large-Scale Multitask Training

3.3. 大规模多任务训练

With 6 task heads, 12 datasets, and over 4.4 million individual training instances – training our multi-task ViLBERT model is a daunting proposition. Multi-task learning (especially at this scale) poses significant challenges as learning objectives have complex and unknown dynamics and may compete [41]. Further, vision-and-language datasets vary significantly in size and difficulty. For instance, a single epoch of VG (our largest dataset) corresponds to 19.8 epochs of $\operatorname{RefCOCOg}$ (our smallest). Likewise, when trained in isolation RefCOCOg converges in 5K iterations whereas VQAv2 takes 84K iterations (over 16 times more). Below, we describe the details of our multi-task training approach and techniques to overcome these challenges.

采用6个任务头、12个数据集和超过440万条独立训练实例——训练我们的多任务ViLBERT模型是一项艰巨的任务。多任务学习(尤其是这种规模)带来了重大挑战,因为学习目标具有复杂且未知的动态特性,并可能相互竞争[41]。此外,视觉与语言数据集在规模和难度上差异显著。例如,单个VG(我们最大的数据集)训练周期相当于19.8个$\operatorname{RefCOCOg}$(我们最小的数据集)周期。同样,单独训练时RefCOCOg在5K次迭代后收敛,而VQAv2需要84K次迭代(超过16倍)。下文将详细阐述我们的多任务训练方法及应对这些挑战的技术方案。

Pre training. All our models are pretrained on Conceptual Caption dataset [39] including our self-supervised task modifications as described in Sec. 3.1.

预训练。我们所有的模型都在Conceptual Caption数据集[39]上进行了预训练,包括第3.1节中描述的自监督任务修改。

Round-Robin Batch-Level Sampling. We consider a round-robin batch-level sampling regime that cycles through each task from the beginning of multi-task training. As such, one multi-task iteration consists of each task forwarding a batch and updating parameters in sequence.

轮询批次级采样。我们考虑一种轮询批次级采样机制,从多任务训练开始时就循环遍历每个任务。因此,一次多任务迭代包含每个任务依次前向传播一个批次并更新参数。

Dynamic Stop-and-Go. As noted earlier, different tasks have different difficulties and dataset sizes. Consequentially, simply cycling through all tasks may drastically overtrain smaller tasks leading to over fitting. Typically earlystopping provides a strong defense to this phenomenon; however, stopping a task in multi-task training introduces problems with catastrophic forgetting as the base network drifts over time due to other tasks. We introduce an intuitive but effective dynamic stop and go (DSG) mechanism to avoid these problems. We monitor the validation loss $s_{t}$ of each task $t$ , computing it once per task epoch. If performance improvement is less than $0.1%$ over 2 epochs, we consider it Converged and shift it into stop mode. In DSG stop mode, a task only updates every iter-gap $(\Delta)$ iterations. If validation performance degrades by $0.5%$ from the task’s best measured performance while in stop mode, the task is considered Diverged and is returned to DSG go. This procedure is shown in Algorithm 1.

动态启停机制。如前所述,不同任务具有不同的难度和数据集规模。若简单循环执行所有任务,可能导致小任务被过度训练从而引发过拟合。虽然早停法 (early stopping) 通常能有效缓解该现象,但在多任务训练中停止特定任务会因主干网络受其他任务影响持续偏移,进而导致灾难性遗忘问题。为此我们提出一种直观高效的动态启停 (DSG) 机制:监控每个任务 $t$ 的验证损失 $s_{t}$(每个任务周期计算一次),若连续2个周期性能提升低于 $0.1%$ 则判定为已收敛 (Converged) 并转入停止模式。在DSG停止模式下,任务仅每隔 $(\Delta)$ 次迭代更新一次;若停止模式下验证性能较历史最佳值下降 $0.5%$ 则判定为发散 (Diverged) 并重新激活任务。该流程如算法1所示。

Curriculum Learning. Inspired by prior multi-task literature [4] [31], we experimented with both curriculum and anti-curriculum strategies based on task difficulty. Specifically, for anti-curriculum we first train on the slowestconverging task-group G1 (Vocab-Based VQA) before

课程学习。受先前多任务文献[4][31]的启发,我们基于任务难度尝试了课程与反课程策略。具体而言,对于反课程策略,我们首先在收敛最慢的任务组G1(基于词汇的VQA)上进行训练。

Algorithm 1: DSG for Multi-Task Learning

算法 1: 多任务学习的DSG算法

starting full round-robin multi-task training. Inversely for the curriculum setting we first train on our fastestconverging task-group G3 (Referring Expressions). Different from previous observation [31, 33], we found that using no curriculum leads to superior performance when com- bined with other strategies proposed in this section.

开始全轮询多任务训练。相反地,在课程设置中,我们首先在收敛速度最快的任务组G3(指代表达式)上进行训练。与之前的观察[31, 33]不同,我们发现当结合本节提出的其他策略时,不使用课程设置能带来更优的性能。

Setting Multi-Task Hyper parameters. We follow a simple design philosophy – identify simple heuristics based on hyper-parameters tuned for each task in single-task training. This significantly reduces the burden of searching for jointtraining hyper-parameters. See the supplement for a full list of per task learning rates, batch sizes, and other settings. Our code has been made available1.

设置多任务超参数。我们遵循一个简单的设计原则——基于单任务训练中针对每个任务调整的超参数来识别简单的启发式方法。这大大减轻了搜索联合训练超参数的负担。完整的每任务学习率、批量大小和其他设置列表请参阅补充材料。我们的代码已开源1。

Batch Size: For multi-task, we keep the batch size tuned for single-task training for each task.

批量大小 (Batch Size): 在多任务训练中,我们保持各任务沿用单任务训练时调优的批量大小。

Warm-up Duration: We found it important to set warm-up duration relative to the largest dataset. Specifically, we run linear warm-up over $\eta*N$ iterations where $N$ is the max. number of iterations taken to train any dataset in the singletask setting. We observe significant performance degradation for harder tasks when warm-up was shorter. We set $\eta$ to 0.1 for our experiments.

预热时长:我们发现将预热时长与最大数据集相关联至关重要。具体而言,我们会在 $\eta*N$ 次迭代中执行线性预热,其中 $N$ 是单任务设置下训练任何数据集所需的最大迭代次数。当预热时间较短时,我们观察到较难任务的性能显著下降。实验中我们将 $\eta$ 设为 0.1。

Loss Scaling: Our model has shared and task-specific parameters and we found it important to maintain separate learning rates. For the shared base model, we set the the base learning rate to the minimum over all single-task dataset parameters. To accommodate variable learning rates for each dataset, we scale the task loss for each dataset by the ratio of task target learning rate over base learning rate.

损失缩放:我们的模型包含共享参数和任务特定参数,我们发现保持独立的学习率非常重要。对于共享基础模型,我们将基础学习率设置为所有单任务数据集参数中的最小值。为了适应每个数据集的不同学习率,我们通过任务目标学习率与基础学习率的比值来缩放每个数据集的损失值。

4. Experiments and Results 4.1. Single-Task Performance

4. 实验与结果 4.1. 单任务性能

To establish baseline performance for the ViLBERT architecture that forms the backbone of our multi-task experiments, we first train single-task models on top of the base ViLBERT architecture (Section 3) for each of our

为了建立作为我们多任务实验核心的ViLBERT架构的基准性能,我们首先在基础ViLBERT架构(第3节)上为每项任务训练单任务模型。

基于词汇的视觉问答 (G1) 图像检索 (G2) 指代表达 (G3) 验证 (G4) 总参数量 (模型数) 平均得分
VQAv2 GQA VGQA COCO Flickr30k COCO COCO+ COCOg V7W GW NLVR2 SNLI-VE
Clean test-dev test-dev val test(R1) test(R1) test test test test test testP test
1 单任务 (ST) 71.82 58.19 34.38 65.28 61.14 78.63 71.11 72.24 80.51 62.81 74.25 76.72 3B 3 (12) 67.25
单任务 (ST) 71.24 59.09 34.10 64.80 61.46 78.17 69.47 72.21 80.51 62.53 74.25 76.53 3B (12) 67.03
GJ 分组任务 (GT) 72.03 59.60 36.18 65.06 66.00 80.23 72.79 75.30 81.54 64.78 74.62 76.52 1B (4) 68.72
4 全任务 (AT) 72.57 60.12 36.36 63.70 63.52 80.58 73.25 75.96 82.75 65.04 78.44 76.78 270M (1) 69.08
5 全任务 (不含G4) 72.68 62.09 36.74 64.88 64.62 80.76 73.60 75.80 83.03 65.41 266M (1)
GT 6 微调 >单任务 72.61 59.96 35.81 66.26 66.98 79.94 72.12 75.18 81.57 64.56 74.47 76.34 3B (12) 68.81
7 AT 微调,单任务 72.92 60.48 36.56 65.46 65.14 80.86 73.45 76.00 83.01 65.15 78.87 76.73 3B (12) 69.55
QQ AT 微调单任务 73.15 60.65 36.64 68.00 67.90 81.20 74.22 76.35 83.35 65.69 78.87 76.95 3B (12) 70.24

Table 2: Comparison of our multi-task models to single-task performance. We find multi-task training (rows 3-5) provides significant gains over single-task training (rows 1-2) while reducing the parameter count from over 3 billion to 270 million. Further, following multi-task training by task-specific fine-tuning (rows $6-9$ ) further gains can be made at the cost of increased parameters.

Table 3: Pair-wise (left) and triple-wise (right) inter-group representative task analysis. Each entry is the relative performance change from single-task training for the row-task when jointly trained with the column-task(s).

表 2: 多任务模型与单任务性能对比。我们发现多任务训练(第3-5行)相比单任务训练(第1-2行)能显著提升性能,同时将参数量从超过30亿减少到2.7亿。此外,在多任务训练后进行任务特定微调(第6-9行)可以进一步获得性能提升,但代价是增加参数量。

TrainedWith TrainedWith Avg.
G1 G2 G3 G4 Avg. G1&G2 G1&G3 G1&G4 G2&G3 G2&G4 G3&G4
Relative PERF G1(VQAv2) 0.38% 0.38% -0.20% 0.19% 0.63% -0.08% 0.18% 0.24%
G2(Flickr30k) 0.46% 0.23% -4.13% -1.15% 1.24% 0.49% -4.36% -0.88%
G3(Visual7W) 0.39% 0.78% 0.24% 0.47% 0.86% 0.19% 0.29% 0.44%
G4(NLVR2) 2.29% 1.47% 0.67% 1.48% 3.69% 3.22% 2.73% 3.21%
Avg. 1.04% 0.88% 0.43% -1.36% 2.27% 2.23% 0.34% 1.68% 0.10% -2.09%

表 3: 组间代表任务的双向(左)和三向(右)分析。每个条目表示行任务与列任务联合训练时,相比单任务训练的相对性能变化。

12 datasets. Rows 1 and 2 in Table 2 show the performance of these models trained on the full and cleaned datasets, respectively. As expected, reducing the training set size through cleaning results in lower performance in most cases. Our improvements over the pre training objective (Sec 3.1) results in better downstream tasks performance (71.82 vs. 70.55 on VQA and 61.46 vs. 58.20 on Flickr30k Recall $@1$ ). See the supplementary for full comparison. Overall, our base architecture is competitive with prior work and a good starting point for multi-task learning.

12个数据集。表2中的第1行和第2行分别展示了这些模型在完整数据集和清洗后数据集上的性能表现。不出所料,通过清洗减少训练集规模在大多数情况下会导致性能下降。我们对预训练目标(见3.1节)的改进带来了更好的下游任务性能(VQA任务71.82 vs. 70.55,Flickr30k Recall $@1$任务61.46 vs. 58.20)。完整对比结果请参阅补充材料。总体而言,我们的基础架构与现有工作相比具有竞争力,是多任务学习的良好起点。

4.2. Intra-Group Multi-task Performance

4.2. 组内多任务性能

We begin with the most intuitive multi-task setting – jointly training tasks within the same groups. As grouped tasks are typically highly related, this is akin to some existing data augmentation practices (e.g. adding Visual Genome (VG) QA data when training VQA). Note this corresponds to four separate multi-task models – one for each group.

我们从最直观的多任务设置开始——在同一组内联合训练任务。由于分组任务通常高度相关,这类似于一些现有的数据增强实践(例如在训练VQA时添加Visual Genome (VG)问答数据)。请注意,这对应着四个独立的多任务模型——每组一个。

Table 2 row 3 shows the result of intra-group multi-task training. Comparing with single-task models trained on the same data (row 2), we see meaningful improvements of between $0.37%$ $\mathrm{(NLVR^{2}})$ ) and $4.54%$ (Flickr30k retrieval) points for 11 out of 12 tasks (only SNLI-VE did not improve). Comparing to row 1, we see that intra-group multitask training overcomes the data-loss from cleaning with an average score of 68.72, outperforming the single-task models trained on the full datasets which have an average score of 67.25. Further, the total number of parameters drops by a factor of $3\times$ – going from 12 full models to only 4.

表 2 第 3 行显示了组内多任务训练的结果。与相同数据训练的单任务模型 (第 2 行) 相比,12 项任务中有 11 项 (仅 SNLI-VE 未提升) 实现了 $0.37%$ (NLVR$^{2}$) 至 $4.54%$ (Flickr30k 检索) 的显著提升。对比第 1 行可见,组内多任务训练以 68.72 的平均分克服了数据清洗带来的损失,优于使用完整数据集训练的单任务模型 (平均分 67.25)。此外,参数量总数降至 $1/3$——从 12 个完整模型缩减为仅 4 个。

4.3. Inter-Group Multi-task Performance

4.3. 组间多任务性能

Representative Task Analysis. We next consider the interplay between different task-groups. For efficiency, we consider multi-task training with representative tasks from each group – specifically VQA (G1), Retrieval Flickr30k (G2), Visual7W (G3), and $\mathrm{NLVR^{2}}$ (G4). These were selected to maximize diversity in underlying image sources. We examine their relationships by jointly training all pairs and triplets of tasks under our multi-task training approach.

代表性任务分析。接下来我们考虑不同任务组之间的相互作用。为了提高效率,我们选取了每组具有代表性的任务进行多任务训练——具体包括VQA (G1)、Retrieval Flickr30k (G2)、Visual7W (G3)和$\mathrm{NLVR^{2}}$ (G4)。这些任务的选取旨在最大化底层图像来源的多样性。我们通过多任务训练方法联合训练所有任务对和三元组,来研究它们之间的关系。

Table 3 (left) shows the results of training each represent at ive task pair. Each entry is the percent change from single-task performance for the row-task when jointly trained with the column-task. As such, the Avg. row (bottom) shows the mean impact each column-task has on other tasks, and likewise the Avg. column (right) shows the mean impact other tasks have on each row-task. For instance, we find that adding VQA (G1) benefits other tasks with an average improvement of $+1.04%$ . Interestingly, adding $\mathrm{NLVR^{2}}$ (G4) degrades other tasks on average $(-1.36%)$ while making significant gains itself $(+1.48%)$ . This is primarily due to a $-4.13%$ interaction with G2. Table 3 (right) shows all task triplets. Gains in the paired-experiments are not simply additive. In the pair-wise analysis, G3 gained $+0.39%$ and $+0.78%$ from G1 and G2 respectively. As before, G4 has some strong negative effects on other groups $-4.36%$ G2 with G3 & G4) but these effects can be regulated by other

表 3 (左) 展示了每个代表性任务对的训练结果。每个单元格表示行任务与列任务联合训练时,相对于单任务性能的百分比变化。因此,Avg. 行(底部)显示每项列任务对其他任务的平均影响,同理 Avg. 列(右侧)显示其他任务对每项行任务的平均影响。例如,我们发现添加 VQA (G1) 能使其他任务平均提升 $+1.04%$ 。有趣的是,添加 $\mathrm{NLVR^{2}}$ (G4) 会平均降低其他任务性能 $(-1.36%)$ ,但其自身却获得显著提升 $(+1.48%)$ ,这主要源于与 G2 的 $-4.13%$ 交互效应。表 3 (右) 展示了所有任务三元组情况,配对实验中的增益并非简单叠加。在成对分析中,G3 从 G1 和 G2 分别获得 $+0.39%$ 和 $+0.78%$ 提升。如前所述,G4 对其他组别存在较强负面影响 (G2 与 G3 & G4 组合为 $-4.36%$ ),但这些影响可通过其他...

Table 4: Comparison to recent SOTA. For image retrieval (IR) COCO and Flickr we report R1 scores on the 1K test set.

表 4: 与近期SOTA方法的对比。对于图像检索(IR)任务中的COCO和Flickr数据集,我们报告了在1K测试集上的R1分数。

任务 数据集划分 SOTA UNITER[8] BERTg UNITER[8] BERTL OurSAT BERT: OurSAT->ST BERTg
VQA test-dev 72.27 73.24 72.57 73.15
VG QA val 36.36 36.64
GQA test-dev 60.00 [45] 60.12 60.65
IR COCO test (R1) 68.50 [23] 63.70 68.00
IR Flickr30k test (R1) 71.50 73.66 63.52 67.90
RefCOCO test 80.21 80.88 80.58 81.20
RefCOCO+ test 72.90 73.73 73.25 74.22
RefCOCOg test 74.41 75.77 75.96 76.35
Visual 7W test 72.53 [16] 82.75 83.35
GuessWhat test 61.30 [13] 65.04 65.69
NLVR2 testP 77.87 79.50 78.44 78.87
SNLI-VE test 78.02 78.98 76.78 76.95
参数量 (#模型数) 602M (7 x 86M) 2.1B (7x303M) 270M (1x270M) 3B (12x250M)

tasks $(+0.49%$ G2 with G1 & G4).

任务 $(+0.49%$ G2与G1和G4的组合)。

Full Multi-task Results. We move to our main result – a single model trained on all 12 datasets. The results of this All-Tasks (AT) model are shown in Table 2 row 4. This model outperforms independent single-task models trained on the same data (row 2) for 11 out of 12 tasks and improve the average score by 2.05 points (69.08 vs. 67.03). We reiterate for emphasis, average performance improves by 2.05 points while reducing the number of parameters from over 3 billion to 270 million (a $12\times$ reduction). This is also true for comparison with single-task models trained on full datasets (row 1) by a similar margin of 1.83 points.

完整多任务结果。我们转向主要结果——一个在全部12个数据集上训练的单一模型。这个全任务(AT)模型的结果如表2第4行所示。该模型在12个任务中有11个优于相同数据上训练的独立单任务模型(第2行),并将平均得分提高了2.05分(69.08对67.03)。我们再次强调,平均性能提升2.05分的同时,参数量从超过30亿减少到2.7亿(缩减12倍)。与完整数据集训练的单任务模型(第1行)相比,同样保持了1.83分的类似优势。

Our AT model also outperforms the Group-Task (GT) models (row 3) despite having $4\mathbf{x}$ fewer parameters (avg. 69.08 vs 68.72). This implies that despite their diversity, tasks across different groups can benefit from joint training.

我们的AT模型在参数量减少4倍的情况下(平均69.08对68.72),表现仍优于分组任务(GT)模型(第3行)。这表明尽管任务存在多样性,跨组别的联合训练仍能带来收益。

We observed from the representative task analysis that G4 tends to have a negatively effect other groups during joint training. To validate this observation on all tasks, we train an All-Task model without G4 (row 5). This model achieves higher avg. score of 67.96 for $\mathrm{G}1{+}\mathrm{G}2{+}\mathrm{G}3$ compared to the full AT model’s 67.39. $\mathrm{NLVR^{2}}$ (G4) presents two images per description and often one matches while the other does not. Despite the alignment with one image, the instance as a whole is negative. We speculate that this supervision may interfere with the standard caption-image alignment objective in Flickr30k.

我们从代表性任务分析中观察到,G4在联合训练时往往会对其他组产生负面影响。为了在所有任务上验证这一观察结果,我们训练了一个不含G4的全任务模型(第5行)。该模型在$\mathrm{G}1{+}\mathrm{G}2{+}\mathrm{G}3$上取得了67.96的平均分,高于完整AT模型的67.39分。$\mathrm{NLVR^{2}}$(G4)每个描述对应两张图像,通常一张匹配而另一张不匹配。尽管其中一张图像对齐,但实例整体仍为负样本。我们推测这种监督信号可能会干扰Flickr30k中标准的图文对齐目标。

4.4. Multi-Task Learning as Pre training

4.4. 多任务学习作为预训练

For some applications, single task performance may be paramount and justify storing a task-specific model. Even then, fine-tuning from a multi-task trained model may allow the model to take advantage of the additional, diverse supervision captured during multi-task training. Following [26], we finetune our trained multi-task models (GT and AT) on each downstream task and show results in Table 2. Rows 6 and 7 show that finetuning from the all-task model (AT) outperforms finetuning from the group-task models (GT) with an average score of 69.51 vs. 68.81. For comparison with our multi-task models, these are finetuned on the cleaned datasets which are $11%$ smaller on average. To compare to prior work, we also finetune on the full dataset for individual tasks (Row 8) and observe further improvements. Recall that our multi-task model was trained on cleaned data so there is no possibility of test leak here. These model outperform single-task models without multi-task pre training (row 1) by a large margin (70.24 vs. 67.25 avg. score).

对于某些应用场景,单一任务的性能可能至关重要,因此值得存储任务专用模型。即便如此,基于多任务训练模型进行微调,仍能让模型利用多任务训练期间捕获的额外多样化监督信号。参照[26]的做法,我们在每个下游任务上对已训练的多任务模型(GT和AT)进行微调,结果如表2所示。第6-7行数据显示,基于全任务模型(AT)微调的表现优于基于分组任务模型(GT)微调,平均得分分别为69.51和68.81。为与多任务模型对比,这些模型是在清洗后的数据集上微调的,该数据集平均规模缩小了$11%$。为与已有研究对比,我们还在完整数据集上对单项任务进行微调(第8行),观察到进一步提升。需注意我们的多任务模型基于清洗数据训练,因此不存在测试数据泄露。这些模型大幅领先未经多任务预训练的单任务模型(第1行)(平均得分70.24 vs 67.25)。

Table 5: Comparison with other multi-task models. VQA score is on test-dev and the retrieval tasks on their respective 1K test split. For Flickr Grounding (FG) we report R1 on Flickr30K test.

VQACOCORetrievalFlickr RetrievalFG
R1R5R10R1R5R10R1
OmniNet[36]55.76
HDC [33]69.2857.4088.4095.6056.1082.9089.4057.39
Ours72.7065.1691.0096.20 65.0688.6693.5264.61

表 5: 与其他多任务模型的对比。VQA分数基于test-dev数据集,检索任务基于各自的1K测试集划分。对于Flickr Grounding (FG),我们报告Flickr30K测试集上的R1分数。

VQA COCORetrieval Flickr Retrieval FG
R1 R5 R10 R1 R5 R10 R1
OmniNet[36] 55.76
HDC [33] 69.28 57.40 88.40 95.60 56.10 82.90 89.40
Ours 72.70 65.16 91.00 96.20 65.06 88.66 93.52

4.5. Comparison with Existing Work

4.5. 与现有工作的对比

In Table 4 we compare with existing state-of-the-art. We draw special comparison with the recent UNITER [8] architecture as it is similar to our base ViLBERT model. Like ViLBERT, UNITER is a general BERT-based vision-andlanguage architecture pretrained through self-supervised tasks and then finetuned for each downstream task. We show two UNITER columns corresponding to their underlying BERT model – either Base B or Large L. Our ViLBERT model uses the smaller $\mathrm{BERT}{\mathrm{B}}$ . Our single all-task model $(\mathrm{Ours}{\mathrm{AT}})$ ) achieves competitive performance to state-of-theart task-specific models. Our single-task finetuned models $(\mathrm{Ours}_{\mathrm{AT->ST}})$ surpass state-of-the-art on 7 out of 12 tasks.

在表4中,我们与现有最先进技术进行了比较。我们特别与近期提出的UNITER [8]架构进行了对比,因为其与我们基础的ViLBERT模型相似。与ViLBERT类似,UNITER是一种基于BERT的通用视觉-语言架构,通过自监督任务进行预训练,然后针对每个下游任务进行微调。我们展示了两列UNITER结果,分别对应其底层BERT模型——Base B或Large L。我们的ViLBERT模型使用了较小的$\mathrm{BERT}{\mathrm{B}}$。我们的单一全任务模型$(\mathrm{Ours}{\mathrm{AT}})$达到了与最先进专用任务模型相当的性能。我们的单任务微调模型$(\mathrm{Ours}_{\mathrm{AT->ST}})$在12项任务中有7项超越了当前最优水平。

Table 5 compares our method with other recently proposed multi-modal, multi-task learning approaches – OmniNet [36] and Hierarchical Dense Co-Attention (HDC) [33]. OmniNet is trained on part-of-speech tagging, image captioning, visual question answering, and video activity recognition, while HDC is trained on image caption retrieval, visual question answering, and visual grounding. We train a multi-task model on the same tasks and cleaned datasets used in HDC [33]. Flickr Grounding is a new task that we include for this comparison. Our multi-task model outperforms these approaches by a large margin.

表 5: 将我们的方法与近期提出的多模态多任务学习方法——OmniNet [36] 和分层密集共注意力 (HDC) [33] 进行对比。OmniNet 在词性标注、图像描述生成、视觉问答和视频活动识别任务上训练,而 HDC 则在图像描述检索、视觉问答和视觉定位任务上训练。我们在 HDC [33] 使用的相同任务和清洗数据集上训练了一个多任务模型。Flickr Grounding 是我们为本次对比新增的任务。我们的多任务模型以显著优势超越了这些方法。

5. Analysis and Ablation Study

5. 分析与消融研究

Ablations on task token and training strategies. To verify our design choices, we perform ablations for different task token granularity and multi-task training strategies. The results are shown in Table 6. We report average group and overall average performance. Detailed breakdown for each task can be found in supplement.

任务token与训练策略的消融实验。为验证设计选择,我们针对不同任务token粒度和多任务训练策略进行了消融实验。结果如表6所示,汇报了平均组别性能和整体平均性能。各任务详细数据可查阅补充材料。

Table 6: Ablations on our design choices and comparison to curriculum and anti-curriculum learning multi-task approaches.

Task TokenDynamic Stop-and-GoG1G2G3G4All Tasks Average
AT (our)
1tokenper dataset56.3563.6175.5277.6169.08
2token per head55.9561.4875.3577.3768.52
3w/o tasktoken55.6762.5575.3876.7368.53
4w/o DSG55.5062.9275.2476.3168.52
5w/curriculum54.6861.2175.1976.7067.24
w/anti-curriculum55.8259.5873.6975.9467.98
7vanillamultitask54.0961.4575.2876.7167.92

表 6: 设计选择的消融实验及与课程学习和反课程学习多任务方法的对比

Task Token Dynamic Stop-and-Go G1 G2 G3 G4 All Tasks Average
AT (our)
1 token per dataset 56.35 63.61 75.52 77.61
2 token per head 55.95 61.48 75.35 77.37
3 w/o task token 55.67 62.55 75.38 76.73
4 w/o DSG 55.50 62.92 75.24 76.31
5 w/ curriculum 54.68 61.21 75.19 76.70
6 w/ anti-curriculum 55.82 59.58 73.69 75.94
7 vanilla multitask 54.09 61.45 75.28 76.71

For task tokens, our default setting is with a different task token per dataset (12 total, Row 1). We compare this with two ablations: one task token per output head (4 total, Row 2) and no task tokens (Row 3). We observe that task-specific tokens lead to better performance compared to head-based tokens (avg. 69.08 vs. 68.52) and no task tokens (avg. 69.08 vs. 68.53). This shows that task-aware feature embedding is useful even within the same output space; e.g. per-task tokens may help differentiate noun phrases and pointing questions in Referring Expression.

对于任务token,我们的默认设置为每个数据集使用不同的任务token(共12个,第1行)。我们将其与两种消融实验进行比较:每个输出头使用一个任务token(共4个,第2行)和不使用任务token(第3行)。我们观察到,相比基于头的token(平均69.08 vs. 68.52)和不使用任务token(平均69.08 vs. 68.53),特定任务的token能带来更好的性能。这表明即使在相同的输出空间中,任务感知的特征嵌入也是有用的;例如,每个任务的token可能有助于区分Referring Expression中的名词短语和指向性问题。

For multi-task training schedule, we compare our dynamic stop-and-go (DSG) (Row 3) with Curriculum (Row 5) and Anti-Curriculum (Row 6) approaches discussed in Sec. 3. We consider convergence rate as a measure of task difficulty. For Curriculum, we first train tasks in G4 and then train all tasks together (easier $\longrightarrow$ harder). For Anti-Curriculum, we train G1 tasks first and then train on all tasks together (harder $\longrightarrow$ easier). Table 6 shows our dynamic stop-and-go training schedule outperforms anti-curriculum (avg. 68.52 vs. 67.98) and curriculum (avg. 68.53 vs. 67.24). Row 7 shows results of a ‘vanilla’, round-robin training scheme with no task tokens or training scheduling. The average score of vanilla multitask is close to anti-curriculum $(67.92\nu s.67.98)$ . Consistent with prior work [31], performance on harder tasks (G1) is worse compared to anti-curriculum. Our full training regime outperforms this significantly (avg. 69.08 vs. 67.92).

在多任务训练计划方面,我们将动态启停 (DSG) (第3行) 与第3节讨论的课程学习 (第5行) 和反课程学习 (第6行) 方法进行对比。我们将收敛速度作为任务难度的衡量标准。对于课程学习,我们首先训练G4组任务,然后同时训练所有任务 (由易到难) 。对于反课程学习,我们先训练G1组任务,再同时训练所有任务 (由难到易) 。表6显示我们的动态启停训练计划优于反课程学习 (平均68.52 vs. 67.98) 和课程学习 (平均68.53 vs. 67.24) 。第7行展示了没有使用任务token或训练调度的基础轮询训练方案结果,其多任务平均得分接近反课程学习 $(67.92\nu s.67.98)$ 。与先前研究 [31] 一致,在较难任务 (G1) 上的表现逊于反课程学习。我们完整的训练方案显著优于该基准 (平均69.08 vs. 67.92) 。

Behavior of Dynamic Stop-and-Go training. To characterize our dynamic stop-and-go training scheme, we visualize the dynamic training schedule in Fig. 2 (left) – bold lines indicate normal go training and thin lines are stop states when datasets receive sparser updates at a fixed iteration gap (every 4th iteration here). We see that smaller datasets quickly converge and enter stop state training early. As the base model drifts over time, they periodically return to full go state training to adjust. Interestingly, after some cycles of this, they enter the stop state and continue with only sparse updates for the rest of training.

动态停走训练的行为特征。为了描述我们的动态停走训练方案,我们在图2(左)中可视化了动态训练计划——粗线表示正常的go训练状态,细线则是数据集以固定迭代间隔(此处为每4次迭代)接收稀疏更新时的stop状态。可以看到较小数据集会快速收敛并提前进入stop状态训练。随着基础模型随时间漂移,它们会周期性返回完整go状态进行调整。有趣的是,经过若干次循环后,这些数据集会进入stop状态并在剩余训练中仅保持稀疏更新。

Another aspect of dynamic stop-and-go training is the sparsity of updates in the stop state. Fig. 2 (right) shows the mean normalized accuracy for each group for multi-task models trained with different iteration gaps $(\Delta)$ . We observe that raising $\Delta$ (i.e. updating more sparsely) improves performance initially but degrades for larger values. Absolute and per-task scores are provided in the supplement.

动态启停训练的另一个方面是停止状态下更新的稀疏性。图 2 (右) 展示了在不同迭代间隔 $(\Delta)$ 下训练的多任务模型每组平均归一化准确率。我们观察到增大 $\Delta$ (即更稀疏地更新) 初期能提升性能,但随着值增大会导致性能下降。补充材料中提供了绝对值和逐任务得分。


Figure 2: Left: Visualization of Dynamic stop-and-go during multi-task training. Solid line indicates in the go mode while thin line indicates stop mode. Right: Mean accuracy (normalized group-wise for easier comparison) for each group with different iter-gap $\Delta$ for Dynamic stop-and-go .

图 2: 左: 多任务训练中动态启停 (Dynamic stop-and-go) 的可视化。实线表示运行模式,虚线表示停止模式。右: 采用不同迭代间隔 $\Delta$ 的动态启停策略时,各组的平均准确率 (为便于比较已进行组间归一化处理)。

Multi-Task visual grounding consistency. Given the common shared base model, one question is whether multitask models exhibit more consistent visual groundings than independent task-specific models. For example, does a model that correctly answers “What color is the largest dog?” also correctly ground the referring expression “largest dog”? To assess this, we consider 1500 images from the $\mathrm{RefCOCO/+}$ test sets that also have VQA annotations such that for each image $I_{i}$ there are associated questions ${\boldsymbol{q}^{(i)}}$ and referring expressions ${r^{(i)}}$ . To measure the overlap in visual concepts between a question $q_{j}^{\left(i\right)}$ and reference $r_{k}^{(i)}$ , we count overlapping nouns and adjectives (identified using a part-of-speech tagger [47]) and denote this $d(q_{j}^{(i)},r_{k}^{(i)})$ . Armed with this notion of similarity, we consider each question-reference pair for each image (total 111,275 combinations) and compute a weighted accuracy. A pair is considered correct if the question was answered correctly and the referent was localized. Each pair is weighed by their overlap $d(q_{j}^{(i)},r_{k}^{(i)})$ . Note that if $q_{j}^{\left(i\right)}$ and r(ki) do not have any common visual concept (d(qj(i ), r(ki ))), the correctness of this pair does not affect the overall metric.

多任务视觉定位一致性。给定共享的基础模型,一个重要问题是多任务模型是否比独立的任务专用模型展现出更一致的视觉定位能力。例如,一个能正确回答"最大的狗是什么颜色?"的模型,是否也能准确定位指代表达式"最大的狗"?为评估这一点,我们从 $\mathrm{RefCOCO/+}$ 测试集中选取1500张图像,这些图像同时具有VQA标注,使得每张图像 $I_{i}$ 都关联着问题集 ${\boldsymbol{q}^{(i)}}$ 和指代表达式集 ${r^{(i)}}$。为衡量问题 $q_{j}^{\left(i\right)}$ 与指代 $r_{k}^{(i)}$ 之间视觉概念的重叠程度,我们统计重叠的名词和形容词(使用词性标注工具[47]识别),记为 $d(q_{j}^{(i)},r_{k}^{(i)})$。基于这种相似性度量,我们对每张图像的问题-指代对(共111,275组组合)计算加权准确率:当问题回答正确且指代定位准确时,该组合被视为正确,其权重由重叠度 $d(q_{j}^{(i)},r_{k}^{(i)})$ 决定。需注意的是,若 $q_{j}^{\left(i\right)}$ 与 $r_{k}^{(i)}$ 不存在共同视觉概念(即 $d(q_{j}^{(i)}, r_{k}^{(i)}) = 0$),则该组合的正确性不影响总体指标。

We evaluate our Single-Task (ST), All-Task (AT), and finetuned from All-Task $\scriptstyle({\mathtt{A T}}->{\mathtt{S T}})$ models on the proposed metric. AT consistently outperforms $S\mathrm{T}$ ( $55.40%$ vs. $58.30%)$ ) and $\mathrm{AT}\mathrm{-}>\mathrm{ST}$ achieves the best performance $(64.64%)$ . This shows our model trained on multiple tasks achieve better visual grounding consistency across different tasks. Further analysis can be found in the supplement.

我们评估了单任务 (ST)、全任务 (AT) 以及从全任务微调 ($\scriptstyle({\mathtt{A T}}->{\mathtt{S T}})$) 的模型在提出的指标上的表现。AT 始终优于 ST (55.40% vs. 58.30%),而 AT->ST 取得了最佳性能 (64.64%)。这表明我们在多任务上训练的模型在不同任务间实现了更好的视觉定位一致性。更多分析详见补充材料。

Regularizing effects of multi-task learning. We find multi-task training to have a regularizing effect on tasks which overfit when trained separately. In Fig. 4 we plot the training and validation curves for two tasks (SNLI-VE and Flickr Grounding) where single task training overfits quickly. On the other hand when trained in a multi-task setup with all other tasks, the validation score improves and

多任务学习的正则化效应。我们发现多任务训练对单独训练时容易过拟合的任务具有正则化效果。图4展示了两个任务(SNLI-VE和Flickr Grounding)的训练与验证曲线,这些任务在单任务训练时会快速过拟合。相比之下,当与所有其他任务进行多任务联合训练时,验证分数得到提升。


Figure 3: Our single model $\mathrm{(Our_{AT})}$ ) can perform a multitude of V&L tasks: caption and image retrieval, question answering, grounding phrases, guessing image regions based on a dialog, verifying facts about a pair of images, natural language inferences from an image, etc. Here we show outputs of our model for a variety of inputs (that mimic tasks from the 12 datasets it has been trained on).

图 3: 我们的单一模型 $\mathrm{(Our_{AT})}$ )能够执行多种视觉与语言(V&L)任务:图像描述与检索、问答、短语定位、基于对话猜测图像区域、验证图像对的事实、从图像中推断自然语言等。此处展示了模型针对各类输入(模拟其训练过的12个数据集任务)的输出示例。


Figure 4: Multi-Task training acts as a regularize r.

图 4: 多任务训练起到正则化 (regularize) 作用。

there is no over fitting.

没有过拟合。

Qualitative examples. Figure 3 shows example outputs of our models. Due to space limitation, we provide extensive visualization s in the supplement.

定性分析示例。图 3: 展示了我们模型的输出示例。由于篇幅限制,更多可视化结果详见补充材料。

6. Related Work

6. 相关工作

Multi-task learning. There has been substantial interest in multi-task learning [6,38], i.e. training a single model for multiple tasks at once. Advances in multi-task learning have been developed in the context of vision [5,20,32,42,52,53], language [10, 25, 26, 31, 37], and robotics [18, 34, 46]. Among them, Standley et al. [41] studies how different vision tasks are related to each other. Strezoski et al. [42] studies layer-wise task routing for different vision tasks. McCann et al. [31] pose ten natural language processing (NLP) tasks as question answering tasks. MT-DNN [26] combines multi-task learning with pre training [14] to improve the learning of text representations. Despite this progress, it is still challenging to train a single model on many tasks that can outperform or even match their singletask counterparts. To enhance the training scheme, BAM [9] applies knowledge distillation where single-task models teach the multi-task model. Raffel et al. [37] explore different sampling strategies for NLP tasks. We focus on multi-task learning for V&L tasks.

多任务学习。多任务学习 [6,38](即训练单个模型同时处理多个任务)引起了广泛关注。该领域在视觉 [5,20,32,42,52,53]、语言 [10,25,26,31,37] 和机器人 [18,34,46] 方向均取得进展。其中 Standley 等人 [41] 研究了不同视觉任务间的关联性,Strezoski 等人 [42] 探索了面向不同视觉任务的层级任务路由机制,McCann 等人 [31] 将十项自然语言处理 (NLP) 任务重构为问答任务。MT-DNN [26] 通过结合多任务学习与预训练 [14] 改进了文本表征学习。尽管取得进展,训练单一模型在多项任务上达到或超越单任务模型性能仍具挑战性。为优化训练方案,BAM [9] 采用单任务模型指导多任务模型的知识蒸馏方法,Raffel 等人 [37] 则探索了 NLP 任务的不同采样策略。本文聚焦视觉与语言 (V&L) 任务的多任务学习。

Vision and language. While we address 12 V&L tasks in Sec. 2.1, we do miss some families of tasks including image and video captioning [7], visual dialog [12], embodied question answering [11] and instruction following [3]. Different from earlier work [16,22,28,29,50,51,55] which design bespoke architecture for different tasks, recently proposed models for V&L [1, 8, 23, 24, 27, 43, 45, 54] provide a common architecture that can be pretrained using self-supervised losses and adapted to many vision and language tasks. However, these models still require task specific fine tui ning, which may easily overfit on small dataset. Our single model jointly learns from multiple V&L tasks and achieves competitive performance. Further, multi-task training provides a better vi so linguistic representation for task specific finetuning than self-supervised objectives.

视觉与语言。虽然我们在第2.1节中涉及了12项视觉与语言(V&L)任务,但仍遗漏了部分任务类型,包括图像视频描述生成 [7]、视觉对话 [12]、具身问答 [11] 和指令跟随 [3]。与早期研究 [16,22,28,29,50,51,55] 为不同任务设计定制架构不同,近期提出的视觉与语言模型 [1,8,23,24,27,43,45,54] 提供了可通过自监督损失预训练并适配多种视觉语言任务的通用架构。然而这些模型仍需任务特定的微调,在小数据集上容易过拟合。我们的单一模型通过多任务联合学习取得了具有竞争力的性能。此外,相比自监督目标,多任务训练能为任务特定微调提供更优的视觉语言表征。

Multi-task V&L learning. Recent work [33, 36, 40] also explores multi-task learning in V&L. HDC [33] trains a multi-task network on multiple datasets and uses a hyperparameter search method to determine which layer output should be taken for each task. Our method does not need any hyper parameter search to choose outputs for different tasks and outperforms both [36] and [33]. [40] is a concurrent work that does multi-task training on 12 dialogue datasets (only two with images). Our work differs in that we focus on a variety of vision and language tasks.

多任务视觉与语言学习。近期研究[33, 36, 40]也探索了视觉与语言领域的多任务学习。HDC[33]在多个数据集上训练多任务网络,并通过超参数搜索方法确定每项任务应采用的层级输出。我们的方法无需任何超参数搜索即可为不同任务选择输出,且性能优于[36]和[33]。[40]是一项同期工作,在12个对话数据集(仅两个含图像)上进行多任务训练。我们的研究差异在于聚焦多样化的视觉与语言任务。

7. Conclusion

7. 结论

In this work, we develop a training regime and experimental setting for large-scale, multi-modal, multi-task learning. As one part of this, we introduce a novel task scheduling approach to help avoid over- or under-training tasks with differing sizes or difficulties. Using this framework, we explore the relationships between 12 vision-andlanguage datasets – our single multi-task model outperforms 12 single-task models. We find multi-task training can lead to significant gains over independent task training. Further, we show that multi-task learning is an effective pretraining task for training state-of-the-art single-task models.

在本工作中,我们开发了一套适用于大规模、多模态、多任务学习的训练机制与实验设置。作为其中一环,我们提出了一种新颖的任务调度方法,用于避免对不同规模或难度的任务产生过训练或欠训练。通过该框架,我们探索了12个视觉-语言数据集之间的关联性——我们的单一多任务模型性能超越了12个单任务模型。研究发现,多任务训练相比独立任务训练能带来显著性能提升。此外,我们证明多任务学习可作为训练顶尖单任务模型的有效预训练手段。

Acknowledgement. The GaTech effort was supported in part by NSF, AFRL, DARPA, ONR YIPs, ARO PECASE, Amazon. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of the U.S. Government, or any sponsor.

致谢。GaTech的研究工作部分得到了美国国家科学基金会(NSF)、空军研究实验室(AFRL)、国防高级研究计划局(DARPA)、海军研究办公室青年研究员计划(ONR YIPs)、陆军研究办公室青年研究员总统奖(ARO PECASE)以及亚马逊公司的资助。本文所包含的观点和结论仅代表作者个人立场,不应被解释为美国政府或任何资助机构明示或暗示的官方政策或背书。

References

参考文献

12-in-1: Multi-Task Vision and Language Representation Learning

12合1:多任务视觉与语言表征学习

8. Supplementary

8. 补充材料

In this section, we first show the full details of the cleaned dataset in Sec. 8.1. We further discuss the modifications in pre training, show our multi-task model architecture and describe the implementation details in Sec. 8.2, Sec. 8.3 and Sec. 8.4 respectively. The rest of the section provides extensive experimental results to fully analyze our proposed model.

在本节中,我们首先在第8.1节展示清洗后数据集的完整细节。随后分别在第8.2节、第8.3节和第8.4节讨论预训练阶段的改进方案,展示多任务模型架构并描述实现细节。本节其余部分通过大量实验结果对所提模型进行全面分析。

8.1. Datasets

8.1. 数据集

Table 7 shows the number of images in the train $^{+}$ val and test sets before and after cleaning. Our cleaning process removes $13.02%$ of the total number of images on average. It is important to note that here we show the number of images per dataset and not the number of actual training samples. Different tasks have different number of training samples for each image. For details on training samples please refer to Table 8. We collect the union of all dataset test sets and remove any occurrence of these images from all training and validation sets; in this way we arrive at the Clean training and validation sets. With this strategy, the test sets of the original datasets are not modified in any way.

表7展示了清理前后训练集$^{+}$验证集和测试集中的图像数量。我们的清理流程平均移除了总图像数的$13.02%$。需特别说明的是,此处呈现的是每个数据集的图像数量,而非实际训练样本量——不同任务中每张图像对应的训练样本数各不相同。具体训练样本信息请参阅表8。我们汇总所有数据集测试集的并集,并从全部训练集和验证集中剔除这些图像,由此得到清理后的训练集和验证集。该策略确保原始数据集的测试集不做任何改动。

Train+ValTestCleanedTrain+Val%Removed
[A]VQA2.0[15]123,28781,43498,86119.81
[B VGQA[21]108,24992,14714.87
[C]GQA[17]82,3742,98769,86815.18
[D]COCORetrieval[7]118,2875,00099,43515.93
[E]Flickr30kRetrieval[35]30,0141,00029,0773.12
[F]RefCOCO[19]18,4941,50014,48121.69
[F]RefCOCO+[19]18,4921,50014,47921.70
[H]RefCOCOG[30]23,1992,60017,90322.82
[I]Visual 7W[55]17,9537,78016,4158.56
[J](GuessWhat [13]56,6389,89951,2919.44
[K]SNLI-VE[49]30,7831,00029,8083.16
[L] NLVR2 [44]95,5228,05695,5220
Average13.02

Table 7: Number of images in the train $^{+}$ val and test sets before and after cleaning. We use the training part of the cleaned dataset in the multi-task experiments. Note that this is not the number of training samples but the number of images in the dataset.

Train+Val Test CleanedTrain+Val %Removed
[A] VQA2.0 [15] 123,287 81,434 98,861 19.81
[B] VGQA [21] 108,249 92,147 14.87
[C] GQA [17] 82,374 2,987 69,868 15.18
[D] COCORetrieval [7] 118,287 5,000 99,435 15.93
[E] Flickr30kRetrieval [35] 30,014 1,000 29,077 3.12
[F] RefCOCO [19] 18,494 1,500 14,481 21.69
[F] RefCOCO+ [19] 18,492 1,500 14,479 21.70
[H] RefCOCOG [30] 23,199 2,600 17,903 22.82
[I] Visual 7W [55] 17,953 7,780 16,415 8.56
[J] GuessWhat [13] 56,638 9,899 51,291 9.44
[K] SNLI-VE [49] 30,783 1,000 29,808 3.16
[L] NLVR2 [44] 95,522 8,056 95,522 0
平均 13.02

表 7: 清洗前后训练 $^{+}$ 验证集和测试集中的图像数量。我们在多任务实验中使用清洗后数据集的训练部分。请注意,这不是训练样本的数量,而是数据集中的图像数量。

8.2. Improvements over ViLBERT Pre training

8.2. ViLBERT预训练的改进

In this section, we discuss in detail the modification we made to the base ViLBERT pre training approach.

在本节中,我们将详细讨论对基础ViLBERT预训练方法所做的修改。

Masked prediction with mislaigned pairs. In the original ViLBERT pre training procedure, the model observes an image and caption as inputs. The caption is either obtained from the paired caption (with $p=0.5$ ) or a randomly sampled misaligned caption from the dataset. The multimodal alignment prediction task, which predicts whether the image and caption are aligned, is crucial for image retrieval tasks [23, 27, 45]. Recent work [43] has questioned the necessity of the multi-modal alignment prediction task and observed better performance on non-image retrieval tasks without this pre training objective. Similar observations are also found in the natural language understanding tasks [?, ?, ?, ?]. Digging further into this, we find that both the alignment and prediction tasks are typically done together. For misaligned image-caption pairs, this amounts to forcing the model to predict missing image or text regions based on incorrect paired data! We find the model will learn worse context representations in this setup. Instead of removing the multi-modal alignment prediction task, we only perform the mask multi-modal modelling task on aligned image-caption pairs. This will effectively remove the noise introduced by negative samples.

掩码预测与错配对样本。在原始ViLBERT预训练过程中,模型接收图像和标题作为输入,其中标题有50%概率($p=0.5$)采用配对标题,或从数据集中随机采样错配标题。多模态对齐预测任务(判断图像与标题是否匹配)对图像检索任务至关重要[23, 27, 45]。近期研究[43]质疑该任务的必要性,发现在非图像检索任务中去除该预训练目标反而表现更优,自然语言理解任务中也存在类似现象[?, ?, ?, ?]。深入分析发现,对齐判断与掩码预测通常同步执行——对于错配的图文对,这相当于强制模型基于错误配对数据预测缺失的图像或文本区域!我们发现这种设置会导致模型学习到更差的上下文表征。我们的改进方案是:保留多模态对齐预测任务,但仅对配对的图文对执行掩码多模态建模任务,从而有效消除负样本引入的噪声。

Masking overlapping regions. Different from words embedding in the caption, visual feature embeddings (extracted from a pretrained Faster-RCNN [?]) have a lot of repetitions due to overlapped image regions. To avoid visual clue leakage from the visual embedding of other elements, VL-BERT [43] sets the pixels laid in the masked RoI to zeros before applying Faster R-CNN. However, overlapped image patches with boundary information may still leak the visual clues for the masked RoI. We mask the overlapped image regions in a more aggressive manner – any visual embedding that overlaps a masked region by $40%$ IOU or more is also masked. We observe significant improvements over the ViLBERT model as shown in Table 9 when comparing column ViLBERT with ${\mathrm{Ours}}_{\mathrm{ST}}$ .

掩码重叠区域。与标题中的词嵌入(word embedding)不同,视觉特征嵌入(通过预训练的Faster-RCNN [?]提取)由于图像区域重叠存在大量重复。为避免其他元素的视觉嵌入泄露线索,VL-BERT [43]在应用Faster R-CNN前将被掩码RoI覆盖的像素置零。但含有边界信息的重叠图像块仍可能泄露被掩码RoI的视觉线索。我们采用更激进的掩码策略——任何与被掩码区域交并比(IOU)达到40%及以上的视觉嵌入都会被掩码。如表9所示,对比ViLBERT列与${\mathrm{Ours}}_{\mathrm{ST}}$列时,我们观察到模型性能较ViLBERT有显著提升。

8.3. Model Architecture

8.3. 模型架构

Fig. 5 shows the architecture of the our model for V&L multi-task learning, which is described in Sec. 3.2. We use ViLBERT as our base model shared across different tasks. For the task-specific heads, our model jointly train with four different task group – Vocab-Based VQA; Image Retrieval, Refer Expression and Multimodal Verification.

图 5: 展示了我们用于视觉与语言(V&L)多任务学习的模型架构(详见第3.2节)。我们采用ViLBERT作为跨任务共享的基础模型。针对特定任务的头部结构,我们的模型联合训练了四个不同任务组:基于词汇的视觉问答(Vocab-Based VQA)、图像检索(Image Retrieval)、指代表达(Refer Expression)以及多模态验证(Multimodal Verification)。

8.4. Implementation Details

8.4. 实现细节

Image features are extracted from a ResNeXT-152 Faster-RCNN model trained on Visual Genome(VG) with attribute loss. Our model is first initialized from pretrained BERT weights [14]. Our models are trained using AdamW optimizer [?] with a linear warmup and linear decay learning rate scheduler. We train our multi-task model for 40K total iterations (same as the number of iterations for the VG QA single task) on 8 NVIDIA V100 GPUs for 5 days. We use AdamW optimizer and a warmup linear schedule. Hyper parameters like learning rate and batch sizes used for each task are listed in Table 8. We also report the number of training samples used in various settings in our experiments.

图像特征提取自一个基于Visual Genome(VG)数据集训练并带有属性损失的ResNeXT-152 Faster-RCNN模型。我们的模型首先从预训练的BERT权重[14]初始化,采用AdamW优化器[?]配合线性预热与线性衰减学习率调度器进行训练。在多任务模型训练中,我们使用8块NVIDIA V100 GPU进行了总计40K次迭代(与VG QA单任务迭代次数相同),耗时5天。优化器选用AdamW并采用预热线性调度策略。各任务使用的学习率、批量大小等超参数详见表8,同时列出了实验中不同设置下的训练样本数量。


Figure 5: Architecture of the our model for V&L multi-task learning. We augment the input query with a task token to learn the task-aware feature embedding.

图 5: 我们的视觉与语言多任务学习模型架构。通过添加任务token来增强输入查询,以学习任务感知的特征嵌入。

Table 8: Training details including sample sizes, testing metric and hyper parameters for single task and multi-task training.

SamplesHyperparams
Full TrainCleaned TrainTestMetricBSLR
[A VQA2.0[15]655,111542,104447,793VQAAccuracy1284e-5
[B]VG QA[21]1,437,9311,294,2555,000VQAAccuracy1284e-5
[C]GQA [17]1,072,062962,92812,578VQAAccuracy1284e-5
[D]IR COCO[7]566,747487,6001,000Recall @ 1,5,101282e-5
[E]IR Flickr30k[35]145,000140,4851,000Recall @1,5,101282e-5
[F]RefCOCO[19]120,62496,22110,752Accuracy2562e-5
[F]RefCOCO+[19]120,19195,85210,615Accuracy2562e-5
[H]RefCOCOG[30]80,51265,5149,602Accuracy2562e-5
[I]Visual 7W[55]93,81393,81357,265Accuracy2562e-5
[J]GuessWhat[13]113,221100,39823,785Accuracy642e-5
[K] NLVR? [44]86,37386,3736,967Accuracy642e-5
[L]SNLI-VE[49]529,527512,39617,901Accuracy2562e-5
Total5,021,1124,477,939604,258

表 8: 单任务与多任务训练细节,包括样本量、测试指标及超参数。

样本量 超参数
完整训练集 清洗后训练集 测试集 评估指标 批次大小 (BS) 学习率 (LR)
[A] VQA2.0 [15] 655,111 542,104 447,793 VQA准确率 128 4e-5
[B] VG QA [21] 1,437,931 1,294,255 5,000 VQA准确率 128 4e-5
[C] GQA [17] 1,072,062 962,928 12,578 VQA准确率 128 4e-5
[D] IR COCO [7] 566,747 487,600 1,000 召回率@1,5,10 128 2e-5
[E] IR Flickr30k [35] 145,000 140,485 1,000 召回率@1,5,10 128 2e-5
[F] RefCOCO [19] 120,624 96,221 10,752 准确率 256 2e-5
[F] RefCOCO+ [19] 120,191 95,852 10,615 准确率 256 2e-5
[H] RefCOCOG [30] 80,512 65,514 9,602 准确率 256 2e-5
[I] Visual 7W [55] 93,813 93,813 57,265 准确率 256 2e-5
[J] GuessWhat [13] 113,221 100,398 23,785 准确率 64 2e-5
[K] NLVR? [44] 86,373 86,373 6,967 准确率 64 2e-5
[L] SNLI-VE [49] 529,527 512,396 17,901 准确率 256 2e-5
总计 5,021,112 4,477,939 604,258

8.5. Multi-Task Training

8.5. 多任务训练

To further illustrate the multi-task training process, in Fig. 7 we show the training curves for single-task vs. multitask for all the 12 tasks in our setup. Green lines show single-task training and blue lines show multi-task training. Since we train the model with maximum iterations across different datasets for multi-task training, for some smaller datasets (e.g. RefCOCO, Visual7W etc.), the number of iterations for single task is much smaller compared to the multi-task setting. By comparing the training curves of single-tasks and multi-tasks, we can see that most of the tasks have similar training curves. However, the tasks in the vocab-based VQA group benefit from the multi-task training with faster convergence within first 10000 iterations.

为进一步说明多任务训练过程,图7展示了我们设置的12项任务中单任务与多任务的训练曲线。绿色线条表示单任务训练,蓝色线条表示多任务训练。由于在多任务训练中我们根据不同数据集的最大迭代次数训练模型,对于某些较小的数据集(如RefCOCO、Visual7W等),单任务的迭代次数远少于多任务设置。通过比较单任务和多任务的训练曲线可以看出,大多数任务的训练曲线相似。然而,基于词汇的VQA组中的任务受益于多任务训练,在前10000次迭代内收敛速度更快。

Concept drift of smaller datasets. In Fig. 6 (left), we plot the val accuracy of our AT model on $\scriptstyle{\mathrm{RefCOCO+}}$ to show the concept drift of smaller datasets during MT training. Even with sparse updates (stop mode), we observe sharp drops (dips before go mode is reactivated) on $\scriptstyle{\mathrm{RefCOCO+}}$ .

小数据集的概念漂移。在图6 (左)中,我们绘制了AT模型在$\scriptstyle{\mathrm{RefCOCO+}}$上的验证准确率,以展示小数据集在MT训练期间的概念漂移。即使在稀疏更新(停止模式)下,我们仍观察到$\scriptstyle{\mathrm{RefCOCO+}}$上出现急剧下降(在go模式重新激活前的骤降)。

Relationship between dataset size and go mode duration. The dataset size gap can be significant – up to 16:1 for VG QA vs ReferCOCOg. To illustrate how dataset size affects our dynamic stop-and-go training regime, we plot dataset size vs active training iterations in Fig. 6 (right). Among datasets with a similar size, we see significant differences in active training time. This shows that dynamic stop-and-go addresses issues of dataset difficulties rather than just size. However, there is a general trend that larger datasets do tend to stay in the active go mode longer.

数据集规模与运行模式时长的关系。数据集规模差异可能非常显著——VG QA与ReferCOCOg之间可达16:1。为说明数据集规模如何影响我们的动态启停训练机制,我们在图6(右)中绘制了数据集规模与主动训练迭代次数的关系图。在规模相近的数据集中,我们观察到主动训练时间存在显著差异。这表明动态启停机制解决的是数据集难度问题,而不仅仅是规模问题。但整体趋势显示,较大规模数据集确实倾向于保持更长的主动运行模式时长。


Figure 6: Left: Val acc. of our AT model on $\scriptstyle{\mathrm{RefCOCO+}}$ . Right: Dataset size vs. number of updates during stop-and-go training.

图 6: 左: 我们的对抗训练(AT)模型在 $\scriptstyle{\mathrm{RefCOCO+}}$ 上的验证准确率。右: 启停式训练过程中数据集规模与更新次数的关系。

Full per task accuracy for AT without G4 model. In Table 2, we observed from the representative task analysis that G4 tends to have a negatively effect other group during joint training. In Table 12, we further show full per task accuracy for AT without G4 model and different ablations. We can see that $\mathbb{A}\mathbb{T}_{\mathbf{W}/\mathbf{o}}\mathbf{G}4$ outperforms $\mathrm{AT}0.48%$ on MT scores, which verifies G4 tends to have negatively effect even on the finetuned model. How to remove the negative interactions between different tasks is left to future study.

无G4模型的各任务完整准确率。在表2中,我们通过代表性任务分析观察到G4在联合训练时会对其他组产生负面影响。表12进一步展示了无G4模型及不同消融实验下的各任务完整准确率。可以看出$\mathbb{A}\mathbb{T}_{\mathbf{W}/\mathbf{o}}\mathbf{G}4$在MT分数上比$\mathrm{AT}0.48%$表现更优,这验证了G4即使对微调模型也存在负面影响。如何消除不同任务间的负面交互将留待未来研究。

8.6. Comparison with other SOTA

8.6. 与其他SOTA方法的对比

Table 9 shows the detailed comparison of ${\mathrm{Ours}}{\mathrm{ST}}$ (also shown in Table 2, line 1) and $\mathrm{Ours}_{\mathrm{AT->ST}}$ (also shown in Table 2, line 8) with the recent SOTA approaches, inlcuding ViLBERT [27], Unicoder-VL [23], VisualBERT [24], LXMERT [45] and UNITER [8]. Most of the recent proposed methods follows the pretrain-then-finetune scheme, usually pre training on out-of-domain data or in-domain data. The out-of-domain data contains Conceptual Caption Dataset (CC) [39] and SBU dataset [?] while in-domain data contains COCO [7] and Visual Genome [?]. Pre-training on the in-domain datasets usually leads to better downstream performance, since there is less domain transfer from pretraining to finetuning. Similar to ViLBERT, we pretrain our model on CC, which is different from VLBERT $\mathrm{CC}+\mathrm{Wiki}$ Corpus), VisualBERT $(\mathrm{CC}+\mathrm{COCO})$ , LXMERT $\mathbf{(COCO+}$ VG) and UNITER $(\mathrm{CC}+\mathrm{SUB}+\mathrm{COCO}+\mathrm{VG})$ . We achieve comparable performance with less pretrained data.

表 9: 详细比较了 ${\mathrm{Ours}}{\mathrm{ST}}$ (见表 2 第 1 行) 和 $\mathrm{Ours}_{\mathrm{AT->ST}}$ (见表 2 第 8 行) 与近期 SOTA 方法的性能, 包括 ViLBERT [27], Unicoder-VL [23], VisualBERT [24], LXMERT [45] 和 UNITER [8]。最新方法大多采用预训练-微调 (pretrain-then-finetune) 范式, 通常在域外数据或域内数据上进行预训练。域外数据包含 Conceptual Caption Dataset (CC) [39] 和 SBU 数据集 [?], 域内数据则包含 COCO [7] 和 Visual Genome [?]。由于预训练与微调之间的领域迁移更小, 使用域内数据集预训练通常能获得更好的下游性能。与 ViLBERT 类似, 我们在 CC 上预训练模型, 这与 VLBERT ($\mathrm{CC}+\mathrm{Wiki}$ 语料库), VisualBERT ($\mathrm{CC}+\mathrm{COCO}$), LXMERT ($\mathbf{COCO+}$ VG) 和 UNITER ($\mathrm{CC}+\mathrm{SUB}+\mathrm{COCO}+\mathrm{VG}$) 不同。我们使用更少的预训练数据取得了可比性能。

8.7. Full Breakdown of Ablation Study

8.7. 消融实验完整分析

Table 10 shows the full breakdown of Table 6 and Fig. 2 per task in the main paper. RC refers to Retrieval COCO and RF refers to Retrieval Flick $\cdot30\mathbf{k}$ . VQA and GQA are evaluated on test-dev splits. Retrieval COCO and

表 10 展示了主论文中表 6 和图 2 按任务的完整细分。RC 指检索 COCO (Retrieval COCO) ,RF 指检索 Flickr $\cdot30\mathbf{k}$ 。VQA 和 GQA 在测试开发集上评估。检索 COCO 和


Figure 7: Training curves on train set for ${\mathrm{Ours}}{\mathrm{ST}}$ (Table 2 Row 2) vs $\mathrm{Ours}{\mathrm{AT}}$ (Table $2\textsf{R o w}$ 4) models for all the 12 tasks in our experiments. Green lines show single-task training $(\mathrm{Ours}{\mathrm{ST}})$ and blue lines show multi-task training $(\mathrm{Ours}{\mathrm{AT}})$ ). Note that all these training are with the Clean $V&L$ setup. We can observe that for some of the tasks the training for ${\mathrm{Ours}}_{\mathrm{ST}}$ are shorter as they have fewer number of iterations when trained alone. Please refer to Sec. 8.5 for more details.

图 7: 实验所有12个任务中 ${\mathrm{Ours}}{\mathrm{ST}}$ (表2第2行)与 $\mathrm{Ours}{\mathrm{AT}}$ (表2第4行)模型在训练集上的训练曲线。绿线表示单任务训练 $(\mathrm{Ours}{\mathrm{ST}})$ ,蓝线表示多任务训练 $(\mathrm{Ours}{\mathrm{AT}})$ )。注意所有训练均采用Clean $V&L$ 设置。可观察到部分任务的 ${\mathrm{Ours}}_{\mathrm{ST}}$ 训练周期较短,因其单独训练时迭代次数较少。详见第8.5节。

Flickr30k are evaluated on their respective 1K test split. $\mathrm{NLVR^{2}}$ is evaluated on testP split. All other datasets are evaluated on their respective test splits. Table 11 shows the full scores for each task for different DSG iteration gap $(\Delta)$ . Table 12 shows the detailed per task scores for ATw/o G4 model and different ablations for it. We compare with full AT model as well.

Flickr30k 在各自的 1K 测试集上进行评估。$\mathrm{NLVR^{2}}$ 在 testP 集上评估。所有其他数据集均在各自的测试集上评估。表 11 展示了不同 DSG 迭代间隔 $(\Delta)$ 下各任务的完整分数。表 12 展示了 ATw/o G4 模型及各消融实验的详细任务分数。我们同时与完整 AT 模型进行了对比。

8.8. Multi-task visual grounding consistency

8.8. 多任务视觉定位一致性

In Sec. 5, we propose the multi-task visual grounding consistency. Here, we explain the proposed metric in more detail. Given $N$ images with $\mathrm{RefCOCO/+}$ refer expression and VQA questions, we want to test if multi-task models exhibit more consistent visual groundings than independent task-specific models. For each image $I_{i}$ , there are associated VQA question ${\boldsymbol{q}^{(i)}}$ and referring expression ${r^{(i)}}$ . To measure the overlap in visual concepts between a question qj $q_{j}^{\left(i\right)}$ and reference $r_{k}^{(i)}$ , we count the the number of overlapped noun $/$ adj as $d(q_{j}^{(i)},r_{k}^{(i)})$ , the multi-task visual grounding consistency can be calculated as:

在第5节中,我们提出了多任务视觉定位一致性。这里详细解释所提出的指标:给定$N$张带有$\mathrm{RefCOCO/+}$指代表达和VQA问题的图像,我们想测试多任务模型是否比独立任务专用模型表现出更一致的视觉定位。对于每张图像$I_{i}$,关联着VQA问题${\boldsymbol{q}^{(i)}}$和指代表达${r^{(i)}}$。为了衡量问题$q_{j}^{\left(i\right)}$与参考$r_{k}^{(i)}$之间视觉概念的重叠程度,我们统计重叠名词/形容词的数量$d(q_{j}^{(i)},r_{k}^{(i)})$,多任务视觉定位一致性可计算为:

$$
\mathtt{M T-V G C}=\frac{\sum_{k=0}^{N}|\sum_{j}\sum_{k}d(q_{j}^{(i)},r_{k}^{(i)})\mathbb{1}{\left{y(q_{j}^{(i)})=1&y(r_{k}^{(i)})=1\right}}|}{\sum_{i=0}^{N}|\sum_{j}\sum_{k}d(q_{j}^{(i)},r_{k}^{(i)})\mathbb{1}|}
$$

$$
\mathtt{M T-V G C}=\frac{\sum_{k=0}^{N}|\sum_{j}\sum_{k}d(q_{j}^{(i)},r_{k}^{(i)})\mathbb{1}{\left{y(q_{j}^{(i)})=1&y(r_{k}^{(i)})=1\right}}|}{\sum_{i=0}^{N}|\sum_{j}\sum_{k}d(q_{j}^{(i)},r_{k}^{(i)})\mathbb{1}|}
$$

where y(q(i )) $y(q_{k}^{(i)})=1$ means the model correctly answer the question $q_{k}^{\left(i\right)}$ based on VQA accuracy metric and $y(r_{k}^{(i)})=$ 1 means the model correctly locate the image regions (IoU $>0.5)$ given the reference $r_{k}^{(i)}$ .

其中 $y(q_{k}^{(i)})=1$ 表示模型基于VQA准确率指标正确回答了问题 $q_{k}^{\left(i\right)}$ ,而 $y(r_{k}^{(i)})=$ 1 表示模型在给定参考 $r_{k}^{(i)}$ 的情况下准确定位了图像区域 (IoU $>0.5)$ 。

8.9. Qualitative Results

8.9. 定性结果

Fig. 8 shows more qualitative examples of our single model $\mathrm{Our}_{\mathrm{AT}}$ on different vision and language tasks and Fig. 9 shows some failure cases. The examples in Fig. 8 show that the AT model works well for these wide range of tasks consistently. It can perform well in both short as well as long reasoning questions, image retrieval, pointing tasks, referring expressions and multi-modal validation. Failure cases mostly occur when the model encounters counting questions or difficult referring expressions and phrases for fine grained recognition.

图 8 展示了我们的单一模型 $\mathrm{Our}_{\mathrm{AT}}$ 在不同视觉与语言任务上的更多定性示例,图 9 则展示了一些失败案例。图 8 中的示例表明,AT (Adapter Transformer) 模型能稳定胜任这些广泛的任务,包括短/长推理问题、图像检索、指向任务、指代表达式以及多模态验证。失败案例主要出现在模型遇到计数问题,或需要细粒度识别的复杂指代表达和短语时。

Table 9: Comparison of ${\mathrm{Ours}}{\mathrm{ST}}$ (Table 2 Row 1) and $\mathrm{Ours}_{\mathrm{AT->ST}}$ (Table 2 Row 8) models on full dataset with other SOTA methods. Results for RefCOCO and $\scriptstyle{\mathrm{RefCOCO+}}$ are reported on the full test split (testA $^+$ testB). Refer to Sec 8.6 for more details.

表 9: ${\mathrm{Ours}}{\mathrm{ST}}$ (表 2 第 1 行) 和 $\mathrm{Ours}_{\mathrm{AT->ST}}$ (表 2 第 8 行) 模型在完整数据集上与其他 SOTA 方法的对比。RefCOCO 和 $\scriptstyle{\mathrm{RefCOCO+}}$ 的结果报告在完整测试集 (testA $^+$ testB) 上。更多细节参见第 8.6 节。

任务 SOTA ViLBERT VLBERT Unicoder-VL VisualBERT LXMERT UNITER BASE LARGE OurssT OurSAT->ST
预训练数据 CC CC + Wiki Corpus CC CC+COCO COCO+VG CC+SUB+COCO+VG CC CC
VQA test-dev 70.63 70.55 70.50 70.80 72.42 72.27 73.24 71.82 73.15
VG QA val 34.38 36.64
GQA test-dev 60.00 58.19 60.65
IR COCO R1 61.60 68.50 65.28 68.00
IR COCO R5 89.6 92.70 - 91.02 92.38
IR COCO R10 95.2 96.90 - 96.18 96.52
IR Flickr R1 48.60 58.20 68.30 71.50 73.66 61.14 67.90
IR Flickr R5 77.70 84.90 90.30 91.16 93.06 87.16 89.60
IR Flickr R10 85.20 91.52 94.60 95.20 95.98 92.30 94.18
Visual 7W test 72.53 80.51 83.35
Ref-COCO test 77.12 80.48 80.88 78.63 81.20
Ref-COCO+ test 67.17 70.93 69.47 73.26 73.73 71.11
Ref-COCOg test 69.46 74.51 75.77 72.24 76.35
Guess What test 61.30 62.81 65.69
NLVR2 test-P 53.50 — 67.00 74.50 77.87 79.50 74.25 78.87
SNLI-VE test 71.16 78.02 78.98 76.72 76.95

Table 10: Full per task accuracy for the different ablation studies (summarized in Table 6). RC is Retrieval COCO and RF is Retrieval Flickr30k. Mean of G2 is taken over the Recall $\ @1$ scores. We can see that with task token per dataset and DSG achieve the best performance.

表 10: 不同消融研究的完整任务准确率 (汇总于表 6)。RC 指 Retrieval COCO,RF 指 Retrieval Flickr30k。G2 的平均值取自 Recall $\ @1$ 分数。可以看出,使用每个数据集的任务 token 和 DSG 取得了最佳性能。

8.10. Attention Visualization s

8.10. 注意力可视化

In this section we examine the visual groundings learned by the techniques we presented in Sec. 8.2. We verify this by visualizing the attentions of our pretrained model, which is trained on the Conceptual Caption dataset. Given a test image, and corresponding caption “The boy and his mom pet the black and white sheep”, we feed the imagecaption pair as input and take the image to question coattention for visualization. For each image patch, we use the most attended word to represent its semantic meaning, and show the patches corresponding to the visual words (‘boy’, ‘mom’, ‘pet’, ‘white’, ‘sheep’). Fig. 10 shows the corre- spondence between attended regions and underlined words. We can see that the pretrained model learns meaningful visual grounding for the concept ‘boy’, ‘sheep’, ‘white’ and ‘pet’.

本节我们检验了8.2节所述技术学习到的视觉基础。通过可视化预训练模型的注意力机制进行验证,该模型基于Conceptual Caption数据集训练。给定测试图像及对应标题"男孩和他妈妈抚摸黑白相间的羊",我们将图文对作为输入,并提取图像到问题的协同注意力进行可视化。每个图像块用最受关注的单词表示其语义,并展示与视觉词汇("男孩"、"妈妈"、"抚摸"、"白色"、"羊")对应的图像块。图10显示了关注区域与下划线单词的对应关系。可见预训练模型对"男孩"、"羊"、"白色"和"抚摸"等概念学习了有意义的视觉基础。

To visualize the attention for our multi-task trained model $(\mathrm{Ours}_{\mathrm{AT}})$ ), we use BertVis2 to visualization the attention distribution on the sentence to sentence self-attention $S{\rightarrow}S$ , sentence to image co-attention $S{\rightarrow}I$ , image to sentence co-attention $I{\rightarrow}S$ and image to image self attention $I{\rightarrow}I$ . Fig. 11 shows an example of the sentence to sentence attention for all layers and all heads (middle) and a specific layer and head (right). We can see that our model learns the previous words attention pattern, bag of words attention pattern (Layer 1, Head 1) and next words attention pattern (Layer 2, Head 0). This shows that the model is able to generate position-aware queries and keys to calculate the attentions. To get a sense of the difference of attention distribution across different tasks, Fig. 12 and Fig. 13 show the attention distribution on the examples of Fig. 3. We can see for different tasks, the model learns to use significant different sentence to sentence self-attention pattern.

为了可视化我们多任务训练模型 $(\mathrm{Ours}_{\mathrm{AT}})$ 的注意力机制,我们使用BertVis2工具对以下注意力分布进行可视化:句子到句子的自注意力 $S{\rightarrow}S$、句子到图像的协同注意力 $S{\rightarrow}I$、图像到句子的协同注意力 $I{\rightarrow}S$ 以及图像到图像的自注意力 $I{\rightarrow}I$。图11展示了所有层和所有头(中间)以及特定层和头(右侧)的句子到句子注意力示例。可以看出,我们的模型学习了前词注意力模式(如第一层第一头)、词袋注意力模式(如第二层第零头)以及后词注意力模式。这表明该模型能够生成位置感知的查询(query)和键(key)来计算注意力权重。为了观察不同任务间注意力分布的差异,图12和图13展示了图3示例中的注意力分布情况。可见针对不同任务,模型学会了使用显著不同的句子到句子自注意力模式。

Table 11: Full per task accuracy for Fig. 2 showing different Dynamic Stop-and-Go Iteration Gaps $(\Delta)$ . Mean of G2 is taken over the Recall $@1$ scores.

表 11: 图 2 中不同动态停走迭代间隙 $(\Delta)$ 的各任务完整准确率。G2 的平均值取自召回率 $@1$ 分数。
Table 12: Full per task accuracy for $\mathbb{A}\mathbb{T}{\mathbf{W}/\mathbf{o}}\mathbf{\Phi}\mathbf{G}\mathbf{4}$ model and different ablations for $\mathbb{A}\mathbb{T}_{\mathbf{W}/\mathbf{O}}\mathbb{G}4$ . Mean of G2 is taken over the Recall $@1$ scores.

表 12: $\mathbb{A}\mathbb{T}{\mathbf{W}/\mathbf{o}}\mathbf{\Phi}\mathbf{G}\mathbf{4}$ 模型及 $\mathbb{A}\mathbb{T}_{\mathbf{W}/\mathbf{O}}\mathbb{G}4$ 不同消融实验的完整任务准确率。G2 均值取 Recall $@1$ 分数。


Figure 8: Our single multi-task model can solve multiple task consistently and correctly. Additional qualitative examples of our single model $\mathrm{Our}_{\mathrm{AT}}$ on multitude of V&L tasks: caption and image retrieval, question answering, grounding phrases, guessing image regions based on a dialog, verifying facts about a pair of images, natural language inferences from an image, etc. Here we show outputs of our model for a variety of inputs (that mimic tasks from the 12 datasets it has been trained on).

图 8: 我们的单一多任务模型能够一致且正确地解决多项任务。我们展示了单模型 $\mathrm{Our}_{\mathrm{AT}}$ 在多种视觉与语言(V&L)任务上的定性示例:图像描述与检索、问答、短语定位、基于对话猜测图像区域、验证图像对的事实、从图像中推导自然语言推理等。此处展示了模型针对各类输入(模拟其训练所用的12个数据集任务)的输出结果。


Figure 9: Failure cases of our single AT model on multitude of V&L tasks. Failure cases mostly occur when the model encounters counting questions or difficult referring expressions and phrases for fine grained recognition.

图 9: 我们的单一 AT (Adapter Transformer) 模型在多种视觉与语言 (V&L) 任务上的失败案例。失败案例主要出现在模型遇到计数问题、复杂指代表达或需要细粒度识别的短语时。


Figure 10: Visualization s of image to sentence attention for the pretrained model on conceptual caption dataset. Given the image to sentence co-attention, we use the most attended word to represent its semantic meaning, and show the patches corresponding to the visual words (‘boy’, ‘mom’, ‘pet’, ‘white’, ‘sheep’). Different colors show a correspondence between attended regions and underlined words. We can see that the model learns meaningful concept through pre training.

图 10: 概念标注数据集上预训练模型的图像到句子注意力可视化。基于图像到句子的协同注意力机制,我们使用最受关注的单词表示其语义,并展示与视觉词汇("boy", "mom", "pet", "white", "sheep")对应的图像区域。不同颜色标示了注意力区域与下划线单词的对应关系。可见该模型通过预训练习得了有意义的概念。


Figure 11: Visualization s of the attentions of the pretrained model on conceptual caption dataset using BertVis toolbox. From left to right: Image and associate caption, sentence to sentence self-attention for all layers and all heads, sentence to sentence self-attention for Layer 1 Head 1 and Layer 2 Head 0. Our model learns the previous words attention pattern, bag of words attention pattern and next words attention pattern.

图 11: 使用BertVis工具箱在概念标注数据集上对预训练模型注意力机制的可视化结果。从左至右分别为:图像及对应描述文本、所有层所有头部的句间自注意力、第1层第1头和第2层第0头的句间自注意力。我们的模型学习了前词注意力模式、词袋注意力模式以及后词注意力模式。


Figure 12: Visualization s of the attentions of $\mathrm{Our}_{\mathrm{AT}}$ model using BertVis toolbox on each tasks. From left to right are image and associate sentence, sentence to sentence self-attention, sentence to image co-attention image to sentence co-attention image to image self-attention. Dashed orange bounding boxes in the image are the referring expression outputs regardless of tasks. The model learns to use significant different sentence to sentence self-attention pattern for different tasks.

图 12: 使用BertVis工具箱对$\mathrm{Our}_{\mathrm{AT}}$模型在各任务中注意力机制的可视化。从左至右分别为图像及关联语句、语句间自注意力、语句到图像的协同注意力、图像到语句的协同注意力以及图像间自注意力。图中橙色虚线框为跨任务的指代表达输出。该模型学会了针对不同任务采用显著差异化的语句间自注意力模式。


Figure 13: Visualization s of the attentions of $\mathrm{Our}_{\mathrm{AT}}$ model using BertVis toolbox on each tasks. From left to right are image and associate sentence, sentence to sentence self-attention, sentence to image co-attention image to sentence co-attention image to image self-attention. Dashed orange bounding boxes in the image are the referring expression outputs regardless of tasks. The model learns to use significant different sentence to sentence self-attention pattern for different tasks.

图 13: 使用 BertVis 工具箱对 $\mathrm{Our}_{\mathrm{AT}}$ 模型在各任务中注意力机制的可视化。从左至右分别为图像及关联语句、语句间自注意力、语句到图像的协同注意力、图像到语句的协同注意力以及图像间自注意力。图中橙色虚线框为不同任务中指代表达式的输出区域。该模型学会针对不同任务采用显著差异的语句间自注意力模式。

阅读全文(20积分)