12-in-1: Multi-Task Vision and Language Representation Learning

12合1：多任务视觉与语言表征学习

Abstract

摘要

Much of vision-and-language research focuses on a small but diverse set of independent tasks and supporting datasets often studied in isolation; however, the visuallygrounded language understanding skills required for success at these tasks overlap significantly. In this work, we investigate these relationships between vision-and-language tasks by developing a large-scale, multi-task training regime. Our approach culminates in a single model on 12 datasets from four broad categories of task including visual question answering, caption-based image retrieval, grounding referring expressions, and multi-modal verification. Compared to independently trained single-task mod- els, this represents a reduction from approximately 3 billion parameters to 270 million while simultaneously improving performance by 2.05 points on average across tasks. We use our multi-task framework to perform in-depth analysis of the effect of joint training diverse tasks. Further, we show that finetuning task-specific models from our single multi-task model can lead to further improvements, achieving performance at or above the state-of-the-art.

视觉与语言研究大多聚焦于少量但多样化的独立任务及其配套数据集，这些任务通常被孤立研究。然而，成功完成这些任务所需的视觉基础语言理解技能存在显著重叠。本研究通过构建大规模多任务训练体系，探索视觉与语言任务间的关联性。我们的方法最终在四大类任务（视觉问答、基于描述的图像检索、指代表达式定位和多模态验证）的12个数据集上实现了单一模型统一。与独立训练的单任务模型相比，该模型将参数量从约30亿缩减至2.7亿，同时平均任务性能提升2.05分。我们利用该多任务框架深入分析联合训练多样化任务的效果，并证明基于统一多任务模型微调特定任务模型可带来额外性能提升，达到或超越当前最优水平。

1. Introduction

1. 引言

A compelling reason to study language and vision jointly is the promise of language as a universal and natural interface for visual reasoning problems – useful both in specifying a wide range of problems and in communicating AI responses. However, the current research landscape for visually-grounded language understanding is a patchwork of many specialized tasks like question answering or caption generation, each supported by a handful of datasets. As such, progress in this field has been measured by the independent improvement of bespoke models designed and trained for each of these specific tasks and datasets.

研究语言与视觉联合的一个有力理由是，语言有望成为视觉推理问题的通用自然交互界面——既能用于描述各类问题，也能传达AI的响应。然而，当前基于视觉的语言理解研究格局由问答、字幕生成等众多专项任务拼凑而成，每个任务仅由少量数据集支撑。因此，该领域的进展一直通过为这些特定任务和数据集单独设计、训练的定制模型的独立改进来衡量。

The recent rise of general architectures for vision-andlanguage [1, 23, 24, 27, 43, 45, 54] reduces the architectural differences across tasks. These models pretrain common architectures on self-supervised tasks to learn general visio-linguistic representations then fine-tune for specific datasets; however, the result is still a menagerie of independent task-specific models rather than a single unified model. This is dissatisfying in practice – the model that understands questions cannot ground noun phrases, the grounding model cannot retrieve images based on a description, and so forth. Further, this approach does not scale well as each new task requires storing a new model.

视觉与语言通用架构 [1, 23, 24, 27, 43, 45, 54] 的近期兴起减少了跨任务间的架构差异。这些模型通过自监督任务预训练通用架构以学习视觉-语言通用表征，随后针对特定数据集进行微调；然而其结果仍是各自独立的任务专用模型集合，而非单一统一模型。这种实践存在明显缺陷——能理解问题的模型无法关联名词短语，具备关联能力的模型无法根据描述检索图像等。此外，该方法扩展性较差，每个新任务都需要存储一个新模型。

Figure 1: We introduce an approach for effective multi-task learning, training a single model on 12 popular vision-and-language datasets. This single model performs at par or even better than independent task-specific state-of-the-art approaches for many tasks.

图 1: 我们提出了一种高效的多任务学习方法，在12个主流视觉-语言数据集上训练单一模型。该单一模型在多项任务中表现与独立任务专用最优方法相当甚至更优。

Beyond being intellectually dissatisfying, this task-based fracturing leaves quite a lot on the table. While individual tasks present different challenges and diverse interfaces, the underlying associations between language and visual concepts are often common across tasks. For example, learning to ground the referring expression “small red vase” requires understanding the same concepts as answering the question “What color is the small vase?”. Training multiple tasks jointly can potentially pool these different sources of grounding supervision. Further, developing models that can perform well on a wide range of tasks simultaneously can help guard against the research community over fitting to specific datasets and metrics.

除了在学术上不尽如人意外，这种基于任务的碎片化处理还留下了大量未开发的潜力。虽然单个任务面临不同的挑战和多样化的接口，但语言与视觉概念之间的底层关联往往在不同任务中是共通的。例如，学习定位指代表达"小红花瓶"所需理解的概念，与回答"小花瓶是什么颜色？"这一问题所需的概念是相同的。联合训练多个任务有可能整合这些不同来源的语义监督。此外，开发能同时胜任多种任务的模型，有助于防止研究社区过度拟合特定数据集和评估指标。

In this work, we develop a multi-task model for discriminative vision-and-language tasks based on the recently proposed ViLBERT [27] model. We consider four categories of tasks – training jointly on a total of 12 different datasets. Our results not only show that a single model can perform all these tasks, but also that joint training can improve the performance compared to single-task training with the same architecture. Before undertaking this effort, it was not obvious to us that this would be the case – multitask training is notoriously challenging and vision-and-language datasets vary greatly in size, interface, and difficulty. Our model attains improvements of 0.25 to 4.19 absolute points from multi-task training – improving over corresponding singletask models for 11 out of 12 tasks. Further, we demonstrate that multi-task training is an effective pre training step for single-task models – leading to further gains and setting a new state-of-the-art for 7 out of 12 tasks.

在本工作中，我们基于最新提出的ViLBERT [27]模型，开发了一个用于判别式视觉与语言任务的多任务模型。我们考虑了四大类任务——共在12个不同数据集上进行联合训练。结果表明，单一模型不仅能完成所有这些任务，而且相比相同架构的单任务训练，联合训练还能提升性能。在开展这项工作之前，我们并未预见到这种结果——众所周知多任务训练极具挑战性，且视觉与语言数据集在规模、接口和难度上差异巨大。通过多任务训练，我们的模型取得了0.25至4.19个绝对百分点的性能提升——在12项任务中有11项超越了对应的单任务模型。此外，我们还证明多任务训练可作为单任务模型的有效预训练步骤——带来进一步性能提升，并为12项任务中的7项创造了新标杆。

Large-scale multi-task learning is challenging as datasets can vary in size and difficulty. To address these issues, we introduce a dynamic stop-and-go training scheduler, taskdependent input tokens, and simple hyper-parameter heuristics. Using our proposed pipeline, we were able to train many multi-task models with varying datasets – assessing the relationships between different vision-and-language tasks in terms of their performance when trained together.

大规模多任务学习面临挑战，因为数据集在规模和难度上可能存在差异。为解决这些问题，我们引入了动态启停训练调度器、任务相关输入token以及简单的超参数启发式方法。通过我们提出的流程，我们能够训练多个具有不同数据集的多任务模型——评估不同视觉与语言任务在联合训练时的性能关系。

To summarize, we make the following contributions:

总结来说，我们的贡献如下：

– We systematically analyze the joint training relationships between different of vision-and-language datasets and tasks and present a Clean V&L Multi-Task setup, which ensures no train-test leaks across task. – We develop a single multi-task model trained on 12 popular V&L datasets. Compared to a set of independent models, this represents a reduction from ${\sim}3$ billion parameters to ${\sim}270$ million while simultaneously improving average performance by 2.05 points. – We demonstrate that multi-task training is useful even in cases where single-task performance is paramount. On average, fine-tuning from our multi-task model for single tasks resulted in an average improvement of 2.98 points over baseline single-task trained models.

我们系统分析了不同视觉与语言数据集及任务间的联合训练关系，并提出了一个无训练-测试泄漏的Clean V&L多任务框架。
我们开发了在12个主流V&L数据集上训练的单一多任务模型。与独立模型组相比，参数量从约30亿降至约2.7亿，同时平均性能提升2.05个点。
我们证明即使在单任务性能优先的场景下，多任务训练仍具价值。平均而言，基于我们多任务模型进行单任务微调，比基线单任务模型平均提升2.98个点。

2. Vision-and-Language Tasks

2. 视觉与语言任务

2.1. Task-Groups and Datasets

2.1. 任务组与数据集

We consider 12 popular vision and language datasets. These datasets cover a wide range of tasks and require diverse grounding granularity and reasoning skills. We group related datasets into four groups to facilitate our analysis:

我们选取了12个主流的视觉与语言数据集。这些数据集涵盖广泛的任务类型，需要多样化的基础粒度与推理能力。为便于分析，我们将相关数据集划分为四组：

Vocab-based VQA. Given an image and a natural-language question, select an answer from a fixed vocabulary. We consider three popular datasets for this group – VQAv2 [15], GQA [17], and Visual Genome (VG) QA [21].

基于词汇的视觉问答 (Vocab-based VQA)。给定一张图像和一个自然语言问题，从固定词汇表中选择答案。我们针对该类别考虑了三个常用数据集——VQAv2 [15]、GQA [17] 和 Visual Genome (VG) QA [21]。

Image Retrieval. Given a caption and a pool of images, retrieve the target image that is best-described by the caption. We consider COCO [7] and Flickr30K [35] captioning datasets for this task-group.

图像检索。给定一段描述和一个图像池，检索出最能被该描述准确描述的目标图像。我们选用COCO [7]和Flickr30K [35]这两个图像描述数据集来完成此任务组。

Referring Expressions. Given a natural language expression and an image, identify the target region that is referred to by expression. The expression can vary greatly across datasets from simple noun phrases to multi-round dialogs.

指代表达。给定一个自然语言表达和一张图像，识别该表达所指的目标区域。不同数据集中的表达差异很大，从简单的名词短语到多轮对话都可能出现。

Table 1: Percentage of row-task test images that are present in column-tasks train/val images.

		%Row-Task Test Images in Column-Task Train/Val Set
	[A]	[B]	[C]	[D]	[E]	[F]	[G]	[H]	[I]	[J]	[K]	[L]
	[A]VQA2.0[15]	0%	%0	0%	0%	0%	0%	%0	0%	%0	0% %0	0%
	[B]VG QA[21]	%0	%0	%0	0%	%0 0%	%0	%0	%0	0%	%0	%0
[c]	GQA [17]	%0	0%	%0	0%	%0 0%	0%	%0	%0	0%	%0	0%
[D]	CoCO [7]	100%	43%	33%	0%	%0	0% 0%	%0	7%	46%	%0	0%
	[E]Flickr30k[35]	%0	0%	0%	0%	%0 0%	%0	%0	%0	0%	98%	0%
	[F]RefCOCO[19]	100%	36%	27%	100%	%0	0% %0	%99	8%	62%	%0	0%
	[G]RefCOCO+[19]	100%	38%	27%	100%	%0 %0	%0	%99	8%	62%	%0	0%
	[H]RefCOCOG[30]	100%	41%	31%	100%	%0 53%	53%	%0	8%	63%	%0	0%
[I]	Visual 7w[55]	50%	100%	79%	48%	%0	8% 8%	10%	0%	24%	%0	0%
[f]	GuessWhat [13]	100%	40%	31%	96%	%0 20%	20%	26%	7%	0%	0%	0%
[K]	SNLI-VE[49]	%0	%0	%0	0%	94% 0%	%0	%0	%0	0%	%0	0%
[L] NLVR2[44]		%0	%0	%0	%0	%0	%0	%0	%0	%0 %0	%0	%0

表 1: 行任务测试图像在列任务训练/验证图像中的占比

		%行任务测试图像在列任务训练/验证集中的占比
	[A]	[B]
	[A]VQA2.0[15]	0%
	[B]VG QA[21]	0%
[c]	GQA [17]	0%
[D]	CoCO [7]	100%
	[E]Flickr30k[35]	0%
	[F]RefCOCO[19]	100%
	[G]RefCOCO+[19]	100%
	[H]RefCOCOG[30]	100%
[I]	Visual 7w[55]	50%
[f]	GuessWhat [13]	100%
[K]	SNLI-VE[49]	0%
[L] NLVR2[44]		0%

We consider phrase grounding in $\mathrm{RefCOCO(+/g)}$ [19, 30], Pointing questions in Visual7W [55], and dialog sequences in the GuessWhat [13]. We note that these language inputs vary significantly in terms of detail and structure.

我们考虑在 $\mathrm{RefCOCO(+/g)}$ [19, 30] 中进行短语定位 (phrase grounding)，在 Visual7W [55] 中处理指向性问题 (pointing questions)，以及在 GuessWhat [13] 中处理对话序列。我们注意到这些语言输入在细节和结构上存在显著差异。

Multi-modal Verification. Given one or more images and a natural language statement, judge the correctness or predict their semantic relationship. We consider $\mathrm{NLVR^{2}}$ [44] and SNLI-VE [49]. In $\mathrm{NLVR^{2}}$ , two images are given and the statement must be true for both to be true. In SNLIVE, image-statement pairs are classified as representing an entailment, contradiction, or neutral. That is, whether the content of the image confirms, refutes, or is insufficient to comment on the truth of the corresponding statement.

多模态验证。给定一张或多张图像及自然语言陈述，判断其正确性或预测语义关系。我们选用 $\mathrm{NLVR^{2}}$ [44] 和 SNLI-VE [49] 作为基准。在 $\mathrm{NLVR^{2}}$ 任务中，需对两张图像进行联合判断，仅当陈述同时适用于两张图像时才判定为真；而 SNLI-VE 则将图像-陈述对分类为蕴含、矛盾或中性关系，即判断图像内容是否支持、反驳或无法证实对应陈述的真实性。

2.2. A Clean V&L Multi-Task Setup

2.2. 一个干净的视觉与语言多任务设置

Many V&L tasks are built on top of each other and share significant overlap in terms of individual images.

许多视觉与语言(V&L)任务相互构建，并在单个图像层面存在显著重叠。

However, as each task is often examined in isolation, there does not exist an in-depth analysis of this overlap across different V&L tasks. Table 1 shows the percentage of test images for the target tasks which are present in other tasks’ train/val sets. As we can see, there exists significant overlap across tasks. Even though different tasks require different inputs and outputs, other task annotations will provide clues about the visual grounding – for example, a referring expression for a “blue striped ball” at training could unfairly improve a VQA model’s ability to answer “What color is the striped ball?” for the same image at test time. To avoid information leakage from the annotations of other tasks, we propose a cleaned multi-task split for V&L tasks where test images are removed from train/val for all the tasks. We stress that the test sets are not modified in any way, so our results are comparable to prior work. Cleaning results in about an $11%$ reduction in training data on average across datasets. Full details of this process and statistics regarding cleaned dataset size are available in the supplement.

然而，由于各项任务通常被独立研究，目前缺乏针对不同视觉与语言(V&L)任务间重叠性的深入分析。表1展示了目标任务的测试图像在其他任务训练集/验证集中出现的比例。数据显示任务间存在显著重叠——尽管不同任务需要不同的输入输出，但其他任务的标注信息可能泄露视觉基础线索（例如训练时出现的"蓝色条纹球"指代表达，会在测试时不公平地提升VQA模型对同一图像回答"条纹球是什么颜色"的能力）。为避免其他任务标注造成的信息泄露，我们提出了净化版多任务划分方案：从所有任务的训练集/验证集中移除测试图像。需要强调的是测试集未作任何修改，因此我们的结果与先前研究具有可比性。数据净化平均导致各数据集训练规模减少约$11%$，完整处理流程及净化后数据集规模统计详见附录。

3. Approach

3. 方法

3.1. Base Architecture

3.1. 基础架构

There has been a flurry of recent work developing general vision-and-language model architectures that are amenable to large-scale self-supervised pre training. [1, 23, 24, 27, 43, 45, 54]. By pre training general representations and then finetuning on single downstream tasks, these models set state-of-the-art in many tasks. For the base archi- tecture in our experiments, we take the ViLBERT model proposed by Lu et al. [27]. We describe it here briefly.

近期涌现了大量研究致力于开发适用于大规模自监督预训练的通用视觉-语言模型架构 [1, 23, 24, 27, 43, 45, 54]。这些模型通过预训练通用表征再针对单一下游任务微调，在多类任务中实现了最先进的性能。在我们的实验中，我们采用 Lu 等人 [27] 提出的 ViLBERT 模型作为基础架构，现简要说明如下。

At the interface level, ViLBERT takes as input an image $I$ and text segment $Q$ represented as the sequence ${\mathtt{I M G},v_{1},\ldots,v_{T}$ , CLS, $w_{1},\dots,w_{T},\mathrm{SEP}}$ where ${\bar{v}{i}}{i=1}^{T}$ are image region features [2], ${w_{j}}{j=1}^{T}$ are word tokens, and the IMG, CLS, and SEP tokens are special markers. The model then outputs embeddings for each input ${h_{v_{i}}}_{i=1}^{T}$ , ${h_{w_{j}}}{j=1}^{T}$ , $h_{\mathrm{{IMG}}},h_{\mathrm{{CLS}}}$ , and $h_{\mathrm{SEP}}$ . As in [27], we take $h_{\mathrm{IMG}}$ and $h_{\mathrm{{CLS}}}$ as holistic image and text representations.

在接口层面，ViLBERT 的输入包括图像 $I$ 和文本片段 $Q$，表示为序列 ${\mathtt{I M G},v_{1},\ldots,v_{T}$，CLS，$w_{1},\dots,w_{T},\mathrm{SEP}}$。其中 ${\bar{v}{i}}{i=1}^{T}$ 是图像区域特征 [2]，${w_{j}}{j=1}^{T}$ 是单词 token，而 IMG、CLS 和 SEP token 是特殊标记。模型随后输出每个输入的嵌入表示 ${h_{v_{i}}}{i=1}^{T}$、${h_{w_{j}}}{j=1}^{T}$、$h_{\mathrm{{IMG}}}$、$h_{\mathrm{{CLS}}}$ 和 $h_{\mathrm{SEP}}$。如 [27] 所述，我们将 $h_{\mathrm{IMG}}$ 和 $h_{\mathrm{{CLS}}}$ 作为整体图像和文本的表征。

Internally, ViLBERT consists of two parallel BERT-style [14] models operating over image regions and text segments. Each stream is a series of transformer blocks (TRM) [48] connected by co-attention al transformer layers (CoTRM) which enable information exchange between modalities. We use the default parameter setting, which has 6 / 12 layers of TRM for visual / linguistic streams respectively.

ViLBERT内部由两个并行的BERT风格[14]模型组成，分别处理图像区域和文本片段。每个流都是由Transformer块(TRM)[48]构成的序列，通过跨模态注意力层(CoTRM)连接，实现模态间信息交互。我们采用默认参数配置，视觉/语言流分别包含6层和12层TRM。

Like many of the models of this class, ViLBERT is pretrained on the Conceptual Caption dataset [39] with two ‘proxy’ tasks: masked multi-modal modelling and multimodal alignment prediction. The first randomly masks approximate ly $15%$ of both words and image tokens and reconstructs them given the remaining inputs. The later tasks the model with predicting whether an image and caption correspond or not. After pre training, the model can be finetuned for strong performance for various downstream tasks.

与许多同类模型一样，ViLBERT在Conceptual Caption数据集[39]上通过两项"代理"任务进行预训练：掩码多模态建模和多模态对齐预测。前者随机掩码约$15%$的文本单词和图像Token，并根据剩余输入进行重建；后者则让模型判断图像与标题是否匹配。预训练完成后，该模型可通过微调在各种下游任务中实现优异性能。

We make two important modifications to this pre training process. First, when masking visual regions we also mask other regions with significant overlap $\mathrm{{(>0.4~IoU)}}$ to avoid leaking visual information. This forces the model to rely more heavily on language to predict image content. Second, we do not enforce the masked multi-modal modelling loss when sampling a negative (unmatching) caption for multimodal alignment prediction. This will effectively remove the noise introduced by negative samples. While orthogonal to our primary contribution of multi-task learning, we found these modifications to make the baseline model more effective. For further discussion, see the supplemental material. All models we present are first pretrained in this manner.

我们对这一预训练过程进行了两处重要修改。首先，在遮蔽视觉区域时，我们同时会遮蔽与之显著重叠 (IoU>0.4) 的其他区域，以避免视觉信息泄露。这迫使模型更依赖语言来预测图像内容。其次，在为多模态对齐预测采样负样本（不匹配标题）时，我们不强制执行遮蔽多模态建模损失。这将有效消除负样本引入的噪声。虽然这些修改与我们的多任务学习主要贡献正交，但我们发现它们能使基线模型更有效。更多讨论详见补充材料。我们展示的所有模型都首先以这种方式进行预训练。

3.2. Multi-Task Learning

3.2. 多任务学习

We consider a simple multi-task model where each task has a task-specific ‘head’ network that branches off a common, shared ‘trunk’ ViLBERT model. As such, we learn shared trunk parameters $\theta_{s}$ and a set of task-specific layers ${\theta_{t}}{t=1}^{T}$ for $\tau$ tasks. Our goal is to learn parameters $\theta_{s}\cup{\theta_{t}}_{t=1}^{T}$ that minimize loss across all tasks. Details on heads and other modifications follow.

我们考虑一个简单的多任务模型，其中每个任务都有一个任务特定的"头"网络，从一个共享的ViLBERT主干模型分支出来。因此，我们学习共享主干参数$\theta_{s}$和针对$\tau$个任务的一组任务特定层${\theta_{t}}{t=1}^{T}$。我们的目标是学习参数$\theta_{s}\cup{\theta_{t}}_{t=1}^{T}$，以最小化所有任务的损失。关于任务头和其他修改的细节如下。

Task Token. While relying on the same groundings, different tasks may still require the model to process inputs differently – e.g. referring expressions just require grounding while VQA must follow grounding with additional reasoning. To enable this, we augment the query with a task token $\mathrm{TASK}{t}$ such that the new input format is ${\mathtt{I M G},v_{1},\ldots,v_{n}$ , CLS, $\Gamma\mathbb{A}\mathrm{SK}{t},w_{1},\dots,w_{m},\mathrm{SEP}}$ . The architecture can then leverage this task information in a bottom-up manner. In what follows, we describe the taskspecific heads by task groups.

任务Token。虽然依赖相同的基础，不同任务仍可能要求模型以不同方式处理输入——例如指代表达仅需基础信息，而视觉问答(VQA)则必须在基础信息之外进行额外推理。为此，我们在查询中添加任务Token $\mathrm{TASK}{t}$，使新输入格式变为 ${\mathtt{I M G},v_{1},\ldots,v_{n}$ , CLS, $\Gamma\mathbb{A}\mathrm{SK}{t},w_{1},\dots,w_{m},\mathrm{SEP}}$。该架构随后能以自底向上方式利用此任务信息。下文将按任务组别描述各任务专用头。

Vocab-Based VQA Output: We compute an overall image-query representation as an element-wise product between the holistic $h_{\tt I M G}$ and $h_{\mathrm{{CLS}}}$ representations. As in [2, 17], we treat vocab-based VQA as a multi-label classification task – assigning a soft target score to each answer based on its relevancy to the ground truth answer. We compute scores for a set of the pre-defined answers $A$ by using a two-layer MLP on top of the overall representation:

基于词汇的视觉问答输出：我们通过整体图像表征 $h_{\tt I M G}$ 和 $h_{\mathrm{{CLS}}}$ 表征的逐元素乘积，计算出一个综合的图像-查询表征。参照 [2, 17] 的方法，我们将基于词汇的视觉问答视为多标签分类任务——根据每个答案与真实答案的相关性为其分配软目标分数。通过对预定义答案集 $A$ 的综合表征应用两层 MLP (多层感知机) 来计算得分：

$$
P_{v}(A|I,Q)=\sigma(\mathrm{MLP}(h_{\tt I M G}\odot h_{\tt C L S}))
$$

where $\sigma$ is the sigmoid function. Due to the answer vocabulary differences, VQA and VG QA share the MLP and answer vocabulary while GQA learns a separate one.

其中 $\sigma$ 是 sigmoid 函数。由于答案词汇差异，VQA 和 VG QA 共享 MLP 和答案词汇，而 GQA 则学习独立的词汇表。

Image Retrieval Output: Using the same overall representation, we compute an alignment score between imagecaption pairs as:

图像检索输出：使用相同的整体表示，我们计算图像-标题对之间的对齐分数为：

$$
\mathsf{R e1}(I,Q)=W_{i}(h_{\tt I M G}\odot h_{\tt C L S})
$$

where $W_{i}\in\mathbb{R}^{d\times1}$ is shared across COCO and Flickr30k image retrieval tasks. As in [27], we train a 4-way multiplechoice against hard-negatives selected off-line and then fixed. Recent work has used online hard-negative mining [8, 23] but this is costly to compute.

其中 $W_{i}\in\mathbb{R}^{d\times1}$ 在COCO和Flickr30k图像检索任务中共享。如[27]所述，我们针对离线选择并固定的困难负样本训练了一个4路多选分类器。近期研究采用了在线困难负样本挖掘[8, 23]，但计算成本较高。

Referring Expressions Output: We rerank a set of region proposals [50] given the referring expression. We pass the final representation $h_{v_{i}}$ for each image region $i$ into a learned projection $W_{r}\in\mathbb{R}^{d\times1}$ to predict a matching score.

指代表达式输出：我们根据指代表达式对一组区域提议[50]进行重新排序。将每个图像区域i的最终表示$h_{v_{i}}$输入学习到的投影 $W_{r}\in\mathbb{R}^{d\times1}$ ，以预测匹配分数。

$$
\mathrm{Re1}(v_{i},Q)=W_{r}h_{v i}
$$

Note that $Q$ may be either a phrase, question or dialog based on different tasks $(\mathrm{RefCOCO}+/\mathrm{g}$ , Visual7W, GuessWhat). $W_{r}$ is shared across all the referring expression tasks.

注意，$Q$ 可能是基于不同任务 $(\mathrm{RefCOCO}+/\mathrm{g}$、Visual7W、GuessWhat) 的短语、问题或对话。$W_{r}$ 在所有指代表达任务中共享。

Multi-modal Verification Output: Taking $\mathrm{NLVR^{2}}$ as an example, the input is a concatenation of two images ( $I_{0}$ and $I_{1}$ ) and a statement $Q$ , that the model must judge the validity of the statement given the images. We consider this a classification problem given an embedding that encodes the two image-statement pairs $(I_{0},Q)$ and $(I_{1},Q)$ . The output probability is predicted by a 2-layer MLP with softmax:

多模态验证输出：以 $\mathrm{NLVR^{2}}$ 为例，输入为两张图像 ( $I_{0}$ 和 $I_{1}$ ) 与一个陈述 $Q$ 的拼接，模型需根据图像判断该陈述的有效性。我们将其视为给定编码图像-陈述对 $(I_{0},Q)$ 和 $(I_{1},Q)$ 的嵌入向量的分类问题。输出概率通过带softmax的2层MLP预测：

$$
P_{v}(C|I_{0},I_{1},Q)=\mathsf{s o f t m a x}\left(\mathrm{MLP}\left(\left[h_{\mathrm{IMG}}^{0}\odot h_{\mathrm{CLS}}^{0}\right]\right)\right)
$$

where [ ] is concatenation.

其中 [ ] 表示拼接操作。

For SNLI-VE, the input is a single image and statement. We thus learn a separate classifier of the same form that predicts the sentiment (entailment, neutral, contradiction) from the inputs.

对于SNLI-VE数据集，输入是一张图片和一条陈述语句。因此我们训练了一个形式相同的独立分类器，用于根据输入预测情感倾向（蕴含、中性、矛盾）。

3.3. Large-Scale Multitask Training

3.3. 大规模多任务训练

With 6 task heads, 12 datasets, and over 4.4 million individual training instances – training our multi-task ViLBERT model is a daunting proposition. Multi-task learning (especially at this scale) poses significant challenges as learning objectives have complex and unknown dynamics and may compete [41]. Further, vision-and-language datasets vary significantly in size and difficulty. For instance, a single epoch of VG (our largest dataset) corresponds to 19.8 epochs of $\operatorname{RefCOCOg}$ (our smallest). Likewise, when trained in isolation RefCOCOg converges in 5K iterations whereas VQAv2 takes 84K iterations (over 16 times more). Below, we describe the details of our multi-task training approach and techniques to overcome these challenges.

采用6个任务头、12个数据集和超过440万条独立训练实例——训练我们的多任务ViLBERT模型是一项艰巨的任务。多任务学习（尤其是这种规模）带来了重大挑战，因为学习目标具有复杂且未知的动态特性，并可能相互竞争[41]。此外，视觉与语言数据集在规模和难度上差异显著。例如，单个VG（我们最大的数据集）训练周期相当于19.8个$\operatorname{RefCOCOg}$（我们最小的数据集）周期。同样，单独训练时RefCOCOg在5K次迭代后收敛，而VQAv2需要84K次迭代（超过16倍）。下文将详细阐述我们的多任务训练方法及应对这些挑战的技术方案。

Pre training. All our models are pretrained on Conceptual Caption dataset [39] including our self-supervised task modifications as described in Sec. 3.1.

预训练。我们所有的模型都在Conceptual Caption数据集[39]上进行了预训练，包括第3.1节中描述的自监督任务修改。

Round-Robin Batch-Level Sampling. We consider a round-robin batch-level sampling regime that cycles through each task from the beginning of multi-task training. As such, one multi-task iteration consists of each task forwarding a batch and updating parameters in sequence.

轮询批次级采样。我们考虑一种轮询批次级采样机制，从多任务训练开始时就循环遍历每个任务。因此，一次多任务迭代包含每个任务依次前向传播一个批次并更新参数。

Dynamic Stop-and-Go. As noted earlier, different tasks have different difficulties and dataset sizes. Consequentially, simply cycling through all tasks may drastically overtrain smaller tasks leading to over fitting. Typically earlystopping provides a strong defense to this phenomenon; however, stopping a task in multi-task training introduces problems with catastrophic forgetting as the base network drifts over time due to other tasks. We introduce an intuitive but effective dynamic stop and go (DSG) mechanism to avoid these problems. We monitor the validation loss $s_{t}$ of each task $t$ , computing it once per task epoch. If performance improvement is less than $0.1%$ over 2 epochs, we consider it Converged and shift it into stop mode. In DSG stop mode, a task only updates every iter-gap $(\Delta)$ iterations. If validation performance degrades by $0.5%$ from the task’s best measured performance while in stop mode, the task is considered Diverged and is returned to DSG go. This procedure is shown in Algorithm 1.

动态启停机制。如前所述，不同任务具有不同的难度和数据集规模。若简单循环执行所有任务，可能导致小任务被过度训练从而引发过拟合。虽然早停法 (early stopping) 通常能有效缓解该现象，但在多任务训练中停止特定任务会因主干网络受其他任务影响持续偏移，进而导致灾难性遗忘问题。为此我们提出一种直观高效的动态启停 (DSG) 机制：监控每个任务 $t$ 的验证损失 $s_{t}$（每个任务周期计算一次），若连续2个周期性能提升低于 $0.1%$ 则判定为已收敛 (Converged) 并转入停止模式。在DSG停止模式下，任务仅每隔 $(\Delta)$ 次迭代更新一次；若停止模式下验证性能较历史最佳值下降 $0.5%$ 则判定为发散 (Diverged) 并重新激活任务。该流程如算法1所示。

Curriculum Learning. Inspired by prior multi-task literature [4] [31], we experimented with both curriculum and anti-curriculum strategies based on task difficulty. Specifically, for anti-curriculum we first train on the slowestconverging task-group G1 (Vocab-Based VQA) before

课程学习。受先前多任务文献[4][31]的启发，我们基于任务难度尝试了课程与反课程策略。具体而言，对于反课程策略，我们首先在收敛最慢的任务组G1（基于词汇的VQA）上进行训练。

Algorithm 1: DSG for Multi-Task Learning

算法 1: 多任务学习的DSG算法

starting full round-robin multi-task training. Inversely for the curriculum setting we first train on our fastestconverging task-group G3 (Referring Expressions). Different from previous observation [31, 33], we found that using no curriculum leads to superior performance when com- bined with other strategies proposed in this section.

开始全轮询多任务训练。相反地，在课程设置中，我们首先在收敛速度最快的任务组G3（指代表达式）上进行训练。与之前的观察[31, 33]不同，我们发现当结合本节提出的其他策略时，不使用课程设置能带来更优的性能。

Setting Multi-Task Hyper parameters. We follow a simple design philosophy – identify simple heuristics based on hyper-parameters tuned for each task in single-task training. This significantly reduces the burden of searching for jointtraining hyper-parameters. See the supplement for a full list of per task learning rates, batch sizes, and other settings. Our code has been made available1.

设置多任务超参数。我们遵循一个简单的设计原则——基于单任务训练中针对每个任务调整的超参数来识别简单的启发式方法。这大大减轻了搜索联合训练超参数的负担。完整的每任务学习率、批量大小和其他设置列表请参阅补充材料。我们的代码已开源1。

Batch Size: For multi-task, we keep the batch size tuned for single-task training for each task.

批量大小 (Batch Size): 在多任务训练中，我们保持各任务沿用单任务训练时调优的批量大小。

Warm-up Duration: We found it important to set warm-up duration relative to the largest dataset. Specifically, we run linear warm-up over $\eta*N$ iterations where $N$ is the max. number of iterations taken to train any dataset in the singletask setting. We observe significant performance degradation for harder tasks when warm-up was shorter. We set $\eta$ to 0.1 for our experiments.

预热时长：我们发现将预热时长与最大数据集相关联至关重要。具体而言，我们会在 $\eta*N$ 次迭代中执行线性预热，其中 $N$ 是单任务设置下训练任何数据集所需的最大迭代次数。当预热时间较短时，我们观察到较难任务的性能显著下降。实验中我们将 $\eta$ 设为 0.1。

Loss Scaling: Our model has shared and task-specific parameters and we found it important to maintain separate learning rates. For the shared base model, we set the the base learning rate to the minimum over all single-task dataset parameters. To accommodate variable learning rates for each dataset, we scale the task loss for each dataset by the ratio of task target learning rate over base learning rate.

损失缩放：我们的模型包含共享参数和任务特定参数，我们发现保持独立的学习率非常重要。对于共享基础模型，我们将基础学习率设置为所有单任务数据集参数中的最小值。为了适应每个数据集的不同学习率，我们通过任务目标学习率与基础学习率的比值来缩放每个数据集的损失值。

4. Experiments and Results 4.1. Single-Task Performance

4. 实验与结果 4.1. 单任务性能

To establish baseline performance for the ViLBERT architecture that forms the backbone of our multi-task experiments, we first train single-task models on top of the base ViLBERT architecture (Section 3) for each of our

为了建立作为我们多任务实验核心的ViLBERT架构的基准性能，我们首先在基础ViLBERT架构（第3节）上为每项任务训练单任务模型。

			基于词汇的视觉问答 (G1)			图像检索 (G2)		指代表达 (G3)					验证 (G4)			总参数量 (模型数)
			VQAv2	GQA	VGQA	COCO	Flickr30k	COCO	COCO+	COCOg	V7W	GW	NLVR2	SNLI-VE
		Clean	test-dev	test-dev	val	test(R1)	test(R1)	test	test	test	test	test	testP	test
1	单任务 (ST)		71.82	58.19	34.38	65.28	61.14	78.63	71.11	72.24	80.51	62.81	74.25	76.72	3B 3 (12)	67.25
	单任务 (ST)		71.24	59.09	34.10	64.80	61.46	78.17	69.47	72.21	80.51	62.53	74.25	76.53	3B (12)	67.03
GJ	分组任务 (GT)		72.03	59.60	36.18	65.06	66.00	80.23	72.79	75.30	81.54	64.78	74.62	76.52	1B (4)	68.72
4	全任务 (AT)		72.57	60.12	36.36	63.70	63.52	80.58	73.25	75.96	82.75	65.04	78.44	76.78	270M (1)	69.08
5	全任务 (不含G4)		72.68	62.09	36.74	64.88	64.62	80.76	73.60	75.80	83.03	65.41			266M (1)
GT 6	微调 >单任务		72.61	59.96	35.81	66.26	66.98	79.94	72.12	75.18	81.57	64.56	74.47	76.34	3B (12)	68.81
7 AT	微调,单任务		72.92	60.48	36.56	65.46	65.14	80.86	73.45	76.00	83.01	65.15	78.87	76.73	3B (12)	69.55
QQ AT	微调单任务		73.15	60.65	36.64	68.00	67.90	81.20	74.22	76.35	83.35	65.69	78.87	76.95	3B (12)	70.24

Table 2: Comparison of our multi-task models to single-task performance. We find multi-task training (rows 3-5) provides significant gains over single-task training (rows 1-2) while reducing the parameter count from over 3 billion to 270 million. Further, following multi-task training by task-specific fine-tuning (rows $6-9$ ) further gains can be made at the cost of increased parameters.

Table 3: Pair-wise (left) and triple-wise (right) inter-group representative task analysis. Each entry is the relative performance change from single-task training for the row-task when jointly trained with the column-task(s).

表 2: 多任务模型与单任务性能对比。我们发现多任务训练(第3-5行)相比单任务训练(第1-2行)能显著提升性能，同时将参数量从超过30亿减少到2.7亿。此外，在多任务训练后进行任务特定微调(第6-9行)可以进一步获得性能提升，但代价是增加参数量。

		TrainedWith					TrainedWith						Avg.
		G1	G2	G3	G4	Avg.	G1&G2	G1&G3	G1&G4	G2&G3	G2&G4	G3&G4
Relative PERF	G1(VQAv2)		0.38%	0.38%	-0.20%	0.19%				0.63%	-0.08%	0.18%	0.24%
	G2(Flickr30k)	0.46%		0.23%	-4.13%	-1.15%		1.24%	0.49%			-4.36%	-0.88%
	G3(Visual7W)	0.39%	0.78%		0.24%	0.47%	0.86%		0.19%		0.29%		0.44%
	G4(NLVR2)	2.29%	1.47%	0.67%		1.48%	3.69%	3.22%		2.73%			3.21%
Avg.		1.04%	0.88%	0.43%	-1.36%		2.27%	2.23%	0.34%	1.68%	0.10%	-2.09%

表 3: 组间代表任务的双向(左)和三向(右)分析。每个条目表示行任务与列任务联合训练时，相比单任务训练的相对性能变化。

12 datasets. Rows 1 and 2 in Table 2 show the performance of these models trained on the full and cleaned datasets, respectively. As expected, reducing the training set size through cleaning results in lower performance in most cases. Our improvements over the pre training objective (Sec 3.1) results in better downstream tasks performance (71.82 vs. 70.55 on VQA and 61.46 vs. 58.20 on Flickr30k Recall $@1$ ). See the supplementary for full comparison. Overall, our base architecture is competitive with prior work and a good starting point for multi-task learning.

12个数据集。表2中的第1行和第2行分别展示了这些模型在完整数据集和清洗后数据集上的性能表现。不出所料，通过清洗减少训练集规模在大多数情况下会导致性能下降。我们对预训练目标(见3.1节)的改进带来了更好的下游任务性能(VQA任务71.82 vs. 70.55，Flickr30k Recall $@1$任务61.46 vs. 58.20)。完整对比结果请参阅补充材料。总体而言，我们的基础架构与现有工作相比具有竞争力，是多任务学习的良好起点。

4.2. Intra-Group Multi-task Performance

4.2. 组内多任务性能

We begin with the most intuitive multi-task setting – jointly training tasks within the same groups. As grouped tasks are typically highly related, this is akin to some existing data augmentation practices (e.g. adding Visual Genome (VG) QA data when training VQA). Note this corresponds to four separate multi-task models – one for each group.

我们从最直观的多任务设置开始——在同一组内联合训练任务。由于分组任务通常高度相关，这类似于一些现有的数据增强实践（例如在训练VQA时添加Visual Genome (VG)问答数据）。请注意，这对应着四个独立的多任务模型——每组一个。

Table 2 row 3 shows the result of intra-group multi-task training. Comparing with single-task models trained on the same data (row 2), we see meaningful improvements of between $0.37%$ $\mathrm{(NLVR^{2}})$ ) and $4.54%$ (Flickr30k retrieval) points for 11 out of 12 tasks (only SNLI-VE did not improve). Comparing to row 1, we see that intra-group multitask training overcomes the data-loss from cleaning with an average score of 68.72, outperforming the single-task models trained on the full datasets which have an average score of 67.25. Further, the total number of parameters drops by a factor of $3\times$ – going from 12 full models to only 4.

表 2 第 3 行显示了组内多任务训练的结果。与相同数据训练的单任务模型 (第 2 行) 相比，12 项任务中有 11 项 (仅 SNLI-VE 未提升) 实现了 $0.37%$ (NLVR$^{2}$) 至 $4.54%$ (Flickr30k 检索) 的显著提升。对比第 1 行可见，组内多任务训练以 68.72 的平均分克服了数据清洗带来的损失，优于使用完整数据集训练的单任务模型 (平均分 67.25)。此外，参数量总数降至 $1/3$——从 12 个完整模型缩减为仅 4 个。

4.3. Inter-Group Multi-task Performance

4.3. 组间多任务性能

Representative Task Analysis. We next consider the interplay between different task-groups. For efficiency, we consider multi-task training with representative tasks from each group – specifically VQA (G1), Retrieval Flickr30k (G2), Visual7W (G3), and $\mathrm{NLVR^{2}}$ (G4). These were selected to maximize diversity in underlying image sources. We examine their relationships by jointly training all pairs and triplets of tasks under our multi-task training approach.

代表性任务分析。接下来我们考虑不同任务组之间的相互作用。为了提高效率，我们选取了每组具有代表性的任务进行多任务训练——具体包括VQA (G1)、Retrieval Flickr30k (G2)、Visual7W (G3)和$\mathrm{NLVR^{2}}$ (G4)。这些任务的选取旨在最大化底层图像来源的多样性。我们通过多任务训练方法联合训练所有任务对和三元组，来研究它们之间的关系。

Table 3 (left) shows the results of training each represent at ive task pair. Each entry is the percent change from single-task performance for the row-task when jointly trained with the column-task. As such, the Avg. row (bottom) shows the mean impact each column-task has on other tasks, and likewise the Avg. column (right) shows the mean impact other tasks have on each row-task. For instance, we find that adding VQA (G1) benefits other tasks with an average improvement of $+1.04%$ . Interestingly, adding $\mathrm{NLVR^{2}}$ (G4) degrades other tasks on average $(-1.36%)$ while making significant gains itself $(+1.48%)$ . This is primarily due to a $-4.13%$ interaction with G2. Table 3 (right) shows all task triplets. Gains in the paired-experiments are not simply additive. In the pair-wise analysis, G3 gained $+0.39%$ and $+0.78%$ from G1 and G2 respectively. As before, G4 has some strong negative effects on other groups $-4.36%$ G2 with G3 & G4) but these effects can be regulated by other

表 3 (左) 展示了每个代表性任务对的训练结果。每个单元格表示行任务与列任务联合训练时，相对于单任务性能的百分比变化。因此，Avg. 行(底部)显示每项列任务对其他任务的平均影响，同理 Avg. 列(右侧)显示其他任务对每项行任务的平均影响。例如，我们发现添加 VQA (G1) 能使其他任务平均提升 $+1.04%$ 。有趣的是，添加 $\mathrm{NLVR^{2}}$ (G4) 会平均降低其他任务性能 $(-1.36%)$ ，但其自身却获得显著提升 $(+1.48%)$ ，这主要源于与 G2 的 $-4.13%$ 交互效应。表 3 (右) 展示了所有任务三元组情况，配对实验中的增益并非简单叠加。在成对分析中，G3 从 G1 和 G2 分别获得 $+0.39%$ 和 $+0.78%$ 提升。如前所述，G4 对其他组别存在较强负面影响 (G2 与 G3 & G4 组合为 $-4.36%$ )，但这些影响可通过其他...

Table 4: Comparison to recent SOTA. For image retrieval (IR) COCO and Flickr we report R1 scores on the 1K test set.

表 4: 与近期SOTA方法的对比。对于图像检索(IR)任务中的COCO和Flickr数据集，我们报告了在1K测试集上的R1分数。

任务	数据集划分	SOTA	UNITER[8] BERTg	UNITER[8] BERTL	OurSAT BERT:	OurSAT->ST BERTg
VQA	test-dev		72.27	73.24	72.57	73.15
VG QA	val				36.36	36.64
GQA	test-dev	60.00 [45]			60.12	60.65
IR COCO	test (R1)	68.50 [23]			63.70	68.00
IR Flickr30k	test (R1)		71.50	73.66	63.52	67.90
RefCOCO	test		80.21	80.88	80.58	81.20
RefCOCO+	test		72.90	73.73	73.25	74.22
RefCOCOg	test		74.41	75.77	75.96	76.35
Visual 7W	test	72.53 [16]			82.75	83.35
GuessWhat	test	61.30 [13]			65.04	65.69
NLVR2	testP		77.87	79.50	78.44	78.87
SNLI-VE	test		78.02	78.98	76.78	76.95
参数量 (#模型数)			602M (7 x 86M)	2.1B (7x303M)	270M (1x270M)	3B (12x250M)

tasks $(+0.49%$ G2 with G1 & G4).

任务 $(+0.49%$ G2与G1和G4的组合)。

Full Multi-task Results. We move to our main result – a single model trained on all 12 datasets. The results of this All-Tasks (AT) model are shown in Table 2 row 4. This model outperforms independent single-task models trained on the same data (row 2) for 11 out of 12 tasks and improve the average score by 2.05 points (69.08 vs. 67.03). We reiterate for emphasis, average performance improves by 2.05 points while reducing the number of parameters from over 3 billion to 270 million (a $12\times$ reduction). This is also true for comparison with single-task models trained on full datasets (row 1) by a similar margin of 1.83 points.

完整多任务结果。我们转向主要结果——一个在全部12个数据集上训练的单一模型。这个全任务(AT)模型的结果如表2第4行所示。该模型在12个任务中有11个优于相同数据上训练的独立单任务模型(第2行)，并将平均得分提高了2.05分(69.08对67.03)。我们再次强调，平均性能提升2.05分的同时，参数量从超过30亿减少到2.7亿(缩减12倍)。与完整数据集训练的单任务模型(第1行)相比，同样保持了1.83分的类似优势。

Our AT model also outperforms the Group-Task (GT) models (row 3) despite having $4\mathbf{x}$ fewer parameters (avg. 69.08 vs 68.72). This implies that despite their diversity, tasks across different groups can benefit from joint training.

我们的AT模型在参数量减少4倍的情况下（平均69.08对68.72），表现仍优于分组任务（GT）模型（第3行）。这表明尽管任务存在多样性，跨组别的联合训练仍能带来收益。

We observed from the representative task analysis that G4 tends to have a negatively effect other groups during joint training. To validate this observation on all tasks, we train an All-Task model without G4 (row 5). This model achieves higher avg. score of 67.96 for $\mathrm{G}1{+}\mathrm{G}2{+}\mathrm{G}3$ compared to the full AT model’s 67.39. $\mathrm{NLVR^{2}}$ (G4) presents two images per description and often one matches while the other does not. Despite the alignment with one image, the instance as a whole is negative. We speculate that this supervision may interfere with the standard caption-image alignment objective in Flickr30k.

我们从代表性任务分析中观察到，G4在联合训练时往往会对其他组产生负面影响。为了在所有任务上验证这一观察结果，我们训练了一个不含G4的全任务模型（第5行）。该模型在$\mathrm{G}1{+}\mathrm{G}2{+}\mathrm{G}3$上取得了67.96的平均分，高于完整AT模型的67.39分。$\mathrm{NLVR^{2}}$（G4）每个描述对应两张图像，通常一张匹配而另一张不匹配。尽管其中一张图像对齐，但实例整体仍为负样本。我们推测这种监督信号可能会干扰Flickr30k中标准的图文对齐目标。

4.4. Multi-Task Learning as Pre training

4.4. 多任务学习作为预训练

For some applications, single task performance may be paramount and justify storing a task-specific model. Even then, fine-tuning from a multi-task trained model may allow the model to take advantage of the additional, diverse supervision captured during multi-task training. Following [26], we finetune our trained multi-task models (GT and AT) on each downstream task and show results in Table 2. Rows 6 and 7 show that finetuning from the all-task model (AT) outperforms finetuning from the group-task models (GT) with an average score of 69.51 vs. 68.81. For comparison with our multi-task models, these are finetuned on the cleaned datasets which are $11%$ smaller on average. To compare to prior work, we also finetune on the full dataset for individual tasks (Row 8) and observe further improvements. Recall that our multi-task model was trained on cleaned data so there is no possibility of test leak here. These model outperform single-task models without multi-task pre training (row 1) by a large margin (70.24 vs. 67.25 avg. score).

对于某些应用场景，单一任务的性能可能至关重要，因此值得存储任务专用模型。即便如此，基于多任务训练模型进行微调，仍能让模型利用多任务训练期间捕获的额外多样化监督信号。参照[26]的做法，我们在每个下游任务上对已训练的多任务模型（GT和AT）进行微调，结果如表2所示。第6-7行数据显示，基于全任务模型（AT）微调的表现优于基于分组任务模型（GT）微调，平均得分分别为69.51和68.81。为与多任务模型对比，这些模型是在清洗后的数据集上微调的，该数据集平均规模缩小了$11%$。为与已有研究对比，我们还在完整数据集上对单项任务进行微调（第8行），观察到进一步提升。需注意我们的多任务模型基于清洗数据训练，因此不存在测试数据泄露。这些模型大幅领先未经多任务预训练的单任务模型（第1行）（平均得分70.24 vs 67.25）。

Table 5: Comparison with other multi-task models. VQA score is on test-dev and the retrieval tasks on their respective 1K test split. For Flickr Grounding (FG) we report R1 on Flickr30K test.

	VQA	COCORetrieval			Flickr Retrieval			FG
		R1	R5	R10	R1	R5	R10	R1
		OmniNet[36]	55.76
HDC [33]		69.2857.4088.4095.6056.10				82.90	89.40	57.39
Ours	72.7065.1691.00			96.20 65.06		88.66	93.52	64.61

表 5: 与其他多任务模型的对比。VQA分数基于test-dev数据集，检索任务基于各自的1K测试集划分。对于Flickr Grounding (FG)，我们报告Flickr30K测试集上的R1分数。

	VQA	COCORetrieval			Flickr Retrieval			FG
		R1	R5	R10	R1	R5	R10	R1
OmniNet[36]	55.76
HDC [33]		69.28	57.40	88.40	95.60	56.10	82.90	89.40
Ours	72.70	65.16	91.00		96.20	65.06	88.66	93.52

4.5. Comparison with Existing Work

4.5. 与现有工作的对比

In Table 4 we compare with existing state-of-the-art. We draw special comparison with the recent UNITER [8] architecture as it is similar to our base ViLBERT model. Like ViLBERT, UNITER is a general BERT-based vision-andlanguage architecture pretrained through self-supervised tasks and then finetuned for each downstream task. We show two UNITER columns corresponding to their underlying BERT model – either Base B or Large L. Our ViLBERT model uses the smaller $\mathrm{BERT}{\mathrm{B}}$ . Our single all-task model $(\mathrm{Ours}{\mathrm{AT}})$ ) achieves competitive performance to state-of-theart task-specific models. Our single-task finetuned models $(\mathrm{Ours}_{\mathrm{AT->ST}})$ surpass state-of-the-art on 7 out of 12 tasks.

在表4中，我们与现有最先进技术进行了比较。我们特别与近期提出的UNITER [8]架构进行了对比，因为其与我们基础的ViLBERT模型相似。与ViLBERT类似，UNITER是一种基于BERT的通用视觉-语言架构，通过自监督任务进行预训练，然后针对每个下游任务进行微调。我们展示了两列UNITER结果，分别对应其底层BERT模型——Base B或Large L。我们的ViLBERT模型使用了较小的$\mathrm{BERT}{\mathrm{B}}$。我们的单一全任务模型$(\mathrm{Ours}{\mathrm{AT}})$达到了与最先进专用任务模型相当的性能。我们的单任务微调模型$(\mathrm{Ours}_{\mathrm{AT->ST}})$在12项任务中有7项超越了当前最优水平。

Table 5 compares our method with other recently proposed multi-modal, multi-task learning approaches – OmniNet [36] and Hierarchical Dense Co-Attention (HDC) [33]. OmniNet is trained on part-of-speech tagging, image captioning, visual question answering, and video activity recognition, while HDC is trained on image caption retrieval, visual question answering, and visual grounding. We train a multi-task model on the same tasks and cleaned datasets used in HDC [33]. Flickr Grounding is a new task that we include for this comparison. Our multi-task model outperforms these approaches by a large margin.

表 5: 将我们的方法与近期提出的多模态多任务学习方法——OmniNet [36] 和分层密集共注意力 (HDC) [33] 进行对比。OmniNet 在词性标注、图像描述生成、视觉问答和视频活动识别任务上训练，而 HDC 则在图像描述检索、视觉问答和视觉定位任务上训练。我们在 HDC [33] 使用的相同任务和清洗数据集上训练了一个多任务模型。Flickr Grounding 是我们为本次对比新增的任务。我们的多任务模型以显著优势超越了这些方法。

5. Analysis and Ablation Study

5. 分析与消融研究

Ablations on task token and training strategies. To verify our design choices, we perform ablations for different task token granularity and multi-task training strategies. The results are shown in Table 6. We report average group and overall average performance. Detailed breakdown for each task can be found in supplement.

任务token与训练策略的消融实验。为验证设计选择，我们针对不同任务token粒度和多任务训练策略进行了消融实验。结果如表6所示，汇报了平均组别性能和整体平均性能。各任务详细数据可查阅补充材料。

Table 6: Ablations on our design choices and comparison to curriculum and anti-curriculum learning multi-task approaches.

	Task Token	Dynamic Stop-and-Go	G1	G2	G3		G4	All Tasks Average
AT (our)
1	tokenper dataset			56.35	63.61	75.52	77.61	69.08
2	token per head			55.95	61.48	75.35	77.37	68.52
3	w/o tasktoken		√	55.67	62.55	75.38	76.73	68.53
4	w/o DSG			55.50	62.92	75.24	76.31	68.52
5	w/curriculum			54.68	61.21	75.19	76.70	67.24
	w/anti-curriculum			55.82	59.58	73.69	75.94	67.98
7	vanillamultitask			54.09	61.45	75.28	76.71	67.92

表 6: 设计选择的消融实验及与课程学习和反课程学习多任务方法的对比

	Task Token	Dynamic Stop-and-Go	G2	G3	G4	All Tasks Average
AT (our)
1	token per dataset		56.35	63.61	75.52	77.61
2	token per head		55.95	61.48	75.35	77.37
3	w/o task token	√	55.67	62.55	75.38	76.73
4	w/o DSG		55.50	62.92	75.24	76.31
5	w/ curriculum		54.68	61.21	75.19	76.70
6	w/ anti-curriculum		55.82	59.58	73.69	75.94
7	vanilla multitask		54.09	61.45	75.28	76.71

For task tokens, our default setting is with a different task token per dataset (12 total, Row 1). We compare this with two ablations: one task token per output head (4 total, Row 2) and no task tokens (Row 3). We observe that task-specific tokens lead to better performance compared to head-based tokens (avg. 69.08 vs. 68.52) and no task tokens (avg. 69.08 vs. 68.53). This shows that task-aware feature embedding is useful even within the same output space; e.g. per-task tokens may help differentiate noun phrases and pointing questions in Referring Expression.

对于任务token，我们的默认设置为每个数据集使用不同的任务token（共12个，第1行）。我们将其与两种消融实验进行比较：每个输出头使用一个任务token（共4个，第2行）和不使用任务token（第3行）。我们观察到，相比基于头的token（平均69.08 vs. 68.52）和不使用任务token（平均69.08 vs. 68.53），特定任务的token能带来更好的性能。这表明即使在相同的输出空间中，任务感知的特征嵌入也是有用的；例如，每个任务的token可能有助于区分Referring Expression中的名词短语和指向性问题。

For multi-task training schedule, we compare our dynamic stop-and-go (DSG) (Row 3) with Curriculum (Row 5) and Anti-Curriculum (Row 6) approaches discussed in Sec. 3. We consider convergence rate as a measure of task difficulty. For Curriculum, we first train tasks in G4 and then train all tasks together (easier $\longrightarrow$ harder). For Anti-Curriculum, we train G1 tasks first and then train on all tasks together (harder $\longrightarrow$ easier). Table 6 shows our dynamic stop-and-go training schedule outperforms anti-curriculum (avg. 68.52 vs. 67.98) and curriculum (avg. 68.53 vs. 67.24). Row 7 shows results of a ‘vanilla’, round-robin training scheme with no task tokens or training scheduling. The average score of vanilla multitask is close to anti-curriculum $(67.92\nu s.67.98)$ . Consistent with prior work [31], performance on harder tasks (G1) is worse compared to anti-curriculum. Our full training regime outperforms this significantly (avg. 69.08 vs. 67.92).

在多任务训练计划方面，我们将动态启停 (DSG) (第3行) 与第3节讨论的课程学习 (第5行) 和反课程学习 (第6行) 方法进行对比。我们将收敛速度作为任务难度的衡量标准。对于课程学习，我们首先训练G4组任务，然后同时训练所有任务 (由易到难) 。对于反课程学习，我们先训练G1组任务，再同时训练所有任务 (由难到易) 。表6显示我们的动态启停训练计划优于反课程学习 (平均68.52 vs. 67.98) 和课程学习 (平均68.53 vs. 67.24) 。第7行展示了没有使用任务token或训练调度的基础轮询训练方案结果，其多任务平均得分接近反课程学习 $(67.92\nu s.67.98)$ 。与先前研究 [31] 一致，在较难任务 (G1) 上的表现逊于反课程学习。我们完整的训练方案显著优于该基准 (平均69.08 vs. 67.92) 。

Behavior of Dynamic Stop-and-Go training. To characterize our dynamic stop-and-go training scheme, we visualize the dynamic training schedule in Fig. 2 (left) – bold lines indicate normal go training and thin lines are stop states when datasets receive sparser updates at a fixed iteration gap (every 4th iteration here). We see that smaller datasets quickly converge and enter stop state training early. As the base model drifts over time, they periodically return to full go state training to adjust. Interestingly, after some cycles of this, they enter the stop state and continue with only sparse updates for the rest of training.

动态停走训练的行为特征。为了描述我们的动态停走训练方案，我们在图2（左）中可视化了动态训练计划——粗线表示正常的go训练状态，细线则是数据集以固定迭代间隔（此处为每4次迭代）接收稀疏更新时的stop状态。可以看到较小数据集会快速收敛并提前进入stop状态训练。随着基础模型随时间漂移，它们会周期性返回完整go状态进行调整。有趣的是，经过若干次循环后，这些数据集会进入stop状态并在剩余训练中仅保持稀疏更新。

Another aspect of dynamic stop-and-go training is the sparsity of updates in the stop state. Fig. 2 (right) shows the mean normalized accuracy for each group for multi-task models trained with different iteration gaps $(\Delta)$ . We observe that raising $\Delta$ (i.e. updating more sparsely) improves performance initially but degrades for larger values. Absolute and per-task scores are provided in the supplement.

动态启停训练的另一个方面是停止状态下更新的稀疏性。图 2 (右) 展示了在不同迭代间隔 $(\Delta)$ 下训练的多任务模型每组平均归一化准确率。我们观察到增大 $\Delta$ (即更稀疏地更新) 初期能提升性能，但随着值增大会导致性能下降。补充材料中提供了绝对值和逐任务得分。

Figure 2: Left: Visualization of Dynamic stop-and-go during multi-task training. Solid line indicates in the go mode while thin line indicates stop mode. Right: Mean accuracy (normalized group-wise for easier comparison) for each group with different iter-gap $\Delta$ for Dynamic stop-and-go .

图 2: 左: 多任务训练中动态启停 (Dynamic stop-and-go) 的可视化。实线表示运行模式，虚线表示停止模式。右: 采用不同迭代间隔 $\Delta$ 的动态启停策略时，各组的平均准确率 (为便于比较已进行组间归一化处理)。

Multi-Task visual grounding consistency. Given the common shared base model, one question is whether multitask models exhibit more consistent visual groundings than independent task-specific models. For example, does a model that correctly answers “What color is the largest dog?” also correctly ground the referring expression “largest dog”? To assess this, we consider 1500 images from the $\mathrm{RefCOCO/+}$ test sets that also have VQA annotations such that for each image $I_{i}$ there are associated questions ${\boldsymbol{q}^{(i)}}$ and referring expressions ${r^{(i)}}$ . To measure the overlap in visual concepts between a question $q_{j}^{\left(i\right)}$ and reference $r_{k}^{(i)}$ , we count overlapping nouns and adjectives (identified using a part-of-speech tagger [47]) and denote this $d(q_{j}^{(i)},r_{k}^{(i)})$ . Armed with this notion of similarity, we consider each question-reference pair for each image (total 111,275 combinations) and compute a weighted accuracy. A pair is considered correct if the question was answered correctly and the referent was localized. Each pair is weighed by their overlap $d(q_{j}^{(i)},r_{k}^{(i)})$ . Note that if $q_{j}^{\left(i\right)}$ and r(ki) do not have any common visual concept (d(qj(i ), r(ki ))), the correctness of this pair does not affect the overall metric.

多任务视觉定位一致性。给定共享的基础模型，一个重要问题是多任务模型是否比独立的任务专用模型展现出更一致的视觉定位能力。例如，一个能正确回答"最大的狗是什么颜色？"的模型，是否也能准确定位指代表达式"最大的狗"？为评估这一点，我们从 $\mathrm{RefCOCO/+}$ 测试集中选取1500张图像，这些图像同时具有VQA标注，使得每张图像 $I_{i}$ 都关联着问题集 ${\boldsymbol{q}^{(i)}}$ 和指代表达式集 ${r^{(i)}}$。为衡量问题 $q_{j}^{\left(i\right)}$ 与指代 $r_{k}^{(i)}$ 之间视觉概念的重叠程度，我们统计重叠的名词和形容词（使用词性标注工具[47]识别），记为 $d(q_{j}^{(i)},r_{k}^{(i)})$。基于这种相似性度量，我们对每张图像的问题-指代对（共111,275组组合）计算加权准确率：当问题回答正确且指代定位准确时，该组合被视为正确，其权重由重叠度 $d(q_{j}^{(i)},r_{k}^{(i)})$ 决定。需注意的是，若 $q_{j}^{\left(i\right)}$ 与 $r_{k}^{(i)}$ 不存在共同视觉概念（即 $d(q_{j}^{(i)}, r_{k}^{(i)}) = 0$），则该组合的正确性不影响总体指标。

We evaluate our Single-Task (ST), All-Task (AT), and finetuned from All-Task $\scriptstyle({\mathtt{A T}}->{\mathtt{S T}})$ models on the proposed metric. AT consistently outperforms $S\mathrm{T}$ ( $55.40%$ vs. $58.30%)$ ) and $\mathrm{AT}\mathrm{-}>\mathrm{ST}$ achieves the best performance $(64.64%)$ . This shows our model trained on multiple tasks achieve better visual grounding consistency across different tasks. Further analysis can be found in the supplement.

我们评估了单任务 (ST)、全任务 (AT) 以及从全任务微调 ($\scriptstyle({\mathtt{A T}}->{\mathtt{S T}})$) 的模型在提出的指标上的表现。AT 始终优于 ST (55.40% vs. 58.30%)，而 AT->ST 取得了最佳性能 (64.64%)。这表明我们在多任务上训练的模型在不同任务间实现了更好的视觉定位一致性。更多分析详见补充材料。

Regularizing effects of multi-task learning. We find multi-task training to have a regularizing effect on tasks which overfit when trained separately. In Fig. 4 we plot the training and validation curves for two tasks (SNLI-VE and Flickr Grounding) where single task training overfits quickly. On the other hand when trained in a multi-task setup with all other tasks, the validation score improves and

多任务学习的正则化效应。我们发现多任务训练对

[论文翻译]12合1：多任务视觉与语言表征学习

原文地址：https://arxiv.org/pdf/1912.02315v2