OmniVec2 - A Novel Transformer based Network for Large Scale Multimodal and Multitask Learning

OmniVec2 - 基于Transformer的新型大规模多模态多任务学习网络

Abstract

摘要

We present a novel multimodal multitask network and associated training algorithm. The method is capable of ingesting data from approximately 12 different modalities namely image, video, audio, text, depth, point cloud, time series, tabular, graph, X-ray, infrared, IMU, and hyperspectral. The proposed approach utilizes modality specialized tokenizers, a shared transformer architecture, and cross-attention mechanisms to project the data from different modalities into a unified embedding space. It addresses multimodal and multitask scenarios by incorporating modality-specific task heads for different tasks in respective modalities. We propose a novel pre training strategy with iterative modality switching to initialize the network, and a training algorithm which trades off fully joint training over all modalities, with training on pairs of modalities at a time. We provide comprehensive evaluation across 25 datasets from 12 modalities and show state of the art performances, demonstrating the effectiveness of the proposed architecture, pre training strategy and adapted multitask training.

我们提出了一种新颖的多模态多任务网络及相应的训练算法。该方法能够处理约12种不同模态的数据，包括图像、视频、音频、文本、深度、点云、时间序列、表格、图结构、X射线、红外、惯性测量单元(IMU)和高光谱数据。该方案采用模态专用分词器、共享Transformer架构和交叉注意力机制，将不同模态数据映射到统一的嵌入空间。通过为各模态中的不同任务配备专用任务头，实现了多模态与多任务场景的协同处理。我们提出了一种基于迭代模态切换的新型预训练策略来初始化网络，并设计了一种权衡所有模态联合训练与两两模态交替训练的训练算法。通过在12种模态的25个数据集上进行全面评估，我们展示了所提架构、预训练策略和适配多任务训练方法的先进性，实现了业界领先的性能表现。

across various modalities, potentially reducing the need for extensive labeling in specific modalities for particular tasks.

跨多种模态，有望减少特定任务中对单一模态进行大量标注的需求。

In the present work, we extend such line of research and propose a multimodal multitask method which learns embeddings in a shared space across different modalities and then employs task specific sub-networks for solving specific tasks in specific modalities. The method utilizes a common transformer based bottleneck block to map the input to embeddings in a shared space, thus incorporating knowledge from multiple tasks associated with different respective modalities. This structure leads to learning of very robust representations informed and regularized by all tasks and modalities together. The embeddings are then used by the task heads to make required predictions.

在本研究中，我们拓展了这一研究方向，提出一种多模态多任务方法：该方法先在跨模态共享空间中学习嵌入表示，再通过任务专用子网络处理特定模态的特定任务。该方法采用基于Transformer的通用瓶颈模块将输入映射至共享空间，从而整合来自不同模态关联任务的多源知识。这种结构促使学习过程受到所有任务和模态的共同约束与信息共享，最终形成高度鲁棒的表征。任务头模块随后利用这些嵌入表示进行目标预测。

1. Introduction

1. 引言

Extracting meaningful representations from data is a central task in machine learning. Majority of the approaches proposed are usually specialized for specific modalities and tasks. The development of methods capable of handling multiple modalities, in a holistic way, has been an active topic of research recently [21, 34, 35, 45, 63, 105]. Multi task learning has a large body of literature [10], but has been traditionally limited to tasks from single modality. Learning a unified network that trains shared parameters across diverse tasks in different modalities, like image, video, depth maps, audio, has been shown to be more robust and give better generalization and reduce over fitting to a single task or modality [1, 24] cf. unimodal networks. Such joint learning also enables more efficient use of available labeled data

从数据中提取有意义的表征是机器学习的核心任务。大多数现有方法通常针对特定模态和任务专门设计。近年来，能够以整体方式处理多模态的方法开发已成为活跃的研究课题 [21, 34, 35, 45, 63, 105]。多任务学习已有大量文献 [10]，但传统上局限于单一模态的任务。研究表明，学习一个在图像、视频、深度图、音频等不同模态的多样化任务中训练共享参数的统一网络，相比单模态网络具有更强鲁棒性、更好泛化能力，并能减少对单一任务或模态的过拟合 [1, 24]。这种联合学习还能更高效地利用现有标注数据

Previous research in generalized multimodal learning falls into three main categories. First, there are methods that process multiple heterogeneous modalities such as images, 3D, and audio, directly without using separate encoders for each modality, learning representations directly from these inputs [34, 35]. Second, some approaches use modality specific encoders and then learn generalized embeddings, for data from each modality, based on a unified objective in the latent space [5]. Third, there are methods focused on knowledge sharing across different modalities, employing either a single common encoder [21] or distinct encoders for each modality [1]. Our work aligns more closely with the third type of approaches, while incorporating elements from the first. We employ modality specific tokenizers and encoders, and have a bottleneck shared transformer backbone. Token iz ation is tailored to each modality, drawing inspiration from the Uni-Perceiver model but with key modifications detailed in Sec. 3. After token iz ation, transformer based network is used to obtain initial representations for the modalities which are passed through fully connected layers and then fused together with cross attention module. The fused representation then passes through the transformer backbone. The features from the transformer are then individually fused with original modality features using cross attention and are in turn fed to the modality specific task head.

广义多模态学习的先前研究主要分为三类。首先，有些方法直接处理图像、3D和音频等多种异构模态，无需为每个模态使用单独的编码器，直接从这些输入中学习表征 [34, 35]。其次，某些方法使用模态专用编码器，然后基于潜在空间的统一目标，为每个模态的数据学习广义嵌入 [5]。第三，有些方法专注于不同模态间的知识共享，采用单一通用编码器 [21] 或为每个模态使用不同编码器 [1]。我们的工作更接近第三种方法，同时融合了第一种方法的元素。我们采用模态专用的分词器 (tokenizer) 和编码器，并共享一个Transformer骨干瓶颈结构。分词过程针对每个模态定制，灵感来自Uni-Perceiver模型，但第3节详述了关键修改。分词后，使用基于Transformer的网络获取各模态的初始表征，这些表征通过全连接层后，再经由交叉注意力模块融合。融合后的表征随后通过Transformer骨干网络。最后，Transformer输出的特征会通过交叉注意力与原始模态特征逐一融合，并输入到各模态专用的任务头中。

The training procedure involves a dual-stage masked pre training and a full task based loss optimization. The first stage of masked pre training is the standard unsupervised masked pre-training with one modality at a time. The second state masked pre training involves masked pretraining with pairs of modalities at a time, employing a two stream setup as shown in Fig. 1. In this stage two modalities are used together, tokens are randomly masked and the full network is used to predict the masked tokens using the unmasked tokens for both modalities together. This allows for knowledge sharing across all modalities as the training proceeds by randomly sampling training batches from two modalities from all modalities. The final training step is then training for multiple tasks for different modalities. This is done similar to the second stage of masked pretraining, i.e. pairs of modalities are sampled, and a pair of tasks are sampled, one from each modality. Training batches are then constructed, half each from the two modality-task pairs. These are then used to optimize standard losses corresponding to the tasks, e.g. cross entropy for classification and $\ell_{2}$ loss for pixelwise prediction. The pre training and final task training using pairs of modalities is the key component of the training strategy, that enables the cross modal knowledge sharing across all modalities together, which we discuss more in the following.

训练流程包含双阶段掩码预训练和基于任务的全损失优化。第一阶段掩码预训练采用标准单模态无监督掩码预训练。第二阶段掩码预训练采用双模态对掩码预训练，如图1所示的双流架构。该阶段联合使用两种模态，随机掩码token后利用完整网络通过两种模态的未掩码token共同预测被掩码token。通过从所有模态中随机采样两种模态的训练批次，实现训练过程中跨所有模态的知识共享。最终训练步骤是为不同模态的多任务进行训练，其方式与第二阶段掩码预训练类似：采样模态对并为每个模态采样一个任务，构建的训练批次中各模态-任务对占比均等。随后使用这些批次优化任务对应的标准损失函数（如分类任务的交叉熵、逐像素预测任务的 $\ell_{2}$ 损失）。采用模态对进行预训练和最终任务训练是该策略的核心，实现了所有模态间的跨模态知识共享，我们将在下文详细讨论。

In summary, the contributions of the work are as follows. (i) We propose a multimodal multitask network based on transformer architectures with modality specific tokenizers, shared backbone, and task specific heads. (ii) We provide comprehensive empirical results on 25 benchmark datasets over 12 distinct modalities i.e. text, image, point cloud, audio and video along with applications to X-Ray, infrared, hyper spectral, IMU, graph, tabular, and time-series data. The method achieves better or close to state of the art performances on these datasets. (iii) We propose a novel multimodal pre training approach that alternates between a pair of modalities to enable crossmodal knowledge sharing. (iv) We propose a multimodal and multitask supervised training approach to leverage knowledge sharing between modalities for robust learning, simplifying the complex processes proposed in previous works on modality integration, e.g. [45, 94].

总结来说，本工作的贡献如下。(i) 我们提出了一种基于Transformer架构的多模态多任务网络，包含模态特定的分词器 (tokenizer)、共享主干网络和任务特定的头部结构。(ii) 我们在12种不同模态（包括文本、图像、点云、音频、视频，以及X射线、红外、高光谱、IMU、图数据、表格数据和时间序列数据）的25个基准数据集上提供了全面的实验结果。该方法在这些数据集上达到或接近最先进性能。(iii) 我们提出了一种新颖的多模态预训练方法，通过在模态对之间交替训练来实现跨模态知识共享。(iv) 我们提出了一种多模态多任务监督训练方法，利用模态间的知识共享实现鲁棒学习，简化了先前工作中模态整合的复杂流程 [45, 94]。

2. 相关工作

In this section, we discuss similar works and various similar paradigms to our work.

在本节中，我们将讨论与本研究相关的类似工作和各种相似范式。

Multi-modal methods. Contemporary multi-modal methods predominantly employ modality-specific feature encoders [2, 36, 37, 63, 85], focusing on fusion techniques within their architectural designs. These networks usually vary across modalities, necessitating architectural modifications for combined usage. They must address challenges related to feature fusion timing, fine-tuning, and pre-training etc. [87]. Such complexities restrict the adaptability of universal frameworks like transformers for diverse domains, including point clouds, audio, and images.

多模态方法。当前的多模态方法主要采用特定模态的特征编码器 [2, 36, 37, 63, 85]，其架构设计侧重于融合技术。这些网络通常因模态而异，需要调整架构以实现联合使用。它们必须解决特征融合时机、微调和预训练等挑战 [87]。此类复杂性限制了通用框架（如 Transformer）在点云、音频和图像等多样化领域的适应性。

Common network for multiple modalities. A growing body of research aims to learn from multiple modalities without modality-specific encoders [5, 7, 21, 35]. Notably, architectures like the perceiver [7, 34, 35] employ crossattention among latent queries to process multiple modalities together. The hierarchical perceiver [7] expands on this by structuring the input while maintaining locality. Other approaches, such as data2vec [5], use modality-specific encoders. Omnivore [21], with a common encoder, is limited to visual modalities only. Contrarily, VATT [1] employs a unified transformer backbone but processes each modality independently. These multi-modal methods have demonstrated enhanced robustness [1, 24].

多模态通用网络。越来越多的研究致力于无需特定模态编码器即可从多模态数据中学习 [5, 7, 21, 35]。值得注意的是，像感知器 (perceiver) [7, 34, 35] 这样的架构利用潜在查询之间的交叉注意力来共同处理多种模态。分层感知器 (hierarchical perceiver) [7] 通过结构化输入同时保持局部性来扩展这一方法。其他方法如 data2vec [5] 则使用特定模态编码器。Omnivore [21] 虽然采用通用编码器，但仅限于视觉模态。相反，VATT [1] 采用统一的 Transformer 主干网络，但独立处理每种模态。这些多模态方法已展现出更强的鲁棒性 [1, 24]。

Multi-task learning. As explored in the preceding section, there has been a surge in methods that process multiple modalities. Perceive rIO[34] extends the capabilities of Perceiver [35] to facilitate learning multiple tasks with a singular network architecture. Although Perceive rIO is capable of multitasking, often separate networks are employed [98]. Various techniques [5, 11, 21, 32, 59] learn from raw representations of multiple modalities and are applicable to numerous tasks.

多任务学习。如前一节所述，处理多模态的方法激增。PerceiverIO[34]扩展了Perceiver[35]的能力，以促进使用单一网络架构学习多个任务。尽管PerceiverIO能够进行多任务处理，但通常仍会使用单独的网络[98]。多种技术[5, 11, 21, 32, 59]从多模态的原始表示中学习，并适用于众多任务。

Multi-modal masked pre training. Approaches such as [50, 79, 88] implement masked pre-training. This technique has proven beneficial for improving the performance of deep networks across different modalities and tasks[1, 4, 5, 20, 28, 95].

多模态掩码预训练。诸如[50, 79, 88]等方法实现了掩码预训练。该技术已被证明能有效提升深度网络在不同模态和任务中的性能[1, 4, 5, 20, 28, 95]。

Comparison to similar works. We draw motivations from Uni Perceive r [105], MetaFormer [16] and OmniVec [68]. Unlike Uni Perceive r line of methods, we do not use a unified task head definition, while similar to it we use task specific task heads. This allows our method to learn more robust and leverage fine details from each task depending upon the complexity of the tasks, which is important as each modality has distinct definition of complexity. For ex., in vision task, classification is a relatively simpler task as compared to segmentation, as segmentation tasks enforces networks to learn pixel level attention and learning better neighbourhood relationships [27, 69]. Further, MetaFormer uses unified tokenizers, and instead, we utilize modality specific tokenizers. Our experiments indicate that modality specific tokenizers perform better than MetaFormer’s unified tokenizer when training on multiple modalities. Further, OmniVec uses separate encoders for each modaity, that makes the network heavy and computationally expensive. In contrast, we use modality specific tokenizers with a shared backbone. Additionally, unlike other works, we train on multiple modalities in a multi task manner, allowing the network to learn from multiple modalities with varying task complexities simultaneously.

与同类工作的比较。我们的灵感来源于Uni Perceiver [105]、MetaFormer [16]和OmniVec [68]。与Uni Perceiver系列方法不同，我们未采用统一的任务头定义，但与之相似的是，我们使用了任务特定的任务头。这使得我们的方法能够根据任务复杂度，从每个任务中学习更鲁棒的特征并利用精细细节——这一点至关重要，因为每种模态对复杂度的定义各不相同。例如，在视觉任务中，分类相比分割是相对简单的任务，因为分割任务迫使网络学习像素级注意力并建立更好的邻域关系 [27, 69]。此外，MetaFormer采用统一的Token化器，而我们则使用模态特定的Token化器。实验表明，在多模态训练时，模态特定Token化器的性能优于MetaFormer的统一Token化器。另外，OmniVec为每种模态使用独立编码器，这会导致网络臃肿且计算成本高昂。相比之下，我们在共享主干网络上使用模态特定Token化器。值得一提的是，与其他工作不同，我们以多任务方式训练多模态数据，使网络能同时从具有不同任务复杂度的多模态数据中学习。

3. Approach

3. 方法

Overview. We are interested in multimodal multitask learning. Say we have modalities indexed by $m\in[1,M]$ , and each modality has $T$ tasks indexed by $t\in[1,T]$ . Note that here we assume same number of tasks for all modalities for notational convenience, in practice different modalities would have different number of tasks. Examples of modality and their tasks could be classification into categories for point cloud modality, and dense pixel wise segmentation in image modality. We are interested in jointly learning classifiers $\phi_{m t}(\cdot|\theta_{m t})$ which take inputs $x_{m}$ from modality $m$ and make predictions for task $t$ , with $\theta_{m t}$ being the respective parameters. We assume that the learning is to happen by loss minimization where $\ell_{m t}(\cdot)$ denotes the loss for task $t$ on modality $m$ . Examples of such losses are cross entropy loss for classification tasks, and $\ell_{2}$ loss for dense image prediction tasks such as image segmentation. We would like to solve the following optimization.

概述。我们关注多模态多任务学习。假设有以 $m\in[1,M]$ 索引的模态，每个模态有 $T$ 个以 $t\in[1,T]$ 索引的任务。请注意，为便于表示，此处假设所有模态的任务数量相同，实际应用中不同模态的任务数量会有所不同。模态及其任务的示例包括点云模态的类别分类和图像模态的密集像素级分割。我们关注联合学习分类器 $\phi_{m t}(\cdot|\theta_{m t})$ ，这些分类器接收来自模态 $m$ 的输入 $x_{m}$ 并为任务 $t$ 做出预测，其中 $\theta_{m t}$ 是相应参数。我们假设学习通过损失最小化进行，其中 $\ell_{m t}(\cdot)$ 表示模态 $m$ 上任务 $t$ 的损失。此类损失的示例包括分类任务的交叉熵损失，以及图像分割等密集图像预测任务的 $\ell_{2}$ 损失。我们希望解决以下优化问题。

图片.png

where $\Theta={\theta_{m t}|m,t}$ are the parameters of all the pre- dictors, and $\tau_{m t}$ is the training set provided for task $t$ of modality $m$ . This is the extension of multiple task learning to multiple modalities as well.

其中 $\Theta={\theta_{m t}|m,t}$ 是所有预测器的参数集合， $\tau_{m t}$ 是为模态 $m$ 的任务 $t$ 提供的训练集。这是将多任务学习扩展到多模态的延伸。

We present a network and associated unsupervised pretraining and supervised training algorithm for the above task of multimodal multitask learning. The network consists of $M\times T$ modality specific tokenizers, followed by common feature transformation and feature fusion networks built with transformers, with cross attention modules in between, denoted by $f(\cdot),g(\cdot)$ in Fig. 1. The final part of the network are $M\times T$ task specific prediction heads, denoted by $h_{m t}(\cdot)$ for task $t$ on modality $m$ , which provide the final outputs for the tasks. At inference the prediction function is the composition of the three functions, i.e. $\phi(x)~=~h_{m t}\circ g\circ f(x)$ where $x$ is the tokenized form of the input. While training, we sample a pair of modalities from all the available modalities, and then sample one task each for the sampled modalities. We then construct training batch, half from each sampled task. Once the tokenization is done, the features $x_{i},x_{j}$ are passed into the first feature transformation subnetwork to obtain $f(x_{i}),f(x_{j})$ . These are then passed through the cross attention module to fuse them together. The fused features are then input to the second part of the network, i.e. $g(\cdot)$ . The output $\hat{x}_ {i j}=g\circ\mathcal{A}(f(x_{i}),f(x_{j}))$ , where $\mathcal{A}(\cdot)$ is the cross attention function, is then again fused with the respective input features $x_{i},x_{j}$ . These features, i.e. $\mathcal{A}(\hat{x}_ {i j},x_{i}),\mathcal{A}(\hat{x}_ {i j},x_{j})$ are then fed to the task predictors $h_{i q}$ and $h_{j r}$ , to obtain the final predictions for task $q,r$ on modalities $i,j$ respectively. The sum of losses $\ell_{i q}+\ell_{j r}$ are then minimized for the current batch by back propagation. Thus the learning proceeds by optimizing pairs of losses at a time, to stochastic ally minimize the sum over all the losses.

我们提出了一种网络及相关的无监督预训练和有监督训练算法，用于上述多模态多任务学习任务。该网络包含 $M\times T$ 个模态特定的分词器，随后是由Transformer构建的通用特征转换和特征融合网络，中间穿插交叉注意力模块（在图1中用 $f(\cdot),g(\cdot)$ 表示）。网络最后部分是 $M\times T$ 个任务特定的预测头，对于模态 $m$ 上的任务 $t$ 表示为 $h_{m t}(\cdot)$ ，它们为任务提供最终输出。在推理时，预测函数是这三个函数的组合，即 $\phi(x)~=~h_{m t}\circ g\circ f(x)$ ，其中 $x$ 是输入的Token化形式。训练时，我们从所有可用模态中采样一对模态，然后为每个采样模态各采样一个任务。接着构建训练批次，每个采样任务各占一半。完成分词后，特征 $x_{i},x_{j}$ 被传入第一个特征转换子网络，得到 $f(x_{i}),f(x_{j})$ 。然后通过交叉注意力模块将它们融合。融合后的特征随后输入到网络的第二部分，即 $g(\cdot)$ 。输出 $\hat{x}_ {i j}=g\circ\mathcal{A}(f(x_{i}),f(x_{j}))$ （其中 $\mathcal{A}(\cdot)$ 是交叉注意力函数）会再次与各自的输入特征 $x_{i},x_{j}$ 融合。这些特征（即 $\mathcal{A}(\hat{x}_ {i j},x_{i}),\mathcal{A}(\hat{x}_ {i j},x_{j})$ ）随后被送入任务预测器 $h_{i q}$ 和 $h_{j r}$ ，分别获得模态 $i,j$ 上任务 $q,r$ 的最终预测。通过反向传播最小化当前批次的损失总和 $\ell_{i q}+\ell_{j r}$ 。因此，学习过程通过每次优化一对损失来随机联合最小化所有损失的总和。

Along with the supervised multimodal joint training explained above, the learning also consists of two stages of unsupervised masked pre training with the first stage being unimodal and the second stage being multimodal pre training, to achieve knowledge sharing between tasks and modalities leading to regularized and robust predictors. We now present each of the components and the full training algorithm in detail.

除了上述有监督的多模态联合训练外，学习过程还包括两个阶段的无监督掩码预训练：第一阶段为单模态预训练，第二阶段为多模态预训练。这种设计旨在实现任务与模态间的知识共享，从而生成正则化且鲁棒的预测器。下面我们将详细阐述各组件及完整训练算法。

3.1. Network components

3.1. 网络组件

We now go through the network components sequentially from input to output.

我们现在按从输入到输出的顺序依次介绍网络组件。

Tokenizers. Each modality is tokenized using a modality specific tokenizer. The tokenizers are similar to those used in Uni-Perceiver [45], however, instead of attaching an embedding to the tokens, we provide transformer with one type of modality at a time. Further, Uni-Perceiver utilizes a combination of tokens from multiple modalities passed to a single transformer. This limits the Uni-perceiver to a limited set of modalities, i.e. text, image and videos. However, our method does not suffer from any such limitation. The details of specific tokenizers for the different modalities are provided in Supplementary.

Tokenizer。每种模态都使用特定的模态tokenizer进行处理。这些tokenizer与Uni-Perceiver [45] 中使用的类似，但我们不是为token附加嵌入，而是每次仅向transformer提供一种模态类型。此外，Uni-Perceiver采用了将多种模态的token组合传递给单一transformer的方式，这限制了Uni-perceiver只能处理有限的模态集，即文本、图像和视频。而我们的方法则不受此类限制。不同模态的具体tokenizer细节见补充材料。

Feature transformation network. Once the features are tokenized, they are then passed through a transformer network. While the method can utilize any transformer backbone, in the current implementation we use a transformer based on BERT [13]. Here, the multi head attention involves standard self-attention [76], and GeLU [30] activation prior to the MLP layer. The output from the transformer network is passed to a fully connected neural network with three fully connected layers with ReLU activation. This transformer network along with the fully connected layers is denoted a $f(\cdot)$ in Fig. 1. The network could be used without the fully connected layers—we added the fully connected layers to reduce the dimensions of the features so that the computational complexity of the remaining part of the network could be reduced.

特征转换网络。特征被转换为token后，会输入到一个transformer网络中。虽然该方法可以使用任何transformer主干网络，但在当前实现中我们采用了基于BERT[13]的transformer。其中多头注意力机制采用标准自注意力[76]，并在MLP层前使用GeLU[30]激活函数。transformer网络的输出会传递到一个具有三个全连接层(采用ReLU激活)的全连接神经网络。如图1所示，这个transformer网络与全连接层共同表示为 $f(\cdot)$ 。该网络可以不使用全连接层——我们添加全连接层是为了降低特征维度，从而减少网络剩余部分的计算复杂度。

Mixing features with cross attention. When training, we fuse the features from the two transformer streams, corresponding to two modalities, with cross attention module. The output fused features are then passed to another transformer network, denoted a $g(\cdot)$ in Fig. 1. The architecture of the transformer network is same as the transformers used in feature transformation network.

通过交叉注意力机制融合特征。在训练过程中，我们使用交叉注意力模块将来自两个Transformer流（对应两种模态）的特征进行融合。随后，将融合后的特征输出传递至另一个Transformer网络（在图1中表示为 $g(\cdot)$ ）。该Transformer网络的架构与特征转换网络中使用的Transformer相同。

Figure 1. Overview of the proposed method. The proposed method consists of three parts, the feature transformation network $f(\cdot)$ which consists of a transformer followed by fully connected layers to reduce feature dimensions, another transformer $g(\cdot)$ and finally the task prediction heads $h_{m t}(\cdot)$ for task $t$ on modality $m$ . The input data is tokenized with corresponding modality specific tokenizer. While training, pairs of modalities are used and the features are fused between the two modalities using cross attention layers, in a two stream configuration as shown here. While making prediction, the network is a single stream with cross attention layers removed, and the output is $h_{m t}\circ g\circ f(x)$ where $_x$ is the output of the corresponding modality specific tokenizer.

图 1: 所提方法概述。该方法包含三部分：特征转换网络 $f(\cdot)$ (由Transformer和全连接层组成以降低特征维度)、另一个Transformer $g(\cdot)$ ，以及针对模态 $m$ 上任务 $t$ 的任务预测头 $h_{m t}(\cdot)$ 。输入数据通过对应模态的特定分词器进行Token化处理。训练时采用模态对，并通过图中所示的双流架构中的交叉注意力层实现模态间特征融合。预测时网络为单流结构（移除交叉注意力层），输出为 $h_{m t}\circ g\circ f(x)$ ，其中 $_x$ 是对应模态特定分词器的输出。

Modality and task specific heads. The part of the network are the modality and task specific heads, denoted a $h_{m t}(\cdot)$ in Fig. 1. These task heads take as input, features from respective modality streams fused with features from the above network, fused with cross attention module. The task heads consist of a vanilla ViT-Tiny networks [82].

模态与任务特定头部。网络中的这一部分是指模态与任务特定的头部，在图1中表示为 $h_{m t}(\cdot)$ 。这些任务头部接收来自各自模态流的特征作为输入，这些特征与上述网络的特征通过交叉注意力模块融合。任务头部由一个标准的ViT-Tiny网络[82]构成。

3.2. Training

3.2. 训练

The training is done in three steps: (i) masked pre training iterating over modalities but doing masked prediction with one modality at a time, (ii) multimodal masked pre training where two modalities are simultaneously used to do masked prediction for each, and (iii) finally supervised task based training.

训练分为三个步骤：(i) 掩码预训练 (masked pre-training) ，迭代处理多种模态，但每次仅对单一模态进行掩码预测；(ii) 多模态掩码预训练 (multimodal masked pre-training) ，同时使用两种模态分别进行掩码预测；(iii) 最后进行基于监督任务的训练。

Stage 1 masked pre training. The first step in training is self supervised pre training of the transformer in the feature transformation network. We follow earlier works [1, 20, 68] and add a decoder for predicting masked tokens. Specifically, for an input modality with $P$ patches, we randomly mask $P_{m}$ patches, and feed non-masked patches and their positions to an encoder network attached in addition to the feature transformer. Further, we iterate between modalities while keeping the transformer network common, so that it learns to work with all modalities. Once this stage is complete we discard the decoder added, and keep only the encoder transformer.

阶段1：掩码预训练。训练的第一步是对特征转换网络中的Transformer进行自监督预训练。我们遵循先前研究[1, 20, 68]的方法，添加了一个用于预测掩码Token的解码器。具体来说，对于包含 $P$ 个补丁的输入模态，我们随机掩码 $P_{m}$ 个补丁，并将未掩码的补丁及其位置信息输入到附加于特征Transformer的编码器网络中。此外，我们在保持Transformer网络通用的同时，在多个模态之间进行迭代训练，使其学会处理所有模态。完成该阶段后，我们会移除添加的解码器，仅保留编码器Transformer。

Stage 2 masked pre training. We engage the full network, except the task specific prediction heads. We take two inputs from two different modalities and pass them through the network till just before the task prediction heads. Instead of task prediction heads we add decoders to predict the masked tokens for respective input modalities. This process involves decoding the modalities in parallel, utilizing the outputs from the cross-attention modules and the modalityspecific feature vectors. This alternating approach is key to achieving effective multimodal masked pre training. Here also, we randomly mask tokens for both the modalities. Task balancing is not employed in this pre training stage. Such a multi task multi modality approach allows us to utilize unpaired data across modalities. As in stage 1 pretraining, once this stage of training is finished, we discard the decoders added and keep the trained network $f,g$ .

阶段2：掩码预训练。我们启用整个网络（除任务特定预测头外），从两种不同模态各取一个输入，让它们通过网络直至任务预测头之前。在此阶段，我们用解码器替代预测头，分别预测两种输入模态中被掩码的token。该过程通过并行解码模态数据实现，同时利用交叉注意力模块的输出和模态特异性特征向量。这种交替式训练是实现高效多模态掩码预训练的关键。本阶段同样会对两种模态的token进行随机掩码，且不采用任务平衡策略。这种多任务多模态方法使我们能够跨模态利用非配对数据。与阶段1预训练相同，本阶段训练完成后，我们会移除新增的解码器，仅保留训练好的网络 $f,g$ 。

3.2.1 Multimodal multitask supervised training

3.2.1 多模态多任务监督训练

In the final part of the training, we train for two tasks at a time from two different modalities. This lets up stochastically minimize the loss function in Eq. 1, but minimizing sum of two losses at a time instead of minimizing the sum of all of them. When we use two modalities, we use the network as shown in Fig. 1 in a two stream configuration. With the two modality features being fused together in the middle, passed through a transformer $g(\cdot)$ and then fused back with themselves, before finally being input to the task prediction heads. Such fusion of the the features from two modalities leads to knowledge sharing between the tasks of different modalies and makes the learning robust and regularized.

在训练的最后阶段，我们同时针对两种不同模态的任务进行训练。这种方法可以随机最小化公式1中的损失函数，但每次只最小化两个损失之和而非所有损失的总和。当使用双模态时，我们采用图1所示的网络结构进行双流配置：两种模态特征在中间层融合，经过Transformer模块 $g(\cdot)$ 处理后与自身特征再次融合，最终输入任务预测头。这种双模态特征融合机制实现了跨模态任务间的知识共享，使学习过程具有鲁棒性和正则化特性。

Given the varying complexities of these task pairs, as underscored in previous research [17], we found it essential to balance the complexity of tasks in a multitask learning setting. Hence, the we train while employing standard task balancing techniques. We adjust the loss magnitude for each task based on its convergence rate. As our ablation studies will demonstrate, this approach allows for random pairing of modalities, in contrast to the need for selecting specific pairs as suggested in prior works [45, 68, 94, 105]. We give details of such task balancing in the Supplementary material.

鉴于这些任务对的复杂度差异（如先前研究[17]所强调），我们发现必须在多任务学习环境中平衡任务复杂度。因此，我们在训练时采用了标准任务平衡技术，根据每个任务的收敛速度调整其损失幅度。正如消融实验所示，这种方法支持模态的随机配对，而无需像先前研究[45, 68, 94, 105]建议的那样选择特定组合。任务平衡的详细实现参见补充材料。

3.2.2 Masked pre training for different modalities

3.2.2 面向多模态的掩码预训练

We use the best practices when pre training with different modalities, following existing works. We use image, video, text, audio and 3D point clouds modalities for masked pretraining. We employ a consistent masking approach across visual and auditory modalities. We follow [65] for textual data, utilizing random sentence permutation [90]. We designate a fraction $f$ of tokens for prediction, following the 8:1:1 token masking ratio of BERT [13]. Our primary goal is to reduce the discrepancy between the input and the outputs of the decoder. For inputs such as images, videos, point clouds, and audio spec tro grams, we aim to minimize the $\ell_{2}$ distance between the predicted and actual target patches. Normalization to zero mean and unit variance is applied to visual inputs. For textual data, we utilize the permuted language modeling objective of XLNet [90].

我们在多模态预训练中采用最佳实践，遵循现有研究成果。使用图像、视频、文本、音频和3D点云模态进行掩码预训练，在视觉与听觉模态间采用统一的掩码策略。文本数据处理参照[65]，采用随机句子重排技术[90]。按照BERT[13]的8:1:1 token掩码比例，设定预测token的比例为 $f$ 。核心目标是降低解码器输出与输入间的差异，对图像、视频、点云和音频频谱图等输入，我们最小化预测块与实际目标块间的 $\ell_{2}$ 距离。视觉输入进行零均值单位方差归一化，文本数据采用XLNet[90]的置换语言建模目标。

3.2.3 Inference

3.2.3 推理

When doing prediction, the network is used as a single stream without the cross attention layers in Fig. 1. The input data is tokenized with the tokenizer for its modality, passed through the feature transformation network $f(\cdot)$ followed by the second transformer $g(\cdot)$ , and finally input to the task prediction head $h_{m t}(\cdot)$ , i.e. the full forward pass is $h_{m t}\circ g\circ f(x)$ where $x$ is the output of the tokenizer.

在进行预测时，网络作为单一路径使用，不包含图1中的交叉注意力层。输入数据通过对应模态的分词器 (tokenizer) 进行分词，经由特征转换网络 $f(\cdot)$ 处理后输入第二个Transformer $g(\cdot)$ ，最终送入任务预测头 $h_{m t}(\cdot)$ 。完整前向计算过程为 $h_{m t}\circ g\circ f(x)$ ，其中 $x$ 表示分词器的输出。

4. Experimental results

4. 实验结果

Masked pre training. We use AudioSet (audio) [19], Something-Something v2 (SSv2) (video) [25], English Wikipedia (text), ImageNet1K (image) [12], SUN RGB-D (depth maps) [66], ModelNet40 (3D point cloud) [84] for pre training the network. For Stage 1 of masked pre training (Sec. 3.2), we use the samples from the training set of the respective datasets. For Stage 2 of masked pre training, we randomly select two modalities, and sample data from them to pretrain the full network. Further, we randomly mask patches. For image, video and audio, we mask $95%$ of the patches. For point cloud and text, we mask $90%$ and $95%$ of the patches respectively. We perform pre training for 3000 epochs. We use fraction $f$ as $5%$ .

掩码预训练。我们使用AudioSet（音频）[19]、Something-Something v2（SSv2）（视频）[25]、英文维基百科（文本）、ImageNet1K（图像）[12]、SUN RGB-D（深度图）[66]、ModelNet40（3D点云）[84]进行网络预训练。在掩码预训练的第一阶段（第3.2节），我们使用各数据集训练集中的样本。在掩码预训练的第二阶段，我们随机选择两种模态，并从中采样数据以预训练整个网络。此外，我们随机掩码补丁。对于图像、视频和音频，我们掩码 $95%$ 的补丁。对于点云和文本，我们分别掩码 $90%$ 和 $95%$ 的补丁。我们进行了3000个周期的预训练。使用分数 $f$ 为 $5%$ 。

Downstream tasks. We train the model on downstream tasks and report results. The datasets used for single modality methods are i Naturalist-2018 [75] (Image Recognition), Places-365 [100] (Scene Recognition), Kinetics-400 [38] (Video Action Recognition), Moments in Time [53] (Video Action Recognition), ESC50 [57] (Audio Event Classification), S3DIS [3] (3D point cloud segmentation), Dialogue

下游任务。我们在下游任务上训练模型并报告结果。用于单模态方法的数据集包括 iNaturalist-2018 [75] (图像识别)、Places-365 [100] (场景识别)、Kinetics-400 [38] (视频动作识别)、Moments in Time [53] (视频动作识别)、ESC50 [57] (音频事件分类)、S3DIS [3] (3D点云分割)、Dialogue

SUM [9] (Text sum mari z ation).

SUM [9] (文本摘要)

Adaptation on unseen datasets. To assess our method’s adaptability to datasets not seen at training, we report comparisons with image classification on Oxford-IIIT Pets [56], action recognition in videos using UCF-101 [67] and HMDB51 [41], 3D point cloud classification on ScanObjectNN [74], point cloud segmentation with NYU v2 seg [64], text sum mari z ation using the SamSum dataset [22]. As the number of classes and labels differ in each dataset as compared to the datasets used during pre training, we randomly sample $10%$ data from each of the training set. Further, we extract the embeddings using the pretrained network, and train two fully connected layers with task specific loss functions. This allows us to demonstrate the ability of the proposed method to generate embeddings which can generalize across datasets.

在未见数据集上的适应性。为评估我们的方法对训练时未见数据集的适应能力，我们报告了以下对比实验：牛津-IIIT宠物数据集[56]的图像分类、UCF-101[67]和HMDB51[41]的视频动作识别、ScanObjectNN[74]的3D点云分类、NYU v2 seg[64]的点云分割，以及使用SamSum数据集[22]的文本摘要任务。由于每个数据集的类别数和标签与预训练所用数据集存在差异，我们从每个训练集中随机采样 $10%$ 的数据。此外，我们使用预训练网络提取嵌入特征，并训练两个带有任务特定损失函数的全连接层。这证明了所提方法生成的嵌入特征具备跨数据集泛化能力。

Cross domain generalization. We follow prior work [1] and evaluate on video-text retrieval on two benchmark datasets i.e. YouCook2 [104], and MSR-VTT [86], for multiple modalities.

跨领域泛化。我们遵循先前工作 [1]，在 YouCook2 [104] 和 MSR-VTT [86] 两个基准数据集上针对多模态进行视频-文本检索评估。

Adaptation on unseen modalities. We also evaluate our method on unseen modalities. Specifically, we evaluate our method on the following (i) X-Ray scan, and hyper spectral data recognition, where we utilize the RegDB [54], Chest X-Ray [62], and Indian Pine datasets1. (ii) Time-series forecasting, where our experiments are based on the ETTh1 [103], Traffic2, Weather3, and Exchange datasets [42]. (iii) Graph understanding through the PCQM4M-LSC dataset [33], which comprises 4.4 million organic molecules with quantum-mechanical properties, focusing on predicting molecular properties with applications in drug discovery and material science. (iv)Tabular analysis, where we engage with the adult and bank marketing datasets from the UCI repository 4, (v) IMU recognition, where we conduct experiments on IMU sensor clas- sification using the Ego4D dataset [26], assessing the capability to understand inertial motion systems. We follow [16] for the train test splits and evaluation metrics on these datasets. Further, we use modality specific tokenizers and follow similar network settings as for generalization on unseen datasets.

对未见模态的适应性。我们还在未见模态上评估了我们的方法，具体包括：(i) X射线扫描和高光谱数据识别，使用了RegDB [54]、Chest X-Ray [62]和Indian Pine数据集；(ii) 时间序列预测，实验基于ETTh1 [103]、Traffic、Weather和Exchange数据集 [42]；(iii) 通过PCQM4M-LSC数据集 [33]进行图理解，该数据集包含440万个具有量子力学特性的有机分子，重点预测分子特性，应用于药物发现和材料科学；(iv) 表格分析，使用了UCI存储库中的adult和bank marketing数据集；(v) IMU识别，利用Ego4D数据集 [26]进行IMU传感器分类实验，评估理解惯性运动系统的能力。我们遵循[16]的方法进行数据集划分和评估指标设置，并使用模态特定的分词器，网络设置与未见数据集泛化实验保持一致。

We provide more details on the tokenizers used for each modality, description of task heads, and formulations of loss functions in the supplementary material.

我们在补充材料中提供了每种模态使用的分词器 (tokenizer) 、任务头描述以及损失函数公式的更多细节。

Table 1. i Naturalist-2018 and Places365 top-1 accuracy.

表 1: iNaturalist-2018 和 Places365 的 top-1 准确率。

方法/数据集	iN2018	P365
Omni-MAE [20]	78.1	59.4
Omnivore [21]	84.1	59.9
EfficientNetB8 [71]	81.3	58.6
MAE [29]	86.8
MetaFormer [94]	87.5	60.7
InternImage [77]	92.6	61.2
OmniVec [68]	93.8	63.5
Ours	94.6	65.1

Table 2. Kinetics-400 top-1 accuracy.

表 2: Kinetics-400 top-1 准确率。

方法	K400
Omnivore [21]	84.1
VATT [1] Uniformerv2[46]	90.0
InternVideo[78]	91.1
TubeViT[58]	90.9
OmniVec[68]	91.1
Ours	93.6

Table 3. Moments in time top-1 accuracy.

表 3: Moments in time top-1 准确率

| 方法 | MIT |
|------------

[论文翻译]OmniVec2 - 基于Transformer的新型大规模多模态多任务学习网络

原文地址：https://openaccess.thecvf.com//content/CVPR2024/papers/Srivastava_OmniVec2_-_A_Novel_Transformer_based_Network_for_Large_Scale_CVPR_2024_paper.pdf

OmniVec2 - A Novel Transformer based Network for Large Scale Multimodal and Multitask Learning

OmniVec2 - 基于Transformer的新型大规模多模态多任务学习网络

Abstract

摘要

1. Introduction

1. 引言

2. 相关工作

3. Approach

3. 方法

3.1. Network components

3.2. Training

3.2. 训练

3.2.1 Multimodal multitask supervised training

3.2.1 多模态多任务监督训练

3.2.2 Masked pre training for different modalities

3.2.2 面向多模态的掩码预训练

3.2.3 Inference

3.2.3 推理

4. Experimental results

4. 实验结果

[论文翻译]OmniVec2 - 基于Transformer的新型大规模多模态多任务学习网络

原文地址：https://openaccess.thecvf.com//content/CVPR2024/papers/Srivastava_OmniVec2_-_A_Novel_Transformer_based_Network_for_Large_Scale_CVPR_2024_paper.pdf

OmniVec2 - A Novel Transformer based Network for Large Scale Multimodal and Multitask Learning

OmniVec2 - 基于Transformer的新型大规模多模态多任务学习网络

Abstract

摘要

1. Introduction

1. 引言

2. Related Works

2. 相关工作

3. Approach

3. 方法

3.1. Network components

3.2. Training

3.2. 训练

3.2.1 Multimodal multitask supervised training

3.2.1 多模态多任务监督训练

3.2.2 Masked pre training for different modalities

3.2.2 面向多模态的掩码预训练

3.2.3 Inference

3.2.3 推理

4. Experimental results

4. 实验结果