MARLIN: Masked Auto encoder for facial video Representation LearnINg
MARLIN: 面向面部视频表征学习的掩码自编码器
Abstract
摘要
This paper proposes a self-supervised approach to learn universal facial representations from videos, that can transfer across a variety of facial analysis tasks such as Facial Attribute Recognition (FAR), Facial Expression Recognition (FER), DeepFake Detection (DFD), and Lip Synchro nization $(L S)$ . Our proposed framework, named MARLIN, is a facial video masked auto encoder, that learns highly robust and generic facial embeddings from abundantly available non-annotated web crawled facial videos. As a challenging auxiliary task, MARLIN reconstructs the spatio-temporal details of the face from the densely masked facial regions which mainly include eyes, nose, mouth, lips, and skin to capture local and global aspects that in turn help in encoding generic and transferable features. Through a variety of experiments on diverse downstream tasks, we demonstrate MARLIN to be an excellent facial video encoder as well as feature extractor, that performs consistently well across a variety of downstream tasks including FAR $1.l3%$ gain over supervised benchmark), FER $2.64%$ gain over unsupervised benchmark), DFD ( $1.86%$ gain over unsupervised benchmark), LS $29.36%$ gain for Frechet Inception Distance), and even in low data regime. Our code and models are available at https://github.com/ControlNet/MARLIN.
本文提出了一种自监督方法,用于从视频中学习通用的面部表征,可迁移至多种面部分析任务,如面部属性识别 (FAR)、面部表情识别 (FER)、深度伪造检测 (DFD) 和唇语同步 $(L S)$。我们提出的框架名为MARLIN,是一种面部视频掩码自编码器,通过大量未标注的网络爬取面部视频学习高度鲁棒且通用的面部嵌入。作为一项具有挑战性的辅助任务,MARLIN从密集掩码的面部区域(主要包括眼睛、鼻子、嘴巴、嘴唇和皮肤)重建面部的时空细节,以捕捉局部和全局特征,从而帮助编码通用且可迁移的特征。通过对多种下游任务的实验,我们证明MARLIN是一种优秀的面部视频编码器和特征提取器,在包括FAR(比监督基准提升$1.13%$)、FER(比无监督基准提升$2.64%$)、DFD(比无监督基准提升$1.86%$)、LS(Frechet Inception Distance提升$29.36%$)在内的各种下游任务中表现一致良好,甚至在低数据情况下也表现优异。我们的代码和模型可在https://github.com/ControlNet/MARLIN获取。
1. Introduction
1. 引言
Facial analysis tasks [34, 43, 70, 85] provide essential cues for human non-verbal behavior analysis, and help unfold meaningful insights regarding social interaction [36], communication [40], cognition [68] with potential applications in Human-Computer Interaction (HCI) and Affective Computing domains. Recently, we have witnessed significant progress in deep neural network models to solve facial analysis tasks such as Facial Attribute Recognition (FAR) [34, 85], Facial Expression Recognition (FER) [48], DeepFake Detection (DFD) [70], and Lip Synchronization (LS) [43]. While these deep models can achieve remarkable performance, they often require large-scale annotated datasets, which is not only a resource-expensive and timeconsuming process but also infeasible for some applications requiring domain expertise for annotation (e.g. FER).
面部分析任务 [34, 43, 70, 85] 为人类非语言行为分析提供了关键线索,并有助于揭示社交互动 [36]、沟通 [40]、认知 [68] 等领域的有意义见解,在人机交互 (HCI) 和情感计算领域具有潜在应用价值。近年来,我们见证了深度神经网络模型在解决面部属性识别 (FAR) [34, 85]、面部表情识别 (FER) [48]、深度伪造检测 (DFD) [70] 和唇部同步 (LS) [43] 等面部分析任务方面取得的重大进展。尽管这些深度模型能实现卓越性能,但它们通常需要大规模标注数据集,这不仅是一个资源密集且耗时的过程,对于某些需要领域专业知识进行标注的应用 (例如 FER) 更是难以实现。

Figure 1. Overview of the proposed Masked Auto encoder for facial Representation LearnINg aka MARLIN. MARLIN aims to learn a universal facial representation from abundantly available non-annotated facial video data.
图 1: 提出的面部表征学习掩码自编码器 (Masked Autoencoder for facial Representation LearnINg, MARLIN) 概述。MARLIN 旨在从大量可用的未标注面部视频数据中学习通用面部表征。
To this end, self-supervised pre-training [26, 37, 71] has lately emerged as an effective strategy to address the limitations of fully supervised methods, as it enables generic representation learning from non-annotated data, that can then be transferred across tasks having limited labels. For images of natural scenes and objects, self-supervised learning approaches using self-distillation [14], contrastivelearning [18, 19], solving pre-text tasks such as jigsaw puzzle [53], and more recently auto encoding [37,71] have even outperformed the supervised learning approaches.
为此,自监督预训练 [26, 37, 71] 最近成为一种应对全监督方法局限性的有效策略,因为它能够从未标注数据中学习通用表征,随后可迁移至标签有限的任务中。对于自然场景和物体的图像,采用自蒸馏 [14]、对比学习 [18, 19]、解决拼图等前置任务 [53] 以及近期自编码 [37, 71] 的自监督学习方法,其表现甚至超越了监督学习方法。
Despite the promises offered by these self-supervised methods in learning scalable and generic representations for natural scene images and videos, these have not yet been investigated for learning representations from facial video data. Facial representation learning requires tracking of fine-grained face specific details which might not be perfectly captured by linear tube masking [71]. Until now, most of the existing approaches associated with facial analysis tasks are highly specialized and develop task-specific models trained in a fully supervised manner [46, 54, 63], with very few recent efforts towards learning generic image-based facial encoding [10,84]. These closely related works [10, 84] either focus on exploring training dataset properties in terms of size and quality [10] or performing pre-training in visual-linguistic way [84]. These works [10, 84] are hard to scale since they use static imagelevel facial information and the image-caption pairs are highly associated with context information rather than face.
尽管这些自监督方法在学习自然场景图像和视频的可扩展通用表征方面展现出潜力,但尚未被用于从面部视频数据中学习表征。面部表征学习需要追踪细粒度的面部特征细节,而线性管状掩码 [71] 可能无法完美捕捉这些细节。迄今为止,大多数与面部分析任务相关的现有方法都高度专业化,采用全监督方式训练任务特定模型 [46, 54, 63],近期仅有少数研究尝试学习通用的基于图像的面部编码 [10, 84]。这些密切相关的工作 [10, 84] 要么聚焦于探索训练数据集在规模和质量上的特性 [10],要么以视觉-语言方式进行预训练 [84]。这些方法 [10, 84] 难以扩展,因为它们使用静态图像层面的面部信息,且图像-标题对高度依赖上下文信息而非面部特征。
In this paper, our goal is to learn universal and taskagnostic representations in a self-supervised manner for face-related downstream tasks (see Fig. 1). For this purpose, we employ a masked auto encoder [37, 71] with a facial-guided masking strategy that learns to reconstruct spatio-temporal details of a face from densely masked facial regions using non-annotated videos. Unlike existing approaches for natural scene videos [71], where the tubemasking is initialized with a static part of the video without any semantic information, our approach dynamically tracks face and then develops a facial part-guided tube masking strategy using an off-the-shelf face parser i.e. FaceX- Zoo [75]. Thus, we pose a more challenging task that encourages the model to learn spatio-temporal representations to cover local as well as global information. Inspired by prior works [27, 60] showing high-quality reconstruction results along with rich and generic latent features, we incorporate adversarial loss on top of masked encoding to enhance reconstruction quality. Our experimental results show that our proposed framework, MARLIN, learns highly generic facial encoding that scale and transfers well across diverse facial analysis tasks such as FER, DFD, FAR, and LS and achieve favorable performance gain w.r.t. state-ofthe-art benchmarks. In summary, our main contributions are:
本文的目标是以自监督方式学习通用且与任务无关的表示,用于人脸相关下游任务 (见图 1)。为此,我们采用带有面部引导掩码策略的掩码自编码器 [37, 71],利用未标注视频学习从密集掩码面部区域重建面部的时空细节。与自然场景视频现有方法 [71] 不同 (其管状掩码初始化于视频无语义信息的静态部分),我们的方法动态跟踪人脸,随后使用现成的人脸解析器 (即 FaceX-Zoo [75]) 开发面部部位引导的管状掩码策略。因此,我们提出了更具挑战性的任务,促使模型学习覆盖局部和全局信息的时空表示。受先前工作 [27, 60] 启发 (其展示了高质量重建结果及丰富通用的潜在特征),我们在掩码编码基础上加入对抗损失以提升重建质量。实验结果表明,我们提出的框架 MARLIN 能学习高度通用的人脸编码,可扩展并良好迁移至多种人脸分析任务 (如 FER、DFD、FAR 和 LS),相对现有基准实现了优越的性能提升。主要贡献包括:
• We propose, MARLIN, a universal and task-agnostic facial encoder that learns robust and transferable facial representation from abundantly available non-annotated web-crawled facial videos in a self-supervised fashion. • As a challenging auxiliary task, we propose to reconstruct the spatio-temporal details of the face from the densely masked facial regions. The proposed facial region-guided tube masking (aka Fasking) strategy aims to learn local and global aspects from facial videos which in turn help encode generic and transferable features. • Through extensive quantitative and qualitative analysis, we show that MARLIN learns rich, generic, transferable, and robust facial representation, that performs consistently well across a variety of downstream tasks including FAR ( $1.13%$ gain over supervised benchmark), FER $2.64%$ gain over unsupervised benchmark), DFD ( $1.86%$ gain over unsupervised benchmark), LS $29.36%$ gain for Frechet Inception Distance) and even in few shot settings.
• 我们提出了MARLIN,一种通用且任务无关的面部编码器,通过自监督方式从大量可获取的非标注网络爬取面部视频中学习鲁棒且可迁移的面部表征。
• 作为一项具有挑战性的辅助任务,我们提出从密集掩蔽的面部区域中重建时空细节。所提出的面部区域引导管道掩蔽(简称Fasking)策略旨在从面部视频中学习局部和全局特征,从而帮助编码通用且可迁移的特征。
• 通过广泛的定量和定性分析,我们表明MARLIN学习到丰富、通用、可迁移且鲁棒的面部表征,在多种下游任务中表现一致优异,包括FAR(比监督基准提升$1.13%$)、FER(比无监督基准提升$2.64%$)、DFD(比无监督基准提升$1.86%$)、LS(Frechet Inception Distance提升$29.36%$),甚至在少样本设置中也是如此。
Table 1. Facial Analysis Tasks. Overview of different face related tasks and relevant datasets down the lane.
| Datasets | #Samples | Env. | Fmt. | Task | Year |
| LFW [39] | 13,233 | Wild | Img. | Identification | 2008 |
| VGG-FACE[54] | 2.6M | Wild | Img. | Identification | 2015 |
| CelebA [50] | 202,599 | Wild | Img. | Attributes | 2015 |
| YouTubeFace[78] | 3,425 | Wild | Vid | Identification | 2011 |
| LRS2 [22] | 144,482 | Wild | Vid | Lip Sync. | 2017 |
| CelebV [79] | 5 | Wild | Vid | Reenact | 2018 |
| CMU-MOSEI[83] | 23,453 | Wild | Vid | Emo,Senti | 2018 |
| FaceForensics++[62] | 1,004 | Wild | Vid | DeepFake | 2019 |
| VoxCeleb2[23] | 150,480 | Wild | Vid | Speaker | 2018 |
| CelebV-HQ [85] | 55,666 | Wild | Vid | Attribute | 2022 |
表 1: 面部分析任务。不同人脸相关任务及对应数据集的概览。
| 数据集 | 样本数 | 环境 | 格式 | 任务 | 年份 |
|---|---|---|---|---|---|
| LFW [39] | 13,233 | 非受控 | 图像 | 身份识别 | 2008 |
| VGG-FACE [54] | 2.6M | 非受控 | 图像 | 身份识别 | 2015 |
| CelebA [50] | 202,599 | 非受控 | 图像 | 属性识别 | 2015 |
| YouTubeFace [78] | 3,425 | 非受控 | 视频 | 身份识别 | 2011 |
| LRS2 [22] | 144,482 | 非受控 | 视频 | 唇语同步 | 2017 |
| CelebV [79] | 5 | 非受控 | 视频 | 面部重演 | 2018 |
| CMU-MOSEI [83] | 23,453 | 非受控 | 视频 | 情绪/情感 | 2018 |
| FaceForensics++ [62] | 1,004 | 非受控 | 视频 | 深度伪造检测 | 2019 |
| VoxCeleb2 [23] | 150,480 | 非受控 | 视频 | 说话人识别 | 2018 |
| CelebV-HQ [85] | 55,666 | 非受控 | 视频 | 属性识别 | 2022 |
2. Related Work
2. 相关工作
Masked Auto Encoder. Masked auto encoder learns robust and transferable representations based on the hypothesis of reconstruction of the masked region. Masked autoencoding is motivated by context encoders [56] and denoising encoders [73]. After success of BERT [26] based masking, the vision community has also explored different design choices of masked auto encoding such as pixel level masking [17, 37, 80], token level masking [29] and deep feature based masking [6, 77], using vision Transformers [44, 52]. Similarly, for modeling spatio-temporal patterns of the input data, masked motion modelling [69] and tube masking [71] strategies have been incorporated recently. Along this line, MARLIN masks and reconstructs domain-specific facial parts to learn universal facial representation.
掩码自编码器 (Masked Auto Encoder)。掩码自编码器基于掩码区域重建的假设学习鲁棒且可迁移的表征。掩码自编码的灵感源于上下文编码器 [56] 与去噪编码器 [73]。随着基于 BERT [26] 掩码策略的成功,视觉领域也探索了多种掩码自编码设计方案,包括像素级掩码 [17, 37, 80]、Token 级掩码 [29] 以及基于深度特征的掩码 [6, 77],这些方案均采用视觉 Transformer [44, 52] 实现。类似地,为建模输入数据的时空模式,近期还融入了掩码运动建模 [69] 和管状掩码 [71] 策略。基于此,MARLIN 通过掩码并重建特定面部区域来学习通用面部表征。
Facial Representation Learning. Till date, most of the existing facial analysis approaches are conducted in a task-specific way with fully supervised manner [46, 54, 63] on manually annotated data to enhance performance. Any state-of-the-art model’s performance on benchmarked datasets is impacted by the quality and quantity of annotated data used during training. Tab. 1 shows an overview of the task-specific large-scale facial image or video datasets that have been curated over the past decade [1] to facilitate research in Face Verification (LFW [39], MS-celeb1M [34], VGG-FACE [54], VGGFace2 [13]), Facial At- tribute Recognition(CelebA [50], CelebV-HQ [85]), FacialEmotion Recognition (CMU-MOSEI [83]), DeepFake Detection $\mathrm{(FF++}$ [62]) and Lip Synchronization (LRS2 [22]). However, data curation encounters several challenges such as requirements of specialized hardware (e.g. for FER and action unit data), the discrepancy in data distribution that prevent merging of multiple datasets [10], and most importantly time consuming and resource expensive annotation process. To eliminate these drawbacks, some of the existing approaches [20, 81, 82] adopt data augmentation strat- egy via image or video synthesis as the surge in face generation technology fueled by Generative Adversarial Network (GAN) [20, 67, 81, 82] and other generation techniques [16, 35] aids realistic face generation even with the control over facial attributes. These generation techniques add variation in training set quantitatively, but in some cases it still lags in qualitative aspects due to domain specific inconsistency and more importantly high network complexity.
面部表征学习。迄今为止,大多数现有面部分析方法都采用全监督方式[46, 54, 63]在人工标注数据上以任务特定形式开展,以提升性能。任何最先进模型在基准数据集上的表现都受训练所用标注数据的质量和数量影响。表1展示了过去十年为促进面部验证(LFW[39]、MS-celeb1M[34]、VGG-FACE[54]、VGGFace2[13])、面部属性识别(CelebA[50]、CelebV-HQ[85])、面部情绪识别(CMU-MOSEI[83])、深度伪造检测$\mathrm{(FF++}$[62])和唇语同步(LRS2[22])研究而构建的任务特定大规模面部图像/视频数据集概览[1]。但数据构建面临诸多挑战:需要专用硬件(如用于FER和动作单元数据)、数据分布差异阻碍多数据集合并[10],最重要的是耗时耗力的标注流程。为消除这些缺陷,现有方法[20, 81, 82]采用通过生成对抗网络(GAN)[20, 67, 81, 82]和其他生成技术[16, 35]驱动的图像/视频合成数据增强策略——这些技术支持可控面部属性的逼真人脸生成。虽然这些生成技术从数量上增加了训练集多样性,但由于领域特异性不一致和较高的网络复杂度,某些情况下仍存在质量不足的问题。
To this end, there are very few recent works that aim to learn image-based task specific facial encoding with limited supervision [3, 9, 10, 65, 84, 84, 86, 86]. The most closely related existing works [10,84] either focus on exploring training dataset properties in terms of size and quality [10] or performing pre-training in visual-linguistic way [84]. These works [10, 84] are hard to scale since they use static image level facial information and the image-caption pairs are highly correlated with context information rather than face. In this work, we aim to develop a generic, universal, and task-agnostic facial encoder that learns from web-crawled non-annotated data. Our experimental analysis shows that MARLIN can align the latent space manifold to any desired downstream task specific label space. Thus, MARLIN has the capability to act as a strong facial encoder or feature extractor in many low-resource real world applications.
为此,近期只有少数研究致力于在有限监督下学习基于图像的任务特定面部编码 [3, 9, 10, 65, 84, 84, 86, 86]。最相关的现有工作 [10,84] 要么专注于探索训练数据集在规模和质量方面的特性 [10],要么以视觉-语言方式进行预训练 [84]。这些工作 [10, 84] 难以扩展,因为它们使用静态图像级别的面部信息,且图像-标题对与上下文信息高度相关而非人脸。在本研究中,我们的目标是开发一种通用、普适且与任务无关的面部编码器,该编码器从网络爬取的非标注数据中学习。我们的实验分析表明,MARLIN 能够将潜在空间流形对齐到任何所需的下游任务特定标签空间。因此,MARLIN 具备作为强大面部编码器或特征提取器的能力,适用于许多低资源的现实应用场景。
3. MARLIN
3. MARLIN
Our objective is to learn robust and transferable universal facial representation from abundantly available nonannotated facial video data [78]. If we think holistic ally, face specific tasks involve two different aspects: a) facial appearance related attributes such as parts of the face (nose, eyes, lips, hair, etc.), facial shape and texture which mainly need spatial investigation; and b) facial action such as emotion, Facial Action Coding System (FACS), lip synchro niza- tion which requires temporal information. Thus, spatio- temporal modeling is highly desirable in order to learn strong, robust, and transferable representation. To this end, our proposed framework, MARLIN, adopts a facial region guided masking strategy which poses a challenging auxiliary reconstruction task for self supervised representation learning (See Fig. 2). To facilitate learning from masked auto-encoder, we mainly choose the YouTube Faces [78] dataset that uses web-crawled facial videos from YouTube having variation in terms of different real life conditions.
我们的目标是从大量可用的无标注面部视频数据[78]中学习稳健且可迁移的通用面部表征。从整体来看,面部特定任务涉及两个不同方面:a) 与面部外观相关的属性,如面部局部(鼻子、眼睛、嘴唇、头发等)、面部形状和纹理,这些主要需要空间分析;以及 b) 面部动作,如情绪、面部动作编码系统 (FACS)、唇语同步等,这些需要时序信息。因此,时空建模对于学习强大、稳健且可迁移的表征至关重要。为此,我们提出的框架 MARLIN 采用了一种面部区域引导的掩蔽策略,为自监督表征学习设置了一个具有挑战性的辅助重建任务(见图 2)。为了促进基于掩蔽自编码器的学习,我们主要选择了 YouTube Faces [78] 数据集,该数据集使用了从 YouTube 爬取的面部视频,涵盖了不同现实生活条件下的变化。
3.1. Facial Representation Learning
3.1. 面部表征学习
Preliminaries. MARLIN consists of an encoder $(\mathcal{F}{\phi_{\varepsilon}})$ , decoder $(\mathcal{F}{\phi_{\mathcal{D}}})$ and disc rim in at or $(\mathcal{F}{\phi_{\Gamma}})$ with embedding parameters $\phi_{\mathcal{E}}$ , $\phi_{\mathcal{D}}$ and $\phi_{\Gamma}$ , respectively. Given a training dataset $\mathbb{D}={V_{i}}{i=1}^{N}$ where $N$ is the number of videos in the dataset and $\dot{V}\in\mathbb{R}^{C\times T_{0}\times H_{0}\times W_{0}}$ $(C,T_{0},H_{0},$ $W_{0}$ are channel, temporal depth, height and width of the raw video, respectively). From the raw input video $V$ , we track and crop the facial regions [75] followed by random temporal sampling represented as $\boldsymbol{v}\in~\mathbb{R}^{(C\times T\times H\times W)}$ $(T,H,$ $W$ are the modified temporal depth, height, and width of the derived video clip, respectively). The derived video clip
预备知识。MARLIN 由一个编码器 $(\mathcal{F}{\phi_{\varepsilon}})$、解码器 $(\mathcal{F}{\phi_{\mathcal{D}}})$ 和判别器 $(\mathcal{F}{\phi_{\Gamma}})$ 组成,其嵌入参数分别为 $\phi_{\mathcal{E}}$、$\phi_{\mathcal{D}}$ 和 $\phi_{\Gamma}$。给定训练数据集 $\mathbb{D}={V_{i}}{i=1}^{N}$,其中 $N$ 为数据集中视频数量,$\dot{V}\in\mathbb{R}^{C\times T_{0}\times H_{0}\times W_{0}}$($C,T_{0},H_{0},W_{0}$ 分别表示原始视频的通道数、时间深度、高度和宽度)。从原始输入视频 $V$ 中,我们跟踪并裁剪面部区域 [75],随后进行随机时间采样,表示为 $\boldsymbol{v}\in~\mathbb{R}^{(C\times T\times H\times W)}$($T,H,W$ 分别表示衍生视频片段调整后的时间深度、高度和宽度)。
$v$ is further mapped to $(\mathbf{k}-\mathbf{n})$ visible and $\mathbf{n}$ masked tokens denoted as by facialregion guided masking strategy $(\mathcal{F}{\phi_{f}})$ with a pre-defined masking ratio $\begin{array}{r}{\mathbf{r}=\frac{\mathbf{n}}{\mathbf{k}}}\end{array}$ . Here, $\mathbf{e}$ is the embedding dimension and $\mathbf{k}$ is the total number of tokens derived from $v$ , i.e. $\begin{array}{r}{\mathbf{k}=\frac{T}{\mathbf{t}}\times\frac{H}{\mathbf{h}}\times\frac{W}{\mathbf{w}}}\end{array}$ , given 3D cube tokens have dimension of $\mathbf{t}\times\mathbf{h}\times\mathbf{w}$ each. Thus, MARLIN injects facial region specific domain knowledge in the aforementioned token space to guide the representation learning via masking.
$v$ 进一步映射为 $(\mathbf{k}-\mathbf{n})$ 个可见和 $\mathbf{n}$ 个掩码的 Token,这是通过面部区域引导的掩码策略 $(\mathcal{F}{\phi_{f}})$ 和预定义的掩码比例 $\begin{array}{r}{\mathbf{r}=\frac{\mathbf{n}}{\mathbf{k}}}\end{array}$ 实现的。其中,$\mathbf{e}$ 是嵌入维度,$\mathbf{k}$ 是从 $v$ 中提取的 Token 总数,即 $\begin{array}{r}{\mathbf{k}=\frac{T}{\mathbf{t}}\times\frac{H}{\mathbf{h}}\times\frac{W}{\mathbf{w}}}\end{array}$,前提是每个 3D 立方体 Token 的维度为 $\mathbf{t}\times\mathbf{h}\times\mathbf{w}$。因此,MARLIN 在上述 Token 空间中注入了面部区域特定的领域知识,通过掩码来指导表示学习。
The visible tokens $\tilde{X}{v}$ are mapped to the latent space $\mathbf{z}$ by the following mapping function $\mathcal{F}{\phi_{\mathcal{E}}}:\tilde{X}{v}\rightarrow\mathbf{z}$ . The latent space feature $\mathbf{z}$ is further fed to the decoder $\mathcal{F}{\phi_{\mathcal{D}}}$ which reconstruct $\mathbf{z}$ to the $\mathbf{n}$ masked tokens $X_{m}^{'}$ by the following mapping $\mathcal{F}{\phi_{d}}:\mathbf{z}\rightarrow X_{m}^{'}$ . In the decoder, the corresponding visible and masked 3D cubes contain the flatten raw pixels denoted as $\mathbf{e}=\mathbf{Cthw}$ . In brief given the visible tokens $\tilde{X}_{v}$ , we reconstruct the masked tokens by the following function:
可见的Token $\tilde{X}{v}$ 通过以下映射函数 $\mathcal{F}{\phi_{\mathcal{E}}}:\tilde{X}{v}\rightarrow\mathbf{z}$ 映射到潜在空间 $\mathbf{z}$。潜在空间特征 $\mathbf{z}$ 进一步输入到解码器 $\mathcal{F}{\phi_{\mathcal{D}}}$,后者通过映射 $\mathcal{F}{\phi{d}}:\mathbf{z}\rightarrow X_{m}^{'}$ 将 $\mathbf{z}$ 重构为 $\mathbf{n}$ 个被遮蔽的Token $X_{m}^{'}$。在解码器中,对应的可见和被遮蔽3D立方体包含展平的原始像素,记为 $\mathbf{e}=\mathbf{Cthw}$。简而言之,给定可见Token $\tilde{X}_{v}$,我们通过以下函数重构被遮蔽的Token:
$$
X_{m}^{'}=\mathcal{F}{\phi_{\mathcal{D}}}\circ\mathcal{F}{\phi_{\mathcal{E}}}(\tilde{X}_{v})
$$
$$
X_{m}^{'}=\mathcal{F}{\phi_{\mathcal{D}}}\circ\mathcal{F}{\phi_{\mathcal{E}}}(\tilde{X}_{v})
$$
Reconstructing spatio-temporal facial patterns from raw pixels is quite challenging, we deploy a disc rim in at or $\mathcal{F}{\phi_{\Gamma}}$ with the adversarial training for better synthesis.
从原始像素重建时空面部模式颇具挑战性,我们在 $\mathcal{F}{\phi_{\Gamma}}$ 处部署了一个对抗训练圆环以获得更好的合成效果。
3.2. Self-Supervised Representation Learning.
3.2. 自监督表征学习
The self supervised pre-training strategy of MARLIN consists of three main components described below: a) Facial-region Guided Tube Masking (Fasking). In order to capture spatio-temporal correspondence, we have deployed facial region specific tube masking strategy following [71]. We dynamically track and mask facial components across temporal axis for each spatio-temporal cube. Our facial regions based tube-masking strategy ensures the same facial region is masked throughout the temporal cube, thus posing a challenging reconstruction task and promoting learning local and global facial details (See Alg. 1). As the masked spatio-temporal cubes look like deformable bending tubes, we termed it as Facial region-guided tube masking aka Fasking.
MARLIN的自监督预训练策略包含以下三个主要组成部分:
a) 面部区域引导的管状掩码 (Fasking)。为了捕捉时空对应关系,我们采用了基于[71]的面部区域特定管状掩码策略。针对每个时空立方体,我们动态追踪并沿时间轴掩码面部组件。这种基于面部区域的管状掩码策略确保同一面部区域在整个时间立方体中被掩码,从而构成具有挑战性的重建任务,促进局部和全局面部细节的学习(参见算法1)。由于被掩码的时空立方体形似可弯曲的管状结构,我们将其称为面部区域引导的管状掩码(Fasking)。

Figure 2. Architectural overview of MARLIN (Best viewed in color). MARLIN mainly consists of (a) Representation Learning Module, (b) Facial Region guided Tube Masking, and (c) Downstream Adaptation. (a) Representation Learning Module: MARLIN learns the facial representation from the unlabeled, web crawled video data in a self-supervised fashion (highlighted in Blue). (b) Facial Region guided Tube Masking: With the aid of facial region guided tube masking (highlighted in Yellow), MARLIN gets joint spatio-temporal attention which in turn facilitates downstream performance. The Face guided tube masking strategy injects domain knowledge into the pipeline. (c) Downstream Adaptation: For facial task specific downstream adaptation, MARLIN utilizes Linear Probing (LP) and Fine-Tuning (FT) to show the robustness, general iz ability, and transfer ability of the learned feature (highlighted in Green).
图 2: MARLIN架构概览 (建议彩色查看)。MARLIN主要由以下部分组成:(a) 表征学习模块,(b) 面部区域引导的管状掩码,(c) 下游适配。(a) 表征学习模块:MARLIN以自监督方式 (蓝色高亮) 从无标注的网络爬取视频数据中学习面部表征。(b) 面部区域引导的管状掩码:通过面部区域引导的管状掩码 (黄色高亮),MARLIN获得联合时空注意力,从而提升下游任务性能。该面部引导的管状掩码策略为流程注入了领域知识。(c) 下游适配:针对面部任务特定的下游适配,MARLIN采用线性探测 (LP) 和微调 (FT) 来展示所学特征的鲁棒性、泛化能力和迁移能力 (绿色高亮)。
Adversarial Adaptation Strategy. To enhance the generation quality for rich representation learning, we incorporate adversarial adaptation on top of the masked autoencoder backbone. According to the prior literature [27,60], adversarial training enhances generation quality which in turn results in rich latent feature $z$ . The disc rim in at or $\mathcal{F}{\phi_{\Gamma}}$ as shown in Fig. 2 is an MLP based network which imposes adversarial loss $\mathcal{L}{\mathrm{adv}}$ between $X_{m}$ and their reconstructed counterparts X′m.
c) 对抗适应策略。为提升生成质量以实现丰富的表征学习,我们在掩码自编码器主干网络基础上引入了对抗适应机制。根据先前文献 [27,60] 的研究,对抗训练能提升生成质量,从而产生丰富的潜在特征 $z$。如图 2 所示,判别器 $\mathcal{F}{\phi_{\Gamma}}$ 是一个基于 MLP 的网络,它在 $X_{m}$ 与其重建对应项 $X'{m}$ 之间施加对抗损失 $\mathcal{L}_{\mathrm{adv}}$。
3.3. Overall MARLIN Loss
3.3. 整体 MARLIN 损失
Alg. 2 summarizes the training process for the MARLIN framework. MARLIN mainly imposes (a) Reconstruction Loss and (b) Adversarial Loss to facilitate the training.
算法 2 总结了 MARLIN 框架的训练过程。MARLIN 主要采用 (a) 重构损失 (Reconstruction Loss) 和 (b) 对抗损失 (Adversarial Loss) 来促进训练。
(a) Reconstruction Loss. Given an input masked tokens $\tilde{X}{m}$ , the masked auto-encoder module reconstruct it back to $X_{m}^{'}$ . To this end, we minimize mean squared error loss in the 3D token space to update the weights of the $(\mathcal{F}{\phi_{\Gamma}}\circ\mathcal{F}{\phi\varepsilon}\circ$ $\mathcal{F}{\phi_{f}}$ ) branch. The loss is defined as
(a) 重建损失 (Reconstruction Loss)。给定输入掩码 tokens $\tilde{X}{m}$,掩码自编码器模块将其重建为 $X_{m}^{'}$。为此,我们在 3D token 空间中最小化均方误差损失来更新 $(\mathcal{F}{\phi_{\Gamma}}\circ\mathcal{F}{\phi\varepsilon}\circ$ $\mathcal{F}{\phi_{f}}$ ) 分支的权重。该损失定义为
$$
\mathcal{L}{\mathtt{r e c o n}}=\frac{1}{N}\sum_{i=1}^{N}||X_{m}^{(i)}-X_{m}^{'(i)}||_{2}
$$
$$
\mathcal{L}{\mathtt{r e c o n}}=\frac{1}{N}\sum_{i=1}^{N}||X_{m}^{(i)}-X_{m}^{'(i)}||_{2}
$$
where $N$ is the total number of data in $\mathbb{D}$ , $X_{m}^{\left(i\right)}$ and $X_{m}^{'(i)}$ are the masked token and reconstruction of $i$ -th data in $\mathbb{D}$ .
其中 $N$ 是 $\mathbb{D}$ 中数据的总数,$X_{m}^{\left(i\right)}$ 和 $X_{m}^{'(i)}$ 分别是 $\mathbb{D}$ 中第 $i$ 个数据的掩码 token 和重建结果。
Algorithm 2 Training procedure for MARLIN
算法 2: MARLIN训练流程
| 步骤 | 说明 |
|---|---|
| 输入 | FΦf, FΦe, FΦp, FΦr, Fe, D, r, Ddown |
| 1 | while 未收敛 do |
| 2 | MARLIN预训练 |
| 3 | u ← 采样批次(D) |
| 4 | {Xm,X}←F(v,r) ≥ Fasking (参见算法1) |
| 5 | Xm ← FΦD o Fe(x) 训练 FΦr |
| 6 | (xux)(p){}→ {} |
| 7 | X'm ← FΦp o FΦe(Xu) > 训练 FΦε, FΦD |
| 8 | {Φε,ΦD} ←V{Φe,ΦD}C(9)(Xm, Xm) |
| 9 | end while |
(b) Adversarial Loss. The adversarial adaptation considers the Was sen stain GAN loss [5] for better reconstruction of spatio-temporal facial patterns which in turn helps in learning rich representation. The loss is defined as follows:
(b) 对抗损失。对抗自适应采用Wasserstein GAN损失[5]以更好地重建时空面部模式,这有助于学习丰富的表征。该损失定义如下:
$$
\begin{array}{r l r}{\lefteqn{\mathcal{L}{\mathrm{adv}}^{(d)}=\frac{1}{N\mathbf{n}}\sum_{i=1}^{N}(\sum_{\boldsymbol{x}{m}^{\prime}\in X_{m}^{'(i)}}\mathcal{F}{\phi_{\Gamma}}(\boldsymbol{x}{m}^{'})-\sum_{\boldsymbol{x}{m}\in X_{m}^{(i)}}\mathcal{F}{\phi_{\Gamma}}(\boldsymbol{x}{m}))}}\ &{}&{\quad\quad\mathcal{L}{\mathrm{adv}}^{(g)}=-\frac{1}{N\mathbf{n}}\sum_{i=1}^{N}\sum_{\boldsymbol{x}{m}^{\prime}\in X_{m}^{'(i)}}\mathcal{F}{\phi_{\Gamma}}(\boldsymbol{x}_{m}^{'})}\ &{}&{\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad(4)}\end{array}
$$
$$
\begin{array}{r l r}{\lefteqn{\mathcal{L}{\mathrm{adv}}^{(d)}=\frac{1}{N\mathbf{n}}\sum_{i=1}^{N}(\sum_{\boldsymbol{x}{m}^{\prime}\in X_{m}^{'(i)}}\mathcal{F}{\phi_{\Gamma}}(\boldsymbol{x}{m}^{'})-\sum_{\boldsymbol{x}{m}\in X_{m}^{(i)}}\mathcal{F}{\phi_{\Gamma}}(\boldsymbol{x}{m}))}}\ &{}&{\quad\quad\mathcal{L}{\mathrm{adv}}^{(g)}=-\frac{1}{N\mathbf{n}}\sum_{i=1}^{N}\sum_{\boldsymbol{x}{m}^{\prime}\in X_{m}^{'(i)}}\mathcal{F}{\phi_{\Gamma}}(\boldsymbol{x}_{m}^{'})}\ &{}&{\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad(4)}\end{array}
$$
Thus, the overall learning objective $\mathcal{L}$ is formulated as follows, where $\lambda_{W}$ is the weighting parameter:
因此,整体学习目标 $\mathcal{L}$ 的公式如下,其中 $\lambda_{W}$ 为权重参数:
$$
\begin{array}{r l}&{\mathcal{L}^{\left(g\right)}=\mathcal{L}{\mathrm{recon}}+\lambda_{W}\mathcal{L}{\mathrm{adv}}^{\left(g\right)}}\ &{\mathcal{L}^{\left(d\right)}=\mathcal{L}_{\mathrm{adv}}^{\left(d\right)}}\end{array}
$$
During MARLIN’s pre-training phase, $\mathcal{L}^{(d)}$ updates the parameters $\phi_{d i s}$ and $\mathcal{L}^{(g)}$ updates the parameters $\phi_{e},\phi_{d}$ .
在MARLIN的预训练阶段,$\mathcal{L}^{(d)}$更新参数$\phi_{d i s}$,$\mathcal{L}^{(g)}$更新参数$\phi_{e},\phi_{d}$。
3.4. Downstream Adaptation
3.4. 下游适配
Our proposed MARLIN framework learns robust and transferable facial representation from the facial video in a self-supervised way. Following the standard evaluation protocols, we adopt Linear Probing (LP) and Fine-Tuning (FT) for downstream adaptation for different face relevant tasks (See Fig. 2 inference module). Given any task specific downstream dataset $\mathbb{D}{\mathrm{down}}={v_{j},\mathbf{y}{j}}{j=1}^{\mathcal{N}}$ , we deploy linear fully-connected (FC) layers with embedding parameters $\theta$ to align the latent space to the downstream task specific label space on top of encoder module $\mathcal{F}{\phi\varepsilon}$ . For linear probing, we freeze the backbone network $\mathcal{F}{\phi_{\mathcal{E}}}$ and only update the ${\mathcal{F}}{\theta}$ . On the other hand for FT, we fine-tune the whole module i.e. $(\mathscr{F}{\phi_{\mathscr{E}}}\circ\mathscr{F}_{\theta})$ . When MARLIN is used as a feature extractor for LP, it uses a sliding temporal window to extract features $Z$ of the input face cropped video $V$ as shown in Fig. 2 (c). The details of different downstream facial tasks are described below:
我们提出的MARLIN框架通过自监督方式从面部视频中学习鲁棒且可迁移的面部表征。遵循标准评估协议,我们采用线性探测(LP)和微调(FT)进行不同面部相关任务的下游适配(见图2推理模块)。给定任何特定任务的下游数据集$\mathbb{D}{\mathrm{down}}={v_{j},\mathbf{y}{j}}{j=1}^{\mathcal{N}}$,我们在编码器模块$\mathcal{F}{\phi\varepsilon}$顶部部署具有嵌入参数$\theta$的线性全连接(FC)层,以将潜在空间对齐到下游任务特定的标签空间。对于线性探测,我们冻结主干网络$\mathcal{F}{\phi_{\mathcal{E}}}$,仅更新${\mathcal{F}}{\theta}$。而对于FT,我们对整个模块$(\mathscr{F}{\phi_{\mathscr{E}}}\circ\mathscr{F}_{\theta})$进行微调。当MARLIN作为LP的特征提取器时,它使用滑动时间窗口提取输入面部裁剪视频$V$的特征$Z$,如图2(c)所示。不同下游面部任务的细节描述如下:
Facial Attribute Recognition (FAR) predicts the presence of appearance and action attributes such as gender, race, hair color, and emotion of a given face video. The problem of predicting facial attributes can be posed as a multi-label learning problem highly dependent on rich spatial encoding. For the downstream adaptation purpose, we use 28,532 train, 3,567 val, and 3,567 test videos from the CelebVHQ [85] dataset. Following the prior works [33, 50, 84], we report average accuracy(↑), Area Under the Curve (AUC↑) over all attributes.
面部属性识别 (Facial Attribute Recognition, FAR) 用于预测给定面部视频中外观与行为属性的存在情况,例如性别、种族、发色和情绪。该问题可建模为高度依赖丰富空间编码的多标签学习任务。为适配下游任务,我们采用CelebVHQ [85] 数据集中的28,532段训练视频、3,567段验证视频和3,567段测试视频。参照先前研究 [33, 50, 84],我们报告所有属性的平均准确率(↑)和曲线下面积 (AUC↑)。
Facial Expression Recognition (FER) task encodes spatio-temporal facial muscle movement patterns to predict emotion (6-class) and sentiment (7-class and 2-class) of the concerned subject given a facial video. We evaluate the performance of MARLIN on CMU-MOSEI dataset [7] which is a conversational corpus having 16,726 train, 1,871 val, and 4,662 test data. Following the prior works [7, 25], we use overall accuracy(↑) as metrics.
面部表情识别 (Facial Expression Recognition, FER) 任务通过对面部肌肉运动模式的时空编码,预测给定面部视频中目标对象的情绪 (6分类) 和情感倾向 (7分类和2分类)。我们在 CMU-MOSEI 数据集 [7] 上评估 MARLIN 的性能,该数据集为包含 16,726 条训练数据、1,871 条验证数据和 4,662 条测试数据的对话语料库。遵循先前研究 [7, 25] 的设定,我们采用总体准确率 (↑) 作为评估指标。
Deepfake Detection (DFD) task predicts spatio-temporal facial forgery given a facial video from $\mathrm{FF}{+}{+}$ (LQ) dataset [62]. For downstream adaptation, we use 3,600 train, 700 val, and 700 test sample videos from $\mathrm{FF}{+}{+}$ (LQ) dataset [62]. Following prior literature [12, 58, 76], we use accuracy(↑) and AUC(↑) as the evaluation metrics.
深度伪造检测 (DFD) 任务基于 $\mathrm{FF}{+}{+}$ (LQ) 数据集 [62] 的面部视频预测时空面部伪造。在下游适配中,我们使用 $\mathrm{FF}{+}{+}$ (LQ) 数据集 [62] 的 3,600 个训练样本视频、700 个验证样本视频和 700 个测试样本视频。遵循先前文献 [12, 58, 76],我们采用准确率 (↑) 和 AUC (↑) 作为评估指标。
Lip Synchronization (LS) is another line of research that require facial region specific spatio-temporal synchro nization. This downstream adaptation further elaborates the adaptation capability of MARLIN for face generation tasks. For adaptation, we replace the facial encoder module in Wav2Lip [57] with MARLIN, and adjust the temporal window accordingly i.e. from 5 frames to $\mathbf{T}$ frames. For evaluation, we use the LRS2 [22] dataset having 45,838 train, 1,082 val, and 1,243 test videos. Following the prior literature [57, 74], we use Lip-Sync Error-Distance (LSE-D ↓), Lip-Sync Error-Confidence (LSE-C $\uparrow$ ) and Frechet Incep- tion Distance (FID ↓) [38] as evaluation matrices.
唇音同步 (Lip Synchronization, LS) 是另一项需要面部区域特定时空同步的研究方向。这一下游任务进一步验证了MARLIN在面部生成任务中的适应能力。我们通过替换Wav2Lip [57] 中的面部编码器模块为MARLIN,并相应调整时间窗口(从5帧扩展到 $\mathbf{T}$ 帧)来完成适配。评估采用LRS2 [22] 数据集(含45,838个训练视频、1,082个验证视频和1,243个测试视频)。参照现有文献 [57, 74],选用唇音同步误差距离 (LSE-D ↓)、唇音同步误差置信度 (LSE-C $\uparrow$ ) 和Frechet起始距离 (FID ↓) [38] 作为评估矩阵。
4. Experiments and Results
4. 实验与结果
We have comprehensively compared our method on different downstream adaptation tasks from quantitative (See Sec. 4.2) and qualitative (See Sec. 4.3 perspectives. Ad
我们在不同的下游适应任务上从定量(参见第4.2节)和定性(参见第4.3节)角度全面比较了我们的方法。
Table 2. Facial Attribute Recognition. Our proposed framework, MARLIN, trained on YTF [78] dataset and Linear Probed/FineTuned on CelebV-HQ [85] benchmark dataset in terms of accuracy↑ and area under the curve↑. * shows supervised methods trained on the CelebV-HQ [85] dataset.
表 2: 面部属性识别。我们提出的框架 MARLIN 在 YTF [78] 数据集上训练,并在 CelebV-HQ [85] 基准数据集上进行线性探测/微调,以准确率↑和曲线下面积↑为指标。* 表示在 CelebV-HQ [85] 数据集上训练的监督方法。
| 方法 | 外观 | 动作 | 整体 | ||
|---|---|---|---|---|---|
| Acc.↑ | AUC↑ | Acc.↑ | AUC↑ | ||
| R3D [72]* | 92.34 | 0.9424 | 94.57 | 0.9173 | 93.45 |
| MViTv1[30]* | 92.90 | 0.9452 | 95.13 | 0.9233 | 94.01 |
| MViTv2[49]* | 92.77 | 0.954 | 95.15 | 0.9239 | 93.96 |
| VideoMAE(FT)[71] | 92.91 | 0.9529 | 95.37 | 0.9284 | 94.14 |
| MARLIN (LP) | 91.90 | 0.9373 | 95.25 | 0.9278 | 93.57 |
| MARLIN (FT) | 93.90 | 0.9561 | 95.48 | 0.9406 | 94.69 |
ditionally, we have performed extensive ablation studies to provide justification for our design choices.
此外,我们还进行了大量的消融实验,为我们的设计选择提供了依据。
4.1. Experimental Protocols
4.1. 实验协议
Datasets. We evaluate the MARLIN framework on different facial analysis tasks described in Sec. 3.4. In brief, we use CelebV-HQ [85] for facial attribute and action prediction, CMU-MOSEI dataset [7] for conversational emotion and sentiment prediction, $\mathrm{FF}{+}{+}(\mathrm{LQ})$ dataset [62] for deepfake detection and LRS2 [22] for lip synchronization.
数据集。我们在第3.4节描述的不同面部分析任务上评估MARLIN框架。具体而言,使用CelebV-HQ [85]进行面部属性和动作预测,CMU-MOSEI数据集[7]进行对话情绪和情感预测,$\mathrm{FF}{+}{+}(\mathrm{LQ})$数据集[62]进行深度伪造检测,LRS2 [22]进行唇部同步。
Settings. For fair comparisons, we follow the dataset specific experimental protocols mentioned in the task specific prior literature [7, 22, 33, 50, 62, 84]. Other than traditional evaluation, we perform few shot adaptation strategy as well to show the robustness and transfer ability of MARLIN.
设置。为确保公平比较,我们遵循各任务相关先前文献[7, 22, 33, 50, 62, 84]中提到的数据集特定实验方案。除传统评估外,我们还采用少样本适应策略以验证MARLIN的鲁棒性与迁移能力。
Implementation Details. We implemented the method on PyTorch [55] with Nvidia RTX A6000 GPU. First of all, given any temporal chunk of a facial video, consecutive frames are highly redundant. Therefore, to consider semantically meaningful frames having significant motion across frames, we adopt the minimum temporal stride value to be 2. Given an input video (having dimension $3\times16\times224\times$ 224), the cube embedding layer generates $8\times14\times143\mathrm{D}$ tokens of dimension $2\times16\times16$ to preserve spatio-temporal patterns. Using the Fasking strategy (See Algo. 1), MARLIN densely masks these tokens with a pre-defined masking ratio. Our empirical analysis suggests that MARLIN works favorably with a high masking ratio $(90%)$ . MARLIN’s objective is to generate the masked part from the sparse visible tokens. After Fasking, each token is mapped to the latent space embedding dimension of 768. From this latent embedding, the masked part is reconstructed in the 3D token space that can further be mapped to the original video. For fair comparison, we use ViT-B as the backbone encoder, although the impact of other ViT-variants are depicted in ablation study. The pre-training hyper parameters are as follows: the base learning rate is linearly scaled with respect to the overall batch size, $\mathtt{l r}=\mathtt{b a s e}$ learning rate $\times$ batch $\mathtt{S i z e/256}$ . For self-supervised pre-training, we use AdamW optimizer with base learning rate $1.5e{-}4$ , momentum $\beta_{1}=0.9,\beta_{2}=0.95$ with a learning rate scheduler (cosine decay) [51]. For linear probing, we use Adam optimizer with $\beta_{1}=0.5,\beta_{2}~=~0.9$ and base learning rate $1e{-}4$ , weight decay 0. For fine-tuning, we use Adam optimizer with $\beta_{1}=0.5,\beta_{2}=0.9$ and base learning rate $1e{-}4$ without any weight decay.
实现细节。我们在PyTorch [55] 和Nvidia RTX A6000 GPU上实现了该方法。首先,给定面部视频的任何时间片段,连续帧之间存在高度冗余性。因此,为筛选具有显著帧间运动的语义关键帧,我们采用最小时间步长值为2。对于输入视频(维度为 $3\times16\times224\times$ 224),立方体嵌入层会生成 $8\times14\times143\mathrm{D}$ 个维度为 $2\times16\times16$ 的Token以保留时空模式。通过Fasking策略(参见算法1),MARLIN以预定义的掩码比例对这些Token进行密集掩码。实验分析表明MARLIN在较高掩码比例 $(90%)$ 下表现优异。MARLIN的目标是从稀疏可见Token中重建被掩码部分。Fasking处理后,每个Token被映射到768维的潜在空间嵌入。基于该潜在嵌入,被掩码部分将在3D Token空间中重建,并可进一步映射回原始视频。为公平比较,我们采用ViT-B作为主干编码器,其他ViT变体的影响详见消融实验。预训练超参数设置如下:基础学习率随总批次大小线性缩放,$\mathtt{l r}=\mathtt{b a s e}$ 学习率 $\times$ 批次 $\mathtt{S i z e/256}$。自监督预训练使用AdamW优化器,基础学习率 $1.5e{-}4$,动量 $\beta_{1}=0.9,\beta_{2}=0.95$ 并配备余弦衰减学习率调度器 [51]。线性探测阶段采用Adam优化器,$\beta_{1}=0.5,\beta_{2}~=~0.9$,基础学习率 $1e{-}4$,权重衰减为0。微调阶段使用Adam优化器,$\beta_{1}=0.5,\beta_{2}=0.9$,基础学习率 $1e{-}4$ 且不设权重衰减。
Table 3. Facial Expression and Sentiment Recognition. Downstream adaptation results on MOSEI dataset [7] for Emotion, sentiment (7-class), and sentiment (2-class). Our proposed method, MARLIN, outperforms visual modality based emotion prediction methods. Please note that SOTA for UMON [25] and GMF [4] utilize three modalities and thus, not directly comparable. Here, YTF: YouTube Face [78] and LAV represents linguistic, audio, and visual modality, respectively. * denotes supervised methods.
表 3: 面部表情与情感识别。在MOSEI数据集[7]上针对情感、情绪(7分类)和情绪(2分类)的下游适配结果。我们提出的MARLIN方法优于基于视觉模态的情感预测方法。请注意,UMON[25]和GMF[4]的SOTA结果使用了三种模态,因此不具备直接可比性。其中YTF: YouTube Face[78],LAV分别表示语言、音频和视觉模态。*表示监督式方法。
| 任务 | 预训练 | 方法 | 模态 | 准确率↑ |
|---|---|---|---|---|
| 情感 | YTF[78] YTF[78] | MViTv1[49]* UMONS[25]* GMF[4]* VideoMAE[71] MARLIN | V LAV LAV V V | 80.45 80.68 81.14 80.39 80.60 |
| 情绪(7分类) | YTF[78] YTF[78] | MViTv1[49]* VideoMAE[71] MARLIN | V V V | 33.35 33.78 34.63 |
| 情绪(2分类) | MOSEI[7]和IEMOCAP[11] YTF[78] YTF[78] | CAE-LR[45] VideoMAE[71] MARLIN | V V V | 71.06 72.96 73.70 |
4.2. Quantitative Analysis
4.2. 定量分析
4.2.1. Comparison with SOTA Facial Analysis Tasks.
4.2.1. 与SOTA面部分析任务的对比
We compare the performance of MARLIN with different downstream facial analysis tasks following standard task specific evaluation protocols [7, 22, 33, 50, 62, 84].
我们按照特定任务的标准评估协议 [7, 22, 33, 50, 62, 84],比较了 MARLIN 在不同下游面部分析任务中的性能。
Facial Attributes. In Tab. 2, we compare the LP and FT adaptation performance of MARLIN with the popular trans
面部属性。在表 2 中,我们比较了 MARLIN 与流行 Transformer 的 LP (Linear Probing) 和 FT (Fine-Tuning) 适应性能。
Table 4. Deepfake Detection. We compare the Fine-Tuning (FT) results on MARLIN for Face Forensic $^{++}$ [62] dataset. * denotes supervised methods.
表 4: Deepfake检测。我们比较了MARLIN在Face Forensic++ [62]数据集上的微调(FT)结果。*表示监督方法。
| 预训练 (Pre-train) | 方法 (Method) | 准确率 (Acc.)(%)↑ | AUC↑ |
|---|---|---|---|
| Steg.Features [32]* | 55.98 | ||
| LD-CNN [24]* | 58.69 | ||
| Constraied Conv. [8]* | 66.84 | ||
| CustomPooling CNN [61]* | 61.18 | ||
| MesoNet [2]* | 70.47 | ||
| Face X-ray [47]* | 0.6160 | ||
| Xception [21]* | 86.86 | 0.8930 | |
| F3-Net [58]* | 93.02 | 0.9580 | |
| P3D [59]* | 0.6705 | ||
| R3D[72]* | 0.8772 | ||
| I3D [15]* | 0.9318 | ||
| M2TR [76]* | 0.9395 | ||
| ST-M2TR [76]* | 0.9531 | ||
| YTF [78] | VideoMAE[71] | 87.57 | 0.9082 |
| YTF [78] | MARLIN | 89.43 | 0.9305 |
Table 5. Lip Synchronization. We compare Linear Probing (LP) and Fine-Tuning (FT) results on the LRS2 [22] dataset.
表 5: 唇音同步。我们在 LRS2 [22] 数据集上比较了线性探测 (Linear Probing, LP) 和微调 (Fine-Tuning, FT) 的结果。
| 方法 | LSE-D↓ | LSE-C↑ | FID↓ |
|---|---|---|---|
| Speech2Vid[41] | 14.230 | 1.587 | 12.320 |
| LipGAN [42] | 10.330 | 3.199 | 4.861 |
| Wav2Lip [57] | 7.521 | 6.406 | 4.887 |
| AttnWav2Lip[74] | 7.339 | 6.530 | — |
| Wav2Lip + ViT [28] | 8.996 | 2.807 | 13.352 |
| Wav2Lip+ViT+VideoMAE[71] | 7.316 | 5.096 | 4.097 |
| Wav2Lip+ViT+MARLIN | 7.127 | 5.528 | 3.452 |
former (i.e. MViT-v1 [30] and MViT-v2 [49]) and CNNs (i.e. R3D [72]) on CelebV-HQ [85] dataset. From the table, it is observed that MARLIN’s FT version outperforms supervised MViT-v2 [49] transformer architecture by $1.13%$ $(92.77%\rightarrow93.90%)$ for appearance attributes and $0.33%$ $95.15%\rightarrow95.48%)$ for action attributes. Similar patterns are also been observed with the R3D CNN module as well. We attribute MARLIN’s performance gain to the pre-training strategy that encodes generic, robust, and transferable features from any input facial video.
在CelebV-HQ [85]数据集上对比了MViT系列(即MViT-v1 [30]和MViT-v2 [49])与CNN模型(即R3D [72])。表中数据显示,MARLIN的微调版本在外观属性分类上比监督式MViT-v2 [49] Transformer架构提升1.13% (92.77%→93.90%),在动作属性分类上提升0.33% (95.15%→95.48%)。R3D CNN模块也呈现相似趋势。我们认为MARLIN的性能提升源于其预训练策略,该策略能从任意输入的面部视频中编码出通用、鲁棒且可迁移的特征。
Emotion and Sentiment. In Tab. 3, we similarly compare the LP and FT adaptation performance of conversational emotion and sentiment in terms of accuracy(↑) and AUC( ) on CMU-MOSEI [83] dataset. Please note that the MARLIN is a visual modality only encoder. The results suggest that MARLIN performs competitively with SOTA methods [25, 45, 49], especially it outperforms unsupervised SOTA CAE-LR [45] by $2.64%$ $71.06%\rightarrow73.70%)$ on 2-class sentiment task. For emotion and 7-class sentiment as well, it outperforms supervised benchmarks [49] marginally. These results also indicate that MARLIN learns highly generic, robust, and transferable feature representation from pre-training.
情感与情绪分析。在表3中,我们同样比较了对话情感与情绪分析的LP和FT适应性能,指标为CMU-MOSEI数据集[83]上的准确率(↑)和AUC( )。需注意MARLIN是仅基于视觉模态的编码器。结果表明,MARLIN与当前最优方法[25,45,49]表现相当,尤其在二分类情感任务上以$2.64%$ ($71.06%\rightarrow73.70%$)的优势超越无监督SOTA方法CAE-LR[45]。对于情绪和七分类情感任务,它也略微优于有监督基准[49]。这些结果还表明,MARLIN通过预训练学习到了高度通用、鲁棒且可迁移的特征表示。
DeepFake Detection. In Tab. 4, we compare the performance of video manipulation on Face Forensic $^{++}$ [62] dataset and report results in terms of video-level accuracy( ) and AUC( ). The results indicate that MARLIN performs favorably against the supervised SOTA methods [2, 8, 15, 21, 24, 32, 47, 59, 61, 72]. This is the first SSL work that uses only spatio-temporal visual information anomaly to detect video manipulation. Unless $\mathrm{F^{3}}$ - Net, which uses frequency aware pattern over the temporal dimension to detect forgeries in a supervised fashion. Whereas MARLIN irrespective of frequency pattern learns facial representation and can detect anomalies from the spatio-temporal signal.
DeepFake检测。在表4中,我们比较了Face Forensic++ [62]数据集上的视频篡改性能,并以视频级准确率( )和AUC( )报告结果。结果表明,MARLIN在性能上优于有监督的SOTA方法[2, 8, 15, 21, 24, 32, 47, 59, 61, 72]。这是首个仅使用时空气视觉信息异常进行视频篡改检测的自监督学习(SSL)工作。不同于F3-Net使用时域维度上的频率感知模式以有监督方式检测伪造,MARLIN无需依赖频率模式即可学习面部表征,并能从时空气信号中检测异常。
Lip Synchronization. For a fair comparison, we adopt the following experimental setups: 1) $W a\nu2L i p+V i T.$ : To compare the contribution of ViT architecture [28] wrt SOTA CNNs and MARLIN where the weights of ViT is trained from scratch on LRS2 [22] dataset. 2) $W a\nu2L i p+V i T+V i d e o M A E.$ : To compare the contribution of vanilla VideoMAE with ViT backbone pre-trained on
唇形同步 (Lip Synchronization)。为公平比较,我们采用以下实验设置:1) $W a\nu2L i p+V i T.$ :比较ViT架构[28]相对于SOTA CNN和MARLIN的贡献,其中ViT权重在LRS2[22]数据集上从头训练。2) $W a\nu2L i p+V i T+V i d e o M A E.$ :比较基于ViT骨干网络预训练的原始VideoMAE的贡献
Table 6. Few shot adaptation on different facial tasks. Comparison of different methods for few shot adaptation.
表 6: 不同面部任务的少样本适应。不同方法在少样本适应上的比较。
| Data→ Task Anno.% | MOSEI [7] | FF++[62] CelebV-HQ[85] | ||||
|---|---|---|---|---|---|---|
| Emo. Acc.↑ | 7-Sen. Acc.↑ | 2-Sen. Acc.↑ | DeepFake AUC | Appr. AUC | Act. AUC | |
| 100% | 80.60 | 34.63 | 73.70 | 0.9305 | 0.9373 | 0.9278 |
| 50% | 80.59 | 33.73 | 73.33 | 0.8681 | 0.9273 | 0.9270 |
| 10% | 79.89 | 33.56 | 72.26 | 0.7459 | 0.8996 | 0.9201 |
| 78.61 | 30.09 | 71.89 | 0.6252 | 0.8423 | 0.9063 |
YTF [78] dataset. 2) $W a\nu2L i p+V i T\substack{+M A R L I N:}$ To compare the contribution of MARLIN pre-trained on YTF [78] with SOTA [57, 66, 74] and different design aspects. The experimental results are depicted in Tab. 5 with $\mathrm{LSE-D}\downarrow$ , LSE-C $\uparrow$ and FID $\downarrow$ as evaluation metrics following standard protocol [38, 57, 66, 74]. The improvement of lip sync score $(\mathrm{LSE{-}D\downarrow}$ : $7.521\rightarrow7.127$ ; FID $\downarrow$ : $4.887\rightarrow3.452,$ ) indicatesthat MARLIN learns rich spatio-temporal patterns which are transferable and robust. It is also interesting to observe that MARLIN is adaptive to very fine grained features specific to the face as well.
YTF [78] 数据集。2) $W a\nu2L i p+V i T\substack{+M A R L I N:}$ 为比较 MARLIN 在 YTF [78] 上的预训练贡献与 SOTA [57, 66, 74] 及不同设计方面的差异,实验结果如表 5 所示,采用 $\mathrm{LSE-D}\downarrow$、LSE-C $\uparrow$ 和 FID $\downarrow$ 作为评估指标,遵循标准协议 [38, 57, 66, 74]。唇同步分数的提升 $(\mathrm{LSE{-}D\downarrow}$ : $7.521\rightarrow7.127$ ; FID $\downarrow$ : $4.887\rightarrow3.452,$ ) 表明 MARLIN 学习了可迁移且鲁棒的丰富时空模式。另一个有趣的现象是,MARLIN 对脸部细粒度特征也表现出适应性。
4.2.2. Few-Shot Adaptation.
4.2.2. 少样本 (Few-Shot) 适应
Few shot adaptation has recently gained attention due to its adaptation capability with very low data regime [9, 65, 84, 86]. Following the standard evaluation protocol [9, 65, 84, 86], we also investigate the adaptation capability of MARLIN. Given any downstream dataset, we use limited train set labels to align the output manifold while keeping the test set fixed via LP (MOSEI, CelebV-HQ) and FT $(\mathrm{FF+})$ strategy. From Tab. 6, a slight drop in performance is observed across different tasks which further demonstrates that MARLIN learns generic, transferable, and adaptive information.
少样本 (few-shot) 适应因其极低数据需求下的适应能力而受到关注 [9, 65, 84, 86]。遵循标准评估协议 [9, 65, 84, 86],我们同样研究了 MARLIN 的适应能力。给定任意下游数据集,我们使用有限的训练集标签对齐输出流形,同时通过 LP (MOSEI, CelebV-HQ) 和 FT $(\mathrm{FF+})$ 策略保持测试集固定。从表 6 可见,不同任务中性能略有下降,这进一步证明 MARLIN 学习了通用、可迁移且自适应的信息。
4.2.3. Ablation Studies.
4.2.3. 消融研究
We have performed extensive ablation studies to show the effectiveness of each component.
我们进行了大量消融实验以验证各组件的有效性。
Table 7. Contribution of different modules, encoder architectures, and masking strategies towards overall MARLIN framework. Fasking: Facial Guided Masking, AT: Adversarial Training
表 7. 不同模块、编码器架构和掩码策略对MARLIN框架整体性能的贡献。Fasking: 面部引导掩码, AT: 对抗训练
| 数据集 → | MOSEI[7] | FF++[62] | |||
|---|---|---|---|---|---|
| 情感准确率 (%↑) | 7类情感准确率 (%个) | 2类情感准确率 (%↑) | 准确率 (%↑) | AUC (↑) | |
| 模块 VideoMAE | 80.39 | 33.78 | 72.96 | 87.57 | 0.9082 |
| + Fasking | 80.55 | ||||
| 34.58 | 73.54 | 87.29 | 0.9154 | ||
| + AT | 80.58 | 34.05 | 73.17 | 88.00 | 0.9096 |
| +Both (MARLIN) | 80.60 | 34.63 | 73.70 | 89.43 | 0.9305 |
| 编码器架构 | |||||
| ViT-S | 80.38 | 33.40 | 72.69 | 87.43 | 0.8863 |
| ViT-B | 80.60 | 34.63 | 73.70 | 89.43 | 0.9305 |
| ViT-L | 80.63 | 35.28 | 74.83 | 90.71 | 0.9377 |
| 掩码策略 | |||||
| Random | 80.40 | 34.10 | 72.96 | 87.29 | 0.8797 |
| 79.33 | 72.90 | ||||
| Frame | 33.99 | 86.57 | 0.8835 | ||
| Tube | 80.58 | 34.05 | 73.17 | 88.00 | 0.9096 |
| Fasking | 80.60 | 34.63 | 73.70 | 89.43 | 0.9305 |

Figure 3. Impact of Masking Ratio Comparison of different masking ratios for emotion and sentiment prediction in CMUMOSEI dataset [7]. Empirically, it suggests $90%$ masking works best for MARLIN.
图 3: 掩码比例的影响
在CMUMOSEI数据集[7]中针对情感和情绪预测任务的不同掩码比例对比实验。实证研究表明,90%的掩码比例对MARLIN模型效果最佳。
- Masking ratio. We use different masking ratios in the range $[0.05\cdot0.95]$ and repeat the pre-training followed by LP on CMU-MOSEI [83] dataset. From Fig. 3, we see that $\sim90%$ masking ratio is optimal for the MARLIN. With a less masking ratio (i.e. $\leq0.5$ ), more information is available for the reconstruction task which degrades the feature quality. Similarly, beyond $\sim90%$ , the reconstruction task becomes more challenging, leading to a performance drop. With the empirical evidence, we set the masking ratio to be $\sim90%$ throughout all of our experiments. 2) Masking strategies. We further compare the proposed Fasking strategy with existing masking strategies [31,71] i.e. Frame, Random and Tube-Masking. The empirical results in Tab. 7 demonstrate that Fasking is better. 3) Different modules. We progressively integrate each module and observe its impact on downstream performance on CMU-MOSEI [83] and $\mathrm{FF}{+}{+}$ [62] while keeping other components fixed. From Tab. 7, we see that the addition of Fasking and Adversarial Training (AT) improves the performance, reflecting the importance of each component. 4) Encoder architectures. To investigate the impact of the backbone encoder architectures, and compare ViT-S, ViT-B, and ViT-L (See Tab. 7). We observe that the larger model size enhances the performance. For fair comparison, we use a ViT-B encoder.
- 掩码比例。我们在 $[0.05\cdot0.95]$ 范围内使用不同掩码比例,并在 CMU-MOSEI [83] 数据集上重复预训练及线性探测。从图 3 可见,$\sim90%$ 的掩码比例对 MARLIN 最为理想。较低掩码比例(即 $\leq0.5$)会使重建任务获得过多信息,导致特征质量下降;而超过 $\sim90%$ 时,重建任务难度骤增,性能随之降低。根据实证结果,我们在所有实验中均采用 $\sim90%$ 的掩码比例。
- 掩码策略。我们将提出的 Fasking 策略与现有掩码策略 [31,71](帧掩码、随机掩码、管状掩码)进行对比。表 7 的实证结果表明 Fasking 更具优势。
- 模块差异。我们在固定其他组件的情况下,逐步集成各模块并观察其对 CMU-MOSEI [83] 和 $\mathrm{FF}{+}{+}$ [62] 下游任务的影响。表 7 显示,引入 Fasking 和对抗训练 (AT) 能提升性能,印证了各组件的必要性。
- 编码器架构。为探究主干编码器架构的影响,我们比较了 ViT-S、ViT-B 和 ViT-L(见表 7)。结果表明模型规模扩大能提升性能。为公平对比,我们选用 ViT-B 编码器。
4.3. Qualitative Aspects
4.3. 定性方面
In order to understand the effectiveness of the learned features, we further conducted following qualitative analysis.
为了理解所学特征的有效性,我们进一步进行了以下定性分析。
- Facial Attributes. We visualize the important regions that MARLIN focused on using Gradient-weighted Class Activation Mapping (Grad-CAM) [64]. In Fig. 4 top, the heat-map results are based on LP on top of MARLIN’s feature on CelebV-HQ [85] dataset (appearance task) and it indicates that MARLIN focus on facial attributes such as hair, spectacle, hat, etc. 2) Lip Synchronization. In Fig. 4 bottom, we presents the generation results for lower part of faces which is a challenging task. The top, middle and bottom rows show ground truth, vanilla Wav2Lip [57]’s output and MARLIN’s output along with the closeup looks, respectively. Here, Wav2lip’s CNN encoder failed to locate the lip region (as shown in the Wav2lip row of Fig. 4 highlighted in red) whereas MARLIN despite pre-trained on fasking strategy is adaptive enough to generate more ac
- 面部属性。我们使用梯度加权类激活映射 (Grad-CAM) [64] 可视化 MARLIN 关注的重要区域。在图 4 顶部,热力图结果基于 CelebV-HQ [85] 数据集(外观任务)上 MARLIN 特征的 LP (Landmark Points),表明 MARLIN 专注于头发、眼镜、帽子等面部属性。
- 唇部同步。在图 4 底部,我们展示了面部下半部分的生成结果,这是一项具有挑战性的任务。顶部、中间和底部分别显示真实情况、原始 Wav2Lip [57] 的输出和 MARLIN 的输出以及特写视图。此处,Wav2Lip 的 CNN 编码器未能定位唇部区域(如图 4 Wav2Lip 行红色高亮所示),而 MARLIN 尽管在掩码策略上进行了预训练,但仍能自适应地生成更准确的结果。

Figure 4. Qualitative Analysis. Top: Qualitative results for MARLIN for facial attribute recognition task. Bottom: Qualitative results for MARLIN for facial lip synchronization task. curate spatio-temporal pattern for LS.

图 4: 定性分析。上: MARLIN 在面部属性识别任务中的定性结果。下: MARLIN 在唇语同步任务中的定性结果。精准的时空模式用于 LS。
5. Conclusion
5. 结论
In this paper, we aim to learn a universal and generic facial encoder, MARLIN, which is adaptive, robust and transferable for different facial analysis tasks. As a challenging auxiliary task, MARLIN reconstructs the spatio-temporal details of the face from the densely masked facial regions to capture local and global aspects which in turn helps in encoding generic and transferable features. Broader Impact. We believe that MARLIN can act as a good feature extractor for different downstream facial analysis tasks. Owing to the rich facial features, it would be easy to deploy MARLIN in low resource (e.g. mobile devices, Jetson Nano platforms) devices for real world applications. Limitations. As the model is trained on YouTube Face dataset [78], there could be potential bias in terms of race and cultural background of the identities. Potential bias can also be introduced in the model as we use the existing face detection library [75]. We will eliminate these limitations in our updated versions.
本文旨在学习一种通用且多用途的面部编码器MARLIN,该编码器具有适应性、鲁棒性,并可迁移至不同面部分析任务。作为一项具有挑战性的辅助任务,MARLIN通过从密集掩蔽的面部区域重建时空细节,来捕捉局部和全局特征,从而帮助编码通用且可迁移的特征。
更广泛的影响。我们相信MARLIN可以作为不同下游面部分析任务的良好特征提取器。得益于丰富的面部特征,MARLIN可以轻松部署在低资源(如移动设备、Jetson Nano平台)设备中,用于实际应用。
局限性。由于模型是在YouTube Face数据集[78]上训练的,可能在身份识别方面存在种族和文化背景的潜在偏差。此外,由于使用了现有的面部检测库[75],模型也可能引入潜在偏差。我们将在后续版本中消除这些局限性。
Acknowledgements.
致谢。
This material is based on research sponsored by DARPA under agreement number HR 001122 C 0029. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright notation thereon. M. Hayat is supported by the ARC DECRA Fellowship DE 200101100.
本材料基于DARPA根据协议号HR 001122 C 0029资助的研究。美国政府有权为政府目的复制和分发重印本,无论其上是否有版权标记。M. Hayat由ARC DECRA Fellowship DE 200101100支持。
