MARLIN: Masked Auto encoder for facial video Representation LearnINg

MARLIN: 面向面部视频表征学习的掩码自编码器

Abstract

摘要

This paper proposes a self-supervised approach to learn universal facial representations from videos, that can transfer across a variety of facial analysis tasks such as Facial Attribute Recognition (FAR), Facial Expression Recognition (FER), DeepFake Detection (DFD), and Lip Synchro nization $(L S)$ . Our proposed framework, named MARLIN, is a facial video masked auto encoder, that learns highly robust and generic facial embeddings from abundantly available non-annotated web crawled facial videos. As a challenging auxiliary task, MARLIN reconstructs the spatio-temporal details of the face from the densely masked facial regions which mainly include eyes, nose, mouth, lips, and skin to capture local and global aspects that in turn help in encoding generic and transferable features. Through a variety of experiments on diverse downstream tasks, we demonstrate MARLIN to be an excellent facial video encoder as well as feature extractor, that performs consistently well across a variety of downstream tasks including FAR $1.l3%$ gain over supervised benchmark), FER $2.64%$ gain over unsupervised benchmark), DFD ( $1.86%$ gain over unsupervised benchmark), LS $29.36%$ gain for Frechet Inception Distance), and even in low data regime. Our code and models are available at https://github.com/ControlNet/MARLIN.

本文提出了一种自监督方法，用于从视频中学习通用的面部表征，可迁移至多种面部分析任务，如面部属性识别 (FAR)、面部表情识别 (FER)、深度伪造检测 (DFD) 和唇语同步 $(L S)$。我们提出的框架名为MARLIN，是一种面部视频掩码自编码器，通过大量未标注的网络爬取面部视频学习高度鲁棒且通用的面部嵌入。作为一项具有挑战性的辅助任务，MARLIN从密集掩码的面部区域（主要包括眼睛、鼻子、嘴巴、嘴唇和皮肤）重建面部的时空细节，以捕捉局部和全局特征，从而帮助编码通用且可迁移的特征。通过对多种下游任务的实验，我们证明MARLIN是一种优秀的面部视频编码器和特征提取器，在包括FAR（比监督基准提升$1.13%$）、FER（比无监督基准提升$2.64%$）、DFD（比无监督基准提升$1.86%$）、LS（Frechet Inception Distance提升$29.36%$）在内的各种下游任务中表现一致良好，甚至在低数据情况下也表现优异。我们的代码和模型可在https://github.com/ControlNet/MARLIN获取。

1. Introduction

1. 引言

Facial analysis tasks [34, 43, 70, 85] provide essential cues for human non-verbal behavior analysis, and help unfold meaningful insights regarding social interaction [36], communication [40], cognition [68] with potential applications in Human-Computer Interaction (HCI) and Affective Computing domains. Recently, we have witnessed significant progress in deep neural network models to solve facial analysis tasks such as Facial Attribute Recognition (FAR) [34, 85], Facial Expression Recognition (FER) [48], DeepFake Detection (DFD) [70], and Lip Synchronization (LS) [43]. While these deep models can achieve remarkable performance, they often require large-scale annotated datasets, which is not only a resource-expensive and timeconsuming process but also infeasible for some applications requiring domain expertise for annotation (e.g. FER).

面部分析任务 [34, 43, 70, 85] 为人类非语言行为分析提供了关键线索，并有助于揭示社交互动 [36]、沟通 [40]、认知 [68] 等领域的有意义见解，在人机交互 (HCI) 和情感计算领域具有潜在应用价值。近年来，我们见证了深度神经网络模型在解决面部属性识别 (FAR) [34, 85]、面部表情识别 (FER) [48]、深度伪造检测 (DFD) [70] 和唇部同步 (LS) [43] 等面部分析任务方面取得的重大进展。尽管这些深度模型能实现卓越性能，但它们通常需要大规模标注数据集，这不仅是一个资源密集且耗时的过程，对于某些需要领域专业知识进行标注的应用 (例如 FER) 更是难以实现。

Figure 1. Overview of the proposed Masked Auto encoder for facial Representation LearnINg aka MARLIN. MARLIN aims to learn a universal facial representation from abundantly available non-annotated facial video data.

图 1: 提出的面部表征学习掩码自编码器 (Masked Autoencoder for facial Representation LearnINg, MARLIN) 概述。MARLIN 旨在从大量可用的未标注面部视频数据中学习通用面部表征。

To this end, self-supervised pre-training [26, 37, 71] has lately emerged as an effective strategy to address the limitations of fully supervised methods, as it enables generic representation learning from non-annotated data, that can then be transferred across tasks having limited labels. For images of natural scenes and objects, self-supervised learning approaches using self-distillation [14], contrastivelearning [18, 19], solving pre-text tasks such as jigsaw puzzle [53], and more recently auto encoding [37,71] have even outperformed the supervised learning approaches.

为此，自监督预训练 [26, 37, 71] 最近成为一种应对全监督方法局限性的有效策略，因为它能够从未标注数据中学习通用表征，随后可迁移至标签有限的任务中。对于自然场景和物体的图像，采用自蒸馏 [14]、对比学习 [18, 19]、解决拼图等前置任务 [53] 以及近期自编码 [37, 71] 的自监督学习方法，其表现甚至超越了监督学习方法。

Despite the promises offered by these self-supervised methods in learning scalable and generic representations for natural scene images and videos, these have not yet been investigated for learning representations from facial video data. Facial representation learning requires tracking of fine-grained face specific details which might not be perfectly captured by linear tube masking [71]. Until now, most of the existing approaches associated with facial analysis tasks are highly specialized and develop task-specific models trained in a fully supervised manner [46, 54, 63], with very few recent efforts towards learning generic image-based facial encoding [10,84]. These closely related works [10, 84] either focus on exploring training dataset properties in terms of size and quality [10] or performing pre-training in visual-linguistic way [84]. These works [10, 84] are hard to scale since they use static imagelevel facial information and the image-caption pairs are highly associated with context information rather than face.

尽管这些自监督方法在学习自然场景图像和视频的可扩展通用表征方面展现出潜力，但尚未被用于从面部视频数据中学习表征。面部表征学习需要追踪细粒度的面部特征细节，而线性管状掩码 [71] 可能无法完美捕捉这些细节。迄今为止，大多数与面部分析任务相关的现有方法都高度专业化，采用全监督方式训练任务特定模型 [46, 54, 63]，近期仅有少数研究尝试学习通用的基于图像的面部编码 [10, 84]。这些密切相关的工作 [10, 84] 要么聚焦于探索训练数据集在规模和质量上的特性 [10]，要么以视觉-语言方式进行预训练 [84]。这些方法 [10, 84] 难以扩展，因为它们使用静态图像层面的面部信息，且图像-标题对高度依赖上下文信息而非面部特征。

In this paper, our goal is to learn universal and taskagnostic representations in a self-supervised manner for face-related downstream tasks (see Fig. 1). For this purpose, we employ a masked auto encoder [37, 71] with a facial-guided masking strategy that learns to reconstruct spatio-temporal details of a face from densely masked facial regions using non-annotated videos. Unlike existing approaches for natural scene videos [71], where the tubemasking is initialized with a static part of the video without any semantic information, our approach dynamically tracks face and then develops a facial part-guided tube masking strategy using an off-the-shelf face parser i.e. FaceX- Zoo [75]. Thus, we pose a more challenging task that encourages the model to learn spatio-temporal representations to cover local as well as global information. Inspired by prior works [27, 60] showing high-quality reconstruction results along with rich and generic latent features, we incorporate adversarial loss on top of masked encoding to enhance reconstruction quality. Our experimental results show that our proposed framework, MARLIN, learns highly generic facial encoding that scale and transfers well across diverse facial analysis tasks such as FER, DFD, FAR, and LS and achieve favorable performance gain w.r.t. state-ofthe-art benchmarks. In summary, our main contributions are:

本文的目标是以自监督方式学习通用且与任务无关的表示，用于人脸相关下游任务 (见图 1)。为此，我们采用带有面部引导掩码策略的掩码自编码器 [37, 71]，利用未标注视频学习从密集掩码面部区域重建面部的时空细节。与自然场景视频现有方法 [71] 不同 (其管状掩码初始化于视频无语义信息的静态部分)，我们的方法动态跟踪人脸，随后使用现成的人脸解析器 (即 FaceX-Zoo [75]) 开发面部部位引导的管状掩码策略。因此，我们提出了更具挑战性的任务，促使模型学习覆盖局部和全局信息的时空表示。受先前工作 [27, 60] 启发 (其展示了高质量重建结果及丰富通用的潜在特征)，我们在掩码编码基础上加入对抗损失以提升重建质量。实验结果表明，我们提出的框架 MARLIN 能学习高度通用的人脸编码，可扩展并良好迁移至多种人脸分析任务 (如 FER、DFD、FAR 和 LS)，相对现有基准实现了优越的性能提升。主要贡献包括：

• We propose, MARLIN, a universal and task-agnostic facial encoder that learns robust and transferable facial representation from abundantly available non-annotated web-crawled facial videos in a self-supervised fashion. • As a challenging auxiliary task, we propose to reconstruct the spatio-temporal details of the face from the densely masked facial regions. The proposed facial region-guided tube masking (aka Fasking) strategy aims to learn local and global aspects from facial videos which in turn help encode generic and transferable features. • Through extensive quantitative and qualitative analysis, we show that MARLIN learns rich, generic, transferable, and robust facial representation, that performs consistently well across a variety of downstream tasks including FAR ( $1.13%$ gain over supervised benchmark), FER $2.64%$ gain over unsupervised benchmark), DFD ( $1.86%$ gain over unsupervised benchmark), LS $29.36%$ gain for Frechet Inception Distance) and even in few shot settings.

• 我们提出了MARLIN，一种通用且任务无关的面部编码器，通过自监督方式从大量可获取的非标注网络爬取面部视频中学习鲁棒且可迁移的面部表征。
• 作为一项具有挑战性的辅助任务，我们提出从密集掩蔽的面部区域中重建时空细节。所提出的面部区域引导管道掩蔽（简称Fasking）策略旨在从面部视频中学习局部和全局特征，从而帮助编码通用且可迁移的特征。
• 通过广泛的定量和定性分析，我们表明MARLIN学习到丰富、通用、可迁移且鲁棒的面部表征，在多种下游任务中表现一致优异，包括FAR（比监督基准提升$1.13%$）、FER（比无监督基准提升$2.64%$）、DFD（比无监督基准提升$1.86%$）、LS（Frechet Inception Distance提升$29.36%$），甚至在少样本设置中也是如此。

Table 1. Facial Analysis Tasks. Overview of different face related tasks and relevant datasets down the lane.

Datasets	#Samples	Env.	Fmt.	Task	Year
LFW [39]	13,233	Wild	Img.	Identification	2008
VGG-FACE[54]	2.6M	Wild	Img.	Identification	2015
CelebA [50]	202,599	Wild	Img.	Attributes	2015
YouTubeFace[78]	3,425	Wild	Vid	Identification	2011
LRS2 [22]	144,482	Wild	Vid	Lip Sync.	2017
CelebV [79]	5	Wild	Vid	Reenact	2018
CMU-MOSEI[83]	23,453	Wild	Vid	Emo,Senti	2018
FaceForensics++[62]	1,004	Wild	Vid	DeepFake	2019
VoxCeleb2[23]	150,480	Wild	Vid	Speaker	2018
CelebV-HQ [85]	55,666	Wild	Vid	Attribute	2022

表 1: 面部分析任务。不同人脸相关任务及对应数据集的概览。

数据集	样本数	环境	格式	任务	年份
LFW [39]	13,233	非受控	图像	身份识别	2008
VGG-FACE [54]	2.6M	非受控	图像	身份识别	2015
CelebA [50]	202,599	非受控	图像	属性识别	2015
YouTubeFace [78]	3,425	非受控	视频	身份识别	2011
LRS2 [22]	144,482	非受控	视频	唇语同步	2017
CelebV [79]	5	非受控	视频	面部重演	2018
CMU-MOSEI [83]	23,453	非受控	视频	情绪/情感	2018
FaceForensics++ [62]	1,004	非受控	视频	深度伪造检测	2019
VoxCeleb2 [23]	150,480	非受控	视频	说话人识别	2018
CelebV-HQ [85]	55,666	非受控	视频	属性识别	2022

2. 相关工作

Masked Auto Encoder. Masked auto encoder learns robust and transferable representations based on the hypothesis of reconstruction of the masked region. Masked autoencoding is motivated by context encoders [56] and denoising encoders [73]. After success of BERT [26] based masking, the vision community has also explored different design choices of masked auto encoding such as pixel level masking [17, 37, 80], token level masking [29] and deep feature based masking [6, 77], using vision Transformers [44, 52]. Similarly, for modeling spatio-temporal patterns of the input data, masked motion modelling [69] and tube masking [71] strategies have been incorporated recently. Along this line, MARLIN masks and reconstructs domain-specific facial parts to learn universal facial representation.

掩码自编码器 (Masked Auto Encoder)。掩码自编码器基于掩码区域重建的假设学习鲁棒且可迁移的表征。掩码自编码的灵感源于上下文编码器 [56] 与去噪编码器 [73]。随着基于 BERT [26] 掩码策略的成功，视觉领域也探索了多种掩码自编码设计方案，包括像素级掩码 [17, 37, 80]、Token 级掩码 [29] 以及基于深度特征的掩码 [6, 77]，这些方案均采用视觉 Transformer [44, 52] 实现。类似地，为建模输入数据的时空模式，近期还融入了掩码运动建模 [69] 和管状掩码 [71] 策略。基于此，MARLIN 通过掩码并重建特定面部区域来学习通用面部表征。

Facial Representation Learning. Till date, most of the existing facial analysis approaches are conducted in a task-specific way with fully supervised manner [46, 54, 63] on manually annotated data to enhance performance. Any state-of-the-art model’s performance on benchmarked datasets is impacted by the quality and quantity of annotated data used during training. Tab. 1 shows an overview of the task-specific large-scale facial image or video datasets that have been curated over the past decade [1] to facilitate research in Face Verification (LFW [39], MS-celeb1M [34], VGG-FACE [54], VGGFace2 [13]), Facial At- tribute Recognition(CelebA [50], CelebV-HQ [85]), FacialEmotion Recognition (CMU-MOSEI [83]), DeepFake Detection $\mathrm{(FF++}$ [62]) and Lip Synchronization (LRS2 [22]). However, data curation encounters several challenges such as requirements of specialized hardware (e.g. for FER and action unit data), the discrepancy in data distribution that prevent merging of multiple datasets [10], and most importantly time consuming and resource expensive annotation process. To eliminate these drawbacks, some of the existing approaches [20, 81, 82] adopt data augmentation strat- egy via image or video synthesis as the surge in face generation technology fueled by Generative Adversarial Network (GAN) [20, 67, 81, 82] and other generation techniques [16, 35] aids realistic face generation even with the control over facial attributes. These generation techniques add variation in training set quantitatively, but in some cases it still lags in qualitative aspects due to domain specific inconsistency and more importantly high network complexity.

面部表征学习。迄今为止，大多数现有面部分析方法都采用全监督方式[46, 54, 63]在人工标注数据上以任务特定形式开展，以提升性能。任何最先进模型在基准数据集上的表现都受训练所用标注数据的质量和数量影响。表1展示了过去十年为促进面部验证(LFW[39]、MS-celeb1M[34]、VGG-FACE[54]、VGGFace2[13])、面部属性识别(CelebA[50]、CelebV-HQ[85])、面部情绪识别(CMU-MOSEI[83])、深度伪造检测$\mathrm{(FF++}$[62])和唇语同步(LRS2[22])研究而构建的任务特定大规模面部图像/视频数据集概览[1]。但数据构建面临诸多挑战：需要专用硬件(如用于FER和动作单元数据)、数据分布差异阻碍多数据集合并[10]，最重要的是耗时耗力的标注流程。为消除这些缺陷，现有方法[20, 81, 82]采用通过生成对抗网络(GAN)[20, 67, 81, 82]和其他生成技术[16, 35]驱动的图像/视频合成数据增强策略——这些技术支持可控面部属性的逼真人脸生成。虽然这些生成技术从数量上增加了训练集多样性，但由于领域特异性不一致和较高的网络复杂度，某些情况下仍存在质量不足的问题。

To this end, there are very few recent works that aim to learn image-based task specific facial encoding with limited supervision [3, 9, 10, 65, 84, 84, 86, 86]. The most closely related existing works [10,84] either focus on exploring training dataset properties in terms of size and quality [10] or performing pre-training in visual-linguistic way [84]. These works [10, 84] are hard to scale since they use static image level facial information and the image-caption pairs are highly correlated with context information rather than face. In this work, we aim to develop a generic, universal, and task-agnostic facial encoder that learns from web-crawled non-annotated data. Our experimental analysis shows that MARLIN can align the latent space manifold to any desired downstream task specific label space. Thus, MARLIN has the capability to act as a strong facial encoder or feature extractor in many low-resource real world applications.

为此，近期只有少数研究致力于在有限监督下学习基于图像的任务特定面部编码 [3, 9, 10, 65, 84, 84, 86, 86]。最相关的现有工作 [10,84] 要么专注于探索训练数据集在规模和质量方面的特性 [10]，要么以视觉-语言方式进行预训练 [84]。这些工作 [10, 84] 难以扩展，因为它们使用静态图像级别的面部信息，且图像-标题对与上下文信息高度相关而非人脸。在本研究中，我们的目标是开发一种通用、普适且与任务无关的面部编码器，该编码器从网络爬取的非标注数据中学习。我们的实验分析表明，MARLIN 能够将潜在空间流形对齐到任何所需的下游任务特定标签空间。因此，MARLIN 具备作为强大面部编码器或特征提取器的能力，适用于许多低资源的现实应用场景。

3. MARLIN

Our objective is to learn robust and transferable universal facial representation from abundantly available nonannotated facial video data [78]. If we think holistic ally, face specific tasks involve two different aspects: a) facial appearance related attributes such as parts of the face (nose, eyes, lips, hair, etc.), facial shape and texture which mainly need spatial investigation; and b) facial action such as emotion, Facial Action Coding System (FACS), lip synchro niza- tion which requires temporal information. Thus, spatio- temporal modeling is highly desirable in order to learn strong, robust, and transferable representation. To this end, our proposed framework, MARLIN, adopts a facial region guided masking strategy which poses a challenging auxiliary reconstruction task for self supervised representation learning (See Fig. 2). To facilitate learning from masked auto-encoder, we mainly choose the YouTube Faces [78] dataset that uses web-crawled facial videos from YouTube having variation in terms of different real life conditions.

我们的目标是从大量可用的无标注面部视频数据[78]中学习稳健且可迁移的通用面部表征。从整体来看，面部特定任务涉及两个不同方面：a) 与面部外观相关的属性，如面部局部（鼻子、眼睛、嘴唇、头发等）、面部形状和纹理，这些主要需要空间分析；以及 b) 面部动作，如情绪、面部动作编码系统 (FACS)、唇语同步等，这些需要时序信息。因此，时空建模对于学习强大、稳健且可迁移的表征至关重要。为此，我们提出的框架 MARLIN 采用了一种面部区域引导的掩蔽策略，为自监督表征学习设置了一个具有挑战性的辅助重建任务（见图 2）。为了促进基于掩蔽自编码器的学习，我们主要选择了 YouTube Faces [78] 数据集，该数据集使用了从 YouTube 爬取的面部视频，涵盖了不同现实生活条件下的变化。

3.1. Facial Representation Learning

3.1. 面部表征学习

Preliminaries. MARLIN consists of an encoder $(\mathcal{F}{\phi_{\varepsilon}})$ , decoder $(\mathcal{F}{\phi_{\mathcal{D}}})$ and disc rim in at or $(\mathcal{F}{\phi_{\Gamma}})$ with embedding parameters $\phi_{\mathcal{E}}$ , $\phi_{\mathcal{D}}$ and $\phi_{\Gamma}$ , respectively. Given a training dataset $\mathbb{D}={V_{i}}{i=1}^{N}$ where $N$ is the number of videos in the dataset and $\dot{V}\in\mathbb{R}^{C\times T_{0}\times H_{0}\times W_{0}}$ $(C,T_{0},H_{0},$ $W_{0}$ are channel, temporal depth, height and width of the raw video, respectively). From the raw input video $V$ , we track and crop the facial regions [75] followed by random temporal sampling represented as $\boldsymbol{v}\in~\mathbb{R}^{(C\times T\times H\times W)}$ $(T,H,$ $W$ are the modified temporal depth, height, and width of the derived video clip, respectively). The derived video clip

预备知识。MARLIN 由一个编码器 $(\mathcal{F}{\phi_{\varepsilon}})$、解码器 $(\mathcal{F}{\phi_{\mathcal{D}}})$ 和判别器 $(\mathcal{F}{\phi_{\Gamma}})$ 组成，其嵌入参数分别为 $\phi_{\mathcal{E}}$、$\phi_{\mathcal{D}}$ 和 $\phi_{\Gamma}$。给定训练数据集 $\mathbb{D}={V_{i}}{i=1}^{N}$，其中 $N$ 为数据集中视频数量，$\dot{V}\in\mathbb{R}^{C\times T_{0}\times H_{0}\times W_{0}}$（$C,T_{0},H_{0},W_{0}$ 分别表示原始视频的通道数、时间深度、高度和宽度）。从原始输入视频 $V$ 中，我们跟踪并裁剪面部区域 [75]，随后进行随机时间采样，表示为 $\boldsymbol{v}\in~\mathbb{R}^{(C\times T\times H\times W)}$（$T,H,W$ 分别表示衍生视频片段调整后的时间深度、高度和宽度）。

$v$ is further mapped to $(\mathbf{k}-\mathbf{n})$ visible and $\mathbf{n}$ masked tokens denoted as by facialregion guided masking strategy $(\mathcal{F}{\phi_{f}})$ with a pre-defined masking ratio $\begin{array}{r}{\mathbf{r}=\frac{\mathbf{n}}{\mathbf{k}}}\end{array}$ . Here, $\mathbf{e}$ is the embedding dimension and $\mathbf{k}$ is the total number of tokens derived from $v$ , i.e. $\begin{array}{r}{\mathbf{k}=\frac{T}{\mathbf{t}}\times\frac{H}{\mathbf{h}}\times\frac{W}{\mathbf{w}}}\end{array}$ , given 3D cube tokens have dimension of $\mathbf{t}\times\mathbf{h}\times\mathbf{w}$ each. Thus, MARLIN injects facial region specific domain knowledge in the aforementioned token space to guide the representation learning via masking.

$v$ 进一步映射为 $(\mathbf{k}-\mathbf{n})$ 个可见和 $\mathbf{n}$ 个掩码的 Token，这是通过面部区域引导的掩码策略 $(\mathcal{F}{\phi_{f}})$ 和预定义的掩码比例 $\begin{array}{r}{\mathbf{r}=\frac{\mathbf{n}}{\mathbf{k}}}\end{array}$ 实现的。其中，$\mathbf{e}$ 是嵌入维度，$\mathbf{k}$ 是从 $v$ 中提取的 Token 总数，即 $\begin{array}{r}{\mathbf{k}=\frac{T}{\mathbf{t}}\times\frac{H}{\mathbf{h}}\times\frac{W}{\mathbf{w}}}\end{array}$，前提是每个 3D 立方体 Token 的维度为 $\mathbf{t}\times\mathbf{h}\times\mathbf{w}$。因此，MARLIN 在上述 Token 空间中注入了面部区域特定的领域知识，通过掩码来指导表示学习。

The visible tokens $\tilde{X}{v}$ are mapped to the latent space $\mathbf{z}$ by the following mapping function $\mathcal{F}{\phi_{\mathcal{E}}}:\tilde{X}{v}\rightarrow\mathbf{z}$ . The latent space feature $\mathbf{z}$ is further fed to the decoder $\mathcal{F}{\phi_{\mathcal{D}}}$ which reconstruct $\mathbf{z}$ to the $\mathbf{n}$ masked tokens $X_{m}^{'}$ by the following mapping $\mathcal{F}{\phi_{d}}:\mathbf{z}\rightarrow X_{m}^{'}$ . In the decoder, the corresponding visible and masked 3D cubes contain the flatten raw pixels denoted as $\mathbf{e}=\mathbf{Cthw}$ . In brief given the visible tokens $\tilde{X}_{v}$ , we reconstruct the masked tokens by the following function:

可见的Token $\tilde{X}{v}$ 通过以下映射函数 $\mathcal{F}{\phi_{\mathcal{E}}}:\tilde{X}{v}\rightarrow\mathbf{z}$ 映射到潜在空间 $\mathbf{z}$。潜在空间特征 $\mathbf{z}$ 进一步输入到解码器 $\mathcal{F}{\phi_{\mathcal{D}}}$，后者通过映射 $\mathcal{F}{\phi{d}}:\mathbf{z}\rightarrow X_{m}^{'}$ 将 $\mathbf{z}$ 重构为 $\mathbf{n}$ 个被遮蔽的Token $X_{m}^{'}$。在解码器中，对应的可见和被遮蔽3D立方体包含展平的原始像素，记为 $\mathbf{e}=\mathbf{Cthw}$。简而言之，给定可见Token $\tilde{X}_{v}$，我们通过以下函数重构被遮蔽的Token：

$$
X_{m}^{'}=\mathcal{F}{\phi_{\mathcal{D}}}\circ\mathcal{F}{\phi_{\mathcal{E}}}(\tilde{X}_{v})
$$

Reconstructing spatio-temporal facial patterns from raw pixels is quite challenging, we deploy a disc rim in at or $\mathcal{F}{\phi_{\Gamma}}$ with the adversarial training for better synthesis.

从原始像素重建时空面部模式颇具挑战性，我们在 $\mathcal{F}{\phi_{\Gamma}}$ 处部署了一个对抗训练圆环以获得更好的合成效果。

3.2. Self-Supervised Representation Learning.

3.2. 自监督表征学习

The self supervised pre-training strategy of MARLIN consists of three main components described below: a) Facial-region Guided Tube Masking (Fasking). In order to capture spatio-temporal correspondence, we have deployed facial region specific tube masking strategy following [71]. We dynamically track and mask facial components across temporal axis for each spatio-temporal cube. Our facial regions based tube-masking strategy ensures the same facial region is masked throughout the temporal cube, thus posing a challenging reconstruction task and promoting learning local and global facial details (See Alg. 1). As the masked spatio-temporal cubes look like deformable bending tubes, we termed it as Facial region-guided tube masking aka Fasking.

MARLIN的自监督预训练策略包含以下三个主要组成部分：
a) 面部区域引导的管状掩码 (Fasking)。为了捕捉时空对应关系，我们采用了基于[71]的面部区域特定管状掩码策略。针对每个时空立方体，我们动态追踪并沿时间轴掩码面部组件。这种基于面部区域的管状掩码策略确保同一面部区域在整个时间立方体中被掩码，从而构成具有挑战性的重建任务，促进局部和全局面部细节的学习（参见算法1）。由于被掩码的时空立方体形似可弯曲的管状结构，我们将其称为面部区域引导的管状掩码（Fasking）。

Figure 2. Architectural overview of MARLIN (Best viewed in color). MARLIN mainly consists of (a) Representation Learning Module, (b) Facial Region guided Tube Masking, and (c) Downstream Adaptation. (a) Representation Learning Module: MARLIN learns the facial representation from the unlabeled, web crawled video data in a self-supervised fashion (highlighted in Blue). (b) Facial Region guided Tube Masking: With the aid of facial region guided tube masking (highlighted in Yellow), MARLIN gets joint spatio-temporal attention which in turn facilitates downstream performance. The Face guided tube masking strategy injects domain knowledge into the pipeline. (c) Downstream Adaptation: For facial task specific downstream adaptation, MARLIN utilizes Linear Probing (LP) and Fine-Tuning (FT) to show the robustness, general iz ability, and transfer ability of the learned feature (highlighted in Green).

图 2: MARLIN架构概览 (建议彩色查看)。MARLIN主要由以下部分组成：(a) 表征学习模块，(b) 面部区域引导的管状掩码，(c) 下游适配。(a) 表征学习模块：MARLIN以自监督方式 (蓝色高亮) 从无标注的网络爬取视频数据中学习面部表征。(b) 面部区域引导的管状掩码：通过面部区域引导的管状掩码 (黄色高亮)，MARLIN获得联合时空注意力，从而提升下游任务性能。该面部引导的管状掩码策略为流程注入了领域知识。(c) 下游适配：针对面部任务特定的下游适配，MARLIN采用线性探测 (LP) 和微调 (FT) 来展示所学特征的鲁棒性、泛化能力和迁移能力 (绿色高亮)。

Adversarial Adaptation Strategy. To enhance the generation quality for rich representation learning, we incorporate adversarial adaptation on top of the masked autoencoder backbone. According to the prior literature [27,60], adversarial training enhances generation quality which in turn results in rich latent feature $z$ . The disc rim in at or $\mathcal{F}{\phi_{\Gamma}}$ as shown in Fig. 2 is an MLP based network which imposes adversarial loss $\mathcal{L}{\mathrm{adv}}$ between $X_{m}$ and their reconstructed counterparts X′m.

c) 对抗适应策略。为提升生成质量以实现丰富的表征学习，我们在掩码自编码器主干网络基础上引入了对抗适应机制。根据先前文献 [27,60] 的研究，对抗训练能提升生成质量，从而产生丰富的潜在特征 $z$。如图 2 所示，判别器 $\mathcal{F}{\phi_{\Gamma}}$ 是一个基于 MLP 的网络，它在 $X_{m}$ 与其重建对应项 $X'{m}$ 之间施加对抗损失 $\mathcal{L}_{\mathrm{adv}}$。

3.3. Overall MARLIN Loss

3.3. 整体 MARLIN 损失

Alg. 2 summarizes the training process for the MARLIN framework. MARLIN mainly imposes (a) Reconstruction Loss and (b) Adversarial Loss to facilitate the training.

算法 2 总结了 MARLIN 框架的训练过程。MARLIN 主要采用 (a) 重构损失 (Reconstruction Loss) 和 (b) 对抗损失 (Adversarial Loss) 来促进训练。

(a) Reconstruction Loss. Given an input masked tokens $\tilde{X}{m}$ , the masked auto-encoder module reconstruct it back to $X_{m}^{'}$ . To this end, we minimize mean squared error loss in the 3D token space to update the weights of the $(\mathcal{F}{\phi_{\Gamma}}\circ\mathcal{F}{\phi\varepsilon}\circ$ $\mathcal{F}{\phi_{f}}$ ) branch. The loss is defined as

(a) 重建损失 (Reconstruction Loss)。给定输入掩码 tokens $\tilde{X}{m}$，掩码自编码器模块将其重建为 $X_{m}^{'}$。为此，我们在 3D token 空间中最小化均方误差损失来更新 $(\mathcal{F}{\phi_{\Gamma}}\circ\mathcal{F}{\phi\varepsilon}\circ$ $\mathcal{F}{\phi_{f}}$ ) 分支的权重。该损失定义为

$$
\mathcal{L}{\mathtt{r e c o n}}=\frac{1}{N}\sum_{i=1}^{N}||X_{m}^{(i)}-X_{m}^{'(i)}||_{2}
$$

where $N$ is the total number of data in $\mathbb{D}$ , $X_{m}^{\left(i\right)}$ and $X_{m}^{'(i)}$ are the masked token and reconstruction of $i$ -th data in $\mathbb{D}$ .

其中 $N$ 是 $\mathbb{D}$ 中数据的总数，$X_{m}^{\left(i\right)}$ 和 $X_{m}^{'(i)}$ 分别是 $\mathbb{D}$ 中第 $i$ 个数据的掩码 token 和重建结果。

Algorithm 2 Training procedure for MARLIN

算法 2: MARLIN训练流程

步骤	说明
输入	FΦf, FΦe, FΦp, FΦr, Fe, D, r, Ddown
1	while 未收敛 do
2	MARLIN预训练
3	u ← 采样批次(D)
4	{Xm,X}←F(v,r) ≥ Fasking (参见算法1)
5	Xm ← FΦD o Fe(x) 训练 FΦr
6	(xux)(p){}→ {}
7	X'm ← FΦp o FΦe(Xu) > 训练 FΦε, FΦD
8	{Φε,ΦD} ←V{Φe,ΦD}C(9)(Xm, Xm)
9	end while

(b) Adversarial Loss. The adversarial adaptation considers the Was sen stain GAN loss [5] for better reconstruction of spatio-temporal facial patterns which in turn helps in learning rich representation. The loss is defined as follows:

(b) 对抗损失。对抗自适应采用Wasserstein GAN损失[5]以更好地重建时空面部模式，这有助于学习丰富的表征。该损失定义如下：

$$
\begin{array}{r l r}{\lefteqn{\mathcal{L}{\mathrm{adv}}^{(d)}=\frac{1}{N\mathbf{n}}\sum_{i=1}^{N}(\sum_{\boldsymbol{x}{m}^{\prime}\in X_{m}^{'(i)}}\mathcal{F}{\phi_{\Gamma}}(\boldsymbol{x}{m}^{'})-\sum_{\boldsymbol{x}{m}\in X_{m}^{(i)}}\mathcal{F}{\phi_{\Gamma}}(\boldsymbol{x}{m}))}}\ &{}&{\quad\quad\mathcal{L}{\mathrm{adv}}^{(g)}=-\frac{1}{N\mathbf{n}}\sum_{i=1}^{N}\sum_{\boldsymbol{x}{m}^{\prime}\in X_{m}^{'(i)}}\mathcal{F}{\phi_{\Gamma}}(\boldsymbol{x}_{m}^{'})}\ &{}&{\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad(4)}\end{array}
$$

Thus, the overall learning objective $\mathcal{L}$ is formulated as follows, where $\lambda_{W}$ is the weighting parameter:

因此，整体学习目标 $\mathcal{L}$ 的公式如下，其中 $\lambda_{W}$ 为权重参数：

$$
\begin{array}{r l}&{\mathcal{L}^{\left(g\right)}=\mathcal{L}{\mathrm{recon}}+\lambda_{W}\mathcal{L}{\mathrm{adv}}^{\left(g\right)}}\ &{\mathcal{L}^{\left(d\right)}=\mathcal{L}_{\mathrm{adv}}^{\left(d\right)}}\end{array}
$$

During MARLIN’s pre-training phase, $\mathcal{L}^{(d)}$ updates the parameters $\phi_{d i s}$ and $\mathcal{L}^{(g)}$ updates the parameters $\phi_{e},\phi_{d}$ .

在MARLIN的预训练阶段，$\mathcal{L}^{(d)}$更新参数$\phi_{d i s}$，$\mathcal{L}^{(g)}$更新参数$\phi_{e},\phi_{d}$。

3.4. Downstream Adaptation

3.4. 下游适配

Our proposed MARLIN framework learns robust and transferable facial representation from the facial video in a self-supervised way. Following the standard evaluation protocols, we adopt Linear Probing (LP) and Fine-Tuning (FT) for downstream adaptation for different face relevant tasks (See Fig. 2 inference module). Given any task specific downstream dataset $\mathbb{D}{\mathrm{down}}={v_{j},\mathbf{y}{j}}{j=1}^{\mathcal{N}}$ , we deploy linear fully-connected (FC) layers with embedding parameters $\theta$ to align the latent space to the downstream task specific label space on top of encoder module $\mathcal{F}{\phi\varepsilon}$ . For linear probing, we freeze the backbone network $\mathcal{F}{\phi_{\mathcal{E}}}$ and only update the ${\mathcal{F}}{\theta}$ . On the other hand for FT, we fine-tune the whole module i.e. $(\mathscr{F}{\phi_{\mathscr{E}}}\circ\mathscr{F}_{\theta})$ . When MARLIN is used as a feature extractor for LP, it uses a sliding temporal window to extract features $Z$ of the input face cropped video $V$ as shown in Fig. 2 (c). The details of different downstream facial tasks are described below:

我们提出的MARLIN框架通过自监督方式从面部视频中学习鲁棒且可迁移的面部表征。遵循标准评估协议，我们采用线性探测(LP)和微调(FT)进行不同面部相关任务的下游适配(见图2推理模块)。给定任何特定任务的下游数据集$\mathbb{D}{\mathrm{down}}={v_{j},\mathbf{y}{j}}{j=1}^{\mathcal{N}}$，我们在编码器模块$\mathcal{F}{\phi\varepsilon}$顶部部署具有嵌入参数$\theta$的线性全连接(FC)层，以将潜在空间对齐到下游任务特定的标签空间。对于线性探测，我们冻结主干网络$\mathcal{F}{\phi_{\mathcal{E}}}$，仅更新${\mathcal{F}}{\theta}$。而对于FT，我们对整个模块$(\mathscr{F}{\phi_{\mathscr{E}}}\circ\mathscr{F}_{\theta})$进行微调。当MARLIN作为LP的特征提取器时，它使用滑动时间窗口提取输入面部裁剪视频$V$的特征$Z$，如图2(c)所示。不同下游面部任务的细节描述如下：

Facial Attribute Recognition (FAR) predicts the presence of appearance and action attributes such as gender, race, hair color, and emotion of a given face video. The problem of predicting facial attributes can be posed as a multi-label learning problem highly dependent on rich spatial encoding. For the downstream adaptation purpose, we use 28,532 train, 3,567 val, and 3,567 test videos from the CelebVHQ [85] dataset. Following the prior works [33, 50, 84], we report average accuracy(↑), Area Under the Curve (AUC↑) over all attributes.

面部属性识别 (Facial Attribute Recognition, FAR) 用于预测给定面部视频中外观与行为属性的存在情况，例如性别、种族、发色和情绪。该问题可建模为高度依赖丰富空间编码的多标签学习任务。为适配下游任务，我们采用CelebVHQ [85] 数据集中的28,532段训练视频、3,567段验证视频和3,567段测试视频。参照先前研究 [33, 50, 84]，我们报告所有属性的平均准确率(↑)和曲线下面积 (AUC↑)。

Facial Expression Recognition (FER) task encodes spatio-temporal facial muscle movement patterns to predict emotion (6-class) and sentiment (7-class and 2-class) of the concerned subject given a facial video. We evaluate the performance of MARLIN on CMU-MOSEI dataset [7] which is a conversational corpus having 16,726 train, 1,871 val, and 4,662 test data. Following the prior works [7, 25], we use overall accuracy(↑) as metrics.

面部表情识别 (Facial Expression Recognition, FER) 任务通过对面部肌肉运动模式的时空编码，预测给定面部视频中目标对象的情绪 (6分类) 和情感倾向 (7分类和2分类)。我们在 CMU-MOSEI 数据集 [7] 上评估 MARLIN 的性能，该数据集为包含 16,726 条训练数据、1,871 条验证数据和 4,662 条测试数据的对话语料库。遵循先前研究 [7, 25] 的设定，我们采用总体准确率 (↑) 作为评估指标。

Deepfake Detection (DFD) task predicts spatio-temporal facial forgery given a facial video from $\mathrm{FF}{+}{+}$ (LQ) dataset [62]. For downstream adaptation, we use 3,600 train, 700 val, and 700 test sample videos from $\mathrm{FF}{+}{+}$ (LQ) dataset [62]. Following prior literature [12, 58, 76], we use accuracy(↑) and AUC(↑) as the evaluation metrics.

深度伪造检测 (DFD) 任务基于 $\mathrm{FF}{+}{+}$ (LQ) 数据集 [62] 的面部视频预测时空面部伪造。在下游适配中，我们使用 $\mathrm{FF}{+}{+}$ (LQ) 数据集 [62] 的 3,600 个训练样本视频、700 个验证样本视频和 700 个测试样本视频。遵循先前文献 [12, 58, 76]，我们采用准确率 (↑) 和 AUC (↑) 作为评估指标。

Lip Synchronization (LS) is another line of research that require facial region specific spatio-temporal synchro nization. This downstream adaptation further elaborates the adaptation capability of MARLIN for face generation tasks. For adaptation, we replace the facial encoder module in Wav2Lip [57] with MARLIN, and adjust the temporal window accordingly i.e. from 5 frames to $\mathbf{T}$ frames. For evaluation, we use the LRS2 [22] dataset having 45,838 train, 1,082 val, and 1,243 test videos. Following the prior literature [57, 74], we use Lip-Sync Error-Distance (LSE-D ↓), Lip-Sync Error-Confidence (LSE-C $\uparrow$ ) and Frechet Incep- tion Distance (FID ↓) [38] as evaluation matrices.

唇音同步 (Lip Synchronization, LS) 是另一项需要面部区域特定时空同步的研究方向。这一下游任务进一步验证了MARLIN在面部生成任务中的适应能力。我们通过替换Wav2Lip [57] 中的面部编码器模块为MARLIN，并相应调整时间窗口（从5帧扩展到 $\mathbf{T}$ 帧）来完成适配。评估采用LRS2 [22] 数据集（含45,838个训练视频、1,082个验证视频和1,243个测试视频）。参照现有文献 [57, 74]，选用唇音同步误差距离 (LSE-D ↓)、唇音同步误差置信度 (LSE-C $\uparrow$ ) 和Frechet起始距离 (FID ↓) [38] 作为评估矩阵。

4. Experiments and Results

4. 实验与结果

We have comprehensively compared our method on different downstream adaptation tasks from quantitative (See Sec. 4.2) and qualitative (See Sec. 4.3 perspectives. Ad

我们在不同的下游适应任务上从定量（参见第4.2节）和定性（参见第4.3节）角度全面比较了我们的方法。

Table 2. Facial Attribute Recognition. Our proposed framework, MARLIN, trained on YTF [78] dataset and Linear Probed/FineTuned on CelebV-HQ [85] benchmark dataset in terms of accuracy↑ and area under the curve↑. * shows supervised methods trained on the CelebV-HQ [85] dataset.

表 2: 面部属性识别。我们提出的框架 MARLIN 在 YTF [78] 数据集上训练，并在 CelebV-HQ [85] 基准数据集上进行线性探测/微调，以准确率↑和曲线下面积↑为指标。* 表示在 CelebV-HQ [85] 数据集上训练的监督方法。

方法	外观		动作		整体
	Acc.↑	AU

[论文翻译]MARLIN: 面向面部视频表征学习的掩码自编码器

原文地址：https://arxiv.org/pdf/2211.06627v3

MARLIN: Masked Auto encoder for facial video Representation LearnINg

MARLIN: 面向面部视频表征学习的掩码自编码器

Abstract

摘要

1. Introduction

1. 引言

2. 相关工作

3. MARLIN

3. MARLIN

3.1. Facial Representation Learning

3.1. 面部表征学习

3.2. Self-Supervised Representation Learning.

3.2. 自监督表征学习

3.3. Overall MARLIN Loss

3.3. 整体 MARLIN 损失

3.4. Downstream Adaptation

4. Experiments and Results

4. 实验与结果

[论文翻译]MARLIN: 面向面部视频表征学习的掩码自编码器

原文地址：https://arxiv.org/pdf/2211.06627v3

MARLIN: Masked Auto encoder for facial video Representation LearnINg

MARLIN: 面向面部视频表征学习的掩码自编码器

Abstract

摘要

1. Introduction

1. 引言

2. Related Work

2. 相关工作

3. MARLIN

3. MARLIN

3.1. Facial Representation Learning

3.1. 面部表征学习

3.2. Self-Supervised Representation Learning.

3.2. 自监督表征学习

3.3. Overall MARLIN Loss

3.3. 整体 MARLIN 损失

3.4. Downstream Adaptation

4. Experiments and Results

4. 实验与结果