General iz able Method for Face Anti-Spoofing with Semi-Supervised Learning
基于半监督学习的通用人脸防伪方法
Abstract
摘要
Face anti-spoofing has drawn a lot of attention due to the high security requirements in biometric authentication systems. Bringing face biometric to commercial hardware became mostly dependent on developing reliable methods for detecting fake login sessions without specialized sensors. Current CNN-based method perform well on the domains they were trained for, but often show poor generalization on previously unseen datasets. In this paper we describe a method for utilizing unsupervised pre training for improving performance across multiple datasets without any adaptation, introduce the Entry Anti spoofing Dataset for supervised fine-tuning, and propose a multi-class auxiliary classification layer for augmenting the binary classification task of detecting spoofing attempts with explicit interpret able signals. We demonstrate the efficiency of our model by achieving state-of-the-art results on cross-dataset testing on MSU-MFSD, Replay-Attack, and OULU-NPU datasets.
人脸防伪技术因生物特征识别系统的高安全性需求而备受关注。将面部生物识别技术引入商用硬件设备,主要依赖于开发可靠的检测方法,以在不依赖专用传感器的情况下识别虚假登录会话。当前基于CNN的方法在训练域内表现良好,但面对未见过的数据集时泛化能力往往较差。本文提出一种利用无监督预训练来提升跨数据集性能的方法(无需任何适配),引入了用于监督微调的Entry Anti spoofing数据集,并提出通过增加多类别辅助分类层来增强防伪检测二元分类任务,该层能提供明确可解释的信号。通过在MSU-MFSD、Replay-Attack和OULU-NPU数据集上的跨数据集测试达到最先进水平,验证了模型的高效性。
1. Introduction
1. 引言
Biometric authentication systems based on face recognition are taking prominence in both everyday life and highvalue transactions with strong security requirements. However, despite the recent advances in computer vision, these systems are still confined to specialized hardware that utilizes depth or NIR sensors for preventing Presentation Attacks. Among the Presentation Attacks, print and replay attacks are the most common, and detecting them became an important problem in the field of biometric authentication. From the practical application perspective, developing a robust algorithm for detecting such attacks on commercial webcams using only signals from the video stream would enable wide adoption of face-based authentication and verification.
基于人脸识别的生物认证系统在日常生活中以及高安全性要求的高价值交易中日益凸显其重要性。然而,尽管计算机视觉领域近期取得了进展,这些系统仍受限于依赖深度或近红外(NIR)传感器等专用硬件来防范呈现攻击(Presentation Attacks)。在各类呈现攻击中,打印攻击和重放攻击最为常见,其检测已成为生物认证领域的重要课题。从实际应用角度出发,若仅利用视频流信号就能开发出针对商用网络摄像头的鲁棒攻击检测算法,将极大推动基于人脸的身份认证与验证技术的普及。
Recently published results show that deep learningbased models can achieve good results on the datasets they were trained on [17] [14] [8], but the generalization to other datasets is not so easily achieved - i.e. when the model that achieved a state-of-the-art result on benchmark A is tested on benchmark B (the protocol we will be referring to as ”cross-test”), the discrepancy in scores is quite significant, even with the latest breakthroughs in domain adaptation that were aimed to address this problem. The consequences of such discrepancy in the real world are quite damaging for the application security.
近期发表的研究结果表明,基于深度学习 (deep learning) 的模型在训练数据集上能取得优异表现 [17][14][8],但泛化到其他数据集却并非易事——例如当在基准测试A上达到最先进 (state-of-the-art) 水平的模型被用于基准测试B(我们将此协议称为"交叉测试"(cross-test))时,即便采用针对该问题的最新领域自适应 (domain adaptation) 突破技术,其性能评分差异仍十分显著。这种差异在现实世界中对应用安全将造成严重影响。
The key motivation behind this work is that achieving strong generalization on cross-testing on multiple string benchmarks would reliably reflect the effectiveness of the algorithm in the wild. We propose to achieve this generalization by changing the approach to collecting training data. Moreover, motivated by the previous work in selfsupervised learning [3] [7], we experiment with the network pre training on larger datasets to improve the results further.
这项工作的核心动机在于,若能实现在多个字符串基准测试上的强大泛化能力,将可靠反映算法在真实场景中的有效性。我们提出通过改变训练数据收集方法来实现这种泛化。此外,受自监督学习(selfsupervised learning)领域先前研究[3][7]的启发,我们尝试在更大规模数据集上进行网络预训练(pre training)以进一步提升结果。
To validate the effectiveness of the developed method, we report the evaluation results in two experimental settings: intra-dataset test is evaluated on a test portion of our internal dataset, and cross-test is evaluated on well-known and established benchmarks: MSU-MFSD [16], ReplayAttack [4], and OULU-NPU [2].
为验证所开发方法的有效性,我们在两种实验设置下报告评估结果:内部数据集测试在我们的内部数据集测试部分进行评估,交叉测试则在知名基准数据集上进行评估,包括 MSU-MFSD [16]、ReplayAttack [4] 和 OULU-NPU [2]。
1.1. Contribution
1.1. 贡献
In this paper we present the following results:
本文我们展示了以下成果:
2. Related work
2. 相关工作
Recent publications on task-agnostic self-supervised and semi-supervised pre training show that using large unlabeled datasets for unsupervised pre training followed by supervised fine-tuning is capable of outperforming standard supervised learning methods [3] [7]. This is especially promising for the field of face anti-spoofing, where the problem of lack of comprehensive labeled datasets suitable for building models viable for security applications is especially severe. There have been efforts to alleviate this problem with using rich semantic annotations [18]. Alternative paradigms of circumventing the data shortage by domain adaptation and generating synthetic data were demonstrated by [15] [8], citing the problem of domain shift as one of the most critical for anti-spoofing.
近期关于任务无关的自监督与半监督预训练的研究表明,先利用大规模无标注数据进行无监督预训练,再进行有监督微调的方法,能够超越传统有监督学习方法 [3][7]。这一发现为人脸防伪领域带来了重要启示,该领域长期面临缺乏适用于安全场景建模的完备标注数据集问题。已有研究尝试通过引入丰富语义标注来缓解该问题 [18]。另有学者通过领域自适应 [15] 和合成数据生成 [8] 来应对数据短缺,并指出领域偏移是防伪任务最关键挑战之一。
The main problem of existing methods is still in the domain shift and the lack of generalization between different datasets, as shown in cross-dataset tests, even in works that demonstrate state-of-the-art results on intra-dataset tests [15] [8] [14] [17].
现有方法的主要问题仍在于领域偏移和不同数据集间泛化能力的不足,这在跨数据集测试中尤为明显,即便是那些在数据集内测试中展现最先进成果的研究 [15] [8] [14] [17] 也是如此。
3. Method
3. 方法
3.1. Entry Anti spoofing Dataset
3.1. 入口反欺诈数据集
Existing datasets for face anti-spoofing cover too narrow a domain, compared to the diversity of camera/lighting/distance/attack conditions seen in the real world. While achieving low error scores on open-source benchmarks using a cross-dataset protocol is indicative of good performance of a network under some subset of conditions, it turned out to be an unreliable predictor of the stability and accuracy when used in a real application. Specifically, regardless of the Attack Presentation Classification Error Rate (APCER) and Bona Fide Presentation Classification Error Rate (BPCER) shown by a candidate model trained on open-source datasets, there always were multiple sets of conditions where the model’s predictions started being inconsistent. We have addressed this problem by building an internal dataset that would consist of the training portion and a test subset that would be comprehensive enough to address the missing subdomains in other benchmarks.
现有的人脸防伪数据集覆盖的领域过于狭窄,远不及现实世界中摄像头/光照/距离/攻击条件的多样性。虽然采用跨数据集协议在开源基准测试上取得低错误率能表明网络在部分条件下的良好性能,但事实证明这无法可靠预测实际应用中的稳定性和准确性。具体而言,无论基于开源数据集训练的候选模型展现出多低的攻击呈现分类错误率(APCER)和真实呈现分类错误率(BPCER),总存在多组条件会使模型预测出现不一致。我们通过构建内部数据集解决了这个问题,该数据集包含训练部分和测试子集,其覆盖面足以弥补其他基准测试缺失的子领域。
Entry Anti spoofing Dataset consists of 83000 live video recordings collected via a custom-built UI that simulates the process of logging into a web-based biometric authentication system like Entry. The recordings were collected and labeled via a crowd sourcing data labeling service, from 45000 participants from more than 20 countries on five continents. Each recording was made on a mobile or laptop webcam (with roughly $30%$ of recordings being from laptop cameras, and $70%$ - from mobile), with subject’s face visible from multiple angles. Subjects’ genders, ages and ethnicities are not correlated with their collected recordings being spoofing or bona fide sessions, but overall distribution was not restricted in order to be as close to real-life demographic of potential users as possible. The process of collecting the dataset was iterative, with each version of the dataset produced after identifying a ”blind spot subdomain” — specific set of camera/lighting/distance/other conditions, that were leading to unstable performance of the model. These blind spots were identified by crowd sourcing attacks on different iterations of earlier anti-spoofing models that were provided by our research team.
Entry反欺骗数据集包含83,000条通过定制UI采集的真人视频记录,该UI模拟了登录基于网络的生物识别认证系统(如Entry)的流程。这些记录通过众包数据标注服务采集并标注,参与者来自五大洲20多个国家的45,000人。每条记录均通过移动设备或笔记本电脑摄像头拍摄(约30%来自笔记本摄像头,70%来自移动设备),拍摄对象的面部从多角度可见。受试者性别、年龄和种族与其采集记录属于欺骗会话或真实会话无相关性,但总体分布未受限制,以尽可能接近潜在用户的真实人口统计特征。数据集采集过程采用迭代方式,每个版本的数据集均在识别出"盲点子域"(即导致模型性能不稳定的特定相机/光照/距离/其他条件组合)后生成。这些盲点通过众包攻击早期反欺骗模型迭代版本(由我们研究团队提供)的方式识别。
数据集 | 视频数量 | 受试者人数 |
---|---|---|
Entry | 83000 | 45000 |
MSU-MFSD [16] | 280 | 35 |
Replay-Attack [4] | 1300 | 50 |
OULU-NPU [2] | 4950 | 55 |
3.2. Architecture and algorithm
3.2. 架构与算法
Figure 1. Network architecture. We apply the network for each frame in a video stream, sequentially before aggregating received predictions to get the final spoofing score for the entire session. Table 1. Comparison of existing datasets for face anti-spoofing.
图 1: 网络架构。我们对视频流中的每一帧依次应用该网络,在聚合接收到的预测后得到整个会话的最终欺骗分数。
表 1: 现有面部反欺骗数据集的对比。
The training process is split into two steps in the following way:
训练过程按以下方式分为两个步骤:
3.2.1 Task-aware fine-tuning
3.2.1 任务感知微调
Our training process follows the general approach described by Chen et al. in [3]. To recap, their training pipeline starts with an unsupervised pre training of a large network on a large amount of unlabeled data, followed by supervised fine-tuning on a (typically smaller) task-specific labeled dataset. In our experiments, an open source pretrained network provided by Meta [7] substitutes the pre training step. We do not fine-tune the layers imported from RegNet, following the established practice [3] [7] based on the observation that a large self-supervised pretrained network improves the generalization.
我们的训练流程遵循Chen等人在[3]中描述的通用方法。概括而言,他们的训练流程首先在大规模无标注数据上对大型网络进行无监督预训练,随后在(通常较小的)任务特定标注数据集上进行监督微调。本实验中,我们采用Meta[7]提供的开源预训练网络替代预训练步骤。根据现有实践[3][7]的观察结论——大规模自监督预训练网络能提升泛化能力,我们不对从RegNet导入的层进行微调。
Our network structure is modified from RegNet-32g [7]: for the final layer we use a multi-headed classifier composed from 8 independent binary class if i ers with the following semantics:
我们的网络结构基于RegNet-32g [7]进行了修改:在最后一层使用了由8个独立二元分类器组成的多头分类器,其语义如下:
The training is done with the pretrained layers from [7] frozen. We are optimizing the Reduced Focal Loss function [12] with AdamW [10] for 3 epochs with learning rate set to $1e-6$ . Frames that are samples from source videos are heavily augmented to further prevent the domain shift with the following set of augmentations:
训练过程中保持[7]中的预训练层冻结。我们使用AdamW [10]优化器对Reduced Focal Loss函数[12]进行了3个周期的训练,学习率设置为$1e-6$。为防止域偏移,对源视频采样帧进行了以下增强处理:
During training, we optimize a multi-class classification loss function for heads 1-8, but during inference only the probability from Head 8 is used. We observe that introducing an explicit classification of visible signals in a spoofing attempt improves the convergence speed and stabilizes the training process, but the actual spoofing detection does not require fine-grained classification.
训练期间,我们针对头部1-8优化多分类损失函数,但推理时仅使用头部8的概率输出。实验表明,在欺骗攻击中引入可见信号的显式分类能提升收敛速度并稳定训练过程,但实际欺骗检测并不需要细粒度分类。
3.2.2 Network distillation
3.2.2 网络蒸馏
After the network is trained, we perform the distillation procedure with a smaller architecture, to make the real-time
网络训练完成后,我们采用更小的架构进行蒸馏处理,以实现实时
inference feasible. Since we have a large number of labeled videos in our training dataset $\mathcal{D}$ , we are leveraging the weighted distillation loss from [3]:
推理可行。由于我们的训练数据集 $\mathcal{D}$ 中包含大量带标签视频,我们采用了[3]中的加权蒸馏损失:
$$
\begin{array}{r l}&{\mathcal{L}=-\left({1-\alpha}\right)\displaystyle\sum_{\left(x_{i},y_{i}\right)\in{\mathcal{D}^{L}}}\bigg[\log P^{S}(y_{i}|x_{i})\bigg]}\ &{\quad\quad-\alpha\displaystyle\sum_{x_{i}\in{\mathcal{D}}}\bigg[\displaystyle\sum_{y}P^{T}(y|x_{i};\tau)\log P^{S}(y|x_{i};\tau)\bigg]}\end{array}
$$
$$
\begin{array}{r l}&{\mathcal{L}=-\left({1-\alpha}\right)\displaystyle\sum_{\left(x_{i},y_{i}\right)\in{\mathcal{D}^{L}}}\bigg[\log P^{S}(y_{i}|x_{i})\bigg]}\ &{\quad\quad-\alpha\displaystyle\sum_{x_{i}\in{\mathcal{D}}}\bigg[\displaystyle\sum_{y}P^{T}(y|x_{i};\tau)\log P^{S}(y|x_{i};\tau)\bigg]}\end{array}
$$
where $\tau$ is a scalar temperature parameter, $\alpha$ is a balancing parameter, $P^{T}(y|x_{i})$ is the output of the teacher network, which is frozen after training, and $P^{S}(y|x_{i})$ is the output of the student network.
其中 $\tau$ 是标量温度参数,$\alpha$ 是平衡参数,$P^{T}(y|x_{i})$ 是教师网络(teacher network)的输出(训练后冻结),$P^{S}(y|x_{i})$ 是学生网络(student network)的输出。
The architecture for the student network is Efficient NetB3 [13].
学生网络的架构为Efficient NetB3 [13]。
4. Experiments
4. 实验
4.1. Benchmarks
4.1. 基准测试
We evaluate the effectiveness of our network from two aspects: an evaluation portion of Entry Anti spoofing Dataset for intra-test, and MSU-MFSD, Replay-Attack, and OULU-NPU for cross-database tests. When it comes to the baselines, we compare the results of our models that