[论文翻译]低样本异质人脸识别的双重变分生成


原文地址:https://arxiv.org/pdf/1903.10203v3


Dual Variation al Generation for Low Shot Heterogeneous Face Recognition

低样本异质人脸识别的双重变分生成

Abstract

摘要

Heterogeneous Face Recognition (HFR) is a challenging issue because of the large domain discrepancy and a lack of heterogeneous data. This paper considers HFR as a dual generation problem, and proposes a novel Dual Variation al Generation (DVG) framework. It generates large-scale new paired heterogeneous images with the same identity from noise, for the sake of reducing the domain gap of HFR. Specifically, we first introduce a dual variation al auto encoder to represent a joint distribution of paired heterogeneous images. Then, in order to ensure the identity consistency of the generated paired heterogeneous images, we impose a distribution alignment in the latent space and a pairwise identity preserving in the image space. Moreover, the HFR network reduces the domain discrepancy by constraining the pairwise feature distances between the generated paired heterogeneous images. Extensive experiments on four HFR databases show that our method can significantly improve state-of-the-art results. The related code is available at https://github.com/BradyFU/DVG.

异构人脸识别 (Heterogeneous Face Recognition, HFR) 由于存在较大域差异且缺乏异构数据而成为一个具有挑战性的问题。本文将HFR视为双重生成问题,提出了一种新颖的双重变分生成 (Dual Variation al Generation, DVG) 框架。该框架通过从噪声中生成具有相同身份的大规模成对异构图像,以减小HFR的域差异。具体而言,我们首先引入双重变分自编码器来表示成对异构图像的联合分布。然后,为了确保生成的成对异构图像的身份一致性,我们在潜在空间施加分布对齐约束,并在图像空间施加成对身份保持约束。此外,HFR网络通过约束生成的成对异构图像之间的特征距离来减小域差异。在四个HFR数据库上的大量实验表明,我们的方法能显著提升当前最优性能。相关代码已开源:https://github.com/BradyFU/DVG

1 Introduction

1 引言

With the development of deep learning, face recognition has made significant progress [34, 2] in recent years. However, in many real-world applications, such as video surveillance, facial authentication on mobile devices and computer forensics, it is still a great challenge to match heterogeneous face images in different modalities, including sketch images [37], near infrared images [24] and polar i metric thermal images [36]. Heterogeneous face recognition (HFR) has attracted much attention in the face recognition community. Due to the large domain gap, one challenge is that the face recognition model trained on VIS data often degrades significantly for HFR. Therefore, lots of cross domain feature matching methods [10] are introduced to reduce the large domain gap between heterogeneous face images. However, since it is expensive and time-consuming to collect a large number of heterogeneous face images, there is no public large-scale heterogeneous face database. With the limited training data, CNNs trained for HFR often tend to be over fitting.

随着深度学习的发展,人脸识别近年来取得了显著进展 [34, 2]。然而在许多实际应用中(如视频监控、移动设备面部认证和计算机取证),跨模态异质人脸图像(包括素描图像 [37]、近红外图像 [24] 和偏振热成像图像 [36])的匹配仍面临巨大挑战。异质人脸识别 (HFR) 已成为人脸识别领域的研究热点。由于存在巨大的域差异,基于可见光 (VIS) 数据训练的人脸识别模型在 HFR 任务中往往表现显著下降。为此,研究者提出了多种跨域特征匹配方法 [10] 以缩小异质人脸图像间的域差异。然而,由于异质人脸数据采集成本高且耗时长,目前缺乏公开的大规模异质人脸数据库。在训练数据有限的情况下,针对 HFR 训练的卷积神经网络 (CNN) 容易出现过拟合现象。

Recently, the great progress of high-quality face synthesis [38, 5, 33, 39] has made “recognition via generation” possible. TP-GAN [16] and CAPG-GAN [13] introduce face synthesis to improve the quantitative performance of large pose face recognition. For HFR, [32] proposes a two-path model to synthesize VIS images from NIR images. [36] utilizes a GAN based multi-stream feature fusion technique to generate VIS images from polar i metric thermal faces. However, all these methods are based on conditional image-to-image translation framework, leading to two potential challenges: 1) Diversity: Given one image, a generator only synthesizes one new image of the target domain [32]. It means such conditional image-to-image translation methods can only generate limited number of images. In addition, as shown in the left part of Fig. 1, two images before and after translation have same attributes (e.g., the pose and the expression) except for the spectral information, which means it is difficult for such conditional image-to-image translation methods to promote intra-class diversity. In particular, these problems will be very prominent in the low-shot heterogeneous face recognition, i.e., learning from few heterogeneous data. 2) Consistency: When generating large-scale samples, it is challenging to guarantee that the synthesized face images belong to the same identity of the input images. Although identity preserving loss [13] constrains the distances between features of the input and synthesized images, it does not constraint the intra-class and inter-class distances of the embedding space.

近年来,高质量人脸合成技术[38, 5, 33, 39]的重大进展使得"通过生成进行识别"成为可能。TP-GAN[16]和CAPG-GAN[13]引入人脸合成技术来提升大姿态人脸识别的量化性能。针对异构人脸识别(HFR),[32]提出双路径模型从近红外(NIR)图像合成可见光(VIS)图像。[36]采用基于生成对抗网络(GAN)的多流特征融合技术,从极化热成像人脸生成可见光图像。然而这些方法都基于条件式图像到图像转换框架,存在两个潜在挑战:1)多样性:给定单张输入图像,生成器仅能合成目标域的一张新图像[32]。这意味着此类条件式转换方法只能生成有限数量的图像。如图1左侧所示,转换前后的两张图像除光谱信息外具有相同属性(如姿态和表情),因此难以通过此类方法提升类内多样性。这些问题在少样本异构人脸识别(即从少量异构数据学习)场景中尤为突出。2)一致性:当生成大规模样本时,难以保证合成人脸与输入图像属于同一身份。虽然身份保持损失函数[13]约束了输入图像与合成图像特征之间的距离,但并未对嵌入空间的类内和类间距离进行约束。


Figure 1: The diversity comparisons between the conditional image-to-image translation [32] (left part, the above is the input NIR image and the below is the corresponding translated VIS image) and our unconditional DVG (right part, all paired heterogeneous images are generated via noise). For the conditional image-to-image translation methods, given one NIR image, a generator only synthesizes one new VIS image with same attributes (e.g., the pose and the expression) except for the spectral information. Differently, DVG generates massive new paired images with rich intra-class diversity from noise.

图 1: 条件式图像到图像转换 [32] (左侧部分,上方为输入近红外(NIR)图像,下方为对应转换的可见光(VIS)图像)与我们提出的无条件DVG (右侧部分,所有配对异质图像均由噪声生成)的多样性对比。对于条件式图像到图像转换方法,给定一张NIR图像,生成器仅合成一张除光谱信息外具有相同属性(如姿态和表情)的新VIS图像。不同的是,DVG能从噪声生成大量具有丰富类内多样性的新配对图像。

To tackle the above challenges, we propose a novel unconditional Dual Variation al Generation (DVG) framework (shown in Fig. 3) that generates large-scale paired heterogeneous images with the same identity from noise. Unconditional generative models can generate new images (generate single image per time) from noise [21], but since these images do not have identity labels, it is difficult to use these images for recognition networks. DVG makes use of the property of generating new images of the unconditional generative model [21], and adopts a dual generation manner to get paired heterogeneous images with the same identity every time. This enables DVG to generate large-scale images, and make the generated images can be used to optimize recognition networks. Meanwhile, DVG also absorbs the various intra-class changes of the training database, leading to the generated paired images have abundant intra-class diversity. For instance, as presented in the right part of Fig. 1, the first four paired images have different poses, and the fifth paired images have different expressions. Furthermore, DVG only pays attention to the identity consistency of the paired heterogeneous images rather than the identity whom the paired heterogeneous images belong to, which avoids the consistency problem of previous methods. Specifically, we introduce a dual variation al auto encoder to learn a joint distribution of paired heterogeneous images. In order to constrain the generated paired images to belong to the same identity, we impose both a distribution alignment in the latent space and a pairwise identity preserving in the image space. New paired images are generated by sampling and copying a noise from a standard Gaussian distribution, as displayed in the left part of Fig. 3. These generated paired images are used to optimize the HFR network by a pairwise distance constraint, aiming at reducing the domain discrepancy.

为解决上述挑战,我们提出了一种新颖的无条件双变分生成(DVG)框架(如图3所示),该框架能够从噪声中生成具有相同身份的大规模成对异构图像。无条件生成模型可以从噪声中生成新图像(每次生成单张图像)[21],但由于这些图像缺乏身份标签,难以直接用于识别网络。DVG利用无条件生成模型生成新图像的特性[21],采用双生成模式每次获取具有相同身份的成对异构图像。这使得DVG能够生成大规模图像,并使生成图像可用于优化识别网络。同时,DVG还吸收了训练数据库中丰富的类内变化,使生成的成对图像具有显著的类内多样性。例如图1右侧所示,前四组成对图像具有不同姿态,第五组成对图像则呈现不同表情。此外,DVG仅关注成对异构图像的身份一致性而非具体身份归属,从而避免了现有方法的身份一致性问题。具体而言,我们引入双变分自编码器来学习成对异构图像的联合分布。为约束生成图像对属于同一身份,我们在潜在空间实施分布对齐,并在图像空间施加成对身份保持。通过从标准高斯分布采样并复制噪声来生成新图像对(如图3左侧所示),这些生成图像对通过成对距离约束优化异质人脸识别网络,旨在减小域间差异。

In summary, the main contributions are as follows:

总之,主要贡献如下:

• We provide a new insight into the problems of HFR. That is, we consider HFR as a dual generation problem, and propose a novel dual variation al generation framework. This framework generates new paired heterogeneous images with abundant intra-class diversity to reduce the domain gap of HFR. • In order to guarantee that the generated paired images belong to the same identity, we constrain the consistency of paired images in both latent space and image space. These allow new images sampled from the noise can be used for recognition networks.

• 我们针对异构人脸识别(HFR)问题提出了新见解,将其视为双重生成问题,并提出创新的双重变分生成框架。该框架通过生成具有丰富类内多样性的成对异构图像来减小HFR的域间差异。
• 为确保生成的成对图像属于同一身份,我们在潜在空间和图像空间同时约束图像对的一致性。这使得从噪声中采样的新图像可直接用于识别网络。


Figure 2: The dual generation results via noise $256\times256$ resolution). For each pair, the left is NIR and the right is the paired VIS image.

图 2: 通过噪声生成的双模态结果 ($256\times256$ 分辨率)。每组图像中,左侧为近红外(NIR)图像,右侧为对应的可见光(VIS)图像。

2 Background and Related Work

2 背景与相关工作

2.1 Heterogeneous Face Recognition

2.1 异质人脸识别

Lots of researchers pay their attention to Heterogeneous Face Recognition (HFR). For the feature-level learning, [22] employs HOG features with sparse representation for HFR. [7] utilizes LBP histogram with Linear Discriminant Analysis to obtain domain-invariant features. [10] proposes Invariant Deep Representation (IDR) to disentangle representations into two orthogonal subspaces for NIR-VIS HFR. Further, [11] extends IDR by introducing Wasser stein distance to obtain domain invariant features for HFR. For the image-level learning, the common idea is to transform heterogeneous face images from one modality into another one via image synthesis. [19] utilizes joint dictionary learning to reconstruct face images for boosting the performance of face matching. [23] proposes a cross-spectral hallucination and low-rank embedding to synthesize a heterogeneous image in a patch way.

许多研究者将目光投向异质人脸识别(HFR)。在特征层面学习方面,[22]采用HOG特征结合稀疏表示进行HFR。[7]利用LBP直方图与线性判别分析获取跨域不变特征。[10]提出不变深度表示(IDR),将表征解耦至两个正交子空间以实现近红外-可见光(NIR-VIS)异质人脸识别。[11]进一步扩展IDR方法,引入Wasserstein距离来获取跨域不变特征。在图像层面学习方面,主流思路是通过图像合成将异质人脸图像从一种模态转换到另一种模态。[19]采用联合字典学习重构人脸图像以提升匹配性能。[23]提出跨光谱幻构与低秩嵌入方法,以分块方式合成异质图像。

2.2 Generative Models

2.2 生成式模型 (Generative Models)

Variation al auto encoders (VAEs) [21] and generative adversarial networks (GANs) [6] are the most prominent generative models. VAEs consist of an encoder network $q_{\phi}(z|x)$ and a decoder network $p_{\theta}(x|z)$ . $q_{\phi}(z|x)$ maps input images $x$ to the latent variables $z$ that match to a prior $p(z)$ , and $p_{\theta}(x|z)$ samples images $x$ from the latent variables $z$ . The evidence lower bound objective (ELBO) of VAEs:

变分自编码器 (VAEs) [21] 和生成对抗网络 (GANs) [6] 是最突出的生成模型。VAEs 由编码器网络 $q_{\phi}(z|x)$ 和解码器网络 $p_{\theta}(x|z)$ 组成。$q_{\phi}(z|x)$ 将输入图像 $x$ 映射到与先验 $p(z)$ 匹配的潜变量 $z$,而 $p_{\theta}(x|z)$ 从潜变量 $z$ 采样图像 $x$。VAEs 的证据下界目标 (ELBO):

$$
\begin{array}{r}{\log p_{\theta}(x)\geq\mathbb{E}{q_{\phi}(z\vert x)}\log p_{\theta}(x\vert z)-D_{\mathrm{KL}}\big(q_{\phi}(z\vert x)\vert\vert p(z)\big).}\end{array}
$$

$$
\begin{array}{r}{\log p_{\theta}(x)\geq\mathbb{E}{q_{\phi}(z\vert x)}\log p_{\theta}(x\vert z)-D_{\mathrm{KL}}\big(q_{\phi}(z\vert x)\vert\vert p(z)\big).}\end{array}
$$

The two parts in ELBO are a reconstruction error and a Kullback-Leibler divergence, respectively.

ELBO中的两部分分别是重构误差和Kullback-Leibler散度。

Differently, GANs adopt a generator $G$ and a disc rim in at or $D$ to play a min-max game. $G$ generates images from a prior $p(z)$ to confuse $D$ , and $D$ is trained to distinguish between generated data and real data. This adversarial rule takes the form:

与GAN不同,生成对抗网络 (Generative Adversarial Network) 采用生成器 $G$ 和判别器 $D$ 进行极小极大博弈。$G$ 从先验分布 $p(z)$ 生成图像以迷惑 $D$,而 $D$ 被训练用于区分生成数据与真实数据。该对抗规则的形式为:

$$
\underset{G}{\mathrm{min}}\underset{D}{\mathrm{max}}\mathbb{E}{x\sim p_{d a t a}(x)}\left[\log D(x)\right]+\mathbb{E}{z\sim p_{z}(z)}\left[\log(1-D(G(z)))\right].
$$

$$
\underset{G}{\mathrm{min}}\underset{D}{\mathrm{max}}\mathbb{E}{x\sim p_{d a t a}(x)}\left[\log D(x)\right]+\mathbb{E}{z\sim p_{z}(z)}\left[\log(1-D(G(z)))\right].
$$

They have achieved remarkable success in various applications, such as unconditional image generation that generates images from noise [20, 15], and conditional image generation that synthesizes images according to the given condition [32, 16]. According to [15], VAEs have nice manifold representations, while GANs are better at generating sharper images.

他们在各种应用中取得了显著成功,例如从噪声生成图像的无条件图像生成 [20, 15],以及根据给定条件合成图像的条件图像生成 [32, 16]。根据 [15],变分自编码器 (VAE) 具有良好的流形表示能力,而生成对抗网络 (GAN) 更擅长生成更清晰的图像。

Another work to address the similar problem of our method is CoGAN [25], which uses a weightsharing manner to generate paired images in two different modalities. However, CoGAN neither explicitly constrains the identity consistency of paired images in the latent space nor in the image space. It is challenging for the weight-sharing manner of CoGAN to generate paired images with the same identity, as shown in Fig. 4.

另一项与我们方法解决类似问题的工作是CoGAN [25],它采用权重共享方式生成两种不同模态的配对图像。然而,CoGAN既未在潜在空间也未在图像空间显式约束配对图像的身份一致性。如图4所示,CoGAN的权重共享方式难以生成具有相同身份的配对图像。


Figure 3: The purpose (left part) and training model (right part) of our unconditional DVG framework. DVG generates large-scale new paired heterogeneous images with the same identity from standard Gaussian noise, aiming at reducing the domain discrepancy for HFR. In order to achieve this purpose, we elaborately design a dual variation al auto encoder. Given a pair of heterogeneous images from the same identity, the dual variation al auto encoder learns a joint distribution in the latent space. In order to guarantee the identity consistency of the generated paired images, we impose a distribution alignment in the latent space and a pairwise identity preserving in the image space.

图 3: 我们无条件DVG框架的目标(左半部分)和训练模型(右半部分)。DVG从标准高斯噪声中生成具有相同身份的大规模新配对异构图像,旨在减少HFR的域差异。为实现这一目标,我们精心设计了一个双重变分自编码器。给定来自同一身份的一对异构图像,该双重变分自编码器在潜在空间中学习联合分布。为确保生成配对图像的身份一致性,我们在潜在空间施加分布对齐约束,并在图像空间实施成对身份保持。

3 Proposed Method

3 提出的方法

In this section, we will introduce our method in detail, including the dual variation al generation and heterogeneous face recognition. Note that we specifically discuss the NIR-VIS images for better presentation. Other heterogeneous images are also applicable.

在本节中,我们将详细介绍我们的方法,包括双重变分生成和异质人脸识别。需要注意的是,为了更好的展示效果,我们特别讨论了近红外-可见光(NIR-VIS)图像。其他类型的异质图像也同样适用。

3.1 Dual Variation al Generation

3.1 双重变分生成

As shown in the right part of Fig. 3, DVG consists of a feature extractor $F_{i p}$ , and a dual variation al auto encoder: two encoder networks and a decoder network, all of which play the same roles of VAEs [21]. Specifically, $F_{i p}$ extracts the semantic information of the generated images to preserve the identity information. The encoder network $E_{N}$ maps NIR images $x_{N}$ to a latent space $z_{N}=$ $q_{\phi_{N}}\left(z_{N}\middle|x_{N}\right)$ by a re parameter iz ation trick: $z_{N}=u_{N}+\sigma_{N}\odot\epsilon.$ , where $u_{N}$ and $\sigma_{N}$ denote mean and standard deviation of NIR images, respectively. In addition, $\epsilon$ is sampled from a multi-variate standard Gaussian and $\odot$ denotes the Hadamard product. The encoder network $E_{V}$ has the same manner as $E_{N}\colon z_{V}=q_{\phi_{V}}\left(z_{V}|x_{V}\right)$ , which is for VIS images $x_{V}$ . After obtaining the two independent distributions, we concatenate $z_{N}$ and $z_{V}$ to get the joint distribution $z_{I}$ .

如图 3 右侧所示,DVG 由特征提取器 $F_{i p}$ 和一个双变分自编码器组成:两个编码器网络和一个解码器网络,它们均扮演与 VAE [21] 相同的角色。具体而言,$F_{i p}$ 提取生成图像的语义信息以保留身份特征。编码器网络 $E_{N}$ 通过重参数化技巧将近红外图像 $x_{N}$ 映射到潜在空间 $z_{N}=q_{\phi_{N}}\left(z_{N}\middle|x_{N}\right)$:$z_{N}=u_{N}+\sigma_{N}\odot\epsilon.$,其中 $u_{N}$ 和 $\sigma_{N}$ 分别表示近红外图像的均值与标准差。此外,$\epsilon$ 采样自多元标准高斯分布,$\odot$ 表示哈达玛积。编码器网络 $E_{V}$ 采用与 $E_{N}$ 相同的方式处理可见光图像 $x_{V}$:$z_{V}=q_{\phi_{V}}\left(z_{V}|x_{V}\right)$。获得两个独立分布后,我们将 $z_{N}$ 和 $z_{V}$ 拼接得到联合分布 $z_{I}$。

Distribution Learning We utilize VAEs to learn the joint distribution of the paired NIR-VIS images. Given a pair of NIR-VIS images ${x_{N},x_{V}}$ , we constrain the posterior distribution $q_{\phi_{N}}\left(z_{N}\middle|x_{N}\right)$ and $q_{\phi_{V}}(z_{V}|x_{V})$ by the Kullback-Leibler divergence:

分布学习
我们利用VAE(变分自编码器)来学习配对的近红外-可见光图像的联合分布。给定一对近红外-可见光图像 ${x_{N},x_{V}}$ ,我们通过Kullback-Leibler散度约束后验分布 $q_{\phi_{N}}\left(z_{N}\middle|x_{N}\right)$ 和 $q_{\phi_{V}}(z_{V}|x_{V})$ :

$$
\begin{array}{r}{\mathcal{L}{\mathrm{kl}}=D_{\mathrm{KL}}\big(q_{\phi_{N}}\big(z_{N}\big|x_{N}\big)\big|\big|p\big(z_{N}\big)\big)+D_{\mathrm{KL}}\big(q_{\phi_{V}}\big(z_{V}\big|x_{V}\big)\big|\big|p\big(z_{V}\big)\big),}\end{array}
$$

$$
\begin{array}{r}{\mathcal{L}{\mathrm{kl}}=D_{\mathrm{KL}}\big(q_{\phi_{N}}\big(z_{N}\big|x_{N}\big)\big|\big|p\big(z_{N}\big)\big)+D_{\mathrm{KL}}\big(q_{\phi_{V}}\big(z_{V}\big|x_{V}\big)\big|\big|p\big(z_{V}\big)\big),}\end{array}
$$

where the prior distributions $p(z_{N})$ and $p(z_{V})$ are both the multi-variate standard Gaussian distributions. Like the original VAEs, we require the decoder network $p_{\theta}(x_{N},x_{V}|z_{I})$ to be able to reconstruct the input images $x_{N}$ and $x_{V}$ from the learned distribution:

其中先验分布 $p(z_{N})$ 和 $p(z_{V})$ 都是多元标准高斯分布。与原始变分自编码器 (VAE) 类似,我们要求解码器网络 $p_{\theta}(x_{N},x_{V}|z_{I})$ 能够从学习到的分布中重建输入图像 $x_{N}$ 和 $x_{V}$:

$$
{\mathcal{L}}{\mathrm{rec}}=-\mathbb{E}{q_{\phi_{N}}(z_{N}|x_{N})\cup q_{\phi_{V}}(z_{V}|x_{V})}\log p_{\theta}(x_{N},x_{V}|z_{I}).
$$

$$
{\mathcal{L}}{\mathrm{rec}}=-\mathbb{E}{q_{\phi_{N}}(z_{N}|x_{N})\cup q_{\phi_{V}}(z_{V}|x_{V})}\log p_{\theta}(x_{N},x_{V}|z_{I}).
$$

Distribution Alignment We expect a pair of NIR-VIS images ${x_{N},x_{V}}$ to be projected into a common latent space by the encoders $E_{N}$ and $E_{V}$ , i.e., the NIR distribution $p(z_{N}^{(i)})$ is the same as the VIS distribution $p(z_{V}^{(i)})$ , where $i$ denotes the identity information. That means we maintain the identity consistency of the generated paired images in the latent space. Explicitly, we align the NIR and VIS distributions by minimizing the Wasser stein distance between the two distributions. Given two Gaussian distributions $p(z_{N}^{(i)})=N(u_{N}^{(i)},{\sigma_{N}^{(i)}}^{2})$ and $p(z_{V}^{(i)})=N(u_{V}^{(i)},{\sigma_{V}^{(i)}}^{2})$ , the 2-Wasser stein distance between $p(z_{N}^{(i)})$ and $p(z_{V}^{(i)})$ is simplified [10] as:

分布对齐
我们期望通过编码器 $E_{N}$ 和 $E_{V}$ 将一对近红外-可见光图像 ${x_{N},x_{V}}$ 投影到一个共同的潜在空间中,即近红外分布 $p(z_{N}^{(i)})$ 与可见光分布 $p(z_{V}^{(i)})$ 相同,其中 $i$ 表示身份信息。这意味着我们在潜在空间中保持生成配对图像的身份一致性。具体而言,我们通过最小化两个分布之间的Wasserstein距离来对齐近红外和可见光分布。给定两个高斯分布 $p(z_{N}^{(i)})=N(u_{N}^{(i)},{\sigma_{N}^{(i)}}^{2})$ 和 $p(z_{V}^{(i)})=N(u_{V}^{(i)},{\sigma_{V}^{(i)}}^{2})$,$p(z_{N}^{(i)})$ 和 $p(z_{V}^{(i)})$ 之间的2-Wasserstein距离可简化为[10]:

$$
\mathcal{L}{\mathrm{dist}}=\frac{1}{2}\left[||u_{N}^{(i)}-u_{V}^{(i)}||{2}^{2}+||\sigma_{N}^{(i)}-\sigma_{V}^{(i)}||_{2}^{2}\right].
$$

$$
\mathcal{L}{\mathrm{dist}}=\frac{1}{2}\left[||u_{N}^{(i)}-u_{V}^{(i)}||{2}^{2}+||\sigma_{N}^{(i)}-\sigma_{V}^{(i)}||_{2}^{2}\right].
$$

Pairwise Identity Preserving In previous image-to-image translation works [16, 13], identity preserving is usually introduced to maintain identity information. The traditional approach uses a pre-trained feature extractor to enforce the features of the generated images to be close to the target ones. However, since the lack of intra-class and inter-class constraints, it is challenge to guarantee the synthesized images to belong to the specific categories of the target images. Considering that DVG generates a pair of heterogeneous images per time, we only need to consider the identity consistency of the paired images.

成对身份保持
在以往的图像到图像转换工作中 [16, 13],通常会引入身份保持来维护身份信息。传统方法使用预训练的特征提取器,强制生成图像的特征接近目标图像。然而,由于缺乏类内和类间约束,确保合成图像属于目标图像的特定类别具有挑战性。考虑到 DVG 每次生成一对异构图像,我们只需关注配对图像的身份一致性。

Specifically, we adopt Light CNN [34] as the feature extractor $F_{i p}$ to constrain the feature distance between the reconstructed paired images:

具体而言,我们采用 Light CNN [34] 作为特征提取器 $F_{i p}$,以约束重建配对图像之间的特征距离:

$$
\mathcal{L}{\mathrm{ip-pair}}=||F_{i p}(\hat{x}{N})-F_{i p}(\hat{x}{V})||_{2}^{2},
$$

$$
\mathcal{L}{\mathrm{ip-pair}}=||F_{i p}(\hat{x}{N})-F_{i p}(\hat{x}{V})||_{2}^{2},
$$

where $F_{i p}(\cdot)$ means the normalized output of the last fully connected layer of $F_{i p}$ . In addition, we also use $F_{i p}$ to make the features of the reconstructed images and the original input images close enough as previous works [16, 13]:

其中 $F_{ip}(\cdot)$ 表示 $F_{ip}$ 最后一个全连接层的归一化输出。此外,我们还使用 $F_{ip}$ 使重建图像和原始输入图像的特征足够接近,如先前工作 [16, 13] 所示:

$$
\begin{array}{r}{\mathcal{L}{\mathrm{ip-rec}}=||F_{i p}(\hat{x}{N})-F_{i p}({x}{N})||{2}^{2}+||F_{i p}(\hat{x}{V})-F_{i p}({x}{V})||_{2}^{2},}\end{array}
$$

$$
\begin{array}{r}{\mathcal{L}{\mathrm{ip-rec}}=||F_{i p}(\hat{x}{N})-F_{i p}({x}{N})||{2}^{2}+||F_{i p}(\hat{x}{V})-F_{i p}({x}{V})||_{2}^{2},}\end{array}
$$

where ${\hat{x}}{N}$ and $\hat{x}{V}$ denote the reconstructions of the input paired images $x_{N}$ and $x_{V}$ , respectively.

其中 ${\hat{x}}{N}$ 和 $\hat{x}{V}$ 分别表示输入配对图像 $x_{N}$ 和 $x_{V}$ 的重建结果。

Diversity Constraint In order to further increase the diversity of the generated images, we also introduce a diversity loss [27]. In the sampling stage, when two sampled noise $z_{I_{1}}$ are $z_{I_{2}}$ are close, the generated images $x_{I_{1}}$ and $x_{I_{2}}$ are going to be similar. We maximize the following loss to encourage the decoder $D_{I}$ to generate more diverse images:

多样性约束
为了进一步提升生成图像的多样性,我们还引入了多样性损失 [27]。在采样阶段,当两个采样噪声 $z_{I_{1}}$ 和 $z_{I_{2}}$ 接近时,生成的图像 $x_{I_{1}}$ 和 $x_{I_{2}}$ 会趋于相似。我们通过最大化以下损失来激励解码器 $D_{I}$ 生成更多样化的图像:

$$
\mathcal{L}{\mathrm{div}}=\operatorname*{max}{D_{I}}\frac{|F_{i p}(x_{I_{1}})-F_{i p}(x_{I_{2}})|}{|z_{I_{1}}-z_{I_{2}}|}.
$$

$$
\mathcal{L}{\mathrm{div}}=\operatorname*{max}{D_{I}}\frac{|F_{i p}(x_{I_{1}})-F_{i p}(x_{I_{2}})|}{|z_{I_{1}}-z_{I_{2}}|}.
$$

Overall Loss Moreover, in order to increase the sharpness of our generated images, we also adopt an adversarial loss ${\mathcal{L}}_{\mathrm{adv}}$ as [31]. Hence, the overall loss to optimize the dual variation al auto encoder can be formulated as

总体损失
此外,为了提高生成图像的清晰度,我们还采用了对抗损失 ${\mathcal{L}}_{\mathrm{adv}}$ 如 [31] 所述。因此,优化双重变分自编码器的总体损失可表示为

$$
\mathcal{L}{\mathrm{gen}}=\mathcal{L}{\mathrm{rec}}+\mathcal{L}{\mathrm{kl}}+\mathcal{L}{\mathrm{adv}}+\lambda_{1}\mathcal{L}{\mathrm{dist}}+\lambda_{2}\mathcal{L}{\mathrm{ip-pair}}+\lambda_{3}\mathcal{L}{\mathrm{ip-rec}}+\lambda_{4}\mathcal{L}_{\mathrm{div}},
$$

$$
\mathcal{L}{\mathrm{gen}}=\mathcal{L}{\mathrm{rec}}+\mathcal{L}{\mathrm{kl}}+\mathcal{L}{\mathrm{adv}}+\lambda_{1}\mathcal{L}{\mathrm{dist}}+\lambda_{2}\mathcal{L}{\mathrm{ip-pair}}+\lambda_{3}\mathcal{L}{\mathrm{ip-rec}}+\lambda_{4}\mathcal{L}_{\mathrm{div}},
$$

where $\lambda_{1},\lambda_{2},\lambda_{3}$ and $\lambda_{4}$ are the trade-off parameters.

其中 $\lambda_{1},\lambda_{2},\lambda_{3}$ 和 $\lambda_{4}$ 是权衡参数。

3.2 Heterogeneous Face Recognition

3.2 异质人脸识别

For the heterogeneous face recognition, our training data contains the original limited labeled data $x_{i}(i\in{N,V})$ and the large-scale generated unlabeled paired NIR-VIS data $\tilde{x}{i}(i\in{N,V})$ . Here, we define a heterogeneous face recognition network $F$ to extract features $f_{i}=F(x_{i};\Theta)$ , where $i\in{N,V}$ and $\Theta$ is the parameters of $F$ . For the original labeled NIR and VIS images, we utilize a softmax loss:

对于异构人脸识别,我们的训练数据包含原始有限标注数据 $x_{i}(i\in{N,V})$ 和大规模生成的未配对NIR-VIS数据 $\tilde{x}{i}(i\in{N,V})$。在此,我们定义了一个异构人脸识别网络 $F$ 用于提取特征 $f_{i}=F(x_{i};\Theta)$,其中 $i\in{N,V}$,$\Theta$ 是 $F$ 的参数。对于原始标注的NIR和VIS图像,我们采用softmax损失函数:

$$
\mathcal{L}{\mathrm{cls}}=\sum_{i\in{N,V}}\operatorname{softmax}(F(x_{i};\Theta),y),
$$

$$
\mathcal{L}{\mathrm{cls}}=\sum_{i\in{N,V}}\operatorname{softmax}(F(x_{i};\Theta),y),
$$

where $y$ is the label of identity.

其中 $y$ 是身份标签。

For the generated paired heterogeneous images, since they are generated from noise, there are no specific classes for the paired images. But as mentioned in section 3.1, DVG ensures that the generated paired images belong to the same identity. Therefore, a pairwise distance loss between the paired heterogeneous samples is formulated as follows:

对于生成的成对异构图像,由于它们是从噪声生成的,这些成对图像没有特定的类别。但如第3.1节所述,DVG确保生成的成对图像属于同一身份。因此,成对异构样本之间的配对距离损失公式如下:

$$
\mathcal{L}{\mathrm{pair}}=||F(\tilde{x}{N};\Theta)-F(\tilde{x}{V};\Theta)||_{2}^{2},
$$

$$
\mathcal{L}{\mathrm{pair}}=||F(\tilde{x}{N};\Theta)-F(\tilde{x}{V};\Theta)||_{2}^{2},
$$

In this way, we can efficiently minimize the domain discrepancy by generating large-scale unlabeled paired heterogeneous images. As stated above, the final loss to optimize for the heterogeneous face recognition network can be written as

通过生成大规模未标记的配对异质图像,我们能够有效缩小域间差异。如上所述,优化异质人脸识别网络的最终损失函数可表示为

$$
\mathcal{L}{\mathrm{hfr}}=\mathcal{L}{\mathrm{cls}}+\alpha_{1}\mathcal{L}_{\mathrm{pair}},
$$

$$
\mathcal{L}{\mathrm{hfr}}=\mathcal{L}{\mathrm{cls}}+\alpha_{1}\mathcal{L}_{\mathrm{pair}},
$$

where $\alpha_{1}$ is the trade-off parameter.

其中 $\alpha_{1}$ 是权衡参数。

方法 MD FID Rank-1 方法 Rank-1
CoGAN 0.61 10.6 95.2 w/oLdist w/o Lip-pair 94.3 96.1
VAE 0.54 8.2 94.6 w/oLdiv 98.5
DVG 0.24 7.1 99.2 DVG 99.2

Table 1: Experimental analyses on the CASIA NIR-VIS 2.0 database. The backbone is LightCNN-9. (a) The quantitative comparisons of different methods. MD (lower is better) means the mean feature distance between the generated paired NIR and VIS images. FID (lower is better) is measured based on the features of LightCNN-9, instead of the traditional Inception model. (b) The ablation study. (a)

表 1: CASIA NIR-VIS 2.0 数据库上的实验分析。主干网络为 LightCNN-9。(a) 不同方法的定量对比。MD (值越小越好) 表示生成的成对近红外与可见光图像之间的平均特征距离。FID (值越小越好) 基于 LightCNN-9 的特征计算,而非传统的 Inception 模型。(b) 消融实验。(a)

4 Experiments

4 实验

4.1 Databases and Protocols

4.1 数据库与协议

Three NIR-VIS heterogeneous face databases and one Sketch-Photo heterogeneous face database are used to evaluate our proposed method. For the NIR-VIS face recognition, following [35], we report Rank-1 accuracy and verification rate $(\mathrm{VR})@$ false accept rate (FAR) for the CASIA NIR-VIS 2.0 [24], the Oulu-CASIA NIR-VIS [18] and the BUAA-VisNir Face [14] databases. Note that, for the Oulu-CASIA NIR-VIS database, there are only 20 subjects are selected as the training set. In addition, the IIIT-D Viewed Sketch database [1] is employed for the Sketch-Photo face recognition. Due to the few number of images in the IIIT-D Viewed Sketch database, following the protocols of [3], we use the CUHK Face Sketch FERET (CUFSF) [37] as the training set and report the Rank-1 accuracy and $\scriptstyle\mathrm{VR}@\mathrm{FAR}=1%$ for comparisons.

我们采用三个近红外-可见光(NIR-VIS)异质人脸数据库和一个素描-照片异质人脸数据库来评估所提方法。针对NIR-VIS人脸识别任务,参照[35]的做法,我们在CASIA NIR-VIS 2.0[24]、Oulu-CASIA NIR-VIS[18]和BUAA-VisNir Face[14]数据库上报告Rank-1准确率以及误接受率(FAR)对应的验证率(VR)。需要注意的是,Oulu-CASIA NIR-VIS数据库仅选取20个对象作为训练集。此外,采用IIIT-D Viewed Sketch数据库[1]进行素描-照片人脸识别实验。由于该数据库图像数量有限,我们遵循[3]的实验协议,使用CUHK Face Sketch FERET (CUFSF)[37]作为训练集,并报告Rank-1准确率及FAR=1%时的VR值用于对比。

4.2 Experimental Details

4.2 实验细节

For the dual variation al generation, the architectures of the encoder and decoder networks are the same as [15], and the architecture of our disc rim in at or is the same as [31]. These networks are trained using Adam optimizer with a fixed rate of 0.0002. Other parameters $\lambda_{1},\lambda_{2}$ , $\lambda_{3}$ and $\lambda_{4}$ in Eq. (9) are set to 50, 5, 1000 and 0.2, respectively. For the heterogeneous face recognition, we utilize both LightCNN-9 and LightCNN-29 [34] as the backbones. The models are pre-trained on the MS-Celeb-1M database [9] and fine-tuned on the HFR training sets. All the face images are aligned to $144\times144$ and randomly cropped to $128\times128$ as the input for training. Stochastic gradient descent (SGD) is used as the optimizer, where the momentum is set to 0.9 and weight decay is set to $5e{-}4$ The learning rate is set to $1e{-3}$ initially and reduced to 5e-4 gradually. The batch size is set to 64 and the dropout ratio is 0.5. The trade-off parameters $\alpha_{1}$ in Eq. (12) is set to 0.001 during training.

对于双重变分生成,编码器和解码器网络的架构与[15]相同,判别器的架构与[31]相同。这些网络使用学习率固定为0.0002的Adam优化器进行训练。式(9)中的其他参数$\lambda_{1}$、$\lambda_{2}$、$\lambda_{3}$和$\lambda_{4}$分别设置为50、5、1000和0.2。在跨模态人脸识别任务中,我们同时采用LightCNN-9和LightCNN-29[34]作为主干网络。模型先在MS-Celeb-1M数据库[9]上进行预训练,然后在HFR训练集上微调。所有人脸图像对齐至$144\times144$分辨率后,随机裁剪为$128\times128$作为训练输入。采用随机梯度下降(SGD)作为优化器,动量设为0.9,权重衰减设为$5e{-}4$。初始学习率为$1e{-3}$,逐步降至5e-4。批次大小设为64,丢弃率为0.5。训练过程中式(12)的权衡参数$\alpha_{1}$设为0.001。

4.3 Experimental Analyses

4.3 实验分析

In this section, we analyze three metrics, including identity consistency, distribution consistency and visual quality, to demonstrate the effectiveness of DVG. The compared methods include CoGAN [25] and VAE [21]. For VAE model, the input is the concatenated NIR-VIS images.

在本节中,我们分析了身份一致性 (identity consistency)、分布一致性 (distribution consistency) 和视觉质量 (visual quality) 三个指标,以证明 DVG 的有效性。对比方法包括 CoGAN [25] 和 VAE [21]。对于 VAE 模型,输入为拼接后的近红外-可见光 (NIR-VIS) 图像。

Identity Consistency. In order to analyze the identity consistency, we measure the feature distance between the generated paired images on the CASIA NIR-VIS 2.0 database. Specifically, we first use a pre-trained Light CNN-9 [34] to extract features and then measure the mean distance (MD) of the paired images. The results are reported in Table 1a. MD is computed from 50K generated image pairs and the MD value of the original database is 0.26. We can clearly see that the MD value of DVG is even smaller than the original database, which means that our method can effectively guarantee the identity consistency of the generated paired images. The recognition performance of different methods is also reported in Table 1a. We can see that DVG correspondingly achieves the best results.

身份一致性。为了分析身份一致性,我们在CASIA NIR-VIS 2.0数据库上测量了生成配对图像之间的特征距离。具体而言,我们首先使用预训练的Light CNN-9 [34]提取特征,然后计算配对图像的平均距离(MD)。结果如表1a所示。MD是基于5万对生成图像计算得出的,原始数据库的MD值为0.26。可以明显看出,DVG的MD值甚至小于原始数据库,这表明我们的方法能有效保证生成配对图像的身份一致性。表1a还报告了不同方法的识别性能,可见DVG相应地取得了最佳结果。

Distribution Consistency. On the CASIA NIR-VIS 2.0 database, we take Fréchet Inception Distance (FID) [12] to measure the Fréchet distance of two distributions in the feature space, refecting the distribution consistency. We first measure the FID between the generated VIS images and the real VIS images, and the FID between the generated NIR images and the real NIR images, respectively.

分布一致性。在CASIA NIR-VIS 2.0数据库上,我们采用Fréchet Inception Distance (FID) [12] 来衡量特征空间中两个分布的Fréchet距离,反映分布一致性。首先分别测量生成的VIS图像与真实VIS图像之间的FID,以及生成的NIR图像与真实NIR图像之间的FID。


Figure 4: Visual comparisons of dual image generation results on the CASIA NIR-VIS 2.0 database. The generated paired images of DVG are more similar than those of CoGAN and VAE.

图 4: 在CASIA NIR-VIS 2.0数据库上的双图像生成结果可视化对比。DVG生成的配对图像比CoGAN和VAE生成的更具相似性。


Figure 5: The dual generation results on the Oulu-CASIA NIR-VIS, the BUAA-VisNir and the CUHK Face Sketch FERET (CUFSF) databases.

图 5: Oulu-CASIA NIR-VIS、BUAA-VisNir 和 CUHK Face Sketch FERET (CUFSF) 数据库上的双重生成结果。

Then we calculate the mean FID as the final results, which are reported in Table 1a. Considering that the face recognition network can better extract features of face images, we use a LightCNN-9 to extract features for calculating FID instead of the traditional Inception model. Similarly, FID results are computed from 50K generated image pairs. As shown in Table 1a, DVG achieves best results, demonstrating that DVG has really learned the distributions of two modalities.

然后我们计算平均FID作为最终结果,结果报告在表1a中。考虑到人脸识别网络能更好地提取人脸图像特征,我们使用LightCNN-9替代传统的Inception模型来提取特征计算FID。同样地,FID结果基于5万张生成图像对计算得出。如表1a所示,DVG取得了最佳结果,这表明DVG确实学习到了两种模态的分布。

Visual Quality. In Fig. 4, we compare the dual generation results ( $128\times128$ resolution) of different methods on the CASIA NIR-VIS 2.0 database. Our visual results are obviously better than CoGAN and VAE. Moreover, we can observe that the generated paired images of VAE and CoGAN are not similar, which leads to worse Rank-1 accuracy during optimizing HFR network (see Table 1a). More dual generation results of DVG are shown in Fig. 2 ( $256\times256$ resolution) and Fig. 5.

视觉质量。在图4中,我们比较了不同方法在CASIA NIR-VIS 2.0数据库上的双生成结果(128×128分辨率)。我们的视觉结果明显优于CoGAN和VAE。此外,可以观察到VAE和CoGAN生成的成对图像相似度较低,这导致优化HFR网络时Rank-1准确率下降(见表1a)。DVG的更多双生成结果如图2(256×256分辨率)和图5所示。

Ablation Study. Table 1b presents the comparison results of our DVG and its three variants on the CASIA NIR-VIS 2.0 database. We observe that the recognition performance will decrease if one component is not adopted. Particularly, the accuracy drops significantly when the distribution alignment loss ${\mathcal{L}}{\mathrm{dist}}$ or the pairwise identity preserving loss ${\mathcal{L}}_{\mathrm{ip-pair}}$ are not used. These results suggest that every component is crucial in our model.

消融实验。表 1b 展示了我们的 DVG 及其三个变体在 CASIA NIR-VIS 2.0 数据库上的对比结果。我们发现,如果未采用某个组件,识别性能会下降。特别是当不使用分布对齐损失 ${\mathcal{L}}{\mathrm{dist}}$ 或成对身份保留损失 ${\mathcal{L}}_{\mathrm{ip-pair}}$ 时,准确率显著下降。这些结果表明,每个组件在我们的模型中都是至关重要的。

Moreover, we analyze how the number of generated samples influence the HFR network on the Oulu-CASIA NIR-VIS database that only contains 20 identities with about 1,000 images for training. We generate 1K, 5K, 10K and 50K pairs of heterogeneous images via DVG, and we obtain $68.7%$ , $85.9%$ , $89.5%$ and $89.4%$ on $\operatorname{VR}@\operatorname{FAR}{=}0.1%$ by LightCNN-9, respectively. The results have been significantly improved with the increasing number of the generated pairs, suggesting that DVG can boost the performance of the low-shot heterogeneous face recognition.

此外,我们分析了在仅包含20个身份约1,000张训练图像的Oulu-CASIA NIR-VIS数据库上,生成样本数量如何影响HFR网络性能。通过DVG生成1K、5K、10K和50K对异构图像后,LightCNN-9在$\operatorname{VR}@\operatorname{FAR}{=}0.1%$指标下分别达到$68.7%$、$85.9%$、$89.5%$和$89.4%$。随着生成图像对数量的增加,结果显著提升,表明DVG能够有效增强少样本异构人脸识别性能。

4.4 Comparisons with State-of-the-art Methods

4.4 与最先进方法的比较

The recognition performance of our proposed DVG is demonstrated in this section on four heterogeneous face recognition databases. The performance of state-of-the-art methods, such as IDNet [29], HFR-CNN [30], Hallucination [23], DLFace [28], TRIVET [26], IDR [10], W-CNN [11], RCN [4], MC-CNN [3] and DVR [35] is compared in Table 2. In addition, LightCNN-9 and LightCNN-29 are our baseline methods.

本节展示了我们提出的DVG在四个异构人脸识别数据库上的识别性能表现。表2对比了IDNet [29]、HFR-CNN [30]、Hallucination [23]、DLFace [28]、TRIVET [26]、IDR [10]、W-CNN [11]、RCN [4]、MC-CNN [3]和DVR [35]等前沿方法的性能。此外,LightCNN-9和LightCNN-29是我们的基线方法。

Table 2: Comparisons with other state-of-the-art deep HFR methods on the CASIA NIR-VIS 2.0, the Oulu-CASIA NIR-VIS, the BUAA-VisNir and the IIIT-D Viewed Sketch databases.

表 2: 在CASIA NIR-VIS 2.0、Oulu-CASIA NIR-VIS、BUAA-VisNir和IIIT-D Viewed Sketch数据库上与其他先进深度HFR方法的对比。

Method CASIA NIR-VIS 2.0 Oulu-CASIA NIR-VIS BUAA-VisNir IIIT-D Viewed Sketch
Rank-1 FAR=0.1% Rank-1 FAR=1% FAR=0.1% Rank-1 FAR=1% FAR=0.1% Rank-1 FAR=1%
IDNet[29] 87.1 ± 0.9 74.5
HFR-CNN[30] 85.9 ± 0.9 78.0
Hallucination[23] 89.6 ± 0.9
DLFace[28] 98.68
TRIVET[26] 95.7±0.5 91.0 ± 1.3 92.2 67.9 33.6 93.9 93.0 80.9
IDR [10] 97.3 ± 0.4 95.7 ± 0.7 94.3 73.4 46.2 94.3 93.4 84.7
W-CNN [11] 98.7 ± 0.3 98.4 ± 0.4 98.0 81.5 54.6 97.4 96.0 91.9
DVR [35] 99.7 ± 0.1 99.6 ± 0.3 100.0 97.2 84.9 99.2 98.5 96.9
RCN [4] 99.3 ± 0.2 98.7 ± 0.2 90.34
MC-CNN[3] 99.4 ± 0.1 99.3 ± 0.1 87.40
LightCNN-9 97.1 ± 0.7 93.7 ± 0.8 93.8 80.4 43.8 94.8 94.3 83.5 84.07 75.30
LightCNN-9+DVG 99.2± 0.3 98.8 ± 0.3 100.0 97.6 89.5 98.0 97.1 93.1 86.65 92.24
LightCNN-29 98.1 ± 0.4 97.4 ± 0.5 99.0 93.1 68.3 96.8 97.0 89.4 83.24 81.04
LightCNN-29+DVG 99.8±0.1 99.8±0.1 100.0 98.5 92.9 99.3 98.5 97.3 96.99 97.86


Figure 6: The ROC curves on the CASIA NIR-VIS 2.0, the Oulu-CASIA NIR-VIS and the BUAAVisNir databases, respectively

图 6: 分别在 CASIA NIR-VIS 2.0、Oulu-CASIA NIR-VIS 和 BUAAVisNir 数据库上的 ROC 曲线

Fig. 6 presents the ROC curves, including TRIVET, IDR, W-CNN, DVR, and the proposed DVG. To better demonstrate the results, we only perform ROC curves of DVG trained on LightCNN29. It is obvious that DVG outperforms other state-of-the-art methods, especially on the low shot heterogeneous databases such as the Oulu-CASIA NIR-VIS database.

图 6: 展示了 ROC 曲线,包括 TRIVET、IDR、W-CNN、DVR 以及提出的 DVG。为了更好地展示结果,我们仅呈现基于 LightCNN29 训练的 DVG 的 ROC 曲线。显然,DVG 优于其他最先进方法,尤其是在低样本异构数据库(如 Oulu-CASIA NIR-VIS 数据库)上表现突出。

Expect for the above commonly used NIR-VIS and Sketch-Photo, we further explore other potential applications, including the face recognition under different resolutions on the NJU-ID database [17] and different poses on the Multi-PIE database [8]. The NJU-ID database consists of 256 identities with one ID card image ( $102\times126$ resolution) and one camera image ( $640\times480$ resolution) per identity. Considering the few number of images in the NJU-ID database, we use our collected ID-Photo database (1000 identities) as the training set and the NJU-ID database as the testing set. The Multi-PIE database contains 337 subjects with different poses. We use profiles $(\pm75^{o},\pm90^{o})$

除了上述常用的近红外-可见光(NIR-VIS)和素描-照片(Sketch-Photo)跨模态场景外,我们还探索了其他潜在应用,包括NJU-ID数据库[17]中不同分辨率的人脸识别以及Multi-PIE数据库[8]中不同姿态的人脸识别。NJU-ID数据库包含256个身份,每个身份包含一张身份证图像(102×126分辨率)和一张摄像头图像(640×480分辨率)。鉴于NJU-ID数据库图像数量较少,我们使用自建的证件照数据库(1000个身份)作为训练集,NJU-ID数据库作为测试集。Multi-PIE数据库包含337个不同姿态的受试者,我们采用侧面轮廓(±75°,±90°)作为测试姿态。

and frontal faces as different modalities. 200 persons are used as the training set and the rest 137 persons are the testing set (Setting 2 of [13]). On the NJU-ID database, we improve Rank-1 by $5.5%$ (DVG $96.8%$ - Baseline $91.3%$ ) and $\operatorname{VR}@\operatorname{FAR}=1%$ by $6.2%$ (DVG $96.7%$ - Baseline $90.5%$ ) over the baseline LightCNN-29. On the Multi-PIE database, the Rank-1 of $\pm90^{o}$ and $\pm75^{o}$ is increased by $18.5%$ (DVG $83.9%$ - Baseline $65.4%$ ) and $4.3%$ (DVG $97.3%$ - Baseline $93.0%$ ), respectively. We will continue to explore more applications in our future work.

并将正面人脸作为不同模态。训练集使用200人,其余137人作为测试集([13]的设置2)。在NJU-ID数据库上,我们将Rank-1准确率提升了$5.5%$(DVG $96.8%$ - 基线 $91.3%$),$\operatorname{VR}@\operatorname{FAR}=1%$提升了$6.2%$(DVG $96.7%$ - 基线 $90.5%$),均优于基线LightCNN-29模型。在Multi-PIE数据库上,$\pm90^{o}$和$\pm75^{o}$的Rank-1准确率分别提高了$18.5%$(DVG $83.9%$ - 基线 $65.4%$)和$4.3%$(DVG $97.3%$ - 基线 $93.0%$)。我们将在未来工作中继续探索更多应用场景。

5 Conclusion

5 结论

This paper has developed a novel dual variation al generation framework that generates large-scale new paired heterogeneous images with abundant intra-class diversity from noise, providing a new insight into the problems of HFR. A dual variation al auto encoder is first proposed to learn a joint distribution of paired heterogeneous images. Then, both the distribution alignment in the latent space and the pairwise distance constraint in the image space are utilized to ensure the identity consistency of the generated image pairs. Finally, DVG generates diverse paired heterogeneous images with the same identity from noise to boost HFR network. Extensive qualitative and quantitative experimental results on four databases have shown the superiority of our method.

本文提出了一种新颖的双变分生成框架,能够从噪声中生成具有丰富类内多样性的大规模新型配对异质图像,为异质人脸识别(HFR)问题提供了新思路。首先提出双变分自编码器来学习配对异质图像的联合分布;随后通过潜在空间的分布对齐和图像空间的成对距离约束,确保生成图像对的身份一致性;最终DVG框架实现了从噪声生成具有相同身份且多样化的配对异质图像,有效提升了HFR网络性能。在四个数据库上的大量定性与定量实验结果验证了本方法的优越性。

阅读全文(20积分)