[论文翻译]低样本异质人脸识别的双重变分生成


原文地址:https://arxiv.org/pdf/1903.10203v3


Dual Variation al Generation for Low Shot Heterogeneous Face Recognition

低样本异质人脸识别的双重变分生成

Abstract

摘要

Heterogeneous Face Recognition (HFR) is a challenging issue because of the large domain discrepancy and a lack of heterogeneous data. This paper considers HFR as a dual generation problem, and proposes a novel Dual Variation al Generation (DVG) framework. It generates large-scale new paired heterogeneous images with the same identity from noise, for the sake of reducing the domain gap of HFR. Specifically, we first introduce a dual variation al auto encoder to represent a joint distribution of paired heterogeneous images. Then, in order to ensure the identity consistency of the generated paired heterogeneous images, we impose a distribution alignment in the latent space and a pairwise identity preserving in the image space. Moreover, the HFR network reduces the domain discrepancy by constraining the pairwise feature distances between the generated paired heterogeneous images. Extensive experiments on four HFR databases show that our method can significantly improve state-of-the-art results. The related code is available at https://github.com/BradyFU/DVG.

异构人脸识别 (Heterogeneous Face Recognition, HFR) 由于存在较大域差异且缺乏异构数据而成为一个具有挑战性的问题。本文将HFR视为双重生成问题,提出了一种新颖的双重变分生成 (Dual Variation al Generation, DVG) 框架。该框架通过从噪声中生成具有相同身份的大规模成对异构图像,以减小HFR的域差异。具体而言,我们首先引入双重变分自编码器来表示成对异构图像的联合分布。然后,为了确保生成的成对异构图像的身份一致性,我们在潜在空间施加分布对齐约束,并在图像空间施加成对身份保持约束。此外,HFR网络通过约束生成的成对异构图像之间的特征距离来减小域差异。在四个HFR数据库上的大量实验表明,我们的方法能显著提升当前最优性能。相关代码已开源:https://github.com/BradyFU/DVG

1 Introduction

1 引言

With the development of deep learning, face recognition has made significant progress [34, 2] in recent years. However, in many real-world applications, such as video surveillance, facial authentication on mobile devices and computer forensics, it is still a great challenge to match heterogeneous face images in different modalities, including sketch images [37], near infrared images [24] and polar i metric thermal images [36]. Heterogeneous face recognition (HFR) has attracted much attention in the face recognition community. Due to the large domain gap, one challenge is that the face recognition model trained on VIS data often degrades significantly for HFR. Therefore, lots of cross domain feature matching methods [10] are introduced to reduce the large domain gap between heterogeneous face images. However, since it is expensive and time-consuming to collect a large number of heterogeneous face images, there is no public large-scale heterogeneous face database. With the limited training data, CNNs trained for HFR often tend to be over fitting.

随着深度学习的发展,人脸识别近年来取得了显著进展 [34, 2]。然而在许多实际应用中(如视频监控、移动设备面部认证和计算机取证),跨模态异质人脸图像(包括素描图像 [37]、近红外图像 [24] 和偏振热成像图像 [36])的匹配仍面临巨大挑战。异质人脸识别 (HFR) 已成为人脸识别领域的研究热点。由于存在巨大的域差异,基于可见光 (VIS) 数据训练的人脸识别模型在 HFR 任务中往往表现显著下降。为此,研究者提出了多种跨域特征匹配方法 [10] 以缩小异质人脸图像间的域差异。然而,由于异质人脸数据采集成本高且耗时长,目前缺乏公开的大规模异质人脸数据库。在训练数据有限的情况下,针对 HFR 训练的卷积神经网络 (CNN) 容易出现过拟合现象。

Recently, the great progress of high-quality face synthesis [38, 5, 33, 39] has made “recognition via generation” possible. TP-GAN [16] and CAPG-GAN [13] introduce face synthesis to improve the quantitative performance of large pose face recognition. For HFR, [32] proposes a two-path model to synthesize VIS images from NIR images. [36] utilizes a GAN based multi-stream feature fusion technique to generate VIS images from polar i metric thermal faces. However, all these methods are based on conditional image-to-image translation framework, leading to two potential challenges: 1) Diversity: Given one image, a generator only synthesizes one new image of the target domain [32]. It means such conditional image-to-image translation methods can only generate limited number of images. In addition, as shown in the left part of Fig. 1, two images before and after translation have same attributes (e.g., the pose and the expression) except for the spectral information, which means it is difficult for such conditional image-to-image translation methods to promote intra-class diversity. In particular, these problems will be very prominent in the low-shot heterogeneous face recognition, i.e., learning from few heterogeneous data. 2) Consistency: When generating large-scale samples, it is challenging to guarantee that the synthesized face images belong to the same identity of the input images. Although identity preserving loss [13] constrains the distances between features of the input and synthesized images, it does not constraint the intra-class and inter-class distances of the embedding space.

近年来,高质量人脸合成技术[38, 5, 33, 39]的重大进展使得"通过生成进行识别"成为可能。TP-GAN[16]和CAPG-GAN[13]引入人脸合成技术来提升大姿态人脸识别的量化性能。针对异构人脸识别(HFR),[32]提出双路径模型从近红外(NIR)图像合成可见光(VIS)图像。[36]采用基于生成对抗网络(GAN)的多流特征融合技术,从极化热成像人脸生成可见光图像。然而这些方法都基于条件式图像到图像转换框架,存在两个潜在挑战:1)多样性:给定单张输入图像,生成器仅能合成目标域的一张新图像[32]。这意味着此类条件式转换方法只能生成有限数量的图像。如图1左侧所示,转换前后的两张图像除光谱信息外具有相同属性(如姿态和表情),因此难以通过此类方法提升类内多样性。这些问题在少样本异构人脸识别(即从少量异构数据学习)场景中尤为突出。2)一致性:当生成大规模样本时,难以保证合成人脸与输入图像属于同一身份。虽然身份保持损失函数[13]约束了输入图像与合成图像特征之间的距离,但并未对嵌入空间的类内和类间距离进行约束。


Figure 1: The diversity comparisons between the conditional image-to-image translation [32] (left part, the above is the input NIR image and the below is the corresponding translated VIS image) and our unconditional DVG (right part, all paired heterogeneous images are generated via noise). For the conditional image-to-image translation methods, given one NIR image, a generator only synthesizes one new VIS image with same attributes (e.g., the pose and the expression) except for the spectral information. Differently, DVG generates massive new paired images with rich intra-class diversity from noise.

图 1: 条件式图像到图像转换 [32] (左侧部分,上方为输入近红外(NIR)图像,下方为对应转换的可见光(VIS)图像)与我们提出的无条件DVG (右侧部分,所有配对异质图像均由噪声生成)的多样性对比。对于条件式图像到图像转换方法,给定一张NIR图像,生成器仅合成一张除光谱信息外具有相同属性(如姿态和表情)的新VIS图像。不同的是,DVG能从噪声生成大量具有丰富类内多样性的新配对图像。

To tackle the above challenges, we propose a novel unconditional Dual Variation al Generation (DVG) framework (shown in Fig. 3) that generates large-scale paired heterogeneous images with the same identity from noise. Unconditional generative models can generate new images (generate single image per time) from noise [21], but since these images do not have identity labels, it is difficult to use these images for recognition networks. DVG makes use of the property of generating new images of the unconditional generative model [21], and adopts a dual generation manner to get paired heterogeneous images with the same identity every time. This enables DVG to generate large-scale images, and make the generated images can be used to optimize recognition networks. Meanwhile, DVG also absorbs the various intra-class changes of the training database, leading to the generated paired images have abundant intra-class diversity. For instance, as presented in the right part of Fig. 1, the first four paired images have different poses, and the fifth paired images have different expressions. Furthermore, DVG only pays attention to the identity consistency of the paired heterogeneous images rather than the identity whom the paired heterogeneous images belong to, which avoids the consistency problem of previous methods. Specifically, we introduce a dual variation al auto encoder to learn a joint distribution of paired heterogeneous images. In order to constrain the generated paired images to belong to the same identity, we impose both a distribution alignment in the latent space and a pairwise identity preserving in the image space. New paired images are generated by sampling and copying a noise from a standard Gaussian distribution, as displayed in the left part of Fig. 3. These generated paired images are used to optimize the HFR network by a pairwise distance constraint, aiming at reducing the domain discrepancy.

为解决上述挑战,我们提出了一种新颖的无条件双变分生成(DVG)框架(如图3所示),该框架能够从噪声中生成具有相同身份的大规模成对异构图像。无条件生成模型可以从噪声中生成新图像(每次生成单张图像)[21],但由于这些图像缺乏身份标签,难以直接用于识别网络。DVG利用无条件生成模型生成新图像的特性[21],采用双生成模式每次获取具有相同身份的成对异构图像。这使得DVG能够生成大规模图像,并使生成图像可用于优化识别网络。同时,DVG还吸收了训练数据库中丰富的类内变化,使生成的成对图像具有显著的类内多样性。例如图1右侧所示,前四组成对图像具有不同姿态,第五组成对图像则呈现不同表情。此外,DVG仅关注成对异构图像的身份一致性而非具体身份归属,从而避免了现有方法的身份一致性问题。具体而言,我们引入双变分自编码器来学习成对异构图像的联合分布。为约束生成图像对属于同一身份,我们在潜在空间实施分布对齐,并在图像空间施加成对身份保持。通过从标准高斯分布采样并复制噪声来生成新图像对(如图3左侧所示),这些生成图像对通过成对距离约束优化异质人脸识别网络,旨在减小域间差异。

In summary, the main contributions are as follows:

总之,主要贡献如下:

• We provide a new insight into the problems of HFR. That is, we consider HFR as a dual generation problem, and propose a novel dual variation al generation framework. This framework generates new paired heterogeneous images with abundant intra-class diversity to reduce the domain gap of HFR. • In order to guarantee that the generated paired images belong to the same identity, we constrain the consistency of paired images in both latent space and image space. These allow new images sampled from the noise can be used for recognition networks.

• 我们针对异构人脸识别(HFR)问题提出了新见解,将其视为双重生成问题,并提出创新的双重变分生成框架。该框架通过生成具有丰富类内多样性的成对异构图像来减小HFR的域间差异。
• 为确保生成的成对图像属于同一身份,我们在潜在空间和图像空间同时约束图像对的一致性。这使得从噪声中采样的新图像可直接用于识别网络。


Figure 2: The dual generation results via noise $256\times256$ resolution). For each pair, the left is NIR and the right is the paired VIS image.

图 2: 通过噪声生成的双模态结果 ($256\times256$ 分辨率)。每组图像中,左侧为近红外(NIR)图像,右侧为对应的可见光(VIS)图像。

2 Background and Related Work

2 背景与相关工作

2.1 Heterogeneous Face Recognition

2.1 异质人脸识别

Lots of researchers pay their attention to Heterogeneous Face Recognition (HFR). For the feature-level learning, [22] employs HOG features with sparse representation for HFR. [7] utilizes LBP histogram with Linear Discriminant Analysis to obtain domain-invariant features. [10] proposes Invariant Deep Representation (IDR) to disentangle representations into two orthogonal subspaces for NIR-VIS HFR. Further, [11] extends IDR by introducing Wasser stein distance to obtain domain invariant features for HFR. For the image-level learning, the common idea is to transform heterogeneous face images from one modality into another one via image synthesis. [19] utilizes joint dictionary learning to reconstruct face images for boosting the performance of face matching. [23] proposes a cross-spectral hallucination and low-rank embedding to synthesize a heterogeneous image in a patch way.

许多研究者将目光投向异质人脸识别(HFR)。在特征层面学习方面,[22]采用HOG特征结合稀疏表示进行HFR。[7]利用LBP直方图与线性判别分析获取跨域不变特征。[10]提出不变深度表示(IDR),将表征解耦至两个正交子空间以实现近红外-可见光(NIR-VIS)异质人脸识别。[11]进一步扩展IDR方法,引入Wasserstein距离来获取跨域不变特征。在图像层面学习方面,主流思路是通过图像合成将异质人脸图像从一种模态转换到另一种模态。[19]采用联合字典学习重构人脸图像以提升匹配性能。[23]提出跨光谱幻构与低秩嵌入方法,以分块方式合成异质图像。

2.2 Generative Models

2.2 生成式模型 (Generative Models)

Variation al auto encoders (VAEs) [21] and generative adversarial networks (GANs) [6] are the most prominent generative models. VAEs consist of an encoder network $q_{\phi}(z|x)$ and a decoder network $p_{\theta}(x|z)$ . $q_{\phi}(z|x)$ maps input images $x$ to the latent variables $z$ that match to a prior $p(z)$ , and $p_{\theta}(x|z)$ samples images $x$ from the latent variables $z$ . The evidence lower bound objective (ELBO) of VAEs:

变分自编码器 (VAEs) [21] 和生成对抗网络 (GANs) [6] 是最突出的生成模型。VAEs 由编码器网络 $q_{\phi}(z|x)$ 和解码器网络 $p_{\theta}(x|z)$ 组成。$q_{\phi}(z|x)$ 将输入图像 $x$ 映射到与先验 $p(z)$ 匹配的潜变量 $z$,而 $p_{\theta}(x|z)$ 从潜变量 $z$ 采样图像 $x$。VAEs 的证据下界目标 (ELBO):

$$
\begin{array}{r}{\log p_{\theta}(x)\geq\mathbb{E}{q_{\phi}(z\vert x)}\log p_{\theta}(x\vert z)-D_{\mathrm{KL}}\big(q_{\phi}(z\vert x)\vert\vert p(z)\big).}\end{array}
$$

$$
\begin{array}{r}{\log p_{\theta}(x)\geq\mathbb{E}{q_{\phi}(z\vert x)}\log p_{\theta}(x\vert z)-D_{\mathrm{KL}}\big(q_{\phi}(z\vert x)\vert\vert p(z)\big).}\end{array}
$$

The two parts in ELBO are a reconstruction error and a Kullback-Leibler divergence, respectively.

ELBO中的两部分分别是重构误差和Kullback-Leibler散度。

Differently, GANs adopt a generator $G$ and a disc rim in at or $D$ to play a min-max game. $G$ generates images from a prior $p(z)$ to confuse $D$ , and $D$ is trained to distinguish between generated data and real data. This adversarial rule takes the form:

与GAN不同,生成对抗网络 (Generative Adversarial Network) 采用生成器 $G$ 和判别器 $D$ 进行极小极大博弈。$G$ 从先验分布 $p(z)$ 生成图像以迷惑 $D$,而 $D$ 被训练用于区分生成数据与真实数据。该对抗规则的形式为:

$$
\underset{G}{\mathrm{min}}\underset{D}{\mathrm{max}}\mathbb{E}{x\sim p_{d a t a}(x)}\left[\log D(x)\right]+\mathbb{E}{z\sim p_{z}(z)}\left[\log(1-D(G(z)))\right].
$$

$$
\underset{G}{\mathrm{min}}\underset{D}{\mathrm{max}}\mathbb{E}{x\sim p_{d a t a}(x)}\left[\log D(x)\right]+\mathbb{E}{z\sim p_{z}(z)}\left[\log(1-D(G(z)))\right].
$$

They have achieved remarkable success in various applications, such as unconditional image generation that generates images from noise [20, 15], and conditional image generation that synthesizes images according to the given condition [32, 16]. According to [15], VAEs have nice manifold representations, while GANs are better at generating sharper images.

他们在各种应用中取得了显著成功,例如从噪声生成图像的无条件图像生成 [20, 15],以及根据给定条件合成图像的条件图像生成 [32, 16]。根据 [15],变分自编码器 (VAE) 具有良好的流形表示能力,而生成对抗网络 (GAN) 更擅长生成更清晰的图像。

Another work to address the similar problem of our method is CoGAN [25], which uses a weightsharing manner to generate paired images in two different modalities. However, CoGAN neither explicitly constrains the identity consistency of paired images in the latent space nor in the image space. It is challenging for the weight-sharing manner of CoGAN to generate paired images with the same identity, as shown in Fig. 4.

另一项与我们方法解决类似问题的工作是CoGAN [25],它采用权重共享方式生成两种不同模态的配对图像。然而,CoGAN既未在潜在空间也未在图像空间显式约束配对图像的身份一致性。如图4所示,CoGAN的权重共享方式难以生成具有相同身份的配对图像。


Figure 3: The purpose (left part) and training model (right part) of our unconditional DVG framework. DVG generates large-scale new paired heterogeneous images with the same identity from standard Gaussian noise, aiming at reducing the domain discrepancy for HFR. In order to achieve this purpose, we elaborately design a dual variation al auto encoder. Given a pair of heterogeneous images from the same identity, the dual variation al auto encoder learns a joint distribution in the latent space. In order to guarantee the identity consistency of the generated paired images, we impose a distribution alignment in the latent space and a pairwise identity preserving in the image space.

图 3: 我们无条件DVG框架的目标(左半部分)和训练模型(右半部分)。DVG从标准高斯噪声中生成具有相同身份的大规模新配对异构图像,旨在减少HFR的域差异。为实现这一目标,我们精心设计了一个双重变分自编码器。给定来自同一身份的一对异构图像,该双重变分自编码器在潜在空间中学习联合分布。为确保生成配对图像的身份一致性,我们在潜在空间施加分布对齐约束,并在图像空间实施成对身份保持。

3 Proposed Method

3 提出的方法

In this section, we will introduce our method in detail, including the dual variation al generation and heterogeneous face recognition. Note that we specifically discuss the NIR-VIS images for better presentation. Other heterogeneous images are also applicable.

在本节中,我们将详细介绍我们的方法,包括双重变分生成和异质人脸识别。需要注意的是,为了更好的展示效果,我们特别讨论了近红外-可见光(NIR-VIS)图像。其他类型的异质图像也同样适用。

3.1 Dual Variation al Generation

3.1 双重变分生成

As shown in the right part of Fig. 3, DVG consists of a feature extractor $F_{i p}$ , and a dual variation al auto encoder: two encoder networks and a decoder network, all of which play the same roles of VAEs [21]. Specifically, $F_{i p}$ extracts the semantic information of the generated images to preserve the identity information. The encoder network $E_{N}$ maps NIR images $x_{N}$ to a latent space $z_{N}=$ $q_{\phi_{N}}\left(z_{N}\middle|x_{N}\right)$ by a re parameter iz ation trick: $z_{N}=u_{N}+\sigma_{N}\odot\epsilon.$ , where $u_{N}$ and $\sigma_{N}$ denote mean and standard deviation of NIR images, respectively. In addition, $\epsilon$ is sampled from a multi-variate standard Gaussian and $\odot$ denotes the Hadamard product. The encoder network $E_{V}$ has the same manner as $E_{N}\colon z_{V}=q_{\phi_{V}}\left(z_{V}|x_{V}\right)$ , which is for VIS images $x_{V}$ . After obtaining the two independent distributions, we concatenate $z_{N}$ and $z_{V}$ to get the joint distribution $z_{I}$ .

如图 3 右侧所示,DVG 由特征提取器 $F_{i p}$ 和一个双变分自编码器组成:两个编码器网络和一个解码器网络,它们均扮演与 VAE [21] 相同的角色。具体而言,$F_{i p}$ 提取生成图像的语义信息以保留身份特征。编码器网络 $E_{N}$ 通过重参数化技巧将近红外图像 $x_{N}$ 映射到潜在空间 $z_{N}=q_{\phi_{N}}\left(z_{N}\middle|x_{N}\right)$:$z_{N}=u_{N}+\sigma_{N}\odot\epsilon.$,其中 $u_{N}$ 和 $\sigma_{N}$ 分别表示近红外图像的均值与标准差。此外,$\epsilon$ 采样自多元标准高斯分布,$\odot$ 表示哈达玛积。编码器网络 $E_{V}$ 采用与 $E_{N}$ 相同的方式处理可见光图像 $x_{V}$:$z_{V}=q_{\phi_{V}}\left(z_{V}|x_{V}\right)$。获得两个独立分布后,我们将 $z_{N}$ 和 $z_{V}$ 拼接得到联合分布 $z_{I}$。

Distribution Learning We utilize VAEs to learn the joint distribution of the paired NIR-VIS images. Given a pair of NIR-VIS images ${x_{N},x_{V}}$ , we constrain the posterior distribution $q_{\phi_{N}}\left(z_{N}\middle|x_{N}\right)$ and $q_{\phi_{V}}(z_{V}|x_{V})$ by the Kullback-Leibler divergence:

分布学习
我们利用VAE(变分自编码器)来学习配对的近红外-可见光图像的联合分布。给定一对近红外-可见光图像 ${x_{N},x_{V}}$ ,我们通过Kullback-Leibler散度约束后验分布 $q_{\phi_{N}}\left(z_{N}\middle|x_{N}\right)$ 和 $q_{\phi_{V}}(z_{V}|x_{V})$ :

$$
\begin{array}{r}{\mathcal{L}{\mathrm{kl}}=D_{\mathrm{KL}}\big(q_{\phi_{N}}\big(z_{N}\big|x_{N}\big)\big|\big|p\big(z_{N}\big)\big)+D_{\mathrm{KL}}\big(q_{\phi_{V}}\big(z_{V}\big|x_{V}\big)\big|\big|p\big(z_{V}\big)\big),}\end{array}
$$

$$
\begin{array}{r}{\mathcal{L}{\mathrm{kl}}=D_{\mathrm{KL}}\big(q_{\phi_{N}}\big(z_{N}\big|x_{N}\big)\big|\big|p\big(z_{N}\big)\big)+D_{\mathrm{KL}}\big(q_{\phi_{V}}\big(z_{V}\big|x_{V}\big)\big|\big|p\big(z_{V}\big)\big),}\end{array}
$$

where the prior distributions $p(z_{N})$ and $p(z_{V})$ are both the multi-variate standard Gaussian distributions. Like the original VAEs, we require the decoder network $p_{\theta}(x_{N},x_{V}|z_{I})$ to be able to reconstruct the input images $x_{N}$ and $x_{V}$ from the learned distribution:

其中先验分布 $p(z_{N})$ 和 $p(z_{V})$ 都是多元标准高斯分布。与原始变分自编码器 (VAE) 类似,我们要求解码器网络 $p_{\theta}(x_{N},x_{V}|z_{I})$ 能够从学习到的分布中重建输入图像 $x_{N}$ 和 $x_{V}$:

$$
{\mathcal{L}}{\mathrm{rec}}=-\mathbb{E}{q_{\phi_{N}}(z_{N}|x_{N})\cup q_{\phi_{V}}(z_{V}|x_{V})}\log p_{\theta}(x_{N},x_{V}|z_{I}).
$$

$$
{\mathcal{L}}{\mathrm{rec}}=-\mathbb{E}{q_{\phi_{N}}(z_{N}|x_{N})\cup q_{\phi_{V}}(z_{V}|x_{V})}\log p_{\theta}(x_{N},x_{V}|z_{I}).
$$

Distribution Alignment We expect a pair of NIR-VIS images ${x_{N},x_{V}}$ to be projected into a common latent space by the encoders $E_{N}$ and $E_{V}$ , i.e., the NIR distribution $p(z_{N}^{(i)})$ is the same as the VIS distribution $p(z_{V}^{(i)})$ , where $i$ denotes the identity information. That means we maintain the identity consistency of the generated paired images in the latent space. Explicitly, we align the NIR and VIS distributions by minimizing the Wasser stein distance between the two distributions. Given two Gaussian distributions $p(z_{N}^{(i)})=N(u_{N}^{(i)},{\sigma_{N}^{(i)}}^{2})$ and $p(z_{V}^{(i)})=N(u_{V}^{(i)},{\sigma_{V}^{(i)}}^{2})$ , the 2-Wasser stein distance between $p(z_{N}^{(i)})$ and $p(z_{V}^{(i)})$ is simplified [10] as:

分布对齐
我们期望通过编码器 $E_{N}$ 和 $E_{V}$ 将一对近红外-可见光图像 ${x_{N},x_{V}}$ 投影到一个共同的潜在空间中,即近红外分布 $p(z_{N}^{(i)})$ 与可见光分布 $p(z_{V}^{(i)})$ 相同,其中 $i$ 表示身份信息。这意味着我们在潜在空间中保持生成配对图像的身份一致性。具体而言,我们通过最小化两个分布之间的Wasserstein距离来对齐近红外和可见光分布。给定两个高斯分布 $p(z_{N}^{(i)})=N(u_{N}^{(i)},{\sigma_{N}^{(i)}}^{2})$ 和 $p(z_{V}^{(i)})=N(u_{V}^{(i)},{\sigma_{V}^{(i)}}^{2})$,$p(z_{N}^{(i)})$ 和 $p(z_{V}^{(i)})$ 之间的2-Wasserstein距离可简化为[10]:

$$
\mathcal{L}{\mathrm{dist}}=\frac{1}{2}\left[||u_{N}^{(i)}-u_{V}^{(i)}||{2}^{2}+||\sigma_{N}^{(i)}-\sigma_{V}^{(i)}||_{2}^{2}\right].
$$

$$
\mathcal{L}{\mathrm{dist}}=\frac{1}{2}\left[||u_{N}^{(i)}-u_{V}^{(i)}||{2}^{2}+||\sigma_{N}^{(i)}-\sigma_{V}^{(i)}||_{2}^{2}\right].
$$

Pairwise Identity Preserving In previous image-to-image translation works [16, 13], identity preserving is usually introduced to maintain identity information. The traditional approach uses a pre-trained feature extractor to enforce the features of the generated images to be close to the target ones. However, since the lack of intra-class and inter-class constraints, it is challenge to guarantee the synthesized images to belong to the specific categories of the target images. Considering that DVG generates a pair of heterogeneous images per time, we only need to consider the identity consistency of the paired images.

成对身份保持
在以往的图像到图像转换工作中 [16, 13],通常会引入身份保持来维护身份信息。传统方法使用预训练的特征提取器,强制生成图像的特征接近目标图像。然而,由于缺乏类内和类间约束,确保合成图像属于目标图像的特定类别具有挑战性。考虑到 DVG 每次生成一对异构图像,我们只需关注配对图像的身份一致性。

Specifically, we adopt Light CNN [34] as the feature extractor $F_{i p}$ to constrain the feature distance between the reconstructed paired images:

具体而言,我们采用 Light CNN [34] 作为特征提取器 $F_{i p}$,以约束重建配对图像之间的特征距离:

$$
\mathcal{L}{\mathrm{ip-pair}}=||F_{i p}(\hat{x}{N})-F_{i p}(\hat{x}{V})||_{2}^{2},
$$

$$
\mathcal{L}{\mathrm{ip-pair}}=||F_{i p}(\hat{x}{N})-F_{i p}(\hat{x}{V})||_{2}^{2},
$$

where $F_{i p}(\cdot)$ means the normalized output of the last fully connected layer of $F_{i p}$ . In addition, we also use $F_{i p}$ to make the features of the reconstructed images and the original input images close enough as previous works [16, 13]:

其中 $F_{ip}(\cdot)$ 表示 $F_{ip}$ 最后一个全连接层的归一化输出。此外,我们还使用 $F_{ip}$ 使重建图像和原始输入图像的特征足够接近,如先前工作 [16, 13] 所示:

$$
\begin{array}{r}{\mathcal{L}{\mathrm{ip-rec}}=||F_{i p}(\hat{x}{N})-F_{i p}({x}{N})||{2}^{2}+||F_{i p}(\hat{x}{V})-F_{i p}({x}{V})||_{2}^{2},}\end{array}
$$

$$
\begin{array}{r}{\mathcal{L}{\mathrm{ip-rec}}=||F_{i p}(\hat{x}{N})-F_{i p}({x}{N})||{2}^{2}+||F_{i p}(\hat{x}{V})-F_{i p}({x}{V})||_{2}^{2},}\end{array}
$$

where ${\hat{x}}{N}$ and $\hat{x}{V}$ denote the reconstructions of the input paired images $x_{N}$ and $x_{V}$ , respectively.

其中 ${\hat{x}}{N}$ 和 $\hat{x}{V}$ 分别表示输入配对图像 $x_{N}$ 和 $x_{V}$ 的重建结果。

Diversity Constraint In order to further increase the diversity of the generated images, we also introduce a diversity loss [27]. In the sampling stage, when two sampled noise $z_{I_{1}}$ are $z_{I_{2}}$ are close, the generated images $x_{I_{1}}$ and $x_{I_{2}}$ are going to be similar. We maximize the following loss to encourage the decoder $D_{I}$ to generate more diverse images:

多样性约束
为了进一步提升生成图像的多样性,我们还引入了多样性损失 [27]。在采样阶段,当两个采样噪声 $z_{I_{1}}$ 和 $z_{I_{2}}$ 接近时,生成的图像 $x_{I_{1}}$ 和 $x_{I_{2}}$ 会趋于相似。我们通过最大化以下损失来激励解码器 $D_{I}$ 生成更多样化的图像:

$$
\mathcal{L}{\mathrm{div}}=\operatorname*{max}{D_{I}}\frac{|F_{i p}(x_{I_{1}})-F_{i p}(x_{I_{2}})|}{|z_{I_{1}}-z_{I_{2}}|}.
$$

$$
\mathcal{L}{\mathrm{div}}=\operatorname*{max}{D_{I}}\frac{|F_{i p}(x_{I_{1}})-F_{i p}(x_{I_{2}})|}{|z_{I_{1}}-z_{I_{2}}|}.
$$

Overall Loss Moreover, in order to increase the sharpness of our generated images, we also adopt an adversarial loss ${\mathcal{L}}_{\mathrm{adv}}$ as [31]. Hence, the overall loss to optimize the dual variation al auto encoder can be formulated as

总体损失
此外,为了提高生成图像的清晰度,我们还采用了对抗损失 ${\mathcal{L}}_{\mathrm{adv}}$ 如 [31] 所述。因此,优化双重变分自编码器的总体损失可表示为

$$
\mathcal{L}{\mathrm{gen}}=\mathcal{L}{\mathrm{rec}}+\mathcal{L}{\mathrm{kl}}+\mathcal{L}{\mathrm{adv}}+\lambda_{1}\mathcal{L}{\mathrm{dist}}+\lambda_{2}\mathcal{L}{\mathrm{ip-pair}}+\lambda_{3}\mathcal{L}{\mathrm{ip-rec}}+\lambda_{4}\mathcal{L}_{\mathrm{div}},
$$

$$
\mathcal{L}{\mathrm{gen}}=\mathcal{L}{\mathrm{rec}}+\mathcal{L}{\mathrm{kl}}+\mathcal{L}{\mathrm{adv}}+\lambda_{1}\mathcal{L}{\mathrm{dist}}+\lambda_{2}\mathcal{L}{\mathrm{ip-pair}}+\lambda_{3}\mathcal{L}{\mathrm{ip-rec}}+\lambda_{4}\mathcal{L}_{\mathrm{div}},
$$

where $\lambda_{1},\lambda_{2},\lambda_{3}$ and $\lambda_{4}$ are the trade-off parameters.

其中 $\lambda_{1},\lambda_{2},\lambda_{3}$ 和 $\lambda_{4}$ 是权衡参数。

3.2 Heterogeneous Face Recognition

3.2 异质人脸识别

For the heterogeneous face recognition, our training data contains the original limited labeled data $x_{i}(i\in{N,V})$ and the large-scale generated unlabeled paired NIR-VIS data $\tilde{x}{i}(i\in{N,V})$ . Here, we define a heterogeneous face recognition network $F$ to extract features $f_{i}=F(x_{i};\Theta)$ , where $i\in{N,V}$ and $\Theta$ is the parameters of $F$ . For the original labeled NIR and VIS images, we utilize a softmax loss:

对于异构人脸识别,我们的训练数据包含原始有限标注数据 $x_{i}(i\in{N,V})$ 和大规模生成的未配对NIR-VIS数据 $\tilde{x}{i}(i\in{N,V})$。在此,我们定义了一个异构人脸识别网络 $F$ 用于提取特征 $f_{i}=F(x_{i};\Theta)$,其中 $i\in{N,V}$,$\Theta$ 是 $F$ 的参数。对于原始标注的NIR和VIS图像,我们采用softmax损失函数:

$$
\mathcal{L}{\mathrm{cls}}=\sum_{i\in{N,V}}\operatorname{softmax}(F(x_{i};\Theta),y),
$$

$$
\mathcal{L}{\mathrm{cls}}=\sum_{i\in{N,V}}\operatorname{softmax}(F(x_{i};\Theta),y),
$$

where $y$ is the label of identity.

其中 $y$ 是身份标签。

For the generated paired heterogeneous images, since they are generated from noise, there are no specific classes for the paired images. But as mentioned in section 3.1, DVG ensures that the generated paired images belong to the same identity. Therefore, a pairwise distance loss between the paired heterogeneous samples is formulated as follows:

对于生成的成对异构图像,由于它们是从噪声生成的,这些成对图像没有特定的类别。但如第3.1节所述,DVG确保生成的成对图像属于同一身份。因此,成对异构样本之间的配对距离损失公式如下:

$$
\mathcal{L}{\mathrm{pair}}=||F(\tilde{x}{N};\Theta)-F(\tilde{x}{V};\Theta)||_{2}^{2},
$$

$$
\mathcal{L}{\mathrm{pair}}=||F(\tilde{x}{N};\Theta)-F(\tilde{x}{V};\Theta)||_{2}^{2},
$$

In this way, we can efficiently minimize the domain discrepancy by generating large-scale unlabeled paire