Free-Form Image Inpainting with Gated Convolution
基于门控卷积的自由形式图像修复
Figure 1: Free-form image inpainting results by our system built on gated convolution. Each triad shows original image, free-form input and our result from left to right. The system supports free-form mask and guidance like user sketch. It helps user remove distracting objects, modify image layouts and edit faces in images.
图 1: 基于门控卷积的系统生成的自由形式图像修复结果。每组三张图从左到右依次显示原始图像、自由形式输入和我们的结果。该系统支持自由形式遮罩和用户草图等引导方式,帮助用户移除干扰物体、修改图像布局以及编辑图像中的人脸。
Abstract
摘要
We present a generative image inpainting system to complete images with free-form mask and guidance. The system is based on gated convolutions learned from millions of images without additional labelling efforts. The proposed gated convolution solves the issue of vanilla convolution that treats all input pixels as valid ones, generalizes partial convolution by providing a learnable dynamic feature selection mechanism for each channel at each spatial location across all layers. Moreover, as free-form masks may appear anywhere in images with any shape, global and local GANs designed for a single rectangular mask are not applicable. Thus, we also present a patch-based GAN loss, named SNPatchGAN, by applying spectral-normalized disc rim in at or on dense image patches. SN-PatchGAN is simple in formulation, fast and stable in training. Results on automatic image inpainting and user-guided extension demonstrate that our system generates higher-quality and more flexible results than previous methods. Our system helps user quickly remove distracting objects, modify image layouts, clear watermarks and edit faces. Code, demo and models are available at: https://github.com/JiahuiYu/ generative in painting.
我们提出了一种生成式图像修复系统,能够通过自由形式的遮罩和引导来完成图像。该系统基于从数百万张图像中学习到的门控卷积,无需额外的标注工作。所提出的门控卷积解决了普通卷积将所有输入像素视为有效像素的问题,通过为每一层中每个空间位置的每个通道提供可学习的动态特征选择机制,推广了部分卷积。此外,由于自由形式的遮罩可能以任何形状出现在图像的任意位置,为单个矩形遮罩设计的全局和局部 GAN 并不适用。因此,我们还提出了一种基于 patch 的 GAN 损失,名为 SN-PatchGAN,通过在密集图像 patch 上应用谱归一化判别器来实现。SN-PatchGAN 在公式上简单,训练快速且稳定。自动图像修复和用户引导扩展的结果表明,我们的系统比之前的方法生成了更高质量且更灵活的结果。我们的系统帮助用户快速移除干扰物体、修改图像布局、清除水印和编辑面部。代码、演示和模型可在以下网址获取:https://github.com/JiahuiYu/generative_inpainting。
1. Introduction
1. 引言
Image inpainting (a.k.a. image completion or image hole-filling) is a task of synthesizing alternative contents in missing regions such that the modification is visually realistic and semantically correct. It allows to remove distracting objects or retouch undesired regions in photos. It can also be extended to tasks including image/video un-cropping, rotation, stitching, re-targeting, re-composition, compression, super-resolution, harmonization and many others.
图像修复(也称为图像补全或图像空洞填充)是一项在缺失区域合成替代内容的任务,使得修改在视觉上逼真且语义正确。它允许去除照片中分散注意力的物体或修饰不理想区域。它还可以扩展到包括图像/视频去裁剪、旋转、拼接、重定向、重构、压缩、超分辨率、协调等多种任务。
In computer vision, two broad approaches to image inpainting exist: patch matching using low-level image features and feed-forward generative models with deep convolutional networks. The former approach [3, 8, 9] can synthesize plausible stationary textures, but usually makes critical failures in non-stationary cases like complicated scenes, faces and objects. The latter approach [15, 49, 45, 46, 38, 37, 48, 26, 52, 33, 35, 19] can exploit semantics learned from large scale datasets to synthesize contents in nonstationary images in an end-to-end fashion.
在计算机视觉中,图像修复存在两种主要方法:使用低级图像特征的补丁匹配和基于深度卷积网络的前馈生成模型。前者 [3, 8, 9] 可以合成合理的静态纹理,但在复杂场景、人脸和物体等非静态情况下通常会出现严重失败。后者 [15, 49, 45, 46, 38, 37, 48, 26, 52, 33, 35, 19] 可以利用从大规模数据集中学习到的语义,以端到端的方式合成非静态图像中的内容。
However, deep generative models based on vanilla convolutions are naturally ill-fitted for image hole-filling because the spatially shared convolutional filters treat all input pixels or features as same valid ones. For holefilling, the input to each layer are composed of valid pixels/features outside holes and invalid ones in masked regions. Vanilla convolutions apply same filters on all valid, invalid and mixed (for example, the ones on hole boundary) pixels/features, leading to visual artifacts such as color discrepancy, blurriness and obvious edge responses surrounding holes when tested on free-form masks [15, 49].
然而,基于普通卷积的深度生成模型天然不适合图像补全,因为空间共享的卷积滤波器将所有输入像素或特征视为相同的有效元素。对于补全任务,每一层的输入由孔洞外的有效像素/特征和孔洞内的无效像素/特征组成。普通卷积对所有有效、无效以及混合(例如,孔洞边界上的)像素/特征应用相同的滤波器,导致在使用自由形状掩码测试时出现视觉伪影,如颜色差异、模糊以及孔洞周围明显的边缘响应 [15, 49]。
To address this limitation, recently partial convolution [23] is proposed where the convolution is masked and normalized to be conditioned only on valid pixels. It is then followed by a rule-based mask-update step to update valid locations for next layer. Partial convolution categorizes all input locations to be either invalid or valid, and multiplies a zero-or-one mask to inputs throughout all layers. The mask can also be viewed as a single un-learnable feature gating channel1. However this assumption has several limitations. First, considering the input spatial locations across different layers of a network, they may include (1) valid pixels in input image, (2) masked pixels in input image, (3) neurons with receptive field covering no valid pixel of input image, (4) neurons with receptive field covering different number of valid pixels of input image (these valid image pixels may also have different relative locations), and (5) synthesized pixels in deep layers. Heuristic ally categorizing all locations to be either invalid or valid ignores these important information. Second, if we extend to user-guided image inpainting where users provide sparse sketch inside the mask, should these pixel locations be considered as valid or invalid? How to properly update the mask for next layer? Third, for partial convolution the “invalid” pixels will progressively disappear layer by layer and the rule-based mask will be all ones in deep layers. However, to synthesize pixels in hole these deep layers may also need the information of whether current locations are inside or outside the hole. The partial convolution with all-ones mask cannot provide such information. We will show that if we allow the network to learn the mask automatically, the mask may have different values based on whether current locations are masked or not in input image, even in deep layers.
为了解决这一限制,最近提出了部分卷积 (partial convolution) [23],其中卷积被掩码化并归一化,仅基于有效像素进行条件化。随后是一个基于规则的掩码更新步骤,以更新下一层的有效位置。部分卷积将所有输入位置分类为无效或有效,并在所有层中将输入乘以一个零或一的掩码。该掩码也可以被视为单个不可学习的特征门控通道。然而,这一假设存在几个限制。首先,考虑到网络不同层的输入空间位置,它们可能包括:(1) 输入图像中的有效像素,(2) 输入图像中的掩码像素,(3) 感受野不覆盖输入图像任何有效像素的神经元,(4) 感受野覆盖输入图像不同数量有效像素的神经元 (这些有效图像像素可能也有不同的相对位置),以及 (5) 深层中的合成像素。启发式地将所有位置分类为无效或有效会忽略这些重要信息。其次,如果我们扩展到用户引导的图像修复,用户在掩码内提供稀疏草图,这些像素位置应被视为有效还是无效?如何正确更新下一层的掩码?第三,对于部分卷积,“无效”像素将逐层逐渐消失,基于规则的掩码在深层中将全部为 1。然而,为了合成空洞中的像素,这些深层可能还需要当前位置是位于空洞内还是外的信息。全为 1 的掩码无法提供此类信息。我们将展示,如果我们允许网络自动学习掩码,掩码可能会根据当前位置在输入图像中是否被掩码而具有不同的值,即使在深层也是如此。
We propose gated convolution for free-form image inpainting. It learns a dynamic feature gating mechanism for each channel and each spatial location (for example, inside or outside masks, RGB channels or user-guidance channels). Specifically we consider the formulation where the input feature is firstly used to compute gating values $g=\sigma(w_{g}x)$ ( $\bar{\sigma}$ is sigmoid function, $w_{g}$ is learnable parameter). The final output is a multiplication of learned feature and gating values $y=\phi(w x){\odot}g$ where $\phi$ can be any activation function. Gated convolution is easy to implement and performs significantly better when (1) the masks have arbitrary shapes and (2) the inputs are no longer simply RGB channels with a mask but also have conditional inputs like sparse sketch. For network architectures, we stack gated convolution to form an encoder-decoder network following [49]. Our inpainting network also integrates contextual attention module within same refinement network [49] to better capture long-range dependencies.
我们提出了一种用于自由形式图像修复的门控卷积。它为每个通道和每个空间位置(例如,掩码内部或外部,RGB通道或用户引导通道)学习一个动态特征门控机制。具体来说,我们考虑了输入特征首先用于计算门控值 $g=\sigma(w_{g}x)$ ( $\bar{\sigma}$ 是 sigmoid 函数, $w_{g}$ 是可学习参数)的公式。最终输出是学习到的特征与门控值的乘积 $y=\phi(w x){\odot}g$ ,其中 $\phi$ 可以是任何激活函数。门控卷积易于实现,并且在以下情况下表现显著更好:(1) 掩码具有任意形状;(2) 输入不再仅仅是带有掩码的 RGB 通道,而是还具有像稀疏草图这样的条件输入。对于网络架构,我们按照 [49] 堆叠门控卷积以形成编码器-解码器网络。我们的修复网络还在相同的细化网络 [49] 中集成了上下文注意力模块,以更好地捕捉长程依赖关系。
Without compromise of performance, we also significantly simplify training objectives as two terms: a pixelwise reconstruction loss and an adversarial loss. The modification is mainly designed for free-form image inpainting. As the holes may appear anywhere in images with any shape, global and local GANs [15] designed for a single rectangular mask are not applicable. Instead, we propose a variant of generative adversarial networks, named SN-PatchGAN, motivated by global and local GANs [15], Markov ian GANs [21], perceptual loss [17] and recent work on spectral-normalized GANs [24]. The disc rim in at or of SN-PatchGAN directly computes hinge loss on each point of the output map with format $\mathbb{R}^{h\times w\times c}$ , formulating $h\times$ $w\times c$ number of GANs focusing on different locations and different semantics (represented in different channels). SNPatchGAN is simple in formulation, fast and stable in training and produces high-quality inpainting results.
在不影响性能的前提下,我们还将训练目标大幅简化为两项:逐像素重建损失和对抗损失。这一修改主要是为自由形式图像修复设计的。由于孔洞可能以任何形状出现在图像的任意位置,专为单一矩形遮罩设计的全局和局部 GAN [15] 并不适用。相反,我们提出了生成对抗网络的一个变体,命名为 SN-PatchGAN,其灵感来源于全局和局部 GAN [15]、马尔可夫 GAN [21]、感知损失 [17] 以及最近关于谱归一化 GAN [24] 的研究。SN-PatchGAN 的判别器直接输出映射的每个点上计算合页损失,格式为 $\mathbb{R}^{h\times w\times c}$,形成了 $h\times$ $w\times c$ 个 GAN,专注于不同位置和不同语义(通过不同通道表示)。SN-PatchGAN 在公式上简单,训练快速且稳定,并生成高质量的修复结果。
Table 1: Comparison of different approaches including PatchMatch [3], Global&Local [15], Context Attention [49], Partial Con v [23] and our approach. The comparison of image inpainting is based on four dimensions: Semantic Under standing, Non-Local Algorithm, Free-Form Masks and User-Guided Option.
表 1: 不同方法的比较,包括 PatchMatch [3]、Global&Local [15]、Context Attention [49]、Partial Con v [23] 和我们的方法。图像修复的比较基于四个维度:语义理解 (Semantic Understanding)、非局部算法 (Non-Local Algorithm)、自由形式掩码 (Free-Form Masks) 和用户引导选项 (User-Guided Option)。
PM [3] | GL [15] | CA [49] | PC [23] | Ours | |
---|---|---|---|---|---|
语义理解 | |||||
非局部算法 | √ | ||||
自由形式掩码 | √ | ||||
用户引导选项 |
For practical image inpainting tools, enabling user inter activity is crucial because there could exist many plausible solutions for filling a hole in an image. To this end, we present an extension to allow user sketch as guided input. Comparison to other methods is summarized in Table 1. Our main contributions are as follows: (1) We introduce gated convolution to learn a dynamic feature selection mechanism for each channel at each spatial location across all layers, significantly improving the color consistency and inpainting quality of free-form masks and inputs. (2) We present a more practical patch-based GAN disc rim in at or,
对于实用的图像修复工具,启用用户交互至关重要,因为填补图像中的空洞可能存在多种合理的解决方案。为此,我们提出了一种扩展,允许用户草图作为引导输入。与其他方法的对比总结在表1中。我们的主要贡献如下:(1) 我们引入了门控卷积 (gated convolution) ,为每一层的每个通道在空间位置上学习动态特征选择机制,显著提高了自由形式掩码和输入的颜色一致性和修复质量。(2) 我们提出了一种更实用的基于分片的 GAN 判别器。
SN-PatchGAN, for free-form image inpainting. It is simple, fast and produces high-quality inpainting results. (3) We extend our inpainting model to an interactive one, enabling user sketch as guidance to obtain more user-desired inpainting results. (4) Our proposed inpainting system achieves higher-quality free-form inpainting than previous state of the arts on benchmark datasets including Places2 natural scenes and CelebA-HQ faces. We show that the proposed system helps user quickly remove distracting objects, modify image layouts, clear watermarks and edit faces in images.
SN-PatchGAN,用于自由图像修复。它简单、快速且能生成高质量的修复结果。(3) 我们将修复模型扩展为交互式模型,允许用户以草图作为指导,获得更符合用户需求的修复结果。(4) 我们提出的修复系统在包括Places2自然场景和CelebA-HQ人脸在内的基准数据集上,实现了比以往技术更高质量的自由图像修复。我们展示了该系统帮助用户快速移除干扰物体、修改图像布局、清除水印以及编辑图像中的人脸。
2. Related Work
2. 相关工作
2.1. Automatic Image Inpainting
2.1. 自动图像修复
A variety of approaches have been proposed for image inpainting. Traditionally, patch-based [8, 9] algorithms progressively extend pixels close to the hole boundaries based on low-level features (for example, features of mean square difference on RGB space), to search and paste the most similar image patch. These algorithms work well on stationary textural regions but often fail on non-stationary images. Further, Simakov et al. propose bidirectional similarity synthesis approach [36] to better capture and summarize non-stationary visual data. To reduce the high cost of memory and computation during search, tree-based acceleration structures of memory [25] and randomized algorithms [3] are proposed. Moreover, inpainting results are improved by matching local features like image gradients [2, 5] and offset statistics of similar patches [11]. Recently, image inpainting systems based on deep learning are proposed to directly predict pixel values inside masks. A significant advantage of these models is the ability to learn adaptive image features for different semantics. Thus they can synthesize more visually plausible contents especially for images like faces [22, 47], objects [29] and natural scenes [15, 49]. Among all these methods, Iizuka et al. [15] propose a fully convolutional image inpainting network with both global and local consistency to handle high-resolution images on a variety of datasets [18, 32, 53]. This approach, however, still heavily relies on Poisson image blending with traditional patch-based inpainting results [11]. Yu et al. [49] propose an end-to-end image inpainting model by adopting stacked generative networks to further ensure the color and texture consist en ce of generated regions with surroundings. Moreover, for capturing long-range spatial dependencies, contextual attention module [49] is proposed and integrated into networks to explicitly borrow information from distant spatial locations. However, this approach is mainly trained on large rectangular masks and does not generalize well on free-form masks. To better handle irregular masks, partial convolution [23] is proposed where the convolution is masked and re-normalized to utilize valid pixels only. It is then followed by a rule-based mask-update step to recompute new masks layer by layer.
多种方法已被提出用于图像修复。传统上,基于补丁的 [8, 9] 算法逐步扩展靠近空洞边界的像素,基于低层次特征(例如,RGB空间上的均方差特征)搜索并粘贴最相似的图像补丁。这些算法在静态纹理区域上表现良好,但在非静态图像上往往失败。此外,Simakov等人提出了双向相似性合成方法 [36],以更好地捕捉和总结非静态视觉数据。为了减少搜索过程中的高内存和计算成本,提出了基于树的内存加速结构 [25] 和随机算法 [3]。此外,通过匹配图像梯度 [2, 5] 和相似补丁的偏移统计 [11] 等局部特征,改进了修复结果。最近,基于深度学习的图像修复系统被提出,直接预测掩码内的像素值。这些模型的一个显著优势是能够学习不同语义的自适应图像特征。因此,它们可以合成更具视觉合理性的内容,特别是对于像人脸 [22, 47]、物体 [29] 和自然场景 [15, 49] 这样的图像。在所有这些方法中,Iizuka等人 [15] 提出了一个具有全局和局部一致性的全卷积图像修复网络,以处理各种数据集 [18, 32, 53] 上的高分辨率图像。然而,这种方法仍然严重依赖于与传统的基于补丁的修复结果 [11] 的泊松图像融合。Yu等人 [49] 提出了一个端到端的图像修复模型,通过采用堆叠生成网络进一步确保生成区域与周围环境的颜色和纹理一致性。此外,为了捕捉长距离的空间依赖性,提出了上下文注意力模块 [49] 并将其集成到网络中,以显式地从远距离空间位置借用信息。然而,这种方法主要在大矩形掩码上训练,并且在自由形式掩码上泛化能力较差。为了更好地处理不规则掩码,提出了部分卷积 [23],其中卷积被掩码并重新归一化,仅利用有效像素。随后是基于规则的掩码更新步骤,逐层重新计算新掩码。
2.2. Guided Image Inpainting and Synthesis
2.2. 引导图像修复与合成
To improve image inpainting, user guidance is explored including dots or lines [1, 3, 7, 40], structures [13], transformation or distortion information [14, 30] and image exemplars [4, 10, 20, 43, 51]. Notably, Hays and Efros [10] first utilize millions of photographs as a database to search for an example image which is most similar to the input, and then complete the image by cutting and pasting the corresponding regions from the matched image.
为改进图像修复,探索了包括点或线 [1, 3, 7, 40]、结构 [13]、变换或扭曲信息 [14, 30] 以及图像样本 [4, 10, 20, 43, 51] 在内的用户引导方法。值得注意的是,Hays 和 Efros [10] 首次利用数百万张照片作为数据库,搜索与输入最相似的示例图像,然后通过从匹配图像中剪切并粘贴相应区域来完成图像修复。
Recent advances in conditional generative networks empower user-guided image processing, synthesis and manipulation learned from large-scale datasets. Here we selectively review several related work. Zhang et al. [50] propose color iz ation networks which can take user guidance as additional inputs. Wang et al. [42] propose to synthesize high-resolution photo-realistic images from semantic label maps using conditional generative adversarial networks. The Scribbler [34] explore a deep generative network conditioned on sketched boundaries and sparse color strokes to synthesize cars, bedrooms, or faces.
条件生成网络的最新进展使用户引导的图像处理、合成和操作能够从大规模数据集中学习。这里我们选择性地回顾了几项相关工作。Zhang 等人 [50] 提出了着色网络,可以将用户引导作为额外输入。Wang 等人 [42] 提出使用条件生成对抗网络从语义标签图中合成高分辨率逼真图像。Scribbler [34] 探索了一种基于草图边界和稀疏颜色笔触的深度生成网络,用于合成汽车、卧室或人脸。
2.3. Feature-wise Gating
2.3. 特征门控
Feature-wise gating has been explored widely in vision [12, 28, 39, 41], language [6], speech [27] and many other tasks. For examples, Highway Networks [39] utilize feature gating to ease gradient-based training of very deep networks. Squeeze-and-Excitation Networks re-calibrate feature responses by explicitly multiplying each channel with learned sigmoidal gating values. WaveNets [27] achieve better results by employing a special feature gating $y=\operatorname{tanh}(w_{1}x),.$ ·sigmoid $(w_{2}x)$ for modeling audio signals.
特征门控已在视觉 [12, 28, 39, 41]、语言 [6]、语音 [27] 以及许多其他任务中得到了广泛探索。例如,Highway Networks [39] 利用特征门控来简化基于梯度的深度网络训练。Squeeze-and-Excitation Networks通过将每个通道与学习到的sigmoidal门控值显式相乘来重新校准特征响应。WaveNets [27] 通过采用一种特殊特征门控 $y=\operatorname{tanh}(w_{1}x),.$ ·sigmoid $(w_{2}x)$ 来建模音频信号,取得了更好的结果。
3. Approach
3. 方法
In this section, we describe our approach from bottom to top. We first introduce the details of the Gated Convolution, SN-PatchGAN, and then present the overview of inpainting network in Figure 3 and our extension to allow optional user guidance.
在本节中,我们自下而上地描述我们的方法。我们首先介绍门控卷积 (Gated Convolution) 和 SN-PatchGAN 的细节,然后展示图 3 中的修复网络概览,以及我们为支持用户可选引导所做的扩展。
3.1. Gated Convolution
3.1. 门控卷积 (Gated Convolution)
We first explain why vanilla convolutions used in [15, 49] are ill-fitted for the task of free-form image inpainting. We consider a convolutional layer in which a bank of filters are applied to the input feature map as output. Assume input is $C-$ channel, each pixel located at $(y,x)$ in $C^{\prime}-$ channel output map is computed as
我们首先解释为什么在[15, 49]中使用的普通卷积不适合自由形式图像修复任务。我们考虑一个卷积层,其中一组滤波器被应用于输入特征图以输出。假设输入是$C-$通道,位于$C^{\prime}-$通道输出图中$(y,x)$的每个像素计算为
$$
O_{y,x}=\sum_{i=-k_{h}^{\prime}}^{k_{h}^{\prime}}\sum_{j=-k_{w}^{\prime}}^{k_{w}^{\prime}}W_{k_{h}^{\prime}+i,k_{w}^{\prime}+j}\cdot I_{y+i,x+j},
$$
$$
O_{y,x}=\sum_{i=-k_{h}^{\prime}}^{k_{h}^{\prime}}\sum_{j=-k_{w}^{\prime}}^{k_{w}^{\prime}}W_{k_{h}^{\prime}+i,k_{w}^{\prime}+j}\cdot I_{y+i,x+j},
$$
where $x,y$ represents $\mathbf{X}$ -axis, y-axis of output map, $k_{h}$ and $k_{w}$ is the kernel size (e.g. $3,\times,3)$ , $\begin{array}{r l r}{k_{h}^{\prime}}&{{}\stackrel{\cdot}{=}}&{\frac{k_{h}-\bar{1}}{2}}\end{array}$ , $k_{w}^{\prime}~=$ $\frac{k_{w}!-!1}{2}$ , $W\in\mathbb{R}^{k_{h}\times k_{w}\times C^{\prime}\times C}$ represents convolutional filters, $I_{y+i,x+j}\in\mathbb{R}^{C}$ and $O_{y,x}\in\mathbb{R}^{C^{\prime}}$ are inputs and outputs. For simplicity, the bias in convolution is ignored.
其中 $x,y$ 表示输出图的 $\mathbf{X}$ 轴和 y 轴,$k_{h}$ 和 $k_{w}$ 是卷积核大小(例如 $3,\times,3)$,$\begin{array}{r l r}{k_{h}^{\prime}}&{{}\stackrel{\cdot}{=}}&{\frac{k_{h}-\bar{1}}{2}}\end{array}$,$k_{w}^{\prime}~=$ $\frac{k_{w}!-!1}{2}$,$W\in\mathbb{R}^{k_{h}\times k_{w}\times C^{\prime}\times C}$ 表示卷积滤波器,$I_{y+i,x+j}\in\mathbb{R}^{C}$ 和 $O_{y,x}\in\mathbb{R}^{C^{\prime}}$ 分别是输入和输出。为简化起见,卷积中的偏置项被忽略。
The equation shows that for all spatial locations $(y,x)$ , the same filters are applied to produce the output in vanilla convolutional layers. This makes sense for tasks such as image classification and object detection, where all pixels of input image are valid, to extract local features in a slidingwindow fashion. However, for image inpainting, the input are composed of both regions with valid pixels/features outside holes and invalid pixels/features (in shallow layers) or synthesized pixels/features (in deep layers) in masked regions. This causes ambiguity during training and leads to visual artifacts such as color discrepancy, blurriness and obvious edge responses during testing, as reported in [23].
方程表明,对于所有空间位置 $(y,x)$,在普通的卷积层中应用相同的滤波器来生成输出。这对于图像分类和目标检测等任务是有意义的,因为输入图像的所有像素都是有效的,可以以滑动窗口的方式提取局部特征。然而,对于图像修复任务,输入图像由两部分组成:孔洞外具有有效像素/特征的区域,以及掩码区域内无效的像素/特征(在浅层)或合成的像素/特征(在深层)。这会导致训练过程中的模糊性,并在测试时产生视觉伪影,如颜色差异、模糊和明显的边缘响应,如 [23] 中所述。
Recently partial convolution is proposed [23] which adapts a masking and re-normalization step to make the convolution dependent only on valid pixels as
最近提出了部分卷积 [23],它采用了掩码和重新归一化步骤,使卷积仅依赖于有效像素。
Figure 2: Illustration of partial convolution (left) and gated convolution (right).
图 2: 部分卷积 (左) 和门控卷积 (右) 的示意图。
updated with rules, gated convolutions learn soft mask auto mati call y from data. It is formulated as:
更新规则后,门控卷积从数据中自动学习软掩码。其公式为:
in which $M$ is the corresponding binary mask, 1 represents pixel in the location $(y,x)$ is valid, 0 represents the pixel is invalid, $\odot$ denotes element-wise multiplication. After each partial convolution operation, the mask-update step is required to propagate new $M$ with the following rule: $m_{y,x}^{\prime}=1$ , iff $\mathrm{sum}(\mathbf{M})>0$ .
其中 $M$ 是对应的二值掩码,1 表示位置 $(y,x)$ 的像素有效,0 表示像素无效,$\odot$ 表示逐元素相乘。在每次部分卷积操作后,需要按照以下规则更新掩码 $M$:$m_{y,x}^{\prime}=1$,当且仅当 $\mathrm{sum}(\mathbf{M})>0$。
Partial convolution [23] improves the quality of inpainting on irregular mask, but it still has remaining issues: (1) It heuristic ally classifies all spatial locations to be either valid or invalid. The mask in next layer will be set to ones no matter how many pixels are covered by the filter range in previous layer (for example, 1 valid pixel and 9 valid pixels are treated as same to update current mask). (2) It is incompatible with additional user inputs. We aim at a user-guided image inpainting system where users can optionally provide sparse sketch inside the mask as conditional channels. In this situation, should these pixel locations be considered as valid or invalid? How to properly update the mask for next layer? (3) For partial convolution the invalid pixels will progressively disappear in deep layers, gradually converting all mask values to ones. However, our study shows that if we allow the network to learn optimal mask automatically, the network assigns soft mask values to every spatial locations even in deep layers. (4) All channels in each layer share the same mask, which limits the flexibility. Essentially, partial convolution can be viewed as un-learnable single-channel feature hard-gating.
部分卷积 [23] 在不规则掩码上的修复质量有所提升,但仍存在以下问题:(1) 它启发式地将所有空间位置分类为有效或无效。无论前一层的滤波器范围内覆盖了多少像素(例如,1个有效像素和9个有效像素被视为相同来更新当前掩码),下一层的掩码都会被设置为1。(2) 它与额外的用户输入不兼容。我们的目标是实现一个用户引导的图像修复系统,用户可以选择性地在掩码内提供稀疏的草图作为条件通道。在这种情况下,这些像素位置应被视为有效还是无效?如何正确更新下一层的掩码?(3) 对于部分卷积,无效像素在深层会逐渐消失,逐渐将所有掩码值转换为1。然而,我们的研究表明,如果我们允许网络自动学习最优掩码,网络即使在深层也会为每个空间位置分配软掩码值。(4) 每一层的所有通道共享相同的掩码,这限制了灵活性。本质上,部分卷积可以被视为不可学习的单通道特征硬门控。
We propose gated convolution for image inpainting network, as shown in Figure 2. Instead of hard-gating mask where $\sigma$ is sigmoid function thus the output gating values are between zeros and ones. $\phi$ can be any activation functions (for examples, ReLU, ELU and LeakyReLU). $W_{g}$ and $W_{f}$ are two different convolutional filters.
我们提出了一种用于图像修复网络的门控卷积,如图 2 所示。与硬门控掩码不同,其中 $\sigma$ 是 sigmoid 函数,因此输出门控值介于零和一之间。$\phi$ 可以是任何激活函数(例如 ReLU、ELU 和 LeakyReLU)。$W_{g}$ 和 $W_{f}$ 是两个不同的卷积滤波器。
The proposed gated convolution learns a dynamic feature selection mechanism for each channel and each spatial location. Interestingly, visualization of intermediate gating values show that it learns to select the feature not only according to background, mask, sketch, but also considering semantic segmentation in some channels. Even in deep layers, gated convolution learns to highlight the masked regions and sketch information in separate channels to better generate inpainting results.
提出的门控卷积为每个通道和每个空间位置学习了一种动态特征选择机制。有趣的是,中间门控值的可视化显示,它不仅根据背景、掩码、草图来选择特征,在某些通道中还考虑了语义分割。即使在深层,门控卷积也学会了在单独的通道中突出掩码区域和草图信息,以更好地生成修复结果。
3.2. Spectral-Normalized Markovian Discriminator (SN-PatchGAN)
3.2. 谱归一化马尔可夫判别器 (SN-PatchGAN)
For previous inpainting networks which try to fill a single rectangular hole, an additional local GAN is used on the masked rectangular region to improve results [15, 49]. However, we consider the task of free-form image inpainting where there may be multiple holes with any shape at any location. Motivated by global and local GANs [15], MarkovianGANs [16, 21], perceptual loss [17] and recent work on spectral-normalized GANs [24], we present a simple and effective GAN loss, SN-PatchGAN, for training free-form image inpainting networks.
对于之前尝试填充单个矩形孔洞的图像修复网络,通常在掩码矩形区域上使用额外的局部生成对抗网络 (GAN) 以提高效果 [15, 49]。然而,我们考虑的是自由形式图像修复任务,其中可能存在任意形状、任意位置的多个孔洞。受全局和局部 GAN [15]、MarkovianGANs [16, 21]、感知损失 [17] 以及最近关于谱归一化 GAN [24] 的研究启发,我们提出了一种简单而有效的 GAN 损失函数 SN-PatchGAN,用于训练自由形式图像修复网络。
A convolutional network is used as the disc rim in at or where the input consists of image, mask and guidance channels, and the output is a 3-D feature of shape $\mathbb{R}^{h\times w\times c}$ $\mathit{h}$ , $w$ , $c$ representing the height, width and number of channels respectively). As shown in Figure 3, six strided convolutions with kernel size 5 and stride 2 is stacked to captures the feature statistics of Markovian patches [21]. We then directly apply GANs for each feature element in this feature map, formulating $h\times w\times c$ number of GANs focusing on different locations and different semantics (represented in different channels) of input image. It is noteworthy that the receptive field of each neuron in output map can cover entire input image in our training setting, thus a global discriminator is not necessary.
卷积网络被用作判别器,其输入由图像、掩码和引导通道组成,输出为一个形状为 $\mathbb{R}^{h\times w\times c}$ 的三维特征(其中 $\mathit{h}$、$w$、$c$ 分别表示高度、宽度和通道数)。如图 3 所示,我们堆叠了六个核大小为 5、步幅为 2 的卷积层,以捕捉马尔可夫块的特征统计 [21]。然后,我们直接在该特征图中的每个特征元素上应用 GAN,形成了 $h\times w\times c$ 个 GAN,专注于输入图像的不同位置和不同语义(在不同通道中表示)。值得注意的是,在我们的训练设置中,输出图中每个神经元的感受野可以覆盖整个输入图像,因此不需要全局判别器。
Figure 3: Overview of our framework with gated convolution and SN-PatchGAN for free-form image inpainting.
图 3: 使用门控卷积和 SN-PatchGAN 进行自由形式图像修复的框架概述。
We also adapt the recently proposed spectral normalization [24] to further stabilize the training of GANs. We use the default fast approximation algorithm of spectral normalization described in SN-GANs [24]. To discriminate if the input is real or fake, we also use the hinge loss as objective function for generator $\mathcal{L}{G},=,-\mathbb{E}{z\sim\mathbb{P}{z}(z)}[D^{s n}(G(z))]$ and disc rim in at or $\mathcal{L}{D^{s n}}=\mathbb{E}{x\sim\mathbb{P}{d a t a}(x)}[R e L U(\mathbb{1}-D^{s n}(x))]+$ $\mathbb{E}{z\sim\mathbb{P}{z}(z)}[R e L U(\mathbb{1}+D^{s n}(G(z)))]$ where $D^{s n}$ represents spectral-normalized disc rim in at or, $G$ is image inpainting network that takes incomplete image $z$ .
我们还采用了最近提出的谱归一化 [24] 来进一步稳定 GAN 的训练。我们使用 SN-GANs [24] 中描述的默认快速近似算法进行谱归一化。为了区分输入是真实还是虚假,我们还使用合页损失作为生成器的目标函数 $\mathcal{L}{G},=,-\mathbb{E}{z\sim\mathbb{P}{z}(z)}[D^{s n}(G(z))]$ 和判别器 $\mathcal{L}{D^{s n}}=\mathbb{E}{x\sim\mathbb{P}{d a t a}(x)}[R e L U(\mathbb{1}-D^{s n}(x))]+$ $\mathbb{E}{z\sim\mathbb{P}{z}(z)}[R e L U(\mathbb{1}+D^{s n}(G(z)))]$,其中 $D^{s n}$ 表示谱归一化判别器,$G$ 是接收不完整图像 $z$ 的图像修复网络。
With SN-PatchGAN, our inpainting network trains faster and more stable than baseline model [49]. Perceptual loss is not used since similar patch-level information is already encoded in SN-PatchGAN. Compared with PartialConv [23] in which 6 different loss terms and balancing hyper-parameters are used, our final objective function for inpainting network is only composed of pixel-wise $\ell_{1}$ reconstruction loss and SN-PatchGAN loss, with default loss balancing hyper-parameter as $1:1$ .
使用 SN-PatchGAN,我们的修复网络训练速度更快且比基线模型 [49] 更稳定。由于 SN-PatchGAN 已经编码了类似的 patch 级别信息,因此不需要使用感知损失。与 PartialConv [23] 中使用 6 种不同的损失项和平衡超参数相比,我们的修复网络的最终目标函数仅由像素级的 $\ell_{1}$ 重建损失和 SN-PatchGAN 损失组成,默认的损失平衡超参数为 $1:1$。
3.3. Inpainting Network Architecture
3.3. 修复网络架构
We customize a generative inpainting network [49] with the proposed gated convolution and SN-PatchGAN loss. Specifically, we adapt the full model architecture in [49] with both coarse and refinement networks. The full framework is summarized in Figure 3.
我们使用提出的门控卷积和SN-PatchGAN损失定制了一个生成式修复网络[49]。具体来说,我们采用了[49]中的完整模型架构,包括粗修复和精修网络。完整框架如图3所示。
For coarse and refinement networks, we use a simple encoder-decoder network [49] instead of U-Net used in PartialConv [23]. We found that skip connections in a UNet [31] have no significant effect for non-narrow mask. This is mainly because for center of a masked region, the inputs of these skip connections are almost zeros thus cannot propagate detailed color or texture information to the decoder of that region. For hole boundaries, our encoderdecoder architecture equipped with gated convolution is sufficient to generate seamless results.
对于粗化和细化网络,我们使用了一个简单的编码器-解码器网络 [49],而不是 PartialConv [23] 中使用的 U-Net。我们发现,在非狭窄掩码的情况下,UNet [31] 中的跳跃连接没有显著效果。这主要是因为对于掩码区域的中心,这些跳跃连接的输入几乎为零,因此无法将详细的颜色或纹理信息传播到该区域的解码器。对于空洞边界,我们配备了门控卷积的编码器-解码器架构足以生成无缝的结果。
We replace all vanilla convolutions with gated convolutions [49]. One potential problem is that gated convolutions introduce additional parameters. To maintain the same efficiency with our baseline model [49], we slim the model width by $25%$ and have not found obvious performance drop both quantitatively and qualitatively. The inpainting network is trained end-to-end and can be tested on free-form holes at arbitrary locations. Our network is fully convolutional and supports different input resolutions in inference.
我们将所有普通卷积替换为门控卷积 [49]。一个潜在的问题是门控卷积会引入额外的参数。为了保持与基线模型 [49] 相同的效率,我们将模型宽度缩减了 $25%$,并且在定量和定性上均未发现明显的性能下降。修复网络是端到端训练的,可以在任意位置的自由形状孔洞上进行测试。我们的网络是完全卷积的,在推理时支持不同的输入分辨率。
3.4. Free-Form Mask Generation
3.4. 自由形式掩码生成
The algorithm to automatically generate free-form masks is important and non-trivial. The sampled masks, in essence, should be (1) similar to masks drawn in real use-cases, (2) diverse to avoid over-fitting, (3) efficient in computation and storage, (4) controllable and flexible. Previous method [23] collects a fixed set of irregular masks from an occlusion estimation method between two consecutive frames of videos. Although random dilation, rotation and cropping are added to increase its diversity, the method does not meet other requirements listed above.
自动生成自由形式掩码的算法重要且非平凡。采样的掩码本质上应满足:(1) 与实际使用场景中绘制的掩码相似,(2) 具有多样性以避免过拟合,(3) 在计算和存储上高效,(4) 可控且灵活。先前的方法 [23] 从视频连续帧之间的遮挡估计方法中收集了一组固定的不规则掩码。尽管通过随机扩张、旋转和裁剪增加了其多样性,但该方法未能满足上述其他要求。
We introduce a simple algorithm to automatically generate random free-form masks on-the-fly during training. For the task of hole filling, users behave like using an eraser to brush back and forth to mask out undesired regions. This behavior can be simply simulated with a randomized algorithm by drawing lines and rotating angles repeatedly. To ensure smoothness of two lines, we also draw a circle in joints between the two lines. More details are included in the supplementary materials due to space limit.
我们引入了一种简单的算法,在训练过程中动态生成随机的自由形式掩码。对于填补空洞的任务,用户的行为就像使用橡皮擦来回涂抹以掩盖不需要的区域。这种行为可以通过一个随机化算法简单地模拟,即反复绘制线条和旋转角度。为了确保两条线条的平滑连接,我们还在两条线条的连接处绘制了一个圆。由于篇幅限制,更多细节包含在补充材料中。
3.5. Extension to User-Guided Image Inpainting
3.5. 扩展到用户引导的图像修复
We use sketch as an example user guidance to extend our image inpainting network as a user-guided system. Sketch (or edge) is simple and intuitive for users to draw. We show both cases with faces and natural scenes. For faces, we extract landmarks and connect related landmarks. For natural scene images, we directly extract edge maps using the HED edge detector [44] and set all values above a certain threshold (i.e. 0.6) to ones. Sketch examples are shown in the supplementary materials due to space limit.
我们以草图为例,将其作为一种用户引导方式,将图像修复网络扩展为用户引导系统。草图(或边缘)对用户来说简单直观。我们展示了人脸和自然场景两种情况。对于人脸,我们提取特征点并连接相关特征点。对于自然场景图像,我们直接使用 HED 边缘检测器 [44] 提取边缘图,并将所有高于某个阈值(即 0.6)的值设为一。由于篇幅限制,草图示例详见补充材料。
For training the user-guided image inpainting system, intuitively we will need additional constraint loss to enforce the network generating results conditioned on the user guidance. However with the same combination of pixel-wise reconstruction loss and GAN loss (with conditional channels as input to the disc rim in at or), we are able to learn conditional generative network in which the generated results respect user guidance faithfully. We also tried to use additional pixel-wise loss on HED [44] output features with the raw image or the generated result as input to enforce constraints, but the inpainting quality is similar. The user-guided inpainting model is separately trained with a 5-channel input (R,G,B color channels, mask channel and sketch channel).
为了训练用户引导的图像修复系统,直观上我们需要额外的约束损失来强制网络生成基于用户引导条件的结果。然而,通过相同的像素级重建损失和 GAN 损失(将条件通道作为判别器的输入)的组合,我们能够学习到条件生成网络,其中生成的结果忠实地遵循用户引导。我们还尝试在 HED [44] 输出特征上使用额外的像素级损失,以原始图像或生成结果作为输入来强制约束,但修复质量相似。用户引导的修复模型是单独训练的,输入为 5 个通道(R、G、B 颜色通道、掩码通道和草图通道)。
4. Results
4. 结果
We evaluate the proposed free-form image inpainting system on Places2 [53] and CelebA-HQ faces [18]. Our model has totally 4.1M parameters, and is trained with TensorFlow v1.8, CUDNN v7.0, CUDA v9.0. For testing, it runs at 0.21 seconds per image on single NVIDIA(R) Tesla(R) V100 GPU and 1.9 seconds on Intel(R) Xeon(R) CPU $\ @\ 2.00\mathrm{GHz}$ for images of resolution $512\times512$ on average, regardless of hole size.
我们在 Places2 [53] 和 CelebA-HQ 人脸数据集 [18] 上评估了提出的自由形式图像修复系统。我们的模型共有 4.1M 参数,使用 TensorFlow v1.8、CUDNN v7.0 和 CUDA v9.0 进行训练。在测试中,对于分辨率为 $512\times512$ 的图像,无论空洞大小如何,它在单个 NVIDIA(R) Tesla(R) V100 GPU 上每张图像平均运行时间为 0.21 秒,在 Intel(R) Xeon(R) CPU $\ @\ 2.00\mathrm{GHz}$ 上为 1.9 秒。
4.1. Quantitative Results
4.1. 定量结果
As mentioned in [49], image inpainting lacks good quantitative evaluation metrics. Nevertheless, we report in Table 2 our evaluation results in terms of mean $\ell_{1}$ error and mean $\ell_{2}$ error on validation images of Places2, with both center rectangle mask and free-form mask. As shown in the table, learning-based methods perform better than PatchMatch [3] in terms of mean $\ell_{1}$ and $\ell_{2}$ errors. Moreover, partial convolution implemented within the same framework obtains worse performance, which may due to un-learnable rule-based gating.
如 [49] 所述,图像修复缺乏良好的定量评估指标。尽管如此,我们在表 2 中报告了在 Places2 验证图像上使用中心矩形掩码和自由形式掩码的均方 $\ell_{1}$ 误差和均方 $\ell_{2}$ 误差的评估结果。如表所示,基于学习的方法在均方 $\ell_{1}$ 和 $\ell_{2}$ 误差方面优于 PatchMatch [3]。此外,在相同框架下实现的部分卷积性能较差,这可能是由于基于规则的不可学习门控机制所致。
Table 2: Results of mean $\ell_{1}$ error and mean $\ell_{2}$ error on validation images of Places2 with both rectangle masks and free-form masks. Both Partial Con v* and ours are trained on same random combination of rectangle and free-form masks. No edge guidance is utilized in training/inference to ensure fair comparison. * denotes our implementation within the same framework due to un availability of official implementation and models.
表 2: 在 Places2 验证集上使用矩形掩码和自由形状掩码的平均 $\ell_{1}$ 误差和平均 $\ell_{2}$ 误差结果。Partial Con v* 和我们的方法均在相同的随机组合矩形和自由形状掩码上进行训练。为确保公平比较,训练/推理过程中未使用边缘引导。* 表示由于官方实现和模型不可用,我们在相同框架内的实现。
方法 | 矩形掩码 Ei 误差 | 矩形掩码 l2 误差 | 自由形掩码 li 误差 | 自由形掩码 C2 误差 |
---|---|---|---|---|
PatchMatch [3] | 16.1% | 3.9% | 11.3% | 2.4% |
Global&Local [15] | 9.3% | 2.2% | 21.6% | 7.1% |
ContextAttention [49] | 8.6% | 2.1% | 17.2% | 4.7% |
PartialConv* [23] | 9.8% | 2.3% | 10.4% | 1.9% |
Ours | %9'8 | 2.0% | 9.1% | 1.6% |
4.2. Qualitative Comparisons
4.2. 定性比较
Next, we compare our model with previous state-of-theart methods [15, 23, 49]. Figure 4 and Figure 5 shows automati