[论文翻译]基于门控卷积的自由形式图像修复


原文地址:https://arxiv.org/pdf/1806.03589v2


Free-Form Image Inpainting with Gated Convolution

基于门控卷积的自由形式图像修复

Figure 1: Free-form image inpainting results by our system built on gated convolution. Each triad shows original image, free-form input and our result from left to right. The system supports free-form mask and guidance like user sketch. It helps user remove distracting objects, modify image layouts and edit faces in images.

图 1: 基于门控卷积的系统生成的自由形式图像修复结果。每组三张图从左到右依次显示原始图像、自由形式输入和我们的结果。该系统支持自由形式遮罩和用户草图等引导方式,帮助用户移除干扰物体、修改图像布局以及编辑图像中的人脸。

Abstract

摘要

We present a generative image inpainting system to complete images with free-form mask and guidance. The system is based on gated convolutions learned from millions of images without additional labelling efforts. The proposed gated convolution solves the issue of vanilla convolution that treats all input pixels as valid ones, generalizes partial convolution by providing a learnable dynamic feature selection mechanism for each channel at each spatial location across all layers. Moreover, as free-form masks may appear anywhere in images with any shape, global and local GANs designed for a single rectangular mask are not applicable. Thus, we also present a patch-based GAN loss, named SNPatchGAN, by applying spectral-normalized disc rim in at or on dense image patches. SN-PatchGAN is simple in formulation, fast and stable in training. Results on automatic image inpainting and user-guided extension demonstrate that our system generates higher-quality and more flexible results than previous methods. Our system helps user quickly remove distracting objects, modify image layouts, clear watermarks and edit faces. Code, demo and models are available at: https://github.com/JiahuiYu/ generative in painting.

我们提出了一种生成式图像修复系统,能够通过自由形式的遮罩和引导来完成图像。该系统基于从数百万张图像中学习到的门控卷积,无需额外的标注工作。所提出的门控卷积解决了普通卷积将所有输入像素视为有效像素的问题,通过为每一层中每个空间位置的每个通道提供可学习的动态特征选择机制,推广了部分卷积。此外,由于自由形式的遮罩可能以任何形状出现在图像的任意位置,为单个矩形遮罩设计的全局和局部 GAN 并不适用。因此,我们还提出了一种基于 patch 的 GAN 损失,名为 SN-PatchGAN,通过在密集图像 patch 上应用谱归一化判别器来实现。SN-PatchGAN 在公式上简单,训练快速且稳定。自动图像修复和用户引导扩展的结果表明,我们的系统比之前的方法生成了更高质量且更灵活的结果。我们的系统帮助用户快速移除干扰物体、修改图像布局、清除水印和编辑面部。代码、演示和模型可在以下网址获取:https://github.com/JiahuiYu/generative_inpainting

1. Introduction

1. 引言

Image inpainting (a.k.a. image completion or image hole-filling) is a task of synthesizing alternative contents in missing regions such that the modification is visually realistic and semantically correct. It allows to remove distracting objects or retouch undesired regions in photos. It can also be extended to tasks including image/video un-cropping, rotation, stitching, re-targeting, re-composition, compression, super-resolution, harmonization and many others.

图像修复(也称为图像补全或图像空洞填充)是一项在缺失区域合成替代内容的任务,使得修改在视觉上逼真且语义正确。它允许去除照片中分散注意力的物体或修饰不理想区域。它还可以扩展到包括图像/视频去裁剪、旋转、拼接、重定向、重构、压缩、超分辨率、协调等多种任务。

In computer vision, two broad approaches to image inpainting exist: patch matching using low-level image features and feed-forward generative models with deep convolutional networks. The former approach [3, 8, 9] can synthesize plausible stationary textures, but usually makes critical failures in non-stationary cases like complicated scenes, faces and objects. The latter approach [15, 49, 45, 46, 38, 37, 48, 26, 52, 33, 35, 19] can exploit semantics learned from large scale datasets to synthesize contents in nonstationary images in an end-to-end fashion.

在计算机视觉中,图像修复存在两种主要方法:使用低级图像特征的补丁匹配和基于深度卷积网络的前馈生成模型。前者 [3, 8, 9] 可以合成合理的静态纹理,但在复杂场景、人脸和物体等非静态情况下通常会出现严重失败。后者 [15, 49, 45, 46, 38, 37, 48, 26, 52, 33, 35, 19] 可以利用从大规模数据集中学习到的语义,以端到端的方式合成非静态图像中的内容。

However, deep generative models based on vanilla convolutions are naturally ill-fitted for image hole-filling because the spatially shared convolutional filters treat all input pixels or features as same valid ones. For holefilling, the input to each layer are composed of valid pixels/features outside holes and invalid ones in masked regions. Vanilla convolutions apply same filters on all valid, invalid and mixed (for example, the ones on hole boundary) pixels/features, leading to visual artifacts such as color discrepancy, blurriness and obvious edge responses surrounding holes when tested on free-form masks [15, 49].

然而,基于普通卷积的深度生成模型天然不适合图像补全,因为空间共享的卷积滤波器将所有输入像素或特征视为相同的有效元素。对于补全任务,每一层的输入由孔洞外的有效像素/特征和孔洞内的无效像素/特征组成。普通卷积对所有有效、无效以及混合(例如,孔洞边界上的)像素/特征应用相同的滤波器,导致在使用自由形状掩码测试时出现视觉伪影,如颜色差异、模糊以及孔洞周围明显的边缘响应 [15, 49]。

To address this limitation, recently partial convolution [23] is proposed where the convolution is masked and normalized to be conditioned only on valid pixels. It is then followed by a rule-based mask-update step to update valid locations for next layer. Partial convolution categorizes all input locations to be either invalid or valid, and multiplies a zero-or-one mask to inputs throughout all layers. The mask can also be viewed as a single un-learnable feature gating channel1. However this assumption has several limitations. First, considering the input spatial locations across different layers of a network, they may include (1) valid pixels in input image, (2) masked pixels in input image, (3) neurons with receptive field covering no valid pixel of input image, (4) neurons with receptive field covering different number of valid pixels of input image (these valid image pixels may also have different relative locations), and (5) synthesized pixels in deep layers. Heuristic ally categorizing all locations to be either invalid or valid ignores these important information. Second, if we extend to user-guided image inpainting where users provide sparse sketch inside the mask, should these pixel locations be considered as valid or invalid? How to properly update the mask for next layer? Third, for partial convolution the “invalid” pixels will progressively disappear layer by layer and the rule-based mask will be all ones in deep layers. However, to synthesize pixels in hole these deep layers may also need the information of whether current locations are inside or outside the hole. The partial convolution with all-ones mask cannot provide such information. We will show that if we allow the network to learn the mask automatically, the mask may have different values based on whether current locations are masked or not in input image, even in deep layers.

为了解决这一限制,最近提出了部分卷积 (partial convolution) [23],其中卷积被掩码化并归一化,仅基于有效像素进行条件化。随后是一个基于规则的掩码更新步骤,以更新下一层的有效位置。部分卷积将所有输入位置分类为无效或有效,并在所有层中将输入乘以一个零或一的掩码。该掩码也可以被视为单个不可学习的特征门控通道。然而,这一假设存在几个限制。首先,考虑到网络不同层的输入空间位置,它们可能包括:(1) 输入图像中的有效像素,(2) 输入图像中的掩码像素,(3) 感受野不覆盖输入图像任何有效像素的神经元,(4) 感受野覆盖输入图像不同数量有效像素的神经元 (这些有效图像像素可能也有不同的相对位置),以及 (5) 深层中的合成像素。启发式地将所有位置分类为无效或有效会忽略这些重要信息。其次,如果我们扩展到用户引导的图像修复,用户在掩码内提供稀疏草图,这些像素位置应被视为有效还是无效?如何正确更新下一层的掩码?第三,对于部分卷积,“无效”像素将逐层逐渐消失,基于规则的掩码在深层中将全部为 1。然而,为了合成空洞中的像素,这些深层可能还需要当前位置是位于空洞内还是外的信息。全为 1 的掩码无法提供此类信息。我们将展示,如果我们允许网络自动学习掩码,掩码可能会根据当前位置在输入图像中是否被掩码而具有不同的值,即使在深层也是如此。

We propose gated convolution for free-form image inpainting. It learns a dynamic feature gating mechanism for each channel and each spatial location (for example, inside or outside masks, RGB channels or user-guidance channels). Specifically we consider the formulation where the input feature is firstly used to compute gating values $g=\sigma(w_{g}x)$ ( $\bar{\sigma}$ is sigmoid function, $w_{g}$ is learnable parameter). The final output is a multiplication of learned feature and gating values $y=\phi(w x){\odot}g$ where $\phi$ can be any activation function. Gated convolution is easy to implement and performs significantly better when (1) the masks have arbitrary shapes and (2) the inputs are no longer simply RGB channels with a mask but also have conditional inputs like sparse sketch. For network architectures, we stack gated convolution to form an encoder-decoder network following [49]. Our inpainting network also integrates contextual attention module within same refinement network [49] to better capture long-range dependencies.

我们提出了一种用于自由形式图像修复的门控卷积。它为每个通道和每个空间位置(例如,掩码内部或外部,RGB通道或用户引导通道)学习一个动态特征门控机制。具体来说,我们考虑了输入特征首先用于计算门控值 $g=\sigma(w_{g}x)$ ( $\bar{\sigma}$ 是 sigmoid 函数, $w_{g}$ 是可学习参数)的公式。最终输出是学习到的特征与门控值的乘积 $y=\phi(w x){\odot}g$ ,其中 $\phi$ 可以是任何激活函数。门控卷积易于实现,并且在以下情况下表现显著更好:(1) 掩码具有任意形状;(2) 输入不再仅仅是带有掩码的 RGB 通道,而是还具有像稀疏草图这样的条件输入。对于网络架构,我们按照 [49] 堆叠门控卷积以形成编码器-解码器网络。我们的修复网络还在相同的细化网络 [49] 中集成了上下文注意力模块,以更好地捕捉长程依赖关系。

Without compromise of performance, we also significantly simplify training objectives as two terms: a pixelwise reconstruction loss and an adversarial loss. The modification is mainly designed for free-form image inpainting. As the holes may appear anywhere in images with any shape, global and local GANs [15] designed for a single rectangular mask are not applicable. Instead, we propose a variant of generative adversarial networks, named SN-PatchGAN, motivated by global and local GANs [15], Markov ian GANs [21], perceptual loss [17] and recent work on spectral-normalized GANs [24]. The disc rim in at or of SN-PatchGAN directly computes hinge loss on each point of the output map with format $\mathbb{R}^{h\times w\times c}$ , formulating $h\times$ $w\times c$ number of GANs focusing on different locations and different semantics (represented in different channels). SNPatchGAN is simple in formulation, fast and stable in training and produces high-quality inpainting results.

在不影响性能的前提下,我们还将训练目标大幅简化为两项:逐像素重建损失和对抗损失。这一修改主要是为自由形式图像修复设计的。由于孔洞可能以任何形状出现在图像的任意位置,专为单一矩形遮罩设计的全局和局部 GAN [15] 并不适用。相反,我们提出了生成对抗网络的一个变体,命名为 SN-PatchGAN,其灵感来源于全局和局部 GAN [15]、马尔可夫 GAN [21]、感知损失 [17] 以及最近关于谱归一化 GAN [24] 的研究。SN-PatchGAN 的判别器直接输出映射的每个点上计算合页损失,格式为 $\mathbb{R}^{h\times w\times c}$,形成了 $h\times$ $w\times c$ 个 GAN,专注于不同位置和不同语义(通过不同通道表示)。SN-PatchGAN 在公式上简单,训练快速且稳定,并生成高质量的修复结果。

Table 1: Comparison of different approaches including PatchMatch [3], Global&Local [15], Context Attention [49], Partial Con v [23] and our approach. The comparison of image inpainting is based on four dimensions: Semantic Under standing, Non-Local Algorithm, Free-Form Masks and User-Guided Option.

表 1: 不同方法的比较,包括 PatchMatch [3]、Global&Local [15]、Context Attention [49]、Partial Con v [23] 和我们的方法。图像修复的比较基于四个维度:语义理解 (Semantic Understanding)、非局部算法 (Non-Local Algorithm)、自由形式掩码 (Free-Form Masks) 和用户引导选项 (User-Guided Option)。

PM [3] GL [15] CA [49] PC [23] Ours
语义理解
非局部算法
自由形式掩码
用户引导选项

For practical image inpainting tools, enabling user inter activity is crucial because there could exist many plausible solutions for filling a hole in an image. To this end, we present an extension to allow user sketch as guided input. Comparison to other methods is summarized in Table 1. Our main contributions are as follows: (1) We introduce gated convolution to learn a dynamic feature selection mechanism for each channel at each spatial location across all layers, significantly improving the color consistency and inpainting quality of free-form masks and inputs. (2) We present a more practical patch-based GAN disc rim in at or,

对于实用的图像修复工具,启用用户交互至关重要,因为填补图像中的空洞可能存在多种合理的解决方案。为此,我们提出了一种扩展,允许用户草图作为引导输入。与其他方法的对比总结在表1中。我们的主要贡献如下:(1) 我们引入了门控卷积 (gated convolution) ,为每一层的每个通道在空间位置上学习动态特征选择机制,显著提高了自由形式掩码和输入的颜色一致性和修复质量。(2) 我们提出了一种更实用的基于分片的 GAN 判别器。

SN-PatchGAN, for free-form image inpainting. It is simple, fast and produces high-quality inpainting results. (3) We extend our inpainting model to an interactive one, enabling user sketch as guidance to obtain more user-desired inpainting results. (4) Our proposed inpainting system achieves higher-quality free-form inpainting than previous state of the arts on benchmark datasets including Places2 natural scenes and CelebA-HQ faces. We show that the proposed system helps user quickly remove distracting objects, modify image layouts, clear watermarks and edit faces in images.

SN-PatchGAN,用于自由图像修复。它简单、快速且能生成高质量的修复结果。(3) 我们将修复模型扩展为交互式模型,允许用户以草图作为指导,获得更符合用户需求的修复结果。(4) 我们提出的修复系统在包括Places2自然场景和CelebA-HQ人脸在内的基准数据集上,实现了比以往技术更高质量的自由图像修复。我们展示了该系统帮助用户快速移除干扰物体、修改图像布局、清除水印以及编辑图像中的人脸。

2. Related Work

2. 相关工作

2.1. Automatic Image Inpainting

2.1. 自动图像修复

A variety of approaches have been proposed for image inpainting. Traditionally, patch-based [8, 9] algorithms progressively extend pixels close to the hole boundaries based on low-level features (for example, features of mean square difference on RGB space), to search and paste the most similar image patch. These algorithms work well on stationary textural regions but often fail on non-stationary images. Further, Simakov et al. propose bidirectional similarity synthesis approach [36] to better capture and summarize non-stationary visual data. To reduce the high cost of memory and computation during search, tree-based acceleration structures of memory [25] and randomized algorithms [3] are proposed. Moreover, inpainting results are improved by matching local features like image gradients [2, 5] and offset statistics of similar patches [11]. Recently, image inpainting systems based on deep learning are proposed to directly predict pixel values inside masks. A significant advantage of these models is the ability to learn adaptive image features for different semantics. Thus they can synthesize more visually plausible contents especially for images like faces [22, 47], objects [29] and natural scenes [15, 49]. Among all these methods, Iizuka et al. [15] propose a fully convolutional image inpainting network with both global and local consistency to handle high-resolution images on a variety of datasets [18, 32, 53]. This approach, however, still heavily relies on Poisson image blending with traditional patch-based inpainting results [11]. Yu et al. [49] propose an end-to-end image inpainting model by adopting stacked generative networks to further ensure the color and texture consist en ce of generated regions with surroundings. Moreover, for capturing long-range spatial dependencies, contextual attention module [49] is proposed and integrated into networks to explicitly borrow information from distant spatial locations. However, this approach is mainly trained on large rectangular masks and does not generalize well on free-form masks. To better handle irregular masks, partial convolution [23] is proposed where the convolution is masked and re-normalized to utilize valid pixels only. It is then followed by a rule-based mask-update step to recompute new masks layer by layer.

多种方法已被提出用于图像修复。传统上,基于补丁的 [8, 9] 算法逐步扩展靠近空洞边界的像素,基于低层次特征(例如,RGB空间上的均方差特征)搜索并粘贴最相似的图像补丁。这些算法在静态纹理区域上表现良好,但在非静态图像上往往失败。此外,Simakov等人提出了双向相似性合成方法 [36],以更好地捕捉和总结非静态视觉数据。为了减少搜索过程中的高内存和计算成本,提出了基于树的内存加速结构 [25] 和随机算法 [3]。此外,通过匹配图像梯度 [2, 5] 和相似补丁的偏移统计 [11] 等局部特征,改进了修复结果。最近,基于深度学习的图像修复系统被提出,直接预测掩码内的像素值。这些模型的一个显著优势是能够学习不同语义的自适应图像特征。因此,它们可以合成更具视觉合理性的内容,特别是对于像人脸 [22, 47]、物体 [29] 和自然场景 [15, 49] 这样的图像。在所有这些方法中,Iizuka等人 [15] 提出了一个具有全局和局部一致性的全卷积图像修复网络,以处理各种数据集 [18, 32, 53] 上的高分辨率图像。然而,这种方法仍然严重依赖于与传统的基于补丁的修复结果 [11] 的泊松图像融合。Yu等人 [49] 提出了一个端到端的图像修复模型,通过采用堆叠生成网络进一步确保生成区域与周围环境的颜色和纹理一致性。此外,为了捕捉长距离的空间依赖性,提出了上下文注意力模块 [49] 并将其集成到网络中,以显式地从远距离空间位置借用信息。然而,这种方法主要在大矩形掩码上训练,并且在自由形式掩码上泛化能力较差。为了更好地处理不规则掩码,提出了部分卷积 [23],其中卷积被掩码并重新归一化,仅利用有效像素。随后是基于规则的掩码更新步骤,逐层重新计算新掩码。

2.2. Guided Image Inpainting and Synthesis

2.2. 引导图像修复与合成

To improve image inpainting, user guidance is explored including dots or lines [1, 3, 7, 40], structures [13], transformation or distortion information [14, 30] and image exemplars [4, 10, 20, 43, 51]. Notably, Hays and Efros [10] first utilize millions of photographs as a database to search for an example image which is most similar to the input, and then complete the image by cutting and pasting the corresponding regions from the matched image.

为改进图像修复,探索了包括点或线 [1, 3, 7, 40]、结构 [13]、变换或扭曲信息 [14, 30] 以及图像样本 [4, 10, 20, 43, 51] 在内的用户引导方法。值得注意的是,Hays 和 Efros [10] 首次利用数百万张照片作为数据库,搜索与输入最相似的示例图像,然后通过从匹配图像中剪切并粘贴相应区域来完成图像修复。

Recent advances in conditional generative networks empower user-guided image processing, synthesis and manipulation learned from large-scale datasets. Here we selectively review several related work. Zhang et al. [50] propose color iz ation networks which can take user guidance as additional inputs. Wang et al. [42] propose to synthesize high-resolution photo-realistic images from semantic label maps using conditional generative adversarial networks. The Scribbler [34] explore a deep generative network conditioned on sketched boundaries and sparse color strokes to synthesize cars, bedrooms, or faces.

条件生成网络的最新进展使用户引导的图像处理、合成和操作能够从大规模数据集中学习。这里我们选择性地回顾了几项相关工作。Zhang 等人 [50] 提出了着色网络,可以将用户引导作为额外输入。Wang 等人 [42] 提出使用条件生成对抗网络从语义标签图中合成高分辨率逼真图像。Scribbler [34] 探索了一种基于草图边界和稀疏颜色笔触的深度生成网络,用于合成汽车、卧室或人脸。

2.3. Feature-wise Gating

2.3. 特征门控

Feature-wise gating has been explored widely in vision [12, 28, 39, 41], language [6], speech [27] and many other tasks. For examples, Highway Networks [39] utilize feature gating to ease gradient-based training of very deep networks. Squeeze-and-Excitation Networks re-calibrate feature responses by explicitly multiplying each channel with learned sigmoidal gating values. WaveNets [27] achieve better results by employing a special feature gating $y=\operatorname{tanh}(w_{1}x),.$ ·sigmoid $(w_{2}x)$ for modeling audio signals.

特征门控已在视觉 [12, 28, 39, 41]、语言 [6]、语音 [27] 以及许多其他任务中得到了广泛探索。例如,Highway Networks [39] 利用特征门控来简化基于梯度的深度网络训练。Squeeze-and-Excitation Networks通过将每个通道与学习到的sigmoidal门控值显式相乘来重新校准特征响应。WaveNets [27] 通过采用一种特殊特征门控 $y=\operatorname{tanh}(w_{1}x),.$ ·sigmoid $(w_{2}x)$ 来建模音频信号,取得了更好的结果。

3. Approach

3. 方法

In this section, we describe our approach from bottom to top. We first introduce the details of the Gated Convolution, SN-PatchGAN, and then present the overview of inpainting network in Figure 3 and our extension to allow optional user guidance.

在本节中,我们自下而上地描述我们的方法。我们首先介绍门控卷积 (Gated Convolution) 和 SN-PatchGAN 的细节,然后展示图 3 中的修复网络概览,以及我们为支持用户可选引导所做的扩展。

3.1. Gated Convolution

3.1. 门控卷积 (Gated Convolution)

We first explain why vanilla convolutions used in [15, 49] are ill-fitted for the task of free-form image inpainting. We consider a convolutional layer in which a bank of filters are applied to the input feature map as output. Assume input is $C-$ channel, each pixel located at $(y,x)$ in $C^{\prime}-$ channel output map is computed as

我们首先解释为什么在[15, 49]中使用的普通卷积不适合自由形式图像修复任务。我们考虑一个卷积层,其中一组滤波器被应用于输入特征图以输出。假设输入是$C-$通道,位于$C^{\prime}-$通道输出图中$(y,x)$的每个像素计算为

$$
O_{y,x}=\sum_{i=-k_{h}^{\prime}}^{k_{h}^{\prime}}\sum_{j=-k_{w}^{\prime}}^{k_{w}^{\prime}}W_{k_{h}^{\prime}+i,k_{w}^{\prime}+j}\cdot I_{y+i,x+j},
$$

$$
O_{y,x}=\sum_{i=-k_{h}^{\prime}}^{k_{h}^{\prime}}\sum_{j=-k_{w}^{\prime}}^{k_{w}^{\prime}}W_{k_{h}^{\prime}+i,k_{w}^{\prime}+j}\cdot I_{y+i,x+j},
$$

where $x,y$ represents $\mathbf{X}$ -axis, y-axis of output map, $k_{h}$ and $k_{w}$ is the kernel size (e.g. $3,\times,3)$ , $\begin{array}{r l r}{k_{h}^{\prime}}&{{}\stackrel{\cdot}{=}}&{\frac{k_{h}-\bar{1}}{2}}\end{array}$ , $k_{w}^{\prime}~=$ $\frac{k_{w}!-!1}{2}$ , $W\in\mathbb{R}^{k_{h}\times k_{w}\times C^{\prime}\times C}$ represents convolutional filters, $I_{y+i,x+j}\in\mathbb{R}^{C}$ and $O_{y,x}\in\mathbb{R}^{C^{\prime}}$ are inputs and outputs. For simplicity, the bias in convolution is ignored.

其中 $x,y$ 表示输出图的 $\mathbf{X}$ 轴和 y 轴,$k_{h}$ 和 $k_{w}$ 是卷积核大小(例如 $3,\times,3)$,$\begin{array}{r l r}{k_{h}^{\prime}}&{{}\stackrel{\cdot}{=}}&{\frac{k_{h}-\bar{1}}{2}}\end{array}$,$k_{w}^{\prime}~=$ $\frac{k_{w}!-!1}{2}$,$W\in\mathbb{R}^{k_{h}\times k_{w}\times C^{\prime}\times C}$ 表示卷积滤波器,$I_{y+i,x+j}\in\mathbb{R}^{C}$ 和 $O_{y,x}\in\mathbb{R}^{C^{\prime}}$ 分别是输入和输出。为简化起见,卷积中的偏置项被忽略。

The equation shows that for all spatial locations $(y,x)$ , the same filters are applied to produce the output in vanilla convolutional layers. This makes sense for tasks such as image classification and object detection, where all pixels of input image are valid, to extract local features in a slidingwindow fashion. However, for image inpainting, the input are composed of both regions with valid pixels/features outside holes and invalid pixels/features (in shallow layers) or synthesized pixels/features (in deep layers) in masked regions. This causes ambiguity during training and leads to visual artifacts such as color discrepancy, blurriness and obvious edge responses during testing, as reported in [23].

方程表明,对于所有空间位置 $(y,x)$,在普通的卷积层中应用相同的滤波器来生成输出。这对于图像分类和目标检测等任务是有意义的,因为输入图像的所有像素都是有效的,可以以滑动窗口的方式提取局部特征。然而,对于图像修复任务,输入图像由两部分组成:孔洞外具有有效像素/特征的区域,以及掩码区域内无效的像素/特征(在浅层)或合成的像素/特征(在深层)。这会导致训练过程中的模糊性,并在测试时产生视觉伪影,如颜色差异、模糊和明显的边缘响应,如 [23] 中所述。

Recently partial convolution is proposed [23] which adapts a masking and re-normalization step to make the convolution dependent only on valid pixels as

最近提出了部分卷积 [23],它采用了掩码和重新归一化步骤,使卷积仅依赖于有效像素。


Figure 2: Illustration of partial convolution (left) and gated convolution (right).

图 2: 部分卷积 (左) 和门控卷积 (右) 的示意图。

updated with rules, gated convolutions learn soft mask auto mati call y from data. It is formulated as:

更新规则后,门控卷积从数据中自动学习软掩码。其公式为:

image.png

in which $M$ is the corresponding binary mask, 1 represents pixel in the location $(y,x)$ is valid, 0 represents the pixel is invalid, $\odot$ denotes element-wise multiplication. After each partial convolution operation, the mask-update step is required to propagate new $M$ with the following rule: $m_{y,x}^{\prime}=1$ , iff $\mathrm{sum}(\mathbf{M})>0$ .

其中 $M$ 是对应的二值掩码,1 表示位置 $(y,x)$ 的像素有效,0 表示像素无效,$\odot$ 表示逐元素相乘。在每次部分卷积操作后,需要按照以下规则更新掩码 $M$:$m_{y,x}^{\prime}=1$,当且仅当 $\mathrm{sum}(\mathbf{M})>0$。

Partial convolution [23] improves the quality of inpainting on irregular mask, but it still has remaining issues: (1) It heuristic ally classifies all spatial locations to be either valid or invalid. The mask in next layer will be set to ones no matter how many pixels are covered by the filter range in previous layer (for example, 1 valid pixel and 9 valid pixels are treated as same to update current mask). (2) It is incompatible with additional user inputs. We aim at a user-guided image inpainting system where users can optionally provide sparse sketch inside the mask as conditional channels. In this situation, should these pixel locations be considered as valid or invalid? How to properly update the mask for next layer? (3) For partial convolution the invalid pixels will progressively disappear in deep layers, gradually converting all mask values to ones. However, our study shows that if we allow the network to learn optimal mask automatically, the network assigns soft mask values to every spatial locations even in deep layers. (4) All channels in each layer share the same mask, which limits the flexibility. Essentially, partial convolution can be viewed as un-learnable single-channel feature hard-gating.

部分卷积 [23] 在不规则掩码上的修复质量有所提升,但仍存在以下问题:(1) 它启发式地将所有空间位置分类为有效或无效。无论前一层的滤波器范围内覆盖了多少像素(例如,1个有效像素和9个有效像素被视为相同来更新当前掩码),下一层的掩码都会被设置为1。(2) 它与额外的用户输入不兼容。我们的目标是实现一个用户引导的图像修复系统,用户可以选择性地在掩码内提供稀疏的草图作为条件通道。在这种情况下,这些像素位置应被视为有效还是无效?如何正确更新下一层的掩码?(3) 对于部分卷积,无效像素在深层会逐渐消失,逐渐将所有掩码值转换为1。然而,我们的研究表明,如果我们允许网络自动学习最优掩码,网络即使在深层也会为每个空间位置分配软掩码值。(4) 每一层的所有通道共享相同的掩码,这限制了灵活性。本质上,部分卷积可以被视为不可学习的单通道特征硬门控。

We propose gated convolution for image inpainting network, as shown in Figure 2. Instead of hard-gating mask where $\sigma$ is sigmoid function thus the output gating values are between zeros and ones. $\phi$ can be any activation functions (for examples, ReLU, ELU and LeakyReLU). $W_{g}$ and $W_{f}$ are two different convolutional filters.

我们提出了一种用于图像修复网络的门控卷积,如图 2 所示。与硬门控掩码不同,其中 $\sigma$ 是 sigmoid 函数,因此输出门控值介于零和一之间。$\phi$ 可以是任何激活函数(例如 ReLU、ELU 和 LeakyReLU)。$W_{g}$ 和 $W_{f}$ 是两个不同的卷积滤波器。

The proposed gated convolution learns a dynamic feature selection mechanism for each channel and each spatial location. Interestingly, visualization of intermediate gating values show that it learns to select the feature not only according to background, mask, sketch, but also considering semantic segmentation in some channels. Even in deep layers, gated convolution learns to highlight the masked regions and sketch information in separate channels to better generate inpainting results.

提出的门控卷积为每个通道和每个空间位置学习了一种动态特征选择机制。有趣的是,中间门控值的可视化显示,它不仅根据背景、掩码、草图来选择特征,在某些通道中还考虑了语义分割。即使在深层,门控卷积也学会了在单独的通道中突出掩码区域和草图信息,以更好地生成修复结果。

3.2. Spectral-Normalized Markovian Discriminator (SN-PatchGAN)

3.2. 谱归一化马尔可夫判别器 (SN-PatchGAN)

For previous inpainting networks which try to fill a single rectangular hole, an additional local GAN is used on the masked rectangular region to improve results [15, 49]. However, we consider the task of free-form image inpainting where there may be multiple holes with any shape at any location. Motivated by global and local GANs [15], MarkovianGANs [16, 21], perceptual loss [17] and recent work on spectral-normalized GANs [24], we present a simple and effective GAN loss, SN-PatchGAN, for training free-form image inpainting networks.

对于之前尝试填充单个矩形孔洞的图像修复网络,通常在掩码矩形区域上使用额外的局部生成对抗网络 (GAN) 以提高效果 [15, 49]。然而,我们考虑的是自由形式图像修复任务,其中可能存在任意形状、任意位置的多个孔洞。受全局和局部 GAN [15]、MarkovianGANs [16, 21]、感知损失 [17] 以及最近关于谱归一化 GAN [24] 的研究启发,我们提出了一种简单而有效的 GAN 损失函数 SN-PatchGAN,用于训练自由形式图像修复网络。

A convolutional network is used as the disc rim in at or where the input consists of image, mask and guidance channels, and the output is a 3-D feature of shape $\mathbb{R}^{h\times w\times c}$ $\mathit{h}$ , $w$ , $c$ representing the height, width and number of channels respectively). As shown in Figure 3, six strided convolutions with kernel size 5 and stride 2 is stacked to captures the feature statistics of Markovian patches [21]. We then directly apply GANs for each feature element in this feature map, formulating $h\times w\times c$ number of GANs focusing on different locations and different semantics (represented in different channels) of input image. It is noteworthy that the receptive field of each neuron in output map can cover entire input image in our training setting, thus a global discriminator is not necessary.

卷积网络被用作判别器,其输入由图像、掩码和引导通道组成,输出为一个形状为 $\mathbb{R}^{h\times w\times c}$ 的三维特征(其中 $\mathit{h}$、$w$、$c$ 分别表示高度、宽度和通道数)。如图 3 所示,我们堆叠了六个核大小为 5、步幅为 2 的卷积层,以捕捉马尔可夫块的特征统计 [21]。然后,我们直接在该特征图中的每个特征元素上应用 GAN,形成了 $h\times w\times c$ 个 GAN,专注于输入图像的不同位置和不同语义(在不同通道中表示)。值得注意的是,在我们的训练设置中,输出图中每个神经元的感受野可以覆盖整个输入图像,因此不需要全局判别器。


Figure 3: Overview of our framework with gated convolution and SN-PatchGAN for free-form image inpainting.

图 3: 使用门控卷积和 SN-PatchGAN 进行自由形式图像修复的框架概述。

We also adapt the recently proposed spectral normalization [24] to further stabilize the training of GANs. We use the default fast approximation algorithm of spectral normalization described in SN-GANs [24]. To discriminate if the input is real or fake, we also use the hinge loss as objective function for generator $\mathcal{L}{G},=,-\mathbb{E}{z\sim\mathbb{P}{z}(z)}[D^{s n}(G(z))]$ and disc rim in at or $\mathcal{L}{D^{s n}}=\mathbb{E}{x\sim\mathbb{P}{d a t a}(x)}[R e L U(\mathbb{1}-D^{s n}(x))]+$ $\mathbb{E}{z\sim\mathbb{P}{z}(z)}[R e L U(\mathbb{1}+D^{s n}(G(z)))]$ where $D^{s n}$ represents spectral-normalized disc rim in at or, $G$ is image inpainting network that takes incomplete image $z$ .

我们还采用了最近提出的谱归一化 [24] 来进一步稳定 GAN 的训练。我们使用 SN-GANs [24] 中描述的默认快速近似算法进行谱归一化。为了区分输入是真实还是虚假,我们还使用合页损失作为生成器的目标函数 $\mathcal{L}{G},=,-\mathbb{E}{z\sim\mathbb{P}{z}(z)}[D^{s n}(G(z))]$ 和判别器 $\mathcal{L}{D^{s n}}=\mathbb{E}{x\sim\mathbb{P}{d a t a}(x)}[R e L U(\mathbb{1}-D^{s n}(x))]+$ $\mathbb{E}{z\sim\mathbb{P}{z}(z)}[R e L U(\mathbb{1}+D^{s n}(G(z)))]$,其中 $D^{s n}$ 表示谱归一化判别器,$G$ 是接收不完整图像 $z$ 的图像修复网络。

With SN-PatchGAN, our inpainting network trains faster and more stable than baseline model [49]. Perceptual loss is not used since similar patch-level information is already encoded in SN-PatchGAN. Compared with PartialConv [23] in which 6 different loss terms and balancing hyper-parameters are used, our final objective function for inpainting network is only composed of pixel-wise $\ell_{1}$ reconstruction loss and SN-PatchGAN loss, with default loss balancing hyper-parameter as $1:1$ .

使用 SN-PatchGAN,我们的修复网络训练速度更快且比基线模型 [49] 更稳定。由于 SN-PatchGAN 已经编码了类似的 patch 级别信息,因此不需要使用感知损失。与 PartialConv [23] 中使用 6 种不同的损失项和平衡超参数相比,我们的修复网络的最终目标函数仅由像素级的 $\ell_{1}$ 重建损失和 SN-PatchGAN 损失组成,默认的损失平衡超参数为 $1:1$。

3.3. Inpainting Network Architecture

3.3. 修复网络架构

We customize a generative inpainting network [49] with the proposed gated convolution and SN-PatchGAN loss. Specifically, we adapt the full model architecture in [49] with both coarse and refinement networks. The full framework is summarized in Figure 3.

我们使用提出的门控卷积和SN-PatchGAN损失定制了一个生成式修复网络[49]。具体来说,我们采用了[49]中的完整模型架构,包括粗修复和精修网络。完整框架如图3所示。

For coarse and refinement networks, we use a simple encoder-decoder network [49] instead of U-Net used in PartialConv [23]. We found that skip connections in a UNet [31] have no significant effect for non-narrow mask. This is mainly because for center of a masked region, the inputs of these skip connections are almost zeros thus cannot propagate detailed color or texture information to the decoder of that region. For hole boundaries, our encoderdecoder architecture equipped with gated convolution is sufficient to generate seamless results.

对于粗化和细化网络,我们使用了一个简单的编码器-解码器网络 [49],而不是 PartialConv [23] 中使用的 U-Net。我们发现,在非狭窄掩码的情况下,UNet [31] 中的跳跃连接没有显著效果。这主要是因为对于掩码区域的中心,这些跳跃连接的输入几乎为零,因此无法将详细的颜色或纹理信息传播到该区域的解码器。对于空洞边界,我们配备了门控卷积的编码器-解码器架构足以生成无缝的结果。

We replace all vanilla convolutions with gated convolutions [49]. One potential problem is that gated convolutions introduce additional parameters. To maintain the same efficiency with our baseline model [49], we slim the model width by $25%$ and have not found obvious performance drop both quantitatively and qualitatively. The inpainting network is trained end-to-end and can be tested on free-form holes at arbitrary locations. Our network is fully convolutional and supports different input resolutions in inference.

我们将所有普通卷积替换为门控卷积 [49]。一个潜在的问题是门控卷积会引入额外的参数。为了保持与基线模型 [49] 相同的效率,我们将模型宽度缩减了 $25%$,并且在定量和定性上均未发现明显的性能下降。修复网络是端到端训练的,可以在任意位置的自由形状孔洞上进行测试。我们的网络是完全卷积的,在推理时支持不同的输入分辨率。

3.4. Free-Form Mask Generation

3.4. 自由形式掩码生成

The algorithm to automatically generate free-form masks is important and non-trivial. The sampled masks, in essence, should be (1) similar to masks drawn in real use-cases, (2) diverse to avoid over-fitting, (3) efficient in computation and storage, (4) controllable and flexible. Previous method [23] collects a fixed set of irregular masks from an occlusion estimation method between two consecutive frames of videos. Although random dilation, rotation and cropping are added to increase its diversity, the method does not meet other requirements listed above.

自动生成自由形式掩码的算法重要且非平凡。采样的掩码本质上应满足:(1) 与实际使用场景中绘制的掩码相似,(2) 具有多样性以避免过拟合,(3) 在计算和存储上高效,(4) 可控且灵活。先前的方法 [23] 从视频连续帧之间的遮挡估计方法中收集了一组固定的不规则掩码。尽管通过随机扩张、旋转和裁剪增加了其多样性,但该方法未能满足上述其他要求。

We introduce a simple algorithm to automatically generate random free-form masks on-the-fly during training. For the task of hole filling, users behave like using an eraser to brush back and forth to mask out undesired regions. This behavior can be simply simulated with a randomized algorithm by drawing lines and rotating angles repeatedly. To ensure smoothness of two lines, we also draw a circle in joints between the two lines. More details are included in the supplementary materials due to space limit.

我们引入了一种简单的算法,在训练过程中动态生成随机的自由形式掩码。对于填补空洞的任务,用户的行为就像使用橡皮擦来回涂抹以掩盖不需要的区域。这种行为可以通过一个随机化算法简单地模拟,即反复绘制线条和旋转角度。为了确保两条线条的平滑连接,我们还在两条线条的连接处绘制了一个圆。由于篇幅限制,更多细节包含在补充材料中。

3.5. Extension to User-Guided Image Inpainting

3.5. 扩展到用户引导的图像修复

We use sketch as an example user guidance to extend our image inpainting network as a user-guided system. Sketch (or edge) is simple and intuitive for users to draw. We show both cases with faces and natural scenes. For faces, we extract landmarks and connect related landmarks. For natural scene images, we directly extract edge maps using the HED edge detector [44] and set all values above a certain threshold (i.e. 0.6) to ones. Sketch examples are shown in the supplementary materials due to space limit.

我们以草图为例,将其作为一种用户引导方式,将图像修复网络扩展为用户引导系统。草图(或边缘)对用户来说简单直观。我们展示了人脸和自然场景两种情况。对于人脸,我们提取特征点并连接相关特征点。对于自然场景图像,我们直接使用 HED 边缘检测器 [44] 提取边缘图,并将所有高于某个阈值(即 0.6)的值设为一。由于篇幅限制,草图示例详见补充材料。

For training the user-guided image inpainting system, intuitively we will need additional constraint loss to enforce the network generating results conditioned on the user guidance. However with the same combination of pixel-wise reconstruction loss and GAN loss (with conditional channels as input to the disc rim in at or), we are able to learn conditional generative network in which the generated results respect user guidance faithfully. We also tried to use additional pixel-wise loss on HED [44] output features with the raw image or the generated result as input to enforce constraints, but the inpainting quality is similar. The user-guided inpainting model is separately trained with a 5-channel input (R,G,B color channels, mask channel and sketch channel).

为了训练用户引导的图像修复系统,直观上我们需要额外的约束损失来强制网络生成基于用户引导条件的结果。然而,通过相同的像素级重建损失和 GAN 损失(将条件通道作为判别器的输入)的组合,我们能够学习到条件生成网络,其中生成的结果忠实地遵循用户引导。我们还尝试在 HED [44] 输出特征上使用额外的像素级损失,以原始图像或生成结果作为输入来强制约束,但修复质量相似。用户引导的修复模型是单独训练的,输入为 5 个通道(R、G、B 颜色通道、掩码通道和草图通道)。

4. Results

4. 结果

We evaluate the proposed free-form image inpainting system on Places2 [53] and CelebA-HQ faces [18]. Our model has totally 4.1M parameters, and is trained with TensorFlow v1.8, CUDNN v7.0, CUDA v9.0. For testing, it runs at 0.21 seconds per image on single NVIDIA(R) Tesla(R) V100 GPU and 1.9 seconds on Intel(R) Xeon(R) CPU $\ @\ 2.00\mathrm{GHz}$ for images of resolution $512\times512$ on average, regardless of hole size.

我们在 Places2 [53] 和 CelebA-HQ 人脸数据集 [18] 上评估了提出的自由形式图像修复系统。我们的模型共有 4.1M 参数,使用 TensorFlow v1.8、CUDNN v7.0 和 CUDA v9.0 进行训练。在测试中,对于分辨率为 $512\times512$ 的图像,无论空洞大小如何,它在单个 NVIDIA(R) Tesla(R) V100 GPU 上每张图像平均运行时间为 0.21 秒,在 Intel(R) Xeon(R) CPU $\ @\ 2.00\mathrm{GHz}$ 上为 1.9 秒。

4.1. Quantitative Results

4.1. 定量结果

As mentioned in [49], image inpainting lacks good quantitative evaluation metrics. Nevertheless, we report in Table 2 our evaluation results in terms of mean $\ell_{1}$ error and mean $\ell_{2}$ error on validation images of Places2, with both center rectangle mask and free-form mask. As shown in the table, learning-based methods perform better than PatchMatch [3] in terms of mean $\ell_{1}$ and $\ell_{2}$ errors. Moreover, partial convolution implemented within the same framework obtains worse performance, which may due to un-learnable rule-based gating.

如 [49] 所述,图像修复缺乏良好的定量评估指标。尽管如此,我们在表 2 中报告了在 Places2 验证图像上使用中心矩形掩码和自由形式掩码的均方 $\ell_{1}$ 误差和均方 $\ell_{2}$ 误差的评估结果。如表所示,基于学习的方法在均方 $\ell_{1}$ 和 $\ell_{2}$ 误差方面优于 PatchMatch [3]。此外,在相同框架下实现的部分卷积性能较差,这可能是由于基于规则的不可学习门控机制所致。

Table 2: Results of mean $\ell_{1}$ error and mean $\ell_{2}$ error on validation images of Places2 with both rectangle masks and free-form masks. Both Partial Con v* and ours are trained on same random combination of rectangle and free-form masks. No edge guidance is utilized in training/inference to ensure fair comparison. * denotes our implementation within the same framework due to un availability of official implementation and models.

表 2: 在 Places2 验证集上使用矩形掩码和自由形状掩码的平均 $\ell_{1}$ 误差和平均 $\ell_{2}$ 误差结果。Partial Con v* 和我们的方法均在相同的随机组合矩形和自由形状掩码上进行训练。为确保公平比较,训练/推理过程中未使用边缘引导。* 表示由于官方实现和模型不可用,我们在相同框架内的实现。

方法 矩形掩码 Ei 误差 矩形掩码 l2 误差 自由形掩码 li 误差 自由形掩码 C2 误差
PatchMatch [3] 16.1% 3.9% 11.3% 2.4%
Global&Local [15] 9.3% 2.2% 21.6% 7.1%
ContextAttention [49] 8.6% 2.1% 17.2% 4.7%
PartialConv* [23] 9.8% 2.3% 10.4% 1.9%
Ours %9'8 2.0% 9.1% 1.6%

4.2. Qualitative Comparisons

4.2. 定性比较

Next, we compare our model with previous state-of-theart methods [15, 23, 49]. Figure 4 and Figure 5 shows automatic and user-guided inpainting results with several represent at ive images. For automatic image inpainting, the result of Partial Con v is obtained from its online demo2. For user-guided image inpainting, we train Partial Con v* with the exact same setting of GatedConv, expect the convolution types (sketch regions are treated as valid pixels for rulebased mask updating). For all learning-based methods, no post-processing step is performed to ensure fairness.

接下来,我们将我们的模型与之前的最先进方法 [15, 23, 49] 进行比较。图 4 和图 5 展示了自动和用户引导的修复结果,并展示了几个代表性图像。对于自动图像修复,Partial Conv 的结果是从其在线演示2中获得的。对于用户引导的图像修复,我们使用与 GatedConv 完全相同的设置训练了 Partial Conv*,唯一的区别是卷积类型(草图区域被视为有效像素以进行基于规则的掩码更新)。为了确保公平性,所有基于学习的方法均未进行后处理步骤。


Figure 4: Example cases of qualitative comparison on the Places2 and CelebA-HQ validation sets. More comparisons are included in supplementary materials due to space limit. Best viewed (e.g., shadows in uniform region) with zoom-in.

图 4: Places2 和 CelebA-HQ 验证集上的定性对比示例。由于篇幅限制,更多对比见补充材料。建议放大查看(例如,均匀区域中的阴影)。


Figure 5: Comparison of user-guided image inpainting.

图 5: 用户引导的图像修复对比。

As reported in [15], simple uniform region (last row of Figure 4 and Figure 5) are hard cases for learningbased image inpainting networks. Previous methods with vanilla convolution have obvious visual artifacts and edge responses in/surrounding holes. Partial Con v produces better results but still exhibits observable color discrepancy. Our method based on gated convolution obtains more visually pleasing results without noticeable color inconsistency. In Figure 5, given sparse sketch, our method produces realistic results with seamless boundary transitions.

正如[15]中所报道的,简单的均匀区域(图4和图5的最后一行)对于基于学习的图像修复网络来说是困难的案例。使用普通卷积的先前方法在孔洞内部/周围有明显的视觉伪影和边缘响应。Partial Conv产生了更好的结果,但仍然表现出可观察到的颜色差异。我们基于门控卷积的方法获得了更视觉上令人愉悦的结果,没有明显的颜色不一致。在图5中,给定稀疏的草图,我们的方法产生了具有无缝边界过渡的逼真结果。

4.3. Object Removal and Creative Editing

4.3. 目标移除与创意编辑

Moreover, we study two important real use cases of image inpainting: object removal and creative editing.

此外,我们研究了图像修复的两个重要实际用例:物体移除和创意编辑。

Object Removal. In the first example, we try to remove the distracting person in Figure 6. We compare our method with commercial product Photoshop (based on PatchMatch [3]) and the previous state-of-the-art generative inpainting network (official released model trained on Places2) [49]. The results show that Content-Aware Fill function from Photoshop incorrectly copies half of face from left. This example reflects the fact that traditional methods without learning from large-scale data ignore the semantics of an image, which leads to critical failures in non-stationary/complicated scenes. For learning-based methods with vanilla convolution [49], artifacts exist near hole boundaries.

物体移除。在第一个例子中,我们尝试移除图 6 中干扰的人物。我们将我们的方法与商用产品 Photoshop(基于 PatchMatch [3])以及之前最先进的生成修复网络(官方发布的模型,在 Places2 上训练)[49] 进行了比较。结果显示,Photoshop 的 Content-Aware Fill 功能错误地复制了左侧的半张脸。这个例子反映了传统方法在没有从大规模数据中学习的情况下会忽略图像的语义,从而导致在非静态/复杂场景中的严重失败。对于基于普通卷积的学习方法 [49],空洞边界附近存在伪影。

Creative Editing. Next we study the case where user interacts with the inpainting system to produce more desired results. The examples on both faces and natural scenes are shown in Figure 7. Our inpainting results nicely follow the user sketch, which is useful for creatively editing image lay

创意编辑。接下来我们研究用户与修复系统交互以产生更理想结果的情况。面部和自然场景的示例如图7所示。我们的修复结果很好地遵循了用户草图,这对于创意编辑图像布局非常有用。


Figure 6: Object removal case study with comparison. 4.5. Ablation Study of SN-PatchGAN

图 6: 物体移除案例研究对比。4.5. SN-PatchGAN 消融研究


Figure 7: Examples of user-guided inpainting/editing of faces and natural scenes.

图 7: 用户引导的人脸和自然场景修复/编辑示例。

outs, faces and many others.

输出、面部等。

4.4. User Study

4.4. 用户研究

We performed a user study by first collecting 30 test images (with holes but no sketches) from Places2 validation dataset without knowing their inpainting results on each model. We then computed results of the following four methods for comparison: (1) ground truth, (2) our model, (3) re-implemented Partial Con v [23] within same framework, and (4) official Partial Con v [23]. We did two types of user study. (A) We evaluate each method individually to rate the naturalness/inpainting quality of results (from 1 to 10, the higher the better), and (B) we compare our model and the official Partial Con v model to evaluate which method produces better results. 104 users finished the user study with the results shown as follows.

我们进行了一项用户研究,首先从Places2验证数据集中收集了30张测试图像(有孔洞但没有草图),并且事先不知道每个模型在这些图像上的修复结果。然后我们计算了以下四种方法的修复结果进行比较:(1) 真实图像,(2) 我们的模型,(3) 在同一框架下重新实现的Partial Con v [23],以及(4) 官方的Partial Con v [23]。我们进行了两种类型的用户研究。(A) 我们单独评估每种方法,对结果的自然性/修复质量进行评分(从1到10,分数越高越好),(B) 我们比较了我们的模型和官方的Partial Con v模型,评估哪种方法产生更好的结果。104名用户完成了用户研究,结果如下所示。

(B) Pairwise comparison of (2) our model vs. (4) official Partial Con v model: $79.4%$ vs. $20.6%$ (the higher the better).

(B) 对比我们的模型与官方 Partial Con v 模型的配对比较:$79.4%$ vs. $20.6%$ (数值越高越好)。

Figure 8: Ablation study of SN-PatchGAN. From left to right, we show original image, masked input, results with one global GAN and our results with SN-PatchGAN.

图 8: SN-PatchGAN 的消融研究。从左到右,我们展示了原始图像、带掩码的输入、使用单一全局 GAN 的结果以及使用 SN-PatchGAN 的结果。

SN-PatchGAN is proposed for the reason that free-form masks may appear anywhere in images with any shape. Previously introduced global and local GANs [15] designed for a single rectangular mask are not applicable. We provide ablation experiments of SN-PatchGAN in the context of image inpainting in Figure 8. SN-PatchGAN leads to significantly better results, which verifies that (1) one vanilla global disc rim in at or has worse performance [15], and (2) GAN with spectral normalization has better stability and performance [24]. Although introducing more loss functions may help in training free-form image inpainting networks [23], we demonstrate that a simple combination of SN-PatchGAN loss and pixel-wise $\ell_{1}$ loss, with default loss balancing hyper-parameter as 1:1, produces photo-realistic inpainting results. More comparison examples are shown in the supplementary materials.

SN-PatchGAN 的提出是因为自由形状的掩码可能出现在图像的任意位置且具有任意形状。之前引入的针对单一矩形掩码设计的全局和局部 GAN [15] 并不适用。我们在图 8 中提供了 SN-PatchGAN 在图像修复背景下的消融实验。SN-PatchGAN 带来了显著更好的结果,验证了 (1) 单一的全局判别器性能较差 [15],以及 (2) 使用谱归一化的 GAN 具有更好的稳定性和性能 [24]。尽管引入更多的损失函数可能有助于训练自由形状的图像修复网络 [23],但我们证明了 SN-PatchGAN 损失和逐像素 $\ell_{1}$ 损失的简单组合(默认损失平衡超参数为 1:1)能够产生逼真的修复结果。更多对比示例见补充材料。

5. Conclusions

5. 结论

We presented a novel free-form image inpainting system based on an end-to-end generative network with gated convolution, trained with pixel-wise $\ell_{1}$ loss and SN-PatchGAN. We demonstrated that gated convolutions significantly improve inpainting results with free-form masks and user guidance input. We showed user sketch as an exemplar guidance to help users quickly remove distracting objects, modify image layouts, clear watermarks, edit faces and interactively create novel objects in photos. Quantitative results, qualitative comparisons and user studies demonstrated the superiority of our proposed free-form image inpainting system.

我们提出了一种基于端到端生成网络和门控卷积的新型自由形式图像修复系统,该系统使用像素级 $\ell_{1}$ 损失和 SN-PatchGAN 进行训练。我们证明了门控卷积在自由形式掩码和用户引导输入下显著改善了修复结果。我们展示了用户草图作为一种示例引导,帮助用户快速移除分散注意力的物体、修改图像布局、清除水印、编辑面部以及交互式地在照片中创建新物体。定量结果、定性比较和用户研究证明了我们提出的自由形式图像修复系统的优越性。

References

参考文献

In this supplementary material, we first provide details of our free-form mask generation algorithm in Section A and sketch generation algorithm in Section B. We then study the effects of sketch input in Section C with an example where the input image uses the same mask but different sketches. Next we provide visualization and interpretation of learned gating values in Section D. We show additional ablation study of our proposed SN-PatchGAN in Section E. We show more comparison results of Global&Local [15], Context Attention [49], Partial Con v [23] (both our implementation within same framework and official model via online demo3) and our GatedConv in Section F. We finally show more inpainting results of our system with support of free-form masks and user guidance on both natural scenes and faces in Section G. Moreover, a recorded realtime video demo is available at: https://youtu.be/ uZ kEi 9 Y 2 dj 4.

在本补充材料中,我们首先在A节中提供了自由形式掩码生成算法的详细信息,并在B节中介绍了草图生成算法。接着,我们在C节中通过一个示例研究了草图输入的影响,该示例中输入图像使用相同的掩码但不同的草图。然后在D节中,我们提供了学习到的门控值的可视化和解释。在E节中,我们展示了所提出的SN-PatchGAN的额外消融研究。在F节中,我们展示了与Global&Local [15]、Context Attention [49]、Partial Conv [23](包括我们在相同框架内的实现和通过在线演示3的官方模型)以及我们的GatedConv的更多对比结果。最后在G节中,我们展示了在自然场景和人脸上使用自由形式掩码和用户引导的更多修复结果。此外,实时的视频演示可在以下网址观看:https://youtu.be/uZkEi9Y2dj4。

A. Free-Form Mask Generation

A. 自由形式掩码生成

Figure 9: Sampled free-form masks with previous work [23] (1st row) and our automatic algorithm (2nd row).

图 9: 使用之前工作 [23] 采样的自由形式掩码 (第 1 行) 和我们的自动算法 (第 2 行)。

The algorithm to automatically generate free-form masks is important and non-trivial. The sampled masks, in essence, should be (1) similar in shape to holes drawn in real use-cases, (2) diverse to avoid over-fitting, (3) efficient in computation and storage, (4) controllable and flexible. Previous method [23] collects a fixed set of irregular masks from an occlusion estimation method between two consecutive frames of videos. Although random dilation, rotation and cropping are added to increase its diversity, the method does not meet other requirements listed above.

自动生成自由形态掩码的算法既重要又复杂。采样的掩码本质上应满足以下条件:(1) 形状与实际使用场景中绘制的孔洞相似,(2) 具有多样性以避免过拟合,(3) 计算和存储高效,(4) 可控且灵活。先前的方法 [23] 从视频连续帧之间的遮挡估计方法中收集了一组固定的不规则掩码。尽管通过随机膨胀、旋转和裁剪增加了其多样性,但该方法未能满足上述其他要求。

We introduce a simple algorithm to automatically generate random free-form masks on-the-fly during training. For the task of hole filling, users behave like using an eraser to brush back and forth to mask out undesired regions. This behavior can be simply simulated with a randomized algorithm by drawing lines and rotating angles repeatedly. To ensure smoothness of two lines, we also draw a circle in joints between the two lines.

我们引入了一种简单的算法,在训练过程中自动实时生成随机的自由形式掩码。对于填补空洞的任务,用户的行为就像使用橡皮擦来回刷动以掩盖不需要的区域。这种行为可以通过随机算法简单地模拟,通过重复绘制线条和旋转角度来实现。为了确保两条线条的平滑连接,我们还在两条线条的连接处绘制一个圆。

Algorithm 1 Algorithm for sampling free-form training masks. maxVertex, maxLength, max Brush Width, maxAngle are four hyper-parameters to control the mask generation.

算法 1:自由形式训练掩码采样算法。maxVertex、maxLength、maxBrushWidth、maxAngle 是控制掩码生成的四个超参数。

mask $=$ random.flip Left Right(mask) mask $=$ random.flip Top Bottom(mask)

mask $=$ random.flip Left Right(mask) mask $=$ random.flip Top Bottom(mask)

We use maxVertex, maxLength, maxWidth and maxAngle as four hyper-parameters to provide large varieties of sampled masks. Moreover, our algorithm generates masks on-the-fly with little computational overhead and no storage is required. In practice, the computation of free-form masks on CPU can be easily hid behind training networks on GPU in modern deep learning frameworks. The overall mask generation algorithm is illustrated in Algorithm 1. Additionally we can sample multiple strokes in single image to mask multiple regions, and add regular masks (e.g. rectangular) on top of sampled free-form masks. Example masks compared with previous method [23] is shown in Figure 9.

我们使用 maxVertex、maxLength、maxWidth 和 maxAngle 作为四个超参数来提供多样化的采样掩码。此外,我们的算法实时生成掩码,计算开销小且无需存储。在实际应用中,在现代深度学习框架中,CPU 上的自由形态掩码计算可以轻松隐藏在 GPU 上的网络训练之后。整体掩码生成算法如算法 1 所示。此外,我们可以在单张图像中采样多个笔触以掩码多个区域,并在采样的自由形态掩码之上添加常规掩码(例如矩形)。与之前方法 [23] 相比的示例掩码如图 9 所示。

B. Sketch Generation

B. 草图生成

Figure 10: For face dataset (on the left), we directly detect landmarks of faces and connect related nearby landmarks as training sketch, which is extremely robust and useful for editing faces. We use HED [44] model with threshold 0.6 to extract binary sketch for natural scenes (on the right).

图 10: 对于人脸数据集(左侧),我们直接检测人脸的关键点并将附近相关的关键点连接起来作为训练草图,这种方法非常鲁棒且对人脸编辑非常有用。我们使用 HED [44] 模型,阈值为 0.6,来提取自然场景的二进制草图(右侧)。


Figure 11: Image inpainting examples where the input image uses same mask but different sketches.

图 11: 图像修复示例,输入图像使用相同的掩码但不同的草图。

We use sketch as an example user guidance to extend our image inpainting network as a user guided system. We show both cases on faces and natural scenes. For faces, we extract landmarks and connect related landmarks. For natural scene images, we directly extract edge maps using the HED [44] edge detector and set all values above a certain threshold (i.e. 0.6) to ones. Sketch examples are shown in Figure 10. Alternative methods to generative better sketch or other user guidance should also work well with our user-guided image inpainting system.

我们以草图为例,将图像修复网络扩展为用户引导系统。我们展示了人脸和自然场景的两种情况。对于人脸,我们提取地标并连接相关地标。对于自然场景图像,我们直接使用 HED [44] 边缘检测器提取边缘图,并将所有高于某个阈值(即 0.6)的值设置为 1。草图示例如图 10 所示。生成更好草图或其他用户引导的替代方法也适用于我们的用户引导图像修复系统。

C. The Effects of Sketch Input

C. 草图输入的效果

As shown in Section 4.3, our inpainting network can nicely follow the user sketch, which is useful for creative editing of images. We show in Figure 11 an additional comparison case where the input image uses the same mask but different sketches.

如第4.3节所示,我们的修复网络能够很好地遵循用户绘制的草图,这对于图像的创意编辑非常有用。我们在图11中展示了一个额外的对比案例,其中输入图像使用了相同的掩码但不同的草图。

D. Visualization and Interpretation

D. 可视化与解释

In Figure 12, we provide the visualization and interpretation of learned gating values in our inpainting network, and compare them with that of Partial Con v [23].

在图12中,我们提供了修复网络中学习到的门控值的可视化和解释,并将其与 Partial Con v [23] 进行了比较。

E. Ablation Study of SN-PatchGAN

E. SN-PatchGAN 消融研究

In this section, we present ablation study to demonstrate the effectiveness of SN-PatchGAN. It is noteworthy that SN-PatchGAN is proposed because free-form masks may appear anywhere in images with any shape. Global and local GANs [15] designed for a single rectangular mask are not applicable. Previous work have already shown that (1) one vanilla global disc rim in at or has much worse performance than two local and global disc rim in at or s [15], and (2) GAN with spectral normalization has better stability and performance. We also provide experiments of

在本节中,我们通过消融研究来展示 SN-PatchGAN 的有效性。值得注意的是,SN-PatchGAN 的提出是因为自由形式的掩码可能以任何形状出现在图像的任何位置。为单一矩形掩码设计的全局和局部 GAN [15] 并不适用。先前的工作已经表明:(1) 一个普通的全局判别器比局部和全局判别器的组合表现要差得多 [15],(2) 使用谱归一化的 GAN 具有更好的稳定性和性能。我们还提供了实验...

SN-PatchGAN in the context of image inpainting in Figure 13. Our image inpainting network trained on a global GAN without spectral normalization has significantly worse performance on all examples.

图 13 中展示了图像修复背景下的 SN-PatchGAN。在没有使用谱归一化的全局 GAN 上训练的我们的图像修复网络在所有示例中的表现明显更差。

F. More Comparison Results

F. 更多比较结果

In this section, we show more comparison results of learning-based image inpainting systems including Global&Local [15], Context Attention [49], Partial- Conv [23] (both our implementation within same framework and official model via online demo) and our proposed method based on gated convolution. Note that the models of scenes and faces are trained in separate following all other methods [15, 23, 49]. All testing images are not in the training set. Results are shown in Figure 14 and Figure 15. Compared with our baseline Partial Con v, our inpainting system generates higher-quality inpainting results. Although PartialConv significantly improves over previous baselines like Global&Local [15] and Context Attention [49], it still produces observable color inconsistency or shadows in both official online demo and our reproduced version (best-viewed with zoom-in on PDF to see color shadows and artifacts). Moreover, Partial Con v fails especially on cases (1) when holes are large and involving transitions of two segments (e.g., a mask covering both sky and ground), and (2) when the image has strong structure/contour/edge prior. The reasons are discussed in the introduction of main paper that unlearnable rule-based hard-gating heuristic ally categorizes all input locations to be either invalid or valid, ignoring many other important information. Gated convolution is able to leverage these information by learning a soft-gating end-to-end.

在本节中,我们展示了更多基于学习的图像修复系统的比较结果,包括 Global&Local [15]、Context Attention [49]、Partial-Conv [23](我们在相同框架内的实现和通过在线演示的官方模型)以及我们提出的基于门控卷积的方法。请注意,场景和面部模型是按照所有其他方法 [15, 23, 49] 分别训练的。所有测试图像均不在训练集中。结果如图 14 和图 15 所示。与我们的基线 Partial Conv 相比,我们的修复系统生成了更高质量的修复结果。尽管 PartialConv 显著改进了之前的基线,如 Global&Local [15] 和 Context Attention [49],但在官方在线演示和我们的复现版本中,它仍然产生了可观察到的颜色不一致或阴影(最好在 PDF 上放大查看颜色阴影和伪影)。此外,Partial Conv 在以下情况下尤其失败:(1) 当孔洞较大且涉及两个片段的过渡时(例如,覆盖天空和地面的掩码),以及 (2) 当图像具有强烈的结构/轮廓/边缘先验时。原因在主论文的引言中讨论过,即基于规则的硬门控启发式方法将所有输入位置分类为无效或有效,忽略了许多其他重要信息。门控卷积能够通过端到端学习软门控来利用这些信息。

G. More Inpainting Results of Our System

G. 我们系统的更多修复结果

In this section, we present more examples towards real use cases based on our proposed image inpainting system. We show inpainting results on both natural scenes and faces in Figure 16, Figure 17 and Figure 18. We show our inpainting system helps user quickly remove distracting objects, modify image layouts, edit faces and interactively create novel objects in images.

在本节中,我们将基于提出的图像修复系统展示更多真实用例的示例。我们在图 16、图 17 和图 18 中展示了自然场景和人脸的修复结果。我们的修复系统帮助用户快速移除干扰对象、修改图像布局、编辑人脸并交互式地在图像中创建新对象。

Figure 12: Comparisons of gated convolution and partial convolution with visualization and interpretation of learned gating values. We first show our inpainting network architecture based on [49] by replacing all convolutions with gated convolutions in the 1st row. Note that for simplicity, the following refinement network in [49] is ignored in the figure. With same settings, we train two models based on gated convolution and partial convolution separately. We then directly visualize intermediate un-normalized gating values in the 2nd row. The values differ mainly based on three parts: background, mask and sketch. In the 3rd row, we provide an interpretation based on which part(s) have higher gating values. Interestingly we also find that for some channels (e.g. channel-31 of the layer after dilated convolution), the learned gating values are based on foreground/background semantic segmentation. For comparison, we also visualize the un-learnable fixed binary mask $M$ of partial convolution in the 4th row.

图 12: 门控卷积和部分卷积的对比,以及对学习到的门控值的可视化和解释。我们首先在第一行展示了基于 [49] 的修复网络架构,将所有卷积替换为门控卷积。注意,为简化起见,图中忽略了 [49] 中的后续细化网络。在相同设置下,我们分别训练了基于门控卷积和部分卷积的两个模型。然后,我们在第二行直接可视化了中间未归一化的门控值。这些值主要基于三个部分:背景、掩码和草图。在第三行,我们提供了基于哪些部分具有较高门控值的解释。有趣的是,我们还发现对于某些通道(例如膨胀卷积后的层的 channel-31),学习到的门控值基于前景/背景语义分割。为了比较,我们还在第四行可视化了部分卷积中不可学习的固定二值掩码 $M$。


Figure 13: Ablation Study of SN-PatchGAN. From left to right, we show original image, masked input, results with one global GAN and our results with SN-PatchGAN. SN-PatchGAN is proposed because free-form masks may appear anywhere in images with any shape. Global and local GANs [15] designed for a single rectangular mask are not applicable.

图 13: SN-PatchGAN 的消融研究。从左到右,我们展示了原始图像、带掩码的输入、使用一个全局 GAN 的结果以及我们使用 SN-PatchGAN 的结果。SN-PatchGAN 被提出是因为自由形式的掩码可能以任何形状出现在图像的任何位置。为单一矩形掩码设计的全局和局部 GAN [15] 不适用。


Figure 14: More comparison results on natural scenes. Best-viewed with zoom-in on PDF to see color shadows and artifacts.

图 14: 自然场景的更多对比结果。建议在 PDF 中放大查看,以便观察色彩阴影和伪影。


Figure 15: More comparison results on faces. Best-viewed with zoom-in on PDF to see color shadows and artifacts.

图 15: 更多面部对比结果。建议在 PDF 中放大查看颜色阴影和伪影。


Figure 16: More results from our free-form inpainting system on natural images (1).

图 16: 我们的自由形式修复系统在自然图像上的更多结果 (1)。


Figure 17: More results from our free-form inpainting system on natural images (2).

图 17: 自然图像上自由形式修复系统的更多结果 (2)。


Figure 18: More results from our free-form inpainting system on faces.

图 18: 我们自由形式修复系统在人脸上的更多结果。

阅读全文(20积分)