[论文翻译]ESRGAN: 增强型超分辨率生成对抗网络


原文地址:https://arxiv.org/pdf/1809.00219v2.pdf

代码地址:https://github.com/xinntao/ESRGAN

译者语:最近在老影片复原中常用到的超分辨率算法

中英文对照地址 https://aiqianji.com/blog/article/1
论文中英文对照合集 : https://aiqianji.com/blog/articles

ABSTRACT

The Super-Resolution Generative Adversarial Network (SRGAN) [1] is a seminal work that is capable of generating realistic textures during single image super-resolution. However, the hallucinated details are often accompanied with unpleasant artifacts. To further enhance the visual quality, we thoroughly study three key components of SRGAN – network architecture, adversarial loss and perceptual loss, and improve each of them to derive an Enhanced SRGAN (ESRGAN). In particular, we introduce the Residual-in-Residual Dense Block (RRDB) without batch normalization as the basic network building unit. Moreover, we borrow the idea from relativistic GAN [2] to let the discriminator predict relative realness instead of the absolute value. Finally, we improve the perceptual loss by using the features before activation, which could provide stronger supervision for brightness consistency and texture recovery. Benefiting from these improvements, the proposed ESRGAN achieves consistently better visual quality with more realistic and natural textures than SRGAN and won the first place in the PIRM2018-SR Challenge11We won the first place in region 3 and got the best perceptual index. [3]. The code is available at https://github.com/xinntao/ESRGAN.

In this supplementary file, we first show more examples of Batch-Normalization (BN) related artifacts in Section 6. Then we introduce several useful techniques that facilitate training very deep models in Section 7. The analysis of the influence of different datasets and training patch size is depicted in Section 8 and Section 9, respectively. Finally, in Section 10, we provide more qualitative results for visual comparison.

摘要 Abstract

超分辨率生成对抗网络(SR GAN)[1]是一项开创性的工作,它能够在单图像超分辨率任务中生成逼真的纹理。然而,虚幻的细节通常伴随着令人不快的伪影。
为了进一步提高视觉质量,我们深入研究了SRGAN 网络架构,对抗性损失和感知损失这三个关键组成部分,并对其中每一项都进行了改进,产生了一个增强型SRGAN(ESRGAN)。
特别需要注意的是,我们在没有使用批量标准化的情况下引入RRDB Residual-in-Residual Dense Block作为基本网络构建单元。
此外,我们借用RaGAN [2]的思想来让判别器预测图像的相对真实性而不是图像的绝对真实性。
最后,我们使用激活前的特征来改善感知损失,这可以提供对亮度一致性和纹理恢复更强的监督力。
从这些改进中,所提出的ESRGAN实现了更好的视觉质量,具有比SRGAN更逼真和自然的纹理并且赢得了PIRM2018-SR Challenge1中的第一名[3]。

1 介绍 INTRODUCTION

Single image super-resolution (SISR), as a fundamental low-level vision problem, has attracted increasing attention in the research community and AI companies. SISR aims at recovering a high-resolution (HR) image from a single low-resolution (LR) one. Since the pioneer work of SRCNN proposed by Dong et al. [4], deep convolution neural network (CNN) approaches have brought prosperous development. Various network architecture designs and training strategies have continuously improved the SR performance, especially the Peak Signal-to-Noise Ratio (PSNR) value [5, 6, 7, 1, 8, 9, 10, 11, 12]. However, these PSNR-oriented approaches tend to output over-smoothed results without sufficient high-frequency details, since the PSNR metric fundamentally disagrees with the subjective evaluation of human observers [1].

Several perceptual-driven methods have been proposed to improve the visual quality of SR results. For instance, perceptual loss [13, 14] is proposed to optimize super-resolution model in a feature space instead of pixel space. Generative adversarial network [15] is introduced to SR by [1, 16] to encourage the network to favor solutions that look more like natural images. The semantic image prior is further incorporated to improve recovered texture details [17]. One of the milestones in the way pursuing visually pleasing results is SRGAN [1]. The basic model is built with residual blocks [18] and optimized using perceptual loss in a GAN framework. With all these techniques, SRGAN significantly improves the overall visual quality of reconstruction over PSNR-oriented methods.

单图像超分辨率(SISR)作为一种基本的低级视觉问题,已经引起了研究界和AI公司越来越多的关注。SISR旨在从单个低分辨率(LR)图像中恢复高分辨率(HR)图像。自从Dong[4]等人提出的SRCNN的先驱工作以来,深度卷积神经网络(CNN)方法带来了超分辨率领域的繁荣发展。大家制定了各种网络架构设计和训练策略不断提高SR性能,其中常用的就有峰值信噪比(PSNR)值[5,6,7,1,8,9,10,11,12]。
然而,这些面向PSNR的方法往往会输出过度平滑的结果,而没有足够的高频细节,因为PSNR指标从根本上和人类观察者的主观评价不同[1]。

However, there still exists a clear gap between SRGAN results and the ground-truth (GT) images, as shown in Fig. 1. In this study, we revisit the key components of SRGAN and improve the model in three aspects. First, we improve the network structure by introducing the Residual-in-Residual Dense Block (RDDB), which is of higher capacity and easier to train. We also remove Batch Normalization (BN) [19] layers as in [20] and use residual scaling [21, 20] and smaller initialization to facilitate training a very deep network. Second, we improve the discriminator using Relativistic average GAN (RaGAN) [2], which learns to judge “whether one image is more realistic than the other” rather than “whether one image is real or fake”. Our experiments show that this improvement helps the generator recover more realistic texture details. Third, we propose an improved perceptual loss by using the VGG features before activation instead of after activation as in SRGAN. We empirically find that the adjusted perceptual loss provides sharper edges and more visually pleasing results, as will be shown in Sec. 4.4. Extensive experiments show that the enhanced SRGAN, termed ESRGAN, consistently outperforms state-of-the-art methods in both sharpness and details (see Fig. 1 and Fig. 7).

大家提出了几种感知驱动方法来改善SR结果的视觉质量。例如,提出感知损失[13,14]来优化特征空间中的超分辨率模型而不是像素空间。生成的对抗性网络[15]被[1,16]引入SR任务,以鼓励产生更像自然图像的解决方案。语义图像的先验知识也被进一步合并以改善恢复的纹理细节[17]。
在超分辨率的研究中,SRGAN [1]是一块里程碑,它从视觉效果上来说给出了令人愉悦的结果。基本模型使用残差块[18]构建,并使用GAN框架中的感知损耗进行优化。通过所有这些技术,SRGAN和面向PSNR的方法相比显著提高了重建的整体视觉质量。然而,SRGAN结果与真实数据(GT)图像之间仍存在明显差距,如图1所示。
esrgan_pic1.png

在本研究中,我们重新审视了SRGAN的关键组件,并从三个方面改进了模型。

首先,我们通过引入Residual-in-Residual Dense
Block (RDDB)来改进网络结构,该结构具有更高的容量和更容易训练。
我们还删除了[20]中的批量标准化(BN)[19]层,并使用残差缩放[21,20]和更小的初始化来促进训练非常深的网络。

其次,我们使用相对平均GAN(RaGAN)来改进判别器[2],它学会判断一个图像比另一个图像更真实,而不是“一个图像是真实的还是假的”。我们的实验表明这种改进有助于生成器恢复更真实的纹理细节。

第三,我们建议通过在激活之前使用VGG特征而不是像SRGAN中激活后使用VGG特征来改善感知损失。

0.png

我们凭经验发现调整后的感知损失提供了更清晰的边缘和更加视觉上令人愉悦的结果,如第4.4节所示。大量实验表明,增强型SRGAN,称为ESRGAN,在锐度和细节方面始终优于最先进的方法(见图1和图7)。
我们采用ESRGAN的变体参与PIRM-SR挑战[3]。这一挑战是第一次基于[22]以感知质量意识方式评估表现的SR竞赛,其中作者声称失真和感知质量是相互矛盾。感知质量由non-reference measures of Ma's score[23]和NIQE[24]来判断,即感知指数。较低的感知指数代表更好的感知质量。

如图2所示,感知 - 失真平面被划分为由均方根误差(RMSE)上的阈值定义的三个区域,并且在每个区域中实现最低感知指数的算法成为区域冠军。
我们主要关注区域3,因为我们的目标是将感知质量提升到新的高度。

由于上述改进和Sec4.6中讨论的一些其他调整,我们提出的ESRGAN在PIRM-SR挑战赛(第3区)中以最佳感知指数获得了第一名。我们展示了EDSR [20],RCAN [12]和EnhanceNet [16]以及提交的ESRGAN模型的基线。 蓝点由图像插值产生。

为了平衡视觉质量和RMSE/PSNR,我们进一步提出了网络插值策略,可以不断调整重建风格和平滑度。另一种替代方案是图像插值,其直接逐像素地插入图像。我们采用这种策略参与区域1和区域2。网络插值和图像插值策略及其差异在Sec3.4中讨论。
We take a variant of ESRGAN to participate in the PIRM-SR Challenge [3]. This challenge is the first SR competition that evaluates the performance in a perceptual-quality aware manner based on [22], where the authors claim that distortion and perceptual quality are at odds with each other. The perceptual quality is judged by the non-reference measures of Ma’s score [23] and NIQE [24], i.e., perceptual index = 12((10−Ma)+NIQE). A lower perceptual index represents a better perceptual quality.

As shown in Fig. 2, the perception-distortion plane is divided into three regions defined by thresholds on the Root-Mean-Square Error (RMSE), and the algorithm that achieves the lowest perceptual index in each region becomes the regional champion. We mainly focus on region 3 as we aim to bring the perceptual quality to a new high. Thanks to the aforementioned improvements and some other adjustments as discussed in Sec. 4.6, our proposed ESRGAN won the first place in the PIRM-SR Challenge (region 3) with the best perceptual index.

In order to balance the visual quality and RMSE/PSNR, we further propose the network interpolation strategy, which could continuously adjust the reconstruction style and smoothness. Another alternative is image interpolation, which directly interpolates images pixel by pixel. We employ this strategy to participate in region 1 and region 2. The network interpolation and image interpolation strategies and their differences are discussed in Sec. 3.4.

We focus on deep neural network approaches to solve the SR problem. As a pioneer work, Dong et al. [4, 25] propose SRCNN to learn the mapping from LR to HR images in an end-to-end manner, achieving superior performance against previous works. Later on, the field has witnessed a variety of network architectures, such as a deeper network with residual learning [5], Laplacian pyramid structure [6], residual blocks [1], recursive learning [7, 8], densely connected network [9], deep back projection [10] and residual dense network [11]. Specifically, Lim et al. [20] propose EDSR model by removing unnecessary BN layers in the residual block and expanding the model size, which achieves significant improvement. Zhang et al. [11] propose to use effective residual dense block in SR, and they further explore a deeper network with channel attention [12], achieving the state-of-the-art PSNR performance. Besides supervised learning, other methods like reinforcement learning [26] and unsupervised learning [27] are also introduced to solve general image restoration problems.

我们专注于深度神经网络方法来解决SR问题。作为一项先驱工作,Dong等人[4,25]提出SRCNN以端到端的方式学习从LR到HR图像的映射,与以前的工作相比具有更好的性能。

后来,该领域见证了各种网络架构,如具有残差学习的更深层网络[5],拉普拉斯金字塔结构[6],残差块[1],递归学习[7,8],密集连接网络[ 9],深背投影[10]和残余密集网络[11]。

具体而言,Lim等[20]通过去除残余块中不必要的BN层并扩展模型尺寸来提出EDSR模型,从而实现了显着的改进。
Zhang等[11]提出在SR中使用有效残余密集块,并进一步探索具有信道关注的更深层网络[12],从而实现最先进的PSNR性能。
除了监督学习外,还引入了其他方法,如强化学习[26]和无监督学习[27],以解决一般图像恢复问题。

Several methods have been proposed to stabilize training a very deep model. For instance, residual path is developed to stabilize the training and improve the performance [18, 5, 12]. Residual scaling is first employed by Szegedy et al. [21] and also used in EDSR. For general deep networks, He et al. [28] propose a robust initialization method for VGG-style networks without BN. To facilitate training a deeper network, we develop a compact and effective residual-in-residual dense block, which also helps to improve the perceptual quality by a large margin.

大家已经提出了一些方法可以稳定训练一些非常深的模型。例如,使用残差路径(residual path)以稳定训练并改善性能[18,5,12]。Szegedy等人[21]首先采用残差缩放(Residual scaling ),EDSR也用了这种方法。对于一般的深度网络,He等人[28]提出了一种针对没有BN的VGG型网络的鲁棒初始化方法。

为了便于训练更深的网络,我们开发了一个紧凑而有效的剩余残留密集块,这也有助于提高感知质量。

Perceptual-driven approaches have also been proposed to improve the visual quality of SR results. Based on the idea of being closer to perceptual similarity [29, 14], perceptual loss [13] is proposed to enhance the visual quality by minimizing the error in a feature space instead of pixel space. Contextual loss [30] is developed to generate images with natural image statistics by using an objective that focuses on the feature distribution rather than merely comparing the appearance. Ledig et al. [1] propose SRGAN model that uses perceptual loss and adversarial loss to favor outputs residing on the manifold of natural images. Sajjadi et al. [16] develop a similar approach and further explored the local texture matching loss. Based on these works, Wang et al. [17] propose spatial feature transform to effectively incorporate semantic prior in an image and improve the recovered textures.

除此之外,大家还提出了感知驱动方法来改善SR结果的视觉质量。基于更接近感知相似性的想法[29,14],Johnson提出了感知损失[13],通过最小化特征空间中的误差而不是像素空间来增强视觉质量。
Roey Mechrez[30]通过使用关注特征分布的目标而不是仅仅比较外观,开发了 Contextual loss来生成具有自然图像统计信息的图像。Ledig等[1]提出了SRGAN模型,该模型使用感知损失和对抗性损失来处理自然图像上的输出。
Sajjadi等[16]开发了一种类似的方法,并进一步探索了局部纹理匹配损失。基于这些工作,Wang等人[17]提出了空间特征变换,以有效地将语义先验结合到图像中并改善恢复的纹理。

Throughout the literature, photo-realism is usually attained by adversarial training with GAN [15]. Recently there are a bunch of works that focus on developing more effective GAN frameworks. WGAN [31] proposes to minimize a reasonable and efficient approximation of Wasserstein distance and regularizes discriminator by weight clipping. Other improved regularization for discriminator includes gradient clipping [32] and spectral normalization [33]. Relativistic discriminator [2] is developed not only to increase the probability that generated data are real, but also to simultaneously decrease the probability that real data are real. In this work, we enhance SRGAN by employing a more effective relativistic average GAN.

在整个文献中,照片写实主义通常通过与GAN的对抗训练来实现[15]。最近,有许多工作专注于开发更有效的GAN框架。
WGAN[31]建议最小化Wasserstein距离的合理有效近似值,并通过权重限幅来规范鉴别器。
用于鉴别器的其他改进的正则化包括梯度限幅[32]和谱归一化[33]。
Relativistic discriminator[2]的开发不仅是为了增加生成数据的真实概率,而且还是为了同时降低实际数据是真实的概率。
在这项工作中,我们通过采用更有效的RaGAN来增强SRGAN。

SR algorithms are typically evaluated by several widely used distortion measures, e.g., PSNR and SSIM. However, these metrics fundamentally disagree with the subjective evaluation of human observers [1]. Non-reference measures are used for perceptual quality evaluation, including Ma’s score [23] and NIQE [24], both of which are used to calculate the perceptual index in the PIRM-SR Challenge [3]. In a recent study, Blau et al. [22] find that the distortion and perceptual quality are at odds with each other.

SR算法通常通过几种广泛使用的失真度量来评估,例如PSNR和SSIM。
然而,这些指标从根本上不同意人类观察者的主观评价[1]。非参考测量用于感知质量评估,包括Ma's score[23]和NIQE [24],两者都用于计算PIRM-SR挑战中的感知指数[3]。
在最近的一项研究中,Blau等[22]发现扭曲和感知质量彼此不一致。

3 算法提出 Proposed Methods

Our main aim is to improve the overall perceptual quality for SR. In this section, we first describe our proposed network architecture and then discuss the improvements from the discriminator and perceptual loss. At last, we describe the network interpolation strategy for balancing perceptual quality and PSNR.

我们的主要目标是提高SR的总体感知质量。在这部分,我们首先描述我们提出的网络结构,
然后讨论判别器和感知损失的改进。最后,我们描述为了平衡感知质量和PSNR的网络插值策略。

图3:我们采用SRResNet [1]的基本架构,其中大部分计算在LR特征空间中完成

我们可以选择或设计“基本块”(例如,残余块[18],密集块[34],RRDB)以获得更好的性能

3.1 网络结构

In order to further improve the recovered image quality of SRGAN, we mainly make two modifications to the structure of generator G: 1) remove all BN layers; 2) replace the original basic block with the proposed Residual-in-Residual Dense Block (RRDB), which combines multi-level residual network and dense connections as depicted in Fig. 4.

为了进一步提高SRGAN恢复图像质量,我们主要为生成器G的结构做了两个改进:

  1. 去除掉所有的BN层。
  2. 提出用残差密集块(RRDB)代替原始基础块,其结合了多层残差网络和密集连接,如图4所示。

图4

左:我们去除SRGAN中残余块中的BN层。 右:在我们的深层模型中使用RRDB块,β是残差缩放参数。

Removing BN layers has proven to increase performance and reduce computational complexity in different PSNR-oriented tasks including SR [20] and deblurring [35]. BN layers normalize the features using mean and variance in a batch during training and use estimated mean and variance of the whole training dataset during testing. When the statistics of training and testing datasets differ a lot, BN layers tend to introduce unpleasant artifacts and limit the generalization ability. We empirically observe that BN layers are more likely to bring artifacts when the network is deeper and trained under a GAN framework. These artifacts occasionally appear among iterations and different settings, violating the needs for a stable performance over training. We therefore remove BN layers for stable training and consistent performance. Furthermore, removing BN layers helps to improve generalization ability and to reduce computational complexity and memory usage.

去除BN层已经被证明在不同的PSNR-oriented任务有助于增强性能和减少计算复杂度,包括SR[20]和去模糊[35]。BN层在训练期间使用批次的均值和方差对特征进行归一化,在测试期间使用整个训练数据集的估计均值和方差。

当训练和测试数据集的统计数据差异很大时,BN层往往引入不适的伪影,限制了泛化能力。我们以经验观察到,BN层有可能当网络深和在GAN网络下训练时带来伪影。这些伪影偶尔出现在迭代和不同设置之间,违反了稳定性能超过训练的需求。因此,我们为了训练稳定和一致性去除了BN层。此外,去除BN层有助于提高泛化能力,减少计算复杂度和内存使用。

We keep the high-level architecture design of SRGAN (see Fig. 3), and use a novel basic block namely RRDB as depicted in Fig. 4. Based on the observation that more layers and conne