# [论文翻译]ESRGAN: 增强型超分辨率生成对抗网络

## ABSTRACT

The Super-Resolution Generative Adversarial Network (SRGAN) [1] is a seminal work that is capable of generating realistic textures during single image super-resolution. However, the hallucinated details are often accompanied with unpleasant artifacts. To further enhance the visual quality, we thoroughly study three key components of SRGAN – network architecture, adversarial loss and perceptual loss, and improve each of them to derive an Enhanced SRGAN (ESRGAN). In particular, we introduce the Residual-in-Residual Dense Block (RRDB) without batch normalization as the basic network building unit. Moreover, we borrow the idea from relativistic GAN [2] to let the discriminator predict relative realness instead of the absolute value. Finally, we improve the perceptual loss by using the features before activation, which could provide stronger supervision for brightness consistency and texture recovery. Benefiting from these improvements, the proposed ESRGAN achieves consistently better visual quality with more realistic and natural textures than SRGAN and won the first place in the PIRM2018-SR Challenge11We won the first place in region 3 and got the best perceptual index. [3]. The code is available at https://github.com/xinntao/ESRGAN.

In this supplementary file, we first show more examples of Batch-Normalization (BN) related artifacts in Section 6. Then we introduce several useful techniques that facilitate training very deep models in Section 7. The analysis of the influence of different datasets and training patch size is depicted in Section 8 and Section 9, respectively. Finally, in Section 10, we provide more qualitative results for visual comparison.

## 1 介绍 INTRODUCTION

Single image super-resolution (SISR), as a fundamental low-level vision problem, has attracted increasing attention in the research community and AI companies. SISR aims at recovering a high-resolution (HR) image from a single low-resolution (LR) one. Since the pioneer work of SRCNN proposed by Dong et al. [4], deep convolution neural network (CNN) approaches have brought prosperous development. Various network architecture designs and training strategies have continuously improved the SR performance, especially the Peak Signal-to-Noise Ratio (PSNR) value [5, 6, 7, 1, 8, 9, 10, 11, 12]. However, these PSNR-oriented approaches tend to output over-smoothed results without sufficient high-frequency details, since the PSNR metric fundamentally disagrees with the subjective evaluation of human observers [1].

Several perceptual-driven methods have been proposed to improve the visual quality of SR results. For instance, perceptual loss [13, 14] is proposed to optimize super-resolution model in a feature space instead of pixel space. Generative adversarial network [15] is introduced to SR by [1, 16] to encourage the network to favor solutions that look more like natural images. The semantic image prior is further incorporated to improve recovered texture details [17]. One of the milestones in the way pursuing visually pleasing results is SRGAN [1]. The basic model is built with residual blocks [18] and optimized using perceptual loss in a GAN framework. With all these techniques, SRGAN significantly improves the overall visual quality of reconstruction over PSNR-oriented methods.

However, there still exists a clear gap between SRGAN results and the ground-truth (GT) images, as shown in Fig. 1. In this study, we revisit the key components of SRGAN and improve the model in three aspects. First, we improve the network structure by introducing the Residual-in-Residual Dense Block (RDDB), which is of higher capacity and easier to train. We also remove Batch Normalization (BN) [19] layers as in [20] and use residual scaling [21, 20] and smaller initialization to facilitate training a very deep network. Second, we improve the discriminator using Relativistic average GAN (RaGAN) [2], which learns to judge “whether one image is more realistic than the other” rather than “whether one image is real or fake”. Our experiments show that this improvement helps the generator recover more realistic texture details. Third, we propose an improved perceptual loss by using the VGG features before activation instead of after activation as in SRGAN. We empirically find that the adjusted perceptual loss provides sharper edges and more visually pleasing results, as will be shown in Sec. 4.4. Extensive experiments show that the enhanced SRGAN, termed ESRGAN, consistently outperforms state-of-the-art methods in both sharpness and details (see Fig. 1 and Fig. 7).

Block （RDDB）来改进网络结构，该结构具有更高的容量和更容易训练。

We take a variant of ESRGAN to participate in the PIRM-SR Challenge [3]. This challenge is the first SR competition that evaluates the performance in a perceptual-quality aware manner based on [22], where the authors claim that distortion and perceptual quality are at odds with each other. The perceptual quality is judged by the non-reference measures of Ma’s score [23] and NIQE [24], i.e., perceptual index = 12((10−Ma)+NIQE). A lower perceptual index represents a better perceptual quality.

As shown in Fig. 2, the perception-distortion plane is divided into three regions defined by thresholds on the Root-Mean-Square Error (RMSE), and the algorithm that achieves the lowest perceptual index in each region becomes the regional champion. We mainly focus on region 3 as we aim to bring the perceptual quality to a new high. Thanks to the aforementioned improvements and some other adjustments as discussed in Sec. 4.6, our proposed ESRGAN won the first place in the PIRM-SR Challenge (region 3) with the best perceptual index.

In order to balance the visual quality and RMSE/PSNR, we further propose the network interpolation strategy, which could continuously adjust the reconstruction style and smoothness. Another alternative is image interpolation, which directly interpolates images pixel by pixel. We employ this strategy to participate in region 1 and region 2. The network interpolation and image interpolation strategies and their differences are discussed in Sec. 3.4.

We focus on deep neural network approaches to solve the SR problem. As a pioneer work, Dong et al. [4, 25] propose SRCNN to learn the mapping from LR to HR images in an end-to-end manner, achieving superior performance against previous works. Later on, the field has witnessed a variety of network architectures, such as a deeper network with residual learning [5], Laplacian pyramid structure [6], residual blocks [1], recursive learning [7, 8], densely connected network [9], deep back projection [10] and residual dense network [11]. Specifically, Lim et al. [20] propose EDSR model by removing unnecessary BN layers in the residual block and expanding the model size, which achieves significant improvement. Zhang et al. [11] propose to use effective residual dense block in SR, and they further explore a deeper network with channel attention [12], achieving the state-of-the-art PSNR performance. Besides supervised learning, other methods like reinforcement learning [26] and unsupervised learning [27] are also introduced to solve general image restoration problems.

Zhang 等[11]提出在 SR 中使用有效残余密集块，并进一步探索具有信道关注的更深层网络[12]，从而实现最先进的 PSNR 性能。

Several methods have been proposed to stabilize training a very deep model. For instance, residual path is developed to stabilize the training and improve the performance [18, 5, 12]. Residual scaling is first employed by Szegedy et al. [21] and also used in EDSR. For general deep networks, He et al. [28] propose a robust initialization method for VGG-style networks without BN. To facilitate training a deeper network, we develop a compact and effective residual-in-residual dense block, which also helps to improve the perceptual quality by a large margin.

Perceptual-driven approaches have also been proposed to improve the visual quality of SR results. Based on the idea of being closer to perceptual similarity [29, 14], perceptual loss [13] is proposed to enhance the visual quality by minimizing the error in a feature space instead of pixel space. Contextual loss [30] is developed to generate images with natural image statistics by using an objective that focuses on the feature distribution rather than merely comparing the appearance. Ledig et al. [1] propose SRGAN model that uses perceptual loss and adversarial loss to favor outputs residing on the manifold of natural images. Sajjadi et al. [16] develop a similar approach and further explored the local texture matching loss. Based on these works, Wang et al. [17] propose spatial feature transform to effectively incorporate semantic prior in an image and improve the recovered textures.

Roey Mechrez[30]通过使用关注特征分布的目标而不是仅仅比较外观，开发了 Contextual loss 来生成具有自然图像统计信息的图像。Ledig 等[1]提出了 SRGAN 模型，该模型使用感知损失和对抗性损失来处理自然图像上的输出。

Throughout the literature, photo-realism is usually attained by adversarial training with GAN [15]. Recently there are a bunch of works that focus on developing more effective GAN frameworks. WGAN [31] proposes to minimize a reasonable and efficient approximation of Wasserstein distance and regularizes discriminator by weight clipping. Other improved regularization for discriminator includes gradient clipping [32] and spectral normalization [33]. Relativistic discriminator [2] is developed not only to increase the probability that generated data are real, but also to simultaneously decrease the probability that real data are real. In this work, we enhance SRGAN by employing a more effective relativistic average GAN.

WGAN[31]建议最小化 Wasserstein 距离的合理有效近似值，并通过权重限幅来规范鉴别器。

Relativistic discriminator[2]的开发不仅是为了增加生成数据的真实概率，而且还是为了同时降低实际数据是真实的概率。

SR algorithms are typically evaluated by several widely used distortion measures, e.g., PSNR and SSIM. However, these metrics fundamentally disagree with the subjective evaluation of human observers [1]. Non-reference measures are used for perceptual quality evaluation, including Ma’s score [23] and NIQE [24], both of which are used to calculate the perceptual index in the PIRM-SR Challenge [3]. In a recent study, Blau et al. [22] find that the distortion and perceptual quality are at odds with each other.

SR 算法通常通过几种广泛使用的失真度量来评估，例如 PSNR 和 SSIM。

## 3 算法提出 Proposed Methods

Our main aim is to improve the overall perceptual quality for SR. In this section, we first describe our proposed network architecture and then discuss the improvements from the discriminator and perceptual loss. At last, we describe the network interpolation strategy for balancing perceptual quality and PSNR.

### 3.1 网络结构

In order to further improve the recovered image quality of SRGAN, we mainly make two modifications to the structure of generator G: 1) remove all BN layers; 2) replace the original basic block with the proposed Residual-in-Residual Dense Block (RRDB), which combines multi-level residual network and dense connections as depicted in Fig. 4.

1. 去除掉所有的 BN 层。
2. 提出用残差密集块（RRDB）代替原始基础块，其结合了多层残差网络和密集连接，如图 4 所示。

Removing BN layers has proven to increase performance and reduce computational complexity in different PSNR-oriented tasks including SR [20] and deblurring [35]. BN layers normalize the features using mean and variance in a batch during training and use estimated mean and variance of the whole training dataset during testing. When the statistics of training and testing datasets differ a lot, BN layers tend to introduce unpleasant artifacts and limit the generalization ability. We empirically observe that BN layers are more likely to bring artifacts when the network is deeper and trained under a GAN framework. These artifacts occasionally appear among iterations and different settings, violating the needs for a stable performance over training. We therefore remove BN layers for stable training and consistent performance. Furthermore, removing BN layers helps to improve generalization ability and to reduce computational complexity and memory usage.

We keep the high-level architecture design of SRGAN (see Fig. 3), and use a novel basic block namely RRDB as depicted in Fig. 4. Based on the observation that more layers and connections could always boost performance [20, 11, 12], the proposed RRDB employs a deeper and more complex structure than the original residual block in SRGAN. Specifically, as shown in Fig. 4, the proposed RRDB has a residual-in-residual structure, where residual learning is used in different levels. A similar network structure is proposed in [36] that also applies a multi-level residual network. However, our RRDB differs from [36] in that we use dense block [34] in the main path as [11], where the network capacity becomes higher benefiting from the dense connections.

In addition to the improved architecture, we also exploit several techniques to facilitate training a very deep network: 1) residual scaling [21, 20], i.e., scaling down the residuals by multiplying a constant between 0 and 1 before adding them to the main path to prevent instability; 2) smaller initialization, as we empirically find residual architecture is easier to train when the initial parameter variance becomes smaller. More discussion can be found in the supplementary material.

1. 残差缩放[21,20]，即将残差乘以 0 和 1 之间的常数，然后将它们添加到主路径以防止不稳定。
2. 较小初始化，我们经验发现，当初始参数方差变小时，残差结构更容易训练。更多的讨论在 supplementary material。

The training details and the effectiveness of the proposed network will be presented in Sec. 4.

### 3.2 相对判别器

Besides the improved structure of generator, we also enhance the discriminator based on the Relativistic GAN [2]. Different from the standard discriminator D in SRGAN, which estimates the probability that one input image x is real and natural, a relativistic discriminator tries to predict the probability that a real image xr is relatively more realistic than a fake one xf, as shown in Fig. 5.

### 3.3 感知损失

We also develop a more effective perceptual loss Lpercep by constraining on features before activation rather than after activation as practiced in SRGAN.

Based on the idea of being closer to perceptual similarity [29, 14], Johnson et al. [13] propose perceptual loss and it is extended in SRGAN [1]. Perceptual loss is previously defined on the activation layers of a pre-trained deep network, where the distance between two activated features is minimized. Contrary to the convention, we propose to use features before the activation layers, which will overcome two drawbacks of the original design. First, the activated features are very sparse, especially after a very deep network, as depicted in Fig. 6. For example, the average percentage of activated neurons for image ‘baboon’ after VGG19-5444We use pre-trained 19-layer VGG network[37], where 54 indicates features obtained by the 4th convolution before the 5th maxpooling layer, representing high-level features and similarly, 22 represents low-level features. layer is merely 11.17%. The sparse activation provides weak supervision and thus leads to inferior performance. Second, using features after activation also causes inconsistent reconstructed brightness compared with the ground-truth image, which we will show in Sec. 4.4.

## 4 实验

### 4.1 实验细节

1. 它可以避免生成器的不希望的局部最优;
2. 在预训练之后，鉴别器在一开始就接收相对好的超分辨图像而不是极端假图像（黑色或噪声图像），这有助于它更多地关注纹理辨别。

### 4.2 数据 Data

For training, we mainly use the DIV2K dataset [40], which is a high-quality (2K resolution) dataset for image restoration tasks. Beyond the training set of DIV2K that contains 800 images, we also seek for other datasets with rich and diverse textures for our training. To this end, we further use the Flickr2K dataset [41] consisting of 2650 2K high-resolution images collected on the Flickr website, and the OutdoorSceneTraining (OST) [17] dataset to enrich our training set. We empirically find that using this large dataset with richer textures helps the generator to produce more natural results, as shown in Fig. 9.

We train our models in RGB channels and augment the training dataset with random horizontal flips and 90 degree rotations. We evaluate our models on widely used benchmark datasets – Set5 [42], Set14 [43], BSD100 [44], Urban100 [45], and the PIRM self-validation dataset that is provided in the PIRM-SR Challenge.

### 4.3 定性结果

ESRGAN 能够中生成更详细的建筑结构（参见图像 102061），而其他方法无法生成足够的细节（SRGAN）或者会添加不需要的纹理（增强网络）。此外，先前基于 GAN 的方法有时会引入令人不愉快的伪影，例如，SRGAN 会增加面部皱纹。我们的 ESRGAN 摆脱了这些，并产生了自然的效果。

ESRGAN 产生更多的自然纹理，例如动物毛皮，建筑结构和草纹理，以及不那么令人不愉快的文物，例如 SRGAN 在脸上的人工制品。从图 7 中可以看出，我们提出的 ESRGAN 在锐度和细节方面都优于以前的方法。

### 4.4 消融研究

BN 删除

RaGAN
RaGAN 使用改进的相对判别器，这有助于学习更清晰的边缘和更细致的纹理。

### 4.6 PIRM-SR 挑战赛

1. MINC 损失用作感知损失的变体，如第 3.3 节中所述。尽管感知指数略有增加，但我们仍然认为，探索关注纹理的感知损失对 SR 至关重要；
2. 原始数据集[24]，用于学习感知指数，也用于我们的训练;
3. 由于 PSNR 约束，使用高达 η= 10 的损失 L1;
4. 我们还使用反投影[46]作为后处理，这可以提高 PSNR，有时可以降低感知指数。

## 5 结论

Xintao Wang1, Ke Yu1, Shixiang Wu2, Jinjin Gu3, Yihao Liu4,
Chao Dong2, Chen Change Loy5, Yu Qiao2, Xiaoou Tang1
ECCV 2018
https://arxiv.org/pdf/1809.00219v2.pdf