Fine-grained Attention and Feature-sharing Generative Adversarial Networks for Single Image Super-Resolution



Yitong Yan, Chuangchuang Liu, Changyou Chen, Xianfang Sun, Longcun Jin*, Xinyi Peng, and Xiang Zhou


Traditional super-resolution (SR) methods by minimize the mean square error usually produce images with over-smoothed and blurry edges, due to the lack of high-frequency details. In this paper, we propose two novel techniques within the generative adversarial network framework to encourage generation of photo-realistic images for image super-resolution. Firstly, instead of producing a single score to discriminate real and fake images, we propose a variant, called Fine-grained Attention Generative Adversarial Network (FASRGAN), to discriminate each pixel of real and fake images. FASRGAN adopts a UNet-like network as the discriminator with two outputs: an image score and an image score map. The score map has the same spatial size as the HR/SR images, serving as the fine-grained attention to represent the degree of reconstruction difficulty for each pixel. Secondly, instead of using different networks for the generator and the discriminator, we introduce a feature-sharing variant (denoted as Fs-SRGAN) for both the generator and the discriminator. The sharing mechanism can maintain model express power while making the model more compact, and thus can improve the ability of producing high-quality images. Quantitative and visual comparisons with state-of-the-art methods on benchmark datasets demonstrate the superiority of our methods. We further apply our super-resolution images for object recognition, which further demonstrates the effectiveness of our proposed method. The code is available at https://github.com/Rainyfish/FASRGAN-and-Fs-SRGAN.


传统超分辨率(SR)通过最小化均线误差通常产生具有过平滑的图像和模糊边缘,由于缺乏高频细节。在本文中,我们提出了一种在生成的对抗网络框架内提出了两种新颖的技术,以鼓励为图像超分辨率产生照片 - 现实图像。首先,不是制作单一分数来区分实际和假图像,我们提出了一种变种,称为细粒度关注生成的对抗网络(Fasrgan),以区分实际和假图像的每个像素。 Fasrgggn采用一个像unet-like的网络作为具有两个输出的鉴别器:图像分数和图像分数图。得分图具有与HR / SR图像相同的空间尺寸,用作致细粒的注意力,以代表每个像素的重建难度。其次,而不是使用不同网络的发电机和鉴别器,我们将发电机和鉴别器介绍一个特征共享变量(表示为FS-SRGAN)。共享机制可以在使模型更紧凑的同时保持模型表达功率,因此可以提高生产高质量图像的能力。在基准数据集上具有最先进方法的定量和视觉比较,证明了我们方法的优越性。我们进一步应用我们的超分辨率图像进行对象识别,进一步展示了我们所提出的方法的有效性。代码:https:/github.com/rainyfish/fasrgan-and-fs-srgan。


image super-resolution (SISR), which aims to recover a high-resolution (HR) image from its low-solution (LR) version, has been an active research topic in computer graphic and vision for decades. SISR has also attracted increasing attention in both academia and industry, with applications in various fields such as medical imaging, security surveillance, object recognition and so on. However, SISR is a typically ill-posed problem due to the irreversible image degradation process, , multiple HR images can be generated from one single LR image. Learning the mapping between HR and LR images plays an important role in addressing this problem. Recently, deep convolution neural networks (CNNs) have been shown great success in many vision tasks, such as image classification, object detection, and image restoration. Dong et al. first proposed a three-layer CNN for single image super-resolution (SRCNN) to directly learn the mapping from LR to HR images. Since then the CNN-based methods have been dominant for the SR problem because they greatly improved the reconstruction performance. Kumar et al. tapped into the ability of polynomial neural networks to hierarchically learn refinements of a function that maps LR to HR patches. VDSR obtained remarkable performance by increasing the depth of the network to 20, proving the importance of the network depth for detecting effective features of images. EDSR removed unnecessary batch normalization layer in the ResNet architecture and widened the channels, significantly improving the performance. RCAN applied residual in residual structure to construct a very deep network and used a channel attention mechanism to adaptively rescale features. The aforementioned methods use the optimization idea of minimizing the mean squared error (MSE) between the recovered SR image and the corresponding HR image. Such methods are designed to maximize the peak signal-to-noise ratio (PSNR). However, they typically produce over-smoothed edges and lack tiny textures. To produce photo-realistic SR images, Ledig et al. first used the generative adversarial network (GAN) to match the underlying distributions of HR and SR images. ESRGAN further extended the generator network and used the Relativistic Discriminator to produce more photo-realistic images. However, as shown in Fig.GAN_based, the discriminator in these GAN-based methods only outputs a score of the whole input SR/HR image, which is a coarse way to guide the generator. Furthermore, the generator and discriminator of these works are considered to be independent, while we believe there should be significant information to be shared. For example, the lower-level parts of the two networks both aim at extracting tiny features such as corners and edges, which we believe should be correlated.



最近,深度卷积神经网络(CNN)在许多视觉任务(例如图像分类,目标检测和图像恢复)中已显示出巨大的成功。董等。[ dong2015image ]首先提出了一种用于单图像超分辨率(SRCNN)的三层CNN,以直接学习从LR到HR图像的映射。从那时起,基于CNN的方法 [ 9044873 ]一直是SR问题的主导,因为它们极大地提高了重建性能。Kumar等。 [ kumar2016fast ]充分利用了多项式神经网络的功能,可以分层学习将LR映射到HR补丁的函数的细化。VDSR [kim2016accurate ]通过增加网络的深度20,证明用于检测图像的有效特征的网络深度的重要性取得了显着的性能。EDSR [ lim2017enhanced ]删除了ResNet [ he2016deep ]体系结构中不必要的批处理规范化层, 并拓宽了渠道,从而显着提高了性能。RCAN [ zhang2018image ]在残差结构中应用了残差以构建非常深的网络,并使用通道注意机制来自适应地重新缩放特征。

前述方法使用使恢复的SR图像和对应的HR图像之间的均方误差(MSE)最小的优化思想。此类方法旨在最大程度地提高峰值信噪比(PSNR)。但是,它们通常会产生过度平滑的边缘,并且缺少微小的纹理。为了产生照片般逼真的SR图像,Ledig等人。[ ledig2017photo ]首先使用生成对抗网络(GAN) [ goodfellow2014generative ]来匹配HR和SR图像的基本分布。ESRGAN [ wang2018esrgan ]进一步扩展了生成器网络,并使用了相对论判别器 [ jolicoeur-martineau2018]产生更多逼真的图像。但是,如图1所示,这些基于GAN的方法中的鉴别器仅输出整个输入SR / HR图像的分数,这是引导生成器的粗略方法。此外,这些作品的产生者和区分者被认为是独立的,而我们认为应该分享重要的信息。例如,两个网络的较低层部分都旨在提取微小的特征,例如角和边缘,我们认为应该将它们关联起来。

To address these limitations, we propose two novel techniques based on the GAN framework for image super-resolution, a fine-grained attention mechanism for the discriminator and a feature-sharing network component for both the generator and the discriminator. Specifically, we use a UNet-like discriminator (Fig.FASRGAN_D) to introduce a fine-grained attention in the GAN (FASRGAN). Our discriminator produces two outputs, a score of the whole input image and a fine-grained score map of every pixel in the image. The score map shares the same spatial size as the input image, and measures the degree of differences at each pixel between the generated and the true distributions. To produce better visual quality images, we incorporate the score map into the loss function with an attention mechanism to make the generator pay more attention on the hard parts of an image, instead of treating all parts equally. In addition, we propose a feature-sharing mechanism (Fig.Co-SRGAN) to align the low-level feature extraction of both the generator and the discriminator (Fs-SRGAN). This novel structure can significantly reduce the number of parameters and improve the performance. Overall, our main contributions are three-fold: \item We propose a novel UNet-like discriminator to generate a single score for the whole image and a pixel-wise score map of the input image. We further incorporate the score map into the loss function with an attention mechanism to define the generator. This attention mechanism makes the generator focus on hard parts of an image for generation. \item We introduce a feature-sharing mechanism to define the shared low-level feature extraction for the generator and the discriminator. This reduces the number of model parameters and helps the generator and the discriminator extract more effective features. \item The proposed two components are general, and can be applied to other GAN-based SR models. %In this paper, we also combine the two techniques into one model. Extensive experiments on benchmark datasets illustrate the superiority of our proposed methods compared with current state-of-the-art methods. The remainder of the paper is organized as follows. Section II describes related works. The proposed GAN-based methods are presented in Section III. Experimental results are discussed in Section IV. Finally, the conclusions are drawn in Section V.


为了解决这些局限性,我们提出了两种基于GAN框架的图像超分辨率新技术,一种用于鉴别器的细粒度关注机制,以及一种用于生成器和鉴别器的特征共享网络组件。具体来说,我们使用类似 UNet的[ ronneberger2015u ]鉴别器(图2),以在GAN(FASRGAN)中引入细粒度的关注。我们的鉴别器产生两个输出,一个是整个输入图像的得分,另一个是图像中每个像素的细粒度得分图。分数图与输入图像共享相同的空间大小,并测量所生成的分布与真实分布之间的每个像素的差异程度。为了产生更好的视觉质量图像,我们将得分图结合到具有关注机制的损失函数中,以使生成器更加关注图像的硬部分,而不是平等对待所有部分。此外,我们提出了一种功能共享机制(图[3]),以对齐生成器和鉴别器(Fs-SRGAN)的低级特征提取。这种新颖的结构可以显着减少参数数量并提高性能。


  • 我们提出了一种新颖的类似于UNet的鉴别器,以生成整个图像的单个分数以及输入图像的像素级分数图。我们进一步将得分图结合到具有关注机制的损失函数中,以定义生成器。这种关注机制使生成器专注于生成图像的硬部分。
  • 我们引入了一种特征共享机制来定义生成器和鉴别器的共享低级特征提取。这减少了模型参数的数量,并有助于生成器和鉴别器提取更有效的特征。
  • 所提议的两个组件是通用的,并且可以应用于其他基于GAN的SR模型。在基准数据集上进行的大量实验表明,与当前的最新技术相比,我们提出的方法具有优越性。


Traditional SISR methods are exemplar or dictionary based. However, these methods are limited by the size of datasets or dictionaries, and are usually time-consuming. These shortcomings can be greatly alleviated by the recent CNN-based methods [9044873].

In their pioneer work, Dong et al. [dong2015image] applied convolutional neural networks with three layers for SISR, namely SRCNN, to learn a mapping from LR to HR images in an end-to-end manner. Kim et al. [kim2016accurate] increases the depth of the network to 20, achieving great improvement in accuracy compared to SRCNN. Instead of using the interpolated LR images as the inputs of network, FSRCNN [dong2016accelerating] extracted features from the origin LR images and upscaled the spatial size by upsampling layers at the tail of the network. This architecture is widely used in the subsequent SR methods. Various advanced upsampling structures have been proposed recently, for instance, deconvolutional layer [zeiler2010deconvolutional] , sub-pixel convolution [shi2016real], and EUSR [kim2018deep]. LapSRN [lai2017deep] progressively reconstructed an HR image with increasing scales of an input image by the Laplacian pyramid structure. Lim et al. [lim2017enhanced] proposed a very large network (EDSR) and its multi-scale version (MDSR), which removed the unnecessary batch normalization layer in the ResNet [he2016deep] and greatly improved super-resolution performance. D-DBPN [haris2018deep] introduced an error-correcting feedback mechanism to learn relationships between LR features and SR features. ZSSR [shocher2018zero] used a unsupervised method to learn the mapping between HR images and LR images. DIP [ulyanov2018deep] showed that the structure of a generator network can capture a large amount of low-level image statistics before any learning is performed, which can be used as a handcrafted prior with excellent results in super-resolution and other standard inverse problems. To address the real-world LR image problem, Fritsche et al. [fritsche2019frequency] proposed to separate the low and high image frequencies and treat them in different ways during training. Adversarial training is used to modify only the high, not the low frequencies. RDN [zhang2018residual] combined dense and residual connections to make full use of information of LR images. Different from RDN, MS-RHDN [liu2019multi] proposed multi-scale residual hierarchical dense networks to extract multi-scale and hierarchical feature maps. Yang et al. [yang2018drfn] proposed a deep recurrent fusion network (DRFN) for SR with large-scale factors, which used transposed convolution to jointly extract and upsample raw features from the input and used multi-level fusion for reconstruction. SeaNet [fang2020soft] proposed a Soft-edge assisted Network to reconstruct the high-quality SR images with the help of image soft-edge. Zhang et al. [zhang2020adaptive] proposed an adaptive importance learning scheme to enhance the performance of the lightweight SISR network architecture. RCAN [zhang2018image] applied channel-attention mechanism to adaptively rescale channel-wise features. SAN [Dai_2019_CVPR] further proposed a second-order channel attention (SOCA) module to rescale the features instead of global average pooling.

传统的SISR方法是基于示例或字典的。但是,这些方法受数据集或字典大小的限制,通常很耗时。这些缺点可以通过最近的基于CNN的方法来大大缓解 [ 9044873 ]。

在他们的开创性工作中,Dong等人[ dong2015image ]将具有三层的卷积神经网络用于SSR,即SRCNN,以端对端的方式学习从LR到HR图像的映射。Kim等。 [ kim2016accurate ]将网络的深度增加到20,与SRCNN相比,准确性得到了极大的提高。FSRCNN [ dong2016accelerating ]而不是将插值的LR图像用作网络的输入。 从原始LR图像中提取特征,并通过对网络尾部的图层进行上采样来扩大空间大小。此体系结构在随后的SR方法中被广泛使用。最近已经提出了各种先进的上采样结构,例如,反卷积层 [ zeiler2010deconvolutional ],子像素卷积 [ shi2016real ]和EUSR [ kim2018deep ]。LapSRN [ lai2017deep ]通过拉普拉斯金字塔结构,随着输入图像比例的增加,逐渐重建了HR图像。Lim等。 [ lim2017enhanced ]提出了一个超大型网络(EDSR)及其多尺度版本(MDSR),该网络删除了ResNet [ he2016deep ]中不必要的批处理规范化层, 并极大地提高了超分辨率性能。D-DBPN [ haris2018deep ]引入了一种纠错反馈机制,以了解LR功能和SR功能之间的关系。ZSSR [ shocher2018zero ]使用一种无​​监督的方法来学习HR图像和LR图像之间的映射。DIP [ ulyanov2018deep ]结果表明,生成器网络的结构可以在进行任何学习之前捕获大量的低级图像统计信息,可以用作手工制作的方法,在超分辨率和其他标准逆问题中具有出色的效果。为了解决现实世界中的LR图像问题,Fritsche等人。 [ fritsche2019frequency ]建议将低图像频率和高图像频率分开,并在训练期间以不同的方式对其进行处理。对抗训练仅用于修改高频,而不能修改低频。RDN [ zhang2018residual ]结合了密集连接和残余连接以充分利用LR图像的信息。与RDN不同,MS-RHDN [ liu2019multi ]提出了多尺度残差分层稠密网络,以提取多尺度和层次特征图。Yan [yang2018drfn]提出了一种具有大规模因子的SR深度递归融合网络(DRFN),该算法利用转置卷积从输入中共同提取和上采样原始特征,并使用多级融合进行重构。SeaNet [ fang2020soft ]提出了一种软边缘辅助网络,借助图像软边缘来重建高质量的SR图像。[zhang2020adaptive]提出了一种自适应重要性学习方案,以增强轻量级SISR网络体系结构的性能。RCAN[ zhang2018image ]应用了通道注意机制来自适应地重新缩放通道级特征。SAN [ Dai_2019_CVPR ]进一步提出了一个二阶通道关注(SOCA)模块,以重新调整功能而不是全局平均池。

The aforementioned methods aim to achieve high PSNR and SSIM values. However, these criteria usually cause heavy over-smoothed edges and artifacts. Images generated by these MSE-based SR methods lose various high-frequency details and have a bad perceptual quality. To generate more photo-realistic images, Ledig et al. firstly introduced generative adversarial network into image super-resolution, called SRGAN. SRGAN combined a perceptual loss and an adversarial loss to improve the reality of generated images. But some visually implausible artifacts still could be found in some generated images. To reduce the artifacts, EnhanceNet combined a pixel-wise loss in the image space, a perceptual loss in the feature space, a texture matching loss and an adversarial loss. The contextual loss was a kind of perceptual loss to make the generated images as similar as possible to ground-truth images. Yan et al. firstly trained a novel full-reference image quality assessment (FR-IQA) approach for SISR, then employed the proposed loss function (SR-IQA) to train their SR network which contains their proposed highway unit. In addition, they also integrated SR-IQA loss to the GAN-based SR method to achieve better results for both accuracy and perceptual quality. Based on SRGAN, ESRGAN $ RN{1} $ substituted the standard residual block with a residual-in-residual dense block, $ RN{2} $ removed batch normalization layers, $ RN{3} $ utilized VGG feature before activated as perceptual loss, and $ RN{4} $ replaced the standard discriminator with Relativistic Discriminator proposed in RaGAN. Noteworthily, ESRGAN won the first place in the 2018 PIRM Challenge on Perceptual Image Super-Resolution, which pursued the high perceptual-quality images. RankSRGAN firstly trained a ranker to learn the behavior of perceptual metrics and then introduced a rank-content loss to optimize the perceptual quality.

图2: FASRGAN的鉴别器架构,其中K,S,G分别代表Conv层的内核大小,步幅和过滤器编号。FC代表完全连接的层。面具是一张得分图[0,1个],弥补了图像中每个像素重建的困难。

上述方法旨在实现高PSNR和SSIM值。然而,这些标准通常会导致重型过平滑的边缘和伪影。基于MSE的SR方法生成的图像失去了各种高频细节,并且具有糟糕的感知质量。生成更多照片逼真的图像,LEDIG *等。*首先将生成的对抗网络引入图像超分辨率,称为SRAN。 SRGAN联合感知损失和对抗的丧失丧失,以改善生成的图像的现实。但是一些视觉难以置信的伪像在一些生成的图像中仍然可以找到。为了减少伪影,增强eneT组合在图像空间中的像素明显的损失,特征空间中的感知损失,纹理匹配损失和越野损失。语境损失是一种感知损失,使生成的图像尽可能相似地与地面真相类似。 Yan 首先培训了SISR的新型全参考图像质量评估(FR-IQA)方法,然后雇用了拟议的损失函数(SR-IQA)培训他们的SR网络,其中包含其所提出的公路单位。此外,它们还将SR-IQA损失集成到基于GAN的SR方法,以实现准确性和感知质量的更好结果。基于SRGAN,ESRGAN $RN{1} $ 用residual-in-residual dense block代替标准残差块,$ RN{2} $拆除批量归一化层,$RN{3} $在激活为感知损失之前使用VGG功能,以及$ RN{4} $用Ragan中提出的相对论判别器取代了标准鉴别器。值得注意的是,Esrgan获得了2018年PIRM 挑战感知形象超分辨率的第一名,生成了高感知形象。 RanksRggan首先训练了一个Ranker来学习感知度量的行为,然后介绍了排名内容损失以优化感知质量。

Proposed Methods 建议方法

Overview 概述

Our methods aim to reconstruct a high-resolution image $ I^{\text{SR}}\in R^{Wr\times Hr\times C} $ from a low-resolution image $ I^{\text{LR}}\in R^{W\times H\times C} $ , where $ W $ and $ H $ are the width and height of the LR image, $ r $ is the upscaling factor, and $ C $ is the number of channels of the color space. This section details our two strategies within the GAN framework for image super-resolution in order: FASRGAN and Fs-SRGAN. Specifically, we propose a fine-grained attention mechanism in FASRGAN to make the generator focus on the difficult parts of image reconstruction based on the score map from the UNet-like discriminator, instead of treating each part equally. We further propose a feature-sharing mechanism in Fs-SRGAN by sharing the low-level feature extractor of the generator and the discriminator. Both networks update the gradient of the shared low-level feature extractor in the training phase, which could make the model more compact while keeping enough representation power. These two strategies contribute to the overall perceptual quality for SR, respectively. For simplicity, we use the same network architecture as ESRGAN for the generator to generate the SR images from the input LR images.

我们的方法旨在从低分辨率图像$ I^{\text{LR}}\in R^{W\times H\times C} $ 中重建高分辨率图像 $ I^{\text{SR}}\in R^{Wr\times Hr\times C} $ ,其中$W $和$H $是宽度和高度LR图像,$r $是Upsoging因子,$C $是颜色空间的通道数。本节按顺序详细介绍了GAN框架中用于图像超分辨率的两种策略:FASRGAN和Fs-SRGAN。具体来说,我们在FASRGAN中提出了一种细粒度的关注机制,以使生成器基于类似UNet的鉴别器的得分图,专注于图像重建的困难部分,而不是平等地对待每个部分。通过共享生成器和鉴别器的低级特征提取器,我们进一步提出了Fs-SRGAN中的特征共享机制。两个网络都在训练阶段更新共享的低级特征提取器的梯度,这可以使模型更紧凑,同时保持足够的表示能力。这两种策略分别有助于提高SR的整体感知质量。为简单起见, [ wang2018esrgan ]用于生成器从输入的LR图像生成SR图像。

Fine-grained Attention Generator Adversarial Networks 细粒度的注意力生成器对抗网络

Our proposed fine-grained attention GAN (FASRGAN) designs a specific discriminator for SISR. As discussed above and shown in Fig.GAN_based, the discriminator in a standard GAN-based SR model outputs a score of the whole input SR/HR image. This can be considered as a coarse way to judge an input image and cannot discriminate local features of inputs. To tackle this problem, the proposed FASRGAN defines a UNet-like discriminator contained two outputs, corresponding to a score of the whole image and a fine-grained score map. The score map has the same size as the input image and is used for pixel-wise discrimination. The proposed discriminator is illustrated in Fig.FASRGAN_D.

我们提出的细粒度注意GAN(FASRGAN)设计了SISR的特定区分符。如上文所讨论并如图1所示,基于GAN的标准SR模型中的鉴别器输出整个输入SR / HR图像的分数。这可以被认为是判断输入图像的粗略方法,并且不能区分输入的局部特征。为了解决这个问题,提出的FASRGAN定义了一个类似UNet的鉴别器,其中包含两个输出,分别对应于整个图像的分数和细粒度的分数图。得分图具有与输入图像相同的大小,并用于按像素区分。所提出的鉴别器在图2示出 。

A UNet-like Discriminator类似于UNet的鉴别器

The UNet-like discriminator with two outputs can be divided into two parts: an encoder and a decoder. Similar to the standard discriminator D in ESRGAN, the encoder part of the proposed UNet-like discriminator uses a standard max-pooling layer with a stride of 2 to reduce the spatial size of a feature map and increase receptive fields. At the same time, the number of channels is increased for improving representative ability. At the end of the encoder, two fully connected layers are added to output a score, measuring the overall probability of an input image $ x $ being real or fake. We further enhance the discriminator based on the Relativistic GAN, which has also been used in ESRGAN. The loss function is defined as:


编码器。 与ESRGAN中的标准鉴别器D相似,拟议的类似UNet的鉴别器的编码器部分使用步长为2的标准最大池化层来减小特征图的空间大小并增加接收场。同时,增加了通道数以提高代表能力。在编码器的末尾,添加了两个完全连接的层以输出分数,从而测量输入图像的总体概率X是真实的还是假的。我们基于相对论GAN [ jolicoeur-martineau2018 ]进一步增强了鉴别器,该相对论 也已在ESRGAN [ wang2018esrgan ]中使用。损失函数定义为:

L_{adv}^D= D_{Ra}(x_r,x_f)+D_{Ra}(x_f,x_r)= \mathbb{E}[\log(1-\sigma(C(x_r)-\mathbb{E}[C(x_f)]))] + \mathbb{E}[\log(\sigma(C(x_f)-\mathbb{E}[C(x_r)]))],

L_{adv}^D = \mathbb{E}_{x_r}[\log(1-D_{Ra}(x_r,x_f))]+ \mathbb{E}_{x_f}[\log(D_{Ra}(x_f,x_r))]= \mathbb{E}_{x_r}[\log(1-\sigma(C(x_r)-\mathbb{E}_{x_f}[C(x_f)]))]+ \mathbb{E}_{x_f}[\log(\sigma(C(x_f)-\mathbb{E}_{x_r}[C(x_r)]))],


where $ x_r $ and $ x_f $ stand for the ground-truth image and the generated SR image, respectively. $ D_{Ra}(\cdot) $ refers to the function of the relativistic discriminator, which tries to predict the probability that a real image $ x_r $ is more realistic than a fake one $ x_f $ ; $ C(x) $ is the discriminator output before sigmoid function and $ \sigma $ is the sigmoid function. We exploit an upsampling layer to gradually extend the spatial size of feature maps, followed by two convolutional layers to extract more information. To make full use of features, we concatenate the previous outputs from the encoder, which have the same spatial size as current ones. As shown in Fig.FASRGAN_D, the feature maps at the end of the decoder have the same spatial size as input images. Finally, we use the sigmoid function to produce a score map $ M\in R^{Wr\times Hr \times C} $ that provides pixel-wise discrimination between real and fake pixels of an input image. The fine-grained adversarial loss function $ L_{M}^D $ for the discriminator is defined as: where $ M_r $ and $ M_f $ represent the score maps of the HR image and the generated SR image, respectively. Finally, the loss function for the discriminator is defined as:

其中$x_r $和$x_f $分别代表真实图像和生成的SR图像。 $D_{Ra}(\cdot) $是指相对论鉴别器的功能,这试图预测真实图像$x_r $比假一个$x_f $更真实的概率; $C(x) $是SIGMOID函数之前的鉴别器输出,$\sigma $是SIGMOID函数。我们利用上采样层逐渐扩展特征贴图的空间大小,然后是两个卷积层来提取更多信息。为了充分利用功能,我们将先前的Encoder输出连接,其具有与当前的空间大小相同。如图FASRGAN_D所示,解码器末尾的特征映射具有与输入图像相同的空间尺寸。最后,我们使用SIGMOID函数来产生得分图$M\in R^{Wr\times Hr \times C} $,其提供输入图像的实际和虚假像素之间的像素明智的辨别。用于鉴别器的细粒逆势损失函数$L_{M}^D $定义为:其中$M_r $和$M_f$分别表示HR图像的得分图和生成的SR图像。最后,鉴别器的损耗函数定义为:

$ L^D = L^D_{adv} + L^D_M $ .


Generator Objective Function

In the GAN-based SR methods, the generator is generally used to generate the SR images from the LR images. ESRGAN introduced Residual-in-Residual Dense Block (RRDB) without batch normalization as the basic network building unit, which is of higher capacity and easier to train compared with the ResBlock in SRGAN. In this paper, we also adopt RRDB to construct our generator for a fair comparison with ESRGAN. The generator is trained by several losses, defined as following: Following, we use an $ L_1 $ loss function to constrain the content of a generated SR image to be close to the HR image. The loss is defined in Eq.L1. where $ F_\theta^G(\cdot) $ represents the function of the generator, $ \theta $ is the parameters of the generator and $ I_i $ means the $ i $ -th image. The perceptual loss aims to make the SR image close to the corresponding HR image based on high-level features extracted from a pre-trained network. Similar to , we consider both the SR and HR images as the input to the pre-trained VGG19 and extract the VGG19-54 layer features. The perceptual loss is defined as:

$$ L_{percep} = \parallel F_{\theta}^{VGG}(G(I_i^{LR}))-F_{\theta}^{VGG}(I_i^{HR})\parallel_1, $$

where $ F_{\theta}^{VGG}(\cdot) $ is the function of VGG and $ I_i $ is the $ i $ -th image, $ G(\cdot $ ) is the function of the generator. The discriminator contains two outputs, a whole estimation of the entire image and the pixel-wise fine-grained estimations of an input image. The total adversarial loss function for the generator is defined as:

$$ L_{adv}^G = L_{entire}^G+L_{fine}^G, $$

As shown in Eq.mask_d, the discriminator tries to distinguish the real and fake image in a fine-grained way, while the generator aims to fool the discriminator. Thus the fine-grained adversarial loss function $ L^G_{fine} $ for the generator is the symmetrical form of Eq.mask_d: $ L_{entire}^G $ is also the symmetrical form of Eq.discriminator_adversarial and defined as:

The score map generated by the UNet-like discriminator is represented as pixel-wise discrimination scores of an input image, with values $ M(w,h,c) $ among $ [0,1] $ . A higher score means the corresponding pixel of the input image is closer to that of the ground-truth image. In this manner, the score map can indicate which parts of an image are more difficult to generate and which parts are easier. For instance, the structure background part of an image is sometimes simpler, and thus it would expect the discriminator reflects this to the generator when updating the generator. In other words, the part with lower scores (more difficult to generate) should receive more attention when updating the generator. As a result, we incorporate the score map as the fine-grained attention mechanism into a $ L_1 $ loss function, constituting a weighted attention loss function:

L_{attention}=\frac{1}{Wr\times Hr \times C}\sum_{w=1}^{Wr}\sum_{h=1}^{Hr}\sum_{c=1}^C (1-M_f(w,h,c))
\times \parallel F_{\theta}^{G} (I_i^{LR})(w,h,c)-I_i^{HR}(w,h,c),

where $ M_f(w,h,c) $ is the score map of the generated image given by the discriminator. Instead of treating every pixel of an image equally, $ L_{attention} $ contributes to pay more attention in the hard-to-recovered part of an image, such as the textures with rich semantic information. Combining the above losses with different weights, the total loss of the generator is:
$$ \label{eq:generator_loss} L^{G}= L_{percep} + \lambda_1 L_{adv}^G + \lambda_2 L_{attention} + \lambda_3 L_1, $$
where $ \lambda_1 $ , $ \lambda_2 $ , $ \lambda_3 $ are the coefficients to balance different loss terms.


在基于GAN的SR方法中,发电机通常用于从LR图像生成SR图像。 ESRGAN引入了残留的残余密集块(RRDB),无批量标准化为基本网络建筑单元,与SRAN中的RESBLOCK相比,该容量更高,更容易训练。在本文中,我们还通过RRDB构建我们的发电机,以便与ESRAN进行公平比较。发电机由几个损耗培训,定义如下:以下,我们使用$L_1 $丢失功能来限制生成的SR图像的内容以接近HR图像。损失在EQ.L1中定义。其中$F_\theta^G(\cdot) $表示发电机的功能,$\theta $是发电机的参数,$I_i $表示$i $ -th图像。感知损失旨在使SR图像基于从预先训练的网络中提取的高级功能,使得靠近相应的HR图像。类似于,我们将SR和HR图像视为预先训练的VGG19的输入并提取VGG19-54层特征。感知损失定义为:
$$ \ label {eq:percep} l_ {percep} = \ parents f _ {\ theta} ^ {vgg}(g(i_i ^ {lr})) - f _ {\ theta} ^ { vgg}(i_i ^ {hr})\ parallel_1,$$
其中$F_{\theta}^{VGG}(\cdot) $是vgg和$I_i $的功能是$i $ -th图像,$G(\cdot $)是函数发电机。鉴别器包含两个输出,整个图像的整体估计和输入图像的像素明智的细粒估计。发电机的总体对冲丢失函数定义为:
$$ \ label {eq:avd} l_ {adv} ^ g = {entire} ^ g + l_ {fine} ^ g,$$
,如eq.mask_d所示,鉴别者试图以细粒度的方式区分真实和假图像,而发电机则旨在欺骗鉴别者。因此,发电机的细粒度逆势丢失函数$L^G_{fine} $是eq.mask_d:$L_{entire}^G $的对称形式也是eq.discriminator_addersarial的对称形式,并定义为:

$$ \ label {eq:ground_adv }

由非唯一鉴别器生成的分数图表示为输入图像的像素 - 方向辨别得分,其中$[0,1] $中值$M(w,h,c)$。较高的分数是指输入图像的相应像素更靠近地面图的图像。以这种方式,得分图可以指示图像的哪些部分更难以产生,并且哪个部件更容易。例如,图像的结构背景部分有时是更简单的,因此在更新发电机时,它将期望鉴别器将其反射到发电机。换句话说,在更新发电机时,具有较低分数(更难以生成)的部件应更加关注。因此,我们将分数图作为细长的注意机制纳入$L_1 $损耗功能,构成了加权注意力损失功能:

L_{attention}=\frac{1}{Wr\times Hr \times C}\sum_{w=1}^{Wr}\sum_{h=1}^{Hr}\sum_{c=1}^C (1-M_f(w,h,c))
\times \parallel F_{\theta}^{G} (I_i^{LR})(w,h,c)-I_i^{HR}(w,h,c)\parallel _1,

其中$M_f(w,h,c) $是通过鉴别器给出的生成图像的分数映射。代替将图像的每个像素同样处理,$L_{attention} $有助于在图像的难以恢复的部分中提供更多关注,例如具有丰富语义信息的纹理。将上述损失与不同权重结合,发电机的总损失是:
$$ \ label {eq:generator_loss} l ^ {g} = l_ {percep} + lambda_1 l_ {adv} ^ g + \ lambda_2 l_ {attention} + \ lambda_3 l_1,$$
其中$\lambda_1 $,$\lambda_2 $,$\lambda_3 $是平衡不同损耗术语的系数。

Feature-sharing Generator Adversarial Networks

In the standard GANs, the generator and the discriminator are usually defined as two independent networks. Based on the observation that the shallow parts of the two networks always extract low-level textures such as edges and corners, we propose a new network structure (Fs-SRGAN) to enable low-level feature sharing between the generator and the discriminator. This can reduce the number of parameters and help both networks extract more effective features. Consequently, our Fs-SRGAN contains three parts: a shared feature extractor, a generator, and a discriminator, as shown in Fig.Co-SRGAN.



Shared Feature Extractor

We first use a share feature extractor to transform an input image from color space to feature space, before extracting low-level feature maps. The feature-sharing mechanism allows the generator and the discriminator to jointly optimize the low-level feature extractor. Similar to FASRGAN, we adopt RRDB, the basic block of ESRGAN, as the basic structure. The shared feature extractor contains $ E $ RRDBs to extract helpful feature maps for both the generator and the discriminator, described as following:
$$ \label{eq:shared_part} H_{shared} = F_{shared}(x), $$
where $ H_{shared} $ is the low-level feature maps extracted by the shared part, $ F_{shared} $ represents the function of the shared feature extractor, and $ x $ is the input. For the generator, the input is an LR image, while for the discriminator it is a SR image or a HR image. Considering the input sizes of the generator and the discriminator are different, we apply a fully Convolutional Neural Network with invariant size of feature maps to extract features so that the different input sizes do not matter.

shared feature extractor

我们首先使用共享特征提取器将输入图像从颜色空间转换为特征空间,然后提取低级特征映射。特征共享机制允许发电机和鉴别器共同优化低级特征提取器。类似于Fasrgan,我们采用RRDB,ESRGAN的基本块,作为基本结构。共享特征提取器包含$E $ rrdbs,用于提取有关的发电机和鉴别器的有用特征映射,如下所示:
$$ \ label {eq:shared_pa​​rt} h_ {shared} = f_ {shared}(x),$$
其中$H_{shared} $是由共享部分提取的低级特征映射,$F_{shared} $表示共享特征提取器的功能,而$x $是输入。对于发电机,输入是LR图像,而对于鉴别器,它是SR图像或HR图像。考虑到发电机的输入大小和鉴别器是不同的,我们应用了一个完全卷积的神经网络,具有不变大小的特征映射来提取功能,以便不同的输入大小无关紧要。

The Generator and the Discriminator

The rest parts of the generator and the discriminator are the same as those in standard GAN-based methods, except that the inputs are feature maps instead of images as shown in Fig.Co-SRGAN. The generator contains three parts: low-level feature extraction, deep feature extraction, and reconstruction, which are used for transforming the input image to the feature space from the color space and extracting low-level information, extracting high-level semantic features and reconstructing SR image, respectively. Note that the generator in our Fs-SRGAN only contains the latter two parts. Similar to the shared low-level feature extraction, we adopt RRDB as the basic part of deep feature extraction, except that more RRDBs are used to increase the depth of the network with the purpose of extracting more high-frequency feature for reconstruction. The reconstruction part utilizes an upsampling layer to upscale the high-level feature maps and a Conv layer to reconstruct an SR image. The loss function of the generator is same as that of ESRGAN, which includes perceptual loss, adversarial loss, and pixel-based loss:
$$ \label{eq:fs_generator} L^{G}= L_{percep} + \lambda_1 L_{adv}^G + \lambda_2 L_1, $$
where $ \lambda_1 $ , $ \lambda_2 $ are the coefficients to balance different loss terms, $ L_{percep} $ and $ L_1 $ are defined in Eq.percep and Eq.L1, respectively, $ L_{adv}^G $ is the adversarial loss with the same definition as $ L^G_{entire} $ in Eq.entire_adv. Because the discriminator is a classification network that distinguishes the input as SR or HR image, we apply a structure similar to the VGG network as the discriminator. To reduce information loss, we substitute the pooling layer (used in the encoder of the UNet-discriminator) for a Conv layer with a stride of 2 to decrease the size of feature map. At the tail of the discriminator, we use a Conv layer to transform the feature map into a one-dimensional vector, then use two fully connected layers to output the classification score $ s $ among $ [0, 1] $ . The value of $ s $ closer to 1 means more real, otherwise more fake. The loss function of the discriminator is same as $ L^D_{adv} $ defined in Eq.discriminator_adversarial.


$$ \label{eq:fs_generator} l ^ {g} = l_ {percep} + \ lambda_1 l_ {adv} ^ g + \ lambda_2 l_1,$$
,其中$\lambda_1 $,$\lambda_2 $是平衡不同损失术语的系数,$L_{percep} $和$L_1 $在eq.percep和eq.l1中定义,分别,$L_{adv}^G $是具有与eq.entire_adv的$L^G_{entire} $相同的对手丢失。因为鉴别器是将输入作为SR或HR图像的分类网络,所以我们将类似于VGG网络的结构应用于鉴别器。为了减少信息丢失,我们将汇集层(在UNET-Coldiminator的编码器中使用)替换为CONC层,其步幅为2,以减小特征图的大小。在鉴别器的尾部,我们使用经常层将特征映射转换为一维向量,然后使用两个完全连接的层输出$[0, 1] $中的分类得分$s $。 $s $更接近1的值意味着更真实,否则更假。鉴别器的损耗函数与在eq.discriminator_addersarial中定义的$L^D_{adv} $相同。

Experimental Results

In this section, we first describe our model training details, then provide quantitative and visual comparisons with several state-of-the-art methods on benchmark datasets for our two proposed novel methods, FASRGAN and Fs-SRGAN. We further combine the fine-grained attention and the feature-sharing mechanisms into one single model, termed FA $ + $ Fs-SRGAN.

PSNR 28.57 29.91 30.46 29.73 30.15 30.28 30.19
SSIM 0.8102 0.8510 0.8515 0.8398 0.8450 0.8588 0.8571
Set5 PI 2.8466 3.4322 3.5463 2.9867 3.1685 3.9143 3.7455
LPIPS 0.0488 0.0389 0.0350 0.0348 0.0325 0.0330 0.0344
PSNR 23.54 24.39 24.36 24.49 24.51 24.55 24.67
SSIM 0.6933 0.7309 0.7341 0.7319 0.7380 0.7509 0.7466
Urban100 PI 3.4543 3.4814 3.7312 3.3253 3.5173 3.5940 3.5819
LPIPS 0.0777 0.0693 0.0591 0.0667 0.0588 0.0591 0.0625
PSNR 24.94 25.50 25.32 25.51 25.41 25.61 25.87
SSIM 0.6266 0.6528 0.6514 0.6530 0.6523 0.6726 0.6747
BSD100 PI 2.8467 2.3054 2.4150 2.0768 2.2783 2.4056 2.3749
LPIPS 0.0982 0.0887 0.0798 0.0850 0.0796 0.0801 0.0855
PSNR 27.28 28.16 28.17 28.10 28.15 28.15 28.23
SSIM 0.7460 0.7753 0.7759 0.7710 0.7768 0.7903 0.7891
DIV2K val PI 3.4953 3.1619 3.2572 3.0130 3.1034 3.3303 3.3092
LPIPS 0.0753 0.0605 0.0550 0.0576 0.0539 0.0542 0.0576
PSNR 25.47 25.61 25.18 25.65 25.38 25.75 26.00
SSIM 0.6569 0.6757 0.6596 0.6726 0.6648 0.6907 0.6934
PIRM val PI 2.6762 2.2254 2.5548 2.0183 2.2476 2.3311 2.2482
LPIPS 0.0838 0.0718 0.0714 0.0675 0.0685 0.0651 0.0677
Quantitative results with the Bicubic degradation model for $4\times$ SR. Best and second best results are highlighted


在本节中,我们首先描述了我们的模型培训细节,然后提供了对我们两个提出的新颖方法,Fasrgan和FS-SRGAN的基准数据集的数量和视觉比较。我们进一步将细粒度的注意力和特征共享机制结合在一起,称为FA $+ $ FS-SRGAN。

Training Details

In training, we use the training set from DIV2K as the training set to train our models. The LR images are obtained by bicubic downsampling (BI) from the source high-resolution images. Images are augmented by rotating and flipping. We also randomly crop $ 48\times48 $ patches from LR images as the input of the network. Our networks are optimized with the ADAM optimizer. The hyper-parameters $ \beta_1 $ and $ \beta_2 $ in the ADAM optimizer are set to $ \beta_1 = 0.9 $ and $ \beta_2=0.999 $ . The batch size is set to 16. The generator is pre-trained by the $ L_1 $ loss function, followed by generator and the discriminator training with the corresponding loss functions. Following, the initial learning rate is set to $ 1\times10^{-4} $ , and then decays to half every $ 2\times10^5 $ iterations. In FASRGAN, the coefficients in Eq.generator_loss are set as $ \lambda_1 = 5e $ - $ 3 $ , $ \lambda_2 = 1e $ - $ 2 $ and $ \lambda_3 = 1e $ - $ 2 $ . Similar to ESRGAN, the number of RRDBs in the generator is set as 23. In Fs-SRGAN, we set the number of RRDBs in the shared feature extractor as $ E=1 $ , and in the deep feature extractor as 16. The coefficients in Eq.fs_generator are set as $ \lambda_1 = 5e $ - $ 3 $ and $ \lambda_2 = 1e $ - $ 2 $ . In FA+Fs-SRGAN, the number of RRDBs in the share part is set as 2, while in the deep feature extraction part is 15. The discriminator and the coefficients of the loss function are the same as those of FASRGAN. We implement our models with the PyTorch framework on two NVIDIA GeForce RTX 2080Ti GPUs.


在培训中,我们使用DIV2K的培训设置为培训我们的模型。通过来自源高分辨率图像的双向下采样(BI)获得LR图像。通过旋转和翻转来增强图像。我们还随机从LR图像随机作物$48\times48 $修补程序作为网络的输入。我们的网络与ADAM Optimizer进行了优化。 ADAM优化器中的超参数$\beta_1 $和$\beta_2 $设置为$\beta_1 = 0.9 $和$\beta_2=0.999 $。批量大小设置为16.发电机由$L_1 $损耗功能预先培训,然后是发电机和具有相应损耗功能的鉴别器训练。以下,初始学习速率设置为$1\times10^{-4} $,然后衰减为每一个$2\times10^5 $迭代。在Fasrgan中,EQ.Generator_Loss中的系数被设置为$\lambda_1 = 5e $-$3 $,$\lambda_2 = 1e $-$2 $和$\lambda_3 = 1e $-$2 $。类似于ESRAN,发电机中的RRDB的数量设定为23.在FS-SRGAN中,我们将共享特征提取器中的RRDB中的RRDB数量设置为$E=1 $,深度特征提取器为16. eq中的系数.fs_generator被设置为$\lambda_1 = 5e $-$3 $和$\lambda_2 = 1e $-$2 $。在FA + FS-SRGAN中,共享部分中的RRDB的数量设置为2,而在深度特征提取部分中为15.判别器和损失函数的系数与FasRGAN的判别器相同。我们使用两个NVIDIA GeForce RTX 2080TI GPU的Pytorch框架实现我们的模型。

Datasets and Evaluation Metrics

In the testing phase, we use seven standard benchmark datasets to evaluate the performance of our proposed methods: Set5, Set14, BSD100, Urban100, Manga109, DIV2K validation, PIRM validation and test dataset. Blau et al. proved mathematically that perceptual quality is not always improved with the increase of PSNR value and there is a trade-off between average distortion and perceptual quality. Hence, we not only use PSNR and SSIM to measure the reconstruction accuracy, but also adopt the learned perceptual image patch similarity (LPIPS) and perceptual index (PI) to evaluate the perceptual quality of SR images. LPIPS firstly adopts a pre-trained network $ \mathcal{F} $ to extract patches
$ y, y_0 $ from the reference and target images $ x, x_0 $ . The network $ \mathcal{F} $ computes the activations of the image patches, each is scaled by a learned weight $ w_l $ and then sums up the $ L_2 $ distances across all layers. Finally, it computes a perceptual real/fake prediction as follows:
\label{eq:LPIPS} d(x, x_0) = \sum_l \frac{1}{H_lW_l}\sum_{h,w}\parallel w_l \odot(y_l^{hw} - \hat{y}^{hw}_{0l})\parallel_2^2,
where $ y_l, y_{0l}\in\mathbb{R}^{H_l \times W_l \times C_l} $ represent the reference or target features from layer $ l $ . The HR images from the public datasets are regarded as the reference images, the SR images generated by our methods or the compared methods as the target images. We use the public codes and pre-trained network (AlexNet from version 0.0) for evaluation. While PI is based on the non-reference image quality measures of Ma et al. and NIQE: PI= $ \frac{1}{2} $ ((10-Ma)+NIQE). PSNR and SSIM are calculated on the luminance channel in the YCbCr color space. We also use LPIPS and root mean square error (RMSE) to measure the trade-off between perceptual quality and reconstruction accuracy. Using LPIPS/RMSE rather than LPIPS/PSNR to evaluate the trade-off is for better observation, where lower LPIPS/RMSE means a better result. Higher PSNR/SSIM and lower RMSE mean better results in reconstruction accuracy, while lower scores of LPIPS/PI imply that the images are more similar to the HR images. As shown in Fig.78004, the SR image of our FASRGAN has less artifacts than that of ESRGAN and is clearer than that of RankSRGAN. But the PI value of the SR image produced by RankSRGAN is lower than that of our FASRGAN, and even lower than that of the original HR image. In terms of LPIPS, our method attains the lowest value, which is more consistent with human observation. Hence, we use LPIPS as our first perceptual quality metric and PI as the second one.



在测试阶段,我们使用七个标准基准数据集来评估我们所提出的方法的性能:SET5,SET14,BSD100,URBAN100,MANGA109,DIV2K验证,PIRM验证和测试数据集。 Blau *等。*在数学上证明,随着PSNR价值的增加,不断提高了感知质量,平均失真与感知质量之间存在权衡。因此,我们不仅使用PSNR和SSIM来测量重建准确性,而且还采用学习的感知图像贴片相似度(LPIP)和感知指数(PI)来评估SR图像的感知质量。 LPIPS首先采用预先训练的网络$\mathcal{F} $来从参考和目标图像$x, x_0 $中提取补丁$y, y_0 $。网络$\mathcal{F} $计算图像修补程序的激活,每个激活由学习权重$w_l $缩放,然后在所有层上施加$L_2 $距离。最后,它计算了一个感知真/假的预测。$y_l, y_{0l}\in\mathbb{R}^{H_l \times W_l \times C_l} $表示来自图层$l $的参考或目标特征。来自公共数据集的HR图像被视为参考图像,由我们的方法或比较方法作为目标图像生成的SR图像。我们使用公共代码和预先训练的网络(0.0版本的AlexNet)进行评估。虽然PI基于MA *等人的非参考图像质量测量。*和NIQE:PI = $\frac{1}{2} $((10-MA)+ NIQE)。 PSNR和SSIM在YCBCR颜色空间中的亮度通道上计算。我们还使用LPIP和均方根误差(RMSE)来测量感知质量与重建准确性之间的权衡。使用LPIPS / RMSE而不是LPIPS / PSNR来评估权衡是为了更好的观察,下LPIP / RMSE意味着更好的结果。较高的PSNR / SSIM和较低的RMSE意味着改善重建精度的结果,而LPIPS / PI的较低分数意味着图像与HR图像更类似于图像。如图78004所示,我们的FasRggan的SR图像具有比Esrgan的少于伪像,而且比RanksRAN的更清晰。但是RankSrgan产生的SR图像的PI值低于我们的FasRgan,甚至低于原始HR图像的SR图像。就LPIP而言,我们的方法达到了最低值,这与人类观察更符合。因此,我们使用LPIPS作为我们的第一个感知质量指标和PI作为第二个。

Quantitative Comparisons

We present the quantitative comparisons between our methods and the state-of-the-art perceptual image SR methods on several benchmark datasets. As shown in Tablequantitative, in most cases, RankSRGAN obtains the lowest PI values among these methods, benefiting from using the loss function with the newly added ranker to optimize the generator. However, our FASRGAN obtains the best LPIPS on most datasets, and both the LPIPS and PI values are better than that of ESRGAN, whose structure of the generator is the same as ours. It demonstrates that the fine-grained attention in our FASRGAN can transfer more information to the generator to produce better results. With less RRDBs in the generator, our Fs-SRGAN obtains best SSIM results and comparable, sometimes even better LPIPS results than those of ESRGAN and FASRGAN. In other words, our Fs-SRGAN extracts features more effectively and efficiently, benefiting from the feature-sharing mechanism. The combined model FA+Fs-SRGAN obtains the highest PSNR except for Set5, indicating that it can recover more contents in the SR images. We also compare our methods with the state-of-the-art methods on the trade-off between reconstruction accuracy and visual quality. The results are shown in Fig.PI_RMSE. Methods in the top-left part are almost MSE-based with low RMSE and high LPIPS scores due to the over-smoothed edges and lack of high-frequency details. The bottom-right category includes the GAN-based methods, such as SRGAN, EnhanceNet, ESRGAN, RankSRGAN, and our methods. These methods usually gain high-visual quality images even if their RMSE values are larger than those of the MSE-based methods. Our FASRGAN gets better visual quality and reconstruction accuracy compared with EnhanceNet, SRGAN and ESRGAN, and lower LPIPS than RankSRGAN. Our Fs-SRGAN attains comparable LPIPS with ESRGAN but lower RMSE, and better visual quality and reconstruction accuracy than RankSRGAN. The combined model FA+Fs-SRGAN obtains the lowest RMSE among the GAN-based methods. These demonstrate the effectiveness of our fine-grained attention and feature-sharing mechanism. To further demonstrate the effectiveness of our FASRGAN and Fs-SRGAN, we conduct a user study to calculate the Mean Opinion Score (MOS) against the state-of-the-art SR methods, i.e. SRGAN, ESRGAN and RankSRGAN. Ten candidates are shown with a side-by-side comparison of the generated SR image and the corresponding ground-truth. They are then asked to evaluate the difference of the two images on a 5-level scale defined as: 0 - almost identical, 1 - mostly similar, 2 - similar, 3 - somewhat similar and 4 - mostly different. We randomly select 10 images from PIRM val dataset, and invite 10 participants to give a score on each image according to the 5-level scale. For a better comparison, one small patch from the image is zoomed in. The average scores of all images are considered as the final results. As suggested in TableMos, our FASRGAN and Fs-SRGAN achieve better performance than all the compared methods, proving the effectiveness of our proposed fine-grained attention and feature-sharing mechanism.

Methods PIRM Val
SRGAN 0.0718 1.98
ESRGAN 0.0714 1.88
RankSRGAN 0.0675 1.84
FASRGAN (ours) 0.0685 1.46
Fs-SRGAN (ours) 0.0651 1.46
The comparison of LPIPS and MOS between our methods and the state-of-the-art methods on PIRM Val, where the lower values mean more similar with the HR image. The LPIPS is tested on the whole dataset, while MOS is calculated on 10 randomly selected images.


我们在几个基准数据集中介绍了我们的方法和最先进的感知图像SR方法之间的定量比较。如下所示,在大多数情况下,RanksRGAN在这些方法中获得最低的PI值,从使用新添加的Ranker使用损耗功能来优化发电机。但是,我们的FasRggan获得了大多数数据集的最佳LPIP,LPIP和PI值都比ESRGAN的LPIP和PI值均优于发电机的结构与我们的结构相同。它表明,我们的Fasrgan中的细粒度关注可以将更多信息传送给发电机以产生更好的结果。由于发电机中较少的RRDB,我们的FS-SRGAN获得最佳SSIM结果和可比性,有时甚至更好的LPIPS结果而不是ESRAN和FASRGAN。换句话说,我们的FS-SRGAN更有效,有效地提取功能,从特征共享机制中受益。组合模型FA + FS-SRGAN获得除SET5之外的最高PSNR,表示它可以在SR图像中恢复更多内容。我们还将我们的方法与最先进的方法进行了重建准确性和视觉质量之间的权衡。结果如图所示。由于过平滑的边缘和缺乏高频细节,左上部分的方法几乎基于低RMSE和高LPIPS分数。右下类别包括基于GaN的方法,例如SRGAN,ENHANCENET,ESRAN,RANKSRGAN和我们的方法。即使其RMSE值大于基于MSE的方法的方法,这些方法通常也会获得高视觉质量图像。与Enhanceenet,SRANG和ESRAN,比RANKSRAN相比,我们的FASRGAN获得更好的视觉质量和重建准确性。我们的FS-SRGAN获得了与ESRAN的相当的LPIP,但较低的RMSE,以及比RANKSRGAN更好的视觉质量和重建准确性。组合模型FA + FS-SRGAN获得基于GAN的方法中的最低RMSE。这些展示了我们细粒度关注和特征共享机制的有效性。为了进一步展示我们的Fasrgan和FS-SRGAN的有效性,我们开展用户学习,以计算针对最先进的SR方法的平均意见分数(MOS),即SRGAN,ESRAN和RANKSRAN。显示十个候选者,并以生成的SR图像和相应的地面真理的并排比较。然后要求它们在5级规模上评估两个图像的差异定义为:0 - “几乎相同”,1 - “大多是相似”,2 - “相似”,3 - “有些类似”和4 - “大多数不同”。我们从PIRM Val数据集中随机选择10个图像,并邀请10名参与者根据5级比例在每个图像上进行分数。有更好的比较,从图像中放大一个小贴片。所有图像的平均分数被视为最终结果。如Tablemos所示,我们的Fasrgan和FS-SRGAN获得比所有比较方法更好的性能,证明了我们提出的细粒度关注和特征共享机制的有效性。

Qualitative Results

We compare our final models on several public benchmark datasets with the state-of-the-art MSE-based methods: SRCNN, EDSR, RDN, RCAN, and GAN-based approaches: SRGAN, EnhanceNet, ESRGAN, RankSRGAN.



Visual Comparisons of FASRGAN

Some representative quality results are presented in Fig.visual_results_ASRGAN. PSNR , PI and LPIPS are also provided for reference. As shown in Fig.visual_results_ASRGAN, our proposed FASRGAN outperforms previous methods by a large margin. Images generated by FASRGAN contain more fine-grained textures and details. For example, the cropped parts of image 0828 are full of textures. All the compared MSE-based methods suffer from heavy blurry artifacts, failing to recover the structure and the gap of the stripes. SRGAN, EnhanceNet, ESRGAN and RankSRGAN generate high-frequency noise and wrong textures; while our FASRGAN can reduce noise and recover them more correctly, producing more faithful results and being closer to the HR images. For image img_093 in Urban100, the cropped parts of the images generated by the compared methods contain heavily blurry artifacts and lines with wrong directions. Although the LPIPS of ESRGAN is a little lower than our FASRGAN, our FASRGAN can alleviate the artifacts better and recover zebra crossing in the right direction. More results can be seen in the supplemental material. These comparisons demonstrate the strong ability of FASRGAN for producing more photo-realistic and high-quality SR images.


一些代表性质量结果在图visup_results_asrgan中呈现。还提供PSNR,PI和LPIP参考。如图所示,我们提出的Fasrgan优于以前的方法。 Fasrgan生成的图像包含更细粒度的纹理和细节。例如,图像0828的裁剪部分充满了纹理。所有基于MSE的方法都遭受重型模糊的伪像,未能恢复条纹的结构和间隙。 SRGAN,ENHANCENET,ESRGAN和RANKSRGAN产生高频噪音和错误的纹理;虽然我们的Fasrgan可以降低噪音并更好地恢复它们,产生更忠实的结果并更接近人力资源图像。对于Urban100中的图像“IMG_093”,比较方法产生的图像的裁剪部分包含严重模糊的伪像和具有错误方向的线。虽然ESRGAN的LPIPS比我们的FASRGAN低一点,但我们的FASRGAN可以更好地缓解伪影,并在正确的方向上恢复斑马交叉。可以在补充材料中看到更多结果。这些比较展示了Fasran以生产更多照片现实和高质量SR图像的强大能力。

Visual Comparisons of Fs-SRGAN

We also compare our Fs-SRGAN with state-of-the-art methods in Fig.visual_results_Co-SRGAN. Our Fs-SRGAN obtains better performance than other methods in producing SR images, in terms of sharpness and details. For image baboon, the cropped parts of the images generated by the MSE-based methods are over-smoothed. Previous GAN-based methods not only fail to produce clear whiskers but also introduce lots of unpleasing noise. Despite having lower LPIPS value, ESRGAN generates too many whiskers, which have not appeared in the original HR image. While our Fs-SRGAN produces more correct whiskers. For image 0812 , MSE-based methods still suffer from heavy blurry artifacts and generate unnatural results. GAN-based methods cannot maintain the structures of the stairs or the train tracks and introduce artifacts. Our proposed Fs-SRGAN outperforms the compared methods, reducing the artifacts and recovering the correct textures. More results can be seen in the supplemental material. These also indicate that the shared low-level feature extractor of the generator and the discriminator is beneficial.


我们还将我们的FS-SRGAN与图VISUP_RESULTS_CO-SRGAN中最先进的方法进行比较。在清晰度和细节方面,我们的FS-SRGAN获得的性能比其他方法在生产SR图像方面。对于图像“狒狒”,基于MSE的方法生成的图像的裁剪部分过度平滑。以前的基于GAN的方法不仅无法产生清晰的晶须,而且还引入了许多令人难以释放的噪音。尽管LPIPS值较低,ESRGAN产生太多晶须,其尚未出现在原始的HR图像中。虽然我们的FS-SRGAN生产更正确的晶须。对于图像0812,基于MSE的方法仍然患有沉重的模糊伪像并产生不自然的结果。 GaN的方法不能维持楼梯的结构或火车轨道并引入伪影。我们提出的FS-SRGAN优于比较方法,减少伪影并恢复正确的纹理。可以在补充材料中看到更多结果。这些还表明发电机的共享低级特征提取器和鉴别者是有益的。

Visual Comparisons of FA+Fs-SRGAN

We further present the visual results of our FA+Fs-SRGAN compared with ESRGAN, FASRGAN and Fs-SRGAN. As shown in Fig.FA+Fs-SRGAN, for image 57 and 'OhWareraRettouSeitokai', the results from FA+Fs-SRGAN are better than those of FASRGAN, and contain more correct textures. The PSNR and LPIPS values are both the best for FA+Fs-SRGAN. More results can be seen in the supplemental material. These results illustrate that the combined method can restore more contents for the SR images and obtains comparable or even better visual results compared with FASRGAN and Fs-SRGAN.

FA + FS-SRGAN的视觉比较

我们进一步提出了与ESRAN,FASRAN和FS-SRGAN相比我们的FA + FS-SRGAN的视觉结果。如图FA + FS-SRGAN,用于图像57和'OHWARERARTERESITOKAI'所示,FA + FS-SRGAN的结果优于FASRGAN的结果,并包含更正确的纹理。 PSNR和LPIPS值都是FA + FS-SRGAN的最佳状态。可以在补充材料中看到更多结果。这些结果说明了组合方法可以恢复SR图像的更多内容,并与Fasrgan和FS-SRGAN相比获得可比或甚至更好的视觉结果。

Model Analysis

This section compares the sizes and the time complexity of the generators between our methods and ESRGAN, which use RRDB as the basic block to construct the generators. We are not comparing our methods with SRGAN and RankSRGAN in these aspects as clearly they use 16 Resblocks to build their generators, so their parameters and inference time are less than ours. In the aspect of the numbers of parameters, both ESRGAN and our FASRGAN have 23 RRDBs and 16.7M parameters, while our Fs-SRGAN and FA+Fs-SRGAN have 17 RRDBs and 12.46M parameters in their generators. In the aspect of inference time, we run our models and the public official test code and model from ESRGAN on Urban100 using a machine with 4.2GHz Inter i7 CPU (32G RAM) and Nvidia RTX 2080 platform. We conduct five times of inference on Urban100 and take the mean as the inference time. Our Fs-SRGAN and FA+Fs-SRGAN run much faster than ESRGAN, where Fs-SRGAN has the average time of 0.1377 seconds and FA+Fs-SRGAN 0.1364 seconds, while ESRGAN has 0.3573 seconds. Even our FASRGAN runs a little faster than ESRGAN, with average time 0.3160 seconds. From Tablequantitative we can see that our Fs-SRGAN has comparable or even better results than ESRGAN, which demonstrates the efficiency of our feature-sharing mechanism. Fig.PI_process plots the curves of PI values in the training process of our proposed methods on Set14. We observe that the training process of FASRGAN is more stable and the PI value is the lowest. The average PI value of Fs-SRGAN is higher than FASRGAN. As described in, the deep model has a stronger representation capacity to capture semantic information and reduce unpleasing noises. And as mentioned above, Fs-SRGAN contains fewer RRDBs than FASRGAN. Hence, we speculate that compared with FASRGAN, Fs-SRGAN with fewer RRDBs captures less information for reconstruction but brings more noises, causing higher PI values. FA $ + $ Fs-SRGAN, which combines the fine-grained attention mechanism into Fs-SRGAN, obtains the lower PI values than Fs-SRGAN, which demonstrates the effectiveness of our fine-grained attention mechanism. However, the training of the FA+Fs-SRGAN is not stable, which is the concern we need to focus on in our future work.


Evaluation Top-1 error Top-5 error
Bicubic 0.526 0.277
SRCNN 0.464 0.230
FSRCNN 0.488 0.252
SRGAN 0.410 0.191
EnhanceNet 0.454 0.224
ESRGAN 0.334 0.132
RankSRGAN 0.342 0.136
Fs-SRGAN (ours) 0.338 0.136
FA+Fs-SRGAN (ours) 0.337 0.134
FASRGAN (ours) 0.323 0.124
Baseline 0.241 0.071
The result of object recognition between our methods and the state-of-the-art methods for 4 times SR. The baseline uses the original HR image as the input of ResNet-50 model.


本节比较了我们的方法与ESRGAN之间的发电机的大小和时间复杂度,它使用RRDB作为构建发电机的基本块。我们并没有将我们的方法与SRGAN和RANKSRAN进行比较,因为他们使用16个RESBLOCKS构建其发电机,因此他们的参数和推理时间少于我们的。在参数数量的方面,ESRAN和FASRGAN都有23个额定RRDB和16.7M参数,而我们的FS-SRGAN和FA + FS-SRGAN在其发电机中有17个RRDB和12.46M参数。在推理时间的方面,我们使用带有4.2GHz INTE I7 CPU(32G RAM)和NVIDIA RTX 2080平台的机器运行我们的模型和公共官方测试代码和型号。我们对城市100进行了五次推断,并将其平均值作为推理时间。我们的FS-SRGAN和FA + FS-SRGAN比ESRGAN更快,FS-SRGAN的平均时间为0.1377秒,FA + FS-SRGAN 0.1364秒,而ESRAN有0.3573秒。即使我们的Fasrggg也比ESRGAN更快地运行,平均时间为0.3160秒。从REBUTQUNITIVATIVE,我们可以看到我们的FS-SRGAN具有比ESRAGAG的可比性甚至更好,这表明了我们的特征共享机制的效率。图Pi_Process在我们在Set14上提出的方法的培训过程中绘制PI值的曲线。我们观察到Fasrgan的培训过程更稳定,PI值是最低的。 FS-SRGAN的平均PI值高于FASRGAN。如上所述,深层模型具有更强的表示能力来捕获语义信息并减少令人难以释放的噪声。如上所述,FS-SRGAN包含比FASRGAN更少的RRDB。因此,我们推测与Fasrgan相比,FS-SRGAN与较少RRDB的FS-SRGAN捕获更少的重建信息,但带来更多噪音,导致更高的PI值。 FA $+ $ FS-SRGAN,将细粒度注意机制与FS-SRGAN相结合,获得比FS-SRGAN更低的PI值,这证明了我们的细粒度关注机制的有效性。但是,FA + FS-SRGAN的培训并不稳定,这是我们在未来工作中关注的担忧。

Object Recognition Performance

To further demonstrate the quality of our generated SR images, we treat them as a pre-processing step for object recognition. We use the same setting as EnhanceNet and evaluate the object recognition performance with the generated images by our methods and other state-of-the-art methods: SRCNN, FSRCNN, SRGAN, EnhanceNet, ESRGAN, RankSRGAN. We use the pre-trained ResNet-50 on imageNet as an evaluation model and fetch the first 1000 images in ImageNet CLS-LOC validation dataset for evaluation. The test images are first down-sampled by bicubic and then upscaled by our methods and the compared methods. These SR images are then used as inputs to the ResNet-50 model to calculate their Top-1 and Top-5 errors for evaluation. As shown in Tableobject_recognition_SRGAN, both two methods we proposed and the variant FA $ + $ Fs-SRGAN achieve considerable accuracy compared to the state-of-the-art methods. Among these three methods, FASRGAN achieves the lowest Top-1 and Top-5 errors, Fs-SRGAN and FA+Fs-SRGAN obtain comparable results with ESRGAN, demonstrating the effectiveness of both the fine-grained attention and the feature-sharing mechanisms.

Model FA mechanism Fs mechanism
PSNR 25.04 25.26 25.44 25.69
PIRM SSIM 0.6454 0.6523 0.6626 0.6785
Test PI 2.4251 2.1160 2.1420 2.2279
LPIPS 0.0751 0.0718 0.0731 0.0695
The ablation study of fine-grained attention (FA) and feature-sharing (Fs) mechanisms for 4 times super-resolution.


进一步证明了我们生成的SR图像的质量,我们将它们视为用于对象识别的预处理步骤。我们使用与EnhanceNet相同的设置,并通过我们的方法和其他最先进的方法评估对象识别性能:SRCNN,FSRCNN,SRGAN,ENHANCENET,ESRGAN,RANKSRGAN。我们将预先训练的Reset-50上的ImageNet作为评估模型,并在Imagenet CLS-Loc验证数据集中获取前1000个图像进行评估。首先通过双向采样测试图像,然后通过我们的方法和比较方法升高。然后将这些SR图像用作Reset-50模型的输入,以计算它们的前1个和前5个误差进行评估。如TableObject_recognition_SRGAN所示,与最先进的方法相比,我们提出的两种方法和VARIANT FA $+ $ FS-SRGAN获得了相当大的准确性。在这三种方法中,Fasrgan获得了最低的前1和前5个错误,FS-SRGAN和FA + FS-SRGAN获得了ESRAN的可比结果,证明了细粒度关注的有效性和特征共享机制。

Ablation Study

To study the effects of the two mechanisms in the proposed methods, we conduct ablation experiments by removing the mechanisms and test the differences, respectively. The quantitative results are illustrated in Tableablation_result, overall visual comparisons are presented in Fig.ablation_FASRGAN, Fig.ablation_Co-SRGAN and the supplemental material. A detailed discussion is provided as follows.



Removing the Fine-grained Attention Mechanism

We first remove the fine-grained attention (FA) mechanism in FASRGAN. The attention item is removed from the loss functions of the generator in the model without FA. The coefficients of Eq.generator_loss are set as $ \lambda_1 = 5e $ - $ 3 $ , $ \lambda_2 = 0 $ and $ \lambda_3 = 1e $ - $ 2 $ . The fine-grained adversarial loss functions, $ L^D_M $ and $ L^G_{fine} $ are also removed. The generators of FASRGAN and the model without FA have the same parameters, and the difference lies in the loss function in training. From Tableablation_result we can see that FASRGAN surpasses the model without FA in all metrics. An obvious performance decrease can be observed in Fig.ablation_FASRGAN. For image img_009, the model without FA mechanism introduces some unnatural noises and undesired edges, while FASRGAN can maintain the structure and produce high-quality SR images. For image OL_Lunch, the result from the model without FA mechanism contains more artifacts and noise and the letters cannot be well recognized, while FASRGAN reduces the artifacts and noises, whose result looks closer to the original HR images. The visual analysis indicates the effectiveness of the FA mechanism in removing unpleasant and unnatural artifacts.


我们首先在Fasrgan中删除细粒度的注意力(FA)机制。从没有FA的模型中的发电机的损耗功能中移除了注意项。 eq.generator_loss的系数被设置为$\lambda_1 = 5e $-$3 $,$\lambda_2 = 0 $和$\lambda_3 = 1e $-$2 $。还删除了细粒越野丧失丢失函数$L^D_M $和$L^G_{fine} $。 Fasrgan的发电机和没有FA的模型具有相同的参数,并且差异在于训练中的损失功能。来自桌面_Result,我们可以看到Fasrgan在所有指标中超越了没有FA的模型。在图5中可以观察到明显的性能下降。对于图像'IMG_009',没有FA机制的模​​型引入了一些不自然的噪音和不期望的边缘,而Fasrgan可以维持结构并产生高质量的SR图像。对于图像'OL_luch',没有FA机制的模​​型的结果包含更多的工件和噪音,而字母不能很好地识别,而FASRGAN减少了伪影和噪声,其结果看起来更接近原始的HR图像。视觉分析表明了FA机制在去除令人不愉快和不自然伪影方面的有效性。

Removing the Feature-sharing Mechanism

We remove the feature-sharing (Fs) mechanism, so that the generator and discriminator extract their low-level features separately, but the loss function keeps the same as that of Fs-SRGAN. The discriminator and the generator in our Fs-SRGAN use a shared RRDB to extract low-level features, while in the case the Fs mechanism is removed, different RRDBs are used to extract low-level features for them individually. Tableablation_result shows that Fs-SRGAN has lower LPIPS and higher PSNR/SSIM than the model without Fs mechanism. Fig.ablation_Co-SRGAN presents the results of the model without Fs mechanism and Fs-SRGAN. We can observe that Fs-SRGAN outperforms the model without Fs mechanism by a large margin. The removal of Fs mechanism tends to introduce unpleasant artifacts. For image zebra, by employing the Fs mechanism, Fs-SRGAN can alleviate heavy artifacts and noises, recovering the strips of legs more clearly and correctly. For image 292, our Fs-SRGAN generates more textures of the pane. The above results illustrate the effectiveness of our Fs mechanism.


我们删除了功能共享(FS)机制,以便发电机和鉴别器分别提取它们的低级功能,但损耗函数与FS的丢失功能保持相同SRGAN。我们的FS-SRGAN中的鉴别器和生成器使用共享的RRDB来提取低级功能,而在移除FS机制的情况下,使用不同的RRDB来单独为它们提取低级别功能。桌布_Result显示FS-SRGAN具有较低的LPIP和更高的PSNR / SSIM而不是FS机制的模型。 Fig.ablation_Co-SRGAN显示没有FS机制和FS-SRGAN的模型的结果。我们可以观察到FS-SRGAN以大边距不得不使用FS机制的模型。去除FS机制倾向于引入令人不快的伪影。对于图像“斑马”,通过采用FS机制,FS-SRGAN可以缓解重型伪影和噪音,更清晰且正确地恢复腿部。对于图像292,我们的FS-SRGAN会产生更多窗格的纹理。以上结果说明了我们的FS机制的有效性。

Feature Visualization of Fine-Grained Attention and Feature-sharing Mechanisms

To further verify the effectiveness of our proposed fine-grained attention and feature-sharing mechanisms, we present feature visualizations of the first RRDB of the generator (FASRGAN) and the shared feature extraction part (Fs-SRGAN) in Fig.fea_visualization. FASRGAN reduces the noises in the image img_063 and extracts more texture information compared with the model without attention mechanism. The feature maps from Fs-SRGAN also contain more helpful textures, showing that our proposed methods help the networks extract more useful information.


进一步验证我们提出的细粒度关注和功能共享机制的有效性,我们提出了发电机的第一个RRDB的功能可视化(FasRgan )和图5中的共享特征提取部分(FS-SRGAN)。 FasRgan减少了图像“IMG_063”中的噪声,并与模型相比提取更具纹理信息,而无需注意机制。 FS-SRGAN的特征映射还包含更多有用的纹理,显示我们所提出的方法帮助网络提取更有用的信息。

The Block Number of the Feature-sharing Part

To further study the effect of the depth of the shared feature extractor in Fs-SRGAN, we vary the number of RRDBs in both the shared low-level feature extractor and the deep feature extractor, keeping the total number of RRDBs in the generator unchanged. As shown in Fig.change, increasing the number of the shared feature extractor leads to performance reduction and increases the burden of the discriminator due to more parameters which makes the model difficult to train. Among them, E1G16 obtains the best results in both visual quality and reconstruction accuracy.


Dataset Metric $\lambda_2 = 0.05$ $\lambda_2 =0.01$ $\lambda_2 =0.005$
PSNR 26.44 26.29 26.41
SSIM 0.7103 0.7040 0.7109
PI 2.7639 2.7008 2.8057
LPIPS 0.0662 0.0640 0.0675
Urban100 PSNR 24.35 24.51 24.31
SSIM 0.7359 0.7380 0.7364
PI 3.5091 3.5173 3.5284
LPIPS 0.0608 0.0588 0.0613
DIV2K val PSNR 28.16 28.15 28.19
SSIM 0.7771 0.7903 0.7803
PI 3.1826 3.3303 3.2378
LPIPS 0.0547 0.0542 0.0560
The ablation study of the coefficient



Coefficient of the Fine-Grained Attention Loss in the FASRGAN Generator

We also conduct an ablation study to verify the influences of different coefficient $ \lambda_2 $ of the fine-grained attention loss $ L_{attention} $ in the generator of FASRGAN. We set $ \lambda_2 $ as 0.05, 0.01 and 0.005, while the other settings are kept the same. As shown in Tablecoefficient, the model with $ \lambda_2 = 0.01 $ has the best performance in SSIM and LPIPS on Urban100 and DIV2K val, and achieves comparable results in PSNR and PI. The visual results are shown in the supplemental material. When $ \lambda_2 $ is set too small, the fine-grained feedback from the discriminator has less impact on the generator. And when $ \lambda_2 $ is set too large, the training is unstable for both the generator and discriminator and hard to converge. These results indicate that $ \lambda_2 = 0.01 $ is a good setting in practice, which is used in our FASRGAN.

Set5 PSNR 29.91 29.61 29.66
SSIM 0.8510 0.8437 0.8541
PI 3.4322 3.0651 3.4440
LPIPS 0.0389 0.0341 \0.0368
Set14 PSNR 26.56 26.11 26.27
SSIM 0.7093 0.6977 0.7179
PI 2.8549 2.7550 2.7705
LPIPS 0.0696 0.0692 0.0669
Urban100 PSNR 24.39 24.00 24.04
SSIM 0.7309 0.7205 0.7331
PI 3.4814 \3.4252 3.4818
LPIPS 0.0693 0.0688 0.0691
BSD100 PSNR 25.50 25.35 25.46
SSIM 0.6528 0.6506 0.6650
PI 2.3054 2.2503 2.3348
LPIPS 0.0887 0.0856 0.0876
The effect of fine-grained attention (FA) and feature-sharing (Fs) mechanisms on SRGAN.


我们还开展了一种消融研究,以验证发电机中细粒度注意力$L_{attention} $的不同系数$\lambda_2$的影响Fasrgan。我们将$\lambda_2 $设置为0.05,0.01和0.005,而其他设置保持相同。如桌面展示所示,具有$\lambda_2 = 0.01 $的模型在城市100和DIV2K VAL上的SSIM和LPIP中具有最佳性能,并在PSNR和PI中实现了可比的结果。可视结果显示在补充材料中。当$\lambda_2 $设置得太小时,鉴别器的细粒度反馈对发电机的影响较小。当$\lambda_2 $设置得太大时,培训对于发电机和鉴别器且难以收敛,训练不稳定。这些结果表明$\lambda_2 = 0.01 $是在实践中的良好设置,用于我们的FasRgan。

The Fine-grained Attention and Feature-sharing Mechanisms in SRGAN

To verify whether our proposed fine-grained attention (FA) and feature-sharing mechanisms can improve the performance in other GAN-based SR models, we incorporate these two mechanisms into SRGAN, denoting as SRGAN_FA and SRGAN_Fs respectively. The generator in SRGAN_FA is the same as that of SRGAN, and the discriminator adopts our proposed UNet-like structure. We use a convolution layer and a residual block (RB) as the shared low-level feature extraction part for the generator and discriminator in SRGAN_Fs. The rest part of the generator and the discriminator are similar with that of SRGAN, except that the number of RB in the deep feature extraction is 13. Hence, the parameter for SRGAN_Fs is 1.48K, and 1.554K for SRGAN and SRGAN_FA. As shown in Tablesrgan, SRGAN_FA achieves the best performance in terms of PI and LPIPS on most of the test dataset. SRGAN_Fs also outperforms SRGAN in SSIM and LPIPS. These results demonstrate that our proposed FA and Fs mechanisms can be well adapted to the SRGAN model.


验证我们提出的细粒度注意(FA)和功能共享机制是否可以提高其他基于GAN的SR模型的性能,我们纳入这两个机制进入SRGAN,分别表示为SRGAN_FA和SRGAN_FS。 SRGAN_FA中的发电机与SRGAN的发电机相同,鉴别器采用我们提出的杂交结构。我们使用卷积层和残余块(RB)作为SRGAN_FS中的发电机和鉴别器的共享低级特征提取部件。发生器和鉴别器的静止部分与SRGAN的静态部分类似,除了深度特征提取中的RB的数量是13.因此,SRGAN_FS的参数为1.48K,SRGAN和SRGAN的1.554K _F A。如TablesRgan所示,SRGAN_FA在大多数测试数据集中实现了PI和LPIP的最佳性能。 SRGAN_FS还优于SSIM和LPIPS的SRGAN。这些结果表明,我们所提出的FA和FS机制可以适应SRGAN模型。

Results on the Real-World Dataset

We also benchmark our proposed methods on a publicly available real-world dataset to test the robustness. We adopt the test set from RealSR(V3) as the dataset and PSNR/SSIM/PI/LPIPS as evaluation metrics. As shown in the top part of Tablerealsr, our FASRGAN and Fs-SRGAN obtain better PI and LPIPS than ESRGAN and comparable results with RankSRGAN, demonstrating that our proposed models have better robustness on real-world LR images. In addition, we used the training set from RealSR(V3) to fine-tune ESRGAN and our proposed methods. Both of them have run about 150k iterations, where the learning rate is initially set as $ 10^{-4} $ and decays a half every 50k iterations. We test the fine-tuned (FT) models on the test set, and also compare them with ZSSR and the work from Fritsche et al. proposed for the AIM 2019 Challenge on Real World, denoted as FSSR. ZSSR is the first unsupervised SR method for non-ideal LR images. The codes and models of FSSR are publicly available, and we adopt the model TDSR of AIM for comparison. As shown in the bottom part of Tablerealsr, our FASRGAN and Fs-SRGAN still obtain better results in LPIPS, indicating that our models are robust on the real-world images. Visual results of the fine-tuned models and the compared methods are also presented in Fig.RealSR. We can observe that the results generated by ZSSR and FSSR are heavily blurred, which brings a bad visual effect. The fine-tuned results from ESRGAN contain some artifacts and noises, resulting in unpleasing observation. While our fine-tuned FASRGAN and Fs-SRGAN reduce the artifacts and produce more pleasing results, demonstrating the robustness of our proposed models.


上,我们还将我们的建议方法基准测试在公开的真实数据集上以测试稳健性。我们通过Realsr(v3)作为数据集和psnr / ssim / pi / lpips采用测试集作为评估指标。如Tablerealsr的顶部所示,我们的Fasrgan和FS-SRGAN获得的PI和LPIPS比ESRGAN和RANKSRAN的比较结果,表明我们所提出的模型对现实世界的LR图像具有更好的鲁棒性。此外,我们使用Realsr(V3)的培训设置为微调ESRAN和我们的提出方法。它们都有大约150k迭代,其中学习速率最初被设置为$10^{-4} $并衰减每50k迭代的一半。我们在测试集上测试微调(FT)模型,并将它们与ZSSR和Fritsche *等人的工作进行比较。*为现实世界的AIM挑战提出,表示为FSSR。 ZSSR是第一种用于非理想LR图像的无保化SR方法。 FSSR的代码和模型是公开可用的,我们采用了旨在比较的型号TDSR。如Tablerealsr的底部所示,我们的FasRgan和FS-SRGAN仍然可以获得LPIP的更好的结果,表明我们的模型对真实世界的图像具有强大。微调模型的视觉结果和比较方法也在图4中呈现。我们可以观察到ZSSR和FSSR产生的结果严重模糊,这带来了糟糕的视觉效果。 ESRGAN的微调结果包含一些伪影和噪声,导致观察令人难快的观察。虽然我们微调的Fasrgan和FS-SRGAN减少了伪影,并产生了更令人愉悦的结果,展示了我们所提出的模型的稳健性。


We propose two GAN-based models, FASRGAN and Fs-SRGAN, for SISR to overcome the limitations of existing methods. FASRAGN introduces a fine-grained attention mechanism into the GAN framework, where the discriminator has two outputs: a score for measuring the quality of the whole input and a fine-grained attention estimation for the input. The fine-grained attention delivers a fine-grained supervisor to the generator to ensure generation of pixel-wise photo-realistic images. The Fs-SRGAN shares the low-level feature extractor of the generator and the discriminator, reducing the number of parameters and improving the reconstruction performance. These two mechanisms are general and could be applied to other GAN-based SR models. Comparisons with other state-of-the-art methods on benchmark datasets demonstrate the effectiveness of our proposed methods.


我们提出了两个基于GAN的模型,FasRgan和FS-SRGAN,用于克服现有方法的局限性。 Fasragn将一个细粒度的注意机制引入GaN框架,其中鉴别器有两个输出:测量整个输入质量的分数和输入的细粒度注意估计。细粒度的注意力为发电机提供了一个细粒度的监督员,以确保产生像素 - 明智的照片逼真图像。 FS-SRGAN共享发电机的低级特征提取器和鉴别器,减少参数的数量并提高重建性能。这两个机制是通用的,可以应用于其他基于GaN的SR模型。与基准数据集的其他最先进方法的比较展示了我们所提出的方法的有效性。