Fine-grained Attention and Feature-sharing Generative Adversarial Networks for Single Image Super-Resolution
细粒度的注意力和特征共享生成的单图像超分辨率
author
Yitong Yan, Chuangchuang Liu, Changyou Chen, Xianfang Sun, Longcun Jin*, Xinyi Peng, and Xiang Zhou
Abstract
Traditional super-resolution (SR) methods by minimize the mean square error usually produce images with over-smoothed and blurry edges, due to the lack of high-frequency details. In this paper, we propose two novel techniques within the generative adversarial network framework to encourage generation of photo-realistic images for image super-resolution. Firstly, instead of producing a single score to discriminate real and fake images, we propose a variant, called Fine-grained Attention Generative Adversarial Network (FASRGAN), to discriminate each pixel of real and fake images. FASRGAN adopts a UNet-like network as the discriminator with two outputs: an image score and an image score map. The score map has the same spatial size as the HR/SR images, serving as the fine-grained attention to represent the degree of reconstruction difficulty for each pixel. Secondly, instead of using different networks for the generator and the discriminator, we introduce a feature-sharing variant (denoted as Fs-SRGAN) for both the generator and the discriminator. The sharing mechanism can maintain model express power while making the model more compact, and thus can improve the ability of producing high-quality images. Quantitative and visual comparisons with state-of-the-art methods on benchmark datasets demonstrate the superiority of our methods. We further apply our super-resolution images for object recognition, which further demonstrates the effectiveness of our proposed method. The code is available at https://github.com/Rainyfish/FASRGAN-and-Fs-SRGAN.
摘要
传统超分辨率(SR)通过最小化均线误差通常产生具有过平滑的图像和模糊边缘,由于缺乏高频细节。在本文中,我们提出了一种在生成的对抗网络框架内提出了两种新颖的技术,以鼓励为图像超分辨率产生照片 - 现实图像。首先,不是制作单一分数来区分实际和假图像,我们提出了一种变种,称为细粒度关注生成的对抗网络(Fasrgan),以区分实际和假图像的每个像素。 Fasrgggn采用一个像unet-like的网络作为具有两个输出的鉴别器:图像分数和图像分数图。得分图具有与HR / SR图像相同的空间尺寸,用作致细粒的注意力,以代表每个像素的重建难度。其次,而不是使用不同网络的发电机和鉴别器,我们将发电机和鉴别器介绍一个特征共享变量(表示为FS-SRGAN)。共享机制可以在使模型更紧凑的同时保持模型表达功率,因此可以提高生产高质量图像的能力。在基准数据集上具有最先进方法的定量和视觉比较,证明了我们方法的优越性。我们进一步应用我们的超分辨率图像进行对象识别,进一步展示了我们所提出的方法的有效性。代码:https:/github.com/rainyfish/fasrgan-and-fs-srgan。
Introduction
image super-resolution (SISR), which aims to recover a high-resolution (HR) image from its low-solution (LR) version, has been an active research topic in computer graphic and vision for decades. SISR has also attracted increasing attention in both academia and industry, with applications in various fields such as medical imaging, security surveillance, object recognition and so on. However, SISR is a typically ill-posed problem due to the irreversible image degradation process, , multiple HR images can be generated from one single LR image. Learning the mapping between HR and LR images plays an important role in addressing this problem. Recently, deep convolution neural networks (CNNs) have been shown great success in many vision tasks, such as image classification, object detection, and image restoration. Dong et al. first proposed a three-layer CNN for single image super-resolution (SRCNN) to directly learn the mapping from LR to HR images. Since then the CNN-based methods have been dominant for the SR problem because they greatly improved the reconstruction performance. Kumar et al. tapped into the ability of polynomial neural networks to hierarchically learn refinements of a function that maps LR to HR patches. VDSR obtained remarkable performance by increasing the depth of the network to 20, proving the importance of the network depth for detecting effective features of images. EDSR removed unnecessary batch normalization layer in the ResNet architecture and widened the channels, significantly improving the performance. RCAN applied residual in residual structure to construct a very deep network and used a channel attention mechanism to adaptively rescale features. The aforementioned methods use the optimization idea of minimizing the mean squared error (MSE) between the recovered SR image and the corresponding HR image. Such methods are designed to maximize the peak signal-to-noise ratio (PSNR). However, they typically produce over-smoothed edges and lack tiny textures. To produce photo-realistic SR images, Ledig et al. first used the generative adversarial network (GAN) to match the underlying distributions of HR and SR images. ESRGAN further extended the generator network and used the Relativistic Discriminator to produce more photo-realistic images. However, as shown in Fig.GAN_based, the discriminator in these GAN-based methods only outputs a score of the whole input SR/HR image, which is a coarse way to guide the generator. Furthermore, the generator and discriminator of these works are considered to be independent, while we believe there should be significant information to be shared. For example, the lower-level parts of the two networks both aim at extracting tiny features such as corners and edges, which we believe should be correlated.
简介
单图像超分辨率(SISR)旨在从其低分辨率(LR)版本恢复高分辨率(HR)图像,数十年来一直是计算机图形和视觉领域的活跃研究主题。SISR在医学成像,安全监控,物体识别等各个领域的应用也引起了学术界和工业界的越来越多的关注。但是,由于不可逆的图像降级过程,SISR通常是一个不均衡的问题,即可以从一个单一的LR图像生成多个HR图像。学习HR和LR图像之间的映射在解决此问题方面起着重要作用。
最近,深度卷积神经网络(CNN)在许多视觉任务(例如图像分类,目标检测和图像恢复)中已显示出巨大的成功。董等。[ dong2015image ]首先提出了一种用于单图像超分辨率(SRCNN)的三层CNN,以直接学习从LR到HR图像的映射。从那时起,基于CNN的方法 [ 9044873 ]一直是SR问题的主导,因为它们极大地提高了重建性能。Kumar等。 [ kumar2016fast ]充分利用了多项式神经网络的功能,可以分层学习将LR映射到HR补丁的函数的细化。VDSR [kim2016accurate ]通过增加网络的深度20,证明用于检测图像的有效特征的网络深度的重要性取得了显着的性能。EDSR [ lim2017enhanced ]删除了ResNet [ he2016deep ]体系结构中不必要的批处理规范化层, 并拓宽了渠道,从而显着提高了性能。RCAN [ zhang2018image ]在残差结构中应用了残差以构建非常深的网络,并使用通道注意机制来自适应地重新缩放特征。
前述方法使用使恢复的SR图像和对应的HR图像之间的均方误差(MSE)最小的优化思想。此类方法旨在最大程度地提高峰值信噪比(PSNR)。但是,它们通常会产生过度平滑的边缘,并且缺少微小的纹理。为了产生照片般逼真的SR图像,Ledig等人。[ ledig2017photo ]首先使用生成对抗网络(GAN) [ goodfellow2014generative ]来匹配HR和SR图像的基本分布。ESRGAN [ wang2018esrgan ]进一步扩展了生成器网络,并使用了相对论判别器 [ jolicoeur-martineau2018]产生更多逼真的图像。但是,如图1所示,这些基于GAN的方法中的鉴别器仅输出整个输入SR / HR图像的分数,这是引导生成器的粗略方法。此外,这些作品的产生者和区分者被认为是独立的,而我们认为应该分享重要的信息。例如,两个网络的较低层部分都旨在提取微小的特征,例如角和边缘,我们认为应该将它们关联起来。
To address these limitations, we propose two novel techniques based on the GAN framework for image super-resolution, a fine-grained attention mechanism for the discriminator and a feature-sharing network component for both the generator and the discriminator. Specifically, we use a UNet-like discriminator (Fig.FASRGAN_D) to introduce a fine-grained attention in the GAN (FASRGAN). Our discriminator produces two outputs, a score of the whole input image and a fine-grained score map of every pixel in the image. The score map shares the same spatial size as the input image, and measures the degree of differences at each pixel between the generated and the true distributions. To produce better visual quality images, we incorporate the score map into the loss function with an attention mechanism to make the generator pay more attention on the hard parts of an image, instead of treating all parts equally. In addition, we propose a feature-sharing mechanism (Fig.Co-SRGAN) to align the low-level feature extraction of both the generator and the discriminator (Fs-SRGAN). This novel structure can significantly reduce the number of parameters and improve the performance. Overall, our main contributions are three-fold: \item We propose a novel UNet-like discriminator to generate a single score for the whole image and a pixel-wise score map of the input image. We further incorporate the score map into the loss function with an attention mechanism to define the generator. This attention mechanism makes the generator focus on hard parts of an image for generation. \item We introduce a feature-sharing mechanism to define the shared low-level feature extraction for the generator and the discriminator. This reduces the number of model parameters and helps the generator and the discriminator extract more effective features. \item The proposed two components are general, and can be applied to other GAN-based SR models. %In this paper, we also combine the two techniques into one model. Extensive experiments on benchmark datasets illustrate the superiority of our proposed methods compared with current state-of-the-art methods. The remainder of the paper is organized as follows. Section II describes related works. The proposed GAN-based methods are presented in Section III. Experimental results are discussed in Section IV. Finally, the conclusions are drawn in Section V.
图1:基于GAN的超分辨率方法的体系结构。生成器旨在重建逼真的SR图像,而鉴别器则将SR图像与真实的HR图像区分开。
为了解决这些局限性,我们提出了两种基于GAN框架的图像超分辨率新技术,一种用于鉴别器的细粒度关注机制,以及一种用于生成器和鉴别器的特征共享网络组件。具体来说,我们使用类似 UNet的[ ronneberger2015u ]鉴别器(图2),以在GAN(FASRGAN)中引入细粒度的关注。我们的鉴别器产生两个输出,一个是整个输入图像的得分,另一个是图像中每个像素的细粒度得分图。分数图与输入图像共享相同的空间大小,并测量所生成的分布与真实分布之间的每个像素的差异程度。为了产生更好的视觉质量图像,我们将得分图结合到具有关注机制的损失函数中,以使生成器更加关注图像的硬部分,而不是平等对待所有部分。此外,我们提出了一种功能共享机制(图[3]),以对齐生成器和鉴别器(Fs-SRGAN)的低级特征提取。这种新颖的结构可以显着减少参数数量并提高性能。
总体而言,我们的主要贡献是三方面的:
- 我们提出了一种新颖的类似于UNet的鉴别器,以生成整个图像的单个分数以及输入图像的像素级分数图。我们进一步将得分图结合到具有关注机制的损失函数中,以定义生成器。这种关注机制使生成器专注于生成图像的硬部分。
- 我们引入了一种特征共享机制来定义生成器和鉴别器的共享低级特征提取。这减少了模型参数的数量,并有助于生成器和鉴别器提取更有效的特征。
- 所提议的两个组件是通用的,并且可以应用于其他基于GAN的SR模型。在基准数据集上进行的大量实验表明,与当前的最新技术相比,我们提出的方法具有优越性。
在本文的其余部分安排如下。第二节介绍相关工作。第三节介绍了基于GAN的建议方法。实验结果在第四节中讨论。最后,在第五节中得出结论。
Related Work 相关工作
Traditional SISR methods are exemplar or dictionary based. However, these methods are limited by the size of datasets or dictionaries, and are usually time-consuming. These shortcomings can be greatly alleviated by the recent CNN-based methods [9044873].
In their pioneer work, Dong et al. [dong2015image] applied convolutional neural networks with three layers for SISR, namely SRCNN, to learn a mapping from LR to HR images in an end-to-end manner. Kim et al. [kim2016accurate] increases the depth of the network to 20, achieving great improvement in accuracy compared to SRCNN. Instead of using the interpolated LR images as the inputs of network, FSRCNN [dong2016accelerating] extracted features from the origin LR images and upscaled the spatial size by upsampling layers at the tail of the network. This architecture is widely used in the subsequent SR methods. Various advanced upsampling structures have been proposed recently, for instance, deconvolutional layer [zeiler2010deconvolutional] , sub-pixel convolution [shi2016real], and EUSR [kim2018deep]. LapSRN [lai2017deep] progressively reconstructed an HR image with increasing scales of an input image by the Laplacian pyramid structure. Lim et al. [lim2017enhanced] proposed a very large network (EDSR) and its multi-scale version (MDSR), which removed the unnecessary batch normalization layer in the ResNet [he2016deep] and greatly improved super-resolution performance. D-DBPN [haris2018deep] introduced an error-correcting feedback mechanism to learn relationships between LR features and SR features. ZSSR [shocher2018zero] used a unsupervised method to learn the mapping between HR images and LR images. DIP [ulyanov2018deep] showed that the structure of a generator network can capture a large amount of low-level image statistics before any learning is performed, which can be used as a handcrafted prior with excellent results in super-resolution and other standard inverse problems. To address the real-world LR image problem, Fritsche et al. [fritsche2019frequency] proposed to separate the low and high image frequencies and treat them in different ways during training. Adversarial training is used to modify only the high, not the low frequencies. RDN [zhang2018residual] combined dense and residual connections to make full use of information of LR images. Different from RDN, MS-RHDN [liu2019multi] proposed multi-scale residual hierarchical dense networks to extract multi-scale and hierarchical feature maps. Yang et al. [yang2018drfn] proposed a deep recurrent fusion network (DRFN) for SR with large-scale factors, which used transposed convolution to jointly extract and upsample raw features from the input and used multi-level fusion for reconstruction. SeaNet [fang2020soft] proposed a Soft-edge assisted Network to reconstruct the high-quality SR images with the help of image soft-edge. Zhang et al. [zhang2020adaptive] proposed an adaptive importance learning scheme to enhance the performance of the lightweight SISR network architecture. RCAN [zhang2018image] applied channel-attention mechanism to adaptively rescale channel-wise features. SAN [Dai_2019_CVPR] further proposed a second-order channel attention (SOCA) module to rescale the features instead of global average pooling.
传统的SISR方法是基于示例或字典的。但是,这些方法受数据集或字典大小的限制,通常很耗时。这些缺点可以通过最近的基于CNN的方法来大大缓解 [ 9044873 ]。
在他们的开创性工作中,Dong等人[ dong2015image ]将具有三层的卷积神经网络用于SSR,即SRCNN,以端对端的方式学习从LR到HR图像的映射。Kim等。 [ kim2016accurate ]将网络的深度增加到20,与SRCNN相比,准确性得到了极大的提高。FSRCNN [ dong2016accelerating ]而不是将插值的LR图像用作网络的输入。 从原始LR图像中提取特征,并通过对网络尾部的图层进行上采样来扩大空间大小。此体系结构在随后的SR方法中被广泛使用。最近已经提出了各种先进的上采样结构,例如,反卷积层 [ zeiler2010deconvolutional ],子像素卷积 [ shi2016real ]和EUSR [ kim2018deep ]。LapSRN [ lai2017deep ]通过拉普拉斯金字塔结构,随着输入图像比例的增加,逐渐重建了HR图像。Lim等。 [ lim2017enhanced ]提出了一个超大型网络(EDSR)及其多尺度版本(MDSR),该网络删除了ResNet [ he2016deep ]中不必要的批处理规范化层, 并极大地提高了超分辨率性能。D-DBPN [ haris2018deep ]引入了一种纠错反馈机制,以了解LR功能和SR功能之间的关系。ZSSR [ shocher2018zero ]使用一种无监督的方法来学习HR图像和LR图像之间的映射。DIP [ ulyanov2018deep ]结果表明,生成器网络的结构可以在进行任何学习之前捕获大量的低级图像统计信息,可以用作手工制作的方法,在超分辨率和其他标准逆问题中具有出色的效果。为了解决现实世界中的LR图像问题,Fritsche等人。 [ fritsche2019frequency ]建议将低图像频率和高图像频率分开,并在训练期间以不同的方式对其进行处理。对抗训练仅用于修改高频,而不能修改低频。RDN [ zhang2018residual ]结合了密集连接和残余连接以充分利用LR图像的信息。与RDN不同,MS-RHDN [ liu2019multi ]提出了多尺度残差分层稠密网络,以提取多尺度和层次特征图。Yan [yang2018drfn]提出了一种具有大规模因子的SR深度递归融合网络(DRFN),该算法利用转置卷积从输入中共同提取和上采样原始特征,并使用多级融合进行重构。SeaNet [ fang2020soft ]提出了一种软边缘辅助网络,借助图像软边缘来重建高质量的SR图像。[zhang2020adaptive]提出了一种自适应重要性学习方案,以增强轻量级SISR网络体系结构的性能。RCAN[ zhang2018image ]应用了通道注意机制来自适应地重新缩放通道级特征。SAN [ Dai_2019_CVPR ]进一步提出了一个二阶通道关注(SOCA)模块,以重新调整功能而不是全局平均池。
The aforementioned methods aim to achieve high PSNR and SSIM values. However, these criteria usually cause heavy over-smoothed edges and artifacts. Images generated by these MSE-based SR methods lose various high-frequency details and have a bad perceptual quality. To generate more photo-realistic images, Ledig et al. firstly introduced generative adversarial network into image super-resolution, called SRGAN. SRGAN combined a perceptual loss and an adversarial loss to improve the reality of generated images. But some visually implausible artifacts still could be found in some generated images. To reduce the artifacts, EnhanceNet combined a pixel-wise loss in the image space, a perceptual loss in the feature space, a texture matching loss and an adversarial loss. The contextual loss was a kind of perceptual loss to make the generated images as similar as possible to ground-truth images. Yan et al. firstly trained a novel full-reference image quality assessment (FR-IQA) approach for SISR, then employed the proposed loss function (SR-IQA) to train their SR network which contains their proposed highway unit. In addition, they also integrated SR-IQA loss to the GAN-based SR method to achieve better results for both accuracy and perceptual quality. Based on SRGAN, ESRGAN $ RN{1} $ substituted the standard residual block with a residual-in-residual dense block, $ RN{2} $ removed batch normalization layers, $ RN{3} $ utilized VGG feature before activated as perceptual loss, and $ RN{4} $ replaced the standard discriminator with Relativistic Discriminator proposed in RaGAN. Noteworthily, ESRGAN won the first place in the 2018 PIRM Challenge on Perceptual Image Super-Resolution, which pursued the high perceptual-quality images. RankSRGAN firstly trained a ranker to learn the behavior of perceptual metrics and then introduced a rank-content loss to optimize the perceptual quality.
图2: FASRGAN的鉴别器架构,其中K,S,G分别代表Conv层的内核大小,步幅和过滤器编号。FC代表完全连接的层。面具是一张得分图[0,1个],弥补了图像中每个像素重建的困难。
上述方法旨在实现高PSNR和SSIM值。然而,这些标准通常会导致重型过平滑的边缘和伪影。基于MSE的SR方法生成的图像失去了各种高频细节,并且具有糟糕的感知质量。生成更多照片逼真的图像,LEDIG *等。*首先将生成的对抗网络引入图像超分辨率,称为SRAN。 SRGAN联合感知损失和对抗的丧失丧失,以改善生成的图像的现实。但是一些视觉难以置信的伪像在一些生成的图像中仍然可以找到。为了减少伪影,增强eneT组合在图像空间中的像素明显的损失,特征空间中的感知损失,纹理匹配损失和越野损失。语境损失是一种感知损失,使生成的图像尽可能相似地与地面真相类似。 Yan 首先培训了SISR的新型全参考图像质量评估(FR-IQA)方法,然后雇用了拟议的损失函数(SR-IQA)培训他们的SR网络,其中包含其所提出的公路单位。此外,它们还将SR-IQA损失集成到基于GAN的SR方法,以实现准确性和感知质量的更好结果。基于SRGAN,ESRGAN $RN{1} $ 用residual-in-residual dense block代替标准残差块,$ RN{2} $拆除批量归一化层,$RN{3} $在激活为感知损失之前使用VGG功能,以及$ RN{4} $用Ragan中提出的相对论判别器取代了标准鉴别器。值得注意的是,Esrgan获得了2018年PIRM 挑战感知形象超分辨率的第一名,生成了高感知形象。 RanksRggan首先训练了一个Ranker来学习感知度量的行为,然后介绍了排名内容损失以优化感知质量。
Proposed Methods 建议方法
Overview 概述
Our methods aim to reconstruct a high-resolution image $ I^{\text{SR}}\in R^{Wr\times Hr\times C} $ from a low-resolution image $ I^{\text{LR}}\in R^{W\times H\times C} $ , where $ W $ and $ H $ are the width and height of the LR image, $ r $ is the upscaling factor, and $ C $ is the number of channels of the color space. This section details our two strategies within the GAN framework for image super-resolution in order: FASRGAN and Fs-SRGAN. Specifically, we propose a fine-grained attention mechanism in FASRGAN to make the generator focus on the difficult parts of image reconstruction based on the score map from the UNet-like discriminator, instead of treating each part equally. We further propose a feature-sharing mechanism in Fs-SRGAN by sharing the low-level feature extractor of the generator and the discriminator. Both networks update the gradient of the shared low-level feature extractor in the training phase, which could make the model more compact while keeping enough representation power. These two strategies contribute to the overall perceptual quality for SR, respectively. For simplicity, we use the same network architecture as ESRGAN for the generator to generate the SR images from the input LR images.
我们的方法旨在从低分辨率图像$ I^{\text{LR}}\in R^{W\times H\times C} $ 中重建高分辨率图像 $ I^{\text{SR}}\in R^{Wr\times Hr\times C} $ ,其中$W $和$H $是宽度和高度LR图像,$r $是Upsoging因子,$C $是颜色空间的通道数。本节按顺序详细介绍了GAN框架中用于图像超分辨率的两种策略:FASRGAN和Fs-SRGAN。具体来说,我们在FASRGAN中提出了一种细粒度的关注机制,以使生成器基于类似UNet的鉴别器的得分图,专注于图像重建的困难部分,而不是平等地对待每个部分。通过共享生成器和鉴别器的低级特征提取器,我们进一步提出了Fs-SRGAN中的特征共享机制。两个网络都在训练阶段更新共享的低级特征提取器的梯度,这可以使模型更紧凑,同时保持足够的表示能力。这两种策略分别有助于提高SR的整体感知质量。为简单起见, [ wang2018esrgan ]用于生成器从输入的LR图像生成SR图像。
Fine-grained Attention Generator Adversarial Networks 细粒度的注意力生成器对抗网络
Our proposed fine-grained attention GAN (FASRGAN) designs a specific discriminator for SISR. As discussed above and shown in Fig.GAN_based, the discriminator in a standard GAN-based SR model outputs a score of the whole input SR/HR image. This can be considered as a coarse way to judge an input image and cannot discriminate local features of inputs. To tackle this problem, the proposed FASRGAN defines a UNet-like discriminator contained two outputs, corresponding to a score of the whole image and a fine-grained score map. The score map has the same size as the input image and is used for pixel-wise discrimination. The proposed discriminator is illustrated in Fig.FASRGAN_D.
我们提出的细粒度注意GAN(FASRGAN)设计了SISR的特定区分符。如上文所讨论并如图1所示,基于GAN的标准SR模型中的鉴别器输出整个输入SR / HR图像的分数。这可以被认为是判断输入图像的粗略方法,并且不能区分输入的局部特征。为了解决这个问题,提出的FASRGAN定义了一个类似UNet的鉴别器,其中包含两个输出,分别对应于整个图像的分数和细粒度的分数图。得分图具有与输入图像相同的大小,并用于按像素区分。所提出的鉴别器在图2示出 。
A UNet-like Discriminator类似于UNet的鉴别器
The UNet-like discriminator with two outputs can be divided into two parts: an encoder and a decoder. Similar to the standard discriminator D in ESRGAN, the encoder part of the proposed UNet-like discriminator uses a standard max-pooling layer with a stride of 2 to reduce the spatial size of a feature map and increase receptive fields. At the same time, the number of channels is increased for improving representative ability. At the end of the encoder, two fully connected layers are added to output a score, measuring the overall probability of an input image $ x $ being real or fake. We further enhance the discriminator based on the Relativistic GAN, which has also been used in ESRGAN. The loss function is defined as:
具有两个输出的类UNet鉴别器可以分为两部分:编码器和解码器。
编码器。 与ESRGAN中的标准鉴别器D相似,拟议的类似UNet的鉴别器的编码器部分使用步长为2的标准最大池化层来减小特征图的空间大小并增加接收场。同时,增加了通道数以提高代表能力。在编码器的末尾,添加了两个完全连接的层以输出分数,从而测量输入图像的总体概率X是真实的还是假的。我们基于相对论GAN [ jolicoeur-martineau2018 ]进一步增强了鉴别器,该相对论 也已在ESRGAN [ wang2018esrgan ]中使用。损失函数定义为:
$$
L_{adv}^D= D_{Ra}(x_r,x_f)+D_{Ra}(x_f,x_r)= \mathbb{E}[\log(1-\sigma(C(x_r)-\mathbb{E}[C(x_f)]))] + \mathbb{E}[\log(\sigma(C(x_f)-\mathbb{E}[C(x_r)]))],
$$
$$
L_{adv}^D = \mathbb{E}_{x_r}[\log(1-D_{Ra}(x_r,x_f))]+ \mathbb{E}_{x_f}[\log(D_{Ra}(x_f,x_r))]= \mathbb{E}_{x_r}[\log(1-\sigma(C(x_r)-\mathbb{E}_{x_f}[C(x_f)]))]+ \mathbb{E}_{x_f}[\log(\sigma(C(x_f)-\mathbb{E}_{x_r}[C(x_r)]))],
$$
where $ x_r $ and $ x_f $ stand for the ground-truth image and the generated SR image, respectively. $ D_{Ra}(\cdot) $ refers to the function of the relativistic discriminator, which tries to predict the probability that a real image $ x_r $ is more realistic than a fake one $ x_f $ ; $ C(x) $ is the discriminator output before sigmoid function and $ \sigma $ is the sigmoid function. We exploit an upsampling layer to gradually extend the spatial size of feature maps, followed by two convolutional layers to extract more information. To make full use of features, we concatenate the previous outputs from the encoder, which have the same spatial size as current ones. As shown in Fig.FASRGAN_D, the feature maps at the end of the decoder have the same spatial size as input images. Finally, we use the sigmoid function to produce a score map $ M\in R^{Wr\times Hr \times C} $ that provides pixel-wise discrimination between real and fake pixels of an input image. The fine-grained adversarial loss function $ L_{M}^D $ for the discriminator is defined as: where $ M_r $ and $ M_f $ represent the score maps of the HR image and the generated SR image, respectively. Finally, the loss function for the discriminator is defined as:
其中$x_r $和$x_f $分别代表真实图像和生成的SR图像。 $D_{Ra}(\cdot) $是指相对论鉴别器的功能,这试图预测真实图像$x_r $比假一个$x_f $更真实的概率; $C(x) $是SIGMOID函数之前的鉴别器输出,$\sigma $是SIGMOID函数。我们利用上采样层逐渐扩展特征贴图的空间大小,然后是两个卷积层来提取更多信息。为了充分利用功能,我们将先前的Encoder输出连接,其具有与当前的空间大小相同。如图FASRGAN_D所示,解码器末尾的特征映射具有与输入图像相同的空间尺寸。最后,我们使用SIGMOID函数来产生得分图$M\in R^{Wr\times Hr \times C} $,其提供输入图像的实际和虚假像素之间的像素明智的辨别。用于鉴别器的细粒逆势损失函数$L_{M}^D $定义为:其中$M_r $和$M_f$分别表示HR图像的得分图和生成的SR图像。最后,鉴别器的损耗函数定义为:
$ L^D = L^D_{adv} + L^D_M $ .
Generator Objective Function
In the GAN-based SR methods, the generator is generally used to generate the SR images from the LR images. ESRGAN introduced Residual-in-Residual Dense Block (RRDB) without batch normalization as the basic network building unit, which is of higher capacity and easier to train compared with the ResBlock in SRGAN. In this paper, we also adopt RRDB to construct our generator for a fair comparison with ESRGAN. The generator is trained by several losses, defined as following: Following, we use an $ L_1 $ loss function to constrain the content of a generated SR image to be close to the HR image. The loss is defined in Eq.L1. where $ F_\theta^G(\cdot) $ represents the function of the generator, $ \theta $ is the parameters of the generator and $ I_i $ means the $ i $ -th image. The perceptual loss aims to make the SR image close to the corresponding HR image based on high-level features extracted from a pre-trained network. Similar to , we consider both the SR and HR images as the input to the pre-trained VGG19 and extract the VGG19-54 layer features. The perceptual loss is defined as:
$$ L_{percep} = \parallel F_{\theta}^{VGG}(G(I_i^{LR}))-F_{\theta}^{VGG}(I_i^{HR})\parallel_1, $$
where $ F_{\theta}^{VGG}(\cdot) $ is the function of VGG and $ I_i $ is the $ i $ -th image, $ G(\cdot $ ) is the function of the generator. The discriminator contains two outputs, a whole estimation of the entire image and the pixel-wise fine-grained estimations of an input image. The total adversarial loss function for the generator is defined as:
$$ L_{adv}^G = L_{entire}^G+L_{fine}^G, $$
As shown in Eq.mask_d, the discriminator tries to distinguish the real and fake image in a fine-grained way, while the generator aims to fool the discriminator. Thus the fine-grained adversarial loss function $ L^G_{fine} $ for the generator is the symmetrical form of Eq.mask_d: $ L_{entire}^G $ is also the symmetrical form of Eq.discriminator_adversarial and defined as:
$$
L_{entire}^G=\mathbb{E}[\log(\sigma(C(x_r)-\mathbb{E}[C(x_f)]))]
+\mathbb{E}[\log(1-\sigma(C(x_f)-\mathbb{E}[C(x_r)]))].
$$
The score map generated by the UNet-like discriminator is represented as pixel-wise discrimination scores of an input image, with values $ M(w,h,c) $ among $ [0,1] $ . A higher score means the corresponding pixel of the input image is closer to that of the ground-truth image. In this manner, the score map can indicate which parts of an image are more difficult to generate and which parts are easier. For instance, the structure background part of an image is sometimes simpler, and thus it would expect the discriminator reflects this to the generator when updating the generator. In other words, the part with lower scores (more difficult to generate) should receive more attention when updating the generator. As a result, we incorporate the score map as the fine-grained attention mechanism into a $ L_1 $ loss function, constituting a weighted attention loss function:
$$
L_{attention}=\frac{1}{Wr\times Hr \times C}\sum_{w=1}^{Wr}\sum_{h=1}^{Hr}\sum_{c=1}^C (1-M_f(w,h,c))
\times \parallel F_{\theta}^{G} (I_i^{LR})(w,h,c)-I_i^{HR}(w,h,c),
$$
where $ M_f(w,h,c) $ is the score map of the generated image given by the discriminator. Instead of treating every pixel of an image equally, $ L_{attention} $ contributes to pay more attention in the hard-to-recovered part of an image, such as the textures with rich semantic information. Combining the above losses with different weights, the total loss of the generator is:
$$ \label{eq:generator_loss} L^{G}= L_{percep} + \lambda_1 L_{adv}^G + \lambda_2 L_{attention} + \lambda_3 L_1, $$
where $ \lambda_1 $ , $ \lambda_2 $ , $ \lambda_3 $ are the coefficients to balance different loss terms.
发生器物镜函数
在基于GAN的SR方法中,发电机通常用于从LR图像生成SR图像。 ESRGAN引入了残留的残余密集块(RRDB),无批量标准化为基本网络建筑单元,与SRAN中的RESBLOCK相比,该容量更高,更容易训练。在本文中,我们还通过RRDB构建我们的发电机,以便与ESRAN进行公平比较。发电机由几个损耗培训,定义如下:以下,我们使用$L_1 $丢失功能来限制生成的SR图像的内容以接近HR图像。损失在EQ.L1中定义。其中$F_\theta^G(\cdot) $表示发电机的功能,$\theta $是发电机的参数,$I_i $表示$i $ -th图像。感知损失旨在使SR图像基于从预先训练的网络中提取的高级功能,使得靠近相应的HR图像。类似于,我们将SR和HR图像视为预先训练的VGG19的输入并提取VGG19-54层特征。感知损失定义为:
$$ \ label {eq:percep} l_ {percep} = \ parents f _ {\ theta} ^ {vgg}(g(i_i ^ {lr})) - f _ {\ theta} ^ { vgg}(i_i ^ {hr})\ parallel_1,$$
其中$F_{\theta}^{VGG}(\cdot) $是vgg和$I_i $的功能是$i $ -th图像,$G(\cdot $)是函数发电机。鉴别器包含两个输出,整个图像的整体估计和输入图像的像素明智的细粒估计。发电机的总体对冲丢失函数定义为:
$$ \ label {eq:avd} l_ {adv} ^ g = {entire} ^ g + l_ {fine} ^ g,$$
,如eq.mask_d所示,鉴别者试图以细粒度的方式区分真实和假图像,而发电机则旨在欺骗鉴别者。因此,发电机的细粒度逆势丢失函数$L^G_{fine} $是eq.mask_d:$L_{entire}^G $的对称形式也是eq.discriminator_addersarial的对称形式,并定义为:
$$ \ label {eq:ground_adv }
L_{entire}^G=\mathbb{E}[\log(\sigma(C(x_r)-\mathbb{E}[C(x_f)]))]
+\mathbb{E}[\log(1-\sigma(C(x_f)-\mathbb{E}[C(x_r)]))].
$$
由非唯一鉴别器生成的分数图表示为输入图像的像素 - 方向辨别得分,其中$[0,1] $中值$M(w,h,c)$。较高的分数是指输入图像的相应像素更靠近地面图的图像。以这种方式,得分图可以指示图像的哪些部分更难以产生,并且哪个部件更容易。例如,图像的结构背景部分有时是更简单的,因此在更新发电机时,它将期望鉴别器将其反射到发电机。换句话说,在更新发电机时,具有较低分数(更难以生成)的部件应更加关注。因此,我们将分数图作为细长的注意机制纳入$L_1 $损耗功能,构成了加权注意力损失功能:
$$
L_{attention}=\frac{1}{Wr\times Hr \times C}\sum_{w=1}^{Wr}\sum_{h=1}^{Hr}\sum_{c=1}^C (1-M_f(w,h,c))
\times \parallel F_{\theta}^{G} (I_i^{LR})(w,h,c)-I_i^{HR}(w,h,c)\parallel _1,
\label{EQ:ATTEN}
$$
其中$M_f(w,h,c) $是通过鉴别器给出的生成图像的分数映射。代替将图像的每个像素同样处理,$L_{attention} $有助于在图像的难以恢复的部分中提供更多关注,例如具有丰富语义信息的纹理。将上述损失与不同权重结合,发电机的总损失是:
$$ \ label {eq:generator_loss} l ^ {g} = l_ {percep} + lambda_1 l_ {adv} ^ g + \ lambda_2 l_ {attention} + \ lambda_3 l_1,$$
其中$\lambda_1 $,$\lambda_2 $,$\lambda_3 $是平衡不同损耗术语的系数。
Feature-sharing Generator Adversarial Networks
In the standard GANs, the generator and the discriminator are usually defined as two independent networks. Based on the observation that the shallow parts of the two networks always extract low-level textures such as edges and corners, we propose a new network structure (Fs-SRGAN) to enable low-level feature sharing between the generator and the discriminator. This can reduce the number of parameters and help both networks extract more effective features. Consequently, our Fs-SRGAN contains three parts: a shared feature extractor, a generator, and a discriminator, as shown in Fig.Co-SRGAN.
功能共享生成发生器对冲网络
在标准GAN中,发电机和鉴别器通常被定义为两个独立网络。基于观察到两个网络的浅部分总是提取边缘和角落等低级纹理,我们提出了一种新的网络结构(FS-SRGAN),以在发电机和鉴别器之间启用低级功能共享。这可以减少参数的数量并帮助两个网络提取更有效的功能。因此,我们的FS-SRGAN包含三个部分:共享特征提取器,发电机和鉴别器,如图所示。
Shared Feature Extractor
We first use a share feature extractor to transform an input image from color space to feature space, before extracting low-level feature maps. The feature-sharing mechanism allows the generator and the discriminator to jointly optimize the low-level feature extractor. Similar to FASRGAN, we adopt RRDB, the basic block of ESRGAN, as the basic structure. The shared feature extractor contains $ E $ RRDBs to extract helpful feature maps for both the generator and the discriminator, described as following:
$$ \label{eq:shared_part} H_{shared} = F_{shared}(x), $$
where $ H_{shared} $ is the low-level feature maps extracted by the shared part, $ F_{shared} $ represents the function of the shared feature extractor, and $ x $ is the input. For the generator, the input is an LR image, while for the discriminator it is a SR image or a HR image. Considering the input sizes of the generator and the discriminator are different, we apply a fully Convolutional Neural Network with invariant size of feature maps to extract features so that the different input sizes do not matter.
shared feature extractor
我们首先使用共享特征提取器将输入图像从颜色空间转换为特征空间,然后提取低级特征映射。特征共享机制允许发电机和鉴别器共同优化低级特征提取器。类似于Fasrgan,我们采用RRDB,ESRGAN的基本块,作为基本结构。共享特征提取器包含$E $ rrdbs,用于提取有关的发电机和鉴别器的有用特征映射,如下所示:
$$ \ label {eq:shared_part} h_ {shared} = f_ {shared}(x),$$
其中$H_{shared} $是由共享部分提取的低级特征映射,$F_{shared} $表示共享特征提取器的功能,而$x $是输入。对于发电机,输入是LR图像,而对于鉴别器,它是SR图像或HR图像。考虑到发电机的输入大小和鉴别器是不同的,我们应用了一个完全卷积的神经网络,具有不变大小的特征映射来提取功能,以便不同的输入大小无关紧要。
The Generator and the Discriminator
The rest parts of the generator and the discriminator are the same as those in standard GAN-based methods, except that the inputs are feature maps instead of images as shown in Fig.Co-SRGAN. The generator contains three parts: low-level feature extraction, deep feature extraction, and reconstruction, which are used for transforming the input image to the feature space from the color space and extracting low-level information, extracting high-level semantic features and reconstructing SR image, respectively. Note that the generator in our Fs-SRGAN only contains the latter two parts. Similar to the shared low-level feature extraction, we adopt RRDB as the basic part of deep feature extraction, except that more RRDBs are used to increase the depth of the network with the purpose of extracting more high-frequency feature for reconstruction. The reconstruction part utilizes an upsampling layer to upscale the high-level feature maps and a Conv layer to reconstruct an SR image. The loss function of the generator is same as that of ESRGAN, which includes perceptual loss, adversarial loss, and pixel-based loss:
$$ \label{eq:fs_generator} L^{G}= L_{percep} + \lambda_1 L_{adv}^G + \lambda_2 L_1, $$
where $ \lambda_1 $ , $ \lambda_2 $ are the coefficients to balance different loss terms, $ L_{percep} $ and $ L_1 $ are defined in Eq.percep and Eq.L1, respectively, $ L_{adv}^G $ is the adversarial loss with the same definition as $ L^G_{entire} $ in Eq.entire_adv. Because the discriminator is a classification network that distinguishes the input as SR or HR image, we apply a structure similar to the VGG network as the discriminator. To reduce information loss, we substitute the pooling layer (used in the encoder of the UNet-discriminator) for a Conv layer with a stride of 2 to decrease the size of feature map. At the tail of the discriminator, we use a Conv layer to transform the feature map into a one-dimensional vector, then use two fully connected layers to output the classification score $ s $ among $ [0, 1] $ . The value of $ s $ closer to 1 means more real, otherwise more fake. The loss function of the discriminator is same as $ L^D_{adv} $ defined in Eq.discriminator_adversarial.
生成器和鉴别器
发电机的静止部分和鉴别器的基于标准GaN的方法的静止部分相同,不同之处在于输入是特征映射而不是图中所示的图像SRGAN。发电机包含三个部分:低级特征提取,深度特征提取和重建,用于从颜色空间转换到要素空间的输入图像并提取低级信息,提取高电平语义特征和重建SR图像分别。请注意,我们的FS-SRGAN中的发电机只包含后两部分。类似于共享低级特征提取,我们采用RRDB作为深度特征提取的基本部分,除了使用更多的RRDB来增加网络的深度,目的是提取更多的高频功能来重建。重建部分利用上采样层来高档高级特征映射和经常层以重建SR图像。发电机的损耗功能与ESRAN的损失功能相同,包括感性损失,对抗性丢失和基于像素的损失:
$$ \label{eq:fs_generator} l ^ {g} = l_ {percep} + \ lambda_1 l_ {adv} ^ g + \ lambda_2 l_1,$$
,其中$\lambda_1 $,$\lambda_2 $是平衡不同损失术语的系数,$L_{percep} $和$L_1 $在eq.percep和eq.l1中定义,分别,$L_{adv}^G $是具有与eq.entire_adv的$L^G_{entire} $相同的对手丢失。因为鉴别器是将输入作为SR或HR图像的分类网络,所以我们将类似于VGG网络的结构应用于鉴别器。为了减少信息丢失,我们将汇集层(在UNET-Coldiminator的编码器中使用)替换为CONC层,其步幅为2,以减小特征图的大小。在鉴别器的尾部,我们使用经常层将特征映射转换为一维向量,然后使用两个完全连接的层输出$[0, 1] $中的分类得分$s $。 $s $更接近1的值意味着更真实,否则更假。鉴别器的损耗函数与在eq.discriminator_addersarial中定义的$L^D_{adv} $相同。
Experimental Results
In this section, we first describe our model training details, then provide quantitative and visual comparisons with several state-of-the-art methods on benchmark datasets for our two proposed novel methods, FASRGAN and Fs-SRGAN. We further combine the fine-grained attention and the feature-sharing mechanisms into one single model, termed FA $ + $ Fs-SRGAN.
Dataset | Metric | EnhanceNet | SRGAN | ESRGAN | RankSRGAN | FASRGAN | Fs-SRGAN | FA+Fs-SRGAN |
---|---|---|---|---|---|---|---|---|
PSNR | 28.57 | 29.91 | 30.46 | 29.73 | 30.15 | 30.28 | 30.19 | |
SSIM | 0.8102 | 0.8510 | 0.8515 | 0.8398 | 0.8450 | 0.8588 | 0.8571 | |
Set5 | PI | 2.8466 | 3.4322 | 3.5463 | 2.9867 | 3.1685 | 3.9143 | 3.7455 |
LPIPS | 0.0488 | 0.0389 | 0.0350 | 0.0348 | 0.0325 | 0.0330 | 0.0344 | |
PSNR | 23.54 | 24.39 | 24.36 | 24.49 | 24.51 | 24.55 | 24.67 | |
SSIM | 0.6933 | 0.7309 | 0.7341 | 0.7319 | 0.7380 | 0.7509 | 0.7466 | |
Urban100 | PI | 3.4543 | 3.4814 | 3.7312 | 3.3253 | 3.5173 | 3.5940 | 3.5819 |
LPIPS | 0.0777 | 0.0693 | 0.0591 | 0.0667 | 0.0588 | 0.0591 | 0.0625 | |
PSNR | 24.94 | 25.50 | 25.32 | 25.51 | 25.41 | 25.61 | 25.87 | |
SSIM | 0.6266 | 0.6528 | 0.6514 | 0.6530 | 0.6523 | 0.6726 | 0.6747 | |
BSD100 | PI | 2.8467 | 2.3054 | 2.4150 | 2.0768 | 2.2783 | 2.4056 | 2.3749 |
LPIPS | 0.0982 | 0.0887 | 0.0798 | 0.0850 | 0.0796 | 0.0801 | 0.0855 | |
PSNR | 27.28 | 28.16 | 28.17 | 28.10 | 28.15 | 28.15 | 28.23 | |
SSIM | 0.7460 | 0.7753 | 0.7759 | 0.7710 | 0.7768 | 0.7903 | 0.7891 | |
DIV2K val | PI | 3.4953 | 3.1619 | 3.2572 | 3.0130 | 3.1034 | 3.3303 | 3.3092 |
LPIPS | 0.0753 | 0.0605 | 0.0550 | 0.0576 | 0.0539 | 0.0542 | 0.0576 | |
PSNR | 25.47 | 25.61 | 25.18 | 25.65 | 25.38 | 25.75 | 26.00 | |
SSIM | 0.6569 | 0.6757 | 0.6596 | 0.6726 | 0.6648 | 0.6907 | 0.6934 | |
PIRM val | PI | 2.6762 | 2.2254 | 2.5548 | 2.0183 | 2.2476 | 2.3311 | 2.2482 |
LPIPS | 0.0838 | 0.0718 | 0.0714 | 0.0675 | 0.0685 | 0.0651 | 0.0677 |
Quantitative results with the Bicubic degradation model for $4\times$ SR. Best and second best results are highlighted
实验结果
在本节中,我们首先描述了我们的模型培训细节,然后提供了对我们两个提出的新颖方法,Fasrgan和FS-SRGAN的基准数据集的数量和视觉比较。我们进一步将细粒度的注意力和特征共享机制结合在一起,称为FA $+ $ FS-SRGAN。
Training Details
In training, we use the training set from DIV2K as the training set to train our models. The LR images are obtained by bicubic downsampling (BI) from the source high-resolution images. Images are augmented by rotating and flipping. We also randomly crop $ 48\times48 $ patches from LR images as the input of the network. Our networks are optimized with the ADAM optimizer. The hyper-parameters $ \beta_1 $ and $ \beta_2 $ in the ADAM optimizer are set to $ \beta_1 = 0.9 $ and $ \beta_2=0.999 $ . The batch size is set to 16. The generator is pre-trained by the $ L_1 $ loss function, followed by generator and the discriminator training with the corresponding loss functions. Following, the initial learning rate is set to $ 1\times10^{-4} $ , and then decays to half every $ 2\times10^5 $ iterations. In FASRGAN, the coefficients in Eq.generator_loss are set as $ \lambda_1 = 5e $ - $ 3 $ , $ \lambda_2 = 1e $ - $ 2 $ and $ \lambda_3 = 1e $ - $ 2 $ . Similar to ESRGAN, the number of RRDBs in the generator is set as 23. In Fs-SRGAN, we set the number of RRDBs in the shared feature extractor as $ E=1 $ , and in the deep feature extractor as 16. The coefficients in Eq.fs_generator are set as $ \lambda_1 = 5e $ - $ 3 $ and $ \lambda_2 = 1e $ - $ 2 $ . In FA+Fs-SRGAN, the number of RRDBs in the share part is set as 2, while in the deep feature extraction part is 15. The discriminator and the coefficients of the loss function are the same as those of FASRGAN. We implement our models with the PyTorch framework on two NVIDIA GeForce RTX 2080Ti GPUs.
培训详情
在培训中,我们使用DIV2K的培训设置为培训我们的模型。通过来自源高分辨率图像的双向下采样(BI)获得LR图像。通过旋转和翻转来增强图像。我们还随机从LR图像随机作物$48\times48 $修补程序作为网络的输入。我们的网络与ADAM Optimizer进行了优化。 ADAM优化器中的超参数$\beta_1 $和$\beta_2 $设置为$\beta_1 = 0.9 $和$\beta_2=0.999 $。批量大小设置为16.发电机由$L_1 $损耗功能预先培训,然后是发电机和具有相应损耗功能的鉴别器训练。以下,初始学习速率设置为$1\times10^{-4} $,然后衰减为每一个$2\times10^5 $迭代。在Fasrgan中,EQ.Generator_Loss中的系数被设置为$\lambda_1 = 5e $-$3 $,$\lambda_2 = 1e $-$2 $和$\lambda_3 = 1e $-$2 $。类似于ESRAN,发电机中的RRDB的数量设定为23.在FS-SRGAN中,我们将共享特征提取器中的RRDB中的RRDB数量设置为$E=1 $,深度特征提取器为16. eq中的系数.fs_generator被设置为$\lambda_1 = 5e $-$3 $和$\lambda_2 = 1e $-$2 $。在FA + FS-SRGAN中,共享部分中的RRDB的数量设置为2,而在深度特征提取部分中为15.判别器和损失函数的系数与FasRGAN的判别器相同。我们使用两个NVIDIA GeForce RTX 2080TI GPU的Pytorch框架实现我们的模型。
Datasets and Evaluation Metrics
In the testing phase, we use seven standard benchmark datasets to evaluate the performance of our proposed methods: Set5, Set14, BSD100, Urban100, Manga109, DIV2K validation, PIRM validation and test dataset. Blau et al. proved mathematically that perceptual quality is not always improved with the increase of PSNR value and there is a trade-off between average distortion and perceptual quality. Hence, we not only use PSNR and SSIM to measure the reconstruction accuracy, but also adopt the learned perceptual image patch similarity (LPIPS) and perceptual index (PI) to evaluate the perceptual quality of SR images. LPIPS firstly adopts a pre-trained network $ \mathcal{F} $ to extract patches
$ y, y_0 $ from the reference and target images $ x, x_0 $ . The network $ \mathcal{F} $ computes the activations of the image patches, each is scaled by a learned weight $ w_l $ and then sums up the $ L_2 $ distances across all layers. Finally, it computes a perceptual real/fake prediction as follows:
$$
\label{eq:LPIPS} d(x, x_0) = \sum_l \frac{1}{H_lW_l}\sum_{h,w}\parallel w_l \odot(y_l^{hw} - \hat{y}^{hw}_{0l})\parallel_2^2,
$$
where $ y_l, y_{0l}\in\mathbb{R}^{H_l \times W_l \times C_l} $ represent the reference or target features from layer $ l $ . The HR images from the public datasets are regarded as the reference images, the SR images generated by our methods or the compared methods as the target images. We use the public codes and pre-trained network (AlexNet from version 0.0) for evaluation. While PI is based on the non-reference image quality measures of Ma et al. and NIQE: PI= $ \frac{1}{2} $ ((10-Ma)+NIQE). PSNR and SSIM are calculated on the luminance channel in the YCbCr color space. We also use LPIPS and root mean square error (RMSE) to measure the trade-off between perceptual quality and reconstruction accuracy. Using LPIPS/RMSE rather than LPIPS/PSNR to evaluate the trade-off is for better observation, where lower LPIPS/RMSE means a better result. Higher PSNR/SSIM and lower RMSE mean better results in reconstruction accuracy, while lower scores of LPIPS/PI imply that the images are more similar to the HR images. As shown in Fig.78004, the SR image of our FASRGAN has less artifacts than that of ESRGAN and is clearer than that of RankSRGAN. But the PI value of the SR image produced by RankSRGAN is lower than that of our FASRGAN, and even lower than that of the original HR image. In terms of LPIPS, our method attains the lowest value, which is more consistent with human observation. Hence, we use LPIPS as our first perceptual quality metric and PI as the second one.
数据集和评估指标
在测试阶段,我们使用七个标准基准数据集来评估我们所提出的方法的性能:SET5,SET14,BSD100,URBAN100,MANGA109,DIV2K验证,PIRM验证和测试数据集。 Blau *等。*在数学上证明,随着PSNR价值的增加,不断提高了感知质量,平均失真与感知质量之间存在权衡。因此,我们不仅使用PSNR和SSIM来测量重建准确性,而且还采用学习的感知图像贴片相似度(LPIP)和感知指数(PI)来评估SR图像的感知质量。 LPIPS首先采用预先训练的网络$\mathcal{F} $来从参考和目标图像$x, x_0 $中提取补丁$y, y_0 $。网络$\mathcal{F} $计算图像修补程序的激活,每个激活由学习权重$w_l $缩放,然后在所有层上施加$L_2 $距离。最后,它计算了一个感知真/假的预测。$y_l, y_{0l}\in\mathbb{R}^{H_l \times W_l \times C_l} $表示来自图层$l $的参考或目标特征。来自公共数据集的HR图像被视为参考图像,由我们的方法或比较方法作为目标图像生成的SR图像。我们使用公共代码和预先训练的网络(0.0版本的AlexNet)进行评估。虽然PI基于MA *等人的非参考图像质量测量。*和NIQE:PI = $\frac{1}{2} $((10-MA)+ NIQE)。 PSNR和SSIM在YCBCR颜色空间中的亮度通道上计算。我们还使用LPIP和均方根误差(RMSE)来测量感知质量与重建准确性之间的权衡。使用LPIPS / RMSE而不是LPIPS / PSNR来评估权衡是为了更好的观察,下LPIP / RMSE意味着更好的结果。较高的PSNR / SSIM和较低的RMSE意味着改善重建精度的结果,而LPIPS / PI的较低分数意味着图像与HR图像更类似于图像。如图78004所示,我们的FasRggan的SR图像具有比Esrgan的少于伪像,而且比RanksRAN的更清晰。但是RankSrgan产生的SR图像的PI值低于我们的FasRgan,甚至低于原始HR图像的SR图像。就LPIP而言,我们的方法达到了最低值,这与人类观察更符合。因此,我们使用LPIPS作为我们的第一个感知质量指标和PI作为第二个。
Quantitative Comparisons
We present the quantitative comparisons between our meth