KOALAnet: Blind Super-Resolution using Kernel-Oriented Adaptive Local Adjustment
Koalanet:基于核函数自适应局部调整算法的盲超分辨率
Abstract
Blind super-resolution (SR) methods aim to generate a high quality high resolution image from a low resolution image containing unknown degradations. However, natural images contain various types and amounts of blur: some may be due to the inherent degradation characteristics of the camera, but some may even be intentional, for aesthetic purposes (e.g. Bokeh effect). In the case of the latter, it becomes highly difficult for SR methods to disentangle the blur to remove, and that to leave as is. In this paper, we propose a novel blind SR framework based on kernel-oriented adaptive local adjustment (KOALA) of SR features, called KOALAnet, which jointly learns spatially-variant degradation and restoration kernels in order to adapt to the spatially-variant blur characteristics in real images. Our KOALAnet outperforms recent blind SR methods for synthesized LR images obtained with randomized degradations, and we further show that the proposed KOALAnet produces the most natural results for artistic photographs with intentional blur, which are not over-sharpened, by effectively handling images mixed with in-focus and out-of-focus areas.
摘要
盲超分辨率(SR)方法旨在从包含未知降级的低分辨率图像产生高质量的高分辨率图像。然而,自然图像含有各种类型和量的模糊:有些可能是由于相机的固有降解特性,但有些可能是有意的,用于美学目的(例如Bokeh效应)。在后者的情况下,SR方法非常困难,以解开模糊以移除,并尽可能离开。在本文中,我们提出了一种基于SR特征的内核型自适应局部调整(Koala)的新型盲人SR框架,称为Koalanet,其共同学习空间变型劣化和恢复核,以适应空间变体模糊特性在真实的图像中。我们的Koalanet优于由随机降解获得的合成LR图像的最近盲目SR方法,我们进一步表明,KOALAnet有效地处理混合失焦和聚焦区域的图像。
Introduction
When a deep neural network is trained under a specific scenario, its generalization ability tends to be limited to that particular setting, and its performance deteriorates under a different condition. This is a major problem in single image super-resolution (SR), where most neural-network-based methods have focused on the upscaling of low resolution (LR) images to high resolution (HR) images solely under the setting , until very recently. Naturally, their performance tends to severely drop if the input LR image is degraded by even a slightly different downsampling kernel, which is often the case in real images . Hence, more recent SR methods aim for SR, where the true degradation kernels are unknown . However, this unknown blur may be of various types with different characteristics. Often, images are captured with a different depth-of-field (DoF) by manipulating the aperture sizes and the focal lengths of camera lenses, for aesthetic purposes (e.g. Bokeh effect) as shown in Fig. dof_imgs. Recent mobile devices even try to simulate this synthetically (e.g. portrait mode) for artistic effects . Although a camera-specific degradation could be spatially-equivariant (similar to the way LR images are generated for SR), the blur generated due to DoF of the camera would be , where some areas are in focus, and others are out of focus.
简介
在特定场景下培训深度神经网络时,其泛化能力趋于限于该特定设置,其性能在不同的情况下恶化。这是单个图像超分辨率(SR)中的一个主要问题,其中大多数基于神经网络的方法都集中在最近在设置下的高分辨率(HR)图像的低分辨率(LR)图像的上升,直到最近。当然,如果输入LR图像甚至是略微不同的下采样内核,则它们的性能趋于严重下降,这通常是真实图像中的情况。因此,最近的SR方法旨在为SR,其中真正的降级核未知。然而,这种未知的模糊可能具有不同特性的各种类型。通常,通过操纵孔径尺寸和相机镜头的焦距来捕获图像,用于使用相机镜头的孔径(例如散景效应),如图2所示。DOF_IMGS。最近的移动设备甚至试图为艺术效果模拟这种合成(例如纵向模式)。虽然相机特定的劣化可以是空间的(类似于SR生成的LR图像的方式),但由于相机的DOF而产生的模糊将是,某些区域处于焦点,而其他区域则不受焦点。
These types of LR images are extremely challenging for SR, since ideally, the intentional blur must be left unaltered (should not be over-sharpened) to maintain the photographer's intent after SR. However, the SR results of such images are yet to be analyzed in literature.
In this paper, we propose a blind SR framework based on kernel-oriented adaptive local adjustment (KOALA) of SR features, called KOALAnet, by jointly learning the degradation and restoration kernels. The KOALAnet consists of two networks: a downsampling network that estimates blur kernels, and an upsampling network that fuses this information by mapping the predicted degradation kernels to the feature kernel space, predicting degradation-specific local feature adjustment parameters that are applied by on the SR feature maps. After training under a random anisotropic Gaussian degradation setting, our KOALAnet is able to accurately predict the underlying degradation kernels and effectively leverage this information for SR. Moreover, it demonstrates a good generalization ability on historic images containing unknown degradations compared to previous blind SR methods. We further provide comparisons on real aesthetic DoF images, and show that our KOALAnet effectively handles images with intentional blur. Our contributions are three-fold:
sep1pt
- We propose a blind SR framework that jointly learns spatially-variant degradation and restoration kernels. The restoration (upsampling) network leverages novel KOALA modules to adaptively adjust the SR features based on the predicted degradation kernels. The KOALA modules are extensible, and can be inserted into any CNN architecture for image restoration tasks.
- We empirically show that the proposed KOALAnet outperforms the recent state-of-the-art blind SR methods for synthesized LR images obtained under randomized degradation conditions, as well as for historic LR images with unknown degradations.
- We first analyze SR results on images mixed with in-focus and out-of-focus regions, showing that our KOALAnet is able to discern intentionally blurry areas and process them accordingly, leaving the photographer's intent unchanged after SR.
这些类型的LR图像对于SR非常具有挑战性,因为理想情况下,故意模糊必须留下(不应过度锐化),以维持SR之后的摄影师的意图。然而,这些图像的SR结果尚未在文献中分析。
在本文中,我们通过联合学习劣化和恢复内核,基于核心为导向的自适应局部调整(考拉)的盲人SR框架。 Koalanet由两个网络组成:一个下采样网络,估计模糊内核,以及通过将预测的劣化内核映射到特征内核空间来熔化此信息的上采样网络,预测SR上应用的劣化特定的本地特征调整参数特征映射。在随机各向异性高斯降级设置下进行培训后,我们的考拉纳能够准确地预测底层劣化内核,并有效地利用SR的这些信息。此外,与先前的盲学SR方法相比,它展示了历史图像上含有未知降解的历史图像的良好概括能力。我们进一步提供了真正的审美DOF图像上的比较,并表明我们的Koalanet有效地处理了故意模糊的图像。我们的贡献是三倍:
SEP1PT
- 我们提出了一个盲人SR框架,共同学习Spatial-Variant劣化和恢复内核。恢复(上采样)网络利用新颖的考拉模块,基于预测的降级核心自适应地调整SR特征。考拉模块是可扩展,可以插入任何CNN架构中以进行图像恢复任务。
- 我们经验表明,所提出的Koalanet优于在随机降解条件下获得的合成LR图像的最近最新的最先进的盲人SR方法,以及具有未知降解的历史性LR图像。
- 我们首先分析与in-focus和Out-focus地区混合的图像上的SR结果,表明我们的考拉纳能够辨别出故意模糊的区域并相应地处理它们,使摄影师的意图保持不变在SR之后。
Related Work
Since the first CNN-based SR method by Dong , highly sophisticated deep learning networks have been proposed in image SR , achieving remarkable quantitative or qualitative performance. Especially, Wang introduced feature-level affine transformation based on segmentation prior to generate class-specific texture in the SR result. Although these methods perform promisingly under the ideal bicubic-degraded setting, they tend to produce or results if the degradations present in the test images deviate from bicubic degradation.
Recent methods handling multiple types of degradations can be categorized into non-blind SR , where the LR images are coupled with the ground truth degradation information (blur kernel or noise level), or blind SR , where only the LR images are given without the ground truth degradation information that is then to be estimated. Among the former, Zhang provided the principal components of the Gaussian blur kernel and the level of additive Gaussian noise by concatenating them with the LR input for degradation-aware SR. Xu also integrated the degradation information in the same way, but with a backbone network using dynamic upsampling filters , raising the SR performance. However, these methods require ground truth blur information at test time, which is unrealistic for practical application scenarios. Among blind SR methods that predict the degradation information, an inspiring work by Gu inserted spatial feature transform modules in the CNN architecture to integrate the degradation information with iterative kernel correction. However, the iterative framework can be time-consuming since the entire framework must be repeated many times during inference, and the optimal number of iteration loops varies among input images, requiring human intervention for maximal performance. Furthermore, their network generates vector kernels that are eventually stretched with repeated values to be inserted to the SR network, limiting the degradation modeling capability of local characteristics.
相关工作
自Dong的基于CNN的基于CNN的SR方法以来,已经提出了图像SR中的高度复杂的深度学习网络,实现了显着的定量或定性性能。特别是,Wang在SR结果中生成特定类纹理之前,基于分段引入特征级仿射变换。虽然这些方法在理想的双暗降级设置下具有承诺,但是如果测试图像中存在的降解偏离双子化降解,则它们倾向于产生或结果。
最近处理多种类型的降级可以分类为*非盲sr ,其中LR图像与地面真理劣化信息(模糊内核或噪音水平)耦合到或盲SR *,其中仅给出LR图像而不估计地面真相劣化信息。在前者中,张为高斯模糊内核的主要成分提供了通过将它们连接到降级感知SR的LR输入来提供高斯模糊内核的主要组成部分和添加剂高斯噪声的水平。 XU还以相同的方式整合了劣化信息,但使用使用动态上采样过滤器的骨干网络,提高了SR性能。但是,这些方法需要在测试时间进行地面实践模糊信息,这对于实际应用方案是不现实的。在预测降级信息的盲SR方法中,CNN架构中的GU插入的空间特征变换模块的鼓舞人心的工作,以将劣化信息与迭代内核校正集成。然而,迭代框架可能是耗时的,因为在推理期间必须多次重复整个框架,并且在输入图像中的最佳迭代循环变化,需要人为干预以获得最大性能。此外,它们的网络生成最终以重复值拉伸的向量内核,以将其插入到SR网络,限制了局部特征的劣化建模能力。
Another prominent work is the KernelGAN that generates downscaled LR images by learning the internal patch distribution of the test LR image. The downscaled LR patches, or the kernel information, and the original test LR images are then plugged into zero-shot SR or non-blind SR methods. There also exist methods that employ GANs to generate realistic kernels for data augmentation , or learn to synthesize LR images along with the SR image . In comparison, our downsampling network predicts the underlying blur kernels that are used to modulate and locally filter the upsampling features. Jia first proposed that generate image- and location-specific filters that filter the input images in a locally adaptive manner to better handle the non-stationary property of natural images, in contrast to the conventional convolution layers with spatially-equivariant filters. Application-wise, Niklaus and Jo successfully employed the dynamic filtering networks for video frame interpolation, and video SR, respectively. The recent non-blind SR method by Xu also employed a two-branch dynamic upsampling architecture . However, the provided ground truth degradation kernel is still restricted to spatially kernels and are entered naively by simple concatenation, unlike our proposed KOALAnet that estimates the underlying blur kernels from input LR images and effectively integrates this information for SR.
另一个突出的工作是通过学习测试LR图像的内部补丁分布来生成较低的LR图像的内核。然后将缩小的LR补丁或内核信息,以及原始测试LR图像插入零拍SR或非盲SR方法。还存在采用GAN生成用于数据增强的现实内核的方法,或者学习合成LR图像以及SR图像。相比之下,我们的下采样网络预测用于调制和局部过滤ups采样功能的底层模糊内核。 jia首先提出生成局部自适应方式过滤输入图像的图像和位置特定滤波器,以更好地处理自然图像的非稳定性,与空间的滤波器的传统卷积层相比。 Application-Wise,Niklaus和Jo分别成功地使用了用于视频帧插值的动态过滤网络和视频SR。 XU最近的非盲学SR方法还采用了双分支动态上采样架构。但是,所提供的地面真理劣化内核仍然限于空间内核,并且通过简单的级联,与我们提出的考拉内部估计来自输入LR图像的基础模糊内核并有效地集成了该信息的所提出的Koalanet,并有效地集成了这些信息的依然级联。
Proposed Method
We propose a blind SR framework with (i) a downsampling network that predicts spatially-variant degradation kernels, and (ii) an upsampling network that contains KOALA modules, which adaptively fuses the degradation kernel information for enhanced blind SR.
建议方法
我们提出了一种盲的SR框架与(i)一个下采样网络,该下采样网络预测空间变型劣化核,(ii)包含考拉模块的上采样网络,其自适应地熔化了增强盲盲SR的劣化内核信息。
Downsampling Network
During training, an LR image, $ X $ , is generated by applying a random anisotropic Gaussian blur kernel, $ k_{g} $ , on an HR image, $ Y $ , and downsampling it with the bicubic kernel, $ k_{b} $ , similar to , given as, where $ \downarrow_s $ denotes downsampling by scale factor $ s $ . Hence, the downsampling kernel $ k_{d} $ can be obtained as $ k_{d}=k_{g}*k_{b} $ , and the degradation process can be implemented by an $ s $ -stride convolution of $ Y $ by $ k_{d} $ . We believe that anisotropic Gaussian kernels are a more suitable choice than isotropic Gaussian kernels for blind SR, as anisotropic kernels are the more generalized superset. We do not apply any additional anti-aliasing measures (like in the default Matlab imresize function), since $ Y $ is already low-pass filtered by $ k_g $ . The downsampling network, shown in the upper part of Fig. net_arch, takes a degraded LR RGB image, $ X $ , as input, and aims to predict its underlying degradation kernel that is assumed to have been used to obtain $ X $ from its HR counterpart, $ Y $ , through a U-Net-based architecture with ResBlocks. The output, $ F_d $ , is a 3D tensor of size $ H\times W\times 400 $ , composed of $ 20\times20 $ local filters at every $ (h, w) $ pixel location. The local filters are normalized to have a sum of 1 (denoted as in Fig. net_arch) by subtracting each of their mean values and adding a bias of $ 1/400 $ . With $ F_d $ , the LR image, $ \hat{X} $ , can be reconstructed by, where $ \oast\downarrow_s $ represents $ 20\times20 $ local filtering at each pixel location with stride $ s $ , as illustrated in Fig.
下采样网络
在训练期间,通过在HR Image,$ Y $上应用随机各向异性高斯模糊内核,$ Y $和DOWS采样它,使用BICUBIC核,$对其进行$ X $而生成LR图像$ X $。 k_{b} $,类似于给出的,其中$\downarrow_s $表示按比例因子$ s $的下采样。因此,可以获得下采样内核$ k_{d} $作为$ k_{d}=k_{g}*k_{b} $,并且劣化过程可以通过$ Y $的$ s $的$ k_{d}$实现。我们认为,各向异性高斯内核是比各向同性高斯核的更合适的选择,因为各向异性核是更广泛的超级核。我们不应用任何额外的抗锯齿测量(如在默认Matlab IMResize函数中),因为$ Y $已被$ k_g $过滤。下面采样网络在图2的上部。NET_ARCH,采用DIVADed LR RGB图像,$ X $,作为输入,并旨在预测其底层的降级内核,该劣化内核被假定已被用于从其中获取$ X $ HR对应,$ Y $通过带有Resblocks的基于U-Net的架构。输出$ F_d $,是3D尺寸$ H\times 400 $,由$ 20\times20 $本地滤波器组成,每个$(h, w) $像素位置。通过减去其平均值并添加$ 1/400 $的每个平均值,归一化滤波器归一化以具有1(如图3所示)的总和。使用$ F_d $,可以重建LR图像$\hat{X} $,其中$\oast\downarrow_s $表示在具有步骤$ s $的每个像素位置处的$ 20\times20 $本地滤波,如图4所示。
net_arch. For training, we propose to use an LR reconstruction loss, $ L_r=l_1(\hat{X}, X) $ , which indirectly enforces the downsampling network to predict a at each pixel location based on image prior. To bring flexibility in the spatially-variant kernel estimation, the loss with the ground truth kernel is only given to the spatial-wise mean of $ F_d $ . Then, the total loss for the downsampling network is given as, where $ E_{hw}[\cdot] $ denotes a spatial-wise mean over $ (h, w) $ , and $ k_d $ is reshaped to $ 1\times1\times400 $ from the original size of $ 20\times 20 $ . Estimating the blur kernel for a smooth region in an LR image is difficult since dissimilar blur kernels may produce similar smooth pixel values. Consequently, if the network aims to directly predict the true blur kernel, the gradient of a kernel matching loss may not back-propagate a desirable signal. Meanwhile, for highly textured regions of HR images, the induced LR images are largely influenced by the blur kernels, which enables the downsampling network to find inherent degradation cues from the LR images. In this case, the degradation information can be highly helpful in reconstructing the SR image as well, since most of the SR reconstruction error tends to occur in these regions.
net_arch。对于培训,我们建议使用LR重建损失$ L_r=l_1(\hat{X}, X) $,其间接地强制执行下采样网络以基于先前的图像预测每个像素位置处的A处。为了在空间变量的内核估计中带来灵活性,与地面真相内核的丢失仅给予$ F_d $的空间方式。然后,给出了下采样网络的总损失,其中$ E_{hw}[\cdot] $表示$(h, w) $的空间方向,而$ k_d $从原始尺寸中重新装入$ 1\times1\times400 $ $ 20\times 20 $。估计LR图像中的平滑区域的模糊内核是困难的,因为不同的模糊核可能产生类似的平滑像素值。因此,如果网络旨在直接预测真正的模糊内核,则内核匹配丢失的梯度可能不会回到期望的信号。同时,对于HR图像的高度纹理区域,感应的LR图像主要受模糊内核的影响,这使得下采样网络能够从LR图像找到固有的降级线索。在这种情况下,劣化信息也可以非常有用地重建SR图像,因为大多数SR重建误差趋于发生在这些区域中。
Upsampling Network
We consider the upsampling process to be the inverse of the downsampling process, and thus, design an upsampling network in correspondance with the downsampling network as shown in Fig. net_arch. The upsampling network takes in the degraded LR input, $ X $ , of size $ H\times W\times 3 $ , and generates an SR output, $ \hat{Y} $ , of size $ sH\times sW\times 3 $ , where $ s $ is a scale factor. In the early convolution layers of the upsampling network, the SR feature maps are adjusted by five cascaded KOALA modules, K, which are explained in detail in the next section. Then, after seven cascaded residual blocks, R, the resulting feature map, $ f_u $ , is given by, $ f_u = (RL\circ R^7\circ K^5\circ Conv)(X), $ where RL is ReLU activation . $ f_u $ is fed separately into a residual branch and a filter generation branch similar to , where the residual map, $ r $ , and local upsampling filters, $ F_u $ , are obtained as, for $ s=4 $ , where PS is a pixel shuffler of $ s=2 $ and Normalize denotes normalizing by subtracting the mean and adding a bias of $ 1/25 $ for each $ 5\times 5 $ local filter. The second PS and its preceding convolution layer are removed when generating $ r $ for $ s=2 $ . When applying the generated $ F_u $ of size $ H\times W\times(25\times s\times s) $ on the input $ X $ , $ F_u $ is split into $ s\times s $ tensors in the channel direction, and each chunk of $ H\times W\times 25 $ tensor is interpreted as a $ 5\times 5 $ local filter at every $ (h, w) $ pixel location. They are applied on $ X $ (same filters for RGB channels) by computing the local inner product at the corresponding grid position $ (h, w) $ . After filtering all of the $ s\times s $ chunks, the produced $ H\times W\times(s\times s\times 3) $ tensor is pixel-shuffled with scale $ s $ to generate the enlarged $ \tilde{Y} $ of size $ sH\times sW\times 3 $ similar to . Finally, $ \hat{Y} $ is computed as $ \hat{Y}=\tilde{Y}+r $ , and the upsampling network is trained with $ l_1(\hat{Y}, Y) $ .
Method $ \times2 $ | Set5 | Set14 | BSD100 | Urban100 | Manga109 | DIV2K-val | DIV2KRK | Complexity |
---|---|---|---|---|---|---|---|---|
PSNR/SSIM | PSNR/SSIM | PSNR/SSIM | PSNR/SSIM | PSNR/SSIM | PSNR/SSIM | PSNR/SSIM | Time (s)/GFLOPs | |
Bicubic | 27.11/0.7850 | 26.00/0.7222 | 26.09/0.6838 | 22.82/0.6537 | 24.87/0.7911 | 28.27/0.7835 | 28.73/0.8040 | - / - |
ZSSR | 27.30/0.7952 | 26.55/0.7402 | 26.46/0.7020 | 23.13/0.6706 | 25.43/0.8041 | 28.69/0.7958 | 29.10/0.8215 | 24.69/ |
KernelGAN | 27.35/0.7839 | 24.57/0.7061 | 25.56/0.6990 | 23.12/0.6907 | 25.99/0.8270 | 27.66/0.7892 | / | 230.66/10,219 |
+ZSSR | ||||||||
BlindSR | / | / | / | / | / | / | 29.44/0.8464 | /13,910 |
\Ours | 33.08/0.9137 | 30.35/0.8568 | 29.70/0.8248 | 27.19/0.8318 | 32.61/0.9369 | 32.55/0.8902 | 31.89/0.8852 | 0.71/201 |
Method $ \times4 $ | Set5 | Set14 | BSD100 | Urban100 | Manga109 | DIV2K-val | DIV2KRK | Complexity |
Bicubic | 26.41/0.7511 | 24.73/0.6641 | 25.12/0.6321 | 22.04/0.6061 | 23.60/0.7482 | 27.04/0.7417 | 25.33/0.6795 | - / - |
ZSSR | 26.49/0.7530 | 24.93/0.6812 | 25.36/0.6526 | 22.39/0.6327 | 24.43/0.7813 | 27.39/0.7590 | 25.61/0.6911 | 16.91/6,091 |
KernelGAN | 22.12/0.5989 | 19.73/0.5194 | 21.02/0.5377 | 20.12/0.5743 | 22.61/0.7345 | 23.75/0.6830 | 26.81/0.7316 | 357.70/11,908 |
+ZSSR | ||||||||
IKC \textitlast | 27.73/0.8024 | 25.38/0.7162 | 25.68/0.6844 | 23.03/0.6852 | 25.44/0.8273 | 27.61/0.7843 | 27.39/ | / |
IKC \textitmax | / | / | / | / | / | / | /0.7684 | / |
\Ours | 30.28/0.8658 | 27.20/0.7541 | 26.97/0.7172 | 24.71/0.7427 | 28.48/0.8814 | 29.44/0.8156 | 27.77/0.7637 | 0.59/57 |
Quantitative comparison on various datasets. We also provide a comparison on computational complexity in terms of the average inference time on Set5, and GFLOPs on baby
We propose a novel feature transformation module, KOALA, that adaptively adjusts the intermediate features in the upsampling network based on the degradation kernels predicted by the downsampling network. The KOALA modules are placed at the earlier stage of feature extraction in order to calibrate the anisotropically degraded LR features before the reconstruction phase.
上采样网络
我们认为上采样过程是下采样过程的倒数,因此,如图所示,设计了与下采样网络相对应的上采样网络。net_arch。上采样网络采用DIFADED的LR输入,$ X $,大小$ W\times W $,并生成SR输出$\hat{Y} $,大小$ sH\times sW\times 3 $,其中$ s $是a比例因子。在上采样网络的早期卷积层中,SR特征映射由五个级联的Koala模块,k调整,k在下一节中详细解释。然后,在七个级联残差块中,R,得到的特征图$ f_u $,$ f_u = (RL\circ R^7\circ K^5\circ K^5\circ Conv)(X), $其中R1是Relu激活。 $ f_u $分别进料到残余分支和类似于$ s=4 $的剩余地图,$ r $和本地上采样过滤器$ F_u $的滤波器生成分支,其中PS是像素的$ s=4 $ $ s=2 $的Shuffler并归一化表示为每个$ 5\times 5 $本地滤波器减去平均值并添加$ 1/25 $的偏差。当$ s=2 $生成$ r $时,将第二PS及其前面的卷积层被移除。当应用所生成的$大小的t203_0 $ $ H\times W\times(25\times s\times s) $对输入$ X $,$ F_u $被分成$ s\times s $张量在通道方向上,和$的每个组块t213_0\times W\times 25 $张量在每个$(h, w) $像素位置解释为$ 5\times 5 $本地滤波器。它们通过在相应的网格位置$(h, w) $处计算本地内部产品来应用于$ X $(相同的RGB通道的滤波器)。过滤所有的$ s\times s $块,所产生的$ H\times W\times(s\times s\times 3) $张量为像素洗牌带刻度$ s $以生成放大的$\tilde{Y}大小的$ $ sH\times sW\times 3 后$类似于。最后,$\hat{Y} $被计算为$\hat{Y}=\tilde{Y}+r $,并使用$ l_1(\hat{Y}, Y) $培训UPS采样网络。
我们提出了一种新颖的特征变换模块,KoALA,它基于下采样网络预测的劣化内核,自适应地调整上采样网络中的中间特征。考拉模块放置在特征提取的早期阶段,以便在重建阶段之前校准各向异性降级的LR特征。
Specifically, when the input feature, $ x $ , is entered into a KOALA module, K, it goes through 2 convolution layers, and is adjusted by a set of multiplicative parameters, $ m $ , followed by a set of local kernels, $ k $ , generated based on the predicted degradation kernels, $ F_d $ . Instead of directly feeding $ F_d $ into K, the kernel features, $ f_d $ , extracted after 3 convolution layers are entered. After a local residual connection, the output, $ y $ , of the KOALA module is given by, where, In Eq. 4, $ \otimes $ and $ \oast $ denote element-wise multiplication and local feature filtering, respectively. For generating $ k $ , $ 1\times1 $ convolutions are employed so that spatially adjacent values of kernel features, $ f_d $ , are not mixed by convolution operations. The kernel values of $ k $ are constrained to have a sum of 1 (Normalize), like for $ F_d $ and $ F_u $ . The local feature filtering operation, $ \oast $ , is first applied by reshaping a $ 1\times 1\times 49 $ vector at each grid position $ (h, w) $ to a $ 7\times 7 $ 2D local kernel, and computing the local inner product at each $ (h, w) $ position of the input feature. Since the same $ 7\times 7 $ kernels are applied channel-wise, the multiplicative parameter, $ m $ , introduces element-wise scaling for the features over the channel depth. This is also efficient in terms of the number of parameters, compared to predicting the per-pixel local kernels for every channel ( $ 49+64 $ vs. $ 49\times64 $ filter parameters). By placing the residual connection after the feature transformations (Eq. 4), the adjustment parameters can be considered as removing the unwanted feature residuals related to degradation from the original input features.
具体地,当输入特征$ x $输入到Koala模块中,它通过2卷积图,并通过一组乘法参数$ m $调整,然后是一组本地内核,$ k $,基于预测的劣化内核生成$ F_d $。输入在输入3卷积图层后提取的内核特征$ F_d $,而不是直接送入$ F_d $,而不是直接送入$ F_d $,而是提取的。在局部剩余连接之后,Koala模块的输出$ y $由EQ中的何处给出。 4,$\otimes $和$\oast $分别表示元素 - WISE乘法和本地特征过滤。对于生成$ k $,采用$ 1\times 1\times1 $卷积,以便通过卷积操作不会混合空间相邻的内核特征值$ f_d $。 $ k $的内核值被约束为具有1(归一化)的总和,例如用于$ F_d $和$ F_u $。首先通过在每个网格位置$(h, w) $处重塑$ 1\times 1\times 1\times 49 $向量,以$ 7\times 7 $ 2D本地内核,并计算每个网格位置的$ 1\times 49 $向量,并计算每个内部产品$(h, w) $输入功能的位置。由于应用了相同的$ 7\times 7 $内核,因此乘法参数$ m $引入了通道深度上的特征的元素 - 方向缩放。与参数的数量相比,这也是有效的,与每个通道的每个像素本地内核预测($ 49+64 $ VS.$ 49\times64 $滤波器参数)。通过将剩余连接放置在特征变换(EQ.4)之后,可以将调整参数视为删除与原始输入特征的劣化相关的不需要的特征残差。
Training Strategy
We employ a 3-stage training strategy: (i) the downsampling network is pre-trained with $ l_1(\hat{X}, X) $ ; (ii) the upsampling network is pre-trained with $ l_1(\hat{Y}, Y) $ by replacing all KOALA modules with ResBlocks; (iii) the whole framework (KOALAnet) including the KOALA modules (with convolution layers needed for generating $ f_d $ , $ m $ and $ k $ inserted on the pre-trained ResBlocks) is jointly optimized based on $ l_1(\hat{X}, X)+l_1(\hat{Y}, Y) $ . With this strategy, the KOALA modules can be effectively trained with already meaningful features obtained from the early training phases, and focus on utilizing the degradation kernel cues for SR.
培训策略
我们采用了一项三阶段培训策略:(i)下采样网络预先接受$ l_1(\hat{X}, X) $; (ii)Ups采样网络通过用Resblocks替换所有考拉模块,用$ l_1(\hat{Y}, Y) $预先培训; (iii)基于$ l_1(\hat{X}, X)+l_1(\hat{Y}, X)+l_1(\hat{Y}, X)+l_1(\hat{Y}, X)+l_1(\hat{Y}, X)+l_1( $。通过这种策略,Koala模块可以通过早期训练阶段获得的有意义的功能有效地培训,并专注于利用SR的降级核心线索。
Experiment Results
In our implementations, $ k_{d} $ of size $ 20\times20 $ is computed by convolving $ k_{b} $ with a random anisotropic Gaussian kernel ( $ k_{g} $ ) of size $ 15\times15 $ , following Eq. degradation. It should be noted that $ k_{b} $ is originally a bicubic downscaling kernel of size $ 4\times4 $ same as in the function of Matlab anti-aliasing, but is zero-padded to be $ 20\times20 $ to align with the size of $ k_{d} $ as well as to avoid image shift. The Gaussian kernels for degradation are generated by randomly rotating a biva