KOALAnet: Blind Super-Resolution using Kernel-Oriented Adaptive Local Adjustment



Blind super-resolution (SR) methods aim to generate a high quality high resolution image from a low resolution image containing unknown degradations. However, natural images contain various types and amounts of blur: some may be due to the inherent degradation characteristics of the camera, but some may even be intentional, for aesthetic purposes (e.g. Bokeh effect). In the case of the latter, it becomes highly difficult for SR methods to disentangle the blur to remove, and that to leave as is. In this paper, we propose a novel blind SR framework based on kernel-oriented adaptive local adjustment (KOALA) of SR features, called KOALAnet, which jointly learns spatially-variant degradation and restoration kernels in order to adapt to the spatially-variant blur characteristics in real images. Our KOALAnet outperforms recent blind SR methods for synthesized LR images obtained with randomized degradations, and we further show that the proposed KOALAnet produces the most natural results for artistic photographs with intentional blur, which are not over-sharpened, by effectively handling images mixed with in-focus and out-of-focus areas.


盲超分辨率(SR)方法旨在从包含未知降级的低分辨率图像产生高质量的高分辨率图像。然而,自然图像含有各种类型和量的模糊:有些可能是由于相机的固有降解特性,但有些可能是有意的,用于美学目的(例如 Bokeh 效应)。在后者的情况下,SR 方法非常困难,以解开模糊以移除,并尽可能离开。在本文中,我们提出了一种基于 SR 特征的内核型自适应局部调整(Koala)的新型盲人 SR 框架,称为 Koalanet,其共同学习空间变型劣化和恢复核,以适应空间变体模糊特性在真实的图像中。我们的 Koalanet 优于由随机降解获得的合成 LR 图像的最近盲目 SR 方法,我们进一步表明,KOALAnet 有效地处理混合失焦和聚焦区域的图像。


When a deep neural network is trained under a specific scenario, its generalization ability tends to be limited to that particular setting, and its performance deteriorates under a different condition. This is a major problem in single image super-resolution (SR), where most neural-network-based methods have focused on the upscaling of low resolution (LR) images to high resolution (HR) images solely under the setting , until very recently. Naturally, their performance tends to severely drop if the input LR image is degraded by even a slightly different downsampling kernel, which is often the case in real images . Hence, more recent SR methods aim for SR, where the true degradation kernels are unknown . However, this unknown blur may be of various types with different characteristics. Often, images are captured with a different depth-of-field (DoF) by manipulating the aperture sizes and the focal lengths of camera lenses, for aesthetic purposes (e.g. Bokeh effect) as shown in Fig. dof_imgs. Recent mobile devices even try to simulate this synthetically (e.g. portrait mode) for artistic effects . Although a camera-specific degradation could be spatially-equivariant (similar to the way LR images are generated for SR), the blur generated due to DoF of the camera would be , where some areas are in focus, and others are out of focus.


在特定场景下培训深度神经网络时,其泛化能力趋于限于该特定设置,其性能在不同的情况下恶化。这是单个图像超分辨率(SR)中的一个主要问题,其中大多数基于神经网络的方法都集中在最近在设置下的高分辨率(HR)图像的低分辨率(LR)图像的上升,直到最近。当然,如果输入 LR 图像甚至是略微不同的下采样内核,则它们的性能趋于严重下降,这通常是真实图像中的情况。因此,最近的 SR 方法旨在为 SR,其中真正的降级核未知。然而,这种未知的模糊可能具有不同特性的各种类型。通常,通过操纵孔径尺寸和相机镜头的焦距来捕获图像,用于使用相机镜头的孔径(例如散景效应),如图 2 所示。DOF_IMGS。最近的移动设备甚至试图为艺术效果模拟这种合成(例如纵向模式)。虽然相机特定的劣化可以是空间的(类似于 SR 生成的 LR 图像的方式),但由于相机的 DOF 而产生的模糊将是,某些区域处于焦点,而其他区域则不受焦点。

These types of LR images are extremely challenging for SR, since ideally, the intentional blur must be left unaltered (should not be over-sharpened) to maintain the photographer's intent after SR. However, the SR results of such images are yet to be analyzed in literature.


In this paper, we propose a blind SR framework based on kernel-oriented adaptive local adjustment (KOALA) of SR features, called KOALAnet, by jointly learning the degradation and restoration kernels. The KOALAnet consists of two networks: a downsampling network that estimates blur kernels, and an upsampling network that fuses this information by mapping the predicted degradation kernels to the feature kernel space, predicting degradation-specific local feature adjustment parameters that are applied by on the SR feature maps. After training under a random anisotropic Gaussian degradation setting, our KOALAnet is able to accurately predict the underlying degradation kernels and effectively leverage this information for SR. Moreover, it demonstrates a good generalization ability on historic images containing unknown degradations compared to previous blind SR methods. We further provide comparisons on real aesthetic DoF images, and show that our KOALAnet effectively handles images with intentional blur. Our contributions are three-fold:


  • We propose a blind SR framework that jointly learns spatially-variant degradation and restoration kernels. The restoration (upsampling) network leverages novel KOALA modules to adaptively adjust the SR features based on the predicted degradation kernels. The KOALA modules are extensible, and can be inserted into any CNN architecture for image restoration tasks.
  • We empirically show that the proposed KOALAnet outperforms the recent state-of-the-art blind SR methods for synthesized LR images obtained under randomized degradation conditions, as well as for historic LR images with unknown degradations.
  • We first analyze SR results on images mixed with in-focus and out-of-focus regions, showing that our KOALAnet is able to discern intentionally blurry areas and process them accordingly, leaving the photographer's intent unchanged after SR.

这些类型的 LR 图像对于 SR 非常具有挑战性,因为理想情况下,故意模糊必须留下(不应过度锐化),以维持 SR 之后的摄影师的意图。然而,这些图像的 SR 结果尚未在文献中分析。

在本文中,我们通过联合学习劣化和恢复内核,基于核心为导向的自适应局部调整(考拉)的盲人 SR 框架。 Koalanet 由两个网络组成:一个下采样网络,估计模糊内核,以及通过将预测的劣化内核映射到特征内核空间来熔化此信息的上采样网络,预测 SR 上应用的劣化特定的本地特征调整参数特征映射。在随机各向异性高斯降级设置下进行培训后,我们的考拉纳能够准确地预测底层劣化内核,并有效地利用 SR 的这些信息。此外,与先前的盲学 SR 方法相比,它展示了历史图像上含有未知降解的历史图像的良好概括能力。我们进一步提供了真正的审美 DOF 图像上的比较,并表明我们的 Koalanet 有效地处理了故意模糊的图像。我们的贡献是三倍:


  • 我们提出了一个盲人 SR 框架,共同学习 Spatial-Variant 劣化和恢复内核。恢复(上采样)网络利用新颖的考拉模块,基于预测的降级核心自适应地调整 SR 特征。考拉模块是可扩展,可以插入任何 CNN 架构中以进行图像恢复任务。
  • 我们经验表明,所提出的 Koalanet 优于在随机降解条件下获得的合成 LR 图像的最近最新的最先进的盲人 SR 方法,以及具有未知降解的历史性 LR 图像。
  • 我们首先分析与 in-focus 和 Out-focus 地区混合的图像上的 SR 结果,表明我们的考拉纳能够辨别出故意模糊的区域并相应地处理它们,使摄影师的意图保持不变在 SR 之后。

Since the first CNN-based SR method by Dong , highly sophisticated deep learning networks have been proposed in image SR , achieving remarkable quantitative or qualitative performance. Especially, Wang introduced feature-level affine transformation based on segmentation prior to generate class-specific texture in the SR result. Although these methods perform promisingly under the ideal bicubic-degraded setting, they tend to produce or results if the degradations present in the test images deviate from bicubic degradation.


Recent methods handling multiple types of degradations can be categorized into non-blind SR , where the LR images are coupled with the ground truth degradation information (blur kernel or noise level), or blind SR , where only the LR images are given without the ground truth degradation information that is then to be estimated. Among the former, Zhang provided the principal components of the Gaussian blur kernel and the level of additive Gaussian noise by concatenating them with the LR input for degradation-aware SR. Xu also integrated the degradation information in the same way, but with a backbone network using dynamic upsampling filters , raising the SR performance. However, these methods require ground truth blur information at test time, which is unrealistic for practical application scenarios. Among blind SR methods that predict the degradation information, an inspiring work by Gu inserted spatial feature transform modules in the CNN architecture to integrate the degradation information with iterative kernel correction. However, the iterative framework can be time-consuming since the entire framework must be repeated many times during inference, and the optimal number of iteration loops varies among input images, requiring human intervention for maximal performance. Furthermore, their network generates vector kernels that are eventually stretched with repeated values to be inserted to the SR network, limiting the degradation modeling capability of local characteristics.


自 Dong 的基于 CNN 的基于 CNN 的 SR 方法以来,已经提出了图像 SR 中的高度复杂的深度学习网络,实现了显着的定量或定性性能。特别是,Wang 在 SR 结果中生成特定类纹理之前,基于分段引入特征级仿射变换。虽然这些方法在理想的双暗降级设置下具有承诺,但是如果测试图像中存在的降解偏离双子化降解,则它们倾向于产生或结果。

最近处理多种类型的降级可以分类为*非盲 sr ,其中 LR 图像与地面真理劣化信息(模糊内核或噪音水平)耦合到或盲 SR *,其中仅给出 LR 图像而不估计地面真相劣化信息。在前者中,张为高斯模糊内核的主要成分提供了通过将它们连接到降级感知 SR 的 LR 输入来提供高斯模糊内核的主要组成部分和添加剂高斯噪声的水平。 XU 还以相同的方式整合了劣化信息,但使用使用动态上采样过滤器的骨干网络,提高了 SR 性能。但是,这些方法需要在测试时间进行地面实践模糊信息,这对于实际应用方案是不现实的。在预测降级信息的盲 SR 方法中,CNN 架构中的 GU 插入的空间特征变换模块的鼓舞人心的工作,以将劣化信息与迭代内核校正集成。然而,迭代框架可能是耗时的,因为在推理期间必须多次重复整个框架,并且在输入图像中的最佳迭代循环变化,需要人为干预以获得最大性能。此外,它们的网络生成最终以重复值拉伸的向量内核,以将其插入到 SR 网络,限制了局部特征的劣化建模能力。

Another prominent work is the KernelGAN that generates downscaled LR images by learning the internal patch distribution of the test LR image. The downscaled LR patches, or the kernel information, and the original test LR images are then plugged into zero-shot SR or non-blind SR methods. There also exist methods that employ GANs to generate realistic kernels for data augmentation , or learn to synthesize LR images along with the SR image . In comparison, our downsampling network predicts the underlying blur kernels that are used to modulate and locally filter the upsampling features. Jia first proposed that generate image- and location-specific filters that filter the input images in a locally adaptive manner to better handle the non-stationary property of natural images, in contrast to the conventional convolution layers with spatially-equivariant filters. Application-wise, Niklaus and Jo successfully employed the dynamic filtering networks for video frame interpolation, and video SR, respectively. The recent non-blind SR method by Xu also employed a two-branch dynamic upsampling architecture . However, the provided ground truth degradation kernel is still restricted to spatially kernels and are entered naively by simple concatenation, unlike our proposed KOALAnet that estimates the underlying blur kernels from input LR images and effectively integrates this information for SR.

另一个突出的工作是通过学习测试 LR 图像的内部补丁分布来生成较低的 LR 图像的内核。然后将缩小的 LR 补丁或内核信息,以及原始测试 LR 图像插入零拍 SR 或非盲 SR 方法。还存在采用 GAN 生成用于数据增强的现实内核的方法,或者学习合成 LR 图像以及 SR 图像。相比之下,我们的下采样网络预测用于调制和局部过滤 ups 采样功能的底层模糊内核。 jia 首先提出生成局部自适应方式过滤输入图像的图像和位置特定滤波器,以更好地处理自然图像的非稳定性,与空间的滤波器的传统卷积层相比。 Application-Wise,Niklaus 和 Jo 分别成功地使用了用于视频帧插值的动态过滤网络和视频 SR。 XU 最近的非盲学 SR 方法还采用了双分支动态上采样架构。但是,所提供的地面真理劣化内核仍然限于空间内核,并且通过简单的级联,与我们提出的考拉内部估计来自输入 LR 图像的基础模糊内核并有效地集成了该信息的所提出的 Koalanet,并有效地集成了这些信息的依然级联。

Proposed Method

We propose a blind SR framework with (i) a downsampling network that predicts spatially-variant degradation kernels, and (ii) an upsampling network that contains KOALA modules, which adaptively fuses the degradation kernel information for enhanced blind SR.


我们提出了一种盲的 SR 框架与(i)一个下采样网络,该下采样网络预测空间变型劣化核,(ii)包含考拉模块的上采样网络,其自适应地熔化了增强盲盲 SR 的劣化内核信息。

Downsampling Network

During training, an LR image, $ X $ , is generated by applying a random anisotropic Gaussian blur kernel, $ k_{g} $ , on an HR image, $ Y $ , and downsampling it with the bicubic kernel, $ k_{b} $ , similar to , given as, where $ \downarrow_s $ denotes downsampling by scale factor $ s $ . Hence, the downsampling kernel $ k_{d} $ can be obtained as $ k_{d}=k_{g}*k_{b} $ , and the degradation process can be implemented by an $ s $ -stride convolution of $ Y $ by $ k_{d} $ . We believe that anisotropic Gaussian kernels are a more suitable choice than isotropic Gaussian kernels for blind SR, as anisotropic kernels are the more generalized superset. We do not apply any additional anti-aliasing measures (like in the default Matlab imresize function), since $ Y $ is already low-pass filtered by $ k_g $ . The downsampling network, shown in the upper part of Fig. net_arch, takes a degraded LR RGB image, $ X $ , as input, and aims to predict its underlying degradation kernel that is assumed to have been used to obtain $ X $ from its HR counterpart, $ Y $ , through a U-Net-based architecture with ResBlocks. The output, $ F_d $ , is a 3D tensor of size $ H\times W\times 400 $ , composed of $ 20\times20 $ local filters at every $ (h, w) $ pixel location. The local filters are normalized to have a sum of 1 (denoted as in Fig. net_arch) by subtracting each of their mean values and adding a bias of $ 1/400 $ . With $ F_d $ , the LR image, $ \hat{X} $ , can be reconstructed by, where $ \oast\downarrow_s $ represents $ 20\times20 $ local filtering at each pixel location with stride $ s $ , as illustrated in Fig.


在训练期间,通过在 HR Image,$ Y $上应用随机各向异性高斯模糊内核,$ Y $和 DOWS 采样它,使用 BICUBIC 核,$对其进行$ X $而生成 LR 图像$ X $。 k_{b} $,类似于给出的,其中$\downarrow_s $表示按比例因子$ s $的下采样。因此,可以获得下采样内核$ k_{d} $作为$ k_{d}=k_{g}*k_{b} $,并且劣化过程可以通过$ Y $的$ s $的$ k_{d}$实现。我们认为,各向异性高斯内核是比各向同性高斯核的更合适的选择,因为各向异性核是更广泛的超级核。我们不应用任何额外的抗锯齿测量(如在默认 Matlab IMResize 函数中),因为$ Y $已被$ k_g $过滤。下面采样网络在图 2 的上部。NET_ARCH,采用 DIVADed LR RGB 图像,$ X $,作为输入,并旨在预测其底层的降级内核,该劣化内核被假定已被用于从其中获取$ X $ HR 对应,$ Y $通过带有 Resblocks 的基于 U-Net 的架构。输出$ F_d $,是 3D 尺寸$ H\times 400 $,由$ 20\times20 $本地滤波器组成,每个$(h, w) $像素位置。通过减去其平均值并添加$ 1/400 $的每个平均值,归一化滤波器归一化以具有 1(如图 3 所示)的总和。使用$ F_d $,可以重建 LR 图像$\hat{X} $,其中$\oast\downarrow_s $表示在具有步骤$ s $的每个像素位置处的$ 20\times20 $本地滤波,如图 4 所示。

net_arch. For training, we propose to use an LR reconstruction loss, $ L_r=l_1(\hat{X}, X) $ , which indirectly enforces the downsampling network to predict a at each pixel location based on image prior. To bring flexibility in the spatially-variant kernel estimation, the loss with the ground truth kernel is only given to the spatial-wise mean of $ F_d $ . Then, the total loss for the downsampling network is given as, where $ E_{hw}[\cdot] $ denotes a spatial-wise mean over $ (h, w) $ , and $ k_d $ is reshaped to $ 1\times1\times400 $ from the original size of $ 20\times 20 $ . Estimating the blur kernel for a smooth region in an LR image is difficult since dissimilar blur kernels may produce similar smooth pixel values. Consequently, if the network aims to directly predict the true blur kernel, the gradient of a kernel matching loss may not back-propagate a desirable signal. Meanwhile, for highly textured regions of HR images, the induced LR images are largely influenced by the blur kernels, which enables the downsampling network to find inherent degradation cues from the LR images. In this case, the degradation information can be highly helpful in reconstructing the SR image as well, since most of the SR reconstruction error tends to occur in these regions.

net_arch。对于培训,我们建议使用 LR 重建损失$ L_r=l_1(\hat{X}, X) $,其间接地强制执行下采样网络以基于先前的图像预测每个像素位置处的 A 处。为了在空间变量的内核估计中带来灵活性,与地面真相内核的丢失仅给予$ F_d $的空间方式。然后,给出了下采样网络的总损失,其中$ E_{hw}[\cdot] $表示$(h, w) $的空间方向,而$ k_d $从原始尺寸中重新装入$ 1\times1\times400 $ $ 20\times 20 $。估计 LR 图像中的平滑区域的模糊内核是困难的,因为不同的模糊核可能产生类似的平滑像素值。因此,如果网络旨在直接预测真正的模糊内核,则内核匹配丢失的梯度可能不会回到期望的信号。同时,对于 HR 图像的高度纹理区域,感应的 LR 图像主要受模糊内核的影响,这使得下采样网络能够从 LR 图像找到固有的降级线索。在这种情况下,劣化信息也可以非常有用地重建 SR 图像,因为大多数 SR 重建误差趋于发生在这些区域中。

Upsampling Network

We consider the upsampling process to be the inverse of the downsampling process, and thus, design an upsampling network in correspondance with the downsampling network as shown in Fig. net_arch. The upsampling network takes in the degraded LR input, $ X $ , of size $ H\times W\times 3 $ , and generates an SR output, $ \hat{Y} $ , of size $ sH\times sW\times 3 $ , where $ s $ is a scale factor. In the early convolution layers of the upsampling network, the SR feature maps are adjusted by five cascaded KOALA modules, K, which are explained in detail in the next section. Then, after seven cascaded residual blocks, R, the resulting feature map, $ f_u $ , is given by, $ f_u = (RL\circ R^7\circ K^5\circ Conv)(X), $ where RL is ReLU activation . $ f_u $ is fed separately into a residual branch and a filter generation branch similar to , where the residual map, $ r $ , and local upsampling filters, $ F_u $ , are obtained as, for $ s=4 $ , where PS is a pixel shuffler of $ s=2 $ and Normalize denotes normalizing by subtracting the mean and adding a bias of $ 1/25 $ for each $ 5\times 5 $ local filter. The second PS and its preceding convolution layer are removed when generating $ r $ for $ s=2 $ . When applying the generated $ F_u $ of size $ H\times W\times(25\times s\times s) $ on the input $ X $ , $ F_u $ is split into $ s\times s $ tensors in the channel direction, and each chunk of $ H\times W\times 25 $ tensor is interpreted as a $ 5\times 5 $ local filter at every $ (h, w) $ pixel location. They are applied on $ X $ (same filters for RGB channels) by computing the local inner product at the corresponding grid position $ (h, w) $ . After filtering all of the $ s\times s $ chunks, the produced $ H\times W\times(s\times s\times 3) $ tensor is pixel-shuffled with scale $ s $ to generate the enlarged $ \tilde{Y} $ of size $ sH\times sW\times 3 $ similar to . Finally, $ \hat{Y} $ is computed as $ \hat{Y}=\tilde{Y}+r $ , and the upsampling network is trained with $ l_1(\hat{Y}, Y) $ .

Method $ \times2 $ Set5 Set14 BSD100 Urban100 Manga109 DIV2K-val DIV2KRK Complexity
Bicubic 27.11/0.7850 26.00/0.7222 26.09/0.6838 22.82/0.6537 24.87/0.7911 28.27/0.7835 28.73/0.8040 - / -
ZSSR 27.30/0.7952 26.55/0.7402 26.46/0.7020 23.13/0.6706 25.43/0.8041 28.69/0.7958 29.10/0.8215 24.69/
KernelGAN 27.35/0.7839 24.57/0.7061 25.56/0.6990 23.12/0.6907 25.99/0.8270 27.66/0.7892 / 230.66/10,219
BlindSR / / / / / / 29.44/0.8464 /13,910
\Ours 33.08/0.9137 30.35/0.8568 29.70/0.8248 27.19/0.8318 32.61/0.9369 32.55/0.8902 31.89/0.8852 0.71/201
Method $ \times4 $ Set5 Set14 BSD100 Urban100 Manga109 DIV2K-val DIV2KRK Complexity
Bicubic 26.41/0.7511 24.73/0.6641 25.12/0.6321 22.04/0.6061 23.60/0.7482 27.04/0.7417 25.33/0.6795 - / -
ZSSR 26.49/0.7530 24.93/0.6812 25.36/0.6526 22.39/0.6327 24.43/0.7813 27.39/0.7590 25.61/0.6911 16.91/6,091
KernelGAN 22.12/0.5989 19.73/0.5194 21.02/0.5377 20.12/0.5743 22.61/0.7345 23.75/0.6830 26.81/0.7316 357.70/11,908
IKC \textitlast 27.73/0.8024 25.38/0.7162 25.68/0.6844 23.03/0.6852 25.44/0.8273 27.61/0.7843 27.39/ /
IKC \textitmax / / / / / / /0.7684 /
\Ours 30.28/0.8658 27.20/0.7541 26.97/0.7172 24.71/0.7427 28.48/0.8814 29.44/0.8156 27.77/0.7637 0.59/57
Quantitative comparison on various datasets. We also provide a comparison on computational complexity in terms of the average inference time on Set5, and GFLOPs on baby

We propose a novel feature transformation module, KOALA, that adaptively adjusts the intermediate features in the upsampling network based on the degradation kernels predicted by the downsampling network. The KOALA modules are placed at the earlier stage of feature extraction in order to calibrate the anisotropically degraded LR features before the reconstruction phase.


我们认为上采样过程是下采样过程的倒数,因此,如图所示,设计了与下采样网络相对应的上采样网络。net_arch。上采样网络采用 DIFADED 的 LR 输入,$ X $,大小$ W\times W $,并生成 SR 输出$\hat{Y} $,大小$ sH\times sW\times 3 $,其中$ s $是 a 比例因子。在上采样网络的早期卷积层中,SR 特征映射由五个级联的 Koala 模块,k 调整,k 在下一节中详细解释。然后,在七个级联残差块中,R,得到的特征图$ f_u $,$ f_u = (RL\circ R^7\circ K^5\circ K^5\circ Conv)(X), $其中 R1 是 Relu 激活。 $ f_u $分别进料到残余分支和类似于$ s=4 $的剩余地图,$ r $和本地上采样过滤器$ F_u $的滤波器生成分支,其中 PS 是像素的$ s=4 $ $ s=2 $的 Shuffler 并归一化表示为每个$ 5\times 5 $本地滤波器减去平均值并添加$ 1/25 $的偏差。当$ s=2 $生成$ r $时,将第二 PS 及其前面的卷积层被移除。当应用所生成的$大小的 t203_0 $ $ H\times W\times(25\times s\times s) $对输入$ X $,$ F_u $被分成$ s\times s $张量在通道方向上,和$的每个组块 t213_0\times W\times 25 $张量在每个$(h, w) $像素位置解释为$ 5\times 5 $本地滤波器。它们通过在相应的网格位置$(h, w) $处计算本地内部产品来应用于$ X $(相同的 RGB 通道的滤波器)。过滤所有的$ s\times s $块,所产生的$ H\times W\times(s\times s\times 3) $张量为像素洗牌带刻度$ s $以生成放大的$\tilde{Y}大小的$ $ sH\times sW\times 3 后$类似于。最后,$\hat{Y} $被计算为$\hat{Y}=\tilde{Y}+r $,并使用$ l_1(\hat{Y}, Y) $培训 UPS 采样网络。

我们提出了一种新颖的特征变换模块,KoALA,它基于下采样网络预测的劣化内核,自适应地调整上采样网络中的中间特征。考拉模块放置在特征提取的早期阶段,以便在重建阶段之前校准各向异性降级的 LR 特征。

Specifically, when the input feature, $ x $ , is entered into a KOALA module, K, it goes through 2 convolution layers, and is adjusted by a set of multiplicative parameters, $ m $ , followed by a set of local kernels, $ k $ , generated based on the predicted degradation kernels, $ F_d $ . Instead of directly feeding $ F_d $ into K, the kernel features, $ f_d $ , extracted after 3 convolution layers are entered. After a local residual connection, the output, $ y $ , of the KOALA module is given by, where, In Eq. 4, $ \otimes $ and $ \oast $ denote element-wise multiplication and local feature filtering, respectively. For generating $ k $ , $ 1\times1 $ convolutions are employed so that spatially adjacent values of kernel features, $ f_d $ , are not mixed by convolution operations. The kernel values of $ k $ are constrained to have a sum of 1 (Normalize), like for $ F_d $ and $ F_u $ . The local feature filtering operation, $ \oast $ , is first applied by reshaping a $ 1\times 1\times 49 $ vector at each grid position $ (h, w) $ to a $ 7\times 7 $ 2D local kernel, and computing the local inner product at each $ (h, w) $ position of the input feature. Since the same $ 7\times 7 $ kernels are applied channel-wise, the multiplicative parameter, $ m $ , introduces element-wise scaling for the features over the channel depth. This is also efficient in terms of the number of parameters, compared to predicting the per-pixel local kernels for every channel ( $ 49+64 $ vs. $ 49\times64 $ filter parameters). By placing the residual connection after the feature transformations (Eq. 4), the adjustment parameters can be considered as removing the unwanted feature residuals related to degradation from the original input features.

具体地,当输入特征$ x $输入到 Koala 模块中,它通过 2 卷积图,并通过一组乘法参数$ m $调整,然后是一组本地内核,$ k $,基于预测的劣化内核生成$ F_d $。输入在输入 3 卷积图层后提取的内核特征$ F_d $,而不是直接送入$ F_d $,而不是直接送入$ F_d $,而是提取的。在局部剩余连接之后,Koala 模块的输出$ y $由 EQ 中的何处给出。 4,$\otimes $和$\oast $分别表示元素 - WISE 乘法和本地特征过滤。对于生成$ k $,采用$ 1\times 1\times1 $卷积,以便通过卷积操作不会混合空间相邻的内核特征值$ f_d $。 $ k $的内核值被约束为具有 1(归一化)的总和,例如用于$ F_d $和$ F_u $。首先通过在每个网格位置$(h, w) $处重塑$ 1\times 1\times 1\times 49 $向量,以$ 7\times 7 $ 2D 本地内核,并计算每个网格位置的$ 1\times 49 $向量,并计算每个内部产品$(h, w) $输入功能的位置。由于应用了相同的$ 7\times 7 $内核,因此乘法参数$ m $引入了通道深度上的特征的元素 - 方向缩放。与参数的数量相比,这也是有效的,与每个通道的每个像素本地内核预测($ 49+64 $ VS.$ 49\times64 $滤波器参数)。通过将剩余连接放置在特征变换(EQ.4)之后,可以将调整参数视为删除与原始输入特征的劣化相关的不需要的特征残差。

Training Strategy

We employ a 3-stage training strategy: (i) the downsampling network is pre-trained with $ l_1(\hat{X}, X) $ ; (ii) the upsampling network is pre-trained with $ l_1(\hat{Y}, Y) $ by replacing all KOALA modules with ResBlocks; (iii) the whole framework (KOALAnet) including the KOALA modules (with convolution layers needed for generating $ f_d $ , $ m $ and $ k $ inserted on the pre-trained ResBlocks) is jointly optimized based on $ l_1(\hat{X}, X)+l_1(\hat{Y}, Y) $ . With this strategy, the KOALA modules can be effectively trained with already meaningful features obtained from the early training phases, and focus on utilizing the degradation kernel cues for SR.


我们采用了一项三阶段培训策略:(i)下采样网络预先接受$ l_1(\hat{X}, X) $; (ii)Ups 采样网络通过用 Resblocks 替换所有考拉模块,用$ l_1(\hat{Y}, Y) $预先培训; (iii)基于$ l_1(\hat{X}, X)+l_1(\hat{Y}, X)+l_1(\hat{Y}, X)+l_1(\hat{Y}, X)+l_1(\hat{Y}, X)+l_1( $。通过这种策略,Koala 模块可以通过早期训练阶段获得的有意义的功能有效地培训,并专注于利用 SR 的降级核心线索。

Experiment Results

In our implementations, $ k_{d} $ of size $ 20\times20 $ is computed by convolving $ k_{b} $ with a random anisotropic Gaussian kernel ( $ k_{g} $ ) of size $ 15\times15 $ , following Eq. degradation. It should be noted that $ k_{b} $ is originally a bicubic downscaling kernel of size $ 4\times4 $ same as in the function of Matlab anti-aliasing, but is zero-padded to be $ 20\times20 $ to align with the size of $ k_{d} $ as well as to avoid image shift. The Gaussian kernels for degradation are generated by randomly rotating a bivariate Gaussian kernel by $ \theta\sim Uniform(0, \pi/2) $ , and by randomly selecting its kernel width that is determined by a diagonal covariance matrix with $ \sigma_{11} $ and $ \sigma_{22}\sim Uniform(0.2,4.0) $ . With $ k_{d} $ , we build our training data on the DIV2K dataset according to Eq. degradation. Testsets are generated using Set5 , Set14 , BSD100 , Urban100 , Manga109 and DIV2K-val for comparison with other methods. When generating the testsets, we ensure that different parameters are selected for different images by assigning different random seed values. We additionally compare to DIV2KRK proposed in , which contains DIV2K images that are randomly degraded.



All convolution filters in the KOALAnet are of size $ 3\times 3 $ with 64 output channels following , unless otherwise noted as $ 1\times 1 $ Conv or with the output channel noted next to an operation block in Fig. net_arch. All CNN-based networks used in our experiments are trained with LR patches of size $ 64\times 64 $ normalized to $ [-1, 1] $ , where each patch is randomly cropped, and randomly degraded with $ k_d $ during training. The mini-batch size is 8, and the initial learning rate of $ 10^{-4} $ is decreased by $ 1/10 $ at 80% and 90% of 200K iterations for each training stage. We consider $ s=2 $ and $ s=4 $ for SR in our experiments.


在我们的实现中,$ k_{d}大小的$ $ 20\times20 $计算由卷积$ k_{b} $与大小的随机各向异性高斯核($ k_{g} $)$ 15\times15 $,以下等式。降解。应该注意的是,$ k_{b} $最初是$ 4\times4 $与 MATLAB 抗混叠功能相同的双臂缩减内核,但零填充为$ 20\times20 $与大小对齐$ k_{d} $以及避免图像偏移。通过$\theta\sim Uniform(0, T_D1 随机旋转二颤旋高斯内核,并通过随机选择由对角线协方差矩阵确定的内核宽度来生成高斯高斯内核生成劣化,该核宽度与$\sigma_{11}$和$\sigma_{22}\sim Uniform(0.2,4.0) $。使用$ k_{d} $,我们根据 EQ 在 DIV2K 数据集上构建我们的培训数据。降解。使用 SET5,SET14,BSD100,URBAN100,MANGA109 和 DIV2K-VAT 生成测试,以与其他方法进行比较。生成测试集时,我们确保通过分配不同的随机种子值来为不同图像选择不同的参数。我们另外与 DIV2KRK 相比,其中包含随机降级的 DIV2K 图像。

在 KOALAnet 所有卷积滤波器尺寸的$ 3\times 3 $与以下,除非如$ 1\times 1 $转化率或与该输出信道另有说明 64 个输出通道在图 4 中的操作块旁边注意到。NET_ARCH。我们实验中使用的所有基于 CNN 的网络均受 LR 尺寸$ 64\times 64 $培训到$[-1, 1] $的培训,其中每个补丁都是随机裁剪的,并且在训练期间随机地使用$ k_d $随机降级。迷你批量尺寸为 8,$ 10^{-4} $的初始学习速率由$ 1/10 $降低 80%,占每个训练阶段的 200k 迭代的 90%。我们在我们的实验中考虑$ s=2 $和$ s=4 $。

Comparison to Existing Blind SR Methods

We compare our method with recent state-of-the-art blind SR methods, BlindSR and IKC . For , we use the pre-trained model in an independent implementation by an author with only $ s=2 $ model. For , we use the official pre-trained model by the authors with only $ s=4 $ model. We also compare against ZSSR with default degradation as well as by incorporating KernelGAN to provide degradation information, both with the official codes. We compare the Y-channel PSNR and SSIM for various methods on the six random anisotropic degradation testsets as well as DIV2KRK in Table quant_compar. We also provide a comparison on the average inference time and GFLOPs in the rightmost column, where the inference time in seconds is measured on Set5 with an NVIDIA Titan RTX excluding file I/O times, and the GFLOPs is computed on the image in Set5 that is of $ 512\times512 $ resolution in terms of the HR ground truth. The inference time includes zero-shot training for ZSSR and optimization for . The inherent limitation of zero-shot models is that they cannot leverage the abundant training data that is utilized by other methods as image-specific CNNs are trained at test time. For IKC , we report the results of the last iteration (IKC ) as well as those producing the maximum PSNR (IKC ) from total 7 iteration loops. Since IKC is trained under an isotropic setting, its data modeling capability tends to be limited under a more randomized superset of anisotropic blur. Note that BlindSR had been trained under an anisotropic setting like ours. On DIV2KRK , where artificial noise is injected to the synthetic degradation kernels, internal-learning-based methods ZSSR and KernelGAN are advantageous over the other methods as they can adapt to the unorthodox kernels. Nevertheless, our method outperforms all compared methods on DIV2KRK even though it was not trained with kernel noise, demonstrating good generalization ability. On other testsets, our KOALAnet outperforms the other methods by a large margin of over 1 dB in most cases. We compare the visual results on the randomized anisotropic testset in Fig. qual_comp. We have also visualized the mean of the predicted spatially-variant kernels along with the ground truth kernels. Our method is able to restore sharp edges and high frequency details. Most importantly, in Fig. real_imgs, we also compare our method under real conditions on old historic images without ground truth labels. In this case, the results are generated using the configuration for real images for ZSSR, and we show the results generated at the last iteration for IKC. Our method performs well even on these real images with unknown degradations, demonstrating good generalization ability.


与现有盲 sr 方法的比较

我们将我们的方法与最近的最先进的盲人 sr 方法,blindsr 和 ikc 进行了比较。因为,我们在仅具有$ s=2 $模型的作者的自主实现中使用预训练模型。因为,我们使用作者使用官方预先训练的模型,只有$ s=4 $模型。我们还与默认的劣化以及结合 KernelGan 提供官方代码的劣化信息,我们也与 ZSSR 进行比较。我们将 Y 频道 PSNR 和 SSIM 用于六个随机各向异性劣化测试单元以及表 Quant_Compar 中的 Div2KRK 中的各种方法进行比较。我们还提供了对最右侧列中的平均推理时间和 GFLOPS 的比较,其中在 SET5 上测量的推理时间与不包括文件 I / O 次的 NVIDIA TITAN RTX,并且 GFLOPS 在 SET5 中的图像上计算在 HR 地面真理方面是$ 512\times512 $分辨率。推理时间包括 ZSSR 和优化的零拍摄训练。零拍模型的固有限制是它们不能利用其他方法使用的丰富训练数据,因为图像特定的 CNN 在测试时间训练。对于 IKC,我们报告最后一次迭代(IKC)的结果以及从总 7 迭代循环生产最大 PSNR(IKC)的结果。由于 IKC 在各向同性设定下培训,其数据建模能力趋于受限于各向异性模糊的更随随机化的模糊。请注意,Blindsr 已在像我们这样的各向异性设置下培训。在 DIV2KRK 上,其中人工噪声注入合成劣化核,基于内部学习的方法 ZSSR 和 Kernelgan 对其他方法有利,因为它们可以适应非正统核。尽管如此,我们的方法优于 DIV2KRK 的所有比较方法,即使它没有接受核噪声训练,展示了良好的泛化能力。在其他 Testsets 上,我们的 Koalanet 在大多数情况下通过超过 1 dB 的大幅度优于其他方法。我们将 Visual 结果与图 1 中的随机各向异性测试集进行比较。Qual_Comp。我们还与地面真理内核一起可视化了预测的空间变体内核的平均值。我们的方法能够恢复夏普边缘和高频细节。最重要的是,在图 1 中。Real_imgs,我们还在没有地面真理标签的旧历史性图像的真实条件下比较了我们的方法。在这种情况下,使用用于 ZSSR 的真实图像的配置生成结果,我们显示了在 IKC 的最后一次迭代中生成的结果。我们的方法甚至在这些真实图像上表现良好,具有未知的降级,展示了良好的泛化能力。

Results on Aesthetic Images

We collected several shallow DoF images from the Web containing intentional spatially-variant blur for aesthetic purposes, to compare the SR results of existing blind SR methods to ours. Before applying SR, these images are bicubic-downsampled so that we can consider the original images as ground truth, in order to gauge the intended blur characteristics in the original images. As shown in Fig. dof_imgs, IKC and ZSSR with KernelGAN tend to over-sharpen even the intentionally blurry areas that should be left blurry. ZSSR produces blurry results overall, even in the foreground (in-focus) regions. In contrast, our KOALAnet leaves the originally out-of-focus region blurry and appropriately upscales the overall image, yielding results that are closest to the original images. For further analysis, we also compare our KOALAnet with a Baseline that only has the upsampling network in our framework. As shown in Fig. dof_imgs_baseline, the regions with strong blur far away from the in-focus area remain blurry for both methods. However, the Baseline cannot correctly disentangle the intentional blur from the degradation blur in the boundary areas between the in-focus and completely out-of-focus areas, where it can be ambiguous whether the blur should be sharpened or left blurry. With shallow DoF images where only a narrow band of regions are in focus in Fig. dof_imgs_baseline, the Baseline tends to produce results with a than the original image due to over-sharpening of the boundary areas.




我们从包含有意的空间变体模糊的网上收集了几个浅 DOF 图像,用于审美目的,以比较现有盲人 SR 方法的 SR 结果。在申请 SR 之前,这些图像是双臂下采样,以便我们可以将原始图像视为地面真相,以便衡量原始图像中的预期模糊特性。如图 1 所示。DOF_IMGS,IKC 和 ZSSR 与 Kernelangan 倾向于过度锐化,即使是应留下模糊的故意模糊区域。 ZSSR 整体地产生模糊结果,即使在前景(焦点)区域也是如此。相比之下,我们的考拉纳留下了最初的焦点区域模糊并适当地升高了整体图像,产生最接近原始图像的结果。为了进一步分析,我们还将我们的考拉网与基线进行比较,只有在我们的框架中只有上采样网络。如图 1 所示。DOF_IMGS_BASELINE,具有远离焦点区域的强烈模糊的区域仍然是两种方法的模糊。然而,基线无法正确地解开来自焦点和完全焦点间区域之间的边界区域的降解模糊的故意模糊,在那里它可以模糊是否应该削尖或留下模糊。对于浅 DOF 图像,其中仅在图 2 中侧重于窄带区域。DOF_IMGS_BASELENLE,基线由于边界区域的过度锐化而产生与原始图像的结果。

Ablation Study

In this Section, we analyze the effect of the different components in our framework with various ablation studies, and provide visualizations of estimated blur kernels, local upsampling filters, and local filters in the KOALA modules.

Model Baseline KOALA KOALAnet KOALA
(upsampler) only \textitk +GT kernel
PSNR 29.20 29.40 29.44 29.67
SSIM 0.8110 0.8150 0.8156 0.8212
Ablation study on the KOALA module for $ \times $ 4.

We analyzed the effect of the proposed KOALA modules by retraining the following SR models: (i) Baseline with only the upsampling network without using any degradation kernel information (no downsampling network, nor KOALA modules), (ii) a model that only has the parameters (not ) in the KOALA modules, (iii) a model to which ground truth kernels are given instead of the estimated degradation kernels (KOALA+GT kernel). From the Baseline, adding KOALA modules with only parameters improves PSNR performance by 0.2 dB, and adding further improves the PSNR gain by 0.04 dB, showing the effectiveness of our proposed KOALA modules in incorporating degradation kernel information. The SR performance of KOALA modules with ground truth kernels can be considered as an upper bound, with 0.23 dB higher PSNR than the KOALAnet. Fig. koala_ab_natural compares the $ \times 4 $ SR results of the Baseline and the KOALAnet. With the predicted degradation kernel information, the KOALA modules help to effectively remove the blur induced by the random degradations, while revealing fine edges ( $ 2^{nd} $ column). The images in the $ 3^{rd} $ column show the SR results when incorrect larger or smaller blur kernels are deliberately provided to the KOALA modules. In these cases, the wrong kernels cause the upsampling network to produce over-sharpened (for larger kernels) or blurry (for smaller kernels) results. All models were tested on the DIV2K-val testset.

Models (a) (b) (c) (d) Full
ResNet U-Net Uniform No Norm. Model
PSNR 47.79 47.85 45.78 49.15 49.50
SSIM 0.9967 0.9967 0.9945 0.9976 0.9980
Experiment on downsampling network architecture for $ \times $ 1/4. All networks contain 27 convolution layers.
$ l_2 $ error KernelGAN BlindSR KOALAnet
DIV2K-val 0.0230 0.0152 0.0010
DIV2KRK 0.0084 0.0081 0.0044
Kernel accuracy (average $ l_2 $ error) measured on the two random anisotropic datasets, DIV2K-val and DIV2KRK .

To analyze the downsampling network, we retrained its four variants: (a) a ResNet model -- a ResNet-style architecture trained instead of a U-Net, (b) a model with residual connection removed from ResBlocks -- thus a common U-Net, not a ResU-Net, (c) a model that estimates uniform kernels, (d) a model without normalization (sum to 1) employed for $ F_d $ (). Table downsampler shows PSNR and SSIM values measured between the original degraded LR image and the LR image reconstructed using the kernels produced from the different models, on the DIV2K-val testset. in Table downsampler denotes our final downsampling network (a ResU-Net), where there is a large drop in PSNR if any of the components are ablated.



我们通过再培训以下 SR 模型来分析所提出的考拉模块的效果:(i)仅使用任何降级内核信息(无下采样网络,或 koala 模块),(ii)的基线仅具有 Koala 模块中的参数(不是)的模型,(iii)给出了地面真实内核的模型而不是估计的劣化内核(Koala + Gt 内核)。从基线,只有参数添加考拉模块,通过 0.2 dB 提高 PSNR 性能,并进一步提高了 0.04 dB 的 PSNR 增益,显示了我们提出的考核模块在加入降级内核信息时的有效性。带有地面真相核的考拉模块的 SR 性能可以被视为一个上限,PSNR 比 Koalanet 更高 0.23 dB。图。Koala_ab_Natural 比较基线和考拉内特的$\times 4 $ SR 结果。利用预测的降级内核信息,考拉模块有助于有助于有效地去除由随机降级引起的模糊,同时揭示细边($ 2^{nd} $列)。 $ 3^{rd} $列中的图像显示,当故意提供给 Koala 模块时,将较大或更小的模糊内核显示出 SR 结果。在这些情况下,错误的内核使上采样网络产生过度锐化(对于较大的内核)或模糊(对于较小的内核)结果。所有模型都在 DIV2K-VAT 测试集上进行了测试。

分析下采样网络,我们撤退了它的四个变体:(a)reset 型号 - 培训的 Resnet 风格架构而不是 U-Net,(B)模型使用 Resplocks 中删除的剩余连接 - 因此是一个常见的 U-Net,而不是 Resu-net,(c)估计统一内核的模型,(d)模型而没有用于$ F_d $()的归一化(SUM)的模型。表下采样器显示 PSNR 和 SSIM 值在 DIV2K-VAT 测试集上使用不同型号产生的核对原始化的 LR 图像和重建的 LR 图像。在表下采样器表示我们的最终下采样网络(RESU-Net),如果任何组件都被烧蚀,则在 PSNR 中存在大的下降。

Especially, if spatially-equivariant kernels are estimated as in (c), PSNR performance drops drastically by 4.72 dB, showing the importance of using spatially-variant kernels. In order to evaluate the accuracy of degradation kernel estimation, we measured the average $ l_2 $ distance between the ground truth kernels and the kernels estimated by KernelGAN , BlindSR and the downsampler of KOALAnet, on the random anisotropic degradation testset DIV2K-val and DIV2KRK in Table kernel_accuracy. We make sure that the centers of the estimated kernels are aligned to the center of each ground truth kernel through manual shifting. As shown in Table kernel_accuracy, KOALAnet predicts more accurate degradation kernels with lower $ l_2 $ error compared to other kernel estimators.


In Fig. kernel_vis(a), we visualized the estimated blur kernels and upsampling kernels of our KOALAnet on two different locations in the same image. The $ 1^{st} $ column shows the spatially-variant degradation kernels that are predicted by the downsampling network. As discussed in Section downsampler, the predicted blur kernel is close to the true kernel in the complex area (), while a non-directional kernel is obtained in the homogeneous region (). In the $ 2^{nd} $ column, the $ s\times s $ 2D upsampling kernels of size $ 5\times 5 $ are also shown to be non-uniform depending on the location. We have also visualized some examples of local filters, $ k $ , of the KOALA modules in Fig. kernel_vis(b). The top row shows the degradation kernels estimated by the downsampling network, and the bottom row shows the $ 7\times 7 $ local filters ( $ k $ ) of the $ 5^{th} $ KOALA module. Even without any explicit enforcement on the shape of $ k $ , they are learned to be related to the orientations and shapes of the blur kernel, and thus able to adjust the SR features accordingly.

特别是,如果估计空间 - 等级的内核,则 PSNR 性能急剧下降 4.72 dB,显示使用空间变体内核的重要性。为了评估降级核估计的准确性,我们在随机各向异性降解测试集 Div2K-Val 和 Div2KRK 中测量了地面真理内核和 Koalanet 估计的内核之间的平均$ l_2 $距离。表 Kernel_Accuracy。我们确保估计内核的中心通过手动移位对齐每个地面实况内核的中心。如表 Kernel_Accuracy 所示,Koalanet 与其他内核估计值相比,Koalanet 预测具有较低$ l_2 $错误的更准确的劣化内核。

在图 kernel_vis(a)中,我们可见我们 KOALAnet 的估算的模糊内核和内核上采样在同一图像中的两个不同的位置。 $ 1^{st} $列显示了下采样网络预测的空间变体劣化内核。如在下面采样器的部分中所讨论的,预测的模糊内核靠近复杂区域()中的真正内核,而在同一区域()中获得非定向内核。在$ 2^{nd} $列中,$ s\times s $ 2D UPS 采样粒度$ 5\times 5 $也被认为是不均匀的,具体取决于位置。我们还可以在图 3 中的考核模块的本地滤波器$ k $的一些示例中显示了一些示例。KERNEL_VIS(B)。顶阶显示由下采样网络估计的劣化内核,底行显示$ 5^{th} $ Koala 模块的$ 7\times 7 $本地滤波器($ k $)。即使没有$ k $的形状没有任何明确的强制,他们也会学习与模糊内核的方向和形状相关,从而能够相应地调整 SR 功能。


Blind SR is an important step towards generalizing learning-based SR models for diverse types of degradations and content of LR data. In order to achieve this goal, we designed a downsampling network that predicts spatially-variant kernels and an upsampling network that leverages this information effectively, by applying these kernels as local filtering operations to modulate the early SR features based on the degradation information. As a result, our proposed KOALAnet accurately predicts the HR images under a randomized synthetic setting as well as for historic data. Furthermore, we first analyzed the SR results on real aesthetic photographs, for which our KOALAnet appropriately handles the intentional blur unlike other methods or the Baseline. Our code and data are publicly available on the web.




盲学 SR 是朝着促进基于学习的 SR 模型的重要步骤,以实现不同类型的降级和 LR 数据的内容。为了实现这一目标,我们设计了一种下采样网络,其通过将这些内核作为本地滤波操作应用于本地滤波操作来预测空间变型内核和 ups 采样网络,其利用这些内核来根据劣化信息调制早期的 SR 特征来调用早期的 SR 特征。因此,我们提出的 Koalanet 准确地预测了随机合成设置以及历史数据下的 HR 图像。此外,我们首先分析了真正的审美照片的 SR 结果,我们的考拉内特适当地处理与其他方法或基线不同的故意模糊。我们的代码和数据在网上公开使用。