# Koalanet：基于核函数自适应局部调整算法的盲超分辨率

## Abstract

Blind super-resolution (SR) methods aim to generate a high quality high resolution image from a low resolution image containing unknown degradations. However, natural images contain various types and amounts of blur: some may be due to the inherent degradation characteristics of the camera, but some may even be intentional, for aesthetic purposes (e.g. Bokeh effect). In the case of the latter, it becomes highly difficult for SR methods to disentangle the blur to remove, and that to leave as is. In this paper, we propose a novel blind SR framework based on kernel-oriented adaptive local adjustment (KOALA) of SR features, called KOALAnet, which jointly learns spatially-variant degradation and restoration kernels in order to adapt to the spatially-variant blur characteristics in real images. Our KOALAnet outperforms recent blind SR methods for synthesized LR images obtained with randomized degradations, and we further show that the proposed KOALAnet produces the most natural results for artistic photographs with intentional blur, which are not over-sharpened, by effectively handling images mixed with in-focus and out-of-focus areas.

## Introduction

When a deep neural network is trained under a specific scenario, its generalization ability tends to be limited to that particular setting, and its performance deteriorates under a different condition. This is a major problem in single image super-resolution (SR), where most neural-network-based methods have focused on the upscaling of low resolution (LR) images to high resolution (HR) images solely under the setting , until very recently. Naturally, their performance tends to severely drop if the input LR image is degraded by even a slightly different downsampling kernel, which is often the case in real images . Hence, more recent SR methods aim for SR, where the true degradation kernels are unknown . However, this unknown blur may be of various types with different characteristics. Often, images are captured with a different depth-of-field (DoF) by manipulating the aperture sizes and the focal lengths of camera lenses, for aesthetic purposes (e.g. Bokeh effect) as shown in Fig. dof_imgs. Recent mobile devices even try to simulate this synthetically (e.g. portrait mode) for artistic effects . Although a camera-specific degradation could be spatially-equivariant (similar to the way LR images are generated for SR), the blur generated due to DoF of the camera would be , where some areas are in focus, and others are out of focus.

## 简介

These types of LR images are extremely challenging for SR, since ideally, the intentional blur must be left unaltered (should not be over-sharpened) to maintain the photographer's intent after SR. However, the SR results of such images are yet to be analyzed in literature.

In this paper, we propose a blind SR framework based on kernel-oriented adaptive local adjustment (KOALA) of SR features, called KOALAnet, by jointly learning the degradation and restoration kernels. The KOALAnet consists of two networks: a downsampling network that estimates blur kernels, and an upsampling network that fuses this information by mapping the predicted degradation kernels to the feature kernel space, predicting degradation-specific local feature adjustment parameters that are applied by on the SR feature maps. After training under a random anisotropic Gaussian degradation setting, our KOALAnet is able to accurately predict the underlying degradation kernels and effectively leverage this information for SR. Moreover, it demonstrates a good generalization ability on historic images containing unknown degradations compared to previous blind SR methods. We further provide comparisons on real aesthetic DoF images, and show that our KOALAnet effectively handles images with intentional blur. Our contributions are three-fold:

sep1pt

• We propose a blind SR framework that jointly learns spatially-variant degradation and restoration kernels. The restoration (upsampling) network leverages novel KOALA modules to adaptively adjust the SR features based on the predicted degradation kernels. The KOALA modules are extensible, and can be inserted into any CNN architecture for image restoration tasks.
• We empirically show that the proposed KOALAnet outperforms the recent state-of-the-art blind SR methods for synthesized LR images obtained under randomized degradation conditions, as well as for historic LR images with unknown degradations.
• We first analyze SR results on images mixed with in-focus and out-of-focus regions, showing that our KOALAnet is able to discern intentionally blurry areas and process them accordingly, leaving the photographer's intent unchanged after SR.

SEP1PT

• 我们提出了一个盲人 SR 框架，共同学习 Spatial-Variant 劣化和恢复内核。恢复（上采样）网络利用新颖的考拉模块，基于预测的降级核心自适应地调整 SR 特征。考拉模块是可扩展，可以插入任何 CNN 架构中以进行图像恢复任务。
• 我们经验表明，所提出的 Koalanet 优于在随机降解条件下获得的合成 LR 图像的最近最新的最先进的盲人 SR 方法，以及具有未知降解的历史性 LR 图像。
• 我们首先分析与 in-focus 和 Out-focus 地区混合的图像上的 SR 结果，表明我们的考拉纳能够辨别出故意模糊的区域并相应地处理它们，使摄影师的意图保持不变在 SR 之后。

Since the first CNN-based SR method by Dong , highly sophisticated deep learning networks have been proposed in image SR , achieving remarkable quantitative or qualitative performance. Especially, Wang introduced feature-level affine transformation based on segmentation prior to generate class-specific texture in the SR result. Although these methods perform promisingly under the ideal bicubic-degraded setting, they tend to produce or results if the degradations present in the test images deviate from bicubic degradation.

Recent methods handling multiple types of degradations can be categorized into non-blind SR , where the LR images are coupled with the ground truth degradation information (blur kernel or noise level), or blind SR , where only the LR images are given without the ground truth degradation information that is then to be estimated. Among the former, Zhang provided the principal components of the Gaussian blur kernel and the level of additive Gaussian noise by concatenating them with the LR input for degradation-aware SR. Xu also integrated the degradation information in the same way, but with a backbone network using dynamic upsampling filters , raising the SR performance. However, these methods require ground truth blur information at test time, which is unrealistic for practical application scenarios. Among blind SR methods that predict the degradation information, an inspiring work by Gu inserted spatial feature transform modules in the CNN architecture to integrate the degradation information with iterative kernel correction. However, the iterative framework can be time-consuming since the entire framework must be repeated many times during inference, and the optimal number of iteration loops varies among input images, requiring human intervention for maximal performance. Furthermore, their network generates vector kernels that are eventually stretched with repeated values to be inserted to the SR network, limiting the degradation modeling capability of local characteristics.

## 相关工作

Another prominent work is the KernelGAN that generates downscaled LR images by learning the internal patch distribution of the test LR image. The downscaled LR patches, or the kernel information, and the original test LR images are then plugged into zero-shot SR or non-blind SR methods. There also exist methods that employ GANs to generate realistic kernels for data augmentation , or learn to synthesize LR images along with the SR image . In comparison, our downsampling network predicts the underlying blur kernels that are used to modulate and locally filter the upsampling features. Jia first proposed that generate image- and location-specific filters that filter the input images in a locally adaptive manner to better handle the non-stationary property of natural images, in contrast to the conventional convolution layers with spatially-equivariant filters. Application-wise, Niklaus and Jo successfully employed the dynamic filtering networks for video frame interpolation, and video SR, respectively. The recent non-blind SR method by Xu also employed a two-branch dynamic upsampling architecture . However, the provided ground truth degradation kernel is still restricted to spatially kernels and are entered naively by simple concatenation, unlike our proposed KOALAnet that estimates the underlying blur kernels from input LR images and effectively integrates this information for SR.

## Proposed Method

We propose a blind SR framework with (i) a downsampling network that predicts spatially-variant degradation kernels, and (ii) an upsampling network that contains KOALA modules, which adaptively fuses the degradation kernel information for enhanced blind SR.

## 建议方法

### Downsampling Network

During training, an LR image, $X$ , is generated by applying a random anisotropic Gaussian blur kernel, $k_{g}$ , on an HR image, $Y$ , and downsampling it with the bicubic kernel, $k_{b}$ , similar to , given as, where $\downarrow_s$ denotes downsampling by scale factor $s$ . Hence, the downsampling kernel $k_{d}$ can be obtained as $k_{d}=k_{g}*k_{b}$ , and the degradation process can be implemented by an $s$ -stride convolution of $Y$ by $k_{d}$ . We believe that anisotropic Gaussian kernels are a more suitable choice than isotropic Gaussian kernels for blind SR, as anisotropic kernels are the more generalized superset. We do not apply any additional anti-aliasing measures (like in the default Matlab imresize function), since $Y$ is already low-pass filtered by $k_g$ . The downsampling network, shown in the upper part of Fig. net_arch, takes a degraded LR RGB image, $X$ , as input, and aims to predict its underlying degradation kernel that is assumed to have been used to obtain $X$ from its HR counterpart, $Y$ , through a U-Net-based architecture with ResBlocks. The output, $F_d$ , is a 3D tensor of size $H\times W\times 400$ , composed of $20\times20$ local filters at every $(h, w)$ pixel location. The local filters are normalized to have a sum of 1 (denoted as in Fig. net_arch) by subtracting each of their mean values and adding a bias of $1/400$ . With $F_d$ , the LR image, $\hat{X}$ , can be reconstructed by, where $\oast\downarrow_s$ represents $20\times20$ local filtering at each pixel location with stride $s$ , as illustrated in Fig.

### 下采样网络

net_arch. For training, we propose to use an LR reconstruction loss, $L_r=l_1(\hat{X}, X)$ , which indirectly enforces the downsampling network to predict a at each pixel location based on image prior. To bring flexibility in the spatially-variant kernel estimation, the loss with the ground truth kernel is only given to the spatial-wise mean of $F_d$ . Then, the total loss for the downsampling network is given as, where $E_{hw}[\cdot]$ denotes a spatial-wise mean over $(h, w)$ , and $k_d$ is reshaped to $1\times1\times400$ from the original size of $20\times 20$ . Estimating the blur kernel for a smooth region in an LR image is difficult since dissimilar blur kernels may produce similar smooth pixel values. Consequently, if the network aims to directly predict the true blur kernel, the gradient of a kernel matching loss may not back-propagate a desirable signal. Meanwhile, for highly textured regions of HR images, the induced LR images are largely influenced by the blur kernels, which enables the downsampling network to find inherent degradation cues from the LR images. In this case, the degradation information can be highly helpful in reconstructing the SR image as well, since most of the SR reconstruction error tends to occur in these regions.

net_arch。对于培训，我们建议使用 LR 重建损失$L_r=l_1(\hat{X}, X)$，其间接地强制执行下采样网络以基于先前的图像预测每个像素位置处的 A 处。为了在空间变量的内核估计中带来灵活性，与地面真相内核的丢失仅给予$F_d$的空间方式。然后，给出了下采样网络的总损失，其中$E_{hw}[\cdot]$表示$(h, w)$的空间方向，而$k_d$从原始尺寸中重新装入$1\times1\times400$ $20\times 20$。估计 LR 图像中的平滑区域的模糊内核是困难的，因为不同的模糊核可能产生类似的平滑像素值。因此，如果网络旨在直接预测真正的模糊内核，则内核匹配丢失的梯度可能不会回到期望的信号。同时，对于 HR 图像的高度纹理区域，感应的 LR 图像主要受模糊内核的影响，这使得下采样网络能够从 LR 图像找到固有的降级线索。在这种情况下，劣化信息也可以非常有用地重建 SR 图像，因为大多数 SR 重建误差趋于发生在这些区域中。

### Upsampling Network

We consider the upsampling process to be the inverse of the downsampling process, and thus, design an upsampling network in correspondance with the downsampling network as shown in Fig. net_arch. The upsampling network takes in the degraded LR input, $X$ , of size $H\times W\times 3$ , and generates an SR output, $\hat{Y}$ , of size $sH\times sW\times 3$ , where $s$ is a scale factor. In the early convolution layers of the upsampling network, the SR feature maps are adjusted by five cascaded KOALA modules, K, which are explained in detail in the next section. Then, after seven cascaded residual blocks, R, the resulting feature map, $f_u$ , is given by, $f_u = (RL\circ R^7\circ K^5\circ Conv)(X),$ where RL is ReLU activation . $f_u$ is fed separately into a residual branch and a filter generation branch similar to , where the residual map, $r$ , and local upsampling filters, $F_u$ , are obtained as, for $s=4$ , where PS is a pixel shuffler of $s=2$ and Normalize denotes normalizing by subtracting the mean and adding a bias of $1/25$ for each $5\times 5$ local filter. The second PS and its preceding convolution layer are removed when generating $r$ for $s=2$ . When applying the generated $F_u$ of size $H\times W\times(25\times s\times s)$ on the input $X$ , $F_u$ is split into $s\times s$ tensors in the channel direction, and each chunk of $H\times W\times 25$ tensor is interpreted as a $5\times 5$ local filter at every $(h, w)$ pixel location. They are applied on $X$ (same filters for RGB channels) by computing the local inner product at the corresponding grid position $(h, w)$ . After filtering all of the $s\times s$ chunks, the produced $H\times W\times(s\times s\times 3)$ tensor is pixel-shuffled with scale $s$ to generate the enlarged $\tilde{Y}$ of size $sH\times sW\times 3$ similar to . Finally, $\hat{Y}$ is computed as $\hat{Y}=\tilde{Y}+r$ , and the upsampling network is trained with $l_1(\hat{Y}, Y)$ .

Method $\times2$ Set5 Set14 BSD100 Urban100 Manga109 DIV2K-val DIV2KRK Complexity
PSNR/SSIM PSNR/SSIM PSNR/SSIM PSNR/SSIM PSNR/SSIM PSNR/SSIM PSNR/SSIM Time (s)/GFLOPs
Bicubic 27.11/0.7850 26.00/0.7222 26.09/0.6838 22.82/0.6537 24.87/0.7911 28.27/0.7835 28.73/0.8040 - / -
ZSSR 27.30/0.7952 26.55/0.7402 26.46/0.7020 23.13/0.6706 25.43/0.8041 28.69/0.7958 29.10/0.8215 24.69/
KernelGAN 27.35/0.7839 24.57/0.7061 25.56/0.6990 23.12/0.6907 25.99/0.8270 27.66/0.7892 / 230.66/10,219
+ZSSR
BlindSR / / / / / / 29.44/0.8464 /13,910
\Ours 33.08/0.9137 30.35/0.8568 29.70/0.8248 27.19/0.8318 32.61/0.9369 32.55/0.8902 31.89/0.8852 0.71/201
Method $\times4$ Set5 Set14 BSD100 Urban100 Manga109 DIV2K-val DIV2KRK Complexity
Bicubic 26.41/0.7511 24.73/0.6641 25.12/0.6321 22.04/0.6061 23.60/0.7482 27.04/0.7417 25.33/0.6795 - / -
ZSSR 26.49/0.7530 24.93/0.6812 25.36/0.6526 22.39/0.6327 24.43/0.7813 27.39/0.7590 25.61/0.6911 16.91/6,091
KernelGAN 22.12/0.5989 19.73/0.5194 21.02/0.5377 20.12/0.5743 22.61/0.7345 23.75/0.6830 26.81/0.7316 357.70/11,908
+ZSSR
IKC \textitlast 27.73/0.8024 25.38/0.7162 25.68/0.6844 23.03/0.6852 25.44/0.8273 27.61/0.7843 27.39/ /
IKC \textitmax / / / / / / /0.7684 /
\Ours 30.28/0.8658 27.20/0.7541 26.97/0.7172 24.71/0.7427 28.48/0.8814 29.44/0.8156 27.77/0.7637 0.59/57
###### Quantitative comparison on various datasets. We also provide a comparison on computational complexity in terms of the average inference time on Set5, and GFLOPs on baby

We propose a novel feature transformation module, KOALA, that adaptively adjusts the intermediate features in the upsampling network based on the degradation kernels predicted by the downsampling network. The KOALA modules are placed at the earlier stage of feature extraction in order to calibrate the anisotropically degraded LR features before the reconstruction phase.

### 上采样网络

Specifically, when the input feature, $x$ , is entered into a KOALA module, K, it goes through 2 convolution layers, and is adjusted by a set of multiplicative parameters, $m$ , followed by a set of local kernels, $k$ , generated based on the predicted degradation kernels, $F_d$ . Instead of directly feeding $F_d$ into K, the kernel features, $f_d$ , extracted after 3 convolution layers are entered. After a local residual connection, the output, $y$ , of the KOALA module is given by, where, In Eq. 4, $\otimes$ and $\oast$ denote element-wise multiplication and local feature filtering, respectively. For generating $k$ , $1\times1$ convolutions are employed so that spatially adjacent values of kernel features, $f_d$ , are not mixed by convolution operations. The kernel values of $k$ are constrained to have a sum of 1 (Normalize), like for $F_d$ and $F_u$ . The local feature filtering operation, $\oast$ , is first applied by reshaping a $1\times 1\times 49$ vector at each grid position $(h, w)$ to a $7\times 7$ 2D local kernel, and computing the local inner product at each $(h, w)$ position of the input feature. Since the same $7\times 7$ kernels are applied channel-wise, the multiplicative parameter, $m$ , introduces element-wise scaling for the features over the channel depth. This is also efficient in terms of the number of parameters, compared to predicting the per-pixel local kernels for every channel ( $49+64$ vs. $49\times64$ filter parameters). By placing the residual connection after the feature transformations (Eq. 4), the adjustment parameters can be considered as removing the unwanted feature residuals related to degradation from the original input features.

### Training Strategy

We employ a 3-stage training strategy: (i) the downsampling network is pre-trained with $l_1(\hat{X}, X)$ ; (ii) the upsampling network is pre-trained with $l_1(\hat{Y}, Y)$ by replacing all KOALA modules with ResBlocks; (iii) the whole framework (KOALAnet) including the KOALA modules (with convolution layers needed for generating $f_d$ , $m$ and $k$ inserted on the pre-trained ResBlocks) is jointly optimized based on $l_1(\hat{X}, X)+l_1(\hat{Y}, Y)$ . With this strategy, the KOALA modules can be effectively trained with already meaningful features obtained from the early training phases, and focus on utilizing the degradation kernel cues for SR.

## Experiment Results

In our implementations, $k_{d}$ of size $20\times20$ is computed by convolving $k_{b}$ with a random anisotropic Gaussian kernel ( $k_{g}$ ) of size $15\times15$ , following Eq. degradation. It should be noted that $k_{b}$ is originally a bicubic downscaling kernel of size $4\times4$ same as in the function of Matlab anti-aliasing, but is zero-padded to be $20\times20$ to align with the size of $k_{d}$ as well as to avoid image shift. The Gaussian kernels for degradation are generated by randomly rotating a bivariate Gaussian kernel by $\theta\sim Uniform(0, \pi/2)$ , and by randomly selecting its kernel width that is determined by a diagonal covariance matrix with $\sigma_{11}$ and $\sigma_{22}\sim Uniform(0.2,4.0)$ . With $k_{d}$ , we build our training data on the DIV2K dataset according to Eq. degradation. Testsets are generated using Set5 , Set14 , BSD100 , Urban100 , Manga109 and DIV2K-val for comparison with other methods. When generating the testsets, we ensure that different parameters are selected for different images by assigning different random seed values. We additionally compare to DIV2KRK proposed in , which contains DIV2K images that are randomly degraded.

All convolution filters in the KOALAnet are of size $3\times 3$ with 64 output channels following , unless otherwise noted as $1\times 1$ Conv or with the output channel noted next to an operation block in Fig. net_arch. All CNN-based networks used in our experiments are trained with LR patches of size $64\times 64$ normalized to $[-1, 1]$ , where each patch is randomly cropped, and randomly degraded with $k_d$ during training. The mini-batch size is 8, and the initial learning rate of $10^{-4}$ is decreased by $1/10$ at 80% and 90% of 200K iterations for each training stage. We consider $s=2$ and $s=4$ for SR in our experiments.

## Conclusion

Blind SR is an important step towards generalizing learning-based SR models for diverse types of degradations and content of LR data. In order to achieve this goal, we designed a downsampling network that predicts spatially-variant kernels and an upsampling network that leverages this information effectively, by applying these kernels as local filtering operations to modulate the early SR features based on the degradation information. As a result, our proposed KOALAnet accurately predicts the HR images under a randomized synthetic setting as well as for historic data. Furthermore, we first analyzed the SR results on real aesthetic photographs, for which our KOALAnet appropriately handles the intentional blur unlike other methods or the Baseline. Our code and data are publicly available on the web.