[论文翻译]CARN: 快速、准确、轻量级的超分辨率级联残差网络


下载PDF:https://arxiv.org/pdf/1803.08664.pdf

下载代码:https://github.com/nmhkahn/CARN-pytorch

CARN: Fast, Accurate, and Lightweight Super-Resolution with Cascading Residual Network

> 译者语:一个轻量级好用的超分辨率算法
> 中英文对照地址 https://aiqianji.com/blog/article/7
> 论文中英文对照合集 : https://aiqianji.com/blog/articles

摘要 Abstract

近年来,深度学习方法已经成功地应用于单图像超分辨率任务。尽管深度学习方法有很好的性能,但由于计算量大的要求,它们很难适用于实际任务。
本文通过提出一种精确、轻量级的图像超分辨率深度网络来解决这一问题。
详细地说,我们在残差网络上设计了一个级联机制的架构。我们也展示了多个不同的级联残差模型来验证算法的有效性。大量实验表明,即使用很少的参数和操作,我们的模型也能达到与最先进的方法相当的性能。

关键词:超分辨率,深度卷积神经网络

ABSTRACT

In recent years, deep learning methods have been successfully applied to single-image super-resolution tasks. Despite their great performances, deep learning methods cannot be easily applied to real-world applications due to the requirement of heavy computation. In this paper, we address this issue by proposing an accurate and lightweight deep learning model for image super-resolution. In detail, we design an architecture that implements a cascading mechanism upon a residual network. We also present a variant model of the proposed cascading residual network to further improve efficiency. Our extensive experiments show that even with much fewer parameters and operations, our models achieve performance comparable to that of state-of-the-art methods.

Keywords:

Super-Resolution, Deep Convolutional Neural Network

I. 介绍 Introduction

Super-resolution (SR) is a computer vision task that reconstructs a high-resolution (HR) image from a low-resolution (LR) image. Specifically, we are concerned with single image super-resolution (SISR), which performs SR using a single LR image. SISR is generally difficult to achieve due to the fact that computing the HR image from an LR image is a many-to-one mapping. Despite such difficulty, SISR is a very active area because it can offer the promise of overcoming resolution limitations, and could be used in a variety applications such as video streaming or surveillance system.
Recently, convolutional neural network-based(CNN-based) methods have provided outstanding performance in SISR tasks[srcnn2014, vdsr2016, lapsrn2017]. From the SRCNN[srcnn2014] that has three convolutional layers to MDSR[mdsr2017] that has more than 160 layers, the depth of the network and the overall performance have dramatically grown over time. However, even though deep learning methods increase the quality of the SR images, they are not suitable for real-world scenarios. From this point of view, it is important to design lightweight deep learning models that are practical for real-world applications. One way to build a lean model is reducing the number of parameters. There are many ways to achieve this[han2015deep, squeezenet], but the most simple and effective approach is to use a recursive network. For example, DRCN[drcn2016] uses a recursive network to reduce redundant parameters, and DRRN[drnn2017] improves DRCN by adding a residual architecture to it. These models decrease the number of model parameters effectively when compared to the standard CNN and show good performance. However, there are two downsides to these models: 1) They first upsample the input image before feeding it to the CNN model, and 2) they increase the depth or the width of the network to compensate for the loss due to using a recursive network. These points enable the model to maintain the details of the image when reconstructed, but at the expense of increased number of operations and inference time.

超分辨率是一项计算机视觉任务,它从低分辨率图像重建高分辨率图像。具体来说,我们关注的是单图像超分辨率(SISR),它使用单个低分辨率图像来执行超分辨率。SISR 通常很难实现,因为从单个低分辨率图像计算单个高分辨率图像是一个一对多的映射(一个像素点到多个像素点)。尽管有这样的困难,但 SISR 的研究仍是非常活跃,因为它有望克服分辨率的局限,并可用于各种应用,如视频流或监控系统。
最近,基于卷积神经网络的方法在 SISR 任务中表现突出[6,20,24]。从有三个卷积层的 SRCNN [6]到有 160 多层的 MDSR [26],网络的深度和整体性能随着时间的推移而显著增长。然而,即使深度学习方法提高了超分辨率图像的质量,它们也不适合真实世界的场景。从这个角度来看,设计适用于现实世界应用的轻量级深度学习模型非常重要。建立轻量级模型的一种方法是减少参数数量。实现这一点有许多方法[11,19],但最简单有效的方法是使用递归网络(recursive network)。例如,DRCN [21]使用递归网络来减少冗余参数,而 DRRN [35]通过增加残差结构来改进 DRCN。与标准 CNN 相比,这些模型有效地减少了模型参数的数量,并且表现出良好的性能。然而,这些模型有两个缺点:1)在将输入图像送到 CNN 模型之前,它们首先对输入图像进行上采样,2)它们增加了网络的深度或宽度,以补偿由于使用递归网络而造成的损失。这些点使模型能够在重建时保持图像的细节,但代价是操作数量和推理时间的增加。
Most of the works that aim to build a lean model focused primarily on reducing the number of parameters. However, as mentioned above, the number of operations is also an important factor to consider in real-world scenarios. Consider a situation where an SR system operates on a mobile device. Then, the execution speed of the system is also of crucial importance from a user-experience perspective. Especially the battery capacity, which is heavily dependent on the amount of computation performed, becomes a major problem. In this respect, reducing the number of operations in the deep learning architectures is a challenging and necessary step that has largely been ignored until now. Another scenario relates to applying SR methods on video streaming services. The demand for streaming media has skyrocketed, and hence requires large storage to store massive multimedia data. It is therefore imperative to compress data using lossy compression techniques before storing. Then, an SR technique can be applied to restore the data to the original resolution. However, because latency is the most critical factor in streaming services, the decompression process (i.e., super-resolution) has to be performed in real-time. To do so, it is essential to make the SR methods lightweight in terms of the number of operations.

大多数旨在建立精益模型的工作主要集中在减少参数的数量上。然而,如上所述,运算操作的数量也是现实场景中要考虑的一个重要因素。考虑在移动设备上运行超分辨率算法的情况,从用户体验的角度来看,系统的执行速度也是至关重要的。还有耗电量,这是严重依赖于算法计算量的,这也将成为一个主要问题。在这方面,减少算法中的运算量是一个具有挑战性和必要的步骤,但迄今为止,这一点在很大程度上被忽略了。另一种情况涉及将超分辨率算法应用于视频流服务。对流媒体的需求激增,因此需要大容量存储来存储大量的多媒体数据。因此,在存储之前,必须使用有损压缩技术来压缩数据。然后,可以应用超分辨率技术将数据恢复到原始分辨率。因为延迟是关键的因素,所以解压缩过程(即超分辨率)必须实时执行。为了做到这一点,减少计算量,使超分辨率方法变得轻量级是至关重要的。

To handle these requirements and improve the recent models, we propose a Cascading residual network (CARN) and its variant CARN-Mobile (CARN-M). We first build our CARN model to increase the performance, and extend it to CARN-M to optimize it for speed and the number of operations. Following the FSRCNN[fsrcnn2016], CARN and CARN-M take the LR images and compute the HR counterparts as the output of the network. The middle parts of our models are designed based on the ResNet[resnet]. The ResNet architecture has been widely used in deep learning-based SR methods[drnn2017, mdsr2017] because of the ease of training and superior performance. In addition to the ResNet architecture, CARN uses a cascading mechanism at both the local and the global level to incorporate the features from multiple layers. This has the effect of reflecting various levels of input representations in order to receive more information. In addition to the CARN model, we also provide the CARN-M model that allows the designer to tune the trade-off between the performance and the heaviness of the model. It does so by means of efficient residual block (residual-E) and recursive network architecture, which we describe in more detail in Section 3.

为了处理这些需求和改进最近的模型,我们提出了一个级联残差网络(CARN)及其变种 CARN-Mobile (CARN-M)。我们首先构建我们的 CARN 模型来提高性能,并将其扩展到 CARN-M,以优化速度和操作数量。和 FSRCNN [7]一样,CARN 系列模型拍摄低分辨率图像并通过网络输出高分辨率图像。我们模型的中间部分是基于 ResNet 设计的[13]。由于易于训练和卓越的性能,ResNet 体系结构已被广泛使用[26,35]。除了 ResNet 架构之外,CARN 还在本地和全局级别使用级联机制来整合来自多个层的功能。这具有反映不同级别的输入表示以便接收更多信息的效果。除了 CARN 模型,我们还提供了 CARN-M 模型,允许设计师在模型的性能和大小之间进行权衡。它是通过 residual-E 模块和递归网络结构来实现的,我们将在第 3 节中详细描述。

In summary, our main contributions are as follows: 1) We propose CARN, a neural network for SR based on the cascading modules, which achieves high performance. Our main modules, which we call the cascading modules, effectively boost the performance via multi-level representation and multiple shortcut connections. 2) We also propose CARN-M for efficient SR by combining the efficient residual block and the recursive network scheme. 3) We show through extensive experiments, that our model uses only a modest number of operations and parameters to achieve competitive results. Our CARN-M, which is the more lightweight SR model, shows comparable results to others with much fewer operations (Fig. 1).

总之,我们的主要贡献如下:
1)提出了一种基于级联模块的神经网络 CARN,它在超分辨率任务上取得了较高的性能(图 1)。我们的级联模块通过多级表示和多条 shortcut 连接有效提升性能。
2)结合 residual-E 模块和递归网络方案,提出了有效的超分辨率算法:CARN-M。
3)我们通过广泛的实验表明,我们的模型仅使用适度数量的运算操作和参数来实现有竞争力的结果。我们的 CARN-M 是更轻量级的 SR 模型,运算操作更少,而效果显示出与其他模型相当的结果。

图 1:我们的方法与现有方法相比的超分辨率结果。

Since the success of AlexNet[alexnet] in image recognition task[imagenet2009], various deep learning approaches have been applied to many computer vision tasks[ssd, fasterrcnn, deconvnet, color]. The SISR task is one such task, and we present an overview of the deep learning-based SISR in section 2.1.

Another area we deal with in this paper is model compression. Recent deep learning models focus on squeezing model parameters and operations for application in low-power computing devices, which has many practical benefits in real-world applications. We briefly review this area in section 2.2.

自从 AlexNet [23]在图像识别任务[5]中取得成功以来,许多深度学习方法已经被应用于不同的计算机视觉任务[9,27,30,40]。单图超分辨率 SISR 就是这样一个任务,我们在第 2.1 节中概述了基于深度学习的 SISR。我们在本文中处理的另一个领域是模型压缩。最近的研究集中于压缩模型参数和操作,以便在低功率计算设备中应用,这在现实世界的应用中具有许多实际好处。我们在第 2.2 节简要回顾。

2.1 基于深度学习的图像超分辨率 DEEP LEARNING BASED IMAGE SUPER-RESOLUTION

Recently, deep learning based models have shown dramatic improvements in the SISR task. Dong et al.[srcnn2014] first proposed a deep learning-based SR method, called SRCNN, which outperformed traditional algorithms. However, SRCNN has a large number of operations compared to its depth, as input images are upsampled before being fed into the network. Taking a different approach from SRCNN, FSRCNN[fsrcnn2016] and ESPCN[espcn2016] upsample images at the end of the networks. This latter approach leads to the reduction in the number of operations compared to the former. However, the overall performance could be degraded if there are not enough layers after the upsampling layer. Moreover, they cannot manage multi-scale training, as the input image size differs for each upsampling scale.

Despite the fact that the power of deep learning comes from deep layers, the aforementioned methods have settled for shallow layers because of the difficulty in training. To better harness the depth of deep learning models, Kim et al.[vdsr2016] proposed VDSR, which uses residual learning to map the LR images x to their residual images r. Then, VDSR produces the SR images y by adding the residual back into the original, i.e., y=x+r.

最近,基于深度学习的模型在 SISR 任务中显示出显著的改进。Dong et al. [6]首次提出了一种基于深度学习的超分辨率方法:SRCNN,其性能优于传统算法。然而,由于网络将向上采样的图像作为输入,导致 SRCNN 网络深度不深,却有大量的计算操作。采用与 SRCNN 不同,FSRCNN [7]和 ESPCN [33]在网络末端再进行上采样操作。通过这样做,与 SRCNN 相比,它减少了运算操作次数。但是,如果在上采样过程后没有足够的层,整体性能可能会下降。此外,它们不能处理多尺度训练,因为输入图像在进行上采样后,尺寸不同。

两个模型都有一个低分辨率图像,并在网络末端向上采样到高分辨率。在 CARN 模型中,每个残差模块被改变为级联模块。蓝色箭头表示全局级联连接。

On the other hand, LapSRN[lapsrn2017] uses a Laplacian pyramid architecture to increase the image size gradually. By doing so, LapSRN effectively performs SR on extremely low-resolution cases with fewer number of operations compared to VDSR. The main difference is that VDSR upsamples the image at the beginning whereas LapSRN does so sequentially.

Another issue of DL-based SR is how to reduce the parameters and operation. For example, DRCN[drcn2016] uses a recursive network to reduce parameters by engaging in redundant usages of a small number of parameters. DRRN[drnn2017] improves upon DRCN by combining the recursive and residual network schemes to achieve better performance with less parameters. However, DRCN and DRRN use very deep networks to compensate for the loss of performance and hence these models require heavy computing resources. Hence, we aim to build a model that is lightweight in both size and computation. We will briefly discuss previous works that address such model efficiency issues in the following section.

尽管事实上,深层的深度学习网络才能发挥作用,但由于训练困难,上述方法都是在进行浅层的学习。为了更好地增加网络的深度,Kim 等人[20]提出了 VDSR,它使用残差学习将 LR 图像 x 映射到它们的残差图像 r。然后,VDSR 通过将残差加上原始图像来产生 SR 图像 y,即 y = x + r。另一方面,LapSRN [24]使用拉普拉斯金字塔架构来逐渐增加图像大小。通过这样做,LapSRN 可以在极低分辨率的情况下有效地执行算法,并且与 VDSR 相比,计算操作数量更少。

基于深度学习的超分辨率算法研究的另一个问题是如何减少参数和计算操作。例如,DRCN [21]使用递归网络,通过少量参数的冗余使用来减少参数。DRRN [35]通过结合递归和残差网络方案来改善 DRCN,以更少的参数实现更好的性能。然而,DRCN 和 DRRN 使用非常深的网络来补偿性能损失,因此需要大量的计算资源。因此,我们的目标是建立一个在尺寸和计算上都很轻的模型。在下一节中,我们将简要讨论以前解决这种模型效率问题的工作。

2.2 高效的神经网络 EFFICIENT NEURAL NETWORK

Lately, there has been rising interest in building small and efficient neural networks[squeezenet, han2015deep, mobilenets]. These approaches can be categorized into two groups: 1) Compressing pretrained networks, and 2) designing small but efficient models. Han et al.[han2015deep] proposed deep compressing techniques, which consist of pruning, vector quantization, and Huffman coding to reduce the size of a pretrained network. In the latter category, SqueezeNet[squeezenet] builds an AlexNet-based architecture and achieves nearly the same performance level with 50× fewer parameters compared to AlexNet. MobileNet[mobilenets] builds an efficient neural network by applying depthwise separable convolution introduced in Sifre et al.[dwconv]. With depthwise separable convolution, it is easy to build lightweight, deep neural networks, which allow end-users to choose the appropriate network size based on the application constraints. Because of this simplicity, we also apply this technique in the residual block with some modification to achieve a lean neural network.

人们对建立小型高效的神经网络越来越感兴趣[11,16,19]。这些方法可以分为两组:
1)压缩预处理网络,
2)设计小而有效的模型。
Han et al.[11]提出了深度压缩技术,包括剪枝、矢量量化和霍夫曼编码,以减小预训练网络的大小。在后一类中,SheeNet[19]构建了一个基于 AlexNet 的架构,并以少 50 倍的参数实现了相当的性能水平。MobileNet [16]通过应用 Sifre 等人[34]中介绍的深度可分卷积建立了一个有效的网络。基于这种简化方法,我们在残差块中应用这种技术,并对其进行一些修改,以实现一个精简的神经网络。

3 提出的方法 Proposed Method

As mentioned in Section 1, we propose two main models: CARN and CARN-M. CARN is designed to make a high-performing SR model while suppressing the number of operations compared to the state-of-the-art methods. Based on CARN, we design CARN-M, which is a much more efficient SR model in terms of both parameters and operations.

正如在第 1 节中提到的,我们提出了两个主要的模型:CARN 和 CARN-M。CARN 被设计成一个高性能的超分辨率模型,同时与最先进的方法相比,减少了计算操作的数量。在 CARN 的基础上,我们设计了 CARN-M,这是一个在参数和计算操作都更少的超分辨率模型。

3.1 级联残差网络 Cascading Residual Network

Our CARN model is based on ResNet[resnet2016]. The main difference between CARN and ResNet is the presence of local and global cascading modules. Fig. 2 (b) graphically depicts how the global cascading occurs. The outputs of intermediary layers are cascaded into the higher layers, and finally converge on a single 1x1 convolution layer. Note that the intermediary layers are implemented as cascading blocks, which host local cascading connections themselves. Such local cascading operations are shown in (c) and (d). Local cascading is almost identical to global one except that the unit blocks are plain residual blocks.

我们的 CARN 模型基于 ResNet [14]。CARN 和 ResNet 的主要区别在于本地和全局级联模块的存在。
图 2 (b)图展示了全局级联方法。中间层的输出被级联到更高的层,并最终通过单个 1×1 卷积层转换。中间层通过级联模块实现,它们本身承载局部级联连接。这种局部级联操作如图 2(c)和(d)所示。局部级联几乎与全局级联相同,只是单位模块是普通的残差模块。

图 2:纯 ResNet 的网络架构(上图)和建议的 CARN(下图)

To express the implementation mathematically, let f be a convolution function and τ be an activation function. Then, we can define the i-th residual block Ri, which has two convolutions followed by a residual addition, as

表示为公式:设 f 为卷积函数,τ 为激活函数。然后,我们可以定义第 I 个剩余块R_i,它
有两个卷积层,后面接残差模块的加法部分
1.png

Here, H^i is the output of the i-th residual block, W^i_R is the parameter set of the entire residual block, and W^{i,j}_Ris the parameter of the j-th convolution layer in the i-th block. With this formulation, we denote the output feature of the final residual block of ResNet asH^{u}, which becomes the input to the upsampling block.

这里,H^i是第 i 个残差块的输出,W^i_R是残差块的参数集,W^{i,j}_R是第 i 个块中第 j 个卷积层的参数。我们把 ResNet 的最终残差块的输出特征表示为H^{u},它成为上采样块的输入。
2.png

Note that because our model has a single convolution layer before each residual block, the first residual block gets f(X;Wc) as input, where W_c is the parameter of the convolution layer.

In contrast to ResNet, our CARN model has a local cascading block illustrated in block (c) of Fig. 3 instead of a plain residual block. In here, we denote B^{i,j}as the output of the j-th residual block in the i-th cascading block, and W^i_c as the set of parameters of the i-th local cascading block. Then, the i-th local cascading block B^i_localis defined as

请注意,因为我们的模型在每个残差块之前有一个卷积层,所以第一个残差块得到 f(X;Wc)作为输入,其中W_c是卷积层的参数。
与 ResNet 不一样的是,我们的 CARN 模型具有图 3 的块(c)所示的局部级联块,而不是简单的残差块。这里,我们将B^{i,j}表示为第 i 个级联块中的第 j 个残差块的输出,将W^i_c表示为第 i 个本地级联块的参数集。然后,第 i 个局部级联块B^i_local被定义为3.png

其中,B^{i,U}定义为可以递归调用:
4.png

Finally, we define the output feature of the final cascading block H^u by combining both the local and global cascading.
H^0 is the output of the first convolution layer. In our settings, we set u=3 in CARN to match its depth with that of the corresponding ResNet.

最后,我们可以通过结合本地和全局级联来定义最终级联块H^u的输出特征。H^0是第一个卷积层的输出。我们为我们的 CARN 和 CARN-M 固定 u = b = 3。
7.png

其中\hat y_i^{box} 是来自网络的回归目标,y_i^{box}是 ground truth 坐标。共四个坐标,包括左上,宽和高,故y _i ^{box} \in \mathbb R^4 3) Facial landmark localization: 类似候选框回归任务,人脸关键点定位也化为回归问题,我们最小化欧几里得 loss:5.png

The main difference between CARN and ResNet lies in the cascading mechanism. As shown in Fig. 2, CARN has global cascading connections represented as the blue arrows, each of which is followed by an 1×1 convolution layer. Cascading on both the local and global levels has two advantages: 1) The model incorporates features from multiple layers, which allows to learn multi-level representations. 2) Multi-level cascading connection behaves as multi-level shortcut connections that quickly propagate information from lower to higher layers.

Multi-level representation is used in many deep learning methods[lee2017multi, long2015fully] because of its successful performance with simple modifications. Our CARN follows such a scheme, but we apply this arrangement to a variety of feature levels to boost performance, as shown in equation 4. By doing so, our model reconstructs the LR image based on multi-level features. This facilitates the model to restore the details and contexts of the image simultaneously. As a result, our models effectively improve not only primitive objects such as stripes or lines, but also complex objects like hand or street lamps.

Another reason for adapting the cascading scheme is to leverage the effect of shortcut connection. The reason for the improved performance is two fold: First, the propagation of information follows multiple paths. The benefit of multi-path is well discussed in many deep learning models[densenet, unet]. Second, by adding extra convolution layers, our model can learn to choose the right pathway with the given input information flows. However, the strength of multiple shortcut is degraded when we use only one of local or global cascading, especially the local connection. We elaborate the details and present a case study on the effects of cascading mechanism in Section 4.4.

CARN 和 ResNet 的主要区别在于级联机制。如图 2 所示,CARN 具有用蓝色箭头表示的全局级联连接,每个连接后面是 1×1 卷积层。在局部和全局两个层次上进行级联有两个优点:
1)模型包含来自多个层次的特征,这允许学习多层次的表示。
2)多级级联连接表现为多级快捷连接,可将信息从较低层快速传播到较高层(反向传播时,反之亦然)。
CARN 采用了[25,28]中的多级表示方案,但我们如公式 4 所示将这种方案应用于不同特征级别,以提高性能。通过这样做,我们的模型基于多级特征重建低分辨率图像。这有助于模型同时恢复图像的细节和上下文。因此,我们的模型不仅在原始图像上效果很好,在复杂细节的图像上,也能有效地提高效果。
采用级联方案的另一个原因是有双重目的:
首先,信息的传播可以遵循多条路径[17,32]。
其次,通过增加额外的卷积层,我们的模型可以学习在给定的输入信息流下选择正确的路径。
然而,当我们仅使用本地或全局级联中的一个时,尤其是本地连接时,多重连接的有效性会降低。我们在第 4.4 节中详细阐述了级联机制的细节并给出了一个案例研究。

3.2 高效级联残差网络 EFFICIENT CASCADING RESIDUAL NETWORK

To improve the efficiency of CARN, we first propose an efficient residual (residual-E) block. We use an approach similar to that of MobileNet[mobilenets], but our formulation is more general. Our residual-E block consists of two 3×3 group convolutions and one 1×1 convolution, as shown in Fig. 3 (b). The latter convolution is the same as pointwise convolution, which is used in depthwise separable convolution[mobilenets]. The former convolution is a group extension of the depthwise convolution. The advantage of using group convolution over depthwise convolution is that it makes the efficiency of the model tunable. More precisely, the user can choose the group size appropriately because the group size and performance are usually in a trade-off relationship. The analysis on the cost efficiency of using the residual-E block is as follows.

Let K be the kernel size andC_inC_out be the number of input and output channels, respectively. Because we retain the feature resolution of the input and output by padding, we can denote F to be both the input and output feature size. The cost of a plain residual block is given as

为了提高 CARN 的性能,我们提出了一种有效的残差块 (residual-E)。我们使用类似于 MobileNet[16]的方法,但是使用分组卷积来代替深度卷积。我们的 residual-E 残差块由两个 3×3 的分组卷积和一个 pointwise 卷积组成,如图 3 (b)所示。

图 3:(a)残差块,(b)有效残差块,(c)级联块,(d)递归级联块的简化结构。(a)和(b)中的 ⊕ 运算是剩余学习的逐元素加法。

与深度卷积相比,使用分组卷积的优势在于它使得模型的性能可调。用户可以适当地选择组大小,因为组大小和性能是一种权衡关系。使用剩余 E 块的成本效益分析如下。
设 K 为内核大小,C_inC_out为输入输出通道数。因为我们通过 padding 来保持输入和输出的特征分辨率,所以我们可以将 F 表示为输入和输出特征大小。那么,一个残差模块的计算成本可以表示为10.png

Note that we only count the cost of convolution layers and ignore the addition or activation because both the plain and the efficient residual blocks have the same amount of cost in terms of addition and activation.

Let G be the group size. Then, the cost of a residual-E block, which consist of two group convolutions and one 1×1 convolution, is as given in equation 6.

请注意,我们只计算卷积层的计算成本,而忽略了加法或激活函数,因为它们在其他设计中,计算成本都是一样的。
设 G 为分组卷积大小。然后,residual-E 残差块的计算成本由两个分组卷积和一个 pointwise 卷积组成,如等式 5 所示:

8.png

By changing the plain residual block to our efficient residual block, we can reduce the computation by the ratio of

通过将普通残差块改为有效残差块,我们可以将计算量减少
9.png

Because our model uses a kernel of size 3×3 for all group convolutions, and the number of channels is constantly 64, using an efficient residual block instead of a standard residual block can reduce the computation from 1.8 up to 14 times depending on the group size. To find the best trade-off between performance and computation, we performed an extensive case study in Section 4.4.

To further reduce the parameters, we apply a technique similar to the one used by the recursive network. That is, we make the parameters of the Cascading blocks shared, effectively making the blocks recursive. Fig. 3 (d) shows our block after applying the recursive scheme. This approach reduces the parameters by up to three times of their original number.

Despite the above measures, the upsampling block is another obstacle, as the number of channels has to be increased quadratically with respect to the upsampling ratio[espcn2016]. Moreover, we use multi-scale learning to boost the performance, so the parameters of the upsampling block are increased by up to 48% in CARN and 75% in CARN-M. To mitigate this problem, we replace the 3×3 convolution layer with a 1×1 convolution inside the upsampling block. This trick reduces the parameters by nine times but with little degradation in performance.

因为我们的模型对所有分组卷积使用大小为 3×3 的核,并且信道的数量恒定为 64,所以使用有效的残差块代替标准残差块可以根据分组大小将计算从 1.8 减少到 14 倍。为了找到性能和计算之间的最佳平衡,我们在第 4.4 节中进行了广泛的案例研究。
为了进一步减少参数,我们应用了类似于递归网络所使用的技术。也就是说,我们共享级联块的参数,有效地使块递归。图 3 (d)显示了我们在应用递归方案后的块。这种方法将参数减少到原来的三倍。

3.3 与最新技术的比较

对比 SRDenseNet。
SRDenseNet[37]使用 dense block 和 skip connection。与我们的模型的区别是:
1)我们使用全局级联,这比 skip connection 更通用。在 SRDenseNet 中,各级特征都在最后一个 dense block 的最后组合,但是我们的全局级联方案连接所有块,这样可以实现多级的 skip connection。
2)SRDenseNet 通过 contact 操作保留 dense block 的局部信息,而我们通过 1 × 1 卷积层逐步收集。使用额外的 1×1 卷积层可以有更高的表示能力。

对比 MemNet。
MemNet [36]的动机和我们类似。然而,与我们的机制有两个主要区别。
1)在 MemNet 的存储块内部,每个递归单元的输出特征在网络的末端连接,然后用 1×1 卷积进行融合。而我们融合局部块中每个可能点的特征,这可以通过附加的卷积层和非线性来提高表示能力。总的来说,由于训练的困难,这种代表力往往得不到满足。然而,我们通过使用本地和全局级联机制克服了这个问题。我们将在第 4.4 节讨论细节。
2) MemNet 将向上采样的图像作为输入,因此加和乘的数量比我们的多。我们模型的输入是一个低分辨率图像,我们在网络末端对其进行上采样,因此我们计算效率更高。

4 实验结果 Experimental Results

4.1 数据集 DATASETS

There exist diverse single image super-resolution datasets, but the most widely used ones are the 291 image set by Yang et al.[yang2010] and the Berkeley Segmentation Dataset[b100]. However, because these two do not have sufficient images for training a deep neural network, we additionally use the DIV2K dataset[div2k]. The DIV2K dataset is a newly-proposed high-quality image dataset, which consists of 800 training images, 100 validation images, and 100 test images. Because of the richness of this dataset, recent SR models[mdsr2017, btsrn2017, cnf2017, selnet] use DIV2K as well. We use the standard benchmark datasets such as Set5[set5], Set14[yang2010], and B100[b100] for testing and benchmarking.

单幅图像的超分辨率数据集多种多样,但应用最广泛的是 Yang et al.[39]提出的 291 图像集和伯克利分割数据集[2]。 然而,因为这两个没有足够的图像来训练深度神经网络,我们另外使用了 DIV2K 数据集[1]。
DIV2K 数据集是一个新提出的高质量图像数据集,由 800 幅训练图像、100 幅验证图像和 100 幅测试图像组成。由于这个数据集的丰富性,最近的 SR 模型[4,8,26,31]也使用了 DIV2K。 我们使用标准基准数据集,如集合 5 [3]、集合 14 [39]、B100 [29]和 Urban100 [18]进行测试和基准测试。

4.2 实施和训练细节 IMPLEMENTATION AND TRAINING DETAILS

We use the RGB input patches of size 48×48 from the LR images for training. We sample the LR patches randomly and augment them with random horizontal flips and 90 degree rotation. We train our models with the ADAM optimizer[adam] by setting $\beta_{1}=0.9$、$\beta_{2}=0.999$, and ϵ=10−8 in 5×105 steps. The minibatch size is 32, and the learning rate begins with 10−4 and is halved every 3×105 steps. All the weights and biases are initialized by equation 8. In here, we denote cin as the number of channels of input feature map.

我们使用将低分辨率 RGB 图像分成 64×64 尺寸进行训练。我们随机采样,并通过随机水平翻转和 90 度旋转来增强它们。我们 ADAM 优化器的设置参数为$\beta_{1}=0.9$、$\beta_{2}=0.999$和$\epsilon=10^{-8}$,迭代$6\times10^{5}$步。mini_batch 大小为 64,学习速率从$10^{-4}$开始,每 $4\times10^{5}$ 步减半。所有权重和偏差这样初始化:
11.png

其中12.pngc_{in}是输入的通道数。

The most well known and effective weight initialization methods are given by Glorot et al.[glorot2010] and He et al.[he2015delving]. However, such initialization routines tend to set the weights of our multiple narrow 1×1 convolution layers very high, resulting in an unstable training. Therefore, we sample the initial values from a uniform distribution to alleviate the initialization problem.

To train our model in a multi-scale manner, we first select the scaling factor as one of 2×, 3×, and 4× because our model can only process a single scale for each batch. Then, we can construct and argument our input batch, as described above. We use the L1 loss as our loss function instead of the L2. The L2 loss is widely used in the image restoration task due to its relationship with the peak signal-to-noise ratio (PSNR). However, in our experiments, L1 provides better convergence and performance. The downside of the L1 loss is that the convergence speed is relatively slower than that of L2 without the residual block. However, this drawback could be mitigated by using a ResNet style architecture such as that used by ours.

Glorot 等人[10]和何等人[12]给出了最著名和最有效的权重初始化方法。然而,这种初始化程序倾向于将我们的多个窄的 1×1 卷积层的权重设置得非常高,导致训练不稳定。因此,我们从均匀分布中采样初始值,以缓解初始化问题。
为了以多尺度方式训练我们的模型,我们首先将缩放比例因子设置为 ×2、×3 和 ×4 中的一个,因为我们的模型只能为每个批次处理单个尺度。然后,我们构建并论证我们的输入批,如上所述。 我们用 L1 损失代替 L2 损失作为我们的损失函数。由于 L2 损失与峰值信噪比(PSNR)的关系,它被广泛用于图像恢复任务。然而,在我们的实验中,L1 提供了更好的收敛性和性能。L1 损失的缺点是没有残差模块的时候收敛速度比 L2 相对慢。然而,这个缺点可以通过使用 ResNet 风格的模型来减轻。

4.3 与最先进方法的比较 COMPARISON WITH STATE-OF-THE-ART METHODS

We compare the proposed CARN and CARN-M with state-of-the-art SR methods on two commonly-used image quality metrics: PSNR and the structural similarity index (SSIM)[ssim]. One thing to note here is that we represent the number of operations by Mult-Adds. Mult-Adds is the number of composite multiply-accumulate operations for a single image. In Fig. 4, we compare CARN(-M) against the benchmark algorithms using the number of operations (by Mult-Adds) and parameters. Our CARN model achieves result comparable with SelNet[selnet], both of which outperform all state-of-the-art models except MDSR[mdsr2017]. This is not surprising, because MDSR has nearly six times more parameters than that of either ours or SelNet’s. CARN and DRCN[drcn2016] have similar number of parameters, but CARN outperforms DRCN by using 100× fewer Mult-Adds. Our CARN-M model also outperforms most of the benchmark methods with 2,319M Mult-Adds, which is nearly half the amount of SRCNN[srcnn2014].

我们在两个常用的图像质量度量标准上比较了所提出的 CARN 和 CARN-M 与最先进的 SR 方法:PSNR 和 SSIM [38]。这里需要注意的一点是,我们用 Mult-Adds 来表示运算的次数。Mult-Adds 是单个图像的复合乘法累加运算的次数。我们假设高分辨率图像大小为 720 像素(1280×720)。 在图 4 中,我们根据 Mult-Adds 和 Set14 ×4 数据集上的参数数量,将我们的 CARN 系列与各种基准算法进行了比较。这里,我们的 CARN 模型优于所有参数小于 5M 的最先进的模型。特别是,CARN 的参数数量与 DRCN [21]、SelNet [4]和 SRDenseNet[37]相似,但我们的表现优于所有三种模型。
MDSR [26]的性能比我们的好,这并不奇怪,因为 MDSR 的 8M 参数比我们的参数多近 6 倍。CARN-M 模型也优于大多数基准方法,并显示出与重型模型相当的结果。
图 4:性能和参数量比较

此外,我们的模型在计算成本方面是最有效的:CARN 以 90.9G 的 Mult-Adds 显示了第二好的结果,与 SelNet [4]相当。这种效率主要来自最近许多模型[7,24,37]使用的后上采样方法。此外,与其他类似方法相比,我们的新型级联机制显示出更高的性能。例如,CARN 使用相似的运算次数,其性能比最相似的 SelNet 模型高出 0.11 PSNR。此外,CARN-M 模型使用了和 SRCNN 相似的运算量,表现出了可以和很多计算量大的模型相当的结果。

表 1:基于深度学习的 SR 算法的量化结果

Table 1 shows the quantitative comparisons of the performances over the Set5, Set14, and B100 datasets. With only a few operations, our CARN outperforms almost all benchmark models on most of the datasets and shows similar scores with SelNet. In detail, the CARN has a similar parameters to DRCN, but the number of operations is on par with SRCNN[srcnn2014]. Our lightweight model, CARN-M has approximately the same number of parameters as DRRN (305K), but shows comparable results to others using approximately 2,000 times fewer operations. Note that MDSR is excluded from this table, because we only compare models that have roughly similar number of parameters as ours. In detail, MDSR has a parameter set whose size is 4.5 times larger than that of the second-largest model, which is DRCN.

As shown in Table 1, CARN model has a slightly larger number of parameters compared to the ones with similar complexity (i.e. between CARN and SelNet). However, this is because our models have to contain all possible upsampling layers in order to perform multi-scale learning, which could take up a large portion of parameters. On the other hand, VDSR and DRRN do not require this extra burden, even if multi-scale learning is performed, because they upsample the image by bicubic upsampling before processing it.

The benefit of using multi-scale learning is that it can process multiple scales using a single trained model. Only a few known architectures, such as VDSR, DRRN, and MDSR have this characteristic. This helps us alleviate the burden of heavy-weight model size when deploying the SR application on mobile devices; CARN(-M) only needs a single fixed model for multiple scales, whereas even the state-of-the-art algorithms require to train separate models for each supported scale. This property is well-suited for real-world products because the size of the applications has to be fixed while the scale of given LR images could vary.

In Fig. 5, we visually illustrate the qualitative comparisons over three datasets (Set14[yang2010], B100[b100] and Urban100[urban100]) for 4× scale. It can be seen that our model works better than others and accurately reconstructs not only stripes and line patterns, but also complex objects such as hand and street lamps.

表 1 还显示了基准数据集性能的定量比较。请注意,MDSR 被排除在此表之外,因为我们只比较参数数量与我们大致相似的模型;MDSR 有一个参数集,其大小是第二大模型的四倍。我们的 CARN 在许多基准数据集上超过了以前的所有方法。CARN-M 模型使用很少的操作就能获得相当的结果。我们还要强调的是,虽然 CARN-M 的参数比 SRCNN 或者 DRRN 多,但是在现实场景中是可以容忍的。SRCNN 和 CARN-M 的大小分别是 200KB 和 1.6MB,在最近的移动设备上都是可以接受的。

为了使我们的模型更加轻量级,我们应用了多尺度学习方法。使用多尺度学习的好处是,它可以使用单个训练模型处理多个尺度。这有助于我们减轻在移动设备上部署服务请求应用程序时模型尺寸过大的负担;CARN(-M)对于多个尺度只需要一个单一的固定模型,而其他算法即使是最先进的也需要为每个支持的尺度训练单独的模型。
这种特性非常适合现实世界的产品,因为应用程序的大小必须是固定的,而给定的低分辨率图像的比例可能会有所不同。将多尺度学习用于我们的模型增加了参数量,因为网络必须包含需要上采样层。另一方面,即使执行多尺度学习,VDSR 和 DRRN 也不需要这种额外的负担,因为它们在处理图像之前对图像进行上采样。
在图 6 中,我们直观地展示了在 3 个数据集(集合 14、B100 和 Urban100)上对 ×4 尺度的定性比较。可以看出,我们的模型比其他模型工作得更好,不仅准确地重建了条纹和线条图案,还精确地重建了手和路灯等复杂对象。

图 6:结果

4.4 模型分析 Model Analysis

To further investigate the performance behavior of the proposed methods, we analyze our models via ablation study. First, we show how local and global cascading modules affect the performance of CARN. Next, we analyze the trade-off between performance vs. parameters and operations. We train all cases for 3×105 steps with the same set of hyper-parameters for fair comparison.

为了进一步研究所提出方法的性能,首先,我们展示了本地和全局级联模块如何影响 CARN 的性能。接下来,我们分析性能与参数和操作之间的权衡。

级联模块 Cascading Modules.

Table 2 presents the ablation study on the effect of local and global cascading modules. In this table, the baseline is ResNet[resnet2016], CARN-NL is CARN without local cascading and CARN-NG is CARN without global cascading. The network topologies are all same, but because of the 1×1 convolution layer, the overall number of parameters is increased by up to 10%.

表 2 显示了对局部和整体级联模块影响的消融研究。在这个表中,基线是 ResNet,CARN-NL 是没有局部级联的 CARN,CARN-NG 是没有全局 cascading 的 CARN。网络拓扑结构都是相同的,但是由于 1×1 卷积层,参数的总数增加了 10%。

表 2:在 Set14 ×4 数据集上测量的全局和局部级联模块的效果。CARN-NL 代表没有局部级联的 CARN

We see that the model with only global cascading (CARN-NL) shows better performance than the baseline because the global cascading mechanism effectively carries mid- to high-level frequency signals from shallow to deep layers. Furthermore, by gathering all features before the upsampling layers, the model can better leverage multi-level representations. By incorporating multi-level representations, the CARN model can consider a variety of information from many different receptive fields when reconstructing the image.

Somewhat surprisingly, using only local cascading blocks (CARN-NG) harms the performance. As discussed in He et al.[he2016identity], multiplicative manipulations such as 1×1 convolution on the shortcut connection can hamper information propagation, and thus lead to complications during optimization. Similarly, cascading connections in the local cascading blocks of CARN-NG behave as shortcut connections inside the residual blocks. Because these connections consist of concatenation and 1×1 convolutions, it is natural to expect performance degradation, as mentioned by He et al.[he2016identity]. That is, the advantage of multi-level representation is limited to the inside of each local cascading block. Therefore, there appears to be no benefit of using the cascading connection because of the increased number of multiplication operations in the cascading connection.

However, CARN uses both local and global cascading levels and outperforms all three models. This is because the global cascading mechanism eases the information propagation issues that CARN-NG suffers from. In detail, information propagates globally via global cascading, and information flows in the local cascading blocks are fused with the ones that come through global connections. By doing so, information is transmitted by multiple shortcuts and thus mitigates the vanishing gradient problem. In other words, the advantage of multi-level representation is leveraged by the global cascading connections, which help the information to propagate to higher layers.

我们看到,只有全局级联的模型(CARN-NL)显示出比基线更好的性能,因为全局级联机制将浅层的中频信号有效地传送到深层的高频信号。此外,通过在上采样层之前收集所有特征,模型可以更好地利用多级表示。通过结合多层次的表示,当重建图像时,CARN 模型可以考虑来自许多不同感受野的各种信息。

有些令人惊讶的是,仅使用本地级联块(CARN-NG)会损害性能。如何等[15]所讨论的,乘法运算如快捷连接上的 1×1 卷积会妨碍信息传递,从而导致优化过程的复杂化。CARN-NG 的局部级联块中的级联连接和残差模块中的快捷连接 shortcut 类似。因为这些连接由 contact 和 1×1 卷积组成,所以性能自然会下降。也就是说,多级表示的优势在局部级联模块内部是受限的。使用级联连接似乎没有好处,因为级联连接中乘法运算的数量增加了。但是,CARN 同时使用局部和全局级联级别,性能优于所有三种模型。这是因为全局级联机制减轻了 CARN-NG 所遭受的信息传播问题。具体来说,信息通过全局级联传输,局部级联块中的信息流与来自全局连接的信息流融合在一起。通过这样做,信息通过多个 shortcut 传输,从而缓解了渐变消失的问题。换句话说,多级表示的优势被全局级联连接所利用,这有助于信息传播到更高层。

** 效率权衡 Efficiency Trade-off.**

Fig. 6 depicts the trade-off study of PSNR vs. parameters, and PSNR vs. operations (Mult-Adds) in relation to the efficient residual block and recursive network. In this experiment, we evaluate all possible group sizes of the group convolution in the efficient residual block for both the recursive and non-recursive cases. In both graphs, the blue line represents the model that does not use the recursive scheme and the orange line is the model that uses recursive cascading block.

图 5 描述了从参数量和计算操作数两个方面进行了比较 PSNR(Peak Signal to Noise Ratio)峰值信噪比的情况。在这个实验中,我们评估了递归和非递归情况下对残差模块的 group 设置不同的参数的情况。在两个图中,蓝线表示不使用递归方案的模型,而橙色线表示使用递归级联块的模型。

Although all efficient models perform worse than the CARN model that shows 28.70 PSNR, the number of parameters and operations are decreased dramatically. For example, the G64 case shows a five-times reduction in both parameters and operations. However, unlike the comparable performance that is shown in Howard et al.[mobilenets], the degradation of performance is more pronounced in our experiment. There are two reasons for this: 1) We stack two consecutive group convolutions before the pointwise convolution in the efficient residual block. 2) Replacing the 3×3 convolution with a 1×1 in the upsampling layer harms the performance as well. However, despite the weakness, it is crucial to use the above two ideas to create fast, small, and accurate super-resolution models.

Next, we observe the case which uses the recursive scheme. As illustrated in Fig. 5(b), there is no change in Mult-Adds but the performance worsens, which seems reasonable given the decreased number of parameters in the recursive scheme. On the other hand, Fig. 5(a) shows that using the recursive scheme makes the model achieve better performance with less parameters. Based on these observations, we decide to choose the group size as four in the efficient residual block and use the recursive network scheme as our CARN-M model. By doing so, CARN-M reduces the parameters by five times and the operations by nearly four times with loss of 0.29 PSNR compared to CARN.

CARN(28.70 PSNR)比所有模型的表现都要好,但是参数越少,性能越差。,与 Howard 等人[16]中显示结果不同,我们的性能的下降更加明显。
接下来,我们观察使用递归方案的情况。如图 5b 所示,结论相似。考虑到递归方案中参数数量的减少,这似乎是合理的。另一方面,图 5a 表示出了使用递归方案可以使得模型用更少的参数实现更好的性能。基于这些观察,我们决定残差模块的共享组大小设定为 4,并使用递归网络方案来设计 CARN-M 模型。通过这样的方法,CARN-M 将参数数量减少了五倍,操作数量减少了近四倍,与 CARN 相比损失了 0.29 PSNR。

图 5:根据 PSNR 对参数(左)和 PSNR 对运算(右)使用有效残差块和递归网络的结果。我们以 ×4 的比例对 Set14 上的所有型号进行评估。GConv 表示组卷积的组大小,R 表示具有递归网络方案的模型(即 G4R 表示具有递归级联块的组四)。

5. 结论 Conclusion

In this work, we proposed a novel cascading network architecture that can perform SISR accurately and efficiently. The main idea behind our architecture is to add multiple cascading connections starting from each intermediary layer to the others. Such connections are made on both the local (block-wise) and global (layer-wise) levels, which allows for efficient flow of information and gradient. Our experiments show that employing both types of connections greatly outperforms those using only one or none at all.

We wish to further develop this work by improving and applying our technique on video data. Many streaming services require large storage to provide high-quality videos. In conjunction with our approach, one may devise a service that stores low-quality videos that go through our SR system to produce high-quality videos on-the-fly.

在这项工作中,我们提出了一种新的级联网络结构,可以准确有效地形成 SISR。 我们的架构背后的主要思想是从每个中间层开始向其他层添加多个级联连接。这种连接是在本地(块方式 block-wise)和全局(层方式 layer-waise)两个层次上建立的,这允许信息和梯度的有效流动。 我们的实验表明,使用这两种类型的连接大大优于那些只使用一种或根本不使用的算法。
我们希望通过将我们的技术应用于视频数据来进一步发展这项工作。许多流媒体服务需要大的存储空间来提供高质量的视频。结合我们的方法,人们可以设计一种服务,存储低质量的视频,然后通过我们的服务请求系统实时生成高质量的视频。

鸣谢

Namhyuk Ahn, Byungkon Kang, and Kyung-Ah Sohn
ECCV 2018
https://arxiv.org/pdf/1803.08664.pdf

初译:机翻 校对:白果 首发:AI 千集

本研究得到了韩国教育部资助,韩国国家研究基金会(NRF)的支持,NRF-2016R1D1A1B03933875 (K.-A. Sohn) and NRF-2016R1A6A3A11932796 (B. Kang).