HMANet: Hybrid Multi-Axis Aggregation Network for Image Super-Resolution
HMANet: 用于图像超分辨率的混合多轴聚合网络
Abstract
摘要
Transformer-based methods have demonstrated excellent performance on super-resolution visual tasks, surpassing conventional convolutional neural networks. However, existing work typically restricts self-attention computation to non-overlapping windows to save computational costs. This means that Transformer-based networks can only use input information from a limited spatial range. Therefore, a novel Hybrid Multi-Axis Aggregation network (HMA) is proposed in this paper to exploit feature potential information better. HMA is constructed by stacking Residual Hybrid Transformer Blocks(RHTB) and Grid Attention Blocks(GAB). On the one side, RHTB combines channel attention and self-attention to enhance nonlocal feature fusion and produce more attractive visual results. Conversely, GAB is used in cross-domain information interaction to jointly model similar features and obtain a larger perceptual field. For the super-resolution task in the training phase, a novel pre-training method is designed to enhance the model representation capabilities further and validate the proposed model’s effectiveness through many experiments. The experimental results show that HMA outperforms the state-of-the-art methods on the benchmark dataset. We provide code and models at https://github.com/korouuuuu/HMA.
基于Transformer的方法在超分辨率视觉任务上展现出卓越性能,超越了传统卷积神经网络。然而现有工作通常将自注意力计算限制在非重叠窗口以节省计算成本,这意味着基于Transformer的网络只能利用有限空间范围内的输入信息。为此,本文提出新型混合多轴聚合网络(HMA),通过堆叠残差混合Transformer模块(RHTB)和网格注意力模块(GAB)构建而成。一方面,RHTB结合通道注意力与自注意力机制增强非局部特征融合,生成更具吸引力的视觉效果;另一方面,GAB用于跨域信息交互,联合建模相似特征并获得更大感受野。针对训练阶段的超分辨率任务,我们设计了新型预训练方法以进一步提升模型表征能力,并通过大量实验验证所提模型的有效性。实验结果表明,HMA在基准数据集上优于最先进方法。代码与模型已开源:https://github.com/korouuuuu/HMA。
1. Introduction
1. 引言
Natural images have different features, such as multi-scale pattern repetition, same-scale texture similarity, and structural similarity [45]. Deep neural networks can exploit these properties for image reconstruction. However, it cannot capture the complex dependencies between distant elements due to the limitations of CNN’s fixed local receptive field and parameter sharing mechanism, thus limiting its ability to model long-range dependencies [25]. Recent research has introduced the self-attention mechanism to computer vision [20, 23]. Researchers have used the long-range dependency modeling capability and multi-scale processing advantages in the self-attention mechanism to enhance the joint modeling of different hierarchical structures in images.
自然图像具有多种特征,例如多尺度模式重复、同尺度纹理相似性和结构相似性 [45]。深度神经网络可以利用这些特性进行图像重建。然而,由于CNN固定的局部感受野和参数共享机制的限制,它无法捕捉远距离元素之间的复杂依赖关系,从而限制了其对长程依赖建模的能力 [25]。最近的研究将自注意力机制(self-attention)引入计算机视觉领域 [20, 23]。研究人员利用自注意力机制中的长程依赖建模能力和多尺度处理优势,增强了对图像中不同层次结构的联合建模。
Figure 1. The performance of the proposed HMA is compared with the state-of-the-art SwinIR, ART, HAT, and GRL methods in terms of PSNR(dB). Our method outperforms the state-of-the-art methods by $0.1\mathrm{dB}{\sim}1.4\mathrm{dB}$ .
图 1: 所提出的 HMA 方法与当前最先进的 SwinIR、ART、HAT 和 GRL 方法在 PSNR(dB) 指标上的性能对比。我们的方法以 $0.1\mathrm{dB}{\sim}1.4\mathrm{dB}$ 的优势超越现有最优方法。
Although Transformer-based methods have been successfully applied to image restoration tasks, there are still some things that could be improved. Existing windowbased Transformer networks restrict the self-attention computation to a dense area. This strategy obviously leads to a limited receptive field and does not fully utilize the feature information from the original image. For the purpose of generating images with more realistic details, researchers consider using GAN networks or inputting the reference information to provide additional feature information [4, 11, 33]. However, the network may generate unreasonable results if the input additional feature information
尽管基于Transformer的方法已成功应用于图像修复任务,但仍存在一些不足之处。现有的基于窗口的Transformer网络将自注意力计算限制在密集区域内,这种策略显然会导致感受野受限,且未能充分利用原始图像的特征信息。为了生成具有更真实细节的图像,研究人员考虑使用GAN网络或输入参考信息来提供额外特征信息[4, 11, 33]。然而若输入的额外特征信息不当,网络可能会生成不合理的结果
does not match.
不匹配。
In order to overcome the above problems, we propose a hybrid multiaxial aggregation network called HMA in this paper. HMA combines channel attention and self-attention, which utilizes channel attention’s global information perception capability to compensate for self-attention’s shortcomings. In addition, we introduce a grid attention block to achieve the modeling across distances in images. Meanwhile, to further excite the potential performance of the model, we customize a pre-training strategy for the superresolution task. Benefiting from these designs, as shown in Fig. 1, our proposed method can effectively improve the model performance $0.1\mathrm{dB}{\sim}1.4\mathrm{dB},$ ). The main contributions of this paper are summarised as follows:
为了克服上述问题,本文提出了一种名为HMA的混合多轴聚合网络。HMA结合了通道注意力(channel attention)和自注意力(self-attention),利用通道注意力的全局信息感知能力弥补自注意力的不足。此外,我们引入了网格注意力块(grid attention block)来实现图像中的跨距离建模。同时,为了进一步激发模型的潜在性能,我们为超分辨率任务定制了预训练策略。得益于这些设计,如图1所示,我们提出的方法能有效提升模型性能$0.1\mathrm{dB}{\sim}1.4\mathrm{dB}$。本文的主要贡献总结如下:
• We propose a novel Hybrid Multi-axis Aggregation network (HMA). The HMA comprises Residual Hybrid Transformer Blocks (RHTB) and Grid Attention Blocks (GAB), aiming to consider both local and global receptive fields. GAB models similar features at different image scales to achieve better reconstruction. • We further propose a pre-training strategy for superresolution tasks that can effectively improve the model’s performance using a small training cost. Through a series of comprehensive experiments, our findings substantiate that HMA attains a state-of-the-art performance across various test datasets.
• 我们提出了一种新颖的混合多轴聚合网络(HMA)。该网络由残差混合Transformer块(RHTB)和网格注意力块(GAB)组成,旨在同时考虑局部和全局感受野。GAB通过在不同图像尺度上建模相似特征来实现更优的重建效果。
• 我们进一步提出了一种超分辨率任务的预训练策略,该策略能以较小训练成本有效提升模型性能。通过一系列综合实验,我们的研究证实HMA在多个测试数据集上均达到了最先进的性能水平。
2. Related Works
2. 相关工作
2.1. CNN-Based SISR
2.1. 基于CNN的单图像超分辨率重建
CNN-based SISR methods have made significant progress in recovering image texture details. SRCNN [36] solved the super-resolution task for the first time using CNNs. Subsequently, in order to enhance the network learning ability, VDSR [15] introduced the residual learning idea, which effectively solved the problem of gradient vanishing in deep network training. In SRGAN [17], Christian Ledig et al. proposed to use generative adversarial networks to optimize the process of generating super-resolution images. The generator of SRGAN learns the mapping from low-resolution images to high-resolution images and improves the quality of the generated images by adversarial training. ESRGAN [35] introduces Residual in Residual Dense Block (RRDB) as the basic network unit and reduces the perceptual loss by using features before activation so that the images generated by EARGAN [35] have a more realistic natural texture. In addition, new network architectures are constantly being proposed by researchers to recover more realistic super-resolution image details [3, 8, 38].
基于CNN的SISR方法在恢复图像纹理细节方面取得了显著进展。SRCNN [36] 首次使用CNN解决了超分辨率任务。随后,为了增强网络学习能力,VDSR [15] 引入了残差学习思想,有效解决了深度网络训练中的梯度消失问题。在SRGAN [17] 中,Christian Ledig等人提出使用生成对抗网络来优化超分辨率图像的生成过程。SRGAN的生成器学习从低分辨率图像到高分辨率图像的映射,并通过对抗训练提高生成图像的质量。ESRGAN [35] 引入了残差密集残差块(RRDB)作为基本网络单元,并通过使用激活前的特征减少感知损失,使得EARGAN [35] 生成的图像具有更真实的自然纹理。此外,研究人员不断提出新的网络架构以恢复更真实的超分辨率图像细节 [3, 8, 38]。
2.2. Transformer-Based SISR
2.2. 基于Transformer的单图像超分辨率重建
In recent years, Transformer-based SISR has become an emerging research direction in super-resolution, which utilizes the Transformer architecture to achieve image mapping from low to high resolution. Among them, the Swin Transformer-based SwinIR [20] model achieves the best performance beyond CNN-based on image restoration tasks. In order to further investigate the effect of pretraining on its internal representation, Chen et al. proposed a novel Hybrid Attention Transformer (HAT) [6]. The HAT introduces overlapping cross-attention blocks to enhance the interactions between neighboring windows’ features, thus aggregating the cross-window information better. Our proposed HMA network learns similar feature representations through grid multiplexed self-attention and combines it with channel attention to enhance non-local feature fusion. Therefore, our method can provide additional support for image restoration through similar features in the original image.
近年来,基于Transformer的单图像超分辨率(SISR)已成为超分辨率领域的新兴研究方向,它利用Transformer架构实现从低分辨率到高分辨率的图像映射。其中,基于Swin Transformer的SwinIR[20]模型在图像复原任务上超越了CNN-based方法,达到最佳性能。为深入研究预训练对其内部表征的影响,Chen等人提出了一种新型混合注意力Transformer(HAT)[6]。该模型通过引入重叠交叉注意力块来增强相邻窗口特征间的交互,从而更好地聚合跨窗口信息。我们提出的HMA网络通过网格复用自注意力学习相似特征表示,并将其与通道注意力结合以增强非局部特征融合。因此,我们的方法能通过原始图像中的相似特征为图像复原提供额外支持。
2.3. Self-similarity based image restoration
2.3. 基于自相似性的图像修复
Natural images usually have similar features in different hierarchies, and many SISR methods based on CNN have achieved remarkable results by exploring selfsimilarity [14, 29, 31]. In order to reduce the computational complexity, the computation of self-similarity is usually restricted to local areas. The researchers also proposed to extend the search space by geometric transformations to increase the global feature interactions [12]. In Transformerbased SISR, the computational complexity of non-local self-attention increases quadratically with the growth of image size. Recent studies have proposed using sparse global self-attention to reduce the complexity [40]. Sparse global self-attention allows more feature interactions while reducing computational complexity. The proposed GAB adopts the idea of sparse self-attention to increase global feature interactions while balancing the computational complexity. Our method allows joint modeling using similar features to generate better reconstructed images.
自然图像通常在不同层级具有相似特征,许多基于CNN的单图像超分辨率(SISR)方法通过挖掘自相似性(self-similarity)取得了显著成果[14, 29, 31]。为降低计算复杂度,自相似性计算通常局限于局部区域。研究者还提出通过几何变换扩展搜索空间以增强全局特征交互[12]。在基于Transformer的SISR中,非局部自注意力(non-local self-attention)的计算复杂度随图像尺寸呈平方级增长。近期研究提出采用稀疏全局自注意力(sparse global self-attention)来降低复杂度[40]。稀疏全局自注意力能在降低计算复杂度的同时实现更多特征交互。本文提出的GAB采用稀疏自注意力思想,在平衡计算复杂度的同时增强全局特征交互。我们的方法允许联合建模利用相似特征来生成更优质的重建图像。
3. Motivation
3. 动机
Image self-similarity is vital in image processing, computer vision, and pattern recognition. Image self-similarity is usually characterized by multi-scale and geometric transformation invariance. Image self-similarity can be local or global. Local self-similarity means that one area of an image is similar to another, and global self-similarity means that there is self-similarity between multiple areas within the whole image. Fig. 2 shows that texture units may be repeated at regular intervals. Similarity modeling of features at different locations (e.g., yellow rectangle) in the input image can provide a reference for image reconstruction in the green rectangle when recovering the features in the green rectangle. Image self-similarity has been explored with satisfactory performance in classical super-resolution algorithms.
图像自相似性在图像处理、计算机视觉和模式识别中至关重要。图像自相似性通常表现为多尺度和几何变换不变性特征,可分为局部自相似性和全局自相似性。局部自相似性指图像中某区域与另一区域的相似性,全局自相似性则指整幅图像内多个区域间存在的自相似关系。图2展示了纹理单元可能以固定间隔重复出现的现象,当需要恢复绿色矩形区域特征时,对输入图像中不同位置(如黄色矩形)特征的相似性建模可为绿色矩形区域的图像重建提供参考。在经典超分辨率算法中,图像自相似性已得到充分探索并展现出优异性能。
Figure 2. Example of image similarity based on non-local textures. Image from DIV2K:0830.
图 2: 基于非局部纹理的图像相似度示例。图像来自 DIV2K:0830。
Figure 3. Grid Attention Strategies. We divide the feature map into sparse areas at specific intervals $X=4,$ ) and then compute the self-attention within each set of sparse areas.
图 3: 网格注意力策略。我们将特征图按特定间隔 $X=4$ 划分为稀疏区域,然后在每组稀疏区域内计算自注意力。
Swin Transformer [22] employs cross-window connectivity and multi-head attention mechanisms to deal with the long-range dependency modeling problem. However, Swin Transformer can only use a limited range of pixels when dealing with the SR task and cannot effectively use image self-similarity to enhance the reconstruction effect. For the purpose of increasing the range of pixels utilized by the Swin Transformer, we try to enhance the long-range dependency modeling capability of the Swin Transformer with sparse attention. As shown in Fig. 3, we suggest adding grid attention to increase the interaction between patches. The feature map is divided into $K^{2}$ groups according to the interval size $K$ , and each group contains $\textstyle{\frac{H}{K}}\times{\frac{W}{K}}$ Patches. After the grid shuffle, we can get the feature FG ∈ R HK × WK ×C and compute the self-attention in each group.
Swin Transformer [22] 采用跨窗口连接和多头注意力机制来解决长距离依赖建模问题。然而,Swin Transformer 在处理超分辨率 (SR) 任务时只能使用有限范围的像素,无法有效利用图像自相似性来增强重建效果。为了扩大 Swin Transformer 的像素利用范围,我们尝试通过稀疏注意力增强其长距离依赖建模能力。如图 3 所示,我们建议添加网格注意力以增强块 (patch) 间的交互。特征图按间隔大小 $K$ 划分为 $K^{2}$ 组,每组包含 $\textstyle{\frac{H}{K}}\times{\frac{W}{K}}$ 个块。经过网格混洗后,我们得到特征 FG ∈ R HK × WK ×C 并在每组内计算自注意力。
Not all areas in a natural image have similarity relationships. In order to avoid the non-similar features from damaging the original features, we introduce the global featurebased interaction feature $G\in\mathbb{R}^{\frac{H}{K}\times\frac{W}{K}\times\frac{C}{2}}$ and the windowbased self-attention mechanism ((S)W-MSA) to capture the similarity relationship of the whole image while modeling the similar features by Grid Multihead Self-Attention (GridMSA). The detailed computational procedure is described in Sec. 4.3.
自然图像中并非所有区域都存在相似性关系。为避免非相似特征破坏原始特征,我们引入基于全局特征的交互特征 $G\in\mathbb{R}^{\frac{H}{K}\times\frac{W}{K}\times\frac{C}{2}}$ 和基于窗口的自注意力机制 ((S)W-MSA),通过网格多头自注意力 (GridMSA) 建模相似特征的同时捕获整幅图像的相似性关系。具体计算流程详见第4.3节。
Figure 4. (a) CKA similarity between all G and Q in the $\times2$ SR model. (b) CKA similarity between all G and $\mathrm{K}$ in the $\times2$ SR model.
图 4: (a) $\times2$ 超分辨率模型中所有 G 和 Q 之间的 CKA 相似度。(b) $\times2$ 超分辨率模型中所有 G 和 $\mathrm{K}$ 之间的 CKA 相似度。
To make Grid-MSA work better, we must ensure the similarity between interaction features and query/key structure. Therefore, we introduce centered kernel alignment (CKA) [16] to study the similarity between features. It can be observed that the CKA similarity maps in Fig. 4 presents a diagonal structure, i.e., there is a close structural similarity between the interaction features and the query/keyword in the same layer $(mathbf{CKA}{>}0.9). $Therefore, interaction features can be a medium for query/key interaction with global features in Grid-MSA. With the benefit of these designs, our network is able to reconstruct the image taking full advantage of the pixel information in the input image.
为了让Grid-MSA更好地工作,我们必须确保交互特征与查询/键结构的相似性。因此,我们引入了中心核对齐 (CKA) [16] 来研究特征之间的相似性。从图 4 中的CKA相似性图中可以观察到对角结构,即同一层中的交互特征与查询/关键词之间存在紧密的结构相似性 $(mathbf{CKA}{>}0.9 )。$因此,交互特征可以作为Grid-MSA中查询/键与全局特征交互的媒介。得益于这些设计,我们的网络能够充分利用输入图像中的像素信息来重建图像。
4. Proposed Method
4. 提出的方法
As shown in Fig. 5, HMA consists of three parts: shallow feature extraction, deep feature extraction, and image reconstruction. Among them, RHTB is a stacked combination of multiple Fused Attention Blocks (FAB) and GAB. The RHTB is constructed by residual in residual structure. We will introduce these methods in detail in the following sections.
如图 5 所示,HMA 由三部分组成:浅层特征提取、深层特征提取和图像重建。其中,RHTB 是多个 Fused Attention Blocks (FAB) 和 GAB 的堆叠组合。RHTB 采用残差中的残差结构构建。我们将在后续章节中详细介绍这些方法。
4.1. Overall Architecture
4.1. 整体架构
For a given low-resolution (LR) input $I_{L R}\in\mathbb{R}^{H\times W\times C_{i n}}$ $(H,W$ , and $C_{i n}$ are the height, width, and number of input channels of the input image, respectively), we first extract the shallow features of the $I_{L R}$ using a convolutional layer that maps the $I_{L R}$ to high-dimensional features $F_{0}\in\mathbb{R}^{\dot{H}\times W\times C}$ :
对于给定的低分辨率 (LR) 输入 $I_{L R}\in\mathbb{R}^{H\times W\times C_{i n}}$ ($H$、$W$ 和 $C_{i n}$ 分别表示输入图像的高度、宽度和输入通道数),我们首先使用卷积层提取 $I_{L R}$ 的浅层特征,将其映射为高维特征 $F_{0}\in\mathbb{R}^{\dot{H}\times W\times C}$:
$$
F_{0}=H_{C o n v}(I_{L R}),
$$
$$
F_{0}=H_{C o n v}(I_{L R}),
$$
where $H_{C o n v}(\cdot)$ denotes the convolutional layer and $C$ denotes the number of channels of the intermediate layer features. Subsequently, we input $F_{0}$ into $H_{D F}(\cdot)$ , a deep feature extraction group consisting of M RHTBs and a $3\times3$ convolution. Each RHTB consists of a stack of $\mathbf{N}$ FABs, a GAB, and a convolutional layer with residual connections. Then, we fuse the deep features $F_{D}\in\mathbb{R}^{H\times W\times C}$ with $F_{0}$ by element-by-element summation to obtain $F_{R E C}$ . Finally, we reconstruct $F_{R E C}$ into a high-resolution image $I_{H R}$ :
其中 $H_{Conv}(\cdot)$ 表示卷积层,$C$ 表示中间层特征的通道数。随后,我们将 $F_0$ 输入到由 M 个 RHTB 和一个 $3\times3$ 卷积组成的深度特征提取组 $H_{DF}(\cdot)$ 中。每个 RHTB 由 $\mathbf{N}$ 个 FAB、一个 GAB 和一个带残差连接的卷积层堆叠而成。接着,我们通过逐元素求和将深度特征 $F_D\in\mathbb{R}^{H\times W\times C}$ 与 $F_0$ 融合,得到 $F_{REC}$。最后,我们将 $F_{REC}$ 重建为高分辨率图像 $I_{HR}$:
Figure 5. The overall architecture of HMA and the structure of RHTB and GAB.
图 5: HMA的整体架构及RHTB与GAB的结构
Figure 6. The architecture of FAB.
图 6: FAB 的架构。
$$
\begin{array}{r}{F_{F u s e}=H_{F u s e}(F_{F_{i n}})+F_{F_{i n}},}\end{array}
$$
$$
\begin{array}{r}{F_{F u s e}=H_{F u s e}(F_{F_{i n}})+F_{F_{i n}},}\end{array}
$$
where $F_{F_{i n}}$ represents the input features, and $F_{F u s e}$ represents the features output from the Fused Conv block. Then, we add two successive STL after Fused Conv. In the STL, we follow the classical design in SWinIR, including Window-based self-attention (W-MSA) and Shifted Window-based self-attention (SW-MSA), and Layer Norm. The computation of the STL is as follows:
其中 $F_{F_{i n}}$ 表示输入特征,$F_{F u s e}$ 表示从 Fused Conv 块输出的特征。接着,我们在 Fused Conv 之后添加了两个连续的 STL。在 STL 中,我们遵循了 SWinIR 的经典设计,包括基于窗口的自注意力 (W-MSA) 、基于移位窗口的自注意力 (SW-MSA) 以及层归一化。STL 的计算方式如下:
$$
I_{H R}=H_{R E C}(H_{D F}(F_{0})+F_{0}),
$$
$$
I_{H R}=H_{R E C}(H_{D F}(F_{0})+F_{0}),
$$
where $H_{R E C}(\cdot)$ denotes the reconstruction module.
其中 $H_{R E C}(\cdot)$ 表示重建模块。
4.2. Fused Attention Block (FAB)
4.2. 融合注意力块 (FAB)
Many studies have shown that adding appropriate convolution in the Transformer can further improve network train- ability [26, 37, 42]. Therefore, we insert a convolutional layer before the Swin Transformer Layer (STL) to enhance the network learning capability. As shown in Fig. 6, we insert the Fused Conv module $(H_{F u s e}(\cdot))$ with inverted bottlenecks and squeezed excitation s before the STL to achieve enhanced global information fusion. Note that we use Layer Norm instead of Batch Norm in Fused Conv to avoid the impact on the contrast and color of the image. The computational procedure of Fused Conv is:
多项研究表明,在Transformer中适当加入卷积能进一步提升网络训练能力[26, 37, 42]。为此,我们在Swin Transformer层(STL)前插入卷积层以增强网络学习能力。如图6所示,我们在STL前加入具有倒置瓶颈和压缩激励结构的Fused Conv模块$(H_{F u s e}(\cdot))$,以实现增强的全局信息融合。需注意,Fused Conv中采用Layer Norm替代Batch Norm以避免影响图像对比度和色彩。Fused Conv的计算流程为:
$$
F_{N}=(S)W-M S A(L N(F_{W_{i n}}))+F_{W_{i n}},
$$
$$
F_{N}=(S)W-M S A(L N(F_{W_{i n}}))+F_{W_{i n}},
$$
$$
F_{o u t}=M L P(L N(F_{N}))+F_{N},
$$
$$
F_{o u t}=M L P(L N(F_{N}))+F_{N},
$$
where $F_{W_{i n}}$ , $F_{N}$ , and $F_{o u t}$ indicate the input features, the intermediate features, and the output of the STL, respectively, and MLP denotes the multilayer perceptron. We split the feature map uniformly into $\frac{H\times W}{M^{2}}$ windows in a nonoverlapping manner for efficient modeling. Each window contains $\mathbf M\times\mathbf M$ Patch. The self-attention of a local window is calculated as follows:
其中 $F_{W_{i n}}$ 、 $F_{N}$ 和 $F_{o u t}$ 分别表示 STL (Spatial Transformer Layer) 的输入特征、中间特征和输出,MLP (Multilayer Perceptron) 表示多层感知机。为高效建模,我们将特征图以非重叠方式均匀划分为 $\frac{H\times W}{M^{2}}$ 个窗口,每个窗口包含 $\mathbf M\times\mathbf M$ 个图像块(Patch)。局部窗口的自注意力计算如下:
$$
A t t e n t i o n(Q,K,V)=S o f t M a x(\frac{Q K^{T}}{\sqrt{d}}+B)V,
$$
$$
A t t e n t i o n(Q,K,V)=S o f t M a x(\frac{Q K^{T}}{\sqrt{d}}+B)V,
$$
where $Q,K,V\in\mathbb{R}^{M^{2}\times d}$ are obtained by the linear transformation of the given input feature $F_{W}\overset{\cdot}{\in}\mathbb{R}^{M^{2}\times C}$ . The $d$ and $B$ represent the dimension and relative position encoding of the query/key, respectively.
其中 $Q,K,V\in\mathbb{R}^{M^{2}\times d}$ 是通过对给定输入特征 $F_{W}\overset{\cdot}{\in}\mathbb{R}^{M^{2}\times C}$ 进行线性变换得到的。$d$ 和 $B$ 分别表示查询/键的维度和相对位置编码。
Figure 7. The computational flowchart of Grid Attention.
图 7: Grid Attention的计算流程图
As shown in Fig. 6, Fused Conv expands the channel using a convolutional kernel of size 3 with a default expansion rate of 6. At the same time, a squeeze-excitation (SE) layer with a shrink rate of 0.5 is used in the channel attention layer. Finally, a convolutional kernel of size 1 is used to recover the channel.
如图 6 所示,Fused Conv 使用大小为 3 的卷积核扩展通道,默认扩展率为 6。同时,在通道注意力层中使用了压缩比为 0.5 的压缩激励 (SE) 层。最后,使用大小为 1 的卷积核恢复通道。
4.3. Grid Attention Block(GAB)
4.3. 网格注意力模块(GAB)
We introduce GAB to model cross-area similarity for enhanced image reconstruction. The GAB consists of a Mix Attention Layer (MAL) and an MLP layer. Regarding the MAL, we first split the input feature $F_{i n}$ into two parts by channel: $F_{G}\in\mathbf{\bar{\mathbb{R}}}^{H\times W\times\mathbf{\bar{\mathbb{Z}}}}$ and $\begin{array}{r}{F_{W}\in\mathbb{R}^{H\times W\times\frac{\hat{C}}{2}}}\end{array}$ . Subsequently, we split $F_{W}$ into two parts by channel again and input them into W-MSA and SW-MSA, respectively. Meanwhile, $F_{G}$ is input into Grid-MSA.The computation process of MAL is as follows:
我们引入GAB来建模跨区域相似性以增强图像重建。GAB由混合注意力层(MAL)和MLP层组成。在MAL中,我们首先将输入特征$F_{i n}$按通道分为两部分:$F_{G}\in\mathbf{\bar{\mathbb{R}}}^{H\times W\times\mathbf{\bar{\mathbb{Z}}}}$和$\begin{array}{r}{F_{W}\in\mathbb{R}^{H\times W\times\frac{\hat{C}}{2}}}\end{array}$。随后,我们将$F_{W}$再次按通道分为两部分,分别输入W-MSA和SW-MSA。同时,$F_{G}$被输入Grid-MSA。MAL的计算过程如下:
$$
X_{W_{1}}=W-M S A(F_{W_{1}}),
$$
$$
X_{W_{1}}=W-M S A(F_{W_{1}}),
$$
$$
X_{W_{2}}=S W-M S A(F_{W_{2}}),
$$
$$
X_{W_{2}}=S W-M S A(F_{W_{2}}),
$$
$$
X_{G}=G r i d-M S A(F_{G}),
$$
$$
X_{G}=G r i d-M S A(F_{G}),
$$
$$
X_{M A L}=L N(C a t(X_{W_{1}},X_{W_{2}},X_{G}))+F_{i n},
$$
$$
X_{M A L}=L N(C a t(X_{W_{1}},X_{W_{2}},X_{G}))+F_{i n},
$$
where $X_{W_{1}},X_{W_{2}}$ , and $X_{G}$ are the output features of W- MSA, SW-MSA, and Grid-MSA, respectively. It should be noted that we adopt the post-norm method in GAB to enhance the network training stability. For a given input feature $F_{i n}$ , the computation process of GAB is:
其中 $X_{W_{1}}$、$X_{W_{2}}$ 和 $X_{G}$ 分别是 W-MSA、SW-MSA 和 Grid-MSA 的输出特征。需要注意的是,我们在 GAB 中采用了后归一化 (post-norm) 方法来增强网络训练的稳定性。对于给定的输入特征 $F_{i n}$,GAB 的计算过程为:
$$
F_{M}=L N(M A L(F_{i n})+F_{i n},
$$
$$
F_{M}=L N(M A L(F_{i n})+F_{i n},
$$
$$
F_{o u t}=L N(M L P(F_{M}))+F_{M},
$$
$$
F_{o u t}=L N(M L P(F_{M}))+F_{M},
$$
It is shown in Fig. 7 that the $mathrm{Q},mathrm{K}$ , and $mathrm{v}$ are obtained from the input feature $F_{G}$ after grid shuffle when Grid-MSA
图 7 显示,当使用 Grid-MSA 时,$mathrm{Q}$、$mathrm{K}$ 和 $mathrm{v}$ 是通过对输入特征 $F_{G}$ 进行网格洗牌后获得的。
is used. $\begin{array}{r}{G\in\mathbb{R}^{H\times W\times\frac{C}{2}}}\end{array}$ is obtained from the linear transformation of the input feature $F_{i n}$ after grid shuffle. For Grid-MSA, the self-attention is calculated as follows:
使用。$\begin{array}{r}{G\in\mathbb{R}^{H\times W\times\frac{C}{2}}}\end{array}$ 是通过对输入特征 $F_{i n}$ 进行网格混洗后的线性变换得到的。对于Grid-MSA,自注意力计算如下:
$$
\hat{X}=S o f t M a x(\frac{G K^{T}}{d}+B)V,
$$
$$
\hat{X}=S o f t M a x(\frac{G K^{T}}{d}+B)V,
$$
$$
A t t e n t i o n(Q,G,\hat{X})=S o f t M a x(\frac{Q G^{T}}{d}+B)\hat{X},
$$
$$
A t t e n t i o n(Q,G,\hat{X})=S o f t M a x(\frac{Q G^{T}}{d}+B)\hat{X},
$$
where $\hat{X}$ is the intermediate feature obtained by computing the self-attention from $G,K$ , and $V$ .
其中 $\hat{X}$ 是通过从 $G,K$ 和 $V$ 计算自注意力 (self-attention) 获得的中间特征。
4.4. Pre-training strategy
4.4. 预训练策略
Pre-training plays a crucial role in many visual tasks [1, 34]. Recent studies have shown that pre-training can also capture significant gains in low-level visual tasks. IPT [5] handles different visual tasks by sharing the Transformer module with different head and tail structures. EDT [18] improves the performance of the target task by multi-task pre-training. HAT [6] pre-trains the super-resolution task using a larger dataset directly on the same task. Instead, we propose a pre-training method more suitable for superresolution tasks, i.e., increasing the gain of pre-training by sharing model parameters among pre-trained models with different degradation levels. We first train a $\times2$ model as the initial parameter seed when pre-training on the ImageNet dataset and then use it as the initialization parameter for the $\times3$ model. Then, train the final $\times2$ and $\times4$ models using the trained $\times3$ model as the initialization parameters of the $\times2$ and $\times4$ models. After the pre-training, the $\times2,\times3.$ , and $\times4$ models are fine-tuned on the DF2K dataset. The proposed strategy can bring more performance improvement, although it pays an extra training cost (training a $\times2$ model).
预训练在许多视觉任务中起着至关重要的作用[1, 34]。近期研究表明,预训练在低级视觉任务中同样能带来显著性能提升。IPT [5]通过共享Transformer模块配合不同的头尾结构来处理不同视觉任务。EDT [18]通过多任务预训练提升目标任务的性能。HAT [6]直接在超分辨率任务上使用更大规模数据集进行预训练。与之不同,我们提出了一种更适合超分辨率任务的预训练方法,即通过在不同退化等级的预训练模型间共享参数来提升预训练收益。我们首先在ImageNet数据集上训练一个$\times2$模型作为初始参数种子,随后将其作为$\times3$模型的初始化参数。接着以训练好的$\times3$模型作为$\times2$和$\times4$模型的初始化参数,训练最终的$\times2$和$\times4$模型。预训练完成后,$\times2$、$\times3$和$\times4$模型在DF2K数据集上进行微调。尽管需要额外训练成本(训练$\times2$模型),该策略能带来更大的性能提升。
5. Experiments
5. 实验
5.1. Experimental Setup
5.1. 实验设置
We use DF2K dataset (DIV2K [21] dataset merged with Flicker [32] dataset) as the training set. Meanwhile, we use ImageNet [10] as the pre-training dataset. For the structure of HMA, the number of RHTB and FAB is set to 6, the window size is set to 16, the number of channels is set to 180, and the number of attention al heads is set to 6. The number of attention al heads is 3 and 2 for GridMSA and (S)W-MSA in GAB, respectively. We evaluate on the Set5 [2], Set14 [39], BSD100 [27], Urban100 [14], and Manga109 [28] datasets. Both PSNR and SSIM evaluations are computed on the Y channel.
我们采用DF2K数据集(DIV2K [21] 与Flicker [32] 合并数据集)作为训练集,同时使用ImageNet [10] 进行预训练。HMA结构参数设置为:RHTB和FAB模块数均为6,窗口尺寸16,通道数180,注意力头数6。其中GAB模块的GridMSA和(S)W-MSA注意力头数分别为3和2。评估数据集包括Set5 [2]、Set14 [39]、BSD100 [27]、Urban100 [14] 和Manga109 [28],PSNR与SSIM指标均基于Y通道计算。
5.2. Training Details
5.2. 训练细节
Low-resolution images are generated by down-sampling using bicubic interpolation in MATLAB. We cropped the dataset into $64\times64$ patches for training. Furthermore, we employed horizontal flipping and random rotation for data augmentation. The training batch size is set to 32. During pre-training with ImageNet [10], the total number of training iterations is set to 800K (1K represents 1000 iterations), the learning rate was initialized to $2\times10^{-4}$ and halved at [300K, 500K, 650K, 700K, 750K]. We optimized the model using the Adam optimizer (with $\beta_{1}{=}0.9$ and $\beta_{2}{=}0.99\$ ). Subsequently, we fine-tuned the model on the DF2K dataset. The total number of training iterations is set to 250K, and the initial learning rate was set to $5\times10^{-6}$ and halved at [125K, 200K, 230K, 240K].
低分辨率图像通过MATLAB中的双三次插值下采样生成。我们将数据集裁剪为$64\times64$的块用于训练。此外,我们采用水平翻转和随机旋转进行数据增强。训练批次大小设置为32。在使用ImageNet[10]进行预训练时,总训练迭代次数设为800K(1K表示1000次迭代),学习率初始化为$2\times10^{-4}$,并在[300K, 500K, 650K, 700K, 750K]时减半。我们使用Adam优化器($\beta_{1}{=}0.9$和$\beta_{2}{=}0.99\$)优化模型。随后,我们在DF2K数据集上对模型进行微调。总训练迭代次数设为250K,初始学习率设为$5\times10^{-6}$,并在[125K, 200K, 230K