HMANet: Hybrid Multi-Axis Aggregation Network for Image Super-Resolution
HMANet: 用于图像超分辨率的混合多轴聚合网络
Abstract
摘要
Transformer-based methods have demonstrated excellent performance on super-resolution visual tasks, surpassing conventional convolutional neural networks. However, existing work typically restricts self-attention computation to non-overlapping windows to save computational costs. This means that Transformer-based networks can only use input information from a limited spatial range. Therefore, a novel Hybrid Multi-Axis Aggregation network (HMA) is proposed in this paper to exploit feature potential information better. HMA is constructed by stacking Residual Hybrid Transformer Blocks(RHTB) and Grid Attention Blocks(GAB). On the one side, RHTB combines channel attention and self-attention to enhance nonlocal feature fusion and produce more attractive visual results. Conversely, GAB is used in cross-domain information interaction to jointly model similar features and obtain a larger perceptual field. For the super-resolution task in the training phase, a novel pre-training method is designed to enhance the model representation capabilities further and validate the proposed model’s effectiveness through many experiments. The experimental results show that HMA outperforms the state-of-the-art methods on the benchmark dataset. We provide code and models at https://github.com/korouuuuu/HMA.
基于Transformer的方法在超分辨率视觉任务上展现出卓越性能,超越了传统卷积神经网络。然而现有工作通常将自注意力计算限制在非重叠窗口以节省计算成本,这意味着基于Transformer的网络只能利用有限空间范围内的输入信息。为此,本文提出新型混合多轴聚合网络(HMA),通过堆叠残差混合Transformer模块(RHTB)和网格注意力模块(GAB)构建而成。一方面,RHTB结合通道注意力与自注意力机制增强非局部特征融合,生成更具吸引力的视觉效果;另一方面,GAB用于跨域信息交互,联合建模相似特征并获得更大感受野。针对训练阶段的超分辨率任务,我们设计了新型预训练方法以进一步提升模型表征能力,并通过大量实验验证所提模型的有效性。实验结果表明,HMA在基准数据集上优于最先进方法。代码与模型已开源:https://github.com/korouuuuu/HMA。
1. Introduction
1. 引言
Natural images have different features, such as multi-scale pattern repetition, same-scale texture similarity, and structural similarity [45]. Deep neural networks can exploit these properties for image reconstruction. However, it cannot capture the complex dependencies between distant elements due to the limitations of CNN’s fixed local receptive field and parameter sharing mechanism, thus limiting its ability to model long-range dependencies [25]. Recent research has introduced the self-attention mechanism to computer vision [20, 23]. Researchers have used the long-range dependency modeling capability and multi-scale processing advantages in the self-attention mechanism to enhance the joint modeling of different hierarchical structures in images.
自然图像具有多种特征,例如多尺度模式重复、同尺度纹理相似性和结构相似性 [45]。深度神经网络可以利用这些特性进行图像重建。然而,由于CNN固定的局部感受野和参数共享机制的限制,它无法捕捉远距离元素之间的复杂依赖关系,从而限制了其对长程依赖建模的能力 [25]。最近的研究将自注意力机制(self-attention)引入计算机视觉领域 [20, 23]。研究人员利用自注意力机制中的长程依赖建模能力和多尺度处理优势,增强了对图像中不同层次结构的联合建模。

Figure 1. The performance of the proposed HMA is compared with the state-of-the-art SwinIR, ART, HAT, and GRL methods in terms of PSNR(dB). Our method outperforms the state-of-the-art methods by $0.1\mathrm{dB}{\sim}1.4\mathrm{dB}$ .
图 1: 所提出的 HMA 方法与当前最先进的 SwinIR、ART、HAT 和 GRL 方法在 PSNR(dB) 指标上的性能对比。我们的方法以 $0.1\mathrm{dB}{\sim}1.4\mathrm{dB}$ 的优势超越现有最优方法。
Although Transformer-based methods have been successfully applied to image restoration tasks, there are still some things that could be improved. Existing windowbased Transformer networks restrict the self-attention computation to a dense area. This strategy obviously leads to a limited receptive field and does not fully utilize the feature information from the original image. For the purpose of generating images with more realistic details, researchers consider using GAN networks or inputting the reference information to provide additional feature information [4, 11, 33]. However, the network may generate unreasonable results if the input additional feature information
尽管基于Transformer的方法已成功应用于图像修复任务,但仍存在一些不足之处。现有的基于窗口的Transformer网络将自注意力计算限制在密集区域内,这种策略显然会导致感受野受限,且未能充分利用原始图像的特征信息。为了生成具有更真实细节的图像,研究人员考虑使用GAN网络或输入参考信息来提供额外特征信息[4, 11, 33]。然而若输入的额外特征信息不当,网络可能会生成不合理的结果
does not match.
不匹配。
In order to overcome the above problems, we propose a hybrid multiaxial aggregation network called HMA in this paper. HMA combines channel attention and self-attention, which utilizes channel attention’s global information perception capability to compensate for self-attention’s shortcomings. In addition, we introduce a grid attention block to achieve the modeling across distances in images. Meanwhile, to further excite the potential performance of the model, we customize a pre-training strategy for the superresolution task. Benefiting from these designs, as shown in Fig. 1, our proposed method can effectively improve the model performance $0.1\mathrm{dB}{\sim}1.4\mathrm{dB},$ ). The main contributions of this paper are summarised as follows:
为了克服上述问题,本文提出了一种名为HMA的混合多轴聚合网络。HMA结合了通道注意力(channel attention)和自注意力(self-attention),利用通道注意力的全局信息感知能力弥补自注意力的不足。此外,我们引入了网格注意力块(grid attention block)来实现图像中的跨距离建模。同时,为了进一步激发模型的潜在性能,我们为超分辨率任务定制了预训练策略。得益于这些设计,如图1所示,我们提出的方法能有效提升模型性能$0.1\mathrm{dB}{\sim}1.4\mathrm{dB}$。本文的主要贡献总结如下:
• We propose a novel Hybrid Multi-axis Aggregation network (HMA). The HMA comprises Residual Hybrid Transformer Blocks (RHTB) and Grid Attention Blocks (GAB), aiming to consider both local and global receptive fields. GAB models similar features at different image scales to achieve better reconstruction. • We further propose a pre-training strategy for superresolution tasks that can effectively improve the model’s performance using a small training cost. Through a series of comprehensive experiments, our findings substantiate that HMA attains a state-of-the-art performance across various test datasets.
• 我们提出了一种新颖的混合多轴聚合网络(HMA)。该网络由残差混合Transformer块(RHTB)和网格注意力块(GAB)组成,旨在同时考虑局部和全局感受野。GAB通过在不同图像尺度上建模相似特征来实现更优的重建效果。
• 我们进一步提出了一种超分辨率任务的预训练策略,该策略能以较小训练成本有效提升模型性能。通过一系列综合实验,我们的研究证实HMA在多个测试数据集上均达到了最先进的性能水平。
2. Related Works
2. 相关工作
2.1. CNN-Based SISR
2.1. 基于CNN的单图像超分辨率重建
CNN-based SISR methods have made significant progress in recovering image texture details. SRCNN [36] solved the super-resolution task for the first time using CNNs. Subsequently, in order to enhance the network learning ability, VDSR [15] introduced the residual learning idea, which effectively solved the problem of gradient vanishing in deep network training. In SRGAN [17], Christian Ledig et al. proposed to use generative adversarial networks to optimize the process of generating super-resolution images. The generator of SRGAN learns the mapping from low-resolution images to high-resolution images and improves the quality of the generated images by adversarial training. ESRGAN [35] introduces Residual in Residual Dense Block (RRDB) as the basic network unit and reduces the perceptual loss by using features before activation so that the images generated by EARGAN [35] have a more realistic natural texture. In addition, new network architectures are constantly being proposed by researchers to recover more realistic super-resolution image details [3, 8, 38].
基于CNN的SISR方法在恢复图像纹理细节方面取得了显著进展。SRCNN [36] 首次使用CNN解决了超分辨率任务。随后,为了增强网络学习能力,VDSR [15] 引入了残差学习思想,有效解决了深度网络训练中的梯度消失问题。在SRGAN [17] 中,Christian Ledig等人提出使用生成对抗网络来优化超分辨率图像的生成过程。SRGAN的生成器学习从低分辨率图像到高分辨率图像的映射,并通过对抗训练提高生成图像的质量。ESRGAN [35] 引入了残差密集残差块(RRDB)作为基本网络单元,并通过使用激活前的特征减少感知损失,使得EARGAN [35] 生成的图像具有更真实的自然纹理。此外,研究人员不断提出新的网络架构以恢复更真实的超分辨率图像细节 [3, 8, 38]。
2.2. Transformer-Based SISR
2.2. 基于Transformer的单图像超分辨率重建
In recent years, Transformer-based SISR has become an emerging research direction in super-resolution, which utilizes the Transformer architecture to achieve image mapping from low to high resolution. Among them, the Swin Transformer-based SwinIR [20] model achieves the best performance beyond CNN-based on image restoration tasks. In order to further investigate the effect of pretraining on its internal representation, Chen et al. proposed a novel Hybrid Attention Transformer (HAT) [6]. The HAT introduces overlapping cross-attention blocks to enhance the interactions between neighboring windows’ features, thus aggregating the cross-window information better. Our proposed HMA network learns similar feature representations through grid multiplexed self-attention and combines it with channel attention to enhance non-local feature fusion. Therefore, our method can provide additional support for image restoration through similar features in the original image.
近年来,基于Transformer的单图像超分辨率(SISR)已成为超分辨率领域的新兴研究方向,它利用Transformer架构实现从低分辨率到高分辨率的图像映射。其中,基于Swin Transformer的SwinIR[20]模型在图像复原任务上超越了CNN-based方法,达到最佳性能。为深入研究预训练对其内部表征的影响,Chen等人提出了一种新型混合注意力Transformer(HAT)[6]。该模型通过引入重叠交叉注意力块来增强相邻窗口特征间的交互,从而更好地聚合跨窗口信息。我们提出的HMA网络通过网格复用自注意力学习相似特征表示,并将其与通道注意力结合以增强非局部特征融合。因此,我们的方法能通过原始图像中的相似特征为图像复原提供额外支持。
2.3. Self-similarity based image restoration
2.3. 基于自相似性的图像修复
Natural images usually have similar features in different hierarchies, and many SISR methods based on CNN have achieved remarkable results by exploring selfsimilarity [14, 29, 31]. In order to reduce the computational complexity, the computation of self-similarity is usually restricted to local areas. The researchers also proposed to extend the search space by geometric transformations to increase the global feature interactions [12]. In Transformerbased SISR, the computational complexity of non-local self-attention increases quadratically with the growth of image size. Recent studies have proposed using sparse global self-attention to reduce the complexity [40]. Sparse global self-attention allows more feature interactions while reducing computational complexity. The proposed GAB adopts the idea of sparse self-attention to increase global feature interactions while balancing the computational complexity. Our method allows joint modeling using similar features to generate better reconstructed images.
自然图像通常在不同层级具有相似特征,许多基于CNN的单图像超分辨率(SISR)方法通过挖掘自相似性(self-similarity)取得了显著成果[14, 29, 31]。为降低计算复杂度,自相似性计算通常局限于局部区域。研究者还提出通过几何变换扩展搜索空间以增强全局特征交互[12]。在基于Transformer的SISR中,非局部自注意力(non-local self-attention)的计算复杂度随图像尺寸呈平方级增长。近期研究提出采用稀疏全局自注意力(sparse global self-attention)来降低复杂度[40]。稀疏全局自注意力能在降低计算复杂度的同时实现更多特征交互。本文提出的GAB采用稀疏自注意力思想,在平衡计算复杂度的同时增强全局特征交互。我们的方法允许联合建模利用相似特征来生成更优质的重建图像。
3. Motivation
3. 动机
Image self-similarity is vital in image processing, computer vision, and pattern recognition. Image self-similarity is usually characterized by multi-scale and geometric transformation invariance. Image self-similarity can be local or global. Local self-similarity means that one area of an image is similar to another, and global self-similarity means that there is self-similarity between multiple areas within the whole image. Fig. 2 shows that texture units may be repeated at regular intervals. Similarity modeling of features at different locations (e.g., yellow rectangle) in the input image can provide a reference for image reconstruction in the green rectangle when recovering the features in the green rectangle. Image self-similarity has been explored with satisfactory performance in classical super-resolution algorithms.
图像自相似性在图像处理、计算机视觉和模式识别中至关重要。图像自相似性通常表现为多尺度和几何变换不变性特征,可分为局部自相似性和全局自相似性。局部自相似性指图像中某区域与另一区域的相似性,全局自相似性则指整幅图像内多个区域间存在的自相似关系。图2展示了纹理单元可能以固定间隔重复出现的现象,当需要恢复绿色矩形区域特征时,对输入图像中不同位置(如黄色矩形)特征的相似性建模可为绿色矩形区域的图像重建提供参考。在经典超分辨率算法中,图像自相似性已得到充分探索并展现出优异性能。

Figure 2. Example of image similarity based on non-local textures. Image from DIV2K:0830.
图 2: 基于非局部纹理的图像相似度示例。图像来自 DIV2K:0830。

Figure 3. Grid Attention Strategies. We divide the feature map into sparse areas at specific intervals $X=4,$ ) and then compute the self-attention within each set of sparse areas.
图 3: 网格注意力策略。我们将特征图按特定间隔 $X=4$ 划分为稀疏区域,然后在每组稀疏区域内计算自注意力。
Swin Transformer [22] employs cross-window connectivity and multi-head attention mechanisms to deal with the long-range dependency modeling problem. However, Swin Transformer can only use a limited range of pixels when dealing with the SR task and cannot effectively use image self-similarity to enhance the reconstruction effect. For the purpose of increasing the range of pixels utilized by the Swin Transformer, we try to enhance the long-range dependency modeling capability of the Swin Transformer with sparse attention. As shown in Fig. 3, we suggest adding grid attention to increase the interaction between patches. The feature map is divided into $K^{2}$ groups according to the interval size $K$ , and each group contains $\textstyle{\frac{H}{K}}\times{\frac{W}{K}}$ Patches. After the grid shuffle, we can get the feature FG ∈ R HK × WK ×C and compute the self-attention in each group.
Swin Transformer [22] 采用跨窗口连接和多头注意力机制来解决长距离依赖建模问题。然而,Swin Transformer 在处理超分辨率 (SR) 任务时只能使用有限范围的像素,无法有效利用图像自相似性来增强重建效果。为了扩大 Swin Transformer 的像素利用范围,我们尝试通过稀疏注意力增强其长距离依赖建模能力。如图 3 所示,我们建议添加网格注意力以增强块 (patch) 间的交互。特征图按间隔大小 $K$ 划分为 $K^{2}$ 组,每组包含 $\textstyle{\frac{H}{K}}\times{\frac{W}{K}}$ 个块。经过网格混洗后,我们得到特征 FG ∈ R HK × WK ×C 并在每组内计算自注意力。
Not all areas in a natural image have similarity relationships. In order to avoid the non-similar features from damaging the original features, we introduce the global featurebased interaction feature $G\in\mathbb{R}^{\frac{H}{K}\times\frac{W}{K}\times\frac{C}{2}}$ and the windowbased self-attention mechanism ((S)W-MSA) to capture the similarity relationship of the whole image while modeling the similar features by Grid Multihead Self-Attention (GridMSA). The detailed computational procedure is described in Sec. 4.3.
自然图像中并非所有区域都存在相似性关系。为避免非相似特征破坏原始特征,我们引入基于全局特征的交互特征 $G\in\mathbb{R}^{\frac{H}{K}\times\frac{W}{K}\times\frac{C}{2}}$ 和基于窗口的自注意力机制 ((S)W-MSA),通过网格多头自注意力 (GridMSA) 建模相似特征的同时捕获整幅图像的相似性关系。具体计算流程详见第4.3节。

Figure 4. (a) CKA similarity between all G and Q in the $\times2$ SR model. (b) CKA similarity between all G and $\mathrm{K}$ in the $\times2$ SR model.
图 4: (a) $\times2$ 超分辨率模型中所有 G 和 Q 之间的 CKA 相似度。(b) $\times2$ 超分辨率模型中所有 G 和 $\mathrm{K}$ 之间的 CKA 相似度。
To make Grid-MSA work better, we must ensure the similarity between interaction features and query/key structure. Therefore, we introduce centered kernel alignment (CKA) [16] to study the similarity between features. It can be observed that the CKA similarity maps in Fig. 4 presents a diagonal structure, i.e., there is a close structural similarity between the interaction features and the query/keyword in the same layer $(mathbf{CKA}{>}0.9). $Therefore, interaction features can be a medium for query/key interaction with global features in Grid-MSA. With the benefit of these designs, our network is able to reconstruct the image taking full advantage of the pixel information in the input image.
为了让Grid-MSA更好地工作,我们必须确保交互特征与查询/键结构的相似性。因此,我们引入了中心核对齐 (CKA) [16] 来研究特征之间的相似性。从图 4 中的CKA相似性图中可以观察到对角结构,即同一层中的交互特征与查询/关键词之间存在紧密的结构相似性 $(mathbf{CKA}{>}0.9 )。$因此,交互特征可以作为Grid-MSA中查询/键与全局特征交互的媒介。得益于这些设计,我们的网络能够充分利用输入图像中的像素信息来重建图像。
4. Proposed Method
4. 提出的方法
As shown in Fig. 5, HMA consists of three parts: shallow feature extraction, deep feature extraction, and image reconstruction. Among them, RHTB is a stacked combination of multiple Fused Attention Blocks (FAB) and GAB. The RHTB is constructed by residual in residual structure. We will introduce these methods in detail in the following sections.
如图 5 所示,HMA 由三部分组成:浅层特征提取、深层特征提取和图像重建。其中,RHTB 是多个 Fused Attention Blocks (FAB) 和 GAB 的堆叠组合。RHTB 采用残差中的残差结构构建。我们将在后续章节中详细介绍这些方法。
4.1. Overall Architecture
4.1. 整体架构
For a given low-resolution (LR) input $I_{L R}\in\mathbb{R}^{H\times W\times C_{i n}}$ $(H,W$ , and $C_{i n}$ are the height, width, and number of input channels of the input image, respectively), we first extract the shallow features of the $I_{L R}$ using a convolutional layer that maps the $I_{L R}$ to high-dimensional features $F_{0}\in\mathbb{R}^{\dot{H}\times W\times C}$ :
对于给定的低分辨率 (LR) 输入 $I_{L R}\in\mathbb{R}^{H\times W\times C_{i n}}$ ($H$、$W$ 和 $C_{i n}$ 分别表示输入图像的高度、宽度和输入通道数),我们首先使用卷积层提取 $I_{L R}$ 的浅层特征,将其映射为高维特征 $F_{0}\in\mathbb{R}^{\dot{H}\times W\times C}$:
$$
F_{0}=H_{C o n v}(I_{L R}),
$$
$$
F_{0}=H_{C o n v}(I_{L R}),
$$
where $H_{C o n v}(\cdot)$ denotes the convolutional layer and $C$ denotes the number of channels of the intermediate layer features. Subsequently, we input $F_{0}$ into $H_{D F}(\cdot)$ , a deep feature extraction group consisting of M RHTBs and a $3\times3$ convolution. Each RHTB consists of a stack of $\mathbf{N}$ FABs, a GAB, and a convolutional layer with residual connections. Then, we fuse the deep features $F_{D}\in\mathbb{R}^{H\times W\times C}$ with $F_{0}$ by element-by-element summation to obtain $F_{R E C}$ . Finally, we reconstruct $F_{R E C}$ into a high-resolution image $I_{H R}$ :
其中 $H_{Conv}(\cdot)$ 表示卷积层,$C$ 表示中间层特征的通道数。随后,我们将 $F_0$ 输入到由 M 个 RHTB 和一个 $3\times3$ 卷积组成的深度特征提取组 $H_{DF}(\cdot)$ 中。每个 RHTB 由 $\mathbf{N}$ 个 FAB、一个 GAB 和一个带残差连接的卷积层堆叠而成。接着,我们通过逐元素求和将深度特征 $F_D\in\mathbb{R}^{H\times W\times C}$ 与 $F_0$ 融合,得到 $F_{REC}$。最后,我们将 $F_{REC}$ 重建为高分辨率图像 $I_{HR}$:

Figure 5. The overall architecture of HMA and the structure of RHTB and GAB.
图 5: HMA的整体架构及RHTB与GAB的结构

Figure 6. The architecture of FAB.
图 6: FAB 的架构。
$$
\begin{array}{r}{F_{F u s e}=H_{F u s e}(F_{F_{i n}})+F_{F_{i n}},}\end{array}
$$
$$
\begin{array}{r}{F_{F u s e}=H_{F u s e}(F_{F_{i n}})+F_{F_{i n}},}\end{array}
$$
where $F_{F_{i n}}$ represents the input features, and $F_{F u s e}$ represents the features output from the Fused Conv block. Then, we add two successive STL after Fused Conv. In the STL, we follow the classical design in SWinIR, including Window-based self-attention (W-MSA) and Shifted Window-based self-attention (SW-MSA), and Layer Norm. The computation of the STL is as follows:
其中 $F_{F_{i n}}$ 表示输入特征,$F_{F u s e}$ 表示从 Fused Conv 块输出的特征。接着,我们在 Fused Conv 之后添加了两个连续的 STL。在 STL 中,我们遵循了 SWinIR 的经典设计,包括基于窗口的自注意力 (W-MSA) 、基于移位窗口的自注意力 (SW-MSA) 以及层归一化。STL 的计算方式如下:
$$
I_{H R}=H_{R E C}(H_{D F}(F_{0})+F_{0}),
$$
$$
I_{H R}=H_{R E C}(H_{D F}(F_{0})+F_{0}),
$$
where $H_{R E C}(\cdot)$ denotes the reconstruction module.
其中 $H_{R E C}(\cdot)$ 表示重建模块。
4.2. Fused Attention Block (FAB)
4.2. 融合注意力块 (FAB)
Many studies have shown that adding appropriate convolution in the Transformer can further improve network train- ability [26, 37, 42]. Therefore, we insert a convolutional layer before the Swin Transformer Layer (STL) to enhance the network learning capability. As shown in Fig. 6, we insert the Fused Conv module $(H_{F u s e}(\cdot))$ with inverted bottlenecks and squeezed excitation s before the STL to achieve enhanced global information fusion. Note that we use Layer Norm instead of Batch Norm in Fused Conv to avoid the impact on the contrast and color of the image. The computational procedure of Fused Conv is:
多项研究表明,在Transformer中适当加入卷积能进一步提升网络训练能力[26, 37, 42]。为此,我们在Swin Transformer层(STL)前插入卷积层以增强网络学习能力。如图6所示,我们在STL前加入具有倒置瓶颈和压缩激励结构的Fused Conv模块$(H_{F u s e}(\cdot))$,以实现增强的全局信息融合。需注意,Fused Conv中采用Layer Norm替代Batch Norm以避免影响图像对比度和色彩。Fused Conv的计算流程为:
$$
F_{N}=(S)W-M S A(L N(F_{W_{i n}}))+F_{W_{i n}},
$$
$$
F_{N}=(S)W-M S A(L N(F_{W_{i n}}))+F_{W_{i n}},
$$
$$
F_{o u t}=M L P(L N(F_{N}))+F_{N},
$$
$$
F_{o u t}=M L P(L N(F_{N}))+F_{N},
$$
where $F_{W_{i n}}$ , $F_{N}$ , and $F_{o u t}$ indicate the input features, the intermediate features, and the output of the STL, respectively, and MLP denotes the multilayer perceptron. We split the feature map uniformly into $\frac{H\times W}{M^{2}}$ windows in a nonoverlapping manner for efficient modeling. Each window contains $\mathbf M\times\mathbf M$ Patch. The self-attention of a local window is calculated as follows:
其中 $F_{W_{i n}}$ 、 $F_{N}$ 和 $F_{o u t}$ 分别表示 STL (Spatial Transformer Layer) 的输入特征、中间特征和输出,MLP (Multilayer Perceptron) 表示多层感知机。为高效建模,我们将特征图以非重叠方式均匀划分为 $\frac{H\times W}{M^{2}}$ 个窗口,每个窗口包含 $\mathbf M\times\mathbf M$ 个图像块(Patch)。局部窗口的自注意力计算如下:
$$
A t t e n t i o n(Q,K,V)=S o f t M a x(\frac{Q K^{T}}{\sqrt{d}}+B)V,
$$
$$
A t t e n t i o n(Q,K,V)=S o f t M a x(\frac{Q K^{T}}{\sqrt{d}}+B)V,
$$
where $Q,K,V\in\mathbb{R}^{M^{2}\times d}$ are obtained by the linear transformation of the given input feature $F_{W}\overset{\cdot}{\in}\mathbb{R}^{M^{2}\times C}$ . The $d$ and $B$ represent the dimension and relative position encoding of the query/key, respectively.
其中 $Q,K,V\in\mathbb{R}^{M^{2}\times d}$ 是通过对给定输入特征 $F_{W}\overset{\cdot}{\in}\mathbb{R}^{M^{2}\times C}$ 进行线性变换得到的。$d$ 和 $B$ 分别表示查询/键的维度和相对位置编码。

Figure 7. The computational flowchart of Grid Attention.
图 7: Grid Attention的计算流程图
As shown in Fig. 6, Fused Conv expands the channel using a convolutional kernel of size 3 with a default expansion rate of 6. At the same time, a squeeze-excitation (SE) layer with a shrink rate of 0.5 is used in the channel attention layer. Finally, a convolutional kernel of size 1 is used to recover the channel.
如图 6 所示,Fused Conv 使用大小为 3 的卷积核扩展通道,默认扩展率为 6。同时,在通道注意力层中使用了压缩比为 0.5 的压缩激励 (SE) 层。最后,使用大小为 1 的卷积核恢复通道。
4.3. Grid Attention Block(GAB)
4.3. 网格注意力模块(GAB)
We introduce GAB to model cross-area similarity for enhanced image reconstruction. The GAB consists of a Mix Attention Layer (MAL) and an MLP layer. Regarding the MAL, we first split the input feature $F_{i n}$ into two parts by channel: $F_{G}\in\mathbf{\bar{\mathbb{R}}}^{H\times W\times\mathbf{\bar{\mathbb{Z}}}}$ and $\begin{array}{r}{F_{W}\in\mathbb{R}^{H\times W\times\frac{\hat{C}}{2}}}\end{array}$ . Subsequently, we split $F_{W}$ into two parts by channel again and input them into W-MSA and SW-MSA, respectively. Meanwhile, $F_{G}$ is input into Grid-MSA.The computation process of MAL is as follows:
我们引入GAB来建模跨区域相似性以增强图像重建。GAB由混合注意力层(MAL)和MLP层组成。在MAL中,我们首先将输入特征$F_{i n}$按通道分为两部分:$F_{G}\in\mathbf{\bar{\mathbb{R}}}^{H\times W\times\mathbf{\bar{\mathbb{Z}}}}$和$\begin{array}{r}{F_{W}\in\mathbb{R}^{H\times W\times\frac{\hat{C}}{2}}}\end{array}$。随后,我们将$F_{W}$再次按通道分为两部分,分别输入W-MSA和SW-MSA。同时,$F_{G}$被输入Grid-MSA。MAL的计算过程如下:
$$
X_{W_{1}}=W-M S A(F_{W_{1}}),
$$
$$
X_{W_{1}}=W-M S A(F_{W_{1}}),
$$
$$
X_{W_{2}}=S W-M S A(F_{W_{2}}),
$$
$$
X_{W_{2}}=S W-M S A(F_{W_{2}}),
$$
$$
X_{G}=G r i d-M S A(F_{G}),
$$
$$
X_{G}=G r i d-M S A(F_{G}),
$$
$$
X_{M A L}=L N(C a t(X_{W_{1}},X_{W_{2}},X_{G}))+F_{i n},
$$
$$
X_{M A L}=L N(C a t(X_{W_{1}},X_{W_{2}},X_{G}))+F_{i n},
$$
where $X_{W_{1}},X_{W_{2}}$ , and $X_{G}$ are the output features of W- MSA, SW-MSA, and Grid-MSA, respectively. It should be noted that we adopt the post-norm method in GAB to enhance the network training stability. For a given input feature $F_{i n}$ , the computation process of GAB is:
其中 $X_{W_{1}}$、$X_{W_{2}}$ 和 $X_{G}$ 分别是 W-MSA、SW-MSA 和 Grid-MSA 的输出特征。需要注意的是,我们在 GAB 中采用了后归一化 (post-norm) 方法来增强网络训练的稳定性。对于给定的输入特征 $F_{i n}$,GAB 的计算过程为:
$$
F_{M}=L N(M A L(F_{i n})+F_{i n},
$$
$$
F_{M}=L N(M A L(F_{i n})+F_{i n},
$$
$$
F_{o u t}=L N(M L P(F_{M}))+F_{M},
$$
$$
F_{o u t}=L N(M L P(F_{M}))+F_{M},
$$
It is shown in Fig. 7 that the $mathrm{Q},mathrm{K}$ , and $mathrm{v}$ are obtained from the input feature $F_{G}$ after grid shuffle when Grid-MSA
图 7 显示,当使用 Grid-MSA 时,$mathrm{Q}$、$mathrm{K}$ 和 $mathrm{v}$ 是通过对输入特征 $F_{G}$ 进行网格洗牌后获得的。
is used. $\begin{array}{r}{G\in\mathbb{R}^{H\times W\times\frac{C}{2}}}\end{array}$ is obtained from the linear transformation of the input feature $F_{i n}$ after grid shuffle. For Grid-MSA, the self-attention is calculated as follows:
使用。$\begin{array}{r}{G\in\mathbb{R}^{H\times W\times\frac{C}{2}}}\end{array}$ 是通过对输入特征 $F_{i n}$ 进行网格混洗后的线性变换得到的。对于Grid-MSA,自注意力计算如下:
$$
\hat{X}=S o f t M a x(\frac{G K^{T}}{d}+B)V,
$$
$$
\hat{X}=S o f t M a x(\frac{G K^{T}}{d}+B)V,
$$
$$
A t t e n t i o n(Q,G,\hat{X})=S o f t M a x(\frac{Q G^{T}}{d}+B)\hat{X},
$$
$$
A t t e n t i o n(Q,G,\hat{X})=S o f t M a x(\frac{Q G^{T}}{d}+B)\hat{X},
$$
where $\hat{X}$ is the intermediate feature obtained by computing the self-attention from $G,K$ , and $V$ .
其中 $\hat{X}$ 是通过从 $G,K$ 和 $V$ 计算自注意力 (self-attention) 获得的中间特征。
4.4. Pre-training strategy
4.4. 预训练策略
Pre-training plays a crucial role in many visual tasks [1, 34]. Recent studies have shown that pre-training can also capture significant gains in low-level visual tasks. IPT [5] handles different visual tasks by sharing the Transformer module with different head and tail structures. EDT [18] improves the performance of the target task by multi-task pre-training. HAT [6] pre-trains the super-resolution task using a larger dataset directly on the same task. Instead, we propose a pre-training method more suitable for superresolution tasks, i.e., increasing the gain of pre-training by sharing model parameters among pre-trained models with different degradation levels. We first train a $\times2$ model as the initial parameter seed when pre-training on the ImageNet dataset and then use it as the initialization parameter for the $\times3$ model. Then, train the final $\times2$ and $\times4$ models using the trained $\times3$ model as the initialization parameters of the $\times2$ and $\times4$ models. After the pre-training, the $\times2,\times3.$ , and $\times4$ models are fine-tuned on the DF2K dataset. The proposed strategy can bring more performance improvement, although it pays an extra training cost (training a $\times2$ model).
预训练在许多视觉任务中起着至关重要的作用[1, 34]。近期研究表明,预训练在低级视觉任务中同样能带来显著性能提升。IPT [5]通过共享Transformer模块配合不同的头尾结构来处理不同视觉任务。EDT [18]通过多任务预训练提升目标任务的性能。HAT [6]直接在超分辨率任务上使用更大规模数据集进行预训练。与之不同,我们提出了一种更适合超分辨率任务的预训练方法,即通过在不同退化等级的预训练模型间共享参数来提升预训练收益。我们首先在ImageNet数据集上训练一个$\times2$模型作为初始参数种子,随后将其作为$\times3$模型的初始化参数。接着以训练好的$\times3$模型作为$\times2$和$\times4$模型的初始化参数,训练最终的$\times2$和$\times4$模型。预训练完成后,$\times2$、$\times3$和$\times4$模型在DF2K数据集上进行微调。尽管需要额外训练成本(训练$\times2$模型),该策略能带来更大的性能提升。
5. Experiments
5. 实验
5.1. Experimental Setup
5.1. 实验设置
We use DF2K dataset (DIV2K [21] dataset merged with Flicker [32] dataset) as the training set. Meanwhile, we use ImageNet [10] as the pre-training dataset. For the structure of HMA, the number of RHTB and FAB is set to 6, the window size is set to 16, the number of channels is set to 180, and the number of attention al heads is set to 6. The number of attention al heads is 3 and 2 for GridMSA and (S)W-MSA in GAB, respectively. We evaluate on the Set5 [2], Set14 [39], BSD100 [27], Urban100 [14], and Manga109 [28] datasets. Both PSNR and SSIM evaluations are computed on the Y channel.
我们采用DF2K数据集(DIV2K [21] 与Flicker [32] 合并数据集)作为训练集,同时使用ImageNet [10] 进行预训练。HMA结构参数设置为:RHTB和FAB模块数均为6,窗口尺寸16,通道数180,注意力头数6。其中GAB模块的GridMSA和(S)W-MSA注意力头数分别为3和2。评估数据集包括Set5 [2]、Set14 [39]、BSD100 [27]、Urban100 [14] 和Manga109 [28],PSNR与SSIM指标均基于Y通道计算。
5.2. Training Details
5.2. 训练细节
Low-resolution images are generated by down-sampling using bicubic interpolation in MATLAB. We cropped the dataset into $64\times64$ patches for training. Furthermore, we employed horizontal flipping and random rotation for data augmentation. The training batch size is set to 32. During pre-training with ImageNet [10], the total number of training iterations is set to 800K (1K represents 1000 iterations), the learning rate was initialized to $2\times10^{-4}$ and halved at [300K, 500K, 650K, 700K, 750K]. We optimized the model using the Adam optimizer (with $\beta_{1}{=}0.9$ and $\beta_{2}{=}0.99\$ ). Subsequently, we fine-tuned the model on the DF2K dataset. The total number of training iterations is set to 250K, and the initial learning rate was set to $5\times10^{-6}$ and halved at [125K, 200K, 230K, 240K].
低分辨率图像通过MATLAB中的双三次插值下采样生成。我们将数据集裁剪为$64\times64$的块用于训练。此外,我们采用水平翻转和随机旋转进行数据增强。训练批次大小设置为32。在使用ImageNet[10]进行预训练时,总训练迭代次数设为800K(1K表示1000次迭代),学习率初始化为$2\times10^{-4}$,并在[300K, 500K, 650K, 700K, 750K]时减半。我们使用Adam优化器($\beta_{1}{=}0.9$和$\beta_{2}{=}0.99\$)优化模型。随后,我们在DF2K数据集上对模型进行微调。总训练迭代次数设为250K,初始学习率设为$5\times10^{-6}$,并在[125K, 200K, 230K, 240K]时减半。
Table 1. Ablation study on the proposed Fused Conv and GAB.
表 1: 提出的 Fused Conv 和 GAB 的消融研究
| Baseline | ||||
|---|---|---|---|---|
| Fused Conv | × | √ | ||
| GAB | × | √ | ||
| PSNR/SSIM | 27.49/0.8271 | 28.30/0.8370 | 28.37/0.8375 | 28.42/0.8450 |
Table 2. Ablation study on expansion rate of Fused Conv.
表 2: Fused Conv扩展率的消融实验
| 扩展率 | 2 | 4 | 6 | 8 |
|---|---|---|---|---|
| PSNR | 28.30 | 28.34 | 28.37 | 28.39 |
Table 3. Ablation study on shrink rate of Fused Conv.
表 3: Fused Conv 收缩率的消融研究
| shrink rate | 2 | 4 | 6 | 8 |
|---|---|---|---|---|
| PSNR | 27.39 | 28.37 | 28.32 | 28.28 |
5.3. Ablation Study
5.3. 消融研究
5.3.1 Effectiveness of Fused Conv and GAB
5.3.1 融合卷积与GAB的有效性
We experimentally demonstrate the effectiveness of Fused Conv and GAB proposed in this paper. The experiments are conducted on the Urban100 [14] dataset to evaluate PSNR/SSIM. The evaluation report is presented in Tab. 1. Compared with the baseline results, the best performance is achieved when both modules are used. In contrast, the performance gains obtained when using the Fused Conv or GAB modules alone were not as good as when using them simultaneously. Although the performance of the sole use of the Fused Conv module is slightly higher than the sole use of the GAB module, the GAB module is applied for global image interaction, which can effectively improve the model SSIM value and better restore the image’s texture. This means that our proposed method not only performs well on PSNR but is also excellent in restoring the image’s visual effect.
我们通过实验验证了本文提出的Fused Conv和GAB模块的有效性。实验在Urban100 [14]数据集上进行,评估指标为PSNR/SSIM。评估报告如 表1 所示。与基线结果相比,当同时使用两个模块时取得了最佳性能。相比之下,单独使用Fused Conv或GAB模块获得的性能提升都不如同时使用它们的效果。虽然单独使用Fused Conv模块的性能略高于单独使用GAB模块,但GAB模块用于全局图像交互,能有效提升模型SSIM值并更好地恢复图像纹理。这表明我们提出的方法不仅在PSNR上表现良好,在恢复图像视觉效果方面也十分出色。
5.3.2 Effects of the expansion rate and shrink rate
5.3.2 膨胀率与收缩率的影响
Tab. 2 and Tab. 3 show the effect of expansion and shrink rates on performance, respectively. The data in the table shows that the expansion rate is directly proportional to the performance, while the shrink rate is inversely proportional. Although the performance keeps increasing when the expansion rate increases, the number of parameters and the amount of computation increase quadratically. In order to balance the model performance and computation, we set the expansion rate to 6. Similarly, we set the shrink rate to 2 to get a model with as little computation as possible.
表 2 和表 3 分别展示了扩展率 (expansion rate) 和收缩率 (shrink rate) 对性能的影响。表中数据显示,扩展率与性能成正比,而收缩率与性能成反比。虽然扩展率增加时性能持续提升,但参数量和计算量呈二次方增长。为平衡模型性能与计算开销,我们将扩展率设为 6。同理,为尽可能减少计算量,我们将收缩率设为 2。
5.4. Comparison with State-of-the-Art Methods
5.4. 与最先进方法的对比
5.4.1 Quantitative comparison
5.4.1 定量比较
Tab. 4 shows the comparative results of our method with the state-of-the-art methods on PSNR and SSIM: EDSR [21], RCAN [41], SAN [9], IGNN [43], NLSA [30], IPT [5], SwinIR [20], ESRT [24], SRFoemer [44] EDT [18], HAT [6], HAT-L [6], and GRL [19]. In Tab. 4, it can be seen that the proposed method achieves the best performance on almost all scales on five datasets. Specifically, HMA outperforms SwinIR by $0.2\mathrm{dB}{\sim}1.43\mathrm{dB}$ on all scales. In particular, on Urban100 [14] and MANGA109 [28] that con- tain a large number of repetitive textures, HMA improves by $0.98\mathrm{dB}{\sim}1.43\mathrm{dB}$ compared to SwinIR. It is important to note that both HAT and GRL [19] introduce the channel attention in the model. However, both HAT [6] and GRL [19] perform less well than HMA, which proves the effectiveness of our proposed method.
表 4 展示了我们的方法与当前最先进方法在 PSNR 和 SSIM 上的对比结果:EDSR [21]、RCAN [41]、SAN [9]、IGNN [43]、NLSA [30]、IPT [5]、SwinIR [20]、ESRT [24]、SRFoemer [44]、EDT [18]、HAT [6]、HAT-L [6] 和 GRL [19]。从表 4 可以看出,所提出的方法在五个数据集上几乎所有尺度都取得了最佳性能。具体而言,HMA 在所有尺度上均优于 SwinIR $0.2\mathrm{dB}{\sim}1.43\mathrm{dB}$。特别是在包含大量重复纹理的 Urban100 [14] 和 MANGA109 [28] 上,HMA 相比 SwinIR 提升了 $0.98\mathrm{dB}{\sim}1.43\mathrm{dB}$。值得注意的是,HAT 和 GRL [19] 都在模型中引入了通道注意力机制,但 HAT [6] 和 GRL [19] 的表现均不如 HMA,这证明了我们提出方法的有效性。
5.4.2 Visual comparison
5.4.2 视觉对比
We provide some of the visual comparison results in Fig. 8. The comparison results are selected from the Urban100 [14] dataset: ”img 011”, ”img 033”, ”img 046”, ”img 062”, $ '{mathrm{img}.067}cdot$ and $ '{mathrm{img}.092}cdots$ . In Fig. 8, PSNR and SSIM is calculated in patches marked with red boxes in the images. From the visual comparison, HMA can recover the image texture details better. Compared with other advanced methods, HMA recovers images with clearer edges. We can see many blurred areas in recovering image ”img 011” and image ”img 092” in other state-of-the-art methods, while HMA generates excellent visual effects. The comparison of the visual effects indicates that our proposed method also achieves a superior performance.
我们在图8中提供了一些视觉对比结果。这些对比结果选自Urban100 [14]数据集:"img 011"、"img 033"、"img 046"、"img 062"、$'{mathrm{img}.067}cdot$和$'{mathrm{img}.092}cdots$。在图8中,PSNR和SSIM是在图像中用红色方框标记的局部区域计算的。从视觉对比来看,HMA能更好地恢复图像纹理细节。与其他先进方法相比,HMA恢复的图像边缘更清晰。在其他最先进方法恢复的"img 011"和"img 092"图像中可以看到许多模糊区域,而HMA产生了出色的视觉效果。视觉效果对比表明,我们提出的方法也实现了卓越的性能。
5.5. NTIRE 2024 Challenge
5.5. NTIRE 2024 挑战赛
Our SR model also participated in NTIRE 2024 Image Super-Resolution $(\times4)$ [7] in the validation phase and testing phase. The respective results areshown in Tab. 5.
我们的超分辨率 (Super-Resolution, SR) 模型还参与了 NTIRE 2024 图像超分辨率 $(\times4)$ [7] 的验证阶段和测试阶段。相应结果如 表 5 所示。
6. Conclusion
6. 结论
This study proposes a Hybrid Multi-Axis Aggregation Network (HMA) for single-image super-resolution. Our model combines Fused Convolution with self-attention to better integrate different-level features during deep feature extraction. Additionally, inspired by images inherent hierarchical structural similarity, we introduce a Grid Attention Block for modeling long-range dependencies. The proposed network enhances multi-level structural similarity modeling by combining sparse attention with window attention. For the super-resolution task, we also designed a pre-training strategy specifically to stimulate the model’s potential capabilities further. Extensive experiments demonstrate that our proposed method outperforms state-of-the-art approaches on benchmark datasets for single-image super-resolution tasks.
本研究提出了一种用于单图像超分辨率 (Super-Resolution) 的混合多轴聚合网络 (Hybrid Multi-Axis Aggregation Network, HMA)。该模型将融合卷积 (Fused Convolution) 与自注意力 (self-attention) 相结合,以在深度特征提取过程中更好地整合不同层次的特征。此外,受图像固有层次结构相似性的启发,我们引入了网格注意力块 (Grid Attention Block) 来建模长程依赖关系。所提出的网络通过将稀疏注意力 (sparse attention) 与窗口注意力 (window attention) 相结合,增强了多级结构相似性建模。针对超分辨率任务,我们还专门设计了一种预训练策略,以进一步激发模型的潜在能力。大量实验表明,在单图像超分辨率任务的基准数据集上,我们提出的方法优于现有最先进的方法。
Table 4. Quantitative comparison (PSNR/SSIM) with state-of-the-art methods on benchmark dataset. The top three results are marked in red, blue and green., respectively. $\leftrightarrow\gamma$ indicates that methods adopt pre-training strategy.
表 4: 在基准数据集上与最先进方法的定量比较 (PSNR/SSIM)。前三名结果分别用红色、蓝色和绿色标出。$\leftrightarrow\gamma$ 表示方法采用预训练策略。
| 方法 | 尺度 | Set5[2] PSNR | Set5[2] SSIM | Set14[39] PSNR | Set14[39] SSIM | BSD100[27] PSNR | BSD100[27] SSIM | Urban100[14] PSNR | Urban100[14] SSIM | Manga109[28] PSNR | Manga109[28] SSIM |
|---|---|---|---|---|---|---|---|---|---|---|---|
| EDSR[21] | ×2 | 38.11 | 0.9602 | 33.92 | 0.9195 | 32.32 | 0.9013 | 32.93 | 0.9351 | 39.10 | 0.9773 |
| RCAN[41] | ×2 | 38.27 | 0.9614 | 34.12 | 0.9216 | 32.41 | 0.9027 | 33.34 | 0.9384 | 39.44 | 0.9786 |
| SAN[9] | ×2 | 38.31 | 0.9620 | 34.07 | 0.9213 | 32.42 | 0.9028 | 33.10 | 0.9370 | 39.32 | 0.9792 |
| IGNN[43] | ×2 | 38.24 | 0.9613 | 34.07 | 0.9217 | 32.41 | 0.9025 | 33.23 | 0.9383 | 39.35 | 0.9786 |
| NLSA[30] | ×2 | 38.34 | 0.9618 | 34.08 | 0.9231 | 32.43 | 0.9027 | 33.42 | 0.9394 | 39.59 | 0.9789 |
| IPT[5] | ×2 | 38.37 | - | 34.43 | - | 32.48 | - | 33.76 | - | - | - |
| SwinIR[20] | ×2 | 38.42 | 0.9623 | 34.46 | 0.9250 | 32.53 | 0.9041 | 33.81 | 0.9427 | 39.92 | 0.9797 |
| ESRT[24] | ×2 | - | - | - | - | - | - | - | - | - | - |
| SRFormer[44] | ×2 | 38.51 | 0.9627 | 34.44 | 0.9253 | 32.57 | 0.9046 | 34.09 | 0.9449 | 40.07 | 0.9802 |
| EDT[18] | ×2 | 38.45 | 0.9624 | 34.57 | 0.9258 | 32.52 | 0.9041 | 33.80 | 0.9425 | 39.93 | 0.9800 |
| HAT[6] | ×2 | 38.63 | 0.9630 | 34.86 | 0.9274 | 32.62 | 0.9053 | 34.45 | 0.9466 | 40.26 | 0.9809 |
| GRL[19] | ×2 | 38.67 | 0.9647 | 35.08 | 0.9303 | 32.68 | 0.9087 | 35.06 | 0.9505 | 40.67 | 0.9818 |
| HMA(ours) | ×2 | 38.79 | 0.9641 | 35.11 | 0.9286 | 32.67 | 0.9061 | 34.85 | 0.9493 | 40.73 | 0.9824 |
| HAT-Lt[6] | ×2 | 38.91 | 0.9646 | 35.29 | 0.9293 | 32.74 | 0.9066 | 35.09 | 0.9505 | 41.01 | 0.9831 |
| HMAt(ours) | ×2 | 38.95 | 0.9649 | 35.33 | 0.9297 | 32.79 | 0.9071 | 35.24 | 0.9513 | 41.13 | 0.9836 |
| EDSR[21] | ×3 | 34.65 | 0.928 | 30.52 | 0.8462 | 29.25 | 0.8093 | 28.80 | 0.8653 | 34.17 | 0.9476 |
| RCAN[41] | ×3 | 34.74 | 0.9299 | 30.65 | 0.8482 | 29.32 | 0.8111 | 29.09 | 0.8702 | 34.44 | 0.9499 |
| SAN[9] | ×3 | 34.75 | 0.9300 | 30.59 | 0.8476 | 29.33 | 0.8112 | 28.93 | 0.8671 | 34.30 | 0.9494 |
| IGNN[43] | ×3 | 34.72 | 0.9298 | 30.66 | 0.8484 | 29.31 | 0.8105 | 29.03 | 0.8696 | 34.39 | 0.9496 |
| NLSA[30] | ×3 | 34.85 | 0.9306 | 30.70 | 0.8485 | 29.34 | 0.8117 | 29.25 | 0.8726 | 34.57 | 0.9508 |
| IPT[5] | ×3 | 34.81 | - | 30.85 | - | 29.38 | - | 29.49 | - | - | - |
| SwinIR[20] | ×3 | 34.97 | 0.9318 | 30.93 | 0.8534 | 29.46 | 0.8145 | 29.75 | 0.8826 | 35.12 | 0.9537 |
| ESRT[24] | ×3 | 34.42 | 0.9268 | 30.43 | 0.8433 | 29.15 | 0.8063 | 28.46 | 0.8574 | 33.95 | 0.9455 |
| SRFormer[44] | ×3 | 35.02 | 0.9323 | 30.94 | 0.8540 | 29.48 | 0.8156 | 30.04 | 0.8865 | 35.26 | 0.9543 |
| EDT[18] | ×3 | 34.97 | 0.9316 | 30.89 | 0.8527 | 29.44 | 0.8142 | 29.72 | 0.8814 | 35.13 | 0.9534 |
| HAT[6] | ×3 | 35.07 | 0.9329 | 31.08 | 0.8555 | 29.54 | 0.8167 | 30.23 | 0.8896 | 35.53 | 0.9552 |
| GRL[19] | ×3 | - | - | - | - | - | - | - | - | - | - |
| HMA(ours) | ×3 | 35.22 | 0.9336 | 31.28 | 0.8570 | 29.59 | 0.8682 | 30.65 | 0.8944 | 35.82 | 0.9567 |
| HAT-Lt[6] | ×3 | 35.28 | 0.9345 | 31.47 | 0.8584 | 29.63 | 0.8191 | 30.92 | 0.8981 | 36.02 | 0.9576 |
| HMAt(ours) | ×3 | 35.35 | 0.9347 | 31.47 | 0.8585 | 29.66 | 0.8196 | 31.00 | 0.8984 | 36.10 | 0.9580 |
| EDSR[21] | ×4 | 32.46 | 0.8968 | 28.80 | 0.7876 | 27.71 | 0.7420 | 26.64 | 0.8033 | 31.02 | 0.9148 |
| RCAN[41] | ×4 | 32.63 | 0.9002 | 28.87 | 0.7889 | 27.77 | 0.7436 | 26.82 | 0.8087 | 31.22 | 0.9173 |
| SAN[9] | ×4 | 32.64 | 0.9003 | 28.92 | 0.7888 | 27.78 | 0.7436 | 26.79 | - | - | - |

Figure 8. Visual comparison on $\times4$ SR. PSNR/SSIM is calculated in patches marked with red boxes in the images.
图 8: $\times4$ 超分辨率视觉对比。PSNR/SSIM 数值基于图像中红色框标记的局部区域计算。
Table 5. NTIRE 2024 Challenge Results with $\times4$ SR in terms of PSNR and SSIM on validation phase and testing phase.
表 5: NTIRE 2024 挑战赛在验证阶段和测试阶段 $\times4$ 超分辨率 (SR) 的 PSNR 和 SSIM 结果。
| 验证阶段 | 测试阶段 |
|---|---|
| PSNR: 31.44 | PSNR: 31.18 |
| SSIM: 0.85 | SSIM: 0.86 |
References
参考文献
2012. 5, 7, 1
- 5, 7, 1
for single image super-resolution. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 12780–12791, 2023. 6, 7 [45] Maria Zontak and Michal Irani. Internal statistics of a single natural image. In CVPR 2011, pages 977–984, 2011. 1
用于单幅图像超分辨率。载于《IEEE/CVF国际计算机视觉会议论文集》第12780–12791页,2023年。6, 7
[45] Maria Zontak和Michal Irani。单幅自然图像的内部统计特性。载于《CVPR 2011》第977–984页,2011年。1
HMANet: Hybrid Multi-Axis Aggregation Network for Image Super-Resolution Supplementary Material
HMANet: 用于图像超分辨率的混合多轴聚合网络补充材料
7. Training Details
7. 训练细节
7.1. Study on the pre-training strategy
7.1. 预训练策略研究
We calculate the interlayer CKA [16] similarity in $\times2$ SR, $\times3$ SR, and $\times4$ SR, except for the shallow feature extraction and image reconstruction modules. In Fig. 9, we can see that Fig. 9(a) and Fig. 9(c) show high similarity on the diagonal, while Fig. 9(b) has a low similarity score on the diagonal. Therefore, we train the $\times3$ SR model after training the $\times2$ SR model as the initial parameter and then use the $\times3$ SR model as the initial parameter of the $\times2$ SR model and the $\times4$ SR model.
我们计算了$\times2$超分辨率(SR)、$\times3$超分辨率和$\times4$超分辨率中除浅层特征提取和图像重建模块外的层间CKA [16]相似度。在图9中可以看到,图9(a)和图9(c)对角线上的相似度较高,而图9(b)对角线上的相似度得分较低。因此,我们先训练$\times2$超分辨率模型作为初始参数来训练$\times3$超分辨率模型,再将$\times3$超分辨率模型作为$\times2$超分辨率模型和$\times4$超分辨率模型的初始参数。

Figure 9. (a) CKA similarity map between layers of the $_{\times2}$ SR model and the $\times3$ SR model, (b) CKA similarity map between layers of the $\times2$ SR model and the $\times4$ SR model, (c) CKA similarity map between layers of the $\times3$ SR model and the $\times4$ SR model.
图 9: (a) $_{\times2}$ 超分辨率 (SR) 模型与 $\times3$ 超分辨率模型各层间的 CKA 相似度图谱, (b) $\times2$ 超分辨率模型与 $\times4$ 超分辨率模型各层间的 CKA 相似度图谱, (c) $\times3$ 超分辨率模型与 $\times4$ 超分辨率模型各层间的 CKA 相似度图谱。
We train the model using nine pre-training strategies to test the impact of different pre-training strategies on performance. Tab. 6 shows the training results, which are evaluated on the Set5 [2] dataset. We can find that our proposed pre-training strategies can effectively improve the model performance $\mathbf{\langle0.05dB{\sim}0.09d B}_{,}$ ). It can also be observed that using models with different degradation levels as model initi aliz ation parameters has different effects on motivating the model potential. Using the $\times3$ SR model as the initialization parameter for the $\times2$ and the $\times4\mathrm{SR}$ models maximizes the model performance. Whereas using the $\times2$ SR model as the initialization parameter of the $\times4$ model, on the contrary, reduces the model performance. This suggests that a suitable pre-training strategy can lead to better performance gains for HMA.
我们采用九种预训练策略来训练模型,以测试不同预训练策略对性能的影响。表6展示了在Set5 [2]数据集上的训练结果。可以发现,我们提出的预训练策略能有效提升模型性能$\mathbf{\langle0.05dB{\sim}0.09d B}_{,}$)。同时观察到,使用不同退化等级的模型作为初始化参数对激发模型潜力有不同影响:以$\times3$超分模型作为$\times2$和$\times4\mathrm{SR}$模型的初始化参数时,模型性能达到最优;而使用$\times2$超分模型初始化$\times4$模型时反而会降低性能。这表明合适的预训练策略能为HMA带来更优的性能增益。
| Scale | 初始化参数 | |||||||
|---|---|---|---|---|---|---|---|---|
| w/o | ×2 | x3 | X4 | |||||
| PSNR | SSIM | PSNR | SSIM | PSNR | SSIM | PSNR | SSIM | |
| ×2 | 38.84 | 0.9642 | 38.86 | 0.9644 | 38.95 | 0.9647 | 38.78 | 0.964 |
| ×3 | 35.25 | 0.9342 | 35.35 | 0.9346 | 35.27 | 0.9343 | 35.30 | 0.9345 |
| ×4 | 33.26 | 0.9083 | 33.24 | 0.9081 | 33.38 | 0.9086 | 33.25 | 0.9083 |
Table 6. Quantitative results of HMA PSNR (dB) on $\times4$ SR using different pre-training strategies.
表 6: 不同预训练策略在 $\times4$ 超分辨率任务上的 HMA PSNR (dB) 定量结果
8. Analysis of Model Complexity
8. 模型复杂度分析
We experiments to analyze Grid Attention Block (GAB) and Fused Attention Block (FAB). We also compare our method with the Transformer-based method SwinIR. The $\times4$ SR performance on Urban100 is reported and the number of Multiply-Add operations is computed when the input size is $64\times64$ . Note that the pre-training technique is not used for all models in this section.
我们通过实验分析了网格注意力块 (GAB) 和融合注意力块 (FAB),并将我们的方法与基于Transformer的方法SwinIR进行了比较。报告了Urban100数据集上$\times4$超分辨率的性能,并计算了输入尺寸为$64\times64$时的乘加运算量。请注意,本节所有模型均未使用预训练技术。
we use SwinIR with a window size of 16 as a baseline to study the computational complexity of the proposed GAB and FAB. As shown in Tab. 7, our GAB obtains performance gains by finitely increasing parameters and MultiAdds. It proves the effectiveness and efficiency of the proposed modules. In addition, FAB brings better performance at the same time although it brings more parameters and Multi-Adds.
我们以窗口大小为16的SwinIR作为基线,研究提出的GAB和FAB的计算复杂度。如表7所示,我们的GAB通过有限增加参数和MultiAdds获得了性能提升,证明了所提出模块的有效性和高效性。此外,尽管FAB引入了更多参数和Multi-Adds,但同时也带来了更好的性能。
Table 7. Model complexity comparison of GAB and FAB.
| Method | #Params. | #Multi-Adds. | PSNR |
| SwinIR | 12.1M | 63.8G | 27.81dB |
| W/GAB | 24.4M | 76.9G | 28.37dB |
| w/FCB | 57.6M | 157.0G | 28.30dB |
| Ours | 69.9M | 170.1G | 28.42dB |
表 7: GAB 和 FAB 的模型复杂度对比
| 方法 | 参数量 (#Params.) | 乘加运算量 (#Multi-Adds.) | PSNR |
|---|---|---|---|
| SwinIR | 12.1M | 63.8G | 27.81dB |
| W/GAB | 24.4M | 76.9G | 28.37dB |
| w/FCB | 57.6M | 157.0G | 28.30dB |
| Ours | 69.9M | 170.1G | 28.42dB |
9. Visual Comparisons with LAM
9. 与 LAM 的视觉对比
We provide visual comparisons with the LAM [13] results to compare SwinIR, HAT, and our proposed HMA. The red dots in the LAM results represent the pixels used for reconstructing the patches marked with red boxes in the HR images, and we give the Diffusion Indices (DI) in Fig. 10 to reflect the range of pixels involved. In this case, the more pixels are used to recover a specific input block, the wider the distribution of red dots in LAM, and the higher the DI. As shown in Fig. 10, both HAT and HMA can effectively extend the effective pixel range compared to the baseline SwinIR, where the pixel range is only clustered in a limited area. Compared to HAT, HMA can extend the range of utilized pixels more widely due to the introduction of the
我们提供与LAM [13] 结果的视觉对比,以比较SwinIR、HAT和我们提出的HMA。LAM结果中的红点表示用于重建HR图像中红框标记块所采用的像素,图10中给出的扩散指数(DI)反映了相关像素的覆盖范围。在这种情况下,用于恢复特定输入块的像素越多,LAM中红点的分布越广,DI值越高。如图10所示,与基线方法SwinIR相比,HAT和HMA都能有效扩展有效像素范围(SwinIR的像素范围仅聚集在有限区域)。由于引入了...

Figure 10. Comparison of LAM results between SwinIR, HAT and HMA.
图 10: SwinIR、HAT 和 HMA 的 LAM 结果对比
GAB module. Also, for quantitative metrics, HMA obtains much higher DI values than SwinIR and HAT. The visualization results and quantitative evaluation metrics show that HMA can better utilize global information for local area reconstruction. As a result, the method generated by HMA is more capable of generating high-resolution images with better visualization.
GAB模块。此外,在定量指标方面,HMA获得的DI值远高于SwinIR和HAT。可视化结果和定量评估指标表明,HMA能更好地利用全局信息进行局部区域重建。因此,HMA生成的方法更擅长生成具有更优视觉效果的高分辨率图像。
