[论文翻译]DRCT:让图像超分辨率远离信息瓶颈


原文地址:https://arxiv.org/pdf/2404.00722v5


DRCT: Saving Image Super-Resolution away from Information Bottleneck

DRCT:让图像超分辨率远离信息瓶颈

Abstract

摘要

In recent years, Vision Transformer-based approaches for low-level vision tasks have achieved widespread success. Unlike CNN-based models, Transformers are more adept at capturing long-range dependencies, enabling the reconstruction of images utilizing non-local information. In the domain of super-resolution, Swin-transformer-based models have become mainstream due to their capability of global spatial information modeling and their shiftingwindow attention mechanism that facilitates the interchange of information between different windows. Many researchers have enhanced model performance by expanding the receptive fields or designing meticulous networks, yielding commendable results. However, we observed that it is a general phenomenon for the feature map intensity to be abruptly suppressed to small values towards the network’s end. This implies an information bottleneck and a diminish ment of spatial information, implicitly limiting the model’s potential. To address this, we propose the Dense-residual-connected Transformer (DRCT), aimed at mitigating the loss of spatial information and stabilizing the information flow through dense-residual connections between layers, thereby unleashing the model’s potential and saving the model away from information bottleneck. Experiment results indicate that our approach surpasses state-of-the-art methods on benchmark datasets and performs commendably at the NTIRE-2024 Image SuperResolution $(x4)$ Challenge. Our source code is available at https://github.com/ming053l/DRCT

近年来,基于Vision Transformer的低级视觉任务方法取得了广泛成功。与基于CNN的模型不同,Transformer更擅长捕捉长距离依赖关系,能够利用非局部信息重建图像。在超分辨率领域,基于Swin-transformer的模型因其全局空间信息建模能力及促进不同窗口间信息交换的移位窗口注意力机制,已成为主流方法。许多研究者通过扩大感受野或设计精细网络来提升模型性能,取得了令人瞩目的成果。然而,我们观察到特征图强度在网络末端突然被抑制至较小值是普遍现象,这表明存在信息瓶颈和空间信息衰减,隐性地限制了模型潜力。为此,我们提出稠密残差连接Transformer (DRCT),旨在通过层间稠密残差连接缓解空间信息丢失、稳定信息流,从而释放模型潜力并规避信息瓶颈。实验结果表明,我们的方法在基准数据集上超越了现有最优方法,并在NTIRE-2024图像超分辨率 $(x4)$ 挑战赛中表现优异。源代码详见https://github.com/ming053l/DRCT

1. Introduction

1. 引言

The task of Single Image Super-Resolution (SISR) is aimed at reconstructing a high-quality image from its lowresolution version. This quest for effective and skilled super-resolution algorithms has become a focal point of research within the field of computer vision, owing to its wide range of applications.

单图像超分辨率 (Single Image Super-Resolution, SISR) 的任务旨在从其低分辨率版本重建高质量图像。由于广泛的应用场景,寻找高效且专业的超分辨率算法已成为计算机视觉领域的研究重点。

Following the foundational studies, CNN-based strategies [8, 19, 20, 35, 39, 40] have predominantly governed the super-resolution domain for an extended period. These strategies largely leverage techniques such as residual learning [12, 36, 46, 52, 61], or recursive learning [18, 21] for developing network architectures, significantly propelling the progress of super-resolution models forward.

在基础研究之后,基于CNN (Convolutional Neural Network) 的策略 [8, 19, 20, 35, 39, 40] 长期主导了超分辨率领域。这些策略主要利用残差学习 [12, 36, 46, 52, 61] 或递归学习 [18, 21] 等技术来构建网络架构,极大地推动了超分辨率模型的进展。


Figure 1. The feature map intensity on various benchmark datasets. We observed that feature map intensities decrease sharply at the end of SISR network, indicating potential information loss. In this paper, we propose DRCT to address this issue by enhancing receptive fields and adding dense-connections within residual blocks to mitigate information bottlenecks, thereby improving performance with a simpler model design.

图 1: 各基准数据集上的特征图强度分布。我们观察到SISR网络末端特征图强度急剧下降,表明存在潜在信息丢失。本文提出DRCT方法,通过增强感受野并在残差块内添加密集连接来缓解信息瓶颈,从而以更简洁的模型设计提升性能。

CNN-based networks have achieved notable success in terms of performance. However, the inductive bias of CNN limits SISR models capture long-range dependencies. Their inherent limitations stem from the parameterdependent scaling of the receptive field and the kernel size of convolution operator within different layers, which may neglect non-local spatial information within images.

基于CNN的网络在性能方面取得了显著成功。然而,CNN的归纳偏置限制了SISR模型捕获长距离依赖关系的能力。其固有局限性源于不同层中卷积算子感受野和核大小的参数依赖性缩放,这可能会忽略图像内的非局部空间信息。

To overcome the limitations associated with CNN-based networks, researchers have introduced Transformer-based SISR networks that leverage the capability to model longrange dependencies, thereby enhancing SISR performance. Notable examples include IPT [16] and EDT [33], which utilize pre-training on large-scale dataset like ImageNet [14] to fully leverage the capabilities of Vision Transformer [9] for achieving ideal SISR results. Afterwards, SwinIR [34] incorporates Swin-Transformer [26] into SISR, marked a significant advancement in SISR performance.

为克服基于CNN网络的局限性,研究者们引入了基于Transformer的超分辨率重建(SISR)网络,利用其建模长程依赖关系的能力来提升SISR性能。典型代表包括IPT [16]和EDT [33],它们通过在ImageNet [14]等大规模数据集上进行预训练,充分发挥Vision Transformer [9]的潜力以获得理想的SISR效果。随后,SwinIR [34]将Swin-Transformer [26]引入SISR领域,标志着SISR性能的重大突破。

This approach significantly enhances capabilities beyond those of traditional CNN-based models across various benchmarks. Following SwinIR’s success, several works [4, 6, 32, 34, 58, 59, 63, 64] have built upon its framework. These subsequent studies leverage Transformers to innovate diverse network architectures specifically for super-resolution tasks, showcasing the evolving landscape of SISR technology through the exploration of new architectural innovations and techniques.

该方法显著提升了传统基于CNN模型在各类基准测试中的能力上限。继SwinIR取得成功后,多项研究 [4, 6, 32, 34, 58, 59, 63, 64] 在其框架基础上展开工作。这些后续研究利用Transformer创新了专为超分辨率任务设计的多样化网络架构,通过探索新型结构创新与技术,展现了单图像超分辨率(SISR)技术的演进图景。

While using Transformer-based SISR model for inference across various datasets, we observed a common phenomenon: the intensity distribution of the feature maps undergoes more substantial changes as the network depth increases. This indicates the spatial information and attention intensity learned by the model. However, there’s often sharp decrease towards the end of the network (refer to Figure 1) , shrinking to a smaller range. This phenomenon suggests that such abrupt changes might be accompanied by a loss of spatial information, indicating the presence of an information bottleneck.

在使用基于Transformer的超分辨率(SISR)模型进行跨数据集推理时,我们观察到一个普遍现象:随着网络深度增加,特征图的强度分布会发生更显著变化。这表明了模型学习到的空间信息和注意力强度。然而,网络末端常出现急剧下降(见图1),收缩至更小范围。这种现象表明此类突变可能伴随着空间信息丢失,意味着存在信息瓶颈。

Inspired by a series of works by Wang et al., such as the YOLO-family [47, 50], CSPNet [48], and ELAN [49], we consider that network architectures based on SwinIR, despite significantly enlarging the receptive fields through shift-window attention mechanism to address the small receptive fields in CNNs, are prone to gradient bottlenecks due to the loss of spatial information as network depth increases. This implicitly constrains the model’s performance and potential.

受到Wang等人一系列工作的启发,如YOLO系列[47, 50]、CSPNet[48]和ELAN[49],我们认为基于SwinIR的网络架构虽然通过移位窗口注意力机制显著扩大了感受野以解决CNN中小感受野的问题,但随着网络深度增加,由于空间信息丢失容易导致梯度瓶颈。这隐式地限制了模型的性能和潜力。

To address the issue of spatial information loss due to an increased number of network layers, we introduce the Dense-residual-connected Transformer (DRCT), designed to stabilize the forward-propagation process and prevent information bottlenecks. This is achieved by the proposed Swin-Dense-Residual-Connected Block (SDRCB), which incorporates Swin Transformer Layers and transition layers into each Residual Dense Group (RDG). Consequently, this approach enhances the receptive field with fewer parameters and a simplified model architecture, thereby resulting in improved performance. The main contributions of this paper are summarised as follows:

为解决网络层数增加导致空间信息丢失的问题,我们提出了密集残差连接Transformer (DRCT),旨在稳定前向传播过程并防止信息瓶颈。这一目标通过提出的Swin密集残差连接块 (SDRCB) 实现,该模块将Swin Transformer层和过渡层整合到每个残差密集组 (RDG) 中。因此,该方法以更少的参数和简化的模型架构增强了感受野,从而提升了性能。本文的主要贡献总结如下:

• We observed that as the network depth increases, the intensity of the feature map will gradually increase, then abruptly drop to a smaller range. This severe oscillation may be accompanied by information loss. • We propose DRCT by adding dense-connection within residual groups to stabilize the information flow for deep feature extraction during propagation, thereby saving the

  • 我们观察到,随着网络深度增加,特征图的强度会逐渐增强,然后突然下降至较小范围。这种剧烈波动可能伴随信息丢失。
  • 我们提出DRCT方法,通过在残差组内添加密集连接来稳定传播过程中的深度特征提取信息流,从而保留

SISR model away from the information bottleneck.

SISR模型远离信息瓶颈。

• By integrating dense connections into the SwinTransformer-based SISR model, the proposed DRCT achieves state-of-the-art performance while maintaining efficiency. This approach showcases its potential and raises the upper-bound of the SISR task.

• 通过将密集连接集成到基于 SwinTransformer 的单图像超分辨率 (SISR) 模型中,所提出的 DRCT 在保持高效的同时实现了最先进的性能。这一方法展示了其潜力,并提升了 SISR 任务的上限。

2. Related works

2. 相关工作

2.1. Vision Transformer-based Super-Resolution

2.1. 基于视觉Transformer的超分辨率

IPT [16], a versatile model utilizing the Transformer encoder-decoder architecture, has shown efficacy in several low-level vision tasks. SwinIR [34], building on the Swin Transformer [26] encoder, employs self-attention within local windows during feature extraction for larger receptive fields and greater performance, compared to traditional CNN-based approaches. UFormer [53] introduces an innovative local-enhancement window Transformer block, utilizing a learnable multi-scale restoration modulator within the decoder to enhance the model’s ability to detect both local and global patterns. ART [64] incorporates an attention retractable module to expand its receptive field, thereby enhancing SISR performance. CAT [5] leverages rectanglewindow self-attention for feature aggregation, achieving a broader receptive field. HAT [4] integrates channel attention mechanism [51] with overlapping cross-attention module, activating more pixels to reconstruct better SISR results, thereby setting new benchmarks in the field.

IPT [16] 是一种采用 Transformer 编码器-解码器架构的多功能模型,已在多项低级视觉任务中展现出卓越性能。SwinIR [34] 基于 Swin Transformer [26] 编码器,在特征提取阶段采用局部窗口自注意力机制,相比传统基于 CNN 的方法,能获得更大的感受野和更优的表现。UFormer [53] 提出创新的局部增强窗口 Transformer 模块,在解码器中引入可学习的多尺度修复调制器,增强了模型对局部和全局模式的检测能力。ART [64] 通过加入注意力可伸缩模块来扩展感受野,从而提升单图像超分辨率 (SISR) 性能。CAT [5] 采用矩形窗口自注意力进行特征聚合,实现了更广阔的感受野。HAT [4] 将通道注意力机制 [51] 与重叠交叉注意力模块相结合,通过激活更多像素来重建更优的 SISR 结果,由此树立了该领域的新标杆。

2.2. Auxiliary Supervision and Feature Fusion

2.2. 辅助监督与特征融合

Auxiliary Supervision. Deep supervision is a commonly used auxiliary supervision method [13, 31] that involves training by adding prediction layers at the intermediate levels of the model [47–49]. This approach is particularly prevalent in architectures based on Transformers that incorporate multi-layer decoders. Another popular auxiliary supervision technique involves guiding the feature maps produced by the intermediate layers with relevant metadata to ensure they possess attributes beneficial to the target task [11, 28, 30, 52, 61]. Choosing the appropriate auxiliary supervision mechanism can accelerate the model’s convergence speed, while also enhancing its efficiency and performance.

辅助监督。深度监督是一种常用的辅助监督方法[13, 31],通过在模型中间层添加预测层进行训练[47–49]。这种方法在基于Transformer的多层解码器架构中尤为普遍。另一种流行的辅助监督技术是利用相关元数据引导中间层生成的特征图,确保其具备对目标任务有益的属性[11, 28, 30, 52, 61]。选择合适的辅助监督机制可以加速模型收敛速度,同时提升其效率和性能。

Feature Fusion. Many studies have explored the integration of features across varying dimensions or multi-level features, such as FPN [22], to obtain richer representations for different tasks [36, 55]. In CNNs, attention mechanisms have been applied to both spatial and channel dimensions to improve feature representation; examples of which include RTCS [10] and SwinFusion [37]. In ViT [9], spatial self-attention is used to model the long-range dependencies between pixels. Additionally, some researchers have investigated the incorporation of channel attention within Transformers [3, 62] to effectively amalgamate spatial and channel information, thereby improving model performance.

特征融合。许多研究探索了跨不同维度或多层次特征的整合,例如FPN [22],以获取更丰富的表征用于不同任务[36, 55]。在CNN中,注意力机制已应用于空间和通道维度以改进特征表征,例如RTCS [10]和SwinFusion [37]。在ViT [9]中,空间自注意力用于建模像素间的长程依赖关系。此外,一些研究者探索了在Transformer中引入通道注意力[3, 62],以有效融合空间和通道信息,从而提升模型性能。


Figure 2. The feature map visualization displays, from top to bottom, SwinIR [34], HAT [4], and the proposed DRCT, with positions further to the right representing deeper layers within the network. For both SwinIR and HAT, the intensity of the feature maps is significant in the shallower layers but diminishes towards the network’s end. We consider this phenomenon implies the loss of spatial information, leading to the limitation and information bottleneck with SISR tasks. As for the proposed DRCT, the learned feature maps are gradually and stably enhanced without obvious oscillations. It represents the stability of the information flow during forward propagation, thereby yielding higher intensity in the final layer’s output. (zoom in to better observe the color-bar besides feature maps.)

图 2: 特征图可视化从上至下依次展示 SwinIR [34]、HAT [4] 和本文提出的 DRCT,右侧位置代表网络中更深的层。对于 SwinIR 和 HAT,特征图强度在浅层显著,但在网络末端减弱。我们认为这种现象意味着空间信息的丢失,导致 SISR (单图像超分辨率) 任务存在局限性和信息瓶颈。而本文提出的 DRCT 所学习的特征图呈现逐步稳定的增强趋势,无明显波动。这表明前向传播过程中信息流的稳定性,从而在最终层输出中获得更高强度。(建议放大观察特征图旁的色条。)

3. Problem Statement

3. 问题陈述

3.1. Information Bottleneck Principle

3.1. 信息瓶颈原理

According to the information bottleneck principle [45], the given data $X$ may cause information loss when going through consecutive layers. It may lead to gradient vanish when back-propagation for fitting network parameters and predicting $Y$ , as shown in the equation below:

根据信息瓶颈原理 [45],给定数据 $X$ 在通过连续层时可能会导致信息丢失。这可能在反向传播拟合网络参数和预测 $Y$ 时导致梯度消失,如下式所示:

$$
I(X,X)\geq I(Y,X)\geq I(Y,f_{\theta}(X))\geq I(X,g_{\phi}(f_{\theta}(X))),
$$

$$
I(X,X)\geq I(Y,X)\geq I(Y,f_{\theta}(X))\geq I(X,g_{\phi}(f_{\theta}(X))),
$$

where $I$ indicates mutual information, $f$ and $g$ are transformation functions, and $\theta$ and $\phi$ are parameters of $f$ and $g$ , respectively.

其中 $I$ 表示互信息,$f$ 和 $g$ 是变换函数,$\theta$ 和 $\phi$ 分别是 $f$ 和 $g$ 的参数。

In deep neural networks, $f_{\theta}(\cdot)$ and $g_{\phi}(\cdot)$ respectively represent the two consecutive layers in neural network. From equation (1), we consider that as the number of network layer becomes deeper, the information flow will be more likely to be lost. In term of SISR tasks, the general goal is to find the mapping function $F$ with optimized function parameters $\theta$ to maximize the mutual information be- tween HR and SR image.

在深度神经网络中,$f_{\theta}(\cdot)$ 和 $g_{\phi}(\cdot)$ 分别表示神经网络中两个连续的层。根据方程 (1),我们认为随着网络层数的加深,信息流更有可能丢失。就单图像超分辨率 (SISR) 任务而言,总体目标是找到具有优化函数参数 $\theta$ 的映射函数 $F$,以最大化高分辨率 (HR) 图像和超分辨率 (SR) 图像之间的互信息。

图片.png

3.2. Spatial Information Vanish in Super-resolution

3.2. 超分辨率中的空间信息消失

Generally speaking, SISR methods [4–6, 32, 34, 58, 63, 64] can generally divided into three parts: (1) shallow feature extraction, (2) deep feature extraction, (3) image reconstruction. Among these methods, there is almost no difference between shallow feature extraction and image reconstruction. The former is composed of simple convolution layers, while the latter consists of convolution layers and upsampling layers. However, deep feature extraction differs significantly. Yet, their common ali ty lies in being composed of various residual blocks, which can be simply defined as:

一般来说,SISR方法[4–6, 32, 34, 58, 63, 64]通常可分为三部分:(1) 浅层特征提取,(2) 深层特征提取,(3) 图像重建。这些方法中,浅层特征提取与图像重建环节几乎无差异:前者由简单卷积层构成,后者则由卷积层和上采样层组成。然而,深层特征提取存在显著差异,其共性在于均由各类残差块(residual blocks)构成,可简单定义为:

$$
X^{l+1}=X^{l}+f_{\theta}^{l+1}(X^{l}),
$$

$$
X^{l+1}=X^{l}+f_{\theta}^{l+1}(X^{l}),
$$

where $X$ indicates inputs, $f$ is a consecutive layers for $l^{:}$ th residual group , and $\theta$ represents the parameters of $f^{l}$ .

其中 $X$ 表示输入,$f$ 是第 $l^{:}$ 个残差组的连续层,$\theta$ 代表 $f^{l}$ 的参数。

Especially for SISR task, two methods of stabilizing information flow or training process are introduced:

针对SISR任务,特别引入了两种稳定信息流或训练过程的方法:

Residual connection to learn local feature. Adopting residual learning allows the model to only update the differences between layers, rather than output the total information from a previous layer directly [12]. This reduces the difficulty of model training and prevents gradient vanishing locally [61]. However, according to our observations, while this design effectively transmits spatial information between different residual blocks, there may still be infor

采用残差连接学习局部特征。残差学习使模型只需更新层间差异,而非直接输出前一层全部信息 [12]。该设计降低了模型训练难度,能有效防止局部梯度消失 [61]。但据我们观察,虽然这种结构能在不同残差块间高效传递空间信息,仍可能存在信息...

mation loss.

信息损失。

Because the information within a residual block may not necessarily maintain spatial information, this ultimately leads to non-smoothness in terms of feature map intensity (refer to Fig. 2), causing an information bottleneck at the deepest layers during forward propagation. This makes it easy for spatial information to be lost as the gradient flow reaches the deeper layers of the network, resulting in reduced data efficiency or the need for more complex network designs to achieve better performance.

由于残差块中的信息不一定能保持空间信息,这最终会导致特征图强度上的不连续性(见图 2),在前向传播过程中在最深层形成信息瓶颈。这使得当梯度流到达网络更深层时,空间信息容易丢失,从而导致数据效率降低,或需要更复杂的网络设计以实现更好的性能。

Dense connection to stabilize information flow. Incorpora ting dense connections into the Swin-Transformer based SISR model offers two significant advantages. Firstly, global auxiliary supervision. It effectively fuses the spatial information across different residual groups [52, 61], preserving high-frequency features throughout the deep feature extraction. Secondly, saving SISR model away from information bottleneck. By leveraging the integration of spatial information, the model ensures a smooth transmission of spatial information [46], thereby mitigating the information loss and enhancing the receptive field.

密集连接以稳定信息流。将密集连接融入基于Swin-Transformer的单图像超分辨率(SISR)模型具有两大优势:首先,全局辅助监督机制能有效融合不同残差组间的空间信息[52, 61],在深度特征提取过程中保持高频特征;其次,避免SISR模型陷入信息瓶颈,通过空间信息整合确保信息平滑传递[46],从而减少信息损失并扩大感受野。

4. Motivation

4. 动机

Dense-Residual Group auxiliary supervision. Motivated by RRDB-Net [52], Wang et al. suggested that incorporating dense-residual connections can aggregate multilevel spatial information and stabilize the training process [35, 41]. We consider that it is possible to stabilize the information flow within each residual-groups during propagation, thereby saving SISR model away from the information bottleneck.

密集残差组辅助监督。受RRDB-Net [52]启发,Wang等人提出引入密集残差连接可以聚合多层次空间信息并稳定训练过程 [35, 41]。我们认为这种结构能够在传播过程中稳定每个残差组内的信息流,从而避免SISR模型陷入信息瓶颈。

Dense connection with Shifting-window mechanism. Recent studies on SwinIR-based methods have concentrated on enlarging the receptive field [4–6, 64] by sophisticated WSA or enhancing the network’s capability to extract features [32, 53] for high-quality SR images. By adding dense-connections [15] within Swin-Transformerbased blocks [26, 34] in the SISR network for deep feature extraction, the proposed DRCT’s receptive field is enhanced while capturing long-range dependencies. Consequently, this approach allows for achieving outstanding performance with simpler model architectures [46], or even using shallower SISR networks.

采用移位窗口机制的密集连接。近期基于SwinIR方法的研究主要聚焦于通过复杂的窗口自注意力(WSA)扩大感受野[4–6, 64],或增强网络提取高质量超分辨率(SR)图像特征的能力[32, 53]。通过在单图像超分辨率(SISR)网络的Swin-Transformer基础块[26, 34]中引入密集连接[15],所提出的DRCT在捕获长程依赖关系的同时增强了感受野。因此,该方法能以更简单的模型架构[46]甚至更浅层的SISR网络实现卓越性能。

5. Methodology

5. 方法论

5.1. Network Architecture

5.1. 网络架构

As shown in Figure 3, DRCT comprises three distinct components: shallow feature extraction, deep feature extraction, and image reconstruction module, respectively.

如图 3 所示,DRCT 包含三个独立组件:浅层特征提取、深层特征提取和图像重建模块。

Shallow and deep feature extraction. Given a lowresolution (LR) input $\mathbf{I}{L R}\in\mathbb{R}^{H\times W\times C{i n}}$ $(H,W$ and $C{i n}$

浅层与深层特征提取。给定一个低分辨率 (lowresolution, LR) 输入 $\mathbf{I}{L R}\in\mathbb{R}^{H\times W\times C{i n}}$ $(H,W$ 和 $C{i n}$

are the image height, width and input channel number, respectively), we use a $3\times3$ convolution layer $\mathrm{{Conv}(\cdot)}$ [54] to extract shallow feature $\mathbf{F}_{0}\in\mathbb{R}^{H\times W\times C}$ as

其中图像高度、宽度和输入通道数分别为$H$、$W$和$C$,我们采用$3\times3$卷积层$\mathrm{{Conv}(\cdot)}$ [54]提取浅层特征$\mathbf{F}_{0}\in\mathbb{R}^{H\times W\times C}$作为

图片.png

Then, we extract deep feature which contains highfrequency spatial information $F D FinR^H times W times C$ from $F0$ and it can be defined as

然后,我们从$\mathbf{F}{0}$中提取包含高频空间信息的深度特征$\mathbf{F}{D F}\in\mathbb{R}^{H\times W\times C}$,其定义为

图片.png

where $H_{D F}(\cdot)$ is the deep feature extraction module and it contains $K$ Residual Dense Group (RDG) and single convolution layer $\mathrm{{Conv}(\cdot)}$ for feature transition. More specifically, intermediate features $mathbf{F}{1}$,$mathbf{F}{2}$,ldots,$mathbf{F}{K}$ and the output deep feature ${\bf F}{D F}$ are extracted block by block as

其中 $H_{D F}(\cdot)$ 是深度特征提取模块,包含 $K$ 个残差密集组 (Residual Dense Group, RDG) 和用于特征转换的单个卷积层 $\mathrm{{Conv}(\cdot)}$。具体而言,中间特征 $\mathbf{F}{1},\mathbf{F}{2},\ldots,\mathbf{F}{K}$ 和输出的深度特征 ${\bf F}_{D F}$ 通过逐块提取得到

图片.png

Image reconstruction. We reconstruct the SR image $mathbf{I}{S R}\in mathbf{\bar{mathbb{R}}}^{H\times W\times C{i n}}$ by aggregating shallow and deep features, it can be defined as:

图像重建。我们通过聚合浅层和深层特征来重建超分辨率图像 $mathbf{I}{S R}\in mathbf{\bar{mathbb{R}}}^{H\times W\times C{i n}}$,其定义如下:

图片.png

where $H{mathrm{rec}}(cdot)$ is the function of the reconstruction for fusing high-frequency deep feature ${bf F}{D F}$ and low-frequency feature $mathbf{F}{0}$ together to obtain SR result.

其中 $H{mathrm{rec}}(cdot)$ 是重建函数,用于将高频深度特征 ${bf F}{D F}$ 和低频特征 $mathbf{F}{0}$ 融合在一起以获得超分辨率(SR)结果。

5.2. Deep Feature Extraction

5.2. 深度特征提取

Residual Dense Group. In developing of RDG, we take cues from RRDB-Net [52] and RDN [61], employing a residual-dense block (RDB) as the foundational unit for SISR. The reuse of feature maps emerges as the enhanced receptive field in the RDG’s feed-forward mechanism. To expound further, RDG with several SDRCB enhances the capability to integrate information across different scales, thus allowing for a more comprehensive feature extraction. RDG facilitates the information flow within residual group, capturing the local features and spatial information group by group.

残差密集组 (Residual Dense Group)。在开发RDG时,我们参考了RRDB-Net [52]和RDN [61],采用残差密集块 (RDB) 作为SISR的基础单元。特征图的重用体现在RDG前馈机制中增强的感受野上。进一步说明,带有多个SDRCB的RDG增强了跨尺度信息整合能力,从而实现了更全面的特征提取。RDG促进了残差组内的信息流动,逐组捕获局部特征和空间信息。

Swin-Dense-Residual-Connected Block. In purpose of capturing the long-range dependency, we utilize the shifting window self-attention mechanism of Swin-Transformer Layer (STL) [26, 34] for obatining adaptive receptive fields, complementing RRDB-Net by focusing on multi-level spatial information. This synergy leverages STL to dynamically adjust the focus of the model based on the global content of the input, allowing for a more targeted and efficient extraction of features. This mechanism ensures that even as the depth of the network increases, global details are preserved, effectively enlarging and enhancing the receptive field without compromising. By integrating STLs with dense-residual connections, the architecture benefits from both a vast receptive field and the capability to hone in on the most relevant information, thereby enhancing the model’s performance in SISR tasks requiring detailed and context-aware processing. For the input feature maps $\mathbf{Z}$ within RDG, the SDRCB can be defined as:

Swin-Dense-Residual-Connected 模块。为捕获长程依赖关系,我们采用 Swin-Transformer Layer (STL) [26, 34] 的移位窗口自注意力机制来获取自适应感受野,通过关注多层级空间信息对 RRDB-Net 形成补充。该协同机制利用 STL 根据输入内容的全局信息动态调整模型关注区域,实现更精准高效的特征提取。该机制确保即使网络深度增加,全局细节仍能保留,在不损失信息的前提下有效扩大并增强感受野。通过将 STL 与密集残差连接相结合,该架构既能获得广阔的感受野,又能聚焦最相关信息,从而提升模型在需要细节与上下文感知处理的 SISR 任务中的性能。对于 RDG 内的输入特征图 $\mathbf{Z}$,SDRCB 可定义为:


Figure 3. The overall architecture of the proposed Dense-residual-connected Transformer (DRCT) and the structure of Residual-Dense Group (RDG). Each RDG contains five consecutive Swin-Dense-Residual-Connected Blocks (SDRCBs). By integrating dense-connection [15] into SwinIR [34], the efficiency can be improved for Saving Image Super-resolution away from Information Bottleneck.

图 3: 提出的密集残差连接Transformer (DRCT) 整体架构及残差密集组 (RDG) 结构。每个RDG包含五个连续的Swin密集残差连接块 (SDRCBs)。通过将密集连接 [15] 整合至SwinIR [34],可提升效率以实现远离信息瓶颈的图像超分辨率重建。

图片.png

$$
\mathrm{{SDRCB}}(\mathbf{Z})=\alpha\cdot\mathbf{Z}_{5}+\mathbf{Z},
$$

where $[\cdot]$ denotes the concatenation of multi-level feature maps produced by the previous layers. $H_{\mathrm{trans}}(\cdot)$ refers to the convolution layer with a LeakyReLU activate function for feature transition. The negative slope of LeakyReLU is set to 0.2. $\mathrm{{Conv}_{1}}$ is the $1\times1$ convolution layer, which is used to adaptively fuse a range of features with different levels [42]. $\alpha$ represents residual scaling factor, which is set to 0.2 for stabilizing the training process [52].

其中 $[\cdot]$ 表示前几层生成的多级特征图的拼接。$H_{\mathrm{trans}}(\cdot)$ 指代带有 LeakyReLU 激活函数的卷积层,用于特征转换。LeakyReLU 的负斜率设为 0.2。$\mathrm{{Conv}_{1}}$ 是 $1\times1$ 卷积层,用于自适应融合不同层次的特征范围 [42]。$\alpha$ 表示残差缩放因子,为稳定训练过程设为 0.2 [52]。

5.3. Same-task Progressive Training Strategy

5.3. 同任务渐进式训练策略

In recent years, Progressive Training Strategy (PTS) [17, 25] has gained increased attention and can be seen as a method of fine-tuning. Compared to conventional training methods, PTS tends to converge model parameters to more desirable local minima. HAT [4] introduces the Sametask Pre-training, which aims to train the model on a large dataset like ImageNet [14] before fine-tuning it on a specific data