[论文翻译]HUMUS-NET:混合展开多尺度网络加速磁共振成像重建的架构


原文地址:https://arxiv.org/pdf/2203.08213v2


ABSTRACT

摘要

In accelerated MRI reconstruction, the anatomy of a patient is recovered from a set of undersampled and noisy measurements. Deep learning approaches have been proven to be successful in solving this ill-posed inverse problem and are capable of producing very high quality reconstructions. However, current architectures heavily rely on convolutions, that are content-independent and have difficulties modeling long-range dependencies in images. Recently, Transformers, the workhorse of contemporary natural language processing, have emerged as powerful building blocks for a multitude of vision tasks. These models split input images into non-overlapping patches, embed the patches into lower-dimensional tokens and utilize a self-attention mechanism that does not suffer from the aforementioned weaknesses of convolutional architectures. However, Transformers incur extremely high compute and memory cost when 1) the input image resolution is high and 2) when the image needs to be split into a large number of patches to preserve fine detail information, both of which are typical in low-level vision problems such as MRI reconstruction, having a compounding effect. To tackle these challenges, we propose HUMUS-Net, a hybrid architecture that combines the beneficial implicit bias and efficiency of convolutions with the power of Transformer blocks in an unrolled and multi-scale network. HUMUS-Net extracts high-resolution features via convolutional blocks and refines low-resolution features via a novel Transformer-based multi-scale feature extractor. Features from both levels are then synthesized into a high-resolution output reconstruction. Our network establishes new state of the art on the largest publicly available MRI dataset, the fastMRI dataset. We further demonstrate the performance of HUMUS-Net on two other popular MRI datasets and perform fine-grained ablation studies to validate our design.

在加速MRI重建中,患者解剖结构需从欠采样且含噪声的测量数据中恢复。深度学习方法已被证实能成功解决这一病态逆问题,并可生成极高品质的重建结果。然而,当前架构严重依赖与内容无关的卷积操作,难以建模图像中的长程依赖关系。近年来,作为当代自然语言处理核心的Transformer,已成为众多视觉任务的重要构建模块。这类模型将输入图像分割为不重叠的图块,将其嵌入低维token,并利用自注意力机制克服了卷积架构的上述缺陷。但Transformer在以下场景会引发极高的计算和内存开销:1) 输入图像分辨率较高时;2) 需将图像分割为大量图块以保留精细细节时——这两种情况在MRI重建等底层视觉问题中普遍存在且会产生叠加效应。为应对这些挑战,我们提出HUMUS-Net:一种在展开式多尺度网络中融合卷积有益隐式偏置与高效性、以及Transformer模块强大能力的混合架构。HUMUS-Net通过卷积块提取高分辨率特征,并借助新型基于Transformer的多尺度特征提取器优化低分辨率特征,最终将多层级特征合成为高分辨率输出重建。我们的网络在最大公开MRI数据集fastMRI上创造了新性能标杆,并在另外两个主流MRI数据集上验证了其优越性,同时通过细粒度消融实验证实了设计有效性。

1 Introduction

1 引言

Magnetic resonance imaging (MRI) is a medical imaging technique that uses strong magnetic fields to picture the anatomy and physiological processes of the patient. MRI is one of the most popular imaging modalities as it is noninvasive and doesn’t expose the patient to harmful ionizing radiation. The MRI scanner obtains measurements of the body in the spatial frequency domain also called $k$ -space. The data acquisition process is MR is often time-consuming. Accelerated MRI [Lustig et al., 2008] addresses this challenge by under sampling in the k-space domain, thus reducing the time patients need to spend in the scanner. However, recovering the underlying anatomy from under sampled measurements is an ill-posed problem (less measurements than unknowns) and thus incorporating some form of prior knowledge is crucial in obtaining high quality reconstructions. Classical MRI reconstruction algorithms rely on the assumption that the underlying signal is sparse in some transform domain and attempt to recover a signal that best satisfies this assumption in a technique known as compressed sensing (CS) [Candes et al., 2006, Donoho, 2006]. These classical CS techniques have slow reconstruction speed and typically enforce limited forms of image priors.

磁共振成像 (MRI) 是一种利用强磁场绘制患者解剖结构和生理过程的医学成像技术。由于无创且不会使患者暴露于有害电离辐射,MRI成为最受欢迎的成像方式之一。MRI扫描仪在空间频率域(也称为$k$空间)获取人体测量数据。MR的数据采集过程通常耗时较长。加速MRI [Lustig et al., 2008] 通过在k空间域进行欠采样来应对这一挑战,从而缩短患者在扫描仪内的停留时间。然而,从欠采样测量数据中还原底层解剖结构是一个不适定问题(测量值少于未知量),因此引入某种形式的先验知识对获得高质量重建结果至关重要。传统MRI重建算法基于底层信号在某个变换域具有稀疏性的假设,并尝试通过压缩感知 (CS) 技术 [Candes et al., 2006, Donoho, 2006] 还原最符合该假设的信号。这些传统CS技术重建速度较慢,且通常只能施加有限的图像先验形式。

With the emergence of deep learning (DL), data-driven reconstruction algorithms have far surpassed CS techniques (see Ongie et al. [2020] for an overview). DL models utilize large training datasets to extract flexible, nuanced priors directly from data resulting in excellent reconstruction quality. In recent years, there has been a flurry of activity aimed at designing DL architectures tailored to the MRI reconstruction problem. The most popular models are convolutional neural networks (CNNs) that typically incorporate the physics of the MRI reconstruction problem, and utilize tools from mainstream deep learning (residual learning, data augmentation, self-supervised learning). Comparing the performance of such models has been difficult mainly due to two reasons. First, there has been a large variation in evaluation datasets spanning different scanners, anatomies, acquisition models and under sampling patterns rendering direct comparison challenging. Second, medical imaging datasets are often proprietary due to privacy concerns, hindering reproducibility.

随着深度学习 (DL) 的出现,数据驱动的重建算法已远超压缩感知技术 (CS) (综述见 Ongie 等人 [2020])。深度学习模型利用大规模训练数据集直接从数据中提取灵活、精细的先验信息,从而获得卓越的重建质量。近年来,针对 MRI 重建问题定制深度学习架构的研究呈现爆发式增长。最流行的模型是卷积神经网络 (CNN),这类网络通常融合了 MRI 重建问题的物理特性,并采用主流深度学习工具 (残差学习、数据增强、自监督学习)。此类模型的性能比较一直存在两大难点:首先,评估数据集存在巨大差异,涉及不同扫描设备、解剖部位、采集模型和欠采样模式,导致直接对比具有挑战性;其次,出于隐私考虑,医学影像数据集通常具有专属性,这阻碍了研究可复现性。

More recently, the fastMRI dataset [Zbontar et al., 2019], the largest publicly available MRI dataset, has been gaining ground as a standard benchmark to evaluate MRI reconstruction methods. An annual competition, the fastMRI Challenge [Muckley et al., 2021], attracts significant attention from the machine learning community and acts a driver of innovation in MRI reconstruction. However, over the past years the public leader board has been dominated by a single architecture, the End-to-End VarNet [Sriram et al., 2020] 1, with most models concentrating very closely around the same performance metrics, hinting at the saturation of current architectural choices.

最近,fastMRI数据集 [Zbontar et al., 2019] 作为最大的公开MRI数据集,正逐渐成为评估MRI重建方法的标准基准。一年一度的fastMRI挑战赛 [Muckley et al., 2021] 吸引了机器学习社区的广泛关注,并推动着MRI重建领域的创新。然而过去几年公开排行榜一直被单一架构End-to-End VarNet [Sriram et al., 2020] 主导,大多数模型的性能指标高度集中,暗示当前架构选择已趋于饱和。

In this work, we propose HUMUS-Net: a Hybrid, Unrolled, MUlti-Scale network architecture for accelerated MRI reconstruction that combines the advantages of well-established architectures in the field with the power of contemporary Transformer-based models. We utilize the strong implicit bias of convolutions, but also address their weaknesses, such as content-independence and inability to model long-range dependencies, by incorporating a novel multi-scale feature extractor that operates over embedded image patches via self-attention. Moreover, we tackle the challenge of high input resolution typical in MRI by performing the computationally most expensive operations on extracted low-resolution features. HUMUS-Net establishes new state of the art in accelerated MRI reconstruction on the largest available MRI knee dataset. At the time of writing this paper, HUMUS-Net is the only Transformer-based architecture on the highly competitive fastMRI Public Leader board. Our results are fully reproducible and the source code is available online 2

在本工作中,我们提出了HUMUS-Net:一种用于加速MRI重建的混合式、展开式、多尺度网络架构,它结合了该领域成熟架构的优势与基于Transformer的现代模型的能力。我们利用了卷积强大的隐式偏置,同时也通过引入一种新颖的多尺度特征提取器(该提取器通过自注意力机制在嵌入的图像块上操作)来解决其弱点,如内容无关性和无法建模长距离依赖关系。此外,我们通过在最耗计算资源的操作中提取低分辨率特征,应对了MRI中典型的高输入分辨率挑战。HUMUS-Net在现有最大的MRI膝关节数据集上建立了加速MRI重建的最新标杆。截至撰写本文时,HUMUS-Net是竞争激烈的fastMRI公共排行榜上唯一基于Transformer的架构。我们的结果完全可复现,源代码已在线公开[2]。

2 Background

2 背景

2.1 Inverse Problem Formulation of Accelerated MRI Reconstruction

2.1 加速MRI重建的逆问题表述

An MR scanner obtains measurements of the patient anatomy in the frequency domain, also referred to as $k$ -space. Data acquisition is performed via various receiver coils positioned around the anatomy being imaged, each with different spatial sensitivity. Given a total number of $N$ receiver coils, the measurements obtained by the ith coil can be written as

磁共振 (MR) 扫描仪获取患者解剖结构的频域测量数据,也称为 $k$ 空间。数据采集通过围绕成像解剖结构布置的多个接收线圈完成,每个线圈具有不同的空间灵敏度。假设共有 $N$ 个接收线圈,第 i 个线圈获取的测量数据可表示为

$$
k_{i}=\mathcal{F}S_{i}x^{*}+z_{i},i=1,..,N,
$$

$$
k_{i}=\mathcal{F}S_{i}x^{*}+z_{i},i=1,..,N,
$$

where $\pmb{x}^{*}\in\mathbb{C}^{n}$ is the underlying patient anatomy of interest, $S_{i}$ is a diagonal matrix that represents the sensitivity map of the $i$ th coil, $\mathcal{F}$ is a multi-dimensional Fourier-transform, and $z_{i}$ denotes the measurement noise corrupting the observations obtained from coil $i$ . We use $\pmb{k}=(\pmb{k}{1},...,\pmb{k}{N})$ as a shorthand for the concatenation of individual coil measurements and ${\pmb x}=({\pmb x}{1},...,{\pmb x}_{N})$ as the corresponding image domain representation after inverse Fourier transformation.

其中 $\pmb{x}^{*}\in\mathbb{C}^{n}$ 是目标患者解剖结构的真实值,$S_{i}$ 是表示第 $i$ 个线圈灵敏度图的对角矩阵,$\mathcal{F}$ 是多维傅里叶变换,$z_{i}$ 表示第 $i$ 个线圈观测数据中的测量噪声。我们用 $\pmb{k}=(\pmb{k}{1},...,\pmb{k}{N})$ 简记各线圈测量数据的拼接结果,用 ${\pmb x}=({\pmb x}{1},...,{\pmb x}_{N})$ 表示经过逆傅里叶变换后对应的图像域表示。

Since MR data acquisition time is proportional to the portion of $\mathrm{k\Omega}$ -space being scanned, obtaining fully-sampled data is time-consuming. Therefore, in accelerated MRI scan times are reduced by under sampling in k-space domain. The under sampled $\mathbf{k}$ -space measurements from coil $i$ take the form

由于MR数据采集时间与扫描的$\mathrm{k\Omega}$空间部分成正比,获取全采样数据非常耗时。因此,在加速MRI中,通过在k空间域进行欠采样来减少扫描时间。来自线圈$i$的欠采样$\mathbf{k}$空间测量形式为

$$
\tilde{k}{i}=M k_{i}i=1,..,N,
$$

$$
\tilde{k}{i}=M k_{i}i=1,..,N,
$$

where $M$ is a diagonal matrix representing the binary under sampling mask, that has 0 values for all missing frequency components that have not been sampled during accelerated acquisition.

其中 $M$ 是一个表示二元欠采样掩码的对角矩阵,其值为0的位置对应加速采集中未采样的缺失频率分量。

The forward model that maps the underlying anatomy to coil measurements can be written concisely as $\tilde{k}=\mathcal{A}\left(x^{}\right)$ , where ${\mathcal{A}}\left(\cdot\right)$ is the linear forward mapping and $\tilde{k}$ is the stacked vector of all under sampled coil measurements. Our target is to reconstruct the ground truth object $\scriptstyle{\pmb{x}}^{*}$ from the noisy, under sampled measurements k. Since we have fewer observations than variables to recover, perfect reconstruction in general is not possible. In order to make the problem solvable, prior knowledge on the underlying object is typically incorporated in the form of sparsity in some transform domain. This formulation, known as compressed sensing [Candes et al., 2006, Donoho, 2006], provides a classical framework for accelerated MRI reconstruction [Lustig et al., 2008]. In particular, the above recovery problem can be formulated as a regularized inverse problem

将底层解剖结构映射到线圈测量的正向模型可以简洁地表示为 $\tilde{k}=\mathcal{A}\left(x^{}\right)$ ,其中 ${\mathcal{A}}\left(\cdot\right)$ 是线性正向映射, $\tilde{k}$ 是所有欠采样线圈测量值的堆叠向量。我们的目标是从噪声干扰的欠采样测量值k中重建真实对象 $\scriptstyle{\pmb{x}}^{*}$ 。由于观测值少于待恢复变量,通常无法实现完美重建。为了使问题可解,通常以某些变换域中的稀疏性形式引入关于底层对象的先验知识。这种称为压缩感知 [Candes et al., 2006, Donoho, 2006] 的公式,为加速MRI重建提供了经典框架 [Lustig et al., 2008]。具体而言,上述恢复问题可表述为带正则化的逆问题

$$
\begin{array}{r}{\pmb{\hat{x}}=\underset{\pmb{x}}{\arg\operatorname*{min}}\left|\pmb{\mathcal{A}}\left(\pmb{x}\right)-\tilde{\pmb{k}}\right|^{2}+\mathcal{R}(\pmb{x}),}\end{array}
$$

$$
\begin{array}{r}{\pmb{\hat{x}}=\underset{\pmb{x}}{\arg\operatorname*{min}}\left|\pmb{\mathcal{A}}\left(\pmb{x}\right)-\tilde{\pmb{k}}\right|^{2}+\mathcal{R}(\pmb{x}),}\end{array}
$$

where $\mathcal{R}(\cdot)$ is a regularize r that encapsulates prior knowledge on the object, such as sparsity in some wavelet domain.

其中 $\mathcal{R}(\cdot)$ 是一个正则化项 (regularizer) ,它封装了关于对象的先验知识,例如在某些小波域中的稀疏性。

2.2 Deep Learning-based Accelerated MRI Reconstruction

2.2 基于深度学习 (Deep Learning) 的加速 MRI 重建

More recently, data-driven deep learning-based algorithms tailored to the accelerated MRI reconstruction problem have surpassed the classical compressed sensing baselines. Convolutional neural networks trained on large datasets have established new state of the art in many medical imaging tasks. The highly popular U-Net [Ronne berger et al., 2015] and other similar encoder-decoder architectures have proven to be successful in a range of medical image reconstruction [Hyun et al., 2018, Han and Ye, 2018] and segmentation [Çiçek et al., 2016, Zhou et al., 2018] problems. In the encoder path, the network learns to extract a set of deep, low-dimensional features from images via a series of convolutional and down sampling operations. These concise feature representations are then gradually upsampled and filtered in the decoder to the original image dimensions. Thus the network learns a hierarchical representation over the input image distribution.

最近,针对加速MRI重建问题量身定制的数据驱动深度学习算法已超越传统压缩感知基线。在大规模数据集上训练的卷积神经网络已在众多医学影像任务中确立了新的技术标杆。广受欢迎的U-Net [Ronneberger et al., 2015] 及其他类似编码器-解码器架构,已被证实在医学图像重建 [Hyun et al., 2018, Han and Ye, 2018] 与分割 [Çiçek et al., 2016, Zhou et al., 2018] 领域具有卓越表现。在编码路径中,网络通过一系列卷积和下采样操作学习从图像中提取深层低维特征,这些精炼的特征表示随后在解码器中逐步上采样并滤波还原至原始图像尺寸,从而实现对输入图像分布的层次化表征学习。

Unrolled networks constitute another line of work that has been inspired by popular optimization algorithms used to solve compressed sensing reconstruction problems. These deep learning models consist of a series of sub-networks, also known as cascades, where each sub-network corresponds to an unrolled iteration of popular algorithms such as gradient descent [Zhang and Ghanem, 2018] or ADMM [Sun et al., 2016]. In the context of MRI reconstruction, one can view network unrolling as solving a sequence of smaller denoising problems, instead of the complete recovery problem in one step. Various convolutional neural networks have been employed in the unrolling framework achieving excellent performance in accelerated MRI reconstruction [Putzky et al., 2019, Hammernik et al., 2018, 2019]. E2E-VarNet [Sriram et al., 2020] is the current state-of-the-art convolutional model on the fastMRI dataset. E2E-VarNet transforms the optimization problem in (2.1) to the $\mathbf{k}$ -space domain and unrolls the gradient descent iterations into $T$ cascades, where the $t$ th cascade represents the computation

展开网络 (unrolled networks) 是另一类受压缩感知重建问题常用优化算法启发的工作。这些深度学习模型由一系列子网络(也称为级联)组成,每个子网络对应梯度下降 [Zhang and Ghanem, 2018] 或 ADMM [Sun et al., 2016] 等流行算法的展开迭代。在 MRI 重建场景中,可将网络展开视为分步求解一系列较小的去噪问题,而非一步完成完整重建。多种卷积神经网络已被应用于展开框架,在加速 MRI 重建中取得优异性能 [Putzky et al., 2019, Hammernik et al., 2018, 2019]。E2E-VarNet [Sriram et al., 2020] 是目前 fastMRI 数据集上最先进的卷积模型,它将式 (2.1) 的优化问题转换到 $\mathbf{k}$ 空间域,并将梯度下降迭代展开为 $T$ 个级联,其中第 $t$ 个级联表示计算

$$
\hat{k}^{t+1}=\hat{k}^{t}-\mu^{t}M(\hat{k}^{t}-\tilde{k})+\mathcal{G}(\hat{k}^{t}),
$$

$$
\hat{k}^{t+1}=\hat{k}^{t}-\mu^{t}M(\hat{k}^{t}-\tilde{k})+\mathcal{G}(\hat{k}^{t}),
$$

where $\hat{k}^{t}$ is the estimated reconstruction in the $\mathbf{k}$ -space domain at cascade $t$ , $\mu^{t}$ is a learnable step size parameter and $\mathcal{G}(\cdot)$ is a learned mapping representing the gradient of the regular iz ation term in (2.1). The first term is also known as data consistency (DC) term as it enforces the consistency of the estimate with the available measurements.

其中 $\hat{k}^{t}$ 是在级联 $t$ 时 $\mathbf{k}$ 空间域的估计重建结果,$\mu^{t}$ 是可学习步长参数,$\mathcal{G}(\cdot)$ 表示 (2.1) 中正则化项梯度的学习映射。第一项也称为数据一致性 (DC) 项,因为它强制估计结果与可用测量值保持一致。

3 Related Work

3 相关工作

Transformers in Vision – Vision Transformer (ViT) [Do sov it ski y et al., 2020], a fully non-convolutional vision architecture, has demonstrated state-of-the-art performance on image classification problems when pre-trained on large-scale image datasets. The key idea of ViT is to split the input image into non-overlapping patches, embed each patch via a learned linear mapping and process the resulting tokens via stacked self-attention and multi-layer perceptron (MLP) blocks. For more details we refer the reader to Appendix G and [Do sov it ski y et al., 2020]. The benefit of Transformers over convolutional architectures in vision lies in their ability to capture long-range dependencies in images via the self-attention mechanism.

视觉领域的Transformer——Vision Transformer (ViT) [Dosovitskiy et al., 2020] 是一种完全非卷积的视觉架构,当在大规模图像数据集上进行预训练时,已在图像分类问题上展现出最先进的性能。ViT的核心思想是将输入图像分割为不重叠的图块,通过学习的线性映射嵌入每个图块,并通过堆叠的自注意力(self-attention)和多层感知机(MLP)块处理生成的token。更多细节请参阅附录G和[Dosovitskiy et al., 2020]。Transformer在视觉领域相比卷积架构的优势在于其能够通过自注意力机制捕捉图像中的长程依赖关系。

Since the introduction of ViT, similar attention-based architectures have been proposed for many other vision tasks such as object detection [Carion et al., 2020], image segmentation [Wang et al., 2021b] and restoration [Cao et al., 2021b, Chen et al., 2021, Liang et al., 2021, Zamir et al., 2021, Wang et al., 2021c]. A key challenge for Transformers in low-level vision problems is the quadratic compute complexity of the global self-attention with respect to the input dimension. In some works, this issue has been mitigated by splitting the input image into fixed size patches and processing the patches independently [Chen et al., 2021]. Others focus on designing hierarchical Transformer architectures [Heo et al., 2021, Wang et al., 2021b] similar to popular ResNets [He et al., 2015]. Authors in Zamir et al. [2021] propose applying self-attention channel-wise rather than across the spatial dimension thus reducing the compute overhead to linear complexity. Another successful architecture, the Swin Transformer [Liu et al., 2021], tackles the quadratic scaling issue by computing self-attention on smaller local windows. To encourage cross-window interaction, windows in subsequent layers are shifted relative to each other.

自ViT问世以来,基于注意力机制的类似架构已被广泛应用于目标检测 [Carion et al., 2020]、图像分割 [Wang et al., 2021b] 和图像修复 [Cao et al., 2021b, Chen et al., 2021, Liang et al., 2021, Zamir et al., 2021, Wang et al., 2021c] 等视觉任务。Transformer在底层视觉任务中的主要挑战在于全局自注意力机制随输入维度呈二次方增长的计算复杂度。部分研究通过将输入图像分割为固定大小的区块并独立处理 [Chen et al., 2021] 来缓解这一问题,另一些则借鉴ResNet [He et al., 2015] 设计分层Transformer架构 [Heo et al., 2021, Wang et al., 2021b]。Zamir等人 [2021] 提出在通道维度而非空间维度应用自注意力,将计算开销降至线性复杂度。Swin Transformer [Liu et al., 2021] 通过局部窗口计算自注意力解决二次方复杂度问题,并通过逐层移动窗口位置促进跨窗口交互。


Figure 1: Overview of the HUMUS-Block architecture. First, we extract high-resolution features $F_{H}$ from the input noisy image through a convolution layer $f_{H}$ . Then, we apply a convolutional feature extractor $f_{L}$ to obtain lowresolution features and process them using a Transformer-convolutional hybrid multi-scale feature extractor. The shallow, high-resolution and deep, low-resolution features are then synthesized into the final high-resolution denoised image.

图 1: HUMUS-Block架构概览。首先,我们通过卷积层$f_{H}$从输入噪声图像中提取高分辨率特征$F_{H}$。接着,应用卷积特征提取器$f_{L}$获取低分辨率特征,并使用Transformer-卷积混合多尺度特征提取器进行处理。最终将浅层高分辨率特征与深层低分辨率特征合成为去噪后的高分辨率图像。

Transformers in Medical Imaging – Transformer architectures have been proposed recently to tackle various medical imaging problems. Authors in Cao et al. [2021a] design a U-Net-like architecture for medical image segmentation where the traditional convolutional layers are replaced by Swin Transformer blocks. They report strong results on multi-organ and cardiac image segmentation. In Zhou et al. [2021] a hybrid convolutional and Transformer-based U-Net architecture is proposed tailored to volumetric medical image segmentation with excellent results on benchmark datasets. Similar encoder-decoder architectures for various medical segmentation tasks have been investigated in other works [Huang et al., 2021, Wu et al., 2022]. However, these networks are tailored for image segmentation, a task less sensitive to fine details in the input, and thus larger patch sizes are often used (for instance 4 in Cao et al. [2021a] ). This allows the network to process larger input images, as the number of token embedding is greatly reduced, but as we demonstrate in Section 5.2 embedding individual pixels as $1\times1$ patches is crucial for MRI reconstruction. Thus, compute and memory barriers stemming from large input resolutions are more severe in the MRI reconstruction task and therefore novel approaches are needed.

医学影像中的Transformer架构——近期提出的Transformer架构正被用于解决各类医学影像问题。Cao等人[2021a]设计了一种类似U-Net的医学图像分割架构,用Swin Transformer模块替代了传统卷积层,在多器官和心脏图像分割中取得了优异效果。Zhou等人[2021]则提出了一种结合卷积与Transformer的混合U-Net架构,专门针对三维医学图像分割任务,在基准数据集上表现突出。其他研究[Huang等人,2021;Wu等人,2022]也探索了适用于不同医学分割任务的类似编码器-解码器架构。不过这些网络专为图像分割任务设计(该任务对输入细节的敏感度较低),因此常采用较大分块尺寸(如Cao等人[2021a]中使用的4×4分块)。这种方式通过大幅减少token嵌入数量,使网络能处理更大尺寸的输入图像。但如第5.2节所示,在MRI重建任务中,将单个像素嵌入为$1\times1$分块至关重要。因此,在MRI重建任务中,由高分辨率输入导致的计算与内存压力更为严峻,需要开发新的解决方案。

Promising results have been reported employing Transformers in medical image denoising problems, such as low-dose CT denoising [Wang et al., 2021a, Luthra et al., 2021] and low-count PET/MRI denoising [Zhang et al., 2021]. However, these studies fail to address the challenge of poor scaling to large input resolutions, and only work on small images via either down sampling the original dataset [Luthra et al., 2021] or by slicing the large input images into smaller patches [Wang et al., 2021a]. In contrast, our proposed architecture works directly on the large resolution images that often arise in MRI reconstruction. Even though some work exists on Transformer-based architectures for supervised accelerated MRI reconstruction [Huang et al., 2022, Lin and Heckel, 2022, Feng et al., 2021], and for unsupervised pre-trained reconstruction [Korkmaz et al., 2022], to the best of our knowledge ours is the first work to demonstrate state-of-the-art results on large-scale MRI datasets such as the fastMRI dataset.

在医学图像去噪问题中,Transformer已展现出优异效果,例如低剂量CT去噪 [Wang et al., 2021a, Luthra et al., 2021] 和低计数PET/MRI去噪 [Zhang et al., 2021]。然而这些研究未能解决大输入分辨率下的扩展性难题,仅能通过下采样原始数据集 [Luthra et al., 2021] 或将大图像切割为小块 [Wang et al., 2021a] 来处理小尺寸图像。相比之下,我们提出的架构可直接处理MRI重建中常见的高分辨率图像。尽管已有研究将Transformer架构用于监督式加速MRI重建 [Huang et al., 2022, Lin and Heckel, 2022, Feng et al., 2021] 和无监督预训练重建 [Korkmaz et al., 2022],但据我们所知,本研究首次在fastMRI等大规模MRI数据集上实现了最先进的性能。

4 Method

4 方法

HUMUS-Net combines the efficiency and beneficial implicit bias of convolutional networks with the powerful general representations of Transformers and their capability to capture long-range pixel dependencies. The resulting hybrid network processes information both in image representation (via convolutions) and in patch-embedded token representation (via Transformer blocks). Our proposed architecture consists of a sequence of sub-networks, also called cascades. Each cascade represents an unrolled iteration of an underlying optimization algorithm in k-space (see (2.2)), with an image-domain denoiser, the HUMUS-Block. First, we describe the architecture of the HUMUS-Block, the core component in the sub-network. Then, we specify the high-level, k-space unrolling architecture of HUMUS-Net in Section 4.3.

HUMUS-Net结合了卷积网络的高效性和有益隐式偏置,以及Transformer的强大通用表征能力及其捕捉长距离像素依赖关系的特性。这种混合网络同时在图像表示(通过卷积)和分块嵌入的token表示(通过Transformer模块)中处理信息。我们提出的架构由一系列子网络(也称为级联)组成,每个级联代表k空间(见(2.2))中基础优化算法的展开迭代,并包含一个图像域去噪器——HUMUS-Block。首先,我们描述子网络核心组件HUMUS-Block的架构;然后在第4.3节具体说明HUMUS-Net的高层次k空间展开架构。

4.1 HUMUS-Block Architecture

4.1 HUMUS-Block架构

The HUMUS-Block acts as an image-space denoiser that receives an intermediate reconstruction from the previous cascade and performs a single step of denoising to produce an improved reconstruction for the next cascade. It extracts high-resolution, shallow features and low-resolution, deep features through a novel multi-scale transformer-based block, and synthesizes high-resolution features from those. The high-level overview of the HUMUS-Block is depicted in Fig. 1.

HUMUS-Block作为图像空间降噪器,接收来自前一级联的中间重建结果,执行单步降噪以生成供下一级联使用的改进重建。它通过一种新颖的基于多尺度Transformer的模块提取高分辨率浅层特征和低分辨率深层特征,并从中合成高分辨率特征。图1展示了HUMUS-Block的总体架构。

High-resolution Feature Extraction– The input to our network is a noisy complex-valued image $\pmb{x}{i n}\in\mathbb{R}^{h\times w\times c_{i n}}$ derived from under sampled $\mathbf{k}$ -space data, where the real and imaginary parts of the image are concatenated along the channel dimension. First, we extract high-resolution features $\boldsymbol{F_{H}}\in\mathbb{R}^{h\times w\times d_{H}}$ from the input noisy image through a convolution layer $f_{H}$ written as $F_{H}=f_{H}(x_{i n})$ . This initial $3\times3$ convolution layer provides early visual processing at a relatively low cost and maps the input to a higher, $d_{H}$ dimensional feature space. Important to note that the resolution of the extracted features is the same as the spatial resolution of the input image.

高分辨率特征提取——我们的网络输入是一个含噪声的复值图像 $\pmb{x}{i n}\in\mathbb{R}^{h\times w\times c_{i n}}$,该图像源自欠采样的 $\mathbf{k}$ 空间数据,其实部和虚部沿通道维度拼接。首先,我们通过卷积层 $f_{H}$ 从输入噪声图像中提取高分辨率特征 $\boldsymbol{F_{H}}\in\mathbb{R}^{h\times w\times d_{H}}$,记为 $F_{H}=f_{H}(x_{i n})$。这个初始的 $3\times3$ 卷积层以较低成本实现早期视觉处理,并将输入映射到更高的 $d_{H}$ 维特征空间。需注意的是,提取特征的分辨率与输入图像的空间分辨率保持一致。

Low-resolution Feature Extraction– In case of MRI reconstruction, input resolution is typically significantly higher than commonly used image datasets $(32\times32-256\times256)$ , posing a significant challenge to contemporary Transformerbased models. Thus we apply a convolutional feature extractor $f_{L}$ to obtain low-resolution features $\begin{array}{r}{\dot{F}{L}=f_{L}(F_{H})}\end{array}$ , with $F_{L}\in\mathbb{R}^{h_{L}\times w_{L}\times d_{L}}$ where $f_{L}$ consists of a sequence of convolutional blocks and spatial down sampling operations. The specific architecture is depicted in Figure 4. The purpose of this module is to perform deeper visual processing and to provide manageable input size to the subsequent computation and memory heavy hybrid processing module. In this work, we choose $\begin{array}{r}{h_{L}=\frac{h}{2}}\end{array}$ and $\scriptstyle w_{L}={\frac{w}{2}}$ which strikes a balance between preserving spatial information and resource demands. Furthermore, in order to compensate for the reduced resolution we increase the feature dimension to $d_{L}=2\cdot d_{H}:=d$ .

低分辨率特征提取——在MRI重建任务中,输入分辨率通常显著高于常用图像数据集 $(32\times32-256\times256)$ ,这对当前基于Transformer的模型构成重大挑战。为此,我们采用卷积特征提取器 $f_{L}$ 来获取低分辨率特征 $\begin{array}{r}{\dot{F}{L}=f_{L}(F_{H})}\end{array}$ ,其中 $F_{L}\in\mathbb{R}^{h_{L}\times w_{L}\times d_{L}}$ 。该提取器由一系列卷积块和空间下采样操作构成,具体架构如 图 4 所示。该模块旨在进行更深层次的视觉处理,并为后续计算密集、内存占用量大的混合处理模块提供可管理的输入尺寸。本工作中,我们选择 $\begin{array}{r}{h_{L}=\frac{h}{2}}\end{array}$ 和 $\scriptstyle w_{L}={\frac{w}{2}}$ ,以在保留空间信息与资源需求之间取得平衡。此外,为补偿降低的分辨率,我们将特征维度提升至 $d_{L}=2\cdot d_{H}:=d$ 。

Deep Feature Extraction– The most important part of our model is MUST, a MUlti-scale residual Swin Transformer network. MUST is a multi-scale hybrid feature extractor that takes the low-resolution image representations $F_{L}$ and performs hierarchical Transformer-convolutional hybrid processing in an encoder-decoder fashion, producing deep features $F_{D}=f_{D}(F_{L})$ , where the specific architecture behind $f_{D}$ is detailed in Subsection 4.2.

深度特征提取——我们模型中最重要的部分是MUST,一种多尺度残差Swin Transformer网络。MUST是一个多尺度混合特征提取器,它接收低分辨率图像表示$F_{L}$,并以编码器-解码器的方式执行分层Transformer-卷积混合处理,生成深度特征$F_{D}=f_{D}(F_{L})$,其中$f_{D}$背后的具体架构详见第4.2小节。

High-resolution Image Reconstruction– Finally, we combine information from shallow, high-resolution features $F_{H}$ and deep, low-resolution features $F_{D}$ to reconstruct the high-resolution residual image via a convolutional reconstruction module $f_{R}$ . The residual learning paradigm allows us to learn the difference between noisy and clean images and helps information flow within the network [He et al., 2015]. Thus the final denoised image $\pmb{x_{o u t}}\in\mathbb{R}^{h\times w\times c_{i n}}$ is obtained as ${\pmb x}{o u t}={\pmb x}{i n}+f_{R}(F_{H},F_{D})$ . The specific architecture of the reconstruction network is depicted in Figure 4.

高分辨率图像重建——最后,我们结合来自浅层高分辨率特征 $F_{H}$ 和深层低分辨率特征 $F_{D}$ 的信息,通过卷积重建模块 $f_{R}$ 重建高分辨率残差图像。残差学习范式使我们能够学习噪声图像与干净图像之间的差异,并促进信息在网络中的流动 [He et al., 2015]。因此,最终的去噪图像 $\pmb{x_{o u t}}\in\mathbb{R}^{h\times w\times c_{i n}}$ 通过 ${\pmb x}{o u t}={\pmb x}{i n}+f_{R}(F_{H},F_{D})$ 获得。重建网络的具体架构如图 4 所示。

4.2 Multi-scale Hybrid Feature Extraction via MUST

4.2 基于MUST的多尺度混合特征提取

The key component to our architecture is MUST, a multi-scale hybrid encoder-decoder architecture that performs deep feature extraction in both image and token representation (Figure 1, bottom). First, individual pixels of the input representation of shape $\textstyle{\frac{h}{2}}\times{\frac{w}{2}}\times d$ are flattened and passed through a learned linear mapping to yield $ \frac{h}{2}\cdot\frac{w}{2}$ tokens of dimension $d$ . Tokens corresponding to different image patches are subsequently merged in the encoder path, resulting in a concise latent representation. This highly descriptive representation is passed through a bottleneck block and progressively expanded by combining tokens from the encoder path via skip connections. The final output is rearranged to match the exact shape of the input low-resolution features $F_{L}$ , yielding a deep feature representation $F_{D}$ .

我们架构的核心组件是MUST,这是一种多尺度混合编码器-解码器架构,可在图像和token表示中进行深度特征提取 (图1,底部) 。首先,形状为 $\textstyle{\frac{h}{2}}\times{\frac{w}{2}}\times d$ 的输入表示的各个像素会被展平,并通过学习的线性映射生成 $ \frac{h}{2}\cdot\frac{w}{2}$ 个维度为 $d$ 的token。随后,编码器路径中会合并对应于不同图像块的token,从而生成简洁的潜在表示。这一高度描述性的表示会通过瓶颈块,并通过跳跃连接逐步结合编码器路径中的token进行扩展。最终输出会重新排列以匹配输入低分辨率特征 $F_{L}$ 的精确形状,生成深度特征表示 $F_{D}$ 。

Our design is inspired by the success of Residual Swin Transformer Blocks (RSTB) in image denoising and superresolution [Liang et al., 2021]. RSTB features a stack of Swin Transformer layers (STL) that operate on tokens via a windowed self-attention mechanism [Liu et al., 2021], followed by convolution in image representation. However,

我们的设计灵感来源于残差Swin Transformer块 (RSTB) 在图像去噪和超分辨率任务中的成功 [Liang et al., 2021]。RSTB采用堆叠的Swin Transformer层 (STL) 结构,通过窗口自注意力机制 [Liu et al., 2021] 处理token,最后在图像表示层进行卷积操作。然而,


Figure 2: Depiction of differ- Figure 3: Patch merge and expand ent RSTB modules used in the operations used in our multi-scale HUMUS-Block. feature extractor.

图 2: HUMUS-Block 中使用的不同 RSTB 模块示意图
图 3: 我们多尺度特征提取器中使用的补丁合并与扩展操作


Figure 4: Architecture of convolutional blocks for feature extraction and reconstruction.

图 4: 用于特征提取和重建的卷积块架构。

RSTB blocks operate on a single scale, therefore they cannot be readily applied in a hierarchical encoder-decoder architecture. Therefore, we design three variations of RSTB to facilitate multi-scale processing as depicted in Figure 2.

RSTB模块在单一尺度上运行,因此无法直接应用于分层编码器-解码器架构。为此,我们设计了三种RSTB变体以实现多尺度处理,如图2所示。

RSTB-B is the bottleneck block responsible for processing the encoded latent representation while maintaining feature dimensions. Thus, we keep the default RSTB architecture for our bottleneck block, which already operates on a single scale.

RSTB-B是负责处理编码潜在表示同时保持特征维度的瓶颈块。因此,我们为瓶颈块保留了默认的RSTB架构,该架构已在单一尺度上运行。

RSTB-D has a similar function to convolutional down sampling blocks in U-Nets, but it operates on embedded tokens. Given an input with size $h_{i}\cdot w_{i}\times d$ , we pass it through an RSTB-B block and apply PatchMerge operation. PatchMerge linearly combines tokens corresponding to $2\times2$ non-overlapping image patches, while simultaneously increasing the embedding dimension (see Figure 3, top and Figure 10 in the appendix for more details) resulting in an output of ssiuzbes $\begin{array}{r}{\frac{h_{i}}{2}\cdot\frac{w_{i}}{2}\times2\cdot d}\end{array}$ . n Ftuhret h deer cm odore er , pRatShT vBi-a Ds koiupt cpoutnsn tehceti ohing.her dimensional representation before patch merging to be

RSTB-D的功能类似于U-Net中的卷积下采样块,但它在嵌入的token上进行操作。给定输入尺寸为$h_{i}\cdot w_{i}\times d$,我们将其通过RSTB-B块并应用PatchMerge操作。PatchMerge线性组合对应于$2\times2$非重叠图像块的token,同时增加嵌入维度(详见附录中的图3顶部和图10),输出尺寸为$\begin{array}{r}{\frac{h_{i}}{2}\cdot\frac{w_{i}}{2}\times2\cdot d}\end{array}$。此外,RSTB-D还通过跳跃连接保留了更高维度的表示,以便在块合并前进行特征提取。

RSTB-U used in the decoder path is analogous to convolutional upsampling blocks. An input with size $h_{i}\cdot w_{i}\times d$ is first expanded into a larger number of lower dimensional tokens through a linear mapping via Patch Expand (see Figure 3, bottom and Figure 10 in the appendix for more details). Patch Expand reverses the effect of PatchMerge on feature size, thus resulting in $2h_{i}\cdot2w_{i}$ tokens of dimension $\frac{d}{2}$ . Next, we mix information from the obtained expanded tokens with skip embeddings from higher scales via TokenMix. This operation linearly combines tokens from both paths and normalizes the resulting vectors. Finally, the mixed tokens are processed by an RSTB-B block.

解码路径中使用的RSTB-U类似于卷积上采样块。首先通过Patch Expand的线性映射将尺寸为$h_{i}\cdot w_{i}\times d$的输入扩展为更多低维token(详见附录图10及图3底部)。Patch Expand会逆转PatchMerge对特征尺寸的影响,从而生成$2h_{i}\cdot2w_{i}$个维度为$\frac{d}{2}$的token。接着通过TokenMix将扩展后的token与来自更高尺度的跳跃嵌入(skip embeddings)信息进行混合。该操作会线性组合两条路径的token并对结果向量进行归一化。最后,混合后的token会经过RSTB-B块处理。

4.3 Iterative Unrolling

4.3 迭代展开

Architectures derived from unrolling the iterations of various optimization algorithms have proven to be successful in tackling various inverse problems including MRI reconstruction. These architecture can be interpreted as a cascade of simpler denoisers, each of which progressively refines the estimate from the preceding unrolled iteration (see more details in Appendix F).

源自展开各种优化算法迭代过程的架构已被证明能有效解决包括MRI重建在内的多种逆问题。这类架构可视为由多个简单去噪器组成的级联系统,每个去噪器都会逐步优化前一次展开迭代的估计结果(详见附录F)。

Following Sriram et al. [2020], we unroll the gradient descent iterations of the inverse problem in (2.1) in $\mathbf{k}$ -space domain, yielding the iterative update scheme in (2.2). We apply regular iz ation in image domain via our proposed HUMUS-Block, that is we have $\mathcal{G}(k)=\mathcal{F}\big(\mathcal{E}\big(\mathbf{D}\big(\mathcal{R}\big(\mathcal{\hat{F}}^{-1}(k)\big)\big)\big)\big)$ , where $\mathbf{D}$ denotes the HUMUS-Block, $\begin{array}{r}{\mathcal{R}(\pmb{x}{1},...,\pmb{x}{N})=\sum_{i=1}^{N}S_{i}^{*}x_{i}}\end{array}$ is the reduce operator that combines coil images via the corresponding sensitivity maps and $\mathcal{E}(\pmb{x})=(S_{1}\pmb{x},...,S_{N}\pmb{x})$ is the expand operator that maps the combined image back to individual coil images. The sensitivity maps can be estimated a priori using methods such as ESPIRiT [Uecker et al., 2014] or learned in an end-to-end fashion via a Sensitivity Map Estimator (SME) network proposed in Sriram et al. [2020]. In this work we aspire to design an end-to-end approach and thus we use the latter method and estimate the sensitivity maps from the low-frequency (ACS) region of the under sampled input measurements during training using a standard U-Net network. A visual overview of our unrolled architecture is shown in Figure 9 in the appendix.

遵循Sriram等人[2020]的方法,我们在$\mathbf{k}$空间域展开逆问题(2.1)的梯度下降迭代,得到迭代更新方案(2.2)。我们通过提出的HUMUS-Block在图像域进行正则化,即$\mathcal{G}(k)=\mathcal{F}\big(\mathcal{E}\big(\mathbf{D}\big(\mathcal{R}\big(\mathcal{\hat{F}}^{-1}(k)\big)\big)\big)\big)\big)$,其中$\mathbf{D}$表示HUMUS-Block,$\begin{array}{r}{\mathcal{R}(\pmb{x}{1},...,\pmb{x}{N})=\sum_{i=1}^{N}S_{i}^{*}x_{i}}\end{array}$是通过相应灵敏度图组合线圈图像的归约算子,$\mathcal{E}(\pmb{x})=(S_{1}\pmb{x},...,S_{N}\pmb{x})$是将组合图像映射回单个线圈图像的扩展算子。灵敏度图可以使用ESPIRiT[Uecker等人,2014]等方法先验估计,或通过Sriram等人[2020]提出的灵敏度图估计器(SME)网络以端到端方式学习。在本工作中,我们致力于设计端到端方法,因此使用后一种方法,在训练期间使用标准U-Net网络从欠采样输入测量的低频(ACS)区域估计灵敏度图。我们展开架构的视觉概览见附录中的图9。

4.4 Adjacent Slice Reconstruction (ASR)

4.4 相邻切片重建 (ASR)

We observe improvements in reconstruction quality when instead of processing the under sampled data slice-byslice, we jointly reconstruct a set of adjacent slices via HUMUS-Net (Figure 5). That is, if we have a volume of under sampled data $\tilde{\pmb{k}}^{v o l}=(\tilde{\pmb{k}}^{1},...,\tilde{\pmb{k}}^{K})$ with $K$ slices, when reconstructing slice $c$ , we instead reconstruct the volume $(\tilde{k}^{c-a},...,\tilde{k}^{c-1},\tilde{k}^{c},\tilde{k}^{c+1},...,\tilde{k}^{c+a})$ by concatenating them along the coil channel dimension, where $a$ denotes the number of adjacent slices added on each side. However, we only calculate and back propagate the loss on the center slice $c$ of the reconstructed volume. The benefit of ASR is that the network can remove artifacts corrupting individual slices as it sees a larger context of the slice by observing its neighbors. Even though ASR increases compute cost, it is important to note that it does not impact the number of token embeddings (spatial resolution is unchanged) and thus can be combined favorably with Transformer-based methods.

我们观察到,相较于逐片处理欠采样数据,通过HUMUS-Net联合重建一组相邻切片时重建质量有所提升 (图5)。具体而言,给定一个包含$K$个切片的欠采样数据体$\tilde{\pmb{k}}^{v o l}=(\tilde{\pmb{k}}^{1},...,\tilde{\pmb{k}}^{K})$,在重建第$c$个切片时,我们改为沿线圈通道维度拼接相邻切片$(\tilde{k}^{c-a},...,\tilde{k}^{c-1},\tilde{k}^{c},\tilde{k}^{c+1},...,\tilde{k}^{c+a})$进行整体重建,其中$a$表示每侧添加的相邻切片数量。但仅计算并反向传播重建体中心切片$c$的损失。相邻切片重建(ASR)的优势在于,网络通过观察相邻切片获得更大上下文信息,从而能消除单个切片的伪影。虽然ASR会增加计算成本,但需注意它不会改变token嵌入数量(空间分辨率保持不变),因此可与基于Transformer的方法良好结合。


Figure 5: Adjacent slice reconstruction (depicted in image domain for visual clarity): HUMUS-Net takes a volume of adjacent slices $(\tilde{\pmb x}^{c-a},...,\tilde{\pmb x}^{c},...,\tilde{\pmb x}^{c+a})$ and jointly reconstructs a volume $(\hat{\pmb x}^{c-a},...,\hat{\pmb x}^{c},...,\hat{\pmb x}^{c+a})$ . The reconstruction loss $\mathcal{L}$ is calculated only on the center slice $\hat{\pmb x}^{c}$ .

图 5: 相邻切片重建 (为清晰展示采用图像域表示): HUMUS-Net接收相邻切片组成的体数据 $(\tilde{\pmb x}^{c-a},...,\tilde{\pmb x}^{c},...,\tilde{\pmb x}^{c+a})$ 并联合重建出体数据 $(\hat{\pmb x}^{c-a},...,\hat{\pmb x}^{c},...,\hat{\pmb x}^{c+a})$。重建损失 $\mathcal{L}$ 仅针对中心切片 $\hat{\pmb x}^{c}$ 计算。

5 Experiments

5 实验

In this section we provide experimental results on our proposed architecture, HUMUS-Net. First, we demonstrate the reconstruction performance of our model on various datasets, including the large-scale fastMRI dataset. Then, we justify our design choices through a set of ablation studies.

在本节中,我们展示了所提出的HUMUS-Net架构的实验结果。首先,我们在包括大规模fastMRI数据集在内的多种数据集上验证了模型的重建性能。随后,通过一系列消融实验验证了设计选择的合理性。

5.1 Benchmark Experiments

5.1 基准实验

We investigate the performance of HUMUS-Net on three different datasets. We use the structural similarity index measure (SSIM) [Wang et al., 2003] as a basis of our evaluation, which is the most common evaluation metric in medical image reconstruction. In all of our experiments, we follow the setup of the fastMRI multi-coil knee track with $8\times$ acceleration in order to provide comparison with state-of-the-art networks of the Public Leader board. That is, we perform retrospective under sampling of the fully-sampled k-space data by randomly sub sampling $12.5%$ of whole $\mathrm{k\Omega}$ -space lines in the phase encoding direction, keeping $4%$ of lowest frequency adjacent $\mathrm{k\Omega}$ -space lines. Experiments on other acceleration ratios can be found in Appendix C. During training, we generate random masks following the above method, whereas for the validation dataset we keep the masks fixed for each $\mathbf{k}$ -space volume. For HUMUS-Net, we center crop and pad inputs to $384\times384$ . We compare the reconstruction quality of our proposed model with the current best performing network, E2E-VarNet. For details on HUMUS-Net hyper parameters and training, we refer the reader to Appendix A. For E2E-VarNet, we use the hyper parameters specified in Sriram et al. [2020].

我们研究了HUMUS-Net在三个不同数据集上的性能表现。采用结构相似性指数(SSIM) [Wang et al., 2003]作为评估基准,这是医学图像重建领域最常用的评估指标。所有实验均遵循fastMRI多线圈膝关节数据集的8倍加速设置,以便与Public Leader board上的前沿网络进行对比。具体而言,我们对全采样k空间数据进行回顾性欠采样:在相位编码方向随机抽取12.5%的k空间线,同时保留4%的最低频相邻k空间线。其他加速比的实验结果详见附录C。训练时按上述方法生成随机掩膜,验证集则为每个k空间体积固定掩膜。对于HUMUS-Net,我们将输入中心裁剪并填充至384×384尺寸。将所提模型的重建质量与当前最佳性能网络E2E-VarNet进行对比。HUMUS-Net的超参数设置和训练细节见附录A,E2E-VarNet则采用Sriram等人[2020]论文中指定的超参数。

fastMRI – The fastMRI dataset [Zbontar et al., 2019] is the largest publicly available MRI dataset with competitive baseline models and a public leader board, and thus provides an opportunity to directly compare different algorithms. Specifically, we run experiments on the multi-coil knee dataset, consisting of close to $35k$ slices in 973 volumes. We use the default HUMUS-Net model defined above with 3 adjacent slices as input. Furthermore, we design a large variant of our model, HUMUS-Net-L, which has increased embedding dimension compared to the default model (see details in Appendix A). We train models both only on the training split, and also on the training and validation splits combined (additional $\approx20%$ data) for the leader board. Table 1 demonstrates our results compared to the best published models from the fastMRI Leader board 3 evaluated on the test dataset. Our model establishes new state of the art in terms of SSIM on this dataset by a large margin, and achieves comparable or better performance than other methods in terms of PSNR and NMSE. Moreover, as seen in the second column of Table 2, we evaluated our model on the fastMRI validation dataset as well and compared our results to E2E-VarNet, the best performing model from the leader board. We observe similar improvements in terms of the reconstruction SSIM metric to the test dataset. Visual inspection of reconstructions shows that HUMUS-Net recovers very fine details in images that may be missed by other state-of-the-art reconstruction algorithms (see attached figures in Figure 11 in the appendix). Comparison based on further image quality metrics can be found in Appendix H. We point out that even though our model has more parameters than E2E-VarNet, our proposed image-domain denoiser is more efficient than the U-Net deployed in E2E-VarNet, even when their number of parameters are matched as discussed in Section 5.3. Overall, larger model size does not necessarily correlate with better performance, as seen in other competitive models in Table 1. Finally, we note that the additional training data (in the form of the validation split) provides a consistent small boost to model performance. We refer the reader to Klug and Heckel [2022] for an overview of scaling properties of reconstruction models.

fastMRI – fastMRI数据集 [Zbontar等人,2019] 是最大的公开可用MRI数据集,包含具有竞争力的基线模型和公开排行榜,为直接比较不同算法提供了机会。具体而言,我们在多线圈膝关节数据集上进行实验,该数据集包含973个扫描中近 $35k$ 个切片。我们使用上述定义的默认HUMUS-Net模型,以3个相邻切片作为输入。此外,我们设计了一个大型变体模型HUMUS-Net-L,与默认模型相比增加了嵌入维度(详见附录A)。我们分别在仅训练集、以及训练集与验证集组合(额外 $\approx20%$ 数据)上训练模型以参与排行榜。表1展示了我们的结果与fastMRI排行榜3中最佳已发表模型在测试集上的对比。我们的模型以显著优势在该数据集的SSIM指标上创造了新纪录,并在PSNR和NMSE指标上达到或优于其他方法。此外,如表2第二列所示,我们还在fastMRI验证集上评估了模型,并将结果与排行榜表现最佳的E2E-VarNet模型进行对比,观察到重建SSIM指标相对测试集有类似的提升。重建结果的可视化检查表明,HUMUS-Net能恢复其他先进重建算法可能遗漏的精细图像细节(参见附录图11的附图)。更多图像质量指标的对比见附录H。需要指出的是,尽管我们的模型参数多于E2E-VarNet,但如第5.3节所述,即使参数数量匹配时,我们提出的图像域去噪器仍比E2E-VarNet中部署的U-Net更高效。总体而言,如表1中其他竞争模型所示,更大的模型规模未必对应更好的性能。最后我们注意到,额外训练数据(验证集部分)对模型性能有持续的小幅提升。关于重建模型缩放特性的概述,建议读者参阅Klug和Heckel [2022]。

Table 1: Performance of state-of-the-art accelerated MRI reconstruction techniques on the fastMRI knee test dataset. Most models are trained only on the fastMRI training dataset, if available we show results of models trained on the fastMRI combined training and validation dataset denoted by $(\dagger)$ .

表 1: 当前最先进的加速 MRI 重建技术在 fastMRI 膝关节测试数据集上的性能表现。大多数模型仅在 fastMRI 训练数据集上进行训练,若可用,我们会展示在 fastMRI 训练和验证联合数据集上训练的模型结果,用 $(\dagger)$ 表示。

方法 参数量(约) SSIM(↑) PSNR(↑) NMSE(↓)
E2E-VarNet [Sriram et al., 2020] E2E-VarNetf [Sriram et al., 2020] 30M 0.8900 0.8920 36.9 37.1 0.0089 0.0085
XPDNet [Ramzi et al., 2020] 155M 0.8893 37.2 0.0083
∑-Net [Hammernik et al., 2019] 676M 0.8877 36.7 0.0091
i-RIM [Putzky et al., 2019] 300M 0.8875 36.7 0.0091
U-Net [Zbontar et al., 2019] 214M 0.8640 34.7 0.0132
HUMUS-Net (ours) HUMUS-Net (ours)f 109M 0.8936 0.8945 37.0 0.0086
HUMUS-Net-L (ours) HUMUS-Net-L (ours)f 228M 0.8944 0.8951 37.3 37.3 37.4 0.0081 0.0083 0.0080

Table 2: Validation SSIM of HUMUS-Net on various datasets. For datasets with multiple train-validation split runs we show the mean and standard error of the runs.

表 2: HUMUS-Net 在不同数据集上的验证 SSIM。对于具有多次训练-验证分割运行的数据集,我们展示了运行的平均值和标准误差。

方法 fastMRIknee Stanford 2D Stanford3D
E2E-VarNet [Sriram n et al., 2020] 0.8908 0.8928±0.0168 0.9432±0.0063
HUMUS-Net (ours) 0.8934 0.8954 ± 0.0136 0.9453 ± 0.0065

Table 3: Results of ablation studies on HUMUS-Net, evaluated on the Stanford 3D MRI dataset.

表 3: HUMUS-Net在斯坦福3D MRI数据集上的消融实验结果。

Method Unrolled? Multi-scale? Low-resfeatures? Patchsize Embed.dim. SSIM
Un-SS 1 12 0.9319±0.0080
Un-MS 1 12 0.9357±0.0038
Un-MS-Patch2 人 x 2 36 0.9171±0.0075
HUMUS-Net 1 36 0.9449 ± 0.0064
SwinIR E2E-VarNet 0.9336±0.0069 0.9432±0.0063

Stanford 2D – Next, we run experiments on the Stanford2D FSE [Cheng] dataset, a publicly available MRI dataset consisting of scans from various anatomies (pelvis, lower extremity and more) in 89 fully-sampled volumes. We randomly sample $80%$ of volumes as train data and use the rest for validation. We randomly generate 3 different train-validation splits this way to reduce variations in the presented metrics. As slices in this dataset have widely varying shapes across volumes, we center crop the target images to keep spatial resolution within $384\times384$ . We use the default HUMUS-Net defined above with single slices as input. Our results comparing the best performing MRI reconstruction model with HUMUS-Net is shown in the third column of Table 2. We present the mean SSIM of all runs along with the standard error. We achieve improvements of similar magnitude as on the fastMRI dataset. These results demonstrate the effectiveness of HUMUS-Net on a more diverse dataset featuring multiple anatomies.

Stanford 2D – 接下来,我们在Stanford2D FSE [Cheng]数据集上进行实验。该公开MRI数据集包含89个全采样体积的扫描数据,涵盖骨盆、下肢等多种解剖结构。我们随机选取80%体积作为训练数据,其余用于验证。通过这种方式随机生成3种不同的训练-验证分割,以减少指标呈现的波动性。由于该数据集中各体积的切片形状差异较大,我们对目标图像进行中心裁剪以保持空间分辨率在384×384范围内。使用上述定义的默认HUMUS-Net模型,以单切片作为输入。表2第三列展示了性能最佳MRI重建模型与HUMUS-Net的对比结果,报告了所有实验的平均SSIM及标准误差。我们取得的改进幅度与fastMRI数据集相当,这些结果证明了HUMUS-Net在包含多种解剖结构的多样化数据集上的有效性。

Stanford 3D – Finally, we evaluate our model on the Stanford Fully sampled 3D FSE Knees dataset [Sawyer et al., 2013], a public MRI dataset including 20 volumes of knee MRI scans. We generate train-validation splits using the method described for Stanford 2D and perform 3 runs. We use the default HUMUS-Net network with single slices as input. The last column of Table 2 compares our results to E2E-VarNet, showing improvements of similar scales as on other datasets we have investigated in this work. This experiment demonstrates that HUMUS-Net performs well not only on large-scale MRI datasets, but also on smaller problems.

斯坦福3D数据集 - 最后,我们在斯坦福全采样3D FSE膝关节数据集 [Sawyer et al., 2013] 上评估模型性能。该公开MRI数据集包含20组膝关节MRI扫描数据。我们采用与斯坦福2D相同的划分方法生成训练-验证集,并进行3次实验运行。使用默认配置的HUMUS-Net网络,以单切片作为输入。表2最后一列显示,相比E2E-VarNet,我们的改进幅度与本文研究的其他数据集相当。该实验证明HUMUS-Net不仅在大规模MRI数据集上表现优异,在小规模任务中同样具有竞争力。

5.2 Ablation Studies

5.2 消融实验

In this section, we motivate our design choices through a set of ablation studies. We start from SwinIR, a general image reconstruction network, and highlight its weaknesses for MRI. Then, we demonstrate step-by-step how we addressed these shortcomings and arrived at the HUMUS-Net architecture. We train the models on the Stanford 3D dataset. More details can be found in Appendix B. The results of our ablation studies are summarized in Table 3.

在本节中,我们通过一系列消融实验说明设计决策的依据。我们从通用图像重建网络 SwinIR 出发,重点分析其在 MRI (Magnetic Resonance Imaging) 重建中的不足,随后逐步展示如何解决这些缺陷并最终形成 HUMUS-Net 架构。所有模型均在 Stanford 3D 数据集上训练,更多细节见附录 B。消融实验结果汇总于表 3。

First, we investigate SwinIR, a state-of-the-art image denoiser and super-resolution model that features a hybrid Transformer-convolutional architecture. In order to handle the $10\times$ larger input sizes $(320\times320)$ in our MRI dataset compared to the input images this network has been designed for $(128\times128)$ , we reduce the embedding dimension of SwinIR to fit into GPU memory (16 GB). We find that compared to models designed for MRI reconstruction, such as E2E-VarNet (last row in Table 3) SwinIR performs poorly. This is not only due to the reduced network size, but also due to the fact that SwinIR is not specifically designed to take the MRI forward model into consideration.

首先,我们研究了SwinIR,这是一种采用Transformer与卷积混合架构的先进图像去噪和超分辨率模型。为处理MRI数据集中比该网络设计输入尺寸$(128\times128)$大$10\times$的$(320\times320)$输入,我们降低了SwinIR的嵌入维度以适应GPU内存(16 GB)。研究发现,与专为MRI重建设计的模型(如表3最后一行E2E-VarNet)相比,SwinIR表现欠佳。这既源于网络规模的缩减,也因SwinIR未专门考虑MRI前向模型所致。

Next, we unroll SwinIR and add a sensitivity map estimator. We refer to this model as $U n$ -SS. Due to unrolling, we have to further reduce the embedding dimension of the denoiser and also decrease the depth of the network in order to fit into GPU memory. Un-SS, due to its small size, performs slightly worse than vanilla SwinIR and significantly lags behind the E2E-VarNet architectures. We note that SwinIR operates over a single, full-resolution scale, whereas state-of-the-art MRI reconstruction models typically incorporate multi-scale processing in the form of U-Net-like architectures.

接下来,我们对SwinIR进行展开并加入灵敏度图估计器。将该模型称为$Un$-SS。由于展开操作,我们不得不进一步降低去噪器的嵌入维度,并缩减网络深度以适应GPU内存。Un-SS因其较小规模,性能略逊于原始SwinIR,且显著落后于E2E-VarNet架构。值得注意的是,SwinIR仅在单一全分辨率尺度上运行,而当前最先进的MRI重建模型通常采用类似U-Net架构的多尺度处理方式。

Thus, we replace SwinIR by MUST, our proposed multi-scale hybrid processing unit, but keep the embedding dimension in the largest-resolution scale fixed. The obtained network, which we call Un-MS, has overall lower computational cost when compared with $U n$ -SS, however as Table 3 shows MRI reconstruction performance has significantly improved compared to both Un-SS and vanilla SwinIR, which highlights the efficiency of our proposed multi-scale feature extractor. Reconstruction performance is limited by the low dimension of patch embeddings, which we are unable to increase further due to our compute and memory constraints originating in the high-resolution inputs.

因此,我们用提出的多尺度混合处理单元MUST替代SwinIR,但保持最大分辨率尺度下的嵌入维度固定。所得网络(称为Un-MS)与$Un$-SS相比总体计算成本更低,但如表3所示,其MRI重建性能较Un-SS和原始SwinIR均有显著提升,这凸显了我们提出的多尺度特征提取器的高效性。重建性能受限于小块嵌入的低维度,由于高分辨率输入带来的计算和内存限制,我们无法进一步增加该维度。

The most straightforward approach to tackle the challenge of high input resolution is to increase the patch size. To test this idea, we take Un-MS and embed $2\times2$ patches of the inputs, thus reducing the number of tokens processed by the network by a factor of 4. We refer to this model as Un-MS-Patch2. This reduction in compute and memory load allows us to increase network capacity by increasing the embedding dimension 3-folds (to fill GPU memory again). However, Un-MS-Patch2 performs much worse than previous models using patch size of 1. For classification problems, where the general image context is more important than small details, patches of $16\times16$ or $8\times8$ are considered typical [Do sov it ski y et al., 2020]. Even for more dense prediction tasks such as medical image segmentation, patch size of $4\times4$ has been used successfully [Cao et al., 2021a]. However, our experiments suggest that in low-level vision tasks such as MRI reconstruction using patches larger than $1\times1$ may be detrimental due to loss of crucial high-frequency detail information.

应对高输入分辨率挑战最直接的方法是增大分块(patch)尺寸。为验证这一思路,我们在Un-MS模型中将输入图像嵌入为$2×2$分块,使网络处理的token数量减少为原来的1/4,将该模型称为Un-MS-Patch2。计算量和内存占用的降低使我们能将嵌入维度提升3倍(再次占满GPU显存)。但实验表明,Un-MS-Patch2的性能远逊于采用$1×1$分块的先前模型。在分类任务中,由于全局图像上下文比细节更重要,通常采用$16×16$或$8×8$分块[Do sov it ski y et al., 2020];即便是医学图像分割等密集预测任务,$4×4$分块也取得了成功[Cao et al., 2021a]。但我们的实验表明,在MRI重建这类底层视觉任务中,使用大于$1×1$的分块可能因丢失关键高频细节信息而适得其反。

Our approach to address the heavy computational load of Transformers for large input resolutions where increasing the patch size is not an option is to process lower resolution features extracted via convolutions. That is we replace MUST in Un-MS by a HUMUS-Block, resulting in our proposed HUMUS-Net architecture. We train a smaller version of HUMUS-Net with the same embedding dimension as Un-MS. As seen in Table 3, our model achieves the best performance across all other proposed solutions, even surpassing E2E-VarNet. This series of incremental studies highlights the importance of each architectural design choice leading to our proposed HUMUS-Net architecture. Further ablation studies on the effect of the number of unrolled iterations and adjacent slice reconstruction can be found in Appendix D and Appendix E respectively.

我们针对Transformer在大输入分辨率下计算负载过重且无法增大分块(patch)尺寸的问题,提出通过卷积提取低分辨率特征进行处理的方法。具体而言,我们在Un-MS架构中用HUMUS模块取代MUST模块,由此构建出HUMUS-Net架构。如表3所示,即使采用与Un-MS相同的嵌入维度训练缩小版HUMUS-Net,我们的模型在所有对比方案中仍取得最佳性能,甚至超越E2E-VarNet。这一系列渐进式研究揭示了每个架构设计决策对最终HUMUS-Net方案的重要性。关于展开迭代次数和相邻切片重建影响的进一步消融实验分别详见附录D和附录E。

5.3 Direct comparison of image-domain denoisers

5.3 图像域降噪器的直接对比

In order to further demonstrate the advantage of HUMUS-Net over E2E-VarNet, we provide direct comparison between the image-domain denoisers used in the above methods. E2E-VarNet unrolls a fully convolutional U-Net, whereas we deploy our hybrid HUMUS-Block architecture as a denoiser (Fig. 1). We scale down the HUMUS-Block used in HUMUS-Net to match the size of the U-Net in E2E-Varnet in terms of number of model parameters. To this end, we reduce the embedding dimension from 66 to 30. We train both networks on magnitude images from the Stanford3D dataset. The results are summarized in Table 4. We observe that given a fixed parameter budget, our proposed denoiser outperforms the widely used convolutional U-Net architecture in MRI reconstruction, further demon starting the efficiency of HUMUS-Net. This experiment suggests that our HUMUS-Block could serve as an excellent denoiser, replacing convolutional U-Nets, in a broad range of image restoration applications outside of MRI, which we leave for future work.

为了进一步展示HUMUS-Net相对于E2E-VarNet的优势,我们对上述方法中使用的图像域降噪器进行了直接比较。E2E-VarNet展开了一个全卷积U-Net,而我们则采用混合HUMUS-Block架构作为降噪器 (图1)。我们按模型参数量级将HUMUS-Net中的HUMUS-Block缩小至与E2E-Varnet的U-Net相当规模,具体将嵌入维度从66降至30。两个网络均在Stanford3D数据集的幅值图像上进行训练,结果汇总于表4。实验表明,在固定参数预算下,我们提出的降噪器在MRI重建任务中优于广泛使用的卷积U-Net架构,进一步证明了HUMUS-Net的高效性。该实验还提示,我们的HUMUS-Block可作为优质降噪模块替代卷积U-Net,拓展至MRI之外更广泛的图像修复应用领域,这部分工作将留待未来研究。

Table 4: Direct comparison of denoisers on the Stanford 3D dataset. Mean and standard error of 3 random trainingvalidation splits is shown.

模型 参数量 SSIM(↑) PSNR(↑) NMSE(↓)
U-Net 2.5M 0.9348 ±0.0072 39.0±0.6 0.0257±0.0007
HUMUS-Block 2.4M 0.9378 ± 0.0065 39.2±0.5 0.0246 ± 0.0004

表 4: 斯坦福3D数据集上降噪模型的直接对比。结果显示为3次随机训练验证分割的均值及标准误差。

6 Conclusion

6 结论

In this paper, we introduce HUMUS-Net, an unrolled, Transformer-convolutional hybrid network for accelerated MRI reconstruction. HUMUS-Net achieves state-of-the-art performance on the fastMRI dataset and greatly outperforms all previous published and reproducible methods. We demonstrate the performance of our proposed method on two other MRI datasets and perform fine-grained ablation studies to motivate our design choices and emphasize the compute and memory challenges of Transformer-based architectures on low-level and dense computer vision tasks such as MRI reconstruction. A limitation of our current architecture is that it requires fixed-size inputs, which we intend to address with a more flexible design in the future. This work opens the door for the adoption of a multitude of promising techniques introduced recently in the literature for Transformers, which we leave for future work.

本文介绍了HUMUS-Net,一种用于加速MRI重建的展开式Transformer-卷积混合网络。HUMUS-Net在fastMRI数据集上实现了最先进的性能,大幅超越了所有先前发表且可复现的方法。我们在另外两个MRI数据集上验证了所提方法的性能,并通过细粒度消融实验论证了设计选择的合理性,同时强调了基于Transformer的架构在MRI重建等低层次密集计算机视觉任务中面临的计算与内存挑战。当前架构的局限性在于需要固定尺寸的输入,我们计划在未来通过更灵活的设计来解决这一问题。这项工作为采用文献中近期提出的多种Transformer相关技术打开了大门,这些探索将留待未来研究。

阅读全文(20积分)