[论文翻译]一种用于皮肤病变勾画的边界感知分割扩散模型


原文地址:https://arxiv.org/pdf/2308.02959v1


Der moS eg Diff: A Boundary-aware Segmentation Diffusion Model for Skin Lesion Delineation

Der moS eg Diff: 一种用于皮肤病变勾画的边界感知分割扩散模型

Abstract. Skin lesion segmentation plays a critical role in the early detection and accurate diagnosis of dermatological conditions. Denoising Diffusion Probabilistic Models (DDPMs) have recently gained attention for their exceptional image-generation capabilities. Building on these advancements, we propose Der moS eg Diff, a novel framework for skin lesion segmentation that incorporates boundary information during the learning process. Our approach introduces a novel loss function that prioritizes the boundaries during training, gradually reducing the significance of other regions. We also introduce a novel U-Net-based denoising network that proficiently integrates noise and semantic information inside the network. Experimental results on multiple skin segmentation datasets demonstrate the superiority of Der moS eg Diff over existing CNN, transformer, and diffusion-based approaches, showcasing its effectiveness and generalization in various scenarios. The implementation is publicly accessible on GitHub.

摘要:皮肤病变分割在皮肤病早期检测和准确诊断中起着关键作用。去噪扩散概率模型 (Denoising Diffusion Probabilistic Models, DDPMs) 近期因其卓越的图像生成能力受到广泛关注。基于这些进展,我们提出了Der moS eg Diff——一种在学习过程中融入边界信息的新型皮肤病变分割框架。该方法提出了一种在训练时优先处理边界区域的新型损失函数,并逐步降低其他区域的重要性。我们还设计了一种基于U-Net的新型去噪网络,能够高效整合网络内部的噪声与语义信息。在多个皮肤分割数据集上的实验结果表明,Der moS eg Diff优于现有基于CNN、Transformer和扩散模型的方法,展现了其在各种场景下的有效性和泛化能力。项目代码已在GitHub开源。

Keywords: Deep learning · Diffusion models · Skin · Segmentation.

关键词:深度学习·扩散模型·皮肤·分割。

1 Introduction

1 引言

In medical image analysis, skin lesion segmentation aims at identifying skin abnormal i ties or lesions from dermatological images. Dermatologists traditionally rely on visual examination and manual delineation to diagnose skin lesions, including melanoma, basal cell carcinoma, squamous cell carcinoma, and other benign or malignant growths. However, the accurate and rapid segmentation of these lesions plays a crucial role in early detection, treatment planning, and monitoring of disease progression. Automated medical image segmentation methods have garnered significant attention in recent years due to their potential to enhance diagnosis result accuracy and reliability. The success of these models in medical image segmentation tasks can be attributed to the advancements in deep learning techniques, including convolutional neural networks (CNNs) [2,23,13], implicit neural representations [21] and vision transformers [29,4].

在医学图像分析领域,皮肤病变分割的目标是从皮肤病学图像中识别异常组织或病变。皮肤科医生传统上依赖视觉检查和手动勾画来诊断包括黑色素瘤、基底细胞癌、鳞状细胞癌及其他良性或恶性肿瘤在内的皮肤病变。然而,准确快速地分割这些病变对早期发现、治疗规划和疾病进展监测至关重要。近年来,自动化医学图像分割方法因其提升诊断结果准确性与可靠性的潜力而备受关注。这些模型在医学图像分割任务中的成功可归因于深度学习技术的进步,包括卷积神经网络 (CNNs) [2,23,13]、隐式神经表示 [21] 以及视觉 Transformer [29,4]。

Lately, Denoising Diffusion Probabilistic Models (DDPMs) [11] have gained considerable interest due to their remarkable performance in the field of image generation. This newfound recognition has led to a surge in interest and exploration of DDPMs, propelled by their exceptional capabilities in generating high-quality and diverse samples. Building on this momentum, researchers have successfully proposed new medical image segmentation methods that leverage diffusion models to tackle this challenging task [14]. EnsDiff [30] utilizes ground truth segmentation as training data and input images as priors to generate segmentation distributions, enabling the creation of uncertainty maps and an implicit ensemble of segmentation s. Kim et al. [16] propose a novel framework for self-supervised vessel segmentation. MedSegDiff [31] introduces DPM-based medical image segmentation with dynamic conditional encoding and FF-Parser to mitigate high-frequency noise effects. MedSegDiff-V2 [32] enhances it with a conditional U-Net for improved noise-semantic feature interaction.

最近,去噪扩散概率模型 (Denoising Diffusion Probabilistic Models, DDPMs) [11] 因其在图像生成领域的卓越表现而受到广泛关注。这种新获得的认可推动了对DDPMs的研究热潮,其生成高质量多样化样本的优异能力功不可没。基于这一趋势,研究人员成功提出了利用扩散模型解决医学图像分割这一挑战性任务的新方法 [14]。EnsDiff [30] 使用真实分割标注作为训练数据,并以输入图像作为先验条件来生成分割分布,从而能够创建不确定性图谱和隐式集成分割结果。Kim等人 [16] 提出了一种用于自监督血管分割的新框架。MedSegDiff [31] 引入了基于DPM的医学图像分割方法,采用动态条件编码和FF-Parser来减轻高频噪声影响。MedSegDiff-V2 [32] 通过条件U-Net进行增强,改善了噪声与语义特征的交互作用。

Boundary information has proven crucial in the segmentation of skin images, particularly when it comes to accurately localizing and distinguishing skin lesions from the surrounding healthy tissue [19,29,15]. Boundary information provides spatial relationships between different regions within the skin and holds greater significance compared to other areas. By emphasizing these regions during the training phase, we can achieve more accurate results by encouraging the model to focus on intensifying boundary regions while reducing the impact of other areas. However, most diffusion-based segmentation methods overlook this importance and designate equal importance to all regions. Another critical consideration is the choice of a denoising architecture, which directly impacts the model’s capacity to learn complex data relationships. Most methods have followed a baseline approach [11,22], neglecting the fact that incorporating semantic and noise interaction within the network more effectively.

边界信息在皮肤图像分割中至关重要,尤其在精确定位和区分病灶与周围健康组织时具有关键作用 [19,29,15]。边界信息提供了皮肤内部不同区域的空间关系,相比其他区域具有更高的重要性。通过在训练阶段强化这些区域,我们可以引导模型聚焦边界增强,同时弱化其他区域的影响,从而获得更精确的结果。然而,当前大多数基于扩散模型的分割方法忽视了这一点,对所有区域赋予同等权重。另一个关键因素是去噪架构的选择,这直接影响模型学习复杂数据关系的能力。现有方法大多沿用基线方案 [11,22],未能充分挖掘网络内部语义信息与噪声交互的潜力。

To address these shortcomings, we propose a novel and straightforward framework called Der moS eg Diff. Our approach tackles the above mentioned issues by considering the importance of boundary information during training and presenting a novel denoising network that facilitates a more effective understanding of the relationship between noise and semantic information. Specifically, we propose a novel loss function to prioritize the distinguishing boundaries in the segmentation. By incorporating a dynamic parameter into the loss function, we increase the emphasis on boundary regions while gradually diminishing the significance of other regions as we move further away from the boundaries. Moreover, we present a novel U-Net-based denoising network structure that enhances the integration of guidance throughout the denoising process by incorporating a carefully designed dual-path encoder. This encoder effectively combines noise and semantic information, extracting complementary and disc rim i native features. Our model also has a unique bottleneck incorporating linear attention [26] and original selfattention [10] in parallel. Finally, the decoder receives the output, combined with the two outputs transferred from the encoder, and utilizes this information to estimate the amount of noise. Our experimental results demonstrate the superiority of our proposed method compared to CNN, transformer, and diffusion-based state-of-the-art (SOTA) approaches on ISIC 2018 [9], $\mathrm{PH^{2}}$ [20], and HAM10000 [27] skin segmentation datasets, showcasing the effectiveness and generalization of our method in various scenarios. Contributions of this paper are as follows: $\mathfrak{o}$ We highlight the importance of incorporating boundary information in skin lesion segmentation by introducing a novel loss function that encourages the model to prioritize boundary areas. $\pmb{\wp}$ We present a novel denoising network that significantly improves noise reduction and enhances semantic interaction, demonstrating faster convergence compared to the baseline model on the different skin lesion datasets. $\mathbf{\delta}$ Our approach surpasses state-of-the-art methods, including CNNs, transformers, and diffusion-based techniques, across four diverse skin segmentation datasets.

为解决上述不足,我们提出了一种新颖且简洁的框架Der moS eg Diff。该方法通过以下设计解决了前述问题:在训练中重视边界信息的重要性,并提出一种新型去噪网络以更有效地理解噪声与语义信息的关系。具体而言,我们提出了一种新的损失函数来突出分割任务中的判别性边界。通过在损失函数中引入动态参数,我们增强了对边界区域的关注度,同时随着与边界距离的增加逐步降低其他区域的权重。此外,我们设计了一种基于U-Net的新型去噪网络结构,通过精心设计的双路径编码器增强去噪过程中的引导整合。该编码器能有效融合噪声与语义信息,提取互补且具有判别性的特征。我们的模型还在瓶颈层创新性地并行结合了线性注意力[26]和原始自注意力机制[10]。最终,解码器接收该输出结果,并结合来自编码器的两条传输路径信息,利用这些信息估算噪声量。实验结果表明,在ISIC 2018[9]、$\mathrm{PH^{2}}$[20]和HAM10000[27]皮肤病变分割数据集上,我们提出的方法优于基于CNN、Transformer和扩散模型的当前最优(SOTA)方法,验证了本方法在不同场景下的有效性和泛化能力。本文贡献如下:
$\mathfrak{o}$ 通过提出新型损失函数强调边界信息在皮肤病变分割中的重要性,促使模型优先学习边界区域特征
$\pmb{\wp}$ 提出的新型去噪网络显著提升了降噪效果和语义交互能力,在不同皮肤病变数据集上比基线模型收敛更快
$\mathbf{\delta}$ 本方法在四个皮肤分割数据集上全面超越了包括CNN、Transformer和扩散技术在内的当前最优方法


Fig. 1: (a) illustrates the architecture of the baseline, and (b) presents our proposed Der moS eg Diff framework.

图 1: (a) 展示了基线架构 (b) 展示了我们提出的Der moS eg Diff框架。

2 Method

2 方法

Figure 1 provides an overview of our baseline DDPM model and presents our proposed Der moS eg Diff framework for skin lesion segmentation. While traditional diffusion-based medical image segmentation methods focus on denoising the noisy segmentation mask conditioning by the input image, we propose that incorporating boundary information during the learning process can significantly improve performance. By leveraging edge information to distinguish overlapped objects, we aim to address the challenges posed by fuzzy boundaries in difficult cases and cases where lesions and backgrounds have similar colors. We begin by presenting our baseline method. Subsequently, we delve into how the inclusion of boundary information can enhance skin lesion segmentation and propose a novel approach to incorporate this information into the learning process. Finally, we introduce our network structure, which facilitates the integration of guidance through the denoising process more effectively.

图 1: 展示了我们的基线 DDPM 模型概览,并提出了用于皮肤病变分割的 Der moS eg Diff 框架。传统基于扩散的医学图像分割方法主要关注通过输入图像对带噪分割掩模进行去噪,而我们提出在学习过程中融入边界信息可以显著提升性能。通过利用边缘信息区分重叠对象,我们的目标是解决困难案例中模糊边界以及病变与背景颜色相近情况下的挑战。首先介绍基线方法,随后深入探讨边界信息如何增强皮肤病变分割,并提出一种将此类信息融入学习过程的新方法。最后介绍我们的网络结构,该结构能更有效地通过去噪过程整合引导信息。

2.1 Baseline

2.1 基线 (Baseline)

The core architecture employed in this paper is based on DDPMs [11,30] (see Figure 1a). Diffusion models primarily utilize $T$ timesteps to learn the underlying distribution of the training data, denoted as $q(x_{0})$ , by performing variation al inference on a Markovian process. The framework consists of two processes: forward and reverse. During the forward process, the model starts with the ground truth segmentation mask ( $\boldsymbol{x}_{0}\in\mathbb{R}^{H\times W\times1})$ and adds a Gaussian noise in successive steps, gradually transforming it into a noisy mask:

本文采用的核心架构基于DDPM [11,30] (见图 1a)。扩散模型主要通过 $T$ 个时间步对马尔可夫过程进行变分推断,从而学习训练数据的基础分布 $q(x_{0})$。该框架包含前向和反向两个过程。在前向过程中,模型从真实分割掩码 ( $\boldsymbol{x}_{0}\in\mathbb{R}^{H\times W\times1})$ 出发,通过逐步添加高斯噪声,逐渐将其转化为带噪掩码:

$$
q\left(x_{t}\mid x_{t-1}\right)=\mathcal{N}\left(x_{t};\sqrt{1-\beta_{t}}\cdot x_{t-1},\beta_{t}\cdot\mathbf{I}\right),\forall t\in{1,\ldots,T},
$$

$$
q\left(x_{t}\mid x_{t-1}\right)=\mathcal{N}\left(x_{t};\sqrt{1-\beta_{t}}\cdot x_{t-1},\beta_{t}\cdot\mathbf{I}\right),\forall t\in{1,\ldots,T},
$$

in which $\beta_{1},\dots,\beta_{t-1},\beta_{T}$ represent the variance schedule across diffusion steps. We can then simply sample an arbitrary step of the noisy mask conditioned on the ground truth segmentation as follows:

其中 $\beta_{1},\dots,\beta_{t-1},\beta_{T}$ 表示扩散步骤中的方差调度。随后我们可以简单地基于真实分割结果对任意步骤的噪声掩码进行采样,如下所示:

$$
q\left(\mathbf{x}{t}\mid\mathbf{x}{0}\right)=N\left(\mathbf{x}{t};\sqrt{\bar{\alpha}{t}}\mathbf{x}{0},\left(1-\bar{\alpha}_{t}\right)\mathbf{I}\right)
$$

$$
q\left(\mathbf{x}{t}\mid\mathbf{x}{0}\right)=N\left(\mathbf{x}{t};\sqrt{\bar{\alpha}{t}}\mathbf{x}{0},\left(1-\bar{\alpha}_{t}\right)\mathbf{I}\right)
$$

$$
x_{t}=\sqrt{\bar{\alpha}{t}}x_{0}+\sqrt{1-\bar{\alpha}_{t}}\epsilon,
$$

$$
x_{t}=\sqrt{\bar{\alpha}{t}}x_{0}+\sqrt{1-\bar{\alpha}_{t}}\epsilon,
$$

where $\alpha_{t}:=1-\beta_{t}$ , $\textstyle{\bar{\alpha}}{t}:=\prod_{j=1}^{t}\alpha_{j}$ and $\epsilon\sim\mathcal{N}(0,\mathbf{I})$ . In the reverse process, the objective is to reconstruct the original structure of the mask perturbed during the diffusion process given the input image as guidance $\textbf{\emph{g}}\in\mathbb{R}^{H\times W\times3})$ , by leveraging a neural network to learn the underlying process. To achieve this, we concatenate the $x_{t}$ and $g$ , and denote the concatenated output as $I_{t}:=x_{t}\parallel g$ , where $I_{t}\in\mathbb{R}^{H\times W\times(3+1)}$ . Hence, the reverse process is defined as

其中 $\alpha_{t}:=1-\beta_{t}$,$\textstyle{\bar{\alpha}}{t}:=\prod_{j=1}^{t}\alpha_{j}$,且 $\epsilon\sim\mathcal{N}(0,\mathbf{I})$。在逆向过程中,目标是以输入图像为引导 $\textbf{\emph{g}}\in\mathbb{R}^{H\times W\times3})$,通过利用神经网络学习底层过程,重建在扩散过程中被扰动的掩码原始结构。为此,我们将 $x_{t}$ 和 $g$ 拼接起来,并将拼接后的输出表示为 $I_{t}:=x_{t}\parallel g$,其中 $I_{t}\in\mathbb{R}^{H\times W\times(3+1)}$。因此,逆向过程定义为

$$
p_{\theta}\left(\mathbf{x}{t-1}\mid\mathbf{x}{t}\right)=\mathcal{N}\left(\mathbf{x}{t-1};\mu_{\theta}\left(I_{t},t\right),\varSigma_{\theta}\left(I_{t},t\right)\right),
$$

$$
p_{\theta}\left(\mathbf{x}{t-1}\mid\mathbf{x}{t}\right)=\mathcal{N}\left(\mathbf{x}{t-1};\mu_{\theta}\left(I_{t},t\right),\varSigma_{\theta}\left(I_{t},t\right)\right),
$$

where Ho et al. [11] conclude that instead of directly predicting $\mu_{\theta}$ using the neural network, we can train a model to predict the added noise, $\epsilon\theta$ , leading to a simplified objective as $\mathcal{L}{b}=\left|\epsilon-\epsilon_{\theta}\left(I_{t},t\right)\right|^{2}$ .

Ho等人[11]得出结论,与其直接用神经网络预测$\mu_{\theta}$,不如训练一个模型来预测添加的噪声$\epsilon_{\theta}$,从而将目标简化为$\mathcal{L}{b}=\left|\epsilon-\epsilon_{\theta}\left(I_{t},t\right)\right|^{2}$。

2.2 Boundary-Aware Importance

2.2 边界感知重要性

While diffusion models have shown promising results in medical image segmentation, there is a notable limitation in how we treat all pixels of a segmentation mask equally during training. This approach can lead to saturated results, undermining the model’s performance. In the case of segmentation tasks like skin lesion segmentation, it becomes evident that boundary regions carry significantly more importance than other areas. This is because the boundaries delineate the edges and contours of objects, providing crucial spatial information that aids in distinguishing between the two classes. To address this issue, we present DermoSegDiff, which effectively incorporates boundary information into the learning process and encourages the model to prioritize capturing and preserving boundary details, leading to a faster convergence rate compared to the baseline method. Our approach follows a straightforward yet highly effective strategy for controlling the learning denoising process. It focuses on intensifying the significance of boundaries while gradually reducing this emphasis as we move away from the boundary region utilizing a novel loss function. As depicted in Figure $^{1}$ , our forward process aligns with our baseline, and both denoising networks produce output $\epsilon\theta$ . However, the divergence between the two becomes apparent when computing the loss function. We define our loss function as follows:

虽然扩散模型在医学图像分割中展现出良好的效果,但在训练过程中平等对待分割掩码所有像素的做法存在明显局限。这种方法可能导致结果饱和,影响模型性能。以皮肤病变分割任务为例,边界区域的重要性显然远超其他区域,因为边界勾勒了物体的边缘轮廓,提供了区分两类目标的关键空间信息。为解决这一问题,我们提出DermoSegDiff,该方法将边界信息有效融入学习过程,促使模型优先捕获并保留边界细节,相比基线方法实现了更快的收敛速度。我们采用了一种简单高效的学习去噪过程控制策略:通过新型损失函数强化边界区域的重要性,并随着与边界距离的增加逐步降低这种关注度。如图1所示,我们的前向过程与基线保持一致,两个去噪网络均输出εθ,但在计算损失函数时差异显现。我们定义的损失函数如下:

$$
\mathcal{L}{w}=\left(1+\alpha W_{\Theta}\right)\left|\epsilon-\epsilon_{\theta}\left(x_{t},g,t\right)\right|^{2}
$$

$$
\mathcal{L}{w}=\left(1+\alpha W_{\Theta}\right)\left|\epsilon-\epsilon_{\theta}\left(x_{t},g,t\right)\right|^{2}
$$

where $W_{\Theta}\in\mathbb{R}^{H\times W\times1}$ is a dynamic parameter intended to increase the weight of noise prediction in boundary areas while decreasing its weight as we move away from the boundaries (see Figure 5). $W_{\Theta}$ is obtained through a two-step process involving the calculation of a distance map and subsequent computation of boundary attention. Additionally, $W_{\Theta}$ is dynamically parameterized, depending on the point of time (t) at which the distance map is calculated. It means it functions as a variable that dynamically adjusts according to the specific charact eris tics of each image at time step $t$ .

其中 $W_{\Theta}\in\mathbb{R}^{H\times W\times1}$ 是一个动态参数,旨在增加边界区域噪声预测的权重,同时随着远离边界而降低其权重 (参见图 5)。$W_{\Theta}$ 通过计算距离图和后续边界注意力两个步骤获得。此外,$W_{\Theta}$ 是动态参数化的,具体取决于计算距离图的时间点 (t)。这意味着它是一个根据时间步 $t$ 中每张图像的具体特征动态调整的变量。

Our distance map function operates by taking the ground truth segmentation mask as input. Initially, it identifies the border pixels by assigning a value of one to them while setting all other pixels to zero. To enhance the resolution of the resulting distance map, we extend the border points horizontally from both the left and right sides by $\lceil H%\rceil$ (e.g., for a $256\times256$ image, each row would have seven boundary pixels). To obtain the distance map, we employ the distance transform function [17], which is a commonly used image processing technique for binary images. This function calculates the Euclidean distance between each non-zero (foreground) pixel in the image and the nearest zero (background) pixel. The result is a gray-level image where the intensities of points within foreground regions are modified to represent the distances to the closest boundaries from each individual point. To normalize the intensity levels of the distance map and improve its suitability as a dynamic weighting matrix $W_{\Theta}$ , we employ the technique of gamma correction from image processing to calculate the boundary attention. By adjusting the gamma value, we gain control over the overall intensity of the distance map, resulting in a smoother representation that enhances its effectiveness in the loss function.

我们的距离图函数通过将真实分割掩码作为输入来运作。首先,它通过将边界像素赋值为1、其他像素置零来识别边界像素。为了提高生成距离图的分辨率,我们将边界点从左右两侧水平延伸$\lceil H%\rceil$(例如对于$256\times256$图像,每行会有七个边界像素)。为获得距离图,我们采用距离变换函数[17]——这是一种常用于二值图像的图像处理技术。该函数计算图像中每个非零(前景)像素与最近零值(背景)像素之间的欧氏距离,最终生成灰度图像,其中前景区域各点的强度值被调整为该点到最近边界的距离。为了归一化距离图的强度值并提升其作为动态权重矩阵$W_{\Theta}$的适用性,我们采用图像处理中的伽马校正技术来计算边界注意力。通过调整伽马值,我们可以控制距离图的整体强度分布,从而生成更平滑的表示形式以增强其在损失函数中的效果。

2.3 Network Architecture

2.3 网络架构

Encoder: The overall architecture of our proposed denoising network is depicted in Figure 2. We propose a modification to the U-Net network architecture for predicting added noise $\epsilon\theta$ to a noisy segmentation mask $x_{i-1}^{e n c}$ , guided by the guidance image $g_{i-1}$ and time embedding $t$ , where $i$ refers to the $i-t h$ encoder. The encoder consists of a series of stacked Encoder Modules (EM), which are subsequently followed by a convolution layer to achieve a four-by-four tensor at the output of the encoder. Instead of simply concatenating $x_{i-1}^{e n c}$ and $g_{i-1}$ and feeding into the network [30], our approach enhances the conditioning process by employing a two-path feature extraction strategy in each Encoder Module (EM), focusing on the mutual effect that the noisy segmentation mask and the guidance image can have on each other. Each path includes two ResNet blocks (RB) and is followed by a Linear Attention (L-Att) [26], which is computationally efficient and generates non-redundant feature representation. To incorporate temporal information, time embedding is introduced into each RB. The time embedding is obtained by passing $t$ through a sinusoidal positional embedding, followed by a linear layer, a GeLU activation function, and another linear layer. We use two time embeddings, one for $g_{i-1}(t_{g})$ and another for $x_{i-1}^{e n c}$ ( $t_{x}$ ), to capture the temporal aspects specific to each input. Furthermore, we leverage the knowledge captured by $R B_{1}^{x}$ by transferring and concatenating it with the guidance branch, resulting in $h_{i}$ . By incorporating two paths, we capture specific representations that provide a comprehensive view of the data. The left path extracts noise-related features, while the right path focuses on semantic information. This combination enables the model to incorporate complementary and disc rim i native features. After applying $R B_{2}^{g}$ , we introduce a feedback mechanism that takes a convolution of the $R B_{2}^{g}$ output and connects to the $R B_{2}^{x}$ input. This feedback allows the resultant features, which contain overall information about both the guidance and noise, to be shared with the noise path. By doing so and multiplying the feature maps, we emphasize important features while attenuating less significant ones. This multiplication operation acts as a form of attention mechanism, where the shared features guide the noise path to focus on relevant and informative regions. After the linear attention of the left path and before the right path, we provide another feature concatenation of these two paths, referred to as $b_{i}$ . At the end of each EM block, we obtain four outputs: $h_{i}$ and $b_{i}$ , which are used for skip connections from the encoder to the decoder, and resultant enriched $x_{i}^{e n c}$ and $g_{i}$ are fed into the next EM block to continue the feature extraction process.

编码器:我们提出的去噪网络整体架构如图2所示。我们对U-Net网络架构进行了改进,用于预测添加到噪声分割掩码$x_{i-1}^{e n c}$中的噪声$\epsilon\theta$,整个过程由引导图像$g_{i-1}$和时间嵌入$t$指导,其中$i$表示第$i$个编码器。编码器由一系列堆叠的编码器模块(EM)组成,随后通过卷积层在编码器输出端生成四乘四的张量。与简单拼接$x_{i-1}^{e n c}$和$g_{i-1}$并输入网络[30]不同,我们的方法通过在每个编码器模块(EM)中采用双路径特征提取策略来增强条件处理,重点关注噪声分割掩码和引导图像之间的相互影响。每条路径包含两个ResNet块(RB),后接线性注意力(L-Att)[26],这种设计计算高效且能生成非冗余特征表示。为融入时间信息,每个RB都引入了时间嵌入。时间嵌入是通过将$t$传递给正弦位置嵌入,再经过线性层、GeLU激活函数和另一个线性层获得的。我们使用两个时间嵌入,一个用于$g_{i-1}(t_{g})$,另一个用于$x_{i-1}^{e n c}$($t_{x}$),以捕捉每个输入特有的时间特征。此外,我们通过转移并将$R B_{1}^{x}$捕获的知识与引导分支拼接,得到$h_{i}$。通过采用双路径,我们捕获了能提供数据全面视图的特定表征。左路径提取噪声相关特征,右路径专注于语义信息,这种组合使模型能够整合互补和区分性特征。在应用$R B_{2}^{g}$后,我们引入反馈机制:对$R B_{2}^{g}$输出进行卷积并连接到$R B_{2}^{x}$输入。这种反馈使得包含引导和噪声整体信息的特征能够与噪声路径共享。通过这种方式并对特征图进行乘法运算,我们强调重要特征同时减弱次要特征。这种乘法运算作为一种注意力机制,共享特征引导噪声路径聚焦于相关且信息丰富的区域。在左路径线性注意力之后和右路径之前,我们对这两条路径进行另一次特征拼接,记为$b_{i}$。在每个EM块末尾,我们获得四个输出:用于编码器到解码器跳跃连接的$h_{i}$和$b_{i}$,以及经过强化的$x_{i}^{e n c}$和$g_{i}$将被输入下一个EM块以继续特征提取过程。

![](https://miner.umaxing.com/miner/v2/analysis/pdf_img?as_attachment=False&user_id=931&pdf=3eb57740cc38502a9eec0e5973dbd6860cc3eda4f0e8699c51b3133de0f132051746675125_2308.02959v1.pdf&filename=4