Der moS eg Diff: A Boundary-aware Segmentation Diffusion Model for Skin Lesion Delineation
Der moS eg Diff: 一种用于皮肤病变勾画的边界感知分割扩散模型
Abstract. Skin lesion segmentation plays a critical role in the early detection and accurate diagnosis of dermatological conditions. Denoising Diffusion Probabilistic Models (DDPMs) have recently gained attention for their exceptional image-generation capabilities. Building on these advancements, we propose Der moS eg Diff, a novel framework for skin lesion segmentation that incorporates boundary information during the learning process. Our approach introduces a novel loss function that prioritizes the boundaries during training, gradually reducing the significance of other regions. We also introduce a novel U-Net-based denoising network that proficiently integrates noise and semantic information inside the network. Experimental results on multiple skin segmentation datasets demonstrate the superiority of Der moS eg Diff over existing CNN, transformer, and diffusion-based approaches, showcasing its effectiveness and generalization in various scenarios. The implementation is publicly accessible on GitHub.
摘要:皮肤病变分割在皮肤病早期检测和准确诊断中起着关键作用。去噪扩散概率模型 (Denoising Diffusion Probabilistic Models, DDPMs) 近期因其卓越的图像生成能力受到广泛关注。基于这些进展,我们提出了Der moS eg Diff——一种在学习过程中融入边界信息的新型皮肤病变分割框架。该方法提出了一种在训练时优先处理边界区域的新型损失函数,并逐步降低其他区域的重要性。我们还设计了一种基于U-Net的新型去噪网络,能够高效整合网络内部的噪声与语义信息。在多个皮肤分割数据集上的实验结果表明,Der moS eg Diff优于现有基于CNN、Transformer和扩散模型的方法,展现了其在各种场景下的有效性和泛化能力。项目代码已在GitHub开源。
Keywords: Deep learning · Diffusion models · Skin · Segmentation.
关键词:深度学习·扩散模型·皮肤·分割。
1 Introduction
1 引言
In medical image analysis, skin lesion segmentation aims at identifying skin abnormal i ties or lesions from dermatological images. Dermatologists traditionally rely on visual examination and manual delineation to diagnose skin lesions, including melanoma, basal cell carcinoma, squamous cell carcinoma, and other benign or malignant growths. However, the accurate and rapid segmentation of these lesions plays a crucial role in early detection, treatment planning, and monitoring of disease progression. Automated medical image segmentation methods have garnered significant attention in recent years due to their potential to enhance diagnosis result accuracy and reliability. The success of these models in medical image segmentation tasks can be attributed to the advancements in deep learning techniques, including convolutional neural networks (CNNs) [2,23,13], implicit neural representations [21] and vision transformers [29,4].
在医学图像分析领域,皮肤病变分割的目标是从皮肤病学图像中识别异常组织或病变。皮肤科医生传统上依赖视觉检查和手动勾画来诊断包括黑色素瘤、基底细胞癌、鳞状细胞癌及其他良性或恶性肿瘤在内的皮肤病变。然而,准确快速地分割这些病变对早期发现、治疗规划和疾病进展监测至关重要。近年来,自动化医学图像分割方法因其提升诊断结果准确性与可靠性的潜力而备受关注。这些模型在医学图像分割任务中的成功可归因于深度学习技术的进步,包括卷积神经网络 (CNNs) [2,23,13]、隐式神经表示 [21] 以及视觉 Transformer [29,4]。
Lately, Denoising Diffusion Probabilistic Models (DDPMs) [11] have gained considerable interest due to their remarkable performance in the field of image generation. This newfound recognition has led to a surge in interest and exploration of DDPMs, propelled by their exceptional capabilities in generating high-quality and diverse samples. Building on this momentum, researchers have successfully proposed new medical image segmentation methods that leverage diffusion models to tackle this challenging task [14]. EnsDiff [30] utilizes ground truth segmentation as training data and input images as priors to generate segmentation distributions, enabling the creation of uncertainty maps and an implicit ensemble of segmentation s. Kim et al. [16] propose a novel framework for self-supervised vessel segmentation. MedSegDiff [31] introduces DPM-based medical image segmentation with dynamic conditional encoding and FF-Parser to mitigate high-frequency noise effects. MedSegDiff-V2 [32] enhances it with a conditional U-Net for improved noise-semantic feature interaction.
最近,去噪扩散概率模型 (Denoising Diffusion Probabilistic Models, DDPMs) [11] 因其在图像生成领域的卓越表现而受到广泛关注。这种新获得的认可推动了对DDPMs的研究热潮,其生成高质量多样化样本的优异能力功不可没。基于这一趋势,研究人员成功提出了利用扩散模型解决医学图像分割这一挑战性任务的新方法 [14]。EnsDiff [30] 使用真实分割标注作为训练数据,并以输入图像作为先验条件来生成分割分布,从而能够创建不确定性图谱和隐式集成分割结果。Kim等人 [16] 提出了一种用于自监督血管分割的新框架。MedSegDiff [31] 引入了基于DPM的医学图像分割方法,采用动态条件编码和FF-Parser来减轻高频噪声影响。MedSegDiff-V2 [32] 通过条件U-Net进行增强,改善了噪声与语义特征的交互作用。
Boundary information has proven crucial in the segmentation of skin images, particularly when it comes to accurately localizing and distinguishing skin lesions from the surrounding healthy tissue [19,29,15]. Boundary information provides spatial relationships between different regions within the skin and holds greater significance compared to other areas. By emphasizing these regions during the training phase, we can achieve more accurate results by encouraging the model to focus on intensifying boundary regions while reducing the impact of other areas. However, most diffusion-based segmentation methods overlook this importance and designate equal importance to all regions. Another critical consideration is the choice of a denoising architecture, which directly impacts the model’s capacity to learn complex data relationships. Most methods have followed a baseline approach [11,22], neglecting the fact that incorporating semantic and noise interaction within the network more effectively.
边界信息在皮肤图像分割中至关重要,尤其在精确定位和区分病灶与周围健康组织时具有关键作用 [19,29,15]。边界信息提供了皮肤内部不同区域的空间关系,相比其他区域具有更高的重要性。通过在训练阶段强化这些区域,我们可以引导模型聚焦边界增强,同时弱化其他区域的影响,从而获得更精确的结果。然而,当前大多数基于扩散模型的分割方法忽视了这一点,对所有区域赋予同等权重。另一个关键因素是去噪架构的选择,这直接影响模型学习复杂数据关系的能力。现有方法大多沿用基线方案 [11,22],未能充分挖掘网络内部语义信息与噪声交互的潜力。
To address these shortcomings, we propose a novel and straightforward framework called Der moS eg Diff. Our approach tackles the above mentioned issues by considering the importance of boundary information during training and presenting a novel denoising network that facilitates a more effective understanding of the relationship between noise and semantic information. Specifically, we propose a novel loss function to prioritize the distinguishing boundaries in the segmentation. By incorporating a dynamic parameter into the loss function, we increase the emphasis on boundary regions while gradually diminishing the significance of other regions as we move further away from the boundaries. Moreover, we present a novel U-Net-based denoising network structure that enhances the integration of guidance throughout the denoising process by incorporating a carefully designed dual-path encoder. This encoder effectively combines noise and semantic information, extracting complementary and disc rim i native features. Our model also has a unique bottleneck incorporating linear attention [26] and original selfattention [10] in parallel. Finally, the decoder receives the output, combined with the two outputs transferred from the encoder, and utilizes this information to estimate the amount of noise. Our experimental results demonstrate the superiority of our proposed method compared to CNN, transformer, and diffusion-based state-of-the-art (SOTA) approaches on ISIC 2018 [9], $\mathrm{PH^{2}}$ [20], and HAM10000 [27] skin segmentation datasets, showcasing the effectiveness and generalization of our method in various scenarios. Contributions of this paper are as follows: $\mathfrak{o}$ We highlight the importance of incorporating boundary information in skin lesion segmentation by introducing a novel loss function that encourages the model to prioritize boundary areas. $\pmb{\wp}$ We present a novel denoising network that significantly improves noise reduction and enhances semantic interaction, demonstrating faster convergence compared to the baseline model on the different skin lesion datasets. $\mathbf{\delta}$ Our approach surpasses state-of-the-art methods, including CNNs, transformers, and diffusion-based techniques, across four diverse skin segmentation datasets.
为解决上述不足,我们提出了一种新颖且简洁的框架Der moS eg Diff。该方法通过以下设计解决了前述问题:在训练中重视边界信息的重要性,并提出一种新型去噪网络以更有效地理解噪声与语义信息的关系。具体而言,我们提出了一种新的损失函数来突出分割任务中的判别性边界。通过在损失函数中引入动态参数,我们增强了对边界区域的关注度,同时随着与边界距离的增加逐步降低其他区域的权重。此外,我们设计了一种基于U-Net的新型去噪网络结构,通过精心设计的双路径编码器增强去噪过程中的引导整合。该编码器能有效融合噪声与语义信息,提取互补且具有判别性的特征。我们的模型还在瓶颈层创新性地并行结合了线性注意力[26]和原始自注意力机制[10]。最终,解码器接收该输出结果,并结合来自编码器的两条传输路径信息,利用这些信息估算噪声量。实验结果表明,在ISIC 2018[9]、$\mathrm{PH^{2}}$[20]和HAM10000[27]皮肤病变分割数据集上,我们提出的方法优于基于CNN、Transformer和扩散模型的当前最优(SOTA)方法,验证了本方法在不同场景下的有效性和泛化能力。本文贡献如下:
$\mathfrak{o}$ 通过提出新型损失函数强调边界信息在皮肤病变分割中的重要性,促使模型优先学习边界区域特征
$\pmb{\wp}$ 提出的新型去噪网络显著提升了降噪效果和语义交互能力,在不同皮肤病变数据集上比基线模型收敛更快
$\mathbf{\delta}$ 本方法在四个皮肤分割数据集上全面超越了包括CNN、Transformer和扩散技术在内的当前最优方法

Fig. 1: (a) illustrates the architecture of the baseline, and (b) presents our proposed Der moS eg Diff framework.
图 1: (a) 展示了基线架构 (b) 展示了我们提出的Der moS eg Diff框架。
2 Method
2 方法
Figure 1 provides an overview of our baseline DDPM model and presents our proposed Der moS eg Diff framework for skin lesion segmentation. While traditional diffusion-based medical image segmentation methods focus on denoising the noisy segmentation mask conditioning by the input image, we propose that incorporating boundary information during the learning process can significantly improve performance. By leveraging edge information to distinguish overlapped objects, we aim to address the challenges posed by fuzzy boundaries in difficult cases and cases where lesions and backgrounds have similar colors. We begin by presenting our baseline method. Subsequently, we delve into how the inclusion of boundary information can enhance skin lesion segmentation and propose a novel approach to incorporate this information into the learning process. Finally, we introduce our network structure, which facilitates the integration of guidance through the denoising process more effectively.
图 1: 展示了我们的基线 DDPM 模型概览,并提出了用于皮肤病变分割的 Der moS eg Diff 框架。传统基于扩散的医学图像分割方法主要关注通过输入图像对带噪分割掩模进行去噪,而我们提出在学习过程中融入边界信息可以显著提升性能。通过利用边缘信息区分重叠对象,我们的目标是解决困难案例中模糊边界以及病变与背景颜色相近情况下的挑战。首先介绍基线方法,随后深入探讨边界信息如何增强皮肤病变分割,并提出一种将此类信息融入学习过程的新方法。最后介绍我们的网络结构,该结构能更有效地通过去噪过程整合引导信息。
2.1 Baseline
2.1 基线 (Baseline)
The core architecture employed in this paper is based on DDPMs [11,30] (see Figure 1a). Diffusion models primarily utilize $T$ timesteps to learn the underlying distribution of the training data, denoted as $q(x_{0})$ , by performing variation al inference on a Markovian process. The framework consists of two processes: forward and reverse. During the forward process, the model starts with the ground truth segmentation mask ( $\boldsymbol{x}_{0}\in\mathbb{R}^{H\times W\times1})$ and adds a Gaussian noise in successive steps, gradually transforming it into a noisy mask:
本文采用的核心架构基于DDPM [11,30] (见图 1a)。扩散模型主要通过 $T$ 个时间步对马尔可夫过程进行变分推断,从而学习训练数据的基础分布 $q(x_{0})$。该框架包含前向和反向两个过程。在前向过程中,模型从真实分割掩码 ( $\boldsymbol{x}_{0}\in\mathbb{R}^{H\times W\times1})$ 出发,通过逐步添加高斯噪声,逐渐将其转化为带噪掩码:
$$
q\left(x_{t}\mid x_{t-1}\right)=\mathcal{N}\left(x_{t};\sqrt{1-\beta_{t}}\cdot x_{t-1},\beta_{t}\cdot\mathbf{I}\right),\forall t\in{1,\ldots,T},
$$
$$
q\left(x_{t}\mid x_{t-1}\right)=\mathcal{N}\left(x_{t};\sqrt{1-\beta_{t}}\cdot x_{t-1},\beta_{t}\cdot\mathbf{I}\right),\forall t\in{1,\ldots,T},
$$
in which $\beta_{1},\dots,\beta_{t-1},\beta_{T}$ represent the variance schedule across diffusion steps. We can then simply sample an arbitrary step of the noisy mask conditioned on the ground truth segmentation as follows:
其中 $\beta_{1},\dots,\beta_{t-1},\beta_{T}$ 表示扩散步骤中的方差调度。随后我们可以简单地基于真实分割结果对任意步骤的噪声掩码进行采样,如下所示:
$$
q\left(\mathbf{x}{t}\mid\mathbf{x}{0}\right)=N\left(\mathbf{x}{t};\sqrt{\bar{\alpha}{t}}\mathbf{x}{0},\left(1-\bar{\alpha}_{t}\right)\mathbf{I}\right)
$$
$$
q\left(\mathbf{x}{t}\mid\mathbf{x}{0}\right)=N\left(\mathbf{x}{t};\sqrt{\bar{\alpha}{t}}\mathbf{x}{0},\left(1-\bar{\alpha}_{t}\right)\mathbf{I}\right)
$$
$$
x_{t}=\sqrt{\bar{\alpha}{t}}x_{0}+\sqrt{1-\bar{\alpha}_{t}}\epsilon,
$$
$$
x_{t}=\sqrt{\bar{\alpha}{t}}x_{0}+\sqrt{1-\bar{\alpha}_{t}}\epsilon,
$$
where $\alpha_{t}:=1-\beta_{t}$ , $\textstyle{\bar{\alpha}}{t}:=\prod_{j=1}^{t}\alpha_{j}$ and $\epsilon\sim\mathcal{N}(0,\mathbf{I})$ . In the reverse process, the objective is to reconstruct the original structure of the mask perturbed during the diffusion process given the input image as guidance $\textbf{\emph{g}}\in\mathbb{R}^{H\times W\times3})$ , by leveraging a neural network to learn the underlying process. To achieve this, we concatenate the $x_{t}$ and $g$ , and denote the concatenated output as $I_{t}:=x_{t}\parallel g$ , where $I_{t}\in\mathbb{R}^{H\times W\times(3+1)}$ . Hence, the reverse process is defined as
其中 $\alpha_{t}:=1-\beta_{t}$,$\textstyle{\bar{\alpha}}{t}:=\prod_{j=1}^{t}\alpha_{j}$,且 $\epsilon\sim\mathcal{N}(0,\mathbf{I})$。在逆向过程中,目标是以输入图像为引导 $\textbf{\emph{g}}\in\mathbb{R}^{H\times W\times3})$,通过利用神经网络学习底层过程,重建在扩散过程中被扰动的掩码原始结构。为此,我们将 $x_{t}$ 和 $g$ 拼接起来,并将拼接后的输出表示为 $I_{t}:=x_{t}\parallel g$,其中 $I_{t}\in\mathbb{R}^{H\times W\times(3+1)}$。因此,逆向过程定义为
$$
p_{\theta}\left(\mathbf{x}{t-1}\mid\mathbf{x}{t}\right)=\mathcal{N}\left(\mathbf{x}{t-1};\mu_{\theta}\left(I_{t},t\right),\varSigma_{\theta}\left(I_{t},t\right)\right),
$$
$$
p_{\theta}\left(\mathbf{x}{t-1}\mid\mathbf{x}{t}\right)=\mathcal{N}\left(\mathbf{x}{t-1};\mu_{\theta}\left(I_{t},t\right),\varSigma_{\theta}\left(I_{t},t\right)\right),
$$
where Ho et al. [11] conclude that instead of directly predicting $\mu_{\theta}$ using the neural network, we can train a model to predict the added noise, $\epsilon\theta$ , leading to a simplified objective as $\mathcal{L}{b}=\left|\epsilon-\epsilon_{\theta}\left(I_{t},t\right)\right|^{2}$ .
Ho等人[11]得出结论,与其直接用神经网络预测$\mu_{\theta}$,不如训练一个模型来预测添加的噪声$\epsilon_{\theta}$,从而将目标简化为$\mathcal{L}{b}=\left|\epsilon-\epsilon_{\theta}\left(I_{t},t\right)\right|^{2}$。
2.2 Boundary-Aware Importance
2.2 边界感知重要性
While diffusion models have shown promising results in medical image segmentation, there is a notable limitation in how we treat all pixels of a segmentation mask equally during training. This approach can lead to saturated results, undermining the model’s performance. In the case of segmentation tasks like skin lesion segmentation, it becomes evident that boundary regions carry significantly more importance than other areas. This is because the boundaries delineate the edges and contours of objects, providing crucial spatial information that aids in distinguishing between the two classes. To address this issue, we present DermoSegDiff, which effectively incorporates boundary information into the learning process and encourages the model to prioritize capturing and preserving boundary details, leading to a faster convergence rate compared to the baseline method. Our approach follows a straightforward yet highly effective strategy for controlling the learning denoising process. It focuses on intensifying the significance of boundaries while gradually reducing this emphasis as we move away from the boundary region utilizing a novel loss function. As depicted in Figure $^{1}$ , our forward process aligns with our baseline, and both denoising networks produce output $\epsilon\theta$ . However, the divergence between the two becomes apparent when computing the loss function. We define our loss function as follows:
虽然扩散模型在医学图像分割中展现出良好的效果,但在训练过程中平等对待分割掩码所有像素的做法存在明显局限。这种方法可能导致结果饱和,影响模型性能。以皮肤病变分割任务为例,边界区域的重要性显然远超其他区域,因为边界勾勒了物体的边缘轮廓,提供了区分两类目标的关键空间信息。为解决这一问题,我们提出DermoSegDiff,该方法将边界信息有效融入学习过程,促使模型优先捕获并保留边界细节,相比基线方法实现了更快的收敛速度。我们采用了一种简单高效的学习去噪过程控制策略:通过新型损失函数强化边界区域的重要性,并随着与边界距离的增加逐步降低这种关注度。如图1所示,我们的前向过程与基线保持一致,两个去噪网络均输出εθ,但在计算损失函数时差异显现。我们定义的损失函数如下:
$$
\mathcal{L}{w}=\left(1+\alpha W_{\Theta}\right)\left|\epsilon-\epsilon_{\theta}\left(x_{t},g,t\right)\right|^{2}
$$
$$
\mathcal{L}{w}=\left(1+\alpha W_{\Theta}\right)\left|\epsilon-\epsilon_{\theta}\left(x_{t},g,t\right)\right|^{2}
$$
where $W_{\Theta}\in\mathbb{R}^{H\times W\times1}$ is a dynamic parameter intended to increase the weight of noise prediction in boundary areas while decreasing its weight as we move away from the boundaries (see Figure 5). $W_{\Theta}$ is obtained through a two-step process involving the calculation of a distance map and subsequent computation of boundary attention. Additionally, $W_{\Theta}$ is dynamically parameterized, depending on the point of time (t) at which the distance map is calculated. It means it functions as a variable that dynamically adjusts according to the specific charact eris tics of each image at time step $t$ .
其中 $W_{\Theta}\in\mathbb{R}^{H\times W\times1}$ 是一个动态参数,旨在增加边界区域噪声预测的权重,同时随着远离边界而降低其权重 (参见图 5)。$W_{\Theta}$ 通过计算距离图和后续边界注意力两个步骤获得。此外,$W_{\Theta}$ 是动态参数化的,具体取决于计算距离图的时间点 (t)。这意味着它是一个根据时间步 $t$ 中每张图像的具体特征动态调整的变量。
Our distance map function operates by taking the ground truth segmentation mask as input. Initially, it identifies the border pixels by assigning a value of one to them while setting all other pixels to zero. To enhance the resolution of the resulting distance map, we extend the border points horizontally from both the left and right sides by $\lceil H%\rceil$ (e.g., for a $256\times256$ image, each row would have seven boundary pixels). To obtain the distance map, we employ the distance transform function [17], which is a commonly used image processing technique for binary images. This function calculates the Euclidean distance between each non-zero (foreground) pixel in the image and the nearest zero (background) pixel. The result is a gray-level image where the intensities of points within foreground regions are modified to represent the distances to the closest boundaries from each individual point. To normalize the intensity levels of the distance map and improve its suitability as a dynamic weighting matrix $W_{\Theta}$ , we employ the technique of gamma correction from image processing to calculate the boundary attention. By adjusting the gamma value, we gain control over the overall intensity of the distance map, resulting in a smoother representation that enhances its effectiveness in the loss function.
我们的距离图函数通过将真实分割掩码作为输入来运作。首先,它通过将边界像素赋值为1、其他像素置零来识别边界像素。为了提高生成距离图的分辨率,我们将边界点从左右两侧水平延伸$\lceil H%\rceil$(例如对于$256\times256$图像,每行会有七个边界像素)。为获得距离图,我们采用距离变换函数[17]——这是一种常用于二值图像的图像处理技术。该函数计算图像中每个非零(前景)像素与最近零值(背景)像素之间的欧氏距离,最终生成灰度图像,其中前景区域各点的强度值被调整为该点到最近边界的距离。为了归一化距离图的强度值并提升其作为动态权重矩阵$W_{\Theta}$的适用性,我们采用图像处理中的伽马校正技术来计算边界注意力。通过调整伽马值,我们可以控制距离图的整体强度分布,从而生成更平滑的表示形式以增强其在损失函数中的效果。
2.3 Network Architecture
2.3 网络架构
Encoder: The overall architecture of our proposed denoising network is depicted in Figure 2. We propose a modification to the U-Net network architecture for predicting added noise $\epsilon\theta$ to a noisy segmentation mask $x_{i-1}^{e n c}$ , guided by the guidance image $g_{i-1}$ and time embedding $t$ , where $i$ refers to the $i-t h$ encoder. The encoder consists of a series of stacked Encoder Modules (EM), which are subsequently followed by a convolution layer to achieve a four-by-four tensor at the output of the encoder. Instead of simply concatenating $x_{i-1}^{e n c}$ and $g_{i-1}$ and feeding into the network [30], our approach enhances the conditioning process by employing a two-path feature extraction strategy in each Encoder Module (EM), focusing on the mutual effect that the noisy segmentation mask and the guidance image can have on each other. Each path includes two ResNet blocks (RB) and is followed by a Linear Attention (L-Att) [26], which is computationally efficient and generates non-redundant feature representation. To incorporate temporal information, time embedding is introduced into each RB. The time embedding is obtained by passing $t$ through a sinusoidal positional embedding, followed by a linear layer, a GeLU activation function, and another linear layer. We use two time embeddings, one for $g_{i-1}(t_{g})$ and another for $x_{i-1}^{e n c}$ ( $t_{x}$ ), to capture the temporal aspects specific to each input. Furthermore, we leverage the knowledge captured by $R B_{1}^{x}$ by transferring and concatenating it with the guidance branch, resulting in $h_{i}$ . By incorporating two paths, we capture specific representations that provide a comprehensive view of the data. The left path extracts noise-related features, while the right path focuses on semantic information. This combination enables the model to incorporate complementary and disc rim i native features. After applying $R B_{2}^{g}$ , we introduce a feedback mechanism that takes a convolution of the $R B_{2}^{g}$ output and connects to the $R B_{2}^{x}$ input. This feedback allows the resultant features, which contain overall information about both the guidance and noise, to be shared with the noise path. By doing so and multiplying the feature maps, we emphasize important features while attenuating less significant ones. This multiplication operation acts as a form of attention mechanism, where the shared features guide the noise path to focus on relevant and informative regions. After the linear attention of the left path and before the right path, we provide another feature concatenation of these two paths, referred to as $b_{i}$ . At the end of each EM block, we obtain four outputs: $h_{i}$ and $b_{i}$ , which are used for skip connections from the encoder to the decoder, and resultant enriched $x_{i}^{e n c}$ and $g_{i}$ are fed into the next EM block to continue the feature extraction process.
编码器:我们提出的去噪网络整体架构如图2所示。我们对U-Net网络架构进行了改进,用于预测添加到噪声分割掩码$x_{i-1}^{e n c}$中的噪声$\epsilon\theta$,整个过程由引导图像$g_{i-1}$和时间嵌入$t$指导,其中$i$表示第$i$个编码器。编码器由一系列堆叠的编码器模块(EM)组成,随后通过卷积层在编码器输出端生成四乘四的张量。与简单拼接$x_{i-1}^{e n c}$和$g_{i-1}$并输入网络[30]不同,我们的方法通过在每个编码器模块(EM)中采用双路径特征提取策略来增强条件处理,重点关注噪声分割掩码和引导图像之间的相互影响。每条路径包含两个ResNet块(RB),后接线性注意力(L-Att)[26],这种设计计算高效且能生成非冗余特征表示。为融入时间信息,每个RB都引入了时间嵌入。时间嵌入是通过将$t$传递给正弦位置嵌入,再经过线性层、GeLU激活函数和另一个线性层获得的。我们使用两个时间嵌入,一个用于$g_{i-1}(t_{g})$,另一个用于$x_{i-1}^{e n c}$($t_{x}$),以捕捉每个输入特有的时间特征。此外,我们通过转移并将$R B_{1}^{x}$捕获的知识与引导分支拼接,得到$h_{i}$。通过采用双路径,我们捕获了能提供数据全面视图的特定表征。左路径提取噪声相关特征,右路径专注于语义信息,这种组合使模型能够整合互补和区分性特征。在应用$R B_{2}^{g}$后,我们引入反馈机制:对$R B_{2}^{g}$输出进行卷积并连接到$R B_{2}^{x}$输入。这种反馈使得包含引导和噪声整体信息的特征能够与噪声路径共享。通过这种方式并对特征图进行乘法运算,我们强调重要特征同时减弱次要特征。这种乘法运算作为一种注意力机制,共享特征引导噪声路径聚焦于相关且信息丰富的区域。在左路径线性注意力之后和右路径之前,我们对这两条路径进行另一次特征拼接,记为$b_{i}$。在每个EM块末尾,我们获得四个输出:用于编码器到解码器跳跃连接的$h_{i}$和$b_{i}$,以及经过强化的$x_{i}^{e n c}$和$g_{i}$将被输入下一个EM块以继续特征提取过程。

Fig. 2: The overview of the proposed denoising network architecture. The notation L-Att, RB, EM, DM, LS-Att, and S-Att correspond to the Linear Attention, ResNet Block, Encoder Modules, Decoder Modules, Linear Self-Attention, and Self-Attention modules, respectively.
图 2: 所提出的去噪网络架构概览。标注中的 L-Att、RB、EM、DM、LS-Att 和 S-Att 分别对应线性注意力 (Linear Attention)、残差网络块 (ResNet Block)、编码器模块 (Encoder Modules)、解码器模块 (Decoder Modules)、线性自注意力 (Linear Self-Attention) 和自注意力模块 (Self-Attention)。
Bottleneck: Next, we concatenate the outputs, $x_{L}^{e n c}$ and ${9L}$ , from the last EM block and pass them alongside the time embedding $t{x}$ through a Bottleneck Module (BM), which contains a ResNet block, a Linear Self-Attention (LS-Att), and another ResNet block. LS-Att is a dual attention module that combines original Self-Attention (S-Att) for spatial relationships and L-Att for capturing semantic context in parallel, enhancing the overall feature representation. The output of BM is then fed into the decoder.
瓶颈层:接下来,我们将最后一个EM模块的输出$x_{L}^{e n c}$和${9L}$进行拼接,并与时间嵌入$t{x}$一起输入到瓶颈模块(BM)中。该模块包含一个ResNet块、线性自注意力(LS-Att)和另一个ResNet块。LS-Att是一种双注意力模块,它并行结合了原始自注意力(S-Att)用于空间关系建模和L-Att用于语义上下文捕捉,从而增强整体特征表示。BM的输出随后被送入解码器。
Decoder: The decoder consists of stacked Decoder Modules (DM) followed by a convolutional block that outputs $\epsilon\theta$ . The number of stacked DMs is the same as the number of EMs in the encoder. Unlike the EM blocks, which are dual-path modules, each DM block is a single-path module. It includes two consecutive RB blocks and one L-Att module. $b_{i}$ and $h_{i}$ from the encoder are concatenated with the feature map before and after applying $R B_{1}^{d}$ , respectively. By incorporating these features, the decoder gains access to refined information from the encoder, thereby aiding in better estimating the amount of noise added during the forward process and recovering missing information during the learning process. In addition, to preserve the impact of noise during the decoding process, we implement an additional skip connection from $x$ to the final layer of the decoder. This involves concatenating the resulting feature map of the $D M_{1}$ with $x$ and passing them together through the last convolutional block to output the estimated noise $\epsilon\theta$ .
解码器:解码器由堆叠的解码器模块(DM)和一个输出$\epsilon\theta$的卷积块组成。堆叠DM的数量与编码器中EM的数量相同。与双路径模块的EM块不同,每个DM块都是单路径模块。它包括两个连续的RB块和一个L-Att模块。来自编码器的$b_{i}$和$h_{i}$分别在应用$R B_{1}^{d}$前后与特征图拼接。通过整合这些特征,解码器能够获取编码器中的精炼信息,从而有助于更好地估计前向过程中添加的噪声量,并在学习过程中恢复缺失信息。此外,为了保留解码过程中噪声的影响,我们实现了从$x$到解码器最后一层的额外跳跃连接。这涉及将$D M_{1}$的结果特征图与$x$拼接,并通过最后一个卷积块一起传递以输出估计的噪声$\epsilon\theta$。
3 Results
3 结果
The proposed method has been implemented using the PyTorch library (version 1.14.0) and has undergone training on a single NVIDIA A100 graphics processing unit (80 GB VRAM). The training procedure employs a batch size of 32 and utilizes the Adam optimizer with a base learning rate of 0.0001. The learning rate is decreased by a factor of 0.5 in the event that there is no improvement in the loss function after ten epochs. In all experiments, we established $T$ as 250 and maintained the forward process variances as constants that progressively increased from $\beta_{s t a r t}=0.0004$ to $\beta_{e n d}=0.08$ linearly. Furthermore, in the training process, data augmentation techniques have been employed using Albumen tat ions [5], including spatial augmentation methods such as Affine and Flip transforms and Coarse Dropout, as well as pixel augmentation methods such as GaussNoise and RGBShift transforms. For each dataset, the network was trained for 40000 iterations. Moreover, we set $\alpha$ empirically as 0.2. The duration of the training process was approximately 1.35 seconds per sample. Notably, in our evaluation process, we employ a sampling strategy to generate nine different segmentation masks for each image in the test set. To obtain a final segmentation result, we average these generated masks and apply a threshold of 0. The reported results in terms of performance metrics are based on this ensemble strategy.
所提出的方法采用PyTorch库(版本1.14.0)实现,并在单个NVIDIA A100显卡(80GB显存)上完成训练。训练过程采用32的批次大小,使用Adam优化器,基础学习率为0.0001。若损失函数连续十个周期未改善,学习率将按0.5倍率衰减。所有实验中,我们设定$T$为250,并保持前向过程方差从$\beta_{start}=0.0004$线性递增至$\beta_{end}=0.08$的恒定值。此外,训练中采用了Albumentations[5]的数据增强技术,包括空间增强方法(如仿射变换、翻转变换和粗粒度丢弃)以及像素增强方法(如高斯噪声和RGB偏移变换)。每个数据集上的网络训练均进行40000次迭代。我们根据经验将$\alpha$设为0.2,训练过程耗时约每样本1.35秒。值得注意的是,在评估阶段我们采用采样策略为测试集每张图像生成九种不同的分割掩码。通过平均这些生成掩码并应用0阈值来获得最终分割结果,所报告的性能指标均基于该集成策略。
Table 1: Performance comparison of the proposed method against the SOTA approaches on skin lesion segmentation benchmarks. Blue indicates the best result, and red displays the second-best.
表 1: 所提方法与皮肤病变分割基准上SOTA方法的性能对比。蓝色标注最佳结果,红色标注次优结果。
| Methods | ISIC2018 DSC SE SP ACC | PH2 DSC SE SP ACC | HAM10000 DSC SE SP ACC |
|---|---|---|---|
| U-Net [23] | 0.8545 0.8800 0.9697 0.9404 | 0.8936 0.9125 0.9588 0.9233 | 0.8807 0.9072 0.9588 0.9324 |
| DAGAN [18] | 0.9201 0.8320 0.9640 0.9425 | 0.8499 0.8578 0.9653 0.9452 | 0.8840 0.9063 0.9427 0.9200 |
| TransUNet [7] | 0.8946 0.9056 0.9798 0.9645 | 0.9449 0.9410 0.9564 0.9678 | 0.8820 0.8560 0.9770 0.9510 |
| Swin-Unet [6] | 0.9202 0.8818 0.9832 0.9503 | - | - |
| DeepLabv3+ [8] | - | - | - |
| Att-UNet [24] | - | - | - |

Fig. 3: Visual comparisons of different methods on the ISIC 2018 skin lesion dataset. Ground truth boundaries are shown in green, and predicted boundaries are shown in blue.
图 3: ISIC 2018皮肤病变数据集上不同方法的视觉对比。真实边界以绿色显示,预测边界以蓝色显示。
3.1 Datasets
3.1 数据集
To evaluate the proposed methodology, three publicly available skin lesion segmentation datasets, ISIC 2018 [9], PH $^2$ [20], and HAM10000 [27] are utilized. The same pre-processing criteria described in [3] are used to train and evaluate the first three datasets mentioned. The HAM10000 dataset is also a subset of the ISIC archive containing 10015 dermoscopy images along with their corresponding segmentation masks. 7200 images are used as training, 1800 as validation, and 1015 as test data. Each sample of all datasets is downsized to $128\times128$ pixels using the same pre-processing as [1].
为评估所提出的方法,使用了三个公开可用的皮肤病变分割数据集:ISIC 2018 [9]、PH$^2$ [20] 和 HAM10000 [27]。采用[3]中描述的相同预处理标准对前三个数据集进行训练和评估。HAM10000数据集也是ISIC档案的一个子集,包含10015张皮肤镜图像及其对应的分割掩码。其中7200张图像用于训练,1800张用于验证,1015张用于测试。所有数据集的每个样本均按照[1]的预处理方法将尺寸调整为$128\times128$像素。
3.2 Quantitative and qualitative results
3.2 定量与定性结果
Table 1 presents the performance analysis of our proposed Der moS eg Diff on all four skin lesion segmentation datasets. The evaluation incorporates several metrics, including Dice Score (DSC), Sensitivity (SE), Specificity (SP), and Accuracy (ACC), to establish comprehensive evaluation criteria. In our notation, the model with the baseline loss function is referred to as Der moS eg Diff-A, while the model with the proposed loss function is denoted as Der moS eg Diff-B. The results demonstrate that Der moS eg Diff-B surpasses both CNN and Transformer-based approaches, showcasing its superior performance and generalization capabilities across diverse datasets. Specifically, our main approach demonstrates superior performance compared to pure transformer-based methods such as Swin-Unet [6], CNN-based methods like DeepLabv3+ [8], and hybrid methods like UCTransNet [28]. Moreover, Der moS eg Diff-B exhibits enhanced performance com- pared to the baseline model (EnsDiff) [30], achieving an increase of $+2.18%$ , $+3.83%$ , and $+1.65%$ in DSC score on ISIC 2018, PH $^2$ , and HAM10000 datasets, respectively. Furthermore, in Figure 3, we visually compare the outcomes generated by various skin lesion segmentation models. The results clearly illustrate that our proposed approach excels in capturing intricate structures and producing more accurate boundaries compared to its counterparts. This visual evidence underscores the superior performance achieved by carefully integrating boundary information into the learning process.
表 1: 展示了我们提出的Der moS eg Diff在四个皮肤病变分割数据集上的性能分析。评估采用了多种指标,包括Dice分数(DSC)、灵敏度(SE)、特异度(SP)和准确率(ACC),以建立全面的评估标准。在我们的命名中,使用基线损失函数的模型称为Der moS eg Diff-A,而采用所提损失函数的模型记为Der moS eg Diff-B。结果表明,Der moS eg Diff-B超越了基于CNN和Transformer的方法,展现了其在不同数据集上的优越性能和泛化能力。具体而言,我们的主要方法在性能上优于纯Transformer方法(如Swin-Unet [6])、基于CNN的方法(如DeepLabv3+ [8])以及混合方法(如UCTransNet [28])。此外,与基线模型(EnsDiff) [30]相比,Der moS eg Diff-B在ISIC 2018、PH $^2$ 和HAM10000数据集上的DSC分数分别提升了$+2.18%$、$+3.83%$和$+1.65%$。进一步地,在图 3中,我们直观对比了不同皮肤病变分割模型的输出结果。这些结果清晰表明,与其它方法相比,我们提出的方法在捕捉复杂结构和生成更精确边界方面表现更优。这一可视化证据凸显了将边界信息精心整合到学习过程中所带来的卓越性能。

Fig. 4: An illustration of how our proposed loss function concentrates on the segmentation boundary in contrast to the conventional $\mathcal{L}{b}$ loss in Der moS eg DiffA. The heatmaps are obtained from the $E M_{3}$ using GradCAM [25]. Notably, DSD is an abbreviation of Der moS eg Diff.
图 4: 我们提出的损失函数与传统Der moS eg DiffA中的$\mathcal{L}{b}$损失相比,如何专注于分割边界的示意图。热力图是通过GradCAM [25]从$E M_{3}$中获取的。值得注意的是,DSD是Der moS eg Diff的缩写。
4 Ablation studies
4 消融实验
Figure 4 illustrates the effects of our innovative loss function. The heatmaps are produced utilizing the GradCAM [25], which visually represents the gradients of the output originating from the $E M_{3}$ . Incorporating a novel loss function results in a shift of emphasis towards the boundary region, leading to a $0.51%$ enhancement compared to Der moS eg Diff-A in the overall DSC score on the ISIC 2018 dataset. The analysis reveals a distinct behavior within our model. In the noise path, the model primarily emphasizes local boundary information, while in the guidance branch, it aims to capture more global information. This knowledge is then transferred through feedback to the noise branch, providing complementary information. This combination of local and global information allows our model to effectively leverage both aspects and achieve improved results. Figure 5 depicts the evolution of $W_{\Theta}$ with respect to $T$ . In the initial stages of the denoising process, when the effect of noise is significant, the changes in the boundary area are relatively smooth. During this phase, the model focuses on capturing more global information about the image. As the denoising process progresses and it becomes easier to distinguish between the foreground and background in the resulting image, the weight shifts, placing increased emphasis on the boundary region while disregarding the regions that are further away from it. Additionally, as we approach $x_{0}$ , the emphasis on the boundary information becomes more pronounced. These observations highlight the adaptive nature of $W_{\Theta}$ and its role in effectively preserving boundary details during the denoising process.
图4展示了我们创新损失函数的效果。热图采用GradCAM [25]生成,直观呈现了源自$EM_{3}$输出的梯度变化。引入新型损失函数使模型注意力向边界区域转移,在ISIC 2018数据集上的整体DSC分数比Der moS eg Diff-A提升了$0.51%$。分析表明我们的模型具有独特行为:在噪声路径中主要关注局部边界信息,而在引导分支中则试图捕获更多全局信息,这些知识通过反馈机制传递至噪声分支以提供互补信息。这种局部与全局信息的结合使模型能有效利用两方面特征从而获得更好结果。
图5描绘了$W_{\Theta}$随$T$的变化过程。在去噪初期当噪声影响显著时,边界区域变化相对平缓,此时模型侧重捕捉图像的全局信息。随着去噪进程推进且前景/背景更易区分时,权重转移使边界区域关注度提升,同时忽略远离边界的区域。此外,当接近$x_{0}$时,对边界信息的强调更为显著。这些观察结果凸显了$W_{\Theta}$的自适应特性及其在去噪过程中有效保留边界细节的作用。

Guidance Image Ground Truth Distance Map $W_{\Theta}$ at 0.25T $W_{\Theta}$ at 0.50T WΘ at 0.75T

引导图像 真实距离图 $W_{\Theta}$ 在0.25T时 $W_{\Theta}$ 在0.50T时 $W_{\Theta}$ 在0.75T时

Fig. 5: An illustration of how the $W_{\Theta}$ variable varies dependent on the network’s current time step of diffusion. Fig. 6: (a) Illustrates the limitation imposed by annotation of the dataset, and (b) presents some of the limitations of our proposed model. Ground truth boundaries are shown in green, and predicted boundaries are shown in blue.
图 5: 展示 $W_{\Theta}$ 变量如何随网络当前扩散时间步变化。图 6: (a) 展示数据集标注带来的限制,(b) 呈现我们提出模型的部分局限性。真实边界以绿色显示,预测边界以蓝色显示。
5 Limitations
5 局限性
Despite these promising results, there are also some limitations. For example, some annotations within the datasets may not be entirely precise. Figure 6a portrays certain inconsistencies in the annotations of data. However, despite these annotation challenges, our proposed method demonstrates superior precision in the segmentation of skin lesions in comparison to the annotators. The results indicate that with more meticulous annotation of the masks, our proposed approach could have achieved even higher scores across all evaluation metrics. It is worth noting that there were instances where our model deviated from the accurate annotation and erroneously partitioned the area. Figure 6b depicts instances where our proposed methodology fails to segment the skin lesion accurately. The difficulty in accurately demarcating the boundary between the foreground and background in skin images arises from the high similarity between these regions and requires more work that we aim to address in future work.
尽管取得了这些令人鼓舞的成果,但仍存在一些局限性。例如,数据集中部分标注可能不够完全精确。图6a展示了数据标注中的某些不一致性。然而,尽管存在这些标注挑战,我们提出的方法在皮肤病变分割方面展现出优于人工标注的精度。结果表明,若采用更精细的掩膜标注,本方法有望在所有评估指标上获得更高分数。值得注意的是,模型偶尔会偏离准确标注并错误划分区域。图6b呈现了本方法未能准确分割皮肤病变的案例。由于皮肤图像中前景与背景区域的高度相似性,精确划定其边界存在困难,这需要我们通过后续工作进一步解决。
6 Conclusion
6 结论
This paper introduced the Der moS eg Diff diffusion network for skin lesion segmentation. Our approach introduced a novel loss function that emphasizes the importance of the segmentation’s boundary region and assigns it higher weight during training. Further, we proposed a denoising network that effectively models the noise-semantic information and results in performance improvement.
本文介绍了用于皮肤病变分割的Der moS eg Diff扩散网络。我们提出了一种新颖的损失函数,强调分割边界区域的重要性,并在训练过程中赋予其更高权重。此外,我们还提出了一种降噪网络,能有效建模噪声-语义信息,从而提升性能。
