Estimating Uncertainty in Neural Networks for Cardiac MRI Segmentation: A Benchmark Study

Introduction

Cardiac magnetic resonance imaging (MRI) is the gold standard for evaluating cardiac function due to its excellent soft tissue contrast, high spatial and temporal resolution, and non-ionizing radiation . Segmentation of cardiac structures such as the left ventricle cavity, left ventricle myocardium, and right ventricle cavity is required as a first step to quantify clinically relevant imaging biomarkers such as the left ventricle ejection fraction and myocardial mass. Recently, convolutional neural networks (CNNs) have been shown to perform well for automatic cardiac MR image segmentation . However, when using a CNN in an automated image analysis pipeline, it is important to know which segmentation results are problematic and require further manual inspection. This may improve workflow efficiency by focusing on problematic segmentations, avoiding the review of all images and reducing errors in downstream analysis. This problem has been referred to as , and is closely related to the task of anomaly detection or out-of-distribution detection . One approach to solve this problem is to use predictive uncertainty estimates of a segmentation model. The key idea here is that segmentation outputs with low uncertainty are correct while outputs with high uncertainty are problematic. While several studies have attempted to estimate uncertainty in CNNs for medical image segmentation, most of these studies used Monte Carlo (MC) Dropout or Deep Ensembles. However, there are some limitations associated with these methods, which motivates exploration of other algorithms for estimating uncertainty. For example, when using a fixed dropout rate in MC Dropout, the model uncertainty does not decrease as more training data is used. This is potentially problematic since model uncertainty should approach zero in the limit of infinite data . As a result, the dropout rate needs to be tuned depending on the model size and amount of training data . For Deep Ensembles, it is not clear why this method generates well-calibrated uncertainty estimates. In addition, in previous studies, evaluation of these algorithms was mostly limited to correlations between the predictive uncertainty and segmentation accuracy or metrics measuring how uncertainty can be used to improve segmentation . In this work, we compared Bayesian and non-Bayesian approaches and performed a systematic evaluation of epistemic uncertainty for these methods. In particular, we evaluated Bayes by Backprop (BBB), MC Dropout, and Deep Ensembles based on segmentation accuracy, probability calibration, uncertainty on out-of-distribution datasets, and finally, demonstrated the utility of these methods for segmentation quality control.

简介

Uncertainty in Neural Networks

Uncertainty in neural networks can be estimated using Bayesian and non-Bayesian methods. Bayesian neural networks (BNNs) provide a theoretical framework for generating well-calibrated uncertainty estimates . In BNNs, we are interested in learning the posterior distribution of the neural network weights instead of a maximum likelihood or maximum-a-posteriori estimate. A challenge in learning BNNs is that integrating over the posterior is intractable in high dimensional space. As such, approximate inference techniques such as stochastic variational inference are commonly used. Examples include variational dropout , MC dropout , Bayes by Backprop , multiplicative normalizing flows , and Flipout . Non-Bayesian methods can also be used to estimate uncertainty in neural networks. Examples include bootstrapping , Deep Ensembles , Temperature Scaling , and Resampling Uncertainty Estimation . Note that in Bayesian methods, uncertainty is learned during training and is tightly coupled to the model structure. In non-Bayesian methods, uncertainty is learned during training or estimated after training. In this work, we focused on modelling epistemic uncertainty using Bayesian and non-Bayesian methods. Epistemic or model-based uncertainty is uncertainty related to model parameters due to the use of a finite amount of training data. For humans, epistemic uncertainty corresponds to uncertainty from inexperienced or non-expert observers. In contrast, aleatoric or data-dependent uncertainty is related to the data itself and corresponds to inter- and intra-observer variability. Epistemic uncertainty is more applicable for segmentation quality control since segmentation outputs with high aleatoric uncertainty may be considered acceptable from a quality control perspective as long as the output matches the expectations from one or more experienced observers.

Bayesian neural networks have been used for medical image segmentation tasks; however, most work used MC Dropout to approximate the posterior distribution of the weights and investigations of ways to evaluate the quality of uncertainty have been limited. , , and used a CNN with MC Dropout for brain structure, brain tumour, and brain tumour cavity segmentation. These studies observed positive correlations between segmentation accuracy and uncertainty measures. compared different uncertainty measures in brain lesion segmentation and showed that uncertainty measures can be used to improve lesion detection accuracy. applied MC Dropout on a CNN for cardiac MRI segmentation and showed that training using a Brier loss or cross-entropy loss produced well-calibrated pixel-wise uncertainties, and correcting uncertain pixels can improve segmentation results consistently. used MC Dropout and other non-Bayesian methods to generate skin lesion segmentation and the segmentation uncertainty maps, which were then entered into another neural network to predict the Jaccard index of the segmentation. proposed segmentation quality estimates for aortic MRI segmentation using an ensemble of neural networks and demonstrated improved segmentation accuracy with the use of these estimates. More recently, compared MC Dropout, Deep Ensembles, and auxiliary networks for predicting pixel-wise segmentation errors for two medical image segmentation tasks. In addition to segmentation probability calibration, they examined the overlap between segmentation uncertainty and errors, and the fraction of images which would benefit from uncertainty-guided segmentation correction. In a follow-up work, compared different aggregation methods for uncertainty measures and their performance for segmentation failure detection. Similarly, also compared MC Dropout and Deep Ensembles for CNNs trained with different loss functions in terms of probability calibration and correlation between segmentation accuracy and uncertainty measures. However, these studies did not evaluate other Bayesian methods such as Bayes by Backprop for uncertainty estimation and the performance of these methods on out-of-distribution datasets is unknown. In this work, we evaluated and compared several commonly used uncertainty estimation methods on datasets with various degrees of distortions, which mimic the clinical scenarios. We analyzed the strengths and limitations of these methods on datasets with various distortions, and finally, demonstrated the utility of these algorithms for segmentation quality control.

Contributions

While MC Dropout and Deep Ensembles have been used to estimate uncertainty in medical image segmentation, limited studies have investigated other Bayesian methods and a comparison of these methods is currently lacking. To this end, we performed a systematic study of Bayesian and non-Bayesian neural networks for estimating uncertainty in the context of cardiac MRI segmentation. Our contributions are listed as follows: We hope this work will serve as a benchmark for evaluating uncertainty in cardiac MRI segmentation and inspire further work on uncertainty estimation in medical image segmentation.

方法

Bayesian Neural Networks (BNN) 贝叶斯神经网络（BNN）

Given a dataset of $N$ images $X= {x_i}$ ${i=1 \ldots N}$ and the corresponding segmentation $Y= {y_{i}}_{i=1\ldots N}$ with $C$ classes, we fit a neural network parameterized by weights $w$ to perform segmentation. In BNNs, we are interested in learning the posterior distribution of the weights, $p(w | X, Y)$ , instead of a maximum likelihood or maximum-a-posteriori estimate. This posterior distribution represents uncertainty in the weights, which could be propagated to calculate uncertainty in the predictions . In addition, BNNs have been shown to be able to improve the generalizability of neural networks . A challenge in learning BNNs is that calculating the posterior is intractable due to its high dimensionality. Variational inference is a scalable technique that aims to learn an approximate posterior distribution of the weights $q(w)$ by minimizing the Kullback-Leibler (KL) divergence between the approximate and true posterior. This is equivalent to maximizing the evidence lower bound (ELBO) as follows:

$$q(w) = \argmax_{q(w)}{\mathbb{E}_{q(w)}[\log p(Y| X, w)] - \lambda \cdot \textrm{KL}[q(w) || p(w)]},,$$

where $\mathbb{E}_ {q(w)}[\cdot]$ denotes expectation over the approximate posterior $q(w)$ , $\log p(Y| X, w)$ is the log-likelihood of the training data with given weights $w$ , $p(w)$ represents the prior distribution of $w$ , and KL[ $\cdot$ ] is the Kulback-Leibler divergence between two probability distributions weighted by a hyperparameter $\lambda >0$ to achieve better performance. State-of-the-art segmentation neural networks such as the U-net formulate image segmentation as pixelwise classification. For each pixel $x_{i,j}$ in image $x_i$ , $i= 1,\ldots N$ , $j\in\Omega$ , the neural network generates a prediction$\hat{y}{i,j}$ with probability
$p(\hat{y} {i,j}=c), c =0, 1, \dots C-1$ , through softmax activation of the features in the last layer.

w) $是给定权重$ w $的训练数据的日志似然性，$ p(w) $表示$ w $的先前分布，KL [$\cdot $]是 KLBACK-LEIBLER 在由 HyperParameter$\lambda >0 $加权的两个概率分布之间的发散，以实现更好的性能。最先进的分割神经网络，例如 U-Net 将图像分割为 PixelW 方向分类。对于每个像素$ x_在图像 t60_1  x_i $，$ i= 1,\ldots N $，$ j\in\Omega $，神经网络生成预测$ \hat{y}{i,j} $概率$ p(\hat{y}
q(w) $进行采样权重。在此方法中，辍学率是基于验证数据集的凭证选择的超级计。辍学率定义了重量中的不确定性量，并在整个网络培训和测试中得到了固定。 Deep Ensembles In addition to BNNs, we characterized and evaluated uncertainty estimates using an ensemble of neural networks. This is also known as Deep Ensembles . Deep Ensembles consist of multiple neural networks trained using the same data (or different subsets of the same data) with different random initializations. Combination of the models in an ensemble has been shown to produce well-calibrated probabilities in computer vision tasks, and the variability in the models can be used to calculate predictive uncertainty. This non-Bayesian approach was inspired by the idea of bootstrapping, where stochasticity in the sampling of the training data and training algorithm define model uncertainty. This approach differs from Bayesian methods since it does not require approximation of the posterior distribution of the weights. 除了 BNN 之外，我们还使用神经网络的集合来表征和评估不确定性估计，### Deep Ensembles 。这也被称为深度集成。深度集合由使用具有不同随机初始化的相同数据（或不同数据的不同子集）训练的多个神经网络。已经显示了集合中模型的组合在计算机视觉任务中产生了良好的校准概率，并且可以使用模型的可变性来计算预测性不确定性。这种非贝叶斯方法受到自动启动的启发，其中培训数据和培训算法采样的随机性定义了模型不确定性。这种方法与贝叶斯方法不同，因为它不需要近似重量的后部分布。 Algorithm Evaluation We evaluated the uncertainty estimation algorithms based on three aspects: (1) segmentation accuracy and probability calibration, (2) uncertainty on out-of-distribution images, and (3) application of uncertainty estimates for segmentation quality control. The purposes of these evaluations are as follows: (1) We show that Bayesian neural networks can provide segmentation accuracies that are similar or higher than plain or point estimate neural networks. In addition, predicted segmentation probabilities should be well-calibrated, i.e., a pixel with a predicted probability of 60% belonging to the myocardium is truly 60%. From a frequentist perspective, this means that out of all predictions with 60% probability, 60% of the predictions are correct. (2) We measure segmentation uncertainty on out-of-distribution images to validate that the uncertainty measures perform as expected, i.e., uncertainty should increase when test images are substantially different from training datasets. (3) We expect uncertainty measures to be useful in identifying problematic segmentations that require manual editing. We first discuss metrics for segmentation accuracy and probability calibration and then present methods to quantify predictive uncertainty using uncertainty measures. 算法评估 我们评估了基于三个方面的不确定性估计算法：（1）分割精度和概率校准，（2）在分配图像外的不确定性，以及（3）对分割质量控制的不确定性估计的应用。这些评估的目的如下：（1）我们表明贝叶斯神经网络可以提供类似或高于普通或点估计神经网络的分段精度。另外，预测的分割概率应该是良好的校准，即，具有预测概率的像素，其属于心肌的 60％的像素是真正的 60％。从常见的角度来看，这意味着超过了 60％概率的预测，60％的预测是正确的。 （2）我们测量分发图像的分割不确定度，以验证不确定性措施，即当测试图像与训练数据集的基本不同时，不确定性应该增加。 （3）我们预计不确定性措施可用于识别需要手动编辑的问题分割。我们首先讨论分割精度和概率校准的指标，然后使用不确定性措施来定量预测性不确定性的方法。 Segmentation Accuracy Let$ R_a $and$ R_m $be the automated and manual segmentation regions, respectively. We calculated the algorithm segmentation accuracy using Dice similarity coefficient, average symmetric surface distance (ASSD), and Hausdorff distance (HD). measures the overlap between two segmentations and is given by: quantifies the average distance between the contours of two segmentation regions and is given by: where$ \abs{\partial R_{a}} $represents the number of points on contour$ \partial R_{a} $and$ d(p,\partial R_m) $represents the shortest Euclidean distance from point$ p $to contour$ \partial R_m $;$ \abs{\partial R_m} $and$ d(p,\partial R_a) $are defined in the same manner. is the maximum distance between two contours and is calculated as: 分段精度 让$ R_a $和$ R_m $分别为自动和手动分段区域。我们使用骰子相似系数，平均对称表面距离（ASSD）和 Hausdorff 距离（HD）计算算法分割精度。测量两个分段之间的重叠，并给出：定量两个分割区域的轮廓之间的平均距离，并给出：其中$\abs{\partial R_{a}} $表示轮廓上的点数$\partial R_{a} $和$ d(p,\partial R_m) $表示的点数从点$ p $到轮廓$\partial R_m $的最短欧几里德距离;$\abs{\partial R_m} $和$ d(p,\partial R_a) $以相同的方式定义。是两个轮廓之间的最大距离，并计算为： Probability Calibration These metrics measure how closely the neural network segmentation probabilities match the ground truth probabilities generated from manual segmentation on a per-pixel basis. Following the notation in Sec. bnn-intro, we let$ \hat{y}_j $and$ y_j $denote the prediction and ground truth label of pixel$ j $in a given image,$ j\in\Omega $, respectively. measures how well the learned model fits the observed (testing) data. Note that NLL is sensitive to tail probabilities; that is, a model that generates low probability for the correct class is heavily penalized. NLL is calculated as: measures the mean squared error between the predicted and ground truth probabilities and is given by: 概率校准 这些度量标准测量神经网络分割概率的敏捷程度与每个像素的手动分段产生的地面真相概率匹配。在秒中的符号之后。 BNN-intro，我们让$\hat{y}_j $和$ y_j $分别表示给定图像中的像素$ j $的预测和地形标签，$ j\in\Omega $。测量学习模型适合观察到的（测试）数据的程度。请注意，NLL 对尾部概率敏感;也就是说，为正确阶级产生低概率的模型严重惩罚。 nll 计算为：测量预测和地面真实概率之间的平均平均误差，并提供： Predictive Uncertainty Measures 预测性不确定性测量 Predictive uncertainty measures can be calculated from the neural network predictions to indicate the degree of certainty of the output. These can be calculated per pixel or per structure/class. Pixelwise uncertainty measures are motivated by information theory. These values are calculated per pixel and averaged across all pixels in the image if an image-level measure is required. In this work, we used the following pixelwise uncertainty measures: - {Predictive Entropy} measures the spread of probabilities across all the classes in the mean prediction and is given by: 预测性不确定性措施可以从神经网络预测计算，以指示输出的确定性程度。这些可以每像素或每个结构/类计算。 PixelWive 不确定性措施是通过信息理论的动机。如果需要图像级度量，则每个像素计算每个像素并在图像中的所有像素中的平均值。在这项工作中，我们使用以下 PixelWise 不确定性措施： - {预测熵}在平均预测中的所有类中测量概率的传播，并通过 $$\sum_{j \in \Omega} \sum_{c=0}^{C-1} \left[-p \left(\hat{y}_j = c \right)\log p \left( \hat{y}_j = c \right)\right]$$ • {Mutual Information (MI)} measures how different each sample is from the mean prediction and is calculated as: • MI 测量每个样本的不同从平均预测中的不同，并且计算为： • $$\mathbb{E}{q(\mbf{w})} & \left[ \sum{j\in \Omega} \sum_{c=0}^{C-1} , p \left ( \hat{y}j = c | \mbf{w} \right ) , \log p \left ( \hat{y}j = c|\mbf{w} \right)- \right . & \left . \sum{j\in \Omega} \sum{c=0}^{C-1} , p \left(\hat{y}_j = c \right), \log p \left(\hat{y}j = c \right), \right ] \$$$
where p \left(\hat{y}_j = c|\mbf{w} \right) denotes the prediction given a set of weights w . MI is high if there are samples with both high and low confidence, and is low if all samples have low confidence or high confidence. Structural uncertainty measures were proposed specifically for image segmentation . Here, we define two structural uncertainty measures, which quantify how different each structure is among the prediction samples in terms of Dice and ASSD. We expect structural uncertainty measures to better align with common segmentation accuracy metrics because of their focus on global image-level uncertainty. - {Dice\textsubscript{WithinSamples}} =\frac{1}{T} \sum_{i=i}^{T} \textrm{Dice}(\bar{S}, S_{i}) - {ASSD\textsubscript{WithinSamples}} =\frac{1}{T} \sum_{i=i}^{T} \textrm{ASSD}(\bar{S}, S_{i}) where $\bar{S}$ is the mean predicted segmentation and $S {i}, i \in{1\ldots T}$ , are individual predication samples from the neural network.

Datasets 数据集

UK BioBank (UKBB)

The UKBB dataset consists of images from 4845 healthy volunteers. For each subject, 2D cine cardiac MR images were acquired on a 1.5T Siemens scanner using a bSSFP sequence under breath-hold conditions with ECG-gating (pixel size = 1.8-2.3 mm, slice thickness = 8 mm, number of slices = $\sim$ 10, number of phases = $\sim$ 50). Manual segmentation of the left ventricle blood pool (LV), left ventricle myocardium (Myo), and right ventricle (RV) was performed on the end-diastolic (ED) and end-systolic (ES) phases by one of eight observers followed by random checks by an expert to ensure segmentation quality and consistency. The dataset was randomly split into 4173, 103, and 569 subjects for training, validation, and testing, respectively. These numbers were chosen for convenience.

英国 biobank（UKBB）

UKBB 数据集由 4845 个健康志愿者的图像组成。对于每个受试者，在 1.5T Siemens 扫描仪上使用 BSSFP 序列在呼吸保持条件下使用 ECG-Gating（像素尺寸= 1.8-2.3mm，切片厚度= 8mm，切片数量= $）在 1.5T 西门子扫描仪上获得 2D Cine 心脏 MR 图像。\sim$ 10，阶段数= $\sim$ 50）。左心室血液池（LV），左心室心肌（MyO）和右心室（RV）的手动分割是在八个观察者中的一个接下来的末端舒张末期（ED）和末端收缩期（ES）阶段进行的由专家随机检查，以确保分割质量和一致性。数据集分别随机分为 4173,103 和 569 个受试者，分别进行培训，验证和测试。为方便起见，选择这些数字。

Automated Cardiac Diagnosis Challenge (ACDC)

The ACDC dataset consists of 100 patients with one of five conditions: normal, myocardial infarction, dilated cardiomyopathy, hypertrophic cardiomyopathy, and abnormal right ventricle. 2D Cine MR images were acquired using a bSSFP sequence on a 1.5T/3T Siemens scanner (pixel size = 0.7-1.9 mm, slice thickness = 5-10 mm, number of slices = 6-18, number of phases = 28-40). Manual segmentation was performed at the ED and ES phases with approval by two experts. This dataset was used for testing only.

自动心脏诊断挑战（ACDC）

ACDC 数据集由 100 名患者组成，其中五种病症中的一个：正常，心肌梗死，扩张心肌病，肥厚性心肌病，和异常右心室。使用 BSSFP 序列在 1.5T / 3T Siemens 扫描仪上获得 2D Cine MR 图像（像素尺寸= 0.7-1.9 mm，切片厚度= 5-10mm，切片数= 6-18，相位数= 28-40 ）。手动分割在 ED 和 ES 阶段进行，并通过两位专家批准进行。此数据集仅用于测试。

Training Details

We used a plain 2D U-net for MC Dropout, BBB, and Deep Ensembles. The plain U-net consisted of 10 layers with $3\times3$ filters and 2 layers with $1\times1$ convolutions followed by a softmax layer. The number of filters ranged from 32 to 512 from the top to the bottom layers. For MCD, we added dropout on all layers or only on the middle layers of the U-net with different dropout rates: 0.5, 0.3, 0.1. These settings effectively tuned the amount of uncertainty in the model. For BBB, we experimented with different standard deviations of the prior distributions: $\sigma_{prior}=$ 0.1 or 1.0 and different weights for the KL term: $\lambda=$ 0.1, 1.0, 10, 30. These are commonly used hyperparameters in the literature . For both methods, the final prediction was obtained by averaging the softmax probabilities of 50 samples. For Deep Ensembles, we trained 10 plain U-net models separately using all of the training data with different random initializations and averaged the softmax probabilities of the 10 models. For each method, we saved models with the lowest NLL on the validation dataset since NLL is directly related to segmentation accuracy and probability calibration. For preprocessing, the input images were cropped by taking a region of 160x160 pixels from the center of the original images. Then, image intensities were normalized by subtracting the mean and dividing by the variance of the entire training dataset. Experiments were repeated 5 times (except for Deep Ensembles) and the results were averaged. Data augmentation was performed, including random rotation (-60 to 60 degrees), translation (-60 to 60 pixels), and scaling (0.7 to 1.3 times). These models were trained using the Adam optimizer for 50 epochs. The initial learning rate was set to 1e-4 which was decayed to 1e-5 after 30 epochs.

Experiments and Results 实验和结果

For BBB, $\lambda=10$ and $\sigma_{prior}=0.1$ achieved the best validation negative log likelihood. For MC Dropout, adding dropout on the middle layers with a dropout rate of 0.1 (MCD-0.1) performed the best. We also reported results for MC Dropout with a dropout rate of 0.5 on the middle layers (MCD-0.5) for comparison since this setting is commonly used in the literature.

Segmentation Accuracy and Probability Calibration

The segmentation accuracy and probability calibration of the U-net trained with various methods are shown in Table segmentation_performance. Deep Ensembles performed the best in terms of segmentation accuracy and probability calibration. This was followed by BBB and MCD-0.1, which were comparable to the plain U-net. MCD-0.5 performed slightly worse than other models in terms of segmentation accuracy and probability calibration. These results indicate that Bayesian approaches or Deep Ensembles can yield similar, if not better, segmentation results compared to a plain U-net while providing uncertainty estimates at the same time.

Uncertainty on Distorted Images

In order to validate uncertainty measures as indicators of out-of-distribution datasets, we applied the trained models to carefully generated images with various magnitudes of distortions, including: - adding Rician noise, as found in MR images , with magnitudes ranging from 0.05 to 0.10 (images intensities ranged from 0 to 1), - Gaussian blurring with a standard deviation of 1.0 to 4.0 pixels, and - deforming or stretching around the cardiac structures. Note that these distortions were not applied as part of data augmentation during training and these images were not seen by the neural network. Figure distortions_example shows examples of the distorted images. We expected decreased segmentation accuracy and increased predictive uncertainty in images with greater magnitudes of distortions.

对扭曲图像的不确定性

，以验证不确定性措施作为外部分发数据集的指标，我们应用训练型模型以仔细生成具有各种扭曲的图像，包括： - 添加瑞典噪声，如 MR 图像中所说，幅度范围为 0.05 至 0.10（图像强度范围为 0 至 1）， - 高斯模糊，标准偏差为 1.0 至 4.0 像素，围绕心脏结构变形或拉伸。请注意，这些失真未作为训练期间数据增强的一部分应用，并且神经网络未看到这些图像。 Figure Distortions_example 显示扭曲图像的示例。我们预期的分割准确性降低，并提高了图像的预测性不确定性，具有更大的扭曲量大。

Figureall_distortion_boxplots and Supplementary Table 1 show that segmentation accuracy decreased as the magnitude of the distortions (noise, gaussian blur, stretch) was increased. This is expected since these types of distortions were not seen during training and increasing the magnitude of the distortions resulted in greater differences with the original training datasets. In addition, we observed that the predictive uncertainty increased (i.e., higher predictive entropy, mutual information, ASSD, and lower Dice) with increasing magnitude of distortions but decreased after a certain threshold, as shown in Figureall_distortion_boxplots. This was the case for Deep Ensembles, BBB, and MC Dropout on images with blurring and noise distortions but not with stretching. For example, for BBB, the median predictive entropy for images with slight, moderate, and large additional noise was $1.66 \times 10^{-2}$ , $1.95 \times 10^{-2}$ , and $1.23 \times 10^{-2}$ , respectively. Similarly, the median ASSD for the BBB model was 0.40, 3.40, and 0.71 mm for images with slight, moderate, and large amount of blurring, respectively (See Supplementary Table 3 for more details). Figure noise_distortion_image shows examples of segmentation results and the corresponding predictive entropy and mutual information. While the increasing predictive uncertainty associated with increasing magnitude of distortions was expected, the decrease in predictive uncertainty after a threshold in cases of noise and blurring distortions was not expected. Specifically, for images that were highly distorted, all pixels were being classified as background with low uncertainty (Figureall_distortion_boxplots, bottom row). Although this seems correct when only given the labeling choice of background, LV, Myo, and RV, we argue that the distorted pixels are markedly different from the background pixels in the training images and therefore, should have high uncertainty nonetheless. This is a limitation of all the uncertainty models tested and may be improved using more expressive posteriors. Another observation is that the uncertainty measures began to fail/decrease when dramatic segmentation errors occured (Figureall_distortion_boxplots). This suggests that other heuristics or algorithms such as those presented in can be used to complement the uncertainty measures when trying to detect inaccurate segmentation. For example, segmentation with a non-circular LV blood pool or a blood volume $<$ 50 mL is highly problematic and may indicate poor segmentation.

增加扭曲幅度后的观测

FIGURALL_DISTOL_BOXPLOT 和补充表 1 表明，随着扭曲的幅度（噪音，高斯模糊，拉伸）增加，分割精度降低。这是预期的，因为在训练期间没有看到这些类型的扭曲，并且增加扭曲的幅度导致与原始训练数据集更大的差异。此外，我们观察到，预测性不确定性增加（即，更高的预测熵，互信息，截图和下骰子），随着较大的差异，但在某个阈值之后降低，如图所示。对于具有模糊和噪声扭曲的图像上的深度集成，BBB 和 MC 辍学，这是如此。例如，对于 BBB，具有轻微，中等和大额外噪声的图像的中值预测熵是$1.66 \timesT265_2{-2}$，$1.95 \times 1.95 {-2}$，以及$1.23 \times 10^\times 10^{-2}$。同样，BBB 型号的中位数分别为 0.40,3.40 和 0.71 mm，用于微小，中等和大量模糊的图像（有关更多细节，请参见补充表 3）。图诊断_distorgatorrage_image 显示分割结果的示例和相应的预测熵和相互信息。虽然预期噪声和模糊畸变的阈值之后的预测性不确定性增加了与增加的扭曲程度增加的预测性不确定性。具体地，对于高度扭曲的图像，所有像素都被归类为具有低不确定性的背景（FigureAll_Distorgatorration_Boxplots，底行）。虽然这似乎是正确的，但当鉴于背景，LV，MyO 和 RV 的标签选择时，我们认为失真的像素与训练图像中的背景像素显着不同，因此，应该具有高不确定性。这是对所有测试的不确定性模型的限制，并且可以使用更多富有表现力的前后改善。另一个观察是，在发生戏剧性分割错误时，不确定性措施开始失败/减少（Figureall_distorgatorration_boxplots）。这表明，其他启发式或算法，如诸如所呈现的那些算法可以用于在试图检测到不准确的细分时补充不确定性措施。例如，具有非圆形 LV 血液池或血液体积$<$ 50mL 的分割非常有问题，并且可以表示分割不良。

Comparison between Deep Ensemble, BBB and MC Dropout

In terms of segmentation accuracy and probability calibration, BBB was more robust to noise distortions compared to the other methods. Specifically, BBB showed higher LV Dice, higher LV ASSD, lower NLL, and lower BS on images with greater noise distortions (Figureall_distortion_boxplots, noise = 0.07, 0.08, 0.10 and Supplementary Tables 1 and 2, degree of distortion = 2, 3, 4). Deep Ensembles was more robust to the blurring and stretching distortions than the other methods. Deep Ensembles showed higher segmentation accuracy in cases with greater blurring and stretching distortions compared to the other methods (Figureall_distortion_boxplots and Supplementary Table 1, degree of distortion = 3, 4). In terms of predictive uncertainty measures, BBB was more robust to the noise distortion while both Deep Ensembles and BBB were more resistant to the blurring compared to the other methods. For example, the mutual information, Dice, and ASSD uncertainty measures peaked at a noise distortion with degree=3 for BBB. In contrast, the uncertainty measures peaked at a noise distortion with degree=2 for Deep Ensemble, MCD-0.1, and MCD-0.5. Comparing the uncertainty on images with heavy noise distortions (degree of distortion = 3, 4), BBB showed the highest predictive uncertainty. This was followed by MCD-0.5 while Deep Ensembles and MCD-0.1 showed the lowest predictive uncertainties. For images that were heavily blurred, Deep Ensembles and BBB showed the highest predictive uncertainty. In cases of images with stretching distortions, all methods showed fairly similar predictive uncertainty across all degrees of stretching. Having high predictive uncertainties on heavily distorted images is desirable since they have lower segmentation accuracy. Overall, BBB showed good predictive uncertainties on all of the tested distortions while Deep Ensemble showed good predictive uncertainties only in cases of blurring and stretching distortions. MCD-0.1 and MCD-0.5 showed results comparable to BBB and Deep Ensembles only on the stretched datasets.

Uncertainty on Dataset Shift

To further validate the uncertainty measures, we applied the models trained on the UKBB dataset to a different dataset - the ACDC dataset. The ACDC dataset consists of images with pathologies and image contrast distinctly different from the UKBB training dataset. Because of the dataset shift, we expect decreased segmentation accuracy and increased predictive uncertainty on the ACDC dataset compared to the UKBB test dataset. Figure metrics_correlations shows the segmentation accuracy and uncertainty metrics of the U-net models tested on the ACDC dataset. As expected, we observed decreased segmentation accuracy and increased predictive uncertainty compared to the UKBB test dataset. In terms of segmentation accuracy and probability calibration, the methods from the best to the worst are: Deep Ensemble, BBB, MCD-0.1, Plain U-net, and MCD-0.5, as shown in Figure metrics_correlations and Supplementary Table 4. The methods with the highest to the lowest uncertainties are: MCD-0.5, Deep Ensemble, BBB, and MCD-0.1, as shown in Figure metrics_correlations and Supplementary Table 5. Having high uncertainty on a shifted dataset is desirable, especially when the segmentation performance is poor. Overall, both Deep Ensembles and BBB showed good segmentation accuracy and predictive uncertainty in cases of shifted dataset.

Correlations between Uncertainty and Segmentation Accuracy

To demonstrate the potential utility of uncertainty measures, we evaluated the Spearman rank correlation between uncertainty measures and segmentation accuracy. We used the rank correlation instead of linear correlation to reduce effects of a potential non-linear relationship between the two quantities. Figure metrics_correlations shows the Spearman rank correlation between the uncertainty measures and the ASSD on the UKBB and ACDC test datasets. In the case of training and testing on the UKBB dataset (UKBB $\rightarrow$ UKBB), the uncertainty measure that correlated the most with ASSD was ASSD(Spearman correlation $=\left[0.58, 0.69\right]$ ). This is not surprising since ASSD calculation was similar to ASSD. Similar observations were obtained for training on the UKBB dataset and testing on the ACDC dataset (UKBB $\rightarrow$ ACDC) although the correlations were slightly lower (Spearman correlation $=\left[0.44, 0.66\right]$ ). These correlations suggest that the ASSD uncertainty measure is useful in predicting the segmentation quality when ground truth is not available. Figureuncertainty_measures shows representative segmentation results, posterior prediction samples, and the structural uncertainty measures for RV. The images with poor segmentation (toward the right side) had greater uncertainty as measured by Dice and ASSD. Note that posterior prediction samples are not meant to reflect human inter- or intra-observer variability but rather show what the network has learned from the given data.

Uncertainty for Segmentation Quality Control

In this section, we explored the use of predictive uncertainty estimates to flag potentially problematic segmentations that require manual review. We view this task as a classification problem and we aim to use uncertainty measures to classify segmentations as either good or poor. Having experts to manually determine whether an automated segmentation is good or not for a large dataset is time consuming and adds observer noise. Instead, we used thresholds on segmentation accuracies to determine good or poor automated segmentation. Based on our experience and discussions with our clinical collaborators, we think that contour or surface distance is more indicative of inaccurate segmentations. As such, for each method, the predicted segmentation was considered as poor when the ASSD between the prediction and ground truth is greater than the ASSD of inter-observer manual segmentations. We used inter-observer ASSD of 1.17 mm for LV, 1.19 mm for Myo, and 1.88 mm for RV, based on a recent relatively large-scale study . We then evaluated how well the ASSD uncertainty measure could identify poor segmentations. To utilize the ASSD uncertainty measure, a threshold can be set such that any segmentation with uncertainty above the threshold is flagged for manual review. This would hopefully result in a decreased number of poor segmentation in the dataset. Figureprecision_plots_assd shows the fraction of images with poor segmentation remaining in the dataset and the fraction of images flagged for manual correction when the uncertainty thresholds were varied, i.e., (positives - true positives) vs(true positives + false positives), where positive represents poor segmentation.

在本节中分割质量控制

As we decreased the uncertainty threshold (top left to bottom right in Figureprecision_plots_assd), we flagged more images for manual correction and the number of images with poor segmentation was decreased. The first point on the curves corresponds to a threshold where none of the images are reviewed while the last point indicates that all of the images are reviewed. This is similar to a receiver operating curve with the consideration that the total number of positives or images with poor predicted segmentation is different for each method. This allows for comparison between all the uncertainty estimation methods. Specifically, a curve closer to the bottom left corner or a curve with smaller area under the curve ( $\mathrm{AUC}$ ) indicates that the algorithm provides better initial segmentation and/or its uncertainty is a good indicator of segmentation accuracy. Figure precision_plots_assd shows that all the methods performed similarly for detecting poor LV, Myo, and RV segmentation with Deep Ensembles having slightly lower $\mathrm{AUC}$ than the other methods.

The thresholds of the uncertainty measures for flagging images for manual review can be adjusted based on the requirements of the application. This approach provides a way to identify images to review and may result in substantial time savings. For example, using the ASSD uncertainty measure for the Deep Ensembles method, 48%, 38%, and 31% of the images required manual review in order to reduce the number of images with poor LV, Myo, or RV segmentation to 5% of the test dataset, respectively (Figure precision_plots_assd). In contrast, without using uncertainty measures and assuming no other information about the images is used, approximately 75% of the images need to be reviewed to reduce the proportion of poor segmentation to 5%. Furthermore, using the ASSD uncertainty measure resulted in more time savings compared to a naive approach of reviewing images based on slice position. Specifically, since the segmentation is usually worse on basal and apical slices, a naive approach for segmentation quality control is to first review all of the most basal and most apical slices, followed by the second-most basal and second-most apical slices, and so on. We refer to this approach as using the slice position as a heuristic for segmentation quality control. As shown in Figure precision_plots_assd, using uncertainty measures is more advantageous than this naive approach as evidenced by a lower $\mathrm{AUC}$ and a smaller number of images to review to have 5% poor segmentation remaining. Assuming a dataset of 10,000 subjects each with 10 slices and manual segmentation of 30 seconds per structure per slice , using the ASSD uncertainty measure results in $\sim$ 940 hours of time savings compared to reviewing the images randomly, and $\sim$ 580 hours of time savings compared to using the slice position heuristic to achieve 5% poor segmentation remaining.

Training Time

All experiments were performed on an Nvidia P100 GPU with 12GB of memory. MC Dropout and BBB required $\sim$ 35 hours and $\sim$ 47 hours for training, respectively. For these Bayesian methods, generating 50 prediction samples during testing required 1.5 seconds for each 2D image. The Deep Ensembles method consisted of 10 copies of the plain U-net and was trained in parallel separately. Each plain U-net required $\sim$ 12 hours for training and $\sim$ 0.1 seconds for prediction of a 2D image. Calculation of the predictive entropy, mutual information, Dice, and ASSD uncertainty measures required approximately 0.05, 0.15, 0.15, and 0.4 seconds per image, respectively.

讨论

BBB \emph MC Dropout \emph Deep Ensemble

In this work, we evaluated and compared different Bayesian and non-Bayesian approaches for estimating uncertainty in neural networks for cardiac MRI segmentation. Here, we discuss similarities and differences about how uncertainty is learned in these methods, what was learned after training, and relate these differences to the quality of the predictive uncertainties on out-of-distribution images. While uncertainty in neural network parameters is learned automatically in BBB, this can be tuned by changing the dropout rate in MC Dropout. For the UKBB dataset, a small dropout rate of 0.1 in the middle layers performed better than other MC Dropout models in terms of segmentation accuracy and probability calibration. This is different from other studies which commonly used a dropout rate of 0.5 and may be because of the large amount of relatively uniform training data (i.e., images acquired following the same MR protocol and labelled following the same guidelines). The dropout rate hyperparameters for MC Dropout obtained through grid search correspond to the weight uncertainties learned by BBB to some extent. Figure weight_stats shows that the standard deviation of the weights learned by BBB is lower in the early layers and higher in the middle layers of the U-net. This is similar to MC Dropout with no dropout in the early layers and some dropout on the middle layers. Having greater dropout rate in middle layers is common in other studies .

The effect of dropout rate in MC Dropout is shown in the experiments with dataset shift (Figure metrics_correlations, left). A dropout rate of 0.1 yielded higher segmentation accuracy but lower uncertainty whereas a dropout rate of 0.5 resulted in lower segmentation accuracy but higher predictive uncertainty. BBB was able to mitigate this issue by learning a mean and standard deviation for each weight, resulting in comparable or higher segmentation accuracy with moderate predictive uncertainties between MCD-0.

bbb \ emph mc agropout \ emph深组合

1 and MCD-0.5 (Figure metrics_correlations, left). In Deep Ensembles, the uncertainty in the weights is not learned or tuned but instead stems from the random initialization of the weights and stochasticity of the training procedure. Each model in the ensemble would then learn a local minimum which are combined to form a prediction and uncertainty estimate. This was found to outperform BBB in terms of segmentation accuracy and probability calibration in most cases. This may be because the approximate posterior learned by BBB covered only one (or a few) mode(s) of the true posterior and a small neighbourhood around each mode, as opposed to multiple local modes in Deep Ensembles. Based on our observations, the histogram of the weights learned by each member in Deep Ensembles was very similar to each other. Techniques such as canonical correlation analysis can be used to compare the representations (i.e., feature maps) between different members of the ensemble and to better understand model uncertainty . While Deep Ensembles performed similarly or better than BBB on images with blurring or stretching distortions, BBB outperformed Deep Ensembles on images with noise distortions. A difference in these distortions is that the noise was added independently per pixel while blurring and stretching were applied for each structure as a whole. This suggests that BBB is more suitable for capturing uncertainty in low-level pixelwise variations while Deep Ensemble is better for modelling uncertainty in structural variations. Bayesian approaches and Deep Ensembles can be used to improve segmentation accuracy on datasets which are slightly shifted compared to the training dataset. While the segmentation accuracies for testing on the ACDC dataset using the models trained on UKBB dataset are lower than segmentation accuracies obtained using models trained and tested both on the ACDC dataset , the algorithms employed in this work may be combined with other techniques that are specifically designed to solve this problem, e.g., style augmentation or domain adversarial training .

1 和 MCD-0.5（图 METRICS_CORRELATIONS，左侧）。在深度集合中，不学习或调整权重中的不确定性，而是从训练过程的重量和随机性的随机初始化源。然后，集合中的每个模型将学习局部最小值，该局部最小是形成预测和不确定性估计的。在大多数情况下，发现在分割精度和概率校准方面以 BBB 优于 BBB。这可能是因为 BBB 所学到的近似后后部只覆盖了每种模式的每种模式的真实后的一个（或几个）模式，而不是深度集合中的多种本地模式。基于我们的观察，深融合中每个成员学习的重量的直方图彼此非常相似。诸如规范相关分析的技术可用于比较集合的不同成员之间的表示（即特征映射）并更好地理解模型不确定性。虽然深度集合在具有模糊或拉伸扭曲的图像上类似或更好地进行或更好地进行，但 BBB 在具有噪声扭曲的图像上表现出深度的深度集成。这些失真的差异是每像素独立地添加噪声，而整个结构施加模糊和拉伸。这表明 BBB 更适合于捕获低级像素变化中的不确定性，而深度集合更好地用于在结构变化中建模不确定性。贝叶斯方法和深度集合可用于提高数据集的分段精度，与训练数据集相比略微移位。虽然使用在 UKBB 数据集上培训的模型对 ACDC 数据集进行测试的分割精度低于使用培训和在 ACDC 数据集上进行的模型获得的分段精度，但该工作中采用的算法可以与专门设计的其他技术相结合解决这个问题，例如，风格增强或域对抗训练。

Pixelwise and Structural Uncertainty Measures

We introduced pixelwise and structural uncertainty measures to quantify the predictive uncertainty, and demonstrated the utility of these metrics for segmentation quality control. Both pixelwise and structural uncertainty measures can be used depending on the application. As the segmentation problem was formulated as pixel classification, pixelwise uncertainty measures are straightforward to obtain. These allow users to visualize which pixels and which areas are potentially problematic (Figure noise_distortion_image). However, segmentation is often performed at the image-level and slice by slice. Therefore, in developing uncertainty measures for determining problematic segmentation, an image-level uncertainty measure is required. Accordingly, we showed that structural uncertainty measures are correlated with segmentation accuracy metrics such as ASSD. It is important to note that the predictive uncertainty measures reflect the neural network uncertainty, which is different from human uncertainty. An example is the image with heavy noise in the last row in Figure noise_distortion_image. It is expected that human observers can manually segment this image with low observer variability; however, since the image is very different from the training data, the neural networks were not able to generate a reasonable segmentation and showed high predictive entropy and mutual information for the entire cardiac structure.

Segmentation Quality Control

We showed that the uncertainty measures have decent to good correlations with segmentation accuracy. This could have been negatively affected by noisy ground truth segmentation. Framing segmentation quality control as a binary classification problem instead of evaluating the correlations or predicting the segmentation accuracy metrics alleviates the problem of noise in ground truth segmentation. In this regard, we defined poor segmentation using a threshold on the ASSD between the predicted and ground truth segmentation. This definition of a poor segmentation was adopted based on discussions with clinical collaborators; however, it can be modified depending on the application. For example, other segmentation accuracy metrics such as Hausdorff distance or misclassification area can be used and the framework for evaluating uncertainty measures described in this work can be applied directly. Other studies in segmentation quality control include directly predicting segmentation accuracy or comparing the predicted segmentation to a reference database. trained a random forest classifier to predict a binary label of correct or incorrect segmentation. The examples of correct segmentation were generated using the ground truth segmentations while the incorrect segmentations were generated by deforming or translating the ground truth segmentations. used a 3D residual network to directly predict the Dice score of a segmentation from an image-segmentation pair. The network was trained and tested on a dataset created using a random forest segmentation algorithm. These classifiers are agnostic to how the segmentation was generated.

分割质量控制

However, these approaches are dependent on training with some expected segmentation failures, which may be challenging to incorporate during training. In contrast, our approach used uncertainty measures to detect poor segmentation and is explainable. Instead of using a learning algorithm to determine segmentation quality, we used model uncertainty which emerges intrinsically when training the segmentation algorithm. A slight limitation of our approach is that sampling during the testing phase required up to 50x more computation time compared to a single prediction but this may be accelerated through parallelization. Another approach for segmentation quality control is using a modified version of Reverse Classification Accuracy to predict the segmentation quality of an image-segmentation pair . This approach requires a reference database of images with ground truth segmentation. Each reference image is registered to the test image and the associated reference segmentations are warped accordingly. The warped reference segmentations are considered as potential segmentations of the test image. The quality of the test image segmentation is estimated by comparing the potential segmentations and test segmentation. A limitation of this approach is that it requires a long time to generate the segmentation quality prediction, mainly due to the registration steps.

Conclusions

In this work, we compared Bayesian and non-Bayesian methods, namely BBB, MC Dropout, and Deep Ensembles for cardiac MRI segmentation and showed the utility of these algorithms for segmentation quality control. We found that Deep Ensembles performed better in terms of segmentation accuracy and probability calibration on in-distribution and out-of-distribution datasets. However, BBB outperformed Deep Ensembles on images with noise distortions. Additionally, we identified a limitation of current methods in terms of predictive uncertainty on images with heavy distortions. Furthermore, we showed that the ASSD uncertainty measure correlated well with the ASSD segmentation accuracy on both in-distribution and out-of-distribution datasets. For the task of segmentation quality control, ASSD from all methods are equally useful in predicting problematic segmentations. Using uncertainty measures can result in substantial time savings by reducing the number of images that needs to be reviewed. We acknowledge the use of the facilities of Compute Canada. This work was funded by Canadian Institutes of Health Research (CIHR) MOP: #93531, Ontario Research Fund and GE Healthcare. FG is supported by a Banting postdoctoral fellowship. This work was partly funded by the European Union’s Horizon 2020 research and innovation programme under grant agreement No 825903 (euCanSHare project). SEP acts as a paid consultant to Circle Cardiovascular Imaging Inc., Calgary, Canada and Servier. SEP acknowledges support from the National Institute for Health Research (NIHR) Biomedical Research Centre at Barts, from the SmartHeart EPSRC programme grant (EP/P001009/1) and the London Medical Imaging and AI Centre for Value-Based Healthcare. SEP acknowledges support from the CAP-AI programme, London’s first AI enabling programme focused on stimulating growth in the capital’s AI sector. SEP, SN and SKP acknowledge the British Heart Foundation for funding the manual analysis to create a cardiovascular magnetic resonance imaging reference standard for the UK Biobank imaging resource in 5000 CMR scans (PG/14/89/31194). This project was enabled through access to the Medical Research Council eMedLab Medical Bioinformatics infrastructure, supported by the Medical Research Council (MR/L016311/1). This research was conducted using the UK Biobank resource under Application 2964. UK Biobank will make the data available to all bona fide researchers for all types of health-related research that is in the public interest, without preferential or exclusive access for any person. All researchers will be subject to the same application process and approval criteria as specified by UK Biobank. For the detailed access procedure see http://www.ukbiobank.ac.uk/register-apply/. Please see the supplementary material for detailed tables of results.