[论文翻译]生成式医学影像评估的特征提取:反对演变趋势的新证据


原文地址:https://arxiv.org/pdf/2311.13717v5


Feature Extraction for Generative Medical Imaging Evaluation: New Evidence Against an Evolving Trend

生成式医学影像评估的特征提取:反对演变趋势的新证据

Abstract. Fréchet Inception Distance (FID) is a widely used metric for assessing synthetic image quality. It relies on an ImageNet-based feature extractor, making its applicability to medical imaging unclear. A recent trend is to adapt FID to medical imaging through feature extractors trained on medical images. Our study challenges this practice by demonstrating that ImageNet-based extractors are more consistent and aligned with human judgment than their Rad Image Net counterparts. We evaluated sixteen StyleGAN2 networks across four medical imaging modalities and four data augmentation techniques with Fréchet distances (FDs) computed using eleven ImageNet or Rad Image Net-trained feature extractors. Comparison with human judgment via visual Turing tests revealed that ImageNet-based extractors produced rankings consistent with human judgment, with the FD derived from the ImageNet-trained SwAV extractor significantly correlating with expert evaluations. In contrast, Rad Image Net-based rankings were volatile and inconsistent with human judgment. Our findings challenge prevailing assumptions, providing novel evidence that medical image-trained feature extractors do not inherently improve FDs and can even compromise their reliability. Our code is available at https://github.com/m ckel l woodland/fid-med-eval.

摘要。Fréchet Inception Distance (FID) 是评估合成图像质量的常用指标。它依赖于基于 ImageNet 的特征提取器,因此其在医学影像中的适用性尚不明确。近期趋势是通过在医学图像上训练的特征提取器来调整 FID 以适应医学影像。我们的研究对这一做法提出质疑,证明基于 ImageNet 的提取器比基于 Rad Image Net 的提取器更一致且更符合人类判断。我们评估了 16 个 StyleGAN2 网络,涵盖四种医学影像模态和四种数据增强技术,使用 11 个基于 ImageNet 或 Rad Image Net 训练的特征提取器计算 Fréchet 距离 (FDs)。通过与视觉图灵测试中的人类判断对比,发现基于 ImageNet 的提取器产生的排名与人类判断一致,其中基于 ImageNet 训练的 SwAV 提取器得出的 FD 与专家评估显著相关。相比之下,基于 Rad Image Net 的排名波动较大且与人类判断不一致。我们的发现挑战了普遍假设,提供了新证据表明医学图像训练的特征提取器并不会固有地改善 FDs,甚至可能损害其可靠性。代码详见 https://github.com/mckel l woodland/fid-med-eval。

Keywords: Generative Modeling · Frechet Inception Distance.

关键词:生成式建模 (Generative Modeling) · 弗雷歇初始距离 (Frechet Inception Distance)

1 Introduction

1 引言

Fréchet Inception Distance (FID) is the most commonly used metric for evaluating synthetic image quality [1]. It quantifies the Fréchet distance (FD) between two Gaussian distribution curves fitted to embeddings of real and generated images. These embeddings are typically extracted from the penultimate layer of an Inception V 3 network trained on ImageNet. FID’s utility has been demonstrated through its correlation with human judgment [2], sensitivity to distortions [1], capability to detect over fitting [3], and relative sample efficiency [3]. Nonetheless, the metric has faced criticism, including that the Inception V 3 network may only embed information relevant to ImageNet class discrimination [4,5].

Fréchet Inception Distance (FID) 是评估合成图像质量最常用的指标 [1]。它量化了基于真实图像和生成图像嵌入拟合的两个高斯分布曲线之间的Fréchet距离 (FD)。这些嵌入通常提取自经过ImageNet训练的Inception V3网络倒数第二层。FID的有效性已通过其与人类判断的相关性 [2]、对失真的敏感性 [1]、检测过拟合的能力 [3] 以及相对样本效率 [3] 得到验证。然而,该指标也面临批评,包括Inception V3网络可能仅嵌入与ImageNet类别判别相关的信息 [4,5]。

Three approaches exist for adapting FID to medical imaging. The first involves using an Inception V 3 extractor trained on a large, publicly available medical dataset, such as Rad Image Net, a database containing 1.35 million annotated computed tomography (CT), magnetic resonance imaging (MRI), and ultrasonography exams [6,7]. While a Rad Image Net-based FD considers medically relevant features, its efficacy remains largely unexplored. One potential bias is that networks trained for disease detection may focus too heavily on small, localized regions [8] to effectively evaluate an entire image’s quality. Additionally, Rad Image Net-based FDs may not generalize to new medical modalities [7] or patient populations. Our novel comparison of Rad Image Net-base FDs to human judgment revealed discrepancies, even on in-domain abdominal CT data.

将FID(Frechet Inception Distance)应用于医学影像领域存在三种方法。第一种是使用基于大型公开医学数据集(如Rad Image Net)训练的Inception V3特征提取器。Rad Image Net包含135万张标注的计算机断层扫描(CT)、磁共振成像(MRI)和超声检查图像[6,7]。虽然基于Rad Image Net的FD考虑了医学相关特征,但其有效性尚未得到充分验证。一个潜在偏差是:针对疾病检测训练的网络可能过度关注局部小区域[8],难以有效评估整体图像质量。此外,基于Rad Image Net的FD可能无法泛化到新的医学模态[7]或患者群体。我们首次将Rad Image Net-based FD与人类判断进行对比,发现即使在腹部CT领域数据中也存在差异。

The second approach utilizes self-supervised networks for feature extraction [9]. These networks are encouraging as they create transferable and robust represent at ions [10], including on medical images [4]. Despite their promise, the lack of publicly available, self-supervised models trained on extensive medical imaging datasets has hindered their application. Our study is the first to employ self-supervised extractors for synthetic medical image evaluation. We find a significant correlation between an FD derived from an ImageNet-trained SwAV network (FSD) and medical experts’ appraisal of image realism, highlighting the potential of self-supervision for advancing generative medical imaging evaluation.

第二种方法利用自监督网络进行特征提取 [9]。这些网络的优势在于能生成可迁移且稳健的表征 [10],包括在医学图像上的应用 [4]。尽管前景广阔,但由于缺乏基于大规模医学影像数据集训练的公开自监督模型,其应用受到限制。我们的研究首次采用自监督特征提取器来评估合成医学图像。研究发现,基于ImageNet训练的SwAV网络得到的FD (FSD) 与医学专家对图像真实性的评价存在显著相关性,这凸显了自监督技术在推进生成式医学影像评估方面的潜力。

The third approach employs a feature extractor trained on the dataset used to train the generative imaging model [11,12,13]. While advantageous for domain coherence, the algorithm designer essentially creates the metric used to evaluate their algorithm, resulting in un quantified bias [2]. Moreover, the private and varied nature of these feature extractors poses challenges for reproducibility and benchmarking. Given these limitations, our study focused on publicly available feature extractors.

第三种方法采用了在用于训练生成式成像模型的数据集上训练的特征提取器 [11,12,13]。虽然这种方法在领域一致性方面具有优势,但算法设计者实质上创建了用于评估其算法的指标,导致未量化的偏差 [2]。此外,这些特征提取器的私有性和多样性给可复现性和基准测试带来了挑战。鉴于这些限制,我们的研究专注于公开可用的特征提取器。

Our study offers a novel comparison of generative model rankings created by ImageNet and Rad Image Net-trained feature extractors with expert judgment. Our key contributions are:

我们的研究首次比较了由ImageNet和Rad Image Net训练的特征提取器生成的生成模型排名与专家判断。主要贡献包括:

  1. Introducing a novel method for evaluating visual Turing Tests (VTTs) via hypothesis testing, providing an unbiased measure of participant perception of synthetic image realism.
  2. 提出一种通过假设检验评估视觉图灵测试 (VTT) 的新方法,为参与者对合成图像真实感的感知提供无偏测量。

2 Methods

2 方法

2.1 Generative Modeling

2.1 生成式建模 (Generative Modeling)

Four medical imaging datasets were used for generative modeling: the Segmentation of the Liver Competition 2007 (SLIVER07) dataset with 20 liver CT studies [14]4, the ChestX-ray14 dataset with 112,100 chest X-rays [15] $^5$ , the brain tumor dataset from the Medical Segmentation Decathlon (MSD) with 750 brain MRI studies [16,17] $^6$ , and the Automated Cardiac Diagnosis Challenge (ACDC) dataset with 150 cardiac cine-MRIs [18]7. Multi-dimensional images were converted to two dimensions by extracting axial slices and excluding the slices with less than $15%$ nonzero pixels.

在生成建模中使用了四个医学影像数据集:包含20例肝脏CT扫描的2007年肝脏分割竞赛(SLIVER07)数据集[14]4、包含112,100张胸部X光片的ChestX-ray14数据集[15]$^5$、来自医学分割十项全能(MSD)的包含750例脑部MRI扫描的脑肿瘤数据集[16,17]$^6$,以及包含150例心脏电影磁共振的自动心脏诊断挑战赛(ACDC)数据集[18]7。通过提取轴向切片并排除非零像素少于$15%$的切片,将多维图像转换为二维图像。

To enable a comparison of synthetic quality, four StyleGAN2 [19] models were trained per dataset, using either adaptive disc rim in at or augmentation (ADA) [20], differentiable augmentation (Diff Augment) [21], adaptive pseudo augmentation (APA) [22], or no augmentation. While all of the data augmentation techniques were created to improve the performance of generative models on limited data domains, such as medical imaging, we are the first to benchmark the techniques on medical images. Each model was evaluated using the weights obtained at the end of 25,000 kimg (a kimg represents a thousand real images being shown to the disc rim in at or), except for the MSD experiments, which were limited to 5,000 kimg due to training instability. Our code and trained model weights are available at https://github.com/m ckel l woodland/fid-med-eval.

为了比较合成质量,我们在每个数据集上训练了四个StyleGAN2 [19]模型,分别采用自适应判别器增强 (ADA) [20]、可微分增强 (Diff Augment) [21]、自适应伪增强 (APA) [22] 或无增强。虽然这些数据增强技术最初都是为了提升生成式模型在有限数据领域(如医学影像)的性能而设计的,但我们是首个在医学图像上对这些技术进行基准测试的研究。除MSD实验因训练不稳定性限制在5,000 kimg外,每个模型均使用25,000 kimg(1 kimg表示向判别器展示一千张真实图像)训练结束时的权重进行评估。代码及训练好的模型权重详见https://github.com/mckel l woodland/fid-med-eval。

2.2 Human Evaluation

2.2 人工评估

Human perception of model quality was assessed with one VTT per model. Each test comprised 20 randomly selected images with an equal number of real and generated images. Participants were asked to identify whether each image was real or generated and rate its realism on a Likert scale from 1 to 3 (1: “Not at all realistic,” 2: “Somewhat realistic,” and 3: “Very realistic”). The tests were administered to five specialists with medical degrees. In addition to the VTTs, three radiologists were shown 35 synthetic radio graphs per ChestX-ray14 model and were asked to rank and provide a qualitative assessment of the models.

通过每个模型进行一次视觉图灵测试(VTT)来评估人类对模型质量的感知。每次测试包含20张随机选择的图像,其中真实图像和生成图像数量相等。参与者需判断每张图像是真实的还是生成的,并按1-3分的李克特量表(1: "完全不真实", 2: "有些真实", 3: "非常真实")评估其真实感。测试对象为五位拥有医学学位的专家。除VTT外,还向三位放射科医师展示每个ChestX-ray14模型生成的35张合成X光片,要求他们对模型进行排序并提供定性评估。

False positive rate (FPR) and false negative rate (FNR) were used to evaluate the VTTs. The FPRs represent the proportion of generated images that participants considered to be real. FPRs near 50% indicate random guessing.

误报率 (FPR) 和漏报率 (FNR) 用于评估VTT。FPR表示参与者认为生成图像是真实图像的比例。接近50%的FPR表明属于随机猜测。

One-sided paired $t$ tests were performed on the FPRs with $\alpha{=}.05$ to benchmark the data augmentation techniques. For each VTT, the average Likert ratings of real and generated images were computed per participant. The difference between these average ratings was then computed to compare the perceived realism of real and generated images. Two-sample Kolmogorov-Smirnov (KS) tests were conducted on the Likert ratings of the real and generated images with $\alpha{=}.10$ to determine whether the ratings came from the same distribution, indicating that the participants viewed the realism of the generated images to be equivalent to that of the real images. We are the first to use the difference in average Likert ratings and the KS test for generative modeling evaluation.

对误报率(FPR)进行单侧配对$t$检验($\alpha{=}.05$)以评估数据增强技术的基准效果。针对每个视觉图灵测试(VTT),计算每位参与者对真实图像与生成图像的Likert评分平均值,并通过两者差值比较图像真实感。采用双样本Kolmogorov-Smirnov (KS)检验($\alpha{=}.10$)分析真实与生成图像的Likert评分分布一致性,若检验通过则表明参与者认为生成图像与真实图像具有等效真实感。本研究首次将Likert评分均值差与KS检验应用于生成模型评估领域。

When taking a VTT, participants may be more likely to select either “real” or “generated” when uncertain. This bias causes the average FPR to not fully encapsulate whether participants could differentiate between real and generated images. To address this challenge, we propose a novel method for evaluating VTTs via hypothesis testing. The method aims to demonstrate that the likelihood of a participant selecting “real” is the same for both real and generated images. For each participant $p$ , we define the null hypothesis $\mathbb{P}(\boldsymbol{p}$ guesses real $\mid\mathrm{G})=$ P( $p$ guesses real | R) where G represents the event that the image is generated and R represents the event that the image is real. We evaluate this hypothesis using a two-sample $t$ test with $\alpha{=}.10$ , where the first sample is the participant’s binary predictions for generated images, and the second is their predictions for real images. To evaluate VTTs for multiple participants $P$ , we define the null hypothesis P(random $p\in P$ guesses real $|\mathrm{~G})=\mathrm{P}$ (random $p\in P$ guesses real | R). We evaluate this hypothesis via a two-sample $t$ test with $\alpha{=}.10$ , where the first sample is the FPR and the second is the true positive rate of each participant.

在进行视觉图灵测试 (VTT) 时,参与者在不确定时可能更倾向于选择"真实"或"生成"。这种偏差导致平均假阳性率 (FPR) 无法完全体现参与者能否区分真实与生成图像。为解决这一问题,我们提出一种通过假设检验评估VTT的新方法。该方法旨在证明参与者选择"真实"的概率在真实图像和生成图像中是相同的。

对于每位参与者$p$,我们定义零假设$\mathbb{P}(\boldsymbol{p}$选择真实$\mid\mathrm{G})=$ P($p$选择真实 | R),其中G代表图像为生成事件,R代表图像为真实事件。我们使用$\alpha{=}.10$的双样本$t$检验评估该假设,第一个样本是参与者对生成图像的二元预测,第二个样本是其对真实图像的预测。

为评估多参与者$P$的VTT结果,我们定义零假设P(随机$p\in P$选择真实$|\mathrm{~G})=\mathrm{P}$(随机$p\in P$选择真实 | R)。我们通过$\alpha{=}.10$的双样本$t$检验评估该假设,第一个样本是各参与者的FPR,第二个样本是各参与者的真阳性率。

2.3 Fréchet Distances

2.3 Fréchet距离

Quantitative evaluation of synthetic image quality was performed by calculating the FD $d(\boldsymbol{\Sigma}{1},\boldsymbol{\Sigma}{2},\boldsymbol{\mu}{1},\boldsymbol{\mu}{2})^{2}=\left|\boldsymbol{\mu}{1}-\boldsymbol{\mu}{2}\right|^{2}+\mathrm{tr}(\boldsymbol{\Sigma}{1}+\boldsymbol{\Sigma}{2}-2(\boldsymbol{\Sigma}{1}\boldsymbol{\Sigma}{2})^{\frac{\mathrm{i}}{2}})$ [23] between two multivariate Gaussians $(\Sigma_{R},\mu_{R})$ and $(\textstyle{\sum_{G}},\mu_{G})$ fitted to real and generated features extracted from the penultimate layer of eleven backbone networks: InceptionV3 [24], ResNet50 [25], Inception Res Ne tV 2 [26], and Dense Net 121 [27] each trained separately on both ImageNet [28] and Rad Image Net [6], along with SwAV [29], DINO [30], and a Swin Transformer [31] trained on ImageNet. The first four networks were included to compare all publicly available Rad Image Net models to their ImageNet equivalents. SwAV and DINO were included to evaluate the impact of self-supervision, as self-supervised representations have demonstrated superior transfer ability to new domains [10] and richer embeddings on medical images [4]. Finally, a Swin Transformer [31] was included as transformers have been shown to create transferable and robust representations [32]. We are the first to use self-supervised and transformer architectures with FD for generative medical imaging evaluation. As the scale of FDs varies substantially by feature extractor, relative FDs (rFDs) d(dΣ(RΣ1R, ,ΣΣRG2,,µµRR,1 µ,µGR)22)2 were computed with a random split of the real features into two Gaussian distributions $\left(\boldsymbol{\Sigma}{R_{1}},\boldsymbol{\mu}{R_{1}}\right)$ and $\left(\boldsymbol{\Sigma}{R_{2}},\boldsymbol{\mu}{R_{2}}\right)$ . Paired $t$ tests with $\alpha{=}0.05$ were conducted on the FDs to benchmark the data augmentation techniques. The Pearson correlation coefficient with $\alpha{=}0.05$ was used to quantify the correspondence between the FDs and human judgment and the correspondence between individual FDs. We are the first to consider whether medical image-based FDs are correlated with human judgment.

通过计算两个多元高斯分布 $(\Sigma_{R},\mu_{R})$ 和 $(\textstyle{\sum_{G}},\mu_{G})$ 之间的弗雷歇距离 (Frechet Distance, FD) $d(\boldsymbol{\Sigma}{1},\boldsymbol{\Sigma}{2},\boldsymbol{\mu}{1},\boldsymbol{\mu}{2})^{2}=\left|\boldsymbol{\mu}{1}-\boldsymbol{\mu}{2}\right|^{2}+\mathrm{tr}(\boldsymbol{\Sigma}{1}+\boldsymbol{\Sigma}{2}-2(\boldsymbol{\Sigma}{1}\boldsymbol{\Sigma}{2})^{\frac{\mathrm{i}}{2}})$ [23],对合成图像质量进行定量评估。这两个分布分别拟合了从11个骨干网络倒数第二层提取的真实特征和生成特征,这些网络包括:在ImageNet [28] 和RadImageNet [6] 上分别独立训练的InceptionV3 [24]、ResNet50 [25]、Inception ResNetV2 [26]、DenseNet121 [27],以及在ImageNet上训练的SwAV [29]、DINO [30] 和Swin Transformer [31]。前四个网络的纳入是为了比较所有公开可用的RadImageNet模型与其对应的ImageNet版本。SwAV和DINO的纳入是为了评估自监督的影响,因为自监督表征在新领域[10]和医学图像[4]上表现出更优的迁移能力和更丰富的嵌入。最后,Swin Transformer [31] 的纳入是因为Transformer已被证明能创建可迁移且鲁棒的表征[32]。我们是首个将自监督和Transformer架构与FD结合用于生成式医学影像评估的研究。由于FD的尺度因特征提取器而异,我们通过将真实特征随机拆分为两个高斯分布 $\left(\boldsymbol{\Sigma}{R_{1}},\boldsymbol{\mu}{R_{1}}\right)$ 和 $\left(\boldsymbol{\Sigma}{R_{2}},\boldsymbol{\mu}{R_{2}}\right)$ 来计算相对FD (rFD) $d(d\Sigma(R\Sigma_1R, ,\Sigma\Sigma_RG2,,\mu\mu_RR,1 \mu,\mu_GR)22)2$。采用 $\alpha{=}0.05$ 的配对 $t$ 检验对FD进行统计分析,以评估数据增强技术的基准性能。使用 $\alpha{=}0.05$ 的皮尔逊相关系数量化FD与人类判断之间以及各FD之间的相关性。我们是首个探讨医学图像FD是否与人类判断相关的研究。

3 Results

3 结果

Table 1 summarizes the overall results of the VTTs, with detailed individual participant outcomes available at https://github.com/m ckel l woodland/fid-med-eval. The rFDs based on ImageNet and Rad Image Net are outlined in Tables 2 and 3, while the FDs can be found in Tables 4 and 5 in the Appendix. Model rankings based on individual metrics are illustrated in Figure 1. Our analysis revealed

表1总结了VTT的总体结果,详细个体参与者结果可在https://github.com/mckel l woodland/fid-med-eval查看。基于ImageNet和RadImageNet的rFD分别列于表2和表3,FD则见附录中的表4和表5。基于各指标的模型排名如图1所示。我们的分析表明

Table 1. VTT results. Column 1 lists each tested dataset, while Column 2 specifies the augmentation technique (Aug) utilized during model training: no augmentation (None), ADA, APA, and Diff Augment (DiffAug). Columns 3 and 4 showcase the average FPRs and FNRs. FPRs near $50%$ imply random guessing. Column 5 provides $t$ test p-values, whose null hypothesis is that the probability of a random participant selecting “real” is the same for real and generated images. Column 6 displays the average difference between mean Likert ratings for real and generated images (Diff); a negative value indicates that the generated images were perceived to be more realistic than the actual images. Column 7 presents KS test p-values, whose null hypothesis is that the Likert ratings for real and generated images were drawn from the same distribution. $\uparrow$ and $\downarrow$ denote preferable higher or lower values. The underlined boldface type represents the best performance per dataset. Gray boxes indicate failure to reject the null hypothesis, suggesting that participants viewed real and generated images to be equivalent. $\dagger$ indicates decreased performance compared to no augmentation.

表 1: VTT 结果。第 1 列列出每个测试数据集,第 2 列指定模型训练期间使用的增强技术 (Aug):无增强 (None)、ADA、APA 和 Diff Augment (DiffAug)。第 3 列和第 4 列展示平均 FPR 和 FNR。接近 $50%$ 的 FPR 意味着随机猜测。第 5 列提供 $t$ 检验 p 值,其零假设是真实图像和生成图像被随机参与者选择"真实"的概率相同。第 6 列显示真实图像和生成图像的平均 Likert 评分差值 (Diff);负值表示生成图像被认为比真实图像更逼真。第 7 列展示 KS 检验 p 值,其零假设是真实图像和生成图像的 Likert 评分来自同一分布。$\uparrow$ 和 $\downarrow$ 分别表示数值越高越好或越低越好。带下划线的粗体表示每个数据集的最佳性能。灰色框表示未能拒绝零假设,表明参与者认为真实图像和生成图像等效。$\dagger$ 表示与无增强相比性能下降。

数据集 增强方法 FPR [%]↑ FNR[%]↑ t检验 Diff↓ KS检验
ChestXray-14 48 58 p=.497 0.12 p=.869
ADA 32t 47t p=.340 0.28t p=.549
APA 34t 56 p=.082 0.24 p=.717
DiffAug 48 58 p=.616 0.16 p=.967
SLIVER07 20 34 p=.424 0.68 p<.001
ADA 24 30t p=.748 0.52 p=.001
APA 10t 28 p=.232 0.82 p<.001
DiffAug 34 30t p=.825 0.22 p=.717
MSD 58 48 p=.543 0.08 p>.999
ADA 66 48 p=.217 -0.04 p>.999
APA 46t 38 p=.587 0.04 p>.999
DiffAug 50t 54 p=.812 0.08 p>.999
ACDC 34 22 p=.470 0.52 p=.022
ADA 38 30 p=.653 0.38 p=.112
APA 28t 22 p=.707 0.46 p=.003
DiffAug 44 16t p=.015 0.28 p=.112

consistent rankings among all ImageNet-based FDs, aligning closely with human judgment. In contrast, Rad Image Net-based FDs exhibited volatility and diverged from human assessment. Diff Augment was the best-performing form of augmentation, generating hyper-realistic images on two datasets.

在所有基于ImageNet的FD(特征检测器)中排名一致,与人类判断高度吻合。相比之下,基于Rad Image Net的FD表现出波动性,并与人类评估存在偏差。Diff Augment(差分增强)是表现最佳的数据增强形式,在两个数据集上生成了超逼真的图像。

ImageNet extractors aligned with human judgment. ImageNet-based FDs were consistent with one another in ranking generative models, except for on the MSD dataset, where human rankings were also inconsistent (see Figure 1). This consistency was reinforced by strong correlations between the FDs derived from Inception V 3 and all other ImageNet-based FDs (p<.001). Furthermore, the ImageNet-based FDs aligned with expert judgment (see Figure 1). On the ChestX-ray14 dataset, ImageNet-based FDs ranked generative models in the same order as the radiologists: Diff Augment, ADA, no augmentation, and APA. Particularly promising was the SwAV-based FD, which significantly correlated with human perception across all models (Pearson coefficient of .475 with the difference in average Likert ratings, p=.064).

基于ImageNet的提取器与人类判断一致。除MSD数据集外(该数据集上的人类排名也不一致,见图1),基于ImageNet的FD在生成模型排序上相互一致。这种一致性得到了Inception V3与其他所有基于ImageNet的FD之间强相关性(p<.001)的进一步验证。此外,基于ImageNet的FD与专家判断一致(见图1)。在ChestX-ray14数据集上,基于ImageNet的FD对生成模型的排序与放射科医师的判断顺序相同:Diff Augment、ADA、无增强和APA。特别有前景的是基于SwAV的FD,它在所有模型中与人类感知显著相关(与平均Likert评分差异的Pearson系数为0.475,p=0.064)。

Table 2. ImageNet-based rFDs. Column 1 lists each tested dataset, while Column 2 specifies the augmentation technique (Aug) utilized during model training: no augmentation (None), ADA, APA, and Diff Augment (DiffAug). Columns 3-9 display the rFDs computed using seven ImageNet-trained feature extractors: Inception V 3 (Incept), ResNet50 (Res), Inception Res Ne tV 2 (IRV2), Dense Net 121 (Dense), SwAV, DINO, and Swin Transformer (Swin). $\downarrow$ indicates that a lower value is preferable. The underlined boldface type represents the best performance per dataset. $^\dagger$ denotes decreased performance compared to no augmentation.

表 2: 基于ImageNet的相对弗雷歇距离(rFD)。第1列列出各测试数据集,第2列标明模型训练时采用的数据增强技术(Aug):无增强(None)、ADA、APA和差分增强(DiffAug)。第3-9列展示使用七个ImageNet预训练特征提取器计算的rFD值:Inception V3 (Incept)、ResNet50 (Res)、Inception ResNetV2 (IRV2)、DenseNet121 (Dense)、SwAV、DINO和Swin Transformer (Swin)。$\downarrow$表示数值越小越好。加粗下划线标注每数据集最佳性能。$^\dagger$表示性能较无增强时下降。

数据集 Aug Incept Res IRV2 Dense SwAV DINO Swin
ChestXray-14 None 12.53 279.00 701.00 20.80 53.50 60.43 34.00
ADA 8.90 237.00 576.00 15.55 33.00 37.81 26.36
APA 17.58† 334.00 1004.50 39.85 66.00 82.23 54.21
DiffAug 7.68 146.00 441.00 13.25 25.00 34.51 22.79
SLIVER07 None 1.48 7.90 12.98 2.59 8.28 6.12 6.07
ADA 1.24 7.35 11.71 1.95 6.86 4.57 6.22
APA 1.37 7.33 11.96 2.36 7.79 5.59 5.43
DiffAug 0.78 3.25 5.99 1.24 5.26 3.07 4.77
MSD None 37.32 63.13 61.18 170.38 142.50 108.39 504.47
ADA 36.84 62.50 58.88 141.63 305.00 121.90† 308.59
APA 43.63 70.00† 81.76 145.13 122.50 126.47 196.65
DiffAug 46.32 125.50† 79.88 170.38 825.00 138.11† 175.12
ACDC None 49.67 86.48 121.14 87.46 118.00 140.15 111.07
ADA 20.99 31.66 49.94 35.95 76.40 65.52 61.49
APA 31.15 54.35 76.47 56.68 90.60 87.69 72.10
DiffAug 15.87 23.58 40.60 27.20 71.00 50.47 47.23

Rad Image Net extractors were volatile. Rad Image Net-based FDs produced inconsistent rankings that were contrary to expert judgment. Notably, on the SLIVER07 dataset, Rad Image Net-based FDs ranked Diff Augment as one of the poorest-performing models. However, all measures of human judgment identified Diff Augment as the best-performing model (see Figure 1). This discrepancy is especially concerning considering Rad Image Net’s inclusion of approximately 300,000 CT scans. On the ChestX-ray14 dataset, the FD derived from a Rad Image Net-trained Inception V 3 network ranked the model without augmentation as the best performing. In contrast, a thoracic radiologist observed that both the APA and no augmentation models generated multiple radio graphs with obviously distorted anatomy. Conversely, the weaknesses of the DiffAugment and ADA models were more subtle, with mistakes in support devices and central lines.

Rad Image Net提取器表现不稳定。基于Rad Image Net的特征距离(FD)产生了与专家判断相矛盾的排名结果。值得注意的是,在SLIVER07数据集上,基于Rad Image Net的FD将Diff Augment列为性能最差的模型之一,而所有人工评估指标均认为Diff Augment是性能最佳的模型 (见图1)。考虑到Rad Image Net包含约30万份CT扫描数据,这种差异尤其令人担忧。在ChestX-ray14数据集上,基于Rad Image Net训练的Inception V3网络得出的FD将未增强模型评为最佳,而胸科放射医师观察到APA模型和未增强模型都生成了多幅解剖结构明显失真的影像。相比之下,DiffAugment和ADA模型的缺陷更为细微,主要体现在支撑设备和中心静脉导管等辅助装置的识别错误上。

Table 3. Rad Image Net-based rFDs. Column 1 lists each tested dataset, while Column 2 specifies the augmentation technique (Aug) utilized during model training: no augmentation (None), ADA, APA, and Diff Augment (DiffAug). Columns 3-6 display the rFDs computed using four Rad Image Net-trained feature extractors: Inception V 3, ResNet50, Inception Res Ne tV 2 (IRV2), and Dense Net 121. $\downarrow$ indicates that a lower value is preferable. The underlined boldface type represents the best performance per dataset. $^\dagger$ denotes decreased performance compared to no augmentation.

表 3. 基于Rad Image Net的rFDs。第1列列出每个测试数据集,第2列指定模型训练期间使用的数据增强技术(Aug):无增强(None)、ADA、APA和Diff Augment(DiffAug)。第3-6列显示使用四个Rad Image Net训练的特征提取器计算的rFDs:Inception V3、ResNet50、Inception ResNetV2(IRV2)和DenseNet121。$\downarrow$表示数值越小越好。加粗下划线表示每个数据集的最佳性能。$^\dagger$表示与无增强相比性能下降。

数据集 Aug InceptionV3 ResNet50 IRV2 DenseNet121
ChestXray-14 None 140.00 75.00 80.00 40.00
ADA 660.00 135.00t 190.00 80.00t
APA 280.00f 65.00 80.00 80.00t
DiffAug 280.00f 50.00 90.00 30.00
SLIVER07 None 3.67 3.14 6.00 4.33
ADA 1.89 1.86 3.75 2.33
APA 2.22 1.86 3.00 2.67
DiffAug 4.67 3.29 5.50 4.67t
MSD None 53.00 32.50 32.50 40.00
ADA 36.00 27.5 37.50t 60.00t
APA 54.00t 32.50 40.00 40.00
DiffAug 1551.00f 1105.00f 350.00f 615.00f
ACDC None 26.64 19.00 20.33 32.50
ADA 10.18 9.25 9.67 13.00
APA 14.09 8.75 11.67 17.50
DiffAug 12.09 15.25 29.6 10.50

APA and ADA demonstrated varied performance. Although APA was designed to enhance image quality in limited data domains such as medical imaging, it unexpectedly reduced the perceptual quality of the generated images (p=.012), leading to an $18%$ reduction in the FPR on average. While ADA outperformed APA (p=.050), it did not significantly affect participants’ ability to differentiate real from generated images (p>.999). Despite both techniques under performing in the VTTs, they improved the rFDs for the SLIVER07 (p=.025 ADA, p=.016 APA) and ACDC (p=.003 ADA, p=.004 APA) datasets.

APA和ADA表现出不同的性能。尽管APA旨在提升如医学成像等有限数据领域的图像质量,但它意外降低了生成图像的感知质量 (p=.012),导致FPR平均下降18%。虽然ADA表现优于APA (p=.050),但并未显著影响参与者区分真实图像与生成图像的能力 (p>.999)。尽管这两种技术在VTTs中表现不佳,但它们提升了SLIVER07 (p=.025 ADA, p=.016 APA) 和ACDC (p=.003 ADA, p=.004 APA) 数据集的rFDs。


Fig. 1. Model rankings listed in descending order of performance. FDs are split by dataset and architecture: Inception V 3 (Incept), ResNet50 (Res), Inception Res Ne tV 2 (IRV2), Dense Net 121 (Dense), SwAV, DINO, and Swin Transformer (Swin). Human rankings are KS test p-values (KS), average difference in mean Likert scores (Diff), and FPRs.

图 1: 按性能降序排列的模型排名。FDs按数据集和架构划分:Inception V3 (Incept)、ResNet50 (Res)、Inception ResNetV2 (IRV2)、DenseNet121 (Dense)、SwAV、DINO和Swin Transformer (Swin)。人类排名指标包括KS检验p值 (KS)、平均Likert分数差异 (Diff) 和FPRs。

Diff Augment created hyper-realistic images. Diff Augment outperformed the other augmentation techniques across all FDs (p=.092 ADA, p=.059 APA). This result held for each dataset except MSD, where model training had diverged. Diff Augment was the only form of augmentation to significantly enhance perceptual quality (p=.001), resulting in an $81%$ reduction in the average difference between mean Likert ratings. Participants rated images from Diff Augmentbased models as more realistic than those from both the ChestX-ray14 and MSD datasets. Additionally, Likert ratings for real and generated images from all Diff Augment-based models did not differ significantly (p=.793), suggesting that participants perceived them as equivalent.

Diff Augment 生成了超逼真的图像。在所有FD指标上,Diff Augment的表现均优于其他数据增强技术 (p=.092 ADA, p=.059 APA)。除MSD数据集因模型训练发散外,该结论适用于所有数据集。Diff Augment是唯一能显著提升感知质量的数据增强方式 (p=.001),使得平均Likert评分差异降低81%。参与者认为Diff Augment生成模型的图像比ChestX-ray14和MSD数据集的图像更真实。此外,基于Diff Augment模型生成图像与真实图像的Likert评分无显著差异 (p=.793),表明参与者认为二者具有同等真实度。

4 Conclusion

4 结论

Our study challenges prevailing assumptions by providing novel evidence that medical image-trained feature extractors do not inherently improve FDs for synthetic medical imaging evaluation; instead, they may compromise metric consistency and alignment with human judgment, even on in-domain data. The emerging practice of employing privately trained, medical image-based feature extractors to benchmark new generative algorithms is concerning, as it allows algorithm designers to shape evaluation metrics, potentially introducing biases. Additionally, the efficacy of these FDs often remains inadequately evaluated and un verifiable. We advocate for the comprehensive evaluation and public release of all FDs used in benchmarking generative medical imaging models.

我们的研究挑战了普遍假设,通过提供新证据表明:基于医学影像训练的特征提取器并不能天然提升合成医学影像评估中的FD (Frechet Distance) 表现,反而可能损害指标一致性以及与人类判断的吻合度,即便在同领域数据上也是如此。当前采用私有训练的医学影像特征提取器来评估新生成算法的趋势令人担忧,因为这使得算法设计者能够操控评估指标,可能引入偏见。此外,这些FD的有效性往往缺乏充分评估且难以验证。我们主张对用于生成式医学影像模型基准测试的所有FD进行全面评估并公开其实现。

Acknowledgments Research reported in this publication was supported in part by resources of the Image Guided Cancer Therapy Research Program at The University of Texas MD Anderson Cancer Center, by a generous gift from the Apache Corporation, by the National Institutes of Health/NCI under award number P 30 CA 016672, and by the Tumor Measurement Initiative through the MD Anderson Strategic Initiative Development Program (STRIDE). We thank the NIH Clinical Center for the ChestX-ray14 dataset, the StudioGAN authors [33] for their FD implementations, Vikram Haheshri and Oleg Igoshin for the discussion that led to the hypothesis testing contribution, Erica Goodoff - Senior Scientific Editor in the Research Medical Library at The University of Texas MD Anderson Cancer Center - for editing this article, and Xinyue Zhang and Caleb O’Connor for their comments when reviewing the manuscript. GPT4 was used in the proofreading stage of this manuscript.

致谢
本研究报告的部分资源由德克萨斯大学MD安德森癌症中心的图像引导癌症治疗研究项目、Apache Corporation的慷慨捐赠、美国国立卫生研究院/国家癌症研究所 (P30 CA016672) 以及MD安德森战略发展计划 (STRIDE) 的肿瘤测量计划提供支持。我们感谢美国国立卫生研究院临床中心提供ChestX-ray14数据集,感谢StudioGAN作者[33]提供的FD实现代码,感谢Vikram Haheshri和Oleg Igoshin在假设检验方面的讨论贡献,感谢德克萨斯大学MD安德森癌症中心研究医学图书馆的高级科学编辑Erica Goodoff对本文的编辑,以及Xinyue Zhang和Caleb O’Connor在审稿时提出的意见。GPT4参与了本稿件的校对工作。

References

参考文献

Appendix

附录

Table 4. ImageNet-based FDs. Column 1 lists each tested dataset, while Column 2 specifies the augmentation technique (Aug) utilized during model training: no augmentation (None), ADA, APA, and Diff Augment (DiffAug). Columns 3-9 display the FDs computed using seven ImageNet-trained feature extractors: Inception V 3 (Incept), ResNet50 (Res), Inception Res Ne tV 2 (IRV2), Dense Net 121 (Dense), SwAV, DINO, and Swin Transformer (Swin). $\downarrow$ indicates that a lower value is preferable. The underlined boldface type represents the best performance per dataset. $^\dagger$ denotes decreased performance compared to no augmentation.

表 4: 基于ImageNet的FDs。第1列列出每个测试数据集,第2列指定模型训练期间使用的数据增强技术(Aug):无增强(None)、ADA、APA和DiffAugment(DiffAug)。第3-9列展示了使用七个ImageNet预训练特征提取器计算的FDs:Inception V3(Incept)、ResNet50(Res)、Inception ResNetV2(IRV2)、DenseNet121(Dense)、SwAV、DINO和Swin Transformer(Swin)。$\downarrow$表示数值越低越好。带下划线的粗体表示每个数据集的最佳性能。$^\dagger$表示相比无增强性能下降。

数据集 增强方法 Incept Res IRV2 Dense SwAV DINO Swin
ChestXray-14 None 5.01 4.16 2.79 14.02 1.07 299.13 4.76
ADA 3.56 3.11 2.37 11.52 0.66 187.16 3.69
APA 7.03t 7.97 3.34t 20.09 1.32t 407.05 7.49t
DiffAug 3.07 2.65 1.46 8.82 0.50 170.84 3.19
SLIVER07 None 8.72 9.04 4.74 14.02 2.40 640.32 30.37
ADA 7.34 6.79 4.41 12.65 1.99 478.61 31.09t
APA 8.07 8.22 4.40 12.92 2.26 585.60 27.17
DiffAug 4.62 4.33 1.95 6.47 1.53 321.32 23.83
MSD None 7.09 5.05 10.40 9.43 0.57 422.74 85.76
ADA 7.00 5.00 10.01 11.33 1.22 475.41 52.46
APA 8.29 5.60f 13.90f 11.61 0.49 493.23 33.43
DiffAug 8.80f 10.04f 13.58 13.63 3.30f 538.61f 29.77
ACDC None 67.05 51.60 73.51 87.22 5.90 2888.53 127.74
ADA 28.34 21.21 26.91 35.96 3.82 1350.34 70.71
APA 42.05 33.44 46.20 55.06 4.53 1807.38 82.91
DiffAug 21.42 16.05 20.04 29.23 3.55 1040.24 54.31

Table 5. Rad Image Net-based FDs. Column 1 lists each tested dataset, while Column 2 specifies the augmentation technique (Aug) utilized during model training: no augmentation (None), ADA, APA, and Diff Augment (DiffAug). Columns 3-6 display the FDs computed using four Rad Image Net-trained feature extractors: Inception V 3, ResNet50, Inception Res Ne tV 2 (IRV2), and Dense Net 121. ↓ indicates that a lower value is preferable. The underlined boldface type represents the best performance per dataset. $^\dagger$ denotes decreased performance compared to no augmentation.

表 5: 基于Rad Image Net的FDs。第1列列出每个测试数据集,第2列指定模型训练期间使用的增强技术(Aug):无增强(None)、ADA、APA和Diff Augment (DiffAug)。第3-6列显示使用四个Rad Image Net训练的特征提取器计算的FDs:Inception V3、ResNet50、Inception ResNet V2 (IRV2)和DenseNet121。↓表示数值越小越好。带下划线的粗体表示每个数据集的最佳性能。$^\dagger$表示与无增强相比性能下降。

Dataset Aug InceptionV3 ResNet50 IRV2 DenseNet121
ChestXray-14 None 0.03 0.15 0.08 0.04
ADA 0.13t 0.27t 0.19t 0.08t
APA 0.06t 0.13 0.08 0.08t
DiffAug 0.06t 0.10 0.09t 0.03
SLIVER07 None 0.07 0.22 0.24 0.13
ADA 0.03 0.13 0.15 0.07
APA 0.04 0.13 0.12 0.08
DiffAug 0.08t 0.23 0.22 0.14
MSD None 0.05 0.13 0.13 0.08
ADA 0.04 0.11 0.15t 0.13t
APA 0.05 0.13 0.16t 0.08
DiffAug 1.55t 4.42t 1.40t 1.23t
ACDC None 0.29 0.76 0.61
ADA 0.65
APA 0.11 0.15 0.37 0.35 0.29 0.35 0.26
DiffAug 0.13 0.61 0.29 0.35 0.21
阅读全文(20积分)