Feature Extraction for Generative Medical Imaging Evaluation: New Evidence Against an Evolving Trend
生成式医学影像评估的特征提取:反对演变趋势的新证据
Abstract. Fréchet Inception Distance (FID) is a widely used metric for assessing synthetic image quality. It relies on an ImageNet-based feature extractor, making its applicability to medical imaging unclear. A recent trend is to adapt FID to medical imaging through feature extractors trained on medical images. Our study challenges this practice by demonstrating that ImageNet-based extractors are more consistent and aligned with human judgment than their Rad Image Net counterparts. We evaluated sixteen StyleGAN2 networks across four medical imaging modalities and four data augmentation techniques with Fréchet distances (FDs) computed using eleven ImageNet or Rad Image Net-trained feature extractors. Comparison with human judgment via visual Turing tests revealed that ImageNet-based extractors produced rankings consistent with human judgment, with the FD derived from the ImageNet-trained SwAV extractor significantly correlating with expert evaluations. In contrast, Rad Image Net-based rankings were volatile and inconsistent with human judgment. Our findings challenge prevailing assumptions, providing novel evidence that medical image-trained feature extractors do not inherently improve FDs and can even compromise their reliability. Our code is available at https://github.com/m ckel l woodland/fid-med-eval.
摘要。Fréchet Inception Distance (FID) 是评估合成图像质量的常用指标。它依赖于基于 ImageNet 的特征提取器,因此其在医学影像中的适用性尚不明确。近期趋势是通过在医学图像上训练的特征提取器来调整 FID 以适应医学影像。我们的研究对这一做法提出质疑,证明基于 ImageNet 的提取器比基于 Rad Image Net 的提取器更一致且更符合人类判断。我们评估了 16 个 StyleGAN2 网络,涵盖四种医学影像模态和四种数据增强技术,使用 11 个基于 ImageNet 或 Rad Image Net 训练的特征提取器计算 Fréchet 距离 (FDs)。通过与视觉图灵测试中的人类判断对比,发现基于 ImageNet 的提取器产生的排名与人类判断一致,其中基于 ImageNet 训练的 SwAV 提取器得出的 FD 与专家评估显著相关。相比之下,基于 Rad Image Net 的排名波动较大且与人类判断不一致。我们的发现挑战了普遍假设,提供了新证据表明医学图像训练的特征提取器并不会固有地改善 FDs,甚至可能损害其可靠性。代码详见 https://github.com/mckel l woodland/fid-med-eval。
Keywords: Generative Modeling · Frechet Inception Distance.
关键词:生成式建模 (Generative Modeling) · 弗雷歇初始距离 (Frechet Inception Distance)
1 Introduction
1 引言
Fréchet Inception Distance (FID) is the most commonly used metric for evaluating synthetic image quality [1]. It quantifies the Fréchet distance (FD) between two Gaussian distribution curves fitted to embeddings of real and generated images. These embeddings are typically extracted from the penultimate layer of an Inception V 3 network trained on ImageNet. FID’s utility has been demonstrated through its correlation with human judgment [2], sensitivity to distortions [1], capability to detect over fitting [3], and relative sample efficiency [3]. Nonetheless, the metric has faced criticism, including that the Inception V 3 network may only embed information relevant to ImageNet class discrimination [4,5].
Fréchet Inception Distance (FID) 是评估合成图像质量最常用的指标 [1]。它量化了基于真实图像和生成图像嵌入拟合的两个高斯分布曲线之间的Fréchet距离 (FD)。这些嵌入通常提取自经过ImageNet训练的Inception V3网络倒数第二层。FID的有效性已通过其与人类判断的相关性 [2]、对失真的敏感性 [1]、检测过拟合的能力 [3] 以及相对样本效率 [3] 得到验证。然而,该指标也面临批评,包括Inception V3网络可能仅嵌入与ImageNet类别判别相关的信息 [4,5]。
Three approaches exist for adapting FID to medical imaging. The first involves using an Inception V 3 extractor trained on a large, publicly available medical dataset, such as Rad Image Net, a database containing 1.35 million annotated computed tomography (CT), magnetic resonance imaging (MRI), and ultrasonography exams [6,7]. While a Rad Image Net-based FD considers medically relevant features, its efficacy remains largely unexplored. One potential bias is that networks trained for disease detection may focus too heavily on small, localized regions [8] to effectively evaluate an entire image’s quality. Additionally, Rad Image Net-based FDs may not generalize to new medical modalities [7] or patient populations. Our novel comparison of Rad Image Net-base FDs to human judgment revealed discrepancies, even on in-domain abdominal CT data.
将FID(Frechet Inception Distance)应用于医学影像领域存在三种方法。第一种是使用基于大型公开医学数据集(如Rad Image Net)训练的Inception V3特征提取器。Rad Image Net包含135万张标注的计算机断层扫描(CT)、磁共振成像(MRI)和超声检查图像[6,7]。虽然基于Rad Image Net的FD考虑了医学相关特征,但其有效性尚未得到充分验证。一个潜在偏差是:针对疾病检测训练的网络可能过度关注局部小区域[8],难以有效评估整体图像质量。此外,基于Rad Image Net的FD可能无法泛化到新的医学模态[7]或患者群体。我们首次将Rad Image Net-based FD与人类判断进行对比,发现即使在腹部CT领域数据中也存在差异。
The second approach utilizes self-supervised networks for feature extraction [9]. These networks are encouraging as they create transferable and robust represent at ions [10], including on medical images [4]. Despite their promise, the lack of publicly available, self-supervised models trained on extensive medical imaging datasets has hindered their application. Our study is the first to employ self-supervised extractors for synthetic medical image evaluation. We find a significant correlation between an FD derived from an ImageNet-trained SwAV network (FSD) and medical experts’ appraisal of image realism, highlighting the potential of self-supervision for advancing generative medical imaging evaluation.
第二种方法利用自监督网络进行特征提取 [9]。这些网络的优势在于能生成可迁移且稳健的表征 [10],包括在医学图像上的应用 [4]。尽管前景广阔,但由于缺乏基于大规模医学影像数据集训练的公开自监督模型,其应用受到限制。我们的研究首次采用自监督特征提取器来评估合成医学图像。研究发现,基于ImageNet训练的SwAV网络得到的FD (FSD) 与医学专家对图像真实性的评价存在显著相关性,这凸显了自监督技术在推进生成式医学影像评估方面的潜力。
The third approach employs a feature extractor trained on the dataset used to train the generative imaging model [11,12,13]. While advantageous for domain coherence, the algorithm designer essentially creates the metric used to evaluate their algorithm, resulting in un quantified bias [2]. Moreover, the private and varied nature of these feature extractors poses challenges for reproducibility and benchmarking. Given these limitations, our study focused on publicly available feature extractors.
第三种方法采用了在用于训练生成式成像模型的数据集上训练的特征提取器 [11,12,13]。虽然这种方法在领域一致性方面具有优势,但算法设计者实质上创建了用于评估其算法的指标,导致未量化的偏差 [2]。此外,这些特征提取器的私有性和多样性给可复现性和基准测试带来了挑战。鉴于这些限制,我们的研究专注于公开可用的特征提取器。
Our study offers a novel comparison of generative model rankings created by ImageNet and Rad Image Net-trained feature extractors with expert judgment. Our key contributions are:
我们的研究首次比较了由ImageNet和Rad Image Net训练的特征提取器生成的生成模型排名与专家判断。主要贡献包括:
- Introducing a novel method for evaluating visual Turing Tests (VTTs) via hypothesis testing, providing an unbiased measure of participant perception of synthetic image realism.
- 提出一种通过假设检验评估视觉图灵测试 (VTT) 的新方法,为参与者对合成图像真实感的感知提供无偏测量。
2 Methods
2 方法
2.1 Generative Modeling
2.1 生成式建模 (Generative Modeling)
Four medical imaging datasets were used for generative modeling: the Segmentation of the Liver Competition 2007 (SLIVER07) dataset with 20 liver CT studies [14]4, the ChestX-ray14 dataset with 112,100 chest X-rays [15] $^5$ , the brain tumor dataset from the Medical Segmentation Decathlon (MSD) with 750 brain MRI studies [16,17] $^6$ , and the Automated Cardiac Diagnosis Challenge (ACDC) dataset with 150 cardiac cine-MRIs [18]7. Multi-dimensional images were converted to two dimensions by extracting axial slices and excluding the slices with less than $15%$ nonzero pixels.
在生成建模中使用了四个医学影像数据集:包含20例肝脏CT扫描的2007年肝脏分割竞赛(SLIVER07)数据集[14]4、包含112,100张胸部X光片的ChestX-ray14数据集[15]$^5$、来自医学分割十项全能(MSD)的包含750例脑部MRI扫描的脑肿瘤数据集[16,17]$^6$,以及包含150例心脏电影磁共振的自动心脏诊断挑战赛(ACDC)数据集[18]7。通过提取轴向切片并排除非零像素少于$15%$的切片,将多维图像转换为二维图像。
To enable a comparison of synthetic quality, four StyleGAN2 [19] models were trained per dataset, using either adaptive disc rim in at or augmentation (ADA) [20], differentiable augmentation (Diff Augment) [21], adaptive pseudo augmentation (APA) [22], or no augmentation. While all of the data augmentation techniques were created to improve the performance of generative models on limited data domains, such as medical imaging, we are the first to benchmark the techniques on medical images. Each model was evaluated using the weights obtained at the end of 25,000 kimg (a kimg represents a thousand real images being shown to the disc rim in at or), except for the MSD experiments, which were limited to 5,000 kimg due to training instability. Our code and trained model weights are available at https://github.com/m ckel l woodland/fid-med-eval.
为了比较合成质量,我们在每个数据集上训练了四个StyleGAN2 [19]模型,分别采用自适应判别器增强 (ADA) [20]、可微分增强 (Diff Augment) [21]、自适应伪增强 (APA) [22] 或无增强。虽然这些数据增强技术最初都是为了提升生成式模型在有限数据领域(如医学影像)的性能而设计的,但我们是首个在医学图像上对这些技术进行基准测试的研究。除MSD实验因训练不稳定性限制在5,000 kimg外,每个模型均使用25,000 kimg(1 kimg表示向判别器展示一千张真实图像)训练结束时的权重进行评估。代码及训练好的模型权重详见https://github.com/mckel l woodland/fid-med-eval。
2.2 Human Evaluation
2.2 人工评估
Human perception of model quality was assessed with one VTT per model. Each test comprised 20 randomly selected images with an equal number of real and generated images. Participants were asked to identify whether each image was real or generated and rate its realism on a Likert scale from 1 to 3 (1: “Not at all realistic,” 2: “Somewhat realistic,” and 3: “Very realistic”). The tests were administered to five specialists with medical degrees. In addition to the VTTs, three radiologists were shown 35 synthetic radio graphs per ChestX-ray14 model and were asked to rank and provide a qualitative assessment of the models.
通过每个模型进行一次视觉图灵测试(VTT)来评估人类对模型质量的感知。每次测试包含20张随机选择的图像,其中真实图像和生成图像数量相等。参与者需判断每张图像是真实的还是生成的,并按1-3分的李克特量表(1: "完全不真实", 2: "有些真实", 3: "非常真实")评估其真实感。测试对象为五位拥有医学学位的专家。除VTT外,还向三位放射科医师展示每个ChestX-ray14模型生成的35张合成X光片,要求他们对模型进行排序并提供定性评估。
False positive rate (FPR) and false negative rate (FNR) were used to evaluate the VTTs. The FPRs represent the proportion of generated images that participants considered to be real. FPRs near 50% indicate random guessing.
误报率 (FPR) 和漏报率 (FNR) 用于评估VTT。FPR表示参与者认为生成图像是真实图像的比例。接近50%的FPR表明属于随机猜测。
One-sided paired $t$ tests were performed on the FPRs with $\alpha{=}.05$ to benchmark the data augmentation techniques. For each VTT, the average Likert ratings of real and generated images were computed per participant. The difference between these average ratings was then computed to compare the perceived realism of real and generated images. Two-sample Kolmogorov-Smirnov (KS) tests were conducted on the Likert ratings of the real and generated images with $\alpha{=}.10$ to determine whether the ratings came from the same distribution, indicating that the participants viewed the realism of the generated images to be equivalent to that of the real images. We are the first to use the difference in average Likert ratings and the KS test for generative modeling evaluation.
对误报率(FPR)进行单侧配对$t$检验($\alpha{=}.05$)以评估数据增强技术的基准效果。针对每个视觉图灵测试(VTT),计算每位参与者对真实图像与生成图像的Likert评分平均值,并通过两者差值比较图像真实感。采用双样本Kolmogorov-Smirnov (KS)检验($\alpha{=}.10$)分析真实与生成图像的Likert评分分布一致性,若检验通过则表明参与者认为生成图像与真实图像具有等效真实感。本研究首次将Likert评分均值差与KS检验应用于生成模型评估领域。
When taking a VTT, participants may be more likely to select either “real” or “generated” when uncertain. This bias causes the average FPR to not fully encapsulate whether participants could differentiate between real and generated images. To address this challenge, we propose a novel method for evaluating VTTs via hypothesis testing. The method aims to demonstrate that the likelihood of a participant selecting “real” is the same for both real and generated images. For each participant $p$ , we define the null hypothesis $\mathbb{P}(\boldsymbol{p}$ guesses real $\mid\mathrm{G})=$ P( $p$ guesses real | R) where G represents the event that the image is generated and R represents the event that the image is real. We evaluate this hypothesis using a two-sample $t$ test with $\alpha{=}.10$ , where the first sample is the participant’s binary predictions for generated images, and the second is their predictions for real images. To evaluate VTTs for multiple participants $P$ , we define the null hypothesis P(random $p\in P$ guesses real $|\mathrm{~G})=\mathrm{P}$ (random $p\in P$ guesses real | R). We evaluate this hypothesis via a two-sample $t$ test with $\alpha{=}.10$ , where the first sample is the FPR and the second is the true positive rate of each participant.
在进行视觉图灵测试 (VTT) 时,参与者在不确定时可能更倾向于选择"真实"或"生成"。这种偏差导致平均假阳性率 (FPR) 无法完全体现参与者能否区分真实与生成图像。为解决这一问题,我们提出一种通过假设检验评估VTT的新方法。该方法旨在证明参与者选择"真实"的概率在真实图像和生成图像中是相同的。
对于每位参与者$p$,我们定义零假设$\mathbb{P}(\boldsymbol{p}$选择真实$\mid\mathrm{G})=$ P($p$选择真实 | R),其中G代表图像为生成事件,R代表图像为真实事件。我们使用$\alpha{=}.10$的双样本$t$检验评估该假设,第一个样本是参与者对生成图像的二元预测,第二个样本是其对真实图像的预测。
为评估多参与者$P$的VTT结果,我们定义零假设P(随机$p\in P$选择真实$|\mathrm{~G})=\mathrm{P}$(随机$p\in P$选择真实 | R)。我们通过$\alpha{=}.10$的双样本$t$检验评估该假设,第一个样本是各参与者的FPR,第二个样本是各参与者的真阳性率。
2.3 Fréchet Distances
2.3 Fréchet距离
Quantitative evaluation of synthetic image quality was performed by calculating the FD $d(\boldsymbol{\Sigma}{1},\boldsymbol{\Sigma}{2},\boldsymbol{\mu}{1},\boldsymbol{\mu}{2})^{2}=\left|\boldsymbol{\mu}{1}-\boldsymbol{\mu}{2}\right|^{2}+\mathrm{tr}(\boldsymbol{\Sigma}{1}+\boldsymbol{\Sigma}{2}-2(\boldsymbol{\Sigma}{1}\boldsymbol{\Sigma}{2})^{\frac{\mathrm{i}}{2}})$ [23] between two multivariate Gaussians $(\Sigma_{R},\mu_{R})$ and $(\textstyle{\sum_{G}},\mu_{G})$ fitted to real and generated features extracted from the penultimate layer of eleven backbone networks: InceptionV3 [24], ResNet50 [25], Inception Res Ne tV 2 [26], and Dense Net 121 [27] each trained separately on both ImageNet [28] and Rad Image Net [6], along with SwAV [29], DINO [30], and a Swin Transformer [31] trained on ImageNet. The first four networks were included to compare all publicly available Rad Image Net models to their ImageNet equivalents. SwAV and DINO were included to evaluate the impact of self-supervision, as self-supervised representations have demonstrated superior transfer ability to new domains [10] and richer embeddings on medical images [4]. Finally, a Swin Transformer [31] was included as transformers have been shown to create transferable and robust representations [32]. We are the first to use self-supervised and transformer architectures with FD for generative medical imaging evaluation. As the scale of FDs varies substantially by feature extractor, relative FDs (rFDs) d(dΣ(RΣ1R, ,ΣΣRG2,,µµRR,1 µ,µGR)22)2 were computed with a random split of the real features into two Gaussian distributions $\left(\boldsymbol{\Sigma}{R_{1}},\boldsymbol{\mu}{R_{1}}\right)$ and $\left(\boldsymbol{\Sigma}{R_{2}},\boldsymbol{\mu}{R_{2}}\right)$ . Paired $t$ tests with $\alpha{=}0.05$ were conducted on the FDs to benchmark the data augmentation techniques. The Pearson correlation coefficient with $\alpha{=}0.05$ was used to quantify the correspondence between the FDs and human judgment and the correspondence between individual FDs. We are the first to consider whether medical image-based FDs are correlated with human judgment.
通过计算两个多元高斯分布 $(\Sigma_{R},\mu_{R})$ 和 $(\textstyle{\sum_{G}},\mu_{G})$ 之间的弗雷歇距离 (Frechet Distance, FD) $d(\boldsymbol{\Sigma}{1},\boldsymbol{\Sigma}{2},\boldsymbol{\mu}{1},\boldsymbol{\mu}{2})^{2}=\left|\boldsymbol{\mu}{1}-\boldsymbol{\mu}{2}\right|^{2}+\mathrm{tr}(\boldsymbol{\Sigma}{1}+\boldsymbol{\Sigma}{2}-2(\boldsymbol{\Sigma}{1}\boldsymbol{\Sigma}{2})^{\frac{\mathrm{i}}{2}})$ [23],对合成图像质量进行定量评估。这两个分布分别拟合了从11个骨干网络倒数第二层提取的真实特征和生成特征,这些网络包括:在ImageNet [28] 和RadImageNet [6] 上分别独立训练的InceptionV3 [24]、ResNet50 [25]、Inception ResNetV2 [26]、DenseNet121 [27],以及在ImageNet上训练的SwAV [29]、DINO [30] 和Swin Transformer [31]。前四个网络的纳入是为了比较所有公开可用的RadImageNet模型与其对应的ImageNet版本。SwAV和DINO的纳入是为了评估自监督的影响,因为自监督表征在新领域[10]和医学图像[4]上表现出更优的迁移能力和更丰富的嵌入。最后,Swin Transformer [31] 的纳入是因为Transformer已被证明能创建可迁移且鲁棒的表征[32]。我们是首个将自监督和Transformer架构与FD结合用于生成式医学影像评估的研究。由于FD的尺度因特征提取器而异,我们通过将真实特征随机拆分为两个高斯分布 $\left(\boldsymbol{\Sigma}{R_{1}},\boldsymbol{\mu}{R_{1}}\right)$ 和 $\left(\boldsymbol{\Sigma}{R_{2}},\boldsymbol{\mu}{R_{2}}\right)$ 来计算相对FD (rFD) $d(d\Sigma(R\Sigma_1R, ,\Sigma\Sigma_RG2,,\mu\mu_RR,1 \mu,\mu_GR)22)2$。采用 $\alpha{=}0.05$ 的配对 $t$ 检验对FD进行统计分析,以评估数据增强技术的基准性能。使用 $\alpha{=}0.05$ 的皮尔逊相关系数量化FD与人类判断之间以及各FD之间的相关性。我们是首个探讨医学图像FD是否与人类判断相关的研究。
3 Results
3 结果
Table 1 summarizes the overall results of the VTTs, with detailed individual participant outcomes available at https://github.com/m ckel l woodland/fid-med-eval. The rFDs based on ImageNet and Rad Image Net are outlined in Tables 2 and 3, while the FDs can be found in Tables 4 and 5 in the Appendix. Model rankings based on individual metrics are illustrated in Figure 1. Our analysis revealed
表1总结了VTT的总体结果,详细个体参与者结果可在https://github.com/mckel l woodland/fid-med-eval查看。基于ImageNet和RadImageNet的rFD分别列于表2和表3,FD则见附录中的表4和表5。基于各指标的模型排名如图1所示。我们的分析表明
Table 1. VTT results. Column 1 lists each tested dataset, while Column 2 specifies the augmentation technique (Aug) utilized during model training: no augmentation (None), ADA, APA, and Diff Augment (DiffAug). Columns 3 and 4 showcase the average FPRs and FNRs. FPRs near $50%$ imply random guessing. Column 5 provides $t$ test p-values, whose null hypothesis is that the probability of a random participant selecting “real” is the same for real and generated images. Column 6 displays the average difference between mean Likert ratings for real and generated images (Diff); a negative value indicates that the generated images were perceived to be more realistic than the actual images. Column 7 presents KS test p-values, whose null hypothesis is that the Likert ratings for real and generated images were drawn from the same distribution. $\uparrow$ and $\downarrow$ denote preferable higher or lower values. The underlined boldface type represents the best performance per dataset. Gray boxes indicate failure to reject the null hypothesis, suggesting that participants viewed real and generated images to be equivalent. $\dagger$ indicates decreased performance compared to no augmentation.
表 1: VTT 结果。第 1 列列出每个测试数据集,第 2 列指定模型训练期间使用的增强技术 (Aug):无增强 (None)、ADA、APA 和 Diff Augment (DiffAug)。第 3 列和第 4 列展示平均 FPR 和 FNR。接近 $50%$ 的 FPR 意味着随机猜测。第 5 列提供 $t$ 检验 p 值,其零假设是真实图像和生成图像被随机参与者选择"真实"的概率相同。第 6 列显示真实图像和生成图像的平均 Likert 评分差值 (Diff);负值表示生成图像被认为比真实图像更逼真。第 7 列展示 KS 检验 p 值,其零假设是真实图像和生成图像的 Likert 评分来自同一分布。$\uparrow$ 和 $\downarrow$ 分别表示数值越高越好或越低越好。带下划线的粗体表示每个数据集的最佳性能。灰色框表示未能拒绝零假设,表明参与者认为真实图像和生成图像等效。$\dagger$ 表示与无增强相比性能下降。
数据集 | 增强方法 | FPR [%]↑ | FNR[%]↑ | t检验 | Diff↓ | KS检验 |
---|---|---|---|---|---|---|
ChestXray-14 | 无 | 48 | 58 | p=.497 | 0.12 | p=.869 |
ADA | 32t | 47t | p=.340 | 0.28t | p=.549 | |
APA | 34t | 56 | p=.082 | 0.24 | p=.717 | |
DiffAug | 48 | 58 | p=.616 | 0.16 | p=.967 | |
SLIVER07 | 无 | 20 | 34 | p=.424 | 0.68 | p<.001 |
ADA | 24 | 30t | p=.748 | 0.52 | p=.001 | |
APA | 10t | 28 | p=.232 | 0.82 | p<.001 | |
DiffAug | 34 | 30t | p=.825 | 0.22 | p=.717 | |
MSD | 无 | 58 | 48 | p=.543 | 0.08 | p>.999 |
ADA | 66 | 48 | p=.217 | -0.04 | p>.999 | |
APA | 46t | 38 | p=.587 | 0.04 | p>.999 | |
DiffAug | 50t | 54 | p=.812 | 0.08 | p>.999 | |
ACDC | 无 | 34 | 22 | p=.470 | 0.52 | p=.022 |
ADA | 38 | 30 | p=.653 | 0.38 | p=.112 | |
APA | 28t | 22 | p=.707 | 0.46 | p=.003 | |
DiffAug | 44 | 16t | p=.015 | 0.28 | p=.112 |
consistent rankings among all ImageNet-based FDs, aligning closely with human judgment. In contrast, Rad Image Net-based FDs exhibited volatility and diverged from human assessment. Diff Augment was the best-performing form of augmentation, generating hyper-realistic images on two datasets.
在所有基于ImageNet的FD(特征检测器)中排名一致,与人类判断高度吻合。相比之下,基于Rad Image Net的FD表现出波动性,并与人类评估存在偏差。Diff Augment(差分增强)是表现最佳的数据增强形式,在两个数据集上生成了超逼真的图像。
ImageNet extractors aligned with human judgment. ImageNet-based FDs were consistent with one another in ranking generative models, except for on the MSD dataset, where human rankings were also inconsistent (see Figure 1). This consistency was reinforced by strong correlations between the FDs derived from Inception V 3 and all other ImageNet-based FDs (p<.001). Furthermore, the ImageNet-based FDs aligned with expert judgment (see Figure 1). On the ChestX-ray14 dataset, ImageNet-based FDs ranked generative models in the same order as the radiologists: Diff Augment, ADA, no augmentation, and APA. Particularly promising was the SwAV-based FD, which significantly correlated with human perception across all models (Pearson coefficient of .475 with the difference in average Likert ratings, p=.064).
基于ImageNet的提取器与人类判断一致。除MSD数据集外(该数据集上的人类排名也不一致,见图1),基于ImageNet的FD在生成模型排序上相互一致。这种一致性得到了Inception V3与其他所有基于ImageNet的FD之间强相关性(p<.001)的进一步验证。此外,基于ImageNet的FD与专家判断一致(见图1)。在ChestX-ray14数据集上,基于ImageNet的FD对生成模型的排序与放射科医师的判断顺序相同:Diff Augment、ADA、无增强和APA。特别有前景的是基于SwAV的FD,它在所有模型中与人类感知显著相关(与平均Likert评分差异的Pearson系数为0.475,p=0.064)。
Table 2. ImageNet-based rFDs. Column 1 lists each tested dataset, while Column 2 specifies the augmentation technique (Aug) utilized during model training: no augmentation (None), ADA, APA, and Diff Augment (DiffAug). Columns 3-9 display the rFDs computed using seven ImageNet-trained feature extractors: Inception V 3 (Incept), ResNet50 (Res), Inception Res Ne tV 2 (IRV2), Dense Net 121 (Dense), SwAV, DINO, and Swin Transformer (Swin). $\downarrow$ indicates that a lower value is preferable. The underlined boldface type represents the best performance per dataset. $^\dagger$ denotes decreased performance compared to no augmentation.
表 2: 基于ImageNet的相对弗雷歇距离(rFD)。第1列列出各测试数据集,第2列标明模型训练时采用的数据增强技术(Aug):无增强(None)、ADA、APA和差分增强(DiffAug)。第3-9列展示使用七个ImageNet预训练特征提取器计算的rFD值:Inception V3 (Incept)、ResNet50 (Res)、Inception ResNetV2 (IRV2)、DenseNet121 (Dense)、SwAV、DINO和S