[论文翻译]评估StyleGAN2-ADA在医学图像上的表现


原文地址:https://arxiv.org/pdf/2210.03786v1


Evaluating the Performance of StyleGAN2-ADA on Medical Images

评估StyleGAN2-ADA在医学图像上的表现

Abstract. Although generative adversarial networks (GANs) have shown promise in medical imaging, they have four main limitations that impede their utility: computational cost, data requirements, reliable evaluation measures, and training complexity. Our work investigates each of these obstacles in a novel application of StyleGAN2-ADA to high-resolution medical imaging datasets. Our dataset is comprised of liver-containing axial slices from non-contrast and contrast-enhanced computed tomography (CT) scans. Additionally, we utilized four public datasets composed of various imaging modalities. We trained a StyleGAN2 network with transfer learning (from the Flickr-Faces-HQ dataset) and data augmentation (horizontal flipping and adaptive disc rim in at or augmentation). The network’s generative quality was measured quantitatively with the Fréchet Inception Distance (FID) and qualitatively with a visual Turing test given to seven radiologists and radiation on colo gist s.

摘要。尽管生成对抗网络 (GAN) 在医学影像领域展现出潜力,但其应用仍受四大限制因素制约:计算成本、数据需求、可靠评估指标及训练复杂性。本研究通过将StyleGAN2-ADA创新性应用于高分辨率医学影像数据集,逐一探究了这些挑战。我们的数据集包含非增强和增强计算机断层扫描 (CT) 中带有肝脏的轴向切片,并辅以四个不同成像模式的公开数据集。我们采用迁移学习 (基于Flickr-Faces-HQ数据集) 和数据增强 (水平翻转与自适应判别器增强) 训练StyleGAN2网络,通过Fréchet起始距离 (FID) 定量评估生成质量,并邀请七名放射科和结直肠科医师进行视觉图灵测试定性分析。

The StyleGAN2-ADA network achieved a FID of 5.22 ( $\pm$ 0.17) on our liver CT dataset. It also set new record FIDs of 10.78, 3.52, 21.17, and 5.39 on the publicly available SLIVER07, ChestX-ray14, ACDC, and Medical Segmentation Decathlon (brain tumors) datasets. In the visual Turing test, the clinicians rated generated images as real 42% of the time, approaching random guessing. Our computational ablation study revealed that transfer learning and data augmentation stabilize training and improve the perceptual quality of the generated images. We observed the FID to be consistent with human perceptual evaluation of medical images. Finally, our work found that StyleGAN2-ADA consistently produces high-quality results without hyper parameter searches or retraining.

StyleGAN2-ADA网络在我们的肝脏CT数据集上取得了5.22 ( $\pm$ 0.17)的FID分数。该模型还在公开数据集SLIVER07、ChestX-ray14、ACDC和Medical Segmentation Decathlon(脑肿瘤)上分别创下了10.78、3.52、21.17和5.39的FID新纪录。在图灵视觉测试中,临床医生将生成图像误判为真实图像的比率达42%,接近随机猜测水平。我们的计算消融研究表明,迁移学习和数据增强能稳定训练过程并提升生成图像的感知质量。实验发现FID指标与人类对医学图像的感知评价具有一致性。最终研究表明,StyleGAN2-ADA无需超参数搜索或重新训练即可持续生成高质量结果。

Keywords: StyleGAN2-ADA $^{-1}$ Fréchet Inception Distance · Visual Turing Test · Data Augmentation · Transfer Learning

关键词:StyleGAN2-ADA $^{-1}$ Fréchet Inception距离 · 视觉图灵测试 · 数据增强 · 迁移学习

1 Introduction

1 引言

Recently, generative adversarial networks (GANs) have shown promise in many medical imaging tasks, including data augmentation in computer-aided diagnosis [21], image segmentation [29], image reconstruction [17], treatment planning [1], image translation [10], and anomaly detection [23]. Despite their potential in medical imaging, GANs have several drawbacks that impede both their capabilities and utilization in the medical field. These obstacles include computational cost, data requirements, flawed measures of assessment, and training complexity.

近年来,生成对抗网络 (GAN) 在医学影像领域展现出广泛的应用前景,包括计算机辅助诊断中的数据增强 [21]、图像分割 [29]、图像重建 [17]、治疗规划 [1]、图像转换 [10] 以及异常检测 [23]。尽管GAN在医学影像中具有潜力,但其仍存在诸多缺陷,制约了其在医疗领域的性能表现和实际应用。这些障碍包括计算成本高、数据需求量大、评估指标存在缺陷以及训练过程复杂。

GANs are computationally expensive. The original StyleGAN2 project took 51.06 GPU years to create, 0.23 of which were used for training the FlickrFaces-HQ (FFHQ) weights used in our paper [15]. Despite being the state-ofthe-art generative model for high-resolution images, StyleGAN2 is often not used in medical imaging literature due to its expense [24]. If it is used, images are brought to lower resolutions to offset the cost [20,22]. While StyleGAN [14] (the predecessor to StyleGAN2) has been applied to high-resolution medical images [7], we believe our paper is the first rigorous evaluation of StyleGAN2 on multiple high-resolution medical imaging datasets.

生成对抗网络 (GAN) 计算成本高昂。原始 StyleGAN2 项目耗费了 51.06 GPU 年,其中 0.23 GPU 年用于训练我们论文 [15] 中使用的 FlickrFaces-HQ (FFHQ) 权重。尽管 StyleGAN2 是高分辨率图像生成的最先进模型,但由于其高昂成本,在医学影像文献中很少被采用 [24]。即使使用,图像也会被降低分辨率以抵消成本 [20,22]。虽然 StyleGAN [14] (StyleGAN2 的前身) 曾被应用于高分辨率医学图像 [7],但我们认为本文是首个在多组高分辨率医学影像数据集上对 StyleGAN2 进行严格评估的研究。

At high-resolutions, GANs require hundreds of thousands of images to effectively train, a requirement that is extremely challenging to satisfy in the medical field. With limited data, the GAN’s disc rim in at or overfits on the training examples, obstructing the GAN’s ability to converge. Adaptive disc rim in at or augmentation (ADA) was designed to reduce disc rim in at or over fitting through a wide range of data augmentations that do not “leak” to the generated distribution. When applied to a his to pathology dataset, ADA improved the FID by 84% [12]. In our paper, we perform a computational ablation study that examines how ADA and transfer learning affects performance on medical images.

在高分辨率下,生成对抗网络 (GAN) 需要数十万张图像才能有效训练,这一要求在医学领域极难满足。数据有限时,GAN 的判别器 (discriminator) 会对训练样本欠拟合或过拟合,阻碍模型收敛。自适应判别器增强 (Adaptive Discriminator Augmentation, ADA) 通过采用多种不会"泄漏"到生成分布的数据增强方式,旨在减少判别器的欠拟合或过拟合现象。在组织病理学数据集上的应用表明,ADA 将 FID 指标提升了 84% [12]。本文通过计算消融实验,研究了 ADA 和迁移学习对医学图像性能的影响。

One of the greatest challenges in GANs is constructing robust quantitative evaluation measures [18]. The Fréchet Inception Distance (FID) [9] is the standard for state of the art evaluation for generative modeling in natural imaging. It relies on an Inception network that was trained on ImageNet, which does not contain medical images [6], for its calculation. As such, a common assumption in related literature is that the FID is not applicable to medical images. We revisit this assumption by testing the correlation between the FID and human perceptual evaluation on medical images.

GAN面临的最大挑战之一是建立稳健的定量评估指标 [18]。Fréchet Inception Distance (FID) [9] 是目前自然图像生成建模领域最先进的评估标准。其计算依赖于一个在ImageNet上训练的Inception网络,而该数据集不包含医学图像 [6]。因此,相关文献中普遍认为FID不适用于医学图像。我们通过测试FID与人类对医学图像感知评价之间的相关性,重新审视了这一假设。

GANs are notoriously challenging to train. They have numerous hyperparameters and suffer from training instability. In a large empirical evaluation of various GANs, Lucic et al. [18] found that GAN training is extremely sensitive to hyper parameter settings. A separate study illustrated this sensitivity by performing 1,500 hyper parameter searches on three unique medical imaging datasets with various GAN architectures. The authors found that few models produced meaningful images; even fewer models achieved reasonable metric evaluations [26]. Neither of these studies examined StyleGAN2. Our work is unique in that we test the stability of StyleGAN2, along with its ability to generate quality images without a hyper parameter search.

GANs的训练以难度大而著称。它们拥有大量超参数,且存在训练不稳定的问题。Lucic等人[18]在对多种GAN进行大规模实证评估时发现,GAN训练对超参数设置极为敏感。另一项研究通过在三个独特医学影像数据集上对多种GAN架构进行1500次超参数搜索,直观展示了这种敏感性。作者发现仅有少数模型能生成有意义的图像,达到合理指标评估的模型更是屈指可数[26]。这两项研究均未涉及StyleGAN2。本研究的独特之处在于,我们在无需超参数搜索的情况下,测试了StyleGAN2的稳定性及其生成高质量图像的能力。

The main contributions of our research are as follows:

我们研究的主要贡献如下:

2 Methods

2 方法

2.1 Data

2.1 数据

We used the 97 non-contrast and 108 contrast enhanced abdominal computed tomography (CT) scans presented in [2]. To accentuate the liver, the data was windowed to a level 50 and a width 350, consistent with the preset values for viewing the liver in a commercial treatment planning system (RayStation v10, RaySearch Laboratories, Stockholm, Sweden). All axial slices that contained no liver information were discarded. Voxel values were mapped to the range [0, 255] and converted each axial slice to a PNG image. In all, our training dataset contained 10,600 512x512 images. Three randomly sampled images from our training dataset are shown in the first row of Figure 1. We used an additional 143,345 512x512 images for one experiment in our ablation study. These images were obtained by applying the above mentioned preprocessing steps to 3,029 abdominal CT scans (301 patients) that were retrospectively acquired under an IRB approved protocol.

我们使用了文献[2]中提供的97例非增强和108例增强腹部计算机断层扫描(CT)数据。为突出肝脏显示,所有数据统一采用窗位50、窗宽350进行窗宽调整(与商用治疗计划系统RayStation v10的肝脏预设值一致)。剔除所有不包含肝脏信息的轴位切片后,将体素值线性映射至[0, 255]范围并转换为PNG格式图像。最终训练数据集包含10,600张512x512图像,图1首行展示了其中随机选取的三张样本图像。在消融实验中,我们额外使用了143,345张512x512图像,这些图像通过对3,029例回顾性采集的腹部CT扫描数据(涉及301名患者)实施相同预处理流程获得,该数据采集方案已获机构审查委员会(IRB)批准。

Separately, our methods were applied to several publicly available datasets. For the “Segmentation of the Liver Competition 2007” (SLIVER07) dataset $^5$ [8], we used the 20 scans available in the training dataset and converted each slice to a PNG image without any further preprocessing. In total, this dataset consisted of 4,159 512x512 images. To our knowledge, the previous best FID (29.06) on this dataset was achieved by Skandarani et al. using the StyleGAN network.

此外,我们的方法被应用于多个公开数据集。针对"2007年肝脏分割竞赛"(SLIVER07)数据集$^5$[8],我们使用了训练集中的20例扫描数据,将每层切片转换为PNG格式图像且未进行任何预处理。该数据集共包含4,159张512x512分辨率图像。据我们所知,该数据集此前最佳FID分数(29.06)由Skandarani等人通过StyleGAN网络实现。

The ChestX-ray14 dataset $^6$ [28] consists of 112,120 1024x1024 Chest X-ray images in PNG format. The previous best FID on the ChestX-ray14 dataset of 8.02 was achieved using a Progressive Growing GAN [24]. No preprocessing on this dataset was performed. The Automated Cardiac Diagnosis Challenge (ACDC) dataset [3] consists of 150 cardiac cine-magnetic resonance imaging (MRI) exams. We used the training dataset, which consists of 100 exams. The images were rescaled to the range [0, 255] using SimpleITK [16] and padded with zeros. Each slice was then converted to a 2D PNG image. In total, this dataset consisted of 1,902 512x512 images. The previous best FID on the ACDC training dataset (24.74) was achieved with StyleGAN [26].

ChestX-ray14数据集$^6$[28]包含112,120张1024x1024分辨率的胸部X光PNG格式图像。此前在该数据集上取得的最佳FID分数为8.02,使用的是Progressive Growing GAN[24]。该数据集未进行任何预处理。自动化心脏诊断挑战赛(ACDC)数据集[3]包含150例心脏电影磁共振成像(MRI)检查数据,我们使用了其中100例作为训练集。通过SimpleITK[16]将图像缩放至[0, 255]范围并进行零值填充,最终将每张切片转换为512x512的2D PNG图像,共计1,902张。此前在ACDC训练集上取得的最佳FID分数(24.74)由StyleGAN[26]实现。

Additionally, we applied StyleGAN2-ADA to a dataset whose FID had not been previously evaluated: the brain tumor data from the Medical Segmentation Decathlon $^8$ [25], which contains 750 4D MRI volumes. The gadolinium-enhanced T1-weighted 3D images were extracted and windowed to the range [0, 255] using SimpleITK. Slices were converted to 2D PNG images. This dataset consists of 103,030 256x256 PNG images.

此外,我们将StyleGAN2-ADA应用于一个此前未评估过FID的数据集:来自Medical Segmentation Decathlon$^8$[25]的脑肿瘤数据,该数据集包含750个4D MRI体积。通过SimpleITK提取钆增强T1加权3D图像并将其窗宽调至[0, 255]范围,切片转换为256x256的2D PNG图像,最终生成103,030张PNG格式的样本。

2.2 Generative Modeling

2.2 生成式建模 (Generative Modeling)

Due to its state-of-the-art performance on high-resolution images, we used a StyleGAN2 network as our generative model [15]. For our experiments, we utilized the StyleGAN2 configuration of the official StyleGAN3 repository $^{9}$ [13]. We used the default parameters provided by the implementation, with the exception of changing $\beta_{0}$ to 0.9 in the Adam optimizer and disabling mixed precision. We did not perform a hyper parameter search. We explored the effects of transfer learning and data augmentation in an ablation study with the following experimental designs:

由于其在高分辨率图像上的最先进性能,我们采用了StyleGAN2网络作为生成模型[15]。实验中使用的是官方StyleGAN3代码库中的StyleGAN2配置$^{9}$[13]。除将Adam优化器的$\beta_{0}$改为0.9并禁用混合精度外,其余参数均保持实现提供的默认值。我们未进行超参数搜索,而是通过以下实验设计的消融研究来探索迁移学习和数据增强的影响:

Each of these experiments was performed on our liver CT training dataset. A variation of Experiment 1 was also performed where 143,345 liver images were added to the training dataset. Furthermore, Experiment 4 was performed on the four public datasets. Each experiment was performed on a DGX with eight 40GB A100 GPUs. DGXs were accessed using the XNAT platform [19]. Experiments ran for 6,250 ticks with metrics calculated and weights saved every 50 ticks. Each experiment took approximately 1.5, 4, and 7 days to complete for 256x256, 512x512, and 1024x1024 sized datasets, respectively. We repeated each experiment five times to test algorithm stability.

这些实验均在我们的肝脏CT训练数据集上进行。实验1的一个变体还额外向训练集添加了143,345张肝脏图像。此外,实验4在四个公开数据集上执行。所有实验均在配备八块40GB A100 GPU的DGX服务器上运行,通过XNAT平台[19]访问DGX。实验运行6,250个刻度(tick),每50个刻度计算指标并保存权重。对于256x256、512x512和1024x1024尺寸的数据集,每次实验分别耗时约1.5天、4天和7天。每个实验重复五次以测试算法稳定性。

2.3 Evaluation Measures

2.3 评估指标

Fréchet Inception Distance The FID is the standard for state of the art GAN evaluation in natural imaging. It is the Fréchet distance between two multivariate Gaussians constructed from representations extracted from the coding layer of an Inception network that was pretrained on ImageNet [9]. Several advantages of the FID include its ability to distinguish generated from real samples, agreement with human perceptual judgements, sensitivity to distortions, and computational and sample efficiency [4,9]. As such, we used the FID as our quantitative metric. For each run, we reported the best FID achieved during training. We used the model weights associated with each best FID for further qualitative analysis. For statistical testing, we used permutation tests with $\alpha=0.05$ .

Fréchet Inception距离 (FID) 是自然图像生成对抗网络(GAN)评估的黄金标准。该指标计算两个多元高斯分布之间的Fréchet距离,这些分布由经过ImageNet预训练的Inception网络编码层提取的特征构建而成 [9]。FID的优势包括:能区分生成样本与真实样本、与人类感知判断高度一致、对图像失真敏感、以及计算和样本效率优异 [4,9]。因此我们采用FID作为量化指标,每次实验记录训练过程中获得的最佳FID值,并基于对应模型权重进行定性分析。统计检验采用置换检验 (permutation tests),显著性水平设为$\alpha=0.05$。

Because ImageNet does not contain medical images, prior publications have argued that the FID is not applicable to medical imaging [5,11,27]. As such, they substitute the Inception network with their own encoding networks. This trend has several limitations. First, the FID is only consistent as a metric inasmuch as the same encoding model is used. By using a new model, the reported distance can no longer be considered in the context of prior work that utilizes the FID. Second, the algorithm designer is formulating their own evaluation metric, which will likely introduce un quantified bias into the presented results. Due to these limitations, we use the original definition of the FID for our calculations.

由于ImageNet不包含医学图像,先前的研究认为FID不适用于医学影像领域[5,11,27]。因此,这些研究用自建的编码网络替代了Inception网络。这种做法存在若干局限性:首先,FID作为评估指标的一致性取决于是否采用相同的编码模型。使用新模型后,所报告的距离值将无法与采用FID的既往研究进行横向对比;其次,算法设计者自行制定评估指标,很可能导致结果中存在未量化的偏差。基于这些局限性,我们采用FID的原始定义进行计算。

Visual Turing Tests Because the applicability of the FID to medical imaging is not well understood, our first visual Turing test evaluated the correlation between the FID and human perception on medical images. The test was administered in a Google Form with four sections (created in random order), one per experiment. Each section contained 40 randomly shuffled images, 20 real and 20 generated. All images were randomly selected and only appeared once in the test. The test was given to five participants with a medical physics background who were not familiar with the images. We evaluated the test with the false positive rate (FPR) and false negative rate (FNR).

视觉图灵测试
由于FID在医学影像中的适用性尚未明确,我们首先通过视觉图灵测试评估了FID与人类对医学图像感知之间的相关性。测试通过Google表单进行,包含四个部分(按随机顺序创建),每个实验对应一个部分。每部分包含40张随机打乱的图像,其中20张为真实图像,20张为生成图像。所有图像均为随机选取,且在整个测试中仅出现一次。测试由五位具有医学物理背景但不熟悉这些图像的参与者完成。我们通过假阳性率(FPR)和假阴性率(FNR)对测试结果进行评估。

The purpose of the second visual Turing test was to rigorously validate the perceptual quality of the images generated by the pretrained StyleGAN2-ADA model on our dataset. This test consisted of 50 real and 50 generated images randomly sampled and shuffled. Each section contained one image, a question asking the participant if the image was real or fake, and a Likert scale assessing how realistic the image was. The Likert scale was between 1 (fake) and 5 (real). The test was given to seven radiologists or radiation on colo gist s with an average of 10 years of radiological experience. The results of the Turing test were evaluated with precision, recall, accuracy, FPR, and FNR metrics. Additionally, we computed the average Likert values for both real and generated images. For statistical testing, we used permutation tests with $\alpha=0.10$ .

第二次视觉图灵测试的目的是严格验证预训练StyleGAN2-ADA模型在我们数据集上生成图像的感知质量。该测试包含随机采样并打乱的50张真实图像和50张生成图像。每个部分展示一张图像,询问参与者该图像是真实还是伪造,并用李克特量表(1-5分)评估图像真实感,其中1分表示"虚假",5分表示"真实"。测试由7位平均拥有10年放射学经验的放射科医师或放射肿瘤科医师完成。采用精确率、召回率、准确率、假阳性率(FPR)和假阴性率(FNR)评估图灵测试结果,同时计算真实图像与生成图像的平均李克特分值。统计检验使用置换检验(permutation test),显著性水平设为$\alpha=0.10$。

3 Results

3 结果

On our dataset, the average ( $\pm$ SD) FIDs, n=5, achieved were 10.70 ( $\pm$ 0.72), 7.62 $(\pm0.35)$ , 7.51 (±0.89), and 5.22 ( $\pm$ 0.17), for Experiments 1-4, respectively. Both transfer learning and data augmentation were effective tools in mitigating overfitting on limited medical data. Individually, they improved upon the baseline FID by about $30%$ ( $95%$ confidence). Even greater improvements were achieved ( $50%$ decrease) in the FID when transfer learning and augmentations were used in tandem $95%$ confidence). Data augmentation significantly decreased the generator’s loss and stabilized training, as shown in Figure 3 in the Appendix (95% confidence). Our results show that transfer learning does not need to be performed from a medical imaging dataset to be effective. When Experiment 1 was repeated with the additional 143,345 images, the average ( $\pm$ SD) FID, n=5, attained was 8.45 ( $\pm$ 0.20). This demonstrates that transfer learning and data augmentation, both in conjunction and independently, outperformed a fifteenfold increase in the dataset size.

在我们的数据集上,实验1-4获得的平均FID值( $\pm$ SD, n=5)分别为10.70( $\pm$ 0.72)、7.62( $\pm$ 0.35)、7.51( $\pm$ 0.89)和5.22( $\pm$ 0.17)。迁移学习和数据增强都是缓解有限医疗数据过拟合的有效工具。单独使用时,它们将基线FID提高了约30%(95%置信度)。当迁移学习与数据增强联合使用时,FID获得了更大改进(降低50%,95%置信度)。如附录中的图3所示(95%置信度),数据增强显著降低了生成器损失并稳定了训练。我们的结果表明,迁移学习不需要从医学影像数据集开始就能有效。当使用额外的143,345张图像重复实验1时,获得的平均FID( $\pm$ SD, n=5)为8.45( $\pm$ 0.20)。这表明无论是联合还是单独使用,迁移学习和数据增强的效果都优于数据集规模扩大15倍的情况。


Fig. 1. The first row contains images from our training dataset. The second and third rows contain images generated by the baseline StyleGAN2 model. The fourth and fifth rows contain images generated by the pretrained StyleGAN2-ADA model with augmentations. All images were randomly selected. The images generated by the pretrained StyleGAN2-ADA model demonstrate reduced noise artifacts, enhanced detail, and superior anatomical accuracy

图 1: 第一行展示的是我们训练数据集中的图像。第二行和第三行是由基准 StyleGAN2 模型生成的图像。第四行和第五行是由经过增强预训练的 StyleGAN2-ADA 模型生成的图像。所有图像均为随机选取。预训练 StyleGAN2-ADA 模型生成的图像展现出更少的噪点伪影、更丰富的细节以及更出色的解剖学准确性。

On the SLIVER07, ChestX-ray14, and ACDC datasets, we lowered the record FIDs from 29.06 to 10.78 (mean $11.99\pm1.5^{\prime}$ 7), 8.02 to 3.52 (mean $3.63\pm0.07$ ), and 24.74 to 21.17 (mean $21.43\pm0.32$ ), respectively. For the Medical Segmentation Decathalon (brain tumors) data, we set a new record FID of 5.39 (mean $5.53\pm0.01$ ). These state-of-the-art results indicate that StyleGAN2 has stable performance and can generate quality medical images without a hyper parameter search.

在SLIVER07、ChestX-ray14和ACDC数据集上,我们将FID记录分别从29.06降至10.78 (均值$11.99\pm1.5^{\prime}$7)、8.02降至3.52 (均值$3.63\pm0.07$)以及24.74降至21.17 (均值$21.43\pm0.32$)。在Medical Segmentation Decathalon(脑肿瘤)数据上,我们创下了5.39 (均值$5.53\pm0.01$)的新FID记录。这些最先进的结果表明StyleGAN2具有稳定的性能,无需超参数搜索即可生成高质量医学图像。

Table 1 shows the results of the multi-model visual Turing test. This table provides empirical evidence that the FID is consistent with human perceptual judgement on medical images: the lower the FID, the higher the average FPR (Pearson correlation of -0.91, 90% confidence). This suggests that as the FID decreases, it becomes increasingly difficult for humans to distinguish between real and generated images. In addition, the FPRs demonstrate that augmentations improved the perceptual quality of the generated images ( $90%$ confidence). When data augmentation was combined with transfer learning, the average participant was more likely to say a generated image was real than fake (55% FPR).

表 1 展示了多模态视觉图灵测试的结果。该表提供了经验证据,表明 FID (Fréchet Inception Distance) 与人类对医学图像的感知判断具有一致性:FID 值越低,平均误报率 (FPR) 越高 (Pearson 相关系数为 -0.91,90% 置信度)。这表明随着 FID 降低,人类越来越难以区分真实图像与生成图像。此外,FPR 数据表明数据增强 (data augmentation) 提升了生成图像的感知质量 (90% 置信度)。当数据增强与迁移学习 (transfer learning) 结合使用时,平均每位参与者认为生成图像是真实而非伪造的概率更高 (55% FPR)。

Table 1. Average ( $\pm$ SD) results, n=5, of the multi-model visual Turing test. FIDs are associated with the model used to generate the images in the Turing tests. Multi-Model Visual Turing Test Results

表 1. 多模型视觉图灵测试的平均结果 ( $\pm$ 标准差),n=5。FID值与生成图灵测试图像的模型相关联。

实验组 FID 假阳性率[%] 假阴性率[%]
1.基线 10.43 29 (±27) 32 (±21)
2.预训练 7.78 34 (±19) 32 (±18)
3.数据增强 7.15 49 (±11) 34 (±18)
4.预训练+数据增强 5.06 55 (±9) 41 (±11)

Figure 1 displays randomly selected real and generated images from the baseline StyleGAN2 (10.43 FID) and the pretrained StyleGAN2-ADA (5.06 FID) models. Many of the images generated by the baseline StyleGAN2 model contain noise artifacts, especially in the liver. Images generated by the pretrained StyleGAN2-ADA model show reduced noise, enhanced detail, and superior anatomical accuracy. This perceptual improvement substantiates the claim that the FID is applicable to medical images. The Appendix contains auxiliary pretrained StyleGAN2-ADA generated images (Figure 4) and a larger image demonstrating noise artifacts in the baseline StyleGAN2 model (Figure 2).

图1展示了从基准StyleGAN2(10.43 FID)和预训练StyleGAN2-ADA(5.06 FID)模型中随机选取的真实图像与生成图像。基准StyleGAN2模型生成的许多图像存在噪声伪影(尤其在肝脏区域),而预训练StyleGAN2-ADA模型生成的图像噪声减少、细节增强且具有更优的解剖学准确性。这一感知改进证实了FID指标适用于医学图像的论断。附录包含预训练StyleGAN2-ADA的辅助生成图像(图4)及基准StyleGAN2模型中噪声伪影的放大示例(图2)。

The results of the Turing test given to clinicians, shown in Table 2, further confirm the high-quality nature of the generated images. Overall, the clinicians classified generated images as real 42% of the time, approaching the equivalent of random guessing. Those that had low FPRs typically had higher FNRs and vice versa (Pearson correlation of -0.71, $90%$ confidence), indicating a tendency of clinicians to favor either “real” or “fake” when they were unsure. This tendency was likely a factor in the high inter observer variability among the FPRs. Another likely factor was the experience of the clinicians. For the Likert scale, we found that real images achieved an average score of 3.99 $(\pm1.00)$ and generated images a score of 3.23 ( $\pm$ 1.21). The overlapping $95%$ confidence intervals further demonstrate both the challenging nature of the task and the high-quality nature of the generated images.

临床医生参与的图灵测试结果(表2)进一步证实了生成图像的高质量特性。总体而言,临床医生将生成图像误判为真实图像的比率达42%,接近随机猜测水平。那些假阳性率(FPR)较低的评估者通常具有较高的假阴性率(FNR),反之亦然(Pearson相关性-0.71,90%置信度),表明临床医生在不确定时倾向于偏向"真实"或"虚假"判断。这种倾向可能是观察者间FPR差异较大的原因之一,另一个潜在因素在于临床医生的经验差异。李克特量表评分显示:真实图像平均得分为3.99 (±1.00),生成图像为3.23 (±1.21)。两者95%置信区间的重叠既反映了判别任务的挑战性,也印证了生成图像的高质量特性。

Table 2. Results of the visual Turing test given to clinicians. Clinician Visual Turing Test Results

表 2: 临床医生视觉图灵测试结果

临床医生 精确率 [%] 召回率 [%] 准确率 [%] 假阳性率 [%] 假阴性率 [%]
1 80 86 82 22 14
2 76 44 65 14 56
3 56 80 58 64 20
4 79 62 73 16 38
5 58 98 64 70 2
6 54 88 56 76 12
7 59 48 57 34 52
平均值 (±标准差) 66 (±12) 72 (±21) 65 (±10) 42 (±27) 28 (±21)

4 Conclusion

4 结论

We applied StyleGAN2 to multiple high-resolution medical image datasets. Combined with transfer learning and data augmentation, the architecture achieved state-of-the-art results consistently, without any hyper parameter searches or retraining. The generated images were of sufficient quality that an expert’s ability to tell whether or not an image was generated approached random guessing. Additionally, we found that the “realness” score, based on a 5-point Likert scale, differed between the generated and real images by less than the standard deviation between clinicians. Across a variety of medical imaging modalities, we were able to set new record FID scores on four publicly-available datasets.

我们将StyleGAN2应用于多个高分辨率医学影像数据集。结合迁移学习和数据增强技术,该架构无需任何超参数搜索或重新训练就能持续取得最先进的结果。生成图像的质量极高,以至于专家判断图像是否生成的准确率接近随机猜测。此外,我们发现基于5点李克特量表的"真实度"评分中,生成图像与真实图像的差异小于临床医生间的评分标准差。在多种医学影像模态中,我们在四个公开数据集上创造了新的FID分数记录。

Furthermore, our research provided empirical evidence that the FID is consistent with human perceptual judgement on medical images. A multi-model visual Turing test revealed that as the FID improved, the participants perceived artificially generated images as real more frequently. Qualitatively, we saw an appreciable improvement in the fidelity of the generated images as the FID improved from 10.43 to 5.06. From these results, we concluded that the FID is indeed an appropriate metric for medical images.

此外,我们的研究提供了实证证据,表明FID与人类对医学图像的感知判断一致。一项多模型视觉图灵测试显示,随着FID的提升,参与者更频繁地将人工生成的图像误认为真实图像。从定性角度看,当FID从10.43提升至5.06时,生成图像的保真度出现了显著改善。基于这些结果,我们得出结论:FID确实是评估医学图像的合适指标。

Acknowledgements This work was supported by the Tumor Measurement Initiative through the MD Anderson Strategic Initiative Development Program (STRIDE). We thank the NIH Clinical Center for the ChestX-ray14 dataset.

致谢
本研究由MD安德森战略计划发展项目(STRIDE)的肿瘤测量计划资助。我们感谢美国国立卫生研究院临床中心提供的ChestX-ray14数据集。


Fig. 2. This image was generated by the baseline StyleGAN2 model (10.43 FID). It was chosen to demonstrate the noise artifacts contained in many of the images generated by the model.

图 2: 该图像由基准 StyleGAN2 模型生成 (FID 10.43) ,被选用于展示该模型生成的许多图像中包含的噪声伪影。


Fig. 3. The average generator loss (with standard deviation bars) across training. We see that augmentation not only significantly decreases the loss, but also leads to more stable convergence.

图 3: 训练过程中生成器的平均损失值(带标准差误差条)。可以看出数据增强不仅显著降低了损失值,还使收敛过程更加稳定。


Fig. 4. Randomly selected images generated by the pretrained StyleGAN2-ADA model (5.06 FID).

图 4: 由预训练StyleGAN2-ADA模型(5.06 FID)随机生成的图像样本

阅读全文(20积分)