Evaluating the Performance of StyleGAN2-ADA on Medical Images

评估StyleGAN2-ADA在医学图像上的表现

Abstract. Although generative adversarial networks (GANs) have shown promise in medical imaging, they have four main limitations that impede their utility: computational cost, data requirements, reliable evaluation measures, and training complexity. Our work investigates each of these obstacles in a novel application of StyleGAN2-ADA to high-resolution medical imaging datasets. Our dataset is comprised of liver-containing axial slices from non-contrast and contrast-enhanced computed tomography (CT) scans. Additionally, we utilized four public datasets composed of various imaging modalities. We trained a StyleGAN2 network with transfer learning (from the Flickr-Faces-HQ dataset) and data augmentation (horizontal flipping and adaptive disc rim in at or augmentation). The network’s generative quality was measured quantitatively with the Fréchet Inception Distance (FID) and qualitatively with a visual Turing test given to seven radiologists and radiation on colo gist s.

摘要。尽管生成对抗网络 (GAN) 在医学影像领域展现出潜力，但其应用仍受四大限制因素制约：计算成本、数据需求、可靠评估指标及训练复杂性。本研究通过将StyleGAN2-ADA创新性应用于高分辨率医学影像数据集，逐一探究了这些挑战。我们的数据集包含非增强和增强计算机断层扫描 (CT) 中带有肝脏的轴向切片，并辅以四个不同成像模式的公开数据集。我们采用迁移学习 (基于Flickr-Faces-HQ数据集) 和数据增强 (水平翻转与自适应判别器增强) 训练StyleGAN2网络，通过Fréchet起始距离 (FID) 定量评估生成质量，并邀请七名放射科和结直肠科医师进行视觉图灵测试定性分析。

The StyleGAN2-ADA network achieved a FID of 5.22 ( $\pm$ 0.17) on our liver CT dataset. It also set new record FIDs of 10.78, 3.52, 21.17, and 5.39 on the publicly available SLIVER07, ChestX-ray14, ACDC, and Medical Segmentation Decathlon (brain tumors) datasets. In the visual Turing test, the clinicians rated generated images as real 42% of the time, approaching random guessing. Our computational ablation study revealed that transfer learning and data augmentation stabilize training and improve the perceptual quality of the generated images. We observed the FID to be consistent with human perceptual evaluation of medical images. Finally, our work found that StyleGAN2-ADA consistently produces high-quality results without hyper parameter searches or retraining.

StyleGAN2-ADA网络在我们的肝脏CT数据集上取得了5.22 ( $\pm$ 0.17)的FID分数。该模型还在公开数据集SLIVER07、ChestX-ray14、ACDC和Medical Segmentation Decathlon（脑肿瘤）上分别创下了10.78、3.52、21.17和5.39的FID新纪录。在图灵视觉测试中，临床医生将生成图像误判为真实图像的比率达42%，接近随机猜测水平。我们的计算消融研究表明，迁移学习和数据增强能稳定训练过程并提升生成图像的感知质量。实验发现FID指标与人类对医学图像的感知评价具有一致性。最终研究表明，StyleGAN2-ADA无需超参数搜索或重新训练即可持续生成高质量结果。

Keywords: StyleGAN2-ADA $^{-1}$ Fréchet Inception Distance · Visual Turing Test · Data Augmentation · Transfer Learning

关键词：StyleGAN2-ADA $^{-1}$ Fréchet Inception距离 · 视觉图灵测试 · 数据增强 · 迁移学习

1 Introduction

1 引言

Recently, generative adversarial networks (GANs) have shown promise in many medical imaging tasks, including data augmentation in computer-aided diagnosis [21], image segmentation [29], image reconstruction [17], treatment planning [1], image translation [10], and anomaly detection [23]. Despite their potential in medical imaging, GANs have several drawbacks that impede both their capabilities and utilization in the medical field. These obstacles include computational cost, data requirements, flawed measures of assessment, and training complexity.

近年来，生成对抗网络 (GAN) 在医学影像领域展现出广泛的应用前景，包括计算机辅助诊断中的数据增强 [21]、图像分割 [29]、图像重建 [17]、治疗规划 [1]、图像转换 [10] 以及异常检测 [23]。尽管GAN在医学影像中具有潜力，但其仍存在诸多缺陷，制约了其在医疗领域的性能表现和实际应用。这些障碍包括计算成本高、数据需求量大、评估指标存在缺陷以及训练过程复杂。

GANs are computationally expensive. The original StyleGAN2 project took 51.06 GPU years to create, 0.23 of which were used for training the FlickrFaces-HQ (FFHQ) weights used in our paper [15]. Despite being the state-ofthe-art generative model for high-resolution images, StyleGAN2 is often not used in medical imaging literature due to its expense [24]. If it is used, images are brought to lower resolutions to offset the cost [20,22]. While StyleGAN [14] (the predecessor to StyleGAN2) has been applied to high-resolution medical images [7], we believe our paper is the first rigorous evaluation of StyleGAN2 on multiple high-resolution medical imaging datasets.

生成对抗网络 (GAN) 计算成本高昂。原始 StyleGAN2 项目耗费了 51.06 GPU 年，其中 0.23 GPU 年用于训练我们论文 [15] 中使用的 FlickrFaces-HQ (FFHQ) 权重。尽管 StyleGAN2 是高分辨率图像生成的最先进模型，但由于其高昂成本，在医学影像文献中很少被采用 [24]。即使使用，图像也会被降低分辨率以抵消成本 [20,22]。虽然 StyleGAN [14] (StyleGAN2 的前身) 曾被应用于高分辨率医学图像 [7]，但我们认为本文是首个在多组高分辨率医学影像数据集上对 StyleGAN2 进行严格评估的研究。

At high-resolutions, GANs require hundreds of thousands of images to effectively train, a requirement that is extremely challenging to satisfy in the medical field. With limited data, the GAN’s disc rim in at or overfits on the training examples, obstructing the GAN’s ability to converge. Adaptive disc rim in at or augmentation (ADA) was designed to reduce disc rim in at or over fitting through a wide range of data augmentations that do not “leak” to the generated distribution. When applied to a his to pathology dataset, ADA improved the FID by 84% [12]. In our paper, we perform a computational ablation study that examines how ADA and transfer learning affects performance on medical images.

在高分辨率下，生成对抗网络 (GAN) 需要数十万张图像才能有效训练，这一要求在医学领域极难满足。数据有限时，GAN 的判别器 (discriminator) 会对训练样本欠拟合或过拟合，阻碍模型收敛。自适应判别器增强 (Adaptive Discriminator Augmentation, ADA) 通过采用多种不会"泄漏"到生成分布的数据增强方式，旨在减少判别器的欠拟合或过拟合现象。在组织病理学数据集上的应用表明，ADA 将 FID 指标提升了 84% [12]。本文通过计算消融实验，研究了 ADA 和迁移学习对医学图像性能的影响。

One of the greatest challenges in GANs is constructing robust quantitative evaluation measures [18]. The Fréchet Inception Distance (FID) [9] is the standard for state of the art evaluation for generative modeling in natural imaging. It relies on an Inception network that was trained on ImageNet, which does not contain medical images [6], for its calculation. As such, a common assumption in related literature is that the FID is not applicable to medical images. We revisit this assumption by testing the correlation between the FID and human perceptual evaluation on medical images.

GAN面临的最大挑战之一是建立稳健的定量评估指标 [18]。Fréchet Inception Distance (FID) [9] 是目前自然图像生成建模领域最先进的评估标准。其计算依赖于一个在ImageNet上训练的Inception网络，而该数据集不包含医学图像 [6]。因此，相关文献中普遍认为FID不适用于医学图像。我们通过测试FID与人类对医学图像感知评价之间的相关性，重新审视了这一假设。

GANs are notoriously challenging to train. They have numerous hyperparameters and suffer from training instability. In a large empirical evaluation of various GANs, Lucic et al. [18] found that GAN training is extremely sensitive to hyper parameter settings. A separate study illustrated this sensitivity by performing 1,500 hyper parameter searches on three unique medical imaging datasets with various GAN architectures. The authors found that few models produced meaningful images; even fewer models achieved reasonable metric evaluations [26]. Neither of these studies examined StyleGAN2. Our work is unique in that we test the stability of StyleGAN2, along with its ability to generate quality images without a hyper parameter search.

GANs的训练以难度大而著称。它们拥有大量超参数，且存在训练不稳定的问题。Lucic等人[18]在对多种GAN进行大规模实证评估时发现，GAN训练对超参数设置极为敏感。另一项研究通过在三个独特医学影像数据集上对多种GAN架构进行1500次超参数搜索，直观展示了这种敏感性。作者发现仅有少数模型能生成有意义的图像，达到合理指标评估的模型更是屈指可数[26]。这两项研究均未涉及StyleGAN2。本研究的独特之处在于，我们在无需超参数搜索的情况下，测试了StyleGAN2的稳定性及其生成高质量图像的能力。

The main contributions of our research are as follows:

我们研究的主要贡献如下:

2 Methods

2 方法

2.1 Data

2.1 数据

We used the 97 non-contrast and 108 contrast enhanced abdominal computed tomography (CT) scans presented in [2]. To accentuate the liver, the data was windowed to a level 50 and a width 350, consistent with the preset values for viewing the liver in a commercial treatment planning system (RayStation v10, RaySearch Laboratories, Stockholm, Sweden). All axial slices that contained no liver information were discarded. Voxel values were mapped to the range [0, 255] and converted each axial slice to a PNG image. In all, our training dataset contained 10,600 512x512 images. Three randomly sampled images from our training dataset are shown in the first row of Figure 1. We used an additional 143,345 512x512 images for one experiment in our ablation study. These images were obtained by applying the above mentioned preprocessing steps to 3,029 abdominal CT scans (301 patients) that were retrospectively acquired under an IRB approved protocol.

我们使用了文献[2]中提供的97例非增强和108例增强腹部计算机断层扫描(CT)数据。为突出肝脏显示，所有数据统一采用窗位50、窗宽350进行窗宽调整(与商用治疗计划系统RayStation v10的肝脏预设值一致)。剔除所有不包含肝脏信息的轴位切片后，将体素值线性映射至[0, 255]范围并转换为PNG格式图像。最终训练数据集包含10,600张512x512图像，图1首行展示了其中随机选取的三张样本图像。在消融实验中，我们额外使用了143,345张512x512图像，这些图像通过对3,029例回顾性采集的腹部CT扫描数据(涉及301名患者)实施相同预处理流程获得，该数据采集方案已获机构审查委员会(IRB)批准。

Separately, our methods were applied to several publicly available datasets. For the “Segmentation of the Liver Competition 2007” (SLIVER07) dataset $^5$ [8], we used the 20 scans available in the training dataset and converted each slice to a PNG image without any further preprocessing. In total, this dataset consisted of 4,159 512x512 images. To our knowledge, the previous best FID (29.06) on this dataset was achieved by Skandarani et al. using the StyleGAN network.

此外，我们的方法被应用于多个公开数据集。针对"2007年肝脏分割竞赛"(SLIVER07)数据集$^5$[8]，我们使用了训练集中的20例扫描数据，将每层切片转换为PNG格式图像且未进行任何预处理。该数据集共包含4,159张512x512分辨率图像。据我们所知，该数据集此前最佳FID分数(29.06)由Skandarani等人通过StyleGAN网络实现。

The ChestX-ray14 dataset $^6$ [28] consists of 112,120 1024x1024 Chest X-ray images in PNG format. The previous best FID on the ChestX-ray14 dataset of 8.02 was achieved using a Progressive Growing GAN [24]. No preprocessing on this dataset was performed. The Automated Cardiac Diagnosis Challenge (ACDC) dataset [3] consists of 150 cardiac cine-magnetic resonance imaging (MRI) exams. We used the training dataset, which consists of 100 exams. The images were rescaled to the range [0, 255] using SimpleITK [16] and padded with zeros. Each slice was then converted to a 2D PNG image. In total, this dataset consisted of 1,902 512x512 images. The previous best FID on the ACDC training dataset (24.74) was achieved with StyleGAN [26].

ChestX-ray14数据集$^6$[28]包含112,120张1024x1024分辨率的胸部X光PNG格式图像。此前在该数据集上取得的最佳FID分数为8.02，使用的是Progressive Growing GAN[24]。该数据集未进行任何预处理。自动化心脏诊断挑战赛(ACDC)数据集[3]包含150例心脏电影磁共振成像(MRI)检查数据，我们使用了其中100例作为训练集。通过SimpleITK[16]将图像缩放至[0, 255]范围并进行零值填充，最终将每张切片转换为512x512的2D PNG图像，共计1,902张。此前在ACDC训练集上取得的最佳FID分数(24.74)由StyleGAN[26]实现。

Additionally, we applied StyleGAN2-ADA to a dataset whose FID had not been previously evaluated: the brain tumor data from the Medical Segmentation Decathlon $^8$ [25], which contains 750 4D MRI volumes. The gadolinium-enhanced T1-weighted 3D images were extracted and windowed to the range [0, 255] using SimpleITK. Slices were converted to 2D PNG images. This dataset consists of 103,030 256x256 PNG images.

此外，我们将StyleGAN2-ADA应用于一个此前未评估过FID的数据集：来自Medical Segmentation Decathlon$^8$[25]的脑肿瘤数据，该数据集包含750个4D MRI体积。通过SimpleITK提取钆增强T1加权3D图像并将其窗宽调至[0, 255]范围，切片转换为256x256的2D PNG图像，最终生成103,030张PNG格式的样本。

2.2 Generative Modeling

2.2 生成式建模 (Generative Modeling)

Due to its state-of-the-art performance on high-resolution images, we used a StyleGAN2 network as our generative model [15]. For our experiments, we utilized the StyleGAN2 configuration of the official StyleGAN3 repository $^{9}$ [13]. We used the default parameters provided by the implementation, with the exception of changing $\beta_{0}$ to 0.9 in the Adam optimizer and disabling mixed precision. We did not perform a hyper parameter search. We explored the effects of transfer learning and data augmentation in an ablation study with the following experimental designs:

由于其在高分辨率图像上的最先进性能，我们采用了StyleGAN2网络作为生成模型[15]。实验中使用的是官方StyleGAN3代码库中的StyleGAN2配置$^{9}$[13]。除将Adam优化器的$\beta_{0}$改为0.9并禁用混合精度外，其余参数均保持实现提供的默认值。我们未进行超参数搜索，而是通过以下实验设计的消融研究来探索迁移学习和数据增强的影响：

Each of these experiments was performed on our liver CT training dataset. A variation of Experiment 1 was also performed where 143,345 liver images were added to the training dataset. Furthermore, Experiment 4 was performed on the four public datasets. Each experiment was performed on a DGX with eight 40GB A100 GPUs. DGXs were accessed using the XNAT platform [19]. Experiments ran for 6,250 ticks with metrics calculated and weights saved every 50 ticks. Each experiment took approximately 1.5, 4, and 7 days to complete for 256x256, 512x512, and 1024x1024 sized datasets, respectively. We repeated each experiment five times to test algorithm stability.

这些实验均在我们的肝脏CT训练数据集上进行。实验1的一个变体还额外向训练集添加了143,345张肝脏图像。此外，实验4在四个公开数据集上执行。所有实验均在配备八块40GB A100 GPU的DGX服务器上运行，通过XNAT平台[19]访问DGX。实验运行6,250个刻度(tick)，每50个刻度计算指标并保存权重。对于256x256、512x512和1024x1024尺寸的数据集，每次实验分别耗时约1.5天、4天和7天。每个实验重复五次以测试算法稳定性。

2.3 Evaluation Measures

2.3 评估指标

Fréchet Inception Distance The FID is the standard for state of the art GAN evaluation in natural imaging. It is the Fréchet distance between two multivariate Gaussians constructed from representations extracted from the coding layer of an Inception network that was pretrained on ImageNet [9]. Several advantages of the FID include its ability to distinguish generated from real samples, agreement with human perceptual judgements, sensitivity to distortions, and computational and sample efficiency [4,9]. As such, we used the FID as our quantitative metric. For each run, we reported the best FID achieved during training. We used the model weights associated with each best FID for further qualitative analysis. For statistical testing, we used permutation tests with $\alpha=0.05$ .

Fréchet Inception距离 (FID) 是自然图像生成对抗网络(GAN)评估的黄金标准。该指标计算两个多元高斯分布之间的Fréchet距离，这些分布由经过ImageNet预训练的Inception网络编码层提取的特征构建而成 [9]。FID的优势包括：能区分生成样本与真实样本、与人类感知判断高度一致、对图像失真敏感、以及计算和样本效率优异 [4,9]。因此我们采用FID作为量化指标，每次实验记录训练过程中获得的最佳FID值，并基于对应模型权重进行定性分析。统计检验采用置换检验 (permutation tests)，显著性水平设为$\alpha=0.05$。

Because ImageNet does not contain medical images, prior publications have argued that the FID is not applicable to medical imaging [5,11,27]. As such, they substitute the Inception network with their own encoding networks. This trend has several limitations. First, the FID is only consistent as a metric inasmuch as the same encoding model is used. By using a new model, the reported distance can no longer be considered in the context of prior work that utilizes the FID. Second, the algorithm designer is formulating their own evaluation metric, which will likely introduce un quantified bias into the presented results. Due to these limitations, we use the original definition of the FID for our calculations.

由于ImageNet不包含医学图像，先前的研究认为FID不适用于医学影像领域[5,11,27]。因此，这些研究用自建的编码网络替代了Inception网络。这种做法存在若干局限性：首先，FID作为评估指标的一致性取决于是否采用相同的编码模型。使用新模型后，所报告的距离值将无法与采用FID的既往研究进行横向对比；其次，算法设计者自行制定评估指标，很可能导致结果中存在未量化的偏差。基于这些局限性，我们采用FID的原始定义进行计算。

Visual Turing Tests Because the applicability of the FID to medical imaging is not well understood, our first visual Turing test evaluated the correlation between the FID and human perception on medical images. The test was administered in a Google Form with four sections (created in random order), one per experiment. Each section contained 40 randomly shuffled images, 20 real and 20 generated. All images were randomly selected and only appeared once in the test. The test was given to five participants with a medical physics background who were not familiar with the images. We evaluated the test with the false positive rate (FPR) and false negative rate (FNR).

视觉图灵测试
由于FID在医学影像中的适用性尚未明确，我们首先通过视觉图灵测试评估了FID与人类对医学图像感知之间的相关性。测试通过Google表单进行，包含四个部分（按随机顺序创建），每个实验对应一个部分。每部分包含40张随机打乱的图像，其中20张为真实图像，20张为生成图像。所有图像均为随机选取，且在整个测试中仅出现一次。测试由五位具有医学物理背景但不熟悉这些图像的参与者完成。我们通过假阳性率(FPR)和假阴性率(FNR)对测试结果进行评估。

The purpose of the second visual Turing test was to rigorously validate the perceptual quality of the images generated by the pretrained StyleGAN2-ADA model on our dataset. This test consisted of 50 real and 50 generated images randomly sampled and shuffled. Each section contained one image, a question asking the participant if the image was real or fake, and a Likert scale assessing how realistic the image was. The Likert scale was between 1 (fake) and 5 (real). The test was given to seven radiologists or radiation on colo gist s with an average of 10 years of radiological experience. The results of the Turing test were evaluated with precision, recall, accuracy, FPR, and FNR metrics. Additionally, we computed the average Likert values for both real and generated images. For statistical testing, we used permutation tests with $\alpha=0.10$ .

第二次视觉图灵测试的目的是严格验证预训练StyleGAN2-ADA模型在我们数据集上生成图像的感知质量。该测试包含随机采样并打乱的50张真实图像和50张生成图像。每个部分展示一张图像，询问参与者该图像是真实还是伪造，并用李克特量表(1-5分)评估图像真实感，其中1分表示"虚假"，5分表示"真实"。测试由7位平均拥有10年放射学经验的放射科医师或放射肿瘤科医师完成。采用精确率、召回率、准确率、假阳性率(FPR)和假阴性率(FNR)评估图灵测试结果，同时计算真实图像与生成图像的平均李克特分值。统计检验使用置换检验(permu

[论文翻译]评估StyleGAN2-ADA在医学图像上的表现

原文地址：https://arxiv.org/pdf/2210.03786v1