Exploring the Boundaries of GPT-4 in Radiology

探索GPT-4在放射学领域的边界

Qianchu Liu1, Stephanie L. Hyland1, Shruthi Bannur1, Kenza Bouzid1, Daniel C. Castro1, Maria Teodora Wets cher ek 1, Robert Tinn1, Harshita Sharma1, Fernando Pérez-García1,Anton Schwa ig hofer 1, Pranav Rajpurkar2, Sameer Tajdin Khanna2, Hoifung Poon1, Naoto Usuyama1, Anja Thieme1, Aditya Nori1, Matthew P. Lungren1, Ozan Oktay1 Javier Alvarez-Valle1∗ 1 Microsoft Health Futures 2 Harvard University

钱楚 Liu1、Stephanie L. Hyland1、Shruthi Bannur1、Kenza Bouzid1、Daniel C. Castro1、Maria Teodora Wetscher1、Robert Tinn1、Harshita Sharma1、Fernando Pérez-García1、Anton Schwaighofer1、Pranav Rajpurkar2、Sameer Tajdin Khanna2、Hoifung Poon1、Naoto Usuyama1、Anja Thieme1、Aditya Nori1、Matthew P. Lungren1、Ozan Oktay1、Javier Alvarez-Valle1∗
1 微软健康未来研究院
2 哈佛大学

Abstract

摘要

The recent success of general-domain large language models (LLMs) has significantly changed the natural language processing paradigm towards a unified foundation model across domains and applications. In this paper, we focus on assessing the performance of GPT-4, the most capable LLM so far, on the text-based applications for radiology reports, comparing against state-of-theart (SOTA) radiology-specific models. Exploring various prompting strategies, we evaluated GPT-4 on a diverse range of common radiology tasks and we found GPT-4 either outperforms or is on par with current SOTA radiology mod- els. With zero-shot prompting, GPT-4 already obtains substantial gains $\approx10%$ absolute improvement) over radiology models in temporal sentence similarity classification (accuracy) and natural language inference $(F_ {1})$ . For tasks that require learning dataset-specific style or schema (e.g. findings sum maris ation), GPT-4 improves with example-based prompting and matches supervised SOTA. Our extensive error analysis with a board-certified radiologist shows GPT-4 has a sufficient level of radiology knowledge with only occasional errors in complex context that require nuanced domain knowledge. For findings sum maris ation, GPT4 outputs are found to be overall comparable with existing manually-written impressions.

通用领域大语言模型(LLM)的最新成功显著改变了自然语言处理范式，朝着跨领域和跨应用的统一基础模型方向发展。本文重点评估当前最强大的大语言模型GPT-4在基于文本的放射学报告应用中的表现，并与最先进的(SOTA)放射学专用模型进行比较。通过探索多种提示策略，我们在多种常见放射学任务上评估GPT-4，发现GPT-4要么优于当前SOTA放射学模型，要么与之持平。在零样本提示下，GPT-4在时间句子相似性分类(准确率)和自然语言推理$(F_ {1})$方面已经比放射学模型获得显著提升(绝对改进约10%)。对于需要学习数据集特定风格或模式的任务(如检查结果总结)，GPT-4通过基于示例的提示得到改进，并达到有监督SOTA的水平。我们与委员会认证的放射科医生进行的广泛错误分析表明，GPT-4具有足够的放射学知识水平，仅在需要细微领域知识的复杂上下文中偶尔出错。对于检查结果总结，GPT-4的输出总体上与现有人工撰写的印象相当。

1 Introduction

1 引言

Recently, the emergence of large language models (LLMs) has pushed forward AI performance in many domains; with many GPT-4 (OpenAI, 2023) powered applications achieving and even surpassing human performance in many tasks (Bubeck et al., 2023; Nori et al., 2023). There is a shift in paradigm towards using a unified general-domain foundation LLM to replace domain- and taskspecific models. General-domain LLMs enable a wider range of customised tasks without the need to extensively collect human labels or to perform specialised domain training. Also, with off-theshelf prompting, applying LLMs is easier than the traditional training pipeline for supervised models.

近年来，大语言模型(LLM)的出现推动了AI在多领域的性能突破。许多基于GPT-4 (OpenAI, 2023)的应用已在多项任务中达到甚至超越人类水平 (Bubeck et al., 2023; Nori et al., 2023)。当前范式正转向使用统一通用领域基础大语言模型来替代特定领域和任务的专用模型。通用领域大语言模型无需大量收集人工标注数据或进行专业领域训练，即可支持更广泛的定制化任务。此外，通过现成的提示工程，应用大语言模型比传统监督模型的训练流程更为便捷。

While contemporary studies (Nori et al., 2023; Ranjit et al., 2023; Bhayana et al., 2023a) have started to explore the use of GPT-4 in the clinical domain, the readiness of GPT-4 in the radiology workflow remains to be rigorously and systematically tested. In this study, we set out the following research questions: (1) How can we evaluate GPT4 on its ability to process and understand radiology reports? (2) How can we apply common prompting strategies for GPT-4 across different radiology tasks? (3) How does GPT-4 compare against SOTA radiology-specific models?

尽管当代研究 (Nori et al., 2023; Ranjit et al., 2023; Bhayana et al., 2023a) 已开始探索 GPT-4 在临床领域的应用，但 GPT-4 在放射学工作流程中的适用性仍需严格系统的验证。本研究提出以下核心问题：(1) 如何评估 GPT-4 处理和理解放射学报告的能力？(2) 如何将通用提示策略应用于 GPT-4 以解决不同放射学任务？(3) GPT-4 与当前最优 (SOTA) 放射学专用模型相比表现如何？

To answer these questions, we established a rigorous evaluation framework to evaluate GPT-4 on a diverse range of common radiology tasks including both language understanding and generation. The evaluation covers sentence-level semantics (natural language inference, sentence similarity classification), structured information extraction (including entity extraction, disease classification and disease progression classification), and a direct application of findings sum maris ation. We explored various prompting strategies including zero-shot, few-shot, chain-of-thought (CoT)(Wei et al., 2022), example selection (Liu et al., 2022), and iterative refinement (Ma et al., 2023), and we further experimented with adding self-consistency (Wang et al., 2023) and asking GPT-4 to defer handling uncertain cases to improve the reliability of GPT-4. For each task, we benchmarked GPT-4 with prior GPT-3.5 models (text-davinci-003 and ChatGPT) and the re- spective state-of-the-art (SOTA) radiology models. Apart from reporting metric scores, we performed extensive qualitative analysis with a board-certified radiologist to understand the model errors by categorising them as ambiguous, label noise, or genuine model mistakes. We highlight the particular importance of qualitative analysis for open-ended generation tasks such as findings summa riasti on where GPT-4 may provide alternative solutions.

为回答这些问题，我们建立了一个严格的评估框架，在涵盖语言理解和生成的多样化放射学任务上评估GPT-4。评估内容包括句子级语义（自然语言推理、句子相似度分类）、结构化信息抽取（包括实体抽取、疾病分类和疾病进展分类）以及检查结果总结的直接应用。我们探索了多种提示策略，包括零样本、少样本、思维链（CoT）(Wei et al., 2022)、示例选择(Liu et al., 2022)和迭代优化(Ma et al., 2023)，并进一步尝试通过添加自一致性(Wang et al., 2023)和让GPT-4延迟处理不确定案例来提高其可靠性。针对每项任务，我们将GPT-4与早期GPT-3.5模型（text-davinci-003和ChatGPT）及当前最先进的放射学模型进行对比。除报告指标分数外，我们还与一位获得委员会认证的放射科医师合作开展广泛定性分析，通过将模型错误归类为模糊性错误、标签噪声或真实模型错误来理解错误原因。我们特别强调定性分析在开放式生成任务（如检查结果总结）中的重要性，因为GPT-4可能提供替代解决方案。

To sum up, our key contributions and findings (in italics) are:

总之，我们的核心贡献和发现(斜体部分)包括:

Evaluation Framework: We proposed an evaluation and error analysis framework to benchmark GPT-4 in radiology. Collaborating with a board-certified radiologist, we pinpointed the limitations of GPT-4 and the current task paradigms, directing future evaluation pursuits to tackle more intricate and chal- lenging real-world cases and to move beyond mere metric scores.
评估框架：我们提出了一套评估与错误分析框架，用于在放射学领域对GPT-4进行基准测试。通过与一位获得委员会认证的放射科医师合作，我们精准定位了GPT-4及当前任务范式的局限性，为未来评估工作指明方向——需攻克更复杂、更具挑战性的真实案例，并超越单纯指标分数的衡量。

GPT-4 shows a significant level of radiology knowledge. The majority of detected errors are either ambiguous or label noise, with a few model mistakes requiring nuanced domain knowledge. For findings sum maris ation, GPT4 outputs are often comparable to existing manually-written impressions.

GPT-4展现出显著的放射学知识水平。大多数检测到的错误属于模糊案例或标注噪声，仅有少数需要专业领域知识的模型错误。在检查结果总结方面，GPT-4的输出常可与现有人工撰写的影像报告相媲美。

Prompting Strategies: We explored and established good practices for prompting GPT-4 across different radiology tasks.
提示策略：我们探索并建立了针对不同放射学任务优化GPT-4提示的最佳实践。

GPT-4 requires minimal prompting (zero-shot) for tasks with clear instructions (e.g. sentence similarity). However, for tasks needing comprehension of dataset-specific schema or style (e.g. findings sum maris ation), which are challenging to articulate in instructions, GPT-4 demands advanced example-based prompting.

GPT-4在执行指令明确的任务(如句子相似度)时仅需极简提示(零样本)。但对于需要理解数据集特定模式或风格的任务(如研究结果总结)，由于难以通过指令明确表述，GPT-4需要基于范例的高级提示方式。

GPT-4 vs. SOTA: We compared GPT-4 performance with task-specific SOTA radiology models for understanding and validating the paradigm shift towards a unified foundation model in the specialised domains.
GPT-4对比SOTA模型：我们将GPT-4性能与特定任务的SOTA放射学模型进行对比，以理解和验证专业领域向统一基础模型(Foundation Model)的范式转变。

GPT-4 outperforms or matches performance of task-specific radiology SOTA.

GPT-4在放射学特定任务上表现优于或匹配当前最优水平(SOTA)。

2 相关工作

There have been extensive efforts to benchmark and analyse LLMs in the general-domain. Liang et al. (2023) benchmarks LLMs across broad NLP scenarios with diverse metrics. Hendrycks et al. (2021) measures LLMs’ multitask accuracy across disciplines. Zheng et al. (2023) explores using LLMs as judge for open-ended questions. Bubeck et al.

在大语言模型通用领域基准测试与分析方面已有大量研究。Liang等人 (2023) 采用多样化指标评估了大语言模型在广泛自然语言处理场景中的表现。Hendrycks等人 (2021) 测量了大语言模型跨学科的多任务准确率。Zheng等人 (2023) 探索了将大语言模型作为开放式问题评判者的可行性。Bubeck等人

(2023) further tests GPT-4’s capabilities beyond language processing towards general intelligence (AGI), exploring tasks such as mathematical problem solving and game playing. Many other studies focus on testing specific capabilities such as reasoning from LLMs (Liu et al., 2023b; Espejel et al., 2023).

(2023) 进一步测试了GPT-4在语言处理之外迈向通用人工智能 (AGI) 的能力，探索了数学解题和游戏对战等任务。其他许多研究则专注于测试大语言模型的特定能力，例如推理 (Liu et al., 2023b; Espejel et al., 2023)。

The evaluation of GPT-4 has also begun to garner interest in the medical field. For example, Lee et al. (2023) discusses the potential advantages and drawbacks of using GPT-4 as an AI chatbot in the medical field. Cheng et al. (2023) investigates possible applications of GPT-4 in biomedical engineering. Nori et al. (2023) evaluates GPT-4 for medical competency examinations and shows GPT-4 performance is well above the passing score. There have also been a few recent studies that evaluate GPT-4 in the radiology domain: Bhayana et al. (2023a,b) show that GPT-4 significantly outperforms GPT-3.5 and exceeds the passing scores on radiology board exams. Other studies have shown great potential from GPT-4 in various radiology applications such as simplifying clinical reports for clinical education (Lyu et al., 2023), extracting structures from radiology reports (Adams et al., 2023), natural language inference (NLI) (Wu et al., 2023b), and generating reports (Ranjit et al., 2023). While most of these studies focus on a specific application, our study aims for an extensive evaluation to compare GPT-4 against SOTA radiology models, covering diverse tasks and various prompting techniques.

GPT-4的评估也引起了医学领域的关注。例如，Lee等人(2023)探讨了在医疗领域使用GPT-4作为AI聊天机器人的潜在优势与缺陷。Cheng等人(2023)研究了GPT-4在生物医学工程中的可能应用。Nori等人(2023)评估了GPT-4在医学资格考试中的表现，结果显示其成绩远超及格线。近期还有多项研究评估GPT-4在放射学领域的表现：Bhayana等人(2023a,b)表明GPT-4显著优于GPT-3.5，并在放射学委员会考试中超过及格分数。其他研究则展示了GPT-4在各类放射学应用中的巨大潜力，包括为临床教育简化报告(Lyu等人，2023)、从放射报告中提取结构(Adams等人，2023)、自然语言推理(NLI)(Wu等人，2023b)以及生成报告(Ranjit等人，2023)。虽然这些研究大多聚焦于特定应用，但我们的研究旨在通过覆盖多样化任务和多种提示技术，对GPT-4与当前最佳(SOTA)放射学模型进行广泛比较评估。

Beyond prompting GPT-4, continued efforts are being made to adapt LLMs to the medical domain via fine-tuning. Med-PaLM and Med-PaLM-2 (Singhal et al., 2022, 2023) improve over PaLM (Chowdhery et al., 2022) and PaLM-2 (Anil et al., 2023) with medical-domain fine-tuning. Yunxiang et al. (2023) and Wu et al. (2023a) further fine-tune the open-source LLaMA model (Touvron et al., 2023) with medical-domain data. Van Veen et al. (2023) adapts LLMs to radiology data with parameter efficient fine-tuning. While these models offer lightweight alternatives, our study focuses on GPT4 as it is still by far the best-performing model across many domains and represents the frontier of artificial intelligence (Bubeck et al., 2023).

除了提示GPT-4外，研究人员持续通过微调方式使大语言模型适应医疗领域。Med-PaLM和Med-PaLM-2 (Singhal等人，2022, 2023) 在PaLM (Chowdhery等人，2022) 和PaLM-2 (Anil等人，2023) 基础上通过医疗领域微调实现了性能提升。Yunxiang等人 (2023) 与Wu等人 (2023a) 进一步使用医疗数据对开源LLaMA模型 (Touvron等人，2023) 进行微调。Van Veen等人 (2023) 则通过参数高效微调方法使大语言模型适配放射学数据。虽然这些模型提供了轻量级替代方案，但我们的研究仍聚焦GPT-4——因其仍是目前跨领域性能最优的模型，代表着人工智能的前沿水平 (Bubeck等人，2023)。

3 Evaluation Framework

3 评估框架

3.1 Task selection1

3.1 任务选择1

We benchmark GPT-4 on seven common text-only radiology tasks (Table 1) covering both understanding and generation tasks. The two sentence similarity classification tasks and NLI both require the understanding of sentence-level semantics in a radiology context, with NLI additionally requiring reasoning and logical inference. Structured information extraction tasks (disease classification, disease progression classification, and entity extraction) require both superficial entity extraction and inference from cues with radiology knowledge (e.g. ‘enlarged heart’ implies ‘card iomega ly’). For entity extraction, the model must further follow the schema-specific categorisation of entities. Finally, we evaluate GPT-4 on an important part of the radiology workflow: findings sum maris ation, i.e. condensing detailed descriptions of findings into a clinically actionable impression. These tasks cover different levels of text granularity (sentence-level, word-level, and paragraph-level) and different aspects of report processing, and hence give us a holistic view of how GPT-4 performs in processing radiology reports.

我们在七项常见的纯文本放射学任务（表1）上对GPT-4进行了基准测试，涵盖理解与生成两类任务。其中两个句子相似性分类任务和自然语言推理（NLI）均要求理解放射学语境下的句子级语义，NLI还需进行逻辑推理。结构化信息抽取任务（疾病分类、疾病进展分类和实体抽取）既需要表层实体抽取，又需结合放射学知识进行线索推理（例如"心脏扩大"暗示"心肌病"）。对于实体抽取任务，模型还需遵循特定模式的实体分类标准。最后，我们评估了GPT-4在放射工作流核心环节——检查结果摘要生成的表现，即将详细检查描述浓缩为具有临床指导意义的印象报告。这些任务覆盖了文本粒度的不同层级（句子级、词级、段落级）和报告处理的不同维度，从而全面展现了GPT-4处理放射学报告的能力。

3.2 Prompting strategies

3.2 提示策略

Alongside GPT-4 $(\mathtt{g p t-4-32k})$ ), we evaluated two earlier GPT-3.5 models: text-davinci-003 and ChatGPT (gpt-35-turbo). Model and API details are in Appendix A. For each task, we started with zero-shot prompting and progressively increased prompt complexity to include random few-shot (a fixed set of random examples), and then similaritybased example selection (Liu et al., 2022). For example selection, we use OpenAI’s general-domain text-embedding-ada-002 model to encode the training examples as the candidate pool to select $n$ nearest neighbours for each test instance. For NLI, we also explored CoT, as it was shown to benefit reasoning tasks (Wei et al., 2022). For findings sum maris ation, we replicated Impression GP T (Ma et al., 2023), which adopts dynamic example selection and iterative refinement.

除了 GPT-4 $(\mathtt{gpt-4-32k})$ ，我们还评估了两个早期的 GPT-3.5 模型：text-davinci-003 和 ChatGPT (gpt-35-turbo)。模型和 API 的详细信息见附录 A。对于每项任务，我们从零样本提示开始，逐步增加提示复杂度，包括随机少样本（一组固定的随机示例），然后是基于相似性的示例选择 (Liu et al., 2022)。对于示例选择，我们使用 OpenAI 的通用领域文本嵌入模型 text-embedding-ada-002 将训练示例编码为候选池，为每个测试实例选择 $n$ 个最近邻。对于自然语言推理 (NLI)，我们还探索了思维链 (CoT)，因为它被证明对推理任务有益 (Wei et al., 2022)。对于研究结果总结任务，我们复现了 Impression GPT (Ma et al., 2023)，该方法采用动态示例选择和迭代优化。

To test the stability of GPT-4 output, we applied self-consistency (Wang et al., 2023) for sentence similarity, NLI, and disease classification. We report mean and standard deviation across five runs of

为了测试GPT-4输出的稳定性，我们在句子相似度、自然语言推理(NLI)和疾病分类任务中应用了自洽性方法(Wang et al., 2023)。我们报告了五次运行的平均值和标准差。

GPT-4 with temperature zero2 and self-consistency results with majority voting (indicated by ‘SC’). All prompts are presented in Appendix C.

GPT-4在temperature参数为0时的表现，以及采用多数投票的自洽性(SC)结果。所有提示词详见附录C。

3.3 Error analysis with radiologist

3.3 放射科医生的误差分析

The authors did a first pass of the error cases to review easy instances requiring only general syntactic and linguistic knowledge (e.g. ‘increased pleural effusion’ versus ‘decreased pleural effusion’). We then surfaced the cases where radiology expertise is required to a board-certified radiologist for a second-round review and feedback. For interpret ability, we prompted GPT-4 to give an explanation after its answer. Reviewing both model answer and reasoning, we categorise each error into: ambiguous3, label noise4, or genuine mistake.

作者首先对错误案例进行了初步筛查，以审查仅需一般句法和语言学知识的简单实例（例如"胸腔积液增多"与"胸腔积液减少"）。随后，我们将需要放射学专业知识的案例提交给获得委员会认证的放射科医生进行第二轮审查和反馈。针对可解释性，我们要求GPT-4在给出答案后提供解释。通过审查模型答案和推理过程，我们将每个错误归类为：模糊性错误、标签噪声或真实错误。

4 Experiments

4 实验

4.1 Sentence similarity classification

4.1 句子相似度分类

Task and model setup In this task, the model receives as input a sentence pair and must classify the sentences as having the same, or different meanings. We evaluate the models on two sub-tasks: temporal sentence similarity classification (MS-CXR-T (Bannur et al., 2023b)) and RadNLI-derived sentence similarity classification. Temporal sentence similarity focuses on temporal changes of diseases. For RadNLI, we follow Bannur et al. (2023a) to use the subset of bidirectional ‘entailment’ and ‘contradiction’ pairs and discard the ‘neutral’ pairs to convert RadNLI (Miura et al., 2021) to a binary classification task.

任务与模型设置
在本任务中，模型接收一个句子对作为输入，并必须将句子分类为具有相同或不同的含义。我们在两个子任务上评估模型：时序句子相似性分类（MS-CXR-T (Bannur et al., 2023b)）和基于RadNLI的句子相似性分类。时序句子相似性关注疾病的时序变化。对于RadNLI，我们遵循Bannur et al. (2023a)的方法，使用双向"蕴含"和"矛盾"对的子集，并丢弃"中性"对，将RadNLI (Miura et al., 2021)转换为二分类任务。

The radiology SOTA for this task is BioViL-T (Bannur et al., 2023a) (a radiology-specific visionlanguage model trained with temporal multi-modal contrastive learning). The GPT performance is obtained from zero-shot prompting.

该放射学任务的当前最佳技术是BioViL-T (Bannur等人, 2023a) (一种通过时序多模态对比学习训练的放射学专用视觉语言模型)。GPT性能数据来自零样本提示。

Results As shown in Table 2, all the GPT models outperform BioViL-T, achieving new SOTA. In particular, GPT-4 significantly outperforms both text-davinci-003 and ChatGPT on MS-CXR-T, indicating an advanced understanding of disease progression. Error analysis revealed the majority of the GPT-4 (SC) errors are either ambiguous or label noise with only 1 model mistake in RadNLI (see Appendix B.1), indicating GPT-4 is achieving near-ceiling performance in these tasks.

结果
如表 2 所示，所有 GPT 模型均优于 BioViL-T，达到了新的 SOTA (state-of-the-art)。特别是 GPT-4 在 MS-CXR-T 上的表现显著优于 text-davinci-003 和 ChatGPT，表明其对疾病进展具有更深入的理解。错误分析显示，GPT-4 (SC) 的大部分错误属于模糊案例或标注噪声，仅在 RadNLI 中出现 1 次模型错误 (参见附录 B.1)，这表明 GPT-4 在这些任务中已接近性能极限。

Table 1: Results overview. GPT-4 either outperforms or is on par with previous SOTA. New SOTA is established by GPT-4 on sentence similarity and NLI (absolute improvement for accuracy and $F_ {1}$ are reported). GPT-4 achieves near-ceiling performance in many tasks with $<1%$ mistake rate (shaded). Impression GP T (Ma et al., 2023) requires example selection and iterative example refinement.

Task	Testsamples	Prompting GPT-4	GPT-4performance	Mistakerate
Temporal sentencesimilarity	361	Zero-shot	NewSOTA(↑10%acc.)	0.0%
Sentence similarity (RadNLI)	145	Zero-shot	NewSOTA(↑3%acc.)	0.7%
Natural language inference (RadNLI)	480	Zero-shot+CoT	NewSOTA(↑10%F1)	5.8%
Disease progression	1326	Zero-shot	On par with SOTA	0.4%
Diseaseclassification	1955	10-shot*	On par with SOTA	0.3%
Entity extraction	100	200-shot*	On par with SOTA
Findings summarisation	1606/ 576t	ImpressionGPT	On par with SOTA

$n$ -shot* : similarity-based example selection with $n$ examples; Mistake rate $\mathrm{\upharpoonright=}$ [# genuine mistakes] / [# test samples]; †: [MIMIC] / [Open-i]

表 1: 结果概览。GPT-4 在多数任务上超越或与之前的最先进水平 (SOTA) 持平。GPT-4 在句子相似性和自然语言推理 (NLI) 任务上建立了新的 SOTA (报告了准确率和 $F_ {1}$ 分数的绝对提升)。GPT-4 在多项任务中实现了接近上限的表现，错误率 $<1%$ (阴影标注)。Impression GPT (Ma et al., 2023) 需要示例选择和迭代示例优化。

任务	测试样本	GPT-4 提示方式	GPT-4 表现	错误率
时间句相似性	361	零样本	新 SOTA (↑10% 准确率)	0.0%
句子相似性 (RadNLI)	145	零样本	新 SOTA (↑3% 准确率)	0.7%
自然语言推理 (RadNLI)	480	零样本+思维链	新 SOTA (↑10% $F_ {1}$)	5.8%
疾病进展	1326	零样本	与 SOTA 持平	0.4%
疾病分类	1955	10-shot*	与 SOTA 持平	0.3%
实体抽取	100	200-shot*	与 SOTA 持平	-
检查结果总结	1606/576†	ImpressionGPT	与 SOTA 持平	-

$n$-shot* : 基于相似性的 $n$ 个示例选择；错误率 $\mathrm{\upharpoonright=}$ [# 真实错误] / [# 测试样本]；†: [MIMIC] / [Open-i]

Table 2: Zero-shot GPT-4 and GPT-3.5 achieve new SOTA (accuracy) on sentence similarity tasks. To test the consistency of GPT-4, we report mean and std. across five runs, and the self-consistency results (‘SC’).

Model	MS-CXR-T	RadNLI
text-davinci-003	90.3	91.0
ChatGPT	92.0	95.2
GPT-4	97.3±0.2	94.1±0.4
GPT-4 (SC)	97.2	93.8
BioViL-T (Bannur et al.,2023a)	87.8	90.5

表 2: 零样本 GPT-4 和 GPT-3.5 在句子相似度任务上达到新 SOTA (准确率)。为测试 GPT-4 的一致性，我们报告了五次运行的平均值和标准差，以及自洽性结果 ('SC')。

Model	MS-CXR-T	RadNLI
text-davinci-003	90.3	91.0
ChatGPT	92.0	95.2
GPT-4	97.3±0.2	94.1±0.4
GPT-4 (SC)	97.2	93.8
BioViL-T (Bannur et al., 2023a)	87.8	90.5

4.2 Natural language inference (NLI)

4.2 自然语言推理 (NLI)

Task and model setup We assess GPT on the original RadNLI classification dataset (Miura et al., 2021). The model receives input ‘premise’ and ‘hypothesis’ sentences, and determines their relation: one of ‘entailment’, ‘contradiction’, or ‘neutral’.

任务与模型设置
我们在原始RadNLI分类数据集(Miura et al., 2021)上评估GPT。模型接收输入"前提"和"假设"句子，并判定它们的关系：属于"蕴含(entailment)"、"矛盾(contradiction)"或"中立(neutral)"中的一种。

We present GPT performance with zero-shot prompting and CoT. We compare GPT models against the current SOTA, a radiology-adapted T5 model (DoT5) which was trained on radiology text and general-domain NLI data (Liu et al., 2023a).

我们展示了GPT在零样本提示和思维链(CoT)下的性能表现。将GPT模型与当前最先进的放射学适配T5模型(DoT5)进行对比，该模型基于放射学文本和通用领域自然语言推理(NLI)数据训练(Liu等人，2023a)。

Results Table 3 shows that GPT-4 with CoT achieves a new SOTA on RadNLI, outperforming DoT5 by $10%$ in macro $F_ {1}$ . Whereas NLI has traditionally been a challenging task for earlier GPT models, GPT-4 displays a striking improvement. We also observe that CoT greatly helps in this task especially for GPT-3.5.

结果表3显示，采用思维链(CoT)的GPT-4在RadNLI上实现了新的最先进水平(SOTA)，其宏观F1分数比DoT5高出10%。尽管自然语言推理(NLI)对早期GPT模型而言历来是项挑战性任务，但GPT-4展现出显著进步。我们还观察到思维链对该任务帮助极大，尤其对GPT-3.5而言。

We further investigate how GPT-4 performs in cases that require different levels of radiology expertise6, and we show that GPT-4 reaches the best performance in both generic and radiology-specific logical inference. CoT seems to help GPT models particularly to understand the radiology-specific cases. This is because CoT pushes the model to elaborate more on the radiology knowledge relevant to the input sentences, therefore giving sufficient context for a correct reasoning assessment (see Table B.4). Finally, we highlight that, even for GPT-4, there is still a gap in performance: the cases that specifically require radiology knowledge are more challenging than the other cases.

我们进一步研究了GPT-4在不同放射学专业水平需求场景下的表现[20]，结果表明GPT-4在通用逻辑推理和放射学专项逻辑推理中均达到最佳性能。思维链(CoT)方法尤其能帮助GPT模型理解放射学专项案例，因为该方法促使模型更详细地阐述与输入句子相关的放射学知识，从而为正确推理评估提供充分上下文（见表B.4）。最后需要指出的是，即便是GPT-4仍存在性能差距：需要特定放射学知识的案例比其他案例更具挑战性。

Table 3: GPT performance (macro $F_ {1}$ ) on RadNLI with domain analysis. G $\mathrm{i}\mathrm{PT}{-}4+\mathrm{CoT}$ achieves new SOTA. Mean, std., and self-consistency $(\mathrm{{^{6}S C^{'}}})$ results are reported for GPT $\mathrm{4+CoT}$ across five runs.

All	need domain expertise? Yes No
text-davinci-003 55.9	42.8	60.7
+ CoT 64.9	54.1	68.4
ChatGPT	31.5	52.3
+ CoT	65.6	70.2
GPT-4	74.0	93.1
+CoT	89.3 ± 0.4 78.9 ±1.4	93.5± 0.4
+ CoT (SC)	78.8	93.6
DoT5
(Liu et al.,2023a)	70.1	86.4

表 3: GPT在RadNLI领域的性能表现(宏观 $F_ {1}$ )及领域分析。G $\mathrm{i}\mathrm{PT}{-}4+\mathrm{CoT}$ 创造了新的SOTA记录。GPT $\mathrm{4+CoT}$ 的五次运行结果报告了均值、标准差和自洽性 $(\mathrm{{^{6}S C^{'}}})$。

All	需要领域专业知识? 是否
text-davinci-003 55.9	42.8 60.7
+ CoT 64.9	54.1 68.4
ChatGPT	31.5 52.3
+ CoT	65.6 70.2
GPT-4	74.0 93.1
+CoT	89.3 ± 0.4 78.9 ±1.4
+ CoT (SC)	78.8 93.6
DoT5
(Liu et al.,2023a)	70.1 86.4

4.3 Disease classification

4.3 疾病分类

Task and model setup The evaluation dataset is extracted from Chest ImaGenome (Wu et al., 2021) gold attributes on the sentence level. To fairly compare with the SOTA CheXbert (Smit et al., 2020) model, we focus on pleural effusion, at elect as is, pneumonia, and pneumothorax, which are common pathology names between CheXbert findings and Chest ImaGenome attributes. The output labels are ‘presence’ and ‘absence’ (binary classification) for each pathology. Detailed description of the label mapping is in Appendix D.

任务与模型设置
评估数据集提取自Chest ImaGenome (Wu et al., 2021) 句子级的黄金属性。为公平对比当前最优 (SOTA) 的CheXbert (Smit et al., 2020) 模型，我们聚焦于胸腔积液、肺不张、肺炎和气胸，这些是CheXbert检查结果与Chest ImaGenome属性共有的常见病理名称。每个病理的输出标签为"存在"和"不存在"(二元分类)。标签映射的详细说明见附录D。

Besides the CheXbert baseline, we also include the silver annotations from Chest ImaGenome, produced by an ontology-based NLP tool with filtering rules (the Chest ImaGenome gold datasets are in fact human-verified silver annotations). To prompt GPT models, we started with zero-shot prompting, and then added 10 in-context examples with both random selection and similarity-based example selection. The example candidates are from the Chest ImaGenome silver data.

除了CheXbert基线外，我们还纳入了Chest ImaGenome通过基于本体的自然语言处理工具（经筛选规则处理）生成的银标准标注（Chest ImaGenome金标准数据集实为人工验证过的银标准标注）。在提示GPT模型时，我们首先采用零样本提示，随后加入10个上下文示例，这些示例既包含随机选取也包含基于相似度的选取方式。候选示例均来自Chest ImaGenome银标准数据。

Results As shown in Table 4, there is progressive improvement from text-davinci-003 to ChatGPT and then to GPT-4. All the GPT models zero-shot results outperform CheXbert. We are able to improve GPT-4 zero-shot performance with 10-shot random in-context examples. We achieve a further slight improvement with similarity-based example selection, approaching the performance of silver annotations.

结果
如表 4 所示，从 text-davinci-003 到 ChatGPT 再到 GPT-4 呈现出逐步提升的趋势。所有 GPT 模型的零样本结果均优于 CheXbert。通过引入 10 个少样本随机上下文示例，我们进一步提升了 GPT-4 的零样本性能。采用基于相似性的示例选择策略后，性能略有提升，接近人工标注 (silver annotations) 的水平。

We manually analysed the errors from the GPT-4 $(^{* }10)$ experiment and found that most (20 out of 30) are ambiguous, with the pathology cast as potentially present, rather than being easily labelled as present or not. This is particularly the case for pneumonia whose presence is typically only suggested by findings in the chest X-ray (See examples of such uncertain cases in Table B.6). The rest of the model errors are 5 cases of label noise and 5 model mistakes. With ${<}1%$ mistake rate, GPT-4 is approaching ceiling performance in this task.

我们手动分析了GPT-4 $(^{* }10)$ 实验中的错误，发现大多数（30例中有20例）属于模棱两可的情况，病理表现被标注为可能存在，而非明确判定存在与否。这种情况在肺炎诊断中尤为常见，其存在通常仅通过胸部X光片的征象提示（参见表B.6中此类不确定案例的示例）。其余模型错误包括5例标签噪声和5例模型误判。GPT-4以${<}1%$的错误率接近该任务的性能上限。

Defer from uncertain cases Given the large amount of uncertain and ambiguous cases in the dataset, we experimented with asking the model to output ‘uncertain’ alongside the presence and absence labels, and defer from these uncertain cases.7 Table 5 shows that GPT-4 achieves very strong performance on those cases for which it is not uncertain. Note that pneumonia classification is dramatically improved and many positive cases of pneumonia are deferred. This aligns with our observation from the dataset that pneumonia is often reported as a possibility rather than a certain presence. We further test the robustness of GPT-4 in this setup and report mean, standard deviation and majority vote results in Table E.1.

对不确定病例进行延迟判断
鉴于数据集中存在大量不确定和模糊的病例，我们尝试要求模型在输出"存在"和"不存在"标签的同时标注"不确定"，并对这些不确定病例进行延迟判断。[7] 表5显示，GPT-4在非不确定病例上表现出非常强劲的性能。值得注意的是，肺炎分类的准确率显著提升，同时许多阳性肺炎病例被延迟判断。这与我们在数据集中观察到的现象一致：肺炎常被报告为可能性而非确定性存在。我们进一步测试了GPT-4在此设置下的鲁棒性，并在表E.1中报告了平均值、标准差和多数投票结果。

Table 4: GPT performance on Chest ImaGenome disease classification.

Model	Micro F1	MacroF1
text-davinci-003	79.2	79.9
ChatGPT	89.7	85.0
GPT-4	93.0	91.5
GPT-4 (10)	96.6	96.6
GPT-4 (* 10)	97.9	97.5
CheXbert	73.6	73.1
Silver	97.8	98.9

(n): number of random shots; * : similarity-based example selection; Silver: Chest ImaGenome silver annotations.

表 4: GPT在Chest ImaGenome疾病分类任务中的表现

模型	Micro F1	MacroF1
text-davinci-003	79.2	79.9
ChatGPT	89.7	85.0
GPT-4	93.0	91.5
GPT-4 (10)	96.6	96.6
GPT-4 (* 10)	97.9	97.5
CheXbert	73.6	73.1
Silver	97.8	98.9

(n): 随机样本数量；* : 基于相似性的示例选择；Silver: Chest ImaGenome银标准标注。

Table 5: Zero-shot GPT-4 performance after deferring from uncertain cases on Chest ImaGenome dataset: GPT-4 (defer). Its performance is significantly improved from zero-shot GPT-4 (with binary output).

	GPT-4 (defer)	GPT-4
MacroF	97.4	93.0
Micro F1	98.6	91.5
Pleuraleffusion	98.5[103]	95.3 [176]
Atelectasis	99.0 [154]	97.8 [233]
Pneumonia	92.3 [16]	75.7 [111]
Pneumothorax	100.0 [17]	97.3 [18]

n每种病理学的阳性实例数。: number of positive instances for each pathology.

表 5: 零样本 GPT-4 在 Chest ImaGenome 数据集上对不确定病例进行转诊后的性能: GPT-4 (defer)。其性能较二进制输出的零样本 GPT-4 有显著提升。

	GPT-4 (defer)	GPT-4
MacroF	97.4	93.0
Micro F1	98.6	91.5
Pleuraleffusion	98.5 [103]	95.3 [176]
Atelectasis	99.0 [154]	97.8 [233]
Pneumonia	92.3 [16]	75.7 [111]
Pneumothorax	100.0 [17]	97.3 [18]

4.4 RadGraph entity extraction

4.4 RadGraph实体提取

Task and model setup This task requires a model to extract observation and anatomy entities from radiology reports and determine their presence (present, absent, or uncertain) following the RadGraph schema (Jain et al., 2021). To evaluate the extraction, we report micro $F_ {1}$ score counting a true positive when both the extracted entity text and the label are correct. RadGraph provides two datasets: MIMIC (Johnson et al., 2019) with both train and test data, and CheXpert (Irvin et al., 2019) (with only test data).

任务与模型设置
该任务要求模型从放射学报告中提取观察结果和解剖结构实体，并根据RadGraph框架 (Jain et al., 2021) 判断其存在状态（存在、不存在或不确定）。为评估提取效果，我们采用微观$F_ {1}$分数作为指标，当提取的实体文本和标签均正确时记为真阳性。RadGraph提供两个数据集：包含训练集和测试集的MIMIC (Johnson et al., 2019)，以及仅含测试集的CheXpert (Irvin et al., 2019)。

We compare with the SOTA RadGraph Benchmark model reported in Jain et al. (2021), which is based on $\mathrm{DyGIE++}$ (Wadden et al., 2019) with PubMedBERT initialization s (Gu et al., 2021). Regarding prompting strategy, we started with a randomly selected 1-shot example,8 and then increased the number of random shots to 10. To push the performance, we leveraged the maximum context window of GPT-4, incorporating 200-shot examples with both random selection and similarity-based selection. Additionally, we found it is helpful to perform GPT inference on individual sentences before combining them for report-level output. The in-context examples are also on the sentence level (200-shot sentences roughly corresponds to 40 reports) from the train set.

我们与Jain等人(2021)报告的SOTA RadGraph基准模型进行比较，该模型基于$\mathrm{DyGIE++}$ (Wadden等人，2019) 并采用PubMedBERT初始化 (Gu等人，2021)。在提示策略方面，我们从一个随机选择的1-shot示例开始，然后将随机样本数量增加到10个。为了提升性能，我们利用GPT-4的最大上下文窗口，结合了200-shot示例，包括随机选择和基于相似性的选择。此外，我们发现先在单个句子上进行GPT推理，再将它们组合成报告级别的输出是有帮助的。上下文示例也来自训练集中的句子级别 (200-shot句子大约对应40份报告)。

Table 6: GPT performance (micro $F_ {1}$ ) on RadGraph entity extraction.

Model	MIMIC	CheXpert
text-davinci-003(1)	56.2	49.2
text-davinci-003 (10)	83.2	79.5
ChatGPT (1)	47.1	42.2
ChatGPT (10)	70.6	67.5
GPT-4 (1)	36.6	25.3
GPT-4 (10)	88.3	84.7
GPT-4 (200)	91.5	88.4
GPT-4(* 200)	92.8	90.0
RadGraph Benchmark	94.3	89.5

(n): number of random shots; * : similarity-based example selection

表 6: GPT在RadGraph实体抽取任务上的性能表现(微观 $F_ {1}$ )。

模型	MIMIC	CheXpert
text-davinci-003(1)	56.2	49.2
text-davinci-003 (10)	83.2	79.5
ChatGPT (1)	47.1	42.2
ChatGPT (10)	70.6	67.5
GPT-4 (1)	36.6	25.3
GPT-4 (10)	88.3	84.7
GPT-4 (200)	91.5	88.4
GPT-4(* 200)	92.8	90.0
RadGraph Benchmark	94.3	89.5

(n): 随机样本数量; * : 基于相似性的示例选择

Results As shown in Table 6, examples are crucial for GPT to learn this task. We observe a massive jump in performance when increasing the number of examples in the context. GPT-4 with 200 selected examples achieves overall on-par performance with RadGraph benchmark: while GPT-4 $(^{* }200)$ under performs the RadGraph model on the in-domain MIMIC test set, GPT-4 surpasses RadGraph Benchmark on the out-of-domain CheXpert dataset. This indicates GPT-4 could be a more robust choice to generalise to out-of-domain datasets. Our error analysis reveals the errors are mostly due to GPT-4 failing to learn the schema specifics (Appendix B.5). For example, GPT-4 may extract the whole compound word (‘mild-to-moderate’) as the observation term, while the gold annotations break the word down (‘mild’ and ‘moderate’).

结果
如表 6 所示，示例对 GPT 学习该任务至关重要。我们观察到，随着上下文示例数量的增加，性能出现显著提升。使用 200 个精选示例的 GPT-4 达到了与 RadGraph 基准相当的整体性能：虽然 GPT-4 $(^{* }200)$ 在域内 MIMIC 测试集上表现略逊于 RadGraph 模型，但在域外 CheXpert 数据集上超越了 RadGraph 基准。这表明 GPT-4 可能是泛化至域外数据集的更稳健选择。我们的错误分析显示，错误主要源于 GPT-4 未能掌握模式细节（附录 B.5）。例如，GPT-4 可能将整个复合词（"mild-to-moderate"）提取为观察术语，而标准标注会将其拆分为（"mild" 和 "moderate"）。

4.5 Disease progression classification

4.5 疾病进展分类

Task and model setup We evaluate on the temporal classification task from MS-CXR-T (Bannur et al., 2023b), which provides progression labels for five path o logie s (consolidation, edema, pleural effusion, pneumonia, and pneumothorax) across three progression classes (‘improving’, ‘stable’, and ‘worsening’). In this experiment, the input is the radiology report and the outputs are disease progression labels. We report macro accuracy for each pathology due to class imbalance. As MS-CXR-T labels were originally extracted from Chest ImaGenome, we can also use Chest ImaGenome silver annotations as our baseline. We report GPT performance with zero-shot prompting.

任务和模型设置
我们在MS-CXR-T (Bannur et al., 2023b) 的时间分类任务上进行评估，该任务为五种病理 (实变、水肿、胸腔积液、肺炎和气胸) 提供了三类进展标签 ("改善"、"稳定"和"恶化")。本实验的输入是放射学报告，输出是疾病进展标签。由于类别不平衡，我们报告每种病理的宏观准确率。由于MS-CXR-T标签最初是从Chest ImaGenome提取的，我们也可以使用Chest ImaGenome的银标准注释作为基线。我们报告GPT在零样本提示下的性能。

Results Table 7 shows that there is again a large jump of performance from GPT-4 compared with the earlier GPT-3.5 models. Zero-shot GPT-4 achieves ${>}95%$ across all path o logie s and is comparable with Chest ImaGenome silver annotation. Our error analysis reveals that the majority of model errors are either label noise or ambiguous and the small mistake rate $(0.4%)$ reflects the task is nearly solved.

结果表7显示，与早期GPT-3.5模型相比，GPT-4的性能再次实现了大幅跃升。零样本GPT-4在所有病理类型中准确率均超过95%，与Chest ImaGenome银标准标注结果相当。我们的错误分析表明，大部分模型错误源于标签噪声或标注模糊性，而极低的错误率(0.4%)表明该任务已接近解决。

Table 7: GPT performance on MS-CXR-T disease progression (macro accuracy).

Model	Pl. eff. Cons.		PNA	PTX	Edema
text-davinci-003	92.1	91.8	90.0	96.1	93.6
ChatGPT	91.0	84.8	84.5	93.0	89.8
GPT-4	98.7	95.7	96.4	99.4	96.8
Silver	98.1	91.8	96.6	100.0	97.6

PNA: pneumonia; PTX: pneumothorax; Pl. eff.: pleural effusion; Cons.: consolidation; Silver: Chest ImaGenome silver annotations.

表 7: GPT 在 MS-CXR-T 疾病进展上的表现 (宏观准确率)。

模型	Pl. eff.	Cons.	PNA	PTX	Edema
text-davinci-003	92.1	91.8	90.0	96.1	93.6
ChatGPT	91.0	84.8	84.5	93.0	89.8
GPT-4	98.7	95.7	96.4	99.4	96.8
Silver	98.1	91.8	96.6	100.0	97.6

PNA: 肺炎 (pneumonia); PTX: 气胸 (pneumothorax); Pl. eff.: 胸腔积液 (pleural effusion); Cons.: 实变 (consolidation); Silver: Chest ImaGenome 银标准标注。

4.6 Findings sum maris ation

4.6 研究发现总结

Task and model setup The findings summarisation task requires the model to summarise the input findings into a concise and clinically actionable impression section. We evaluate on the MIMIC (Johnson et al., 2019) and Open-i (Demner-Fushman et al., 2016) datasets and follow Ma et al. (2023) to report results on the official MIMIC test set and a random split (2400:576 for train:test) for Open-i. For metrics, we report RougeL (Lin, 2004) and the CheXbert score (Smit et al., 2020) (a radiologyspecific factuality metric). We further conduct a qualitative comparison study on GPT-4 outputs.

任务与模型设置
总结任务要求模型将输入的检查结果总结为简洁且具有临床可操作性的印象部分。我们在MIMIC (Johnson等人, 2019) 和Open-i (Demner-Fushman等人, 2016) 数据集上进行评估，并遵循Ma等人 (2023) 的方法，在官方MIMIC测试集和Open-i随机划分 (训练集:测试集=2400:576) 上报告结果。评估指标方面，我们报告RougeL (Lin, 2004) 和CheXbert分数 (Smit等人, 2020) (放射学特异性事实性指标)。我们进一步对GPT-4输出进行了定性对比研究。

For prompting strategies, we started with zeroshot and increased the number of random incontext examples to 10-shot. For GPT-4, we tried adding 100 examples with random selection and similarity-based selection. Examples are drawn from the respective train set for each dataset. We also replicated Impression GP T (Ma et al., 2023) with ChatGPT and GPT-4. Impression GP T performs dynamic example selection based on CheXbert labels and iterative ly selects good and bad examples as in-context examples (The implementation details are found in Appendix G).

在提示策略方面，我们从零样本开始，逐步将随机上下文示例数量增加到少样本10条。针对GPT-4，我们尝试通过随机选择和基于相似度的选择各添加100条示例。所有示例均从各数据集的训练集中抽取。我们还使用ChatGPT和GPT-4复现了Impression GPT (Ma et al., 2023) ，该方案基于CheXbert标签动态选择示例，并迭代筛选优质与劣质样本作为上下文示例 (具体实现细节见附录G) 。

We compare with the previous supervised SOTA for this task (Hu et al., 2022) (which adopts a graph encoder to model entity relations from findings), as well as with DoT5 (Liu et al., 2023a), a strong zero-shot sum maris ation baseline.

我们与之前该任务的监督学习SOTA (Hu et al., 2022) (采用图编码器建模实体关系) 以及强大的零样本摘要基线模型DoT5 (Liu et al., 2023a) 进行了对比。

Results While zero-shot GPT models all outperform DoT5, we observe that providing examples is crucial for this task: there is consistent and substantial improvement when increasing the number of in-context examples for all GPT models. A further boost can be achieved when we enable example selection for GPT-4 $(^{* }100)$ . The more advanced Impression GP T brings the best performance out of GPT-4 and achieves performance comparable with the supervised SOTA.

结果
虽然零样本GPT模型均优于DoT5，但我们发现提供示例对此任务至关重要：随着上下文示例数量的增加，所有GPT模型都表现出持续且显著的性能提升。当启用GPT-4的示例选择功能时 $(^{* }100)$ ，性能可进一步提升。更先进的Impression GPT充分发挥了GPT-4的潜力，其表现与监督学习的SOTA方法相当。

Qualitative comparison To understand the differences between GPT-4 output and the manuallywritten impressions, we chose a random sample of reports and asked a radiologist to compare existing manually-written impressions with GPT-4 (ImpressionGPT) output. Table 9 demonstrates that for the majority of the cases $(\approx70%)$ , GPT-4 output is either preferred or comparable with the manuallywritten impression. Tables B.8 and B.9 show examples where GPT-4 outputs are more faithful to the findings than the manually-written impressions.

定性比较
为了理解GPT-4输出与人工书写印象之间的差异，我们随机选取了部分报告样本，并请一位放射科医师将现有的人工书写印象与GPT-4（ImpressionGPT）输出进行对比。表9显示，在大多数情况下$(\approx70%)$，GPT-4输出更受青睐或与人工书写印象相当。表B.8和表B.9展示了GPT-4输出比人工书写印象更忠实于检查结果的案例。

Table 8: GPT performance on findings sum maris ation. Impression GP T iterative ly refines good and bad examples as in-context examples.

	MIMIC		Open-i
Model	R.	CB.	R.	CB.
text-davinci-003	22.9	41.8	14.5	41.9
text-davinci-003 (10)	29.1	43.0	40.5	42.0
ChatGPT	20.0	40.5	14.8	39.6
ChatGPT (10)	31.0	42.5	40.6	41.0
GPT-4	22.5	39.2	18.0	39.3
GPT-4 (10)	28.5	44.2	42.5	44.9
GPT-4 (100)	30.9	44.7	44.2	45.0
GPT-4 (* 100)	38.4	47.4	59.8	47.3
ChatGPT (ImpressionGPT)	44.7	63.9	58.8	44.8
GPT-4 (ImpressionGPT)	46.0	64.9	64.6	46.5
Hu et al. (2022)	47.1	54.5	64.5
DoT5 (Liu et al., 2023a)			11.7	25.8

(n): number of random shots; * : similarity-based example selection; R.: RougeL; CB.: CheXbert.

表 8: GPT在检查结果摘要任务上的表现。ImpressionGPT通过迭代优化上下文示例中的正负案例来提升效果。

模型	MIMIC-R.	MIMIC-CB.	Open-i-R.	Open-i-CB.
text-davinci-003	22.9	41.8	14.5	41.9
text-davinci-003 (10)	29.1	43.0	40.5	42.0
ChatGPT	20.0	40.5	14.8	39.6
ChatGPT (10)	31.0	42.5	40.6	41.0
GPT-4	22.5	39.2	18.0	39.3
GPT-4 (10)	28.5	44.2	42.5	44.9
GPT-4 (100)	30.9	44.7	44.2	45.0
GPT-4 (* 100)	38.4	47.4	59.8	47.3
ChatGPT (ImpressionGPT)	44.7	63.9	58.8	44.8
GPT-4 (ImpressionGPT)	46.0	64.9	64.6	46.5
Hu et al. (2022)	47.1	54.5	64.5
DoT5 (Liu et al., 2023a)			11.7	25.8

(n): 随机样本数量; * : 基于相似度的样本选择; R.: RougeL; CB.: CheXbert。

Table 9: Percentage $(%)$ with which the GPT-4 (Impress ion GP T) generated impression is equivalent or preferred compared with an existing manually-written one according to a radiologist.

Sample (n)	Manual Imp. preferred	Equiv.	GPT-4 preferred	Ambig.
Open-i (80)	28.8	43.8	26.3	1.3
MIMIC (40)	25.0	10.0	57.5	7.5

Equiv.: equivalent; Ambig.: ambiguous; Manual Imp.: Existing manual impression

表 9: 放射科医师认为 GPT-4 (Impress ion GP T) 生成的印象报告与现有手动撰写报告等效或更优的百分比 $(%)$ 。

样本 (n)	手动撰写报告更优	等效	GPT-4 更优	不确定
Open-i (80)	28.8	43.8	26.3	1.3
MIMIC (40)	25.0	10.0	57.5	7.5

(注: Equiv.: 等效; Ambig.: 不确定; Manual Imp.: 现有手动撰写报告)

5 Discussion

5 讨论

5.1 Error analysis and GPT-4 consistency

5.1 错误分析与GPT-4一致性

Moving beyond quantitative scores, we manually reviewed all GPT-4 errors in all the tasks (A detailed analysis is shown in Appendix B). We further analysed the consistency of the errors for a selection of tasks and reported the error breakdown in Table 10. We found the majority of the errors are either ambiguous or label noise. As an example of ambiguity, GPT-4 is extremely strict in identifying paraphrases and argues that one sentence contains minor additional information or slightly different emphasis. In fact, for sentence similarity, disease progression, and disease classification tasks, the model mistakes are $<1%$ of the test set (Table 1). We believe GPT-4 is achieving near-ceiling performance on these tasks. For entity extraction and findings sum maris ation, we found that GPT-4 output for many of the error cases is not necessarily wrong, but is offering an alternative to the schema or style in the dataset. This is verified by our qualitative analysis from Appendix B.5 and Section 4.6).

在超越量化评分的基础上，我们人工复核了所有任务中GPT-4的错误（详细分析见附录B）。我们进一步选取部分任务分析了错误的一致性，并在表10中呈现了错误分类统计。发现大多数错误属于模糊案例或标注噪声。以模糊性为例，GPT-4在识别释义时极为严格，会认为句子包含细微补充信息或略微不同的侧重点。实际上，在句子相似度、疾病进展和疾病分类任务中，模型错误率低于测试集的1%（表1）。我们认为GPT-4在这些任务上已接近性能天花板。对于实体抽取和检查结果总结任务，发现GPT-4在多数错误案例中的输出并非绝对错误，而是提供了与数据集既定模式或风格不同的替代方案。这一点通过附录B.5和4.6节的定性分析得到了验证。

It is important to note that GPT-4 in our current study still makes occasional mistakes. Some mistakes are unstable across runs and can be corrected by self-consistency. Table 10 shows that GPT-4 is mostly consistent, and, for the few cases of inconsistent output, self-consistency can correct most of the model mistakes that occur in minority runs.9 Another helpful strategy is to ask GPT-4 to defer when it is uncertain, as demonstrated by the disease classification experiments (Appendix B.3).

值得注意的是，我们当前研究中的GPT-4仍会偶尔出错。部分错误在不同运行中表现不稳定，可通过自洽性(self-consistency)修正。表10显示GPT-4在多数情况下保持稳定，对于少数输出不一致的情况，自洽性能纠正大部分出现在少数运行中的模型错误。另一个有效策略是要求GPT-4在不确定时延迟判断，如疾病分类实验所示(附录B.3)。

The remaining model mistakes are mostly cases where nuanced domain knowledge is required. For example, GPT-4 mistakenly equates ‘lungs are hyper inflated but clear’ with ‘lungs are well-expanded and clear’ in MS-CXR-T. The former indicates an abnormality while the latter is describing normal lungs. We should point out that this mistake does not mean GPT-4 is fundamentally lacking the knowledge. In fact, when asked explicitly about it in isolation (e.g., difference between ‘hyper inflated’ and ‘well-expanded lungs’), or when we reduce the complexity of the two sentences to ‘lungs are hyper inflated’ and ‘lungs are well-expanded’, GPT-4 is able to differentiate the two terms (Table B.3). We interpret it as nuanced radiology knowledge not being guaranteed to always surface for all contexts with all various prompts. While future prompting strategies might help with these cases, we must acknowledge that potential model mistakes cannot be fully ruled out. Therefore, a human in the loop is still required for safety-critical applications.

模型剩余的失误大多是需要领域细微知识的情况。例如在MS-CXR-T中，GPT-4错误地将"肺部过度充气但清晰"等同于"肺部充分扩张且清晰"。前者表示异常，而后者描述的是正常肺部。需要指出的是，这个错误并不意味着GPT-4从根本上缺乏相关知识。事实上，当单独明确询问时（例如"过度充气"和"肺部充分扩张"的区别），或者当我们把两个句子简化为"肺部过度充气"和"肺部充分扩张"时，GPT-4能够区分这两个术语（表 B.3）。我们将其解释为：不能保证在所有上下文和各种提示下都能准确呈现放射学的细微知识。虽然未来的提示策略可能有助于解决这些问题，但我们必须承认不能完全排除模型潜在的错误。因此，在安全关键应用中仍需要人工参与。

Table 10: Self-consistency error analysis for GPT-4. Errors are categorised by whether they are consistent, occurring in minority runs (SC correct) or occurring in majority runs (SC incorrect). We further categorise errors into model mistakes and others (ambiguous or label noise). We observe the majority of the errors are consistent and many errors are not model mistakes. Within the cases of inconsistent output, self-consistency can correct most of the model mistakes. GPT-4 zero-shot performance is reported in this table (disease classification results are after we defer from the uncertain cases). Error breakdown for other single run experiments are in Table F.1.

	Consistent		sCcorrect		scincorrect
Task	Mistake	Other	Correctedmistake	Other	Mistake	Other	Total
Temporal sentence similarity	0%	72%	10%	0%	0%	18%	11
Sentence similarity (RadNLI)	11%	78%	%0	0%	0%	11%	9
RadNLI	55%	31%	6%	0%	2%	6%	49
Diseaseclassification	22%	67%	11%	0%	0%	0%	9
All	38%	46%	6%	0%	1%	8%	78

表 10: GPT-4的自洽性错误分析。错误按是否一致、发生在少数运行中(SC正确)或多数运行中(SC错误)分类。我们进一步将错误分为模型错误和其他(模糊或标签噪声)。观察到大多数错误是一致的，且许多错误并非模型错误。在不一致输出的情况下，自洽性能纠正大部分模型错误。本表报告了GPT-4的零样本性能(疾病分类结果是排除不确定病例后的)。其他单次实验的错误细分见表F.1。

任务	模型错误	其他	纠正错误	其他	模型错误	其他	总计
时间句相似性	0%	72%	10%	0%	0%	18%	11
句子相似性(RadNLI)	11%	78%	0%	0%	0%	11%	9
RadNLI	55%	31%	6%	0%	2%	6%	49
疾病分类	22%	67%	11%	0%	0%	0%	9
全部	38%	46%	6%	0%	1%	8%	78

5.2 GPT-4 vs SOTA radiology models

5.2 GPT-4 与当前最优放射学模型的对比

Throughout the experiments, we first observed a significant jump of performance of GPT-4 compared with the prior GPT-3.5 (text-davinci-003 and ChatGPT), confirming the findings from previous studies (Nori et al., 2023). We then summarised the overall GPT-4 performance compared with radiology SOTA in Table 1. The key finding is that GPT-4 outperforms or is on par with SOTA radiol- ogy models in the broad range of tasks considered. We further notice that different tasks require different prompting efforts and strategies. For tasks such as sentence similarity, RadNLI, and disease progression, the task requirements can be clearly defined in the instruction. (For example, there is clear logical definition for ‘entailment’, ‘neutral’, and ‘contradiction’ in NLI). For such ‘learnby-instruction’ tasks, a simple zero-shot prompting strategy for GPT-4 can yield significant gains over task-specific baselines or nearly ceiling performance. Disease classification does not fall into this category due to the ambiguity in how to assign labels for the uncertain cases. Here, GPT-4 requires 10 examples to achieve comparable nearceiling performance with previous SOTA. We show that zero-shot GPT-4 can also achieve near-ceiling performance if we defer from uncertain cases (Table 5) in this task. Another key point to note is that GPT-4 is a better choice than the previous SOTA Chest ImaGenome silver annotations for disease and disease progression classification, as the silver annotations are from rule-based systems that are not available to be re-used for other datasets.

在整个实验过程中，我们首先观察到GPT-4相比之前的GPT-3.5(text-davinci-003和ChatGPT)有显著的性能提升，这验证了先前研究(Nori等人，2023)的结论。随后我们在表1中总结了GPT-4与放射学SOTA模型的整体性能对比。关键发现是：在所考虑的广泛任务范围内，GPT-4的表现优于或持平于放射学SOTA模型。我们进一步注意到，不同任务需要不同的提示工程策略。对于句子相似度、RadNLI和疾病进展等任务，其需求可以通过指令明确定义(例如NLI中对"蕴含"、"中立"和"矛盾"有明确的逻辑定义)。这类"指令学习"任务中，GPT-4仅需简单的零样本提示策略就能显著超越特定任务基线，甚至接近理论上限性能。而疾病分类由于不确定病例的标注模糊性不属于此类，GPT-4需要10个示例才能达到与先前SOTA相当的近上限性能。我们证明若在该任务中排除不确定病例(表5)，零样本GPT-4同样能实现接近上限的表现。另一个关键发现是：对于疾病和疾病进展分类任务，GPT-4是比Chest ImaGenome银标注更好的选择，因为后者基于不可复用于其他数据集的规则系统。

Different from the above-mentioned tasks, it is not straightforward to articulate requirements in the instruction for entity extraction and findings sum maris ation. For entity extraction, the exact definition of observation and anatomy is schemaspecific and in many cases can only be inferred from training examples. For findings summarisation, while there are general rule-of-thumb principles for writing a good impression, it is not possible to write down detailed instructions regarding the exact phrasing and style of the impressions in a particular dataset. We call these ‘learn-by-example’ tasks. Task-specific supervised models perform competitively on such tasks, as they can explicitly learn an in-domain distribution from all training examples. We found significant improvement of GPT models with increased number of examples compared with zero-shot, and GPT-4 with example selection can match supervised baselines. Future research can explore ways to combine GPT-4 and supervised models (e.g. treating the latter as plugins Shen et al. 2023; Xu et al. 2023).

与上述任务不同，在实体提取和发现总结的指令中明确表述需求并不容易。对于实体提取而言，观察和解剖的确切定义取决于特定模式，且多数情况下只能通过训练样本推断。对于发现总结，虽然存在撰写优质印象报告的通用经验法则，但无法针对特定数据集中印象报告的具体措辞和风格编写详细指令。我们将这类任务称为"示例学习"任务。

针对特定任务的监督模型在此类任务中表现优异，因为它们能够从所有训练样本中明确学习领域内分布。我们发现，与零样本相比，GPT模型随着示例数量增加有显著提升，而采用示例选择的GPT-4可达到监督基线的水平。未来研究可探索结合GPT-4与监督模型的方法（例如将后者视为插件 Shen et al. 2023; Xu et al. 2023）。

6 Conclusion

6 结论

7 Limitations

7 局限性

This study evaluates GPT-4 on a diverse range of common radiology text-based tasks. We found GPT-4 either outperforms or is on par with taskspecific radiology models. GPT-4 requires the least prompting effort for the ‘learn-by-instruction’ tasks where requirements can be clearly defined in the instruction. Our extensive error analysis shows that although it occasionally fails to surface domain knowledge, GPT-4 has substantial capability in the processing and analysis of radiology text, achieving near-ceiling performance in many tasks.

本研究评估了GPT-4在多种常见放射学文本任务中的表现。我们发现GPT-4在特定放射学任务上要么优于专用模型，要么与之相当。对于可通过指令明确定义需求的"指令学习"任务，GPT-4所需的提示量最少。大量错误分析表明，尽管偶尔无法展现领域知识，但GPT-4在处理和分析放射学文本方面具有强大能力，在多项任务中接近上限性能。

In this paper, we focused on GPT-4 as it is the most capable and the best-performing LLM now across many domains and we would like to establish what best we can do with LLM in radiology. We leave it for future research to test and compare GPT-4 performance with other LLMs. In addition, as GPT-4 with the current prompting strategies in the study already achieves near-ceiling performance in many tasks, we leave an exhaustive experimentation of all existing prompting strategies for future research. For example, we have not explored the more recently proposed advanced prompting techniques including tree of thought (Yao et al., 2023) and self-critique (Shinn et al., 2023) and we encourage future research to apply techniques to help improve the reliability of GPT-4. Also, due to resource constraint, we did not perform self-consistency exhaustively for all tasks and for all GPT models. That being said, we believe the findings from this paper should already represent what an average user can get out of using GPT models on these tasks. The insights and learnings will be useful for designing future prompting strategies for radiology tasks, where particular tasks or error cases will require more prompting efforts.

本文聚焦GPT-4，因其是目前跨领域能力最强、性能最优的大语言模型，我们旨在探索该模型在放射学领域的应用极限。其他大语言模型与GPT-4的性能对比测试将留待未来研究。鉴于当前提示策略下的GPT-4已在多项任务中接近性能天花板，我们未穷尽现有所有提示策略的测试——例如未验证思维树[20]和自我批判[21]等新兴高级提示技术，期待后续研究能应用这些技术提升GPT-4的可靠性。由于资源限制，我们未对所有任务和GPT模型进行完整的自洽性验证。尽管如此，本文成果已能反映普通用户使用GPT模型处理这些任务的典型效果，相关洞见将为设计放射学任务的未来提示策略提供参考，特别是针对需额外提示干预的特定任务或错误案例。

Our error analysis shows that many of the existing radiology tasks contain intrinsic ambiguities and label noise and we call for more quality control when creating evaluation benchmarks in the future. Finally, our qualitative evaluation of the findings sum maris ation task is limited to a single radiologist. This is a subjective assessment that will be influenced by radiologist’s own style and preference. The ideal scenario would be to ask radiologists who participated in the creation of the MIMIC or Open-i dataset to perform the assessment so that they have the same styling preference as the dataset. We are also planning to conduct more nuanced qualitative evaluation addressing different aspects of the summary in the future.

我们的错误分析表明，现有放射学任务中许多存在固有模糊性和标签噪声，因此呼吁未来创建评估基准时加强质量控制。最后，我们对检查结果总结任务的定性评估仅基于一名放射科医生。这种主观评估会受到放射科医生个人风格和偏好的影响。理想情况是邀请参与创建MIMIC或Open-i数据集的放射科医生进行评估，以确保其风格偏好与数据集一致。我们也计划未来开展更细致的定性评估，从不同维度分析总结内容。

8 Ethical Considerations

8 伦理考量

we would like to assure the readers that the experiments in this study were conducted using Azure Open AI services which have all the compliance requirements as any other Azure Services. Azure Open AI is HIPAA compliant and preserves data privacy and compliance of the medical data (e.g., The data are not available to OpenAI). More details can be found in https:

我们向读者保证，本研究的实验均采用Azure Open AI服务完成，该服务与其他Azure服务一样满足所有合规要求。Azure Open AI符合HIPAA标准，能保障医疗数据隐私与合规性（例如：数据不会提供给OpenAI）。更多细节详见：https:

//azure.microsoft.com/en-gb/resources/ microsoft-azure-compliance-offerings, https://learn.microsoft.com/en-us/legal/ cognitive-services/openai/data-privacy and https://learn.microsoft.com/ en-us/answers/questions/1245418/ hipaa-compliance. All the public datasets used in this paper were also reviewed by MSR (Microsoft Research) IRB (OHRP parent organization number IORG #0008066, IRB #IRB00009672) under reference numbers RCT4053 and ERP10284. IRB Decision: approved – Not Human Subjects Research (per 45§46.102(e)(1)(ii), $45\S46.102$ (e)(5))

//azure.microsoft.com/en-gb/resources/microsoft-azure-compliance-offerings, https://learn.microsoft.com/en-us/legal/cognitive-services/openai/data-privacy 以及 https://learn.microsoft.com/en-us/answers/questions/1245418/hipaa-compliance。本文使用的所有公共数据集均经过MSR (Microsoft Research) IRB (OHRP上级组织编号IORG #0008066, IRB编号#IRB00009672)审查，参考编号为RCT4053和ERP10284。IRB决定：批准——非人类受试者研究(依据45§46.102(e)(1)(ii), $45\S46.102$(e)(5))

Acknowledgments

致谢

We would like to thank the anonymous reviewers and area chairs for their helpful suggestions. We would also like to thank Hannah Richardson, Harsha Nori, Maximilian Ilse and Melissa Bristow for their valuable feedback.

我们要感谢匿名评审和领域主席们提出的有益建议。同时感谢Hannah Richardson、Harsha Nori、Maximilian Ilse和Melissa Bristow提供的宝贵反馈。

References

参考文献

Lisa C Adams, Daniel Truhn, Felix Busch, Avan Kader, Stefan M Niehues, Marcus R Makowski, and Keno K Bressem. 2023. Leveraging GPT-4 for post hoc transformation of free-text radiology reports into structured reporting: a multilingual feasibility study. Radiology, page 230725.

Lisa C Adams、Daniel Truhn、Felix Busch、Avan Kader、Stefan M Niehues、Marcus R Makowski 和 Keno K Bressem。2023。利用 GPT-4 对自由文本放射学报告进行事后结构化转换：一项多语言可行性研究。Radiology，第 230725 页。

Rohan Anil, Andrew M Dai, Orhan Firat, Melvin John- son, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, et al. 2023. PaLM 2 technical report. arXiv preprint arXiv:2305.10403.

Rohan Anil, Andrew M Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen等. 2023. PaLM 2技术报告. arXiv预印本arXiv:2305.10403.

Shruthi Bannur, Stephanie Hyland, Qianchu Liu, Fernando Pérez-García, Maximilian Ilse, Daniel C. Cas- tro, Benedikt Boecking, Harshita Sharma, Kenza Bouzid, Anja Thieme, Anton Schwa ig hofer, Maria Wetscherek, Matthew P. Lungren, Aditya Nori, Javier Alvarez-Valle, and Ozan Oktay. 2023a. Learning to exploit temporal structure for biomedical vision-language processing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15016–15027.

Shruthi Bannur、Stephanie Hyland、Qianchu Liu、Fernando Pérez-García、Maximilian Ilse、Daniel C. Castro、Benedikt Boecking、Harshita Sharma、Kenza Bouzid、Anja Thieme、Anton Schwaighofer、Maria Wetscherek、Matthew P. Lungren、Aditya Nori、Javier Alvarez-Valle 和 Ozan Oktay。2023a。学习利用时间结构进行生物医学视觉语言处理。在《IEEE/CVF计算机视觉与模式识别会议论文集》(CVPR)中，第15016–15027页。

Shruthi Bannur, Stephanie Hyland, Qianchu Liu, Fernando Pérez-García, Max Ilse, Daniel Coelho de Castro, Benedikt Boecking, Harshita Sharma, Kenza Bouzid, Anton Schwa ig hofer, Maria Teodora Wetscherek, Hannah Richardson, Tristan Naumann, Javier Alvarez Valle, and Ozan Oktay. 2023b. MSCXR-T: Learning to exploit temporal structure for biomedical vision-language processing. PhysioNet.

Shruthi Bannur、Stephanie Hyland、Qianchu Liu、Fernando Pérez-García、Max Ilse、Daniel Coelho de Castro、Benedikt Boecking、Harshita Sharma、Kenza Bouzid、Anton Schwaighofer、Maria Teodora Wetscherek、Hannah Richardson、Tristan Naumann、Javier Alvarez Valle 和 Ozan Oktay。2023b。MSCXR-T：学习利用生物医学视觉语言处理中的时间结构。PhysioNet。

Rajesh Bhayana, Robert R Bleakney, and Satheesh Krishna. 2023a. GPT-4 in radiology: Improvements in advanced reasoning. Radiology, page 230987.

Rajesh Bhayana、Robert R Bleakney 和 Satheesh Krishna。2023a。GPT-4 在放射学中的应用：高级推理能力的提升。《放射学》，第230987页。

Rajesh Bhayana, Satheesh Krishna, and Robert R Bleakney. 2023b. Performance of ChatGPT on a radiology board-style examination: Insights into current strengths and limitations. Radiology, page 230582.

Rajesh Bhayana、Satheesh Krishna和Robert R Bleakney。2023b。ChatGPT在放射学委员会式考试中的表现：对当前优势与局限性的见解。《放射学》，第230582页。

Sébastien Bubeck, Varun Chandra sekar an, Ronen El- dan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, et al. 2023. Sparks of artificial general intelligence: Early experiments with GPT-4. arXiv preprint arXiv:2303.12712.

Sébastien Bubeck、Varun Chandrasekaran、Ronen Eldan、Johannes Gehrke、Eric Horvitz、Ece Kamar、Peter Lee、Yin Tat Lee、Yuanzhi Li、Scott Lundberg 等。2023. 通用人工智能(AGI)的火花: GPT-4早期实验。arXiv预印本 arXiv:2303.12712。

Kunming Cheng, Qiang Guo, Yongbin He, Yanqiu Lu, Shuqin Gu, and Haiyang Wu. 2023. Exploring the potential of GPT-4 in biomedical engineering: the dawn of a new era. Annals of Biomedical Engineering, pages 1–9.

昆明程、郭强、何永斌、卢艳秋、顾淑琴和吴海洋。2023。探索GPT-4在生物医学工程中的潜力：新时代的曙光。《生物医学工程年鉴》，第1-9页。

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. 2022. PaLM: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311.

Aakanksha Chowdhery、Sharan Narang、Jacob Devlin、Maarten Bosma、Gaurav Mishra、Adam Roberts、Paul Barham、Hyung Won Chung、Charles Sutton、Sebastian Gehrmann 等. 2022. PaLM: 基于Pathways扩展的语言建模. arXiv预印本 arXiv:2204.02311.

Dina Demner-Fushman, Marc D Kohli, Marc B Rosenman, Sonya E Shooshan, Laritza Rodriguez, Sameer Antani, George R Thoma, and Clement J McDonald. 2016. Preparing a collection of radiology examinations for distribution and retrieval. Journal of the American Medical Informatics Association, 23(2):304–310.

Dina Demner-Fushman、Marc D Kohli、Marc B Rosenman、Sonya E Shooshan、Laritza Rodriguez、Sameer Antani、George R Thoma 和 Clement J McDonald。2016。准备用于分发和检索的放射学检查数据集。《美国医学信息学会杂志》23(2):304–310。

Jessica López Espejel, El Hassane Ettifouri, Mahaman Sanoussi Yahaya Alassan, El Mehdi Chouham, and Walid Dahhane. 2023. GPT-3.5 vs GPT-4: Evaluating ChatGPT’s reasoning performance in zero-shot learning. arXiv preprint arXiv:2305.12477.

Jessica López Espejel、El Hassane Ettifouri、Mahaman Sanoussi Yahaya Alassan、El Mehdi Chouham 和 Walid Dahhane。2023。GPT-3.5 与 GPT-4：评估 ChatGPT 在零样本 (Zero-shot) 学习中的推理性能。arXiv 预印本 arXiv:2305.12477。

Yu Gu, Robert Tinn, Hao Cheng, Michael Lucas, Naoto Usuyama, Xiaodong Liu, Tristan Naumann, Jianfeng Gao, and Hoifung Poon. 2021. Domain-specific language model pre training for biomedical natural language processing. ACM Transactions on Computing for Healthcare (HEALTH), 3(1):1–23.

Yu Gu、Robert Tinn、Hao Cheng、Michael Lucas、Naoto Usuyama、Xiaodong Liu、Tristan Naumann、Jianfeng Gao 和 Hoifung Poon。2021。面向生物医学自然语言处理的领域专用语言模型预训练。ACM Transactions on Computing for Healthcare (HEALTH)，3(1):1–23。

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2021. Measuring massive multitask language understanding. In International Conference on Learning Representations.

Dan Hendrycks、Collin Burns、Steven Basart、Andy Zou、Mantas Mazeika、Dawn Song 和 Jacob Steinhardt。2021。大规模多任务语言理解评估。发表于国际学习表征会议。

Jinpeng Hu, Zhuo Li, Zhihong Chen, Zhuguo Li, Xiang Wan, and Tsung-Hui Chang. 2022. Graph enhanced contrastive learning for radiology findings summarization. In Annual Meeting of the Association for Computational Linguistics.

Jinpeng Hu、Zhuo Li、Zhihong Chen、Zhuguo Li、Xiang Wan 和 Tsung-Hui Chang。2022。放射学发现总结的图增强对比学习。发表于计算语言学协会年会。

Jeremy Irvin, Pranav Rajpurkar, Michael Ko, Yifan Yu, Silviana Ciurea-Ilcus, Chris Chute, Henrik Mark- lund, Behzad Haghgoo, Robyn Ball, Katie Shpanskaya, et al. 2019. Chexpert: a large chest radiograph dataset with uncertainty labels and expert comparison. In Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence and Thirty-First Innovative

Jeremy Irvin、Pranav Rajpurkar、Michael Ko、Yifan Yu、Silviana Ciurea-Ilcus、Chris Chute、Henrik Marklund、Behzad Haghgoo、Robyn Ball、Katie Shpanskaya等。2019。CheXpert: 一个带有不确定性标注及专家比对的大规模胸部X光数据集。见《第三十三届AAAI人工智能会议暨第三十一届创新...

Applications of Artificial Intelligence Conference and Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, pages 590–597.

人工智能应用会议暨第九届AAAI人工智能教育进展研讨会论文集，第590-597页。

Saahil Jain, Ashwin Agrawal, Adriel Saporta, Steven Truong, Tan Bui, Pierre Chambon, Yuhao Zhang, Matthew P Lungren, Andrew Y Ng, Curtis Langlotz, et al. 2021. Radgraph: Extracting clinical entities and relations from radiology reports. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1).

Saahil Jain、Ashwin Agrawal、Adriel Saporta、Steven Truong、Tan Bui、Pierre Chambon、Yuhao Zhang、Matthew P Lungren、Andrew Y Ng、Curtis Langlotz 等。2021。Radgraph：从放射学报告中提取临床实体和关系。载于《第三十五届神经信息处理系统大会数据集与基准测试赛道（第一轮）》。

Alistair EW Johnson, Tom J Pollard, Seth J Berkowitz, Nathaniel R Greenbaum, Matthew P Lungren, Chihying Deng, Roger G Mark, and Steven Horng. 2019. MIMIC-CXR, a de-identified publicly available database of chest radio graphs with free-text reports. Scientific data, 6(1):1–8.

Alistair EW Johnson、Tom J Pollard、Seth J Berkowitz、Nathaniel R Greenbaum、Matthew P Lungren、Chihying Deng、Roger G Mark 和 Steven Horng。2019。MIMIC-CXR：一个经过去标识化处理且公开可用的胸部X光片及自由文本报告数据库。《科学数据》，6(1):1-8。

Peter Lee, Sebastien Bubeck, and Joseph Petro. 2023. Benefits, limits, and risks of GPT-4 as an AI chatbot for medicine. New England Journal of Medicine, 388(13):1233–1239.

Peter Lee、Sebastien Bubeck 和 Joseph Petro. 2023. GPT-4作为医学AI聊天机器人的优势、局限与风险. 新英格兰医学杂志, 388(13):1233–1239.

Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, et al. 2023. Holistic evaluation of language models. Transactions on Machine Learning Research (TMLR).

Percy Liang、Rishi Bommasani、Tony Lee、Dimitris Tsipras、Dilara Soylu、Michihiro Yasunaga、Yian Zhang、Deepak Narayanan、Yuhuai Wu、Ananya Kumar 等. 2023. 语言模型的整体评估. 机器学习研究汇刊 (TMLR).

Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics.

Chin-Yew Lin. 2004. ROUGE: 自动摘要评估工具包。In Text Summarization Branches Out, pages 74-81, Barcelona, Spain. Association for Computational Linguistics.

Fangyu Liu, Qianchu Liu, Shruthi Bannur, Fernando Pérez-García, Naoto Usuyama, Sheng Zhang, Tristan Naumann, Aditya Nori, Hoifung Poon, Javier Alvarez-Valle, Ozan Oktay, and Stephanie L. Hy- land. 2023a. Compositional zero-shot domain transfer with text-to-text models.

Fangyu Liu、Qianchu Liu、Shruthi Bannur、Fernando Pérez-García、Naoto Usuyama、Sheng Zhang、Tristan Naumann、Aditya Nori、Hoifung Poon、Javier Alvarez-Valle、Ozan Oktay 和 Stephanie L. Hyland。2023a。基于文本到文本模型的组合式零样本领域迁移。

Hanmeng Liu, Ruoxi Ning, Zhiyang Teng, Jian Liu, Qiji Zhou, and Yue Zhang. 2023b. Evaluating the logical reasoning ability of ChatGPT and GPT-4. arXiv preprint arXiv:2304.03439.

Hanmeng Liu, Ruoxi Ning, Zhiyang Teng, Jian Liu, Qiji Zhou, and Yue Zhang. 2023b. 评估ChatGPT和GPT-4的逻辑推理能力。arXiv预印本 arXiv:2304.03439。

Jiachang Liu, Dinghan Shen, Yizhe Zhang, William B Dolan, Lawrence Carin, and Weizhu Chen. 2022. What makes good in-context examples for GPT-3? In Proceedings of Deep Learning Inside Out (DeeLIO 2022): The 3rd Workshop on Knowledge Extraction and Integration for Deep Learning Architectures, pages 100–114.

Jiachang Liu, Dinghan Shen, Yizhe Zhang, William B Dolan, Lawrence Carin, and Weizhu Chen. 2022. 什么构成了GPT-3的良好上下文示例? 见《深度学习内外会议论文集》(DeeLIO 2022): 第三届深度学习架构知识提取与集成研讨会, 第100-114页。

Qing Lyu, Josh Tan, Michael E Zapadka, Janardhana Pon nat a pura, Chuang Niu, Kyle J Myers, Ge Wang, and Christopher T Whitlow. 2023. Translating radiology reports into plain language using ChatGPT and GPT-4 with prompt learning: results, limitations, and potential. Visual Computing for Industry, Bio medicine, and Art, 6(1):9.

Qing Lyu、Josh Tan、Michael E Zapadka、Janardhana Ponnatapura、Chuang Niu、Kyle J Myers、Ge Wang和Christopher T Whitlow。2023. 基于提示学习的ChatGPT与GPT-4将放射学报告转化为通俗语言：结果、局限性与潜力。Visual Computing for Industry, Biomedicine, and Art, 6(1):9.

Chong Ma, Zihao Wu, Jiaqi Wang, Shaochen Xu, Yaonai Wei, Zhengliang Liu, Lei Guo, Xiaoyan Cai, Shu Zhang, Tuo Zhang, et al. 2023. Impression GP T: An iterative optimizing framework for radiology report sum mari z ation with ChatGPT. arXiv preprint arXiv:2304.08448.

Chong Ma、Zihao Wu、Jiaqi Wang、Shaochen Xu、Yaonai Wei、Zhengliang Liu、Lei Guo、Xiaoyan Cai、Shu Zhang、Tuo Zhang 等. 2023. ImpressionGPT: 基于 ChatGPT 的放射学报告摘要迭代优化框架. arXiv 预印本 arXiv:2304.08448.

Yasuhide Miura, Yuhao Zhang, Emily Tsai, Curtis Langlotz, and Dan Jurafsky. 2021. Improving factual completeness and consistency of image-to-text radiology report generation. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5288–5304.

Yasuhide Miura、Yuhao Zhang、Emily Tsai、Curtis Langlotz 和 Dan Jurafsky。2021。提升影像到文本放射学报告生成的事实完整性与一致性。载于《2021年北美计算语言学协会人类语言技术会议论文集》，第5288–5304页。

Harsha Nori, Nicholas King, Scott Mayer McKinney, Dean Carignan, and Eric Horvitz. 2023. Capabilities of GPT-4 on medical challenge problems. arXiv preprint arXiv:2303.13375.

Harsha Nori、Nicholas King、Scott Mayer McKinney、Dean Carignan 和 Eric Horvitz。2023. GPT-4在医学挑战问题上的能力。arXiv预印本 arXiv:2303.13375。

OpenAI. 2023. Gpt-4 technical report.

OpenAI. 2023. GPT-4技术报告

Mercy Ranjit, Gopinath Ganapathy, Ranjit Manuel, and Tanuja Ganu. 2023. Retrieval augmented chest X-ray report generation using OpenAI GPT models.

Mercy Ranjit、Gopinath Ganapathy、Ranjit Manuel 和 Tanuja Ganu。2023。基于 OpenAI GPT 模型的检索增强型胸部 X 光报告生成。

Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. 2023. HuggingGPT: Solving AI tasks with ChatGPT and its friends in hugging face. arXiv preprint arXiv:2303.17580.

Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. 2023. HuggingGPT: 利用ChatGPT及其Hugging Face生态伙伴解决AI任务. arXiv预印本 arXiv:2303.17580.

Noah Shinn, Beck Labash, and Ashwin Gopinath. 2023. Reflexion: an autonomous agent with dynamic memory and self-reflection. arXiv preprint arXiv:2303.11366.

Noah Shinn、Beck Labash 和 Ashwin Gopinath。2023. Reflexion: 具备动态记忆与自我反思能力的自主智能体。arXiv 预印本 arXiv:2303.11366。

Karan Singhal, Shekoofeh Azizi, Tao Tu, S Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, et al. 2022. Large language models encode clinical knowledge. arXiv preprint arXiv:2212.13138.

Karan Singhal、Shekoofeh Azizi、Tao Tu、S Sara Mahdavi、Jason Wei、Hyung Won Chung、Nathan Scales、Ajay Tanwani、Heather Cole-Lewis、Stephen Pfohl等。2022. 大语言模型编码临床知识。arXiv预印本 arXiv:2212.13138。

Karan Singhal, Tao Tu, Juraj Gottweis, Rory Sayres, Ellery Wulczyn, Le Hou, Kevin Clark, Stephen Pfohl, Heather Cole-Lewis, Darlene Neal, et al. 2023. Towards expert-level medical question answering with large language models. arXiv preprint arXiv:2305.09617.

Karan Singhal、Tao Tu、Juraj Gottweis、Rory Sayres、Ellery Wulczyn、Le Hou、Kevin Clark、Stephen Pfohl、Heather Cole-Lewis、Darlene Neal 等. 2023. 基于大语言模型的专家级医学问答研究. arXiv预印本 arXiv:2305.09617.

Akshay Smit, Saahil Jain, Pranav Rajpurkar, Anuj Pareek, Andrew Ng, and Matthew Lungren. 2020. Combining automatic labelers and expert annotations for accurate radiology report labeling using BERT. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1500–1519, Online. Association for Computational Linguistics.

Akshay Smit、Saahil Jain、Pranav Rajpurkar、Anuj Pareek、Andrew Ng 和 Matthew Lungren。2020。使用 BERT 结合自动标注工具与专家标注实现精准放射学报告标注。载于《2020年自然语言处理实证方法会议论文集》(EMNLP)，第1500–1519页，线上会议。计算语言学协会。

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. 2023. LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.

Hugo Touvron、Thibaut Lavril、Gautier Izacard、Xavier Martinet、Marie-Anne Lachaux、Timothée Lacroix、Baptiste Rozière、Naman Goyal、Eric Hambro、Faisal Azhar、Aurelien Rodriguez、Armand Joulin、Edouard Grave 和 Guillaume Lample。2023。LLaMA: 开放高效的基础语言模型。arXiv预印本 arXiv:2302.13971。

Dave Van Veen, Cara Van Uden, Maayane Attias, Anuj Pareek, Christian Bluethgen, Malgorzata Polacin, Wah Chiu, Jean-Benoit Delbrouck, Juan Zambrano Chaves, Curtis Langlotz, Akshay Chaudhari, and John Pauly. 2023. RadAdapt: Radiology report sum mari z ation via lightweight domain adaptation of large language models. In The 22nd Work- shop on Biomedical Natural Language Processing and BioNLP Shared Tasks, pages 449–460, Toronto, Canada. Association for Computational Linguistics.

Dave Van Veen, Cara Van Uden, Maayane Attias, Anuj Pareek, Christian Bluethgen, Malgorzata Polacin, Wah Chiu, Jean-Benoit Delbrouck, Juan Zambrano Chaves, Curtis Langlotz, Akshay Chaudhari, and John Pauly. 2023. RadAdapt: 基于大语言模型轻量级领域自适应的放射学报告摘要生成。载于《第22届生物医学自然语言处理及BioNLP共享任务研讨会论文集》，第449-460页，加拿大多伦多。计算语言学协会。

David Wadden, Ulme Wennberg, Yi Luan, and Hannaneh Hajishirzi. 2019. Entity, relation, and event extraction with contextual i zed span representations. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 5784– 5789, Hong Kong, China. Association for Computational Linguistics.

David Wadden、Ulme Wennberg、Yi Luan 和 Hannaneh Hajishirzi。2019。基于上下文跨度表示的实体、关系和事件抽取。载于《2019年自然语言处理实证方法会议暨第九届自然语言处理国际联合会议(EMNLP-IJCNLP)论文集》，第5784–5789页，中国香港。计算语言学协会。

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. 2023. Self-consistency improves chain of thought reasonin

[论文翻译]探索GPT-4在放射学领域的边界

原文地址：https://arxiv.org/pdf/2310.14573