Radiology-GPT: A Large Language Model for Radiology
Radiology-GPT: 面向放射学的大语言模型
Zhengliang Liu $^{1}$ , Aoxiao Zhong $^2$ , Yiwei Li $^1$ , Longtao Yang $^3$ , Chao Ju $^3$ , Zihao Wu $^1$ , Chong Ma $^4$ , Peng Shu $^{1}$ , Cheng Chen $^5$ , Sekeun Kim $^5$ , Haixing Dai $^{1}$ , Lin Zhao $^{1}$ , Lichao Sun $^6$ , Dajiang Zhu $7$ , Jun Liu $^3$ , Wei Liu $^8$ , Dinggang Shen $^{9,10,11}$ , Xiang Li $^5$ , Quanzheng Li $^5$ , and Tianming Liu1
郑亮 刘 $^{1}$ , 钟傲霄 $^2$ , 李一苇 $^1$ , 杨龙涛 $^3$ , 鞠超 $^3$ , 吴子豪 $^1$ , 马冲 $^4$ , 舒鹏 $^{1}$ , 陈诚 $^5$ , Kim Sekeun $^5$ , 戴海星 $^{1}$ , 赵琳 $^{1}$ , 孙立超 $^6$ , 朱大江 $7$ , 刘军 $^3$ , 刘伟 $^8$ , 沈定刚 $^{9,10,11}$ , 李翔 $^5$ , 李全正 $^5$ , 刘天明 $1$
1School of Computing, University of Georgia $^2$ Department of Electrical Engineering, Harvard University $^{3}$ Department of Radiology, Second Xiangya Hospital $^4$ School of Automation, Northwestern Polytechnic al University $^5$ Department of Radiology, Massachusetts General Hospital and Harvard Medical School $^6$ Department of Computer Science and Engineering, Lehigh University 7 Department of Computer Science and Engineering, University of Texas at Arlington $^{8}$ Department of Radiation Oncology, Mayo Clinic $^{9}$ School of Biomedical Engineering, Shanghai Tech University $^{10}$ Shanghai United Imaging Intelligence Co., Ltd. $_ {11}$ Shanghai Clinical Research and Trial Center
1 佐治亚大学计算学院
2 哈佛大学电气工程系
3 中南大学湘雅二医院放射科
4 西北工业大学自动化学院
5 麻省总医院及哈佛医学院放射科
6 利哈伊大学计算机科学与工程系
7 德克萨斯大学阿灵顿分校计算机科学与工程系
8 梅奥诊所放射肿瘤科
9 上海科技大学生物医学工程学院
10 上海联影智能医疗科技有限公司
11 上海市临床研究伦理中心
Abstract
摘要
We introduce Radiology-GPT, a large language model for radiology. Using an instruction tuning approach on an extensive dataset of radiology domain knowledge, Radiology-GPT demonstrates superior performance compared to general language models such as StableLM, Dolly and LLaMA. It exhibits significant versatility in radiological diagnosis, research, and communication. This work serves as a catalyst for future developments in clinical NLP. The successful implementation of Radiology-GPT is indicative of the potential of localizing generative large language models, specifically tailored for distinctive medical specialties, while ensuring adherence to privacy standards such as HIPAA. The prospect of developing individualized, large-scale language models that cater to specific needs of various hospitals presents a promising direction. The fusion of conversational competence and domain-specific knowledge in these models is set to foster future development in healthcare AI. A demo of Radiology-GPT is available at https://hugging face.co/spaces/allen-eric/radiology-gpt.
我们推出Radiology-GPT,这是一款专为放射学设计的大语言模型。通过在大量放射学领域知识数据集上采用指令微调方法,Radiology-GPT展现出相较于StableLM、Dolly和LLaMA等通用语言模型的卓越性能。该模型在放射学诊断、研究和交流方面表现出显著的多功能性。这项工作为临床自然语言处理的未来发展提供了催化剂。Radiology-GPT的成功实施表明,针对特定医学专科定制生成式大语言模型并确保符合HIPAA等隐私标准具有巨大潜力。开发满足不同医院特定需求的个性化大规模语言模型,展现出一个充满前景的方向。这些模型将对话能力与领域专业知识相融合,必将推动医疗AI的未来发展。Radiology-GPT演示版可在https://huggingface.co/spaces/allen-eric/radiology-gpt访问。
1 Introduction
1 引言
The recent rise of large language models (LLMs) such as ChatGPT [1] and GPT-4 [2] has brought about a significant transformation in the field of natural language processing (NLP) [3, 1]. These models demonstrate unprecedented abilities and versatility, a clear progression from preceding models such as BERT [4], and have engendered broad advancements in diverse domains [1, 5, 6].
近期,以ChatGPT [1]和GPT-4 [2]为代表的大语言模型(LLM)的崛起,为自然语言处理(NLP)领域带来了重大变革[3, 1]。这些模型展现出前所未有的能力与通用性,相较BERT [4]等前代模型实现了显著突破,并在多个领域推动了广泛进展[1, 5, 6]。
Among the fields that can significantly benefit from these advancements is radiology. Radiology, by its very nature, generates a vast amount of textual data, including radiology reports, clinical notes, annotations associated with medical imaging, and more [7]. Examples of such texts include radiographic findings, annotations for Computerized Tomography (CT) scans, and Magnetic Resonance
能够显著受益于这些进步的领域之一是放射学。放射学本质上会产生大量文本数据,包括放射学报告、临床记录、与医学影像相关的注释等[7]。这类文本的例子包括放射学检查结果、计算机断层扫描(CT)的注释和磁共振
Imaging (MRI) reports, all of which require sophisticated understanding and interpretation.
影像学 (MRI) 报告,均需要复杂的理解和解读。
Despite the transformative potential, the application of LLMs in the radiology domain has been limited [8, 7, 9]. Large commercial models such as GPT-4 and PaLM-2 [10], while powerful, are not readily applicable in clinical settings. HIPAA regulations, privacy concerns, and the necessity for IRB approvals pose substantial barriers [11], primarily because these models necessitate the uploading of patient data to externally hosted platforms.
尽管具有变革潜力,大语言模型(LLM)在放射学领域的应用仍然有限[8,7,9]。GPT-4和PaLM-2[10]等大型商业模型虽然强大,却难以直接应用于临床环境。HIPAA法规、隐私问题以及IRB审批要求构成了重大障碍[11],这主要是因为这些模型需要将患者数据上传至外部托管平台。
This situation underscores the pressing need for a localized foundational model, specifically designed for radiology, that can work effectively within the regulatory boundaries while capitalizing on the potential of LLMs. We address this gap through the development of Radiology-GPT, an LLM specifically designed for the radiology domain.
这一现状凸显出对专为放射学设计的本地化基础模型的迫切需求,该模型需在监管范围内有效运作,同时充分发挥大语言模型的潜力。我们通过开发专为放射学领域设计的大语言模型Radiology-GPT来解决这一空白。
A crucial advantage of generative large language models such as Radiology-GPT, particularly in comparison to domain-specific BERT-based models such as RadBERT [12] (for radiology) or Clinical Radio BERT [13] (for radiation oncology), lies in their flexibility and dynamism. Unlike their BERT-based counterparts, generative LLMs are not rigidly dependent on a specific input structure, allowing for a more diverse range of inputs. Additionally, they are capable of generating diverse outputs, making them suitable for tasks that were previously considered impractical, such as reasoning [14, 15]. This further extends their versatility.
Radiology-GPT等生成式大语言模型 (Generative Large Language Model) 的关键优势在于其灵活性和动态性,这尤其体现在与RadBERT [12] (针对放射学) 或Clinical Radio BERT [13] (针对放射肿瘤学) 等基于BERT的领域专用模型对比时。与基于BERT的模型不同,生成式大语言模型不严格依赖特定输入结构,可接受更广泛的输入类型。此外,它们能生成多样化输出,使其适用于以往被认为不切实际的任务 (如推理 [14, 15]),从而进一步扩展了多功能性。
This inherent conversational capability of generative LLMs [1] positions them as invaluable aids to medical professionals, including radiologists. They can provide contextual insights and responses in a conversational manner, mimicking a human-like interaction, and hence enhancing the usability of these models in a clinical setting.
生成式大语言模型 [1] 这种与生俱来的对话能力,使其成为包括放射科医生在内的医疗专业人员的宝贵助手。它们能以对话方式提供情境化见解和响应,模拟类人交互,从而提升这些模型在临床环境中的实用性。
Furthermore, these models eliminate the need for complex fine-tuning processes and the associated labor-intensive manual annotation procedures. This attribute reduces both the development time and cost, making the adoption of such models more feasible and appealing in a clinical context.
此外,这些模型省去了复杂的微调过程和相应的高强度人工标注流程。这一特性既缩短了开发时间又降低了成本,使得此类模型在临床环境中更具可行性和吸引力。
Radiology-GPT outperforms other instruction-tuned models not specially trained for radiology, such as Satbility AI’s Stable LM [16] and Databrick’s Dolly [17]. The model is trained on the MIMIC-CXR dataset [18], affording it a deep understanding of radiology-specific language and content.
放射学GPT的表现优于其他未针对放射学专门训练的指令调优模型,例如Stability AI的Stable LM [16] 和Databrick的Dolly [17]。该模型基于MIMIC-CXR数据集 [18] 进行训练,使其对放射学专用语言和内容具备深刻理解。
Key contributions of our work include:
我们工作的主要贡献包括:
With these advancements, we believe Radiology-GPT holds immense potential to revolutionize the interpretation of radiology reports and other radiology-associated texts, making it a promising tool for clinicians and researchers alike.
随着这些进步,我们相信 Radiology-GPT 在革新放射学报告及其他放射学相关文本解读方面具有巨大潜力,使其成为临床医生和研究人员的有力工具。
2 Related work
2 相关工作
2.1 Large language models
2.1 大语言模型
In recent years, the field of Natural Language Processing (NLP) has undergone a transformative shift with the advent of large language models (LLMs). These models, such as GPT-3 [19], GPT-4 [2], and PaLM [20], PaLM-2 [10], offer enhanced language understanding capabilities, transforming the way language is processed, analyzed, and generated [1, 3]. A notable shift in this evolution has been the transition from the pre-training and fine-tuning paradigm of BERT [4], GPT [21], GPT-2 [22] and their variants [23, 24, 5]. In contrast, LLMs have introduced impressive few-shot and zero-shot learning capabilities achieved through in-context learning, offering a leap in both the performance and application scope of language models.
近年来,随着大语言模型(LLM)的出现,自然语言处理(NLP)领域发生了革命性转变。以GPT-3[19]、GPT-4[2]、PaLM[20]和PaLM-2[10]为代表的模型显著提升了语言理解能力,彻底改变了语言处理、分析和生成的方式[1,3]。这一演进过程中的显著变化是从BERT[4]、GPT[21]、GPT-2[22]及其变体[23,24,5]的预训练-微调范式,转向通过上下文学习实现惊艳的少样本和零样本学习能力,使语言模型的性能和应用范围都实现了飞跃。
Alongside the growth of these powerful, commercially developed LLMs, the field has also seen the emergence of open-source alternatives. Models such as LLaMA [25] and Bloom [26] have extended the reach of LLMs, fostering democratization and in clu siv it y by offering the research community accessible and replicable models.
随着这些由商业开发的大语言模型 (LLM) 不断发展壮大,开源替代方案也在该领域崭露头角。LLaMA [25] 和 Bloom [26] 等模型通过为研究社区提供可获取且可复现的模型,扩大了大语言模型的影响力,促进了技术民主化与包容性。
Building on open-source LLMs, the research community has begun to explore and develop instructiontuned language models that can effectively interpret and execute complex instructions within the input text. Exemplifying this trend are models such as Alpaca [27], StableLM [16], and Dolly [17]. Alpaca, fine-tuned from Meta’s LLaMA 7B model, has shown that even smaller models can achieve behaviors comparable to larger, closed-source models like OpenAI’s text-davinci-003, while being more cost-effective and easily reproducible.
基于开源大语言模型,研究社区已开始探索开发能够有效解析和执行输入文本中复杂指令的指令调优语言模型。这一趋势的代表性模型包括Alpaca [27]、StableLM [16]和Dolly [17]。以Meta的LLaMA 7B模型微调而来的Alpaca证明,即使较小模型也能实现与OpenAI text-davinci-003等闭源大模型相当的表现,同时更具成本效益且易于复现。
2.2 Domain-specific language models
2.2 领域专用大语言模型
Domain-Specific Language Models (DSLM) are models that are specifically trained or fine-tuned on text data pertaining to a particular domain or sector [28, 29]. By focusing on domain-relevant tasks and language patterns, these models seek to offer superior performance for tasks within their specialized domain.
领域专用语言模型 (Domain-Specific Language Models, DSLM) 指针对特定领域或行业的文本数据进行专门训练或微调的模型 [28, 29]。通过聚焦领域相关任务和语言模式,这类模型旨在其专业领域内提供更优的任务性能。
For instance, AgriBERT [30], a domain-specific variant of BERT, is trained on an extensive corpus of text data related to agriculture and food sciences. It effectively captures specific jargon and language patterns typical of this sector, making it a valuable tool for tasks such as crop disease classification and food quality assessment.
例如,AgriBERT [30] 作为 BERT 的领域专用变体,经过大量农业与食品科学相关文本数据的训练。它能有效捕捉该领域特有的术语和语言模式,成为作物病害分类和食品质量评估等任务中的实用工具。
In the field of education, SciEdBERT [8] exemplifies the potential of DSLMs. SciEdBERT is pretrained on student response data to middle school chemistry and physics questions. It is designed to understand and evaluate the language and reasoning of students in these domains, enabling accurate evaluation of student responses and providing insights into student understanding and misconceptions.
在教育领域,SciEdBERT [8] 展现了领域专用大语言模型 (DSLM) 的潜力。该模型基于中学生化学和物理答题数据进行预训练,旨在理解并评估学生在这些学科中的语言表达与推理能力,从而实现对答题内容的精准评估,并揭示学生的知识掌握情况与认知误区。
Similarly, in healthcare, Clinical Radio BERT [13] has been developed for radiation oncology, training on clinical notes and radiotherapy literature to understand and generate radiation oncology-related text. These examples illustrate the diversity and potential of DSLMs in various sectors, enhancing performance on domain-specific tasks by leveraging tailored language patterns and specific knowledge.
同样,在医疗健康领域,临床放射BERT (Clinical Radio BERT) [13] 被开发用于放射肿瘤学,通过训练临床记录和放射治疗文献来理解和生成与放射肿瘤学相关的文本。这些例子展示了领域特定语言模型 (DSLM) 在各行业的多样性和潜力,通过利用定制化的语言模式和特定知识来提升领域特定任务的性能。
While LLMs and DSLMs each have their strengths, there is a growing need for models that leverage the strengths of both, which is particularly pronounced in areas with specialized jargon and extensive bodies of text data, such as radiology.
虽然大语言模型和领域专用语言模型各有优势,但在放射学等专业术语繁多且文本数据量庞大的领域,越来越需要能结合两者优势的模型。

Figure 1: The overall framework of Radiology-GPT.
图 1: Radiology-GPT 的整体框架。
3 Methodology
3 方法论
Our approach to building Radiology-GPT involves a two-step process: preprocessing of the dataset and fine-tuning the model with instruction following. The pipeline of our method can be seen in Figure 1.
我们构建Radiology-GPT的方法分为两个步骤:数据集的预处理和遵循指令的模型微调。方法流程如图1所示。
3.1 Dataset and preprocessing
3.1 数据集与预处理
We base our work on the publicly available MIMIC-CXR dataset [18], a large, publicly available dataset of chest X-rays (CXRs). MIMIC-CXR contains de-identified medical data from over 60,000 patients who were admitted to the Beth Israel Deaconess Medical Center between 2001 and 2012.
我们的工作基于公开可用的MIMIC-CXR数据集[18],这是一个大型公开胸部X光(CXR)数据集。MIMIC-CXR包含2001年至2012年间贝斯以色列女执事医疗中心超过60,000名患者的去标识化医疗数据。
We focus on the radiology reports available in this dataset, as they provide rich textual information about patients’ imaging findings and the radiologists’ interpretations. The reports typically contain two sections that correspond to each other: "Findings" and "Impression". The "Findings" section includes detailed observations from the radiology images, whereas the "Impression" section includes the summarized interpretations drawn from those observations.
我们重点关注该数据集中可用的放射学报告,因为它们提供了关于患者影像学表现和放射科医生解读的丰富文本信息。报告通常包含两个相互对应的部分:"发现"和"印象"。"发现"部分包含从放射学图像中观察到的详细结果,而"印象"部分则包含根据这些观察结果总结出的解读。
To prepare the data for training, we pre processed these reports to extract the "Findings" and "Impression" sections from each report and organized them into pairs. The preprocessing involved removing irrelevant sections, standardizing terminologies, and handling missing or incomplete sections. Specifically, we excluded ineligible reports through the following operations: (1) remove reports without finding or impression sections, (2) remove reports whose finding section contained less than 10 words, and (3) remove reports whose impression section contained less than 2 words. We apply the official split published by [18] and finally obtain 122,014/957/1,606 reports for train/val/test set.
为准备训练数据,我们对这些报告进行了预处理,从每份报告中提取"发现"和"印象"部分并组织成配对。预处理包括移除无关章节、标准化术语处理以及处理缺失或不完整的章节。具体而言,我们通过以下操作排除了不符合要求的报告:(1) 移除没有发现或印象章节的报告,(2) 移除发现章节少于10个单词的报告,(3) 移除印象章节少于2个单词的报告。我们采用[18]发布的官方划分方式,最终获得122,014/957/1,606份报告分别用于训练集/验证集/测试集。

Figure 2: The instruction-tuning process of Radiology-GPT.
图 2: Radiology-GPT的指令微调流程。
In addition, we also pre processed the OpenI dataset [31] based on the above exclusion operations to function as an independent external test dataset. It is crucial to validate our model’s performance and general iz ability across different data sources. Since the official split is not provided, we follow [32] to randomly divide the dataset into train/val/test sets by 2400:292:576 (total: 3268 reports). The independent nature of the OpenI dataset allowed us to robustly assess our model’s capabilities and understand its practical applicability.
此外,我们还基于上述排除操作对OpenI数据集[31]进行了预处理,将其作为独立的外部测试数据集。验证模型在不同数据源上的性能和泛化能力至关重要。由于未提供官方划分标准,我们参照[32]将该数据集随机划分为训练集/验证集/测试集(2400:292:576,总计3268份报告)。OpenI数据集的独立性使我们能够稳健评估模型能力并理解其实际适用性。
3.2 Experimental setting
3.2 实验设置
In this study, we trained the current version of Radiology-GPT based on Alpaca-7B. The training was conducted on a server with 4 Nvidia A100 80GB GPUs. We utilized LoRA [33] to facilitate training. More importantly, we followed the LoRA approach because the small size and portable nature of LoRA weights eases model sharing and deployment.
在本研究中,我们基于Alpaca-7B训练了当前版本的Radiology-GPT。训练在一台配备4块Nvidia A100 80GB GPU的服务器上进行。我们采用LoRA [33]技术来辅助训练。更重要的是,选择LoRA方法是因为其权重文件体积小、便于移植,能简化模型共享与部署流程。
The batch size was set to 128 and the learning rate was fixed at 3e-4. For LoRA, we set lora_ r, the rank of the low-rank factorization, to 8, and lora_ alpha, the scaling factor for the rank, to 16. This was accompanied by a dropout rate of 0.05. The target modules for LoRA were set to "q_ proj" and "v_ proj". These modules refer to the query and value matrices in the self-attention mechanism of the transformer architecture [34, 33].
批量大小设置为128,学习率固定为3e-4。对于LoRA (Low-Rank Adaptation) ,我们将低秩分解的秩lora_ r设为8,秩的缩放因子lora_ alpha设为16,并配合0.05的丢弃率。LoRA的目标模块设置为"q_ proj"和"v_ proj",这些模块对应Transformer架构自注意力机制中的查询矩阵和值矩阵 [34, 33]。
A very similar training process will be applied to larger versions of Radiology-GPT based on larger base models.
将对基于更大基础模型的Radiology-GPT更大版本应用非常相似的训练过程。
3.3 Instruction tuning
3.3 指令微调 (Instruction tuning)
The next step involves instruction tuning our LLM on this radiology dataset. Our methodology is based on the principle of instruction-tuning [35, 27, 36]. The aim is to tune the model such that it can generate an "Impression" text given a "Findings" text as an instruction. The underlying language model learns the relationship between "Findings" and "Impression" from the dataset and hence, starts generating impressions in a similar manner when presented with new findings. Currently, we use Alpaca as the base model for this process. We will release versions of Radiology-GPT in various sizes in the near future. The model also learns domain-specific knowledge relevant to radiology during this training process.
下一步是在这个放射学数据集上对我们的LLM进行指令微调。我们的方法基于指令微调原则[35, 27, 36],目的是调整模型,使其能够在给定"发现"文本作为指令时生成"印象"文本。基础语言模型从数据集中学习"发现"和"印象"之间的关系,从而在面对新发现时以类似方式生成印象。目前,我们使用Alpaca作为此过程的基础模型。我们将在不久的将来发布不同规模的Radiology-GPT版本。在此训练过程中,模型还会学习与放射学相关的领域特定知识。

Figure 3: Comparisons of the LLMs based on Understand ability, Coherence, Relevance, Conciseness, and Clinical Utility.
图 3: 基于理解能力、连贯性、相关性、简洁性和临床实用性对大语言模型的比较。
To ensure our model learns effectively, we use a specific format for the instructions during training. We provide a short instruction to the "Findings" text: "Derive the impression from findings in the radiology report". The "Impression" text from the same report serves as the target output. This approach promotes learning by aligning the model with the task, thereby creating an instructionfollowing language model fine-tuned for radiology reports. Examples can be seen in Figure 2.
为确保模型有效学习,我们在训练过程中对指令采用特定格式。我们为"检查发现"文本提供简短指令:"从放射学报告中推导印象"。同一报告中的"印象"文本作为目标输出。这种方法通过将模型与任务对齐来促进学习,从而创建出针对放射学报告微调的指令遵循语言模型。示例如图2所示。
Through this methodology, Radiology-GPT is designed to capture the specific language patterns, terminologies, and logical reasoning required to interpret radiology reports effectively, thereby making it an efficient and reliable tool for aiding radiologists.
通过这种方法论,Radiology-GPT旨在捕捉解读放射学报告所需的特定语言模式、术语和逻辑推理能力,从而成为辅助放射科医师的高效可靠工具。
It might be valuable to use diverse instruction pairs beyond "Findings — $_ {\scriptscriptstyle{>}}$ Impression" in radiology. Currently, "Findings — $\scriptscriptstyle>$ Impression" is the most natural and clinically meaningful task to conduct instruction-tuning. We are actively engaging with radiologists to construct a variety of clinically meaningful instruction tuning pairs to further enhance Radiology-GPT.
在放射学领域,除了"检查所见 (Findings) -> 影像印象 (Impression)"指令对外,采用多样化的指令对可能具有重要价值。目前,"检查所见 -> 影像印象"仍是最自然且最具临床意义的指令微调任务。我们正积极与放射科医师合作,构建多种具有临床意义的指令微调对,以进一步提升Radiology-GPT的性能。
4 Evaluation
4 评估
One of the challenging aspects of using large language models (LLMs) in the medical field, particularly in radiology, is the determination of their effectiveness and reliability. Given the consequences of errors, it’s crucial to employ appropriate methods to evaluate and compare their outputs. To quantitatively evaluate the effectiveness of Radiology-GPT and other language models in radiology, we implemented a strategy to measure the understand ability, coherence, relevance, conciseness, clinical utility of generated responses.
在医疗领域,尤其是放射学中使用大语言模型(LLM)的一个挑战在于确定其有效性和可靠性。鉴于错误的后果,采用适当方法来评估和比较其输出至关重要。为定量评估Radiology-GPT及其他语言模型在放射学中的表现,我们实施了衡量生成回答的可理解性、连贯性、相关性、简洁性和临床实用性的策略。
It should be noted that even trained radiologists have different writing styles [37, 38]. There could be significant variations in how different radiologists interpret the same set of radiology findings [39, 40, 41]. Therefore, we believe it is not appropriate to use string-matching [42], BLEU [43], ROUGE [44], or other n-gram based methods to evaluate the generated radiology impressions. Instead, we develop a set of metrics that are directly relevant to clinical radiology practices.
需要注意的是,即使是训练有素的放射科医生也存在书写风格差异 [37, 38]。不同放射科医师对同一组影像学表现的解读可能存在显著差异 [39, 40, 41]。因此,我们认为使用字符串匹配 [42]、BLEU [43]、ROUGE [44] 或其他基于 n-gram 的方法来评估生成的放射学印象并不合适。相反,我们开发了一套与临床放射实践直接相关的评估指标。
We describe the five metrics below:
我们描述以下五个指标:
The experimental results yield insightful results regarding the performance of Radiology-GPT in comparison with other existing models. Figure 3 shows the results of the expert evaluation of LLMs. A panel of two expert radiologists assessed the capacity of these LLMs in generating appropriate impressions based on given findings, considering five key parameters: understand ability, coherence, relevance, conciseness, and clinical utility. Each metric was scored on a scale from 1 to 5. The radiologists independently assessed the quality of the LLMs’ responses using a set of 10 radiology reports randomly selected from the test set of the MIMIC-CXR radiology reports and the test set of the OpenI dataset pre processed by our in-house team. The assessments of the two radiologists were conducted independently, and the final scores for each metric were computed by averaging the scores from both radiologists, providing a balanced and comprehensive evaluation of each model’s performance.
实验结果表明,Radiology-GPT与其他现有模型相比具有显著性能优势。图3展示了大语言模型的专家评估结果。由两名放射科专家组成的评审小组基于五项关键参数(理解能力、连贯性、相关性、简洁性和临床实用性),评估了这些大语言模型根据给定检查结果生成恰当印象报告的能力,各项指标按1-5分制评分。专家们使用从MIMIC-CXR放射报告测试集和我们内部团队预处理的OpenI数据集测试集中随机选取的10份放射报告,独立评估了大语言模型回答的质量。两位放射科医生的评估独立进行,最终各项指标得分取两人评分的平均值,从而对每个模型的性能做出平衡全面的评价。
The performance of Radiology-GPT was found to be comparable to that of ChatGPT in terms of understand ability and slightly better in coherence. However, it lagged slightly behind in relevance, not due to a propensity to omit critical findings but rather, because it tended to produce shorter responses compared to ChatGPT. ChatGPT often generated lengthier responses that addressed nearly every point detailed in the findings (which sometimes can provide more context). Figure 4 shows the results of ChatGPT. This discrepancy could be attributed to the contrasting objectives of these models, with Radiology-GPT being designed to deliver succinct and focused outputs that
研究发现,Radiology-GPT在理解能力方面与ChatGPT表现相当,在连贯性上略胜一筹。但在相关性方面稍显不足,这并非由于遗漏关键发现,而是因其生成的回复通常比ChatGPT更为简短。ChatGPT往往会生成更详尽的回复,几乎涵盖报告中列出的所有细节(有时能提供更多背景信息)。图4展示了ChatGPT的结果。这种差异可能源于两款模型的设计目标不同:Radiology-GPT旨在提供简洁聚焦的输出,而...
ChatGPT
ChatGPT
Prompt: “Derive the impression from findings in the radiology report" Findings: PA and lateral views of the chest. The previously seen pericardia l and pleural effusions have resolved. There is no pneumothorax. There is no consolidation. The cardiac, media st in al, and hilar contours are normal.
提示:“根据放射学报告中的发现得出印象”
发现:胸部正位和侧位片。先前观察到的心包和胸腔积液已消退。无气胸。无实变。心脏、纵隔及肺门轮廓正常。
Based on the findings in the radiology report, the following impression can be derived:
根据放射学报告中的发现,可得出以下印象:
- Resolution of pericardia l and pleural eff us in s:Thereport statesthat the previously observed pericardia l and ple rale ff us in s haveresoed.Pericardia l effusion r fest the accumulation of fluid around the heart, and pleural effusion refers to the accumulation of fluid in the space between the lungs and the chest wall. The resolution of these effusions suggests that the fluid has been reabsorbed or drained.
- 心包积液和胸腔积液的消退:报告指出,先前观察到的心包积液和胸腔积液已消退。心包积液 (pericardial effusion) 是指心脏周围液体积聚,胸腔积液 (pleural effusion) 则指肺与胸壁之间空隙中的液体积聚。这些积液的消退表明液体已被重新吸收或排出。
- No pneumothorax: The report indicates that there is no evidence of pneumothorax, which refers to a collapsed lung caused by air accumulation in the chest cavity. The absence of pneumothorax suggests that the lungs are properly inflated without any air leakage into the chest cavity.
- 无气胸:报告显示无气胸迹象。气胸指胸腔内积气导致肺塌陷的情况。未出现气胸表明肺部充气正常,无气体漏入胸腔。
- Normal cardiac, media st in al, and hilar contours: he radiology report mentions that the cardiac, media st in al and hilar contours are normal. This indicates that there are no significant abnormalities or distortions in the structures within the chest, including the heart, central compartment (media st in um), and the hilar region where the blood vessels and bronchi enter the lungs.
- 正常的心脏、纵隔和肺门轮廓:放射学报告指出心脏、纵隔和肺门轮廓均正常。这表明胸部结构(包括心脏、中央区域(纵隔)以及血管和支气管进入肺部的肺门区)无明显异常或变形。
Based on these findings, the overall impression is that there is resolution of the pericardia l and pleural effusions, no evidence of pneumothorax, and normal cardiac media st in al,and hilar contours.These findings indicate a positive response to treatment or natural resolution of the previously identified abnormalities.However,it's important to consult with a healthcare professional for a comprehensive evaluation and interpretation of these findings in the context of the patient's clinical condition
根据这些发现,总体印象是心包积液和胸腔积液已消退,未见气胸证据,心脏纵隔和肺门轮廓正常。这些结果表明对治疗有积极反应或先前发现的异常已自然消退。然而,必须咨询医疗专业人员,结合患者临床情况对这些发现进行全面评估和解读。
Impression: Resolution of previously seen pericardia l and pleural effusions. No evidence of pneumothorax or consolidation. Normal cardiac, media st in al, and hilar contours.
印象:之前观察到的心包积液和胸腔积液已消退。未见气胸或实变征象。心脏、纵隔及肺门轮廓正常。

Figure 4: Results from ChatGPT.
图 4: ChatGPT 生成结果。
Figure 5: LLM comparison.
图 5: 大语言模型 (LLM) 对比。
Instruction: Given the following radiological findings, generate a corresponding impression
指令:根据以下放射学检查结果,生成相应印象
Input: PA and lateral views of the chest. The previously seen pericardia l and pleural effusions have resolved. There is no pneumothorax. There is no consolidation. The cardiac, media st in al, and hilar contours are normal.
胸部正位及侧位片。此前观察到的心包积液和胸腔积液已消退。未见气胸。未见实变。心脏、纵隔及肺门轮廓正常。
Output: The radiology report indicates that the pericardia l and pleural effusions have resolved, there is no pneumothorax, and the cardiac, media st in al, and hilar contours are normal.
放射报告显示心包和胸腔积液已消退,未见气胸,心脏、纵隔及肺门轮廓正常。

Figure 6: An example of deriving diagnostic impression from radiology findings.
图 6: 从放射学检查结果推导诊断印象的示例。
capture the essential aspects of radiological impressions, a critical quality appreciated in the medical field. Consequently, this led to Radiology-GPT scoring higher in both conciseness and clinical utility.
捕捉放射学报告的核心要点,这是医学领域高度重视的关键质量。因此,Radiology-GPT在简洁性和临床实用性方面均获得了更高评分。
In contrast, the other tested models, including StableLM-7B, Dolly-12B, and LLaMA-7B, were all outperformed by both Radiology-GPT and ChatGPT. Figure 5 shows the results of these models. Despite Dolly-12B possessing a larger model size than Radiology-GPT 7B, it could not match the performance of our domain-tuned model. The lack of instruction tuning on domain-specific data and tasks within the field of radiology severely affected the performance of both StableLM-7B and Dolly-12B.
相比之下,其他测试模型(包括StableLM-7B、Dolly-12B和LLaMA-7B)的表现均逊色于Radiology-GPT和ChatGPT。图5展示了这些模型的结果。尽管Dolly-12B的模型规模大于Radiology-GPT 7B,但其性能仍无法与我们针对特定领域调优的模型相匹敌。由于缺乏针对放射学领域特定数据和任务的指令调优,StableLM-7B和Dolly-12B的性能受到了严重影响。
LLaMA-7B, not having been instruction-tuned at all, struggled the most. It struggled to comprehend the given instructions and possesses insufficient domain-specific knowledge, leading to a markedly lower performance than other models. These findings underline the significant value that domainspecific tuning and instruction comprehension bring to the capabilities of LLMs in healthcare applications.
LLaMA-7B由于完全未经过指令调优,表现最为吃力。该模型难以理解给定指令,且缺乏足够的领域专业知识,导致其性能显著低于其他模型。这些发现凸显了领域特定调优和指令理解能力对大语言模型在医疗应用中的关键价值。
5 Discussion
5 讨论
The exploration of large language models (LLMs), notably a locally deployable version specialized for radiology—hereafter referred to as Radiology-GPT—in the domains of medicine, and more pointedly in radiology, presents an intriguing trajectory with numerous potential future courses. In this section, we discuss future directions that are relevant to applying LLMs to radiology.
探索大语言模型(LLM),特别是专用于放射学的本地可部署版本(下文称为Radiology-GPT)在医学领域,尤其是放射学中的应用,展现了一条引人入胜且充满多种未来可能性的发展路径。本节我们将讨论与大语言模型在放射学中应用相关的未来方向。
Question:
问题:
Answer the following questions: 1. What is tumor "Mass Effect"? 2. What is pneumothorax?
回答以下问题:
- 什么是肿瘤的"占位效应 (Mass Effect)"?
- 什么是气胸 (pneumothorax)?
8Answer
8Answer
- Tumor Mass Effect is a medical condition in which a tumor grows in the chest and puts pressure on the lungs, heart, and other organs in the area. 2. Pneumothorax is a medical condition in which air gets trapped in the space between the lung and the chest wall, causing the lung to collapse.
- 肿瘤占位效应 (Tumor Mass Effect) 是一种医学病症,指肿瘤在胸腔内生长并对肺部、心脏及该区域其他器官产生压迫。
- 气胸 (Pneumothorax) 是一种医学病症,指空气进入肺与胸壁之间的腔隙,导致肺塌陷。

Figure 7: A conversational example.
图 7: 对话示例。
5.1 Clinical Decision Support
5.1 临床决策支持
Radiology-GPT can provide advice on various tasks, such as determining the best imaging modalities for specific clinical situations or assisting in the preparation of radiology reports. Figure 6 gives a perfect example. It’s important to emphasize that the current generation of LLMs, including Radiology-GPT, ChatGPT, and GPT-4, are designed to augment professional judgment, not replace the roles of medical experts.
Radiology-GPT可针对多项任务提供建议,例如确定特定临床场景的最佳成像方式,或协助准备放射学报告。图6: 展示了一个完美示例。需要重点强调的是,当前一代大语言模型(包括Radiology-GPT、ChatGPT和GPT-4)旨在增强专业判断力,而非取代医学专家的角色。
Future development should focus on amplifying LLMs’ proficiency in deciphering intricate and multifaceted medical terminology, thus amplifying their relevance and reliability in clinical decisionmaking. The integration of Radiology-GPT with other AI modalities, such as computer vision models proficient in image interpretation, may well represent a significant progression.
未来发展的重点应放在提升大语言模型解析复杂多元医学术语的能力上,从而增强其在临床决策中的相关性和可靠性。将Radiology-GPT与计算机视觉模型等擅长图像解析的其他AI技术相结合,很可能代表着一项重大进展。
5.2 Augmentation of Patient Communication
5.2 患者沟通的增强
Another potential course is employing Radiology-GPT for the augmentation of patient communication. By transmuting complex medical vernacular and generating intelligible reports, this model could promote a higher level of understanding of medical information among patients. Figure 7 shows a conversational example. Nonetheless, its current capacity to produce precise and trustworthy patient communications without stringent supervision remains restricted.
另一个潜在方向是运用Radiology-GPT增强医患沟通。该模型通过转化复杂医学术语并生成易懂报告,可提升患者对医疗信息的理解水平。图7展示了对话示例。然而,当前在缺乏严格监督的情况下,其生成精准可靠患者沟通内容的能力仍有限制。
Future endeavours should aim to refine Radiology-GPT to reduce inconsistencies and inaccuracies in its outputs and generate higher quality responses that are grounded in domain knowledge. This might be achievable through the fusion of enhanced domain-specific data and superior training strategies. Concurrently, it is crucial to engineer methods for effective validation and quality control of the generated communications prior to their delivery to patients.
未来的工作应致力于完善Radiology-GPT,以减少输出中的不一致性和错误,并生成基于领域知识的更高质量响应。这或许可以通过融合增强的领域特定数据和更优的训练策略来实现。同时,在将生成的沟通内容交付给患者之前,设计有效的验证和质量控制方法至关重要。
5.3 AGI Expert Panel
5.3 通用人工智能 (AGI) 专家小组
In a vision of future healthcare scenarios, we envisage the deployment of localized domain-specific LLMs across different medical specialties that would coalesce into a virtual expert panel. Similar to interdisciplinary board panels of physicians who convene to discuss complex cases, such as those in oncology, these AI models could interactively aid in comprehensive decision-making.
在未来医疗场景的愿景中,我们设想在不同医学专科领域部署本地化的专用大语言模型(LLM),这些模型将组成虚拟专家会诊小组。类似于肿瘤学等多学科医师组成的病例讨论委员会,这些AI模型能够通过交互方式协助完成综合诊疗决策。
In the context of oncology, for instance, a patient’s case could be deliberated upon by a RadiologyGPT, a Pathology-GPT, a Oncology-GPT, and a Surgery-GPT, each providing insights from their respective domains. The Radiology-GPT might provide insights about the tumor size, its location, and spread based on the analysis of radiological scans. The Pathology-GPT could provide details on the cancer subtype and the cellular characteristics based on the biopsy report. The Oncology-GPT and Surgery-GPT might provide insights about potential treatment protocols and surgical options. All of these, combined together, could help in the creation of a comprehensive and personalized care plan.
例如在肿瘤学领域,患者的病例可以由RadiologyGPT(放射科GPT)、Pathology-GPT(病理学GPT)、Oncology-GPT(肿瘤学GPT)和Surgery-GPT(外科GPT)共同讨论,各自提供专业领域的见解。Radiology-GPT可能通过分析放射扫描结果,提供关于肿瘤大小、位置和扩散情况的判断;Pathology-GPT能根据活检报告详细说明癌症亚型和细胞特征;Oncology-GPT与Surgery-GPT则可分别提出潜在治疗方案和手术选择建议。这些专业意见的综合将有助于制定全面且个性化的诊疗计划。
While this concept presents significant potential to enhance the thoroughness of clinical decisionmaking, it also poses several challenges and questions. For instance, the coordination and interaction between these specialized LLMs would require a sophisticated communication framework. Moreover, there are potential risks associated with the aggregation of these insights, especially in cases of conflicting recommendations. Consequently, it would be critical to ensure human oversight to validate and reconcile the conclusions drawn by this AI panel.
尽管这一概念为提升临床决策的全面性展现出巨大潜力,但也带来若干挑战与问题。例如,这些专业化大语言模型间的协调互动需要构建复杂的通信框架。此外,在出现矛盾建议时,这些洞察的汇总可能存在潜在风险。因此,必须确保人工监督机制来验证和协调该AI专家组的结论。
Future research should focus on developing protocols for AI interaction, determining the most effective methods of integrating their insights, and establishing rigorous checks and balances to ensure the validity and reliability of the collective outputs. Equally significant, the development of such an integrated AI panel system should be guided by robust ethical frameworks to ensure that the autonomy and privacy of the patient are not compromised.
未来研究应聚焦于制定AI交互协议,确定整合其见解的最有效方法,并建立严格的制衡机制以确保集体输出的有效性和可靠性。同样重要的是,此类集成AI面板系统的开发应以健全的伦理框架为指导,确保患者的自主权和隐私不受侵害。
Such an AI expert panel system represents a groundbreaking direction in the evolution of LLMs in healthcare, promising to harness the specialized knowledge of various medical fields to provide an unprecedented level of support in complex medical decision-making.
这种AI专家小组系统代表了大语言模型在医疗领域发展的突破性方向,有望整合各医学专科知识,为复杂医疗决策提供前所未有的支持。
In summary, it is crucial to embrace the future of Radiology-GPT in medicine with both a degree of optimism and an attitude of circumspection. Active engagement with large language models, rigorous validation of their outputs, and ethical considerations, particularly relating to patient privacy and data usage, should guide their future evolution and deployment. This approach will ensure that the utilization of Radiology-GPT in healthcare will contribute positively to the progression of patient care, academic research, and the advancement of the medical profession as a whole.
总之,在医学领域接纳Radiology-GPT的未来时,保持适度乐观与审慎态度至关重要。应通过积极运用大语言模型、严格验证其输出结果、以及重视患者隐私和数据使用等伦理考量,来指导其未来的发展与部署。这种策略将确保Radiology-GPT在医疗保健领域的应用,能够为患者护理的进步、学术研究及整个医学行业的发展作出积极贡献。
5.4 Privacy and ethics
5.4 隐私与伦理
Privacy and ethics are significant considerations when integrating AI technologies in health care. Radiology-GPT, given its nature as a locally trained and locally deployed model, upholds patient privacy. This is an area where local models show their superiority over commercially developed LLMs such as ChatGPT or PaLM-2 [10]. By operating within the confines of the hosting hospital’s servers, Radiology-GPT mitigates the risk of patient Protected Health Information (PHI) leaks, making the approach compliant with the Health Insurance Portability and Accountability Act (HIPAA) regulations [11].
在医疗领域整合AI技术时,隐私与伦理是需要重点考量的因素。作为一款本地训练、本地部署的模型,Radiology-GPT能有效维护患者隐私——这正是本地模型相较于ChatGPT或PaLM-2[10]等商业化大语言模型的优势所在。通过在医院自有服务器上运行,Radiology-GPT降低了受保护健康信息(PHI)泄露的风险,使其符合《健康保险流通与责任法案》(HIPAA)[11]的监管要求。
However, it is crucial to acknowledge the potential ethical risks that come with the deployment of such models. One of these risks is the possibility of dispensing inaccurate or ungrounded medical advice [45]. The application of LLMs in a medical setting could inadvertently misinform or misdirect care, especially in the absence of appropriate regulation and control mechanisms. Thus, it is essential to ensure rigorous oversight and regular auditing of these models’ performance.
然而,必须认识到部署此类模型可能带来的伦理风险。其中一个风险是可能提供不准确或无依据的医疗建议 [45]。在缺乏适当监管和控制机制的情况下,大语言模型在医疗环境中的应用可能会无意中误导信息或诊疗方向。因此,必须确保对这些模型的性能进行严格监督和定期审计。
Additionally, bias in training data is a persistent concern in AI, with repercussions potentially amplified in the healthcare context [46]. If the data used to train Radiology-GPT reflects systematic bias in patient treatment or diagnosis, the model could propagate and even exacerbate these biases. Consequently, fairness, accountability, and transparency in model training are paramount to avoid such pitfalls.
此外,训练数据中的偏差是人工智能领域长期存在的问题,在医疗健康领域可能造成更严重的后果[46]。如果用于训练Radiology-GPT的数据反映了患者治疗或诊断中的系统性偏差,该模型可能会传播甚至放大这些偏差。因此,在模型训练中确保公平性、可问责性和透明度至关重要,以避免此类问题。
It is also necessary to consider the risk of exploitation of these models to generate counterfeit content or misinformation [47]. The impressive language generation capabilities of LLMs can be manipulated to create misleading or entirely false health information, with potential consequences for public health.
还需考虑利用这些模型生成伪造内容或虚假信息的风险 [47]。大语言模型强大的语言生成能力可能被操纵来制造误导性或完全虚假的健康信息,对公共健康造成潜在影响。
5.5 Expanding Instruction Tuning Data in Radiology
5.5 扩展放射学领域的指令调优数据
In our current work, we utilized "Findings — $\cdot>$ Impression" pairs as our primary instruction tuning data. While this forms a crucial part of clinical radiology, the potential utility of Radiology-GPT can be greatly enhanced by constructing and incorporating more diverse and clinically meaningful instruction pairs.
在我们当前的工作中,我们采用了"检查结果 — $\cdot>$ 影像报告"配对作为主要的指令微调数据。虽然这是临床放射学的关键组成部分,但通过构建和整合更多样化且具有临床意义的指令配对,可以显著提升Radiology-GPT的潜在效用。
In our work, "Findings — > Impression" serves as a natural starting point, mimicking the real-life task of radiologists interpreting findings to provide an impression. However, the field of radiology is rich with a multitude of other tasks and workflows that could potentially be translated into instruction pairs for Radiology-GPT.
在我们的工作中,"发现 → 印象"作为自然起点,模拟了放射科医师解读检查结果并给出印象的真实工作流程。然而放射学领域还存在大量其他潜在可转化为Radiology-GPT指令对的任务和工作流。
For instance, instructions could be formulated to generate more patient-friendly explanations of radiology reports or to write up recommendations for further diagnostic tests or treatment based on specific findings. Another potential use case could be to instruct Radiology-GPT to generate a list of differential diagnoses based on presented findings, or to summarize the latest research on specific imaging findings or diseases.
例如,可以制定指令来生成对放射学报告更友好的患者解释,或根据特定发现撰写进一步诊断测试或治疗的建议。另一个潜在用例可能是指导 Radiology-GPT 根据呈现的发现生成鉴别诊断列表,或总结关于特定影像学表现或疾病的最新研究。
Moreover, as radiology practices involve a considerable amount of protocol decision-making [48, 49], Radiology-GPT could also be tuned to suggest optimal imaging protocols based on a given clinical scenario.
此外,由于放射科实践涉及大量检查方案决策 [48, 49],Radiology-GPT也可被调整为根据特定临床场景建议最优成像方案。
Expanding the range of instruction pairs can potentially enhance Radiology-GPT’s versatility, making it a more holistic tool that can address different aspects of the radiology workflow.
扩展指令对的范围有可能提升Radiology-GPT的多功能性,使其成为能应对放射工作流程不同环节的更全面工具。
To achieve this, active engagement with practicing radiologists is crucial. By collaborating with experts in the field, we can identify a broad range of real-world tasks that Radiology-GPT can be trained to perform, ultimately ensuring that the model’s capabilities are aligned with the needs of clinical practice.
为实现这一目标,与执业放射科医师的积极互动至关重要。通过与领域专家合作,我们可以确定一系列现实任务来训练 Radiology-GPT,最终确保模型能力与临床实践需求相匹配。
6 Conclusion
6 结论
In conclusion, we have developed Radiology-GPT, a domain-specific large language model that addresses the critical need for a locally deployable AI solution within the field of radiology. By leveraging the strength of large language models and tailoring it to radiological contexts, RadiologyGPT offers a promising leap forward, displaying superior performance over existing baselines in our evaluations. The evaluation metrics we proposed, which encapsulate a combination of qualitative and quantitative measures, offer a robust framework for assessing the effectiveness of this and similar models within healthcare settings.
总之,我们开发了Radiology-GPT这一领域专用大语言模型,解决了放射学领域对本地可部署AI解决方案的迫切需求。通过发挥大语言模型优势并针对放射学场景定制,RadiologyGPT展现出显著进步,在我们的评估中表现优于现有基线。我们提出的评估指标结合了定性与定量方法,为医疗场景中此类模型的效果评估提供了可靠框架。
Moreover, Radiology-GPT opens up exciting avenues for future applications. It provides a foundation for the incorporation of multimodal data, including radiological images, further enhancing its potential contributions to the field. Its localized nature also paves the way for wider applications in other medical specialties, stimulating advancements in healthcare AI that respect privacy regulations. Our study stands as a testament to the potential of specialized AI in medicine, offering both immediate benefits and laying the groundwork for future innovation.
此外,Radiology-GPT为未来应用开辟了令人兴奋的途径。它为整合包括放射影像在内的多模态数据奠定了基础,进一步提升了其在该领域的潜在贡献。其本地化特性也为其他医学专科的广泛应用铺平了道路,推动了符合隐私法规的医疗AI发展。我们的研究证明了专业化AI在医学中的潜力,既提供了即时效益,又为未来创新奠定了基础。
