[论文翻译]Radiology-Llama2: 放射学领域最佳大语言模型


原文地址:https://arxiv.org/pdf/2309.06419


Radiology-Llama2: Best-in-Class Large Language Model for Radiology

Radiology-Llama2: 放射学领域最佳大语言模型

Zhengliang Liu ∗1, Yiwei Li ∗1, Peng Shu ∗1, Aoxiao Zhong $^2$ , Longtao Yang $^{3}$ , Chao Ju $^{3}$ , Zihao Wu $^{1}$ , Chong Ma $\mathbf{\Sigma}^{4}$ , Jie Luo $^5$ , Cheng Chen $^5$ , Sekeun Kim $^{5}$ , Jiang Hu $^5$ , Haixing Dai $^{1}$ , Lin Zhao1, Dajiang Zhu $^6$ , Jun Liu $^{3}$ , Wei Liu $7$ , Dinggang Shen $^{8,9,10}$ , Tianming Liu $^{1}$ , Quanzheng Li $^5$ , and Xiang Li $^5$

郑亮刘 * 1,李一伟 * 1,舒鹏 * 1,钟翱骁 $^2$,杨龙涛 $^{3}$,鞠超 $^{3}$,吴子豪 $^{1}$,马冲 $\mathbf{\Sigma}^{4}$,罗杰 $^5$,陈程 $^5$,金世根 $^{5}$,胡江 $^5$,戴海星 $^{1}$,赵林1,朱大江 $^6$,刘军 $^{3}$,刘伟 $7$,沈定刚 $^{8,9,10}$,刘天明 $^{1}$,李全正 $^5$,和李翔 $^5$

Abstract

摘要

This paper introduces Radiology-Llama2, a large language model specialized for radiology through a process known as instruction tuning. Radiology-Llama2 is based on the Llama2 architecture and further trained on a large dataset of radiology reports to generate coherent and clinically useful impressions from radiological findings. Quantitative evaluations using ROUGE metrics on the MIMIC-CXR and OpenI datasets demonstrate that Radiology-Llama2 achieves state-of-the-art performance compared to other generative language models, with a Rouge-1 score of 0.4834 on MIMIC-CXR and 0.4185 on OpenI. Additional assessments by radiology experts highlight the model’s strengths in understand ability, coherence, relevance, conciseness, and clinical utility. The work illustrates the potential of localized language models designed and tuned for specialized domains like radiology. When properly evaluated and deployed, such models can transform fields like radiology by automating rote tasks and enhancing human expertise.

本文介绍了Radiology-Llama2,这是一个通过指令微调 (instruction tuning) 过程专门针对放射学领域优化的大语言模型。该模型基于Llama2架构,并在大量放射学报告数据集上进行了进一步训练,能够根据放射学检查结果生成连贯且具有临床价值的印象描述。在MIMIC-CXR和OpenI数据集上使用ROUGE指标的定量评估表明,与其他生成式语言模型相比,Radiology-Llama2实现了最先进的性能表现:在MIMIC-CXR上的Rouge-1得分为0.4834,在OpenI上为0.4185。放射学专家的额外评估突出了该模型在可理解性、连贯性、相关性、简洁性和临床实用性方面的优势。这项工作展示了为放射学等专业领域设计和微调的本地化语言模型的潜力。通过适当评估和部署,此类模型可以通过自动化重复性任务和增强人类专业知识来改变放射学等领域。

1 Introduction

1 引言

Transformer-based large language models (LLMs) such as ChatGPT and GPT-4 have shown impressive capabilities in natural language processing [1, 2, 3, 4]. The development in transformer-based NLP models has also spurred advancements in developing and applying transformer-based models in computer vision [5, 6, 7, 8] and other modalities [9, 10, 11, 12, 13, 14, 15, 12, 16, 17, 18, 5, 19]. Since November 2022, inspired by the versatile capabilities and wide popularity of ChatGPT, LLMs have been applied in clinical studies [20], pharmacy [21], radiology [22, 23, 24, 25], Alzheimer’s disease [26, 27], agriculture [28] and brain science research [29].

基于Transformer的大语言模型(如ChatGPT和GPT-4)在自然语言处理领域展现出卓越能力 [1, 2, 3, 4]。基于Transformer的NLP模型发展也推动了该架构在计算机视觉 [5, 6, 7, 8] 与其他模态 [9, 10, 11, 12, 13, 14, 15, 12, 16, 17, 18, 5, 19] 中的应用进展。自2022年11月以来,受ChatGPT多功能性和广泛影响力的启发,大语言模型已应用于临床研究 [20]、药学 [21]、放射学 [22, 23, 24, 25]、阿尔茨海默病 [26, 27]、农业 [28] 及脑科学研究 [29]。


Figure 1: The overall framework of Radiology-Llama2.

图 1: Radiology-Llama2 的整体框架。

However, their application in specialized domains like healthcare has been limited.

然而,它们在医疗等专业领域的应用一直较为有限。

First, localized large language models are a must for real world healthcare, since hospitals cannot share data or upload data to commercial models such as ChatGPT or GPT-4 due to privacy regulations [13, 30, 3].

首先,本地化的大语言模型对于现实世界的医疗保健至关重要,因为医院无法因隐私法规 [13, 30, 3] 而共享数据或将数据上传至 ChatGPT 或 GPT-4 等商业模型。

In addition, LLMs trained on general domain, such as ChatGPT [31], GPT-4 [32] and PaLM 2 [33], lack medical knowledge in specialized domains such as radiology, and it is necessary to design a model that is properly trained on domain data that is clinically meaningful.

此外,在通用领域训练的大语言模型(LLM),如ChatGPT [31]、GPT-4 [32]和PaLM 2 [33],缺乏放射学等专业领域的医学知识,因此需要设计一个经过临床意义领域数据适当训练的模型。

Moreover, our Radiology-Llama2 perfectly imitates the style or radiologists, yet models like ChatGPT generate comprehensive but Wikipedia-like responses unlike the concise and simple language style of real radiologists that facilitates quick information exchange. Finally, this work opens the door for personalized radiological assistants that are tailored to the style of individual physicians [34].

此外,我们的Radiology-Llama2完美模仿了放射科医师的风格,而像ChatGPT这样的模型生成的回答虽然全面但类似维基百科,与真实放射科医师简洁明了的语言风格不同,后者更有利于快速信息交流。最后,这项工作为个性化放射科助手打开了大门,可以根据每位医生的风格进行定制 [34]。

This work addresses this gap through Radiology-Llama2, an LLM tailored for radiology through instruction tuning to generate radiology impressions from findings. Evaluations show it surpasses general LLMs in coherence, conciseness and clinical utility of generated impressions.

本研究通过Radiology-Llama2填补了这一空白,该大语言模型(LLM)通过指令微调专门针对放射科设计,能够根据检查结果生成放射学印象。评估表明,其在生成印象的连贯性、简洁性和临床实用性方面均优于通用大语言模型。

• State-of-the-Art Performance: Outperforms any other language models in deriving clinical impressions [2], setting a new benchmark on MIMIC-CXR and OpenI datasets. • Flexibility and Dynamism: Unlike its BERT-based counterparts [35, 36], Radiology-Llama2 is not tied to a specific input structure, allowing for a broader range of inputs and adaptability to different tasks within radiology, including complex reasoning.

• 最先进性能:在推导临床印象方面超越其他所有语言模型 [2],为MIMIC-CXR和OpenI数据集树立了新基准。
• 灵活性与动态性:与基于BERT的同类模型不同 [35, 36],Radiology-Llama2不受限于特定输入结构,可接受更广泛的输入类型,并能适应放射学领域的不同任务(包括复杂推理)。

• Clinical Usability with Conversational Capabilities: Generative LLMs offers inherent conversational functionality [7], enabling it to provide contextual insights and responses in a human-like manner. This makes Radiology-Llama2 particularly useful for medical professionals in a clinical setting, enhancing both diagnosis and reporting.

• 临床可用性与对话能力:生成式大语言模型 (Generative LLM) 具备天然的对话功能 [7],能够以类人方式提供情境化洞察与响应。这使得Radiology-Llama2在临床环境中对医疗专业人员极具价值,可同时提升诊断和报告效率。

2 Related work

2 相关工作

2.1 Large Language Models (LLMs)

2.1 大语言模型 (LLMs)

Recent developments in NLP are marked by the emergence of LLMs such as GPT-3 [37], GPT-4 [32], PaLM [38], and PaLM-2 [33]. Contrasting the earlier pre-training and fine-tuning approach observed in BERT [35], GPT [39], GPT-2 [40], and their variants [41, 15, 36], these new LLMs exhibit few-shot and zero-shot learning capabilities using in-context learning. Furthermore, open-source models like LLaMA [42] and Bloom [43] have entered the scene, promoting broader accessibility.

自然语言处理领域的最新进展以GPT-3[37]、GPT-4[32]、PaLM[38]和PaLM-2[33]等大语言模型(LLM)的出现为标志。与早期BERT[35]、GPT[39]、GPT-2[40]及其变体[41,15,36]采用的预训练-微调方法不同,这些新型大语言模型通过上下文学习展现出少样本和零样本学习能力。此外,LLaMA[42]和Bloom[43]等开源模型的出现进一步推动了技术的普及。

There’s also an increasing interest in instruction-tuned models such as Alpaca [44], StableLM [45], and Dolly [46].

对指令调优模型的兴趣也在日益增长,例如 Alpaca [44]、StableLM [45] 和 Dolly [46]。

2.2 Domain-Specific Language Models (DSLMs)

2.2 领域专用大语言模型 (DSLM)

DSLMs, such as AgriBERT [47], are tailored to specific domains, aimed at optimal performance in related tasks. Specifically, AgriBERT is trained on agricultural texts, making it suitable for tasks in agriculture. SciEdBERT [17] is designed for the educational sector and focuses on middle school chemistry and physics, offering insights into evaluating students’ responses. Clinical Radio BERT [48], in the healthcare sector, is adept at radiation oncology and emphasizes its training from clinical notes and related literature. These DSLMs highlight the potential and adaptability of specialized models across various sectors [17, 49, 50, 15].

领域专用大语言模型 (DSLM) 如 AgriBERT [47] 专为特定领域定制,旨在相关任务中实现最优性能。具体而言,AgriBERT 基于农业文本训练,适用于农业领域任务。SciEdBERT [17] 面向教育领域,专注于中学化学和物理教学,可评估学生回答质量。Clinical Radio BERT [48] 应用于医疗健康领域,擅长放射肿瘤学,其训练数据来自临床笔记和相关文献。这些 DSLM 展现了专业化模型在不同领域的潜力与适应性 [17, 49, 50, 15]。

3 Methodology

3 方法论

Radiology-Llama2 is trained on a large radiology dataset using instruction tuning to generate radiology impressions from findings. Evaluations by radiologists show it surpasses general LLMs in coherence, conciseness and clinical utility of generated impressions.

Radiology-Llama2基于大规模放射学数据集进行指令微调训练,能够根据检查结果生成放射学印象报告。放射科医师评估表明,其在生成印象的连贯性、简洁性和临床实用性方面均优于通用大语言模型。

3.1 Datasets

3.1 数据集

3.1.1 MIMIC-CXR Dataset

3.1.1 MIMIC-CXR 数据集

MIMIC-CXR is a large chest radio graphs dataset which consists of 227,835 imaging studying based on 65,379 patients presenting to the Beth Israel Deaconess Medical Center Emergency Department between 2011–2016 [51]. There are 377,110 available images in the dataset where each imaging studying contains one or more images (typically a frontal view and a lateral view). This dataset also has the corre son ding free-text radiology reports and has been de-identified to ensure the US Health Insurance Portability and Accountability Act of 1996 (HIPAA) Safe Harbor requirements. Widely application of MIMIC-CXR dataset has been implemented in computer vision, natural language processing and decision support etc.

MIMIC-CXR 是一个大型胸部放射影像数据集,包含基于 2011-2016 年间在贝斯以色列女执事医疗中心急诊科就诊的 65,379 名患者的 227,835 项影像研究 [51]。该数据集共有 377,110 张可用图像,每项影像研究包含一张或多张图像(通常是正位视图和侧位视图)。该数据集还配有相应的自由文本放射学报告,并已进行去标识化处理以满足美国《1996 年健康保险可携性与责任法案》(HIPAA) 安全港要求。MIMIC-CXR 数据集已广泛应用于计算机视觉、自然语言处理和决策支持等领域。

Prompt: “Derive the impression from findings in the radiology report”

提示:“从放射学报告的结果中推导出印象”

Findings: The lungs are hyper expanded. Heart size normal. No mass or focal opacities seen. Stable degenerative changes of the thoracic spine.

检查结果:双肺过度膨胀。心影大小正常。未见肿块或局灶性混浊。胸椎退行性变稳定。

Radiology-LLAMA2

Radiology-LLAMA2

Impression: No evidence of pneumonia.

印象:未见肺炎征象。

LLAMA2

LLAMA2

Impression: Based on the findings in the radiology report, the impression is likely that the patient has a respiratory condition, such as chronic obstructive pulmonary disease (COPD) or pneumonia, which has caused the hyper expansion of the lungs. The normal size of the heart and the absence of any mass or focal opacities suggest that there are no significant cardiovascular or pulmonary abnormalities. The degenerative changes in the thoracic spine are likely related to aging or wear and tear on the spine, rather than any underlying respiratory or cardiovascular condition.

印象:根据放射学报告的结果,患者很可能患有慢性阻塞性肺疾病(COPD)或肺炎等呼吸系统疾病,导致肺部过度扩张。心脏大小正常且未发现肿块或局灶性阴影,表明无明显心血管或肺部异常。胸椎退行性改变可能与脊柱老化或磨损有关,而非潜在的呼吸或心血管疾病。

Radiology-GPT

Radiology-GPT

Impression: The radiology report indicates that the lungs are enlarged and the heart is of normal size, with no masses or focal opacities seen. There are also stable degenerative changes in the thoracic spine.

印象:放射学报告显示肺部增大,心脏大小正常,未见肿块或局灶性混浊。胸椎退行性改变稳定。

ChatGPT

ChatGPT

Impression: Based on the findings in the radiology report, the following impression can be derived: 1. Lungs hyper expanded: The report indicates that the lungs are hyper expanded. This means that the lungs are inflated more than normal, which can be a result of various conditions such as chronic obstructive pulmonary disease (COPD) or emphysema. Hyper expansion of the lungs can lead to reduced lung function…

印象:根据放射学报告的结果,可得出以下印象:

  1. 肺部过度膨胀 (hyper expanded):报告显示肺部过度膨胀。这意味着肺部充气程度超出正常范围,可能由多种病症引起,如慢性阻塞性肺疾病 (COPD) 或肺气肿。肺部过度膨胀可能导致肺功能下降……

StableLM

StableLM

Impression: Based on the given medical record, it appears as if there has been some degree of lung congestion present with increased expiratory efforts recorded during physical examination as well as reported upon through imaging studies such as chest X-rays taken at this timepoint. Further testing may include performing additional diagnostic tests including spirometry for further evaluation of lung function impairment versus any potential underlying conditions which have not yet been identified. It's important to note though that due care should always be exercised when evaluating patients' respiratory status and seeking treatment accordingly.

印象:根据现有病历显示,患者存在一定程度的肺部充血,体格检查记录到呼气用力增加,同期拍摄的胸部X光等影像学检查也证实了该现象。建议进一步实施肺功能检测等附加诊断测试,以评估肺功能损害程度或排查尚未明确的潜在病症。需特别注意:评估患者呼吸状态及制定相应治疗方案时务必保持审慎。

Dolly

Dolly

Impression: The lungs appear to be hyper inflated which indicates fluid accumulation within the alveolar spaces. There does not appear to be any significant consolidation or opacity observed within the chest x-ray, and the heart sizes look normal. Based on these observations, it seems likely that the cause of respiratory distress noted by the patient's physician was caused by some other condition rather than acute pneumonia.

印象:肺部显示过度充气,提示肺泡腔内存在液体积聚。胸片未见明显实变或浑浊影,心脏大小正常。根据这些观察结果,患者医生所记录的呼吸窘迫症状很可能由其他病症引起,而非急性肺炎所致。

LLAMA

LLAMA

Impression:The lungs are hyper expanded.Heart size normal._ N o mass or focal opa cities seen.This page was last edited 14:28, 24 August 2016 by wikidoc user WikiBot. Based on work by Alexandra Almonacid and Kalsang Dolma and wikidoc user GeraldChi.

印象:肺部过度膨胀。心脏大小正常。未见肿块或局灶性混浊。本页面最后编辑于2016年8月24日14:28,由wikidoc用户WikiBot完成。基于Alexandra Almonacid、Kalsang Dolma及wikidoc用户GeraldChi的工作。

Alpaca

Alpaca

Impression: The findings in the radiology report suggest that the lungs are enlarged and the heart is of normal size. There are no masses or focal opacities seen, and the degenerative changes of the thoracic spine are stable.

印象:放射学报告中的发现提示肺部增大,心脏大小正常。未见肿块或局灶性混浊,胸椎退行性变稳定。

Figure 2: Performance of different LLMs on the Radiology task example.

图 2: 不同大语言模型在放射学任务示例中的性能表现。

3.1.2 OpenI Dataset

3.1.2 OpenI 数据集

OpenI dataset is a publicly available dataset aiming to make clinical documents for secondary use in the region of research and education [52]. This dataset collects 8121 images form the hospitals’ picture archiving systems accompanied with 3996 corresponding radiology reports from the Indiana Network. Manual coding has been added into the radiologist reports in order to increase the relevancy of the retrieved clinical documents. Similar to MIMIC-CXR dataset, OpenI uses manually verification after the automatic method to achieve de-identification.

OpenI数据集是一个公开可用的数据集,旨在为研究和教育领域提供临床文档的二次利用[52]。该数据集收集了来自医院影像存档系统的8121张图像,以及印第安纳网络提供的3996份相应放射学报告。放射科医生报告中已添加人工编码,以提高检索到的临床文档的相关性。与MIMIC-CXR数据集类似,OpenI在自动方法后采用人工验证来实现去标识化。

3.2 Instruction Tuning

3.2 指令微调

3.2.1 Principle and Objective

3.2.1 原理与目标

Instruction tuning [53, 54] is a foundational component of the Radiology-Llama2 framework. Instruction tuning addresses the fundamental disconnect between the traditional training objectives of LLMs and the user-specific goals of instruction following. This technique involves additional training using pairs of human-specified instructions and corresponding desired outputs. It serves to align the model with task-specific user objectives, enhance model control l ability, and allow for rapid domain-specific adaptation, all while maintaining computational efficiency.

指令微调 [53, 54] 是 Radiology-Llama2 框架的基础组成部分。指令微调解决了大语言模型传统训练目标与用户特定指令跟随目标之间的根本性脱节问题。该技术通过使用人工指定的指令与对应期望输出组成的配对数据进行额外训练,使模型与任务特定的用户目标对齐,增强模型控制能力,并实现快速领域适应,同时保持计算效率。

To bolster learning, instructions are formatted specifically. For instance, the "Findings" text is supplied with a succinct instruction, such as "Derive the impression from findings in the radiology report", while the "Impression" text from the same report serves as the target output. This approach calibrates the model in alignment with the desired task, thus yielding an instruction-adhering language model optimized for radiology reports.

为加强学习效果,指令经过专门格式化处理。例如,"Findings"文本会附带简明指令(如"从放射学报告中的检查结果推导诊断印象"),而同一报告中的"Impression"文本则作为目标输出。这种方法使模型校准与预期任务保持一致,从而生成一个针对放射学报告优化且遵循指令的大语言模型。

3.2.2 Domain-Specific Knowledge Acquisition

3.2.2 领域特定知识获取

Through training on domain-specific data, the model is also adept at assimilating domain-specific knowledge that is quintessential to radiology. Consequently, Radiology-Llama2 is proficient in capturing language patterns, terminologies, and logical reasoning essential for interpreting radiology reports.

通过对特定领域数据的训练,该模型还擅长吸收放射学领域特有的核心知识。因此,Radiology-Llama2能够熟练掌握解读放射学报告所需的关键语言模式、术语和逻辑推理能力。

The initial instruction tuning is centered on the "Findings" to "Impression" conversion, which holds significant clinical value. Recognizing the potential of diverse instruction pairs in radiology, ongoing engagements with radiologists are aimed at formulating a diverse set of clinically pertinent instruction tuning pairs to augment the capabilities of Radiology-Llama2.

初始指令调优聚焦于"检查所见"到"影像诊断"的转换,这一过程具有重要临床价值。基于对放射学领域多样化指令对潜力的认识,我们正持续与放射科医师合作,旨在构建一组多样化的临床相关指令调优对,以增强Radiology-Llama2的能力。

3.3 Experimental Setting

3.3 实验设置

The Radiology-Llama2 model employs an advanced training regimen. For reference, when training Radiology-GPT with the Llama2-7b-chat [42] base model, the training was facilitated by Low-Rank Approximations (LoRA) [55]. The choice of LoRA was motivated by its compact size and portability which are conducive to model sharing and deployment.

Radiology-Llama2模型采用了先进的训练方案。作为参考,当使用Llama2-7b-chat [42]基础模型训练Radiology-GPT时,训练通过低秩近似(LoRA) [55]实现。选择LoRA是因其体积小巧且便于移植,有利于模型共享和部署。

The experimental setting for the training comprised the following configurations:

训练实验设置包含以下配置:

• Batch size: 128 • Learning rate: $3\times10^{-4}$ • Weight decay: 0.01

• 批大小 (Batch size): 128
• 学习率 (Learning rate): $3\times10^{-4}$
• 权重衰减 (Weight decay): 0.01

LoRA parameters:

LoRA参数:

The target modules for LoRA were set to "q_ proj" and "v_ proj", corresponding to the query and value matrices in the self-attention mechanism of the transformer architecture. The training was conducted on a server equipped with 4 Nvidia A100 80GB GPUs.

LoRA的目标模块设置为"q_ proj"和"v_ proj",对应Transformer架构自注意力机制中的查询(query)矩阵和值(value)矩阵。训练在配备4块Nvidia A100 80GB GPU的服务器上进行。

4 Results

4 结果

The present study evaluated the performance of various large language models on two key datasets pertinent to radiology: MIMIC-CXR and OpenI. The assessment employed Rouge-1, Rouge-2, and Rouge-L as the primary metrics, given their widespread acceptance for evaluating the quality of generated text.

本研究评估了多种大语言模型在两个与放射学相关的关键数据集上的性能:MIMIC-CXR和OpenI。鉴于Rouge-1、Rouge-2和Rouge-L被广泛接受用于评估生成文本的质量,本次评估采用这些指标作为主要评价标准。

4.1 Quantitative Assessment

4.1 定量评估

Radiology-Llama2 significantly outperforms all comparison models across all ROUGE [56] metrics: ROUGE-1, ROUGE-2, and ROUGE-L, on both MIMIC-CXR and OpenI datasets. Results can seen in Table 1 and Table 2.

Radiology-Llama2在所有ROUGE [56]指标(ROUGE-1、ROUGE-2和ROUGE-L)上显著优于所有对比模型,无论是在MIMIC-CXR还是OpenI数据集上。结果如表1和表2所示。

For the MIMIC-CXR dataset, Radiology-Llama2 achieves scores of 0.4834 in ROUGE-1, 0.324 in ROUGE-2, and 0.4427 in ROUGE-L. These scores are markedly higher than those of the second-best

在MIMIC-CXR数据集上,Radiology-Llama2取得了0.4834的ROUGE-1分数、0.324的ROUGE-2分数和0.4427的ROUGE-L分数。这些分数明显高于第二名

Prompt: “Derive the impression from findings in the radiology report” Findings: The na so gastric tube is in adequate position and there is a resolution of the gastric distention. There is still mild bibasilar at elect as is. There are no pneumothorax no pleural effusion. The cardiac and media st in al contour are unchanged.

提示:“根据放射学报告中的检查结果推导印象”
检查结果:鼻胃管位置适当,胃胀已缓解。双侧基底肺野仍存在轻度肺不张。未见气胸及胸腔积液。心脏及纵隔轮廓无变化。

Radiology-LLAMA2

Radiology-LLAMA2

Impression: No evidence of pneumonia. The na so gastric tube is in adequate position and there is a resolution of the gastric distention.

印象:无肺炎证据。鼻胃管位置适当,胃胀已缓解。

LLAMA2

LLAMA2

Impression: Based on the findings in the radiology report, the impression is likely that the patient's condition has improved since the last evaluation. The resolution of gastric distention and the absence of any new pleural effusion or pneumothorax suggest that the patient's respiratory status has stabilized. However, the persistence of mild bibasilar at elect as is suggests that the patient may still be experiencing some respiratory compromise. Overall, the impression is likely that the patient's condition has improved but still requires close monitoring and further evaluation….

印象:根据放射学报告的结果,患者的病情自上次评估以来可能有所改善。胃扩张的缓解以及未出现新的胸腔积液或气胸表明患者的呼吸状态已趋于稳定。然而,双侧基底区持续存在的轻度肺不张提示患者可能仍存在一定的呼吸功能障碍。总体而言,印象可能是患者病情有所好转,但仍需密切监测和进一步评估……

Radiology-GPT

放射学GPT

Impression: The na so gastric tube is in the correct position and the patient's stomach has returned to its normal size, however there is still mild lung collapse in the back of the patient's lungs. There is no sign of pneumothorax or pleural effusion and the cardiac and media st in al contours appear unchanged.

印象:鼻胃管位置正确,患者胃部已恢复至正常大小,但患者背部肺区仍存在轻度肺不张。未见气胸或胸腔积液征象,心脏及纵隔轮廓无明显变化。

ChatGPT

ChatGPT

Impression:…Based on these findings, the overall impression is that the na so gastric tube is appropriately positioned and has relieved gastric distention. There is mild bibasilar at elect as is, but no evidence of pneumothorax, pleural effusion, or significant changes in the cardiac and media st in al contour. It's important to consult with a healthcare professional for a comprehensive evaluation and interpretation of these findings in the context of the patient's clinical condition.

印象:…基于这些发现,总体印象是鼻胃管位置适当并缓解了胃扩张。目前存在轻度双侧肺底膨胀不全,但未见气胸、胸腔积液或心脏纵隔轮廓的显著变化。需结合患者临床表现咨询医疗专业人员以全面评估和解读上述发现。

StableLM

StableLM

Impression: Based on these findings alone it would seem that your patient has a reasonably good prognosis with minimal complications related to their current condition. It might be beneficial for you as well to conduct further investigations such as reviewing imaging studies or performing additional tests if recommended by your healthcare provider.

印象:仅基于这些发现来看,您的患者预后良好,当前病症相关并发症极少。若医疗人员建议,您进一步开展影像学复查或其他检测可能也有裨益。

Dolly

Dolly

Impression: The na so gastric tube is in adequate position resulting in improved feeding tolerance. The patient does continue to have bibasilar at elect as is, but this appears to be resolving. No pneumothorax or pleural effusions were present. Lastly, the heart and media st in um appear normal.

印象:鼻胃管位置适当,患者喂养耐受性改善。目前仍存在双肺底轻度肺不张,但呈好转趋势。未见气胸或胸腔积液。心脏及纵隔结构显示正常。

LLAMA

LLAMA

Impression: Taking the impression from the x-ray findings, evaluate the patient for rib fracture, fracture of the rib cage, and fracture of the sternum. If there is a fracture of the sternum, perform a car dio thoracic consultation and ensure that a chest tube has been placed…

印象:根据X光检查结果,评估患者是否存在肋骨骨折、胸廓骨折及胸骨骨折。若存在胸骨骨折,需进行心胸外科会诊并确认已放置胸腔引流管…

Alpaca

Alpaca

Impression: The na so gastric tube is in adequate position and there is a resolution of the gastric distention. The cardiac and media st in al contour are unchanged.

印象:鼻胃管位置适当,胃胀缓解。心脏及纵隔轮廓无变化。

Figure 3: Performance of different LLMs on the Radiology task example.

图 3: 不同大语言模型在放射学任务示例中的性能表现。

performing model, Anthropic Claude2, which manages 0.3177 in ROUGE-1 and 0.153 in ROUGE-2. This demonstrates that Radiology-Llama2 not only captures a higher proportion of overlapping unigrams between the generated and reference summaries but also maintains this superiority in capturing bigrams and maintaining a longer sequence of content overlap.

表现最佳的模型Anthropic Claude2在ROUGE-1和ROUGE-2上分别取得0.3177和0.153分。这表明Radiology-Llama2不仅在生成摘要与参考摘要之间捕获了更高比例的重叠单字词,还在捕获双字词和保持更长内容重叠序列方面持续保持优势。

Similarly, in the OpenI dataset, Radiology-Llama2 sustains its exemplary performance, recording scores of 0.4185 in ROUGE-1, 0.2569 in ROUGE-2, and 0.4087 in ROUGE-L. In comparison, the second-best model, Anthropic Claude2, scores 0.2372 in ROUGE-1 and 0.1259 in ROUGE-2. The substantial gap between the two models across all metrics underscores Radiology-Llama2’s robustness and general iz ability across datasets.

同样,在OpenI数据集中,Radiology-Llama2保持了卓越表现,其ROUGE-1得分为0.4185,ROUGE-2得分为0.2569,ROUGE-L得分为0.4087。相比之下,排名第二的模型Anthropic Claude2在ROUGE-1和ROUGE-2上的得分分别为0.2372和0.1259。两者在所有指标上的显著差距凸显了Radiology-Llama2跨数据集的鲁棒性和泛化能力。

On the lower end of the performance spectrum, Baichuan-7B exhibits exceptionally low scores on both datasets. For instance, its ROUGE-2 score is a meager 0.0057 on the MIMIC-CXR dataset, and its ROUGE-L score is only 0.0029. This emphasizes the limitations of such models in capturing even the basic elements of content overlap. It is also noteworthy that Radiology-Llama2 maintains its superiority not only in terms of single-term overlap but also in capturing a longer chain of content, as reflected in its ROUGE-L scores.

在性能谱的低端,Baichuan-7B 在两个数据集上均表现出极低的分数。例如,其在 MIMIC-CXR 数据集上的 ROUGE-2 分数仅为 0.0057,ROUGE-L 分数更是低至 0.0029。这凸显了此类模型在捕捉基本内容重叠元素方面的局限性。值得注意的是,Radiology-Llama2 不仅在单项术语重叠上保持优势,还能捕捉更长的内容链,这一点从其 ROUGE-L 分数中得以体现。

This paper has provided two specific examples. The first example can be seen in Figure 2 and Figure 3, the finding is “The lungs are hyper expanded. Heart size normal. No mass or focal opacities seen. Stable degenerative changes of the thoracic spine.” Six LLMs are required to derive the impression from the findings in the radiology report. It can be seen that LLAMA2, ChatGPT and Alpaca all understand the content of the radiology report, however, their answers are too trivial and fail to catch the most important points, which will make their given impression hard to understand and increase the difficulty to be used clinically. In contrast, although StableLM, Dolly and LLAMA basically understand the intention of the text, their answers are not satisfactory. Their answers are not only redundant but also contain a lot of irrelevant information. Radiology-GPT, which has been performed a fine-tuning on a specific data set, has the answer which is obviously more capable and concise, but the gap with the impression given by Radiologist is not small. Therefore, in comparison, it can be clearly seen that the excellent performance of Radiology-LLAMA2, the answer of Radiology-LLAMA2 is not only accurate and concise, but also has the report style which is closest to Radiologist. This intuitively proves the performance of Radiology-LLAMA2 and proves its strong clinical application potential.

本文提供了两个具体示例。第一个示例见图2和图3,检查结果为"肺过度膨胀。心脏大小正常。未见肿块或局灶性阴影。胸椎退行性改变稳定"。要求六个大语言模型从放射学报告的检查结果中得出印象。可以看出LLAMA2、ChatGPT和Alpaca都理解了放射学报告的内容,但其回答过于琐碎,未能抓住重点,这使得给出的印象难以理解并增加了临床使用难度。相比之下,虽然StableLM、Dolly和LLAMA基本理解了文本意图,但其回答并不理想,不仅冗余还包含大量无关信息。在特定数据集上微调过的Radiology-GPT给出的答案明显更有能力且简洁,但与放射科医师给出的印象仍有不小差距。因此通过对比可以清晰看出Radiology-LLAMA2的优异表现,其回答不仅准确简洁,还具有最接近放射科医师的报告风格,直观证明了Radiology-LLAMA2的性能及其强大的临床应用潜力。


Figure 4: Radiologists’ evaluation on the diagnostic results of different LLMs based on radiology reports.

图 4: 放射科医师基于放射学报告对不同大语言模型诊断结果的评估。

In summary, Radiology-Llama2 consistently demonstrates superior performance in generating clinically relevant and coherent radiology reports, as evidenced by its high ROUGE scores across multiple datasets.

总之,Radiology-Llama2在生成临床相关且连贯的放射学报告方面始终表现出卓越性能,其在多个数据集上的高ROUGE分数也印证了这一点。

Llama2Radiology- Radiology- ChatGPT Llama2GPTStableLMDollyLlamaAlpaca
rouge-10.2010.47260.31230.07740.07340.18660.08730.3464
rouge-20.09690.29480.17580.04320.00850.07170.04260.2071
rouge-L0.15780.37570.25180.07020.05160.15720.07760.2935
Llama2 Radiology- Radiology- ChatGPT Llama2 GPT StableLM Dolly Llama Alpaca
rouge-1 0.201 0.4726 0.3123 0.0774 0.0734 0.1866 0.0873 0.3464
rouge-2 0.0969 0.2948 0.1758 0.0432 0.0085 0.0717 0.0426 0.2071
rouge-L 0.1578 0.3757 0.2518 0.0702 0.0516 0.1572 0.0776 0.2935

Table 1: Rouge scores of Radiologist’s evaluations on 10 MIMIC-CXR-OpenI examples.

表 1: 放射科医生对10个MIMIC-CXR-OpenI样本的Rouge评分。

4.2 Expert Evaluation

4.2 专家评估

To supplement the quantitative evaluations, we conducted an expert-based assessment of the models. We randomly selected 10 records each from MIMIC-CXR and OpenI datasets and had two experienced radiologists manually evaluate the generated radiology impressions based on five key criteria: Understand ability, Coherence, Relevance, Conciseness, and Clinical Utility.

为了补充定量评估,我们进行了基于专家的人工模型评估。我们从MIMIC-CXR和OpenI数据集中各随机选取10份记录,由两位资深放射科医师根据五项关键标准对生成的放射学印象进行人工评估:可理解性、连贯性、相关性、简洁性和临床实用性。

MIMIC-CXROpenI
Rouge-1Rouge-2Rouge-LRouge-1Rouge-2Rouge-L
Radiology-Llama20.48340.3240.44270.41850.25690.4087
Llama2 [42]0.17260.07110.13220.08480.02050.0712
Ziya-LLaMA-13B-v1 [57]0.21210.09680.170.11010.03160.0926
ChatGPT [31]0.20480.09960.17020.12030.0370.1052
GPT-4 [32]0.19950.08580.15750.11710.03430.0975
ChatGLM2-6B 3 [58]0.20420.09640.16050.10940.03310.0909
MOSS-MOON-0030sft [59]0.22030.09140.17570.14020.03410.1241
PaLM2 [33]0.27490.14420.22810.13860.04770.1194
Sensenova a [60]0.11360.03040.08090.06340.01060.051
Anthropic Claude2 [61]0.31770.1530.2560.23720.12590.2193
BayLing-7B [62]0.21490.1070.17470.12520.03890.1044
XrayGLM [63]0.11040.04680.12110.06220.01830.0599
Bianque v2 [64]0.02940.00720.02810.02270.00360.0222
XrayPULSE [65]0.05920.00610.03980.02930.00110.0239
DoctorGLM [66]0.18530.09160.1530.09960.03290.0861
BenTsao [67]0.13190.06180.11260.08040.03680.0711
Baichuan-7B [68]0.23790.00570.00290.0030.00090.0028

Table 2: Rouge scores on MIMIC-CXR and OpenI datasets.

MIMIC-CXR OpenI
Rouge-1 Rouge-2 Rouge-L Rouge-1 Rouge-2 Rouge-L
Radiology-Llama2 0.4834 0.324 0.4427 0.4185 0.2569 0.4087
Llama2 [42] 0.1726 0.0711 0.1322 0.0848 0.0205 0.0712
Ziya-LLaMA-13B-v1 [57] 0.2121 0.0968 0.17 0.1101 0.0316 0.0926
ChatGPT [31] 0.2048 0.0996 0.1702 0.1203 0.037 0.1052
GPT-4 [32] 0.1995 0.0858 0.1575 0.1171 0.0343 0.0975
ChatGLM2-6B 3 [58] 0.2042 0.0964 0.1605 0.1094 0.0331 0.0909
MOSS-MOON-0030sft [59] 0.2203 0.0914 0.1757 0.1402 0.0341 0.1241
PaLM2 [33] 0.2749 0.1442 0.2281 0.1386 0.0477 0.1194
Sensenova a [60] 0.1136 0.0304 0.0809 0.0634 0.0106 0.051
Anthropic Claude2 [61] 0.3177 0.153 0.256 0.2372 0.1259 0.2193
BayLing-7B [62] 0.2149 0.107 0.1747 0.1252 0.0389 0.1044
XrayGLM [63] 0.1104 0.0468 0.1211 0.0622 0.0183 0.0599
Bianque v2 [64] 0.0294 0.0072 0.0281 0.0227 0.0036 0.0222
XrayPULSE [65] 0.0592 0.0061 0.0398 0.0293 0.0011 0.0239
DoctorGLM [66] 0.1853 0.0916 0.153 0.0996 0.0329 0.0861
BenTsao [67] 0.1319 0.0618 0.1126 0.0804 0.0368 0.0711
Baichuan-7B [68] 0.2379 0.0057 0.0029 0.003 0.0009 0.0028

表 2: MIMIC-CXR 和 OpenI 数据集上的 Rouge 分数。

n addition, we evaluated the Rouge scores of a few select models on these 10 samples to cross-verify with the experts’ evaluation. Please see 1 for more details.

此外,我们选取了部分模型在这10个样本上的Rouge分数进行评估,以与专家评估结果交叉验证。详情请参见表1。

Overall, Radiology-Llama2 consistently demonstrated superior performance across all five criteria, affirming its status as a highly effective tool for generating radiology impressions. While other models like Radiology-GPT and ChatGPT showed competence in certain areas, they were unable to match the all-around excellence of Radiology-Llama2. Models like Llama and StableLM displayed significant limitations, performing poorly across multiple criteria.

总体而言,Radiology-Llama2在所有五项评估标准中均展现出卓越性能,巩固了其作为高效放射学印象生成工具的地位。尽管Radiology-GPT和ChatGPT等模型在部分领域表现尚可,但始终无法企及Radiology-Llama2的全方位优势。Llama和StableLM等模型则存在明显缺陷,在多项指标中表现欠佳。

5 Discussion

5 讨论

5.1 Application in Clinical Radiology Diagnosis

5.1 在临床放射诊断中的应用

The high Rouge scores attained by Radiology-LLAMA2 suggest that this language model has the potential for significant impact in clinical radiology diagnosis. Its ability to swiftly generate coherent and clinically relevant reports could be transformative, particularly in busy radiology departments where timely and accurate reporting is of the essence. By automating certain aspects of the reporting process, the model serves as a valuable assistive tool for radiologists, allowing them to focus more on complex cases that require nuanced human expertise.

Radiology-LLAMA2 取得的高 Rouge 分数表明,该大语言模型具有对临床放射诊断产生重大影响的潜力。其快速生成连贯且具临床相关性报告的能力可能带来变革性影响,尤其在报告时效性与准确性至关重要的繁忙放射科室中。通过自动化报告流程的某些环节,该模型可作为放射科医师的宝贵辅助工具,使其更专注于需要人类专业判断的复杂病例。

5.2 Need for Diverse Training Data

5.2 对多样化训练数据的需求

While Radiology-LLAMA2 performs impressively on generating findings and impressions, its utility could be further broadened by incorporating more diverse forms of training data. For instance, it could be trained on instructional text for medical procedures, summaries of patient histories, or physician’s notes on differential diagnoses. By diversifying the data sources, the model would be better equipped to assist in various facets of radiological practice, such as recommending further diagnostic tests or suggesting possible treatment paths.

虽然Radiology-LLAMA2在生成检查结果和诊断印象方面表现优异,但通过纳入更多样化的训练数据可进一步扩展其应用场景。例如,可加入医疗操作指南文本、患者病史摘要或医生关于鉴别诊断的注释进行训练。通过数据源的多元化,该模型将能更好地辅助放射科实践的各个环节,例如推荐进一步诊断检查或提示潜在治疗方案。

5.3 Multi modality: Adding Image Capabilities

5.3 多模态:增加图像处理能力

The addition of image analysis capabilities would elevate Radiology-LLAMA2 from a text-generation model to a truly multimodal diagnostic tool. Future iterations could integrate machine learning algorithms for image recognition, enabling the model to make direct observations from X-rays, MRIs, or CT scans. This would potentially create a more holistic diagnostic process where textual and visual data are analyzed in tandem for more accurate and comprehensive diagnoses.

新增图像分析功能将使Radiology-LLAMA2从文本生成模型升级为真正的多模态诊断工具。未来版本可集成用于图像识别的机器学习算法,使模型能够直接从X光、MRI或CT扫描中获取观察结果。这将有望创建更全面的诊断流程,通过同步分析文本和视觉数据来实现更准确、更全面的诊断。

5.4 Conversational Assistant to Radiologists

5.4 面向放射科医师的对话助手

Beyond report generation, Radiology-LLAMA2 could be developed into a conversational assistant that helps radiologists in real-time. This would enable a more dynamic interaction, where the model could assist in tasks ranging from quick data retrieval to offering second opinions on diagnoses. Such a system could act as a "second pair of eyes," providing immediate feedback and thus serving as a valuable safeguard against diagnostic errors.

除了报告生成,Radiology-LLAMA2 还可开发为实时协助放射科医师的对话助手。这将实现更动态的交互,使模型能够协助完成从快速数据检索到提供诊断第二意见等任务。此类系统可充当"第二双眼睛",提供即时反馈,从而成为防范诊断错误的有力保障。

6 Conclusion

6 结论

In conclusion, Radiology-Llama2 represents an important advance in applying LLMs to healthcare. With continued research into model design and evaluation, such specialized LLMs can enable breakthroughs in medical AI.

总之,Radiology-Llama2代表了大语言模型在医疗领域应用的重要进展。通过持续研究模型设计和评估方法,此类专业化大语言模型有望推动医疗AI领域的突破性发展。

References

参考文献

[38] Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.

[38] Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann 等. Palm: 基于Pathways的规模化语言建模. arXiv预印本 arXiv:2204.02311, 2022.

for disentangling neural auditory encoding. In Findings of the Association for Computational

用于解耦神经听觉编码。计算语言学协会发现

阅读全文(20积分)