Radiology-Llama2: Best-in-Class Large Language Model for Radiology

Radiology-Llama2: 放射学领域最佳大语言模型

Zhengliang Liu ∗1, Yiwei Li ∗1, Peng Shu ∗1, Aoxiao Zhong $^2$ , Longtao Yang $^{3}$ , Chao Ju $^{3}$ , Zihao Wu $^{1}$ , Chong Ma $\mathbf{\Sigma}^{4}$ , Jie Luo $^5$ , Cheng Chen $^5$ , Sekeun Kim $^{5}$ , Jiang Hu $^5$ , Haixing Dai $^{1}$ , Lin Zhao1, Dajiang Zhu $^6$ , Jun Liu $^{3}$ , Wei Liu $7$ , Dinggang Shen $^{8,9,10}$ , Tianming Liu $^{1}$ , Quanzheng Li $^5$ , and Xiang Li $^5$

郑亮刘 * 1，李一伟 * 1，舒鹏 * 1，钟翱骁 $^2$，杨龙涛 $^{3}$，鞠超 $^{3}$，吴子豪 $^{1}$，马冲 $\mathbf{\Sigma}^{4}$，罗杰 $^5$，陈程 $^5$，金世根 $^{5}$，胡江 $^5$，戴海星 $^{1}$，赵林1，朱大江 $^6$，刘军 $^{3}$，刘伟 $7$，沈定刚 $^{8,9,10}$，刘天明 $^{1}$，李全正 $^5$，和李翔 $^5$

Abstract

摘要

This paper introduces Radiology-Llama2, a large language model specialized for radiology through a process known as instruction tuning. Radiology-Llama2 is based on the Llama2 architecture and further trained on a large dataset of radiology reports to generate coherent and clinically useful impressions from radiological findings. Quantitative evaluations using ROUGE metrics on the MIMIC-CXR and OpenI datasets demonstrate that Radiology-Llama2 achieves state-of-the-art performance compared to other generative language models, with a Rouge-1 score of 0.4834 on MIMIC-CXR and 0.4185 on OpenI. Additional assessments by radiology experts highlight the model’s strengths in understand ability, coherence, relevance, conciseness, and clinical utility. The work illustrates the potential of localized language models designed and tuned for specialized domains like radiology. When properly evaluated and deployed, such models can transform fields like radiology by automating rote tasks and enhancing human expertise.

本文介绍了Radiology-Llama2，这是一个通过指令微调 (instruction tuning) 过程专门针对放射学领域优化的大语言模型。该模型基于Llama2架构，并在大量放射学报告数据集上进行了进一步训练，能够根据放射学检查结果生成连贯且具有临床价值的印象描述。在MIMIC-CXR和OpenI数据集上使用ROUGE指标的定量评估表明，与其他生成式语言模型相比，Radiology-Llama2实现了最先进的性能表现：在MIMIC-CXR上的Rouge-1得分为0.4834，在OpenI上为0.4185。放射学专家的额外评估突出了该模型在可理解性、连贯性、相关性、简洁性和临床实用性方面的优势。这项工作展示了为放射学等专业领域设计和微调的本地化语言模型的潜力。通过适当评估和部署，此类模型可以通过自动化重复性任务和增强人类专业知识来改变放射学等领域。

1 Introduction

1 引言

Transformer-based large language models (LLMs) such as ChatGPT and GPT-4 have shown impressive capabilities in natural language processing [1, 2, 3, 4]. The development in transformer-based NLP models has also spurred advancements in developing and applying transformer-based models in computer vision [5, 6, 7, 8] and other modalities [9, 10, 11, 12, 13, 14, 15, 12, 16, 17, 18, 5, 19]. Since November 2022, inspired by the versatile capabilities and wide popularity of ChatGPT, LLMs have been applied in clinical studies [20], pharmacy [21], radiology [22, 23, 24, 25], Alzheimer’s disease [26, 27], agriculture [28] and brain science research [29].

基于Transformer的大语言模型（如ChatGPT和GPT-4）在自然语言处理领域展现出卓越能力 [1, 2, 3, 4]。基于Transformer的NLP模型发展也推动了该架构在计算机视觉 [5, 6, 7, 8] 与其他模态 [9, 10, 11, 12, 13, 14, 15, 12, 16, 17, 18, 5, 19] 中的应用进展。自2022年11月以来，受ChatGPT多功能性和广泛影响力的启发，大语言模型已应用于临床研究 [20]、药学 [21]、放射学 [22, 23, 24, 25]、阿尔茨海默病 [26, 27]、农业 [28] 及脑科学研究 [29]。

Figure 1: The overall framework of Radiology-Llama2.

图 1: Radiology-Llama2 的整体框架。

However, their application in specialized domains like healthcare has been limited.

然而，它们在医疗等专业领域的应用一直较为有限。

First, localized large language models are a must for real world healthcare, since hospitals cannot share data or upload data to commercial models such as ChatGPT or GPT-4 due to privacy regulations [13, 30, 3].

首先，本地化的大语言模型对于现实世界的医疗保健至关重要，因为医院无法因隐私法规 [13, 30, 3] 而共享数据或将数据上传至 ChatGPT 或 GPT-4 等商业模型。

In addition, LLMs trained on general domain, such as ChatGPT [31], GPT-4 [32] and PaLM 2 [33], lack medical knowledge in specialized domains such as radiology, and it is necessary to design a model that is properly trained on domain data that is clinically meaningful.

此外，在通用领域训练的大语言模型(LLM)，如ChatGPT [31]、GPT-4 [32]和PaLM 2 [33]，缺乏放射学等专业领域的医学知识，因此需要设计一个经过临床意义领域数据适当训练的模型。

Moreover, our Radiology-Llama2 perfectly imitates the style or radiologists, yet models like ChatGPT generate comprehensive but Wikipedia-like responses unlike the concise and simple language style of real radiologists that facilitates quick information exchange. Finally, this work opens the door for personalized radiological assistants that are tailored to the style of individual physicians [34].

此外，我们的Radiology-Llama2完美模仿了放射科医师的风格，而像ChatGPT这样的模型生成的回答虽然全面但类似维基百科，与真实放射科医师简洁明了的语言风格不同，后者更有利于快速信息交流。最后，这项工作为个性化放射科助手打开了大门，可以根据每位医生的风格进行定制 [34]。

This work addresses this gap through Radiology-Llama2, an LLM tailored for radiology through instruction tuning to generate radiology impressions from findings. Evaluations show it surpasses general LLMs in coherence, conciseness and clinical utility of generated impressions.

本研究通过Radiology-Llama2填补了这一空白，该大语言模型(LLM)通过指令微调专门针对放射科设计，能够根据检查结果生成放射学印象。评估表明，其在生成印象的连贯性、简洁性和临床实用性方面均优于通用大语言模型。

• State-of-the-Art Performance: Outperforms any other language models in deriving clinical impressions [2], setting a new benchmark on MIMIC-CXR and OpenI datasets. • Flexibility and Dynamism: Unlike its BERT-based counterparts [35, 36], Radiology-Llama2 is not tied to a specific input structure, allowing for a broader range of inputs and adaptability to different tasks within radiology, including complex reasoning.

• 最先进性能：在推导临床印象方面超越其他所有语言模型 [2]，为MIMIC-CXR和OpenI数据集树立了新基准。
• 灵活性与动态性：与基于BERT的同类模型不同 [35, 36]，Radiology-Llama2不受限于特定输入结构，可接受更广泛的输入类型，并能适应放射学领域的不同任务（包括复杂推理）。

• Clinical Usability with Conversational Capabilities: Generative LLMs offers inherent conversational functionality [7], enabling it to provide contextual insights and responses in a human-like manner. This makes Radiology-Llama2 particularly useful for medical professionals in a clinical setting, enhancing both diagnosis and reporting.

• 临床可用性与对话能力：生成式大语言模型 (Generative LLM) 具备天然的对话功能 [7]，能够以类人方式提供情境化洞察与响应。这使得Radiology-Llama2在临床环境中对医疗专业人员极具价值，可同时提升诊断和报告效率。

2 相关工作

2.1 Large Language Models (LLMs)

2.1 大语言模型 (LLMs)

Recent developments in NLP are marked by the emergence of LLMs such as GPT-3 [37], GPT-4 [32], PaLM [38], and PaLM-2 [33]. Contrasting the earlier pre-training and fine-tuning approach observed in BERT [35], GPT [39], GPT-2 [40], and their variants [41, 15, 36], these new LLMs exhibit few-shot and zero-shot learning capabilities using in-context learning. Furthermore, open-source models like LLaMA [42] and Bloom [43] have entered the scene, promoting broader accessibility.

自然语言处理领域的最新进展以GPT-3[37]、GPT-4[32]、PaLM[38]和PaLM-2[33]等大语言模型(LLM)的出现为标志。与早期BERT[35]、GPT[39]、GPT-2[40]及其变体[41,15,36]采用的预训练-微调方法不同，这些新型大语言模型通过上下文学习展现出少样本和零样本学习能力。此外，LLaMA[42]和Bloom[43]等开源模型的出现进一步推动了技术的普及。

There’s also an increasing interest in instruction-tuned models such as Alpaca [44], StableLM [45], and Dolly [46].

对指令调优模型的兴趣也在日益增长，例如 Alpaca [44]、StableLM [45] 和 Dolly [46]。

2.2 Domain-Specific Language Models (DSLMs)

2.2 领域专用大语言模型 (DSLM)

DSLMs, such as AgriBERT [47], are tailored to specific domains, aimed at optimal performance in related tasks. Specifically, AgriBERT is trained on agricultural texts, making it suitable for tasks in agriculture. SciEdBERT [17] is designed for the educational sector and focuses on middle school chemistry and physics, offering insights into evaluating students’ responses. Clinical Radio BERT [48], in the healthcare sector, is adept at radiation oncology and emphasizes its training from clinical notes and related literature. These DSLMs highlight the potential and adaptability of specialized models across various sectors [17, 49, 50, 15].

领域专用大语言模型 (DSLM) 如 AgriBERT [47] 专为特定领域定制，旨在相关任务中实现最优性能。具体而言，AgriBERT 基于农业文本训练，适用于农业领域任务。SciEdBERT [17] 面向教育领域，专注于中学化学和物理教学，可评估学生回答质量。Clinical Radio BERT [48] 应用于医疗健康领域，擅长放射肿瘤学，其训练数据来自临床笔记和相关文献。这些 DSLM 展现了专业化模型在不同领域的潜力与适应性 [17, 49, 50, 15]。

3 Methodology

3 方法论

Radiology-Llama2 is trained on a large radiology dataset using instruction tuning to generate radiology impressions from findings. Evaluations by radiologists show it surpasses general LLMs in coherence, conciseness and clinical utility of generated impressions.

Radiology-Llama2基于大规模放射学数据集进行指令微调训练，能够根据检查结果生成放射学印象报告。放射科医师评估表明，其在生成印象的连贯性、简洁性和临床实用性方面均优于通用大语言模型。

3.1 Datasets

3.1 数据集

3.1.1 MIMIC-CXR Dataset

3.1.1 MIMIC-CXR 数据集

MIMIC-CXR is a large chest radio graphs dataset which consists of 227,835 imaging studying based on 65,379 patients presenting to the Beth Israel Deaconess Medical Center Emergency Department between 2011–2016 [51]. There are 377,110 available images in the dataset where each imaging studying contains one or more images (typically a frontal view and a lateral view). This dataset also has the corre son ding free-text radiology reports and has been de-identified to ensure the US Health Insurance Portability and Accountability Act of 1996 (HIPAA) Safe Harbor requirements. Widely application of MIMIC-CXR dataset has been implemented in computer vision, natural language processing and decision support etc.

MIMIC-CXR 是一个大型胸部放射影像数据集，包含基于 2011-2016 年间在贝斯以色列女执事医疗中心急诊科就诊的 65,379 名患者的 227,835 项影像研究 [51]。该数据集共有 377,110 张可用图像，每项影像研究包含一张或多张图像（通常是正位视图和侧位视图）。该数据集还配有相应的自由文本放射学报告，并已进行去标识化处理以满足美国《1996 年健康保险可携性与责任法案》(HIPAA) 安全港要求。MIMIC-CXR 数据集已广泛应用于计算机视觉、自然语言处理和决策支持等领域。

Prompt: “Derive the impression from findings in the radiology report”

提示：“从放射学报告的结果中推导出印象”

Findings: The lungs are hyper expanded. Heart size normal. No mass or focal opacities seen. Stable degenerative changes of the thoracic spine.

检查结果：双肺过度膨胀。心影大小正常。未见肿块或局灶性混浊。胸椎退行性变稳定。

Radiology-LLAMA2

Impression: No evidence of pneumonia.

印象：未见肺炎征象。

LLAMA2

Impression: Based on the findings in the radiology report, the impression is likely that the patient has a respiratory condition, such as chronic obstructive pulmonary disease (COPD) or pneumonia, which has caused the hyper expansion of the lungs. The normal size of the heart and the absence of any mass or focal opacities suggest that there are no significant cardiovascular or pulmonary abnormalities. The degenerative changes in the thoracic spine are likely related to aging or wear and tear on the spine, rather than any underlying respiratory or cardiovascular condition.

印象：根据放射学报告的结果，患者很可能患有慢性阻塞性肺疾病(COPD)或肺炎等呼吸系统疾病，导致肺部过度扩张。心脏大小正常且未发现肿块或局灶性阴影，表明无明显心血管或肺部异常。胸椎退行性改变可能与脊柱老化或磨损有关，而非潜在的呼吸或心血管疾病。

Radiology-GPT

Impression: The radiology report indicates that the lungs are enlarged and the heart is of normal size, with no masses or focal opacities seen. There are also stable degenerative changes in the thoracic spine.

印象：放射学报告显示肺部增大，心脏大小正常，未见肿块或局灶性混浊。胸椎退行性改变稳定。

ChatGPT

Impression: Based on the findings in the radiology report, the following impression can be derived: 1. Lungs hyper expanded: The report indicates that the lungs are hyper expanded. This means that the lungs are inflated more than normal, which can be a result of various conditions such as chronic obstructive pulmonary disease (COPD) or emphysema. Hyper expansion of the lungs can lead to reduced lung function…

印象：根据放射学报告的结果，可得出以下印象：

肺部过度膨胀 (hyper expanded)：报告显示肺部过度膨胀。这意味着肺部充气程度超出正常范围，可能由多种病症引起，如慢性阻塞性肺疾病 (COPD) 或肺气肿。肺部过度膨胀可能导致肺功能下降……

StableLM

Impression: Based on the given medical record, it appears as if there has been some degree of lung congestion present with increased expiratory efforts recorded during physical examination as well as reported upon through imaging studies such as chest X-rays taken at this timepoint. Further testing may include performing additional diagnostic tests including spirometry for further evaluation of lung function impairment versus any potential underlying conditions which have not yet been identified. It's important to note though that due care should always be exercised when evaluating patients' respiratory status and seeking treatment accordingly.

印象：根据现有病历显示，患者存在一定程度的肺部充血，体格检查记录到呼气用力增加，同期拍摄的胸部X光等影像学检查也证实了该现象。建议进一步实施肺功能检测等附加诊断测试，以评估肺功能损害程度或排查尚未明确的潜在病症。需特别注意：评估患者呼吸状态及制定相应治疗方案时务必保持审慎。

Dolly

Impression: The lungs appear to be hyper inflated which indicates fluid accumulation within the alveolar spaces. There does not appear to be any significant consolidation or opacity observed within the chest x-ray, and the heart sizes look normal. Based on these observations, it seems likely that the cause of respiratory distress noted by the patient's physician was caused by some other condition rather than acute pneumonia.

印象：肺部显示过度充气，提示肺泡腔内存在液体积聚。胸片未见明显实变或浑浊影，心脏大小正常。根据这些观察结果，患者医生所记录的呼吸窘迫症状很可能由其他病症引起，而非急性肺炎所致。

LLAMA

Impression:The lungs are hyper expanded.Heart size normal._ N o mass or focal opa cities seen.This page was last edited 14:28, 24 August 2016 by wikidoc user WikiBot. Based on work by Alexandra Almonacid and Kalsang Dolma and wikidoc user GeraldChi.

印象：肺部过度膨胀。心脏大小正常。未见肿块或局灶性混浊。本页面最后编辑于2016年8月24日14:28，由wikidoc用户WikiBot完成。基于Alexandra Almonacid、Kalsang Dolma及wikidoc用户GeraldChi的工作。

Alpaca

Impression: The findings in the radiology report suggest that the lungs are enlarged and the heart is of normal size. There are no masses or focal opacities seen, and the degenerative changes of the thoracic spine are stable.

印象：放射学报告中的发现提示肺部增大，心脏大小正常。未见肿块或局灶性混浊，胸椎退行性变稳定。

Figure 2: Performance of different LLMs on the Radiology task example.

图 2: 不同大语言模型在放射学任务示例中的性能表现。

3.1.2 OpenI Dataset

3.1.2 OpenI 数据集

OpenI dataset is a publicly available dataset aiming to make clinical documents for secondary use in the region of research and education [52]. This dataset collects 8121 images form the hospitals’ picture archiving systems accompanied with 3996 corresponding radiology reports from the Indiana Network. Manual coding has been added into the radiologist reports in order to increase the relevancy of the retrieved clinical documents. Similar to MIMIC-CXR dataset, OpenI uses manually verification after the automatic method to achieve de-identification.

OpenI数据集是一个公开可用的数据集，旨在为研究和教育领域提供临床文档的二次利用[52]。该数据集收集了来自医院影像存档系统的8121张图像，以及印第安纳网络提供的3996份相应放射学报告。放射科医生报告中已添加人工编码，以提高检索到的临床文档的相关性。与MIMIC-CXR数据集类似，OpenI在自动方法后采用人工验证来实现去标识化。

3.2 Instruction Tuning

3.2 指令微调

3.2.1 Principle and Objective

3.2.1 原理与目标

Instruction tuning [53, 54] is a foundational component of the Radiology-Llama2 framework. Instruction tuning addresses the fundamental disconnect between the traditional training objectives of LLMs and the user-specific goals of instruction following. This technique involves additional training using pairs of human-specified instructions and corresponding desired outputs. It serves to align the model with task-specific user objectives, enhance model control l ability, and allow for rapid domain-specific adaptation, all while maintaining computational efficiency.

指令微调 [53, 54] 是 Radiology-Llama2 框架的基础组成部分。指令微调解决了大语言模型传统训练目标与用户特定指令跟随目标之间的根本性脱节问题。该技术通过使用人工指定的指令与对应期望输出组成的配对数据进行额外训练，使模型与任务特定的用户目标对齐，增强模型控制能力，并实现快速领域适应，同时保持计算效率。

To bolster learning, instructions are formatted specifically. For instance, the "Findings" text is supplied with a succinct instruction, such as "Derive the impression from findings in the radiology report", while the "Impression" text from the same report serves as the target output. This approach calibrates the model in alignment with the desired task, thus yielding an instruction-adhering language model optimized for radiology reports.

为加强学习效果，指令经过专门格式化处理。例如，"Findings"文本会附带简明指令（如"从放射学报告中的检查结果推导诊断印象"），而同一报告中的"Impression"文本则作为目标输出。这种方法使模型校准与预期任务保持一致，从而生成一个针对放射学报告优化且遵循指令的大语言模型。

3.2.2 Domain-Specific Knowledge Acquisition

3.2.2 领域特定知识获取

Through training on domain-specific data, the model is also adept at assimilating domain-specific knowledge that is quintessential to radiology. Consequently, Radiology-Llama2 is proficient in capturing language patterns, terminologies, and logical reasoning essential for interpreting radiology reports.

通过对特定领域数据的训练，该模型还擅长吸收放射学领域特有的核心知识。因此，Radiology-Llama2能够熟练掌握解读放射学报告所需的关键语言模式、术语和逻辑推理能力。

The initial instruction tuning is centered on the "Findings" to "Impression" conversion, which holds significant clinical value. Recognizing the potential of diverse instruction pairs in radiology, ongoing engagements with radiologists are aimed at formulating a diverse set of clinically pertinent instruction tuning pairs to augment the capabilities of Radiology-Llama2.

初始指令调优聚焦于"检查所见"到"影像诊断"的转换，这一过程具有重要临床价值。基于对放射学领域多样化指令对潜力的认识，我们正持续与放射科医师合作，旨在构建一组多样化的临床相关指令调优对，以增强Radiology-Llama2的能力。

3.3 Experimental Setting

3.3 实验设置

The Radiology-Llama2 model employs an advanced training regimen. For reference, when training Radiology-GPT with the Llama2-7b-chat [42] base model, the training was facilitated by Low-Rank Approximations (LoRA) [55]. The choice of LoRA was motivated by its compact size and portability which are conducive to model sharing and deployment.

Radiology-Llama2模型采用了先进的训练方案。作为参考，当使用Llama2-7b-chat [42]基础模型训练Radiology-GPT时，训练通过低秩近似(LoRA) [55]实现。选择LoRA是因其体积小巧且便于移植，有利于模型共享和部署。

The experimental setting for the training comprised the following configurations:

训练实验设置包含以下配置：

• Batch size: 128 • Learning rate: $3\times10^{-4}$ • Weight decay: 0.01

• 批大小 (Batch size): 128
• 学习率 (Learning rate): $3\times10^{-4}$
• 权重衰减 (Weight decay): 0.01

LoRA parameters:

LoRA参数:

The target modules for LoRA were set to "q_ proj" and "v_ proj", corresponding to the query and value matrices in the self-attention mechanism of the transformer architecture. The training was conducted on a server equipped with 4 Nvidia A100 80GB GPUs.

LoRA的目标模块设置为"q_ proj"和"v_ proj"，对应Transformer架构自注意力机制中的查询(query)矩阵和值(value)矩阵。训练在配备4块Nvidia A100 80GB GPU的服务器上进行。

4 Results

4 结果

The present study evaluated the performance of various large language models on two key datasets pertinent to radiology: MIMIC-CXR and OpenI. The assessment employed Rouge-1, Rouge-2, and Rouge-L as the primary metrics, given their widespread acceptance for evaluating the quality of generated text.

本研究评估了多种大语言模型在两个与放射学相关的关键数据集上的性能：MIMIC-CXR和OpenI。鉴于Rouge-1、Rouge-2和Rouge-L被广泛接受用于评估生成文本的质量，本次评估采用这些指标作为主要评价标准。

4.1 Quantitative Assessment

4.1 定量评估

Radiology-Llama2 significantly outperforms all comparison models across all ROUGE [56] metrics: ROUGE-1, ROUGE-2, and ROUGE-L, on both MIMIC-CXR and OpenI datasets. Results can seen in Table 1 and Table 2.

Radiology-Llama2在所有ROUGE [56]指标（ROUGE-1、ROUGE-2和ROUGE-L）上显著优于所有对比模型，无论是在MIMIC-CXR还是OpenI数据集上。结果如表1和表2所示。

For the MIMIC-CXR dataset, Radiology-Llama2 achieves scores of 0.4834 in ROUGE-1, 0.324 in ROUGE-2, and 0.4427 in ROUGE-L. These scores are markedly higher than those of the second-best

在MIMIC-CXR数据集上，Radiology-Llama2取得了0.4834的ROUGE-1分数、0.324的ROUGE-2分数和0.4427的ROUGE-L分数。这些分数明显高于第二名

Prompt: “Derive the impression from findings in the radiology report” Findings: The na so gastric tube is in adequate position and there is a resolution of the gastric distention. There is still mild bibasilar at elect as is. There are no pneumothorax no pleural effusion. The cardiac and media st in al contour are unchanged.

提示：“根据放射学报告中的检查结果推导印象”
检查结果：鼻胃管位置适当，胃胀已缓解。双侧基底肺野仍存在轻度肺不张。未见气胸及胸腔积液。心脏及纵隔轮廓无变化。

Radiology-LLAMA2

Impression: No evidence of pneumonia. The na so gastric tube is in adequate position and there is a resolution of the gastric distention.

印象：无肺炎证据。鼻胃管位置适当，胃胀已缓解。

LLAMA2

Impression: Based on the findings in the radiology report, the impression is likely that the patient's condition has improved since the last evaluation. The resolution of gastric distention and the absence of any new pleural effusion or pneumothorax suggest that the patient's respiratory status has stabilized. However, the persistence of mild bibasilar at elect as is suggests that the patient may still be experiencing some respiratory compromise. Overall, the impression is likely that the patient's condition has improved but still requires close monitoring and further evaluation….

印象：根据放射学报告的结果，患者的病情自上次评估以来可能有所改善。胃扩张的缓解以及未出现新的胸腔积液或气胸表明患者的呼吸状态已趋于稳定。然而，双侧基底区持续存在的轻度肺不张提示患者可能仍存在一定的呼吸功能障碍。总体而言，印象可能是患者病情有所好转，但仍需密切监测和进一步评估……

Radiology-GPT

放射学GPT

Impression: The na so gastric tube is in the correct position and the patient's stomach has returned to its normal size, however there is still mild lung collapse in the back of the patient's lungs. There is no sign of pneumothorax or pleural effusion and the cardiac and media st in al contours appear unchanged.

印象：鼻胃管位置正确，患者胃部已恢复至正常大小，但患者背部肺区仍存在轻度肺不张。未见气胸或胸腔积液征象，心脏及纵隔轮廓无明显变化。

ChatGPT

Impression:…Based on these findings, the overall impression is that the na so gastric tube is appropriately positioned and has relieved gastric distention. There is mild bibasilar at elect as is, but no evidence of pneumothorax, pleural effusion, or significant changes in the cardiac and media st in al contour. It's important to consult with a healthcare professional for a comprehensive evaluation and interpretation of these findings in the context of the patient's clinical condition.

印象：…基于这些发现，总体印象是鼻胃管位置适当并缓解了胃扩张。目前存在轻度双侧肺底膨胀不全，但未见气胸、胸腔积液或心脏纵隔轮廓的显著变化。需结合患者临床表现咨询医疗专业人员以全面评估和解读上述发现。

StableLM

Impression: Based on these findings alone it would seem that your patient has a reasonably good prognosis with minimal complications related to their current condition. It might be beneficial for you as well to conduct further investigations such as reviewing imaging studies or performing additional tests if recommended by your healthcare provider.

印象：仅基于这些发现来看，您的患者预后良好，当前病症相关并发症极少。若医疗人员建议，您进一步开展影像学复查或其他检测可能也有裨益。

Dolly

Impression: The na so gastric tube is in adequate position resulting in improved feeding tolerance. The patient does continue to have bibasilar at elect as is, but this appears to be resolving. No pneumothorax or pleural effusions were present. Lastly, the heart and media st in um appear normal.

印象：鼻胃管位置适当，患者喂养耐受性改善。目前仍存在双肺底轻度肺不张，但呈好转趋势。未见气胸或胸腔积液。心脏及纵隔结构显示正常。

LLAMA

Impression: Taking the impression from the x-ray findings, evaluate the patient for rib fracture, fracture of the rib cage, and fracture of the sternum. If there is a fracture of the sternum, perform a car dio thoracic consultation and ensure that a chest tube has been placed…

印象：根据X光检查结果，评估患者是否存在肋骨骨折、胸廓骨折及胸骨骨折。若存在胸骨骨折，需进行心胸外科会诊并确认已放置胸腔引流管…

Alpaca

Impression: The na so gastric tube is in adequate position and there is a resolution of the gastric distention. The cardiac and media st in al contour are unchanged.

印象：鼻胃管位置适当，胃胀缓解。心脏及纵隔轮廓无变化。

Figure 3: Performance of different LLMs on the Radiology task example.

图 3: 不同大语言模型在放射学任务示例中的性能表现。

performing model, Anthropic Claude2, which manages 0.3177 in ROUGE-1 and 0.153 in ROUGE-2. This demonstrates that Radiology-Llama2 not only captures a higher proportion of overlapping unigrams between the generated and reference summaries but also maintains this superiority in capturing bigrams and maintaining a longer sequence of content overlap.

表现

[论文翻译]Radiology-Llama2: 放射学领域最佳大语言模型

原文地址：https://arxiv.org/pdf/2309.06419

Radiology-Llama2: Best-in-Class Large Language Model for Radiology

Radiology-Llama2: 放射学领域最佳大语言模型

Abstract

1 Introduction

1 引言

2 Related work

2 相关工作

2.1 Large Language Models (LLMs)

2.1 大语言模型 (LLMs)

2.2 Domain-Specific Language Models (DSLMs)

2.2 领域专用大语言模型 (DSLM)

3 Methodology

3 方法论

3.1 Datasets

3.1 数据集

3.1.1 MIMIC-CXR Dataset

3.1.1 MIMIC-CXR 数据集

Radiology-LLAMA2

Radiology-LLAMA2

LLAMA2

Radiology-GPT

Radiology-GPT

ChatGPT

StableLM

StableLM

Dolly

Dolly

LLAMA

LLAMA

Alpaca

3.1.2 OpenI Dataset

3.1.2 OpenI 数据集

3.2 Instruction Tuning

3.2 指令微调

3.2.1 Principle and Objective

3.2.1 原理与目标

3.2.2 Domain-Specific Knowledge Acquisition

3.3 Experimental Setting

3.3 实验设置

4 Results

4 结果

4.1 Quantitative Assessment

4.1 定量评估

Radiology-LLAMA2

Radiology-LLAMA2

LLAMA2

LLAMA2

Radiology-GPT

ChatGPT

StableLM

StableLM

Dolly

LLAMA

LLAMA

Alpaca