A survey of large language models for healthcare: from data, technology, and applications to accountability and ethics
大语言模型在医疗领域的应用综述:从数据、技术、应用走向责任与伦理
Kai He a , Rui Mao b, Qika Lin a, Yucheng Ruan a, Xiang Lan a, Mengling Feng a,∗, Erik Cambria b
Kai He a, Rui Mao b, Qika Lin a, Yucheng Ruan a, Xiang Lan a, Mengling Feng a,∗, Erik Cambria b
a National University of Singapore, 119077, Singapore b Nanyang Technological University, 639798, Singapore
a 新加坡国立大学, 119077, 新加坡
b 南洋理工大学, 639798, 新加坡
A R T I C L E I N F O
文章信息
A B S T R A C T
摘要
The utilization of large language models (LLMs) for Healthcare has generated both excitement and concern due to their ability to effectively respond to free-text queries with certain professional knowledge. This survey outlines the capabilities of the currently developed Healthcare LLMs and explicates their development process, to provide an overview of the development road map from traditional Pretrained Language Models (PLMs) to LLMs. Specifically, we first explore the potential of LLMs to enhance the efficiency and effectiveness of various Healthcare applications highlighting both the strengths and limitations. Secondly, we conduct a comparison between the previous PLMs and the latest LLMs, and summarize related Healthcare training data, learning methods, and usage. Finally, the unique concerns associated with deploying LLMs are investigated, particularly regarding fairness, accountability, transparency, and ethics. Besides, we support researchers by compiling a collection of open-source resources1. Summarily, we contend that a significant paradigm shift is underway, transitioning from PLMs to LLMs. This shift encompasses a move from disc rim i native AI approaches to generative AI approaches, as well as a move from model-centered methodologies to datacentered methodologies. We determine that the biggest obstacle of using LLMs in Healthcare are fairness, accountability, transparency and ethics.
大语言模型(LLM)在医疗健康领域的应用既令人振奋又引发担忧,因其能够基于专业知识有效响应自由文本查询。本综述系统梳理了当前医疗健康大语言模型的能力边界,并阐明其发展历程,从而呈现从传统预训练语言模型(PLM)到大语言模型的演进路线。具体而言,我们首先探讨了大语言模型在提升各类医疗健康应用效率与效果方面的潜力,同时剖析其优势与局限;其次,通过对比传统PLM与前沿LLM的技术差异,归纳了相关医疗健康训练数据、学习方法及使用范式;最后,深入分析了部署大语言模型时面临的独特挑战,特别是公平性、问责制、透明度与伦理问题。此外,我们还整理了开源资源合集以支持研究者。总体而言,我们认为当前正经历从PLM到LLM的重大范式转变:一方面是从判别式AI向生成式AI(Generative AI)的跨越,另一方面是从模型中心方法论向数据中心方法论的演进。研究发现,医疗健康领域应用大语言模型的最大障碍在于公平性、问责制、透明度与伦理问题。
1. Introduction
1. 引言
Recently, Large Language Models (LLMs) have emerged as a driving force in AI due to their impressive abilities in understanding, generating, and reasoning. The integration of LLMs into Healthcare represents a significant advancement in the application of AI toward improving clinical outcomes, conserving resources, and enhancing patient care. Healthcare researchers face persistent challenges such as diagnosing rare diseases, interpreting complex patient narratives, and planning personalized treatments. The advanced language processing capabilities of LLMs directly address these needs, offering more precise diagnostics and tailored treatment options. For example, Med-PaLM 2 [1] demonstrates expert-level accuracy on the US Medical Licensing Examination (USMLE). Besides, more general models such as GPT-4, GPT4-o and Llama series also demonstrate superior performance in a variety of healthcare-related tasks. These advancements expand LLM applications in healthcare while improving patient outcomes through greater accuracy and efficiency.
近年来,大语言模型(LLM)凭借其在理解、生成和推理方面的卓越能力,已成为人工智能领域的重要推动力。将大语言模型整合到医疗健康领域,标志着人工智能在改善临床结果、节约资源和提升患者护理方面的重大进展。医疗研究者长期面临着罕见病诊断、复杂患者叙述解读以及个性化治疗方案制定等挑战。大语言模型先进的语言处理能力直接应对这些需求,提供更精准的诊断和定制化治疗方案。例如,Med-PaLM 2[1]在美国医师执照考试(USMLE)中展现出专家级的准确率。此外,GPT-4、GPT4-o和Llama系列等通用模型也在各类医疗相关任务中表现出卓越性能。这些进步不仅拓展了大语言模型在医疗健康领域的应用范围,还通过更高的准确性和效率改善了患者治疗效果。
Initially, Pretrained Language Models (PLMs) include BERT [2] and RoBERTa [3] were developed for general NLP tasks and later adapted for healthcare applications. For simpler tasks, PLMs offer advantages over LLMs in terms of simplicity and efficiency when dealing with less complex cases. However, their use in healthcare was limited because they typically operated as single-task systems, lacking the capability to interact dynamically with complex medical data [4].
最初,预训练语言模型(PLM)如BERT [2]和RoBERTa [3]是为通用NLP任务开发的,后来被应用于医疗领域。对于较简单的任务,在处理不太复杂的案例时,PLM在大语言模型面前具有简洁高效的优势。然而,它们在医疗领域的应用受到限制,因为这些模型通常作为单任务系统运行,缺乏与复杂医疗数据动态交互的能力[4]。
Then, the development of LLMs like GPT-3 represents a transformative evolution from PLMS to LLMs, as illustrated in Fig. 1. With over 100 billion parameters, GPT-3 demonstrates exceptional understanding and generating capabilities, which significantly enhance its functionality across various applications, including Healthcare [6]. These capabilities allow LLMs to process and analyze a broader array of data types, such as patient records, clinical notes, and research papers, to identify patterns and suggest potential diagnoses that might be overlooked by human clinicians [7]. Additionally, the integration of LLMs into Healthcare is further supported by their enhanced explain ability and adaptability compared to PLMs. The introduction of Chain-of-Thought (CoT) processing in newer LLMs contributes to a more transparent AI decision-making process. This transparency is crucial in Healthcare settings, where understanding the rationale behind AI-generated decisions can foster greater trust and reliability among medical professionals in employing AI-powered tools [6].
随后,像 GPT-3 这样的大语言模型 (LLM) 的发展标志着从预训练语言模型 (PLM) 向大语言模型的变革性演进,如图 1 所示。拥有超过 1000 亿参数的 GPT-3 展现出卓越的理解与生成能力,这显著提升了其在医疗健康 [6] 等各类应用中的功能性。这些能力使大语言模型能够处理和分析更广泛的数据类型,例如患者记录、临床笔记和研究论文,从而识别模式并提出可能被人类临床医生忽视的潜在诊断 [7]。此外,与预训练语言模型相比,大语言模型更强的可解释性和适应性进一步支持了其在医疗健康领域的整合。新一代大语言模型中引入的思维链 (Chain-of-Thought, CoT) 处理机制,使人工智能决策过程更加透明。这种透明度在医疗健康场景中至关重要,因为理解 AI 生成决策背后的逻辑能增强医疗专业人员对 AI 驱动工具的信任度和可靠性 [6]。

Fig. 1. The development road map from PLMs to LLMs. GPT-3 [5] marks a significant milestone in the transition from PLMs to LLMs, signaling the beginning of a new era both in general and Healthcare field.
图 1: 从PLM到大语言模型的发展路线图。GPT-3 [5]标志着从PLM向大语言模型转型的重要里程碑,无论在通用领域还是医疗健康领域都预示着一个新时代的开端。
Besides the aforementioned general abilities, many studies have tailored LLMs to address specific healthcare application tasks, marking a significant trend in this field. Understanding this trend is crucial for further advancing and diversifying healthcare applications. For instance, given that the healthcare field inherently involves multimodal data, some studies [8–10] have explored LLMs’ capabilities to understand and analyze diverse medical images. Additionally, models like HuatuoGPT [11] demonstrate active inquiry capabilities, allowing for the extraction of more potential medical information. Other disease-specific LLMs, such as OphGLM [12] for ophthalmology and SoulChat [13] for mental health, highlight the versatility of LLMs in addressing targeted medical needs. Beyond these examples, the potential of LLMs in healthcare remains vast and largely untapped. Investing in the development of effective, ethical, and accountable LLMs is not only essential but also holds immense promise for practical and transformative benefits in healthcare.
除了上述通用能力外,许多研究针对特定医疗应用任务定制了大语言模型(LLM),这成为该领域的重要趋势。理解这一趋势对进一步推进和丰富医疗应用至关重要。例如,鉴于医疗领域本质涉及多模态数据,部分研究[8-10]探索了大语言模型理解和分析各类医学影像的能力。此外,像华佗GPT(HuatuoGPT)[11]这样的模型展示了主动询问能力,可提取更多潜在医疗信息。其他专科大语言模型如眼科领域的OphGLM[12]和精神健康的SoulChat[13],突显了该技术在解决针对性医疗需求方面的多样性。除这些案例外,大语言模型在医疗领域的潜力仍然巨大且尚未充分开发。投资开发有效、符合伦理且可问责的大语言模型不仅至关重要,更能为医疗实践带来变革性效益。
This paper aims to inform readers about the latest developments in the field and offer comprehensive insights to those interested in using or developing healthcare LLMs. It covers various healthcare applications and provides a detailed summary of the underlying technology. We aims to provide insights about how different technologies affect different Healthcare-related tasks. Furthermore, as the capabilities of LLMs continue to improve, we contend that the challenges associated with applying AI in healthcare due to performance limitations are diminishing. Consequently, issues of fairness, accountability, transparency, and ethics are becoming more significant impediments to practical implementation. For this reason, we discuss these four critical issues in the context of employing LLMs and emphasize their importance.
本文旨在向读者介绍该领域的最新进展,并为有意使用或开发医疗大语言模型的研究者提供全面见解。文章涵盖多种医疗应用场景,并对底层技术进行了详细梳理。我们重点解析了不同技术对各类医疗相关任务的影响机制。此外,随着大语言模型能力持续提升,我们认为因性能局限导致AI医疗应用受阻的情况正在减少。相应地,公平性、问责制、透明度和伦理问题正成为实际部署中更显著的障碍。为此,我们专门探讨了应用大语言模型时这四个关键议题,并强调其重要性。
Several surveys [7,14–16] have specifically examined the applications of large language models (LLMs) in medical and healthcare domains, emphasizing their potential benefits and limitations. However, these works lack in-depth technological analysis and fail to address critical issues such as accountability and ethics. Other surveys [17,18] include discussions on technological aspects but primarily focus on general LLM developments and evaluations, offering limited insights into their adaptation and application in healthcare settings. Some studies have a narrower focus. For instance, the study [19] concentrates solely on testing healthcare-specific LLMs, while [20] is limited to their applications in psychotherapy. Plus, former study [21] focused on Healthcare PLMs rather than LLMs. However, we provide a brief introduction to Healthcare PLMs as background information and then delve into the details of Healthcare LLMs. Our comprehensive analysis is anticipated to guide medical researchers in making informed choices in selecting LLMs suitable for their specific needs. The organizational framework of this paper is shown as Fig. 2. Generally, our contributions can be summarized as:
多篇综述研究[7,14-16]专门探讨了大语言模型(LLM)在医疗健康领域的应用,着重分析了其潜在优势与局限性。然而这些研究缺乏深入的技术分析,且未涉及责任归属与伦理等关键问题。另有综述[17,18]虽包含技术层面的讨论,但主要聚焦于通用大语言模型的发展与评估,对其在医疗场景中的适配与应用着墨有限。部分研究范围更为狭窄,例如文献[19]仅关注医疗专用大语言模型的测试,而[20]则局限于心理治疗领域的应用。此外,前期研究[21]主要针对医疗专用预训练模型(PLM)而非大语言模型。不过我们仍将简要介绍医疗PLM作为背景知识,再深入探讨医疗LLM的技术细节。我们期望通过这项全面分析,为医学研究者选择适合其特定需求的LLM提供决策参考。本文组织结构如图2所示,我们的主要贡献可概括为:
• We propose a comprehensive survey of LLMs in Healthcare, outlining a evolution road map from PLMs to LLMs, updating readers on the latest advancements in the field. • We compiled a detailed list of publicly available data, training costs, and task performances for Healthcare LLMs, which is useful for developers and users of private Healthcare LLMs. • We explore key non-technical aspects of LLMs in Healthcare, like fairness, accountability, transparency, and ethics, which are vital for advancing Healthcare AI applications.
• 我们提出了医疗领域大语言模型 (LLM) 的全面综述,勾勒出从预训练语言模型 (PLM) 到大语言模型的演进路线图,帮助读者了解该领域的最新进展。
• 我们整理了医疗领域大语言模型的公开数据、训练成本和任务性能的详细清单,这对私有医疗大语言模型的开发者和使用者具有实用价值。
• 我们探讨了医疗大语言模型的关键非技术因素,如公平性、问责制、透明度和伦理问题,这些对推进医疗AI应用至关重要。
2. What LLMs can do for healthcare? from fundamental tasks to advanced applications
2. 大语言模型(LLM)能为医疗保健做什么?从基础任务到高级应用
Numerous endeavors have been made to apply PLMs or LLMs to Healthcare. In the early stages, the studies primarily focused on fundamental tasks, due to the challenges of accessing diverse medical datasets, the complexity of the medical domain, and limitations of the models’ capabilities. Based LLMs, the concept of Artificial General Intelligence (AGI) for Healthcare has been proposed, which has led to more practical and advanced applications in various aspects of the Healthcare field, as shown in Fig. 3. In this sections, we analyze what LLMs can do for Healthcare in detail, and mainly compare the strengths and weaknesses of LLMs and PLMs on different tasks to highlight the development from PLMs to LLMs.
众多研究致力于将PLM或大语言模型(LLM)应用于医疗领域。早期研究主要聚焦基础性任务,这源于获取多样化医疗数据集的困难、医疗领域的复杂性以及模型能力的局限性。基于大语言模型,医疗领域的通用人工智能(AGI)概念被提出,从而催生了如图3所示的、在医疗各细分领域更实用且高级的应用。本节将详细分析大语言模型对医疗的赋能作用,重点对比大语言模型与传统PLM在不同任务中的优劣势,以突显从PLM到大语言模型的技术演进。
2.1. NER and RE for healthcare
2.1. 医疗领域的命名实体识别和关系抽取
The initial step toward unlocking valuable information in unstructured Healthcare text data mainly involves Named Entity Recognition (NER) and Relation Extraction (RE). These two tasks are main tasks to achieve Information Extraction (IE), which provide fundamental information for a range of other Healthcare applications, such as medical entity normalization and co reference [22], medical knowledge base and knowledge graph construction [23], and entity-enhanced dialogue [24]. For example, by employing NER and RE tasks, the Healthcare knowledge databases Drugbank1 [25] and Unified Medical Language System (UMLS) are constructed, which facilitate various applications in Intellectual Healthcare.
从非结构化医疗文本数据中提取有价值信息的初始步骤主要涉及命名实体识别(NER)和关系抽取(RE)。这两项任务是实现信息抽取(IE)的核心工作,为其他医疗应用提供基础信息,如医疗实体标准化与共指消解[22]、医疗知识库与知识图谱构建[23],以及实体增强对话[24]。例如,通过运用NER和RE技术,构建了Drugbank1[25]和统一医学语言系统(UMLS)等医疗知识数据库,这些数据库促进了智能医疗领域的多种应用。
In the early stages of research on NER with PLMs, a significant portion of studies focused on sequence labeling tasks. To accomplish this, PLMs-based approaches were employed to generate contextual i zed representations for individual tokens. In the case of RE tasks, the extracted entity pairs’ representations were typically fed into a classifier to determine the existence of relations between the given entities. In the era of LLMs, NER and RE have been improved to work under more complex conditions and more convenient usages. One example is LLM-NERRE [26], which combines NER and RE to handle hierarchical information in scientific text. This approach has demonstrated the ability to effectively extract intricate scientific knowledge for tasks that require the use of LLMs. These tasks often involve complexities that cannot be effectively handled by typical PLMs. Meanwhile, LLMs can effectively perform medical NER and RE well even without further training. The study [27] employed Instruct GP T [28] to perform zero/few-shot IE from clinical text, despite not being trained specifically for the clinical domain. The results illustrated that Instruct GP T can perform very well on biomedical evidence extraction, medication status extraction, and medication attribute extraction. This observation supports the notion that LLMs can be applied with flexibility and efficiency.
在基于预训练语言模型 (PLM) 的命名实体识别 (NER) 研究早期,大量工作集中于序列标注任务。为此,基于 PLM 的方法通过生成上下文相关的 token 表征来实现该目标。对于关系抽取 (RE) 任务,通常将抽取出的实体对表征输入分类器,以判断给定实体间是否存在关联。进入大语言模型时代后,NER 和 RE 能力已提升至可适应更复杂场景与更便捷的使用模式。例如 LLM-NERRE [26] 通过联合 NER 与 RE 处理科学文本中的层级化信息,证明了大语言模型在需要提取复杂科学知识的任务中具有显著优势——这类任务往往超出传统 PLM 的处理能力。值得注意的是,大语言模型即便未经专门训练也能出色完成医学 NER 和 RE 任务。研究 [27] 采用 Instruct GPT [28] 对临床文本进行零样本/少样本信息抽取,尽管该模型未接受临床领域专门训练。结果显示 Instruct GPT 在生物医学证据提取、用药状态识别和药物属性抽取等任务中表现优异,这一发现印证了大语言模型具备灵活高效的应用潜力。

Fig. 2. The organizational framework for the content. Sections 3, 4 are technology details, while Sections 2, 5 are more valued for Healthcare professionals.
图 2: 内容组织结构框架。第3、4节为技术细节,而第2、5节对医疗保健专业人员更具价值。

Fig. 3. LLMs for healthcare: from fundamental task to advanced applications.
图 3: 医疗领域的大语言模型 (LLM) : 从基础任务到高级应用
Despite their capabilities, they still perform comparably to specially trained state-of-the-art (SOTA) PLMs, particularly in domains that involve professional terms and symbols. LLMs were trained on unlabeled data, with most of their knowledge derived from a vast amount of textual information. However, for domain-specific knowledge, such as specific types of named entities, LLMs’ pragmatic understanding capabilities are likely to be less effective compared to PLMs that have been fine-tuned on labeled data. Overall, we argue that both PLMs and LLMs have distinct advantages in IE tasks.
尽管它们具备这些能力,但在涉及专业术语和符号的领域中,其表现仍与经过专门训练的最先进 (SOTA) 预训练语言模型 (PLM) 相当。大语言模型 (LLM) 是在未标注数据上训练的,其大部分知识来源于海量文本信息。然而,对于领域特定知识(如特定类型的命名实体),与基于标注数据微调的预训练语言模型相比,大语言模型的实用理解能力可能较弱。总体而言,我们认为预训练语言模型和大语言模型在信息抽取 (IE) 任务中各具优势。
2.2. Text classification for healthcare
2.2. 医疗健康领域的文本分类
Text Classification (TC) aims to assign labels to text with different lengths, such as phrases, sentences, paragraphs, or documents. In Healthcare research, a large amount of patient data is collected in the electronic format, including disease status, medication history, and treatment outcomes. However, these data can only be used with appropriate labels, while TC is one of the most commonly used technology. For example, a research study [29] proposed several methods, based on hybrid Long Short-Term Memory (LSTM) and bidirectional gated recurrent units(Bi-GRU) to achieve medical TC. The study [30] used TC to identify prescription medication mentioned in tweets and achieved good results by using PLMs. Also, some studies employ TCbased Sentiment Analysis (SA) to understand patient emotion or mental healthcare, aiming to provide more humanized treatments [31].
文本分类 (TC) 旨在为不同长度的文本(如短语、句子、段落或文档)分配标签。在医疗健康研究中,大量患者数据以电子形式收集,包括疾病状态、用药史和治疗结果。然而,这些数据只有在获得适当标签后才能使用,而文本分类是最常用的技术之一。例如,一项研究 [29] 提出了基于混合长短期记忆网络 (LSTM) 和双向门控循环单元 (Bi-GRU) 的多种方法来实现医疗文本分类。研究 [30] 利用文本分类技术识别推文中提到的处方药,并通过使用预训练语言模型 (PLM) 取得了良好效果。此外,部分研究采用基于文本分类的情感分析 (SA) 来理解患者情绪或心理健康状况,旨在提供更具人性化的治疗方案 [31]。
However, PLMs-based TC usually cannot satisfy explain able and reliable requirements in the Healthcare field, while LLMs-based TC mitigates these issues to some extent. For example, CARP [32] takes advantage of LLMs by introducing Clue And Reasoning Prompting to achieve better TC tasks. This study adopts a progressive reasoning strategy tailored to address the complex linguistic phenomena involved in TC. AMuLaP [33] is another example, which proposed Automatic Multi-Label Prompting for few-shot TC. By exploring automatic label selection, their method surpasses the GPT-3-style in-context learning method, showing significant improvements compared with previous PLMs-based results.
然而,基于预训练语言模型(PLM)的文本分类(TC)通常无法满足医疗保健领域对可解释性和可靠性的要求,而基于大语言模型(LLM)的文本分类在一定程度上缓解了这些问题。例如,CARP [32]通过引入线索与推理提示(Clue And Reasoning Prompting)技术,利用大语言模型实现了更优的文本分类任务。该研究采用渐进式推理策略,专门针对文本分类中涉及的复杂语言现象。AMuLaP [33]是另一个典型案例,提出了面向少样本文本分类的自动多标签提示(Automatic Multi-Label Prompting)方法。通过探索自动标签选择机制,该方法超越了GPT-3风格的上下文学习方法,相较此前基于预训练语言模型的结果显示出显著提升。
Unlike in general domains where LLMs and SOTA PLMs exhibit similar performance, LLMs demonstrate a clear advantage in Healthcare TC, which primarily due to the inherent complexity of special data. Healthcare texts are laden with specialized language, including technical terms, abbreviations, and jargon that are unique to the field. Moreover, the context in which these terms are used can significantly alter their meanings. For instance, the abbreviation ‘‘MI’’ might mean ‘‘mitral insufficiency’’ or ‘‘myocardial infarction’’, depending on the surrounding context. Given these conditions, Healthcare TC tasks require the integration of various types of data and an understanding of their interplay. This necessitates models that are not only summarize information but also reason con textually. LLMs are well-suited for these tasks due to their deeper contextual understanding and ability to handle complex interactions within the text, making them more effective for healthcare applications than PLMs.
与通用领域中大语言模型和SOTA PLM表现相近的情况不同,大语言模型在医疗文本分类(Healthcare TC)中展现出明显优势,这主要源于专业数据固有的复杂性。医疗文本充斥着领域特有的专业语言,包括技术术语、缩写和行话。此外,这些术语的使用语境会极大改变其含义。例如,缩写"MI"可能表示"二尖瓣关闭不全"或"心肌梗死",具体含义取决于上下文环境。鉴于这些特性,医疗文本分类任务需要整合多种数据类型并理解其相互作用关系,这就要求模型不仅能总结信息,还需具备上下文推理能力。大语言模型凭借更深入的语境理解能力和处理文本内复杂交互的优势,相比PLM更适合这类任务,因此在医疗应用中更为有效。
2.3. Semantic textual similarity for healthcare
2.3. 医疗领域的语义文本相似度
Semantic Textual Similarity (STS) is a way to measure how much two sentences mean the same thing or two documents are similar. In Healthcare, STS is often used to combine information from different sources, especially used for Electronic Health Records (EHR). The 2018 Bio Creative/Open Health NLP (OHNLP) challenge [34] and the National NLP Clinical Challenges (n2c2) 2019 Track 1 show that STS can help reduce mistakes and disorganization in EHRs caused by copying and pasting or using templates. This means that STS can be used to check the quality of medical notes and make them more efficient for other NLP tasks. The study [35] proposed a new method using Clinic alBERT, which was a fine-tuned BERT-based method. The proposed iterative multitask learning technique helps the model learn from related datasets and select the best ones for fine-tuning. Besides, STS can be used for Healthcare information retrieval. For examples, if a patient ask question like ‘‘I was diagnosed with non-clear cell renal cell carcinoma, what are the chances of recurrence after cure? Give me evidence from relevant scientific literature’’, Our AI systems may need retrieval related database to find papers which contain similar semantic sentences. For doctor, when face patients who are difficult to diagnose, this technology can identify similar patients for doctors’ reference.
语义文本相似度 (Semantic Textual Similarity, STS) 是一种衡量两个句子含义相同程度或两份文档相似程度的方法。在医疗领域,STS常被用于整合不同来源的信息,尤其适用于电子健康档案 (EHR) 。2018年Bio Creative/Open Health NLP (OHNLP) 挑战赛[34]和2019年美国临床自然语言处理挑战赛 (n2c2) 第一赛道表明,STS有助于减少因复制粘贴或使用模板导致的EHR错误和混乱。这意味着STS可用于检查医疗记录质量,并提升其在其他自然语言处理任务中的效率。研究[35]提出了一种使用Clinic alBERT的新方法,这是一种基于BERT的微调方法。该研究提出的迭代多任务学习技术能帮助模型从相关数据集中学习,并选择最佳数据集进行微调。此外,STS还可用于医疗信息检索。例如,当患者提出"我被诊断为非透明细胞肾细胞癌,治愈后复发几率是多少?请从相关科学文献中提供证据"这类问题时,我们的AI系统可能需要检索相关数据库以查找包含相似语义句子的论文。对于医生而言,当面对难以诊断的患者时,该技术可识别相似病例供医生参考。
When comparing PLMs and LLMs, we need to break down the situation to start some discussion. For short text semantic classification, SOTA PLMs and LLMs are comparable. This is primarily because such tasks contain less contextual information, meaning the advantages of LLMs in managing large context windows and understanding complex narrative structures are less pronounced. In such cases, the fundamental ability of both PLMs and LLMs to understand and interpret language at a basic level plays a more significant role, leading to similar levels of performance. On the other hand, for tasks like information retrieval, LLMs tend to be overly complex and resource-intensive for the role of a simple retriever. Typically, LLMs excel in directly generating responses or completing texts based on given inputs. In contrast, PLMs, which are generally more lightweight, are better suited for retrieving external knowledge. This distinction makes PLMs more practical for applications where quick, efficient retrieval of information is required without the additional overhead of generating new text content.
在比较PLM和大语言模型时,我们需要拆解具体场景展开讨论。对于短文本语义分类任务,当前最先进的PLM与大语言模型表现相当。这主要是因为此类任务包含的上下文信息较少,大语言模型在管理大上下文窗口和理解复杂叙事结构方面的优势难以凸显。此时,PLM和大语言模型在基础语言理解能力上的表现成为关键因素,导致两者性能相近。另一方面,在信息检索等任务中,大语言模型往往显得过于复杂且资源密集,并不适合作为简单的检索器。通常大语言模型更擅长根据给定输入直接生成响应或补全文本,而相对轻量级的PLM则更适合检索外部知识。这种差异使得在需要快速高效检索信息、且无需生成新文本内容的场景中,PLM更具实用性。
2.4. Question answering for healthcare
2.4. 医疗问答
Traditionally, QA is a separate task that involves generating or retrieving answers for given questions. In Healthcare, QA can be very beneficial for medical professionals to find necessary information in clinical notes or literature, as well as providing basic Healthcare knowledge for patients. According to a report by the Pew Research Center [36], over one-third of American adults have searched online for medical conditions they may have. A strong QA system for Healthcare can significantly fulfill the consultation needs of patients. Many studies [21] explored how to adapt general PLMs to answer Healthcare questions, including designing special pertaining task, fine-tuning on Healthcare data, and introducing external Healthcare knowledge base. However, due to their limited language understanding and generation abilities [37], PLMs-based QA systems struggle to play a significant role in real-world Healthcare scenarios.
传统上,问答 (QA) 是一项独立任务,涉及为给定问题生成或检索答案。在医疗健康领域,问答系统能为医疗专业人员从临床记录或文献中查找必要信息提供极大帮助,同时也能为患者提供基础医疗知识。皮尤研究中心 (Pew Research Center) [36] 的报告显示,超过三分之一的美国成年人曾在线搜索过自身可能患有的病症。一个强大的医疗问答系统能显著满足患者的咨询需求。多项研究 [21] 探讨了如何使通用预训练语言模型 (PLM) 适应医疗问答场景,包括设计特定预训练任务、基于医疗数据进行微调,以及引入外部医疗知识库。然而,由于其在语言理解和生成能力上的局限性 [37],基于预训练语言模型的问答系统难以在实际医疗场景中发挥重要作用。
With the advent of powerful LLMs, prompt-based methods have been introduced to solve various tasks by formulating them as QA tasks, including NER, RE, and SA. Besides, LLMs have significantly improved typical QA tasks in Healthcare fields. For instance, MedPaLM 2 [1] approached or exceeded state-of-the-art performance across MedMCQA [38], PubMedQA [39], and MMLU [40] clinical topics QA datasets. The study [41] investigated the use of ChatGPT, Google Bard, and Claude for patient-specific QA from clinical notes. Another study [42] proposed a retrieval-based medical QA system that uses LLMs in combination with knowledge graphs to address the challenge.
随着强大大语言模型(LLM)的出现,基于提示词(prompt)的方法被引入来解决各类任务,包括命名实体识别(NER)、关系抽取(RE)和情感分析(SA)。此外,大语言模型显著提升了医疗领域的典型问答任务表现。例如,MedPaLM 2 [1]在MedMCQA [38]、PubMedQA [39]和MMLU [40]临床主题问答数据集上的表现达到或超越了最先进水平。研究[41]探讨了使用ChatGPT、Google Bard和Claude从临床记录中生成患者特定问答的可行性。另一项研究[42]提出了基于检索的医疗问答系统,通过结合大语言模型与知识图谱来解决这一挑战。
Visual Question Answering (VQA) has recently garnered significant attention in the Healthcare field for its potential to meet the diverse needs of both patients and healthcare professionals. By facilitating the interpretation of medical images through question answering, VQA holds great promise for aiding diagnostics and enhancing patient under standing through educational tools. One of the key challenges in this domain is the precise identification and comprehension of critical regions in medical images, such as masses, anomalies, and lesions. Equally vital is ensuring that the semantic representation of these regions aligns with the specific demands articulated in textual queries, enabling the generation of con textually relevant and medically accurate responses. For example, The study [43] introduces a novel multiple meta-model quant if i cation method for medical VQA tasks. This method effectively learns meta-annotations and extracts meaningful features. It is designed to enhance metadata through auto-annotation, handle noisy labels, and generate meta-models that produce more robust and reliable features. Besides, MISS [44] presents an efficient multi-task self-supervised learning framework, which unifies the text and multimodal encoders to enhance the alignment of image-text features effectively. Moreover, MISS introduces a novel Transfer-and-Caption method, leveraging LLMs to expand the feature space of single-modal image datasets.
视觉问答 (Visual Question Answering, VQA) 近期在医疗领域获得广泛关注,因其具备满足患者和医疗专业人员多样化需求的潜力。通过问答形式辅助解读医学影像,VQA 在辅助诊断和利用教育工具提升患者理解方面展现出巨大前景。该领域的核心挑战之一在于精准识别并理解医学影像中的关键区域(如肿块、异常和病变),同时确保这些区域的语义表征与文本查询所表述的特定需求相匹配,从而生成符合上下文且医学准确的回答。例如,研究 [43] 提出了一种针对医疗 VQA 任务的新型多元元模型量化方法,该方法能有效学习元标注并提取有意义的特征,其设计通过自动标注增强元数据、处理噪声标签,并生成可产生更鲁棒可靠特征的元模型。此外,MISS [44] 提出了一种高效的多任务自监督学习框架,通过统一文本和多模态编码器显著提升图文特征对齐效果,并创新性地采用基于大语言模型的迁移-描述方法,扩展单模态图像数据集的特征空间。

Fig. 4. The comparison between PLMs-based with LLMs-based dialogue system.

图 4: 基于PLMs与基于LLMs的对话系统对比
As one of their most outstanding ability, LLMs are obviously superior to PLMs on QA tasks. LLMs are increasingly being utilized to boost various real-world Healthcare applications, especially when considering only LLMs can support VQA tasks.
大语言模型 (LLM) 作为其最突出的能力之一,在问答任务上明显优于预训练语言模型 (PLM)。大语言模型正越来越多地被用于推动各类现实医疗应用,尤其是在仅大语言模型能支持视觉问答 (VQA) 任务的情况下。
2.5. Dialogue system for healthcare
2.5. 医疗健康领域的对话系统
Chatbots have demonstrated promising potential to assist both patients and health professionals. The implementation of Healthcare Dialogue Systems can decrease the administrative workload of medical personnel and mitigate the negative consequences resulting from a shortage of physicians. Apart from the QA component, dialogue systems are generally classified into two categories: task-oriented and opendomain dialogue systems. The former is designed to address specific issues for Healthcare, such as hospital guides or medication consultations. In contrast, open-domain dialogue systems prioritize conversing with patients without any specific tasks. These systems are usually used as chatbots to provide emotional support, or mental health-related applications [45]. For example, the study [46] shows that patients who participated in a telehealth project had lower scores for depression, anxiety, and stress, and experienced $38%$ fewer hospital admissions. In the early stages, the study [47] proposed an ontology-based dialogue system that supports electronic referrals for breast cancer, which can handle the informative responses of users based on the medical domain ontology. Another study KR-DS [48] is an end-to-end knowledge-routed relational dialogue system that seamlessly incorporates a rich medical knowledge graph into topic transitions in dialogue management. One of the most notable feature is that PLMs-based dialogue systems often comprise multiple sub-modules, including dialogue management, nature language understanding, or knowledge introduction modules. Each individual sub-module within the overall system has the potential to become a bottleneck, thereby restricting the system’s practical applications.
聊天机器人已展现出协助患者和医疗专业人员的巨大潜力。医疗对话系统的实施能够减轻医务人员的管理负担,并缓解医生短缺带来的负面影响。除问答功能外,对话系统通常分为两类:任务导向型和开放域对话系统。前者旨在解决医疗领域的具体问题,如医院导诊或用药咨询;后者则侧重与患者进行无特定目标的交流,这类系统通常作为聊天机器人提供情感支持或心理健康相关应用[45]。例如研究[46]表明,参与远程医疗项目的患者抑郁、焦虑和压力评分显著降低,住院次数减少38%。早期研究[47]提出基于本体的乳腺癌电子转诊对话系统,能根据医疗领域本体处理用户的信息型回应。另一项研究KR-DS[48]是端到端知识路由关系型对话系统,将丰富的医疗知识图谱无缝融入对话管理的主题转换中。值得注意的是,基于预训练语言模型(PLMs)的对话系统通常包含多个子模块(如对话管理、自然语言理解或知识引入模块),每个子模块都可能成为限制系统实际应用的瓶颈。
In the case of LLM-based dialogue systems, the original pipeline system can be transformed into an end-to-end system leveraging a powerful LLM [17], as shown in Fig. 4. By utilizing an LLM, the remaining task involves aligning the system with human preferences and fine-tuning it for specific fields, without the need for many extra sub-modules, and achieving some advanced abilities that PLMs can hardly do. For example, a new approach [49] was proposed to detect depression, which involves an interpret able and interactive system based on LLMs. The proposed system not only provides a diagnosis, but also offers diagnostic evidence that is grounded in established diagnostic criteria. Additionally, users can engage in dialogue with the system, which allows for a more personalized understanding of their mental state based on their social media content. Chatdoctor [50] is a specialized LLMs designed to overcome the limitations observed in the medical knowledge, which can utilize real-time information from online sources to engage in conversations with patients.
基于大语言模型(LLM)的对话系统中,原始流水线系统可转化为端到端系统[17],如图4所示。通过运用大语言模型,剩余工作只需将系统与人类偏好对齐并针对特定领域微调,无需额外子模块即可实现传统预训练语言模型(PLM)难以企及的高级能力。例如,新提出的抑郁症检测方法[49]构建了基于大语言模型的可解释交互系统,不仅能提供诊断结果,还能依据既定诊断标准生成诊断依据。用户还可与系统对话,基于社交媒体内容获得更个性化的心理状态分析。Chatdoctor[50]是专为突破医疗知识局限设计的专业大语言模型,能利用网络实时信息与患者进行对话。
Table 1 Sum mari z ation about the strengths and weaknesses of PLMs and LLMs by different tasks.
| Task | PLMsfeatures | LLMsfeatures | Comparison |
| Informationextraction | Needlabeleddata | Zero-/few-shot | Havetheirownuniquestrengths |
| Textclassification | Easytoadapt | Explainableandreliable | LLMshave a slightadvantage |
| Semantictextualsimilarity | Skilledatshortcontextsandfundamentaltasks | Skilled atlong contexts and complextasks | Depend ontextlength |
| Questionanswering | Limited language understanding and generation abilities | Betterinherentprofessionalknowledge | LLMshavea significantadvantage |
| Dialoguesystem | Consistofmultiplecomponents | End-to-end system | LLMshave a significant advantage |
| Report generation | Limited generationabilitiesandonlys singlemodality | MultimodalLLMs | LLMshave a significant advantage |
表 1: 不同任务下PLM和LLM的优缺点总结
| 任务 | PLM特性 | LLM特性 | 对比 |
|---|---|---|---|
| 信息抽取 | 需要标注数据 | 零样本/少样本 | 各有独特优势 |
| 文本分类 | 易于适配 | 可解释且可靠 | LLM略有优势 |
| 语义文本相似度 | 擅长短文本和基础任务 | 擅长长文本和复杂任务 | 取决于文本长度 |
| 问答系统 | 语言理解和生成能力有限 | 具备更好的内在专业知识 | LLM优势显著 |
| 对话系统 | 由多个组件构成 | 端到端系统 | LLM优势显著 |
| 报告生成 | 生成能力有限且仅支持单模态 | 多模态LLM | LLM优势显著 |
2.6. Generation of medical reports from images
2.6. 从图像生成医疗报告
Medical reports are of significant clinical value to related specialists, but the process of writing them can be tedious, time-consuming and error-prone for inexperienced ones. Therefore, the automatic generation of medical reports has emerged as a promising research direction in the field of Healthcare. This capability can assist specialists in clinical decision-making and reduce the burden of report writing by automatically drafting reports that describe both abnormalities and relevant normal findings. Additionally, related models are expected to assist clinicians by pairing text reports with interactive visualization s, such as highlighting the region described by each phrase.
医疗报告对相关专科医生具有重要临床价值,但撰写过程对经验不足者而言往往繁琐耗时且易出错。因此,医疗报告自动生成已成为医疗健康领域极具前景的研究方向。该技术能通过自动生成同时描述异常和相关正常发现的报告,辅助临床决策并减轻报告撰写负担。此外,相关模型有望通过将文本报告与交互式可视化(如高亮显示每个短语描述的对应区域)相结合来辅助临床医生。
In an early stage, the study [51] proposed a data-driven method that combines a CNN to predict medical tags and generate a single sentence report. However, a single-sentence report is limited to real medical scenes. To generate multi-sentence reports, the study [52] proposed a multi-level recurrent generation model, which fused multiple image modalities by focusing on the front and later views. Most recently proposed models for automated report generation rely on multimodal technology implemented by LLMs, which can support more advanced applications. For example, VisualGPT [53] utilizes linguistic knowledge from LLMs and adapts it to new domains of image captioning in an efficient manner, even with small amounts of multimodal data. ChatCAD [54] introduced LLMs into medical-image Computer Aided Diagnosis (CAD) networks. Their proposed framework leverages the capabilities of LLMs to enhance the output of multiple CAD networks, including diagnosis networks, lesion segmentation networks, and report generation networks. Their results show that ChatCAD achieved significant improvements under various measures compared with the other two report-generation methods (R2GenCMN [55] and C vT 2 Distil GP T 2 [56]). ChatCAD $^+$ [57] is a multimodal system that addresses the writing style mismatch between radiologists and LLMs. The system is designed to be universal and reliable, capable of handling medical images from diverse domains and providing trustworthy medical advice by leveraging up-to-date information from reputable medical websites. For such a complex task, LLMs clearly outperforms PLM by a wide margin.
早期研究中,[51]提出了一种结合CNN的数据驱动方法,用于预测医学标签并生成单句报告。然而单句报告在实际医疗场景中存在局限性。为生成多语句报告,研究[52]提出了一种多级循环生成模型,通过聚焦前后视图融合多模态图像。近期多数自动化报告生成模型依赖于大语言模型实现的多模态技术,可支持更高级的应用。例如VisualGPT [53]利用大语言模型的语言知识,高效适配到图像描述新领域,即使仅使用少量多模态数据。ChatCAD [54]将大语言模型引入医学图像计算机辅助诊断(CAD)网络,其框架通过大语言模型增强多个CAD网络的输出,包括诊断网络、病灶分割网络和报告生成网络。实验表明,相较于另外两种报告生成方法(R2GenCMN [55]和CvT2DistilGPT2 [56]),ChatCAD在各指标上均取得显著提升。ChatCAD$^+$[57]是多模态系统,解决了放射科医生与大语言模型间的写作风格失配问题。该系统设计为通用可靠架构,能处理跨领域医学图像,并通过权威医学网站的最新信息提供可信医疗建议。对于此类复杂任务,大语言模型明显大幅优于PLM。
2.7. Summary
2.7. 总结
Based on the information provided, we summarize the strengths and weaknesses of PLMs and LLMs by different tasks in Table 1 and conclude the following points. For simpler fundamental tasks, the distinct advantages of LLMs are less apparent. However, as the complexity of advanced tasks increases, particularly those involving complex data conditions, requiring advanced semantic understanding, and comprehensive generative capabilities, LLMs begin to demonstrate their strengths. Besides, LLMs play an integral role in specific sub-fields of Healthcare with enough further training, and turn to emphasis on the multimodal capability of LLMs, such as Healthcare data inherently consists of text, images, and time series data. By leveraging the strengths of LLMs, researchers and Healthcare professionals can harness the power of multiple modalities to improve diagnostic accuracy and patient care.
根据所提供的信息,我们在表1中总结了不同任务下预训练语言模型(PLM)和大语言模型(LLM)的优缺点,并得出以下结论。对于较简单的基础任务,大语言模型的显著优势并不明显。但随着高级任务复杂度的提升,特别是涉及复杂数据条件、需要高级语义理解和综合生成能力的任务时,大语言模型开始展现其优势。此外,经过充分训练后,大语言模型在医疗健康特定子领域发挥着重要作用,并凸显其多模态能力。医疗健康数据天然包含文本、图像和时间序列数据,通过利用大语言模型的优势,研究人员和医疗专业人员可以整合多模态数据来提高诊断准确性和患者护理水平。
Beyond the accomplishments already discussed, several significant challenges remain for healthcare. A major obstacle is the complexity inherent in medical decision-making, which requires the incorporation of comprehensive patient information, including medical, psychological, and social aspects. While AI is proficient in analyzing data, it struggles with understanding complex human emotions and cultural nuances. This deficit is particularly evident in situations needing emotional support, such as during prolonged cancer care, where the empathetic engagement of healthcare professionals cannot be replicated by AI due to its inability to resonate emotionally.
除了已经讨论的成就外,医疗保健领域仍面临若干重大挑战。主要障碍在于医疗决策固有的复杂性,这需要整合包括医学、心理和社会因素在内的全面患者信息。虽然AI擅长数据分析,但在理解复杂人类情感和文化差异方面存在不足。这种缺陷在需要情感支持的情境中尤为明显,例如长期癌症护理期间,由于AI无法产生情感共鸣,医疗专业人员的同理心互动是其无法复现的。
Additionally, as AI becomes more embedded in healthcare, ethical and privacy issues intensify. Concerns about the handling of patient data, preserving privacy, and securing sensitive information are critical. Moreover, determining accountability in instances of diagnostic errors necessitates well-defined legal and ethical frameworks. Another concern is the unequal global distribution of technology, leading to a ‘‘digital divide’’. This divide risks leaving behind developing countries and economically disadvantaged areas, potentially worsening health disparities. AI also struggles with diseases characterized by ambiguous causes or intricate pathological processes. The effectiveness of AI is contingent on the extent of existing medical knowledge, and remains limited in fields that are not thoroughly understood. These challenges highlight the urgent need for collaborative efforts among professionals in healthcare, technology, law, and ethics globally to ensure that technological advancements are equitable, respectful of, and protective toward individual rights. Further discussion on these topics is available in Section 5.
此外,随着AI在医疗领域的深入应用,伦理与隐私问题日益凸显。患者数据处理、隐私保护及敏感信息安全成为核心关切。当出现诊断错误时,责任认定问题也亟需明确的法律和伦理框架。另一重挑战在于技术资源的全球分配不均,由此形成的"数字鸿沟"可能导致发展中国家和经济欠发达地区被边缘化,加剧健康不平等现象。AI对病因不明或病理机制复杂的疾病也表现乏力——其效能高度依赖现有医学认知水平,在尚未被充分理解的领域仍存在局限。这些挑战迫切要求全球医疗、技术、法律及伦理领域的专业人士协同合作,确保技术进步兼具公平性,并能尊重和保护个体权利。更多相关讨论详见第5节。
3. From PLMs to LLMs for healthcare
3. 从 PLMs 到医疗领域的大语言模型
Apart from the increasing model sizes, two significant developments from PLMs to LLMs are the transition from Disc rim i native AI to Generative AI and from model-centered to data-centered approaches. During the PLMs period, published PLMs were primarily evaluated on Natural Language Understanding (NLU) tasks, such as mentioned NER, RE, and TC. These studies are grouped as disc rim i native AI, which concentrates on classification or regression tasks instead of generation tasks. In contrast, generative AI generates new content, often requiring the model to understand existing data (e.g., textual instructions) before generating new content. The evaluation tasks of generative AI are usually QA and conversation tasks.
除了模型规模的不断扩大,从预训练语言模型(PLM)到大语言模型(LLM)的两大重要演进是从判别式AI(Discriminative AI)向生成式AI(Generative AI)的转变,以及从以模型为中心到以数据为中心的方法转型。在预训练语言模型时期,已发表的模型主要针对自然语言理解(NLU)任务进行评估,例如前文提到的命名实体识别(NER)、关系抽取(RE)和文本分类(TC)。这类研究被归类为判别式AI,其核心聚焦于分类或回归任务而非生成任务。相比之下,生成式AI能够创造新内容,通常需要模型先理解现有数据(如文本指令)再进行内容生成。生成式AI的评估任务通常采用问答(QA)和对话任务。
The second perspective is the change from model-centered to datacentered. Before the rise of LLMs, previous research focused on improving neural architecture to enhance the encoding abilities of proposed models. As neural models became increasingly larger, the overparameter iz ation strategy demonstrated promising abilities in learning potential patterns reserved in annotated datasets. Under such conditions, high-quality data played a more significant role in further enhancing various Healthcare applications. On the other hand, recent related developments present a multimodal trend, providing significant support to the data of EHRs, medical images, and medical sequence signals. Based on powerful LLMs, more existing and promising research and applications for Healthcare can be explored. Addressing the challenge of systematically collecting matched multimodal data holds significant importance. For such reason, we list detailed data usages and access links of each LLM in Section 3.2.
第二个视角是从以模型为中心转向以数据为中心。在大语言模型兴起之前,先前的研究主要集中于改进神经架构以增强所提出模型的编码能力。随着神经模型规模不断扩大,过参数化策略在从标注数据集中学习潜在模式方面展现出优异性能。在此背景下,高质量数据对进一步提升各类医疗健康应用发挥着更为关键的作用。另一方面,近期相关发展呈现多模态趋势,为电子健康记录(EHR)、医学影像和医学序列信号数据提供了重要支持。基于强大的大语言模型,可以探索更多现有及潜在的医疗健康研究和应用。系统性地收集匹配的多模态数据这一挑战具有重要意义。为此,我们在3.2节详细列出了每个大语言模型的数据使用情况和获取链接。
3.1. PLMs for healthcare
3.1. 医疗领域的预训练语言模型 (PLMs)
While our survey primarily concentrates on LLMs for Healthcare, it is important to acknowledge that previous studies on PLMs have played a foundational role in the development of LLMs. In this section, we sum up the key research points for Healthcare PLMs, namely (1) enhancing neural architectures, and (2) utilizing more efficient pre-training tasks. These two points will be compared with the distinct study focus of LLMs in Section 3.2, to further support the transition from disc rim i native AI to generative AI and from model-centered to data-centered.
虽然我们的调查主要聚焦于医疗领域的大语言模型(LLM),但必须承认先前关于预训练语言模型(PLM)的研究为大语言模型的发展奠定了基础。本节我们将总结医疗PLM的两大核心研究方向:(1) 改进神经架构,(2) 采用更高效的预训练任务。这两点将与3.2节中大语言模型的独特研究重点进行对比,以进一步佐证从判别式AI向生成式AI(Generative AI)、从以模型为中心向以数据为中心的范式转变。
Table 2 Sum mari z ation of training data and evaluation tasks for existing PLMs for Healthcare. The different training methods are delineated with a solid line and the training data are further delineated with a dashed line.
| Model name | Base | Para. (B) | Training data | Eval task | Date | Link |
| BEHRT [58] | Transformer | 一 | CPRD,HES | Disease Prediction | 04/20 | Link |
| BioMegatron [59] | Megatron | 1.2 | PubMed | biomedical NER,RE,QA | 10/20 | Link |
| PubMedBERT [60] | BERT | 0.11 | PubMed | BLURB | 01/21 | Link |
| Bio-ELECTRA-small[61] | ELECTRA | 0.03 | PubMed | Biomedical NER | 03/20 | |
| BioELECTRA [62] | ELECTRA | 0.03 | PubMed, PMC | BLURB,BLUE | 06/21 | Link |
| AraBERT [63] | BERT | 0.11 | Arabic Wikipedia, OSIAN | Arabic SA, NER, QA | 03/21 | Link |
| FS-/RAD-/GER-BERT [64] | BERT | 0.11 | Unstructured radiology reports | Chest Radiograph Reports Classification | 07/20 | Link |
| VPP [65] | BART | 0.14 | PubMed | Biomedical NER | 03/23 | Link |
| BioBART [66] | BART | 0.14 | PubMed | Biomedical EL,NER,QA,Dialogue, Summarization | 04/22 | Link |
| BioLinkBERT [67] | BERT | 0.34 | PubMed | BLURB, USMLE | 03/22 | Link |
| ELECTRAMed [68] | ELECTRA | 0.11 | PubMed | Biomedical NER, RE, and QA | 04/21 | Link |
| KeBioLM [69] | PubMedBERT | 0.11 | PubMed | BLURB | 04/21 | Link |
| BioFLAIR [70] | BERT | 0.34 | PubMed | Bio NER | 08/19 | Link |
| ouBioBERT [71] | BERT | 0.11 | PubMed, Wikipedia | BLUE | 02/21 | Link |
| SCIFIVE [72] | T5 | 0.77 | PubMed, PMC | Biomedical NER,RE,NIL, QA | 05/21 | Link |
| BioBERT [73] | BERT | 0.11 | PubMed, PMC | Biomedical NER, RE,QA | 05/19 | Link |
| BioALBERT-ner [74] | ALBERT | 0.18 | PubMed, PMC | Biomedical NER | 09/20 | Link |
| GreenCovidSQuADBERT[75] | BERT | 0.34 | PubMed,PMC,CORD19 | NER, QA | 04/20 | Link |
| Bio-LM [76] | RoBERTa | 0.34 | PubMed, PMC, MIMIC-III | 18 Biomedical NLP Tasks | 11/20 | Link |
| BioALBERT [77] | ALBERT | 0.03 | PubMed,PMC, MIMIC-III | 6 BioNLP Tasks | 04/22 | Link |
| BlueBert [78] | BERT | 0.34 | PubMed, MIMIC-III | BLUE | 06/19 | Link |
| ClinicalBert [79] Clinical XLNet[80] | BERT | 0.11 | MIMIC-III | Hospital Readmission Prediction | 11/20 | Link |
| XLNet | 0.11 | MIMIC-III | PMV, Mortality Biomedical NER | 11/20 | Link | |
| MIMIC-BERT [81] | BERT | 0.34 | MIMIC-III | 08/19 | ||
| UmlsBERT [82] | BERT | 0.11 | MIMIC-III | MedNLI,i2b2 2006,2010,2012,2014 | 06/21 | Link |
| CharacterBERT[81] | BERT ALBERT | 0.11 | MIMIC-III, OpenWebText, PMC | Medical NER, NLI, RE, SS | 10/20 | Link |
| Clinical KB-ALBERT [82] | 0.03 | MIMIC-III, UMLS | MedNLI, i2b2 2010, 2012 | 12/20 | Link | |
| MedGPT [81] KAD [83] | GPT-2 BERT | 1.5 | MIMIC-III, private EHRs | Disorder Prediction | 07/21 | |
| Japanese-BERT [84] | BERT | 一 10.11 | MIMIC-CXR | PadChest, ChestXray14, CheXpert and ChestX-Det10 | 03/23 | Link |
| MC-BERT [85] | BERT | 0.11 | Japanese EHR | Symptoms Classification | 07/20 | |
| BERT-EHR [86] | BERT | 一 | Chinese EHR | ChineseBiomedicalEvaluationbenchmark | 08/20 03/21 | Link |
| Med-BERT [87] | BERT | 0.11 | General EHR General EHR | Myocardial Infarction,Breast Cancer,Liver Cirrhosis Disease prediction | 05/21 | Link |
| SAPBERT[88] | Link | |||||
| CODER [89] | BERT mBERT | 0.11 | UMLS UMLS | MEL MCSM, Medical RE | 10/22 | Link |
| AlphaBERT [90] | BERT | 0.34 0.11 | Discharge diagnoses | Extractive Summarization Task | 02/22 | Link |
| BioMed-RoBERTa [91] | RoBERTa | BIOMED | CHEMPROT, RCT | 04/20 | Link | |
| RadBERT [92] | BERT | 0.11 | Report Coding, Summarization | 05/20 | Link | |
| BioBERTpt [93] | BERT | 一 | Radiology Report Corpus | SemClinBr | 05/20 | |
| RoBERTa-MIMIC [94] | RoBERTa | 0.11 | Private clinical notes,WMT16 i2b2 2010,2012,n2c2 2018 | 11/20 | Link | |
| CHMBERT [95] | BERT | 0.11 | i2b2 2010,2012,N2C2 2018 | 12/20 | Link | |
| Galén [96] | RoBERTa | 0.11 0.11 | Medical text data Private clinical cases | Disease Prediction CodiEsp-D,CodiEsp-P,Cantemist-Coding tasks | 01/21 | |
| 05/21 | Link | |||||
| Spanish-bert [97] | BERT | 一 | Spanish data | Spanish Clinical Case Corpus | 04/20 | |
| French-BERT [98] | BERT | 0.11 | French clinical documents | DEFT challenge | 06/20 | 一 |
| ABioNER [99] | BERT | 0.11 | Arabic scientific literature | Arabic NER Persian QA, SA | 03/21 | 一 |
| SINA-BERT [100] | BERT | 0.11 | Online Persian source | 04/21 | 一 | |
| CT-BERT [101] | BERT | 0.11 | Tweet | COVID-19 Text Classification | 05/20 | Link |
| MentalBERT [45] | BERT | 0.11 | Depression Stress,Suicide Detection | 10/21 | Link |
$\Bumpeq$ PMV means prolonged mechanical ventilation prediction. NER means Named Entity Recognition, NLI means Natural Language Inference, RE means Relation Extraction, SS means Sentence Similarity. MCSM means medical conceptual similarity measure [102]. MEL means medical entity linking. EL means Entity Linking. For clarity, we only list parts of representative evaluation tasks. For the column of Para. (B), only the largest size is listed.
表 2: 现有医疗领域预训练语言模型的训练数据和评估任务汇总。不同训练方法用实线分隔,训练数据进一步用虚线分隔。
| 模型名称 | 基础架构 | 参数量(B) | 训练数据 | 评估任务 | 日期 | 链接 |
|---|---|---|---|---|---|---|
| BEHRT [58] | Transformer | - | CPRD,HES | 疾病预测 | 04/20 | Link |
| BioMegatron [59] | Megatron | 1.2 | PubMed | 生物医学NER,RE,QA | 10/20 | Link |
| PubMedBERT [60] | BERT | 0.11 | PubMed | BLURB | 01/21 | Link |
| Bio-ELECTRA-small[61] | ELECTRA | 0.03 | PubMed | 生物医学NER | 03/20 | |
| BioELECTRA [62] | ELECTRA | 0.03 | PubMed, PMC | BLURB,BLUE | 06/21 | Link |
| AraBERT [63] | BERT | 0.11 | 阿拉伯语维基百科,OSIAN | 阿拉伯语SA,NER,QA | 03/21 | Link |
| FS-/RAD-/GER-BERT [64] | BERT | 0.11 | 非结构化放射学报告 | 胸部X光报告分类 | 07/20 | Link |
| VPP [65] | BART | 0.14 | PubMed | 生物医学NER | 03/23 | Link |
| BioBART [66] | BART | 0.14 | PubMed | 生物医学EL,NER,QA,对话,摘要 | 04/22 | Link |
| BioLinkBERT [67] | BERT | 0.34 | PubMed | BLURB,USMLE | 03/22 | Link |
| ELECTRAMed [68] | ELECTRA | 0.11 | PubMed | 生物医学NER,RE,QA | 04/21 | Link |
| KeBioLM [69] | PubMedBERT | 0.11 | PubMed | BLURB | 04/21 | Link |
| BioFLAIR [70] | BERT | 0.34 | PubMed | 生物医学NER | 08/19 | Link |
| ouBioBERT [71] | BERT | 0.11 | PubMed,维基百科 | BLUE | 02/21 | Link |
| SCIFIVE [72] | T5 | 0.77 | PubMed,PMC | 生物医学NER,RE,NIL,QA | 05/21 | Link |
| BioBERT [73] | BERT | 0.11 | PubMed,PMC | 生物医学NER,RE,QA | 05/19 | Link |
| BioALBERT-ner [74] | ALBERT | 0.18 | PubMed,PMC | 生物医学NER | 09/20 | Link |
| GreenCovidSQuADBERT[75] | BERT | 0.34 | PubMed,PMC,CORD19 | NER,QA | 04/20 | Link |
| Bio-LM [76] | RoBERTa | 0.34 | PubMed,PMC,MIMIC-III | 18项生物医学NLP任务 | 11/20 | Link |
| BioALBERT [77] | ALBERT | 0.03 | PubMed,PMC,MIMIC-III | 6项生物医学NLP任务 | 04/22 | Link |
| BlueBert [78] | BERT | 0.34 | PubMed,MIMIC-III | BLUE | 06/19 | Link |
| ClinicalBert [79] Clinical XLNet[80] | BERT | 0.11 | MIMIC-III | 医院再入院预测 | 11/20 | Link |
| XLNet | 0.11 | MIMIC-III | PMV,死亡率预测,生物医学NER | 11/20 | Link | |
| MIMIC-BERT [81] | BERT | 0.34 | MIMIC-III | 08/19 | ||
| UmlsBERT [82] | BERT | 0.11 | MIMIC-III | MedNLI,i2b2 2006,2010,2012,2014 | 06/21 | Link |
| CharacterBERT[81] | BERT ALBERT | 0.11 | MIMIC-III,OpenWebText,PMC | 医学NER,NLI,RE,SS | 10/20 | Link |
| Clinical KB-ALBERT [82] | 0.03 | MIMIC-III,UMLS | MedNLI,i2b2 2010,2012 | 12/20 | Link | |
| MedGPT [81] KAD [83] | GPT-2 BERT | 1.5 | MIMIC-III,私有EHR | 疾病预测 | 07/21 | |
| Japanese-BERT [84] | BERT | -10.11 | MIMIC-CXR | PadChest,ChestXray14,CheXpert,ChestX-Det10 | 03/23 | Link |
| MC-BERT [85] | BERT | 0.11 | 日语EHR | 症状分类 | 07/20 | |
| BERT-EHR [86] | BERT | - | 中文EHR | 中文生物医学评估基准 | 08/20 03/21 | Link |
| Med-BERT [87] | BERT | 0.11 | 通用EHR | 心肌梗塞,乳腺癌,肝硬化预测 | 05/21 | Link |
| SAPBERT[88] | Link | |||||
| CODER [89] | BERT mBERT | 0.11 | UMLS | MEL MCSM,医学RE | 10/22 | Link |
| AlphaBERT [90] | BERT | 0.34 0.11 | 出院诊断 | 抽取式摘要任务 | 02/22 | Link |
| BioMed-RoBERTa [91] | RoBERTa | BIOMED | CHEMPROT,RCT | 04/20 | Link | |
| RadBERT [92] | BERT | 0.11 | 报告编码,摘要 | 05/20 | Link | |
| BioBERTpt [93] | BERT | - | 放射学报告语料库 | SemClinBr | 05/20 | |
| RoBERTa-MIMIC [94] | RoBERTa | 0.11 | 私有临床笔记,WMT16 | i2b2 2010,2012,n2c2 2018 | 11/20 | Link |
| CHMBERT [95] | BERT | 0.11 | i2b2 2010,2012,N2C2 2018 | 12/20 | Link | |
| Galén [96] | RoBERTa | 0.11 0.11 | 医学文本数据 | 疾病预测,CodiEsp-D,CodiEsp-P,Cantemist编码任务 | 01/21 | |
| 私有临床病例 | 05/21 | Link | ||||
| Spanish-bert [97] | BERT | - | 西班牙语数据 | 西班牙临床病例语料库 | 04/20 | |
| French-BERT [98] | BERT | 0.11 | 法语临床文档 | DEFT挑战赛 | 06/20 | - |
| ABioNER [99] | BERT | 0.11 | 阿拉伯科学文献 | 阿拉伯语NER,波斯语QA,SA | 03/21 | - |
| SINA-BERT [100] | BERT | 0.11 | 波斯语在线资源 | 04/21 | - | |
| CT-BERT [101] | BERT | 0.11 | 推特 | COVID-19文本分类 | 05/20 | Link |
| MentalBERT [45] | BERT | 0.11 | 抑郁压力,自杀检测 | 10/21 | Link |
$\Bumpeq$ PMV表示长期机械通气预测。NER表示命名实体识别,NLI表示自然语言推理,RE表示关系抽取,SS表示句子相似度。MCSM表示医学概念相似度测量[102]。MEL表示医学实体链接。EL表示实体链接。为清晰起见,我们仅列出部分代表性评估任务。参数量(B)列仅列出最大规模。
• Public Knowledge Bases. There exist many Healthcare-related knowledge bases, such as UMLS [103], CMeKG [104], BioModels [105], and DrugBank [106]. Among them, UMLS is one of the most popular, which is a repository of biomedical vocabularies developed by the US National Library of Medicine. The UMLS has over 2 million names for 900,000 concepts from more than 60 families of biomedical vocabularies, as well as 12 million relations among these concepts. Based on this structured data, USMLE is organized and usually employed to test Healthcare LLMs. CMeKG [104] is a Chinese medical knowledge graph that has been constructed by referring to authoritative international medical standards and a wide range of sources, including clinical guidelines, industry standards, and medical textbooks. This knowledge graph serves as a comprehensive resource for medical information. Building upon the CMeKG, HuaTuo [107] utilizes diverse instructional data for its instruction tuning process.
• 公共知识库。存在许多与医疗相关的知识库,例如UMLS [103]、CMeKG [104]、BioModels [105]和DrugBank [106]。其中,UMLS是最受欢迎的之一,它是由美国国家医学图书馆开发的生物医学词汇库。UMLS包含来自60多个生物医学词汇家族的90万个概念的200多万个名称,以及这些概念之间的1200万种关系。基于这些结构化数据,USMLE被组织起来,通常用于测试医疗领域的大语言模型。CMeKG [104]是一个中文医学知识图谱,其构建参考了国际权威医学标准和广泛的来源,包括临床指南、行业标准和医学教材。该知识图谱是医学信息的综合资源。基于CMeKG,华佗 [107] 在其指令调优过程中利用了多样化的教学数据。
• Data for Instruction Fine-Tuning. The aforementioned data typically consists of general text that is commonly used for pre training PLMs or LLMs. However, when transitioning from PLMs to LLMs, instruction data becomes crucial to equip LLMs with the capability of following instructions effectively. Unlike PLMs, which primarily focus on next-word prediction, LLMs place greater emphasis on responding to specific instructions. By leveraging a sufficient amount of instruction data for fine-tuning, an LLM can appropriately generate the desired output. This emphasizes the importance of instruction-based training for LLMs to achieve accurate and con textually relevant responses.
• 指令微调数据。上述数据通常由用于预训练PLM或大语言模型的通用文本组成。然而,从PLM过渡到大语言模型时,指令数据变得至关重要,它使大语言模型具备有效遵循指令的能力。与主要关注下一个词预测的PLM不同,大语言模型更强调对特定指令的响应。通过利用足够数量的指令数据进行微调,大语言模型能够恰当地生成所需输出。这凸显了基于指令的训练对于大语言模型获得准确且上下文相关响应的重要性。
For Healthcare PLMs, as shown in see Table 2, a majority of the models utilize the disc rim i native approach, predominantly built upon the BERT architecture. The rationale behind this architectural choice is evident: many typical Healthcare applications are classification tasks. These tasks range from NER in the biomedical domain to more specific challenges such as disease prediction and relation extraction. In addition, the methodology of fine-tuning (FT) stands out as the prevalent training methodology. This trend suggests a broader implication: while general pretrained models offer a foundational grasp of language, they require refinement through domain-specific data to excel in the applications of Healthcare. The choice of training datasets provides further support to the models’ intent of achieving a holistic understanding of the medical domain.
如表 2 所示,医疗领域大语言模型 (Healthcare PLMs) 大多采用判别式方法 (discriminative approach),主要基于 BERT 架构。这种架构选择的理由显而易见:医疗领域的典型应用多为分类任务,涵盖生物医学命名实体识别 (NER)、疾病预测和关系抽取等具体场景。此外,微调 (fine-tuning/FT) 成为主流的训练方法,这一趋势揭示了一个深层现象:虽然通用预训练模型提供了语言理解的基础能力,但需要通过领域数据精调才能在医疗应用中表现出色。训练数据集的选择进一步佐证了这些模型旨在实现医疗领域的全面理解。
Table 3 Sum mari z ation of training data and evaluation tasks for existing LLMs for Healthcare. The different training methods are delineated with a solid line and the training data are further delineated with a dashed line. The color names represent popular evaluate datasets. More detail performance comparisons are shown in Table 4.
| Model name | Method | Training data | Evaluate datasets or tasks | Date | Link |
| GatorTron[108] | PT | Clinical notes | CNER,MRE,MQA | 06/22 | Link |
| GatorTronGPT [109] | PT | Clinical and general text | PubMedQA, USMLE, MedMCQA, DDI, BC5CDR | 05/23 | Link |
| Galactica [110] | PT+SFT | DNA,AA sequence | MedMCQA,PubMedQA,Medical Genetics | 11/22 | Link |
| Me LLaMA [111] | PT+SFT | PubMed, MIMIC-II, MIMIC-IV, MIMIC-CXR | MIBE benchmark [111] | 04/24 | Link |
| MedChatZH [112] | PT+SFT | Text Books,medical and general instructions | WebMedQA | 09/23 | Link |
| BioMistral [113] | PT+SFT | PubMed central data | MMLU,USMLE, MedMCQA,PubMedQA | 02/24 | Link |
| Visual Med-Alpaca [114] | PT+SFT | Medical QA | 04/23 | Link | |
| Apollo [115] | PT+SFT | Books, clinical guidelines, encyclopedias. | XMedBench | 03/24 | Link |
| CancerLLM [116] | PT+SFT | Clinical notes, Pathology report | Cancer Diagnosis Generation, Cancer Phenotype Extraction | 06/24 | 一 |
| MedAlpaca [117] | SFT | Medical QA and dialogues | USMLE,Medical Meadow | 04/23 | Link |
| BenTsao [107] | SFT | Medical QA, Medical knowledge graph | Customed medical QA | 04/23 | Link |
| BianQue [118] | SFT | Medical QA | 04/23 | Link | |
| Med-PaLM 2 [1] | SFT | Medical QA | MultiMedQA,Long-form QA | 05/23 | 一 |
| SoulChat [13] | SFT | Empathetic dialogue, Long text | 一 | 06/23 | Link |
| ChatDoctor[50] | SFT | Patient-doctor dialogues | iCliniq | 03/23 | Link |
| DoctorGLM [119] | SFT | Chinese medical dialogues | 04/23 | Link | |
| OncoGPT [120] | SFT | Oncology conversations | Oncology Question Answering | 02/24 | Link |
| HuatuoGPT [11] | SFT | Conversation data and instruction | CmedQA, webmedQA, and Huatuo-26M | 05/23 | Link |
| Med-PaLM [121] | SFT | Medical data | MultiMedQA,HealthSearchQA | 12/22 | 一 |
| PMC-LLaMA [122] | SFT | Biomedical academic papers | PubMedQA,MedMCQA,USMLE | 04/23 | Link |
| HealAI [123] | SFT | Medical note data, instruction data | Medical Note Writing | 03/24 | 一 |
| BiMediX [124] | SFT | 1.3 million English-Arabic dataset | An Arabic-English benchmark | 02/24 | Link |
| Medical mT5 [125] | SFT | Multilingual medical corpus | SequenceLabeling,QA | 04/24 | Link |
| EpiSemoGPT [126] | SFT | Related publications | Predicting epileptogenic zones | 05/24 | 一 |
| MedAGI [10] | SFT | Public medical datasets and images | SkinGPT-4, XrayChat, PathologyChat | 06/23 | Link |
| Med-Flamingo [8] LLaVA-Med [9] | SFT | Image-caption/tokens pairs | VQA-RAD, Path-VQA, Visual USMLE | 07/23 | Link |
| OphGLM [12] | SFT | Multimodalbiomedical instruction | VQA-RAD,SLAKE,PathVQA | 06/23 | Link |
| LLM-CXR [127] | SFT SFT | Fundus image, knowledge graphs | Fundus diagnosis pipeline tasks [12] | 06/23 | Link |
| JMLR [128] | SFT | MIMIC-CXR | Report generation,VQA,CXRgeneration | 05/23 | Link |
| ClinicalGPT [129] | MIMIC-IV dataset,medical textbooks,pubMed | USMLE, Amboss, MedMCQA, and MMLU-Medical | 02/24 | Link | |
| Polaris [130] | SFT+RLHF SFT+RLHF | Medical dialogues and QA, EHR Proprietary healthcare data | MedDialog, MEDQA-MCMLE, MD-EHR, cMedQA2 Healthcare conversational | 06/23 | |
| Zhongjing [131] | PT+SFT+RLHF | Medical books, health records, clinical reports | CMtMedQA,Huatuo-26M | 03/24 08/23 | 一 Link |
| Qilin-Med [132] | PT+SFT+DPO | Medical QA, plain texts, knowledge graphs | CMExam, CEval, Huatuo-26M | 04/24 | |
| Aloe-Alpha [133] | 一 | ||||
| PT+SFT+DPO | Medical QA, CoT, synthetic data | MultiMedQA, MedMCQA, USMLE, PubMedQA, etc. | 05/24 | 一 |
$^{37}* $ means the study focuses on evaluating the Healthcare LLM, rather than proposing a new LLM. PT means pre-training, ICL means In-context-learning (no parameters updated), SFT means supervised fine-tuning, RLHF means reinforcement learning from human feedback, and DPO means Direct Preference Optimization.
表 3: 现有医疗领域大语言模型的训练数据与评估任务汇总。实线区分不同训练方法,虚线进一步区分训练数据。颜色名称代表常用评估数据集,详细性能对比见表4。
| 模型名称 | 方法 | 训练数据 | 评估数据集/任务 | 日期 | 链接 |
|---|---|---|---|---|---|
| GatorTron[108] | PT | 临床记录 | CNER,MRE,MQA | 06/22 | Link |
| GatorTronGPT [109] | PT | 临床与通用文本 | PubMedQA, USMLE, MedMCQA, DDI, BC5CDR | 05/23 | Link |
| Galactica [110] | PT+SFT | DNA,氨基酸序列 | MedMCQA,PubMedQA,Medical Genetics | 11/22 | Link |
| Me LLaMA [111] | PT+SFT | PubMed, MIMIC-II, MIMIC-IV, MIMIC-CXR | MIBE benchmark [111] | 04/24 | Link |
| MedChatZH [112] | PT+SFT | 教科书、医疗与通用指令 | WebMedQA | 09/23 | Link |
| BioMistral [113] | PT+SFT | PubMed Central数据 | MMLU,USMLE, MedMCQA,PubMedQA | 02/24 | Link |
| Visual Med-Alpaca [114] | PT+SFT | 医疗问答 | - | 04/23 | Link |
| Apollo [115] | PT+SFT | 书籍、临床指南、百科全书 | XMedBench | 03/24 | Link |
| CancerLLM [116] | PT+SFT | 临床记录、病理报告 | 癌症诊断生成、癌症表型提取 | 06/24 | - |
| MedAlpaca [117] | SFT | 医疗问答与对话 | USMLE,Medical Meadow | 04/23 | Link |
| BenTsao [107] | SFT | 医疗问答、医学知识图谱 | 定制医疗问答 | 04/23 | Link |
| BianQue [118] | SFT | 医疗问答 | - | 04/23 | Link |
| Med-PaLM 2 [1] | SFT | 医疗问答 | MultiMedQA,长文本问答 | 05/23 | - |
| SoulChat [13] | SFT | 共情对话、长文本 | - | 06/23 | Link |
| ChatDoctor[50] | SFT | 医患对话 | iCliniq | 03/23 | Link |
| DoctorGLM [119] | SFT | 中文医疗对话 | - | 04/23 | Link |
| OncoGPT [120] | SFT | 肿瘤学对话 | 肿瘤学问答 | 02/24 | Link |
| HuatuoGPT [11] | SFT | 对话数据与指令 | CmedQA, webmedQA, Huatuo-26M | 05/23 | Link |
| Med-PaLM [121] | SFT | 医疗数据 | MultiMedQA,HealthSearchQA | 12/22 | - |
| PMC-LLaMA [122] | SFT | 生物医学学术论文 | PubMedQA,MedMCQA,USMLE | 04/23 | Link |
| HealAI [123] | SFT | 医疗记录数据、指令数据 | 医疗记录撰写 | 03/24 | - |
| BiMediX [124] | SFT | 130万英阿数据集 | 阿拉伯语-英语基准 | 02/24 | Link |
| Medical mT5 [125] | SFT | 多语言医疗语料 | 序列标注,问答 | 04/24 | Link |
| EpiSemoGPT [126] | SFT | 相关出版物 | 预测致痫区 | 05/24 | - |
| MedAGI [10] | SFT | 公共医疗数据集与图像 | SkinGPT-4, XrayChat, PathologyChat | 06/23 | Link |
| Med-Flamingo [8] LLaVA-Med [9] | SFT | 图像-标题/token对 | VQA-RAD, Path-VQA, Visual USMLE | 07/23 | Link |
| OphGLM [12] | SFT | 多模态生物医学指令 | VQA-RAD,SLAKE,PathVQA | 06/23 | Link |
| LLM-CXR [127] | SFT SFT | 眼底图像、知识图谱 | 眼底诊断流程任务 [12] | 06/23 | Link |
| JMLR [128] | SFT | MIMIC-CXR | 报告生成,VQA,CXR生成 | 05/23 | Link |
| ClinicalGPT [129] | - | MIMIC-IV数据集、医学教科书、PubMed | USMLE, Amboss, MedMCQA, MMLU-Medical | 02/24 | Link |
| Polaris [130] | SFT+RLHF SFT+RLHF | 医疗对话与问答、电子健康记录 | MedDialog, MEDQA-MCMLE, MD-EHR, cMedQA2 医疗对话 | 06/23 | - |
| Zhongjing [131] | PT+SFT+RLHF | 医学书籍、健康档案、临床报告 | CMtMedQA,Huatuo-26M | 03/24 08/23 | - Link |
| Qilin-Med [132] | PT+SFT+DPO | 医疗问答、纯文本、知识图谱 | CMExam, CEval, Huatuo-26M | 04/24 | - |
| Aloe-Alpha [133] | - | - | - | - | - |
| - | PT+SFT+DPO | 医疗问答、思维链、合成数据 | MultiMedQA, MedMCQA, USMLE, PubMedQA等 | 05/24 | - |
$^{37}* $ 表示研究侧重评估医疗大语言模型而非提出新模型。PT指预训练(pre-training),ICL指上下文学习(In-context-learning,不更新参数),SFT指监督微调(supervised fine-tuning),RLHF指人类反馈强化学习(reinforcement learning from human feedback),DPO指直接偏好优化(Direct Preference Optimization)。
Unlike PLMs, LLMs have the advantage of eliminating the need for FT and can directly infer at various downstream tasks. Moreover, the core research focus does not primarily revolve around improving neural architectures and developing more efficient pre-training tasks for Healthcare. Consequently, research on LLMs is garnering increased attention.
与PLM不同,大语言模型(LLM)具有无需微调(FT)即可直接在下游任务进行推理的优势。此外,其核心研究方向不再主要围绕改进神经网络架构或开发更高效的医疗领域预训练任务。因此,大语言模型研究正获得越来越多的关注。
3.2. LLMs for healthcare
3.2. 医疗领域的大语言模型
With the surge in general LLM studies, there has also been a notable development of LLMs specifically tailored for the Healthcare. In contrast to the emphasis on neural architecture designs and pre training tasks in previous PLMs research, the studies on LLMs for Healthcare greater emphasis on collections of diverse, precise, and professional Healthcare data, and also data security and privacy protection. In the following sections, we present an overview and analysis of published Healthcare LLMs. For the sake of convenience, we have compiled the pertinent information in Tables 3 and 5. We categorize current LLMs based on their training methods, training data, evaluation, and distinct features, and offer detailed comparisons. Table 4 presents a summary of the performance for the three most popular datasets used to evaluate Healthcare LLMs, aimed at enabling more straightforward comparisons, and also offering a clear perspective on the current capabilities of excellent Healthcare LLMs.
随着通用大语言模型研究的激增,专门针对医疗健康领域定制的大语言模型也取得了显著进展。与以往预训练语言模型(PLM)研究侧重于神经架构设计和预训练任务不同,医疗健康领域的大语言模型研究更强调多样化、精准且专业的医疗数据收集,以及数据安全与隐私保护。以下章节我们将对已发布的医疗健康大语言模型进行概述与分析。为便于查阅,我们已将相关信息整理至表3和表5。我们根据训练方法、训练数据、评估指标和特色功能对现有大语言模型进行分类,并提供详细对比。表4汇总了评估医疗健康大语言模型最常用的三个数据集的性能表现,旨在提供更直观的对比基准,同时清晰展现当前优秀医疗健康大语言模型的实际能力。
Table 4 The performance sum mari z ation for different Healthcare LLMs on three popular datasets.
| (%) | USMLE | MedMCQA | PubMedQA |
| FT BERT | 44.62 [67] | 43.03 [60] | 72.20 [67] |
| Galactica | 44.60 | 77.60 | 77.60 |
| PMC-LLaMA | 44.70 | 50.54 | 69.50 |
| GatorTronGPT | 42.90 | 45.10 | 77.60 |
| DoctorGLM | 67.60 | 一 | 一 |
| MedAlpaca | 60.20 | 一 | 一 |
| Codex | 60.20 | 62.70 | 78.20 |
| Med-PaLM | 67.60 | 57.60 | 79.00 |
| Med-PaLM | 67.60 | 57.60 | 79.00 |
| Aloe-Alpha | 71.01 | 64.47 | 80.20 |
| Med-PaLM 2 | 86.50 | 72.30 | 81.80 |
| GPT-4 | 86.70 | 73.66 | 80.40 |
| Human | 87.00 | 90.00 | 78.00 |
表 4 不同医疗大语言模型在三个流行数据集上的性能汇总 (%)
| (%) | USMLE | MedMCQA | PubMedQA |
|---|---|---|---|
| FT BERT | 44.62 [67] | 43.03 [60] | 72.20 [67] |
| Galactica | 44.60 | 77.60 | 77.60 |
| PMC-LLaMA | 44.70 | 50.54 | 69.50 |
| GatorTronGPT | 42.90 | 45.10 | 77.60 |
| DoctorGLM | 67.60 | — | — |
| MedAlpaca | 60.20 | — | — |
| Codex | 60.20 | 62.70 | 78.20 |
| Med-PaLM | 67.60 | 57.60 | 79.00 |
| Med-PaLM | 67.60 | 57.60 | 79.00 |
| Aloe-Alpha | 71.01 | 64.47 | 80.20 |
| Med-PaLM 2 | 86.50 | 72.30 | 81.80 |
| GPT-4 | 86.70 | 73.66 | 80.40 |
| Human | 87.00 | 90.00 | 78.00 |
• Different Training Methods. Unlike PLMs, the strategy of training LLMs from scratch is not popular for Healthcare LLMs. GatorTron [108] and Gator Tron GP T [109] are only two Healthcare LLMs which training from scratch with only pre training (PT). One of reason is that acquiring and properly anonymizing medical data for training involves navigating complex legal and ethical issues. Additionally, due to the specialized nature of medical data and the high demands for accuracy, training a model from scratch requires substantial computational resources and extremely large healthcare text, which will be more expensive than general LLMs. Compared with PLMs which require fewer parameters and less training data, the significance of PT method is in decline.
• 不同训练方法。与预训练语言模型(PLM)不同,从头开始训练大语言模型的策略在医疗领域并不常见。GatorTron [108]和GatorTronGPT [109]是仅有的两个仅通过预训练(PT)从头训练的医疗大语言模型。原因之一在于获取并妥善匿名化医疗训练数据需要处理复杂的法律和伦理问题。此外,由于医疗数据的专业性和对准确性的高要求,从头训练模型需要大量计算资源和海量医疗文本,其成本远高于通用大语言模型。相比参数更少、训练数据量要求更低的预训练语言模型,预训练方法的重要性正在下降。
Table 5 Brief sum mari z ation of existing LLMs for Healthcare. Sorted in chronological order of publication.
| Model name | Size | Features |
| GatorTron [108] | 8.9 | Training from scratch |
| Galactica [110] | 120 | Reasoning,Multidisciplinary |
| Med-PaLM [121] | 540 | CoT,Self-consistency |
| ChatDoctor [50] | 7 | Retrieve online,External knowledge |
| DoctorGLM [119] | 6 | Extra prompt designer |
| MedAlpaca [117] | 13 | Adapt toMedicine |
| BenTsao [107] | 7 | Knowledge graph |
| PMC-LLaMA [122] | 7 | AdapttoMedicine |
| Visual Med-Alpaca [114] | 7 | Multimodal generative model, Self-Instruct |
| BianQue [118] | 6 | Chain of Questioning |
| Med-PaLM 2 [1] | 340 | Ensemble refinement, CoT, Self-consistency |
| GatorTronGPT [109] | 20 | Trainingfromscratchformedicine |
| LLM-CXR [127] | 3 | Multimodal, Chest X-rays |
| HuatuoGPT [11] | 7 | ReinforcedlearningfromAIfeedback |
| ClinicalGPT [129] | 7 | Multi-round dialogue consultations |
| MedAGI [10] | 一 | Multimodal |
| LLaVA-Med [9] | 13 | Multimodal, Self-instruct, Curriculum learning |
| OphGLM [12] | 6 | Multimodal, Ophthalmology LLM |
| SoulChat [13] | 6 | Mental Healthcare |
| Med-Flamingo [8] | 80 | Multimodal, Few-Shot medical VQA |
| Zhongjing [131] | 13 | Multi-turn Chinese medical dialogue |
| MedChatZH [112] | 7 | Traditional Chinese Medicine, Bilingual |
| JMLR [128] | 13 | RAG, LLM-Rank loss |
| BioMistral [113] | 7 | Multilingual, Model merging emphasis |
| BiMediX [124] | 47 | English and Arabic language |
| OncoGPT[120] | 7 | Real-world doctor-patient oncology dialogue |
| Polaris [130] | Several specialized support agents | |
| HealAI [123] | 540 | RAG,Interactive Editing |
| Apollo [115] | 7 | Multilingual,Lightweight,Proxy tuning |
| Medical mT5 [125] | 3 | Multilingua |
| Qilin-Med [132] | 7 | Domain-specific pre-training,RAG |
| Me LLaMA [111] | 70 | Catastrophic Forgetting |
| EpiSemoGPT [126] | 7 | Predicting epileptogenic zones |
| Aloe-Alpha [133] | 8 | Synthetic CoT |
| CancerLLM [116] | 7 | Specifically for cancer |
表 5: 现有医疗领域大语言模型的简要总结。按发布时间排序。
| 模型名称 | 参数量(亿) | 主要特性 |
|---|---|---|
| GatorTron [108] | 8.9 | 从头训练 |
| Galactica [110] | 120 | 推理能力,多学科 |
| Med-PaLM [121] | 540 | 思维链(CoT),自洽性 |
| ChatDoctor [50] | 7 | 在线检索,外部知识 |
| DoctorGLM [119] | 6 | 额外提示设计器 |
| MedAlpaca [117] | 13 | 医学领域适配 |
| BenTsao [107] | 7 | 知识图谱 |
| PMC-LLaMA [122] | 7 | 医学领域适配 |
| Visual Med-Alpaca [114] | 7 | 多模态生成模型,自指令 |
| BianQue [118] | 6 | 问题链 |
| Med-PaLM 2 [1] | 340 | 集成优化,思维链,自洽性 |
| GatorTronGPT [109] | 20 | 医学领域从头训练 |
| LLM-CXR [127] | 3 | 多模态,胸部X光 |
| HuatuoGPT [11] | 7 | AI反馈强化学习 |
| ClinicalGPT [129] | 7 | 多轮问诊对话 |
| MedAGI [10] | - | 多模态 |
| LLaVA-Med [9] | 13 | 多模态,自指令,课程学习 |
| OphGLM [12] | 6 | 多模态,眼科大模型 |
| SoulChat [13] | 6 | 心理健康护理 |
| Med-Flamingo [8] | 80 | 多模态,少样本医疗问答 |
| Zhongjing [131] | 13 | 中文多轮医疗对话 |
| MedChatZH [112] | 7 | 中医,双语支持 |
| JMLR [128] | 13 | 检索增强生成(RAG),排序损失 |
| BioMistral [113] | 7 | 多语言,模型融合优化 |
| BiMediX [124] | 47 | 英语和阿拉伯语 |
| OncoGPT [120] | 7 | 真实医患肿瘤对话 |
| Polaris [130] | - | 多个专业支持智能体 |
| HealAI [123] | 540 | 检索增强生成,交互式编辑 |
| Apollo [115] | 7 | 多语言,轻量化,代理调优 |
| Medical mT5 [125] | 3 | 多语言 |
| Qilin-Med [132] | 7 | 领域预训练,检索增强生成 |
| Me LLaMA [111] | 70 | 灾难性遗忘 |
| EpiSemoGPT [126] | 7 | 癫痫灶预测 |
| Aloe-Alpha [133] | 8 | 合成思维链 |
| CancerLLM [116] | 7 | 癌症专项 |
Besides PT, the prevalent method for adapting a general LLM to a Healthcare LLM involves SFT. As shown in Table 3, 21 LLM studies only use SFT to tuning their models. In addition, Galactica, Me LLaMA, MedChatZH, BioMistral, Visual Med-Alpaca, and Apollo employ twostep training process, name PT first and then SFT. Among the above models, Galactica [110] is an early-stage study, which demonstrated effectiveness of SFT. This LLM is designed to handle the information overload in the scientific domain, including Healthcare. JMLR [128] introduces a method that enhances medical reasoning and questionanswering by integrating SFT training method and information retrieval systems during the fine-tuning phase. This approach not only improves the model’s ability to utilize medical knowledge effectively but also significantly cuts down on computational resources. Remarkably, JMLR required only 148 GPU hours for training. MedAlpaca [117] addresses privacy concerns in healthcare by employing an open-source policy for on-site implementation, which employs LoRA [148] for task-specific weight updates.
除了PT之外,将通用大语言模型适配为医疗大语言模型的常用方法还包括SFT。如表3所示,有21项大语言模型研究仅使用SFT进行模型调优。此外,Galactica、Me LLaMA、MedChatZH、BioMistral、Visual Med-Alpaca和Apollo采用了两阶段训练流程,即先进行PT再进行SFT。其中Galactica [110]作为早期研究验证了SFT的有效性,该模型专为应对科学领域(包括医疗健康)的信息过载问题而设计。JMLR [128]提出了一种在微调阶段结合SFT训练方法与信息检索系统的方案,显著提升了医学推理和问答能力,不仅优化了模型对医学知识的运用效率,还大幅降低了计算资源消耗——其训练仅需148个GPU小时。MedAlpaca [117]通过采用开源策略实现本地化部署以解决医疗隐私问题,并利用LoRA [148]进行任务特定的权重更新。
Further, the studies [129–132] use multiple advanced training technologies. Among them, Zhongjing [131] is a groundbreaking Chinese medical LLM that integrates PT, SFT, and RLHF to enhance the handling of multi-turn medical dialogues, particularly in Chinese medicine. Qilin-Med [132] is also a Chinese medical LLM enhanced through a multi-stage training methodology, including domain-specific PT, SFT, DPO, and Retrieval Augmented Generation (RAG).
此外,研究[129–132]采用了多种先进的训练技术。其中,Zhongjing[131]是一款开创性的中医大语言模型,整合了PT(预训练)、SFT(监督微调)和RLHF(人类反馈强化学习)技术,显著提升了多轮中医对话的处理能力。Qilin-Med[132]同样是通过多阶段训练方法增强的中医大语言模型,其训练流程包含领域特异性PT、SFT、DPO(直接偏好优化)以及检索增强生成(RAG)技术。
• Different Training Data. Diverse and high-quality data is the one of core parts for Healthcare LLMs. In PLMs era, plain text dominates the training corpus for pre training language models with the next word prediction task. When comes to Healthcare LLMs, QA pairs and dialogues one of more important data type, as shown in Line 12 to 20 in Table 3. This is due to the fact that the LLMs already have strong linguistic skills, as well as some degree of extra knowledge about the specifics of each domain. This attenuates the need to use specialized domain data to perform next word prediction tasks. More competitively, by using QA pairs and dialogues to construct instruction data, SFT can inject domain knowledge while enhancing the model’s instruction compliance. Besides, some multimodal data (Line 27 to 30) and structured Electronic Health Record (EHR) database (Line 31 to 32) are also commonly used by SFT, which is other important training data. We can see a trend of synchronization between the different training methods and the training data. More details about training data can be seen in Section 4.2.
• 不同的训练数据。多样化和高质量的数据是医疗大语言模型的核心部分之一。在预训练语言模型(PLM)时代,纯文本主导了以"下一个词预测"任务为主的训练语料。而对于医疗大语言模型,问答对(QA pairs)和对话数据(如表3第12-20行所示)成为更重要的数据类型。这是因为大语言模型已具备强大的语言能力,并对各领域专业知识有一定程度的掌握,从而降低了对专业领域数据进行"下一个词预测"任务的需求。更具竞争力的是,通过使用问答对和对话构建指令数据,监督微调(SFT)可以在增强模型指令遵循能力的同时注入领域知识。此外,一些多模态数据(第27-30行)和结构化电子健康记录(EHR)数据库(第31-32行)也常被用于SFT,这些同样是重要的训练数据。我们可以看到不同训练方法与训练数据之间存在同步发展趋势。更多训练数据细节详见第4.2节。
• Different Evaluation. Firstly, we investigate some work which focus in evaluate general LLMs for Healthcare tasks and categorize them into four folds: medical examination, medical question answering, medical generation, and medical comprehensive evaluation, which are summarized in Table 6. The medical examination form involves verifying model performance through standard medical tests or examinations. Differently, medical question answering involves utilizing questions posed or collected by human experts to make assessments. Medical generation focuses on generating new medical descriptions or knowledge based on a given input. The studies on medical comprehensive evaluation aim to provide assessments across various application scenarios rather than focusing on a single aspect. From conclusions of these studies, we can generally find that performance of specific tasks are satisfied, while more concerns are raised from non-technological parts, such as robustness, bias, and ethics. We further discussed these aspects in Section 5.
• 差异化评估。首先,我们调研了聚焦于医疗任务的大语言模型 (LLM) 评估工作,并将其归纳为四类:医疗考试评估、医疗问答评估、医疗生成评估及医疗综合评估(详见表6)。医疗考试评估通过标准化医学测试验证模型性能;医疗问答评估采用专家提出或收集的问题进行评估;医疗生成评估侧重基于给定输入生成新的医疗描述或知识;医疗综合评估研究旨在提供跨场景的综合评估而非单一维度。这些研究普遍显示:特定任务性能表现良好,但更多担忧集中在非技术层面(如鲁棒性、偏见和伦理问题),我们将在第5节进一步探讨。
Secondly, we summarize evaluation parts from studies which propose Healthcare LLMs. For example, in Healthcare-related assessments, Galactica notably surpassed previous benchmarks with a $77.6%$ on PubMedQA and achieved $52.9%$ on MedMCQA. JMLR achieves $72.8%$ accuracy on the MMLU-Medical dataset and $65.5%$ on the MedMcQA dataset, surpassing the Meditron-70B and Llama2-13B with RAG, which scored $68.9%$ and $54.9%$ respectively.
其次,我们汇总了提出医疗大语言模型 (Healthcare LLM) 的研究中的评估部分。例如,在医疗相关评估中,Galactica 以 PubMedQA 77.6% 和 MedMCQA 52.9% 的成绩显著超越先前基准。JMLR 在 MMLU-Medical 数据集上达到 72.8% 准确率,在 MedMCQA 数据集上达到 65.5%,超越了采用 RAG 的 Meditron-70B (68.9%) 和 Llama2-13B (54.9%)。
Zhongjing [131] was evaluated using the CMtMedQA-test for multiturn dialogues and the huatuo-26M for single-turn dialogues, focusing on three main dimensions—safety, professionalism, and fluency. Results show that Zhongjing excels in complex dialogue interactions, surpassing existing models like HuatuoGPT in these aspects by leveraging its diverse training approach. Qilin-Med achieved accuracies of $38.4%$ and $40.0%$ in the PT and SFT phases respectively on the CMExam test set. The integration of the RAG approach further enhanced its accuracy to $42.8%$ on CMExam. These advancements highlight Qilin-Med’s capability in generating precise and con textually accurate responses, setting new benchmarks for medical LLMs, particularly in Chinese medical applications.
Zhongjing [131] 在 CMtMedQA-test 上评估了多轮对话能力,在 huatuo-26M 上评估了单轮对话能力,重点关注安全性、专业性和流畅性三个维度。结果表明,得益于多样化的训练方法,Zhongjing 在复杂对话交互中表现优异,在这些方面超越了 HuatuoGPT 等现有模型。Qilin-Med 在 CMExam 测试集上的 PT 和 SFT 阶段分别达到了 $38.4%$ 和 $40.0%$ 的准确率。结合 RAG 方法后,其在 CMExam 上的准确率进一步提升至 $42.8%$。这些进展凸显了 Qilin-Med 在生成精准且符合语境的响应方面的能力,为医疗大语言模型(尤其是中文医疗应用)树立了新标杆。
In summary, by integrating various training methods detailed in Table 3, we identify several over arching trends regarding the impact of different technologies on performance: (1) PT alone does not ensure high performance in LLMs; (2) SFT proves to be more crucial, with RLHF and DPO increasingly becoming important; (3) Techniques that reduce model size tend to result in some loss of performance.
总结来说,通过整合表3中详述的各种训练方法,我们发现了关于不同技术对性能影响的几个总体趋势:(1) 仅靠预训练(PT)并不能确保大语言模型的高性能;(2) 监督微调(SFT)被证明更为关键,而基于人类反馈的强化学习(RLHF)和直接偏好优化(DPO)正变得越来越重要;(3) 减小模型规模的技术往往会导致性能的某些损失。
• Different Features. Further, we discuss LLMs from features of model sizes, language, and modality. Model size is a crucial measure because it directly impacts the model’s representation capabilities, generalization capacity, as well as the computational resources and training time required. We divide LLMs into three groups, extremely large $(>70\mathrm{B})$ , very large (13B-70B) and large (1B-12B). In this paper, there are 7/36 Healthcare LLMs are extremely large, 7/36 are very large, 19/36 are large. Med-PaLM [121] and HealAI [123] are two the largest Healthcare LLM with 540B parameters. Med-PaLM utilizes instruction prompt tuning for adapting LLMs to new domains with a few exemplars. This approach employs a shared soft prompt across multiple datasets, followed by a task-specific human-engineered prompt. Based on such extremely large size, Med-PaLM is evaluated on a 12-aspect benchmark and get satisfied results. For example, Med-PaLM and clinicians achieved a consensus of $92.6%$ and $92.9%$ respectively. Further, HealAI is based on Med-PaLM. However, there are no more details about its development. Med-PaLM 2 [1] is the second large Healthcare LLM with 340B parameters. Despite its smaller size compared to the original PaLM’s 540B parameters, Med-PaLM 2 outperforms its predecessor [1]. Long-form answers from Med-PaLM 2 are evaluated for various quality criteria and often preferred over those from physicians and the original Med-PaLM model. Med-PaLM 2 also introduces ensemble refinement in its prompting strategy, enhancing answer accuracy by generating multiple reasoning paths to refine the final response. Besides Med-PaLM 2, Galactica and Me LLaMA [111] also have more than 100B parameters’ models. It should notice that some smaller LLMs already outperform larger ones in general domains. This trend has not yet extended to Healthcare, but we anticipate that in the near future, smaller Healthcare LLMs will surpass the performance of older, larger models.
• 不同特征。此外,我们从模型规模、语言和多模态特征角度讨论大语言模型。模型规模是关键指标,直接影响模型的表征能力、泛化能力以及所需的计算资源和训练时间。我们将大语言模型分为三组:超大型$(>70\mathrm{B})$、特大型(13B-70B)和大型(1B-12B)。本文中,7/36的医疗大语言模型属于超大型,7/36为特大型,19/36为大型。Med-PaLM [121]和HealAI [123]是参数规模最大的两个医疗大语言模型,均达到540B参数。Med-PaLM采用指令提示调优技术,通过少量示例使大语言模型适应新领域。该方法在多个数据集上使用共享软提示,再结合特定任务的人工设计提示。基于其超大规模,Med-PaLM在12项基准测试中取得满意结果,例如Med-PaLM与临床医生的共识率分别达到$92.6%$和$92.9%$。HealAI基于Med-PaLM构建,但未披露更多开发细节。Med-PaLM 2 [1]是第二大医疗大语言模型,参数为340B。尽管相比原始PaLM的540B参数规模更小,但其性能优于前代[1]。Med-PaLM 2的长篇回答在多项质量评估中常优于医师和原始Med-PaLM的输出。该模型还在提示策略中引入集成优化方法,通过生成多重推理路径来提升最终回答的准确性。除Med-PaLM 2外,Galactica和Me LLaMA [111]也拥有超过100B参数的模型。值得注意的是,在通用领域已有较小模型超越较大模型的现象,这一趋势尚未延伸至医疗领域,但我们预计未来较小的医疗大语言模型将超越早期大型模型的性能。
Table 6 The Healthcare evaluation of LLMs.
| Categories | Studies | Models | Scenarios | #Num | Conclusions |
| Medical Ex. | [134] | ChatGPT | Primary Care | 674 | Average performance of ChatGPT is below the mean passing mark in the last 2 years. |
| [135] | ChatGPT | Medicallicensure | 220 | ChatGPT performs at the level of a third-year medical student. | |
| [136] | ChatGPT | Medical licensure | 376 | ChatGPT performs at or near the passing threshold. | |
| [137] | ChatGPT | Physician queries | 284 | ChatGPT generates largely accurate information to diverse medical queries. | |
| [138] | ChatGPT,GPT-4,Bard,BLOOMZ | Radiation oncology | 100 | Each LLM generally outperforms the non-expert humans,while only GPT-4 outperforms the medicalphysicists. | |
| [41] | ChatGPT, Claude | Patient-specific EHR | 一 | Both models are able to provide accurate, relevant, and comprehensiveanswers. | |
| [139] | ChatGPT | Bariatric surgery | 151 | ChatGPT usually provides accurate and reproducible responses to common questions related to bariatric surgery. | |
| [140] | ChatGPT | Genetics questions | 85 | ChatGPT does not perform significantly differently than human respondents. | |
| [141] | ChatGPT | Fertility counseling | 17 | ChatGPT could produce relevant, meaningful responses to fertility-relatedclinical queries. | |
| [142] | GPT-3.5, GPT-4 | General surgery | 280 | GPT-3.5 and, in particular, GPT-4 exhibit a remarkable ability to understand complex surgical clinical information. | |
| [143] | GPT-3.5,GPT-4 | Dementia diagnosis | 981 | GPT-3.5 andGPT-4cannot outperform traditionalAI tools in dementia diagnosis and prediction tasks. | |
| [144] Medical Gen. | ChatGPT | Gastroenterology | 20 | ChatGPT would generate relevant and clear research questions,but not original. | |
| [145] | ChatGPT,GPT-4 | Radiology report | 138 | ChatGPT performs well and GPT-4 can significantly improve the quality. | |
| [146] | ChatGPT | Benchmark tasks | 34.4K | Zero-shot ChatGPT outperforms the state-of-the-art fine-tuned | |
| Medical Ce. | [147] | ChatGPT | Clinicaland research | models in datasets that have smaller training sets. ChatGPT could potentially exhibitbiases or be susceptible to misuse. |
The Healthcare evaluation of LLMs includes Medical examination (Ex.), medical question answering (Q&A), medical generation (Gen.), and medical comprehensive evaluation (Ce.).
表 6: 大语言模型在医疗领域的评估
| 类别 | 研究 | 模型 | 场景 | 数量 | 结论 |
|---|---|---|---|---|---|
| 医学考试 | [134] | ChatGPT | 初级护理 | 674 | ChatGPT的平均表现低于近两年的平均及格线 |
| 医学考试 | [135] | ChatGPT | 医疗执照考试 | 220 | ChatGPT达到医学院三年级学生水平 |
| [136] | ChatGPT | 医疗执照考试 | 376 | ChatGPT表现接近及格线 | |
| [137] | ChatGPT | 医师咨询 | 284 | ChatGPT能为多样化医疗问题生成基本准确的信息 | |
| [138] | ChatGPT,GPT-4,Bard,BLOOMZ | 放射肿瘤学 | 100 | 各模型普遍优于非专业人士,仅GPT-4超越医学物理师 | |
| [41] | ChatGPT, Claude | 患者特定电子健康档案 | - | 两个模型都能提供准确、相关且全面的答案 | |
| [139] | ChatGPT | 减肥手术 | 151 | 通常能对减肥手术相关问题给出准确且可复现的回答 | |
| [140] | ChatGPT | 遗传学问题 | 85 | 表现与人类受访者无显著差异 | |
| [141] | ChatGPT | 生育咨询 | 17 | 能对生育相关临床问题给出有意义的相关回答 | |
| [142] | GPT-3.5, GPT-4 | 普通外科 | 280 | 展现出理解复杂外科临床信息的卓越能力 | |
| [143] | GPT-3.5,GPT-4 | 痴呆症诊断 | 981 | 在痴呆诊断预测任务中无法超越传统AI工具 | |
| 医学生成 | [144] | ChatGPT | 胃肠病学 | 20 | 能生成相关清晰的研究问题,但缺乏原创性 |
| [145] | ChatGPT,GPT-4 | 放射学报告 | 138 | ChatGPT表现良好,GPT-4能显著提升质量 | |
| [146] | ChatGPT | 基准测试 | 34.4K | 零样本ChatGPT在小型训练集数据上超越微调模型 | |
| 医学综合评估 | [147] | ChatGPT | 临床研究 | - | 可能表现出偏见或易被滥用 |
大语言模型医疗评估涵盖医学考试(Ex.)、医疗问答(Q&A)、医疗文本生成(Gen.)和医疗综合评估(Ce.)四大领域。
In the realm of language, English LLMs are predominantly mainstream. Following English, the second largest group of LLMs is designed for Chinese. BianQue, HuatuoGPT, BenTsao, SoulChat, DoctorGLM, MedChatZH, Zhongjing, and Qilin-Med are Chinese Healthcare LLMs. Among them, DoctorGLM is a pioneer Chinese LLM, focusing on costeffective medical applications. DoctorGLM’s training utilized the ChatDoctor dataset, translating medical dialogues using the ChatGPT API. Besides the above LLMs, there are also multilingual models, such as Apollo and Medical mT5.
在语言领域,英语大语言模型 (LLM) 占据主流地位。紧随其后的是中文大语言模型,包括 BianQue、华佗GPT (HuatuoGPT)、本草 (BenTsao)、灵心对话 (SoulChat)、DoctorGLM、MedChatZH、仲景 (Zhongjing) 和麒麟-医疗 (Qilin-Med) 等中文医疗大语言模型。其中,DoctorGLM 是首个专注于高性价比医疗应用的中文大语言模型,其训练使用了 ChatDoctor 数据集,并通过 ChatGPT API 翻译医疗对话内容。除上述模型外,还存在多语言模型,例如 Apollo 和 Medical mT5。
Besides the above features, multimodal ability is another important development branch, as medical data inherently consists of diverse modalities such as patient medical records, radio graphic images, and physiological signals. By integrating varied data types, multimodal models can enhance the understanding of complex medical conditions from multiple dimensions, enabling more accurate interpretations and diagnoses. For example, Visual Med-Alpaca [114] is a LLaMa-7B based open-source biomedical model that handles multimodal tasks by integrating medical ‘‘visual experts’’. It was trained using a collabor at iv ely curated instruction set from GPT-3.5-Turbo and human experts, incorporating visual modules and instruction-tuning for tasks like radiological image interpretation and complex clinical inquiries. OphGLM [12] is a multimodal model tailored for ophthalmic applications, integrating visual capabilities alongside language processing. It was developed starting from fundus images, creating a pipeline for disease assessment, diagnosis, and lesion segmentation.
除了上述特性外,多模态能力是另一个重要的发展方向,因为医疗数据本质上包含多种模态,如患者病历、放射影像和生理信号等。通过整合不同类型的数据,多模态模型可以从多个维度增强对复杂医疗状况的理解,从而实现更准确的解读和诊断。例如,Visual Med-Alpaca [114] 是一个基于LLaMa-7B的开源生物医学模型,通过整合医学"视觉专家"来处理多模态任务。该模型使用GPT-3.5-Turbo和人类专家协作整理的指令集进行训练,包含视觉模块和指令微调,用于放射影像解读和复杂临床查询等任务。OphGLM [12] 是一款专为眼科应用定制的多模态模型,将视觉能力与语言处理相结合。该模型从眼底图像开发起步,构建了疾病评估、诊断和病灶分割的流程。
3.3. Summary
3.3. 总结
In this section, we present an overview of existing PLMs and LLMs in the Healthcare domain, highlighting their respective research focuses. Furthermore, we provide a comprehensive analysis of performance of Healthcare LLMs on benchmark datasets such as USMLE, MedMCQA, and PubMedQA as shown in Table 4. The intention behind this analysis is to showcase the progress in Healthcare QA development and offer a clear comparison between different Healthcare LLMs. In conclusion, two of the most robust LLMs identified in this analysis are Med-PaLM 2 and GPT-4. It is important to note that while GPT-4 is a general-purpose LLM, Med-PaLM 2 is specifically designed for Healthcare applications. Additionally, it is worth highlighting that the performance gap between LLM and human has significantly narrowed.
在本节中,我们概述了医疗保健领域现有的预训练语言模型(PLM)和大语言模型(LLM),重点介绍了它们各自的研究方向。此外,我们全面分析了医疗大语言模型在USMLE、MedMCQA和PubMedQA等基准数据集上的性能表现,如表4所示。该分析旨在展示医疗问答系统的发展进展,并对不同医疗大语言模型进行清晰比较。分析结果表明,Med-PaLM 2和GPT-4是当前最强大的两个大语言模型。需要特别说明的是,GPT-4是通用型大语言模型,而Med-PaLM 2是专为医疗应用设计的。值得注意的是,大语言模型与人类专家之间的性能差距已显著缩小。
As mentioned earlier, one notable difference between PLMs and LLMs is that PLMs are typically disc rim i native AI models, while LLMs are generative AI models. Although there are auto-regressive PLMs like GPT-1 and GPT-2 also evaluated with classification tasks, auto-encoder PLMs have been more prominent during the PLMs period. As for LLMs, with their powerful capabilities, they have successfully unified various Healthcare tasks as QA or dialogue tasks in a generative way.
如前所述,预训练语言模型(PLM)与大语言模型(LLM)的一个显著区别在于:PLM通常是判别式AI模型,而LLM属于生成式AI (Generative AI)模型。虽然也存在像GPT-1和GPT-2这类通过分类任务评估的自回归PLM,但在PLM发展时期,自编码器PLM占据了更主导地位。至于大语言模型,凭借其强大能力,已成功以生成式方法将各类医疗健康任务统一为问答或对话任务。
From a technological perspective, most PLM studies focus on improving neural architectures and designing more efficient pre-training tasks. On the other hand, LLM studies primarily emphasize data collection, recognizing the importance of data quality and diversity due to the over-parameter iz ation strategy employed in LLM development. This aspect becomes even more crucial when LLMs undergo SFT to align with human desires. A study [1] reveals that the selection of mixed ratios of different training data significantly impacts the performance of LLMs. However, these mixed ratios of PT and SFT, often referred to as a ‘‘special recipe’’ from different strong LLM developers, are rarely publicized. Therefore, apart from SFT, we anticipate the emergence of more exciting and innovative methods for training LLMs, particularly those designed to handle unique features of Healthcare data.
从技术角度看,多数PLM研究聚焦于改进神经架构和设计更高效的预训练任务。而大语言模型(LLM)研究则更强调数据收集,由于LLM开发采用的过参数化策略,数据质量和多样性的重要性尤为凸显。当LLM通过监督微调(SFT)与人类需求对齐时,这一方面变得更为关键。研究[1]表明,不同训练数据的混合比例选择会显著影响LLM性能。然而这些预训练(PT)与SFT的混合比例——通常被不同头部LLM开发者称为"秘方"——很少公开。因此除SFT外,我们期待出现更多创新的大语言模型训练方法,特别是针对医疗健康数据独特特性设计的方法。
Among the investigated Healthcare LLMs, most are derived from general LLMs. For these models, the SFT approach is the most commonly employed training technique. RLHF is less frequently utilized, with only MedAlpaca and HuatuoGPT adopting this method. The limited application of RLHF can be attributed to its high costs and stability challenges. RLHF relies on a reward model to guide training based on human feedback, but in the medical domain, obtaining expert input is significantly more expensive than in general fields. Additionally, inconsistent or noisy feedback can introduce reward variance, destabilizing the learning process. This issue is particularly pronounced in specialized areas like medicine, where expert opinions may diverge. Moreover, during RLHF, models risk catastrophic forgetting—losing previously learned information when new feedback contradicts prior knowledge. In medical applications, this can lead to the loss of critical information, compromising the model’s reliability. Looking ahead, the development of more resource-efficient and stable RLHF algorithms is expected to enhance the performance and applicability of Healthcare LLMs.
在调研的医疗大语言模型中,多数模型源自通用大语言模型。对于这些模型,监督微调(SFT)是最常用的训练技术。基于人类反馈的强化学习(RLHF)使用频率较低,仅MedAlpaca和HuatuoGPT采用了该方法。RLHF应用受限主要源于其高昂成本和稳定性挑战:该方法依赖奖励模型根据人类反馈指导训练,但在医疗领域获取专家反馈的成本远高于通用领域;不一致或有噪声的反馈可能引发奖励方差,导致学习过程失稳——这在专家意见易分歧的医学等专业领域尤为突出。此外,RLHF过程中模型可能遭遇灾难性遗忘,即新反馈与既有知识冲突时丢失已学信息,这在医疗应用中可能导致关键信息缺失,损害模型可靠性。展望未来,开发资源效率更高、更稳定的RLHF算法有望提升医疗大语言模型的性能与适用性。
Further, we have identified two emerging trends. Firstly, there is a growing exploration of multi-model approaches, including LLaVAMed, MedAGI, OphGLM, Visual Med-Alpaca, and Med-Flamingo. Secondly, Chinese Healthcare LLMs are rapidly developing, with examples such as DoctorGLM, Clinical GP T, SoulChat, BenTsao, BianQue, and HuatuoGPT. Finally, it is worth noting that many Healthcare LLM papers provide details about the prompts they used. This observation demonstrates the prompt brittleness, as different prompts can have a significant impact on the model’s performance. Modifications in the prompt syntax, sometimes in ways that are not intuitive to humans, can lead to significant changes in the model’s output. This instability is more matters for Healthcare than other general applications.
此外,我们发现了两个新兴趋势。首先,多模态方法的研究日益增多,包括LLaVAMed、MedAGI、OphGLM、Visual Med-Alpaca和Med-Flamingo。其次,中文医疗大语言模型发展迅速,例如DoctorGLM、Clinical GPT、SoulChat、BenTsao、BianQue和HuatuoGPT。最后值得注意的是,许多医疗大语言模型的论文都详细提供了所使用的提示词(prompt)信息。这一现象揭示了提示词的脆弱性——不同提示词会对模型性能产生显著影响。修改提示词语法(有时以人类难以直观理解的方式)可能导致模型输出的重大变化。相比其他通用应用场景,这种不稳定性在医疗领域更为关键。
4. Usage and data for healthcare LLM
4. 医疗健康领域大语言模型 (LLM) 的使用与数据
4.1. Usage
4.1. 使用方法
• From Fine-tuning to In-context Learning. In-context learning (ICL) offers promising benefits in healthcare by allowing LLMs to generate responses that mirror examples given by users. This method combines example demonstrations with test inputs to enhance the model’s ability to utilize specific knowledge from these examples without needing to update parameters for specific healthcare data. ICL can be particularly effective in healthcare as it helps tailor these models to meet the precise requirements and expectations of medical professionals. Moreover, using examples can simplify interactions, as direct examples are often clearer and easier to understand than complex medical queries, which might not always capture the true intent of the user.
• 从微调(Fine-tuning)到上下文学习(In-context Learning)。上下文学习(ICL)通过让大语言模型生成与用户提供示例相匹配的响应,为医疗健康领域带来显著优势。该方法将示例演示与测试输入相结合,无需针对特定医疗数据更新参数,即可增强模型利用这些示例中特定知识的能力。ICL在医疗领域尤为有效,因为它能帮助定制模型以满足医疗专业人员的精确需求和期望。此外,使用示例可以简化交互过程,因为直接示例通常比复杂的医疗查询更清晰易懂,后者可能无法准确反映用户的真实意图。
Nevertheless, the success of ICL in healthcare depends on various detailed factors like the similarity of inputs, the relevance of the labels, the format of the demonstrations, and how well the inputs and labels are paired. For example, it is vital that both the examples shown in training and the actual inputs used are from comparable medical situations. Also, the training labels must accurately reflect the labels used in real healthcare settings. The way the examples are presented must be carefully structured to ensure the model learns effectively from them. The study [149] investigates these aspects. While the precision of input-label mapping is less critical when label spaces are correctly aligned, inconsistencies in any of these areas can diminish the utility of ICL in real-world healthcare applications, as shown in Fig. 5. Therefore, meticulous attention to these parameters is essential to harness the full potential of ICL in enhancing diagnostic accuracy and efficiency in healthcare settings. However, Healthcare professionals are often not aware of these technology issues, resulting in LLMs not performing at their full potential.
然而,ICL在医疗领域的成功取决于多种细节因素,如输入的相似性、标签的相关性、演示示例的格式以及输入与标签的匹配程度。例如,训练中展示的示例与实际使用的输入必须来自相似的医疗场景。此外,训练标签必须准确反映真实医疗环境中的标注标准。示例的呈现方式需精心设计,以确保模型能有效学习。研究[149]探讨了这些方面。虽然当标签空间正确对齐时,输入-标签映射的精确性要求会降低,但如图5所示,任何环节的不一致都可能削弱ICL在实际医疗应用中的效用。因此,必须细致关注这些参数,才能充分发挥ICL在提升医疗诊断准确性和效率方面的潜力。然而,医疗从业者通常不了解这些技术细节,导致大语言模型未能发挥其全部效能。
• From System 1 To System 2 – Chain-of-Thought. According to the report [150], two distinct categories of Deep Learning systems exist, namely System 1 and System 2. System 1 encompasses the current applications of deep learning, including image recognition, machine translation, speech recognition, and autonomous driving. On the other hand, System 2 represents the future potential of deep learning, involving tasks such as reasoning, planning, and other logic-based and reasoning-oriented activities.
• 从系统1到系统2——思维链 (Chain-of-Thought) 。根据报告[150],深度学习系统存在两个截然不同的类别,即系统1和系统2。系统1涵盖了深度学习的当前应用,包括图像识别、机器翻译、语音识别和自动驾驶。另一方面,系统2代表了深度学习的未来潜力,涉及推理、规划以及其他基于逻辑和以推理为导向的任务。
System-1 tasks in the field of NLP have been largely resolved, demonstrating significant progress. However, progress in System-2 tasks has been limited until recently when the emergence of advanced LLMs triggered a significant shift. The study [6] proposed the CoT prompting, which found it can significantly improve the reasoning and planning performance of LLM by adding a series of intermediate steps. Furthermore, the study [151] found that by just adding a sentence ‘‘Let’s think step by step’’, the reasoning ability of LLMs can be significantly boosted. Later, there are many CoT studies [11,13,118] aiming to enhance the logical reasoning ability of LLM in various Healthcare applications.
自然语言处理(NLP)领域的系统1任务已基本得到解决,显示出显著进展。然而系统2任务的进展一直有限,直到近期先进大语言模型的出现才引发重大转变。研究[6]提出了思维链(CoT)提示方法,发现通过添加一系列中间步骤能显著提升大语言模型的推理与规划能力。进一步地,研究[151]发现仅需添加"让我们逐步思考"这句话,就能大幅提升大语言模型的推理能力。随后出现了许多CoT相关研究[11,13,118],旨在增强大语言模型在各类医疗应用中的逻辑推理能力。
The integration of CoT reasoning in Healthcare LLMs offers notable benefits for improving interpret ability, particularly in complex decision-making processes such as clinical decision support systems. CoT enables models to break down decisions into explicit, step-bystep reasoning, making outputs more transparent and interpret able for healthcare professionals. However, these benefits come with trade-offs. The use of CoT can increase computational complexity and latency due to the need to generate detailed reasoning paths. In time-sensitive scenarios, such as healthcare emergencies, this added delay may limit the practicality of deploying CoT-enabled LLMs. To address this challenge, it is crucial to strike a balance by optimizing CoT reasoning to enhance transparency without sacrificing system responsiveness. Research on inference acceleration presents a promising approach to mitigating this issue, enabling faster processing while maintaining the interpret ability advantages of CoT.
将思维链推理(CoT)整合到医疗大语言模型中,为提升可解释性带来显著优势,尤其在临床决策支持系统等复杂决策场景中。CoT使模型能够将决策分解为显式的分步推理,让输出对医疗专业人员更具透明度和可解释性。但这些优势伴随着权衡:由于需要生成详细推理路径,CoT会增加计算复杂性和延迟。在医疗急救等时效敏感场景中,这种额外延迟可能限制支持CoT的大语言模型的部署可行性。为解决这一挑战,关键在于通过优化CoT推理来平衡透明度提升与系统响应速度。关于推理加速的研究为缓解该问题提供了可行路径,既能保持CoT的可解释性优势,又能实现更快的处理速度。
• AI Agents. The core idea behind recent AI agents is to build autonomous agent systems that utilize LLMs as their central controllers. These systems consist of several components, including Planning, Memory, Tool Use, and Action [152]. The planning component plays a crucial role in breaking down complex tasks into smaller and manageable sub-goals. This enables the agent to handle large tasks more efficiently by tackling them step by step. The Memory component provides the agent with the ability to store and retrieve information over extended periods. It typically utilizes an external vector store and fast retrieval mechanisms, allowing the agent to retain relevant knowledge and recall it as needed. With the Planning and Memory components in place, AI agents can take actions and interact with external tools. AutoGPT [153] is an example of such an autonomous agent system, which leverages GPT-4 to autonomously develop and manage operations. When provided with a topic, AutoGPT can think independently and generate steps to implement the given topic, along with implementation details. This shows the agent’s autonomous ability to plan, utilize its memory, and take appropriate actions.
• AI智能体 (AI Agents)。近期AI智能体的核心理念是构建以大语言模型 (LLM) 为核心控制器的自主智能体系统。这些系统由多个组件构成,包括规划 (Planning)、记忆 (Memory)、工具使用 (Tool Use) 和行动 (Action) [152]。规划组件在将复杂任务分解为更小且可管理的子目标方面起着关键作用,使智能体能够通过逐步处理来更高效地应对大型任务。记忆组件为智能体提供了长期存储和检索信息的能力,通常利用外部向量存储和快速检索机制,使智能体能够保留相关知识并在需要时调用。借助规划和记忆组件,AI智能体可以采取行动并与外部工具交互。AutoGPT [153] 就是此类自主智能体系统的示例,它利用GPT-4自主开发和管理操作。当给定一个主题时,AutoGPT能够独立思考并生成实现该主题的步骤及具体细节,这展示了智能体在规划、利用记忆和采取适当行动方面的自主能力。
To our best knowledge, AI agents have not been widely adopted in the Healthcare field. However, we anticipate the development of more capable AI agent systems in this domain. For instance, it is possible to train specialized models for different medical processes, such as hospital guidance, auxiliary diagnosis, drug recommendation, and prognostic follow-up. These relatively small models can be integrated into a comprehensive AI medical system, where an LLM serves as the central controller. Additionally, specialized disease systems can be established for each department within the Healthcare system. The LLM can play a crucial role in determining which specialized disease systems should be involved in a particular case, resulting in effectively allocating resources and providing specialized care. Overall, the vision is to leverage AI agents and LLMs to create comprehensive and specialized AI systems in Healthcare, covering various medical processes and enabling efficient decision-making and patient care.
据我们所知,AI智能体在医疗健康领域尚未得到广泛应用。但我们预见该领域将发展出更强大的AI智能体系统。例如,可以针对不同医疗流程(如医院导诊、辅助诊断、药物推荐和预后随访)训练专用模型。这些相对轻量的模型可集成到由大语言模型作为中央控制器的综合医疗AI系统中。此外,医疗体系内每个科室都可建立专科疾病系统。大语言模型能发挥关键作用,判断具体病例需要调用哪些专科系统,从而实现资源高效配置和专科化诊疗。总体愿景是通过AI智能体与大语言模型,构建覆盖全流程、兼具综合性与专科化的医疗AI系统,提升决策效率与患者照护水平。

Fig. 5. What makes in-context learning work? $\star$ The data of figures comes from the study [149]. We perform the proper arrangement and layout for discussions $\star$ . We only list the classification task (x-axis) here and sub-figure (d) shows parts of the original results for clarity. (c) Impact of the demonstration format. The results show that the demonstration format has significant effects on the final performance. (d)Impact of then put-label mapping he results show that input-label mapping only has slight ffc t son the nal performance.

图 5: 上下文学习为何有效? $\star$ 图中数据来源于研究[149]。我们对讨论内容进行了适当整理和排版 $\star$。此处仅列出分类任务(x轴),子图(d)为清晰展示截取了部分原始结果。(c)示范格式的影响。结果表明示范格式对最终性能有显著影响。(d)输入-标签映射的影响。结果表明输入-标签映射仅对最终性能有轻微影响。
4.2. Healthcare training data
4.2. 医疗健康训练数据
As mentioned earlier, the transition from PLMs to LLMs brings a significant shift from a model-centered approach to a data-centered approach. Increasing the volume of pre-training data has become a key factor in enhancing the general capabilities of LLMs. In line with this, we have gathered and organized various datasets for training Healthcare LLMs, see shown in Table 7. Additional descriptions are listed below.
如前所述,从PLM到LLM的转变标志着从以模型为中心的方法转向以数据为中心的方法。增加预训练数据量已成为提升大语言模型通用能力的关键因素。为此,我们收集整理了适用于医疗大语言模型训练的多类数据集(见表7),具体说明如下:
• EHR. The Medical Information Mart for Intensive Care III dataset (MIMIC III) is widely recognized as one of the most widely used EHR datasets. It encompasses a comprehensive collection of data from 58,976 unique hospital admissions involving 38,597 patients who were treated in the intensive care unit at the Beth Israel Deaconess Medical Center between 2001 and 2012. Furthermore, the dataset includes 2,083,180 de-identified notes that are associated with these admissions. MIMIC III provides valuable and extensive information, facilitating many PLMs and LLMs developments, such as MIMIC-BERT [81], GatorTron [108], and MedAGI [10].
• EHR。医疗信息集市重症监护III数据集 (MIMIC III) 被公认为使用最广泛的电子健康记录数据集之一。它包含2001至2012年间在贝斯以色列女执事医疗中心重症监护室接受治疗的38,597名患者共58,976次住院记录的全面数据集合。此外,该数据集还包含与这些住院记录相关的2,083,180条去标识化临床笔记。MIMIC III提供了宝贵且广泛的信息,促进了众多预训练语言模型 (PLM) 和大语言模型的发展,例如MIMIC-BERT [81]、GatorTron [108] 和MedAGI [10]。
• Scientific Literature. PubMed is a freely accessible search engine that provides access to the MEDLINE database, which contains references and abstracts related to life sciences and biomedical topics, with over 32 million citations for biomedical literature. The PubMed abstracts alone contain approximately 4.5 billion words, while the fulltext articles available on PubMed Central (PMC) contribute around 13.5 billion words. These datasets consist of high-quality academic and professional text, making them particularly suitable for training Healthcare LLMs. Various PLM and LLMs, such as BioBERT [73], BioELECTRA [182], GatorTron [108], and MedAlpaca [117], have been trained using PubMed data.
• 科学文献。PubMed是一个免费开放的搜索引擎,提供对MEDLINE数据库的访问,该数据库包含与生命科学和生物医学主题相关的参考文献和摘要,拥有超过3200万条生物医学文献引用。仅PubMed摘要就包含约45亿词,而PubMed Central (PMC)上的全文文章贡献了约135亿词。这些数据集由高质量的学术和专业文本组成,使其特别适合训练医疗保健领域的大语言模型。多种PLM和大语言模型,如BioBERT [73]、BioELECTRA [182]、GatorTron [108]和MedAlpaca [117],都使用了PubMed数据进行训练。
• Web Data. Web data includes any text we can obtain from the Internet. Social media is one of the most commonly used data types. Reddit is a popular online platform that combines social news aggregation, content rating, and discussion features. The platform is organized into user-created boards called ‘‘communities’’ or ‘‘sub-reddits’’, covering a broad range of topics. The study [183] crawled health-themed forums on Reddit to form COMETA corpus as LLMs training data. Tweets are also usually employed to collect data, and COVID-twitterBERT [101], Twitter BERT [184], and TwHIN-BERT [185] are trained with these data.
• 网络数据。网络数据包括我们能从互联网上获取的任何文本。社交媒体是最常用的数据类型之一。Reddit 是一个流行的在线平台,集社交新闻聚合、内容评分和讨论功能于一体。该平台由用户创建的板块组成,称为「社区」或「子版块」,涵盖广泛的主题。研究 [183] 爬取了 Reddit 上以健康为主题的论坛,形成 COMETA 语料库作为大语言模型的训练数据。推文也常被用于数据收集,COVID-twitterBERT [101]、Twitter BERT [184] 和 TwHIN-BERT [185] 均基于这些数据训练而成。
In general, the most common sources of data for Healthcare LLMs include EHR, scientific literature, web data, and public knowledge bases. When considering the data structure, QA and dialogue are the most frequently encountered. Additionally, it is crucial to acknowledge the significance of multimodal data. Given that Healthcare domain inherently involves text, images, and time series data, multimodal LLMs offer a promising direction for further research.
医疗领域大语言模型最常见的数据来源通常包括电子健康记录(EHR)、科学文献、网络数据和公共知识库。从数据结构来看,问答(QA)和对话是最常见的形式。此外,必须重视多模态数据的重要性。鉴于医疗领域天然包含文本、图像和时间序列数据,多模态大语言模型为后续研究提供了重要方向。
Besides, we have summarized the relevant computation costs from existing studies in Table 8, which aims to provide clear assessment of computation requirements.
此外,我们在表8中总结了现有研究的相关计算成本,旨在提供清晰的计算需求评估。
4.3. Summary
4.3. 总结
In this section, we first summarize usage for Healthcare LLMs, including ICL, CoT, and Agents. These technologies can further boost powerful capability of Healthcare LLMs without any expensive training process. Such non-parametric methods are also promising directions for further explorations to construct complete Healthcare AI systems. Also, we present a comprehensive overview about the data used for training LLMs, the volume often surpasses the capacity of human teams to manually perform quality checks. Consequently, data collection processes heavily rely on heuristic rules for selecting data sources and applying filters. In the context of LLM training, there are various data challenges to address, including the high cost of Healthcare data, contamination in benchmark data, personally identifiable information, and the mixture of domains during pre-training and fine-tuning tasks.
本节首先总结医疗大语言模型(LLM)的应用场景,包括上下文学习(ICL)、思维链(CoT)和AI智能体。这些技术无需昂贵训练过程即可显著增强医疗大语言模型的能力。此类非参数化方法也是构建完整医疗AI系统的重点探索方向。此外,我们系统梳理了用于训练大语言模型的数据现状:其规模通常超出人工质检的能力范围,因此数据收集过程高度依赖启发式规则来选择数据源并应用过滤机制。在训练大语言模型时,需要应对多重数据挑战,包括医疗数据的高成本、基准数据污染、个人身份信息泄露,以及预训练与微调任务中的多领域混杂问题。
5. Improving fairness, accountability, transparency, and ethics
5. 提升公平性、问责性、透明度和伦理
Fairness, accountability, transparency, and ethics are four important concerns in the AI domain. According to the study [186], Fairness holds paramount significance in guaranteeing that AI does not perpetuate or exacerbate established societal disparities; Accountability plays an important role in ensuring that individuals responsible for the conception and execution of AI can be held answerable for their decisions; Transparency assumes a critical role in ensuring that AI remains open to scrutiny and amenable to audits for possible biases or inaccuracies; Ethics, similarly, assumes a pivotal role in guaranteeing that AI is constructed and utilized in manners that align with prevailing social values and norms.
公平性、问责制、透明度和伦理道德是AI领域的四个重要议题。根据研究[186],公平性对于确保AI不会延续或加剧现有社会差异具有首要意义;问责制在确保AI设计和执行者需对其决策负责方面发挥重要作用;透明度对于保持AI系统可审查性及可审计潜在偏见或错误至关重要;伦理道德同样关键,它确保AI的开发和使用方式符合主流社会价值观与规范。
In the Healthcare domain, we believe that these four aspects are even more critical because the primary focus is on patient well-being and safety. In this context, the utmost importance lies in ensuring patients receive optimal care marked by equitable access to medical services. Additionally, the transparent and trustworthy nature of Healthcare decisions, the accountability in delivering accurate medical diagnoses and treatments, the safeguarding of patient confidentiality, and the adherence to elevated ethical standards emerge as distinct and noteworthy considerations, setting Healthcare apart from AI applications in other domains and more.
在医疗健康领域,我们认为这四个方面更为关键,因为该领域的核心关注点是患者的健康与安全。在此背景下,最重要的是确保患者获得以公平获取医疗服务为标志的最佳护理。此外,医疗决策的透明度和可信度、提供准确医疗诊断和治疗的责任、保护患者隐私以及遵守更高的道德标准,都成为独特而值得注意的考量因素,这使得医疗健康领域的人工智能应用与其他领域相比更具特殊性。
Table 7 Healthcare data can be used to train LLMs.
| Data | Type | Size | Link |
| MIMIC-III | EHR | 58,976 hospital admissions | Link |
| MIMIC-IV | EHR | 11 years of hospital admissions | Link |
| CPRD [154] | EHR | over 2000 primary care practices | Link |
| PubMed | SL | 35M biomedical literature | Link |
| PMC | SL | 8 million articles | Link |
| RCT [155] | SL | 4528 abstract | Link |
| MS 2 [156] | SL | 470,402 abstract | Link |
| CDSR [157] | SL | 7805 abstract | Link |
| SumPubMed [158] | SL | 33,772 abstract | Link |
| The Pile | SL | 825 GB English text | Link |
| S2ORC[159] | SL | 63,709 abstract | Link |
| CORD-19 [160] | SL | 1M papers | Link |
| MeQSum [161] | MS | 1000 instances | Link |
| CHQ-Sum [162] | MS | 1507 instances | Link |
| UMLS | KB | 2M entities for 900K concepts | Link |
| MedDialog [163] | Dial. | 3.66 million conversations | Link |
| CovidDialog [164] | Dial. | 603 consultations | Link |
| Flashcards [117] | Dial. | 33955 instances | Link |
| Wikidoc [117] | Dial. | 67704 instances | Link |
| Wikidoc PI [117] | Dial. | 5942 instances | Link |
| MEDIQA [165] | Dial. | 2208 instances | Link |
| CORD-19 [160] | Dial. | 1056660 instances | Link |
| MMMLU [160] | Dial. | 3787 instances | Link |
| Pubmed Causal [166] | Dial. | 2446 instances | Link |
| ChatDoctor [167] | Dial. | 215 000 instances | Link |
| Alpaca-EN-AN [168] | Inst. | 52K instructions | Link |
| Alpaca-CH-AN [168] | Inst. | 52K instructions | Link |
| ShareGPT | Dial. | 61653long conversations | Link |
| COMETA [169] | Web | 800K Reddit posts | Link |
| WebText | Web | 40 GB of text | Link |
| OpenWebText | Web | 38 GB of text | Link |
| Colossal Corpus | Web | 806 GB of text | Link |
| OpenI | EHR | 3.7 million images | Link |
| U-Xray [170] | MM | 3955 reports and 7470 images | Link |
| ROCO [171] | MM | 81,000 radiology images and captions | Link |
| MedICaT [172] | MM | 17,0o0imagesincludes captions | Link |
| PMC-0A [173] | MM | 1.6M image-caption pairs | Link |
| CheXpert [174] | MM | 224,316 chest radiographs with reports | Link |
| PadChest [175] | MM | 160,000 images with related text | Link |
| MIMIC-CXR | MM | 227,835imaging for 64,588patients | Link |
| PMC-15M [176] | MM | 15 million Figure-caption pairs | Link |
| OpenPath [177] | MM | 208,414pathologyimagesandtext | Link |
| Medtrinity [178] | MM | 25 million images and text | Link |
| MedPix 2.0 [179] | MM | 12,0o0 patient case scenarios | Link |
| MultiMed [180] | MM | 2.56 million samples with 10 modalities | Link |
| WorldMedQA-V [181] | MM | 568QAs withmedical images | Link |
✰ Although there are datasets available for Instruction Fine-Tuning, such as MultiMedQA and the USMLE test, we have opted not to include them in this list. These datasets are typically employed for evaluation purposes rather than serving as primary resources for training. SL, MS, MM and KB means Scientific Literature, Medical Question Sum mari z ation, Multimodal, and Knowledge Base, respectively. Dial. and Inst. mean Dialogue and Instruction.
表 7 医疗健康数据可用于训练大语言模型
| 数据 | 类型 | 规模 | 链接 |
|---|---|---|---|
| MIMIC-III | 电子健康记录(EHR) | 58,976例住院记录 | 链接 |
| MIMIC-IV | 电子健康记录(EHR) | 11年住院记录 | 链接 |
| CPRD [154] | 电子健康记录(EHR) | 超过2000家基层医疗机构 | 链接 |
| PubMed | 科学文献(SL) | 3500万篇生物医学文献 | 链接 |
| PMC | 科学文献(SL) | 800万篇文章 | 链接 |
| RCT [155] | 科学文献(SL) | 4528篇摘要 | 链接 |
| MS 2 [156] | 科学文献(SL) | 470,402篇摘要 | 链接 |
| CDSR [157] | 科学文献(SL) | 7805篇摘要 | 链接 |
| SumPubMed [158] | 科学文献(SL) | 33,772篇摘要 | 链接 |
| The Pile | 科学文献(SL) | 825GB英文文本 | 链接 |
| S2ORC[159] | 科学文献(SL) | 63,709篇摘要 | 链接 |
| CORD-19 [160] | 科学文献(SL) | 100万篇论文 | 链接 |
| MeQSum [161] | 医学问题摘要(MS) | 1000条实例 | 链接 |
| CHQ-Sum [162] | 医学问题摘要(MS) | 1507条实例 | 链接 |
| UMLS | 知识库(KB) | 90万个概念的200万个实体 | 链接 |
| MedDialog [163] | 对话(Dial.) | 366万次对话 | 链接 |
| CovidDialog [164] | 对话(Dial.) | 603次咨询 | 链接 |
| Flashcards [117] | 对话(Dial.) | 33955条实例 | 链接 |
| Wikidoc [117] | 对话(Dial.) | 67704条实例 | 链接 |
| Wikidoc PI [117] | 对话(Dial.) | 5942条实例 | 链接 |
| MEDIQA [165] | 对话(Dial.) | 2208条实例 | 链接 |
| CORD-19 [160] | 对话(Dial.) | 1056660条实例 | 链接 |
| MMMLU [160] | 对话(Dial.) | 3787条实例 | 链接 |
| Pubmed Causal [166] | 对话(Dial.) | 2446条实例 | 链接 |
| ChatDoctor [167] | 对话(Dial.) | 215000条实例 | 链接 |
| Alpaca-EN-AN [168] | 指令(Inst.) | 52K条指令 | 链接 |
| Alpaca-CH-AN [168] | 指令(Inst.) | 52K条指令 | 链接 |
| ShareGPT | 对话(Dial.) | 61653次长对话 | 链接 |
| COMETA [169] | 网络(Web) | 80万条Reddit帖子 | 链接 |
| WebText | 网络(Web) | 40GB文本 | 链接 |
| OpenWebText | 网络(Web) | 38GB文本 | 链接 |
| Colossal Corpus | 网络(Web) | 806GB文本 | 链接 |
| OpenI | 电子健康记录(EHR) | 370万张图像 | 链接 |
| U-Xray [170] | 多模态(MM) | 3955份报告和7470张图像 | 链接 |
| ROCO [171] | 多模态(MM) | 81,000张放射图像及说明 | 链接 |
| MedICaT [172] | 多模态(MM) | 17,000张带说明的图像 | 链接 |
| PMC-0A [173] | 多模态(MM) | 160万张图像-说明对 | 链接 |
| CheXpert [174] | 多模态(MM) | 224,316份胸片及报告 | 链接 |
| PadChest [175] | 多模态(MM) | 16万张带相关文本的图像 | 链接 |
| MIMIC-CXR | 多模态(MM) | 64,588名患者的227,835张影像 | 链接 |
| PMC-15M [176] | 多模态(MM) | 1500万张图表-说明对 | 链接 |
| OpenPath [177] | 多模态(MM) | 208,414张病理图像及文本 | 链接 |
| Medtrinity [178] | 多模态(MM) | 2500万张图像及文本 | 链接 |
| MedPix 2.0 [179] | 多模态(MM) | 12,000个患者案例 | 链接 |
| MultiMed [180] | 多模态(MM) | 256万个10种模态的样本 | 链接 |
| WorldMedQA-V [181] | 多模态(MM) | 568个带医学图像的问答对 | 链接 |
✰ 虽然存在可用于指令微调的数据集(如MultiMedQA和美国医师执照考试试题),但我们选择不将其列入本表。这些数据集通常用于评估而非作为主要训练资源。SL、MS、MM和KB分别代表科学文献、医学问题摘要、多模态和知识库。Dial.和Inst.分别代表对话和指令。
Table 8 The statistics of computation cost for existing Healthcare LLM.
| ModelName | Totaldatasize | GPU type | GPU no. | GPU time |
| VisualMed-Alpaca | 54k data points | A100-80G | 4 | 2.51 h |
| GatorTron | >90billionwords | A100 | 992 | 6days |
| Galactica | A100-80G | 128 | ||
| ChatDoctor | 100k conversations | A100 | 6 | 3h |
| DoctorGLM | 3.5G | A100-80G | 1 | 8h |
| PMC-LLaMA | 75B tokens | A100 | 8 | 7 days |
| VisualMed-Alpaca | 44.8MB* (without images) | A100-80G | 4 | 2.51 h |
| BianQue 1.0 | 9million samples | RTX 4090 | 8 | 16 days |
| GatorTronGPT | 277B tokens | A100-80G | 560 | 26 days |
| HuatuoGPT | 226,042instances | A100 | 8 | 一 |
| LLaVA-Med | 15M image-caption pairs | A100 | 8 | 15 h |
| Med-Flamingo | 1.3M image-caption pairs | A100-80G | 8 | 6.75 days |
表 8 现有医疗领域大语言模型的计算成本统计
| 模型名称 | 总数据量 | GPU类型 | GPU数量 | GPU耗时 |
|---|---|---|---|---|
| VisualMed-Alpaca | 54k数据点 | A100-80G | 4 | 2.51小时 |
| GatorTron | >900亿词 | A100 | 992 | 6天 |
| Galactica | - | A100-80G | 128 | - |
| ChatDoctor | 100k对话 | A100 | 6 | 3小时 |
| DoctorGLM | 3.5G | A100-80G | 1 | 8小时 |
| PMC-LLaMA | 750亿token | A100 | 8 | 7天 |
| VisualMed-Alpaca | 44.8MB* (无图像) | A100-80G | 4 | 2.51小时 |
| 扁鹊1.0 | 900万样本 | RTX 4090 | 8 | 16天 |
| GatorTronGPT | 2770亿token | A100-80G | 560 | 26天 |
| 华佗GPT | 226,042实例 | A100 | 8 | - |
| LLaVA-Med | 1500万图像-标题对 | A100 | 8 | 15小时 |
| Med-Flamingo | 130万图像-标题对 | A100-80G | 8 | 6.75天 |
5.1. Fairness
5.1. 公平性
Fairness within the context of LLMs refers to the principle of equitably treating all users and preventing any form of unjust discrimination. This essential concept revolves around the mitigation of biases, aiming to guarantee that the outcomes produced by an AI system do not provide undue advantages or disadvantages to specific individuals or groups. These determinations should not be influenced by factors such as race, gender, socioeconomic status, or any other related attributes, e.g., different input languages and processing tasks, striving for an impartial and balanced treatment of all users. This fundamental tenet aligns with the broader objective of promoting equality and in clu siv it y for Healthcare LLMs.
大语言模型(LLM)背景下的公平性,指的是平等对待所有用户并防止任何形式不公正歧视的原则。这一核心概念围绕偏见缓解展开,旨在确保AI系统产生的结果不会对特定个人或群体造成不当优势或劣势。这些判断不应受种族、性别、社会经济地位或其他相关属性(例如不同输入语言和处理任务)的影响,力求对所有用户保持公正平衡的对待。这一基本原则与促进医疗领域大语言模型平等性和包容性的总体目标相一致。
The biases from LLMs can be attributed to the uneven distribution of demographic attributes in pre-training corpora. Such an argument also holds for the Healthcare sector [187]. As an example, neural models trained on publicly accessible chest X-ray datasets tend to exhibit under diagnosis tendencies in marginalized communities, including female patients, Black patients, Hispanic patients, and those covered by Medicaid insurance [188]. These specific patient groups often experience systemic under representation within the datasets, resulting in biased algorithms that may be susceptible to shifts in population demographics and disease prevalence. Furthermore, several global disease classification systems display limited intra-observer consensus, implying that an algorithm trained and assessed in one country may undergo evaluation under a dissimilar labeling framework in another country [189].
大语言模型(LLM)的偏见可归因于预训练语料中人口统计属性的不均衡分布。这一论点在医疗保健领域同样成立 [187]。例如,基于公开胸透X光数据集训练的神经模型往往对边缘化群体(如女性患者、黑人患者、拉丁裔患者及医疗补助保险覆盖人群)表现出诊断不足倾向 [188]。这些特定患者群体在数据集中常面临系统性代表不足,导致算法产生偏见,可能无法适应人口结构和疾病流行率的变化。此外,多项全球疾病分类系统存在观察者内部共识度不足的问题,这意味着在一国训练评估的算法可能在另一国面临不同标注框架的检验 [189]。
Current common practices to improve AI fairness in the Healthcare domain focus on pre-processing, in-processing, and post-processing [187]. Importance weighting is a pre-processing technique, which involves adjusting the significance of less frequent samples from protected subgroups. Similarly, resampling endeavors to rectify sampleselection bias by acquiring more equitable subsets of the initial training dataset and can be naturally employed to address the underrepresentation of specific subgroups.
当前在医疗领域提升AI公平性的常见实践主要聚焦于预处理 (pre-processing)、处理中 (in-processing) 和后处理 (post-processing) [187]。重要性加权 (importance weighting) 是一种预处理技术,通过调整受保护子群中低频样本的权重来实现。类似的,重采样 (resampling) 通过从初始训练数据集中获取更均衡的子集来修正样本选择偏差,这种方法可自然应用于解决特定子群代表性不足的问题。
For LLMs, bias mitigation methods are frequently studied in the context of instruction fine-tuning and prompt engineering. The representative technique for instruction fine-tuning is RLHF. In the case of Instruct GP T, GPT-3 is refined through a process involving RLHF, specifically aimed at adhering to human instructions. The procedure involves three sequential steps: firstly, gathering human-authored demonstration data to guide GPT-3’s learning; secondly, assembling comparative data consisting of model-generated outputs assessed by annotators to construct a reward model that predicts outputs preferred by humans; and lastly, fine-tuning policies based on this reward model. The aforementioned process offers a valuable chance to rebalance the data and incorporate additional security measures to prevent biased behavior in the model. However, it is important to note that obtaining demographic information can sometimes be challenging due to privacy and ethical concerns in medical practices. This creates an obstacle when we aim to ensure fairness while also protecting privacy.
对于大语言模型(LLM),偏差缓解方法常在指令微调(instruction fine-tuning)和提示工程(prompt engineering)的背景下被研究。指令微调的代表性技术是RLHF。以InstructGPT为例,GPT-3通过包含RLHF的流程进行优化,该流程专门旨在遵循人类指令。该过程包含三个连续步骤:首先收集人类编写的示范数据来指导GPT-3学习;其次整合由标注员评估的模型输出对比数据,构建用于预测人类偏好输出的奖励模型;最后基于该奖励模型进行策略微调。上述流程为重新平衡数据并整合额外安全措施以防止模型出现偏差行为提供了宝贵机会。但需注意的是,由于医疗实践中的隐私和伦理问题,获取人口统计信息有时颇具挑战性。这在我们力求确保公平性的同时保护隐私时构成了障碍。
5.2. Accountability
5.2. 问责制
LLMs are prone to amplifying the inherent social biases present in their training data, and they may produce hallucinatory or counterfactual outputs. This issue is compounded by their lack of robustness, making them vulnerable to perturbations and deviations from expected performance, especially when faced with diverse inputs or scenarios. In the healthcare sector, these problems can have grave implications because the outputs of LLMs can directly impact people’s health and even their lives. Consequently, ensuring accountability becomes a crucial concern when deploying LLMs in healthcare settings.
大语言模型容易放大训练数据中固有的社会偏见,并可能产生幻觉或与事实相悖的输出。由于缺乏鲁棒性,这些问题会进一步恶化,使模型容易受到扰动和性能偏差的影响,特别是在面对多样化输入或场景时。在医疗健康领域,这些问题可能造成严重后果,因为大语言模型的输出会直接影响人们的健康甚至生命。因此,在医疗场景中部署大语言模型时,确保问责制成为关键问题。
Effective accountability acts as a vital safeguard, ensuring that LLMs can be reliably integrated into the Healthcare field. Specifically, account ability entails that when healthcare LLMs err or yield undesirable outcomes, clear attribution of responsibility enables swift identification of the responsible parties. This facilitates prompt remedial actions and appropriate compensation for affected patients. Addressing these issues not only resolves specific problems but also helps prevent similar issues in the future, thereby enhancing both patient and public trust in healthcare LLM applications.
有效的问责机制作为关键保障,确保大语言模型(LLM)能可靠地融入医疗健康领域。具体而言,问责性要求当医疗大语言模型出现错误或产生不良后果时,明确的责任归属能快速锁定责任方。这有助于及时采取补救措施,并为受影响患者提供合理补偿。解决这些问题不仅能处理具体个案,更能预防未来类似问题,从而提升患者和公众对医疗大语言模型应用的信任度。
The hallucinations problem presents a main obstacle to accountable AI. In the evaluation conducted by the study [190], ChatGPT was evaluated using fact-based question-answering datasets, revealing that its performance did not exhibit enhancements in comparison to earlier versions. Consequently, the reliability of ChatGPT in tasks necessitating faithfulness is called into question. For instance, its potential fabrication of references in the context of scientific article composition [191] and the invention of fictitious legal cases within the legal domain [192] accentuate the potential risks associated with its use in critical domains.
幻觉问题对可信AI构成了主要障碍。研究[190]使用基于事实的问答数据集对ChatGPT进行评估时发现,其性能相比早期版本并未表现出提升。因此,ChatGPT在需要保真度的任务中的可靠性受到质疑。例如,其在科学论文撰写场景中可能伪造参考文献[191],以及在法律领域编造虚构案例[192],凸显了在关键领域使用该技术时存在的潜在风险。
Further, McKenna et al. [193] and Li et al. [194] investigate the root reason of hallucinations. These studies pinpoint the root cause of the hallucination problem: LLMs tend to memorize training data, especially in relation to word frequencies. This fundamental cause indicates that completely resolving the hallucination issue is challenging. Consequently, even the most advanced LLMs may still produce incorrect information. For such reason, we have to make an effective accountability before applying Healthcare LLMs in real medical scenarios.
此外,McKenna等人[193]和李等人[194]研究了幻觉的根本原因。这些研究指出了幻觉问题的根源:大语言模型倾向于记忆训练数据,尤其是与词频相关的部分。这一根本原因表明,完全解决幻觉问题具有挑战性。因此,即使是最先进的大语言模型仍可能产生错误信息。基于此,在医疗场景中应用医疗大语言模型前,我们必须建立有效的问责机制。
Actually, accountability in AI is not just about correcting errors but also about implementing preventative measures that maintain trust and safety, particularly when AI decisions impact human lives. A direct preventive measure is to facilitate user participation in modeling decisions. The study [195] contended that enabling users to access human-generated source references is crucial for enhancing the reliability of the model’s responses. The study [196] advocated for the involvement of both AI developers and system safety engineers in evaluating the moral accountability concerning patient harm. Additionally, they recommend a transition from a static assurance model to a dynamic one, recognizing that ensuring safety is an ongoing process and cannot be entirely resolved during the initial design phase of the AI system before its deployment.
实际上,AI领域的问责制不仅关乎纠错,更涉及实施预防性措施以维护信任与安全,尤其是当AI决策影响人类生活时。一项直接的预防措施是促进用户参与建模决策。研究[195]指出,允许用户访问人工生成的来源参考对于提升模型响应的可靠性至关重要。研究[196]主张让AI开发者和系统安全工程师共同参与评估涉及患者伤害的道德责任。此外,他们建议从静态保障模式转向动态模式,因为确保安全是一个持续过程,无法在AI系统部署前的初始设计阶段完全解决。
The study [197] proposed a solution to tackle the issue of accountability, advocating for the education and training of prospective AI users to discern the appropriateness of relying on AI recommendations. However, imparting this knowledge to practitioners demands a considerable investment of effort. Healthcare professionals frequently grapple with overwhelming workloads and burnout, making comprehensive training on AI a significant challenge. Moreover, not all Healthcare practitioners possess adequate statistical training to comprehend the underlying mechanics of AI algorithms. In addition to education, the study [197] recommended the establishment of policies and mechanisms to ensure the protection of both clinicians and AI within the Healthcare domain.
研究[197]提出了一种解决问责问题的方案,主张通过教育培训使潜在AI用户能够判断依赖AI建议的合理性。然而,向从业者传授这类知识需要大量精力投入。医疗专业人员长期面临超负荷工作与职业倦怠,这使得全面的AI培训成为重大挑战。此外,并非所有医疗从业者都具备足够的统计学训练来理解AI算法的底层机制。除教育措施外,该研究[197]还建议制定政策与机制,以确保临床医生和AI在医疗领域都获得保护。
5.3. Transparency
5.3. 透明度
The limited transparency of neural networks has been widely criticized, presenting significant obstacles to their application in the Healthcare domain. LLMs and PLMs are complex neural network models, which further exacerbate the challenges associated with interpret a bility. In recent years, there have been efforts to understand the inner workings of PLMs in Healthcare contexts. Probing PLMs have been extensively employed to uncover the underlying factors contributing to their performance. For example, the study [198] examined PLMs’ disease knowledge, while the study [199] conducted in-depth analyses of attention in protein Transformer models, yielding valuable insights into their mechanisms. In the general meaning learning domain, a transparent model is typically characterized by decision-making processes akin to those of white-box models, e.g., decision tree-based models or linear regression models. It often encompasses post hoc explanations [200], model-specific explanations [201] or model-agnostic explanations [202]. Sometimes, the explanation insights are derived from feature maps [203], generated natural language [204], factual and counter factual examples [205], or decision-making evidence [206].
神经网络有限的透明度一直广受批评,这为其在医疗健康领域的应用带来了重大障碍。大语言模型(LLM)和预训练语言模型(PLM)作为复杂的神经网络模型,进一步加剧了可解释性方面的挑战。近年来,学界开始尝试理解PLM在医疗场景中的内部工作机制。探测方法被广泛用于揭示影响模型性能的潜在因素,例如研究[198]考察了PLM的疾病知识,而研究[199]则对蛋白质Transformer模型中的注意力机制进行了深入分析,为理解其机理提供了宝贵洞见。
在通用机器学习领域,透明模型通常具有类似白盒模型(如基于决策树的模型或线性回归模型)的决策过程特征。这类解释通常包括事后解释[200]、模型特定解释[201]或模型无关解释[202]。有时,解释性见解会从特征图[203]、生成的自然语言[204]、事实与反事实示例[205]或决策证据[206]中提取。
For PLMs, the study [200] introduced an innovative method accompanied by quantitative metrics aimed at mitigating the limitations observed in existing post hoc explanation approaches. These drawbacks include reliance on human judgment, the necessity for retraining, and issues related to data distribution shifts. The method allows for a quantitative assessment of interpret ability methods without the need for retraining and effectively addresses distribution shifts between training and evaluation sets. In the era of LLMs, CoT prompting [6] has emerged as a potential method for providing a certain level of interpret a bility by generating reasoning steps. The technique empowers LLMs to break down complex, multi-step problems into more manageable intermediate steps. Moreover, it offers a transparent view of the LLM’s behavior, shedding light on its potential process of arriving at a specific answer and offering insights for identifying and rectifying errors in the reasoning path. However, this approach faces two primary challenges: the high cost of annotations required for CoT and the evaluation of interpret ability. Acquiring demonstrations with annotated reasoning steps is an expensive task, particularly in professional fields such as Healthcare. Additionally, evaluating the generated reasoning results as explain able justifications and ensuring their usability pose significant challenges.
对于PLMs,研究[200]提出了一种创新方法并配套量化指标,旨在解决现有事后解释方法存在的局限性。这些缺陷包括对人类判断的依赖、需要重新训练以及与数据分布偏移相关的问题。该方法无需重新训练即可对可解释性方法进行量化评估,并有效解决了训练集与评估集之间的分布偏移问题。在大语言模型时代,思维链提示(CoT prompting)[6]通过生成推理步骤,成为提供一定程度可解释性的潜在方法。该技术使大语言模型能够将复杂的多步骤问题分解为更易处理的中间步骤。此外,它为大语言模型的行为提供了透明视图,揭示了其得出特定答案的潜在过程,并为识别和纠正推理路径中的错误提供了洞察。然而,该方法面临两个主要挑战:思维链所需标注的高成本以及可解释性评估。获取带有标注推理步骤的演示样本是一项昂贵的工作,尤其在医疗保健等专业领域。此外,将生成的推理结果评估为可解释的论证并确保其可用性也构成重大挑战。
5.4. Ethics
5.4. 伦理
The ethical concerns about using LLMs for Healthcare have been widely discussed. Healthcare LLMs typically possess a wide range of patient characteristics, including clinical measurements, molecular signatures, demographic information, and even behavioral and sensory tracking data. It is crucial to acknowledge that these models are susceptible to memorize training data and simply reproducing it for users, resulting compromising the privacy of users.
使用大语言模型(LLM)进行医疗保健的伦理问题已被广泛讨论。医疗保健领域的大语言模型通常掌握大量患者特征数据,包括临床测量指标、分子特征、人口统计信息,甚至行为与感官追踪数据。必须认识到这些模型容易记住训练数据并直接向用户复现,从而导致用户隐私泄露。
As mentioned in Section 4.2, EHRs serve as important training data, alongside public scientific literature and web data. However, it is worth noting that EHRs may contain sensitive information such as patient visits and medical history, and exposing such data could lead to physical and mental harm to patients. It is important to recognize that deidentification techniques employed in EHR may not always guarantee complete safety. Recent studies have shown that there can be instances of data leakage from PLMs, allowing for the recovery of personal health information from models trained on such data sources [207]. Additionally, approaches such as KART [208] have been proposed to assess the vulnerability of sensitive information in biomedical PLMs using various attack strategies.
如第4.2节所述,电子健康记录(EHR)与公开科学文献及网络数据同为重要训练数据。但需注意,EHR可能包含患者就诊记录和病史等敏感信息,此类数据泄露可能导致患者身心伤害。需要认识到,EHR采用的去标识化技术未必能始终保证完全安全。最新研究表明,预训练语言模型(PLM)可能存在数据泄露情况,使得从这类数据源训练的模型中恢复个人健康信息成为可能[207]。此外,KART[208]等方法已被提出,用于通过多种攻击策略评估生物医学PLM中敏感信息的脆弱性。
Medical applications inherently involve sensitive data privacy concerns that surpass other NLP tasks. Consequently, safeguarding privacy during the evaluation process becomes more important. One potential solution to address this challenge is the adoption of Federated Learning (FL) [209], which enable the implementation of large-scale evaluation systems while preserving privacy. By allowing the model to be trained directly on the devices where the data originates, FL keeps sensitive patient information localized, reducing the risk of data breaches. Moreover, it can help in creating more generalized and unbiased models by learning from a diverse array of decentralized data sources, thus covering a broader spectrum of patient conditions.
医学应用天然涉及超越其他自然语言处理(NLP)任务的敏感数据隐私问题。因此,在评估过程中保护隐私变得更为重要。应对这一挑战的潜在解决方案是采用联邦学习(Federated Learning, FL) [209],该技术能在保护隐私的同时实现大规模评估系统。通过让模型直接在数据源设备上进行训练,联邦学习使敏感患者信息保持本地化,从而降低数据泄露风险。此外,通过从多样化的去中心化数据源中学习,它有助于创建更具普适性和无偏见的模型,从而覆盖更广泛的患者状况谱系。
Summarily, it is imperative for stakeholders to engage in ethical reviews and updates of the guidelines governing the use of LLMs. This includes regular assessments of the models for biases, implementing rigorous privacy safeguards, and ensuring transparent and explain able AI systems. Moreover, active collaboration between ethicists, technologists, clinicians, and patients is necessary to harness the benefits of healthcare LLMs while minimizing their risks.
总之,利益相关方必须对大语言模型(LLM)的使用准则进行伦理审查和更新。这包括定期评估模型的偏见、实施严格的隐私保护措施,以及确保人工智能系统的透明性和可解释性。此外,伦理学家、技术人员、临床医生和患者之间需要积极合作,以充分利用医疗领域大语言模型的优势,同时将其风险降至最低。

Fig. 6. The illustration depicts NLP technologies and their related healthcare applications. A quarter circle indicates that the technology is just beginning to be explored in these applications. Two quarters signify that the technology has been studied for several years. Three quarters suggest that the technology is mature and ready for implementation in real-world scenarios. A full circle indicates that the technology is actively being utilized in real scenarios.
图 6: 该示意图展示了自然语言处理(NLP)技术及其相关医疗健康应用。四分之一圆表示该技术在这些应用中刚开始被探索;半圆表示该技术已研究数年;四分之三圆表明该技术已成熟并准备投入实际场景应用;满圆则表示该技术已在真实场景中积极投入使用。
6. Discussion
6. 讨论
6.1. Healthcare core issues
6.1. 医疗健康核心问题
As illustrated in Fig. 6, we identify six core issues in healthcare that are critical for improving healthcare outcomes. We then discuss how these core issues are supported by various LLM-related technologies introduced in Section 2. Generally, foundational technologies such as NER, RE, and TC are widely used in real-world scenarios. Furthermore, the generative capabilities of QA and dialogue systems play increasingly important roles in enhancing healthcare outcomes. The creation and management of clinical documentation are time-consuming, leading to inefficiencies and increased error risks. NER and RE automate the extraction of key information from medical notes, allowing medical professionals to focus more on patient care while reducing paperwork burdens. Also, LLMs can generate structured medical reports and ensure compliance with regulatory requirements, ultimately enhancing the quality of services and optimizing the healthcare system.
如图6所示,我们确定了医疗保健领域中六个对改善医疗结果至关重要的核心问题。接着我们探讨了第2节介绍的各种大语言模型相关技术如何支持这些核心问题。通常而言,NER(命名实体识别)、RE(关系抽取)和TC(文本分类)等基础技术在现实场景中应用广泛。此外,问答系统和对话系统的生成能力在提升医疗效果方面发挥着日益重要的作用。临床文档的创建和管理耗时费力,导致效率低下且增加错误风险。NER和RE技术能自动从医疗记录中提取关键信息,让医疗专业人员更专注于患者护理,同时减轻文书工作负担。此外,大语言模型可生成结构化医疗报告并确保符合监管要求,最终提升服务质量并优化医疗体系。
Clinical Decision Support Systems (CDSS) are vital in healthcare, assisting physicians with precise medical decisions through timely data analysis. Basic CDSS utilize NER and RE to extract key patient features, while STS analyzes similar patients for predictive outcomes. Advanced CDSS leverage LLMs for flexible decision support by addressing user-posed health queries, significantly enhancing the medical decision-making process despite their current rarity in practice. As the demand for healthcare services grows, traditional patient–doctor interaction face challenges, particularly for continuous care outside working hours. QA and Dialogue enable the creation of virtual health assistants that provide round-the-clock health consultations and medication management. These AI assistants can address common health issues, such as drug interactions and appointment management, although advanced features like emotional support are still under exploration.
临床决策支持系统(CDSS)在医疗保健领域至关重要,通过及时的数据分析协助医生做出精准医疗决策。基础型CDSS利用命名实体识别(NER)和关系抽取(RE)提取患者关键特征,而语义文本相似度(STS)通过分析相似患者案例来预测诊疗结果。进阶型CDSS则运用大语言模型(LLM)处理用户提出的健康问题,提供灵活决策支持,尽管目前实际应用较少,但显著提升了医疗决策流程。随着医疗服务需求增长,传统医患互动模式面临挑战,尤其在工作时间外的持续照护场景。问答系统(QA)与对话技术可构建虚拟健康助手,提供全天候健康咨询与用药管理服务。这类AI智能体能够处理药物相互作用、预约管理等常见健康问题,但情感支持等高级功能仍在探索中。
Early diagnosis is vital for improving treatment outcomes, particularly for diseases like cancer, cardiovascular conditions, and brain problems [210–212]. By utilizing LLMs to analyze extensive historical health data, we can identify early disease signals and predict individual risks, employing NER/RE technologies to process structured data and TC for unstructured medical records, while QA and dialogue enhance accuracy in disease prediction.
早期诊断对于改善治疗效果至关重要,尤其是针对癌症、心血管疾病和脑部疾病等病症 [210–212]。通过利用大语言模型(LLM)分析大量历史健康数据,我们可以识别早期疾病信号并预测个体风险,同时运用命名实体识别/关系抽取(NER/RE)技术处理结构化数据,采用文本分类(TC)处理非结构化医疗记录,而问答(QA)和对话系统则能提升疾病预测的准确性。
In medical research, LLMs streamline the analysis of vast literature, using NER/RE to identify keywords and STS to find similar studies, significantly reducing literature review time. This acceleration facilitates access to the latest research findings, allowing researchers to directly query LLMs for multiple potential answers, which inspires further exploration and advancement in the field. The uneven distribution of healthcare resources limits access to medical services in remote areas, making it difficult for patients to receive timely care. QA and dialogue technologies enable chatbots to address common issues and recommend human experts for complex cases.
在医学研究中,大语言模型(LLM)通过NER/RE技术识别关键词,并利用STS技术查找相似研究,大幅缩短文献综述时间。这种加速机制帮助研究者快速获取最新成果,使其能直接向大语言模型查询多种潜在答案,从而激发该领域的进一步探索与突破。医疗资源分布不均导致偏远地区难以获得医疗服务,患者无法及时接受诊治。问答(QA)和对话技术使聊天机器人能处理常见问题,并为复杂病例推荐人类专家。
6.2. Multimodal healthcare LLMs
6.2. 多模态医疗大语言模型
The healthcare domain inherently involves diverse multimodal data, making multimodal Healthcare LLMs one of the most promising and essential research directions [213]. By integrating textual data with medical images, time-series data, and other modalities, these models have the potential to deliver more comprehensive and insightful analyses.
医疗领域本质上涉及多样化的多模态数据,这使得多模态医疗大语言模型成为最具前景和必要性的研究方向之一 [213]。通过整合文本数据与医学影像、时间序列数据及其他模态,这些模型有望提供更全面、更具洞察力的分析。
On one hand, multimodal Healthcare LLMs, which can integrate and learn from heterogeneous data, offer the potential to unlock a profound and nuanced understanding of complex medical phenomena. By capturing complementary semantic information and the intricate relationships across various modalities, these models enable clinicians to gain a holistic view of patients’ conditions. This capability supports more proactive monitoring, precise diagnoses, and highly personalized treatment plans. On the other hand, multimodal learning significantly broadens the application scope in the healthcare field. For instance, a patient’s abdomen may develop a hard, lump-like protrusion, which ordinary patients might find difficult to describe accurately. In such cases, if an LLM could directly analyze the patient’s photo to make a determination, its overall efficiency, capability, and practicality would be significantly enhanced.
一方面,能够整合和学习异构数据的多模态医疗大语言模型,有望实现对复杂医学现象的深入细致理解。通过捕捉跨模态的互补语义信息和错综复杂的关联关系,这些模型能让临床医生全面掌握患者病情。这一能力支持更主动的监测、更精准的诊断和高度个性化的治疗方案。另一方面,多模态学习极大拓展了医疗领域的应用场景。例如患者腹部可能出现硬块状隆起,普通患者往往难以准确描述。此时若大语言模型能直接分析患者照片做出判断,其整体效率、能力和实用性都将显著提升。
Nevertheless, challenges such as data heterogeneity, integration complexity, and the need for large-scale, high-quality datasets persist [214]. Overcoming these challenges through continued research and innovation is vital to fully harness the transformative potential of multimodal data in healthcare LLMs.
然而,数据异构性、集成复杂性以及大规模高质量数据集的需求等挑战依然存在 [214]。通过持续研究和创新克服这些挑战,对于充分发挥多模态数据在医疗大语言模型中的变革潜力至关重要。
6.2.1. Integration with healthcare process
6.2.1. 与医疗流程的整合
Is the application of artificial intelligence in the medical field just an ‘‘old myth’’, or can it really change the status quo? Clearly, although current AI solutions are fragmented and mostly experimental without widespread adoption, there exist such problems because we believe they are mainly caused by the following three reasons based on the existing study [215]. First, it is difficult to integrate with existing hospital information technology (IT) systems. AI solutions require large amounts of data for training, and most of this data is currently stored in hospitals’ own information systems. Retrieving and integrating this data requires upgrades and modifications to existing systems, which will have an impact on hospitals’ daily operations. In addition, different hospitals use different data formats and standards, lack standardized interfaces, and have relatively complex workflows in the Healthcare domain. AI systems find it difficult to adapt to different interfaces, which also increases the difficulty of integration. Second, fragmentation of IT systems due to hospital consolidations. With the increase in hospital mergers and acquisitions, the original hospitals may use completely different IT systems. After consolidation, it is necessary to unify their respective clinical and management systems, which requires huge investment and a long transition period. Introducing new AI systems during this process will face great technical challenges. Third, regulations are unclear and challenging. Currently, laws and regulations for AI medical applications are incomplete. Key issues such as information security, privacy protection, and liability attribution lack clear provisions. In addition, regulations differ across countries and regions. These will bring uncertainties to the development and application of AI systems. At the same time, the application of AI in the medical industry involves complex ethical issues that are also difficult to resolve.
人工智能在医疗领域的应用究竟是"老生常谈",还是真能改变现状?显然,尽管当前AI解决方案呈现碎片化且多为实验性质,尚未广泛普及,但存在这些问题是因为根据现有研究[215],我们认为主要由以下三个原因导致。首先,难以与现有医院信息技术(IT)系统整合。AI解决方案需要大量数据进行训练,而这些数据目前大多存储在医院自有信息系统中。检索和整合这些数据需要对现有系统进行升级改造,这将影响医院日常运营。此外,不同医院采用不同数据格式与标准,缺乏标准化接口,且医疗领域工作流程相对复杂。AI系统难以适配不同接口,这也增加了整合难度。其次,医院合并导致的IT系统碎片化。随着医院并购增加,原有医院可能使用完全不同的IT系统。合并后需要统一各自的临床与管理系统,这需要巨大投入和漫长过渡期。在此过程中引入新AI系统将面临极大技术挑战。第三,法规不明确且具有挑战性。目前AI医疗应用法律法规尚不完善,信息安全、隐私保护、责任归属等关键问题缺乏明确规定。此外,各国各地区监管要求存在差异。这些都将为AI系统的开发应用带来不确定性。同时,AI在医疗行业的应用涉及复杂的伦理问题,这些难题也难以解决。
6.3. Global collaboration and regulatory differences
6.3. 全球协作与监管差异
In addition to general concerns about fairness, accountability, transparency, and ethics, differences between countries pose significant challenges to applying Healthcare LLMs, particularly in the context of global collaborations. One major barrier is the disparity in levels of digital development, often referred to as the ‘‘digital divide’’ [216]. Bridging this divide requires strategies to make LLM technologies more accessible and equitable, especially in under-resourced settings. This can be achieved by developing user-friendly interfaces, supporting multiple languages, and training models on diverse datasets that reflect the needs of various populations. Such in clu siv it y enhances the relevance and applicability of LLMs in global healthcare contexts.
除了对公平性、问责制、透明度和伦理的普遍关切外,国家间的差异也给医疗健康领域大语言模型(LLM)的应用带来了重大挑战,特别是在全球合作背景下。一个主要障碍是数字化发展水平的差距,通常被称为"数字鸿沟"[216]。弥合这一鸿沟需要制定策略,使LLM技术更具可及性和公平性,特别是在资源匮乏的环境中。这可以通过开发用户友好的界面、支持多种语言,以及在不同人群需求多样化的数据集上训练模型来实现。这种包容性增强了LLM在全球医疗健康背景下的相关性和适用性。
Another critical challenge is addressing differences in global regulatory frameworks. Adapting LLMs to comply with diverse legal and ethical standards across regions requires a comprehensive understanding of each jurisdiction’s regulations and cultural nuances. This adaptation not only ensures compliance with local legal frameworks but also fosters trust by respecting regional ethical considerations. Cross-national collaboration is pivotal in overcoming these challenges. Establishing shared governance models and standardized protocols can facilitate the seamless integration of LLMs across borders. Additionally, leveraging privacy-preserving technologies, such as federated learning, enables secure data sharing and collaborative model training while safeguarding patient confidentiality. These collaborative efforts can drive the development of robust, globally applicable healthcare solutions that are sensitive to regional differences and capable of addressing disparities in healthcare access and quality.
另一个关键挑战是应对全球监管框架的差异。要让大语言模型(LLM)适应不同地区的法律和伦理标准,需要全面理解每个司法管辖区的法规和文化差异。这种调整不仅能确保符合当地法律框架,还能通过尊重区域伦理考量来建立信任。
跨国合作是克服这些挑战的关键。建立共享治理模式和标准化协议可以促进大语言模型的无缝跨境整合。此外,利用联邦学习(federated learning)等隐私保护技术,能够在保障患者机密的同时实现安全的数据共享和协作式模型训练。
这些协作努力可以推动开发出强大且全球适用的医疗解决方案,这些方案既能敏锐感知区域差异,又能解决医疗可及性和质量方面的不平等问题。
7. Conclusion
7. 结论
In this study, we provided a comprehensive survey specifically focusing on Healthcare LLMs. Our survey encompassed an extensive examination of data, technologies, applications, fairness, accountability, transparency, and ethics associated with Healthcare LLMs. A noteworthy transformation has been observed from Disc rim i native AI to Generative AI, as well as from model-centered to data-centered approaches, marking a significant shift from PLMs to LLMs. This transition has enabled Healthcare LLMs to support more advanced applications beyond conventional NLP-based fundamental tasks.
在本研究中,我们针对医疗健康领域的大语言模型 (Healthcare LLMs) 进行了全面综述。我们的调查涵盖了对医疗健康大语言模型相关数据、技术、应用、公平性、问责制、透明度及伦理的广泛研究。研究观察到从判别式 AI (Discriminative AI) 到生成式 AI (Generative AI) 的显著转变,以及从以模型为中心到以数据为中心的方法演进,标志着从预训练语言模型 (PLMs) 到大语言模型的重要转型。这一转变使得医疗健康大语言模型能够支持超越传统基于自然语言处理的基础任务的更高级应用。
However, despite the opportunities presented by Healthcare LLMs, several significant challenges persist. Issues pertaining to interpret a bility, privacy protection, medical knowledge enhancement, integration with Healthcare processes, and effective interaction with patients and doctors pose substantial obstacles. These challenges hinder the translation of innovative LLMs into practical adoption within the Healthcare field. Consequently, physicians and other Healthcare professionals must carefully consider the potential benefits and limitations associated with LLMs as they navigate the selection and integration of these models into their medical practice.
然而,尽管医疗大语言模型(Healthcare LLMs)带来了诸多机遇,仍存在若干重大挑战。可解释性、隐私保护、医学知识增强、与医疗流程的整合以及与患者和医生的有效互动等问题构成了实质性障碍。这些挑战阻碍了创新大语言模型在医疗领域的实际应用转化。因此,医生及其他医疗从业者在选择并将这些模型整合到医疗实践中时,必须审慎权衡其潜在优势与局限性。
CRediT authorship contribution statement
CRediT作者贡献声明
Kai He: Writing – review & editing, Writing – original draft, Investigation, Formal analysis, Conceptualization. Rui Mao: Writing – review & editing, Writing – original draft, Methodology, Conceptualization. Qika Lin: Writing – review & editing, Writing – original draft, Methodology, Conceptualization. Yucheng Ruan: Writing – review & editing, Writing – original draft. Xiang Lan: Writing – review & editing, Writing – original draft, Conceptualization. Mengling Feng: Writing – review & editing, Supervision, Funding acquisition, Conceptualization. Erik Cambria: Writing – review & editing, Supervision, Conceptualization.
Kai He: 写作 – 审阅与编辑, 写作 – 初稿, 调研, 形式分析, 概念化。
Rui Mao: 写作 – 审阅与编辑, 写作 – 初稿, 方法论, 概念化。
Qika Lin: 写作 – 审阅与编辑, 写作 – 初稿, 方法论, 概念化。
Yucheng Ruan: 写作 – 审阅与编辑, 写作 – 初稿。
Xiang Lan: 写作 – 审阅与编辑, 写作 – 初稿, 概念化。
Mengling Feng: 写作 – 审阅与编辑, 监督, 资金获取, 概念化。
Erik Cambria: 写作 – 审阅与编辑, 监督, 概念化。
Declaration of competing interest
利益冲突声明
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
作者声明,他们不存在任何已知的竞争性经济利益或个人关系,这些利益或关系可能会影响本文所报告的研究工作。
Acknowledgments
致谢
This work has been supported by the National Research Foundation Singapore under AI Singapore Programme (Award Number: AISG-GC2019-001-2A and AISG2-TC-2022-004); The RIE2025 Industry Align- ment Fund (I2101E0002 – Cisco-NUS Accelerated Digital Economy Corporate Laboratory).
本工作获得新加坡国家研究基金会人工智能新加坡计划 (Award Number: AISG-GC2019-001-2A 和 AISG2-TC-2022-004) 以及 RIE2025 产业联盟基金 (I2101E0002 – 思科-新国大加速数字经济企业实验室) 的资助。
Data availability
数据可用性
No data was used for the research described in the article.
本文所述研究未使用任何数据。
