A survey of large language models for healthcare: from data, technology, and applications to accountability and ethics
大语言模型在医疗领域的应用综述:从数据、技术、应用走向责任与伦理
Kai He a , Rui Mao b, Qika Lin a, Yucheng Ruan a, Xiang Lan a, Mengling Feng a,∗, Erik Cambria b
Kai He a, Rui Mao b, Qika Lin a, Yucheng Ruan a, Xiang Lan a, Mengling Feng a,∗, Erik Cambria b
a National University of Singapore, 119077, Singapore b Nanyang Technological University, 639798, Singapore
a 新加坡国立大学, 119077, 新加坡
b 南洋理工大学, 639798, 新加坡
A R T I C L E I N F O
文章信息
A B S T R A C T
摘要
The utilization of large language models (LLMs) for Healthcare has generated both excitement and concern due to their ability to effectively respond to free-text queries with certain professional knowledge. This survey outlines the capabilities of the currently developed Healthcare LLMs and explicates their development process, to provide an overview of the development road map from traditional Pretrained Language Models (PLMs) to LLMs. Specifically, we first explore the potential of LLMs to enhance the efficiency and effectiveness of various Healthcare applications highlighting both the strengths and limitations. Secondly, we conduct a comparison between the previous PLMs and the latest LLMs, and summarize related Healthcare training data, learning methods, and usage. Finally, the unique concerns associated with deploying LLMs are investigated, particularly regarding fairness, accountability, transparency, and ethics. Besides, we support researchers by compiling a collection of open-source resources1. Summarily, we contend that a significant paradigm shift is underway, transitioning from PLMs to LLMs. This shift encompasses a move from disc rim i native AI approaches to generative AI approaches, as well as a move from model-centered methodologies to datacentered methodologies. We determine that the biggest obstacle of using LLMs in Healthcare are fairness, accountability, transparency and ethics.
大语言模型(LLM)在医疗健康领域的应用既令人振奋又引发担忧,因其能够基于专业知识有效响应自由文本查询。本综述系统梳理了当前医疗健康大语言模型的能力边界,并阐明其发展历程,从而呈现从传统预训练语言模型(PLM)到大语言模型的演进路线。具体而言,我们首先探讨了大语言模型在提升各类医疗健康应用效率与效果方面的潜力,同时剖析其优势与局限;其次,通过对比传统PLM与前沿LLM的技术差异,归纳了相关医疗健康训练数据、学习方法及使用范式;最后,深入分析了部署大语言模型时面临的独特挑战,特别是公平性、问责制、透明度与伦理问题。此外,我们还整理了开源资源合集以支持研究者。总体而言,我们认为当前正经历从PLM到LLM的重大范式转变:一方面是从判别式AI向生成式AI(Generative AI)的跨越,另一方面是从模型中心方法论向数据中心方法论的演进。研究发现,医疗健康领域应用大语言模型的最大障碍在于公平性、问责制、透明度与伦理问题。
1. Introduction
1. 引言
Recently, Large Language Models (LLMs) have emerged as a driving force in AI due to their impressive abilities in understanding, generating, and reasoning. The integration of LLMs into Healthcare represents a significant advancement in the application of AI toward improving clinical outcomes, conserving resources, and enhancing patient care. Healthcare researchers face persistent challenges such as diagnosing rare diseases, interpreting complex patient narratives, and planning personalized treatments. The advanced language processing capabilities of LLMs directly address these needs, offering more precise diagnostics and tailored treatment options. For example, Med-PaLM 2 [1] demonstrates expert-level accuracy on the US Medical Licensing Examination (USMLE). Besides, more general models such as GPT-4, GPT4-o and Llama series also demonstrate superior performance in a variety of healthcare-related tasks. These advancements expand LLM applications in healthcare while improving patient outcomes through greater accuracy and efficiency.
近年来,大语言模型(LLM)凭借其在理解、生成和推理方面的卓越能力,已成为人工智能领域的重要推动力。将大语言模型整合到医疗健康领域,标志着人工智能在改善临床结果、节约资源和提升患者护理方面的重大进展。医疗研究者长期面临着罕见病诊断、复杂患者叙述解读以及个性化治疗方案制定等挑战。大语言模型先进的语言处理能力直接应对这些需求,提供更精准的诊断和定制化治疗方案。例如,Med-PaLM 2[1]在美国医师执照考试(USMLE)中展现出专家级的准确率。此外,GPT-4、GPT4-o和Llama系列等通用模型也在各类医疗相关任务中表现出卓越性能。这些进步不仅拓展了大语言模型在医疗健康领域的应用范围,还通过更高的准确性和效率改善了患者治疗效果。
Initially, Pretrained Language Models (PLMs) include BERT [2] and RoBERTa [3] were developed for general NLP tasks and later adapted for healthcare applications. For simpler tasks, PLMs offer advantages over LLMs in terms of simplicity and efficiency when dealing with less complex cases. However, their use in healthcare was limited because they typically operated as single-task systems, lacking the capability to interact dynamically with complex medical data [4].
最初,预训练语言模型(PLM)如BERT [2]和RoBERTa [3]是为通用NLP任务开发的,后来被应用于医疗领域。对于较简单的任务,在处理不太复杂的案例时,PLM在大语言模型面前具有简洁高效的优势。然而,它们在医疗领域的应用受到限制,因为这些模型通常作为单任务系统运行,缺乏与复杂医疗数据动态交互的能力[4]。
Then, the development of LLMs like GPT-3 represents a transformative evolution from PLMS to LLMs, as illustrated in Fig. 1. With over 100 billion parameters, GPT-3 demonstrates exceptional understanding and generating capabilities, which significantly enhance its functionality across various applications, including Healthcare [6]. These capabilities allow LLMs to process and analyze a broader array of data types, such as patient records, clinical notes, and research papers, to identify patterns and suggest potential diagnoses that might be overlooked by human clinicians [7]. Additionally, the integration of LLMs into Healthcare is further supported by their enhanced explain ability and adaptability compared to PLMs. The introduction of Chain-of-Thought (CoT) processing in newer LLMs contributes to a more transparent AI decision-making process. This transparency is crucial in Healthcare settings, where understanding the rationale behind AI-generated decisions can foster greater trust and reliability among medical professionals in employing AI-powered tools [6].
随后,像 GPT-3 这样的大语言模型 (LLM) 的发展标志着从预训练语言模型 (PLM) 向大语言模型的变革性演进,如图 1 所示。拥有超过 1000 亿参数的 GPT-3 展现出卓越的理解与生成能力,这显著提升了其在医疗健康 [6] 等各类应用中的功能性。这些能力使大语言模型能够处理和分析更广泛的数据类型,例如患者记录、临床笔记和研究论文,从而识别模式并提出可能被人类临床医生忽视的潜在诊断 [7]。此外,与预训练语言模型相比,大语言模型更强的可解释性和适应性进一步支持了其在医疗健康领域的整合。新一代大语言模型中引入的思维链 (Chain-of-Thought, CoT) 处理机制,使人工智能决策过程更加透明。这种透明度在医疗健康场景中至关重要,因为理解 AI 生成决策背后的逻辑能增强医疗专业人员对 AI 驱动工具的信任度和可靠性 [6]。
Fig. 1. The development road map from PLMs to LLMs. GPT-3 [5] marks a significant milestone in the transition from PLMs to LLMs, signaling the beginning of a new era both in general and Healthcare field.
图 1: 从PLM到大语言模型的发展路线图。GPT-3 [5]标志着从PLM向大语言模型转型的重要里程碑,无论在通用领域还是医疗健康领域都预示着一个新时代的开端。
Besides the aforementioned general abilities, many studies have tailored LLMs to address specific healthcare application tasks, marking a significant trend in this field. Understanding this trend is crucial for further advancing and diversifying healthcare applications. For instance, given that the healthcare field inherently involves multimodal data, some studies [8–10] have explored LLMs’ capabilities to understand and analyze diverse medical images. Additionally, models like HuatuoGPT [11] demonstrate active inquiry capabilities, allowing for the extraction of more potential medical information. Other disease-specific LLMs, such as OphGLM [12] for ophthalmology and SoulChat [13] for mental health, highlight the versatility of LLMs in addressing targeted medical needs. Beyond these examples, the potential of LLMs in healthcare remains vast and largely untapped. Investing in the development of effective, ethical, and accountable LLMs is not only essential but also holds immense promise for practical and transformative benefits in healthcare.
除了上述通用能力外,许多研究针对特定医疗应用任务定制了大语言模型(LLM),这成为该领域的重要趋势。理解这一趋势对进一步推进和丰富医疗应用至关重要。例如,鉴于医疗领域本质涉及多模态数据,部分研究[8-10]探索了大语言模型理解和分析各类医学影像的能力。此外,像华佗GPT(HuatuoGPT)[11]这样的模型展示了主动询问能力,可提取更多潜在医疗信息。其他专科大语言模型如眼科领域的OphGLM[12]和精神健康的SoulChat[13],突显了该技术在解决针对性医疗需求方面的多样性。除这些案例外,大语言模型在医疗领域的潜力仍然巨大且尚未充分开发。投资开发有效、符合伦理且可问责的大语言模型不仅至关重要,更能为医疗实践带来变革性效益。
This paper aims to inform readers about the latest developments in the field and offer comprehensive insights to those interested in using or developing healthcare LLMs. It covers various healthcare applications and provides a detailed summary of the underlying technology. We aims to provide insights about how different technologies affect different Healthcare-related tasks. Furthermore, as the capabilities of LLMs continue to improve, we contend that the challenges associated with applying AI in healthcare due to performance limitations are diminishing. Consequently, issues of fairness, accountability, transparency, and ethics are becoming more significant impediments to practical implementation. For this reason, we discuss these four critical issues in the context of employing LLMs and emphasize their importance.
本文旨在向读者介绍该领域的最新进展,并为有意使用或开发医疗大语言模型的研究者提供全面见解。文章涵盖多种医疗应用场景,并对底层技术进行了详细梳理。我们重点解析了不同技术对各类医疗相关任务的影响机制。此外,随着大语言模型能力持续提升,我们认为因性能局限导致AI医疗应用受阻的情况正在减少。相应地,公平性、问责制、透明度和伦理问题正成为实际部署中更显著的障碍。为此,我们专门探讨了应用大语言模型时这四个关键议题,并强调其重要性。
Several surveys [7,14–16] have specifically examined the applications of large language models (LLMs) in medical and healthcare domains, emphasizing their potential benefits and limitations. However, these works lack in-depth technological analysis and fail to address critical issues such as accountability and ethics. Other surveys [17,18] include discussions on technological aspects but primarily focus on general LLM developments and evaluations, offering limited insights into their adaptation and application in healthcare settings. Some studies have a narrower focus. For instance, the study [19] concentrates solely on testing healthcare-specific LLMs, while [20] is limited to their applications in psychotherapy. Plus, former study [21] focused on Healthcare PLMs rather than LLMs. However, we provide a brief introduction to Healthcare PLMs as background information and then delve into the details of Healthcare LLMs. Our comprehensive analysis is anticipated to guide medical researchers in making informed choices in selecting LLMs suitable for their specific needs. The organizational framework of this paper is shown as Fig. 2. Generally, our contributions can be summarized as:
多篇综述研究[7,14-16]专门探讨了大语言模型(LLM)在医疗健康领域的应用,着重分析了其潜在优势与局限性。然而这些研究缺乏深入的技术分析,且未涉及责任归属与伦理等关键问题。另有综述[17,18]虽包含技术层面的讨论,但主要聚焦于通用大语言模型的发展与评估,对其在医疗场景中的适配与应用着墨有限。部分研究范围更为狭窄,例如文献[19]仅关注医疗专用大语言模型的测试,而[20]则局限于心理治疗领域的应用。此外,前期研究[21]主要针对医疗专用预训练模型(PLM)而非大语言模型。不过我们仍将简要介绍医疗PLM作为背景知识,再深入探讨医疗LLM的技术细节。我们期望通过这项全面分析,为医学研究者选择适合其特定需求的LLM提供决策参考。本文组织结构如图2所示,我们的主要贡献可概括为:
• We propose a comprehensive survey of LLMs in Healthcare, outlining a evolution road map from PLMs to LLMs, updating readers on the latest advancements in the field. • We compiled a detailed list of publicly available data, training costs, and task performances for Healthcare LLMs, which is useful for developers and users of private Healthcare LLMs. • We explore key non-technical aspects of LLMs in Healthcare, like fairness, accountability, transparency, and ethics, which are vital for advancing Healthcare AI applications.
• 我们提出了医疗领域大语言模型 (LLM) 的全面综述,勾勒出从预训练语言模型 (PLM) 到大语言模型的演进路线图,帮助读者了解该领域的最新进展。
• 我们整理了医疗领域大语言模型的公开数据、训练成本和任务性能的详细清单,这对私有医疗大语言模型的开发者和使用者具有实用价值。
• 我们探讨了医疗大语言模型的关键非技术因素,如公平性、问责制、透明度和伦理问题,这些对推进医疗AI应用至关重要。
2. What LLMs can do for healthcare? from fundamental tasks to advanced applications
2. 大语言模型(LLM)能为医疗保健做什么?从基础任务到高级应用
Numerous endeavors have been made to apply PLMs or LLMs to Healthcare. In the early stages, the studies primarily focused on fundamental tasks, due to the challenges of accessing diverse medical datasets, the complexity of the medical domain, and limitations of the models’ capabilities. Based LLMs, the concept of Artificial General Intelligence (AGI) for Healthcare has been proposed, which has led to more practical and advanced applications in various aspects of the Healthcare field, as shown in Fig. 3. In this sections, we analyze what LLMs can do for Healthcare in detail, and mainly compare the strengths and weaknesses of LLMs and PLMs on different tasks to highlight the development from PLMs to LLMs.
众多研究致力于将PLM或大语言模型(LLM)应用于医疗领域。早期研究主要聚焦基础性任务,这源于获取多样化医疗数据集的困难、医疗领域的复杂性以及模型能力的局限性。基于大语言模型,医疗领域的通用人工智能(AGI)概念被提出,从而催生了如图3所示的、在医疗各细分领域更实用且高级的应用。本节将详细分析大语言模型对医疗的赋能作用,重点对比大语言模型与传统PLM在不同任务中的优劣势,以突显从PLM到大语言模型的技术演进。
2.1. NER and RE for healthcare
2.1. 医疗领域的命名实体识别和关系抽取
The initial step toward unlocking valuable information in unstructured Healthcare text data mainly involves Named Entity Recognition (NER) and Relation Extraction (RE). These two tasks are main tasks to achieve Information Extraction (IE), which provide fundamental information for a range of other Healthcare applications, such as medical entity normalization and co reference [22], medical knowledge base and knowledge graph construction [23], and entity-enhanced dialogue [24]. For example, by employing NER and RE tasks, the Healthcare knowledge databases Drugbank1 [25] and Unified Medical Language System (UMLS) are constructed, which facilitate various applications in Intellectual Healthcare.
从非结构化医疗文本数据中提取有价值信息的初始步骤主要涉及命名实体识别(NER)和关系抽取(RE)。这两项任务是实现信息抽取(IE)的核心工作,为其他医疗应用提供基础信息,如医疗实体标准化与共指消解[22]、医疗知识库与知识图谱构建[23],以及实体增强对话[24]。例如,通过运用NER和RE技术,构建了Drugbank1[25]和统一医学语言系统(UMLS)等医疗知识数据库,这些数据库促进了智能医疗领域的多种应用。
In the early stages of research on NER with PLMs, a significant portion of studies focused on sequence labeling tasks. To accomplish this, PLMs-based approaches were employed to generate contextual i zed representations for individual tokens. In the case of RE tasks, the extracted entity pairs’ representations were typically fed into a classifier to determine the existence of relations between the given entities. In the era of LLMs, NER and RE have been improved to work under more complex conditions and more convenient usages. One example is LLM-NERRE [26], which combines NER and RE to handle hierarchical information in scientific text. This approach has demonstrated the ability to effectively extract intricate scientific knowledge for tasks that require the use of LLMs. These tasks often involve complexities that cannot be effectively handled by typical PLMs. Meanwhile, LLMs can effectively perform medical NER and RE well even without further training. The study [27] employed Instruct GP T [28] to perform zero/few-shot IE from clinical text, despite not being trained specifically for the clinical domain. The results illustrated that Instruct GP T can perform very well on biomedical evidence extraction, medication status extraction, and medication attribute extraction. This observation supports the notion that LLMs can be applied with flexibility and efficiency.
在基于预训练语言模型 (PLM) 的命名实体识别 (NER) 研究早期,大量工作集中于序列标注任务。为此,基于 PLM 的方法通过生成上下文相关的 token 表征来实现该目标。对于关系抽取 (RE) 任务,通常将抽取出的实体对表征输入分类器,以判断给定实体间是否存在关联。进入大语言模型时代后,NER 和 RE 能力已提升至可适应更复杂场景与更便捷的使用模式。例如 LLM-NERRE [26] 通过联合 NER 与 RE 处理科学文本中的层级化信息,证明了大语言模型在需要提取复杂科学知识的任务中具有显著优势——这类任务往往超出传统 PLM 的处理能力。值得注意的是,大语言模型即便未经专门训练也能出色完成医学 NER 和 RE 任务。研究 [27] 采用 Instruct GPT [28] 对临床文本进行零样本/少样本信息抽取,尽管该模型未接受临床领域专门训练。结果显示 Instruct GPT 在生物医学证据提取、用药状态识别和药物属性抽取等任务中表现优异,这一发现印证了大语言模型具备灵活高效的应用潜力。
Fig. 2. The organizational framework for the content. Sections 3, 4 are technology details, while Sections 2, 5 are more valued for Healthcare professionals.
图 2: 内容组织结构框架。第3、4节为技术细节,而第2、5节对医疗保健专业人员更具价值。
Fig. 3. LLMs for healthcare: from fundamental task to advanced applications.
图 3: 医疗领域的大语言模型 (LLM) : 从基础任务到高级应用
Despite their capabilities, they still perform comparably to specially trained state-of-the-art (SOTA) PLMs, particularly in domains that involve professional terms and symbols. LLMs were trained on unlabeled data, with most of their knowledge derived from a vast amount of textual information. However, for domain-specific knowledge, such as specific types of named entities, LLMs’ pragmatic understanding capabilities are likely to be less effective compared to PLMs that have been fine-tuned on labeled data. Overall, we argue that both PLMs and LLMs have distinct advantages in IE tasks.
尽管它们具备这些能力,但在涉及专业术语和符号的领域中,其表现仍与经过专门训练的最先进 (SOTA) 预训练语言模型 (PLM) 相当。大语言模型 (LLM) 是在未标注数据上训练的,其大部分知识来源于海量文本信息。然而,对于领域特定知识(如特定类型的命名实体),与基于标注数据微调的预训练语言模型相比,大语言模型的实用理解能力可能较弱。总体而言,我们认为预训练语言模型和大语言模型在信息抽取 (IE) 任务中各具优势。
2.2. Text classification for healthcare
2.2. 医疗健康领域的文本分类
Text Classification (TC) aims to assign labels to text with different lengths, such as phrases, sentences, paragraphs, or documents. In Healthcare research, a large amount of patient data is collected in the electronic format, including disease status, medication history, and treatment outcomes. However, these data can only be used with appropriate labels, while TC is one of the most commonly used technology. For example, a research study [29] proposed several methods, based on hybrid Long Short-Term Memory (LSTM) and bidirectional gated recurrent units(Bi-GRU) to achieve medical TC. The study [30] used TC to identify prescription medication mentioned in tweets and achieved good results by using PLMs. Also, some studies employ TCbased Sentiment Analysis (SA) to understand patient emotion or mental healthcare, aiming to provide more humanized treatments [31].
文本分类 (TC) 旨在为不同长度的文本(如短语、句子、段落或文档)分配标签。在医疗健康研究中,大量患者数据以电子形式收集,包括疾病状态、用药史和治疗结果。然而,这些数据只有在获得适当标签后才能使用,而文本分类是最常用的技术之一。例如,一项研究 [29] 提出了基于混合长短期记忆网络 (LSTM) 和双向门控循环单元 (Bi-GRU) 的多种方法来实现医疗文本分类。研究 [30] 利用文本分类技术识别推文中提到的处方药,并通过使用预训练语言模型 (PLM) 取得了良好效果。此外,部分研究采用基于文本分类的情感分析 (SA) 来理解患者情绪或心理健康状况,旨在提供更具人性化的治疗方案 [31]。
However, PLMs-based TC usually cannot satisfy explain able and reliable requirements in the Healthcare field, while LLMs-based TC mitigates these issues to some extent. For example, CARP [32] takes advantage of LLMs by introducing Clue And Reasoning Prompting to achieve better TC tasks. This study adopts a progressive reasoning strategy tailored to address the complex linguistic phenomena involved in TC. AMuLaP [33] is another example, which proposed Automatic Multi-Label Prompting for few-shot TC. By exploring automatic label selection, their method surpasses the GPT-3-style in-context learning method, showing significant improvements compared with previous PLMs-based results.
然而,基于预训练语言模型(PLM)的文本分类(TC)通常无法满足医疗保健领域对可解释性和可靠性的要求,而基于大语言模型(LLM)的文本分类在一定程度上缓解了这些问题。例如,CARP [32]通过引入线索与推理提示(Clue And Reasoning Prompting)技术,利用大语言模型实现了更优的文本分类任务。该研究采用渐进式推理策略,专门针对文本分类中涉及的复杂语言现象。AMuLaP [33]是另一个典型案例,提出了面向少样本文本分类的自动多标签提示(Automatic Multi-Label Prompting)方法。通过探索自动标签选择机制,该方法超越了GPT-3风格的上下文学习方法,相较此前基于预训练语言模型的结果显示出显著提升。
Unlike in general domains where LLMs and SOTA PLMs exhibit similar performance, LLMs demonstrate a clear advantage in Healthcare TC, which primarily due to the inherent complexity of special data. Healthcare texts are laden with specialized language, including technical terms, abbreviations, and jargon that are unique to the field. Moreover, the context in which these terms are used can significantly alter their meanings. For instance, the abbreviation ‘‘MI’’ might mean ‘‘mitral insufficiency’’ or ‘‘myocardial infarction’’, depending on the surrounding context. Given these conditions, Healthcare TC tasks require the integration of various types of data and an understanding of their interplay. This necessitates models that are not only summarize information but also reason con textually. LLMs are well-suited for these tasks due to their deeper contextual understanding and ability to handle complex interactions within the text, making them more effective for healthcare applications than PLMs.
与通用领域中大语言模型和SOTA PLM表现相近的情况不同,大语言模型在医疗文本分类(Healthcare TC)中展现出明显优势,这主要源于专业数据固有的复杂性。医疗文本充斥着领域特有的专业语言,包括技术术语、缩写和行话。此外,这些术语的使用语境会极大改变其含义。例如,缩写"MI"可能表示"二尖瓣关闭不全"或"心肌梗死",具体含义取决于上下文环境。鉴于这些特性,医疗文本分类任务需要整合多种数据类型并理解其相互作用关系,这就要求模型不仅能总结信息,还需具备上下文推理能力。大语言模型凭借更深入的语境理解能力和处理文本内复杂交互的优势,相比PLM更适合这类任务,因此在医疗应用中更为有效。
2.3. Semantic textual similarity for healthcare
2.3. 医疗领域的语义文本相似度
Semantic Textual Similarity (STS) is a way to measure how much two sentences mean the same thing or two documents are similar. In Healthcare, STS is often used to combine information from different sources, especially used for Electronic Health Records (EHR). The 2018 Bio Creative/Open Health NLP (OHNLP) challenge [34] and the National NLP Clinical Challenges (n2c2) 2019 Track 1 show that STS can help reduce mistakes and disorganization in EHRs caused by copying and pasting or using templates. This means that STS can be used to check the quality of medical notes and make them more efficient for other NLP tasks. The study [35] proposed a new method using Clinic alBERT, which was a fine-tuned BERT-based method. The proposed iterative multitask learning technique helps the model learn from related datasets and select the best ones for fine-tuning. Besides, STS can be used for Healthcare information retrieval. For examples, if a patient ask question like ‘‘I was diagnosed with non-clear cell renal cell carcinoma, what are the chances of recurrence after cure? Give me evidence from relevant scientific literature’’, Our AI systems may need retrieval related database to find papers which contain similar semantic sentences. For doctor, when face patients who are difficult to diagnose, this technology can identify similar patients for doctors’ reference.
语义文本相似度 (Semantic Textual Similarity, STS) 是一种衡量两个句子含义相同程度或两份文档相似程度的方法。在医疗领域,STS常被用于整合不同来源的信息,尤其适用于电子健康档案 (EHR) 。2018年Bio Creative/Open Health NLP (OHNLP) 挑战赛[34]和2019年美国临床自然语言处理挑战赛 (n2c2) 第一赛道表明,STS有助于减少因复制粘贴或使用模板导致的EHR错误和混乱。这意味着STS可用于检查医疗记录质量,并提升其在其他自然语言处理任务中的效率。研究[35]提出了一种使用Clinic alBERT的新方法,这是一种基于BERT的微调方法。该研究提出的迭代多任务学习技术能帮助模型从相关数据集中学习,并选择最佳数据集进行微调。此外,STS还可用于医疗信息检索。例如,当患者提出"我被诊断为非透明细胞肾细胞癌,治愈后复发几率是多少?请从相关科学文献中提供证据"这类问题时,我们的AI系统可能需要检索相关数据库以查找包含相似语义句子的论文。对于医生而言,当面对难以诊断的患者时,该技术可识别相似病例供医生参考。
When comparing PLMs and LLMs, we need to break down the situation to start some discussion. For short text semantic classification, SOTA PLMs and LLMs are comparable. This is primarily because such tasks contain less contextual information, meaning the advantages of LLMs in managing large context windows and understanding complex narrative structures are less pronounced. In such cases, the fundamental ability of both PLMs and LLMs to understand and interpret language at a basic level plays a more significant role, leading to similar levels of performance. On the other hand, for tasks like information retrieval, LLMs tend to be overly complex and resource-intensive for the role of a simple retriever. Typically, LLMs excel in directly generating responses or completing texts based on given inputs. In contrast, PLMs, which are generally more lightweight, are better suited for retrieving external knowledge. This distinction makes PLMs more practical for applications where quick, efficient retrieval of information is required without the additional overhead of generating new text content.
在比较PLM和大语言模型时,我们需要拆解具体场景展开讨论。对于短文本语义分类任务,当前最先进的PLM与大语言模型表现相当。这主要是因为此类任务包含的上下文信息较少,大语言模型在管理大上下文窗口和理解复杂叙事结构方面的优势难以凸显。此时,PLM和大语言模型在基础语言理解能力上的表现成为关键因素,导致两者性能相近。另一方面,在信息检索等任务中,大语言模型往往显得过于复杂且资源密集,并不适合作为简单的检索器。通常大语言模型更擅长根据给定输入直接生成响应或补全文本,而相对轻量级的PLM则更适合检索外部知识。这种差异使得在需要快速高效检索信息、且无需生成新文本内容的场景中,PLM更具实用性。
2.4. Question answering for healthcare
2.4. 医疗问答
Traditionally, QA is a separate task that involves generating or retrieving answers for given questions. In Healthcare, QA can be very beneficial for medical professionals to find necessary information in clinical notes or literature, as well as providing basic Healthcare knowledge for patients. According to a report by the Pew Research Center [36], over one-third of American adults have searched online for medical conditions they may have. A strong QA system for Healthcare can significantly fulfill the consultation needs of patients. Many studies [21] explored how to adapt general PLMs to answer Healthcare questions, including designing special pertaining task, fine-tuning on Healthcare data, and introducing external Healthcare knowledge base. However, due to their limited language understanding and generation abilities [37], PLMs-based QA systems struggle to play a significant role in real-world Healthcare scenarios.
传统上,问答 (QA) 是一项独立任务,涉及为给定问题生成或检索答案。在医疗健康领域,问答系统能为医疗专业人员从临床记录或文献中查找必要信息提供极大帮助,同时也能为患者提供基础医疗知识。皮尤研究中心 (Pew Research Center) [36] 的报告显示,超过三分之一的美国成年人曾在线搜索过自身可能患有的病症。一个强大的医疗问答系统能显著满足患者的咨询需求。多项研究 [21] 探讨了如何使通用预训练语言模型 (PLM) 适应医疗问答场景,包括设计特定预训练任务、基于医疗数据进行微调,以及引入外部医疗知识库。然而,由于其在语言理解和生成能力上的局限性 [37],基于预训练语言模型的问答系统难以在实际医疗场景中发挥重要作用。
With the advent of powerful LLMs, prompt-based methods have been introduced to solve various tasks by formulating them as QA tasks, including NER, RE, and SA. Besides, LLMs have significantly improved typical QA tasks in Healthcare fields. For instance, MedPaLM 2 [1] approached or exceeded state-of-the-art performance across MedMCQA [38], PubMedQA [39], and MMLU [40] clinical topics QA datasets. The study [41] investigated the use of ChatGPT, Google Bard, and Claude for patient-specific QA from clinical notes. Another study [42] proposed a retrieval-based medical QA system that uses LLMs in combination with knowledge graphs to address the challenge.
随着强大大语言模型(LLM)的出现,基于提示词(prompt)的方法被引入来解决各类任务,包括命名实体识别(NER)、关系抽取(RE)和情感分析(SA)。此外,大语言模型显著提升了医疗领域的典型问答任务表现。例如,MedPaLM 2 [1]在MedMCQA [38]、PubMedQA [39]和MMLU [40]临床主题问答数据集上的表现达到或超越了最先进水平。研究[41]探讨了使用ChatGPT、Google Bard和Claude从临床记录中生成患者特定问答的可行性。另一项研究[42]提出了基于检索的医疗问答系统,通过结合大语言模型与知识图谱来解决这一挑战。
Visual Question Answering (VQA) has recently garnered significant attention in the Healthcare field for its potential to meet the diverse needs of both patients and healthcare professionals. By facilitating the interpretation of medical images through question answering, VQA holds great promise for aiding diagnostics and enhancing patient under standing through educational tools. One of the key challenges in this domain is the precise identification and comprehension of critical regions in medical images, such as masses, anomalies, and lesions. Equally vital is ensuring that the semantic representation of these regions aligns with the specific demands articulated in textual queries, enabling the generation of con textually relevant and medically accurate responses. For example, The study [43] introduces a novel multiple meta-model quant if i cation method for medical VQA tasks. This method effectively learns meta-annotations and extracts meaningful features. It is designed to enhance metadata through auto-annotation, handle noisy labels, and generate meta-models that produce more robust and reliable features. Besides, MISS [44] presents an efficient multi-task self-supervised learning framework, which unifies the text and multimodal encoders to enhance the alignment of image-text features effectively. Moreover, MISS introduces a novel Transfer-and-Caption method, leveraging LLMs to expand the feature space of single-modal image datasets.
视觉问答 (Visual Question Answering, VQA) 近期在医疗领域获得广泛关注,因其具备满足患者和医疗专业人员多样化需求的潜力。通过问答形式辅助解读医学影像,VQA 在辅助诊断和利用教育工具提升患者理解方面展现出巨大前景。该领域的核心挑战之一在于精准识别并理解医学影像中的关键区域(如肿块、异常和病变),同时确保这些区域的语义表征与文本查询所表述的特定需求相匹配,从而生成符合上下文且医学准确的回答。例如,研究 [43] 提出了一种针对医疗 VQA 任务的新型多元元模型量化方法,该方法能有效学习元标注并提取有意义的特征,其设计通过自动标注增强元数据、处理噪声标签,并生成可产生更鲁棒可靠特征的元模型。此外,MISS [44] 提出了一种高效的多任务自监督学习框架,通过统一文本和多模态编码器显著提升图文特征对齐效果,并创新性地采用基于大语言模型的迁移-描述方法,扩展单模态图像数据集的特征空间。
Fig. 4. The comparison between PLMs-based with LLMs-based dialogue system.
图 4: 基于PLMs与基于LLMs的对话系统对比
As one of their most outstanding ability, LLMs are obviously superior to PLMs on QA tasks. LLMs are increasingly being utilized to boost various real-world Healthcare applications, especially when considering only LLMs can support VQA tasks.
大语言模型 (LLM) 作为其最突出的能力之一,在问答任务上明显优于预训练语言模型 (PLM)。大语言模型正越来越多地被用于推动各类现实医疗应用,尤其是在仅大语言模型能支持视觉问答 (VQA) 任务的情况下。
2.5. Dialogue system for healthcare
2.5. 医疗健康领域的对话系统
Chatbots have demonstrated promising potential to assist both patients and health professionals. The implementation of Healthcare Dialogue Systems can decrease the administrative workload of medical personnel and mitigate the negative consequences resulting from a shortage of physicians. Apart from the QA component, dialogue systems are generally classified into two categories: task-oriented and opendomain dialogue systems. The former is designed to address specific issues for Healthcare, such as hospital guides or medication consultations. In contrast, open-domain dialogue systems prioritize conversing with patients without any specific tasks. These systems are usually used as chatbots to provide emotional support, or mental health-related applications [45]. For example, the study [46] shows that patients who participated in a telehealth project had lower scores for depression, anxiety, and stress, and experienced $38%$ fewer hospital admissions. In the early stages, the study [47] proposed an ontology-based dialogue system that supports electronic referrals for breast cancer, which can handle the informative responses of users based on the medical domain ontology. Another study KR-DS [48] is an end-to-end knowledge-routed relational dialogue system that seamlessly incorporates a rich medical knowledge graph into topic transitions in dialogue management. One of the most notable feature is that PLMs-based dialogue systems often comprise multiple sub-modules, including dialogue management, nature language understanding, or knowledge introduction modules. Each individual sub-module within the overall system has the potential to become a bottleneck, thereby restricting the system’s practical applications.
聊天机器人已展现出协助患者和医疗专业人员的巨大潜力。医疗对话系统的实施能够减轻医务人员的管理负担,并缓解医生短缺带来的负面影响。除问答功能外,对话系统通常分为两类:任务导向型和开放域对话系统。前者旨在解决医疗领域的具体问题,如医院导诊或用药咨询;后者则侧重与患者进行无特定目标的交流,这类系统通常作为聊天机器人提供情感支持或心理健康相关应用[45]。例如研究[46]表明,参与远程医疗项目的患者抑郁、焦虑和压力评分显著降低,住院次数减少38%。早期研究[47]提出基于本体的乳腺癌电子转诊对话系统,能根据医疗领域本体处理用户的信息型回应。另一项研究KR-DS[48]是端到端知识路由关系型对话系统,将丰富的医疗知识图谱无缝融入对话管理的主题转换中。值得注意的是,基于预训练语言模型(PLMs)的对话系统通常包含多个子模块(如对话管理、自然语言理解或知识引入模块),每个子模块都可能成为限制系统实际应用的瓶颈。
In the case of LLM-based dialogue systems, the original pipeline system can be transformed into an end-to-end system leveraging a powerful LLM [17], as shown in Fig. 4. By utilizing an LLM, the remaining task involves aligning the system with human preferences and fine-tuning it for specific fields, without the need for many extra sub-modules, and achieving some advanced abilities that PLMs can hardly do. For example, a new approach [49] was proposed to detect depression, which involves an interpret able and interactive system based on LLMs. The proposed system not only provides a diagnosis, but also offers diagnostic evidence that is grounded in established diagnostic criteria. Additionally, users can engage in dialogue with the system, which allows for a more personalized understanding of their mental state based on their social media content. Chatdoctor [50] is a specialized LLMs designed to overcome the limitations observed in the medical knowledge, which can utilize real-time information from online sources to engage in conversations with patients.
基于大语言模型(LLM)的对话系统中,原始流水线系统可转化为端到端系统[17],如图4所示。通过运用大语言模型,剩余工作只需将系统与人类偏好对齐并针对特定领域微调,无需额外子模块即可实现传统预训练语言模型(PLM)难以企及的高级能力。例如,新提出的抑郁症检测方法[49]构建了基于大语言模型的可解释交互系统,不仅能提供诊断结果,还能依据既定诊断标准生成诊断依据。用户还可与系统对话,基于社交媒体内容获得更个性化的心理状态分析。Chatdoctor[50]是专为突破医疗知识局限设计的专业大语言模型,能利用网络实时信息与患者进行对话。
Table 1 Sum mari z ation about the strengths and weaknesses of PLMs and LLMs by different tasks.
Task | PLMsfeatures | LLMsfeatures | Comparison |
Informationextraction | Needlabeleddata | Zero-/few-shot | Havetheirownuniquestrengths |
Textclassification | Easytoadapt | Explainableandreliable | LLMshave a slightadvantage |
Semantictextualsimilarity | Skilledatshortcontextsandfundamentaltasks | Skilled atlong contexts and complextasks | Depend ontextlength |
Questionanswering | Limited language understanding and generation abilities | Betterinherentprofessionalknowledge | LLMshavea significantadvantage |
Dialoguesystem | Consistofmultiplecomponents | End-to-end system | LLMshave a significant advantage |
Report generation | Limited generationabilitiesandonlys singlemodality | MultimodalLLMs | LLMshave a significant advantage |
表 1: 不同任务下PLM和LLM的优缺点总结
任务 | PLM特性 | LLM特性 | 对比 |
---|---|---|---|
信息抽取 | 需要标注数据 | 零样本/少样本 | 各有独特优势 |
文本分类 | 易于适配 | 可解释且可靠 | LLM略有优势 |
语义文本相似度 | 擅长短文本和基础任务 | 擅长长文本和复杂任务 | 取决于文本长度 |
问答系统 | 语言理解和生成能力有限 | 具备更好的内在专业知识 | LLM优势显著 |
对话系统 | 由多个组件构成 | 端到端系统 | LLM优势显著 |
报告生成 | 生成能力有限且仅支持单模态 | 多模态LLM | LLM优势显著 |
2.6. Generation of medical reports from images
2.6. 从图像生成医疗报告
Medical reports are of significant clinical value to related specialists, but the process of writing them can be tedious, time-consuming and error-prone for inexperienced ones. Therefore, the automatic generation of medical reports has emerged as a promising research direction in the field of Healthcare. This capability can assist specialists in clinical decision-making and reduce the burden of report writing by automatically drafting reports that describe both abnormalities and relevant normal findings. Additionally, related models are expected to assist clinicians by pairing text reports with interactive visualization s, such as highlighting the region described by each phrase.
医疗报告对相关专科医生具有重要临床价值,但撰写过程对经验不足者而言往往繁琐耗时且易出错。因此,医疗报告自动生成已成为医疗健康领域极具前景的研究方向。该技术能通过自动生成同时描述异常和相关正常发现的报告,辅助临床决策并减轻报告撰写负担。此外,相关模型有望通过将文本报告与交互式可视化(如高亮显示每个短语描述的对应区域)相结合来辅助临床医生。
In an early stage, the study [51] proposed a data-driven method that combines a CNN to predict medical tags and generate a single sentence report. However, a single-sentence report is limited to real medical scenes. To generate multi-sentence reports, the study [52] proposed a multi-level recurrent generation model, which fused multiple image modalities by focusing on the front and later views. Most recently proposed models for automated report generation rely on multimodal technology implemented by LLMs, which can support more advanced applications. For example, VisualGPT [53] utilizes linguistic knowledge from LLMs and adapts it to new domains of image captioning in an efficient manner, even with small amounts of multimodal data. ChatCAD [54] introduced LLMs into medical-image Computer Aided Diagnosis (CAD) networks. Their proposed framework leverages the capabilities of LLMs to enhance the output of multiple CAD networks, including diagnosis networks, lesion segmentation networks, and report generation networks. Their results show that ChatCAD achieved significant improvements under various measures compared with the other two report-generation methods (R2GenCMN [55] and C vT 2 Distil GP T 2 [56]). ChatCAD $^+$ [57] is a multimodal system that addresses the writing style mismatch between radiologists and LLMs. The system is designed to be universal and reliable, capable of handling medical images from diverse domains and providing trustworthy medical advice by leveraging up-to-date information from reputable medical websites. For such a complex task, LLMs clearly outperforms PLM by a wide margin.
早期研究中,[51]提出了一种结合CNN的数据驱动方法,用于预测医学标签并生成单句报告。然而单句报告在实际医疗场景中存在局限性。为生成多语句报告,研究[52]提出了一种多级循环生成模型,通过聚焦前后视图融合多模态图像。近期多数自动化报告生成模型依赖于大语言模型实现的多模态技术,可支持更高级的应用。例如VisualGPT [53]利用大语言模型的语言知识,高效适配到图像描述新领域,即使仅使用少量多模态数据。ChatCAD [54]将大语言模型引入医学图像计算机辅助诊断(CAD)网络,其框架通过大语言模型增强多个CAD网络的输出,包括诊断网络、病灶分割网络和报告生成网络。实验表明,相较于另外两种报告生成方法(R2GenCMN [55]和CvT2DistilGPT2 [56]),ChatCAD在各指标上均取得显著提升。ChatCAD$^+$[57]是多模态系统,解决了放射科医生与大语言模型间的写作风格失配问题。该系统设计为通用可靠架构,能处理跨领域医学图像,并通过权威医学网站的最新信息提供可信医疗建议。对于此类复杂任务,大语言模型明显大幅优于PLM。
2.7. Summary
2.7. 总结
Based on the information provided, we summarize the strengths and weaknesses of PLMs and LLMs by different tasks in Table 1 and conclude the following points. For simpler fundamental tasks, the distinct advantages of LLMs are less apparent. However, as the complexity of advanced tasks increases, particularly those involving complex data conditions, requiring advanced semantic understanding, and comprehensive generative capabilities, LLMs begin to demonstrate their strengths. Besides, LLMs play an integral role in specific sub-fields of Healthcare with enough further training, and turn to emphasis on the multimodal capability of LLMs, such as Healthcare data inherently consists of text, images, and time series data. By leveraging the strengths of LLMs, researchers and Healthcare professionals can harness the power of multiple modalities to improve diagnostic accuracy and patient care.
根据所提供的信息,我们在表1中总结了不同任务下预训练语言模型(PLM)和大语言模型(LLM)的优缺点,并得出以下结论。对于较简单的基础任务,大语言模型的显著优势并不明显。但随着高级任务复杂度的提升,特别是涉及复杂数据条件、需要高级语义理解和综合生成能力的任务时,大语言模型开始展现其优势。此外,经过充分训练后,大语言模型在医疗健康特定子领域发挥着重要作用,并凸显其多模态能力。医疗健康数据天然包含文本、图像和时间序列数据,通过利用大语言模型的优势,研究人员和医疗专业人员可以整合多模态数据来提高诊断准确性和患者护理水平。
Beyond the accomplishments already discussed, several significant challenges remain for healthcare. A major obstacle is the complexity inherent in medical decision-making, which requires the incorporation of comprehensive patient information, including medical, psychological, and social aspects. While AI is proficient in analyzing data, it struggles with understanding complex human emotions and cultural nuances. This deficit is particularly evident in situations needing emotional support, such as during prolonged cancer care, where the empathetic engagement of healthcare professionals cannot be replicated by AI due to its inability to resonate emotionally.
除了已经讨论的成就外,医疗保健领域仍面临若干重大挑战。主要障碍在于医疗决策固有的复杂性,这需要整合包括医学、心理和社会因素在内的全面患者信息。虽然AI擅长数据分析,但在理解复杂人类情感和文化差异方面存在不足。这种缺陷在需要情感支持的情境中尤为明显,例如长期癌症护理期间,由于AI无法产生情感共鸣,医疗专业人员的同理心互动是其无法复现的。
Additionally, as AI becomes more embedded in healthcare, ethical and privacy issues intensify. Concerns about the handling of patient data, preserving privacy, and securing sensitive information are critical. Moreover, determining accountability in instances of diagnostic errors necessitates well-defined legal and ethical frameworks. Another concern is the unequal global distribution of technology, leading to a ‘‘digital divide’’. This divide risks leaving behind developing countries and economically disadvantaged areas, potentially worsening health disparities. AI also struggles with diseases characterized by ambiguous causes or intricate pathological processes. The effectiveness of AI is contingent on the extent of existing medical knowledge, and remains limited in fields that are not thoroughly understood. These challenges highlight the urgent need for collaborative efforts among professionals in healthcare, technology, law, and ethics globally to ensure that technological advancements are equitable, respectful of, and protective toward individual rights. Further discussion on these topics is available in Section 5.
此外,随着AI在医疗领域的深入应用,伦理与隐私问题日益凸显。患者数据处理、隐私保护及敏感信息安全成为核心关切。当出现诊断错误时,责任认定问题也亟需明确的法律和伦理框架。另一重挑战在于技术资源的全球分配不均,由此形成的"数字鸿沟"可能导致发展中国家和经济欠发达地区被边缘化,加剧健康不平等现象。AI对病因不明或病理机制复杂的疾病也表现乏力——其效能高度依赖现有医学认知水平,在尚未被充分理解的领域仍存在局限。这些挑战迫切要求全球医疗、技术、法律及伦理领域的专业人士协同合作,确保技术进步兼具公平性,并能尊重和保护个体权利。更多相关讨论详见第5节。
3. From PLMs to LLMs for healthcare
3. 从 PLMs 到医疗领域的大语言模型
Apart from the increasing model sizes, two significant developments from PLMs to LLMs are the transition from Disc rim i native AI to Generative AI and from model-centered to data-centered approaches. During the PLMs period, published PLMs were primarily evaluated on Natural Language Understanding (NLU) tasks, such as mentioned NER, RE, and TC. These studies are grouped as disc rim i native AI, which concentrates on classification or regression tasks instead of generation tasks. In contrast, generative AI generates new content, often requiring the model to understand existing data (e.g., textual instructions) before generating new content. The evaluation tasks of generative AI are usually QA and conversation tasks.
除了模型规模的不断扩大,从预训练语言模型(PLM)到大语言模型(LLM)的两大重要演进是从判别式AI(Discriminative AI)向生成式AI(Generative AI)的转变,以及从以模型为中心到以数据为中心的方法转型。在预训练语言模型时期,已发表的模型主要针对自然语言理解(NLU)任务进行评估,例如前文提到的命名实体识别(NER)、关系抽取(RE)和文本分类(TC)。这类研究被归类为判别式AI,其核心聚焦于分类或回归任务而非生成任务。相比之下,生成式AI能够创造新内容,通常需要模型先理解现有数据(如文本指令)再进行内容生成。生成式AI的评估任务通常采用问答(QA)和对话任务。
The second perspective is the change from model-centered to datacentered. Before the rise of LLMs, previous research focused on improving neural architecture to enhance the encoding abilities of proposed models. As neural models became increasingly larger, the overparameter iz ation strategy demonstrated promising abilities in learning potential patterns reserved in annotated datasets. Under such conditions, high-quality data played a more significant role in further enhancing various Healthcare applications. On the other hand, recent related developments present a multimodal trend, providing significant support to the data of EHRs, medical images, and medical sequence signals. Based on powerful LLMs, more existing and promising research and applications for Healthcare can be explored. Addressing the challenge of systematically collecting matched multimodal data holds significant importance. For such reason, we list detailed data usages and access links of each LLM in Section 3.2.
第二个视角是从以模型为中心转向以数据为中心。在大语言模型兴起之前,先前的研究主要集中于改进神经架构以增强所提出模型的编码能力。随着神经模型规模不断扩大,过参数化策略在从标注数据集中学习潜在模式方面展现出优异性能。在此背景下,高质量数据对进一步提升各类医疗健康应用发挥着更为关键的作用。另一方面,近期相关发展呈现多模态趋势,为电子健康记录(EHR)、医学影像和医学序列信号数据提供了重要支持。基于强大的大语言模型,可以探索更多现有及潜在的医疗健康研究和应用。系统性地收集匹配的多模态数据这一挑战具有重要意义。为此,我们在3.2节详细列出了每个大语言模型的数据使用情况和获取链接。
3.1. PLMs for healthcare
3.1. 医疗领域的预训练语言模型 (PLMs)
While our survey primarily concentrates on LLMs for Healthcare, it is important to acknowledge that previous studies on PLMs have played a foundational role in the development of LLMs. In this section, we sum up the key research points for Healthcare PLMs, namely (1) enhancing neural architectures, and (2) utilizing more efficient pre-training tasks. These two points will be compared with the distinct study focus of LLMs in Section 3.2, to further support the transition from disc rim i native AI to generative AI and from model-centered to data-centered.
虽然我们的调查主要聚焦于医疗领域的大语言模型(LLM),但必须承认先前关于预训练语言模型(PLM)的研究为大语言模型的发展奠定了基础。本节我们将总结医疗PLM的两大核心研究方向:(1) 改进神经架构,(2) 采用更高效的预训练任务。这两点将与3.2节中大语言模型的独特研究重点进行对比,以进一步佐证从判别式AI向生成式AI(Generative AI)、从以模型为中心向以数据为中心的范式转变。
Table 2 Sum mari z ation of training data and evaluation tasks for existing PLMs for Healthcare. The different training methods are delineated with a solid line and the training data are further delineated with a dashed line.
Model name | Base | Para. (B) | Training data | Eval task | Date | Link |
BEHRT [58] | Transformer | 一 | CPRD,HES | Disease Prediction | 04/20 | Link |
BioMegatron [59] | Megatron | 1.2 | PubMed | biomedical NER,RE,QA | 10/20 | Link |
PubMedBERT [60] | BERT | 0.11 | PubMed | BLURB | 01/21 | Link |
Bio-ELECTRA-small[61] | ELECTRA | 0.03 | PubMed | Biomedical NER | 03/20 | |
BioELECTRA [62] | ELECTRA | 0.03 | PubMed, PMC | BLURB,BLUE | 06/21 | Link |
AraBERT [63] | BERT | 0.11 | Arabic Wikipedia, OSIAN | Arabic SA, NER, QA | 03/21 | Link |
FS-/RAD-/GER-BERT [64] | BERT | 0.11 | Unstructured radiology reports | Chest Radiograph Reports Classification | 07/20 | Link |
VPP [65] | BART | 0.14 | PubMed | Biomedical NER | 03/23 | Link |
BioBART [66] | BART | 0.14 | PubMed | Biomedical EL,NER,QA,Dialogue, Summarization | 04/22 | Link |
BioLinkBERT [67] | BERT | 0.34 | PubMed | BLURB, USMLE | 03/22 | Link |
ELECTRAMed [68] | ELECTRA | 0.11 | PubMed | Biomedical NER, RE, and QA | 04/21 | Link |
KeBioLM [69] | PubMedBERT | 0.11 | PubMed | BLURB | 04/21 | Link |
BioFLAIR [70] | BERT | 0.34 | PubMed | Bio NER | 08/19 | Link |
ouBioBERT [71] | BERT | 0.11 | PubMed, Wikipedia | BLUE | 02/21 | Link |
SCIFIVE [72] | T5 | 0.77 | PubMed, PMC | Biomedical NER,RE,NIL, QA | 05/21 | Link |
BioBERT [73] | BERT | 0.11 | PubMed, PMC | Biomedical NER, RE,QA | 05/19 | Link |
BioALBERT-ner [74] | ALBERT | 0.18 | PubMed, PMC | Biomedical NER | 09/20 | Link |
GreenCovidSQuADBERT[75] | BERT | 0.34 | PubMed,PMC,CORD19 | NER, QA | 04/20 | Link |
Bio-LM [76] | RoBERTa | 0.34 | PubMed, PMC, MIMIC-III | 18 Biomedical NLP Tasks | 11/20 | Link |
BioALBERT [77] | ALBERT | 0.03 | PubMed,PMC, MIMIC-III | 6 BioNLP Tasks | 04/22 | Link |
BlueBert [78] | BERT | 0.34 | PubMed, MIMIC-III | BLUE | 06/19 | Link |
ClinicalBert [79] Clinical XLNet[80] | BERT | 0.11 | MIMIC-III | Hospital Readmission Prediction | 11/20 | Link |
XLNet | 0.11 | MIMIC-III | PMV, Mortality Biomedical NER | 11/20 | Link | |
MIMIC-BERT [81] | BERT | 0.34 | MIMIC-III | 08/19 | ||
UmlsBERT [82] | BERT | 0.11 | MIMIC-III | MedNLI,i2b2 2006,2010,2012,2014 | 06/21 | Link |
CharacterBERT[81] | BERT ALBERT | 0.11 | MIMIC-III, OpenWebText, PMC | Medical NER, NLI, RE, SS | 10/20 | Link |
Clinical KB-ALBERT [82] | 0.03 | MIMIC-III, UMLS | MedNLI, i2b2 2010, 2012 | 12/20 | Link | |
MedGPT [81] KAD [83] | GPT-2 BERT | 1.5 | MIMIC-III, private EHRs | Disorder Prediction | 07/21 | |
Japanese-BERT [84] | BERT | 一 10.11 | MIMIC-CXR | PadChest, ChestXray14, CheXpert and ChestX-Det10 | 03/23 | Link |
MC-BERT [85] | BERT | 0.11 | Japanese EHR | Symptoms Classification | 07/20 | |
BERT-EHR [86] | BERT | 一 | Chinese EHR | ChineseBiomedicalEvaluationbenchmark | 08/20 03/21 | Link |
Med-BERT [87] | BERT | 0.11 | General EHR General EHR | Myocardial Infarction,Breast Cancer,Liver Cirrhosis Disease prediction | 05/21 | Link |
SAPBERT[88] | Link | |||||
CODER [89] | BERT mBERT | 0.11 | UMLS UMLS | MEL MCSM, Medical RE | 10/22 | Link |
AlphaBERT [90] | BERT | 0.34 0.11 | Discharge diagnoses | Extractive Summarization Task | 02/22 | Link |
BioMed-RoBERTa [91] | RoBERTa | BIOMED | CHEMPROT, RCT | 04/20 | Link | |
RadBERT [92] | BERT | 0.11 | Report Coding, Summarization | 05/20 | Link | |
BioBERTpt [93] | BERT | 一 | Radiology Report Corpus | SemClinBr | 05/20 | |
RoBERTa-MIMIC [94] | RoBERTa | 0.11 | Private clinical notes,WMT16 i2b2 2010,2012,n2c2 2018 | 11/20 | Link | |
CHMBERT [95] | BERT | 0.11 | i2b2 2010,2012,N2C2 2018 | 12/20 | Link | |
Galén [96] | RoBERTa | 0.11 0.11 | Medical text data Private clinical cases | Disease Prediction CodiEsp-D,CodiEsp-P,Cantemist-Coding tasks | 01/21 | |
05/21 | Link | |||||
Spanish-bert [97] | BERT | 一 | Spanish data | Spanish Clinical Case Corpus | 04/20 | |
French-BERT [98] | BERT | 0.11 | French clinical documents | DEFT challenge | 06/20 | 一 |
ABioNER [99] | BERT | 0.11 | Arabic scientific literature | Arabic NER Persian QA, SA | 03/21 | 一 |
SINA-BERT [100] | BERT | 0.11 | Online Persian source | 04/21 | 一 | |
CT-BERT [101] | BERT | 0.11 | Tweet | COVID-19 Text Classification | 05/20 | Link |
MentalBERT [45] | BERT | 0.11 | Depression Stress,Suicide Detection | 10/21 | Link |
$\Bumpeq$ PMV means prolonged mechanical ventilation prediction. NER means Named Entity Recognition, NLI means Natural Language Inference, RE means Relation Extraction, SS means Sentence Similarity. MCSM means medical conceptual similarity measure [102]. MEL means medical entity linking. EL means Entity Linking. For clarity, we only list parts of representative evaluation tasks. For the column of Para. (B), only the largest size is listed.
表 2: 现有医疗领域预训练语言模型的训练数据和评估任务汇总。不同训练方法用实线分隔,训练数据进一步用虚线分隔。
模型名称 | 基础架构 | 参数量(B) | 训练数据 | 评估任务 | 日期 | 链接 |
---|---|---|---|---|---|---|
BEHRT [58] | Transformer | - | CPRD,HES | 疾病预测 | 04/20 | Link |
BioMegatron [59] | Megatron | 1.2 | PubMed | 生物医学NER,RE,QA | 10/20 | Link |
PubMedBERT [60] | BERT | 0.11 | PubMed | BLURB | 01/21 | Link |
Bio-ELECTRA-small[61] | ELECTRA | 0.03 | PubMed | 生物医学NER | 03/20 | |
BioELECTRA [62] | ELECTRA | 0.03 | PubMed, PMC | BLURB,BLUE | 06/21 | Link |
AraBERT [63] | BERT | 0.11 | 阿拉伯语维基百科,OSIAN | 阿拉伯语SA,NER,QA | 03/21 | Link |
FS-/RAD-/GER-BERT [64] | BERT | 0.11 | 非结构化放射学报告 | 胸部X光报告分类 | 07/20 | Link |
VPP [65] | BART | 0.14 | PubMed | 生物医学NER | 03/23 | Link |
BioBART [66] | BART | 0.14 | PubMed | 生物医学EL,NER,QA,对话,摘要 | 04/22 | Link |
BioLinkBERT [67] | BERT | 0.34 | PubMed | BLURB,USMLE | 03/22 | Link |
ELECTRAMed [68] | ELECTRA | 0.11 | PubMed | 生物医学NER,RE,QA | 04/21 | Link |
KeBioLM [69] | PubMedBERT | 0.11 | PubMed | BLURB | 04/21 | Link |
BioFLAIR [70] | BERT | 0.34 | PubMed | 生物医学NER | 08/19 | Link |
ouBioBERT [71] | BERT | 0.11 | PubMed,维基百科 | BLUE | 02/21 | Link |
SCIFIVE [72] | T5 | 0.77 | PubMed,PMC | 生物医学NER,RE,NIL,QA | 05/21 | Link |
BioBERT [73] | BERT | 0.11 | PubMed,PMC | 生物医学NER,RE,QA | 05/19 | Link |
BioALBERT-ner [74] | ALBERT | 0.18 | PubMed,PMC | 生物医学NER | 09/20 | Link |
GreenCovidSQuADBERT[75] | BERT | 0.34 | PubMed,PMC,CORD19 | NER,QA | 04/20 | Link |
Bio-LM [76] | RoBERTa | 0.34 | PubMed,PMC,MIMIC-III | 18项生物医学NLP任务 | 11/20 | Link |
BioALBERT [77] | ALBERT | 0.03 | PubMed,PMC,MIMIC-III | 6项生物医学NLP任务 | 04/22 | Link |
BlueBert [78] | BERT | 0.34 | PubMed,MIMIC-III | BLUE | 06/19 | Link |
ClinicalBert [79] Clinical XLNet[80] | BERT | 0.11 | MIMIC-III | 医院再入院预测 | 11/20 | Link |
XLNet | 0.11 | MIMIC-III | PMV,死亡率预测,生物医学NER | 11/20 | Link | |
MIMIC-BERT [81] | BERT | 0.34 | MIMIC-III | 08/19 | ||
UmlsBERT [82] | BERT | 0.11 | MIMIC-III | MedNLI,i2b2 2006,2010,2012,2014 | 06/21 | Link |
CharacterBERT[81] | BERT ALBERT | 0.11 | MIMIC-III,OpenWebText,PMC | 医学NER,NLI,RE,SS | 10/20 | Link |
Clinical KB-ALBERT [82] | 0.03 | MIMIC-III,UMLS | MedNLI,i2b2 2010,2012 | 12/20 | Link | |
MedGPT [81] KAD [83] | GPT-2 BERT | 1.5 | MIMIC-III,私有EHR | 疾病预测 | 07/21 | |
Japanese-BERT [84] | BERT | -10.11 | MIMIC-CXR | PadChest,ChestXray14,CheXpert,ChestX-Det10 | 03/23 | Link |
MC-BERT [85] | BERT | 0.11 | 日语EHR | 症状分类 | 07/20 | |
BERT-EHR [86] | BERT | - | 中文EHR | 中文生物医学评估基准 | 08/20 03/21 | Link |
Med-BERT [87] | BERT | 0.11 | 通用EHR | 心肌梗塞,乳腺癌,肝硬化预测 | 05/21 | Link |
SAPBERT[88] | Link | |||||
CODER [89] | BERT mBERT | 0.11 | UMLS | MEL MCSM,医学RE | 10/22 | Link |
AlphaBERT [90] | BERT | 0.34 0.11 | 出院诊断 | 抽取式摘要任务 | 02/22 | Link |
BioMed-RoBERTa [91] | RoBERTa | BIOMED | CHEMPROT,RCT | 04/20 | Link | |
RadBERT [92] | BERT | 0.11 | 报告编码,摘要 | 05/20 | Link | |
BioBERTpt [93] | BERT | - | 放射学报告语料库 | SemClinBr | 05/20 | |
RoBERTa-MIMIC [94] | RoBERTa | 0.11 | 私有临床笔记,WMT16 | i2b2 2010,2012,n2c2 2018 | 11/20 | Link |
CHMBERT [95] | BERT | 0.11 | i2b2 2010,2012,N2C2 2018 | 12/20 | Link | |
Galén [96] | RoBERTa | 0.11 0.11 | 医学文本数据 | 疾病预测,CodiEsp-D,CodiEsp-P,Cantemist编码任务 | 01/21 | |
私有临床病例 | 05/21 | Link | ||||
Spanish-bert [97] | BERT | - | 西班牙语数据 | 西班牙临床病例语料库 | 04/20 | |
French-BERT [98] | BERT | 0.11 | 法语临床文档 | DEFT挑战赛 | 06/20 | - |
ABioNER [99] | BERT | 0.11 | 阿拉伯科学文献 | 阿拉伯语NER,波斯语QA,SA | 03/21 | - |
SINA-BERT [100] | BERT | 0.11 | 波斯语在线资源 | 04/21 | - | |
CT-BERT [101] | BERT | 0.11 | 推特 | COVID-19文本分类 | 05/20 | Link |
MentalBERT [45] | BERT | 0.11 | 抑郁压力,自杀检测 | 10/21 | Link |
$\Bumpeq$ PMV表示长期机械通气预测。NER表示命名实体识别,NLI表示自然语言推理,RE表示关系抽取,SS表示句子相似度。MCSM表示医学概念相似度测量[102]。MEL表示医学实体链接。EL表示实体链接。为清晰起见,我们仅列出部分代表性评估任务。参数量(B)列仅列出最大规模。
• Public Knowledge Bases. There exist many Healthcare-related knowledge bases, such as UMLS [103], CMeKG [104], BioModels [105], and DrugBank [106]. Among them, UMLS is one of the most popular, which is a repository of biomedical vocabularies developed by the US National Library of Medicine. The UMLS has over 2 million names for 900,000 concepts from more than 60 families of biomedical vocabularies, as well as 12 million relations among these concepts. Based on this structured data, USMLE is organized and usually employed to test Healthcare LLMs. CMeKG [104] is a Chinese medical knowledge graph that has been constructed by referring to authoritative international medical standards and a wide range of sources, including clinical guidelines, industry standards, and medical textbooks. This knowledge graph serves as a comprehensive resource for medical information. Building upon the CMeKG, HuaTuo [107] utilizes diverse instructional data for its instruction tuning process.
• 公共知识库。存在许多与医疗相关的知识库,例如UMLS [103]、CMeKG [104]、BioModels [105]和DrugBank [106]。其中,UMLS是最受欢迎的之一,它是由美国国家医学图书馆开发的生物医学词汇库。UMLS包含来自60多个生物医学词汇家族的90万个概念的200多万个名称,以及这些概念之间的1200万种关系。基于这些结构化数据,USMLE被组织起来,通常用于测试医疗领域的大语言模型。CMeKG [104]是一个中文医学知识图谱,其构建参考了国际权威医学标准和广泛的来源,包括临床指南、行业标准和医学教材。该知识图谱是医学信息的综合资源。基于CMeKG,华佗 [107] 在其指令调优过程中利用了多样化的教学数据。
• Data for Instruction Fine-Tuning. The aforementioned data typically consists of general text that is commonly used for pre training PLMs or LLMs. However, when transitioning from PLMs to LLMs, instruction data becomes crucial to equip LLMs with the capability of following instructions effectively. Unlike PLMs, which primarily focus on next-word prediction, LLMs place greater emphasis on responding to specific instructions. By leveraging a sufficient amount of instruction data for fine-tuning, an LLM can appropriately generate the desired output. This emphasizes the importance of instruction-based training for LLMs to achieve accurate and con textually relevant responses.
• 指令微调数据。上述数据通常由用于预训练PLM或大语言模型的通用文本组成。然而,从PLM过渡到大语言模型时,指令数据变得至关重要,它使大语言模型具备有效遵循指令的能力。与主要关注下一个词预测的PLM不同,大语言模型更强调对特定指令的响应。通过利用足够数量的指令数据进行微调,大语言模型能够恰当地生成所需输出。这凸显了基于指令的训练对于大语言模型获得准确且上下文相关响应的重要性。
For Healthcare PLMs, as shown in see Table 2, a majority of the models utilize the disc rim i native approach, predominantly built upon the BERT architecture. The rationale behind this architectural choice is evident: many typical Healthcare applications are classification tasks. These tasks range from NER in the biomedical domain to more specific challenges such as disease prediction and relation extraction. In addition, the methodology of fine-tuning (FT) stands out as the prevalent training methodology. This trend suggests a broader implication: while general pretrained models offer a foundational grasp of language, they require refinement through domain-specific data to excel in the applications of Healthcare. The choice of training datasets provides further support to the models’ intent of achieving a holistic understanding of the medical domain.
如表 2 所示,医疗领域大语言模型 (Healthcare PLMs) 大多采用判别式方法 (discriminative approach),主要基于 BERT 架构。这种架构选择的理由显而易见:医疗领域的典型应用多为分类任务,涵盖生物医学命名实体识别 (NER)、疾病预测和关系抽取等具体场景。此外,微调 (fine-tuning/FT) 成为主流的训练方法,这一趋势揭示了一个深层现象:虽然通用预训练模型提供了语言理解的基础能力,但需要通过领域数据精调才能在医疗应用中表现出色。训练数据集的选择进一步佐证了这些模型旨在实现医疗领域的全面理解。
Table 3 Sum mari z ation of training data and evaluation tasks for existing LLMs for Healthcare. The different training methods are delineated with a solid line and the training data are further delineated with a dashed line. The color names represent popular evaluate datasets. More detail performance comparisons are shown in Table 4.
Model name | Method | Training data | Evaluate datasets or tasks | Date | Link |
GatorTron[108] | PT | Clinical notes | CNER,MRE,MQA | 06/22 | Link |
GatorTronGPT [109] | PT | Clinical and general text | PubMedQA, USMLE, MedMCQA, DDI, BC5CDR | 05/23 | Link |
Galactica [110] | PT+SFT | DNA,AA sequence | MedMCQA,PubMedQA,Medical Genetics | 11/22 | Link |
Me LLaMA [111] | PT+SFT | PubMed, MIMIC-II, MIMIC-IV, MIMIC-CXR | MIBE benchmark [111] | 04/24 | Link |
MedChatZH [112] | PT+SFT | Text Books,medical and general instructions | WebMedQA | 09/23 | Link |
BioMistral [113] | PT+SFT | PubMed central data | MMLU,USMLE, MedMCQA,PubMedQA | 02/24 | Link |
Visual Med-Alpaca [114] | PT+SFT | Medical QA | 04/23 | Link | |
Apollo [115] | PT+SFT | Books, clinical guidelines, encyclopedias. | XMedBench | 03/24 | Link |
CancerLLM [116] | PT+SFT | Clinical notes, Pathology report | Cancer Diagnosis Generation, Cancer Phenotype Extraction | 06/24 | 一 |
MedAlpaca [117] | SFT | Medical QA and dialogues | USMLE,Medical Meadow | 04/23 | Link |
BenTsao [107] | SFT | Medical QA, Medical knowledge graph | Customed medical QA | 04/23 | Link |
BianQue [118] | SFT | Medical QA | 04/23 | Link | |
Med-PaLM 2 [1] | SFT | Medical QA | MultiMedQA,Long-form QA | 05/23 | 一 |
SoulChat [13] | SFT | Empathetic dialogue, Long text | 一 | 06/23 | Link |
ChatDoctor[50] | SFT | Patient-doctor dialogues | iCliniq | 03/23 | Link |
DoctorGLM [119] | SFT | Chinese medical dialogues | 04/23 | Link | |
OncoGPT [120] | SFT | Oncology conversations | Oncology Question Answering | 02/24 | Link |
HuatuoGPT [11] | SFT | Conversation data and instruction | CmedQA, webmedQA, and Huatuo-26M | 05/23 | Link |
Med-PaLM [121] | SFT | Medical data | MultiMedQA,HealthSearchQA | 12/22 | 一 |
PMC-LLaMA [122] | SFT | Biomedical academic papers | PubMedQA,MedMCQA,USMLE | 04/23 | Link |
HealAI [123] | SFT | Medical note data, instruction data | Medical Note Writing | 03/24 | 一 |
BiMediX [124] | SFT | 1.3 million English-Arabic dataset | An Arabic-English benchmark | 02/24 | Link |
Medical mT5 [125] | SFT | Multilingual medical corpus | SequenceLabeling,QA | 04/24 | Link |
EpiSemoGPT [126] | SFT | Related publications | Predicting epileptogenic zones | 05/24 | 一 |
MedAGI [10] | SFT | Public medical datasets and images | SkinGPT-4, XrayChat, PathologyChat | 06/23 | Link |
Med-Flamingo [8] LLaVA-Med [9] | SFT | Image-caption/tokens pairs | VQA-RAD, Path-VQA, Visual USMLE | 07/23 | Link |
OphGLM [12] | SFT | Multimodalbiomedical instruction | VQA-RAD,SLAKE,PathVQA | 06/23 | Link |
LLM-CXR [127] | SFT SFT | Fundus image, knowledge graphs | Fundus diagnosis pipeline tasks [12] | 06/23 | Link |
JMLR [128] | SFT | MIMIC-CXR | Report generation,VQA,CXRgeneration | 05/23 | Link |
ClinicalGPT [129] | MIMIC-IV dataset,medical textbooks,pubMed | USMLE, Amboss, MedMCQA, and MMLU-Medical | 02/24 | Link | |
Polaris [130] | SFT+RLHF SFT+RLHF | Medical dialogues and QA, EHR Proprietary healthcare data | MedDialog, MEDQA-MCMLE, MD-EHR, cMedQA2 Healthcare conversational | 06/23 | |
Zhongjing [131] | PT+SFT+RLHF | Medical books, health records, clinical reports | CMtMedQA,Huatuo-26M | 03/24 08/23 | 一 Link |
Qilin-Med [132] | PT+SFT+DPO | Medical QA, plain texts, knowledge graphs | CMExam, CEval, Huatuo-26M | 04/24 | |
Aloe-Alpha [133] | 一 | ||||
PT+SFT+DPO | Medical QA, CoT, synthetic data | MultiMedQA, MedMCQA, USMLE, PubMedQA, etc. | 05/24 | 一 |
$^{37}* $ means the study focuses on evaluating the Healthcare LLM, rather than proposing a new LLM. PT means pre-training, ICL means In-context-learning (no parameters updated), SFT means supervised fine-tuning, RLHF means reinforcement learning from human feedback, and DPO means Direct Preference Optimization.
表 3: 现有医疗领域大语言模型的训练数据与评估任务汇总。实线区分不同训练方法,虚线进一步区分训练数据。颜色名称代表常用评估数据集,详细性能对比见表4。
模型名称 | 方法 | 训练数据 | 评估数据集/任务 | 日期 | 链接 |
---|---|---|---|---|---|
GatorTron[108] | PT | 临床记录 | CNER,MRE,MQA | 06/22 | Link |
GatorTronGPT [109] | PT | 临床与通用文本 | PubMedQA, USMLE, MedMCQA, DDI, BC5CDR | 05/23 | Link |
Galactica [110] | PT+SFT | DNA,氨基酸序列 | MedMCQA,PubMedQA,Medical Genetics | 11/22 | Link |
Me LLaMA [111] | PT+SFT | PubMed, MIMIC-II, MIMIC-IV, MIMIC-CXR | MIBE benchmark [111] | 04/24 | Link |
MedChatZH [112] | PT+SFT | 教科书、医疗与通用指令 | WebMedQA | 09/23 | Link |
BioMistral [113] | PT+SFT | PubMed Central数据 | MMLU,USMLE, MedMCQA,PubMedQA | 02/24 | Link |
Visual Med-Alpaca [114] | PT+SFT | 医疗问答 | - | 04/23 | Link |
Apollo [115] | PT+SFT | 书籍、临床指南、百科全书 | XMedBench | 03/24 | Link |
CancerLLM [116] | PT+SFT | 临床记录、病理报告 | 癌症诊断生成、癌症表型提取 | 06/24 | - |
MedAlpaca [117] | SFT | 医疗问答与对话 | USMLE,Medical Meadow | 04/23 | Link |
BenTsao [107] | SFT | 医疗问答、医学知识图谱 | 定制医疗问答 | 04/23 | Link |
BianQue [118] | SFT | 医疗问答 | - | 04/23 | Link |
Med-PaLM 2 [1] | SFT | 医疗问答 | MultiMedQA,长文本问答 | 05/23 | - |
SoulChat [13] | SFT | 共情对话、长文本 | - | 06/23 | Link |
ChatDoctor[50] | SFT | 医患对话 | iCliniq | 03/23 | Link |
DoctorGLM [119] | SFT | 中文医疗对话 | - | 04/23 | Link |
OncoGPT [120] | SFT | 肿瘤学对话 | 肿瘤学问答 | 02/24 | Link |
HuatuoGPT [11] | SFT | 对话数据与指令 | CmedQA, webmedQA, Huatuo-26M | 05/23 | Link |
Med-PaLM [121] | SFT | 医疗数据 | MultiMedQA,HealthSearchQA | 12/22 | - |
PMC-LLaMA [122] | SFT | 生物医学学术论文 | PubMedQA,MedMCQA,USMLE | 04/23 | Link |
HealAI [123] | SFT | 医疗记录数据、指令数据 | 医疗记录撰写 | 03/24 | - |
BiMediX [124] | SFT | 130万英阿数据集 | 阿拉伯语-英语基准 | 02/24 | Link |
Medical mT5 [125] | SFT | 多语言医疗语料 | 序列标注,问答 | 04/24 | Link |
EpiSemoGPT [126] | SFT | 相关出版物 | 预测致痫区 | 05/24 | - |
MedAGI [10] | SFT | 公共医疗数据集与图像 | SkinGPT-4, XrayChat, PathologyChat | 06/23 | Link |
Med-Flamingo [8] LLaVA-Med [9] | SFT | 图像-标题/token对 | VQA-RAD, Path-VQA, Visual USMLE | 07/23 | Link |
OphGLM [12] | SFT | 多模态生物医学指令 | VQA-RAD,SLAKE,PathVQA | 06/23 | Link |
LLM-CXR [127] | SFT SFT | 眼底图像、知识图谱 | 眼底诊断流程任务 [12] | 06/23 | Link |
JMLR [128] | SFT | MIMIC-CXR | 报告生成,VQA,CXR生成 | 05/23 | Link |
ClinicalGPT [129] | - | MIMIC-IV数据集、医学教科书、PubMed | USMLE, Amboss, MedMCQA, MMLU-Medical | 02/24 | Link |
Polaris [130] | SFT+RLHF SFT+RLHF | 医疗对话与问答、电子健康记录 | MedDialog, MEDQA-MCMLE, MD-EHR, cMedQA2 医疗对话 | 06/23 | - |
Zhongjing [131] | PT+SFT+RLHF | 医学书籍、健康档案、临床报告 | CMtMedQA,Huatuo-26M | 03/24 08/23 | - Link |
Qilin-Med [132] | PT+SFT+DPO | 医疗问答、纯文本、知识图谱 | CMExam, CEval, Huatuo-26M | 04/24 | - |
Aloe-Alpha [133] | - | - | - | - | - |
- | PT+SFT+DPO | 医疗问答、思维链、合成数据 | MultiMedQA, MedMCQA, USMLE, PubMedQA等 | 05/24 | - |
$^{37}* $ 表示研究侧重评估医疗大语言模型而非提出新模型。PT指预训练(pre-training),ICL指上下文学习(In-context-learning,不更新参数),SFT指监督微调(supervised fine-tuning),RLHF指人类反馈强化学习(reinforcement learning from human feedback),DPO指直接偏好优化(Direct Preference Optimization)。
Unlike PLMs, LLMs have the advantage of eliminating the need for FT and can directly infer at various downstream tasks. Moreover, the core research focus does not primarily revolve around improving neural architectures and developing more efficient pre-training tasks for Healthcare. Consequently, research on LLMs is garnering increased attention.
与PLM不同,大语言模型(LLM)具有无需微调(FT)即可直接在下游任务进行推理的优势。此外,其核心研究方向不再主要围绕改进神经网络架构或开发更高效的医疗领域预训练任务。因此,大语言模型研究正获得越来越多的关注。
3.2. LLMs for healthcare
3.2. 医疗领域的大语言模型
With the surge in general LLM studies, there has also been a notable development of LLMs specifically tailored for the Healthcare. In contrast to the emphasis on neural architecture designs and pre training tasks in previous PLMs research, the studies on LLMs for Healthcare greater emphasis on collections of diverse, precise, and professional Healthcare data, and also data security and privacy protection. In the following sections, we present an overview and analysis of published Healthcare LLMs. For the sake of convenience, we have compiled the pertinent information in Tables 3 and 5. We categorize current LLMs based on their training methods, training data, evaluation, and distinct features, and offer detailed comparisons. Table 4 presents a summary of the performance for the three most popular datasets used to evaluate Healthcare LLMs, aimed at enabling more straightforward comparisons, and also offering a clear perspective on the current capabilities of excellent Healthcare LLMs.
随着通用大语言模型研究的激增,专门针对医疗健康领域定制的大语言模型也取得了显著进展。与以往预训练语言模型(PLM)研究侧重于神经架构设计和预训练任务不同,医疗健康领域的大语言模型研究更强调多样化、精准且专业的医疗数据收集,以及数据安全与隐私保护。以下章节我们将对已发布的医疗健康大语言模型进行概述与分析。为便于查阅,我们已将相关信息整理至表3和表5。我们根据训练方法、训练数据、评估指标和特色功能对现有大语言模型进行分类,并提供详细对比。表4汇总了评估医疗健康大语言模型最常用的三个数据集的性能表现,旨在提供更直观的对比基准,同时清晰展现当前优秀医疗健康大语言模型的实际能力。
Table 4 The performance sum mari z ation for different Healthcare LLMs on three popular datasets.
(%) | USMLE | MedMCQA | PubMedQA |
FT BERT | 44.62 [67] | 43.03 [60] | 72.20 [67] |
Galactica | 44.60 | 77.60 | 77.60 |
PMC-LLaMA | 44.70 | 50.54 | 69.50 |
GatorTronGPT | 42.90 | 45.10 | 77.60 |
DoctorGLM | 67.60 | 一 | 一 |
MedAlpaca | 60.20 | 一 | 一 |
Codex | 60.20 | 62.70 | 78.20 |
Med-PaLM | 67.60 | 57.60 | 79.00 |
Med-PaLM | 67.60 | 57.60 | 79.00 |
Aloe-Alpha | 71.01 | 64.47 | 80.20 |
Med-PaLM 2 | 86.50 | 72.30 | 81.80 |
GPT-4 | 86.70 | 73.66 | 80.40 |
Human | 87.00 | 90.00 | 78.00 |
表 4 不同医疗大语言模型在三个流行数据集上的性能汇总 (%)
(%) | USMLE | MedMCQA | PubMedQA |
---|---|---|---|
FT BERT | 44.62 [67] | 43.03 [60] | 72.20 [67] |
Galactica | 44.60 | 77.60 | 77.60 |
PMC-LLaMA | 44.70 | 50.54 | 69.50 |
GatorTronGPT | 42.90 | 45.10 | 77.60 |
DoctorGLM | 67.60 | — | — |
MedAlpaca | 60.20 | — | — |
Codex | 60.20 | 62.70 | 78.20 |
Med-PaLM | 67.60 | 57.60 | 79.00 |
Med-PaLM | 67.60 | 57.60 | 79.00 |
Aloe-Alpha | 71.01 | 64.47 | 80.20 |
Med-PaLM 2 | 86.50 | 72.30 | 81.80 |
GPT-4 | 86.70 | 73.66 | 80.40 |
Human | 87.00 | 90.00 | 78.00 |
• Different Training Methods. Unlike PLMs, the strategy of training LLMs from scratch is not popular for Healthcare LLMs. GatorTron [108] and Gator Tron GP T [109] are only two Healthcare LLMs which training from scratch with only pre training (PT). One of reason is that acquiring and properly anonymizing medical data for training involves navigating complex legal and ethical issues. Additionally, due to the specialized nature of medical data and the high demands for accuracy, training a model from scratch requires substantial computational resources and extremely large healthcare text, which will be more expensive than general LLMs. Compared with PLMs which require fewer parameters and less training data, the significance of PT method is in decline.
• 不同训练方法。与预训练语言模型(PLM)不同,从头开始训练大语言模型的策略在医疗领域并不常见。GatorTron [108]和GatorTronGPT [109]是仅有的两个仅通过预训练(PT)从头训练的医疗大语言模型。原因之一在于获取并妥善匿名化医疗训练数据需要处理复杂的法律和伦理问题。此外,由于医疗数据的专业性和对准确性的高要求,从头训练模型需要大量计算资源和海量医疗文本,其成本远高于通用大语言模型。相比参数更少、训练数据量要求更低的预训练语言模型,预训练方法的重要性正在下降。
Table 5 Brief sum mari z ation of existing LLMs for Healthcare. Sorted in chronological order of publication.
Model name | Size | Features |
GatorTron [108] | 8.9 | Training from scratch |
Galactica [110] | 120 | Reasoning,Multidisciplinary |
Med-PaLM [121] | 540 | CoT,Self-consistency |
ChatDoctor [50] | 7 | Retrieve online,External knowledge |
DoctorGLM [119] | 6 | Extra prompt designer |
MedAlpaca [117] | 13 | Adapt toMedicine |
BenTsao [107] | 7 | Knowledge graph |
PMC-LLaMA [122] | 7 | AdapttoMedicine |
Visual Med-Alpaca [114] | 7 | Multimodal generative model, Self-Instruct |
BianQue [118] | 6 | Chain of Questioning |
Med-PaLM 2 [1] | 340 | Ensemble refinement, CoT, Self-consistency |
GatorTronGPT [109] | 20 | Trainingfromscratchformedicine |
LLM-CXR [127] | 3 | Multimodal, Chest X-rays |
HuatuoGPT [11] | 7 | ReinforcedlearningfromAIfeedback |
ClinicalGPT [129] | 7 | Multi-round dialogue consultations |
MedAGI [10] | 一 | Multimodal |
LLaVA-Med [9] | 13 | Multimodal, Self-instruct, Curriculum learning |
OphGLM [12] | 6 | Multimodal, Ophthalmology LLM |
SoulChat [13] | 6 | Mental Healthcare |
Med-Flamingo [8] | 80 | Multimodal, Few-Shot medical VQA |
Zhongjing [131] | 13 | Multi-turn Chinese medical dialogue |
MedChatZH [112] | 7 | Traditional Chinese Medicine, Bilingual |
JMLR [128] | 13 | RAG, LLM-Rank loss |
BioMistral [113] | 7 | Multilingual, Model merging emphasis |
BiMediX [124] | 47 | English and Arabic language |
OncoGPT[120] | 7 | Real-world doctor-patient oncology dialogue |
Polaris [130] | Several specialized support agents | |
HealAI [123] | 540 | RAG,Interactive Editing |
Apollo [115] | 7 | Multilingual,Lightweight,Proxy tuning |
Medical mT5 [125] | 3 | Multilingua |
Qilin-Med [132] | 7 | Domain-specific pre-training,RAG |
Me LLaMA [111] | 70 | Catastrophic Forgetting |
EpiSemoGPT [126] | 7 | Predicting epileptogenic zones |
Aloe-Alpha [133] | 8 | Synthetic CoT |
CancerLLM [116] | 7 | Specifically for cancer |
表 5: 现有医疗领域大语言模型的简要总结。按发布时间排序。
模型名称 | 参数量(亿) | 主要特性 |
---|---|---|
GatorTron [108] | 8.9 | 从头训练 |
Galactica [110] | 120 | 推理能力,多学科 |
Med-PaLM [121] | 540 | 思维链(CoT),自洽性 |
ChatDoctor [50] | 7 | 在线检索,外部知识 |
DoctorGLM [119] | 6 | 额外提示设计器 |
MedAlpaca [117] | 13 | 医学领域适配 |
BenTsao [107] | 7 | 知识图谱 |
PMC-LLaMA [122] | 7 | 医学领域适配 |
Visual Med-Alpaca [114] | 7 | 多模态生成模型,自指令 |
BianQue [118] | 6 | 问题链 |
Med-PaLM 2 [1] | 340 | 集成优化,思维链,自洽性 |
GatorTronGPT [109] | 20 | 医学领域从头训练 |
LLM-CXR [127] | 3 | 多模态,胸部X光 |
HuatuoGPT [11] | 7 | AI反馈强化学习 |
ClinicalGPT [129] | 7 | 多轮问诊对话 |
MedAGI [10] | - | 多模态 |
LLaVA-Med [9] | 13 | 多模态,自指令,课程学习 |
OphGLM [12] | 6 | 多模态,眼科大模型 |
SoulChat [13] | 6 | 心理健康护理 |
Med-Flamingo [8] | 80 | 多模态,少样本医疗问答 |
Zhongjing [131] | 13 | 中文多轮医疗对话 |
MedChatZH [112] | 7 | 中医,双语支持 |
JMLR [128] | 13 | 检索增强生成(RAG),排序损失 |
BioMistral [113] | 7 | 多语言,模型融合优化 |
BiMediX [124] | 47 | 英语和阿拉伯语 |
OncoGPT [120] | 7 | 真实医患肿瘤对话 |
Polaris [130] | - | 多个专业支持智能体 |
HealAI [123] | 540 | 检索增强生成,交互式编辑 |
Apollo [115] | 7 | 多语言,轻量化,代理调优 |
Medical mT5 [125] | 3 | 多语言 |
Qilin-Med [132] | 7 | 领域预训练,检索增强生成 |
Me LLaMA [111] | 70 | 灾难性遗忘 |
EpiSemoGPT [126] | 7 | 癫痫灶预测 |
Aloe-Alpha [133] | 8 | 合成思维链 |
CancerLLM [116] | 7 | 癌症专项 |
Besides PT, the prevalent method for adapting a general LLM to a Healthcare LLM involves SFT. As shown in Table 3, 21 LLM studies only use SFT to tuning their models. In addition, Galactica, Me LLaMA, MedChatZH, BioMistral, Visual Med-Alpaca, and Apollo employ twostep training process, name PT first and then SFT. Among the above models, Galactica [110] is an early-stage study, which demonstrated effectiveness of SFT. This LLM is designed to handle the information overload in the scientific domain, including Healthcare. JMLR [128] introduces a method that enhances medical reasoning and questionanswering by integrating SFT training method and information retrieval systems during the fine-tuning phase. This approach not only improves the model’s ability to utilize medical knowledge effectively but also significantly cuts down on computational resources. Remarkably, JMLR required only 148 GPU hours for training. MedAlpaca [117] addresses privacy concerns in healthcare by employing an open-source policy for on-site implementation, which employs LoRA [148] for task-specific weight updates.
除了PT之外,将通用大语言模型适配为医疗大语言模型的常用方法还包括SFT。如表3所示,有21项大语言模型研究仅使用SFT进行模型调优。此外,Galactica、Me LLaMA、MedChatZH、BioMistral、Visual Med-Alpaca和Apollo采用了两阶段训练流程,即先进行PT再进行SFT。其中Galactica [110]作为早期研究验证了SFT的有效性,该模型专为应对科学领域(包括医疗健康)的信息过载问题而设计。JMLR [128]提出了一种在微调阶段结合SFT训练方法与信息检索系统的方案,显著提升了医学推理和问答能力,不仅优化了模型对医学知识的运用效率,还大幅降低了计算资源消耗——其训练仅需148个GPU小时。MedAlpaca [117]通过采用开源策略实现本地化部署以解决医疗隐私问题,并利用LoRA [148]进行任务特定的权重更新。
Further, the studies [129–132] use multiple advanced training technologies. Among them, Zhongjing [131] is a groundbreaking Chinese medical LLM that integrates PT, SFT, and RLHF to enhance the handling of multi-turn medical dialogues, particularly in Chinese medicine. Qilin-Med [132] is also a Chinese medical LLM enhanced through a multi-stage training methodology, including domain-specific PT, SFT, DPO, and Retrieval Augmented Generation (RAG).
此外,研究[129–132]采用了多种先进的训练技术。其中,Zhongjing[131]是一款开创性的中医大语言模型,整合了PT(预训练)、SFT(监督微调)和RLHF(人类反馈强化学习)技术,显著提升了多轮中医对话的处理能力。Qilin-Med[132]同样是通过多阶段训练方法增强的中医大语言模型,其训练流程包含领域特异性PT、SFT、DPO(直接偏好优化)以及检索增强生成(RAG)技术。
• Different Training Data. Diverse and high-quality data is the one of core parts for Healthcare LLMs. In PLMs era, plain text dominates the training corpus for pre training language models with the next word prediction task. When comes to Healthcare LLMs, QA pairs and dialogues one of more important data type, as shown in Line 12 to 20 in Table 3. This is due to the fact that the LLMs already have strong linguistic skills, as well as some degree of extra knowledge about the specifics of each domain. This attenuates the need to use specialized domain data to perform next word prediction tasks. More competitively, by using QA pairs and dialogues to construct instruction data, SFT can inject domain knowledge while enhancing the model’s instruction compliance. Besides, some multimodal data (Line 27 to 30) and structured Electronic Health Record (EHR) database (Line 31 to 32) are also commonly used by SFT, which is other important training data. We can see a trend of synchronization between the different training methods and the training data. More details about training data can be seen in Section 4.2.
• 不同的训练数据。多样化和高质量的数据是医疗大语言模型的核心部分之一。在预训练语言模型(PLM)时代,纯文本主导了以"下一个词预测"任务为主的训练语料。而对于医疗大语言模型,问答对(QA pairs)和对话数据(如表3第12-20行所示)成为更重要的数据类型。这是因为大语言模型已具备强大的语言能力,并对各领域专业知识有一定程度的掌握,从而降低了对专业领域数据进行"下一个词预测"任务的需求。更具竞争力的是,通过使用问答对和对话构建指令数据,监督微调(SFT)可以在增强模型指令遵循能力的同时注入领域知识。此外,一些多模态数据(第27-30行)和结构化电子健康记录(EHR)数据库(第31-32行)也常被用于SFT,这些同样是重要的训练数据。我们可以看到不同训练方法与训练数据之间存在同步发展趋势。更多训练数据细节详见第4.2节。
• Different Evaluation. Firstly, we investigate some work which focus in evaluate general LLMs for Healthcare tasks and categorize them into four folds: medical examination, medical question answering, medical generation, and medical comprehensive evaluation, which are summarized in Table 6. The medical examination form involves verifying model performance through standard medical tests or examinations. Differently, medical question answering involves utilizing questions posed or collected by human experts to make assessments. Medical generation focuses on generating new medical descriptions or knowledge based on a given input. The studies on medical comprehensive evaluation aim to provide assessments across various application scenarios rather than focusing on a single aspect. From conclusions of these studies, we can generally find that performance of specific tasks are satisfied, while more concerns are raised from non-technological parts, such as robustness, bias, and ethics. We further discussed these aspects in Section 5.
• 差异化评估。首先,我们调研了聚焦于医疗任务的大语言模型 (LLM) 评估工作,并将其归纳为四类:医疗考试评估、医疗问答评估、医疗生成评估及医疗综合评估(详见表6)。医疗考试评估通过标准化医学测试验证模型性能;医疗问答评估采用专家提出或收集的问题进行评估;医疗生成评估侧重基于给定输入生成新的医疗描述或知识;医疗综合评估研究旨在提供跨场景的综合评估而非单一维度。这些研究普遍显示:特定任务性能表现良好,但更多担忧集中在非技术层面(如鲁棒性、偏见和伦理问题),我们将在第5节进一步探讨。
Secondly, we summarize evaluation parts from studies which propose Healthcare LLMs. For example, in Healthcare-related assessments, Galactica notably surpassed previous benchmarks with a $77.6%$ on PubMedQA and achieved $52.9%$ on MedMCQA. JMLR achieves $72.8%$ accuracy on the MMLU-Medical dataset and $65.5%$ on the MedMcQA dataset, surpassing the Meditron-70B and Llama2-13B with RAG, which scored $68.9%$ and $54.9%$ respectively.
其次,我们汇总了提出医疗大语言模型 (Healthcare LLM) 的研究中的评估部分。例如,在医疗相关评估中,Galactica 以 PubMedQA 77.6% 和 MedMCQA 52.9% 的成绩显著超越先前基准。JMLR 在 MMLU-Medical 数据集上达到 72.8% 准确率,在 MedMCQA 数据集上达到 65.5%,超越了采用 RAG 的 Meditron-70B (68.9%) 和 Llama2-13B (54.9%)。
Zhongjing [131] was evaluated using the CMtMedQA-test for multiturn dialogues and the huatuo-26M for single-turn dialogues, focusing on three main dimensions—safety, professionalism, and fluency. Results show that Zhongjing excels in complex dialogue interactions, surpassing existing models like HuatuoGPT in these aspects by leveraging its diverse training approach. Qilin-Med achieved accuracies of $38.4%$ and $40.0%$ in the PT and SFT phases respectively on the CMExam test set. The integration of the RAG approach further enhanced its accuracy to $42.8%$ on CMExam. These advancements highlight Qilin-Med’s capability in generating precise and con textually accurate responses, setting new benchmarks for medical LLMs, particularly in Chinese medical applications.
Zhongjing [131] 在 CMtMedQA-test 上评估了多轮对话能力,在 huatuo-26M 上评估了单轮对话能力,重点关注安全性、专业性和流畅性三个维度。结果表明,得益于多样化的训练方法,Zhongjing 在复杂对话交互中表现优异,在这些方面超越了 HuatuoGPT 等现有模型。Qilin-Med 在 CMExam 测试集上的 PT 和 SFT 阶段分别达到了 $38.4%$ 和 $40.0%$ 的准确率。结合 RAG 方法后,其在 CMExam 上的准确率进一步提升至 $42.8%$。这些进展凸显了 Qilin-Med 在生成精准且符合语境的响应方面的能力,为医疗大语言模型(尤其是中文医疗应用)树立了新标杆。
In summary, by integrating various training methods detailed in Table 3, we identify several over arching trends regarding the impact of different technologies on performance: (1) PT alone does not ensure high performance in LLMs; (2) SFT proves to be more crucial, with RLHF and DPO increasingly becoming important; (3) Techniques that reduce model size tend to result in some loss of performance.
总结来说,通过整合表3中详述的各种训练方法,我们发现了关于不同技术对性能影响的几个总体趋势:(1) 仅靠预训练(PT)并不能确保大语言模型的高性能;(2) 监督微调(SFT)被证明更为关键,而基于人类反馈的强化学习(RLHF)和直接偏好优化(DPO)正变得越来越重要;(3) 减小模型规模的技术往往会导致性能的某些损失。
• Different Features. Further, we discuss LLMs from features of model sizes, language, and modality. Model size is a crucial measure because it directly impacts the model’s representation capabilities, generalization capacity, as well as the computational resources and training time required. We divide LLMs into three groups, extremely large $(>70\mathrm{B})$ , very large (13B-70B) and large (1B-12B). In this paper, there are 7/36 Healthcare LLMs are extremely large, 7/36 are very large, 19/36 are large. Med-PaLM [121] and HealAI [123] are two the largest Healthcare LLM with 540B parameters. Med-PaLM utilizes instruction prompt tuning for adapting LLMs to new domains with a few exemplars. This approach employs a shared soft prompt across multiple datasets, followed by a task-specific human-engineered prompt. Based on such extremely large size, Med-PaLM is evaluated on a 12-aspect benchmark and get satisfied results. For example, Med-PaLM and clinicians achieved a consensus of $92.6%$ and $92.9%$ respectively. Further, HealAI is based on Med-PaLM. However, there are no more details about its development. Med-PaLM 2 [1] is the second large Healthcare LLM with 340B parameters. Despite its smaller size compared to the original PaLM’s 540B parameters, Med-PaLM 2 outperforms its predecessor [1]. Long-form answers from Med-PaLM 2 are evaluated for various quality criteria and often preferred over those from physicians and the original Med-PaLM model. Med-PaLM 2 also introduces ensemble refinement in its prompting strategy, enhancing answer accuracy by generating multiple reasoning paths to refine the final response. Besides Med-PaLM 2, Galactica and Me LLaMA [111] also have more than 100B parameters’ models. It should notice that some smaller LLMs already outperform larger ones in general domains. This trend has not yet extended to Healthcare, but we anticipate that in the near future, smaller Healthcare LLMs will surpass the performance of older, larger models.
• 不同特征。此外,我们从模型规模、语言和多模态特征角度讨论大语言模型。模型规模是关键指标,直接影响模型的表征能力、泛化能力以及所需的计算资源和训练时间。我们将大语言模型分为三组:超大型$(>70\mathrm{B})$、特大型(13B-70B)和大型(1B-12B)。本文中,7/36的医疗大语言模型属于超大型,7/36为特大型,19/36为大型。Med-PaLM [121]和HealAI [123]是参数规模最大的两个医疗大语言模型,均达到540B参数。Med-PaLM采用指令提示调优技术,通过少量示例使大语言模型适应新领域。该方法在多个数据集上使用共享软提示,再结合特定任务的人工设计提示。基于其超大规模,Med-PaLM在12项基准测试中取得满意结果,例如Med-PaLM与临床医生的共识率分别达到$92.6%$和$92.9%$。HealAI基于Med-PaLM构建,但未披露更多开发细节。Med-PaLM 2 [1]是第二大医疗大语言模型,参数为340B。尽管相比原始PaLM的540B参数规模更小,但其性能优于前代[1]。Med-PaLM 2的长篇回答在多项质量评估中常优于医师和原始Med-PaLM的输出。该模型还在提示策略中引入集成优化方法,通过生成多重推理路径来提升最终回答的准确性。除Med-PaLM 2外,Galactica和Me LLaMA [111]也拥有超过100B参数的模型。值得注意的是,在通用领域已有较小模型超越较大模型的现象,这一趋势尚未延伸至医疗领域,但我们预计未来较小的医疗大语言模型将超越早期大型模型的性能。
Table 6 The Healthcare evaluation of LLMs.
Categories | Studies | Models | Scenarios | #Num | Conclusions |
Medical Ex. | [134] | ChatGPT | Primary Care | 674 | Average performance of ChatGPT is below the mean passing mark in the last 2 years. |
[135] | ChatGPT | Medicallicensure | 220 | ChatGPT performs at the level of a third-year medical student. | |
[136] | ChatGPT | Medical licensure | 376 | ChatGPT performs at or near the passing threshold. | |
[137] | ChatGPT | Physician queries | 284 | ChatGPT generates largely accurate information to diverse medical queries. | |
[138] | ChatGPT,GPT-4,Bard,BLOOMZ | Radiation oncology | 100 | Each LLM generally outperforms the non-expert humans,while only GPT-4 outperforms the medicalphysicists. | |
[41] | ChatGPT, Claude | Patient-specific EHR | 一 | Both models are able to provide accurate, relevant, and comprehensiveanswers. | |
[139] | ChatGPT | Bariatric surgery | 151 | ChatGPT usually provides accurate and reproducible responses to common questions related to bariatric surgery. | |
[140] | ChatGPT | Genetics questions | 85 | ChatGPT does not perform significantly differently than human respondents. | |
[141] | ChatGPT | Fertility counseling | 17 | ChatGPT could produce relevant, meaningful responses to fertility-relatedclinical queries. | |
[142] | GPT-3.5, GPT-4 | General surgery | 280 | GPT-3.5 and, in particular, GPT-4 exhibit a remarkable ability to understand complex surgical clinical information. | |
[143] | GPT-3.5,GPT-4 | Dementia diagnosis | 981 | GPT-3.5 andGPT-4cannot outperform traditionalAI tools in dementia diagnosis and prediction tasks. | |
[144] Medical Gen. | ChatGPT | Gastroenterology | 20 | ChatGPT would generate relevant and clear research questions,but not original. | |
[145] | ChatGPT,GPT-4 | Radiology report | 138 | ChatGPT performs well and GPT-4 can significantly improve the quality. | |
[146] | ChatGPT | Benchmark tasks | 34.4K | Zero-shot ChatGPT outperforms the state-of-the-art fine-tuned | |
Medical Ce. | [147] | ChatGPT | Clinicaland research | models in datasets that have smaller training sets. ChatGPT could potentially exhibitbiases or be susceptible to misuse. |
The Healthcare evaluation of LLMs includes Medical examination (Ex.), medical question answering (Q&A), medical generation (Gen.), and medical comprehensive evaluation (Ce.).
表 6: 大语言模型在医疗领域的评估
类别 | 研究 | 模型 | 场景 | 数量 | 结论 |
---|---|---|---|---|---|
医学考试 | [134] | ChatGPT | 初级护理 | 674 | ChatGPT的平均表现低于近两年的平均及格线 |
医学考试 | [135] | ChatGPT | 医疗执照考试 | 220 | ChatGPT达到医学院三年级学生水平 |
[136] | ChatGPT | 医疗执照考试 | 376 | ChatGPT表现接近及格线 | |
[137] | ChatGPT | 医师咨询 | 284 | ChatGPT能为多样化医疗问题生成基本准确的信息 | |
[138] | ChatGPT,GPT-4,Bard,BLOOMZ | 放射肿瘤学 | 100 | 各模型普遍优于非专业人士,仅GPT-4超越医学物理师 | |
[41] | ChatGPT, Claude | 患者特定电子健康档案 | - | 两个模型都能提供准确、相关且全面的答案 | |
[139] | ChatGPT | 减肥手术 | 151 | 通常能对减肥手术相关问题给出准确且可复现的回答 | |
[140] | ChatGPT | 遗传学问题 | 85 | 表现与人类受访者无显著差异 | |
[141] | ChatGPT | 生育咨询 | 17 | 能对生育相关临床问题给出有意义的相关回答 | |
[142] | GPT-3.5, GPT-4 | 普通外科 | 280 | 展现出理解复杂外科临床信息的卓越能力 | |
[143] | GPT-3.5,GPT-4 | 痴呆症诊断 | 981 | 在痴呆诊断预测任务中无法超越传统AI工具 | |
医学生成 | [144] | ChatGPT | 胃肠病学 | 20 | 能生成相关清晰的研究问题,但缺乏原创性 |
[145] | ChatGPT,GPT-4 | 放射学报告 | 138 | ChatGPT表现良好,GPT-4能显著提升质量 | |
[146] | ChatGPT | 基准测试 | 34.4K | 零样本ChatGPT在小型训练集数据上超越微调模型 | |
医学综合评估 | [147] | ChatGPT | 临床研究 | - | 可能表现出偏见或易被滥用 |
大语言模型医疗评估涵盖医学考试(Ex.)、医疗问答(Q&A)、医疗文本生成(Gen.)和医疗综合评估(Ce.)四大领域。
In the realm of language, English LLMs are predominantly mainstream. Following English, the second largest group of LLMs is designed for Chinese. BianQue, HuatuoGPT, BenTsao, SoulChat, DoctorGLM, MedChatZH, Zhongjing, and Qilin-Med are Chinese Healthcare LLMs. Among them, DoctorGLM is a pioneer Chinese LLM, focusing on costeffective medical applications. DoctorGLM’s training utilized the ChatDoctor dataset, translating medical dialogues using the ChatGPT API. Besides the above LLMs, there are also multilingual models, such as Apollo and Medical mT5.
在语言领域,英语大语言模型 (LLM) 占据主流地位。紧随其后的是中文大语言模型,包括 BianQue、华佗GPT (HuatuoGPT)、本草 (BenTsao)、灵心对话 (SoulChat)、DoctorGLM、MedChatZH、仲景 (Zhongjing) 和麒麟-医疗 (Qilin-Med) 等中文医疗大语言模型。其中,DoctorGLM 是首个专注于高性价比医疗应用的中文大语言模型,其训练使用了 ChatDoctor 数据集,并通过 ChatGPT API 翻译医疗对话内容。除上述模型外,还存在多语言模型,例如 Apollo 和 Medical mT5。
Besides the above features, multimodal ability is another important development branch, as medical data inherently consists of diverse modalities such as patient medical records, radio graphic images, and physiological signals. By integrating varied data types, multimodal models can enhance the understanding of complex medical conditions from multiple dimensions, enabling more accurate interpretations and diagnoses. For example, Visual Med-Alpaca [114] is a LLaMa-7B based open-source biomedical model that handles multimodal tasks by integrating medical ‘‘visual experts’’. It was trained using a collabor at iv ely curated instruction set from GPT-3.5-Turbo and human experts, incorporating visual modules and instruction-tuning for tasks like radiological image interpretation and complex clinical inquiries. OphGLM [12] is a multimodal model tailored for ophthalmic applications, integrating visual capabilities alongside language processing. It was developed starting from fundus images, creating a pipeline for disease assessment, diagnosis, and lesion segmentation.
除了上述特性外,多模态能力是另一个重要的发展方向,因为医疗数据本质上包含多种模态,如患者病历、放射影像和生理信号等。通过整合不同类型的数据,多模态模型可以从多个维度增强对复杂医疗状况的理解,从而实现更准确的解读和诊断。例如,Visual Med-Alpaca [114] 是一个基于LLaMa-7B的开源生物医学模型,通过整合医学"视觉专家"来处理多模态任务。该模型使用GPT-3.5-Turbo和人类专家协作整理的指令集进行训练,包含视觉模块和指令微调,用于放射影像解读和复杂临床查询等任务。OphGLM [12] 是一款专为眼科应用定制的多模态模型,将视觉能力与语言处理相结合。该模型从眼底图像开发起步,构建了疾病评估、诊断和病灶分割的流程。
3.3. Summary
3.3. 总结
In this section, we present an overview of existing PLMs and LLMs in the Healthcare domain, highlighting their respective research focuses. Furthermore, we provide a comprehensive analysis of performance of Healthcare LLMs on benchmark datasets such as USMLE, MedMCQA, and PubMedQA as shown in Table 4. The intention behind this analysis is to showcase the progress in Healthcare QA development and offer a clear comparison between different Healthcare LLMs. In conclusion, two of the most robust LLMs identified in this analysis are Med-PaLM 2 and GPT-4. It is important to note that while GPT-4 is a general-purpose LLM, Med-PaLM 2 is specificall