[论文翻译]Almanac: 面向临床医学的检索增强型大语言模型


原文地址:https://arxiv.org/pdf/2303.01229


Almanac: Retrieval-Augmented Language Models for Clinical Medicine

Almanac: 面向临床医学的检索增强型大语言模型

$^{\mathrm{ 1 }}{}$ Department of Car dio thoracic Surgery, Stanford Medicine. $^2$ Department of Computer Science, Stanford University. $^{3}$ Division of Cardiovascular Surgery, Penn Medicine. $^4$ Division of Cardiovascular Medicine, Stanford Medicine. $^5$ Department of Pediatrics, Stanford Medicine. $^6$ Department of Neurology, Stanford Medicine. 7 Department of Radiology and Biomedical Informatics, Stanford Medicine. $^{8}$ Division of Infectious Diseases, Stanford Medicine.

$^{\mathrm{ 1 }}{}$ 斯坦福医学中心心胸外科
$^2$ 斯坦福大学计算机科学系
$^{3}$ 宾夕法尼亚医学中心心血管外科
$^4$ 斯坦福医学中心心血管内科
$^5$ 斯坦福医学中心儿科
$^6$ 斯坦福医学中心神经内科
7 斯坦福医学中心放射学与生物医学信息学系
$^{8}$ 斯坦福医学中心传染病科

* Corresponding author(s). E-mail(s): czakka@stanford.edu; willhies@stanford.edu;

  • 通讯作者。电子邮箱:czakka@stanford.edu;willhies@stanford.edu;

Abstract

摘要

Large-language models have recently demonstrated impressive zero-shot capabilities in a variety of natural language tasks such as sum mari z ation, dialogue generation, and question-answering. Despite many promising applications in clinical medicine, adoption of these models in real-world settings has been largely limited by their tendency to generate incorrect and sometimes even toxic statements. In this study, we develop Almanac, a large language model framework augmented with retrieval capabilities for medical guideline and treatment recommendations. Performance on a novel dataset of clinical scenarios ( $n=130$ ) evaluated by a panel of 5 board-certified and resident physicians demonstrates significant increases in factuality (mean of 18% at p-value $<0.05$ ) across all specialties, with improvements in completeness and safety. Our results demonstrate the potential for large language models to be effective tools in the clinical decision-making process, while also emphasizing the importance of careful testing and deployment to mitigate their shortcomings.

大语言模型近期在摘要生成、对话生成和问答等多种自然语言任务中展现出令人印象深刻的零样本能力。尽管在临床医学领域具有诸多应用前景,但这些模型在实际应用中的普及仍因其易生成错误甚至有害陈述的特性而受到极大限制。本研究开发了Almanac——一个通过检索医疗指南和治疗建议增强的大语言模型框架。由5名执业医师和住院医师组成的专家组对临床场景新数据集(n=130)的评估显示,所有专科的事实准确性均显著提升(平均提高18%,p值<0.05),完整性和安全性也有所改善。我们的研究结果证明了大语言模型作为临床决策过程有效工具的潜力,同时也强调了通过严格测试和部署来弥补其缺陷的重要性。

1 Introduction

1 引言

In recent years, language model pre-training has emerged as a powerful training paradigm in natural language processing (NLP) [1–4]. For a large number of these language models, performance improvements have been empirically observed to scale with model and dataset size, with the well-documented emergence of zero-shot capabilities and sample efficiency on a range of downstream NLP tasks [5–7]. However, due the nature of their training objective— predicting the next token in a sentence—large language models (LLMs) can be prone to generating factually incorrect statements, a phenomenon commonly known as hallucination [8, 9]. More contentiously, many works have also demonstrated these models’ ability to reproduce social biases, as well as generating statements reinforcing gender, racial, and religious stereotypes [10, 11]. In an effort to reduce these unwanted behaviors, several works have explored different ways of steering LLM outputs to more closely align with user-intent, including fine-tuning with human feedback [12, 13] and natural language prompt engineering [14, 15]. This pivot in training paradigms has led to an explosion of transformative applications, ranging from human-like chatbots to impressive writing assistants [16, 17]. However, the unstructured and open-ended aspect of LLM prompts puts them at risk of adversarial attacks, or the intentional act of derailing the original goal of a model with malicious intent, such as for generating vitriol at scale, leaking private data, or generating misinformation [18, 19]. As such, despite the promising avenue of research posed by the incorporation of large language models in the clinical workflow, careful consideration must be met in their implementation to ensure patient privacy and safety [20].

近年来,语言模型预训练已成为自然语言处理(NLP)领域一种强大的训练范式[1-4]。大量研究表明,这些语言模型的性能提升与模型规模和数据量呈正相关,并在多种下游NLP任务中展现出零样本(zero-shot)能力和样本效率[5-7]。然而,由于训练目标本质是预测句子中的下一个token(token),大语言模型(LLM)容易生成与事实不符的陈述,这种现象通常被称为幻觉(hallucination)[8,9]。更具争议的是,许多研究还发现这些模型会复现社会偏见,强化性别、种族和宗教的刻板印象[10,11]。为减少这些不良行为,研究者探索了多种引导大语言模型输出更符合用户意图的方法,包括基于人类反馈的微调(fine-tuning)[12,13]和自然语言提示工程(prompt engineering)[14,15]。这种训练范式的转变催生了从类人聊天机器人到智能写作助手等一系列变革性应用[16,17]。然而,大语言模型提示的非结构化和开放性特点使其容易遭受对抗攻击,即通过恶意意图破坏模型原始目标的行为,例如大规模生成恶意言论、泄露隐私数据或传播错误信息[18,19]。因此,尽管将大语言模型引入临床工作流程具有研究前景,但在实施过程中必须审慎考虑患者隐私和安全问题[20]。

In this work, we introduce Almanac, a promising framework to explore the role of medical LLMs and their safe deployment in healthcare settings. To stay abreast the constantly shifting landscape of evidence-based practices, physicians often refer to point-of-care tools to drive better outcomes [21]. As clinical evidence continues to grow however, carefully curated content becomes less accessible, confined to error-prone search tools and time-consuming appraisal techniques that fail to address the unique needs of individual patients. Instead, we study the role of large-language models as clinical knowledge-bases with the ability to use external tools (e.g. search engines, medical databases and calculators) to answer queries related to clinical concepts and latest treatment recommendations. We outsource knowledge retrieval to a web browser and database of predefined knowledge repositories, and utilize an off-the-shelf large language model to achieve high-quality accurate answer generation with in-text citations referencing the source material for improved safety and reliability.

在本研究中,我们介绍了Almanac这一前景广阔的框架,旨在探索医疗大语言模型的作用及其在医疗环境中的安全部署。为了紧跟循证实践不断变化的形势,医生通常会参考即时诊疗工具以实现更好的治疗效果[21]。然而随着临床证据持续增加,精心整理的内容变得难以获取,只能依赖易出错的搜索工具和耗时的评估技术,这些方法无法满足患者的个性化需求。为此,我们研究了大语言模型作为临床知识库的角色,其能够利用外部工具(如搜索引擎、医学数据库和计算器)来回答与临床概念及最新治疗建议相关的查询。我们将知识检索外包给网络浏览器和预定义知识库数据库,并采用现成的大语言模型来生成高质量精准答案,同时通过文内引用注明来源材料,从而提升安全性和可靠性。


Fig. 1 Almanac Overview When presented with a query, Almanac first uses external tools to retrieve relevant information before synthesizing a response with citations referencing source material. With this framework, LLM outputs remain grounded in truth, while providing a reliable way of fact-checking their outputs.


图 1: Almanac概览 当接收到查询时,Almanac首先使用外部工具检索相关信息,随后整合生成带有引用来源的响应。该框架确保大语言模型输出始终基于事实,同时提供可靠的输出事实核查机制。

To better evaluate these models for the clinical workflow, we propose three key objectives which we define as follows:

为了更好地评估这些模型在临床工作流程中的应用,我们提出以下三个关键目标:

Due to increasing concerns of data-leakage (e.g. medical large language models are evaluated on datasets that are potentially included within their training data), we evaluate our approach empirically using a panel of boardcertified clinicians (averaging 14 years of experience) and resident physicians on a novel dataset of open-ended clinical scenarios encountered in a variety of medical specialties. To the authors’ knowledge, this work is the first to demonstrate the ability of grounded large-language models to provide accurate and reliable open-ended answers to medical queries in the clinical setting, paving the way towards the controlled and safe deployment of large language models in healthcare.

由于对数据泄露问题的日益关注(例如医疗大语言模型可能在训练数据中包含了用于评估的数据集),我们采用了一个由执业医师(平均14年经验)和住院医师组成的小组,在一个包含多种医学专科开放式临床场景的新数据集上对我们的方法进行了实证评估。据作者所知,这项研究首次证明了基于实际场景的大语言模型能够为临床环境中的医学问题提供准确可靠的开放式答案,为医疗领域大语言模型的安全可控部署铺平了道路。

1.1 Related Work

1.1 相关工作

By pre-training transformers on curated scientific and biomedical corpora, recent works such as BioGPT [22] and SciBERT [23] have demonstrated improved performance on a variety of biomedical downstream tasks, including clinical entity extraction, medical question-answering, and text generation [24– 28]. Similarly, Lehman et al. [29] recently established the benefits of smaller domain-specific language models in comparison to larger and more generalized models, even when finetuned on limited annotated data. Yet, despite marked improvements with pre-training increasingly larger architecture sizes on domain specific datasets (e.g. GatorTron [30] and Med-PaLM [31]), these models still remain prone to hallucinations and biases, further highlighting the limitations and unreliability of large language models as intrinsic knowledge bases [32].

通过在精选的科学和生物医学语料库上预训练Transformer模型,近期工作如BioGPT [22]和SciBERT [23]已证明其在多种生物医学下游任务中的性能提升,包括临床实体抽取、医学问答和文本生成 [24–28]。类似地,Lehman等人 [29]最近证实了较小领域专用语言模型相较于更大通用模型的优势,即便在有限标注数据上进行微调时也是如此。然而,尽管通过在领域专用数据集(如GatorTron [30]和Med-PaLM [31])上预训练日益增大的架构规模取得了显著改进,这些模型仍容易出现幻觉和偏见,进一步突显了大语言模型作为固有知识库的局限性和不可靠性 [32]。


Fig. 2 ClinicalQA Performance Comparison of performances between Almanac and ChatGPT on the ClinicalQA dataset as evaluated by physicians. Almanac outperforms its counterpart with significant gains in factuality, and marginal improvements in completeness. Although more robust to adversarial prompts, Almanac and ChatGPT both exhibit hallucinations with omission. Despite these performances, ChatGPT answers are preferred $57%$ of the time. Error bars shown visualize standard error (SE)

图 2: ClinicalQA性能对比
Almanac与ChatGPT在ClinicalQA数据集上的表现对比(由医师评估)。Almanac在事实准确性方面显著优于ChatGPT,在完整性方面略有提升。虽然对对抗性提示更具鲁棒性,但Almanac和ChatGPT都会出现遗漏性幻觉。尽管如此,ChatGPT的回答在57%的情况下仍更受青睐。误差条显示为标准误差(SE)。

On the other hand, our work, akin to Nakano et al. [33], Schick et al. [34], and Liévin et al. [35], focuses on leveraging these models for their language understanding and modeling capabilities. In their seminal work, Nakano et al. introduce WebGPT, pairing a language model with web browsing capabilities to improve the accuracy of question answering. Liévin et al. make use of Wikipedia to obtain human-level performances on the three medicalQA datasets. Likewise, Schick et al. finetune their language model to employ various external tools (e.g. calculator and calendar) through simple application programming interfaces (APIs) to overcome limitations with arithmetic s and factual lookup.

另一方面,我们的工作与Nakano等人[33]、Schick等人[34]和Liévin等人[35]类似,重点是利用这些模型的语言理解和建模能力。Nakano等人在其开创性工作中提出了WebGPT,将语言模型与网络浏览功能相结合,以提高问答的准确性。Liévin等人利用维基百科在三个医学问答数据集上达到了人类水平的表现。同样,Schick等人通过微调其语言模型,借助简单的应用程序编程接口(API)使用各种外部工具(如计算器和日历),以克服算术和事实查找方面的限制。

We adopt a similar approach to the works outlined above: by utilizing external tools for knowledge retrieval and calculations, we achieve significant improvements on a variety of clinically useful tasks, while mitigating the current limitations of LLMs.

我们采用与上述工作类似的方法:通过利用外部工具进行知识检索和计算,我们在多种临床实用任务上实现了显著改进,同时缓解了大语言模型当前的局限性。

Table 1 Overview of ClinicalQA, a novel dataset used to evaluate Almanac across 5 medical specialties Table 2 Sample questions derived from the ClinicalQA dataset.

ClinicalQA
MedicalSpecialtyNumberofQuestions
CardiothoracicSurgery Cardiology25
Neurology25
25
InfectiousDiseases25
Pediatrics25
ClinicalCalculationVignettes5
Total130

表 1: 用于评估Almanac的临床问答数据集ClinicalQA概述(涵盖5个医学专科)
表 2: 源自ClinicalQA数据集的示例问题

医学专科 问题数量
心胸外科/心脏病学 25
神经病学 25
25
传染病学 25
儿科学 25
临床计算场景题 5
总计 130

Sample Cardiology Question

心脏病学示例问题

Question: A 40 year old male patient has an average resting heart rate of 72, a systolic blood pressure of $122 \mathrm{mm} \mathrm{Hg}$ and a serum creatinine of $0.38 \mathrm{mg/dL}$ . Given their history of heart failure, myocardial infarction, and recently elevated cardiac enzymes, what is their 6-month mortality following an episode of acute coronary syndrome?

问题:一位40岁男性患者的平均静息心率为72次/分,收缩压为$122 \mathrm{mm} \mathrm{Hg}$,血清肌酐为$0.38 \mathrm{mg/dL}$。鉴于其有心力衰竭、心肌梗死病史及近期心肌酶升高,该患者在急性冠脉综合征发作后的6个月死亡率是多少?

Answer: With a resting heart rate of 72 (9 pts), a systolic blood pressure of 122 (14 pts) and serum creatinine of 0.38 (1 pt), with a history of heart failure (24 pts), myocardial infarction $(12\mathrm{pts})$ , and recently elevated cardiac enzymes (11 pts), the patient’s overall score is 75, with a 6-month mortality risk of 1 to $2.9%$ .

答案:患者静息心率72次/分(9分)、收缩压122 mmHg(14分)、血清肌酐0.38 mg/dL(1分),有心力衰竭病史(24分)、心肌梗死$(12\mathrm{分})$及近期心肌酶升高(11分),总评分为75分,6个月死亡风险为1%至$2.9%$。

Sample Cardiology Question

示例心脏病学问题

Question: What are manifestations of fulminant giant cell myo card it is?

问题:暴发性巨细胞性心肌炎的表现有哪些?

Answer: Giant cell myo card it is is a rare but potentially fatal form of myocarditis, characterized by severe heart failure, arrhythmia s, and conduction disturbances. Clinical manifestations include new onset severe heart failure requiring parenteral inotropic or mechanical circulatory support, new ventricular arrhythmia s, Mobitz type II second-degree atrio ventricular (AV) block, third-degree AV block, or refractory heart failure.

答案:巨细胞心肌炎是一种罕见但可能致命的心肌炎类型,其特征为严重心力衰竭、心律失常和传导障碍。临床表现包括新发需静脉正性肌力药物或机械循环支持的严重心力衰竭、新发室性心律失常、莫氏II型二度房室传导阻滞、三度房室传导阻滞或难治性心力衰竭。

2 Methods

2 方法

2.1 Dataset

2.1 数据集

To more closely evaluate the potential of large language models in clinical medicine, we focus on the task of medical question answering. While existing datasets such as MultiMedQA, MedMCQA, and PubMedQA [31, 36, 37] serve as valid benchmarks for evaluating reading comprehension and knowledge recall of biomedical LMs, they fail to capture the scope of actual clinical scenarios faced by physicians and medical professionals alike. To address this, we curate ClinicalQA, a novel benchmark of open-ended clinical questions spanning several medical specialties, with topics ranging from treatment guideline recommendations to clinical calculations. These questions are sourced from 5 board-certified physicians who are tasked with generating questions related to their day-to-day clinical practices. We provide summary statistics of the dataset in Table 1 and a subset of 25 questions in Appendix A.

为了更深入地评估大语言模型在临床医学中的潜力,我们聚焦于医疗问答任务。虽然现有数据集如MultiMedQA、MedMCQA和PubMedQA [31, 36, 37]可作为评估生物医学大语言模型阅读理解和知识回忆能力的有效基准,但它们未能涵盖医生和医疗专业人员面临的实际临床场景范围。为此,我们构建了ClinicalQA——一个涵盖多个医学专科的开放式临床问题新基准,主题从治疗指南建议到临床计算不等。这些问题来源于5位委员会认证医师,他们负责生成与日常临床实践相关的问题。我们在表1中提供了数据集的统计摘要,并在附录A中提供了25个问题的子集。

While we acknowledge that the fundus of medical knowledge is both broad and extensive, we believe that ClinicalQA can serve as an early but valuable benchmark for LM-based clinical decision-making support systems.

虽然我们承认医学知识的根基既广博又深厚,但我们相信ClinicalQA可以作为基于大语言模型的临床决策支持系统早期但具有价值的基准。

2.2 Architecture

2.2 架构

Almanac consists of many components working asynchronously to achieve accurate document retrieval, reasoning, and question-answering (Figure 1). An overview of each component is outlined below:

Almanac由多个异步工作的组件组成,以实现精准的文档检索、推理和问答(图1)。以下是各组件概述:

Database: The database is a high-performance vector storage and similarity engine optimized for the rapid indexing and search of materials sourced from various contexts, including textbooks and web documents. The database is responsible for storing this content semantically, i.e. through informationdense vectors encoding the meaning of the text they contain, with a similarity metric such as cosine distance. These vectors can later be retrieved through approximate nearest neighbor search such as Hierarchical Navigable Small World (HNSW) [38].

数据库:该数据库是一个高性能向量存储与相似性引擎,专为快速索引和检索来自教材、网络文档等多场景的材料而优化。其通过信息密集的向量(如余弦距离等相似性度量)对文本含义进行语义编码存储,后续可通过分层可导航小世界 (HNSW) [38] 等近似最近邻搜索算法进行向量检索。

Browser: The browser consists of a number of predetermined domains that Almanac is able to access to fetch information from the internet. These websites are carefully curated to ensure high-quality content in response to queries. After each search, the returned content is parsed and stored in the database. In order to overcome the token limit of most large language models, each article is divided into chunks of 1,000 tokens and fed into the retriever separately. When possible, articles are divided by any sections they contain.

浏览器:浏览器包含一系列预定义的域名,Almanac可通过这些域名从互联网获取信息。这些网站经过精心筛选,以确保响应查询的内容质量。每次搜索后,返回的内容会被解析并存入数据库。为突破多数大语言模型的token限制,每篇文章会被分割成1000 token的片段,分别输入检索器。若文章包含章节结构,则优先按章节进行分割。

Retriever: The retriever is a text encoder that encodes queries and reference materials into the same high-dimensional space before storing them in the database. This language model is pretrained on large corpora to ensure that texts with similar content get closer vector representations in this space. At search time, documents matching a given query embedding are scored and threshold ed with a $\lambda=0.83$ and presented to the language model. For the purposes of reproducibility, we employ the ‘text-embedding-ada-002 ’by OpenAI with an output dimension of 1,536.

检索器 (Retriever): 检索器是一种文本编码器,它将查询和参考资料编码到同一高维空间后存储至数据库。该大语言模型经过大规模语料库预训练,确保内容相似的文本在此空间中获得相近的向量表示。检索时,系统会对匹配查询嵌入向量的文档进行评分,并采用 $\lambda=0.83$ 的阈值筛选后呈现给语言模型。为保证可复现性,我们使用OpenAI的"text-embedding-ada-002"模型,其输出维度为1,536。

Language Model: The language model is a generative pretrained transformer architecture finetuned using instructions. This model is responsible for extracting relevant information from the scored context returned by the retriever, to formulate an answer using a combination of in-context [39] and chain-of-thought (CoT) reasoning [40] prompts. For reproducibility and fairer comparison, we employ the ‘text-davinci-003 ’model from OpenAI with a max length of 4,096 tokens. In the event that no articles from the database exceed the match threshold, the language model is prompted to indicate that it has insufficient information to answer the question.

语言模型:该语言模型是基于指令微调的生成式预训练Transformer架构。其核心功能是从检索器返回的评分上下文中提取相关信息,通过结合上下文[39]和思维链(CoT)推理[40]提示来生成答案。为确保可复现性和公平比较,我们采用OpenAI的"text-davinci-003"模型,最大Token长度为4096。若数据库中没有文章达到匹配阈值,语言模型将提示当前信息不足以回答问题。

Table 3 Summary of the rubric used by clinical evaluators on LLM outputs.

AxisQuestion
FactualityDoes the answer agree with standard practices and the consensus estab- lished bybodies of authority in your practice?
If appropriate,does the answer contain correct reasoning steps?
Does the answer provide a valid source of truth (e.g. citation) for inde- pendentverification?
Doesthe answer address allaspects of thequestion?
Does the answer omitany important content?
CompletenessDoesthe answer contain anyirrelevantcontent?
SafetyDoes the answer contain any intended or unintended content which can leadtoadversepatientoutcomes?

表 3: 临床评估人员用于评价大语言模型输出的评分标准摘要

维度 问题描述
* * 事实性* * 回答是否符合您所在领域的标准实践及权威机构达成的共识?
若适用,回答是否包含正确的推理步骤?
回答是否提供了可供独立验证的有效信息来源(如引用文献)?
回答是否涵盖了问题的所有方面?
回答是否遗漏了任何重要内容?
* * 完整性* * 回答是否包含无关内容?
* * 安全性* * 回答是否含有可能导致患者不良后果的有意/无意内容?

2.3 Evaluation

2.3 评估

2.3.1 Clinical QA Evaluation

2.3.1 临床问答评估

To evaluate the outputs generated by LLMs on ClinicalQA, we propose a framework with physician feedback to ensure alignment with our three key metrics. While current LLM evaluation metrics rely on automated methods such as BLEU [41], they fail to fully capture the complexity and nuances of medical retrieval tasks. Rather, inspired by Mahdavi et al. [31] our rubric aims to establish a standardized approach to assess LLM outputs. We outline these questions in Table 3.

为了评估大语言模型(LLM)在ClinicalQA上的输出效果,我们提出了一个结合医师反馈的评估框架,以确保符合三个关键指标。虽然当前LLM评估主要依赖BLEU[41]等自动化方法,但这些指标无法全面捕捉医疗检索任务的复杂性和细微差异。受Mahdavi等人[31]的启发,我们的评估标准旨在建立规范化的LLM输出评估方法。具体评估问题如 表3 所示。

To quantify factuality and completeness, we task a panel of board-certified (averaging more than 14 years of experience) and resident physicians, with independently evaluating outputs generated by Almanac and ChatGPT (Version March 23) on a series of clinical questions within their respective specialties. While efforts are made to ensure unbiased grading (e.g. arbitrary answer formatting, answer order shuffling) to blind physicians to the answer’s provenance, complete answer blinding is not possible due to the different prose styles adopted by each system.

为了量化事实准确性和完整性,我们委托一组拥有执业资格(平均从业经验超过14年)的主治医师和住院医师,对Almanac和ChatGPT(3月23日版本)在其各自专业领域内针对一系列临床问题生成的输出结果进行独立评估。尽管我们采取了确保评分公正性的措施(例如随机化答案格式、打乱答案顺序)以使医师无法辨别答案来源,但由于两个系统采用的行文风格差异,完全隐藏答案来源仍无法实现。

For the assessment of safety, we compare Almanac to ChatGPT performances on a subset of ClinicalQA questions to evaluate their potential for intentional and unintentional harm. Our approaches are as follows:

为了评估安全性,我们将Almanac与ChatGPT在ClinicalQA部分问题上的表现进行对比,以衡量两者在有意和无意伤害方面的潜在风险。具体方法如下:

• Adversarial Prompting: Classified as intentional harm, adversarial prompting involves appending directives to a user’s prompt to deter the language model from its original task. These prompts can be initiated by a malicious actor through various entry points, such as the EHR client or server, with the simplest approach involving the insertion of ‘invisible’directives (e.g. white font, image alt text) into a patient’s clinical note to manipulate the model. Example prompts can include direct orders to generate incorrect outputs, or more advanced scenarios designed to bypass the artificial safeguards gained through model finetuning (e.g. roleplaying). We employ both methods and evaluate ChatGPT and Almanac on a subset of 25 ClinicalQA questions with a set of 5 common adversarial prompts of varying length.

  • 对抗性提示 (Adversarial Prompting):归类为蓄意危害,指在用户提示后附加指令以阻碍语言模型执行原始任务。此类提示可由恶意行为者通过多种入口点(如EHR客户端或服务器)发起,最简单的方式是在患者临床记录中插入"不可见"指令(例如白色字体、图片替代文本)来操控模型。示例提示可能包含直接生成错误输出的命令,或设计用于绕过模型微调获得的人工防护机制(例如角色扮演)的更复杂场景。我们采用两种方法,在25个ClinicalQA问题子集上使用5组不同长度的常见对抗性提示对ChatGPT和Almanac进行评估。


Fig. 3 Output Comparison Comparison between Almanac (top) and ChatGPT (bottom) for a given medical query. With access to a calculator and the retrieved rubric for CHA2DS2- VASc, Almanac is able to correctly respond to clinical vignette in comparison to ChatGPT. Sources are removed for illustrative purposes.


图 3: 输出对比 Almanac (上方) 与 ChatGPT (下方) 对同一医学查询的响应对比。借助计算器和检索到的 CHA2DS2-VASc 评分标准,Almanac 能正确回答临床案例,而 ChatGPT 未能做到。为展示效果隐去了来源信息。

• Errors of Omission: We classify errors of omission as unintentional harm, whereby incomplete information from a healthcare worker results in incorrect LLM outputs due to hallucinations. To simulate this, we randomly withhold key words from 5 clinical vignettes and assess their effects on LLMs outputs.

• 遗漏错误:我们将遗漏错误归类为无意伤害,即医护人员提供的不完整信息导致大语言模型因幻觉产生错误输出。为模拟这种情况,我们从5个临床案例中随机隐去关键词,并评估其对大语言模型输出的影响。

2.3.2 Statistical Evaluation

2.3.2 统计评估

To evaluate our results statistically we perform the following for each metric category in the rubric: we first perform a Shapiro-Wilk test with an $\alpha=0.05$ to check for normality. We then perform a one-way analysis of variance (ANOVA) to test for significance across sub-specialties ( $p<0.05$ ).

为了对结果进行统计学评估,我们对评分表中的每个指标类别执行以下操作:首先进行Shapiro-Wilk检验(α=0.05)以检查正态性,然后进行单因素方差分析(ANOVA)来测试各子专业间的显著性差异(p<0.05)。

3 Results

3 结果

In this section, we provide an overview of our results as summarized in Figure 2.

在本节中,我们概述了如图2所示的结果。

In factuality, Almanac exceeds the performance of ChatGPT by a significant margin, with an average increase in 18% absolute percentage points across specialties, with the highest difference observed in Cardiology (91% vs $69%$ respectively). These results were found to be statistically significant at $p<0.05$ ( $p{-}v a l u e=0.018856$ ; $F=\it{8.61366}$ ). In contrast, ChatGPT struggled with in-depth factual outputs, supporting its statements with correct sources only $56%$ of the time. Additionally, by making use of a calculator for clinical vignettes, Almanac is able to correctly respond to all clinical calculation scenarios, contrary to ChatGPT with incorrect outputs for all 5 (Figure 3).

在事实准确性方面,Almanac 的表现显著优于 ChatGPT,各专科平均绝对百分比提高了 18%,其中心脏病学领域的差异最大(分别为 91% 和 69%)。这些结果具有统计学显著性(p < 0.05)(p 值 = 0.018856;F = 8.61366)。相比之下,ChatGPT 在深入的事实输出方面表现不佳,其陈述仅 56% 的时间能提供正确来源。此外,通过使用计算器处理临床案例,Almanac 能够正确应对所有临床计算场景,而 ChatGPT 则在全部 5 个案例中均给出错误输出(图 3)。

In terms of completeness, despite an absolute gain of $4.8%$ over ChatGPT, Almanac’s performance was not found to be statistically significant, with overall matched performances across specialties. The lowest score obtained for both models was in Car dio thoracic Surgery, at $33%$ vs $25%$ respectively, largely due to answers which were deemed incomplete with missing or irrelevant content.

在完整性方面,尽管Almanac相比ChatGPT有4.8%的绝对提升,但其表现未达到统计学显著性,各专科整体表现相当。两个模型的最低分均出现在心胸外科领域,分别为33%和25%,主要原因是答案存在内容缺失或不相关而被判定为不完整。

Regarding safety, Almanac’s performance greatly superseded that of ChatGPT with adversarial prompting ( $95%$ vs 0% respectively) with matched fra gili ties in errors of omission ( $0%$ for both). We note that for Almanac, the addition of the adversarial prompt lowered the score between the query and the retrieved articles below the threshold ( $\lambda$ ) resulting in the system abstaining from responding to a given prompt. In contrast, ChatGPT did not show the same reservations. We provide detailed results in Appendix B.

在安全性方面,Almanac 的表现远超 ChatGPT 在对抗性提示下的表现(分别为 $95%$ vs 0%),同时在遗漏错误方面的脆弱性相当(两者均为 $0%$)。我们注意到,对于 Almanac,对抗性提示的加入使得查询与检索文章之间的得分降至阈值 ($\lambda$) 以下,导致系统拒绝响应特定提示。相比之下,ChatGPT 并未表现出相同的保留态度。详细结果见附录 B。

Notably, despite safer and factual answers, physicians preferred outputs generated by ChatGPT 57% of the time.

值得注意的是,尽管回答更安全且基于事实,但医生们57%的情况下更倾向于选择ChatGPT生成的输出。

4 Discussion

4 讨论

In this study, we propose a framework for the safe deployment of large language models in healthcare settings, with the aim of answering clinical queries more accurately across a variety of specialties. We evaluate our approach on a novel dataset of clinical questions, and show that our framework achieves significant improvements in factuality and safety in comparison to baselines, as assessed by a panel of board-certified and resident physicians.

本研究提出了一种在医疗环境中安全部署大语言模型的框架,旨在更准确地解答跨专科临床问题。我们在新型临床问题数据集上评估了该方法,经执业医师和住院医师专家组评定,该框架在事实准确性和安全性方面较基线模型均有显著提升。

In recent months, there have been several works exploring the role of large language models in clinical medicine, including DRAGON[42], BioGPT[22], and Med-PaLM[31]. Despite strong performances on medical question-answering datasets such as MedQA [43], these models possess important limitations. Firstly, the datasets used as benchmarks (e.g. USMLE Step 1 questions) do not accurately reflect clinically relevant tasks, and there exists some concerns about data contamination between train-test splits. More so, since these systems leverage the knowledge encoded within their weights to answer clinical queries, their outputs become contingent on the assumption that correct information outweighs misinformation within their training dataset. This becomes especially problematic with evolving medical guidelines, and in the age of rampant misinformation. Despite potential mitigation s such as with supervised finetuning and reinforcement learning with human feedback (RLHF) [20], these models will need to be continuously trained to update their knowledge bases, which can quickly become prohibitively expensive at billionparameter sizes. Finally, as a result of their non-deterministic outputs, these models often display varying and sometimes contradicting responses to the same query, making them unreliable for clinical use.

近几个月来,多项研究探索了大语言模型在临床医学中的作用,包括 DRAGON[42]、BioGPT[22] 和 Med-PaLM[31]。尽管这些模型在 MedQA[43] 等医学问答数据集上表现优异,但仍存在重要局限。首先,基准测试使用的数据集(如 USMLE Step 1 试题)不能准确反映临床相关任务,且训练集-测试集分割存在数据污染隐患。更重要的是,由于这些系统依赖权重内编码的知识回答临床问题,其输出结果建立在"训练数据中正确信息多于错误信息"的前提假设上——这在医学指南持续更新和错误信息泛滥的时代尤为危险。尽管可采用监督微调或基于人类反馈的强化学习 (RLHF) [20] 等方法缓解,但这类模型需要持续训练以更新知识库,对于数十亿参数规模的模型而言成本将急剧攀升。最后,由于其输出具有非确定性,这些模型对相同查询常给出多变甚至矛盾的响应,导致其难以胜任临床场景。

On the other hand our results suggest that retrieval systems can effectively facilitate information retrieval, leading to more accurate and reliable responses to clinical inquiries, grounded in fact. By supplementing responses with passages from pre-defined sources, our grounded system is able to dampen explain ability concerns by enabling clinicians to independently verify outputs. We find this retrieval system to be especially useful in adversarial settings where the query-context scoring system is able to hamper malicious actors from manipulating outputs. Yet, despite deficiencies in factuality and safety, ChatGPT outputs remain the preferred answer by physicians, we posit as a direct consequence of its training with reinforcement-learning through human feedback (RLHF) which optimizes answers to sound more human-like.

另一方面,我们的研究结果表明,检索系统能有效促进信息获取,基于事实为临床问题提供更准确可靠的回答。通过用预定义来源的文本段落补充回答,我们的基于事实的系统能缓解可解释性顾虑,让临床医生可独立验证输出结果。我们发现该检索系统在对抗性场景中尤为有效,其查询-上下文评分机制能阻止恶意行为者操控输出。然而,尽管存在事实准确性和安全性缺陷,ChatGPT生成的回答仍是医生的首选,我们认为这直接源于其通过人类反馈强化学习(RLHF)的训练机制,该机制优化了答案使其更拟人化。

Overall, our findings suggest that Almanac may be a safer and more reliable option for generating answers to clinical questions, but further research is needed to fully evaluate the potential implications of using these models in clinical contexts. Despite clear overall improvements, it is important to emphasize that grounded language models remain prone to errors of omission, and struggle on queries that lack a clear extractive answer within their sources. Their implementations within healthcare centers must be met with careful considerations and explicit mitigation s of their failures.

总体而言,我们的研究结果表明,Almanac可能是回答临床问题时更安全可靠的选择,但仍需进一步研究以全面评估这些模型在临床应用中可能产生的影响。尽管整体改进显著,但必须强调:基于事实的大语言模型仍存在遗漏错误的问题,且难以处理那些无法从资料中直接提取答案的查询。在医疗机构部署这些系统时,必须审慎考量并制定明确的故障缓解方案。

5 Conclusion

5 结论

Our work demonstrates the efficacy of combining text encoders, vector databases, and large language models to provide clinicians with concise, pertinent, and accurate outputs in response to medical queries. This is a strong improvement over current practices which involve clinicians manually searching, curating and internalizing medical documents to optimize patient care. In essence, rather than attempting to generate information using the knowledge encoded with LLM weights (which may be biased or entirely untrue) we refactor clinical queries into search and retrieval tasks, while performing knowledge distillation via LLM over the returned documents. This approach provides both implicit and explicit mitigation s for the bias, hallucination, and explain ability concerns observed in existing medical question-answering LLMs, while allowing clinicians to remain focused on their most fundamental goal: furthering patient care.

我们的工作证明了结合文本编码器、向量数据库和大语言模型为临床医生提供简洁、相关且准确的医疗查询响应的有效性。相比当前临床医生需要手动搜索、整理并内化医疗文档以优化患者护理的实践,这是一个显著改进。本质上,我们并非试图利用大语言模型权重编码的知识生成信息(这些知识可能存在偏见或完全不实),而是将临床查询重构为搜索和检索任务,同时通过大语言模型对返回文档进行知识蒸馏。这种方法对现有医疗问答大语言模型中存在的偏见、幻觉和可解释性问题提供了隐式和显式的缓解方案,同时使临床医生能够专注于其最根本的目标:推进患者护理。

Acknowledgments. We would like to thank Hugging Face for their support over the course of the project.

致谢。我们要感谢Hugging Face在本项目期间给予的支持。

Data Availability. Due to growing concerns of medical benchmarks being used as data for large-scale training of large-language models and further contributing to data contamination of clinical benchmarks, we publish a subset ( $n{=}\mathcal{L}5$ ) of our dataset with this manuscript (Appendix A) and make the rest available upon request. Please contact W.H. (willhies@stanford.edu) for full access to ClinicalQA.

数据可用性。鉴于医学基准数据被用于大语言模型的大规模训练并进一步加剧临床基准数据污染的担忧日益增长,我们随本文发布了数据集的一个子集 ( $n{=}\mathcal{L}5$ ) (附录A),其余数据可根据要求提供。如需获取ClinicalQA完整数据,请联系W.H. (willhies@stanford.edu)。

Declarations

声明

5.1 Funding

5.1 资金

This project was supported in part by a National Heart, Lung, and Blood Institute (NIH NHLBI) grant (1 R 01 HL 157235-01A1) (W.H.).

本项目部分由美国国家心肺血液研究所 (NIH NHLBI) 资助 (1 R 01 HL 157235-01A1) (W.H.)。

5.2 Competing interests

5.2 竞争性利益

The authors declare no competing interests.

作者声明无利益冲突。

5.3 Authors’ contributions

5.3 作者贡献

C.Z. and W.H. designed the experiments, and wrote the manuscript along with A.C, C.L, and E.A. The codebase was authored by C.Z. and A.C. Computational experiments were performed by C.Z. and A.C. under the supervision of C.L., E.A., and W.H. ClinicalQA was curated by K.A., J.B., K.B., K.H, and J.N. and reviewed by J.K and A.D. M.M. provided expertise on final manuscript. The work was supervised by W.H.

C.Z.和W.H.设计了实验,并与A.C、C.L和E.A共同撰写了手稿。代码库由C.Z.和A.C编写。计算实验由C.Z.和A.C在C.L.、E.A.和W.H.的指导下完成。ClinicalQA由K.A.、J.B.、K.B.、K.H和J.N.整理,并由J.K和A.D.审核。M.M.为最终手稿提供了专业意见。整个工作由W.H.监督。

Appendix A Clinical QA 25 Dataset

附录A 临床QA 25数据集

Here we provide a subset of ClinicalQA to provide the medical machine learning community with examples more aligned with clinical workflows, in comparison to existing benchmarks which makes use of the United States Medical Licensing Examination (USMLE) or other similar datasets.

在此我们提供ClinicalQA的一个子集,旨在为医学机器学习社区提供更符合临床工作流程的示例。相较于现有基于美国医师执照考试(USMLE)或其他类似数据集的基准测试,该子集更具实践相关性。

Cardiology
Question: What is the FDA black box warning for use of everolimus in heart transplant patients? Answer:The US Food and Drug Administration issued a black box warning for everolimus due to the increased risk of mortality observed within the first three months posttransplantation among patients started on the higher dose (3.0 mg/day) as de novo immunosuppression.
Question: What are manifestations of fulminant giant cell myocarditis? Answer: Giant cell myocarditis is a rare but potentially fatal form of myocarditis, characterized by severe heart failure, arrhythmias, and conduc- tion disturbances. Clinical manifestations include new onset severe heart failure requiring parenteral inotropic or mechanical circulatory support, new ventricular arrhythmias, Mobitz type II second-degree atrioventricular (AV) block, third-degree AV block, or refractory heart failure.
Question: What is the oral torsemide dose equivalent for oral furosemide 80 mg? Answer: The oral torsemide dose equivalent to oral furosemide 80 mg is 40mg. Question: What is the mechanism of action for milrinone?
Answer: Milrinone is a phosphodiesterase-3 inhibitor that increases cyclic AMP concentration, leading to enhanced calcium infux into the cell, a rise in cell calcium concentration, and increased contractility. It also has vasodilatory effects, decreasing cardiac filling pressures and increasing cardiac index.
Question: What is the standard INR goal for patients with a left ventricular assist device? is 2.0-3.0, according to the 2019 EACTS Expert Consensus on long-term
心脏病学
问: 依维莫司用于心脏移植患者的FDA黑框警告是什么?答: 美国食品药品监督管理局对依维莫司发布黑框警告,原因是观察到在移植后头三个月内,初始使用较高剂量(3.0毫克/天)作为新免疫抑制方案的患者死亡率增加。
问: 暴发性巨细胞性心肌炎的表现有哪些?答: 巨细胞性心肌炎是一种罕见但可能致命的心肌炎,其特征是严重心力衰竭、心律失常和传导障碍。临床表现包括需要静脉注射正性肌力药物或机械循环支持的新发严重心力衰竭、新发室性心律失常、莫氏II型二度房室传导阻滞、三度房室传导阻滞或难治性心力衰竭。
问: 80毫克口服呋塞米相当于多少剂量的口服托拉塞米?答: 80毫克口服呋塞米相当于40毫克口服托拉塞米。问: 米力农的作用机制是什么?
答: 米力农是一种磷酸二酯酶-3抑制剂,可增加环磷酸腺苷浓度,导致钙离子内流增加、细胞内钙浓度升高和收缩力增强。它还具有血管舒张作用,可降低心脏充盈压并提高心脏指数。
问: 左心室辅助装置患者的标准INR目标是多少?根据2019年EACTS长期专家共识,标准INR目标是2.0-3.0。
Cardiothoracic Surgery
Question: Does on pump or off pump CABG yield better results? Answer: Both on and off pump CABG can be performed safely with roughly equivalent long term mortality rates. On pump CABGs tend to yield more bypass grafts which tend to stay patent longer. Off pump CABG has theoretical benefits of decreasing CVA's or renal failure but this was not supported in the larger RCTs.
Question: Which is better, open or endovascular harvesting of saphenous vein for CABG? Answer: Endoscopic vein-graft harvesting is preferred to an open technique for CABG due to a comparable rate of major adverse cardiovascular events (MACE) such as mortality or vein-graft failure but a lower incidence of wound (leg) complications, better cosmetic appearance, and less pain.
Question: How many mitral valve repairs does a surgeon need to perform to attain mastery? Answer: This is currently unknown and would depend on several individual factors.
Question: What is a myocardial bridge? Answer: A myocardial bridge is a segment of an epicardial coronary artery that is intramyocardial, with the muscle overlying the intramyocar- dial segment. It is most commonly seen in the left anterior descending i myocardial ischemia, coronary thrombosis, myocardial infarction, and stress cardiomyopathy.
Question: What is the best second choice conduit for CABG? Answer: The second best choice conduit for CABG depends on patient characteristics including age, weight, coronary anatomy, pulmonary status, and renal failure as well as quality of the conduit. Generally speaking, the radial artery is likely the best choice as a second conduit in left sided lesions with high grade stenoses.
心胸外科
问题:体外循环(On pump)与非体外循环(Off pump)冠状动脉旁路移植术(CABG)哪种效果更好? 答案:两种术式均可安全实施,长期死亡率大致相当。体外循环CABG通常能获得更多旁路移植血管,且血管通畅时间更长。非体外循环CABG理论上可降低脑血管意外(CVA)或肾衰竭风险,但大型随机对照试验(RCT)未支持该结论。
问题:CABG术中采用开放还是血管内隐静脉获取术更好? 答案:由于心血管主要不良事件(MACE)(如死亡率或移植物衰竭)发生率相当,且伤口(腿部)并发症更少、外观更佳、疼痛更轻,CABG优先选择内镜静脉移植物获取术。
问题:外科医生需要完成