[论文翻译]Pharmacy GPT: 人工智能药剂师与ICU药物治疗管理的AI探索


原文地址:https://arxiv.org/pdf/2307.10432


Pharmacy GP T: the Artificial Intelligence Pharmacist and an Exploration of AI for ICU P harm a co therapy Management

Pharmacy GPT: 人工智能药剂师与ICU药物治疗管理的AI探索

Zhengliang Liu ∗1, Zihao Wu ∗1, Mengxuan Hu ∗1, Shaochen Xu ∗1, Bokai Zhao $^2$ , Lin Zhao $^{1}$ , Tianyi Zhang $^3$ , Haixing Dai $^{1}$ , Yiwei Li $^{1}$ , Xianyan Chen $^{3}$ , Ye Shen $^2$ , Sheng Li $^4$ , Quanzheng Li $^5$ , Xiang Li $^5$ , Brian Murray $^6$ , Tianming Liu1, and Andrea Sikora7

郑亮 刘* 1,子豪 吴* 1,梦萱 胡* 1,少辰 徐* 1,博凯 赵$^2$,琳 赵$^{1}$,天一 张$^3$,海星 戴$^{1}$,一伟 李$^{1}$,先燕 陈$^{3}$,烨 沈$^2$,盛 李$^4$,全正 李$^5$,翔 李$^5$,Brian Murray$^6$,天明 刘1,Andrea Sikora7

$^{1}$ School of Computing, University of Georgia $^2$ Department of Epidemiology & Biostatistics, University of Georgia $^{3}$ Department of Statistics, University of Georgia $^4$ School of Data Science, University of Virginia $^5$ Massachusetts General Hospital and Harvard Medical School $^6$ Department of Pharmacy, University of North Carolina Medical Center 7 Department of Clinical and Administrative Pharmacy, University of Georgia College of Pharmacy

$^{1}$ 佐治亚大学计算学院
$^2$ 佐治亚大学流行病学与生物统计学系
$^{3}$ 佐治亚大学统计学系
$^4$ 弗吉尼亚大学数据科学学院
$^5$ 麻省总医院及哈佛医学院
$^6$ 北卡罗来纳大学医学中心药学系
7 佐治亚大学药学院临床与管理药学系

Abstract

摘要

In this study, we introduce Pharmacy GP T, a novel framework to assess the capabilities of large language models (LLMs) such as ChatGPT and GPT-4 in emulating the role of clinical pharmacists. Our methodology encompasses the use of LLMs to generate comprehensible patient clusters, formulate medication plans, and forecast patient outcomes. We conducted our investigation using real data acquired from the intensive care unit (ICU) at the University of North Carolina Health System (UNCHS). Our analysis offers insights into the potential applications and limitations of LLMs in the field of clinical pharmacy, with implications for both patient care and the development of future AI-driven healthcare solutions. By evaluating the performance of Pharmacy GP T, we aim to contribute to the ongoing discourse surrounding the integration of artificial intelligence in healthcare settings, ultimately promoting the responsible and efficacious use of such technologies.

在本研究中,我们提出了Pharmacy GP T这一创新框架,用于评估ChatGPT、GPT-4等大语言模型(LLM)在模拟临床药师角色时的能力。我们的方法包括利用大语言模型生成可理解的患者分群、制定用药方案以及预测患者预后。研究数据来自北卡罗来纳大学医疗系统(UNCHS)重症监护室(ICU)的真实病例。通过分析,我们揭示了大语言模型在临床药学领域的潜在应用与局限性,这些发现对患者护理和未来AI驱动的医疗解决方案开发具有启示意义。通过评估Pharmacy GP T的表现,我们旨在为医疗场景中人工智能整合的持续讨论提供参考,最终促进此类技术的负责任与高效应用。

1 Introduction

1 引言

In recent years, the development of large language models (LLMs) has shown remarkable potential in a multitude of applications across various domains [1, 2, 3, 4]. Despite their impressive generalization capabilities, the performance of LLMs in specific fields (e.g., pharmacy), remains under-investigated and holds untapped potential. Here, we present Pharmacy GP T, the first study to apply LLMs, specifically ChatGPT and GPT-4, to a range of clinically significant problems in the realm of comprehensive medication management in the intensive care unit (ICU) [5, 6, 7].

近年来,大语言模型 (LLM) 的发展在跨领域多场景应用中展现出显著潜力 [1, 2, 3, 4]。尽管其泛化能力令人印象深刻,但大语言模型在特定领域 (如药学) 的性能仍缺乏深入研究且蕴含未开发潜力。本文提出 Pharmacy GP T,这是首个将大语言模型 (特别是 ChatGPT 和 GPT-4) 应用于重症监护室 (ICU) 综合用药管理领域一系列临床重要问题的研究 [5, 6, 7]。

Pharmacy GP T aims to explore the capabilities of LLMs in addressing diverse problems of clinical importance [8], such as patient outcome prediction, AI-based medication decisions, and interpret able clustering analysis of patients. These applications hold promise for improving patient care [9] by empowering a comprehensive suite of AI-enhanced clinical decision support tools for clinical pharmacists.

药房GP T旨在探索大语言模型(LLM)在解决临床重要问题[8]方面的能力,例如患者预后预测、基于AI的用药决策以及可解释的患者聚类分析。这些应用有望通过为临床药剂师提供一套全面的AI增强临床决策支持工具来改善患者护理[9]。

Due to the degree of domain-specific knowledge in fields such as clinical pharmacy, LLMs require domain-specific engineering. Here, a combination of dynamic prompting and iterative optimization was applied. To enhance the performance of LLMs in the pharmacy domain, we developed a dynamic prompting approach that leverages the in-context learning capability [10] of LLMs by constructing dynamic contexts using domain-specific data, which is beneficial for adapting language models to specialized fields [11, 12, 13, 14, 15]. This novel method enables the model to acquire contextual knowledge from semantically similar examples in existing data [10, 16] thereby improving its performance in specialized applications.

由于临床药学等领域的专业知识具有高度特异性,大语言模型(LLM)需要进行领域专项工程化。本研究采用动态提示(dynamic prompting)与迭代优化相结合的方法。为提升大语言模型在药学领域的表现,我们开发了一种动态提示技术:通过领域专用数据构建动态语境,利用大语言模型的上下文学习能力(in-context learning) [10],这种方法有助于语言模型适配专业领域[11, 12, 13, 14, 15]。该创新方法使模型能够从现有数据中语义相似的示例获取上下文知识[10, 16],从而提升其在专业应用中的表现。

Additionally, we designed an iterative optimization algorithm that performs automatic evaluation on the generated prescription results and composes the corresponding instruction prompts to further optimize the model. This technique allows the refinement of Pharmacy GP T without requiring additional training data or fine-tuning the LLMs, paving the way for efficient adaptation of LLMs in other specialized fields.

此外,我们设计了一种迭代优化算法,对生成的处方结果进行自动评估,并编写相应的指令提示以进一步优化模型。该技术可在无需额外训练数据或微调大语言模型的情况下改进Pharmacy GP T,为大语言模型在其他专业领域的高效适配开辟了新路径。


Figure 1: Generating context for ChatGPT and GPT-4

图 1: 为ChatGPT和GPT-4生成上下文

We compared various approaches to optimally use LLMs for a range of pharmacy tasks and aimed to establish a use case portfolio to inspire future research exploring LLM application to p harm a co therapy-related questions.

我们比较了多种优化使用大语言模型 (LLM) 处理药学任务的方法,旨在建立一个用例组合,以启发未来探索大语言模型在药物治疗相关问题中的应用研究。

This work has several goals:

本工作有若干目标:

• 1) To open new dialogues and stimulate discussion surrounding the application of LLMs in pharmacy. • 2) To provide directions for investigators for future improvements in LLM-based pharmacy

• 1) 开启新对话并激发关于大语言模型(LLM)在药学领域应用的讨论
• 2) 为研究者提供基于大语言模型的药学未来改进方向

applications.

应用

• 3) To evaluate the strengths and limitations of ChatGPT and GPT-4 in the pharmacy domain. • 4) To provide insights for tailored data collection strategies to maximize the potential of LLMs in pharmacy and other specialized fields. • 5) To present an effective framework and use case collection for applying LLM to pharmacy.

• 3) 评估ChatGPT和GPT-4在药学领域的优势与局限性。
• 4) 为定制化数据收集策略提供见解,以最大化大语言模型在药学及其他专业领域的潜力。
• 5) 提出应用大语言模型于药学的有效框架及用例集。

In conclusion, this paper introduces Pharmacy GP T, a groundbreaking work that applies LLMs to a range of clinically significant problems in pharmacy. By establishing a foundation for future research and development, Pharmacy GP T has the potential to revolutionize pharmacy practices and enhance the overall quality of healthcare services while addressing its key motivations and goals, ultimately contributing to a deeper understanding and more effective use of LLMs in specialized domains.

总之,本文介绍了Pharmacy GP T这一开创性工作,它将大语言模型应用于药学领域一系列具有临床意义的问题。通过为未来研发奠定基础,Pharmacy GP T有望革新药学实践、提升整体医疗服务质量,同时实现其核心动机与目标,最终推动大语言模型在专业领域的深入理解和更有效应用。

2 Related Work

2 相关工作

2.1 Large language models in healthcare

2.1 医疗领域的大语言模型 (Large Language Models)

Transformer-based language models such as Bidirectional Encoder Representations from Transformer (BERT) [17] and the Generative Pre-trained Transformer (GPT) series [18, 19, 10], have revolutionized the field of natural language processing (NLP) by outperforming earlier approaches like Recurrent Neural Network (RNN)-based models [20, 2, 1, 21] in a variety of tasks. Existing transformerbased language models can be broadly classified into three categories: masked language models (e.g., BERT), auto regressive models (e.g., GPT-3), and encoder-decoder models (e.g., Bidirectional Auto-Regressive Transformer or BART [22]). The recent development of very large language models (LLMs) grounded in the transformer architecture but built on a much grander scale, including ChatGPT and GPT-4 [3], Bloomz [23], LLAMA [24] and Med-PaLM 2 [25] has gained momentum.

基于Transformer的语言模型,如双向编码器表示模型(BERT) [17]和生成式预训练Transformer(GPT)系列[18, 19, 10],通过在各类任务中超越基于循环神经网络(RNN)的早期方法[20, 2, 1, 21],彻底改变了自然语言处理(NLP)领域。现有基于Transformer的语言模型主要可分为三类:掩码语言模型(如BERT)、自回归模型(如GPT-3)以及编码器-解码器模型(如双向自回归Transformer/BART[22])。近期,基于Transformer架构但规模显著扩增的超大语言模型(LLM)发展迅猛,包括ChatGPT和GPT-4[3]、Bloomz[23]、LLAMA[24]以及Med-PaLM 2[25]等。

The primary objective of LLMs is to accurately learn context-specific and domain-specific latent feature representations from input text [17, 3, 26, 27, 28, 29]. For example, the vector representation of the word "prescription" could vary significantly between the pharmacy domain and general usage. Smaller language models such as BERT often necessitate pre-training and supervised finetuning on downstream tasks to achieve satisfactory performance. However, LLMs typically do not require further fine-tuning and model updates but still deliver competitive performance on diverse downstream applications [30, 28, 31, 32, 33].

大语言模型的主要目标是从输入文本中准确学习特定上下文和特定领域的潜在特征表示 [17, 3, 26, 27, 28, 29]。例如,"prescription"一词的向量表示在药学领域和日常使用中可能存在显著差异。BERT等较小规模的语言模型通常需要对下游任务进行预训练和有监督微调才能达到理想性能。然而,大语言模型通常无需进一步微调和模型更新,仍能在多样化下游应用中展现出竞争力 [30, 28, 31, 32, 33]。

LLMs like ChatGPT and GPT-4 have significant potential in healthcare applications due to their advanced natural language understanding (NLU) capabilities. The massive amounts of text and other data generated in clinical practices can be leveraged through LLM-powered tools. LLMs can handle diverse and complex tasks like clinical triage classification [34], medical question-answering [35], HIPAA-compliant data an ony miz ation [36], radiology report sum mari z ation [16], clinical information extraction [37] or dementia detection [38]. In addition, the Reinforcement Learning from Human Feedback (RLHF) [39] process incorporates human preferences and values into ChatGPT and its successor, GPT-4, making it particularly suitable for understanding patient-centered guidelines and human values in healthcare applications.

像ChatGPT和GPT-4这样的大语言模型(LLM)凭借其先进的自然语言理解(NLU)能力,在医疗健康应用中展现出巨大潜力。通过LLM驱动的工具,可以充分利用临床实践中产生的大量文本及其他数据。这类模型能处理多样且复杂的任务,例如临床分诊分类[34]、医学问答[35]、符合HIPAA标准的数据匿名化[36]、放射学报告摘要生成[16]、临床信息提取[37]以及痴呆症检测[38]。此外,基于人类反馈的强化学习(RLHF)[39]机制将人类偏好与价值观融入ChatGPT及其继任者GPT-4,使其特别适合理解医疗应用中以患者为中心的指导原则和人文关怀价值。

We aim to present the first study employing LLMs for a wide range of tasks and problems of interes to pharmacy practice in the intensive care unit (ICU).

我们旨在首次利用大语言模型(LLM)研究重症监护病房(ICU)药学实践中广泛关注的任务和问题。

2.2 Reasoning with LLMs

2.2 基于大语言模型的推理

LLMs have shown great potential in high-level reasoning abilities. For example, GPT-3 has demonstrated common-sense reasoning through in-context learning, and Wei et al. [40] found that LLMs can perform better in arithmetic, deductive, and common-sense reasoning when given carefully prepared sequential prompts, decomposing multi-step problems.

大语言模型在高阶推理能力方面展现出巨大潜力。例如,GPT-3通过上下文学习展示了常识推理能力,Wei等人[40]发现当提供精心设计的序列提示时,大语言模型在算术、演绎和常识推理任务中表现更优,能够有效分解多步骤问题。

Deductive reasoning: Wu et al. [35] compared the deductive reasoning abilities of ChatGPT and GPT-4 in the specialized domain of radiology, revealing that GPT-4 outperforms ChatGPT, with fine-tuned models requiring significant data to match GPT-4’s performance. The results of this investigation suggest that creating generic reasoning models based on LLMs for diverse tasks across various domains is viable and practical.

演绎推理:Wu等人[35]在放射学专业领域对比了ChatGPT与GPT-4的演绎推理能力,发现GPT-4表现更优,且微调模型需大量数据才能达到GPT-4水平。该研究表明基于大语言模型构建通用推理模型以应对跨领域多样化任务具有可行性和实用性。

Ma et al. explored LLMs’ [16] ability to comprehend radiology reports using a dynamic prompting paradigm. They developed an iterative optimization algorithm for automatic evaluation and prompt composition, achieving state-of-the-art performance on MIMIC-CXR and OpenI datasets without additional training data or LLM fine-tuning.

Ma et al. 探索了大语言模型 [16] 通过动态提示范式理解放射学报告的能力。他们开发了一种用于自动评估和提示组合的迭代优化算法,在 MIMIC-CXR 和 OpenI 数据集上实现了最先进的性能,且无需额外训练数据或对大语言模型进行微调。

Abductive reasoning: Jung et al. [41] improved LLMs’ ability to make logical explanations through maieutic prompting, but this study was limited to question-answering style problems. Zhong et al. [33] improved from the well-known ABL framework [42] and proposed the first comprehensive abductive learning pipeline based on LLMs.

溯因推理:Jung等人[41]通过启发式提示(maieutic prompting)提升了大语言模型(LLM)的逻辑解释能力,但该研究仅限于问答式问题。Zhong等人[33]从著名的ABL框架[42]改进而来,提出了首个基于大语言模型的完整溯因学习流程。

Chain-of-thought: Chain-of-thought reasoning (CoT) is a problem-solving approach that breaks complex problems into smaller, manageable steps, resembling how humans naturally solve complicated problems [43]. In the context of LLMs, CoT aims to improve the model’s accuracy and coherence by encouraging step-by-step reasoning. A zero-shot approach asking the LLM to "think step by step" [44] significantly improves reasoning performance across benchmarks. Providing LLMs with more examples (few-shot context learning) [40] enhances their performance through in-context learning. Overall, CoT enables LLMs to effectively tackle complex reasoning tasks.

思维链 (Chain-of-thought):思维链推理 (CoT) 是一种将复杂问题分解为更小、可管理步骤的解题方法,类似于人类自然解决复杂问题的方式 [43]。在大语言模型 (LLM) 的背景下,CoT 旨在通过鼓励逐步推理来提高模型的准确性和连贯性。采用零样本方法要求大语言模型"逐步思考" [44],可显著提升其在各类基准测试中的推理表现。而通过提供更多示例(少样本上下文学习)[40],能够借助上下文学习进一步提升大语言模型的性能。总体而言,CoT 使大语言模型能够有效应对复杂推理任务。

3 Methods

3 方法

3.1 Data description

3.1 数据描述

This dataset was based on a study cohort of 1,000 adult patients admitted for at least 24 hours to a medical, surgical, neuroscience s, cardiac, or burn ICU at the University of North Carolina Health System (UNCHS) between October 2015 and October 2020. Only the patient’s first ICU admission was included in the analysis. The data were extracted from the UNCHS Electronic Health Record (EHR) housed in the Carolina Data Warehouse (CDW) by a trained in-house data analyst.

该数据集基于2015年10月至2020年10月期间在北卡罗来纳大学医疗系统(UNCHS)内科、外科、神经科学、心脏或烧伤重症监护病房住院至少24小时的1000名成年患者研究队列。分析仅纳入患者的首次ICU入院记录。数据由经过培训的内部数据分析师从卡罗来纳数据仓库(CDW)中的UNCHS电子健康记录(EHR)提取。

The dataset contains patient demographics, medication administration record (MAR) information, and patient outcomes, such as common ICU complications. Demographics include age, sex, admission diagnosis, ICU type, Medication Regimen Complexity-Intensive Care Unit (MRC-ICU) [45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55], score at 24 hours, and APACHE II score [56] at 24 hours. The MRC-ICU [57, 58] is a validated score summarizing the complexity of prescribed medications in the ICU [59, 60]. MAR information consists of drug, dose, route, duration, and timing of administration [61, 62, 63, ?].

数据集包含患者人口统计学信息、用药记录(MAR)信息以及患者结局指标,如常见ICU并发症。人口统计学数据涵盖年龄、性别、入院诊断、ICU类型、重症监护用药方案复杂性评分(MRC-ICU) [45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55]、24小时评分以及24小时APACHE II评分[56]。MRC-ICU评分[57, 58]是经过验证的、用于评估ICU处方药物复杂性的综合指标[59, 60]。用药记录信息包括药物名称、剂量、给药途径、持续时间和给药时间点[61, 62, 63, ?]。

Patient outcomes included mortality, ICU length of stay, delirium occurrence (defined by a positive CAM-ICU score), duration of mechanical ventilation, duration of va so press or use, and acute kidney injury (defined by the presence of renal replacement therapy or a serum creatinine greater than 1.5 times baseline). Other than textual descriptions of patient data, a binary value of 1 was assigned to indicate they received a specific drug order, including drug, dose, strength, and formulation/route. Categorical features for patient outcomes were relabeled as numeric values, and any unknown or missing entities were counted as absences from that event.

患者结局指标包括死亡率、ICU住院时长、谵妄发生率 (按CAM-ICU评分阳性判定)、机械通气时长、血管加压药使用时长,以及急性肾损伤 (按接受肾脏替代治疗或血清肌酐值超过基线1.5倍判定)。除患者数据的文本描述外,采用二进制数值1表示患者接受了特定药物医嘱 (含药品名称、剂量、规格及剂型/给药途径)。患者结局的分类特征被重新编码为数值变量,所有未知或缺失实体均视为该事件未发生。

3.2 Creating Interpret able Patient Clusters

3.2 创建可解释的患者聚类

Here, we describe the process for generating interpret able patient clusters using LLM embeddings and hierarchical clustering. The following pseudo-code algorithm outlines the steps:

我们在此描述利用大语言模型嵌入和层次聚类生成可解释患者分组的流程。以下伪代码算法概述了具体步骤:

  1. Feed patient information (i.e., age, sex, diagnosis, and ICD10 problem list) into GPT-3 to generate embedding vectors of size 1536. 2. Use the generated embeddings as input for hierarchical clustering to create patient clusters.
  2. 将患者信息 (即年龄、性别、诊断和ICD10问题列表) 输入GPT-3,生成大小为1536的嵌入向量。
  3. 将生成的嵌入向量作为层次聚类的输入,创建患者集群。

Algorithm 1 Interpret able Patient Clustering

算法 1 可解释的患者聚类

The above algorithm describes the process of generating interpret able patient clusters using ChatGPT embeddings and hierarchical clustering. In this method, patient information is first transformed into 1536-dimensional embeddings using GPT-3 [10, 64]. These embeddings are then used as input for a hierarchical clustering algorithm, which groups patients based on the similarity of their embeddings. This approach aims to create accurate and interpret able patient clusters for use in clinical decisions.

上述算法描述了使用ChatGPT嵌入和层次聚类生成可解释患者群组的过程。该方法首先使用GPT-3 [10,64]将患者信息转换为1536维嵌入向量,随后将这些嵌入作为层次聚类算法的输入,根据嵌入相似度对患者进行分组。该方案旨在构建精确且可解释的患者群组,以辅助临床决策。

3.3 Iterative Optimization

3.3 迭代优化

In the iterative optimization process for Pharmacy GP T, we aimed to enhance the performance of the model when generating various patient-related outputs, such as mortality, length of ICU stay, APACHE II score range, or medication plan at 24 hours. This optimization process is based on an iterative feedback loop, which adjusts the input prompts provided to the model based on its performance in previous iterations.

在针对Pharmacy GP T的迭代优化过程中,我们的目标是提升模型在生成各类患者相关输出时的性能,例如死亡率、ICU住院时长、APACHE II评分范围或24小时用药方案。该优化过程基于迭代反馈循环机制,会根据模型在前几轮迭代中的表现来调整其输入提示词。

Algorithm 2 Iterative Optimization Algorithm for Patient Data

算法 2: 患者数据迭代优化算法


Figure 2: Initial prompt used for medication plan generation

图 2: 用于生成用药方案的初始提示

Initially, we inputted a dynamic prompt containing a patient portrait based on demographics and symptoms to the ChatGPT model (see Figure 2). After this iteration, we calculated the ROUGE-1 score by comparing the model-generated output with the ground truth of the medications ordered in the first 24 hours. Based on the score, the input prompt was modified for the next iteration. If the score was above a predefined threshold, we merged the prompt with the generated output in a manner that encouraged the model to produce a similar response in the next iteration. Otherwise, we modified the prompt to discourage the model from producing the same type of output. The iterative process continued until a predetermined number of iterations was reached. Through this optimization method, Pharmacy GP T learned to improve its predictions and recommendations over time, as it is continually guided by the evaluation score and feedback from previous iterations. This approach enables Pharmacy GP T to adapt and refine its output based on a better understanding of the patient’s condition and the desired outcomes. The final prompt used can be seen in Figure 3.

最初,我们向ChatGPT模型输入了一个包含基于人口统计数据和症状的患者画像的动态提示(见图2)。在此次迭代后,我们通过比较模型生成的输出与最初24小时内所开药物的真实情况,计算了ROUGE-1分数。根据该分数,我们对下一次迭代的输入提示进行了修改。如果分数超过预设阈值,我们会以鼓励模型在下次迭代中产生类似响应的方式,将提示与生成的输出合并。否则,我们会修改提示以防止模型产生相同类型的输出。这一迭代过程持续进行,直到达到预定的迭代次数。通过这种优化方法,Pharmacy GP T学会了随着时间的推移改进其预测和建议,因为它持续受到评估分数和先前迭代反馈的指导。这种方法使Pharmacy GP T能够基于对患者状况和预期结果的更好理解来调整和完善其输出。最终使用的提示如图3所示。


Figure 3: Final prompt used for medication plan generation

图 3: 用于生成用药方案的最终提示词

4 Results

4 结果

4.1 Interpret able clustering

4.1 可解释聚类

Our clustering methodology yielded clusters that closely aligned with the ICD-10 code categories of patients, indicating a high level of interpret ability (see Figure 4). This alignment demonstrates the effectiveness of our approach in generating meaningful and coherent patient groupings.

我们的聚类方法产生的聚类结果与患者的ICD-10代码类别高度吻合,显示出良好的可解释性(见图4)。这种一致性证明了我们的方法在生成有意义且连贯的患者分组方面的有效性。

The group labeled as "Diverse symptoms with neurological impact" is the most sparsely spread among the clusters, exhibiting a diverse range of symptoms on the surface. However, our embedding analysis reveals that the majority of patients in this group share a common underlying condition: neurological disorders. This observation suggests that despite the presence of various other symptoms, patients in this group are primarily affected by neurological issues (see Figure 5). Our approach successfully uncovered this underlying similarity, demonstrating the capability of our clustering method in identifying meaningful connections between patients with seemingly disparate symptom presentations.

标记为"具有神经影响的多样化症状"的群体在各聚类中分布最为稀疏,表面上呈现出一系列多样化的症状。然而,我们的嵌入分析表明,该组大多数患者都有一个共同的潜在病症:神经系统疾病。这一观察结果表明,尽管存在各种其他症状,该组患者主要受到神经系统问题的影响 (见图 5)。我们的方法成功揭示了这种潜在相似性,证明该聚类方法能够识别症状表现看似迥异的患者之间的有意义的关联。

4.2 Predicting patient outcomes

4.2 预测患者疗效

Hospital Mortality We used ChatGPT to predict hospital mortality based on patient information including age, race, sex, ICD-10 admission diagnosis, and com or bidi ties). Table 1 results indicate that imbalanced data significantly reduced precision and F1 scores, as the deceased class in the test set contained only 46 samples, while ChatGPT tended to make predictions in a balanced manner. Also, different prompts did not contribute much to improvement. Specifically, the last three rows attempted to align the diagnosis categories of the test set with those demonstrated in the prompts: Rand-5 randomly selects five demonstrations from the training set and incorporates them into the prompt; Freq-5 randomly selects one demonstration from the top five most frequent diagnosis categories; Bcat-rand-5 randomly chooses five samples with the same diagnosis category as the tested sample; and Sim-5 selects the five most similar patient information records as demonstrations.

院内死亡率
我们使用ChatGPT基于患者信息(包括年龄、种族、性别、ICD-10入院诊断和合并症)预测院内死亡率。表1结果显示,数据不平衡显著降低了精确率和F1分数,因为测试集中的死亡类别仅包含46个样本,而ChatGPT倾向于以平衡方式做出预测。此外,不同的提示词对改进效果贡献不大。具体而言,最后三行尝试将测试集的诊断类别与提示词中展示的类别对齐:Rand-5从训练集中随机选择五个示例并纳入提示词;Freq-5从前五个最频繁的诊断类别中随机选择一个示例;Bcat-rand-5随机选择五个与测试样本具有相同诊断类别的样本;Sim-5选择五个最相似的患者信息记录作为示例。


Figure 4: Interpret able Clusters for real ICU data. Patient age, sex, icd-10 diagnosis, and ICD10 problem list were input to generate ChatGPT embeddings and hierarchical clustering as depicted above, which resulted in meaningful cluster separation. These patient clusters were created by embeddings generated by ADA-002 and clustered via agglom erat ive clustering techniques and visualized using T-SNE (t-distributed Stochastic Neighbor Embedding).

图 4: 真实ICU数据的可解释聚类。将患者年龄、性别、ICD-10诊断和ICD10问题清单输入生成ChatGPT嵌入向量,并采用如上所示的层次聚类方法,最终形成具有临床意义的簇群划分。这些患者簇群是通过ADA-002生成的嵌入向量,使用凝聚聚类技术进行聚类,并借助T-SNE(t分布随机邻域嵌入)实现可视化呈现。

Table 1: Mortality prediction

ModelAccuracyPrecisionRecallF1 Score
rand_ 5-shot0.75490.37500.70950.4898
freq_ 5-shot0.64550.16670.63640.2642
bcat rand_ 5-shot0.66020.20510.76470.4262
sim 5-shot0.66990.28210.64710.3929

表 1: 死亡率预测

Model Accuracy Precision Recall F1 Score
rand_ 5-shot 0.7549 0.3750 0.7095 0.4898
freq_ 5-shot 0.6455 0.1667 0.6364 0.2642
bcat rand_ 5-shot 0.6602 0.2051 0.7647 0.4262
sim 5-shot 0.6699 0.2821 0.6471 0.3929

APACHE II We employed ChatGPT and GPT-4 to predict APACHE II scores based on patient information (age, race, sex, ICD-10 admission diagnosis, and com or bidi ties). Table 2 results reveal that GPT-4 significantly enhanced accuracy. Accuracy is the ratio of correctly classified instances to the total number of instances. Precision focuses on the proportion of instances where the model predicts a positive outcome that actually turns out to be positive. It is a measure of the accuracy of positive class predictions. Recall is the model’s ability to correctly identify positive classes. It is the proportion of correct identification s by the model from all actual positive examples. The F1 score is the harmonic mean of precision and recall, and it attempts to find a balance between the two.

APACHE II 我们使用 ChatGPT 和 GPT-4 基于患者信息(年龄、种族、性别、ICD-10入院诊断及合并症)预测 APACHE II 评分。表 2 结果显示,GPT-4 显著提升了预测准确率。准确率指正确分类实例占总实例的比例。精确率关注模型预测为正类且实际为正类的实例占比,用于衡量正类预测的准确性。召回率反映模型正确识别正类的能力,即模型从所有实际正例中正确识别的比例。F1分数是精确率与召回率的调和平均数,旨在平衡二者关系。


How many patients in Group O:Diverse symptoms with neurological impact have neurological disorders?

图 1:
组O(神经影响症状多样)中有多少患者患有神经系统疾病?

Figure 5: Here, dots represent patients who were clustered to the neurology diagnosis group, with black dots showing a neurology associated ICD10 diagnosis.

图 5: 此处圆点代表被归类到神经科诊断组的患者,黑色圆点表示具有神经科关联的ICD10诊断。

Table 2: APACHE II score prediction

ModelAccuracy
rand_ 5-shot0.1818
freq_ 5-shot0.1091
bcat rand 5-shot0.2000
sim 5-shot0.1636
GPT-4 rand 50.3727
GPT-4_ sim_ 50.4364

表 2: APACHE II 评分预测

模型 准确率
rand_ 5-shot 0.1818
freq_ 5-shot 0.1091
bcat rand 5-shot 0.2000
sim 5-shot 0.1636
GPT-4 rand 5 0.3727
GPT-4_ sim_ 5 0.4364

4.3 Prescribing medication plans

4.3 制定用药方案

We employed GPT-4 to generate medication plans for ICU patients using the same patient information as above. We then compared these plans to the medication plans at the 24 hour point used in practice. Result can be seen in Table 3. ROUGE-1, ROUGE-2, and ROUGE-L are the three main ROUGE (Recall-Oriented Understudy for Gisting Evaluation) metrics used to evaluate automatic

我们使用GPT-4基于相同的患者信息为ICU患者生成用药方案,并将这些方案与实际应用中24小时节点的用药方案进行对比。结果如表3所示。ROUGE-1、ROUGE-2和ROUGE-L是用于评估自动摘要的三种主要ROUGE (Recall-Oriented Understudy for Gisting Evaluation)指标。


Figure 6: Medication plan generated by GPT-4 v.s. the actual plan

图 6: GPT-4生成的用药方案与实际方案对比

Requires the evaluation of an expert panel of pharmacist --- who will also devise evaluation metrics and extra information to help GPT-4 (we suspect it does not access to other patient records).

需要由药剂师专家小组进行评估 (who will also devise evaluation metrics and extra information to help GPT-4) (我们推测它无法访问其他患者记录)。

text sum mari z ation. ROUGE-1 focuses on the overlap of words (1-grams), and ROUGE-2 is based on the overlap of bigrams (2-grams). ROUGE-L focuses on the Longest Common Sub sequence (LCS). This method does not re