[论文翻译]Pharmacy GPT: 人工智能药剂师与ICU药物治疗管理的AI探索


原文地址:https://arxiv.org/pdf/2307.10432


Pharmacy GP T: the Artificial Intelligence Pharmacist and an Exploration of AI for ICU P harm a co therapy Management

Pharmacy GPT: 人工智能药剂师与ICU药物治疗管理的AI探索

Zhengliang Liu ∗1, Zihao Wu ∗1, Mengxuan Hu ∗1, Shaochen Xu ∗1, Bokai Zhao $^2$ , Lin Zhao $^{1}$ , Tianyi Zhang $^3$ , Haixing Dai $^{1}$ , Yiwei Li $^{1}$ , Xianyan Chen $^{3}$ , Ye Shen $^2$ , Sheng Li $^4$ , Quanzheng Li $^5$ , Xiang Li $^5$ , Brian Murray $^6$ , Tianming Liu1, and Andrea Sikora7

郑亮 刘* 1,子豪 吴* 1,梦萱 胡* 1,少辰 徐* 1,博凯 赵$^2$,琳 赵$^{1}$,天一 张$^3$,海星 戴$^{1}$,一伟 李$^{1}$,先燕 陈$^{3}$,烨 沈$^2$,盛 李$^4$,全正 李$^5$,翔 李$^5$,Brian Murray$^6$,天明 刘1,Andrea Sikora7

$^{1}$ School of Computing, University of Georgia $^2$ Department of Epidemiology & Biostatistics, University of Georgia $^{3}$ Department of Statistics, University of Georgia $^4$ School of Data Science, University of Virginia $^5$ Massachusetts General Hospital and Harvard Medical School $^6$ Department of Pharmacy, University of North Carolina Medical Center 7 Department of Clinical and Administrative Pharmacy, University of Georgia College of Pharmacy

$^{1}$ 佐治亚大学计算学院
$^2$ 佐治亚大学流行病学与生物统计学系
$^{3}$ 佐治亚大学统计学系
$^4$ 弗吉尼亚大学数据科学学院
$^5$ 麻省总医院及哈佛医学院
$^6$ 北卡罗来纳大学医学中心药学系
7 佐治亚大学药学院临床与管理药学系

Abstract

摘要

In this study, we introduce Pharmacy GP T, a novel framework to assess the capabilities of large language models (LLMs) such as ChatGPT and GPT-4 in emulating the role of clinical pharmacists. Our methodology encompasses the use of LLMs to generate comprehensible patient clusters, formulate medication plans, and forecast patient outcomes. We conducted our investigation using real data acquired from the intensive care unit (ICU) at the University of North Carolina Health System (UNCHS). Our analysis offers insights into the potential applications and limitations of LLMs in the field of clinical pharmacy, with implications for both patient care and the development of future AI-driven healthcare solutions. By evaluating the performance of Pharmacy GP T, we aim to contribute to the ongoing discourse surrounding the integration of artificial intelligence in healthcare settings, ultimately promoting the responsible and efficacious use of such technologies.

在本研究中,我们提出了Pharmacy GP T这一创新框架,用于评估ChatGPT、GPT-4等大语言模型(LLM)在模拟临床药师角色时的能力。我们的方法包括利用大语言模型生成可理解的患者分群、制定用药方案以及预测患者预后。研究数据来自北卡罗来纳大学医疗系统(UNCHS)重症监护室(ICU)的真实病例。通过分析,我们揭示了大语言模型在临床药学领域的潜在应用与局限性,这些发现对患者护理和未来AI驱动的医疗解决方案开发具有启示意义。通过评估Pharmacy GP T的表现,我们旨在为医疗场景中人工智能整合的持续讨论提供参考,最终促进此类技术的负责任与高效应用。

1 Introduction

1 引言

In recent years, the development of large language models (LLMs) has shown remarkable potential in a multitude of applications across various domains [1, 2, 3, 4]. Despite their impressive generalization capabilities, the performance of LLMs in specific fields (e.g., pharmacy), remains under-investigated and holds untapped potential. Here, we present Pharmacy GP T, the first study to apply LLMs, specifically ChatGPT and GPT-4, to a range of clinically significant problems in the realm of comprehensive medication management in the intensive care unit (ICU) [5, 6, 7].

近年来,大语言模型 (LLM) 的发展在跨领域多场景应用中展现出显著潜力 [1, 2, 3, 4]。尽管其泛化能力令人印象深刻,但大语言模型在特定领域 (如药学) 的性能仍缺乏深入研究且蕴含未开发潜力。本文提出 Pharmacy GP T,这是首个将大语言模型 (特别是 ChatGPT 和 GPT-4) 应用于重症监护室 (ICU) 综合用药管理领域一系列临床重要问题的研究 [5, 6, 7]。

Pharmacy GP T aims to explore the capabilities of LLMs in addressing diverse problems of clinical importance [8], such as patient outcome prediction, AI-based medication decisions, and interpret able clustering analysis of patients. These applications hold promise for improving patient care [9] by empowering a comprehensive suite of AI-enhanced clinical decision support tools for clinical pharmacists.

药房GP T旨在探索大语言模型(LLM)在解决临床重要问题[8]方面的能力,例如患者预后预测、基于AI的用药决策以及可解释的患者聚类分析。这些应用有望通过为临床药剂师提供一套全面的AI增强临床决策支持工具来改善患者护理[9]。

Due to the degree of domain-specific knowledge in fields such as clinical pharmacy, LLMs require domain-specific engineering. Here, a combination of dynamic prompting and iterative optimization was applied. To enhance the performance of LLMs in the pharmacy domain, we developed a dynamic prompting approach that leverages the in-context learning capability [10] of LLMs by constructing dynamic contexts using domain-specific data, which is beneficial for adapting language models to specialized fields [11, 12, 13, 14, 15]. This novel method enables the model to acquire contextual knowledge from semantically similar examples in existing data [10, 16] thereby improving its performance in specialized applications.

由于临床药学等领域的专业知识具有高度特异性,大语言模型(LLM)需要进行领域专项工程化。本研究采用动态提示(dynamic prompting)与迭代优化相结合的方法。为提升大语言模型在药学领域的表现,我们开发了一种动态提示技术:通过领域专用数据构建动态语境,利用大语言模型的上下文学习能力(in-context learning) [10],这种方法有助于语言模型适配专业领域[11, 12, 13, 14, 15]。该创新方法使模型能够从现有数据中语义相似的示例获取上下文知识[10, 16],从而提升其在专业应用中的表现。

Additionally, we designed an iterative optimization algorithm that performs automatic evaluation on the generated prescription results and composes the corresponding instruction prompts to further optimize the model. This technique allows the refinement of Pharmacy GP T without requiring additional training data or fine-tuning the LLMs, paving the way for efficient adaptation of LLMs in other specialized fields.

此外,我们设计了一种迭代优化算法,对生成的处方结果进行自动评估,并编写相应的指令提示以进一步优化模型。该技术可在无需额外训练数据或微调大语言模型的情况下改进Pharmacy GP T,为大语言模型在其他专业领域的高效适配开辟了新路径。


Figure 1: Generating context for ChatGPT and GPT-4

图 1: 为ChatGPT和GPT-4生成上下文

We compared various approaches to optimally use LLMs for a range of pharmacy tasks and aimed to establish a use case portfolio to inspire future research exploring LLM application to p harm a co therapy-related questions.

我们比较了多种优化使用大语言模型 (LLM) 处理药学任务的方法,旨在建立一个用例组合,以启发未来探索大语言模型在药物治疗相关问题中的应用研究。

This work has several goals:

本工作有若干目标:

• 1) To open new dialogues and stimulate discussion surrounding the application of LLMs in pharmacy. • 2) To provide directions for investigators for future improvements in LLM-based pharmacy

• 1) 开启新对话并激发关于大语言模型(LLM)在药学领域应用的讨论
• 2) 为研究者提供基于大语言模型的药学未来改进方向

applications.

应用

• 3) To evaluate the strengths and limitations of ChatGPT and GPT-4 in the pharmacy domain. • 4) To provide insights for tailored data collection strategies to maximize the potential of LLMs in pharmacy and other specialized fields. • 5) To present an effective framework and use case collection for applying LLM to pharmacy.

• 3) 评估ChatGPT和GPT-4在药学领域的优势与局限性。
• 4) 为定制化数据收集策略提供见解,以最大化大语言模型在药学及其他专业领域的潜力。
• 5) 提出应用大语言模型于药学的有效框架及用例集。

In conclusion, this paper introduces Pharmacy GP T, a groundbreaking work that applies LLMs to a range of clinically significant problems in pharmacy. By establishing a foundation for future research and development, Pharmacy GP T has the potential to revolutionize pharmacy practices and enhance the overall quality of healthcare services while addressing its key motivations and goals, ultimately contributing to a deeper understanding and more effective use of LLMs in specialized domains.

总之,本文介绍了Pharmacy GP T这一开创性工作,它将大语言模型应用于药学领域一系列具有临床意义的问题。通过为未来研发奠定基础,Pharmacy GP T有望革新药学实践、提升整体医疗服务质量,同时实现其核心动机与目标,最终推动大语言模型在专业领域的深入理解和更有效应用。

2 Related Work

2 相关工作

2.1 Large language models in healthcare

2.1 医疗领域的大语言模型 (Large Language Models)

Transformer-based language models such as Bidirectional Encoder Representations from Transformer (BERT) [17] and the Generative Pre-trained Transformer (GPT) series [18, 19, 10], have revolutionized the field of natural language processing (NLP) by outperforming earlier approaches like Recurrent Neural Network (RNN)-based models [20, 2, 1, 21] in a variety of tasks. Existing transformerbased language models can be broadly classified into three categories: masked language models (e.g., BERT), auto regressive models (e.g., GPT-3), and encoder-decoder models (e.g., Bidirectional Auto-Regressive Transformer or BART [22]). The recent development of very large language models (LLMs) grounded in the transformer architecture but built on a much grander scale, including ChatGPT and GPT-4 [3], Bloomz [23], LLAMA [24] and Med-PaLM 2 [25] has gained momentum.

基于Transformer的语言模型,如双向编码器表示模型(BERT) [17]和生成式预训练Transformer(GPT)系列[18, 19, 10],通过在各类任务中超越基于循环神经网络(RNN)的早期方法[20, 2, 1, 21],彻底改变了自然语言处理(NLP)领域。现有基于Transformer的语言模型主要可分为三类:掩码语言模型(如BERT)、自回归模型(如GPT-3)以及编码器-解码器模型(如双向自回归Transformer/BART[22])。近期,基于Transformer架构但规模显著扩增的超大语言模型(LLM)发展迅猛,包括ChatGPT和GPT-4[3]、Bloomz[23]、LLAMA[24]以及Med-PaLM 2[25]等。

The primary objective of LLMs is to accurately learn context-specific and domain-specific latent feature representations from input text [17, 3, 26, 27, 28, 29]. For example, the vector representation of the word "prescription" could vary significantly between the pharmacy domain and general usage. Smaller language models such as BERT often necessitate pre-training and supervised finetuning on downstream tasks to achieve satisfactory performance. However, LLMs typically do not require further fine-tuning and model updates but still deliver competitive performance on diverse downstream applications [30, 28, 31, 32, 33].

大语言模型的主要目标是从输入文本中准确学习特定上下文和特定领域的潜在特征表示 [17, 3, 26, 27, 28, 29]。例如,"prescription"一词的向量表示在药学领域和日常使用中可能存在显著差异。BERT等较小规模的语言模型通常需要对下游任务进行预训练和有监督微调才能达到理想性能。然而,大语言模型通常无需进一步微调和模型更新,仍能在多样化下游应用中展现出竞争力 [30, 28, 31, 32, 33]。

LLMs like ChatGPT and GPT-4 have significant potential in healthcare applications due to their advanced natural language understanding (NLU) capabilities. The massive amounts of text and other data generated in clinical practices can be leveraged through LLM-powered tools. LLMs can handle diverse and complex tasks like clinical triage classification [34], medical question-answering [35], HIPAA-compliant data an ony miz ation [36], radiology report sum mari z ation [16], clinical information extraction [37] or dementia detection [38]. In addition, the Reinforcement Learning from Human Feedback (RLHF) [39] process incorporates human preferences and values into ChatGPT and its successor, GPT-4, making it particularly suitable for understanding patient-centered guidelines and human values in healthcare applications.

像ChatGPT和GPT-4这样的大语言模型(LLM)凭借其先进的自然语言理解(NLU)能力,在医疗健康应用中展现出巨大潜力。通过LLM驱动的工具,可以充分利用临床实践中产生的大量文本及其他数据。这类模型能处理多样且复杂的任务,例如临床分诊分类[34]、医学问答[35]、符合HIPAA标准的数据匿名化[36]、放射学报告摘要生成[16]、临床信息提取[37]以及痴呆症检测[38]。此外,基于人类反馈的强化学习(RLHF)[39]机制将人类偏好与价值观融入ChatGPT及其继任者GPT-4,使其特别适合理解医疗应用中以患者为中心的指导原则和人文关怀价值。

We aim to present the first study employing LLMs for a wide range of tasks and problems of interes to pharmacy practice in the intensive care unit (ICU).

我们旨在首次利用大语言模型(LLM)研究重症监护病房(ICU)药学实践中广泛关注的任务和问题。

2.2 Reasoning with LLMs

2.2 基于大语言模型的推理

LLMs have shown great potential in high-level reasoning abilities. For example, GPT-3 has demonstrated common-sense reasoning through in-context learning, and Wei et al. [40] found that LLMs can perform better in arithmetic, deductive, and common-sense reasoning when given carefully prepared sequential prompts, decomposing multi-step problems.

大语言模型在高阶推理能力方面展现出巨大潜力。例如,GPT-3通过上下文学习展示了常识推理能力,Wei等人[40]发现当提供精心设计的序列提示时,大语言模型在算术、演绎和常识推理任务中表现更优,能够有效分解多步骤问题。

Deductive reasoning: Wu et al. [35] compared the deductive reasoning abilities of ChatGPT and GPT-4 in the specialized domain of radiology, revealing that GPT-4 outperforms ChatGPT, with fine-tuned models requiring significant data to match GPT-4’s performance. The results of this investigation suggest that creating generic reasoning models based on LLMs for diverse tasks across various domains is viable and practical.

演绎推理:Wu等人[35]在放射学专业领域对比了ChatGPT与GPT-4的演绎推理能力,发现GPT-4表现更优,且微调模型需大量数据才能达到GPT-4水平。该研究表明基于大语言模型构建通用推理模型以应对跨领域多样化任务具有可行性和实用性。

Ma et al. explored LLMs’ [16] ability to comprehend radiology reports using a dynamic prompting paradigm. They developed an iterative optimization algorithm for automatic evaluation and prompt composition, achieving state-of-the-art performance on MIMIC-CXR and OpenI datasets without additional training data or LLM fine-tuning.

Ma et al. 探索了大语言模型 [16] 通过动态提示范式理解放射学报告的能力。他们开发了一种用于自动评估和提示组合的迭代优化算法,在 MIMIC-CXR 和 OpenI 数据集上实现了最先进的性能,且无需额外训练数据或对大语言模型进行微调。

Abductive reasoning: Jung et al. [41] improved LLMs’ ability to make logical explanations through maieutic prompting, but this study was limited to question-answering style problems. Zhong et al. [33] improved from the well-known ABL framework [42] and proposed the first comprehensive abductive learning pipeline based on LLMs.

溯因推理:Jung等人[41]通过启发式提示(maieutic prompting)提升了大语言模型(LLM)的逻辑解释能力,但该研究仅限于问答式问题。Zhong等人[33]从著名的ABL框架[42]改进而来,提出了首个基于大语言模型的完整溯因学习流程。

Chain-of-thought: Chain-of-thought reasoning (CoT) is a problem-solving approach that breaks complex problems into smaller, manageable steps, resembling how humans naturally solve complicated problems [43]. In the context of LLMs, CoT aims to improve the model’s accuracy and coherence by encouraging step-by-step reasoning. A zero-shot approach asking the LLM to "think step by step" [44] significantly improves reasoning performance across benchmarks. Providing LLMs with more examples (few-shot context learning) [40] enhances their performance through in-context learning. Overall, CoT enables LLMs to effectively tackle complex reasoning tasks.

思维链 (Chain-of-thought):思维链推理 (CoT) 是一种将复杂问题分解为更小、可管理步骤的解题方法,类似于人类自然解决复杂问题的方式 [43]。在大语言模型 (LLM) 的背景下,CoT 旨在通过鼓励逐步推理来提高模型的准确性和连贯性。采用零样本方法要求大语言模型"逐步思考" [44],可显著提升其在各类基准测试中的推理表现。而通过提供更多示例(少样本上下文学习)[40],能够借助上下文学习进一步提升大语言模型的性能。总体而言,CoT 使大语言模型能够有效应对复杂推理任务。

3 Methods

3 方法

3.1 Data description

3.1 数据描述

This dataset was based on a study cohort of 1,000 adult patients admitted for at least 24 hours to a medical, surgical, neuroscience s, cardiac, or burn ICU at the University of North Carolina Health System (UNCHS) between October 2015 and October 2020. Only the patient’s first ICU admission was included in the analysis. The data were extracted from the UNCHS Electronic Health Record (EHR) housed in the Carolina Data Warehouse (CDW) by a trained in-house data analyst.

该数据集基于2015年10月至2020年10月期间在北卡罗来纳大学医疗系统(UNCHS)内科、外科、神经科学、心脏或烧伤重症监护病房住院至少24小时的1000名成年患者研究队列。分析仅纳入患者的首次ICU入院记录。数据由经过培训的内部数据分析师从卡罗来纳数据仓库(CDW)中的UNCHS电子健康记录(EHR)提取。

The dataset contains patient demographics, medication administration record (MAR) information, and patient outcomes, such as common ICU complications. Demographics include age, sex, admission diagnosis, ICU type, Medication Regimen Complexity-Intensive Care Unit (MRC-ICU) [45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55], score at 24 hours, and APACHE II score [56] at 24 hours. The MRC-ICU [57, 58] is a validated score summarizing the complexity of prescribed medications in the ICU [59, 60]. MAR information consists of drug, dose, route, duration, and timing of administration [61, 62, 63, ?].

数据集包含患者人口统计学信息、用药记录(MAR)信息以及患者结局指标,如常见ICU并发症。人口统计学数据涵盖年龄、性别、入院诊断、ICU类型、重症监护用药方案复杂性评分(MRC-ICU) [45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55]、24小时评分以及24小时APACHE II评分[56]。MRC-ICU评分[57, 58]是经过验证的、用于评估ICU处方药物复杂性的综合指标[59, 60]。用药记录信息包括药物名称、剂量、给药途径、持续时间和给药时间点[61, 62, 63, ?]。

Patient outcomes included mortality, ICU length of stay, delirium occurrence (defined by a positive CAM-ICU score), duration of mechanical ventilation, duration of va so press or use, and acute kidney injury (defined by the presence of renal replacement therapy or a serum creatinine greater than 1.5 times baseline). Other than textual descriptions of patient data, a binary value of 1 was assigned to indicate they received a specific drug order, including drug, dose, strength, and formulation/route. Categorical features for patient outcomes were relabeled as numeric values, and any unknown or missing entities were counted as absences from that event.

患者结局指标包括死亡率、ICU住院时长、谵妄发生率 (按CAM-ICU评分阳性判定)、机械通气时长、血管加压药使用时长,以及急性肾损伤 (按接受肾脏替代治疗或血清肌酐值超过基线1.5倍判定)。除患者数据的文本描述外,采用二进制数值1表示患者接受了特定药物医嘱 (含药品名称、剂量、规格及剂型/给药途径)。患者结局的分类特征被重新编码为数值变量,所有未知或缺失实体均视为该事件未发生。

3.2 Creating Interpret able Patient Clusters

3.2 创建可解释的患者聚类

Here, we describe the process for generating interpret able patient clusters using LLM embeddings and hierarchical clustering. The following pseudo-code algorithm outlines the steps:

我们在此描述利用大语言模型嵌入和层次聚类生成可解释患者分组的流程。以下伪代码算法概述了具体步骤:

  1. Feed patient information (i.e., age, sex, diagnosis, and ICD10 problem list) into GPT-3 to generate embedding vectors of size 1536. 2. Use the generated embeddings as input for hierarchical clustering to create patient clusters.
  2. 将患者信息 (即年龄、性别、诊断和ICD10问题列表) 输入GPT-3,生成大小为1536的嵌入向量。
  3. 将生成的嵌入向量作为层次聚类的输入,创建患者集群。

Algorithm 1 Interpret able Patient Clustering

算法 1 可解释的患者聚类

The above algorithm describes the process of generating interpret able patient clusters using ChatGPT embeddings and hierarchical clustering. In this method, patient information is first transformed into 1536-dimensional embeddings using GPT-3 [10, 64]. These embeddings are then used as input for a hierarchical clustering algorithm, which groups patients based on the similarity of their embeddings. This approach aims to create accurate and interpret able patient clusters for use in clinical decisions.

上述算法描述了使用ChatGPT嵌入和层次聚类生成可解释患者群组的过程。该方法首先使用GPT-3 [10,64]将患者信息转换为1536维嵌入向量,随后将这些嵌入作为层次聚类算法的输入,根据嵌入相似度对患者进行分组。该方案旨在构建精确且可解释的患者群组,以辅助临床决策。

3.3 Iterative Optimization

3.3 迭代优化

In the iterative optimization process for Pharmacy GP T, we aimed to enhance the performance of the model when generating various patient-related outputs, such as mortality, length of ICU stay, APACHE II score range, or medication plan at 24 hours. This optimization process is based on an iterative feedback loop, which adjusts the input prompts provided to the model based on its performance in previous iterations.

在针对Pharmacy GP T的迭代优化过程中,我们的目标是提升模型在生成各类患者相关输出时的性能,例如死亡率、ICU住院时长、APACHE II评分范围或24小时用药方案。该优化过程基于迭代反馈循环机制,会根据模型在前几轮迭代中的表现来调整其输入提示词。

Algorithm 2 Iterative Optimization Algorithm for Patient Data

算法 2: 患者数据迭代优化算法


Figure 2: Initial prompt used for medication plan generation

图 2: 用于生成用药方案的初始提示

Initially, we inputted a dynamic prompt containing a patient portrait based on demographics and symptoms to the ChatGPT model (see Figure 2). After this iteration, we calculated the ROUGE-1 score by comparing the model-generated output with the ground truth of the medications ordered in the first 24 hours. Based on the score, the input prompt was modified for the next iteration. If the score was above a predefined threshold, we merged the prompt with the generated output in a manner that encouraged the model to produce a similar response in the next iteration. Otherwise, we modified the prompt to discourage the model from producing the same type of output. The iterative process continued until a predetermined number of iterations was reached. Through this optimization method, Pharmacy GP T learned to improve its predictions and recommendations over time, as it is continually guided by the evaluation score and feedback from previous iterations. This approach enables Pharmacy GP T to adapt and refine its output based on a better understanding of the patient’s condition and the desired outcomes. The final prompt used can be seen in Figure 3.

最初,我们向ChatGPT模型输入了一个包含基于人口统计数据和症状的患者画像的动态提示(见图2)。在此次迭代后,我们通过比较模型生成的输出与最初24小时内所开药物的真实情况,计算了ROUGE-1分数。根据该分数,我们对下一次迭代的输入提示进行了修改。如果分数超过预设阈值,我们会以鼓励模型在下次迭代中产生类似响应的方式,将提示与生成的输出合并。否则,我们会修改提示以防止模型产生相同类型的输出。这一迭代过程持续进行,直到达到预定的迭代次数。通过这种优化方法,Pharmacy GP T学会了随着时间的推移改进其预测和建议,因为它持续受到评估分数和先前迭代反馈的指导。这种方法使Pharmacy GP T能够基于对患者状况和预期结果的更好理解来调整和完善其输出。最终使用的提示如图3所示。


Figure 3: Final prompt used for medication plan generation

图 3: 用于生成用药方案的最终提示词

4 Results

4 结果

4.1 Interpret able clustering

4.1 可解释聚类

Our clustering methodology yielded clusters that closely aligned with the ICD-10 code categories of patients, indicating a high level of interpret ability (see Figure 4). This alignment demonstrates the effectiveness of our approach in generating meaningful and coherent patient groupings.

我们的聚类方法产生的聚类结果与患者的ICD-10代码类别高度吻合,显示出良好的可解释性(见图4)。这种一致性证明了我们的方法在生成有意义且连贯的患者分组方面的有效性。

The group labeled as "Diverse symptoms with neurological impact" is the most sparsely spread among the clusters, exhibiting a diverse range of symptoms on the surface. However, our embedding analysis reveals that the majority of patients in this group share a common underlying condition: neurological disorders. This observation suggests that despite the presence of various other symptoms, patients in this group are primarily affected by neurological issues (see Figure 5). Our approach successfully uncovered this underlying similarity, demonstrating the capability of our clustering method in identifying meaningful connections between patients with seemingly disparate symptom presentations.

标记为"具有神经影响的多样化症状"的群体在各聚类中分布最为稀疏,表面上呈现出一系列多样化的症状。然而,我们的嵌入分析表明,该组大多数患者都有一个共同的潜在病症:神经系统疾病。这一观察结果表明,尽管存在各种其他症状,该组患者主要受到神经系统问题的影响 (见图 5)。我们的方法成功揭示了这种潜在相似性,证明该聚类方法能够识别症状表现看似迥异的患者之间的有意义的关联。

4.2 Predicting patient outcomes

4.2 预测患者疗效

Hospital Mortality We used ChatGPT to predict hospital mortality based on patient information including age, race, sex, ICD-10 admission diagnosis, and com or bidi ties). Table 1 results indicate that imbalanced data significantly reduced precision and F1 scores, as the deceased class in the test set contained only 46 samples, while ChatGPT tended to make predictions in a balanced manner. Also, different prompts did not contribute much to improvement. Specifically, the last three rows attempted to align the diagnosis categories of the test set with those demonstrated in the prompts: Rand-5 randomly selects five demonstrations from the training set and incorporates them into the prompt; Freq-5 randomly selects one demonstration from the top five most frequent diagnosis categories; Bcat-rand-5 randomly chooses five samples with the same diagnosis category as the tested sample; and Sim-5 selects the five most similar patient information records as demonstrations.

院内死亡率
我们使用ChatGPT基于患者信息(包括年龄、种族、性别、ICD-10入院诊断和合并症)预测院内死亡率。表1结果显示,数据不平衡显著降低了精确率和F1分数,因为测试集中的死亡类别仅包含46个样本,而ChatGPT倾向于以平衡方式做出预测。此外,不同的提示词对改进效果贡献不大。具体而言,最后三行尝试将测试集的诊断类别与提示词中展示的类别对齐:Rand-5从训练集中随机选择五个示例并纳入提示词;Freq-5从前五个最频繁的诊断类别中随机选择一个示例;Bcat-rand-5随机选择五个与测试样本具有相同诊断类别的样本;Sim-5选择五个最相似的患者信息记录作为示例。


Figure 4: Interpret able Clusters for real ICU data. Patient age, sex, icd-10 diagnosis, and ICD10 problem list were input to generate ChatGPT embeddings and hierarchical clustering as depicted above, which resulted in meaningful cluster separation. These patient clusters were created by embeddings generated by ADA-002 and clustered via agglom erat ive clustering techniques and visualized using T-SNE (t-distributed Stochastic Neighbor Embedding).

图 4: 真实ICU数据的可解释聚类。将患者年龄、性别、ICD-10诊断和ICD10问题清单输入生成ChatGPT嵌入向量,并采用如上所示的层次聚类方法,最终形成具有临床意义的簇群划分。这些患者簇群是通过ADA-002生成的嵌入向量,使用凝聚聚类技术进行聚类,并借助T-SNE(t分布随机邻域嵌入)实现可视化呈现。

Table 1: Mortality prediction

ModelAccuracyPrecisionRecallF1 Score
rand_ 5-shot0.75490.37500.70950.4898
freq_ 5-shot0.64550.16670.63640.2642
bcat rand_ 5-shot0.66020.20510.76470.4262
sim 5-shot0.66990.28210.64710.3929

表 1: 死亡率预测

Model Accuracy Precision Recall F1 Score
rand_ 5-shot 0.7549 0.3750 0.7095 0.4898
freq_ 5-shot 0.6455 0.1667 0.6364 0.2642
bcat rand_ 5-shot 0.6602 0.2051 0.7647 0.4262
sim 5-shot 0.6699 0.2821 0.6471 0.3929

APACHE II We employed ChatGPT and GPT-4 to predict APACHE II scores based on patient information (age, race, sex, ICD-10 admission diagnosis, and com or bidi ties). Table 2 results reveal that GPT-4 significantly enhanced accuracy. Accuracy is the ratio of correctly classified instances to the total number of instances. Precision focuses on the proportion of instances where the model predicts a positive outcome that actually turns out to be positive. It is a measure of the accuracy of positive class predictions. Recall is the model’s ability to correctly identify positive classes. It is the proportion of correct identification s by the model from all actual positive examples. The F1 score is the harmonic mean of precision and recall, and it attempts to find a balance between the two.

APACHE II 我们使用 ChatGPT 和 GPT-4 基于患者信息(年龄、种族、性别、ICD-10入院诊断及合并症)预测 APACHE II 评分。表 2 结果显示,GPT-4 显著提升了预测准确率。准确率指正确分类实例占总实例的比例。精确率关注模型预测为正类且实际为正类的实例占比,用于衡量正类预测的准确性。召回率反映模型正确识别正类的能力,即模型从所有实际正例中正确识别的比例。F1分数是精确率与召回率的调和平均数,旨在平衡二者关系。


How many patients in Group O:Diverse symptoms with neurological impact have neurological disorders?

图 1:
组O(神经影响症状多样)中有多少患者患有神经系统疾病?

Figure 5: Here, dots represent patients who were clustered to the neurology diagnosis group, with black dots showing a neurology associated ICD10 diagnosis.

图 5: 此处圆点代表被归类到神经科诊断组的患者,黑色圆点表示具有神经科关联的ICD10诊断。

Table 2: APACHE II score prediction

ModelAccuracy
rand_ 5-shot0.1818
freq_ 5-shot0.1091
bcat rand 5-shot0.2000
sim 5-shot0.1636
GPT-4 rand 50.3727
GPT-4_ sim_ 50.4364

表 2: APACHE II 评分预测

模型 准确率
rand_ 5-shot 0.1818
freq_ 5-shot 0.1091
bcat rand 5-shot 0.2000
sim 5-shot 0.1636
GPT-4 rand 5 0.3727
GPT-4_ sim_ 5 0.4364

4.3 Prescribing medication plans

4.3 制定用药方案

We employed GPT-4 to generate medication plans for ICU patients using the same patient information as above. We then compared these plans to the medication plans at the 24 hour point used in practice. Result can be seen in Table 3. ROUGE-1, ROUGE-2, and ROUGE-L are the three main ROUGE (Recall-Oriented Understudy for Gisting Evaluation) metrics used to evaluate automatic

我们使用GPT-4基于相同的患者信息为ICU患者生成用药方案,并将这些方案与实际应用中24小时节点的用药方案进行对比。结果如表3所示。ROUGE-1、ROUGE-2和ROUGE-L是用于评估自动摘要的三种主要ROUGE (Recall-Oriented Understudy for Gisting Evaluation)指标。


Figure 6: Medication plan generated by GPT-4 v.s. the actual plan

图 6: GPT-4生成的用药方案与实际方案对比

Requires the evaluation of an expert panel of pharmacist --- who will also devise evaluation metrics and extra information to help GPT-4 (we suspect it does not access to other patient records).

需要由药剂师专家小组进行评估 (who will also devise evaluation metrics and extra information to help GPT-4) (我们推测它无法访问其他患者记录)。

text sum mari z ation. ROUGE-1 focuses on the overlap of words (1-grams), and ROUGE-2 is based on the overlap of bigrams (2-grams). ROUGE-L focuses on the Longest Common Sub sequence (LCS). This method does not require consecutive word sequence matching but rather identifies the longest sequence of sequential matching words between the candidate abstract and the reference abstract.

文本摘要。ROUGE-1关注单词(1-grams)的重叠率,ROUGE-2基于二元词组(2-grams)的重叠率。ROUGE-L则聚焦最长公共子序列(LCS)。该方法不要求连续词序匹配,而是识别候选摘要与参考摘要之间最长的顺序匹配词序列。

RougelRouge2Rouge3
Pharmacy_ GPT0.07390.00690.0519
LLAMA20.0130.00010.0101
ChatGPT0.04040.00160.0259
GPT40.04660.00340.0287
Rougel Rouge2 Rouge3
Pharmacy_ GPT 0.0739 0.0069 0.0519
LLAMA2 0.013 0.0001 0.0101
ChatGPT 0.0404 0.0016 0.0259
GPT4 0.0466 0.0034 0.0287

Table 3: Rouge score of each model. ROUGE (Recall-Oriented Understudy for Gisting Evaluation) score is an evaluation metric widely used in the field of natural language processing, especially in text sum mari z ation and machine translation. It is mainly used to evaluate the quality of automatically generated summaries or translations. The core idea of ROUGE is to evaluate candidate abstracts by comparing them to a set of reference abstracts (usually human-written).

表 3: 各模型的Rouge得分。ROUGE (Recall-Oriented Understudy for Gisting Evaluation) 得分是自然语言处理领域广泛使用的评估指标,尤其在文本摘要和机器翻译任务中。该指标主要用于评估自动生成的摘要或翻译的质量。ROUGE的核心思想是通过将候选摘要与一组参考摘要 (通常由人工撰写) 进行对比来评估其优劣。

5 Discussion

5 讨论

We observed that the GPT-4 model fed with dynamic context and similar samples achieved the highest accuracy among all tested models. This demonstrates the potential of GPT-4 for highly specialized domains when applied with carefully designed context and sample selection. However, we found that the precision and recall scores were not particularly high for any of the approaches, which may be due to imbalances in the dataset (i.e., alive patients outnumbering deceased patients by a 9 to 1 ratio). This imbalance could have adversely impacted the performance of the models, particularly in terms of precision and recall. Future work could involve addressing this data imbalance and exploring other evaluation metrics better suited for imbalanced datasets to gain a more comprehensive understanding of model performance.

我们观察到,在输入动态上下文和相似样本的情况下,GPT-4模型在所有测试模型中达到了最高准确率。这表明当配合精心设计的上下文和样本选择时,GPT-4在高度专业化领域具有应用潜力。然而,我们发现所有方法的精确率和召回率得分均未达到较高水平,这可能是由于数据集存在不平衡问题(即存活患者与死亡患者的比例为9:1)。这种不平衡可能对模型性能产生了负面影响,特别是在精确率和召回率方面。未来工作可包括解决数据不平衡问题,并探索更适合不平衡数据集的其他评估指标,以更全面地理解模型性能。

5.1 Addressing AI anxiety in healthcare

5.1 应对医疗领域的AI焦虑

The issue of the ’black-box effect’ and associated AI anxiety in healthcare [65] must be addressed [66]. It is crucial to recognize the potential benefits of AI, such as improved patient care quality and increased efficiency in clinical practice, while also acknowledging the limitations of current AI systems, specifically large language models (LLMs) like ChatGPT and GPT-4 [4, 2].

医疗领域中的"黑箱效应"及相关AI焦虑问题[65]亟待解决[66]。在认识到AI潜在优势(如提升患者护理质量、优化临床实践效率)的同时,也必须正视现有AI系统(特别是ChatGPT、GPT-4等大语言模型[4,2])的局限性。

Our study has demonstrated that LLMs can provide valuable suggestions and identify patterns in complex ICU patient data, but they possess limitations that underscore the continued importance of human expertise in healthcare. One such limitation is the lack of access to comprehensive patient histories and records. This constraint prevents LLMs from fully understanding the context of a patient’s condition, which is crucial for making accurate and relevant recommendations.

我们的研究表明,大语言模型(LLM)能够为复杂ICU患者数据提供有价值的建议并识别模式,但其局限性凸显了医疗领域人类专业知识的持续重要性。其中一个局限是无法获取完整的患者病史和记录,这一限制使大语言模型无法充分理解患者病情的背景信息,而这对于做出准确且相关的建议至关重要。

Moreover, LLMs struggle to comprehend complex clinical information or produce nuanced and detailed clinical descriptions [67]. These models are primarily trained on textual data and may not have sufficient exposure to specialized medical terminology, complex concepts, or rare conditions, which are vital for effective decision-making in healthcare [67, 68].

此外,大语言模型难以理解复杂的临床信息或生成细致入微的临床描述 [67]。这些模型主要基于文本数据训练,可能对专业医学术语、复杂概念或罕见病症接触不足,而这些要素对医疗领域的有效决策至关重要 [67, 68]。

Another area where LLMs face challenges (to some degree) is in interpreting time series data [69, 70] and numerical data [71]. The dynamic nature of healthcare particularly the ICU, where clinical status can change in minutes, often requires an understanding of changes in a patient’s condition over time and the ability to analyze numerical information, such as laboratory test results and vital signs. LLMs currently lack the sophistication to fully grasp these aspects of patient care.

大语言模型面临挑战的另一个领域(在一定程度上)是对时间序列数据[69,70]和数值数据[71]的解读。医疗护理尤其是重症监护病房(ICU)的动态特性,患者的临床状态可能在几分钟内发生变化,通常需要理解患者病情随时间的变化,并具备分析实验室检测结果和生命体征等数值信息的能力。目前大语言模型尚缺乏完全掌握这些患者护理方面的成熟度。

The findings of our study suggest that there is no inherent contradiction between embracing AI technologies and preserving the roles of healthcare professionals. Instead, LLMs can be seen as a valuable complement to human expertise, providing support and suggestions while healthcare professionals retain ultimate responsibility for patient care. By combining the strengths of AI with the insights and experience of medical professionals, we can work towards a more efficient and effective healthcare system that benefits patients, providers, and society.

我们的研究结果表明,采用AI技术与保留医疗专业人员角色之间并不存在固有矛盾。相反,大语言模型(LLM)可被视为人类专业知识的宝贵补充——在医疗专业人员保持对患者照护最终责任的同时,AI能提供支持与建议。通过结合AI优势与医疗专业人员的洞察力和经验,我们有望建立一个更高效、更有效的医疗体系,使患者、医疗服务提供者和社会共同受益。

5.2 Improving LLMs for pharmacy

5.2 提升药学领域的大语言模型

Optimizing Pharmacy GP T for clinical scenarios will require further engineering. First, training LLMs on domain-specific patient data [11, 21] is an established strategy for enhancing their performance in specialized domains. By exposing the model to a larger amount of pharmacy-related data, it will acquire a deeper understanding of the domain’s nuances, terminology, and knowledge. This targeted training can lead to improvements in the model’s ability to generate relevant, accurate, and con textually appropriate responses for various pharmacy tasks.

优化药房GP T以适用于临床场景需要进一步的工程改进。首先,在特定领域患者数据上训练大语言模型 [11, 21] 是提升其在专业领域性能的成熟策略。通过让模型接触更多药学相关数据,它将更深入理解该领域的细节、术语和知识。这种针对性训练可以提升模型在各种药学任务中生成相关、准确且符合语境的回答能力。

Second, refining the model architecture to better align with the specific tasks and objectives in pharmacy can also lead to improvements. By designing customized architectures that cater to the unique requirements of pharmacy applications, LLMs can become more effective in addressing domain-specific challenges [21]. This process may involve the development of task-specific layers, attention mechanisms, or other architectural components [13, 21] that are tailored to the needs of pharmacy-related tasks.

其次,通过改进模型架构以更好地适应药学领域的特定任务和目标也能带来提升。通过设计符合药学应用独特需求的定制化架构,大语言模型能更有效地应对该领域的特定挑战[21]。这一过程可能涉及开发针对药学相关任务需求的任务专用层、注意力机制或其他架构组件[13, 21]。

Moreover, localizing LLMs to adhere to Health Insurance Portability and Accountability Act (HIPAA) patient guidelines [36] is crucial for processing sensitive patient data in hospitals. Ensuring that LLMs are compliant with privacy and security regulations is a vital aspect of their adoption in clinical settings [72]. One approach to achieve this is to develop local, on-premises versions of LLMs that can be integrated into hospital systems, thereby eliminating the need to transmit sensitive data over the internet. Alternatively, techniques such as federated learning [73] or secure multi-party computation [74] can be employed to train LLMs on sensitive data while maintaining patient privacy.

此外,将大语言模型本地化以符合《健康保险可携性和责任法案》(HIPAA) 的患者指南 [36] 对于医院处理敏感患者数据至关重要。确保大语言模型符合隐私和安全法规是其在临床环境中采用的关键方面 [72]。实现这一点的一种方法是开发可集成到医院系统中的本地化大语言模型版本,从而消除通过互联网传输敏感数据的需求。或者,可以采用联邦学习 [73] 或安全多方计算 [74] 等技术在保护患者隐私的同时利用敏感数据训练大语言模型。

5.3 Building AI-friendly datasets in pharmacy

5.3 构建药学领域的AI友好型数据集

Our results indicate a strong need to create LLM-friendly datasets in pharmacy.

我们的结果表明,在药学领域亟需创建适合大语言模型 (LLM) 的数据集。

The foremost approach is to collect more detailed text-based patient information. This process may include extensive documentation of patient histories, treatment plans, and progress notes, which would provide LLMs with a wealth of contextual information for generating accurate and relevant responses. Additionally, incorporating data on patient symptoms, along with standardized clinical assessment scores such as APACHE II [56], SOFA [75], and MRC-ICU [57], would offer LLMs a more comprehensive understanding of patient conditions, enabling them to make better-informed predictions and recommendations.

首要方法是收集更详细的基于文本的患者信息。这一过程可能包括对患者病史、治疗方案和进展记录的详尽文档记录,从而为大语言模型提供丰富的上下文信息,以生成准确且相关的回答。此外,整合患者症状数据及标准化临床评估评分 (如 APACHE II [56]、SOFA [75] 和 MRC-ICU [57]) 将使大语言模型更全面地理解患者状况,从而做出更明智的预测和建议。

Extending the duration of data collection beyond traditional 24-hour or 72-hour windows is another important consideration. By capturing a more extensive timeline of patient information, LLMs can gain deeper insights into the progression of diseases and the effectiveness of different treatment strategies. This longitudinal data would allow LLMs to identify trends, assess long-term outcomes, and provide more accurate predictions for patient recovery and risk factors.

延长数据收集周期至超出传统的24小时或72小时窗口是另一个重要考量。通过获取更长时间跨度的患者信息,大语言模型(LLM)能够更深入地洞察疾病进展过程和各种治疗策略的有效性。这类纵向数据将使大语言模型能够识别趋势、评估长期疗效,并为患者康复和风险因素提供更精准的预测。

Furthermore, standardizing drug descriptions, including dosage numbers and frequencies, is crucial for creating an LLM-friendly dataset in pharmacy. This standardization can help eliminate ambiguities and inconsistencies in the data, ensuring that LLMs can accurately interpret and generate relevant information about medications. Moreover, the inclusion of drug interactions, contra indications, and side effects can further enhance the LLM’s ability to provide safe and effective medication recommendations. The first steps towards standardizing ICU medication data for AI have begun [7].

此外,标准化药物描述(包括剂量数字和频率)对于构建药学领域的大语言模型友好型数据集至关重要。这种标准化有助于消除数据中的歧义和不一致,确保大语言模型能准确解读并生成相关药物信息。同时,纳入药物相互作用、禁忌症和副作用等内容,可进一步提升大语言模型提供安全有效用药建议的能力。针对AI的ICU用药数据标准化工作已迈出第一步 [7]。

Incorporating diverse data sources (e.g., electronic health records, clinical trial results, and relevant scientific literature), can provide LLMs with a more comprehensive understanding of the pharmacy domain. By integrating these diverse sources of information, LLMs can develop a more robust knowledge base, enabling them to generate better-informed responses and recommendations for various pharmacy tasks.

整合多样化数据源(如电子健康档案、临床试验结果和相关科学文献)可为大语言模型提供更全面的药学领域认知。通过融合这些多元信息渠道,大语言模型能够构建更完善的知识体系,从而在各类药学任务中生成更具洞察力的响应与建议。

5.4 Define appropriate NLP tasks and metrics for pharmacy

5.4 为药学领域定义合适的自然语言处理任务和指标

To effectively apply LLMs to the pharmacy domain, it is essential to define relevant NLP tasks and develop suitable evaluation metrics. Common NLP tasks include question answering (QA) and sum mari z ation [3], while specialized tasks might involve AI-generated prescription plans and outcome predictions. Traditional NLP evaluation metrics, such as n-gram-based ROUGE scores [76] and METEOR scores [77], may not be reasonably applied to these specialized tasks, as they prioritize similarity between generated and reference texts rather than accuracy, safety, and adherence to clinical guidelines.

为有效将大语言模型(LLM)应用于药学领域,需明确相关自然语言处理(NLP)任务并制定合适的评估指标。常见NLP任务包括问答(QA)和摘要生成[3],而专业任务可能涉及AI生成处方方案和疗效预测。传统NLP评估指标(如基于n-gram的ROUGE分数[76]和METEOR分数[77])可能不适用于这些专业任务,因为这些指标优先考虑生成文本与参考文本的相似性,而非准确性、安全性及临床指南依从性。

To address this challenge, domain experts should collaborate to develop specialized evaluation metrics that capture the unique aspects of pharmacy-related tasks. For instance, metrics could assess the correctness of AI-generated prescriptions or the alignment of predicted outcomes with actual patient outcomes, considering sensitivity, specificity, and overall accuracy.

为解决这一挑战,领域专家应协作开发能捕捉药学任务独特性的专项评估指标。例如,可设计指标评估AI生成处方的正确性,或预测结果与实际患者疗效的吻合度(需综合考虑灵敏度、特异性和整体准确性)。

Defining appropriate NLP tasks and developing tailored evaluation metrics for pharmacy applications is crucial for the successful integration of LLMs in this domain. Collaboration between NLP researchers and pharmacy experts is essential to ensure that these models can effectively contribute to improved patient care and clinical practice efficiency.

为药学应用定义合适的自然语言处理(NLP)任务并开发定制化评估指标,对于大语言模型在该领域的成功整合至关重要。NLP研究人员与药学专家之间的协作必不可少,以确保这些模型能有效促进患者护理改善和临床实践效率提升。

Because of the complexity and individual nature of ICU pharmacy regimens, which require individualized decision-making, critical care pharmacists, who are experts in this domain, were involved in the evaluation process. Ultimately, critical care pharmacists need to review the medication plans generated by GPT-4 to assess their reasonableness and clinical applicability. This will help ensure that the AI-generated plans align with current best practices and account for the specific needs of each patient. It is also crucial to devise evaluation metrics that extend beyond traditional n-gram-based metrics, such as ROUGE, which may not be sufficient for capturing the nuances of this new task. Developing tailored evaluation metrics for assessing the performance of AI-generated medication plans will enable a more comprehensive understanding of the models’ capabilities and limitations. This, in turn, will facilitate improvements in the models and promote the responsible integration of AI in the field of clinical pharmacy. Overall, the outcome and medication prediction results demonstrate a promising approach to using large language models for rapidly assessing patient situations and providing further guidance on medication and treatment plans.

由于ICU药学方案的复杂性和个体化特性需要个性化决策,因此该领域的专家——重症监护药师参与了评估流程。最终,重症监护药师需要审核GPT-4生成的用药方案,评估其合理性与临床适用性。这将有助于确保AI生成的方案符合当前最佳实践,并兼顾每位患者的特殊需求。

设计超越传统n-gram指标(如ROUGE)的评价体系也至关重要,因为这些传统指标可能无法充分捕捉这一新任务的细微差别。开发专门用于评估AI生成用药方案性能的定制化指标,将有助于更全面地理解模型的能力与局限,进而推动模型优化,促进AI在临床药学领域的负责任整合。

总体而言,预测结果和用药方案表明,利用大语言模型快速评估患者状况并为用药治疗方案提供进一步指导,是一种极具前景的方法。

5.5 Developing Multimodal Foundation Models for Pharmacy

5.5 开发面向药学领域的多模态基础模型

As the field of artificial intelligence has evolved, the concept of multimodal learning, encompassing the ability to process and integrate data from various sources such as text, images, audio, and structured data, has gained substantial attention [4]. This approach is of great value in the pharmacy domain, given the wide variety of data types available, from textual patient notes to numerical lab results, diagnostic images, and structured demographic or medication data.

随着人工智能领域的发展,能够处理和整合文本、图像、音频及结构化数据等多源信息的多模态学习 (multimodal learning) 概念获得了广泛关注 [4]。鉴于药学领域存在从文本病历、数值化检验结果、诊断影像到结构化人口统计或用药数据等多种数据类型,该方法具有重要应用价值。

Transformer-based architectures, like those found in large language models (LLMs) such as GPT-4, are well-suited to handle diverse types of sequence data and learn intricate patterns within them [21, 4]. These capabilities make them an excellent foundation for developing multimodal models in healthcare and pharmacy. Transformers have proven performance in NLP [1] and have demonstrated success in domains traditionally dominated by convolutional neural network (CNN) based models, including computer vision [78, 79, 80] and medical image applications [78, 81, 82, 83, 84, 85, 86, 87]. Additionally, they have been utilized in other modalities such as modeling human brain functions [88, 89]. For example, it powers vision foundation models such as the Segment Anything Model (SAM) [90, 91, 92]. This demonstrates their flexibility and potential in handling multiple data modalities [4].

基于Transformer的架构(例如GPT-4等大语言模型中采用的架构)特别适合处理各类序列数据并学习其中的复杂模式[21,4]。这些能力使其成为开发医疗和制药领域多模态模型的理想基础。Transformer在自然语言处理(NLP)领域已展现出卓越性能[1],并在传统由卷积神经网络(CNN)主导的领域(包括计算机视觉[78,79,80]和医学影像应用[78,81,82,83,84,85,86,87])取得了成功。此外,该架构还被应用于人脑功能建模等其他模态研究[88,89]。例如,Segment Anything Model(SAM)等视觉基础模型就采用了该架构[90,91,92],这充分展现了其处理多模态数据的灵活性和潜力[4]。

The multimodal models for pharmacy would involve creating distinct input pathways for each type of data, each leveraging the self-attention mechanisms of transformers. After processing each modality independently, the outputs can be integrated to generate a comprehensive patient representation. This fusion can be guided by various strategies and can determine which features from each modality are most relevant for a given task.

药房多模态模型将为每种数据类型创建独立的输入路径,每条路径都利用Transformer的自注意力机制。在独立处理每种模态后,可将输出结果整合以生成全面的患者表征。这种融合可通过多种策略实现,并能确定每种模态中哪些特征与特定任务最相关。

When trained on a diverse dataset of patients, a multimodal foundation model could potentially generate more accurate and personalized medication plans by considering the full range of patient data. In this way, the model can provide an in-depth understanding of the patient’s current health status, medication needs, and potential risk factors, leading to more effective and tailored pharmaceutical care.

当基于多样化的患者数据集进行训练时,多模态基础模型 (multimodal foundation model) 能够通过综合分析患者全维度数据,生成更精准且个性化的用药方案。这种方式使模型能深入理解患者当前健康状况、用药需求及潜在风险因素,从而提供更高效、定制化的药学服务。

6 Conclusion

6 结论

In conclusion, our exploration into the application of large language models in pharmacy, embodied by Pharmacy GP T, illuminates promising avenues for future development. Despite the evident challenges, we believe that leveraging LLMs can drastically enhance the accuracy, personalization, and efficiency of medication plan generation. The potential for improved patient outcomes and streamlined pharmaceutical operations is significant. As we continue to refine these models, a keen focus on the incorporation of expert knowledge, carefully designed training datasets, and development of tailored metrics for model evaluation will be crucial. By continuing to intertwine AI with pharmacy, we stand at the precipice of a new era of healthcare where AI augments the crucial human component, leading to comprehensive, high-quality patient care.

总之,我们对大语言模型在药学领域应用的探索(以Pharmacy GP T为例)为未来发展指明了前景广阔的路径。尽管存在明显挑战,但我们相信利用大语言模型能显著提升用药方案生成的准确性、个性化程度和效率。这项技术对改善患者预后和优化药学运营具有重大潜力。在持续优化模型的过程中,重点需要关注三大要素:专业知识的整合、精心设计的训练数据集,以及开发针对性的模型评估指标。通过持续推进人工智能与药学的深度融合,我们正站在医疗保健新时代的门槛——在这个时代,AI将增强人类医疗工作者的核心作用,共同实现全面、高质量的患者照护。

References

参考文献

阅读全文(20积分)