[论文翻译]CPLLM: 基于大语言模型的临床预测


原文地址:https://arxiv.org/pdf/2309.11295


CPLLM: CLINICAL PREDICTION WITH LARGE LANGUAGE MODELS

CPLLM: 基于大语言模型的临床预测

Ofir Ben Shoham Department of Software and Information Systems Engineering Ben-Gurion University of the Negev benshoho@post.bgu.ac.il

Ofir Ben Shoham
内盖夫本-古里安大学
软件与信息系统工程系
benshoho@post.bgu.ac.il

Nadav Rappoport Department of Software and Information Systems Engineering Ben-Gurion University of the Negev nadavrap@bgu.ac.il

Nadav Rappoport 软件与信息系统工程系 内盖夫本-古里安大学 nadavrap@bgu.ac.il

ABSTRACT

摘要

We present Clinical Prediction with Large Language Models (CPLLM), a method that involves fine-tuning a pre-trained Large Language Model (LLM) for clinical disease and read mission prediction. We utilized quantization and fine-tuned the LLM using prompts. For diagnosis prediction, we predict whether patients will be diagnosed with a target disease during their next visit or in the subsequent diagnosis, leveraging their historical diagnosis records. We compared our results to various baselines, including RETAIN, and Med-BERT, the current state-of-the-art model for disease prediction using temporal structured EHR data. In addition, We also evaluated CPLLM for patient hospital read mission prediction and compared our method’s performance with benchmark baselines. Our experiments have shown that our proposed method, CPLLM, surpasses all the tested models in terms of PR-AUC and ROC-AUC metrics, showing state-of-the-art results for diagnosis prediction and patient hospital read mission prediction. Such a method can be easily implemented and integrated into the clinical process to help care providers estimate the next steps of patients.

我们提出了临床预测大语言模型 (CPLLM),该方法通过对预训练的大语言模型 (LLM) 进行微调,用于临床疾病和再入院预测。我们采用量化技术并通过提示词对LLM进行微调。在诊断预测任务中,我们基于患者历史诊断记录,预测其下次就诊或后续诊断中是否会被确诊为目标疾病。我们将实验结果与RETAIN、Med-BERT(当前使用时序结构化电子健康记录数据进行疾病预测的最先进模型)等多种基线模型进行了对比。此外,我们还评估了CPLLM在患者再入院预测任务中的表现,并与基准基线模型进行了性能比较。实验结果表明,我们提出的CPLLM方法在PR-AUC和ROC-AUC指标上均优于所有测试模型,在诊断预测和患者再入院预测任务中均取得了最先进的成果。该方法可轻松部署并整合到临床流程中,辅助医护人员预判患者的病情发展。

1 INTRODUCTION

1 引言

Large Language Models (LLMs) are a type of artificial intelligence (AI) that have been shown to be effective at a variety of Natural Language Processing tasks (Zhao et al., 2023). LLMs are trained on large amounts of textual data, which allows them to learn the statistical relationships between words and phrases. LLMs are used for different types of tasks, including natural language understanding, natural language generation, knowledge-intensive tasks, reasoning, and more (Yang et al., 2023b). This makes them well-suited for tasks that require understanding the meaning of a text, such as text classification (Gasparetto et al., 2022; Sun et al., 2023) and even clinical predictions in the medical domain (Thiru nav uk a rasu et al., 2023; Steinberg et al., 2021).

大语言模型 (LLM) 是一种人工智能 (AI) ,已被证明能有效处理各类自然语言处理任务 (Zhao et al., 2023) 。大语言模型通过海量文本数据训练,可学习词语与短语间的统计关系。其应用场景涵盖自然语言理解、自然语言生成、知识密集型任务、推理等多种类型 (Yang et al., 2023b) ,因此特别适合需要理解文本语义的任务,例如文本分类 (Gasparetto et al., 2022; Sun et al., 2023) 乃至医疗领域的临床预测 (Thirunavukarasu et al., 2023; Steinberg et al., 2021) 。

Clinical predictions are used to estimate a patient’s susceptibility to disease, gauge the likelihood of treatment response, or prognosticate the course of a patient’s medical condition. (Laupacis et al., 1997; Wasson et al., 1985). These predictions have been executed via classical models such as Logistic Regression (Hosmer Jr et al., 2013) and Random Forest (Breiman, 2001). However, these traditional methods do not model the order of the medical concept events (diagnoses, procedures, medications, etc.). Instead, they rely solely on the absence or presence of these events (features).

临床预测用于评估患者对疾病的易感性、治疗反应的可能性或预测患者病情的病程 (Laupacis et al., 1997; Wasson et al., 1985)。这些预测通常通过逻辑回归 (Logistic Regression) (Hosmer Jr et al., 2013) 和随机森林 (Random Forest) (Breiman, 2001) 等经典模型实现。然而,这些传统方法无法建模医疗概念事件(诊断、手术、用药等)的时序关系,仅依赖这些事件(特征)的有无进行判断。

Modern event order prediction models, which are more advanced than the mentioned traditional prediction models, are based on RNNs or transformers, where the latter were shown to be superior (Vaswani et al., 2017). Specifically, BERT-Style Models like BERT (Devlin et al., 2018), RoBERTa (Liu et al., 2019), and Deberta (He et al., 2020). Another transformer-based architecture is GPTstyle language model. GPT models are trained to generate the next word in a sequence. GPT models are used in a wide range of downstream tasks such as sum mari z ation, translation, question answering, and more (Floridi & Chiriatti, 2020). To name a few GPT models: LLaMA (Touvron et al., 2023a;b), Falcon (Almazrouei et al., 2023), Bloom (Scao et al., 2022), and GPT4 (OpenAI, 2023). The flexibility and versatility of decoder-only models seem to be advantageous (Yang et al., 2023b).

现代事件顺序预测模型比上述传统预测模型更为先进,它们基于RNN或Transformer架构,其中后者被证明更具优势 (Vaswani et al., 2017)。具体而言,包括BERT风格模型如BERT (Devlin et al., 2018)、RoBERTa (Liu et al., 2019)和DeBERTa (He et al., 2020)。另一种基于Transformer的架构是GPT风格的大语言模型。GPT模型通过训练来生成序列中的下一个词,被广泛应用于摘要生成、翻译、问答等下游任务 (Floridi & Chiriatti, 2020)。代表性的GPT模型包括LLaMA (Touvron et al., 2023a;b)、Falcon (Almazrouei et al., 2023)、Bloom (Scao et al., 2022)以及GPT-4 (OpenAI, 2023)。仅含解码器的模型展现出显著的灵活性和多功能性优势 (Yang et al., 2023b)。

The significance of the mentioned language models for handling sequential data is emphasized, particularly within the context of clinical prediction models relying on Electronic Health Record (EHR) data. Structured EHR data encompasses a patient’s clinical history, notable for its irregular temporal sequence of events and observations (Steinberg et al., 2021). Previous works deal with modeling EHR diagnosis data as a sequence, such as BEHRT (Li et al., 2020; 2022a; Shoham & Rappoport, 2023; Meng et al., 2021), Med-BERT (Rasmy et al., 2021) and Medic-BERT (Hansen et al., 2023) (for length of stay prediction), using BERT models. However, such BERT-based models represent each diagnosis code as an index and do not address the textual description of the ICD code. These models are pre-trained using clinical data, and have a limited sequence length input according to the BERT architecture.

上述语言模型在处理序列数据方面的重要性尤为突出,特别是在依赖电子健康记录(EHR)数据的临床预测模型背景下。结构化EHR数据包含患者的临床病史,其显著特点是事件和观察结果的时间序列不规则(Steinberg et al., 2021)。先前的研究将EHR诊断数据建模为序列进行处理,例如BEHRT(Li et al., 2020; 2022a; Shoham & Rappoport, 2023; Meng et al., 2021)、Med-BERT(Rasmy et al., 2021)和Medic-BERT(Hansen et al., 2023)(用于住院时长预测),这些研究都使用了BERT模型。然而,这类基于BERT的模型仅将每个诊断代码表示为索引,并未处理ICD代码的文本描述。这些模型使用临床数据进行预训练,并且根据BERT架构具有有限的序列长度输入。

There is limited research on using LLMs to train clinical prediction models. One of the main focus of applications of LLM in the clinic is on chat capability of these models (Singhal et al., 2023; Thiru nav uk a rasu et al., 2023) or using an LLM for medical texts-based tasks like text generation (Lu et al., 2022; Agrawal et al., 2022) and text comprehension (Yang et al., 2022; Siva rajkumar & Wang, 2022; Li et al., 2022b; Jiang et al., 2023). In addition, Chen et al. (2023) proposed a method called ClinTaT for cancer prediction. Their focus was on cancer prognostic prediction using few-shot learning, and their data modeling was not designed for structured EHR data that consists of a sequence of diagnoses. However, we want to harness the power of LLMs in understanding sequences of tokens derived from structured EHR data, specifically to train prediction models. We represent the structured data as a text by representing each medical concept corresponding to a word, admissions are treated as visits, and patient history is considered a document. The objectives of this study are to develop a novel method for using LLMs to train clinical predictors and to evaluate the performance of this method on real-world datasets.

关于使用大语言模型 (LLM) 训练临床预测模型的研究有限。目前大语言模型在临床应用的主要焦点集中在聊天功能 (Singhal et al., 2023; Thirunavukarasu et al., 2023) 或基于医疗文本的任务,如文本生成 (Lu et al., 2022; Agrawal et al., 2022) 和文本理解 (Yang et al., 2022; Sivarajkumar & Wang, 2022; Li et al., 2022b; Jiang et al., 2023)。此外,Chen et al. (2023) 提出了一种名为 ClinTaT 的癌症预测方法,其重点是通过少样本学习进行癌症预后预测,且其数据建模并非针对由一系列诊断组成的结构化电子健康记录 (EHR) 数据。

我们的目标是通过大语言模型理解结构化 EHR 数据生成的 token 序列,从而训练预测模型。具体而言,我们将结构化数据表示为文本形式:每个医疗概念对应一个单词,入院视为就诊,患者病史视为文档。本研究的目标是开发一种利用大语言模型训练临床预测器的新方法,并在真实数据集上评估该方法的性能。

Our proposed method uses an LLM to predict future diagnoses and read mission of patients by finetuning LLMs. The medical concepts are represented by text descriptions. Fine-tuning is performed using a prompt that feeds the model with training samples. We used two different LLMs, Llama2, which is a general LLM (Touvron et al., 2023b) and BioMedLM, which was trained on biological and clinical text (Venigalla et al., 2022). We used four prediction tasks and two datasets and compared the performance to baseline models.

我们提出的方法通过微调大语言模型(LLM)来预测患者未来诊断和再入院情况。医疗概念以文本描述形式表示,微调过程采用提示模板向模型输入训练样本。实验选用两种大语言模型:通用大语言模型Llama2 (Touvron et al., 2023b) 和生物医学文本专用模型BioMedLM (Venigalla et al., 2022)。通过四项预测任务和两个数据集验证,并与基线模型进行性能对比。

The proposed method outperforms the state-of-the-art methods. Our generic method can be used for a variety of tasks and is not specific to any particular LLM. Moreover, our method is also suitable for different clinical domains such as demographics, diagnoses, laboratory test results, measurements, procedures, and more.

所提出的方法优于当前最先进的方法。我们的通用方法适用于多种任务,并不局限于特定的大语言模型。此外,该方法还适用于不同临床领域,如人口统计学、诊断、实验室检测结果、测量、手术等。

Contributions: (1) We propose CPLLM, a novel method for clinical prediction with LLM that outperforms state-of-the-art models for disease prediction and patient read mission prediction for structured EHR data. In addition, CPLLM doesn’t require pre-training on clinical data and achieves better performance than alternative approaches. Moreover, Our method has a longer sequence length limit compared to the baseline methods. (2) We show that adding additional tokens to the pre-trained tokenizer of the LLM before fine-tuning improves the performance of the clinical prediction model. (3) Our code is flexible for any LLM, available to use, and easily adaptable to various clinical prediction tasks.

贡献:(1) 我们提出了CPLLM,这是一种利用大语言模型(LLM)进行临床预测的新方法,在结构化电子健康记录(EHR)数据的疾病预测和患者再入院预测任务中超越了现有最优模型。此外,CPLLM无需在临床数据上进行预训练,且性能优于其他替代方案。与基线方法相比,我们的方法还具有更长的序列长度限制。(2) 我们证明了在微调前向大语言模型的预训练分词器添加额外token能提升临床预测模型的性能。(3) 我们的代码适用于任何大语言模型,可立即使用,并能轻松适配各类临床预测任务。

2 METHODS

2 方法

2.1 DISEASE PREDICTION - PROBLEM DEFINITION

2.1 疾病预测 - 问题定义

Formally, for a given patient $p$ , let $n$ denote the total number of diagnoses in their medical history. Thus, the patient’s sequence of diagnoses is represented as ${D_ {p,1},D_ {p,2},D_ {p,3},D_ {p,n}}$ , where each $D_ {p,i}$ $(1\leq i\leq n)$ corresponds to a medical diagnosis in the patient’s history. We considered two types of diagnosis prediction: next diagnosis and next visit diagnosis.

形式上,对于给定患者 $p$,设 $n$ 表示其病史中的诊断总数。因此,患者的诊断序列表示为 ${D_ {p,1},D_ {p,2},D_ {p,3},D_ {p,n}}$,其中每个 $D_ {p,i}$ $(1\leq i\leq n)$ 对应患者病史中的一项医学诊断。我们考虑了两种诊断预测类型:下一诊断和下一就诊诊断。

Next diagnosis prediction: Predict whether patient $p$ will be diagnosed with a specific disease $D_ {x}$ as the $D_ {p,i+1}$ diagnosis given previous diagnoses. Our model relys on the patient’s medical records up to the $i$ -th diagnosis, denoted as ${D_ {p,1},D_ {p,2},D_ {p,i}}$ . Where $D_ {p,i}$ $(1\leq i<n)$ ) indicates the most recent diagnosis observed for patient $p$ . The predictive model utilizes this patient-specific historical medical information to determine whether patient $p$ ’s next diagnosis is a specific disease or not.

下次诊断预测:预测患者 $p$ 在给定既往诊断的情况下,是否会将特定疾病 $D_ {x}$ 作为第 $D_ {p,i+1}$ 次诊断。我们的模型依赖于患者截至第 $i$ 次诊断的医疗记录,记为 ${D_ {p,1},D_ {p,2},\ldots,D_ {p,i}}$ 。其中 $D_ {p,i}$ $(1\leq i<n)$ ) 表示患者 $p$ 最近观察到的诊断。该预测模型利用患者特定的历史医疗信息,判断患者 $p$ 的下次诊断是否为特定疾病。

Next visit diagnosis prediction: Predicting the next diagnosis requires knowledge of the precise timing of each diagnosis. However, these data may occasionally be unavailable, such as when diagnoses are documented at the end of an admission. Consequently, in the context of the MIMICIV dataset, we undertake the task of forecasting whether a patient will receive a specific diagnosis in his subsequent admission.

下次就诊诊断预测:预测下次诊断需要掌握每次诊断的确切时间信息。但在某些情况下(如出院时才记录诊断结果)这些数据可能缺失。因此针对MIMICIV数据集,我们的研究任务是预测患者在下一次入院时是否会接受特定诊断。

2.2 PATIENT HOSPITAL READ MISSION PREDICTION

2.2 患者再入院预测

Based on a patient’s medical history, including procedures, diagnoses, and medications, our objective is to forecast whether the patient will experience hospital read mission within the next $X$ days. We follow the definition of $X$ as specified by the PyHealth benchmark Yang et al. (2023a). In our experiments with the MIMIC-IV dataset, we predict hospital read mission within a 15-day window, and for the eICU-CRD dataset, the prediction time-frame is 5 days (see section 2.3).

基于患者的医疗历史(包括手术、诊断和用药记录),我们的目标是预测患者是否会在未来 $X$ 天内再次入院。我们遵循 PyHealth 基准 Yang et al. (2023a) 中定义的 $X$ 值。在使用 MIMIC-IV 数据集的实验中,我们预测 15 天内的再次入院情况;对于 eICU-CRD 数据集,预测时间范围为 5 天(详见第 2.3 节)。

2.3 DATA

2.3 数据

In this study, we used data from the eICU-CRD database (Pollard et al., 2018) and data from the MIMIC-IV database (Johnson et al., 2020). Our datasets include ICD-9-CM (eICU-CRD) and ICD10-CM (MIMIC-IV) diagnoses and their descriptions. In the eICU-CRD database, each diagnosis is associated with a timestamp. Consequently, we arranged the diagnoses in chronological order based on their respective diagnosis times. Our disease prediction task aims to anticipate whether the forthcoming diagnosis will correspond to a specific disease. Unlike the eICU-CRD dataset, the MIMIC-IV data lacks information on the exact time of each diagnosis assignment. However, it provides the start time for admission and the discharge times for each patient. As a result, our prediction task for this dataset revolves around determining whether a patient will be diagnosed with a specific disease during his subsequent visit.

在本研究中,我们使用了来自eICU-CRD数据库 (Pollard et al., 2018) 和MIMIC-IV数据库 (Johnson et al., 2020) 的数据。我们的数据集包含ICD-9-CM (eICU-CRD) 和ICD10-CM (MIMIC-IV) 诊断及其描述。在eICU-CRD数据库中,每个诊断都关联有时间戳。因此,我们根据各自的诊断时间按时间顺序排列了诊断结果。我们的疾病预测任务旨在预测即将到来的诊断是否与特定疾病相对应。与eICU-CRD数据集不同,MIMIC-IV数据缺乏每次诊断分配的确切时间信息。但它提供了每位患者的入院开始时间和出院时间。因此,我们对该数据集的预测任务围绕确定患者在下一次就诊期间是否会被诊断出特定疾病展开。

Med-BERT adopts a pre-training strategy and trains BERT using Masked Language Modeling (MLM) and Length of stay (LOS) prediction tasks (Rasmy et al., 2021). Therefore, we extracted the necessary data from the databases, including the diagnosis codes for each patient. Additionally, we also include information on the LOS of each admission and the number of visits of each patient. On the other hand, in our approach, we did not conduct an additional pre-training step, as we focused on fine-tuning an LLM. In our proposed method, it’s not required to note at which visit each diagnosis was given. Furthermore, the duration of hospital stay is not required. Notably, our method attains superior results even in the absence of these particulars. This aspect holds significance, since in certain situations, this data may not be accessible. For example, when a patient has not been admitted to the hospital but is under the care of a family doctor.

Med-BERT采用预训练策略,通过掩码语言建模(MLM)和住院时长(LOS)预测任务训练BERT (Rasmy et al., 2021)。为此,我们从数据库中提取了必要数据,包括每位患者的诊断代码,同时纳入每次入院的LOS信息及患者就诊次数。而在我们的方法中,由于专注于对大语言模型进行微调,并未执行额外的预训练步骤。所提出的方法无需记录各诊断对应的就诊时点,也不要求提供住院时长。值得注意的是,即使缺少这些细节信息,我们的方法仍能取得更优结果。这一点尤为重要,因为在某些场景下(例如患者未住院但接受家庭医生护理时),此类数据可能无法获取。

Data Preprocessing: For read mission prediction, we follow PyHealth’s data preprocessing methodology. We include drugs, procedures, and diagnosis codes alongside their respective descriptions. Additionally, we incorporate both ICD-9 and ICD-10 codes and convert them to Clinical Classification Software (CCS) codes Elixhauser (2009). For drugs, we convert the codes to ATC codes Nahler & Nahler (2009). For procedures, we include ICD-9 and ICD-10 procedure codes and convert them to CCS codes using PyHealth. For diagnosis prediction, for the MIMIC-IV dataset, we excluded patients with only one visit, as there is no medical history in such a case. Similarly, for the eICUCRD dataset, patients with just one diagnosis were removed. We also excluded patients who have the disease we are trying to predict at the first visit (or the first diagnosis for eICU-CRD data). We converted our ICD-10 codes to their corresponding CCS categories for MIMIC-IV, while for eICUCRD, we retained the ICD-9 codes as they were. This decision was motivated by the higher number of ICD-10 codes compared to ICD-9 codes (Man chi kant i et al., 2013). Based on the sequence of diagnoses for each patient, we determined whether the patient exhibited a specific diagnosis based on ICD diagnosis codes related to the specific disease according to the relevant CCS category (Elixhauser et al., 2014). Table 1 provides an overview of the number of patients, the count of final patients after preprocessing, average diagnoses, and average visits for each disease prediction task.

数据预处理:对于读取任务预测,我们遵循PyHealth的数据预处理方法。我们纳入了药物、操作和诊断代码及其相应描述。此外,我们同时包含ICD-9和ICD-10代码,并将其转换为临床分类软件(CCS)代码(Elixhauser, 2009)。对于药物,我们将代码转换为ATC代码(Nahler & Nahler, 2009)。对于操作,我们包含ICD-9和ICD-10操作代码,并使用PyHealth将其转换为CCS代码。

在诊断预测方面,对于MIMIC-IV数据集,我们排除了仅有一次就诊的患者,因为这种情况下没有病史记录。同样,对于eICUCRD数据集,我们移除了仅有一个诊断的患者。我们还排除了在首次就诊(或eICU-CRD数据的首次诊断)时已患有我们试图预测的疾病的患者。

我们将MIMIC-IV的ICD-10代码转换为其对应的CCS类别,而对于eICUCRD,我们保留了原始的ICD-9代码。这一决定是基于ICD-10代码数量远多于ICD-9代码的考虑(Man chi kant i et al., 2013)。根据每位患者的诊断序列,我们基于特定疾病相关的ICD诊断代码(按照相应CCS类别)(Elixhauser et al., 2014)确定患者是否表现出特定诊断。

表1展示了每个疾病预测任务的患者数量、预处理后的最终患者数量、平均诊断次数和平均就诊次数的概览。

2.3.1 CLINICAL OUTCOMES

2.3.1 临床结局

We evaluated our model for four prediction tasks: patient hospital read mission prediction and three diagnosis predictions covering Chronic kidney disease, Acute and unspecified renal failure, and Adult respiratory failure. The first two diagnoses were derived from the MIMIC-IV dataset, and the last was derived from the eICU-CRD dataset. The corresponding CCS codes for these diseases are 157 for Acute and unspecified renal failure, 158 for Chronic kidney disease, and 131 for Adult respiratory failure. For each prediction task, patients with specific disease ICD codes were assigned a positive label, and their diagnosis history encompassed all diagnostic codes recorded until the specific code indicated the outcome of interest.

我们对模型进行了四项预测任务的评估:患者再入院预测以及三种疾病诊断预测,涵盖慢性肾病、急性和未特指的肾衰竭以及成人呼吸衰竭。前两种诊断数据来自MIMIC-IV数据集,最后一种来自eICU-CRD数据集。这些疾病对应的CCS编码分别为:急性和未特指的肾衰竭157、慢性肾病158、成人呼吸衰竭131。在每个预测任务中,具有特定疾病ICD编码的患者被标记为阳性样本,其诊断历史包含该特定编码出现前记录的所有诊断代码。

Table 1: Task statistics of the prediction tasks. Visit and diagnosis counts are calculated from the patient’s medical history after preprocessing. IQR - Interquartile range.

DatasetTask#ofpatientsFinal#ofpatientsMedian#ofvisits(IQR)Median#ofdiagnoses(IQR)
MIMIC-IVChronickidneydisease84,45326,1611 (1-2)11 (7-19)
MIMIC-IVAcute andunspecifiedrenalfailure84,45326,7361 (1-2)11 (7-19)
eICU-CRDAdultrespiratoryfailure132,67756,4191 (1-1)1 (1-2)

表 1: 预测任务统计。就诊次数和诊断数量基于预处理后的患者病史计算。IQR - 四分位距。

数据集 任务 患者总数 最终患者数 中位就诊次数 (IQR) 中位诊断数量 (IQR)
MIMIC-IV 慢性肾病 (Chronickidneydisease) 84,453 26,161 1 (1-2) 11 (7-19)
MIMIC-IV 急性和未特指肾功能衰竭 84,453 26,736 1 (1-2) 11 (7-19)
eICU-CRD 成人呼吸衰竭 132,677 56,419 1 (1-1) 1 (1-2)

2.4 BASELINE METHODS

2.4 基线方法

We conducted a rigorous performance assessment of the CPLLM against three baseline methods. For diagnosis prediction task, we used the next baseline models. First, Med-BERT with a classification layer (Rasmy et al., 2021). Second, with Logistic Regression (Hosmer Jr et al., 2013). Furthermore, we compared our method to RETAIN - a disease prediction model featuring double GRUs and attention modules (Choi et al., 2016). We compared CPLLM with these baseline methods to gain valuable insights into its performance in clinical prediction downstream tasks. The comparison was conducted using two metrics: the area under the precision-recall curve (PR-AUC) and the area under the receiver operating characteristic curve (ROC-AUC). Disease prediction tasks are typically imbalanced; therefore ROC-AUC is less suitable for binary class if i ers with imbalanced data Davis & Goadrich (2006). Therefore, our main evaluation metric is PR-AUC, but we also report ROC-AUC for consistency with the baseline methods. For read mission prediction, as mentioned earlier, we compared CPLLM with PyHealth baselines. The models we compared with include ConCare Ma et al. (2020), RETAIN Choi et al. (2016), deeper Nguyen et al. (2016) and GRASP Zhang et al. (2021).

我们对CPLLM进行了严格的性能评估,并与三种基线方法进行了对比。针对诊断预测任务,我们采用了以下基线模型:首先是由Rasmy等人(2021)提出的带分类层的Med-BERT;其次是由Hosmer Jr等人(2013)提出的逻辑回归模型;此外,我们还与Choi等人(2016)提出的RETAIN模型(采用双重GRU和注意力模块的疾病预测模型)进行了比较。通过将CPLLM与这些基线方法对比,我们深入了解了其在临床预测下游任务中的表现。评估采用两个指标:精确率-召回率曲线下面积(PR-AUC)和受试者工作特征曲线下面积(ROC-AUC)。由于疾病预测任务通常存在数据不平衡问题,Davis & Goadrich(2006)指出ROC-AUC不太适用于不平衡数据的二分类任务。因此我们主要采用PR-AUC作为评估指标,但为了与基线方法保持一致也报告了ROC-AUC结果。在再入院预测任务中,如前所述,我们将CPLLM与PyHealth基线模型进行了比较,包括Ma等人(2020)的ConCare、Choi等人(2016)的RETAIN、Nguyen等人(2016)的Deeper以及Zhang等人(2021)的GRASP模型。

2.5 OUR PROPOSED METHOD

2.5 我们提出的方法

We propose a method called Clinical Prediction with Large Language Models (CPLLM). This method involves fine-tuning a LLM using prompts tailored to medical concept sequences. Through fine-tuning using prompts (inputs for LLM guidance), we direct the LLM to grasp intricate relationships among medical concepts.

我们提出了一种名为大语言模型临床预测(CPLLM)的方法。该方法通过针对医学概念序列定制的提示(prompt)对大语言模型进行微调。通过使用提示(用于指导大语言模型的输入)进行微调,我们引导大语言模型掌握医学概念之间的复杂关系。

We utilized two LLMs: Llama2 (13B parameters) (Touvron et al., 2023b) and BioMedLM (also called PubMedGPT, 2.7B parameters) (Venigalla et al., 2022). To enhance the time and memory efficiency of fine-tuning these LLMs, we used QLoRA (Dettmers et al., 2023) and PEFT (Houlsby et al., 2019). QLoRA is a PEFT approach that decreases the number of parameters requiring finetuning and also performs quantization (Dettmers et al., 2023). This combined approach effectively optimized the models’ efficiency, enabling single-GPU fine-tuning for both BioMedLM and Llama2 models.

我们采用了两款大语言模型(LLM):Llama2(130亿参数)(Touvron等人,2023b)和BioMedLM(又称PubMedGPT,27亿参数)(Venigalla等人,2022)。为提升这些大语言模型微调的时空效率,我们采用了QLoRA(Dettmers等人,2023)和PEFT(Houlsby等人,2019)技术。QLoRA是一种通过参数量化(Dettmers等人,2023)减少微调参数量的PEFT方法。该组合方案显著优化了模型效率,使得BioMedLM和Llama2模型均可实现单GPU微调。

We performed separate fine-tuning of each LLM, leveraging specific prompts tailored to our patients’ medical codes and their corresponding labels. In Figure 1, we present an example of the prompts utilized during the fine-tuning process for both the Llama2 and BioMedLM. We also indicated in the prompt the target disease, and the prompts were designed to incorporate the patients’ individual medical code histories, with the goal of improving the models’ performance. For readmission prediction, the prompt was very similar, but it included in addition drugs and procedures. For diagnosis prediction tasks, we added tokens of diagnosis descriptions missing from the original tokenizer vocabulary of the LLM. We performed an ablation study that compared the performance with and without changing the vocabulary of the pre-trained tokenizer.

我们对每个大语言模型(LLM)分别进行了微调,采用了针对患者医疗代码及其对应标签量身定制的提示词。图1展示了Llama2和BioMedLM在微调过程中使用的提示词示例。我们在提示词中注明了目标疾病,并设计这些提示词以整合患者的个体医疗代码历史记录,旨在提升模型性能。对于再入院预测任务,提示词结构类似,但额外加入了药物和治疗程序信息。在诊断预测任务中,我们向LLM原始tokenizer词汇表中补充了缺失的诊断描述token。我们还进行了消融实验,对比了修改与不修改预训练tokenizer词汇表情况下的模型表现。


Figure 1: Illustration of the fine-tuning process for diagnosis prediction. (A) An example of EHR structured data. The patient has three diagnoses. (B) Patient’s historical data is extracted from the EHR, and decoded to a textual list of descriptions. (C) The decoded textual data is then injected into a designed prompt for fine-tuning the LLM. Fine-tuning prompts consist of a general description, the patient’s diagnosis history, and a label. The label is set to 1 when the patient is diagnosed with the outcome of interest (e.g., Adult Respiratory Failure in the subsequent diagnosis or during the next admission, depending on the task.

图 1: 诊断预测的微调过程示意图。(A) 电子健康记录(EHR)结构化数据示例。该患者有三项诊断记录。(B) 从EHR中提取患者历史数据,并解码为文本描述列表。(C) 将解码后的文本数据注入设计好的提示模板中用于大语言模型微调。微调提示包含总体描述、患者诊断史和标签三部分。当患者被诊断出目标病症(如后续诊断中出现成人呼吸衰竭或下次入院时确诊,具体取决于任务要求)时,标签设为1。

For the clinical prediction downstream task, we performed fine-tuning as depicted in Figure 1. We used prompts to ask the LLMs to generate a single binary token in response (1 or 0) by adding a classification layer corresponding to the number of labels. By training the models with all patients’ data for the specified number of epochs, we obtained the fine-tuned LLM tailored to our specific clinical prediction task.

在临床预测下游任务中,我们按照图1所示进行了微调。通过添加与标签数量对应的分类层,我们使用提示词要求大语言模型生成单个二元token(1或0)作为响应。通过用所有患者数据训练模型指定轮次后,我们获得了针对特定临床预测任务定制的微调后大语言模型。

3 EXPERIMENTS

3 实验

3.1 EXPERIMENTAL SETUP

3.1 实验设置

For read mission prediction, we compared our method to the PyHealth benchmark. For the diagnosis prediction tasks, We compare our method to three baseline models. The first is a simple Logistic Regression that does not model the data as a sequence but as simple independent, unordered variables (Manogaran & Lopez, 2018). The second is RETAIN which is a two-level neural attention model (Choi et al., 2016). The third baseline is Med-BERT, which is the state-of-the-art for structured EHR data for disease prediction. RETAIN was the baseline of Med-BERT. We split our data using an 70-10-20 ratio for train, validation, and train sets accordingly. For Med-BERT, we trained the pre-training model with the MLM and LOS tasks, with the TensorFlow package (Abadi et al., 2015). The training of the Med-BERT’s MLM phase was performed according to the fixed number of steps in the original implementation. The training took about 1.5 days on an RTX1080 GPU. Subsequently, we performed fine-tuning on the pre-trained model for the specific clinical prediction downstream tasks. The RETAIN and Med-BERT baselines trained for 500 epochs with early stopping based on the PR-AUC value derived from the validation set, using a maximum number of epochs without improvement of 5 epochs (Prechelt, 2002). During the training of the baselines, we experimented with various batch sizes ${32,100}$ and different learning rates ${\bar{1e}^{-5},2e^{-5}}$ . For each prediction task, we selected the hyper-parameters that achieved the best results on the validation set. For Logistic Regression training, we utilized the scikit-learn package (Pedregosa et al., 2011) and trained the model on a CPU. To determine the optimal hyper-parameters for Logistic Regression, we conducted a grid search encompassing penalty (L1 and L2 regular iz ation), $C$ , solver, and the maximum number of iterations. We explored values of ${0.1,1,10}$ for $C.$ , ${$ ’liblinear’, ’saga’ $}$ for solver, and ${100,200,500}$ for the number of iterations. We took the best hyper-parameters based on the validation PR-AUC for each prediction task.

在读取任务预测方面,我们将本方法与PyHealth基准进行了对比。针对诊断预测任务,我们比较了本方法与三种基线模型:第一种是简单的逻辑回归(Logistic Regression),该模型不将数据视为序列,而是作为独立无序变量处理(Manogaran & Lopez, 2018);第二种是双层神经注意力模型RETAIN(Choi et al., 2016);第三种基线是当前结构化电子健康记录(EHR)疾病预测的最先进模型Med-BERT(RETAIN是Med-BERT的基线)。我们按70-10-20的比例划分训练集、验证集和测试集。

对于Med-BERT,我们使用TensorFlow包(Abadi et al., 2015)通过掩码语言建模(MLM)和住院时长(LOS)任务对预训练模型进行训练。Med-BERT的MLM阶段训练遵循原始实现的固定步数,在RTX1080 GPU上耗时约1.5天。随后,我们针对特定临床预测下游任务对预训练模型进行微调。RETAIN和Med-BERT基线模型训练500个epoch,基于验证集的PR-AUC值采用早停策略(若连续5个epoch未提升则停止训练)(Prechelt, 2002)。

在基线模型训练过程中,我们测试了不同批量大小${32,100}$和学习率${\bar{1e}^{-5},2e^{-5}}$的组合。每个预测任务均选择验证集表现最优的超参数配置。逻辑回归训练使用scikit-learn包(Pedregosa et al., 2011)在CPU上完成,通过网格搜索确定最优超参数:正则化方式(L1/L2)、$C$值${0.1,1,10}$、求解器${$'liblinear','saga'$}$及最大迭代次数${100,200,500}$。各预测任务最终采用验证集PR-AUC表现最佳的超参数组合。

For CPLLM experiments, we fine-tuned two LLMs Llama2 (13B) and BioMedLM (2.7B) using Hugging Face (Wolf et al., 2019). (Dettmers et al., 2023). Specifically, we used a learning rate of $2e^{-5}$ , Lora alpha of 32, Lora dropout of 0.1, and bias of none. Given the resource constraints, we meticulously determined and employed the maximum batch size that our GPU memory could accommodate. We fine-tuned each model over six epochs (and four epochs for read mission due to the larger dataset), selecting the best checkpoint based on validation PR-AUC. Fine-tuning Llama2 for six epochs required about a day of training on an RTX 6000 GPU, while BioMedLM took about two hours on the same hardware. Our fine-tuning process used PEFT, and we didn’t perform additional pre-training in the clinical domain, yet our CPLLM method outperformed the baseline models.

在CPLLM实验中,我们使用Hugging Face (Wolf等人,2019) 对两个大语言模型Llama2 (13B)和BioMedLM (2.7B)进行了微调 (Dettmers等人,2023)。具体而言,我们采用了$2e^{-5}$的学习率、Lora alpha值为32、Lora dropout为0.1且无偏置项。考虑到资源限制,我们精心确定并使用了GPU显存可支持的最大批次大小。每个模型均微调了六个周期(由于读取任务数据集较大,该任务微调了四个周期),并根据验证集PR-AUC选择最佳检查点。在RTX 6000 GPU上,Llama2六周期微调需约一天训练时间,而BioMedLM在相同硬件上耗时约两小时。我们的微调过程采用PEFT技术,且未在临床领域进行额外预训练,但CPLLM方法仍超越了基线模型表现。

3.2 RESULTS

3.2 结果

3.2.1 DIAGNOSIS PREDICTION RESULTS

3.2.1 诊断预测结果

We consider various models for the clinical prediction task: Logistic Regression, Med-BERT with a classification layer, RETAIN, and our proposed method called CPLLM. To examine the statistical significance of the results, we ran each model three times. Table 2 shows the mean and $95%$ confidence interval of PR-AUC and ROC-AUC of these models.

我们为临床预测任务考虑了多种模型:Logistic回归、带分类层的Med-BERT、RETAIN以及我们提出的CPLLM方法。为检验结果的统计显著性,每个模型均运行三次。表2展示了这些模型的PR-AUC和ROC-AUC均值及95%置信区间。

Our findings demonstrate that our method, CPLLM, outperforms all tested models, including RETAIN, Med-BERT, and Logistic Regression, across both PR-AUC and ROC-AUC metrics. Specifically, in the context of the Adult respiratory failure task, CPLLM-Llama2 achieved a noteworthy PR-AUC value of $35.962%$ , signifying an absolute improvement of $0.912%$ over the best-performing baseline model, Logistic Regression, which obtained a PR-AUC score of $35.05%$ . This improvement corresponds to a relative enhancement of $2.6%$ in PR-AUC. Additionally, our method exhibits a relative increase of $5.1%$ in PR-AUC when compared to RETAIN and a $3.31%$ increase when compared to Med-BERT. Regarding ROC-AUC performance, CPLLM outperforms the baseline models. Furthermore, CPLLM-Llama2 demonstrates superior performance in this specific task compared to CPLLM-BioMedLM. Logistic Regression outperforms RETAIN in both PR-AUC $(35.05%)$ and ROC-AUC $(74.664%)$ , but it also outperforms Med-BERT in PR-AUC, albeit not in ROC-AUC $(74.664%$ compared to $75.407%$ for Med-BERT).

我们的研究结果表明,在PR-AUC和ROC-AUC指标上,我们的方法CPLLM均优于所有测试模型(包括RETAIN、Med-BERT和逻辑回归)。具体而言,在成人呼吸衰竭任务中,CPLLM-Llama2取得了35.962%的显著PR-AUC值,相比表现最佳的基线模型逻辑回归(PR-AUC为35.05%)实现了0.912%的绝对提升,相当于PR-AUC相对提升2.6%。此外,与RETAIN相比,我们的方法在PR-AUC上实现了5.1%的相对提升,与Med-BERT相比则提升了3.31%。在ROC-AUC性能方面,CPLLM同样优于基线模型。值得注意的是,CPLLM-Llama2在该特定任务中的表现优于CPLLM-BioMedLM。逻辑回归在PR-AUC (35.05%) 和 ROC-AUC (74.664%) 上均优于RETAIN,其PR-AUC也高于Med-BERT,但ROC-AUC表现 (74.664%) 略低于Med-BERT (75.407%)。

For Chronic kidney disease using the MIMIC-IV dataset, RETAIN had the worst performance in both metrics. Med-BERT outperformed Logistic Regression and RETAIN. CPLLM-Llama2 had the highest PR-AUC score of $33.992%$ , followed by CPLLM-BioMedLM with $33.984%$ and MedBERT with $33.37%$ . However, in ROC-AUC, CPLLM-BioMedLM outperformed all models with a score of $83.404%$ , followed by CPLLM-Llama2 with $83.034%$ and Med-BERT with $83.12%$ .

针对使用MIMIC-IV数据集的慢性肾脏病预测,RETAIN模型在两项指标中表现最差。Med-BERT优于逻辑回归和RETAIN。CPLLM-Llama2以$33.992%$的PR-AUC得分位居榜首,其次是CPLLM-BioMedLM ($33.984%$) 和MedBERT ($33.37%$) 。但在ROC-AUC指标中,CPLLM-BioMedLM以$83.404%$的成绩超越所有模型,CPLLM-Llama2 ($83.034%$) 和Med-BERT ($83.12%$) 紧随其后。

For Acute and unspecified renal failure, CPLLM-Llama2 achieved the highest measurements, boasting a PR-AUC score of $45.442%$ and an ROC-AUC score of $78.504%$ . This signifies a notable improvement of $4.22%$ in PR-AUC compared to the leading baseline model, RETAIN, in this task. Additionally, it demonstrates a $1.31%$ improvement in ROC-AUC compared to the best-performing baseline, which is Logistic Regression with an ROC-AUC score of $77.486%$ . Furthermore, it’s worth highlighting that in this specific task, RETAIN outperforms Med-BERT in terms of PR-AUC but not ROC-AUC. Additionally, CPLLM-Llama2 demonstrates superior performance compared to CPLLM-BioMedLM. We found that CPLLM-Llama2 outperformed CPLLM-BioMedLM and therefore the rest of the analysis will be based on CPLLM-Llama2.

在急性及未特指的肾衰竭任务中,CPLLM-Llama2取得了最高测量结果,其PR-AUC(精确率-召回率曲线下面积)得分为$45.442%$,ROC-AUC(受试者工作特征曲线下面积)得分为$78.504%$。与当前最优基线模型RETAIN相比,其PR-AUC提升了$4.22%$;同时较表现最佳的基线模型(ROC-AUC为$77.486%$的逻辑回归)提升了$1.31%$。值得注意的是,在该任务中RETAIN的PR-AUC优于Med-BERT,但ROC-AUC表现不及后者。此外,CPLLM-Llama2展现出优于CPLLM-BioMedLM的性能表现,因此后续分析将基于CPLLM-Llama2展开。

3.2.2 HOSPITAL READ MISSION PREDICTION RESULTS

3.2.2 医院再入院预测结果

To demonstrate the robustness of CPLLM, we expanded our analysis beyond diagnosis to include procedures and drugs. We compared CPLLM against several baseline methods from the PyHealth benchmark. Table 3 presents the results for patient hospital read mission prediction. In the case of MIMIC-IV, CPLLM with LLama2-13B achieved a PR-AUC of $68.986%$ , outperforming ConCare, the second-best performing model, by $1.46%$ (absolute). For eICU-CRD, CPLLM exhibited the highest PR-AUC among the baselines, achieving a PR-AUC of $94.115%$ . Additionally, CPLLM achieved the highest ROC-AUC in both datasets.

为了验证CPLLM的鲁棒性,我们将分析范围从诊断扩展到包含治疗程序和药物。在PyHealth基准测试中,我们将CPLLM与多种基线方法进行了对比。表3展示了患者再入院预测的结果。在MIMIC-IV数据集上,采用LLama2-13B的CPLLM取得了$68.986%$的PR-AUC值,以$1.46%$(绝对值)的优势超越第二名模型ConCare。在eICU-CRD数据集上,CPLLM以$94.115%$的PR-AUC值位居基线模型之首。此外,CPLLM在两个数据集上的ROC-AUC指标均达到最高值。

3.3 ABLATION STUDY

3.3 消融实验

We conducted an ablation study to investigate the impact of the added tokens to the pre-trained tokenizer of the LLMs before fine-tuning. Table 4 provides a comprehensive overview of the PRAUC and ROC-AUC, comparing scenarios with and without adding extra tokens. For the task of predicting Acute and unspecified renal failure, adding the tokens yields enhancements in both PRAUC and ROC-AUC for CPLLM-Llama2 ( $0.499%$ absolute increase in PR-AUC and a $0.554%$ absolute increase in ROC-AUC). Similarly, CPLLM-BioMedLM shows substantial improvements with a $1.631%$ absolute increase in PR-AUC, representing a relative enhancement of $3.746%$ , and a $0.414%$ absolute increase in ROC-AUC. In contrast, for the prediction of Chronic kidney disease, the inclusion of extra tokens does not significantly impact PR-AUC and ROC-AUC in the case of CPLLM-Llama2. However, CPLLM-BioMedLM demonstrates improvements, specifically an absolute enhancement of $0.686%$ in ROC-AUC and an increase in PR-AUC from $32.638%$ to $33.984%$ . It is worth noting that the PR-AUC of BioMedLM exhibits less stability, as evidenced by a larger confidence interval when no additional tokens are employed $(4.358%)$ . Nevertheless, we conducted two additional runs to get a better estimate of the PR-AUC. Subsequently, we observed that the PR-AUC for these five experiments amounted to $33.078%$ , and the confidence intervals were reduced to $1.773%$ . For Adult respiratory failure prediction, the presence of additional tokens results in improved PR-AUC and ROC-AUC for CPLLM-Llama2, whereas it enhances PR-AUC but does not influence ROC-AUC for CPLLM-BioMedLM. In summary, the findings of this ablation study suggest that, in the majority of cases (9 out of 12 measurements across three prediction tasks), incorporating the added tokens leads to enhanced performance in clinical prediction tasks.

我们进行了一项消融研究,以探究在微调前向大语言模型的预训练分词器添加token的影响。表4全面对比了添加额外token前后的PRAUC和ROC-AUC指标。在预测急性未特指肾衰竭任务中,添加token使CPLLM-Llama2的PR-AUC绝对提升0.499%,ROC-AUC绝对提升0.554%。同样,CPLLM-BioMedLM表现出显著改进:PR-AUC绝对提升1.631%(相对提升3.746%),ROC-AUC绝对提升0.414%。相比之下,在预测慢性肾病时,添加额外token对CPLLM-Llama2的PR-AUC和ROC-AUC影响不大,但CPLLM-BioMedLM的ROC-AUC绝对提升0.686%,PR-AUC从32.638%升至33.984%。值得注意的是,未使用额外token时BioMedLM的PR-AUC稳定性较差(置信区间达4.358%)。为此我们追加两次实验,最终五次实验的PR-AUC均值为33.078%,置信区间降至1.773%。在成人呼吸衰竭预测任务中,额外token使CPLLM-Llama2的PR-AUC和ROC-AUC均获提升,而CPLLM-BioMedLM仅PR-AUC有所改善。总体而言,消融研究表明:在三个预测任务的12项指标测量中,有9项(75%)案例显示添加token能提升临床预测性能。

Table 2: Performances of various models assessed across multiple tasks and datasets. The highest score per task is highlighted in bold.

TaskModelPR-AUCROC-AUC
AdultrespiratoryfailureLogisticRegression35.05074.664
RETAIN34.22 ± 0.29974.454 ± 0.173
Med-BERT34.81 ± 0.20875.407± 0.073
CPLLM-Llama235.962 ± 0.38076.407 ± 0.262
Chronic kidney diseaseCPLLM-BioMedLM Logistic Regression35.494 ± 0.352 32.23075.975 ± 0.214
RETAIN83.016
Med-BERT31.407 ± 1.379 33.37 ± 0.89181.692 ± 0.899
CPLLM-Llama283.12 ± 0.173
CPLLM-BioMedLM33.992 ± 1.262 33.984 ± 1.07783.034 ± 0.511 83.404 ± 0.429
Acute and unspecifiedrenalfailureLogistic Regression42.07577.486
RETAIN43.603 ± 0.40977.364 ± 0.394
Med-BERT42.237±0.40877.427 ± 0.185
CPLLM-Llama245.442±0.83978.504 ± 0.684
CPLLM-BioMedLM45.161 ±1.62278.484 ± 0.403

表 2: 各模型在多项任务和数据集上的性能评估。每项任务的最高分以粗体标出。

任务 模型 PR-AUC ROC-AUC
* * 成人呼吸衰竭* * LogisticRegression 35.050 74.664
RETAIN 34.22 ± 0.299 74.454 ± 0.173
Med-BERT 34.81 ± 0.208 75.407± 0.073
CPLLM-Llama2 * * 35.962 ± 0.380* * * * 76.407 ± 0.262* *
* * 慢性肾病* * CPLLM-BioMedLM Logistic Regression 35.494 ± 0.352 32.230 75.975 ± 0.214
RETAIN 83.016
Med-BERT 31.407 ± 1.379 33.37 ± 0.891 81.692 ± 0.899
CPLLM-Llama2 * * 83.12 ± 0.173* *
CPLLM-BioMedLM 33.992 ± 1.262 33.984 ± 1.077 83.034 ± 0.511 83.404 ± 0.429
* * 急性未特指肾衰竭* * Logistic Regression 42.075 77.486
RETAIN 43.603 ± 0.409 77.364 ± 0.394
Med-BERT 42.237±0.408 77.427 ± 0.185
CPLLM-Llama2 * * 45.442±0.839* * * * 78.504 ± 0.684* *
CPLLM-BioMedLM 45.161 ±1.622 78.484 ± 0.403

Table 3: PR-AUC and ROC-AUC performances of hospital read mission prediction task for MIMICIV and eICU-CRD datasets.

DatasetModelPR-AUCROC-AUC
MIMIC-IVCPLLM-Llama268.986±0.49968.155±0.38
ConCare67.523±0.69767.242 ± 0.269
RETAIN67.343 ± 0.55866.893± 0.421
deeper66.891 ± 0.60466.575± 0.371
GRASP65.656± 2.92965.302±3.369
eICU-CRDCPLLM-Llama294.115 ± 0.70477.916± 1.026
ConCare93.429 ±0.73377.024 ± 1.156
RETAIN93.615 ± 0.34077.149 ± 1.048
deeper93.814 ± 0.42277.814± 0.385
GRASP93.677 ± 1.82477.515±3.899

表 3: MIMICIV和eICU-CRD数据集在再入院预测任务中的PR-AUC和ROC-AUC性能表现。

数据集 模型 PR-AUC ROC-AUC
MIMIC-IV CPLLM-Llama2 68.986±0.499 68.155±0.38
ConCare 67.523±0.697 67.242±0.269
RETAIN 67.343±0.558 66.893±0.421
deeper 66.891±0.604 66.575±0.371
GRASP 65.656±2.929 65.302±3.369
eICU-CRD CPLLM-Llama2 94.115±0.704 77.916±1.026
ConCare 93.429±0.733 77.024±1.156
RETAIN 93.615±0.340 77.149±1.048
deeper 93.814±0.422 77.814±0.385
GRASP 93.677±1.824 77.515±3.899

4 DISCUSSION

4 讨论

Our proposed CPLLM method outperformed the baselines on all four tasks (3 diagnosis prediction and read mission prediction) across two different datasets. We used MIMIC-IV and eICU-CRD datasets to assess the model’s ability to handle two diagnoses coding systems (ICD9 and ICD10) and two data types (homogeneous data from the same hospital in MIMIC-IV and multi-center data in eICU-CRD). CPLLM was superior to all baselines. CPLLM-Llama2 was the best model overall, and only for Chronic kidney disease did CPLLM-BioMedLM outperform CPLLM-Llama2, but only in terms of ROC-AUC. Using CPLLM-Llama2, we achieved PR-AUC relative improvements of $3.309%$ , $1.864%$ , and $7.588%$ over Med-BERT on the three tasks, and ROC-AUC relative improvements of $1.326%$ and $1.391%$ on the Adult respiratory failure and Acute and unspecified renal failure prediction tasks. For hospital read mission prediction, CPLLM achieved relative improvements of $2.17%$ compared to ConCare in PR-AUC for MIMIC-IV. For eICU-CRD read mission prediction, CPLLM showed a relative improvement of $0.31%$ compared to the second-best result, deeper.

我们提出的CPLLM方法在两个不同数据集的所有四项任务(3项诊断预测和再入院预测)上均优于基线模型。使用MIMIC-IV和eICU-CRD数据集评估模型处理两种诊断编码系统(ICD9和ICD10)及两种数据类型(MIMIC-IV中来自同一医院的同质数据和eICU-CRD中的多中心数据)的能力。CPLLM在所有基线模型中表现最优。CPLLM-Llama2是整体最佳模型,仅在慢性肾脏病预测任务中CPLLM-BioMedLM的ROC-AUC指标优于CPLLM-Llama2。使用CPLLM-Llama2时,我们在三项任务上相比Med-BERT实现了PR-AUC相对提升$3.309%$、$1.864%$和$7.588%$,在成人呼吸衰竭及急性和未特指肾功能衰竭预测任务上ROC-AUC相对提升分别为$1.326%$和$1.391%$。对于MIMIC-IV的医院再入院预测,CPLLM相比ConCare在PR-AUC上实现了$2.17%$的相对提升。在eICU-CRD再入院预测任务中,CPLLM相比次优模型deeper实现了$0.31%$的相对提升。

Table 4: PR-AUC and ROC-AUC for CPLLM-Llama2 and CPLLM-BioMedLM, across three distinct medical tasks. Added Tokens column indicates whether additional tokens were incorporated into the pre-trained tokenizer. $\because\ldots$ and ”-” - additional tokens were or were not added accordingly.

TaskModelAdded TokensPR-AUCROC-AUC
Acute and unspecified renal failureCPLLM-Llama2+ 一45.442±0.839 44.943 ± 1.26878.504 ± 0.684 77.95 ± 0.814
CPLLM-BioMedLM+45.161 ±1.622 43.53 ± 1.10178.484 ±0.403 78.07 ± 0.625
CPLLM-Llama2 Chronic kidney+33.992 ±1.262 34.563 ± 1.57883.034 ± 0.511 83.178 ± 1.02
disease Adult respiratory failureCPLLM-BioMedLM+33.984 ± 1.077 32.638 ± 4.35883.404 ± 0.429 82.718 ± 1.191
CPLLM-Llama2+ +35.962±0.38 35.683 ± 0.164 35.494±0.35276.407 ± 0.262 75.776 ± 0.085

表 4: CPLLM-Llama2 和 CPLLM-BioMedLM 在三种不同医疗任务中的 PR-AUC 和 ROC-AUC 表现。新增 Token 列表示是否在预训练分词器中加入了额外 Token。$\because\ldots$ 和 "-" 分别表示已添加或未添加额外 Token。

任务 模型 新增 Token PR-AUC ROC-AUC
急性和未特指肾功能衰竭 CPLLM-Llama2 + 一 45.442±0.839 44.943 ± 1.268 78.504 ± 0.684 77.95 ± 0.814
CPLLM-BioMedLM + 45.161 ±1.622 43.53 ± 1.101 78.484 ±0.403 78.07 ± 0.625
CPLLM-Llama2 慢性肾脏病 + 33.992 ±1.262 34.563 ± 1.578 83.034 ± 0.511 83.178 ± 1.02
成人呼吸衰竭 CPLLM-BioMedLM + 33.984 ± 1.077 32.638 ± 4.358 83.404 ± 0.429 82.718 ± 1.191
CPLLM-Llama2 + + 35.962±0.38 35.683 ± 0.164 35.494±0.352 76.407 ± 0.262 75.776 ± 0.085

Unlike existing approaches that necessitate pre-training with medical concept sequences, our method eliminates the need for additional pre-training tasks. For instance, Med-BERT entails both MLM and LOS prediction tasks using patient sequences of medical concepts. Based on our findings and results, it’s evident that LLMs possess the capability to adeptly represent sequential clinical data without the need for specific pre-training based on clinical sequences. Beyond that, our method can be used even without the LOS data of each patient’s hospitalizations, which is required for Med-BERT pretraining. Sometimes, these data are not available, for example, when there is no hospitalization, but rather data collected among patients who visited a physician in outpatient settings, or when LOS data is not available like in claims data. Furthermore, during the fine-tuning training of CPLLM, it is not necessary to know which diagnoses were given in which visit but only the diagnoses as a sequence. This differs from Med-BERT, which relies on this information for fine-tuning. Notably, we achieved superior performance even without these specific details.

与现有方法需要利用医学概念序列进行预训练不同,我们的方法无需额外预训练任务。例如,Med-BERT需同时执行MLM和基于患者医学概念序列的LOS预测任务。根据我们的研究发现,大语言模型具备出色表征临床序列数据的能力,而无需基于临床序列的特定预训练。此外,即使没有Med-BERT预训练所需的患者住院LOS数据,我们的方法仍可适用——这类数据在某些场景(如门诊患者就诊记录或理赔数据)可能缺失。值得注意的是,CPLLM在微调训练时仅需诊断序列信息,而无需知晓具体诊断与就诊次数的对应关系(这点与依赖此类信息进行微调的Med-BERT不同)。尽管未使用这些细节信息,我们仍实现了更优性能。

We found that including additional tokens in the LLM’s tokenizer before fine-tuning improves the measurement of the prediction model in most cases. For instance, as Llama2 was not initially pretrained on clinical data, supplementing it with missing description codes can enhance its understanding of the medical domain.

我们发现,在微调前向大语言模型的tokenizer中添加额外的token在大多数情况下能提升预测模型的测量效果。例如,由于Llama2最初并未在临床数据上进行预训练,为其补充缺失的描述代码可以增强其对医学领域的理解。

In the original Med-BERT paper, improvements over RETAIN were demonstrated in terms of ROCAUC for three disease prediction tasks (Rasmy et al., 2021). We also found that Med-BERT consistently outperformed RETAIN in all prediction tasks based on ROC-AUC. However, it’s worth noting that, as previously mentioned, ROC-AUC may not be an optimal metric for imbalanced datasets (Davis & Goadrich, 2006). In contrast, when considering PR-AUC, Med-BERT exhibited superior performance compared to RETAIN in two out of three tasks, although it did not outperform RETAIN in the prediction of Acute and unspecified renal failure (with PR-AUC values of $43.603%$ for RETAIN and $42.237%$ for Med-BERT), despite achieving a higher ROC-AUC than RETAIN.

在原版Med-BERT论文中,针对三项疾病预测任务的ROC-AUC指标,研究团队展示了其相对于RETAIN模型的改进(Rasmy et al., 2021)。我们也发现基于ROC-AUC评估,Med-BERT在所有预测任务中始终优于RETAIN。但需注意的是,如前所述,ROC-AUC可能并非不平衡数据集的最佳评估指标(Davis & Goadrich, 2006)。相比之下,在PR-AUC指标下,Med-BERT在三项任务中有两项表现优于RETAIN,尽管在"急性和未特指肾功能衰竭"预测中未能超越RETAIN(PR-AUC值分别为RETAIN $43.603%$ 和Med-BERT $42.237%$),此时Med-BERT的ROC-AUC仍高于RETAIN。

In our read mission prediction experiment, which included diagnoses, drugs, and procedures, we demonstrated the flexibility of our method. It seamlessly incorporates medical concepts from various domains into the sequence with minimal adjustments to the prompt text.

在我们的读取任务预测实验中(包括诊断、药物和程序),我们展示了该方法的灵活性。它只需对提示文本进行最小调整,就能将来自不同领域的医学概念无缝整合到序列中。

Another strength of our proposed method lies in its remarkable capacity to handle longer sequences compared to the current state-of-the-art models for structured EHR data. With maximum sequence lengths of 1024 tokens for CPLLM-BioMedLM and 4096 tokens for CPLLM-Llama2, our approach far surpasses the limitations imposed by Med-BERT and BEHRT (Li et al., 2020). Both MedBERT and BEHRT are constrained by BERT’s maximum of 512 tokens, which significantly restricts their ability to handle longer inputs (Devlin et al., 2018). Without the need for additional training, our method also handles longer sequences compared to Hi-BEHRT, which is specially trained and designed to handle sequences with a maximum of 1220 tokens (Li et al., 2022a).

我们提出的方法另一个优势在于,相比当前结构化电子健康记录(EHR)数据的最先进模型,其处理长序列的能力显著提升。CPLLM-BioMedLM和CPLLM-Llama2分别支持1024和4096 token的最大序列长度,远超Med-BERT和BEHRT (Li et al., 2020) 的512 token限制 (Devlin et al., 2018)。即使不进行额外训练,我们的方法也优于专门设计用于处理最长1220 token序列的Hi-BEHRT (Li et al., 2022a)。

Healthcare stakeholders are looking for ways to improve caregiving without risking patient’s data. The two LLMs we tested are such that can be deployed and used on-premise or in a secure environment and do not require sharing personal data over the web.

医疗保健领域的相关方正在寻求既能提升护理质量又不危及患者数据安全的解决方案。我们测试的两款大语言模型均可部署在本地或安全环境中使用,无需通过网络共享个人数据。

While our method demonstrates promising results in utilizing LLMs for clinical prediction tasks, it is important to acknowledge several limitations. While our method accommodates sequences of up to 4096 tokens for CPLLM-Llama2 and 1024 tokens for CPLLM-BioMedLM, our tests did not include exceptionally long sequences that could fully explore the implications of this extended token limit. That is because the datasets we used do not contain very long observations or many diagnoses of a single patient. Moreover, due to the greater number of parameters in LLMs, our method demands more computational resources, inference time, and training time. Specifically, CPLLM-Llama2 had a longer training time than Med-BERT. However, CPLLM-BioMedLM requires less training time compared to Med-BERT (3.1). That’s because CPLLM-BioMedLM does not require additional pre-training, unlike necessity for MLM and LOS pre-training in Med-BERT.

虽然我们的方法在利用大语言模型(LLM)进行临床预测任务方面展现出良好效果,但仍需认识到若干局限性。尽管CPLLM-Llama2支持最长4096个token的序列,CPLLM-BioMedLM支持1024个token,但我们的测试并未包含能充分探索这种扩展token限制影响的超长序列。这是因为我们使用的数据集不包含非常长的观察记录或单个患者的多次诊断数据。此外,由于大语言模型参数量更大,我们的方法需要更多计算资源、推理时间和训练时间。具体而言,CPLLM-Llama2的训练时间比Med-BERT更长,但CPLLM-BioMedLM的训练时间比Med-BERT更短(3.1)。这是因为与Med-BERT需要进行MLM和LOS预训练不同,CPLLM-BioMedLM不需要额外的预训练过程。

In addition, in our method, there is a necessity to use a specific prompt, a requirement that does not apply to the baseline models. As a result, sometimes the prompt needs to be adapted according to a base model.

此外,在我们的方法中,需要使用特定的提示(prompt),而基线模型则无此要求。因此,有时需要根据基础模型调整提示。

Future work: We hypothesize that combining a retrieval augmentation (Mialon et al., 2023; Zakka et al., 2024), with LLM can improve performance. This is because it allows to include general updated knowledge about the diseases that the patient has been diagnosed with in their medical history. Additionally, this approach can incorporate general knowledge and known risk factors into research on the disease we are trying to predict.

未来工作:我们假设将检索增强 (retrieval augmentation) [Mialon et al., 2023; Zakka et al., 2024] 与大语言模型结合可以提升性能。这是因为该方法能够将患者病史中已确诊疾病的最新通用知识纳入考量。此外,这种方案还能整合通用知识和已知风险因素到我们试图预测疾病的研究中。

5 CONCLUSION

5 结论

In this work, we presented CPLLM, a novel method for clinical disease prediction and patient hospital read mission prediction based on the clinical history of patients. CPLLM has practical application potential. By surpassing the state-of-the-art in clinical task prediction, our method enables more accurate and robust disease forecasting, as well as patient hospital read mission forecasting. CPLLM demonstrated superior performance across all three four on two datasets (MIMIC-IV and eICUCRD). It processes ICD9 and ICD10 diagnoses, procedures, and drugs. We showcased its robustness in dealing with homogeneous and multi-center data. Our method’s advantage lies in eliminating the need for additional pre-training tasks, unlike Med-BERT. Furthermore, our method remains adaptable the length of stay data is unavailable, making it suitable for a broader range of healthcare scenarios, including those involving non-hospitalized patients. In addition, CPLLM’s fine-tuning process requires patients’ diagnoses as a sequence, without the need for which diagnoses were given in which visit. Notably, our method can handle much longer sequences than existing state-of-the-art models.

在本研究中,我们提出了CPLLM,这是一种基于患者临床病史的新型临床疾病预测和患者再入院预测方法。CPLLM具有实际应用潜力。通过在临床任务预测方面超越现有技术,我们的方法能够实现更准确、更稳健的疾病预测以及患者再入院预测。CPLLM在两个数据集(MIMIC-IV和eICUCRD)上均展现出优于所有现有方法的性能。该方法可处理ICD9和ICD10诊断、手术及药物数据。我们验证了其在处理同质化和多中心数据时的鲁棒性。与Med-BERT不同,我们的方法优势在于无需额外的预训练任务。此外,即使在缺少住院时长数据的情况下,该方法仍保持适应性,使其适用于更广泛的医疗场景,包括非住院患者的情况。值得注意的是,CPLLM的微调过程仅需将患者诊断作为序列输入,而无需区分具体就诊记录中的诊断信息。特别地,我们的方法能处理比现有最优模型长得多的序列数据。

6 REPRODUCIBILITY

6 可复现性

Our code is available at the following link: https://github.com/nadavlab/CPLLM. Implementation details can be found in the Experimental Setup section 3.1. To execute the baseline code, we used the source code published as part of the Med-BERT paper (Rasmy et al., 2021).

我们的代码可在以下链接获取:https://github.com/nadavlab/CPLLM。具体实现细节见实验设置章节3.1。执行基线代码时,我们采用了Med-BERT论文(Rasmy et al., 2021)中发布的源代码。

For our experiments, we used the MIMIC-IV v2.0 dataset (Johnson et al., 2020), accessible at https://physionet.org/content/mimiciv/2.0/, as well as the eICU-CRD multi-center dataset (Pollard et al., 2018), which can be found at https://physionet.org/content/eicu-crd/2.0/.

在我们的实验中,我们使用了MIMIC-IV v2.0数据集 (Johnson et al., 2020),可通过https://physionet.org/content/mimiciv/2.0/访问,以及eICU-CRD多中心数据集 (Pollard et al., 2018),可在https://physionet.org/content/eicu-crd/2.0/获取。

REFERENCES

参考文献

阅读全文(20积分)