CPLLM: CLINICAL PREDICTION WITH LARGE LANGUAGE MODELS

CPLLM: 基于大语言模型的临床预测

Ofir Ben Shoham Department of Software and Information Systems Engineering Ben-Gurion University of the Negev benshoho@post.bgu.ac.il

Ofir Ben Shoham
内盖夫本-古里安大学
软件与信息系统工程系
benshoho@post.bgu.ac.il

Nadav Rappoport Department of Software and Information Systems Engineering Ben-Gurion University of the Negev nadavrap@bgu.ac.il

Nadav Rappoport 软件与信息系统工程系内盖夫本-古里安大学 nadavrap@bgu.ac.il

ABSTRACT

摘要

We present Clinical Prediction with Large Language Models (CPLLM), a method that involves fine-tuning a pre-trained Large Language Model (LLM) for clinical disease and read mission prediction. We utilized quantization and fine-tuned the LLM using prompts. For diagnosis prediction, we predict whether patients will be diagnosed with a target disease during their next visit or in the subsequent diagnosis, leveraging their historical diagnosis records. We compared our results to various baselines, including RETAIN, and Med-BERT, the current state-of-the-art model for disease prediction using temporal structured EHR data. In addition, We also evaluated CPLLM for patient hospital read mission prediction and compared our method’s performance with benchmark baselines. Our experiments have shown that our proposed method, CPLLM, surpasses all the tested models in terms of PR-AUC and ROC-AUC metrics, showing state-of-the-art results for diagnosis prediction and patient hospital read mission prediction. Such a method can be easily implemented and integrated into the clinical process to help care providers estimate the next steps of patients.

我们提出了临床预测大语言模型 (CPLLM)，该方法通过对预训练的大语言模型 (LLM) 进行微调，用于临床疾病和再入院预测。我们采用量化技术并通过提示词对LLM进行微调。在诊断预测任务中，我们基于患者历史诊断记录，预测其下次就诊或后续诊断中是否会被确诊为目标疾病。我们将实验结果与RETAIN、Med-BERT（当前使用时序结构化电子健康记录数据进行疾病预测的最先进模型）等多种基线模型进行了对比。此外，我们还评估了CPLLM在患者再入院预测任务中的表现，并与基准基线模型进行了性能比较。实验结果表明，我们提出的CPLLM方法在PR-AUC和ROC-AUC指标上均优于所有测试模型，在诊断预测和患者再入院预测任务中均取得了最先进的成果。该方法可轻松部署并整合到临床流程中，辅助医护人员预判患者的病情发展。

1 INTRODUCTION

1 引言

Large Language Models (LLMs) are a type of artificial intelligence (AI) that have been shown to be effective at a variety of Natural Language Processing tasks (Zhao et al., 2023). LLMs are trained on large amounts of textual data, which allows them to learn the statistical relationships between words and phrases. LLMs are used for different types of tasks, including natural language understanding, natural language generation, knowledge-intensive tasks, reasoning, and more (Yang et al., 2023b). This makes them well-suited for tasks that require understanding the meaning of a text, such as text classification (Gasparetto et al., 2022; Sun et al., 2023) and even clinical predictions in the medical domain (Thiru nav uk a rasu et al., 2023; Steinberg et al., 2021).

大语言模型 (LLM) 是一种人工智能 (AI) ，已被证明能有效处理各类自然语言处理任务 (Zhao et al., 2023) 。大语言模型通过海量文本数据训练，可学习词语与短语间的统计关系。其应用场景涵盖自然语言理解、自然语言生成、知识密集型任务、推理等多种类型 (Yang et al., 2023b) ，因此特别适合需要理解文本语义的任务，例如文本分类 (Gasparetto et al., 2022; Sun et al., 2023) 乃至医疗领域的临床预测 (Thirunavukarasu et al., 2023; Steinberg et al., 2021) 。

Clinical predictions are used to estimate a patient’s susceptibility to disease, gauge the likelihood of treatment response, or prognosticate the course of a patient’s medical condition. (Laupacis et al., 1997; Wasson et al., 1985). These predictions have been executed via classical models such as Logistic Regression (Hosmer Jr et al., 2013) and Random Forest (Breiman, 2001). However, these traditional methods do not model the order of the medical concept events (diagnoses, procedures, medications, etc.). Instead, they rely solely on the absence or presence of these events (features).

临床预测用于评估患者对疾病的易感性、治疗反应的可能性或预测患者病情的病程 (Laupacis et al., 1997; Wasson et al., 1985)。这些预测通常通过逻辑回归 (Logistic Regression) (Hosmer Jr et al., 2013) 和随机森林 (Random Forest) (Breiman, 2001) 等经典模型实现。然而，这些传统方法无法建模医疗概念事件（诊断、手术、用药等）的时序关系，仅依赖这些事件（特征）的有无进行判断。

Modern event order prediction models, which are more advanced than the mentioned traditional prediction models, are based on RNNs or transformers, where the latter were shown to be superior (Vaswani et al., 2017). Specifically, BERT-Style Models like BERT (Devlin et al., 2018), RoBERTa (Liu et al., 2019), and Deberta (He et al., 2020). Another transformer-based architecture is GPTstyle language model. GPT models are trained to generate the next word in a sequence. GPT models are used in a wide range of downstream tasks such as sum mari z ation, translation, question answering, and more (Floridi & Chiriatti, 2020). To name a few GPT models: LLaMA (Touvron et al., 2023a;b), Falcon (Almazrouei et al., 2023), Bloom (Scao et al., 2022), and GPT4 (OpenAI, 2023). The flexibility and versatility of decoder-only models seem to be advantageous (Yang et al., 2023b).

现代事件顺序预测模型比上述传统预测模型更为先进，它们基于RNN或Transformer架构，其中后者被证明更具优势 (Vaswani et al., 2017)。具体而言，包括BERT风格模型如BERT (Devlin et al., 2018)、RoBERTa (Liu et al., 2019)和DeBERTa (He et al., 2020)。另一种基于Transformer的架构是GPT风格的大语言模型。GPT模型通过训练来生成序列中的下一个词，被广泛应用于摘要生成、翻译、问答等下游任务 (Floridi & Chiriatti, 2020)。代表性的GPT模型包括LLaMA (Touvron et al., 2023a;b)、Falcon (Almazrouei et al., 2023)、Bloom (Scao et al., 2022)以及GPT-4 (OpenAI, 2023)。仅含解码器的模型展现出显著的灵活性和多功能性优势 (Yang et al., 2023b)。

The significance of the mentioned language models for handling sequential data is emphasized, particularly within the context of clinical prediction models relying on Electronic Health Record (EHR) data. Structured EHR data encompasses a patient’s clinical history, notable for its irregular temporal sequence of events and observations (Steinberg et al., 2021). Previous works deal with modeling EHR diagnosis data as a sequence, such as BEHRT (Li et al., 2020; 2022a; Shoham & Rappoport, 2023; Meng et al., 2021), Med-BERT (Rasmy et al., 2021) and Medic-BERT (Hansen et al., 2023) (for length of stay prediction), using BERT models. However, such BERT-based models represent each diagnosis code as an index and do not address the textual description of the ICD code. These models are pre-trained using clinical data, and have a limited sequence length input according to the BERT architecture.

上述语言模型在处理序列数据方面的重要性尤为突出，特别是在依赖电子健康记录(EHR)数据的临床预测模型背景下。结构化EHR数据包含患者的临床病史，其显著特点是事件和观察结果的时间序列不规则(Steinberg et al., 2021)。先前的研究将EHR诊断数据建模为序列进行处理，例如BEHRT(Li et al., 2020; 2022a; Shoham & Rappoport, 2023; Meng et al., 2021)、Med-BERT(Rasmy et al., 2021)和Medic-BERT(Hansen et al., 2023)(用于住院时长预测)，这些研究都使用了BERT模型。然而，这类基于BERT的模型仅将每个诊断代码表示为索引，并未处理ICD代码的文本描述。这些模型使用临床数据进行预训练，并且根据BERT架构具有有限的序列长度输入。

There is limited research on using LLMs to train clinical prediction models. One of the main focus of applications of LLM in the clinic is on chat capability of these models (Singhal et al., 2023; Thiru nav uk a rasu et al., 2023) or using an LLM for medical texts-based tasks like text generation (Lu et al., 2022; Agrawal et al., 2022) and text comprehension (Yang et al., 2022; Siva rajkumar & Wang, 2022; Li et al., 2022b; Jiang et al., 2023). In addition, Chen et al. (2023) proposed a method called ClinTaT for cancer prediction. Their focus was on cancer prognostic prediction using few-shot learning, and their data modeling was not designed for structured EHR data that consists of a sequence of diagnoses. However, we want to harness the power of LLMs in understanding sequences of tokens derived from structured EHR data, specifically to train prediction models. We represent the structured data as a text by representing each medical concept corresponding to a word, admissions are treated as visits, and patient history is considered a document. The objectives of this study are to develop a novel method for using LLMs to train clinical predictors and to evaluate the performance of this method on real-world datasets.

关于使用大语言模型 (LLM) 训练临床预测模型的研究有限。目前大语言模型在临床应用的主要焦点集中在聊天功能 (Singhal et al., 2023; Thirunavukarasu et al., 2023) 或基于医疗文本的任务，如文本生成 (Lu et al., 2022; Agrawal et al., 2022) 和文本理解 (Yang et al., 2022; Sivarajkumar & Wang, 2022; Li et al., 2022b; Jiang et al., 2023)。此外，Chen et al. (2023) 提出了一种名为 ClinTaT 的癌症预测方法，其重点是通过少样本学习进行癌症预后预测，且其数据建模并非针对由一系列诊断组成的结构化电子健康记录 (EHR) 数据。

我们的目标是通过大语言模型理解结构化 EHR 数据生成的 token 序列，从而训练预测模型。具体而言，我们将结构化数据表示为文本形式：每个医疗概念对应一个单词，入院视为就诊，患者病史视为文档。本研究的目标是开发一种利用大语言模型训练临床预测器的新方法，并在真实数据集上评估该方法的性能。

Our proposed method uses an LLM to predict future diagnoses and read mission of patients by finetuning LLMs. The medical concepts are represented by text descriptions. Fine-tuning is performed using a prompt that feeds the model with training samples. We used two different LLMs, Llama2, which is a general LLM (Touvron et al., 2023b) and BioMedLM, which was trained on biological and clinical text (Venigalla et al., 2022). We used four prediction tasks and two datasets and compared the performance to baseline models.

我们提出的方法通过微调大语言模型(LLM)来预测患者未来诊断和再入院情况。医疗概念以文本描述形式表示，微调过程采用提示模板向模型输入训练样本。实验选用两种大语言模型：通用大语言模型Llama2 (Touvron et al., 2023b) 和生物医学文本专用模型BioMedLM (Venigalla et al., 2022)。通过四项预测任务和两个数据集验证，并与基线模型进行性能对比。

The proposed method outperforms the state-of-the-art methods. Our generic method can be used for a variety of tasks and is not specific to any particular LLM. Moreover, our method is also suitable for different clinical domains such as demographics, diagnoses, laboratory test results, measurements, procedures, and more.

所提出的方法优于当前最先进的方法。我们的通用方法适用于多种任务，并不局限于特定的大语言模型。此外，该方法还适用于不同临床领域，如人口统计学、诊断、实验室检测结果、测量、手术等。

Contributions: (1) We propose CPLLM, a novel method for clinical prediction with LLM that outperforms state-of-the-art models for disease prediction and patient read mission prediction for structured EHR data. In addition, CPLLM doesn’t require pre-training on clinical data and achieves better performance than alternative approaches. Moreover, Our method has a longer sequence length limit compared to the baseline methods. (2) We show that adding additional tokens to the pre-trained tokenizer of the LLM before fine-tuning improves the performance of the clinical prediction model. (3) Our code is flexible for any LLM, available to use, and easily adaptable to various clinical prediction tasks.

贡献：(1) 我们提出了CPLLM，这是一种利用大语言模型(LLM)进行临床预测的新方法，在结构化电子健康记录(EHR)数据的疾病预测和患者再入院预测任务中超越了现有最优模型。此外，CPLLM无需在临床数据上进行预训练，且性能优于其他替代方案。与基线方法相比，我们的方法还具有更长的序列长度限制。(2) 我们证明了在微调前向大语言模型的预训练分词器添加额外token能提升临床预测模型的性能。(3) 我们的代码适用于任何大语言模型，可立即使用，并能轻松适配各类临床预测任务。

2 METHODS

2 方法

2.1 DISEASE PREDICTION - PROBLEM DEFINITION

2.1 疾病预测 - 问题定义

Formally, for a given patient $p$ , let $n$ denote the total number of diagnoses in their medical history. Thus, the patient’s sequence of diagnoses is represented as ${D_ {p,1},D_ {p,2},D_ {p,3},D_ {p,n}}$ , where each $D_ {p,i}$ $(1\leq i\leq n)$ corresponds to a medical diagnosis in the patient’s history. We considered two types of diagnosis prediction: next diagnosis and next visit diagnosis.

形式上，对于给定患者 $p$，设 $n$ 表示其病史中的诊断总数。因此，患者的诊断序列表示为 ${D_ {p,1},D_ {p,2},D_ {p,3},D_ {p,n}}$，其中每个 $D_ {p,i}$ $(1\leq i\leq n)$ 对应患者病史中的一项医学诊断。我们考虑了两种诊断预测类型：下一诊断和下一就诊诊断。

Next diagnosis prediction: Predict whether patient $p$ will be diagnosed with a specific disease $D_ {x}$ as the $D_ {p,i+1}$ diagnosis given previous diagnoses. Our model relys on the patient’s medical records up to the $i$ -th diagnosis, denoted as ${D_ {p,1},D_ {p,2},D_ {p,i}}$ . Where $D_ {p,i}$ $(1\leq i<n)$ ) indicates the most recent diagnosis observed for patient $p$ . The predictive model utilizes this patient-specific historical medical information to determine whether patient $p$ ’s next diagnosis is a specific disease or not.

下次诊断预测：预测患者 $p$ 在给定既往诊断的情况下，是否会将特定疾病 $D_ {x}$ 作为第 $D_ {p,i+1}$ 次诊断。我们的模型依赖于患者截至第 $i$ 次诊断的医疗记录，记为 ${D_ {p,1},D_ {p,2},\ldots,D_ {p,i}}$ 。其中 $D_ {p,i}$ $(1\leq i<n)$ ) 表示患者 $p$ 最近观察到的诊断。该预测模型利用患者特定的历史医疗信息，判断患者 $p$ 的下次诊断是否为特定疾病。

Next visit diagnosis prediction: Predicting the next diagnosis requires knowledge of the precise timing of each diagnosis. However, these data may occasionally be unavailable, such as when diagnoses are documented at the end of an admission. Consequently, in the context of the MIMICIV dataset, we undertake the task of forecasting whether a patient will receive a specific diagnosis in his subsequent admission.

下次就诊诊断预测：预测下次诊断需要掌握每次诊断的确切时间信息。但在某些情况下(如出院时才记录诊断结果)这些数据可能缺失。因此针对MIMICIV数据集，我们的研究任务是预测患者在下一次入院时是否会接受特定诊断。

2.2 PATIENT HOSPITAL READ MISSION PREDICTION

2.2 患者再入院预测

Based on a patient’s medical history, including procedures, diagnoses, and medications, our objective is to forecast whether the patient will experience hospital read mission within the next $X$ days. We follow the definition of $X$ as specified by the PyHealth benchmark Yang et al. (2023a). In our experiments with the MIMIC-IV dataset, we predict hospital read mission within a 15-day window, and for the eICU-CRD dataset, the prediction time-frame is 5 days (see section 2.3).

基于患者的医疗历史（包括手术、诊断和用药记录），我们的目标是预测患者是否会在未来 $X$ 天内再次入院。我们遵循 PyHealth 基准 Yang et al. (2023a) 中定义的 $X$ 值。在使用 MIMIC-IV 数据集的实验中，我们预测 15 天内的再次入院情况；对于 eICU-CRD 数据集，预测时间范围为 5 天（详见第 2.3 节）。

2.3 DATA

2.3 数据

In this study, we used data from the eICU-CRD database (Pollard et al., 2018) and data from the MIMIC-IV database (Johnson et al., 2020). Our datasets include ICD-9-CM (eICU-CRD) and ICD10-CM (MIMIC-IV) diagnoses and their descriptions. In the eICU-CRD database, each diagnosis is associated with a timestamp. Consequently, we arranged the diagnoses in chronological order based on their respective diagnosis times. Our disease prediction task aims to anticipate whether the forthcoming diagnosis will correspond to a specific disease. Unlike the eICU-CRD dataset, the MIMIC-IV data lacks information on the exact time of each diagnosis assignment. However, it provides the start time for admission and the discharge times for each patient. As a result, our prediction task for this dataset revolves around determining whether a patient will be diagnosed with a specific disease during his subsequent visit.

在本研究中，我们使用了来自eICU-CRD数据库 (Pollard et al., 2018) 和MIMIC-IV数据库 (Johnson et al., 2020) 的数据。我们的数据集包含ICD-9-CM (eICU-CRD) 和ICD10-CM (MIMIC-IV) 诊断及其描述。在eICU-CRD数据库中，每个诊断都关联有时间戳。因此，我们根据各自的诊断时间按时间顺序排列了诊断结果。我们的疾病预测任务旨在预测即将到来的诊断是否与特定疾病相对应。与eICU-CRD数据集不同，MIMIC-IV数据缺乏每次诊断分配的确切时间信息。但它提供了每位患者的入院开始时间和出院时间。因此，我们对该数据集的预测任务围绕确定患者在下一次就诊期间是否会被诊断出特定疾病展开。

Med-BERT adopts a pre-training strategy and trains BERT using Masked Language Modeling (MLM) and Length of stay (LOS) prediction tasks (Rasmy et al., 2021). Therefore, we extracted the necessary data from the databases, including the diagnosis codes for each patient. Additionally, we also include information on the LOS of each admission and the number of visits of each patient. On the other hand, in our approach, we did not conduct an additional pre-training step, as we focused on fine-tuning an LLM. In our proposed method, it’s not required to note at which visit each diagnosis was given. Furthermore, the duration of hospital stay is not required. Notably, our method attains superior results even in the absence of these particulars. This aspect holds significance, since in certain situations, this data may not be accessible. For example, when a patient has not been admitted to the hospital but is under the care of a family doctor.

Med-BERT采用预训练策略，通过掩码语言建模(MLM)和住院时长(LOS)预测任务训练BERT (Rasmy et al., 2021)。为此，我们从数据库中提取了必要数据，包括每位患者的诊断代码，同时纳入每次入院的LOS信息及患者就诊次数。而在我们的方法中，由于专注于对大语言模型进行微调，并未执行额外的预训练步骤。所提出的方法无需记录各诊断对应的就诊时点，也不要求提供住院时长。值得注意的是，即使缺少这些细节信息，我们的方法仍能取得更优结果。这一点尤为重要，因为在某些场景下（例如患者未住院但接受家庭医生护理时），此类数据可能无法获取。

Data Preprocessing: For read mission prediction, we follow PyHealth’s data preprocessing methodology. We include drugs, procedures, and diagnosis codes alongside their respective descriptions. Additionally, we incorporate both ICD-9 and ICD-10 codes and convert them to Clinical Classification Software (CCS) codes Elixhauser (2009). For drugs, we convert the codes to ATC codes Nahler & Nahler (2009). For procedures, we include ICD-9 and ICD-10 procedure codes and convert them to CCS codes using PyHealth. For diagnosis prediction, for the MIMIC-IV dataset, we excluded patients with only one visit, as there is no medical history in such a case. Similarly, for the eICUCRD dataset, patients with just one diagnosis were removed. We also excluded patients who have the disease we are trying to predict at the first visit (or the first diagnosis for eICU-CRD data). We converted our ICD-10 codes to their corresponding CCS categories for MIMIC-IV, while for eICUCRD, we retained the ICD-9 codes as they were. This decision was motivated by the higher number of ICD-10 codes compared to ICD-9 codes (Man chi kant i et al., 2013). Based on the sequence of diagnoses for each patient, we determined whether the patient exhibited a specific diagnosis based on ICD diagnosis codes related to the specific disease according to the relevant CCS category (Elixhauser et al., 2014). Table 1 provides an overview of the number of patients, the count of final patients after preprocessing, average diagnoses, and average visits for each disease prediction task.

数据预处理：对于读取任务预测，我们遵循PyHealth的数据预处理方法。我们纳入了药物、操作和诊断代码及其相应描述。此外，我们同时包含ICD-9和ICD-10代码，并将其转换为临床分类软件(CCS)代码(Elixhauser, 2009)。对于药物，我们将代码转换为ATC代码(Nahler & Nahler, 2009)。对于操作，我们包含ICD-9和ICD-10操作代码，并使用PyHealth将其转换为CCS代码。

在诊断预测方面，对于MIMIC-IV数据集，我们排除了仅有一次就诊的患者，因为这种情况下没有病史记录。同样，对于eICUCRD数据集，我们移除了仅有一个诊断的患者。我们还排除了在首次就诊(或eICU-CRD数据的首次诊断)时已患有我们试图预测的疾病的患者。

我们将MIMIC-IV的ICD-10代码转换为其对应的CCS类别，而对于eICUCRD，我们保留了原始的ICD-9代码。这一决定是基于ICD-10代码数量远多于ICD-9代码的考虑(Man chi kant i et al., 2013)。根据每位患者的诊断序列，我们基于特定疾病相关的ICD诊断代码(按照相应CCS类别)(Elixhauser et al., 2014)确定患者是否表现出特定诊断。

表1展示了每个疾病预测任务的患者数量、预处理后的最终患者数量、平均诊断次数和平均就诊次数的概览。

2.3.1 CLINICAL OUTCOMES

2.3.1 临床结局

We evaluated our model for four prediction tasks: patient hospital read mission prediction and three diagnosis predictions covering Chronic kidney disease, Acute and unspecified renal failure, and Adult respiratory failure. The first two diagnoses were derived from the MIMIC-IV dataset, and the last was derived from the eICU-CRD dataset. The corresponding CCS codes for these diseases are 157 for Acute and unspecified renal failure, 158 for Chronic kidney disease, and 131 for Adult respiratory failure. For each prediction task, patients with specific disease ICD codes were assigned a positive label, and their diagnosis history encompassed all diagnostic codes recorded until the specific code indicated the outcome of interest.

我们对模型进行了四项预测任务的评估：患者再入院预测以及三种疾病诊断预测，涵盖慢性肾病、急性和未特指的肾衰竭以及成人呼吸衰竭。前两种诊断数据来自MIMIC-IV数据集，最后一种来自eICU-CRD数据集。这些疾病对应的CCS编码分别为：急性和未特指的肾衰竭157、慢性肾病158、成人呼吸衰竭131。在每个预测任务中，具有特定疾病ICD编码的患者被标记为阳性样本，其诊断历史包含该特定编码出现前记录的所有诊断代码。

Table 1: Task statistics of the prediction tasks. Visit and diagnosis counts are calculated from the patient’s medical history after preprocessing. IQR - Interquartile range.

Dataset	Task	#ofpatients	Final#ofpatients	Median#ofvisits(IQR)	Median#ofdiagnoses(IQR)
MIMIC-IV	Chronickidneydisease	84,453	26,161	1 (1-2)	11 (7-19)
MIMIC-IV	Acute andunspecifiedrenalfailure	84,453	26,736	1 (1-2)	11 (7-19)
eICU-CRD	Adultrespiratoryfailure	132,677	56,419	1 (1-1)	1 (1-2)

表 1: 预测任务统计。就诊次数和诊断数量基于预处理后的患者病史计算。IQR - 四分位距。

数据集	任务	患者总数	最终患者数	中位就诊次数 (IQR)	中位诊断数量 (IQR)
MIMIC-IV	慢性肾病 (Chronickidneydisease)	84,453	26,161	1 (1-2)	11 (7-19)
MIMIC-IV	急性和未特指肾功能衰竭	84,453	26,736	1 (1-2)	11 (7-19)
eICU-CRD	成人呼吸衰竭	132,677	56,419	1 (1-1)	1 (1-2)

2.4 BASELINE METHODS

2.4 基线方法

We conducted a rigorous performance assessment of the CPLLM against three baseline methods. For diagnosis prediction task, we used the next baseline models. First, Med-BERT with a classification layer (Rasmy et al., 2021). Second, with Logistic Regression (Hosmer Jr et al., 2013). Furthermore, we compared our method to RETAIN - a disease prediction model featuring double GRUs and attention modules (Choi et al., 2016). We compared CPLLM with these baseline methods to gain valuable insights into its performance in clinical prediction downstream tasks. The comparison was conducted using two metrics: the area under the precision-recall curve (PR-AUC) and the area under the receiver operating characteristic curve (ROC-AUC). Disease prediction tasks are typically imbalanced; therefore ROC-AUC is less suitable for binary class if i ers with imbalanced data Davis & Goadrich (2006). Therefore, our main evaluation metric is PR-AUC, but we also report ROC-AUC for consistency with the baseline methods. For read mission prediction, as mentioned earlier, we compared CPLLM with PyHealth baselines. The models we compared with include ConCare Ma et al. (2020), RETAIN Choi et al. (2016), deeper Nguyen et al. (2016) and GRASP Zhang et al. (2021).

我们对CPLLM进行了严格的性能评估，并与三种基线方法进行了对比。针对诊断预测任务，我们采用了以下基线模型：首先是由Rasmy等人(2021)提出的带分类层的Med-BERT；其次是由Hosmer Jr等人(2013)提出的逻辑回归模型；此外，我们还与Choi等人(2016)提出的RETAIN模型（采用双重GRU和注意力模块的疾病预测模型）进行了比较。通过将CPLLM与这些基线方法对比，我们深入了解了其在临床预测下游任务中的表现。评估采用两个指标：精确率-召回率曲线下面积(PR-AUC)和受试者工作特征曲线下面积(ROC-AUC)。由于疾病预测任务通常存在数据不平衡问题，Davis & Goadrich(2006)指出ROC-AUC不太适用于不平衡数据的二分类任务。因此我们主要采用PR-AUC作为评估指标，但为了与基线方法保持一致也报告了ROC-AUC结果。在再入院预测任务中，如前所述，我们将CPLLM与PyHealth基线模型进行了比较，包括Ma等人(2020)的ConCare、Choi等人(2016)的RETAIN、Nguyen等人(2016)的Deeper以及Zhang等人(2021)的GRASP模型。

2.5 OUR PROPOSED METHOD

2.5 我们提出的方法

We propose a method called Clinical Prediction with Large Language Models (CPLLM). This method involves fine-tuning a LLM using prompts tailored to medical concept sequences. Through fine-tuning using prompts (inputs for LLM guidance), we direct the LLM to grasp intricate relationships among medical concepts.

我们提出了一种名为大语言模型临床预测(CPLLM)的方法。该方法通过针对医学概念序列定制的提示(prompt)对大语言模型进行微调。通过使用提示(用于指导大语言模型的输入)进行微调，我们引导大语言模型掌握医学概念之间的复杂关系。

We utilized two LLMs: Llama2 (13B parameters) (Touvron et al., 2023b) and BioMedLM (also called PubMedGPT, 2.7B parameters) (Venigalla et al., 2022). To enhance the time and memory efficiency of fine-tuning these LLMs, we used QLoRA (Dettmers et al., 2023) and PEFT (Houlsby et al., 2019). QLoRA is a PEFT approach that decreases the number of parameters requiring finetuning and also performs quantization (Dettmers et al., 2023). This combined approach effectively optimized the models’ efficiency, enabling single-GPU fine-tuning for both BioMedLM and Llama2 models.

我们采用了两款大语言模型(LLM)：Llama2(130亿参数)(Touvron等人，2023b)和BioMedLM(又称PubMedGPT，27亿参数)(Venigalla等人，2022)。为提升这些大语言模型微调的时空效率，我们采用了QLoRA(Dettmers等人，2023)和PEFT(Houlsby等人，2019)技术。QLoRA是一种通过参数量化(Dettmers等人，2023)减少微调参数量的PEFT方法。该组合方案显著优化了模型效率，使得BioMedLM和Llama2模型均可实现单GPU微调。

We performed separate fine-tuning of each LLM, leveraging specific prompts tailored to our patients’ medical codes and their corresponding labels. In Figure 1, we present an example of the prompts utilized during the fine-tuning process for both the Llama2 and BioMedLM. We also indicated in the prompt the target disease, and the prompts were designed to incorporate the patients’ individual medical code histories, with the goal of improving the models’ performance. For readmission prediction, the prompt was very similar, but it included in addition drugs and procedures. For diagnosis prediction tasks, we added tokens of diagnosis descriptions missing from the original tokenizer vocabulary of the LLM. We performed an ablation study that compared the performance with and without changing the vocabulary of the pre-trained tokenizer.

我们对每个大语言模型(LLM)分别进行了微调，采用了针对患者医疗代码及其对应标签量身定制的提示词。图1展示了Llama2和BioMedLM在微调过程中使用的提示词示例。我们在提示词中注明了目标疾病，并设计这些提示词以整合患者的个体医疗代码历史记录，旨在提升模型性能。对于再入院预测任务，提示词结构类似，但额外加入了药物和治疗程序信息。在诊断预测任务中，我们向LLM原始tokenizer词汇表中补充了缺失的诊断描述token。我们还进行了消融实验，对比了修改与不修改预训练tokenizer词汇表情况下的模型表现。

Figure 1: Illustration of the fine-tuning process for diagnosis prediction. (A) An example of EHR structured data. The patient has three diagnoses. (B) Patient’s historical data is extracted from the EHR, and decoded to a textual list of descriptions. (C) The decoded textual data is then injected into a designed prompt for fine-tuning the LLM. Fine-tuning prompts consist of a general description, the patient’s diagnosis history, and a label. The label is set to 1 when the patient is diagnosed with the outcome of interest (e.g., Adult Respiratory Failure in the subsequent diagnosis or during the next admission, depending on the task.

图 1: 诊断预测的微调过程示意图。(A) 电子健康记录(EHR)结构化数据示例。该患者有三项诊断记录。(B) 从EHR中提取患者历史数据，并解码为文本描述列表。(C) 将解码后的文本数据注入设计好的提示模板中用于大语言模型微调。微调提示包含总体描述、患者诊断史和标签三部分。当患者被诊断出目标病症(如后续诊断中出现成人呼吸衰竭或下次入院时确诊，具体取决于任务要求)时，标签设为1。

For the clinical prediction downstream task, we performed fine-tuning as depicted in Figure 1. We used prompts to ask the LLMs to generate a single binary token in response (1 or 0) by adding a classification layer corresponding to the number of labels. By training the models with all patients’ data for the specified number of epochs, we obtained the fine-tuned LLM tailored to our specific clinical prediction task.

在临床预测下游任务中，我们按照图1所示进行了微调。通过添加与标签数量对应的分类层，我们使用提示词要求大语言模型生成单个二元token（1或0）作为响应。通过用所有患者数据训练模型指定轮次后，我们获得了针对特定临床预测任务定制的微调后大语言模型。

3 EXPERIMENTS

3 实验

3.1 EXPERIMENTAL SETUP

3.1 实验设置

For read mission prediction, we compared our method to the PyHealth benchmark. For the diagnosis prediction tasks, We compare our method to three baseline models. The first is a simple Logistic Regression that does not model the data as a sequence but as simple independent, unordered variables (Manogaran & Lopez, 2018). The second is RETAIN which is a two-level neural attention model (Choi et al., 2016). The third baseline is Med-BERT, which is the state-of-the-art for structured EHR data for disease prediction. RETAIN was the baseline of Med-BERT. We split our data using an 70-10-20 ratio for train, validation, and train sets accordingly. For Med-BERT, we trained the pre-training model with the MLM and LOS tasks, with the TensorFlow package (Abadi et al., 2015). The training of the Med-BERT’s MLM phase was performed according to the fixed number of steps in the original implementation. The training took about 1.5 days on an RTX1080 GPU. Subsequently, we performed fine-tuning on the pre-trained model for the specific clinical prediction downstream tasks. The RETAIN and Med-BERT baselines trained for 500 epochs with early stopping based on the PR-AUC value derived from the validation set, using a maximum number of epochs without improvement of 5 epochs (Prechelt, 2002). During the training of the baselines, we experimented with various batch sizes ${32,100}$ and different learning rates ${\bar{1e}^{-5},2e^{-5}}$ . For each prediction task, we selected the hyper-parameters that achieved the best results on the validation set. For Logistic Regression training, we utilized the scikit-learn package (Pedregosa et al., 2011) and trained the model on a CPU. To determine the optimal hyper-parameters for Logistic Regression, we conducted a grid search encompassing penalty (L1 and L2 regular iz ation), $C$ , solver, and the maximum number of iterations. We explored values of ${0.1,1,10}$ for $C.$ , ${$ ’liblinear’, ’saga’ $}$ for solver, and ${100,200,500}$ for the number of iterations. We took the best hyper-parameters based on the validation PR-AUC for each prediction task.

在读取任务预测方面，我们将本方法与PyHealth基准进行了对比。针对诊断预测任务，我们比较了本方法与三种基线模型：第一种是简单的逻辑回归（Logistic Regression），该模型不将数据视为序列，而是作为独立无序变量处理（Manogaran & Lopez, 2018）；第二种是双层神经注意力模型RETAIN（Choi et al., 2016）；第三种基线是当前结构化电子健康记录（EHR）疾病预测的最先进模型Med-BERT（RETAIN是Med-BERT的基线）。我们按70-10-20的比例划分训练集、验证集和测试集。

对于Med-BERT，我们使用TensorFlow包（Abadi et al., 2015）通过掩码语言建模（MLM）和住院时长（LOS）任务对预训练模型进行训练。Med-BERT的MLM阶段训练遵循原始实现的固定步数，在RTX1080 GPU上耗时约1.5天。随后，我们针对特定临床预测下游任务对预训练模型进行微调。RETAIN和Med-BERT基线模型训练500个epoch，基于验证集的PR-AUC值采用早停策略（若连续5个epoch未提升则停止训练）（Prechelt, 2002）。

在基线模型训练过程中，我们测试了不同批量大小${32,100}$和学习率${\bar{1e}^{-5},2e^{-5}}$的组合。每个预测任务均选择验证集表现最优的超参数配置。逻辑回归训练使用scikit-learn包（Pedregosa et al., 2011）在CPU上完成，通过网格搜索确定最优超参数：正则化方式（L1/L2）、$C$值${0.1,1,10}$、求解器${$'liblinear','saga'$}$及最大迭代次数${100,200,500}$。各预测任务最终采用验证集PR-AUC表现最佳的超参数组合。

For CPLLM experiments, we fine-tuned two LLMs Llama2 (13B) and BioMedLM (2.7B) using Hugging Face (Wolf et al., 2019). (Dettmers et al., 2023). Specifically, we used a learning rate of $2e^{-5}$ , Lora alpha of 32, Lora dropout of 0.1, and bias of none. Given the resource constraints, we meticulously determined and employed the maximum batch size that our GPU memory could accommodate. We fine-tuned each model over six epochs (and four epochs for read mission due to the larger dataset), selecting the best checkpoint based on validation PR-AUC. Fine-tuning Llama2 for six epochs required about a day of training on an RTX 6000 GPU, while BioMedLM took about two hours on the same hardware. Our fine-tuning process used PEFT, and we didn’t perform additional pre-training in the clinical domain, yet our CPLLM method outperformed the baseline models.

在CPLLM实验中，我们使用Hugging Face (Wolf等人，2019) 对两个大语言模型Llama2 (13B)和BioMedLM (2.7B)进行了微调 (Dettmers等人，2023)。具体而言，我们采用了$2e^{-5}$的学习率、Lora alpha值为32、Lora dropout为0.1且无偏置项。考虑到资源限制，我们精心确定并使用了GPU显存可支持的最大批次大小。每个模型均微调了六个周期（由于读取任务数据集较大，该任务微调了四个周期），并根据验证集PR-AUC选择最佳检查点。在RTX 6000 GPU上，Llama2六周期微调需约一天训练时间，而BioMedLM在相同硬件上耗时约两小时。我们的微调过程采用PEFT技术，且未在临床领域进行额外预训练，但CPLLM方法仍超越了基线模型表现。

3.2 RESULTS

3.2 结果

3.2.1 DIAGNOSIS PREDICTION RESULTS

3.2.1 诊断预测结果

We consider various models for the clinical prediction task: Logistic Regression, Med-BERT with a classification layer, RETAIN, and our proposed method called CPLLM. To examine the statistical significance of the results, we ran each model three times. Table 2 shows the mean and $95%$ confidence interval of PR-AUC and ROC-AUC of these models.

我们为临床预测任务考虑了多种模型：Logistic回归、带分类层的Med-BERT、RETAIN以及我们提出的CPLLM方法。为检验结果的统计显著性，每个模型均运行三次。表2展示了这些模型的PR-AUC和ROC-AUC均值及95%置信区间。

Our findings demonstrate that our method, CPLLM, outperforms all tested models, including RETAIN, Med-BERT, and Logistic Regression, across both PR-AUC and ROC-AUC metrics. Specifically, in the context of the Adult respiratory failure task, CPLLM-Llama2 achieved a noteworthy PR-AUC value of $35.962%$ , signifying an absolute improvement of $0.912%$ over the best-performing baseline model, Logistic Regression, which obtained a PR-AUC score of $35.05%$ . This improvement corresponds to a relative enhancement of $2.6%$ in PR-AUC. Additionally, our method exhibits a relative increase of $5.1%$ in PR-AUC when compared to RETAIN and a $3.31%$ increase when compared to Med-BERT. Regarding ROC-AUC performance, CPLLM outperforms the baseline models. Furthermore, CPLLM-Llama2 demonstrates superior performance in this specific task compared to CPLLM-BioMedLM. Logistic Regression outperforms RETAIN in both PR-AUC $(35.05%)$ and ROC-AUC $(74.664%)$ , but it also outperforms Med-BERT in PR-AUC, albeit not in ROC-AUC $(74.664%$ compared to $75.407%$ for Med-BERT).

我们的研究结果表明，在PR-AUC和ROC-AUC指标上，我们的方法CPLLM均优于所有测试模型（包括RETAIN、Med-BERT和逻辑回归）。具体而言，在成人呼吸衰竭任务中，CPLLM-Llama2取得了35.962%的显著PR-AUC值，相比表现最佳的基线模型逻辑回归（PR-AUC为35.05%）实现了0.912%的绝对提升，相当于PR-AUC相对提升2.6%。此外，与RETAIN相比，我们的方法在PR-AUC上实现了5.1%的相对提升，与Med-BERT相比则提升了3.31%。在ROC-AUC性能方面，CPLLM同样优于基线模型。值得注意的是，CPLLM-Llama2在该特定任务中的表现优于CPLLM-BioMedLM。逻辑回归在PR-AUC (35.05%) 和 ROC-AUC (74.664%) 上均优于RETAIN，其PR-AUC也高于Med-BERT，但ROC-AUC表现 (74.664%) 略低于Med-BERT (75.407%)。

For Chronic kidney disease using the MIMIC-IV dataset, RETAIN had the worst performance in both metrics. Med-BERT outperformed Logistic Regression and RETAIN. CPLLM-Llama2 had the highest PR-AUC score of $33.992%$ , followed by CPLLM-BioMedLM with $33.984%$ and MedBERT with $33.37%$ . However, in ROC-AUC, CPLLM-BioMedLM outperformed all models with a score of $83.404%$ , followed by CPLLM-Llama2 with $83.034%$ and Med-BERT with $83.12%$ .

针对使用MIMIC-IV数据集的慢性肾脏病预测，RETAIN模型在两项指标中表现最差。Med-BERT优于逻辑回归和RETAIN。CPLLM-Llama2以$33.992%$的PR-AUC得分位居榜首，其次是CPLLM-BioMedLM ($33

[论文翻译]CPLLM: 基于大语言模型的临床预测

原文地址：https://arxiv.org/pdf/2309.11295