[论文翻译]患者轨迹预测:结合临床记录与Transformer模型


原文地址:https://arxiv.org/pdf/2502.18009v1


Patient Trajectory Prediction: Integrating Clinical Notes with Transformers

患者轨迹预测:结合临床记录与Transformer模型

Keywords:

关键词:

Trajectory prediction, Transformers, Knowledge integration, Deep learning

轨迹预测、Transformer、知识整合、深度学习

Abstract:

摘要:

Predicting disease trajectories from electronic health records (EHRs) is a complex task due to major challenges such as data non-station ari ty, high granularity of medical codes, and integration of multimodal data. EHRs contain both structured data, such as diagnostic codes, and unstructured data, such as clinical notes, which hold essential information often overlooked. Current models, primarily based on structured data, struggle to capture the complete medical context of patients, resulting in a loss of valuable information. To address this issue, we propose an approach that integrates unstructured clinical notes into transformer-based deep learning models for sequential disease prediction. This integration enriches the representation of patients’ medical histories, thereby improving the accuracy of diagnosis predictions. Experiments on MIMIC-IV datasets demonstrate that the proposed approach outperforms traditional models relying solely on structured data.

基于电子健康档案(EHR)预测疾病发展轨迹是一项复杂任务,主要面临数据非平稳性、医疗代码高颗粒度以及多模态数据整合等挑战。EHR既包含诊断代码等结构化数据,也包含临床记录等非结构化数据,后者常被忽视却蕴含关键信息。当前主要基于结构化数据的模型难以全面捕捉患者医疗背景,导致重要信息丢失。为解决这一问题,我们提出将非结构化临床记录整合到基于Transformer的深度学习模型中,用于序列化疾病预测。这种整合能丰富患者病史表征,从而提高诊断预测准确性。在MIMIC-IV数据集上的实验表明,该方法优于仅依赖结构化数据的传统模型。

1 INTRODUCTION

1 引言

In healthcare, the exponential growth of Electronic Health Records (EHRs) has revolutionized patient care while posing new challenges. Healthcare professionals now frequently interact with medical records spanning several decades, having to process and analyze this vast amount of information to make informed decisions about patients’ future health status. This evolution has accelerated the development of automated systems to predict future diagnoses from past medical data, thus becoming a key element of personalized and proactive medicine (Figure 1). Machine learning techniques, particularly deep learning, have seen increasing growth in medicine (Egger et al., 2022), thanks to their adaptability and good results. In medical imaging, for example, deep learning models have achieved a high level of performance in predicting medical diagnoses, sometimes comparable to or even surpassing that of human experts (Mall et al., 2023). These results have led researchers to apply similar techniques to the task of sequential disease prediction ((Choi et al., 2016a; Rodrigues-Jr et al., 2021; Shankar et al., 2023)), where the goal is to predict a patient’s diagnosis at their next visit $(\mathrm{N}{+}1)$ based on the content of their previous visits (N). However, modeling patient trajectories from EHR data presents unique challenges:

在医疗健康领域,电子健康档案(EHR)的指数级增长既革新了患者护理方式,也带来了新的挑战。医疗专业人员如今需要频繁处理跨越数十年的病历记录,必须分析和处理这些海量信息,才能对患者未来健康状况做出明智决策。这一演变加速了从历史医疗数据预测未来诊断的自动化系统发展,使其成为个性化主动医疗的关键要素(图1)。机器学习技术(尤其是深度学习)凭借其适应性和良好效果,在医学领域的应用持续增长(Egger等人,2022)。例如在医学影像领域,深度学习模型已实现高水平的诊断预测性能,有时甚至达到或超越人类专家水平(Mall等人,2023)。这些成果促使研究者将类似技术应用于时序疾病预测任务((Choi等人,2016a; Rodrigues-Jr等人,2021; Shankar等人,2023)),其目标是根据患者既往就诊记录(N)预测其下次就诊$(\mathrm{N}{+}1)$时的诊断。然而,基于EHR数据建立患者病程模型存在独特挑战:

• The non-station ari ty of EHR data which leads to variations in the data, limiting the general iz ability of models.

• 电子健康记录 (EHR) 数据的非平稳性导致数据变化,限制了模型的泛化能力。

Addressing these challenges is essential to develop both accurate and reliable patient trajectory prediction systems capable of assisting physicians in decisionmaking by providing comprehensive forecasts based on a patient’s clinical history.

解决这些挑战对于开发准确可靠的患者轨迹预测系统至关重要,该系统能够基于患者的临床历史提供全面预测,从而辅助医生决策。

In light of these challenges, this article focuses on improving the accuracy of automated medical prognosis systems, particularly in predicting future diagnoses based on patients’ historical medical records.

鉴于这些挑战,本文着重于提高自动化医疗预后系统的准确性,特别是在基于患者历史医疗记录预测未来诊断方面。


Figure 1: Sequential disease predictions

图 1: 序列化疾病预测

Current coding systems, such as the International Classification of Diseases (ICD) 2, often do not fully capture the richness of information contained in clinical notes, which can lead to a loss of valuable information for predicting patient trajectories. To overcome this problem, we propose an approach aimed at improving the accuracy of diagnostic code predictions by integrating clinical note embeddings into transformers, which typically rely solely on medical codes. This method incorporates a discriminating factor that reduces prediction errors by enriching the representation of embeddings. This also allows for the recovery of valuable information often lost in coding systems such as ICD. By incorporating additional context, our approach addresses challenges related to understanding the reasons behind medication prescriptions, procedures performed, and diagnoses made.

当前的编码系统,如国际疾病分类(ICD) 2,通常无法完全捕捉临床记录中包含的丰富信息,这可能导致预测患者病程时有价值信息的丢失。为解决这一问题,我们提出了一种方法,旨在通过将临床记录嵌入(embeddings)整合到通常仅依赖医疗代码的Transformer模型中,提高诊断代码预测的准确性。该方法引入了一个区分因子,通过丰富嵌入表示来减少预测误差,同时还能恢复ICD等编码系统中常丢失的有价值信息。通过融入额外上下文,我们的方法解决了理解药物处方原因、执行操作依据及诊断形成动因等相关挑战。

This article is organized as follows: Section 2 reviews the literature. Section 3 describes our approach, including the process of generating embeddings and their integration into transformers. In Section 4, we present our experimental results. Finally, Section 5 concludes this article and presents future work.

本文结构如下:第2节回顾相关文献。第3节描述我们的方法,包括生成嵌入(embeddings)的过程及其与Transformer的集成。第4节展示实验结果。最后,第5节总结全文并展望未来工作。

2 State of the Art

2 技术现状

Various methods, whether based on deep learning or traditional approaches, have been explored to predict patient trajectories. Among them, Doctor AI (Choi et al., 2016a), a temporal model based on recurrent neural networks (RNN), developed and applied to longitudinal time-stamped EHR data. Doc- tor AI predicts a patient’s medical codes and estimates the time until the next visit. However, it is limited by a fixed window width, which proves inadequate, as a patient’s future diagnosis may depend on medical conditions outside this window. LIG-Doctor (Rodrigues-Jr et al., 2021), an artificial neural network architecture designed to efficiently predict patient trajectories using minimal bidirectional recurrent networks MGRU. MGRU handle the granularity of ICD-9 codes, but suffer from the same limitations as Doctor AI. In (Choi et al., 2016b), the authors propose RETAIN, an interpret able predictive model for healthcare using reverse time attention mechanism. Two RNNs are trained in reverse time order to learn the importance of previous visits, offering improved interpret ability. DeepCare (Pham et al., 2017) employs Long Short-Term Memory (LSTM) networks for predicting next visit diagnosis codes, intervention recommendations and future risk prediction. The Life Model (LM) Framework (Manashty and Light, 2019) proposed an efficient representation of temporal data in concise sequences for training RNN-based models, introducing Mean Tolerance Error (MTE) as both a loss function and metric. Deep Patient (Miotto et al., 2016) introduced an unsupervised deep learning approach using Stack Denoising Auto encoders (SDA) to extract meaningful feature representations from EHR data, but does not consider temporal characteristics, which is limiting, as this notion of time is inherent to the trajectory of a patient. In parallel, more classical methods such as Markov chains (Severson et al., 2020), Bayesian networks (Longato et al., 2022), and Hawkes processes (Lima, 2023) have been explored, but suffer from computational complexity when faced with massive data.

为预测患者病程轨迹,研究者探索了基于深度学习或传统方法的多种方案。其中,Doctor AI (Choi等,2016a) 作为基于循环神经网络(RNN)的时间序列模型,被开发应用于带时间戳的纵向电子健康记录(EHR)数据。该模型可预测患者医疗编码及下次就诊时间,但受限于固定时间窗口宽度,当患者未来诊断涉及窗口期外的健康状况时效果受限。LIG-Doctor (Rodrigues-Jr等,2021) 采用最小化双向循环网络MGRU的人工神经网络架构,虽能处理ICD-9编码粒度,但仍存在与Doctor AI相同的局限性。Choi等(2016b)提出RETAIN模型,通过逆向时间注意力机制构建可解释的医疗预测系统,采用两个逆向训练的RNN评估历史就诊重要性,显著提升模型可解释性。DeepCare (Pham等,2017) 运用长短期记忆(LSTM)网络预测下次就诊诊断编码、治疗建议及远期风险。Life Model框架 (Manashty和Light,2019) 提出将时序数据压缩为简洁序列的高效表示方法以训练RNN模型,并引入平均容忍误差(MTE)作为损失函数和评估指标。Deep Patient (Miotto等,2016) 采用堆叠去噪自编码器(SDA)进行无监督特征提取,但未考虑时间特性这一病程轨迹的核心要素。传统方法方面,马尔可夫链(Severson等,2020)、贝叶斯网络(Longato等,2022)和霍克斯过程(Lima,2023)等方案在大规模数据场景下存在计算复杂度问题。

The introduction of transformers marked an advancement, with Clinical GAN (Shankar et al., 2023), a Generative Adversarial Networks (GAN) method based on the Transformer architecture. In this approach, an encoder-decoder model serves as the generator, while an encoder-only Transformer acts as the critic. The goal was to address exposure bias (Arora et al., 2022), a general issue (i.e., not specific to the Transformer) that arises from the teacher forcing training strategy. However, the use of GANs is challenged by s cal ability issues, such as training instability, non-convergence, and mode collapse (Saad et al., 2024).

Transformer的引入标志着技术的进步,其中Clinical GAN (Shankar等人,2023)是一种基于Transformer架构的生成对抗网络(GAN)方法。该方法采用编码器-解码器模型作为生成器,而仅含编码器的Transformer则充当判别器。其目标是解决暴露偏差(Arora等人,2022),这是由教师强制训练策略引发的普遍问题(即并非Transformer特有)。然而,GAN的使用受到可扩展性问题的挑战,例如训练不稳定、不收敛和模式崩溃(Saad等人,2024)。

Despite the development of various approaches for predicting medical codes, it is important to note that most proposed models have been trained on electronic health records (EHRs) containing only structured data on diagnoses and procedures, such as ICD and CCS codes. However, these data omit some essential contextual information, such as medical reasoning and patient-specific nuances, which can be captured through clinical notes.

尽管已有多种预测医疗编码的方法被开发出来,但值得注意的是,大多数已提出的模型都是在仅包含诊断和程序结构化数据(如ICD和CCS编码)的电子健康记录(EHRs)上训练的。然而,这些数据遗漏了一些重要的背景信息,例如医疗推理和患者特异性细节,而这些信息可以通过临床记录获取。

Moreover, comparing results between different studies poses several challenges:

此外,对比不同研究结果存在多重挑战:

• Dataset Variation: Studies utilize different datasets (e.g., MIMIC-III vs. MIMIC-IV), which encompass varying patient populations and time periods (Johnson et al., 2016; Johnson et al., 2020). This variation can lead to discrepancies in results, as one dataset may present more challenging diagnoses to predict than another due to dif- fering distributions. Consequently, such discrepancies complicate the reliability of comparisons between studies and may impact the applicability of findings to clinical practice. • Test Set Size: The size of the test set can significantly impact results. For instance, Shankar et al. (Shankar et al., 2023) used a test set of only $5%$ of their dataset (approximately 1700 visits), which may not adequately represent the diversity of the patients’ profiles and complexity. • Lack of Standardization: There’s often a lack of transparency regarding specific training datasets and preprocessing steps. Additionally, inconsistencies in the implementation of evaluation metrics can lead to discrepancies in reported results. • Preprocessing Variations: Different preprocessing steps, such as token iz ation and data cleaning, can affect model performance and hinder direct comparisons (Edin et al., 2023). • Code Mapping Inconsistencies: Some approaches predict medical codes directly, while others map to ICD codes. Variations in mapping schemes, such as choosing to apply the Clinical Classification Software Refined (CCSR) or not, can lead to inconsistencies in final code representations (i.e., different target labels).

• 数据集差异:研究采用不同数据集(如MIMIC-III与MIMIC-IV),这些数据集覆盖的患者群体和时间范围存在差异 (Johnson et al., 2016; Johnson et al., 2020)。由于数据分布不同,某个数据集可能比另一个包含更难预测的诊断案例,从而导致结果偏差。这种差异会降低研究间比较的可信度,并可能影响研究结论在临床实践中的适用性。

• 测试集规模:测试集大小会显著影响结果。例如Shankar等 (Shankar et al., 2023) 仅使用数据集中$5%$的样本(约1700次就诊)作为测试集,这可能无法充分反映患者特征的多样性和病情复杂性。

• 标准化缺失:研究常缺乏对具体训练数据集和预处理步骤的透明度。此外,评估指标实施的不一致性会导致报告结果出现偏差。

• 预处理差异:不同的预处理步骤(如token化和数据清洗)会影响模型性能,阻碍直接比较 (Edin et al., 2023)。

• 编码映射不一致:部分方法直接预测医疗编码,其他方法则映射到ICD编码。映射方案的差异(例如是否应用临床分类优化软件CCSR)会导致最终编码表示的不一致(即不同的目标标签)。

These challenges underscore the importance of careful consideration when comparing results across different studies in this field. To enhance comparability and reproducibility in research on patient trajectory prediction, it is crucial to standardize datasets, preprocessing methods, and evaluation metrics.

这些挑战凸显了在比较该领域不同研究结果时需谨慎对待的重要性。为提高患者轨迹预测研究的可比性和可复现性,标准化数据集、预处理方法和评估指标至关重要。

3 Proposed Methodology

3 研究方法

In this section, we describe our approach for predicting patient trajectories, which relies on the MIMICIV datasets 3 4. We detail our methodology, including data preprocessing, model architecture, and integration of clinical notes.

在本节中,我们将介绍基于MIMICIV数据集[3][4]的患者轨迹预测方法,详细阐述数据预处理流程、模型架构设计及临床记录整合策略。

3.1 Data Preprocessing

3.1 数据预处理

Preprocessing of MIMIC-IV data includes several operations:

MIMIC-IV数据的预处理包括以下操作:

Table 1 presents code statistics before and after applying processing steps. This is further illustrated in Figure 2, which reveals a significant data imbalance, showing that a larger number of patients made only a single visit compared to those with multiple visits.

表 1: 展示了应用处理步骤前后的代码统计情况。图 2: 进一步说明了这一点,揭示了显著的数据不平衡现象:相比多次就诊的患者,仅单次就诊的患者数量明显更多。

In addition to structured data processing, clinical notes were also pre processed to ensure consistency. Working with limited textual datasets poses challenges, particularly due to subword tokenizers that fragment similar tokens differently due to slight structural variations. Therefore, we standardized clinical notes by unifying medical abbreviations (e.g., ”hr”, ”hrs”, and ”hr(s)” to ”hours”), removing accents, con- verting Danish characters (such as ”æ” to ”ae”), and putting all notes in lowercase, following the approach of (Alsentzer et al., 2019).

除了结构化数据处理外,临床记录也经过预处理以确保一致性。处理有限的文本数据集存在挑战,特别是由于子词分词器(subword tokenizer)会因细微结构差异而对相似token进行不同分割。因此,我们按照(Alsentzer et al., 2019)的方法,通过统一医学缩写(如将"hr"、"hrs"和"hr(s)"统一为"hours")、去除重音符号、转换丹麦字符(如将"æ"转为"ae")以及将所有记录转为小写,实现了临床记录的标准化。

3.2 Integration of Clinical Notes

3.2 临床记录整合

Electronic Health Records (EHRs) typically contain both structured data (e.g., ICD and CCS codes) and unstructured clinical notes. While structured codes provide a standardized representation of diagnoses and procedures, we hypothesize that clinical notes contain additional valuable information that may not be fully captured by these codification systems.

电子健康档案 (EHR) 通常包含结构化数据 (如ICD和CCS编码) 和非结构化临床记录。虽然结构化编码提供了诊断和程序的标准化表示形式,但我们假设临床记录中还包含这些编码系统可能无法完全捕获的额外有价值信息。


Figure 2: Sample distribution of patients by visit count

图 2: 按就诊次数划分的患者样本分布

Table 1: Code statistics before and after processing Note: For each code type, the first row shows the number of distinct codes, and the second row shows the mean $\pm$ standard deviation per visit.

代码类型 加载时 预处理后
操作代码 8482 470
3.03 ± 2.81 2.99 ± 2.77
诊断代码 15763 762
12.50 ± 7.67 13.18 ± 8.58
药物代码 1609 1609
24.12 ± 28.19 24.12 ± 28.19

表 1: 处理前后的代码统计
注: 每种代码类型的第一行显示唯一代码数量,第二行显示每次就诊的均值 ± 标准差。

In light of this, we propose incorporating clinical note embeddings into transformer-based models, which have traditionally focused solely on clinical codes. Our model, Clinical Mosaic, aims to lever- age both structured and unstructured data, offering a more comprehensive view of a patient’s clinical state and history. This integration not only enriches the model’s input but also enhances its ability to understand and predict patient trajectories more accurately.

鉴于此,我们提出将临床记录嵌入(clinical note embeddings)整合到基于Transformer的模型中。传统方法仅聚焦于临床代码(clinical codes),而我们的Clinical Mosaic模型旨在同时利用结构化和非结构化数据,为患者临床状态及病史提供更全面的观察视角。这种集成不仅丰富了模型输入,还提升了其理解和预测患者病情发展的准确性。

3.2.1 Clinical Mosaic Model

3.2.1 临床马赛克模型

To effectively exploit the information contained in clinical notes, it is crucial to obtain vector representations. BERT models (Devlin et al., 2018), and particularly Clinical BERT (Alsentzer et al., 2019), have proven their ability to capture relevant semantic represent at ions in the medical domain. Clinical BERT is a pre-trained model on MIMIC-III, specifically designed for medical notes. However, this model has certain limitations that may affect its performance in our current context:

为了有效利用临床记录中的信息,获取向量表示至关重要。BERT模型 (Devlin et al., 2018) ,尤其是Clinical BERT (Alsentzer et al., 2019) ,已被证明能够捕捉医疗领域的相关语义表示。Clinical BERT是基于MIMIC-III预训练的模型,专为医疗记录设计。然而,该模型存在一些可能影响当前场景性能的局限性:

  1. Limited sequence length: Clinical BERT was primarily pretrained on sequence lengths of 128 tokens. This limitation may cause the model to under perform when generating representations for longer clinical texts, such as comprehensive discharge summaries. Many studies (Wang et al., 2024) show that models trained with larger context lengths tend to outperform those trained on shorter sequences, as they can capture more longrange dependencies and contextual information.
  2. 序列长度限制:Clinical BERT 主要在 128 token 的序列长度上进行预训练。这一限制可能导致模型在处理较长临床文本(如完整出院摘要)生成表征时表现不佳。许多研究 [20] 表明,使用更长上下文训练的模型往往优于短序列训练的模型,因为它们能捕捉更远距离的依赖关系和上下文信息。
  3. Outdated training data: Clinical BERT was pretrained on MIMIC-III, which is an older version of the MIMIC database. We are currently using MIMIC-IV-NOTES 2.2, which contains more recent and potentially more diverse clinical data. To the best of our knowledge, no publicly available model has been pretrained on this latest version of MIMIC-IV-NOTES.
  4. 过时的训练数据:Clinical BERT是在MIMIC-III上预训练的,而MIMIC-III是MIMIC数据库的旧版本。我们目前使用的是MIMIC-IV-NOTES 2.2,它包含更新且可能更多样化的临床数据。据我们所知,目前还没有公开可用的模型是在这个最新版本的MIMIC-IV-NOTES上预训练的。

These limitations may hinder the model’s ability to fully capture the richness and complexity of clinical narratives. The mismatch between the pretraining data (MIMIC-III) and the target data (MIMIC-IV- NOTES 2.2) could result in suboptimal performance due to differences in language patterns, terminology, and structure.

这些限制可能会阻碍模型充分捕捉临床叙述的丰富性和复杂性。预训练数据 (MIMIC-III) 与目标数据 (MIMIC-IV-NOTES 2.2) 之间的不匹配可能导致性能欠佳,因为两者在语言模式、术语和结构上存在差异。

To address this, we introduce Clinical Mosaic, an adaptation of the Mosaic BERT architecture (Portes et al., 2024) designed for clinical text. The model is pretrained with a sequence length of 512 tokens, leveraging Attention with Linear Biases (ALiBi) to improve extrapolation beyond this limit without requiring learned positional embeddings. This allows for better generalization to downstream tasks that may require longer contexts. Pre training is conducted on 331,794 clinical notes (approximately 170 million tokens) from MIMIC-IV-NOTES 2.2, utilizing 7 A40 GPUs with distributed data parallelism (DDP). The training parameters are detailed in Table 2. To facilitate further research and reproducibility, we publicly release the model weights.5

为此,我们推出了Clinical Mosaic,这是基于Mosaic BERT架构 (Portes et al., 2024) 专为临床文本设计的改进版本。该模型以512个token的序列长度进行预训练,采用线性偏置注意力机制 (ALiBi) 来提升超出该长度限制的外推能力,无需学习位置嵌入。这使得模型能更好地泛化到可能需要更长上下文的下游任务。预训练数据来自MIMIC-IV-NOTES 2.2的331,794份临床记录 (约1.7亿token),使用7块A40 GPU通过分布式数据并行 (DDP) 完成训练。具体训练参数详见表2。为促进后续研究和可复现性,我们公开了模型权重。5

表2:

Table 2: Training parameters of the Clinical Mosaic model

参数
有效批次大小 (EffectiveBatchSize) 224
训练步数 (TrainingSteps) 80,000
序列长度 (SequenceLength) 512 token
优化器 (Optimizer) AdamW
初始学习率 (InitialLearningRate) 5e-4
学习率调度 (LearningRateSchedule) 线性预热 33,000 步,随后余弦退火 46,000 步
最终学习率 (Final Learning Rate) 1e-5
掩码概率 (MaskingProbability) 30%

表 2: Clinical Mosaic 模型的训练参数

3.2.2 Evaluation of Clinical Reasoning of Clinical Mosaic

3.2.2 临床马赛克的临床推理评估

To evaluate the performance of Clinical Mosaic, we fine-tuned the model on the Medical Natural Language Inference (MedNLI) dataset (Romanov and Shivade, 2018). MedNLI is a dataset designed for natural language inference tasks in the clinical domain, derived from MIMIC-III clinical notes. It consists of 14,049 pairs of premises and hypotheses, with the objective of classifying the relationship between each pair as entailment, contradiction, or neutral. We report our results on the same test set used by other models for consistency and comparability.

为评估Clinical Mosaic的性能,我们在医学自然语言推理(MedNLI)数据集(Romanov and Shivade, 2018)上对模型进行了微调。MedNLI是为临床领域自然语言推理任务设计的数据集,源自MIMIC-III临床记录。该数据集包含14,049对前提和假设,目标是将每对关系分类为蕴含、矛盾或中立。为确保一致性和可比性,我们在与其他模型相同的测试集上报告了结果。

The MedNLI task evaluates several essential aspects of clinical language understanding, including semantic comprehension of medical terminology, logical reasoning in a clinical context, as well as the ability to discern nuanced relationships between clinical statements. Performance on this dataset serves as an indicator of a model’s ability to understand and reason about clinical language, a crucial foundation for predicting patient trajectories.

MedNLI任务评估临床语言理解的几个关键方面,包括医学术语的语义理解、临床情境下的逻辑推理,以及辨别临床陈述间微妙关系的能力。在该数据集上的表现可作为模型理解和推理临床语言能力的指标,这是预测患者病程的重要基础。

Experimental Setup We fine-tuned Clinical Mo- saic using the AdamW optimizer with a linear warmup and decay learning rate schedule. The backbone of the model was initialized from the publicly released Sifal/Clinical Mosaic checkpoint on Hugging Face. The classifier head was trained with an increased learning rate compared to the backbone, allowing for targeted adaptation to the MedNLI classification task. The batch size was set to 64, with gradient clipping applied to stabilize training. We used early stopping with a patience of 10 epochs based on validation loss. The model achieved its best result at epoch 27, with an average loss of 0.0221 and a validation accuracy of $86.5%$ .

实验设置
我们使用AdamW优化器对Clinical Mosaic进行微调,并采用线性预热与衰减学习率策略。模型主干部分初始化自Hugging Face平台公开的Sifal/Clinical Mosaic检查点。分类器头部采用比主干更高的学习率进行训练,以实现对MedNLI分类任务的针对性适配。批次大小设为64,并应用梯度裁剪以稳定训练过程。基于验证损失,我们采用早停机制(耐心值为10个周期)。模型在第27个周期达到最佳性能,平均损失为0.0221,验证准确率为$86.5%$。

From the few experiments we conducted, we found the model to be highly sensitive to the learning rate. This could be an artifact of using a lower Adam $\upbeta_{2}$ (0.98) than the commonly used value of 0.999, potentially affecting the optimizer’s adaptation dynamics.

从我们进行的少量实验中发现,该模型对学习率高度敏感。这可能是由于使用了低于常规值0.999的Adam $\upbeta_{2}$ (0.98)参数所致,该设置可能影响优化器的自适应动态特性。

Table 3: Hyper parameter configuration used for fine-tuning Clinical Mosaic on MedNLI.

表 3: 在MedNLI上微调Clinical Mosaic使用的超参数配置。

超参数
学习率 (骨干网络) 2e-5
学习率 (分类器) 权重衰减 2e-4
优化器 1e-6
Betas Adamw
Epsilon (e) (0.9,0.98) 1e-6
批量大小 64
最大梯度范数
预热周期数 1.0
总周期数 5
早停耐心值 40
冻结骨干网络 10 False

Table 4 presents the performance of Clinical Mosaic on the MedNLI task in comparison with other state-of-the-art models, including the original Clinical BERT model (Alsentzer et al., 2019).

表 4: 展示了Clinical Mosaic在MedNLI任务上的性能表现,并与包括原始Clinical BERT模型 (Alsentzer et al., 2019) 在内的其他先进模型进行了对比。

Table 4: Comparison of performance of BERT variants and Clinical Mosaic on the MedNLI test set.

模型 准确率
BERT 77.6%
BioBERT 80.8%
DischargeSummary BERT 80.6%
Clinical Discharge BERT 84.1%
Bio+ClinicalBERT 82.7%
ClinicalMosaic 86.5%

表 4: BERT变体与Clinical Mosaic在MedNLI测试集上的性能对比。

The results show that Clinical Mosaic achieves superior accuracy $(86.5%)$ compared to existing models, including the original Clinical BERT $(84.1%)$ . This improvement suggests that our model optimization s and pre-training approach have strengthened its clinical language comprehension capabilities. To ensure reproducibility, our training script is available in the Github repository.

结果显示,Clinical Mosaic的准确率达到了86.5%,优于包括原始Clinical BERT (84.1%) 在内的现有模型。这一提升表明,我们的模型优化和预训练方法增强了其临床语言理解能力。为确保可复现性,训练脚本已发布于Github仓库。

3.3 Fusion of Clinical Representations

3.3 临床表征融合

To evaluate the impact of integrating clinical note embeddings into an encoder-decoder transformer, we experimented with different fusion points and determined that introducing embeddings at the first layer, before attention mechanisms, led to the best results. This ensures that multi-head attention fully leverages the fused representations, promoting richer interactions (Figure 3).

为了评估将临床记录嵌入(clinical note embeddings)整合到编码器-解码器Transformer中的影响,我们尝试了不同的融合点,并确定在注意力机制之前的第一层引入嵌入能带来最佳效果。这确保了多头注意力(multi-head attention)能充分利用融合表征,促进更丰富的交互 (图 3)。

Each layer of BERT encoders generates different representations of clinical notes. Inspired by prior work (Hosseini et al., 2023), which demonstrated the benefits of aggregating multiple layers, we hypothesized that a similar strategy would enhance our clinical tasks. To balance computational efficiency and performance, we aggregated representations from the last six layers of Clincal Mosaic, ensuring robust embeddings while keeping model complexity manageable.

BERT编码器的每一层都会生成临床记录的不同表征。受先前研究 (Hosseini et al., 2023) 的启发 (该研究证明了聚合多层的优势), 我们假设类似策略能提升临床任务表现。为平衡计算效率与性能, 我们聚合了Clincal Mosaic最后六层的表征, 在保持模型复杂度可控的同时确保嵌入的鲁棒性。

We explored three strategies for embedding generation:

我们探索了三种嵌入生成的策略:

After generating embeddings, we integrate them with CCS code embeddings within a transformer framework. CCS codes attend to clinical note embeddings through self-attention, forming a unified represent ation. The decoder then applies causal crossattention to predict future diagnoses. This fusion of structured (CCS codes) and unstructured (clinical notes) data provides a more comprehensive view of patient trajectories, improving predictive performance.

在生成嵌入向量后,我们将其与CCS代码嵌入向量在Transformer框架中进行整合。CCS代码通过自注意力机制关注临床记录嵌入向量,形成统一表征。解码器随后应用因果交叉注意力来预测未来诊断。这种结构化数据 (CCS代码) 与非结构化数据 (临床记录) 的融合提供了更全面的患者病程视图,从而提升了预测性能。

We also investigated the impact of explicitly adding positional information to the clinical note embeddings, exploring static, sinusoidal, and learned positional embeddings. However, our experiments did not show any noticeable performance improvements.

我们还研究了在临床记录嵌入中显式添加位置信息的影响,探索了静态、正弦和学习式位置嵌入。然而,实验并未显示出任何显著的性能提升。

This could be an artifact of the relatively small dataset size, limiting the model’s ability to benefit from explicit positional encoding. Alternatively, it is possible that the model implicitly learns to capture temporal relationships through the timestamps included in the clinical notes, making additional positional encoding redundant

这可能是由于数据集规模相对较小,限制了模型从显式位置编码中获益的能力。另一种可能是,模型通过临床记录中包含的时间戳隐式地学会了捕捉时序关系,使得额外的位置编码变得冗余

4 Experiments

4 实验

In this section, we describe the experiments we conducted to evaluate our approach with the MIMIC-IV and MIMIC-IV-NOTES datasets (37k source, target pairs after preprocessing). The source code is made available for reproducibility purposes6

在本节中,我们描述了使用MIMIC-IV和MIMIC-IV-NOTES数据集(预处理后37k组源-目标对)评估方法的实验。为便于复现,相关源代码已公开[6]。

4.1 Metrics

4.1 指标

We evaluate the performance of our models with Mean Average Precision at K ${\bf\Pi}({\bf M A P@K})$ (equation 1) and Mean Average Recall at $\zeta(\mathbf{MAR}@\mathbf{K})$ (equation 2) for ${\tt K}\mathrm{=}20$ , 40, 60. These metrics are appropriate for our problem which can be considered as a recom mend ation task, where order is crucial, and they allow direct comparison with previous work, even if some studies use only one metric (Rodrigues-Jr et al., 2021).

我们使用K值平均精度均值 ${\bf\Pi}({\bf M A P@K})$ (方程1) 和K值平均召回率均值 $\zeta(\mathbf{MAR}@\mathbf{K})$ (方程2) 来评估模型性能,其中 ${\tt K}\mathrm{=}20$、40、60。这些指标适用于我们的问题(可视为推荐任务,其中顺序至关重要),并能直接与先前研究进行对比(即使部分研究仅使用单一指标) (Rodrigues-Jr et al., 2021)。

$$
\mathbf{MAP@K}={\frac{1}{|Q|}}\sum_{u=1}^{|Q|}{\frac{1}{\operatorname*{min}(m,K)}}\sum_{k=1}^{K}P(k)\cdot r e l(k)
$$

$$
\mathbf{MAP@K}={\frac{1}{|Q|}}\sum_{u=1}^{|Q|}{\frac{1}{\operatorname*{min}(m,K)}}\sum_{k=1}^{K}P(k)\cdot r e l(k)
$$

$$
\mathbf{MAR@K}={\frac{1}{|Q|}}\sum_{u=1}^{|Q|}{\frac{1}{m}}\sum_{k=1}^{K}r e l(k)
$$

$$
\mathbf{MAR@K}={\frac{1}{|Q|}}\sum_{u=1}^{|Q|}{\frac{1}{m}}\sum_{k=1}^{K}r e l(k)
$$

Where $|Q|$ is the number of target sequences, $m$ is the number of relevant items in a target sequence, $K$ is the rank limit, $P(k)$ is the precision at rank $k$ , and $r e l(k)$ is a function that equals 1 if the item at rank $k$ is relevant, 0 otherwise.

其中 $|Q|$ 表示目标序列数量,$m$ 表示目标序列中相关项的数量,$K$ 表示排名限制,$P(k)$ 表示排名 $k$ 处的精确率,$rel(k)$ 是一个函数:当排名 $k$ 的项相关时取值为1,否则为0。

4.2 Baselines

4.2 基线方法

We compare our approach to state-of-the-art models:

我们将我们的方法与最先进的模型进行比较:

• LIG-Doctor (Rodrigues-Jr et al., 2021): We use an embedding dimension and a hidden dimension equal to the size of the prediction label, which is 714. This model is followed by a linear layer that merges the bidirectional context with a second linear layer. A softmax layer is then applied. The model is trained for 100 epochs with a patience of 10 epochs (the model converges in 13 epochs), using a batch size of 512 and the Adadelta opti- mizer.

• LIG-Doctor (Rodrigues-Jr et al., 2021): 我们使用的嵌入维度和隐藏维度与预测标签的大小相同,均为714。该模型后接一个线性层,用于将双向上下文与第二个线性层合并,随后应用softmax层。模型训练100个周期,早停耐心值为10个周期(模型在13个周期后收敛),采用512的批次大小和Adadelta优化器。


Figure 3: Architecture for integrating notes

图 3: 笔记集成架构

vbnet

vbnet

• Doctor AI (Choi et al., 2016a): We adopted the recommended hyper parameters, with a hidden dimension and embedding dimension set to 2000. A dropout rate of 0.5 is applied. The model is trained over 20 epochs, with a batch size of 384, and uses the Adadelta optimizer. • Clinical GAN (Shankar et al., 2023): The generator uses a 3-layer, 8-head encoder-decoder, with a hidden dimension of 256, and the disc rim in at or uses a 1-layer, 4-head transformer encoder. The model is trained for 100 epochs with a batch size of 8 and converges after 11 epochs. Adam is used for the generator, and SGD for the disc rim in at or, with a Noam scheduler.

• Doctor AI (Choi et al., 2016a): 采用推荐的超参数设置,隐藏层和嵌入维度均为2000,应用0.5的dropout率。模型训练20个epoch,批量大小为384,使用Adadelta优化器。
• Clinical GAN (Shankar et al., 2023): 生成器采用3层8头编码器-解码器结构,隐藏维度为256;判别器使用1层4头Transformer编码器。模型训练100个epoch(11个epoch后收敛),批量大小为8,生成器采用Adam优化器,判别器使用SGD优化器,并搭配Noam调度器。

4.3 Results

4.3 结果

The obtained results are presented in Table 5 . All models were evaluated using 5-fold cross-validation and $95%$ confidence intervals.

结果如表5所示。所有模型均采用5折交叉验证和95%置信区间进行评估。

The first key observation is that injecting clinical note embeddings into the architecture significantly improves performance, particularly in terms of $\mathbf{MAR}@\mathbf{K}$ (refer to Figure 6). However, we consider that this improvement could be hindered by the limited size of our dataset (37k samples), preventing the model from fully learning to exploit the representations of the injected embeddings. The strategy of averaging the embedding layers and visits (Mean) yields the lowest MAR $\ @\mathrm{K}$ among the embedding injection approaches. This may be due to excessive compression of information, leading to information loss. However, this method is the most computationally efficient, as it only adds one vector. This efficiency is important, particularly due to the $O(N^{2})$ computational complexity of the transformer’s attention mechanism.

第一个关键发现是,将临床记录嵌入(clinical note embeddings)注入架构能显著提升性能,尤其在 $\mathbf{MAR}@\mathbf{K}$ 指标上 (参见图6)。但我们认为这种改进可能受限于数据集规模(3.7万样本),导致模型无法充分学习如何利用注入的嵌入表征。在嵌入注入策略中,对嵌入层和就诊记录取均值(Mean)的方法产生了最低的MAR $\ @\mathrm{K}$,这可能是由于信息被过度压缩导致丢失。但该方法计算效率最高,仅需增加一个向量,这对Transformer注意力机制 $O(N^{2})$ 的计算复杂度尤为重要。

The concatenation method, which averages only the embedding layers, brings significant improvements in terms of $\mathbf{MAP@K}$ as shown in Figure 5 and also outperforms all literature methods in terms of $\mathbf{MAR}@\mathbf{K}$ , with low variance scores across different cross-validation splits. This can be justified by two main reasons: on one hand, the rich representation obtained from the notes, averaged over several layers, enhances the richness of information while preserving critical elements, unlike the Mean approach that loses information. On the other hand, this approach also allows the model to be more selective in processing information, leveraging independent elements from different medical visits. LIG-Doctor is designed as a classification task, in which a linear layer is used to predict subsequent diagnoses. This setup introduces two important distinctions in the evaluation. First, the model does not generate predictions in a specific order, making the direct calculation of metrics such as $\mathbf{MAP@K}$ impossible. To address this issue, we propose sorting the logits to establish a generation order. However, since the model was not trained with this information in mind, performance does not improve efficiently as K increases, as shown in Figure 5. Second, classification prevents repetitive predictions, improving MAR $\ @\mathrm{K}$ results. Figure 5 shows that LIGDoctor benefits from increasing K.

连接方法仅对嵌入层进行平均处理,如图5所示,该方法在$\mathbf{MAP@K}$指标上带来显著提升,同时在$\mathbf{MAR}@\mathbf{K}$指标上优于所有文献方法,且在不同交叉验证分割中方差得分较低。这一现象可归因于两个主要原因:一方面,通过对多层笔记信息进行平均处理,该方法在保留关键要素的同时增强了信息丰富度,这与均值方法导致信息丢失的特性形成鲜明对比;另一方面,该方法使模型能够更选择性地处理来自不同就诊记录的独立信息。LIG-Doctor被设计为分类任务,通过线性层预测后续诊断。这一设计在评估中引入两个重要差异:首先,模型不会按特定顺序生成预测,导致无法直接计算$\mathbf{MAP@K}$等指标。为此我们提出通过对数排序建立生成顺序,但由于模型训练时未考虑该信息,如图5所示,性能并未随K值增加而有效提升;其次,分类机制避免了重复预测,从而改善了MAR $\ @\mathrm{K}$结果。图5显示LIGDoctor的性能随K值增加而提升。


Figure 4: Approach using a projection layer

图 4: 使用投影层的方法

1(Shankar et al., 2023), 2(Rodrigues-Jr et al., 2021), 3(Choi et al., 2016a) Note: Values are presented as mean(standard deviation). For example, 0.425(5) represents $0.425{\scriptstyle\pm0.005}$ . Table 5: Performance of different models using $\mathbf{MAP@k}$ and MAR $\ @\mathbf{k}$ . Values are presented as mean(standard deviation in the last decimal place).

模型 K = 20 K = 40 K = 60
MAR MAP MAR
Projection 0.425(5) 0.556(21) 0.439(4)
Concat 0.420(6) 0.569(6) 0.425(5)
Mean 0.416(6) 0.538(84) 0.423(6)
Clinical GANl 0.410(5) 0.558(11) 0.414(5)
Transformer Only 0.398(23) 0.565(23) 0.405(25)
LIG-Doctor2 0.267(48) 0.474(94) 0.361(42)
Doctor A13 0.233(5) 0.206(46) 0.233(5)

1(Shankar et al., 2023), 2(Rodrigues-Jr et al., 2021), 3(Choi et al., 2016a) 注: 数值以均值(标准差)表示。例如, 0.425(5) 表示 $0.425{\scriptstyle\pm0.005}$。表 5: 不同模型使用 $\mathbf{MAP@k}$ 和 MAR $\ @\mathbf{k}$ 的性能表现。数值以均值(最后一位小数标准差)表示。

The results of Doctor AI are lower than those of other models, as the model relies on a single GRU layer. Performance appears to be affected by the increase in the prediction space dimension, and improvements could be made by increasing the hidden dimensions and the number of GRU layers. Clinical GAN shows good results in terms of $\mathbf{MAP@K}$ , but struggles to generate a broader set of relevant predictions, as indicated by its lower MAR $\ @\mathrm{K}$ scores (see Figure 6). This model also exhibits instability during training, a frequent issue with GAN-based archi tec ture s, limiting their s cal ability. This limitation could not be fully explored due to the small size of the dataset used, leaving this question open for future research.

Doctor AI 的结果低于其他模型,因为该模型仅依赖单一的 GRU 层。性能似乎受到预测空间维度增加的影响,可以通过增加隐藏维度和 GRU 层数来改进。Clinical GAN 在 $\mathbf{MAP@K}$ 方面表现良好,但难以生成更广泛的相关预测,其较低的 MAR $\ @\mathrm{K}$ 分数也表明了这一点 (参见图 6)。该模型在训练过程中也表现出不稳定性,这是基于 GAN 架构的常见问题,限制了其可扩展性。由于所用数据集规模较小,这一限制未能充分探究,该问题有待未来研究解决。


Figure 5: Mean average precision $@$ 20, 40, and 60 for different models

图 5: 不同模型在 20、40 和 60 时的平均精度均值 $@$


Figure 6: Mean average recall $@$ 20, 40, and 60 for different models

图 6: 不同模型在20、40和60上的平均召回率均值 $@$

5 Conclusion

5 结论

In this study, we addressed the challenge of predicting patient trajectories by leveraging complementary information from clinical notes to enhance predictive accuracy. Our approach integrates clinical note embeddings into transformer models to forecast patient disease trajectories based on their electronic medical records (EMRs). By combining structured medical data with rich, unstructured information from clinical notes, this method offers a more comprehensive view, potentially leading to more accurate predictions of patient outcomes.

本研究通过利用临床记录中的互补信息来提高预测准确性,解决了预测患者病程轨迹的难题。我们的方法将临床记录嵌入(embeddings)整合到Transformer模型中,基于患者的电子病历(EMRs)预测疾病发展轨迹。该方法通过结合结构化医疗数据与临床记录中的非结构化丰富信息,提供了更全面的视角,有望实现更精准的患者预后预测。

Our experimental results on MIMIC-IV datasets showed that the proposed approach significantly outperforms traditional models that rely solely on structured codes. These findings highlight the considerable potential of utilizing unstructured medical information to improve predictive modeling in healthcare, with the possibility of transforming patient care and resource allocation.

我们在MIMIC-IV数据集上的实验结果表明,所提出的方法显著优于仅依赖结构化编码的传统模型。这些发现凸显了利用非结构化医疗信息改进医疗健康预测建模的巨大潜力,有望改变患者护理和资源分配方式。

For future work, we plan to investigate strategies for increment ally updating embeddings without disrupting the overall pipeline. A key challenge is ensuring that newly generated embeddings remain compatible with previously learned representations while minimizing computational overhead. We will explore continual learning techniques and efficient adaptation mechanisms to maintain model stability and prevent catastrophic forgetting.

在未来的工作中,我们计划研究如何在不干扰整体流程的情况下增量更新嵌入向量。一个关键挑战是确保新生成的嵌入向量与先前学习的表征保持兼容,同时最小化计算开销。我们将探索持续学习技术和高效适应机制,以维持模型稳定性并防止灾难性遗忘。

Additionally, we aim to develop a more automated framework that integrates medical coding with predictive modeling, creating a seamless end-to-end system for clinical decision support. This involves leveraging structured (e.g., CCS codes) and unstructured (e.g., clinical notes) data within a unified architecture, reducing manual intervention in feature extraction and improving interpret ability. A fully automated pipeline could enable real-time adaptation to evolving medical knowledge, enhancing predictive accuracy and clinical utility.

此外,我们致力于开发一个更自动化的框架,将医疗编码与预测建模相结合,构建无缝衔接的端到端临床决策支持系统。该框架需在统一架构中整合结构化数据(如CCS代码)和非结构化数据(如临床记录),减少特征提取中的人工干预并提升可解释性。全自动化流程有望实时适应不断演进的医学知识,从而提升预测准确性与临床实用性。

ACKNOWLEDGEMENTS

致谢

The project leading to this publication has received funding from the Excellence Initiative of Aix Marseille Université - A*Midex, a French “Investissements d’Avenir programme” AMX-21-IET-017.

本出版项目由法国"未来投资计划"项目AMX-21-IET-017资助,该资金来自艾克斯-马赛大学卓越计划(Aix Marseille Université - A*Midex)的卓越倡议。

We thank LIS — Labor a to ire d’Informatique et

我们感谢LIS — 信息与自动化实验室

Systemes, Aix-Marseille University for providing the GPU resources necessary for pre-training and conducting extensive experiments. Additionally, we acknowledge CEDRE — CEntre de formation et de sou- tien aux Données de la REcherche, Programme 2 du projet France 2030 IDeAL for supporting early-stage experiments and hosting part of the computational infra structure.

感谢Aix-Marseille University的Systemes团队为预训练和大规模实验提供必要的GPU资源。同时,我们感谢CEDRE——法国2030计划IDeAL项目第二阶段的"研究与数据培训支持中心"对早期实验的支持及部分计算基础设施的托管。

阅读全文(20积分)