Multimodal LLMs for health grounded in individual-specific data
基于个体特异性数据的健康多模态大语言模型
Anastasiya Belyaeva $\perp$ ∗, Justin Cosentino $^{1}$ ∗, Farhad Hormozdiar $^2$ , Krish Eswaran $^{1}$ , Shravya Shetty $^{1}$ , Greg Corrado $^{1}$ , Andrew Carrol $^{\perp}$ , Cory Y. McLean $^{2\dagger}$ , and Nicholas A. Furlotte $^{1}$ †
Anastasiya Belyaeva $\perp$ ∗, Justin Cosentino $^{1}$ ∗, Farhad Hormozdiar $^2$, Krish Eswaran $^{1}$, Shravya Shetty $^{1}$, Greg Corrado $^{1}$, Andrew Carrol $^{\perp}$, Cory Y. McLean $^{2\dagger}$, and Nicholas A. Furlotte $^{1}$ †
$^{\mathrm{ 1 }}{}$ Google Research, San Francisco CA 94105, USA $^2$ Google Research, Cambridge MA 02142, USA nick fur lotte@google.com
$^{\mathrm{ 1 }}{}$ Google Research,美国旧金山 CA 94105
$^2$ Google Research,美国剑桥 MA 02142
nick fur lotte@google.com
Abstract. Foundation large language models (LLMs) have shown an impressive ability to solve tasks across a wide range of fields including health. To effectively solve personalized health tasks, LLMs need the ability to ingest a diversity of data modalities that are relevant to an individual’s health status. In this paper, we take a step towards creating multimodal LLMs for health that are grounded in individual-specific data by developing a framework (HeLM: Health Large Language Model for Multimodal Understanding) that enables LLMs to use high-dimensional clinical modalities to estimate underlying disease risk. HeLM encodes complex data modalities by learning an encoder that maps them into the LLM’s token embedding space and for simple modalities like tabular data by serializing the data into text. Using data from the UK Biobank, we show that HeLM can effectively use demographic and clinical features in addition to high-dimensional time-series data to estimate disease risk. For example, HeLM achieves an AUROC of 0.75 for asthma prediction when combining tabular and spirogram data modalities compared with 0.49 when only using tabular data. Overall, we find that HeLM outperforms or performs at parity with classical machine learning approaches across a selection of eight binary traits. Furthermore, we investigate the downstream uses of this model such as its general iz ability to out-of-distribution traits and its ability to power conversations around individual health and wellness.
摘要:基础大语言模型(LLM)已展现出解决包括健康领域在内的广泛任务的卓越能力。为有效解决个性化健康任务,大语言模型需具备处理与个体健康状况相关的多种数据模态的能力。本文通过开发HeLM框架(Health Large Language Model for Multimodal Understanding),朝着创建基于个体特异性数据的健康多模态大语言模型迈出重要一步。该框架使大语言模型能够利用高维临床模态数据评估潜在疾病风险:HeLM通过训练编码器将复杂数据模态映射至大语言模型的token嵌入空间,对表格数据等简单模态则采用文本序列化处理。基于英国生物银行(UK Biobank)数据,我们证明HeLM能有效结合人口统计学特征、临床特征和高维时间序列数据进行疾病风险评估。例如在哮喘预测中,结合表格数据和呼吸曲线数据的HeLM模型AUROC达到0.75,而仅使用表格数据时为0.49。总体而言,HeLM在八项二元性状预测任务中均优于或持平传统机器学习方法。此外,我们探究了该模型的下游应用潜力,包括其对分布外性状的泛化能力,以及支持个性化健康对话的能力。
Keywords: Multimodal Large Language Models $\cdot$ Health · UK Biobank.
关键词:多模态大语言模型 (Multimodal Large Language Models) $\cdot$ 健康 $\cdot$ UK Biobank
1 Introduction
1 引言
Foundation large language models (LLMs) have been shown to solve a range of natural language processing (NLP) tasks without having been explicitly trained to do so [4,36]. As a result, researchers are adapting LLMs to solve a variety of non-traditional NLP problems across domains. A recent perspective [23] outlined a variety of health-related use cases that could benefit from foundation LLMs that have not only generalist medical knowledge but that are also infused with individual-specific information such as lab values (e.g., cholesterol and triglycerides), imaging, time-series data, health tracker metrics (e.g., daily step count and heart rate), genome sequences, genetic risk scores, and other omics data modalities. These use cases range from AI clinician assistants to AI-powered early warning systems to user-facing health and wellness chatbots.
基础大语言模型(LLM)已被证明能够解决一系列自然语言处理(NLP)任务,而无需经过专门训练[4,36]。因此,研究人员正在调整大语言模型以解决跨领域的各种非传统NLP问题。近期一篇综述[23]概述了多种健康相关应用场景,这些场景可以受益于不仅具备通用医学知识、还融合了个体特异性信息(如实验室数值(胆固醇和甘油三酯)、影像数据、时间序列数据、健康追踪指标(每日步数和心率)、基因组序列、遗传风险评分及其他组学数据模态)的基础大语言模型。这些应用场景涵盖从AI临床助手到AI驱动的早期预警系统,再到面向用户的健康咨询聊天机器人。
Fig. 1: Overview of HeLM, a multimodal LLM for health. Text features (orange) are tokenized and embedded into the token embedding space via a standard embedding matrix. Non-text modalities such as clinical data (blue) or high-dimensional lung function measures (green) are encoded into the same token embedding space via modality-specific encoders. The LLM is tuned to quantify disease risk given complex multimodal inputs.
图 1: 健康领域多模态大语言模型HeLM概述。文本特征(橙色)通过标准嵌入矩阵进行token化并嵌入到token嵌入空间。非文本模态(如临床数据(蓝色)或高维肺功能测量值(绿色))则通过特定模态编码器编码至同一token嵌入空间。该大语言模型经过调优,可基于复杂多模态输入量化疾病风险。
While the potential applications for foundation LLMs in health are wideranging, at the core of each there is a fundamental need for the model to ingest complex multimodal individual-specific data and use it to gain an understanding of the individual’s underlying health risks. The model can then condition responses to queries on the derived risk profile of an individual. Though there has been promising recent work in developing generalist medical LLMs [38,30,31,32], the problem of using multimodal individual-specific information as context for health-related tasks remains understudied. More broadly, this capability represents one aspect of the general movement towards personalization of LLMs [18,28], which encompasses not only the technical challenges of data integration, but also the complex ethical questions around how the model can and should be used.
虽然基础大语言模型在健康领域的潜在应用范围广泛,但其核心需求始终是模型能够摄入复杂的多模态个体特异性数据,并利用这些数据理解个体的潜在健康风险。随后,模型可以根据推导出的个人风险特征来调整对查询的响应。尽管近期在开发通用医疗大语言模型方面取得了有前景的成果[38,30,31,32],但将多模态个体特异性信息作为健康相关任务的上下文这一问题仍未得到充分研究。更广泛地说,这种能力代表了大语言模型个性化趋势的一个方面[18,28],其中不仅包含数据整合的技术挑战,还涉及模型使用方式和伦理边界的复杂问题。
Providing relevant health information to an LLM could be as simple as including important disease risk factors, such as lab values, in the prompt by representing these factors as text [16,10]. Furthermore, in-context learning techniques such as few-shot prompting [4] can be employed by giving the model examples that help it to connect risk factors to underlying disease. However, this solution isn’t likely to work for the many complex data modalities found in the health space [1]. For example, it isn’t clear how to incorporate images, time-series data, or even rich structured tabular data into the LLM prompt. Furthermore, health factors may not be clearly captured by a single modality but rather may be best predicted by simultaneously incorporating features drawn from multiple modalities.
向大语言模型 (LLM) 提供相关健康信息可以很简单,例如在提示中包含重要的疾病风险因素(如实验室指标),并将这些因素以文本形式表示 [16,10]。此外,可以采用少样本提示 (few-shot prompting) [4] 等上下文学习技术,通过为模型提供示例来帮助其将风险因素与潜在疾病联系起来。然而,这种解决方案可能无法适用于健康领域中许多复杂的数据模态 [1]。例如,目前尚不清楚如何将图像、时间序列数据,甚至丰富的结构化表格数据整合到大语言模型的提示中。此外,单一模态可能无法清晰捕捉健康因素,而通过同时整合来自多模态的特征可能才是最佳预测方式。
Recently, a variety of methods have been introduced to extend LLMs to the multimodal setting. The majority of these methods have focused on images and text [25,21,2,17,39,24] with some models adding the ability to incorporate diverse modalities, such as the movements of robots, video, and audio [11,12,22]. However, these nascent methods have not yet been applied to the health domain. On the other hand, there are a variety of classical machine learning methods for integrating multiple data modalities that are routinely applied in the health domain [1]. Examples include logistic regression class if i ers that take multiple modalities as input, various fusion models [19], auto encoder-based models [37], and cross-supervision models [40]. However, these traditional approaches lack several potential advantages when compared to LLM-based methods.
最近,多种方法被引入以将大语言模型扩展到多模态场景。这些方法大多聚焦于图像与文本 [25,21,2,17,39,24],部分模型还增加了整合机器人动作、视频和音频等多样化模态的能力 [11,12,22]。然而,这些新兴方法尚未应用于医疗领域。另一方面,医疗领域已常规应用多种传统机器学习方法来整合多模态数据 [1],例如将多模态作为输入的逻辑回归分类器、各类融合模型 [19]、基于自动编码器的模型 [37] 以及交叉监督模型 [40]。但与传统方法相比,基于大语言模型的方法具有若干潜在优势。
First, foundation LLMs may have encoded extensive prior knowledge about health related traits. For example, an LLM likely understands that hypertension is the same as high blood pressure and may even know something about what number ranges correspond to—normal or high. This prior knowledge can be useful when dealing with heterogeneous data or with data that only has fuzzy labels. Secondly, LLMs can incorporate additional prior knowledge through prompt engineering, whereas in traditional ML methods including priors can be cumbersome. Thirdly, LLMs may have a high degree of flexibility in working with missing data. Whereas traditional methods require imputation of missing values or dropping samples, LLMs can be prompted to ignore missing values or they can be omitted completely without architectural changes. Finally, many foundation LLMs are conversational by design, so they can more naturally be used for applications such as those mentioned previously: user-facing health and wellness chatbots.
首先,基础大语言模型可能已编码了大量与健康特征相关的先验知识。例如,大语言模型很可能理解高血压(hypertension)与高血压(high blood pressure)是同一概念,甚至可能知道哪些数值范围属于正常或偏高。这种先验知识在处理异构数据或仅有模糊标签的数据时非常有用。
其次,大语言模型可以通过提示工程(prompt engineering)融入额外的先验知识,而传统机器学习方法整合先验知识则较为繁琐。
第三,大语言模型在处理缺失数据时具有高度灵活性。传统方法需要对缺失值进行填补或删除样本,而大语言模型可以通过提示忽略缺失值,或直接省略而不需要调整模型架构。
最后,许多基础大语言模型在设计上具有对话能力,因此可以更自然地应用于前述场景:例如面向用户的健康咨询聊天机器人。
In this paper, we take a step towards multimodal LLMs for health that are grounded in individual-specific data by developing a framework called HeLM that enables LLMs to use health-related features to estimate underlying disease risk. The idea is that an LLM with a representation of background risk, can use this as context in answering health-related queries. We formulate a disease classification task and then use the LLM, Flan-PaLMChilla 62b [7], to score potential outcomes. We compare the LLM score with classic supervised machine learning methods such as logistic regression and gradient-boosted decision trees by evaluating their ability to distinguish between individuals with and without the disease.
本文通过开发名为HeLM的框架,向基于个体特异性数据的健康多模态大语言模型迈进一步。该框架使大语言模型能够利用健康相关特征评估潜在疾病风险。核心思路是:具备背景风险表征的大语言模型,可将此作为上下文来回答健康相关查询。我们构建了一个疾病分类任务,随后使用Flan-PaLMChilla 62b大语言模型[7]对潜在结果进行评分。通过评估区分患病与非患病个体的能力,我们将大语言模型评分与逻辑回归、梯度提升决策树等经典监督机器学习方法进行比较。
Using data from the UK Biobank [5], we evaluate different ways of prompting and encoding data and passing it to the LLM to classify disease status. First, we serialize health-related features into text similar to previously proposed approaches [16,10]. We evaluate zero-shot, few-shot, and a parameter-efficient softprompt tuning approach [20]. Generally, we find that the LLM performs better than random in the zero-shot and few-shot cases, and has comparable and often equivalent performance to standard approaches such as logistic regression and a gradient-boosted decision trees (implemented in XGBoost [6]) after softprompt tuning. Additionally, for some diseases (e.g., diabetes), the zero-shot and few-shot approaches perform surprisingly well when compared with logistic regression and XGBoost, giving evidence to the model’s use of prior knowledge related to disease risk. However, we observe that performance degrades with an increase in the number of input features indicating that the model does not always fully capture signal from serialized text.
利用英国生物银行(UK Biobank) [5] 的数据,我们评估了不同提示方式、数据编码方法以及将其传递给大语言模型(LLM)进行疾病状态分类的效果。首先,我们采用类似先前研究[16,10]的方法,将健康相关特征序列化为文本格式。我们测试了零样本(zero-shot)、少样本(few-shot)以及参数高效的软提示调优方法(softprompt tuning) [20]。总体而言,我们发现大语言模型在零样本和少样本场景下的表现优于随机猜测,经过软提示调优后,其性能与逻辑回归和梯度提升决策树(采用XGBoost[6]实现)等标准方法相当甚至更优。值得注意的是,在某些疾病(如糖尿病)分类任务中,零样本和少样本方法相比逻辑回归与XGBoost表现出意料之外的良好效果,这证明模型能够有效利用与疾病风险相关的先验知识。然而,我们也观察到随着输入特征数量的增加,模型性能会出现下降,这表明大语言模型并不总能完全捕捉序列化文本中的有效信号。
To better capture signal from quantitative data, we propose HeLM, a multimodal approach to disease risk estimation based on the PaLM-E framework [11]. In HeLM, non-text modalities are mapped via encoders (trained over input examples) into the same token embedding space as text, where a separate encoder is learned for each modality (see Figure 1). We show that HeLM matches the performance of classical machine learning methods (logistic regression and XGBoost) and, in some cases, outperforms both the logistic regression and XGBoost models. We show results for encoding tabular data and spirogram curves, a time series representation of lung function used for respiratory disease diagnosis [33,8]. Finally, we explore how HeLM that has been tuned to quantify disease risk can be used in downstream conversational tasks. Here, we find that conversational ability degrades in the tuned LLM, which is consistent with what others have reported [35]. We discuss future directions such as fine-tuning schemes that may mitigate degradation.
为了更好地从定量数据中捕捉信号,我们提出了HeLM——一种基于PaLM-E框架[11]的多模态疾病风险评估方法。在HeLM中,非文本模态通过编码器(基于输入样本训练)被映射到与文本相同的token嵌入空间,其中每个模态都学习独立的编码器(见图1)。我们证明HeLM在性能上与传统机器学习方法(逻辑回归和XGBoost)相当,某些情况下甚至优于两者。我们展示了表格数据和肺功能曲线(用于呼吸系统疾病诊断的肺部功能时序表征[33,8])的编码结果。最后,我们探讨了经过疾病风险量化调优的HeLM如何应用于下游对话任务,发现调优后的大语言模型对话能力会下降,这与已有研究结论一致[35]。我们讨论了可能缓解性能下降的微调方案等未来方向。
2 Methods
2 方法
2.1 LLMs with Tabular Data
2.1 大语言模型与表格数据
An individual’s health status is often represented by a set of values stored in tabular format. We explore whether serializing tabular data into natural language and passing the resulting text to the LLM (Flan-PaLMChilla 62b [7]) is sufficient to achieve strong predictive performance. We construct serialized inputs by mapping table column names and corresponding values into JSON-like format, which is combined with the base prompt. For example, for diabetes prediction, we formulate the following sentence using the data: “Predict if a patient has the condition or not. bmi: ${\mathcal{Q}8.\mathcal{I}}$ . age: ${67.0}$ . sex: {male}. We then compute the log-likelihood of the sentence being completed with “yes}” or “no}”. This log-likelihood serves as a risk score and can be evaluated using metrics such as AUROC and AUPRC to assess discriminatory power.
个体的健康状况通常以表格形式存储的一组数值来表示。我们探讨将表格数据序列化为自然语言并传递给大语言模型(Flan-PaLMChilla 62b [7])是否能实现强大的预测性能。我们通过将表格列名及对应值映射为类JSON格式来构建序列化输入,该格式会与基础提示词结合。例如对于糖尿病预测,我们使用数据构造如下句子:"预测患者是否患病。bmi: ${\mathcal{Q}8.\mathcal{I}}$。age: ${67.0}$。sex: {male}。"随后计算该句子以"yes}"或"no}"结尾的对数似然值,该值作为风险评分,可通过AUROC和AUPRC等指标评估判别力。
We evaluate the LLM’s performance in the zero-shot, few-shot and softprompt tuning settings. In the zero-shot setting, the serialized text input is given directly to the model. In this case, the LLM heavily leverages prior knowledge. In the few-shot scenario, we prefix 10 examples randomly sampled from the training dataset to the model’s prompt. For soft-prompt tuning, following [20], we learn a soft prompt to condition the frozen language model to perform well on the diseases of interest.
我们评估大语言模型在零样本、少样本和软提示调优设置下的表现。在零样本设置中,序列化文本输入直接提供给模型。这种情况下,大语言模型主要依赖先验知识。在少样本场景中,我们从训练数据集中随机抽取10个示例作为模型提示的前缀。对于软提示调优,遵循[20]的方法,我们学习一个软提示来调整冻结的语言模型,使其在目标疾病上表现良好。
Briefly, soft-prompt tuning is a parameter efficient tuning technique that’s commonly used to steer the frozen LLM to perform well on a downstream task given labeled data. Instead of fine-tuning the weights of the LLM or forming a hard prompt out of tokens as in the few-shot scenario, soft-prompt tuning learns a matrix $P\subset\mathbb{R}^{p\times k}$ in the token embedding space, where $p$ is the length of the prompt and $k$ is the size of the language embedding space. The soft-prompt $P$ is then concatenated with the embedded tokens and passed to the decoder. We train the soft-prompt using pairs of examples $(X,Y)$ and back propagation to update the soft-prompt with the goal of maximizing the probability of $Y$ . We train for 20,000 steps and use 1000 training examples.
简要来说,软提示调优 (soft-prompt tuning) 是一种参数高效的调优技术,常用于在给定标注数据的情况下引导冻结的大语言模型在下游任务中表现良好。与在大语言模型上微调权重或在少样本场景中通过token构建硬提示不同,软提示调优会在token嵌入空间中学习一个矩阵 $P\subset\mathbb{R}^{p\times k}$ ,其中 $p$ 为提示长度, $k$ 为语言嵌入空间维度。随后将软提示 $P$ 与嵌入token拼接并输入解码器。我们通过样本对 $(X,Y)$ 和反向传播来训练软提示,以最大化 $Y$ 的概率为目标更新软提示。训练共进行20,000步,使用1,000个训练样本。
2.2 Multimodal LLMs for health: HeLM
2.2 用于健康领域的多模态大语言模型:HeLM
To enable the LLM to reason over complex high-dimensional inputs, we embed non-text data modalities, including time-series data like spirograms and tabular data, into the same latent space as the language tokens (see Figure 1). We use separate encoders for each non-text data modality, where each encoder learns a mapping to the token embedding space. This approach is based on the PaLM-E framework [11]. All of the inputs are then processed together by a pre-trained LLM. More precisely, LLMs typically tokenize the words in a sentence and map the resulting tokens $w_ {i}$ into a language embedding space $\mathcal{X}\subset\mathbb{R}^{k}$ via a large embedding matrix $x_ {i}=\gamma(w_ {i})$ where $\gamma:\mathcal{W}\to\mathcal{X}$ . In HeLM, non-text modalities are mapped to a sequence of vectors in $\mathcal{X}$ via encoders. For example, for mapping the spirogram time-series we trained $\phi_ {s}:{\cal S}\rightarrow\chi^{q_ {s}}$ encoder and for mapping tabular data we trained $\phi_ {t}:\mathcal{T}\to\mathcal{X}^{q_ {t}}$ encoder, where $q_ {s}$ and $q_ {t}$ correspond to the number of vectors in $\mathcal{X}$ space or in other words how many “multimodal” tokens each modality is mapped to. We set $q_ {s}=6$ and $q_ {t}=6$ for all experiments, however in general these can be treated as tunable hyper-parameters. In summary, each vector $x_ {i}$ is formed from either the token embedder $\gamma$ or a modality-specific encoder:
为了让大语言模型能够对复杂的高维输入进行推理,我们将非文本数据模态(包括肺活量图等时间序列数据和表格数据)嵌入到与语言token相同的潜在空间中(见图1)。我们为每种非文本数据模态使用独立的编码器,每个编码器学习到token嵌入空间的映射。该方法基于PaLM-E框架[11]。所有输入随后由预训练的大语言模型共同处理。更准确地说,大语言模型通常会对句子中的单词进行token化,并通过大型嵌入矩阵将生成的token $w_ {i}$ 映射到语言嵌入空间 $\mathcal{X}\subset\mathbb{R}^{k}$,即 $x_ {i}=\gamma(w_ {i})$,其中 $\gamma:\mathcal{W}\to\mathcal{X}$。在HeLM中,非文本模态通过编码器被映射到 $\mathcal{X}$ 中的向量序列。例如,为了映射肺活量图时间序列,我们训练了 $\phi_ {s}:{\cal S}\rightarrow\chi^{q_ {s}}$ 编码器;为了映射表格数据,我们训练了 $\phi_ {t}:\mathcal{T}\to\mathcal{X}^{q_ {t}}$ 编码器,其中 $q_ {s}$ 和 $q_ {t}$ 对应于 $\mathcal{X}$ 空间中的向量数量,换句话说就是每种模态被映射到多少个"多模态"token。在所有实验中,我们设置 $q_ {s}=6$ 和 $q_ {t}=6$,但通常这些可以被视为可调超参数。总之,每个向量 $x_ {i}$ 要么来自token嵌入器 $\gamma$,要么来自特定模态的编码器:
Non-text and text modalities can be injected in any order. We train the modality-specific encoders keeping the pre-trained LLM weights frozen. This is similar to soft-prompt tuning but conditioned on data from a specific modality. For experiments considering multiple diseases we train a single HeLM model on a mixture of all diseases as opposed to one model per disease.
非文本和文本模态可以按任意顺序注入。我们训练特定模态的编码器时保持预训练大语言模型的权重冻结。这种方法类似于软提示调优 (soft-prompt tuning) ,但以特定模态的数据为条件。在考虑多种疾病的实验中,我们针对所有疾病的混合数据训练单个 HeLM 模型,而非为每种疾病单独训练一个模型。
2.3 UK Biobank Dataset Preparation
2.3 UK Biobank 数据集准备
We obtain clinical features, spirometry, and disease labels from the UK Biobank (UKB) [5]. Similar to previous work, we focus on the European genetic ancestry sub population [3]. Limiting to European ancestry within the UK Biobank is a standard heuristic for reducing phenotypic heterogeneity, due to the correlation between population structure and phenotypic variation [9]. Differences in phenotypes across ancestries are multi factorial—socio-economic, cultural, etc.—but are often highly correlated with genetic background and thus population structure. Therefore, selecting study individuals based on a single genetic background is a convenient way to reduce heterogeneity in underlying disease risk, at the expense of creating bias in the dataset and subsequent analyses that utilize this data.
我们从英国生物银行(UKB) [5]获取临床特征、肺活量测定数据和疾病标签。与之前的研究类似,我们聚焦于欧洲遗传血统亚群体[3]。在英国生物银行中限定欧洲血统是一种标准启发式方法,旨在减少表型异质性,这是由于种群结构与表型变异之间的相关性所致[9]。不同血统间表型差异受多重因素影响——社会经济、文化等——但这些差异往往与遗传背景高度相关,进而影响种群结构。因此,基于单一遗传背景选择研究对象是一种便捷的降低潜在疾病风险异质性的方法,但代价是会在数据集及后续利用这些数据的分析中引入偏差。
We defined binary phenotype labels using medical records comprising ICD9 hospital inpatient (HESIN) billing codes, ICD-10 primary care and general practitioner (GP) read codes, and self-report data. Diseases include asthma, diabetes, hypertension, stroke, myocardial infarction, major depression, migraine, all cause mortality, cataracts, gastroesophageal reflux disease (GERD), hay fever eczema (atopic eczema), osteoarthritis, and pneumonia. The following clinical features, lab values, and self-reported statuses sourced from questionnaires were used as model inputs: age, sex, body mass index (BMI), high-density lipoprotein (HDL) cholesterol, low-density lipoprotein (LDL) cholesterol, total cholesterol, triglycerides, diastolic blood pressure, smoking status, snoring, insomnia, daytime napping, average nightly sleeping, and chronotype. See Table S1 for details describing feature definitions.
我们使用包含ICD9医院住院(HESIN)账单代码、ICD-10初级保健和全科医生(GP)读取代码以及自我报告数据的医疗记录定义了二元表型标签。疾病包括哮喘、糖尿病、高血压、中风、心肌梗死、重度抑郁症、偏头痛、全因死亡率、白内障、胃食管反流病(GERD)、花粉热湿疹(特应性湿疹)、骨关节炎和肺炎。以下临床特征、实验室值和来自问卷的自我报告状态被用作模型输入:年龄、性别、体重指数(BMI)、高密度脂蛋白(HDL)胆固醇、低密度脂蛋白(LDL)胆固醇、总胆固醇、甘油三酯、舒张压、吸烟状况、打鼾、失眠、白天小睡、平均夜间睡眠时间和时型。特征定义的详细信息参见表S1。
Spirometry data was prepared following the preprocessing procedures outlined in [8]. In short, we used raw volumetric flow curves containing exhalation volume sampled at 10-ms intervals to liters and then computed the corresponding flow curve by approximating the first derivative with respect to time by taking a finite difference. The volume-time and flow-time curves were then normalized to length 1,000 and combined to generate a one-dimensional flow-volume spirogram. Following well-accepted spirometry quality control standards, we filter the dataset to individuals with at least one acceptable blow from the first visit [27,29,8].
肺功能数据按照[8]中概述的预处理步骤进行准备。简而言之,我们使用包含以10毫秒间隔采样的呼气容积的原始容积流量曲线(单位转换为升),然后通过有限差分法近似时间一阶导数来计算相应的流量曲线。随后将容积-时间曲线和流量-时间曲线归一化为1,000个点的长度,并组合生成一维流量-容积肺活量图。根据广泛接受的肺功能质量控制标准,我们对数据集进行筛选,仅保留首次访视时至少有一次合格吹气记录的个体[27,29,8]。
We randomly partitioned patients with valid data entries for all phenotype labels and clinical features into distinct training and validation datasets.
我们将所有表型标签和临床特征数据完整的患者随机划分为不同的训练集和验证集。
3 Experimental Results
3 实验结果
We present LLM risk prediction performance across eight disease classification tasks under varied experimental settings. We begin by assessing the effectiveness of zero-shot and few-shot prompting as well as soft-prompt tuning on a pre-trained foundation model using health data serialized into text. We then demonstrate that directly mapping quantitative features into the LLM’s latent token space significantly improves model performance. Finally, using asthma as a case study, we show that this mapping procedure generalizes to high-dimensional lung function data.
我们展示了大语言模型(LLM)在八种疾病分类任务中不同实验设置下的风险预测性能。首先评估了零样本(zero-shot)和少样本(few-shot)提示方法以及基于文本序列化健康数据的预训练基础模型软提示调优(soft-prompt tuning)的效果。随后证明,将定量特征直接映射到大语言模型的潜在token空间可显著提升模型性能。最后以哮喘为例,证实该映射方法可推广至高维肺功能数据。
3.1 Quantifying disease risk using zero-shot, few-shot, and soft-prompt tuning
3.1 使用零样本、少样本和软提示调优量化疾病风险
We first establish a baseline for LLM disease risk prediction using zero-shot, few-shot, and soft-prompt tuning with a frozen Flan-PaLMChilla 62b model [7]. We define classification tasks using a diverse set of binary phenotype targets (see Section 2.3). For each task and prompting method, we evaluate a “baseline” set of model inputs consisting of age, sex, and BMI as well as an “expanded” set that includes eleven additional clinical and wellness features: HDL cholesterol, LDL cholesterol, total cholesterol, total triglycerides, diastolic blood pressure, smoking status, average sleep duration, insomnia, snoring, daytime napping, and chronotype (i.e., whether a patient is a morning or evening person). These predictors were chosen based on prior knowledge that they should be informative about the selected targets.
我们首先使用零样本、少样本和软提示调优方法,基于冻结参数的Flan-PaLMChilla 62b模型[7]建立大语言模型疾病风险预测基线。通过一组多样化的二元表型目标定义分类任务(见第2.3节)。针对每项任务和提示方法,我们评估了两种输入特征集:(1) "基线"集包含年龄、性别和BMI;(2) "扩展"集额外纳入11项临床与健康特征:高密度脂蛋白胆固醇(HDL)、低密度脂蛋白胆固醇(LDL)、总胆固醇、总甘油三酯、舒张压、吸烟状态、平均睡眠时长、失眠症状、打鼾、日间小睡习惯以及时型(即患者属于晨型或夜型人群)。这些预测因子的选择基于其与目标表型存在已知关联的先验知识。
We score a validation dataset ( $n=3,000$ ) using the methodology outlined in Section 2. For each validation sample, we generate a disease risk score by computing the log probability of a positive disease label. Similarly, to obtain the logistic regression and XGBoost scores, we compute the probability of a positive disease label given the respective model fit on a separate training set ( $n=10,000$ ).
我们使用第2节中概述的方法对一个验证数据集($n=3,000$)进行评分。对于每个验证样本,我们通过计算阳性疾病标签的对数概率来生成疾病风险评分。同样地,为了获得逻辑回归和XGBoost的评分,我们在一个独立的训练集($n=10,000$)上根据各自模型的拟合情况计算阳性疾病标签的概率。
Table 1 shows performance for each prompting method and input set compared to the baseline models. We observe that at least one prompting technique is competitive with the baselines for most tasks. In some cases (e.g., hypertension and diabetes), zero-shot and few-shot models perform surprisingly well despite seeing little to no training data compared to the baselines, an observation also made by [16]. This suggests that the LLM uses prior knowledge about the relationships between age, sex, BMI and disease likelihood.
表1: 展示了每种提示方法与输入集相较于基线模型的性能表现。我们观察到,在多数任务中至少有一种提示技术与基线模型表现相当。在某些情况下(如高血压和糖尿病),零样本和少样本模型尽管相比基线模型几乎没有或仅有少量训练数据,却表现出惊人的效果,这一现象也被[16]所记录。这表明大语言模型利用了关于年龄、性别、BMI与疾病概率之间关系的先验知识。
Focusing on the baseline feature set (age, sex, and BMI), we aimed to understand how the LLM derives scores. We estimated the importance of each input feature by regressing the features against the scores output by the model. Coefficients from the linear regression model are used to measure feature importance and concordance across methods. Figure S1 shows the result of this analysis for four traits. For diabetes, hypertension, and stroke, we see concordance between logistic regression, XGBoost and the LLM models in terms of direction and relative magnitude of effects. Additionally, we find a strong correlation between logistic regression and LLM scores $\mathrm{\Delta(Spearman=0.46\mathrm{-}0.93}$ across prompting methods), while the correlation between LLM and XGBoost scores is weaker (Spearman = 0.39–0.65). On the other hand, for migraine, we see little concordance in direction and relative magnitude of effects for zero-shot and few-shot LLMs, indicating that the LLM doesn’t have sufficient prior knowledge to relate migraine with the input features. However, this is corrected in the soft-prompt case, where we see concordance and high Spearman correlation between soft-prompt tuned LLM and logistic regression (0.85). On the other hand, the soft-prompt tuned LLM has low Spearman correlation with XGBoost (0.31), which is a non-linear model. Given this, we hypothesize that the LLM is effectively scoring outcomes using what translates to a simple linear func
聚焦基线特征集(年龄、性别和BMI),我们旨在理解大语言模型如何推导评分。通过将输入特征与模型输出的评分进行回归分析,我们评估了各特征的重要性。线性回归模型的系数用于衡量特征重要性及方法间一致性。图 S1 展示了四项特征的分析结果:对于糖尿病、高血压和卒中,逻辑回归、XGBoost与大语言模型在效应方向和相对强度上表现一致。此外,逻辑回归与大语言模型评分呈现强相关性(不同提示方法间Spearman=0.46-0.93),而大语言模型与XGBoost评分的相关性较弱(Spearman=0.39-0.65)。在偏头痛分析中,零样本和少样本大语言模型在效应方向与强度上均缺乏一致性,表明模型缺乏将偏头痛与输入特征关联的先验知识。但软提示调整后的大语言模型与逻辑回归呈现高度一致性(Spearman=0.85),而与非线性模型XGBoost的相关性较低(0.31)。据此我们推测,大语言模型实质上是通过等效于简单线性函数的方式进行结果评分。
Table 1: Comparison of AUC and AUPRC between LLM-based classifiers and classical machine learning approaches on the validation set. The models with “baseline” input features use age, sex and BMI as model features, while the models with the “expanded” set also include 11 additional clinical and wellness features. The mean AUC/AUPRC and the corresponding $95%$ confidence intervals were calculated across 1,000 boots trapping iterations. Bold cells denote the best models for a given phenotype and input feature set, where statistical significance is determined via paired boots trapping. Logistic regression and XGBoost models were trained on 10,000 samples, few-shot on 10 samples and soft-prompt tuning on 1,000 samples.
表 1: 验证集上基于大语言模型的分类器与经典机器学习方法的AUC和AUPRC对比。标注"baseline"输入特征的模型使用年龄、性别和BMI作为特征,而标注"expanded"特征的模型额外包含11项临床与健康指标。所有AUC/AUPRC均值及对应的$95%$置信区间均通过1,000次自助采样计算得出。加粗单元格表示特定表型和输入特征组合下的最优模型,其统计学显著性通过配对自助采样检验确定。逻辑回归和XGBoost模型使用10,000个样本训练,少样本学习使用10个样本,软提示调优使用1,000个样本。
Phenotype | Model | Baseline AUC | Expanded AUC Baseline AUPRC Expanded AUPRC | ||
All Cause Mortality (prevalence =6.83%) | Zero-shot | 0.61 (0.57-0.64) | 0.61 (0.57-0.65) | 0.10 (0.080.12) | 0.11 (0.090.14) |
Few-shot | 0.65 (0.61-0.69) | 0.65 (0.61-0.69) | 0.12 (0.10-0.15) | 0.12 (0.09-0.15) | |
(910-01:0) 010(20-990)690(20-990)690 4001-909 | 0.14 (0.11-0.18) | ||||
LogReg | 0.16 (0.13-0.21) | ||||
XGBoost | 0.67 (0.62-0.70) | 0.67 (0.62-0.70) 0.13 (0.10-0.17) | 0.13 (0.10-0.17) | ||
Diabetes (prevalence = 7.60%) | Zero-shot | 0.70 (0.67-0.73) | 0.61 (0.57-0.65) | 0.18 (0.140.23) | 0.12 (0.100.15) |
Few-shot | 0.72 (0.69-0.76) | 0.67 (0.640.70) | 0.19 (0.15-0.24) | 0.14 (0.110.17) | |
Soft-prompt 0.72 (0.69-0.76) | 0.68 (0.640.72) | 0.23 (0.18-0.28) | 0.17 (0.140.22) | ||
LogReg | 0.26 (0.21-0.32) | ||||
XGBoost | 0.73 (0.69-0.76) 0.73 (0.69-0.76) | 0.22 (0.17-0.26) | 0.22 (0.17-0.26) | ||
Hypertension (prevalence = 40.03%)Few-shot | Zero-shot | 0.70 (0.68-0.72) | 0.68 (0.66-0.70) | 0.60 (0.56-0.62) | 0.57 (0.54-0.60) |
0.73 (0.71-0.75) | 0.72 (0.700.73) | 0.62 (0.59-0.64) | 0.59 (0.56-0.62) | ||
Soft-prompt 0.72 (0.70-0.74) | 0.72 (0.700.74) | 0.60 (0.57-0.63) | 0.59 (0.56-0.62) | ||
LogReg | 0.74 (0.720.76) | 0.74 (0.720.76) | 0.63 (0.60-0.66) | 0.65 (0.62-0.68) | |
XGBoost | 0.77 (0.75-0.78) 0.77 (0.75-0.78) 0.66 (0.63-0.69) | 0.66 (0.63-0.69) | |||
Major Depression (prevalence = 13.03%)Few-shot | Zero-shot | 0.54 (0.51-0.57) | 0.58 (0.55-0.61) | 0.15 (0.130.17) | 0.17 (0.140.19) |
0.50 (0.47-0.53) | 0.55 (0.52-0.58) | 0.14 (0.12-0.15) | 0.15 (0.13-0.17) | ||
Soft-prompt 0.60 (0.57-0.63) 0.47 (0.44-0.50) | 0.20 (0.17-0.23) | 0.12 (0.110.14) | |||
LogReg | 0.19 (0.17-0.22) | ||||
XGBoost | 0.54 (0.52-0.57) | 0.54 (0.52-0.57) | 0.16 (0.140.18) | 0.16 (0.140.18) | |
Migraine (prevalence = 3.77%) | Zero-shot | 0.51 (0.45-0.56) | 0.49 (0.45-0.55) | 0.05 (0.03-0.07) | 0.04 (0.03-0.05) |
Few-shot | 0.45 (0.40-0.50) | 0.47 (0.420.53) | 0.03 (0.03-0.04) | 0.03 (0.03-0.04) | |
Soft-prompt 0.64 (0.59-0.69) 0.51 (0.46-0.56) | 0.07 (0.05-0.09) | 0.04 (0.03-0.06) | |||
LogReg | (110-400)200(290-990)290(890-990)790 | 0.06 (0.04-0.08) | |||
XGBoost | (00-10:0) 900(990-490) 090(990-49:0) 09:0 | 0.06 (0.04-0.07) | |||
(prevalence =5.93%)Few-shot | Myocardial InfarctionZero-shot | 0.61 (0.57-0.65) | 0.61 (0.56-0.65) | 0.08 (0.060.10) | 0.09 (0.07-0.12) |
0.67 (0.63-0.71) | 0.65 (0.61-0.68) | 0.11 (0.09-0.14) | 0.10 (0.08-0.12) | ||
0.11 (0.090.14) | |||||
LogReg | 0.16 (0.12-0.21) | ||||
XGBoost | 0.14 (0.11-0.19) | ||||
Stroke (prevalence = 4.00%) Few-shot | Zero-shot | 0.61 (0.56-0.66) | 0.55 (0.490.60) | 0.06 (0.05-0.09) | 0.05 (0.04-0.06) |
0.65 (0.60-0.69) | 0.60 (0.56-0.65) | 0.07 (0.05-0.09) | 0.05 (0.04-0.07) | ||
Soft-prompt 0.63 (0.58-0.68) | 0.50 (0.45-0.55) | 0.09 (0.06-0.12) | 0.05 (0.03-0.06) | ||
LogReg | (10-200)010(220-890)00(20490)690 | 0.14 (0.09-0.19) | |||
XGBoost | 0.68 (0.63-0.72) 0.68 (0.63-0.72) | 0.09 (0.06-0.12) | 0.09 (0.06-0.12) |
表型 | 模型 | 基线AUC | 扩展AUC 基线AUPRC 扩展AUPRC |
---|---|---|---|
全因死亡率 (患病率=6.83%) | 零样本 | 0.61 (0.57-0.64) | 0.61 (0.57-0.65) 0.10 (0.08-0.12) 0.11 (0.09-0.14) |
少样本 | 0.65 (0.61-0.69) | 0.65 (0.61-0.69) 0.12 (0.10-0.15) 0.12 (0.09-0.15) | |
(910-01:0) 010(20-990)690(20-990)690 4001-909 | 0.14 (0.11-0.18) | ||
LogReg | 0.16 (0.13-0.21) | ||
XGBoost | 0.67 (0.62-0.70) | 0.67 (0.62-0.70) 0.13 (0.10-0.17) 0.13 (0.10-0.17) | |
糖尿病 (患病率=7.60%) | 零样本 | 0.70 (0.67-0.73) | 0.61 (0.57-0.65) 0.18 (0.14-0.23) 0.12 (0.10-0.15) |
少样本 | 0.72 (0.69-0.76) | 0.67 (0.64-0.70) 0.19 (0.15-0.24) 0.14 (0.11-0.17) | |
Soft-prompt | 0.72 (0.69-0.76) | 0.68 (0.64-0.72) 0.23 (0.18-0.28) 0.17 (0.14-0.22) | |
LogReg | 0.26 (0.21-0.32) | ||
XGBoost | 0.73 (0.69-0.76) 0.73 (0.69-0.76) 0.22 (0.17-0.26) 0.22 (0.17-0.26) | ||
高血压 (患病率=40.03%) | 零样本 | 0.70 (0.68-0.72) | 0.68 (0.66-0.70) 0.60 (0.56-0.62) 0.57 (0.54-0.60) |
少样本 | 0.73 (0.71-0.75) | 0.72 (0.70-0.73) 0.62 (0.59-0.64) 0.59 (0.56-0.62) | |
Soft-prompt | 0.72 (0.70-0.74) | 0.72 (0.70-0.74) 0.60 (0.57-0.63) 0.59 (0.56-0.62) | |
LogReg | 0.74 (0.72-0.76) | 0.74 (0.72-0.76) 0.63 (0.60-0.66) 0.65 (0.62-0.68) | |
XGBoost | 0.77 (0.75-0.78) 0.77 (0.75-0.78) 0.66 (0.63-0.69) 0.66 (0.63-0.69) | ||
重度抑郁症 (患病率=13.03%) | 零样本 | 0.54 (0.51-0.57) | 0.58 (0.55-0.61) 0.15 (0.13-0.17) 0.17 (0.14-0.19) |
少样本 | 0.50 (0.47-0.53) | 0.55 (0.52-0.58) 0.14 (0.12-0.15) 0.15 (0.13-0.17) | |
Soft-prompt | 0.60 (0.57-0.63) | 0.47 (0.44-0.50) 0.20 (0.17-0.23) 0.12 (0.11-0.14) | |
LogReg | 0.19 (0.17-0.22) | ||
XGBoost | 0.54 (0.52-0.57) | 0.54 (0.52-0.57) 0.16 (0.14-0.18) 0.16 (0.14-0.18) | |
偏头痛 (患病率=3.77%) | 零样本 | 0.51 (0.45-0.56) | 0.49 (0.45-0.55) 0.05 (0.03-0.07) 0.04 (0.03-0.05) |
少样本 | 0.45 (0.40-0.50) | 0.47 (0.42-0.53) 0.03 (0.03-0.04) 0.03 (0.03-0.04) | |
Soft-prompt | 0.64 (0.59-0.69) | 0.51 (0.46-0.56) 0.07 (0.05-0.09) 0.04 (0.03-0.06) | |
LogReg | (110-400)200(290-990)290(890-990)790 0.06 (0.04-0.08) | ||
XGBoost | (00-10:0) 900(990-490) 090(990-49:0) 09:0 0.06 (0.04-0.07) | ||
心肌梗死 (患病率=5.93%) | 零样本 | 0.61 (0.57-0.65) | 0.61 (0.56-0.65) 0.08 (0.06-0.10) 0.09 (0.07-0.12) |
少样本 | 0.67 (0.63-0.71) | 0.65 (0.61-0.68) 0.11 (0.09-0.14) 0.10 (0.08-0.12) | |
0.11 (0.09-0.14) | |||
LogReg | 0.16 (0.12-0.21) | ||
XGBoost | 0.14 (0.11-0.19) | ||
中风 (患病率=4.00%) | 零样本 | 0.61 (0.56-0.66) | 0.55 (0.49-0.60) 0.06 (0.05-0.09) 0.05 (0.04-0.06) |
少样本 | 0.65 (0.60-0.69) | 0.60 (0.56-0.65) 0.07 (0.05-0.09) 0.05 (0.04-0.07) | |
Soft-prompt | 0.63 (0.58-0.68) | 0.50 (0.45-0.55) 0.09 (0.06-0.12) 0.05 (0.03-0.06) | |
LogReg | (10-200)010(220-890)00(20490)690 0.14 (0.09-0.19) | ||
XGBoost | 0.68 (0.63-0.72) 0.68 (0.63-0.72) 0.09 (0.06-0.12) 0.09 (0.06-0.12) |
tion of the input features and that this linear mapping is effectively learned via soft-prompting.
输入特征的线性映射可以通过软提示有效学习。
In the expanded feature set, we often find that the LLM with zero-shot prompting has poorer performance (e.g., on diabetes) when compared with the LLM with zero-shot prompting using the smaller feature set. This may indicate that the LLM has insufficient prior knowledge about the features contained in the expanded set. However, unlike the baseline features, the model is not able to match baseline performance even when using soft-prompt tuning. This indicates that for a complex set of input features, text serialization does not yield a representation that fully captures the available signal. Together, these results motivate the use of a multimodal approach to directly embed quantitative data into the prompt.
在扩展特征集中,我们经常发现采用零样本提示的大语言模型相比使用较小特征集的零样本提示模型表现更差(例如在糖尿病数据集上)。这可能表明大语言模型对扩展集中包含的特征缺乏足够的先验知识。但与基线特征不同,即使采用软提示调优,模型仍无法达到基线性能。这表明对于复杂的输入特征集,文本序列化无法生成能完整捕捉可用信号的表征。这些结果共同推动了采用多模态方法将定量数据直接嵌入提示的研究方向。
3.2 Encoding quantitative data using HeLM
3.2 使用HeLM编码定量数据
To assess the benefit of directly embedding quantitative data into the LLM’s latent token space, we repeat the experiments from Section 3.1 but encode the “extended” inputs using the HeLM framework. We learn an encoder $\phi_ {t}$ , an MLP (two hidden layers of size 1024 and 4096, respectively), that takes as input quantitative features and maps them into the embedding space. We train this model over a mixture of all seven binary traits ( $n=10,000$ for each trait).
为了评估将定量数据直接嵌入大语言模型潜在token空间的优势,我们重复了第3.1节的实验,但使用HeLM框架对"扩展"输入进行编码。我们学习了一个编码器$\phi_ {t}$(一个具有1024和4096大小两个隐藏层的MLP),该编码器以定量特征作为输入并将其映射到嵌入空间。我们在所有七个二元特征的混合数据上训练该模型(每个特征$n=10,000$)。
Table 2 shows performance metrics for HeLM compared with logistic regression and XGBoost. We see that by directly encoding the tabular data into token space, HeLM performs at parity with the best baseline methods and for all cause mortality, hypertension and myocardial infarction HeLM outperforms the best baseline. In comparing the scoring methods, we again find that HeLM scores are highly correlated with logistic regression scores (Spearman = 0.70–0.87; Figure S4a-Figure S5c). In Figure S2, we repeat the feature importance analysis, this time comparing HeLM with logistic regression and XGBoost. Similar to Section 3.1, we see strong concordance between feature weights, particularly for features that have the most importance. Taken together, these results are consistent with our previous hypothesis that the LLM scores outcomes in a way that is highly consistent with logistic regression and that the LLM is arriving at these scores by similarly linearly weighting input features.
表 2 展示了 HeLM 与逻辑回归和 XGBoost 的性能指标对比。通过直接将表格数据编码到 token 空间,HeLM 的表现与最佳基线方法相当,而在全因死亡率、高血压和心肌梗死预测任务中,HeLM 优于所有基线方法。对比评分方法时,我们发现 HeLM 分数与逻辑回归分数高度相关 (Spearman = 0.70–0.87;图 S4a-图 S5c)。在图 S2 中,我们重复了特征重要性分析,这次比较 HeLM 与逻辑回归和 XGBoost 的表现。与第 3.1 节类似,我们发现特征权重具有高度一致性,尤其是对最重要的特征而言。这些结果验证了我们之前的假设:大语言模型以与逻辑回归高度一致的方式进行结果评分,且通过类似的线性加权输入特征来实现这些评分。
In the case of hypertension, we see that HeLM significantly outperforms logistic regression and XGBoost, while XGBoost appears to have a slight advantage over logistic regression. In addition, the HeLM score is more correlated with the XGBoost score when compared with logistic regression (Spearman 0.87 vs. 0.76), which is not the case for any of the other traits. Although, in no way conclusive, this may indicate that by mapping the tabular data into the token embedding space, the LLM takes advantage of non-linear relationships between input features and hypertension. This phenomenon may account for some of the cases where HeLM outperforms the baseline methods.
在高血压案例中,我们发现HeLM显著优于逻辑回归和XGBoost,而XGBoost较逻辑回归略有优势。此外,与逻辑回归相比,HeLM评分与XGBoost评分的相关性更高(Spearman 0.87 vs. 0.76),这一现象在其他特征中均未出现。尽管无法定论,这可能表明通过将表格数据映射到token嵌入空间,大语言模型利用了输入特征与高血压之间的非线性关系。这一现象或许能解释HeLM在某些情况下优于基线方法的原因。
We hypothesized that HeLM has an advantage over logistic regression due to being trained over a mixture of traits and thus benefiting from transfer learning across traits. To evaluate this, we selected three traits (hypertension, all cause mortality and myocardial infarction) where HeLM performed better than logistic regression in terms of AUROC. For each, we trained a single task HeLM (a separate model for each trait) and compared with the mixture. Overall, we see that the mixture trained HeLM does not have an advantage over the single task model (Table S2). Training on a mixture of diseases is still advantageous since it yields a single model that can be used for a variety of diseases as opposed to separate models for each disease (as is the case for logistic regression and XGBoost).
我们假设HeLM相比逻辑回归具有优势,是因为它在多种性状混合数据上训练,从而受益于跨性状的迁移学习。为验证这一点,我们选取了三个AUROC指标上HeLM表现优于逻辑回归的性状(高血压、全因死亡率和心肌梗死),针对每个性状分别训练单任务HeLM(每个性状独立模型)并与混合训练模型对比。总体而言,混合训练的HeLM并未显示出优于单任务模型的优势(表S2)。不过疾病混合训练仍具价值,因为它能生成适用于多种疾病的单一模型,而非像逻辑回归和XGBoost那样需要为每种疾病建立独立模型。
Table 2: Comparison of AUC and AUPRC between a HeLM, logistic regression and XGBoost models on the validation set. The binary phenotypes are predicted from the 14 “expanded set” input features from Table 1. The features are both encoded as text as in Table 1 and also included as a secondary quantitative data modality.
Phenotype Model | AUC | |
All Cause Mortality | HeLM LogReg 0.68 (0.64-0.72) XGBoost 0.67 (0.62-0.70) | 0.71 (0.68-0.75) 0.18 (0.14-0.23) 0.16 (0.13-0.21) 0.13 (0.10-0.17) |
Diabetes | HeLM LogReg XGBo0st 0.73 (0.69-0.76) | 0.77 (0.74-0.80) 0.27 (0.22-0.33) 0.75 (0.72-0.79) 0.26 (0.21-0.32) 0.22 (0.17-0.26) |
Hypertension HeLM | LogReg 0.74 (0.72-0.76) XGBo0st 0.77 (0.75-0.78) 0.66 (0.63-0.69) | 0.79 (0.77-0.81) 0.69 (0.66-0.72) 0.65 (0.62-0.68) |
Major Depression | HeLM LogReg | 0.63 (0.60-0.65) 0.19 (0.17-0.22) 0.62 (0.59-0.65) 0.19 (0.17-0.22) |
Migraine | XGBoost 0.54 (0.52-0.57) HeLM LogReg | 0.16 (0.14-0.18) 0.61 (0.55-0.66) 0.06 (0.04-0.07) |
Myocardial | HeLM | 0.62 (0.56-0.67) 0.06 (0.04-0.08) XGB00st 0.60 (0.54-0.66) 0.06 (0.04-0.07) 0.73 (0.69-0.77) 0.15 (0.12-0.19) |
Stroke | LogReg 0.70 (0.65-0.74) XGBoost 0.69 (0.65-0.73) HeLM 0.73 (0.68-0.77) 0.14 (0.09-0.20) | 0.16 (0.12-0.21) 0.14 (0.11-0.19) |
表 2: HeLM、逻辑回归和XGBoost模型在验证集上AUC和AUPRC的比较。二元表型是从表1中的14个"扩展集"输入特征预测的。这些特征既按照表1中的方式编码为文本,也作为次要的定量数据模态包含在内。
| 表型模型 | | AUC |
|----