[论文翻译]基于个体特异性数据的健康多模态大语言模型


原文地址:https://arxiv.org/pdf/2307.09018


Multimodal LLMs for health grounded in individual-specific data

基于个体特异性数据的健康多模态大语言模型

Anastasiya Belyaeva $\perp$ ∗, Justin Cosentino $^{1}$ ∗, Farhad Hormozdiar $^2$ , Krish Eswaran $^{1}$ , Shravya Shetty $^{1}$ , Greg Corrado $^{1}$ , Andrew Carrol $^{\perp}$ , Cory Y. McLean $^{2\dagger}$ , and Nicholas A. Furlotte $^{1}$ †

Anastasiya Belyaeva $\perp$ ∗, Justin Cosentino $^{1}$ ∗, Farhad Hormozdiar $^2$, Krish Eswaran $^{1}$, Shravya Shetty $^{1}$, Greg Corrado $^{1}$, Andrew Carrol $^{\perp}$, Cory Y. McLean $^{2\dagger}$, and Nicholas A. Furlotte $^{1}$ †

$^{\mathrm{ 1 }}{}$ Google Research, San Francisco CA 94105, USA $^2$ Google Research, Cambridge MA 02142, USA nick fur lotte@google.com

$^{\mathrm{ 1 }}{}$ Google Research,美国旧金山 CA 94105
$^2$ Google Research,美国剑桥 MA 02142
nick fur lotte@google.com

Abstract. Foundation large language models (LLMs) have shown an impressive ability to solve tasks across a wide range of fields including health. To effectively solve personalized health tasks, LLMs need the ability to ingest a diversity of data modalities that are relevant to an individual’s health status. In this paper, we take a step towards creating multimodal LLMs for health that are grounded in individual-specific data by developing a framework (HeLM: Health Large Language Model for Multimodal Understanding) that enables LLMs to use high-dimensional clinical modalities to estimate underlying disease risk. HeLM encodes complex data modalities by learning an encoder that maps them into the LLM’s token embedding space and for simple modalities like tabular data by serializing the data into text. Using data from the UK Biobank, we show that HeLM can effectively use demographic and clinical features in addition to high-dimensional time-series data to estimate disease risk. For example, HeLM achieves an AUROC of 0.75 for asthma prediction when combining tabular and spirogram data modalities compared with 0.49 when only using tabular data. Overall, we find that HeLM outperforms or performs at parity with classical machine learning approaches across a selection of eight binary traits. Furthermore, we investigate the downstream uses of this model such as its general iz ability to out-of-distribution traits and its ability to power conversations around individual health and wellness.

摘要:基础大语言模型(LLM)已展现出解决包括健康领域在内的广泛任务的卓越能力。为有效解决个性化健康任务,大语言模型需具备处理与个体健康状况相关的多种数据模态的能力。本文通过开发HeLM框架(Health Large Language Model for Multimodal Understanding),朝着创建基于个体特异性数据的健康多模态大语言模型迈出重要一步。该框架使大语言模型能够利用高维临床模态数据评估潜在疾病风险:HeLM通过训练编码器将复杂数据模态映射至大语言模型的token嵌入空间,对表格数据等简单模态则采用文本序列化处理。基于英国生物银行(UK Biobank)数据,我们证明HeLM能有效结合人口统计学特征、临床特征和高维时间序列数据进行疾病风险评估。例如在哮喘预测中,结合表格数据和呼吸曲线数据的HeLM模型AUROC达到0.75,而仅使用表格数据时为0.49。总体而言,HeLM在八项二元性状预测任务中均优于或持平传统机器学习方法。此外,我们探究了该模型的下游应用潜力,包括其对分布外性状的泛化能力,以及支持个性化健康对话的能力。

Keywords: Multimodal Large Language Models $\cdot$ Health · UK Biobank.

关键词:多模态大语言模型 (Multimodal Large Language Models) $\cdot$ 健康 $\cdot$ UK Biobank

1 Introduction

1 引言

Foundation large language models (LLMs) have been shown to solve a range of natural language processing (NLP) tasks without having been explicitly trained to do so [4,36]. As a result, researchers are adapting LLMs to solve a variety of non-traditional NLP problems across domains. A recent perspective [23] outlined a variety of health-related use cases that could benefit from foundation LLMs that have not only generalist medical knowledge but that are also infused with individual-specific information such as lab values (e.g., cholesterol and triglycerides), imaging, time-series data, health tracker metrics (e.g., daily step count and heart rate), genome sequences, genetic risk scores, and other omics data modalities. These use cases range from AI clinician assistants to AI-powered early warning systems to user-facing health and wellness chatbots.

基础大语言模型(LLM)已被证明能够解决一系列自然语言处理(NLP)任务,而无需经过专门训练[4,36]。因此,研究人员正在调整大语言模型以解决跨领域的各种非传统NLP问题。近期一篇综述[23]概述了多种健康相关应用场景,这些场景可以受益于不仅具备通用医学知识、还融合了个体特异性信息(如实验室数值(胆固醇和甘油三酯)、影像数据、时间序列数据、健康追踪指标(每日步数和心率)、基因组序列、遗传风险评分及其他组学数据模态)的基础大语言模型。这些应用场景涵盖从AI临床助手到AI驱动的早期预警系统,再到面向用户的健康咨询聊天机器人。

Fig. 1: Overview of HeLM, a multimodal LLM for health. Text features (orange) are tokenized and embedded into the token embedding space via a standard embedding matrix. Non-text modalities such as clinical data (blue) or high-dimensional lung function measures (green) are encoded into the same token embedding space via modality-specific encoders. The LLM is tuned to quantify disease risk given complex multimodal inputs.

图 1: 健康领域多模态大语言模型HeLM概述。文本特征(橙色)通过标准嵌入矩阵进行token化并嵌入到token嵌入空间。非文本模态(如临床数据(蓝色)或高维肺功能测量值(绿色))则通过特定模态编码器编码至同一token嵌入空间。该大语言模型经过调优,可基于复杂多模态输入量化疾病风险。

While the potential applications for foundation LLMs in health are wideranging, at the core of each there is a fundamental need for the model to ingest complex multimodal individual-specific data and use it to gain an understanding of the individual’s underlying health risks. The model can then condition responses to queries on the derived risk profile of an individual. Though there has been promising recent work in developing generalist medical LLMs [38,30,31,32], the problem of using multimodal individual-specific information as context for health-related tasks remains understudied. More broadly, this capability represents one aspect of the general movement towards personalization of LLMs [18,28], which encompasses not only the technical challenges of data integration, but also the complex ethical questions around how the model can and should be used.

虽然基础大语言模型在健康领域的潜在应用范围广泛,但其核心需求始终是模型能够摄入复杂的多模态个体特异性数据,并利用这些数据理解个体的潜在健康风险。随后,模型可以根据推导出的个人风险特征来调整对查询的响应。尽管近期在开发通用医疗大语言模型方面取得了有前景的成果[38,30,31,32],但将多模态个体特异性信息作为健康相关任务的上下文这一问题仍未得到充分研究。更广泛地说,这种能力代表了大语言模型个性化趋势的一个方面[18,28],其中不仅包含数据整合的技术挑战,还涉及模型使用方式和伦理边界的复杂问题。

Providing relevant health information to an LLM could be as simple as including important disease risk factors, such as lab values, in the prompt by representing these factors as text [16,10]. Furthermore, in-context learning techniques such as few-shot prompting [4] can be employed by giving the model examples that help it to connect risk factors to underlying disease. However, this solution isn’t likely to work for the many complex data modalities found in the health space [1]. For example, it isn’t clear how to incorporate images, time-series data, or even rich structured tabular data into the LLM prompt. Furthermore, health factors may not be clearly captured by a single modality but rather may be best predicted by simultaneously incorporating features drawn from multiple modalities.

向大语言模型 (LLM) 提供相关健康信息可以很简单,例如在提示中包含重要的疾病风险因素(如实验室指标),并将这些因素以文本形式表示 [16,10]。此外,可以采用少样本提示 (few-shot prompting) [4] 等上下文学习技术,通过为模型提供示例来帮助其将风险因素与潜在疾病联系起来。然而,这种解决方案可能无法适用于健康领域中许多复杂的数据模态 [1]。例如,目前尚不清楚如何将图像、时间序列数据,甚至丰富的结构化表格数据整合到大语言模型的提示中。此外,单一模态可能无法清晰捕捉健康因素,而通过同时整合来自多模态的特征可能才是最佳预测方式。

Recently, a variety of methods have been introduced to extend LLMs to the multimodal setting. The majority of these methods have focused on images and text [25,21,2,17,39,24] with some models adding the ability to incorporate diverse modalities, such as the movements of robots, video, and audio [11,12,22]. However, these nascent methods have not yet been applied to the health domain. On the other hand, there are a variety of classical machine learning methods for integrating multiple data modalities that are routinely applied in the health domain [1]. Examples include logistic regression class if i ers that take multiple modalities as input, various fusion models [19], auto encoder-based models [37], and cross-supervision models [40]. However, these traditional approaches lack several potential advantages when compared to LLM-based methods.

最近,多种方法被引入以将大语言模型扩展到多模态场景。这些方法大多聚焦于图像与文本 [25,21,2,17,39,24],部分模型还增加了整合机器人动作、视频和音频等多样化模态的能力 [11,12,22]。然而,这些新兴方法尚未应用于医疗领域。另一方面,医疗领域已常规应用多种传统机器学习方法来整合多模态数据 [1],例如将多模态作为输入的逻辑回归分类器、各类融合模型 [19]、基于自动编码器的模型 [37] 以及交叉监督模型 [40]。但与传统方法相比,基于大语言模型的方法具有若干潜在优势。

First, foundation LLMs may have encoded extensive prior knowledge about health related traits. For example, an LLM likely understands that hypertension is the same as high blood pressure and may even know something about what number ranges correspond to—normal or high. This prior knowledge can be useful when dealing with heterogeneous data or with data that only has fuzzy labels. Secondly, LLMs can incorporate additional prior knowledge through prompt engineering, whereas in traditional ML methods including priors can be cumbersome. Thirdly, LLMs may have a high degree of flexibility in working with missing data. Whereas traditional methods require imputation of missing values or dropping samples, LLMs can be prompted to ignore missing values or they can be omitted completely without architectural changes. Finally, many foundation LLMs are conversational by design, so they can more naturally be used for applications such as those mentioned previously: user-facing health and wellness chatbots.

首先,基础大语言模型可能已编码了大量与健康特征相关的先验知识。例如,大语言模型很可能理解高血压(hypertension)与高血压(high blood pressure)是同一概念,甚至可能知道哪些数值范围属于正常或偏高。这种先验知识在处理异构数据或仅有模糊标签的数据时非常有用。

其次,大语言模型可以通过提示工程(prompt engineering)融入额外的先验知识,而传统机器学习方法整合先验知识则较为繁琐。

第三,大语言模型在处理缺失数据时具有高度灵活性。传统方法需要对缺失值进行填补或删除样本,而大语言模型可以通过提示忽略缺失值,或直接省略而不需要调整模型架构。

最后,许多基础大语言模型在设计上具有对话能力,因此可以更自然地应用于前述场景:例如面向用户的健康咨询聊天机器人。

In this paper, we take a step towards multimodal LLMs for health that are grounded in individual-specific data by developing a framework called HeLM that enables LLMs to use health-related features to estimate underlying disease risk. The idea is that an LLM with a representation of background risk, can use this as context in answering health-related queries. We formulate a disease classification task and then use the LLM, Flan-PaLMChilla 62b [7], to score potential outcomes. We compare the LLM score with classic supervised machine learning methods such as logistic regression and gradient-boosted decision trees by evaluating their ability to distinguish between individuals with and without the disease.

本文通过开发名为HeLM的框架,向基于个体特异性数据的健康多模态大语言模型迈进一步。该框架使大语言模型能够利用健康相关特征评估潜在疾病风险。核心思路是:具备背景风险表征的大语言模型,可将此作为上下文来回答健康相关查询。我们构建了一个疾病分类任务,随后使用Flan-PaLMChilla 62b大语言模型[7]对潜在结果进行评分。通过评估区分患病与非患病个体的能力,我们将大语言模型评分与逻辑回归、梯度提升决策树等经典监督机器学习方法进行比较。

Using data from the UK Biobank [5], we evaluate different ways of prompting and encoding data and passing it to the LLM to classify disease status. First, we serialize health-related features into text similar to previously proposed approaches [16,10]. We evaluate zero-shot, few-shot, and a parameter-efficient softprompt tuning approach [20]. Generally, we find that the LLM performs better than random in the zero-shot and few-shot cases, and has comparable and often equivalent performance to standard approaches such as logistic regression and a gradient-boosted decision trees (implemented in XGBoost [6]) after softprompt tuning. Additionally, for some diseases (e.g., diabetes), the zero-shot and few-shot approaches perform surprisingly well when compared with logistic regression and XGBoost, giving evidence to the model’s use of prior knowledge related to disease risk. However, we observe that performance degrades with an increase in the number of input features indicating that the model does not always fully capture signal from serialized text.

利用英国生物银行(UK Biobank) [5] 的数据,我们评估了不同提示方式、数据编码方法以及将其传递给大语言模型(LLM)进行疾病状态分类的效果。首先,我们采用类似先前研究[16,10]的方法,将健康相关特征序列化为文本格式。我们测试了零样本(zero-shot)、少样本(few-shot)以及参数高效的软提示调优方法(softprompt tuning) [20]。总体而言,我们发现大语言模型在零样本和少样本场景下的表现优于随机猜测,经过软提示调优后,其性能与逻辑回归和梯度提升决策树(采用XGBoost[6]实现)等标准方法相当甚至更优。值得注意的是,在某些疾病(如糖尿病)分类任务中,零样本和少样本方法相比逻辑回归与XGBoost表现出意料之外的良好效果,这证明模型能够有效利用与疾病风险相关的先验知识。然而,我们也观察到随着输入特征数量的增加,模型性能会出现下降,这表明大语言模型并不总能完全捕捉序列化文本中的有效信号。

To better capture signal from quantitative data, we propose HeLM, a multimodal approach to disease risk estimation based on the PaLM-E framework [11]. In HeLM, non-text modalities are mapped via encoders (trained over input examples) into the same token embedding space as text, where a separate encoder is learned for each modality (see Figure 1). We show that HeLM matches the performance of classical machine learning methods (logistic regression and XGBoost) and, in some cases, outperforms both the logistic regression and XGBoost models. We show results for encoding tabular data and spirogram curves, a time series representation of lung function used for respiratory disease diagnosis [33,8]. Finally, we explore how HeLM that has been tuned to quantify disease risk can be used in downstream conversational tasks. Here, we find that conversational ability degrades in the tuned LLM, which is consistent with what others have reported [35]. We discuss future directions such as fine-tuning schemes that may mitigate degradation.

为了更好地从定量数据中捕捉信号,我们提出了HeLM——一种基于PaLM-E框架[11]的多模态疾病风险评估方法。在HeLM中,非文本模态通过编码器(基于输入样本训练)被映射到与文本相同的token嵌入空间,其中每个模态都学习独立的编码器(见图1)。我们证明HeLM在性能上与传统机器学习方法(逻辑回归和XGBoost)相当,某些情况下甚至优于两者。我们展示了表格数据和肺功能曲线(用于呼吸系统疾病诊断的肺部功能时序表征[33,8])的编码结果。最后,我们探讨了经过疾病风险量化调优的HeLM如何应用于下游对话任务,发现调优后的大语言模型对话能力会下降,这与已有研究结论一致[35]。我们讨论了可能缓解性能下降的微调方案等未来方向。

2 Methods

2 方法

2.1 LLMs with Tabular Data

2.1 大语言模型与表格数据

An individual’s health status is often represented by a set of values stored in tabular format. We explore whether serializing tabular data into natural language and passing the resulting text to the LLM (Flan-PaLMChilla 62b [7]) is sufficient to achieve strong predictive performance. We construct serialized inputs by mapping table column names and corresponding values into JSON-like format, which is combined with the base prompt. For example, for diabetes prediction, we formulate the following sentence using the data: “Predict if a patient has the condition or not. bmi: ${\mathcal{Q}8.\mathcal{I}}$ . age: ${67.0}$ . sex: {male}. We then compute the log-likelihood of the sentence being completed with “yes}” or “no}”. This log-likelihood serves as a risk score and can be evaluated using metrics such as AUROC and AUPRC to assess discriminatory power.
个体的健康状况通常以表格形式存储的一组数值来表示。我们探讨将表格数据序列化为自然语言并传递给大语言模型(Flan-PaLMChilla 62b [7])是否能实现强大的预测性能。我们通过将表格列名及对应值映射为类JSON格式来构建序列化输入,该格式会与基础提示词结合。例如对于糖尿病预测,我们使用数据构造如下句子:"预测患者是否患病。bmi: ${\mathcal{Q}8.\mathcal{I}}$。age: ${67.0}$。sex: {male}。"随后计算该句子以"yes}"或"no}"结尾的对数似然值,该值作为风险评分,可通过AUROC和AUPRC等指标评估判别力。

We evaluate the LLM’s performance in the zero-shot, few-shot and softprompt tuning settings. In the zero-shot setting, the serialized text input is given directly to the model. In this case, the LLM heavily leverages prior knowledge. In the few-shot scenario, we prefix 10 examples randomly sampled from the training dataset to the model’s prompt. For soft-prompt tuning, following [20], we learn a soft prompt to condition the frozen language model to perform well on the diseases of interest.

我们评估大语言模型在零样本、少样本和软提示调优设置下的表现。在零样本设置中,序列化文本输入直接提供给模型。这种情况下,大语言模型主要依赖先验知识。在少样本场景中,我们从训练数据集中随机抽取10个示例作为模型提示的前缀。对于软提示调优,遵循[20]的方法,我们学习一个软提示来调整冻结的语言模型,使其在目标疾病上表现良好。

Briefly, soft-prompt tuning is a parameter efficient tuning technique that’s commonly used to steer the frozen LLM to perform well on a downstream task given labeled data. Instead of fine-tuning the weights of the LLM or forming a hard prompt out of tokens as in the few-shot scenario, soft-prompt tuning learns a matrix $P\subset\mathbb{R}^{p\times k}$ in the token embedding space, where $p$ is the length of the prompt and $k$ is the size of the language embedding space. The soft-prompt $P$ is then concatenated with the embedded tokens and passed to the decoder. We train the soft-prompt using pairs of examples $(X,Y)$ and back propagation to update the soft-prompt with the goal of maximizing the probability of $Y$ . We train for 20,000 steps and use 1000 training examples.

简要来说,软提示调优 (soft-prompt tuning) 是一种参数高效的调优技术,常用于在给定标注数据的情况下引导冻结的大语言模型在下游任务中表现良好。与在大语言模型上微调权重或在少样本场景中通过token构建硬提示不同,软提示调优会在token嵌入空间中学习一个矩阵 $P\subset\mathbb{R}^{p\times k}$ ,其中 $p$ 为提示长度, $k$ 为语言嵌入空间维度。随后将软提示 $P$ 与嵌入token拼接并输入解码器。我们通过样本对 $(X,Y)$ 和反向传播来训练软提示,以最大化 $Y$ 的概率为目标更新软提示。训练共进行20,000步,使用1,000个训练样本。

2.2 Multimodal LLMs for health: HeLM

2.2 用于健康领域的多模态大语言模型:HeLM

To enable the LLM to reason over complex high-dimensional inputs, we embed non-text data modalities, including time-series data like spirograms and tabular data, into the same latent space as the language tokens (see Figure 1). We use separate encoders for each non-text data modality, where each encoder learns a mapping to the token embedding space. This approach is based on the PaLM-E framework [11]. All of the inputs are then processed together by a pre-trained LLM. More precisely, LLMs typically tokenize the words in a sentence and map the resulting tokens $w_ {i}$ into a language embedding space $\mathcal{X}\subset\mathbb{R}^{k}$ via a large embedding matrix $x_ {i}=\gamma(w_ {i})$ where $\gamma:\mathcal{W}\to\mathcal{X}$ . In HeLM, non-text modalities are mapped to a sequence of vectors in $\mathcal{X}$ via encoders. For example, for mapping the spirogram time-series we trained $\phi_ {s}:{\cal S}\rightarrow\chi^{q_ {s}}$ encoder and for mapping tabular data we trained $\phi_ {t}:\mathcal{T}\to\mathcal{X}^{q_ {t}}$ encoder, where $q_ {s}$ and $q_ {t}$ correspond to the number of vectors in $\mathcal{X}$ space or in other words how many “multimodal” tokens each modality is mapped to. We set $q_ {s}=6$ and $q_ {t}=6$ for all experiments, however in general these can be treated as tunable hyper-parameters. In summary, each vector $x_ {i}$ is formed from either the token embedder $\gamma$ or a modality-specific encoder:

为了让大语言模型能够对复杂的高维输入进行推理,我们将非文本数据模态(包括肺活量图等时间序列数据和表格数据)嵌入到与语言token相同的潜在空间中(见图1)。我们为每种非文本数据模态使用独立的编码器,每个编码器学习到token嵌入空间的映射。该方法基于PaLM-E框架[11]。所有输入随后由预训练的大语言模型共同处理。更准确地说,大语言模型通常会对句子中的单词进行token化,并通过大型嵌入矩阵将生成的token $w_ {i}$ 映射到语言嵌入空间 $\mathcal{X}\subset\mathbb{R}^{k}$,即 $x_ {i}=\gamma(w_ {i})$,其中 $\gamma:\mathcal{W}\to\mathcal{X}$。在HeLM中,非文本模态通过编码器被映射到 $\mathcal{X}$ 中的向量序列。例如,为了映射肺活量图时间序列,我们训练了 $\phi_ {s}:{\cal S}\rightarrow\chi^{q_ {s}}$ 编码器;为了映射表格数据,我们训练了 $\phi_ {t}:\mathcal{T}\to\mathcal{X}^{q_ {t}}$ 编码器,其中 $q_ {s}$ 和 $q_ {t}$ 对应于 $\mathcal{X}$ 空间中的向量数量,换句话说就是每种模态被映射到多少个"多模态"token。在所有实验中,我们设置 $q_ {s}=6$ 和 $q_ {t}=6$,但通常这些可以被视为可调超参数。总之,每个向量 $x_ {i}$ 要么来自token嵌入器 $\gamma$,要么来自特定模态的编码器:
image.png
Non-text and text modalities can be injected in any order. We train the modality-specific encoders keeping the pre-trained LLM weights frozen. This is similar to soft-prompt tuning but conditioned on data from a specific modality. For experiments considering multiple diseases we train a single HeLM model on a mixture of all diseases as opposed to one model per disease.

非文本和文本模态可以按任意顺序注入。我们训练特定模态的编码器时保持预训练大语言模型的权重冻结。这种方法类似于软提示调优 (soft-prompt tuning) ,但以特定模态的数据为条件。在考虑多种疾病的实验中,我们针对所有疾病的混合数据训练单个 HeLM 模型,而非为每种疾病单独训练一个模型。

2.3 UK Biobank Dataset Preparation

2.3 UK Biobank 数据集准备

We obtain clinical features, spirometry, and disease labels from the UK Biobank (UKB) [5]. Similar to previous work, we focus on the European genetic ancestry sub population [3]. Limiting to European ancestry within the UK Biobank is a standard heuristic for reducing phenotypic heterogeneity, due to the correlation between population structure and phenotypic variation [9]. Differences in phenotypes across ancestries are multi factorial—socio-economic, cultural, etc.—but are often highly correlated with genetic background and thus population structure. Therefore, selecting study individuals based on a single genetic background is a convenient way to reduce heterogeneity in underlying disease risk, at the expense of creating bias in the dataset and subsequent analyses that utilize this data.

我们从英国生物银行(UKB) [5]获取临床特征、肺活量测定数据和疾病标签。与之前的研究类似,我们聚焦于欧洲遗传血统亚群体[3]。在英国生物银行中限定欧洲血统是一种标准启发式方法,旨在减少表型异质性,这是由于种群结构与表型变异之间的相关性所致[9]。不同血统间表型差异受多重因素影响——社会经济、文化等——但这些差异往往与遗传背景高度相关,进而影响种群结构。因此,基于单一遗传背景选择研究对象是一种便捷的降低潜在疾病风险异质性的方法,但代价是会在数据集及后续利用这些数据的分析中引入偏差。

We defined binary phenotype labels using medical records comprising ICD9 hospital inpatient (HESIN) billing codes, ICD-10 primary care and general practitioner (GP) read codes, and self-report data. Diseases include asthma, diabetes, hypertension, stroke, myocardial infarction, major depression, migraine, all cause mortality, cataracts, gastroesophageal reflux disease (GERD), hay fever eczema (atopic eczema), osteoarthritis, and pneumonia. The following clinical features, lab values, and self-reported statuses sourced from questionnaires were used as model inputs: age, sex, body mass index (BMI), high-density lipoprotein (HDL) cholesterol, low-density lipoprotein (LDL) cholesterol, total cholesterol, triglycerides, diastolic blood pressure, smoking status, snoring, insomnia, daytime napping, average nightly sleeping, and chronotype. See Table S1 for details describing feature definitions.

我们使用包含ICD9医院住院(HESIN)账单代码、ICD-10初级保健和全科医生(GP)读取代码以及自我报告数据的医疗记录定义了二元表型标签。疾病包括哮喘、糖尿病、高血压、中风、心肌梗死、重度抑郁症、偏头痛、全因死亡率、白内障、胃食管反流病(GERD)、花粉热湿疹(特应性湿疹)、骨关节炎和肺炎。以下临床特征、实验室值和来自问卷的自我报告状态被用作模型输入:年龄、性别、体重指数(BMI)、高密度脂蛋白(HDL)胆固醇、低密度脂蛋白(LDL)胆固醇、总胆固醇、甘油三酯、舒张压、吸烟状况、打鼾、失眠、白天小睡、平均夜间睡眠时间和时型。特征定义的详细信息参见表S1。

Spirometry data was prepared following the preprocessing procedures outlined in [8]. In short, we used raw volumetric flow curves containing exhalation volume sampled at 10-ms intervals to liters and then computed the corresponding flow curve by approximating the first derivative with respect to time by taking a finite difference. The volume-time and flow-time curves were then normalized to length 1,000 and combined to generate a one-dimensional flow-volume spirogram. Following well-accepted spirometry quality control standards, we filter the dataset to individuals with at least one acceptable blow from the first visit [27,29,8].

肺功能数据按照[8]中概述的预处理步骤进行准备。简而言之,我们使用包含以10毫秒间隔采样的呼气容积的原始容积流量曲线(单位转换为升),然后通过有限差分法近似时间一阶导数来计算相应的流量曲线。随后将容积-时间曲线和流量-时间曲线归一化为1,000个点的长度,并组合生成一维流量-容积肺活量图。根据广泛接受的肺功能质量控制标准,我们对数据集进行筛选,仅保留首次访视时至少有一次合格吹气记录的个体[27,29,8]。

We randomly partitioned patients with valid data entries for all phenotype labels and clinical features into distinct training and validation datasets.

我们将所有表型标签和临床特征数据完整的患者随机划分为不同的训练集和验证集。

3 Experimental Results

3 实验结果

We present LLM risk prediction performance across eight disease classification tasks under varied experimental settings. We begin by assessing the effectiveness of zero-shot and few-shot prompting as well as soft-prompt tuning on a pre-trained foundation model using health data serialized into text. We then demonstrate that directly mapping quantitative features into the LLM’s latent token space significantly improves model performance. Finally, using asthma as a case study, we show that this mapping procedure generalizes to high-dimensional lung function data.

我们展示了大语言模型(LLM)在八种疾病分类任务中不同实验设置下的风险预测性能。首先评估了零样本(zero-shot)和少样本(few-shot)提示方法以及基于文本序列化健康数据的预训练基础模型软提示调优(soft-prompt tuning)的效果。随后证明,将定量特征直接映射到大语言模型的潜在token空间可显著提升模型性能。最后以哮喘为例,证实该映射方法可推广至高维肺功能数据。

3.1 Quantifying disease risk using zero-shot, few-shot, and soft-prompt tuning

3.1 使用零样本、少样本和软提示调优量化疾病风险

We first establish a baseline for LLM disease risk prediction using zero-shot, few-shot, and soft-prompt tuning with a frozen Flan-PaLMChilla 62b model [7]. We define classification tasks using a diverse set of binary phenotype targets (see Section 2.3). For each task and prompting method, we evaluate a “baseline” set of model inputs consisting of age, sex, and BMI as well as an “expanded” set that includes eleven additional clinical and wellness features: HDL cholesterol, LDL cholesterol, total cholesterol, total triglycerides, diastolic blood pressure, smoking status, average sleep duration, insomnia, snoring, daytime napping, and chronotype (i.e., whether a patient is a morning or evening person). These predictors were chosen based on prior knowledge that they should be informative about the selected targets.

我们首先使用零样本、少样本和软提示调优方法,基于冻结参数的Flan-PaLMChilla 62b模型[7]建立大语言模型疾病风险预测基线。通过一组多样化的二元表型目标定义分类任务(见第2.3节)。针对每项任务和提示方法,我们评估了两种输入特征集:(1) "基线"集包含年龄、性别和BMI;(2) "扩展"集额外纳入11项临床与健康特征:高密度脂蛋白胆固醇(HDL)、低密度脂蛋白胆固醇(LDL)、总胆固醇、总甘油三酯、舒张压、吸烟状态、平均睡眠时长、失眠症状、打鼾、日间小睡习惯以及时型(即患者属于晨型或夜型人群)。这些预测因子的选择基于其与目标表型存在已知关联的先验知识。

We score a validation dataset ( $n=3,000$ ) using the methodology outlined in Section 2. For each validation sample, we generate a disease risk score by computing the log probability of a positive disease label. Similarly, to obtain the logistic regression and XGBoost scores, we compute the probability of a positive disease label given the respective model fit on a separate training set ( $n=10,000$ ).

我们使用第2节中概述的方法对一个验证数据集($n=3,000$)进行评分。对于每个验证样本,我们通过计算阳性疾病标签的对数概率来生成疾病风险评分。同样地,为了获得逻辑回归和XGBoost的评分,我们在一个独立的训练集($n=10,000$)上根据各自模型的拟合情况计算阳性疾病标签的概率。

Table 1 shows performance for each prompting method and input set compared to the baseline models. We observe that at least one prompting technique is competitive with the baselines for most tasks. In some cases (e.g., hypertension and diabetes), zero-shot and few-shot models perform surprisingly well despite seeing little to no training data compared to the baselines, an observation also made by [16]. This suggests that the LLM uses prior knowledge about the relationships between age, sex, BMI and disease likelihood.

表1: 展示了每种提示方法与输入集相较于基线模型的性能表现。我们观察到,在多数任务中至少有一种提示技术与基线模型表现相当。在某些情况下(如高血压和糖尿病),零样本和少样本模型尽管相比基线模型几乎没有或仅有少量训练数据,却表现出惊人的效果,这一现象也被[16]所记录。这表明大语言模型利用了关于年龄、性别、BMI与疾病概率之间关系的先验知识。

Focusing on the baseline feature set (age, sex, and BMI), we aimed to understand how the LLM derives scores. We estimated the importance of each input feature by regressing the features against the scores output by the model. Coefficients from the linear regression model are used to measure feature importance and concordance across methods. Figure S1 shows the result of this analysis for four traits. For diabetes, hypertension, and stroke, we see concordance between logistic regression, XGBoost and the LLM models in terms of direction and relative magnitude of effects. Additionally, we find a strong correlation between logistic regression and LLM scores $\mathrm{\Delta(Spearman=0.46\mathrm{-}0.93}$ across prompting methods), while the correlation between LLM and XGBoost scores is weaker (Spearman = 0.39–0.65). On the other hand, for migraine, we see little concordance in direction and relative magnitude of effects for zero-shot and few-shot LLMs, indicating that the LLM doesn’t have sufficient prior knowledge to relate migraine with the input features. However, this is corrected in the soft-prompt case, where we see concordance and high Spearman correlation between soft-prompt tuned LLM and logistic regression (0.85). On the other hand, the soft-prompt tuned LLM has low Spearman correlation with XGBoost (0.31), which is a non-linear model. Given this, we hypothesize that the LLM is effectively scoring outcomes using what translates to a simple linear func

聚焦基线特征集(年龄、性别和BMI),我们旨在理解大语言模型如何推导评分。通过将输入特征与模型输出的评分进行回归分析,我们评估了各特征的重要性。线性回归模型的系数用于衡量特征重要性及方法间一致性。图 S1 展示了四项特征的分析结果:对于糖尿病、高血压和卒中,逻辑回归、XGBoost与大语言模型在效应方向和相对强度上表现一致。此外,逻辑回归与大语言模型评分呈现强相关性(不同提示方法间Spearman=0.46-0.93),而大语言模型与XGBoost评分的相关性较弱(Spearman=0.39-0.65)。在偏头痛分析中,零样本和少样本大语言模型在效应方向与强度上均缺乏一致性,表明模型缺乏将偏头痛与输入特征关联的先验知识。但软提示调整后的大语言模型与逻辑回归呈现高度一致性(Spearman=0.85),而与非线性模型XGBoost的相关性较低(0.31)。据此我们推测,大语言模型实质上是通过等效于简单线性函数的方式进行结果评分。

Table 1: Comparison of AUC and AUPRC between LLM-based classifiers and classical machine learning approaches on the validation set. The models with “baseline” input features use age, sex and BMI as model features, while the models with the “expanded” set also include 11 additional clinical and wellness features. The mean AUC/AUPRC and the corresponding $95%$ confidence intervals were calculated across 1,000 boots trapping iterations. Bold cells denote the best models for a given phenotype and input feature set, where statistical significance is determined via paired boots trapping. Logistic regression and XGBoost models were trained on 10,000 samples, few-shot on 10 samples and soft-prompt tuning on 1,000 samples.

表 1: 验证集上基于大语言模型的分类器与经典机器学习方法的AUC和AUPRC对比。标注"baseline"输入特征的模型使用年龄、性别和BMI作为特征,而标注"expanded"特征的模型额外包含11项临床与健康指标。所有AUC/AUPRC均值及对应的$95%$置信区间均通过1,000次自助采样计算得出。加粗单元格表示特定表型和输入特征组合下的最优模型,其统计学显著性通过配对自助采样检验确定。逻辑回归和XGBoost模型使用10,000个样本训练,少样本学习使用10个样本,软提示调优使用1,000个样本。

PhenotypeModelBaseline AUCExpanded AUC Baseline AUPRC Expanded AUPRC
All Cause Mortality (prevalence =6.83%)Zero-shot0.61 (0.57-0.64)0.61 (0.57-0.65)0.10 (0.080.12)0.11 (0.090.14)
Few-shot0.65 (0.61-0.69)0.65 (0.61-0.69)0.12 (0.10-0.15)0.12 (0.09-0.15)
(910-01:0) 010(20-990)690(20-990)690 4001-9090.14 (0.11-0.18)
LogReg0.16 (0.13-0.21)
XGBoost0.67 (0.62-0.70)0.67 (0.62-0.70) 0.13 (0.10-0.17)0.13 (0.10-0.17)
Diabetes (prevalence = 7.60%)Zero-shot0.70 (0.67-0.73)0.61 (0.57-0.65)0.18 (0.140.23)0.12 (0.100.15)
Few-shot0.72 (0.69-0.76)0.67 (0.640.70)0.19 (0.15-0.24)0.14 (0.110.17)
Soft-prompt 0.72 (0.69-0.76)0.68 (0.640.72)0.23 (0.18-0.28)0.17 (0.140.22)
LogReg0.26 (0.21-0.32)
XGBoost0.73 (0.69-0.76) 0.73 (0.69-0.76)0.22 (0.17-0.26)0.22 (0.17-0.26)
Hypertension (prevalence = 40.03%)Few-shotZero-shot0.70 (0.68-0.72)0.68 (0.66-0.70)0.60 (0.56-0.62)0.57 (0.54-0.60)
0.73 (0.71-0.75)0.72 (0.700.73)0.62 (0.59-0.64)0.59 (0.56-0.62)
Soft-prompt 0.72 (0.70-0.74)0.72 (0.700.74)0.60 (0.57-0.63)0.59 (0.56-0.62)
LogReg0.74 (0.720.76)0.74 (0.720.76)0.63 (0.60-0.66)0.65 (0.62-0.68)
XGBoost0.77 (0.75-0.78) 0.77 (0.75-0.78) 0.66 (0.63-0.69)0.66 (0.63-0.69)
Major Depression (prevalence = 13.03%)Few-shotZero-shot0.54 (0.51-0.57)0.58 (0.55-0.61)0.15 (0.130.17)0.17 (0.140.19)
0.50 (0.47-0.53)0.55 (0.52-0.58)0.14 (0.12-0.15)0.15 (0.13-0.17)
Soft-prompt 0.60 (0.57-0.63) 0.47 (0.44-0.50)0.20 (0.17-0.23)0.12 (0.110.14)
LogReg0.19 (0.17-0.22)
XGBoost0.54 (0.52-0.57)0.54 (0.52-0.57)0.16 (0.140.18)0.16 (0.140.18)
Migraine (prevalence = 3.77%)Zero-shot0.51 (0.45-0.56)0.49 (0.45-0.55)0.05 (0.03-0.07)0.04 (0.03-0.05)
Few-shot0.45 (0.40-0.50)0.47 (0.420.53)0.03 (0.03-0.04)0.03 (0.03-0.04)
Soft-prompt 0.64 (0.59-0.69) 0.51 (0.46-0.56)0.07 (0.05-0.09)0.04 (0.03-0.06)
LogReg(110-400)200(290-990)290(890-990)7900.06 (0.04-0.08)
XGBoost(00-10:0) 900(990-490) 090(990-49:0) 09:00.06 (0.04-0.07)
(prevalence =5.93%)Few-shotMyocardial InfarctionZero-shot0.61 (0.57-0.65)0.61 (0.56-0.65)0.08 (0.060.10)0.09 (0.07-0.12)
0.67 (0.63-0.71)0.65 (0.61-0.68)0.11 (0.09-0.14)0.10 (0.08-0.12)
0.11 (0.090.14)
LogReg0.16 (0.12-0.21)
XGBoost0.14 (0.11-0.19)
Stroke (prevalence = 4.00%) Few-shotZero-shot0.61 (0.56-0.66)0.55 (0.490.60)0.06 (0.05-0.09)0.05 (0.04-0.06)
0.65 (0.60-0.69)0.60 (0.56-0.65)0.07 (0.05-0.09)0.05 (0.04-0.07)
Soft-prompt 0.63 (0.58-0.68)0.50 (0.45-0.55)0.09 (0.06-0.12)0.05 (0.03-0.06)
LogReg(10-200)010(220-890)00(20490)6900.14 (0.09-0.19)
XGBoost0.68 (0.63-0.72) 0.68 (0.63-0.72)0.09 (0.06-0.12)0.09 (0.06-0.12)
表型 模型 基线AUC 扩展AUC 基线AUPRC 扩展AUPRC
全因死亡率 (患病率=6.83%) 零样本 0.61 (0.57-0.64) 0.61 (0.57-0.65) 0.10 (0.08-0.12) 0.11 (0.09-0.14)
少样本 0.65 (0.61-0.69) 0.65 (0.61-0.69) 0.12 (0.10-0.15) 0.12 (0.09-0.15)
(910-01:0) 010(20-990)690(20-990)690 4001-909 0.14 (0.11-0.18)
LogReg 0.16 (0.13-0.21)
XGBoost 0.67 (0.62-0.70) 0.67 (0.62-0.70) 0.13 (0.10-0.17) 0.13 (0.10-0.17)
糖尿病 (患病率=7.60%) 零样本 0.70 (0.67-0.73) 0.61 (0.57-0.65) 0.18 (0.14-0.23) 0.12 (0.10-0.15)
少样本 0.72 (0.69-0.76) 0.67 (0.64-0.70) 0.19 (0.15-0.24) 0.14 (0.11-0.17)
Soft-prompt 0.72 (0.69-0.76) 0.68 (0.64-0.72) 0.23 (0.18-0.28) 0.17 (0.14-0.22)
LogReg 0.26 (0.21-0.32)
XGBoost 0.73 (0.69-0.76) 0.73 (0.69-0.76) 0.22 (0.17-0.26) 0.22 (0.17-0.26)
高血压 (患病率=40.03%) 零样本 0.70 (0.68-0.72) 0.68 (0.66-0.70) 0.60 (0.56-0.62) 0.57 (0.54-0.60)
少样本 0.73 (0.71-0.75) 0.72 (0.70-0.73) 0.62 (0.59-0.64) 0.59 (0.56-0.62)
Soft-prompt 0.72 (0.70-0.74) 0.72 (0.70-0.74) 0.60 (0.57-0.63) 0.59 (0.56-0.62)
LogReg 0.74 (0.72-0.76) 0.74 (0.72-0.76) 0.63 (0.60-0.66) 0.65 (0.62-0.68)
XGBoost 0.77 (0.75-0.78) 0.77 (0.75-0.78) 0.66 (0.63-0.69) 0.66 (0.63-0.69)
重度抑郁症 (患病率=13.03%) 零样本 0.54 (0.51-0.57) 0.58 (0.55-0.61) 0.15 (0.13-0.17) 0.17 (0.14-0.19)
少样本 0.50 (0.47-0.53) 0.55 (0.52-0.58) 0.14 (0.12-0.15) 0.15 (0.13-0.17)
Soft-prompt 0.60 (0.57-0.63) 0.47 (0.44-0.50) 0.20 (0.17-0.23) 0.12 (0.11-0.14)
LogReg 0.19 (0.17-0.22)
XGBoost 0.54 (0.52-0.57) 0.54 (0.52-0.57) 0.16 (0.14-0.18) 0.16 (0.14-0.18)
偏头痛 (患病率=3.77%) 零样本 0.51 (0.45-0.56) 0.49 (0.45-0.55) 0.05 (0.03-0.07) 0.04 (0.03-0.05)
少样本 0.45 (0.40-0.50) 0.47 (0.42-0.53) 0.03 (0.03-0.04) 0.03 (0.03-0.04)
Soft-prompt 0.64 (0.59-0.69) 0.51 (0.46-0.56) 0.07 (0.05-0.09) 0.04 (0.03-0.06)
LogReg (110-400)200(290-990)290(890-990)790 0.06 (0.04-0.08)
XGBoost (00-10:0) 900(990-490) 090(990-49:0) 09:0 0.06 (0.04-0.07)
心肌梗死 (患病率=5.93%) 零样本 0.61 (0.57-0.65) 0.61 (0.56-0.65) 0.08 (0.06-0.10) 0.09 (0.07-0.12)
少样本 0.67 (0.63-0.71) 0.65 (0.61-0.68) 0.11 (0.09-0.14) 0.10 (0.08-0.12)
0.11 (0.09-0.14)
LogReg 0.16 (0.12-0.21)
XGBoost 0.14 (0.11-0.19)
中风 (患病率=4.00%) 零样本 0.61 (0.56-0.66) 0.55 (0.49-0.60) 0.06 (0.05-0.09) 0.05 (0.04-0.06)
少样本 0.65 (0.60-0.69) 0.60 (0.56-0.65) 0.07 (0.05-0.09) 0.05 (0.04-0.07)
Soft-prompt 0.63 (0.58-0.68) 0.50 (0.45-0.55) 0.09 (0.06-0.12) 0.05 (0.03-0.06)
LogReg (10-200)010(220-890)00(20490)690 0.14 (0.09-0.19)
XGBoost 0.68 (0.63-0.72) 0.68 (0.63-0.72) 0.09 (0.06-0.12) 0.09 (0.06-0.12)

tion of the input features and that this linear mapping is effectively learned via soft-prompting.

输入特征的线性映射可以通过软提示有效学习。

In the expanded feature set, we often find that the LLM with zero-shot prompting has poorer performance (e.g., on diabetes) when compared with the LLM with zero-shot prompting using the smaller feature set. This may indicate that the LLM has insufficient prior knowledge about the features contained in the expanded set. However, unlike the baseline features, the model is not able to match baseline performance even when using soft-prompt tuning. This indicates that for a complex set of input features, text serialization does not yield a representation that fully captures the available signal. Together, these results motivate the use of a multimodal approach to directly embed quantitative data into the prompt.

在扩展特征集中,我们经常发现采用零样本提示的大语言模型相比使用较小特征集的零样本提示模型表现更差(例如在糖尿病数据集上)。这可能表明大语言模型对扩展集中包含的特征缺乏足够的先验知识。但与基线特征不同,即使采用软提示调优,模型仍无法达到基线性能。这表明对于复杂的输入特征集,文本序列化无法生成能完整捕捉可用信号的表征。这些结果共同推动了采用多模态方法将定量数据直接嵌入提示的研究方向。

3.2 Encoding quantitative data using HeLM

3.2 使用HeLM编码定量数据

To assess the benefit of directly embedding quantitative data into the LLM’s latent token space, we repeat the experiments from Section 3.1 but encode the “extended” inputs using the HeLM framework. We learn an encoder $\phi_ {t}$ , an MLP (two hidden layers of size 1024 and 4096, respectively), that takes as input quantitative features and maps them into the embedding space. We train this model over a mixture of all seven binary traits ( $n=10,000$ for each trait).

为了评估将定量数据直接嵌入大语言模型潜在token空间的优势,我们重复了第3.1节的实验,但使用HeLM框架对"扩展"输入进行编码。我们学习了一个编码器$\phi_ {t}$(一个具有1024和4096大小两个隐藏层的MLP),该编码器以定量特征作为输入并将其映射到嵌入空间。我们在所有七个二元特征的混合数据上训练该模型(每个特征$n=10,000$)。

Table 2 shows performance metrics for HeLM compared with logistic regression and XGBoost. We see that by directly encoding the tabular data into token space, HeLM performs at parity with the best baseline methods and for all cause mortality, hypertension and myocardial infarction HeLM outperforms the best baseline. In comparing the scoring methods, we again find that HeLM scores are highly correlated with logistic regression scores (Spearman = 0.70–0.87; Figure S4a-Figure S5c). In Figure S2, we repeat the feature importance analysis, this time comparing HeLM with logistic regression and XGBoost. Similar to Section 3.1, we see strong concordance between feature weights, particularly for features that have the most importance. Taken together, these results are consistent with our previous hypothesis that the LLM scores outcomes in a way that is highly consistent with logistic regression and that the LLM is arriving at these scores by similarly linearly weighting input features.

表 2 展示了 HeLM 与逻辑回归和 XGBoost 的性能指标对比。通过直接将表格数据编码到 token 空间,HeLM 的表现与最佳基线方法相当,而在全因死亡率、高血压和心肌梗死预测任务中,HeLM 优于所有基线方法。对比评分方法时,我们发现 HeLM 分数与逻辑回归分数高度相关 (Spearman = 0.70–0.87;图 S4a-图 S5c)。在图 S2 中,我们重复了特征重要性分析,这次比较 HeLM 与逻辑回归和 XGBoost 的表现。与第 3.1 节类似,我们发现特征权重具有高度一致性,尤其是对最重要的特征而言。这些结果验证了我们之前的假设:大语言模型以与逻辑回归高度一致的方式进行结果评分,且通过类似的线性加权输入特征来实现这些评分。

In the case of hypertension, we see that HeLM significantly outperforms logistic regression and XGBoost, while XGBoost appears to have a slight advantage over logistic regression. In addition, the HeLM score is more correlated with the XGBoost score when compared with logistic regression (Spearman 0.87 vs. 0.76), which is not the case for any of the other traits. Although, in no way conclusive, this may indicate that by mapping the tabular data into the token embedding space, the LLM takes advantage of non-linear relationships between input features and hypertension. This phenomenon may account for some of the cases where HeLM outperforms the baseline methods.

在高血压案例中,我们发现HeLM显著优于逻辑回归和XGBoost,而XGBoost较逻辑回归略有优势。此外,与逻辑回归相比,HeLM评分与XGBoost评分的相关性更高(Spearman 0.87 vs. 0.76),这一现象在其他特征中均未出现。尽管无法定论,这可能表明通过将表格数据映射到token嵌入空间,大语言模型利用了输入特征与高血压之间的非线性关系。这一现象或许能解释HeLM在某些情况下优于基线方法的原因。

We hypothesized that HeLM has an advantage over logistic regression due to being trained over a mixture of traits and thus benefiting from transfer learning across traits. To evaluate this, we selected three traits (hypertension, all cause mortality and myocardial infarction) where HeLM performed better than logistic regression in terms of AUROC. For each, we trained a single task HeLM (a separate model for each trait) and compared with the mixture. Overall, we see that the mixture trained HeLM does not have an advantage over the single task model (Table S2). Training on a mixture of diseases is still advantageous since it yields a single model that can be used for a variety of diseases as opposed to separate models for each disease (as is the case for logistic regression and XGBoost).

我们假设HeLM相比逻辑回归具有优势,是因为它在多种性状混合数据上训练,从而受益于跨性状的迁移学习。为验证这一点,我们选取了三个AUROC指标上HeLM表现优于逻辑回归的性状(高血压、全因死亡率和心肌梗死),针对每个性状分别训练单任务HeLM(每个性状独立模型)并与混合训练模型对比。总体而言,混合训练的HeLM并未显示出优于单任务模型的优势(表S2)。不过疾病混合训练仍具价值,因为它能生成适用于多种疾病的单一模型,而非像逻辑回归和XGBoost那样需要为每种疾病建立独立模型。

Table 2: Comparison of AUC and AUPRC between a HeLM, logistic regression and XGBoost models on the validation set. The binary phenotypes are predicted from the 14 “expanded set” input features from Table 1. The features are both encoded as text as in Table 1 and also included as a secondary quantitative data modality.

Phenotype ModelAUC
All Cause MortalityHeLM LogReg 0.68 (0.64-0.72) XGBoost 0.67 (0.62-0.70)0.71 (0.68-0.75) 0.18 (0.14-0.23) 0.16 (0.13-0.21) 0.13 (0.10-0.17)
DiabetesHeLM LogReg XGBo0st 0.73 (0.69-0.76)0.77 (0.74-0.80) 0.27 (0.22-0.33) 0.75 (0.72-0.79) 0.26 (0.21-0.32) 0.22 (0.17-0.26)
Hypertension HeLMLogReg 0.74 (0.72-0.76) XGBo0st 0.77 (0.75-0.78) 0.66 (0.63-0.69)0.79 (0.77-0.81) 0.69 (0.66-0.72) 0.65 (0.62-0.68)
Major DepressionHeLM LogReg0.63 (0.60-0.65) 0.19 (0.17-0.22) 0.62 (0.59-0.65) 0.19 (0.17-0.22)
MigraineXGBoost 0.54 (0.52-0.57) HeLM LogReg0.16 (0.14-0.18) 0.61 (0.55-0.66) 0.06 (0.04-0.07)
MyocardialHeLM0.62 (0.56-0.67) 0.06 (0.04-0.08) XGB00st 0.60 (0.54-0.66) 0.06 (0.04-0.07) 0.73 (0.69-0.77) 0.15 (0.12-0.19)
StrokeLogReg 0.70 (0.65-0.74) XGBoost 0.69 (0.65-0.73) HeLM 0.73 (0.68-0.77) 0.14 (0.09-0.20)0.16 (0.12-0.21) 0.14 (0.11-0.19)

表 2: HeLM、逻辑回归和XGBoost模型在验证集上AUC和AUPRC的比较。二元表型是从表1中的14个"扩展集"输入特征预测的。这些特征既按照表1中的方式编码为文本,也作为次要的定量数据模态包含在内。

表型模型 AUC
全因死亡率 HeLM LogReg 0.68 (0.64-0.72) XGBoost 0.67 (0.62-0.70) 0.71 (0.68-0.75) 0.18 (0.14-0.23) 0.16 (0.13-0.21) 0.13 (0.10-0.17)
糖尿病 HeLM LogReg XGBo0st 0.73 (0.69-0.76) 0.77 (0.74-0.80) 0.27 (0.22-0.33) 0.75 (0.72-0.79) 0.26 (0.21-0.32) 0.22 (0.17-0.26)
高血压 HeLM LogReg 0.74 (0.72-0.76) XGBo0st 0.77 (0.75-0.78) 0.66 (0.63-0.69) 0.79 (0.77-0.81) 0.69 (0.66-0.72) 0.65 (0.62-0.68)
重度抑郁症 HeLM LogReg 0.63 (0.60-0.65) 0.19 (0.17-0.22) 0.62 (0.59-0.65) 0.19 (0.17-0.22)
偏头痛 XGBoost 0.54 (0.52-0.57) HeLM LogReg 0.16 (0.14-0.18) 0.61 (0.55-0.66) 0.06 (0.04-0.07)
心肌梗死 HeLM 0.62 (0.56-0.67) 0.06 (0.04-0.08) XGB00st 0.60 (0.54-0.66) 0.06 (0.04-0.07) 0.73 (0.69-0.77) 0.15 (0.12-0.19)
中风 LogReg 0.70 (0.65-0.74) XGBoost 0.69 (0.65-0.73) HeLM 0.73 (0.68-0.77) 0.14 (0.09-0.20) 0.16 (0.12-0.21) 0.14 (0.11-0.19)

3.3 Estimating asthma risk using multiple modalities

3.3 使用多模态方法评估哮喘风险

Using HeLM to leverage tabular data clearly showed that this is a promising direction for quantifying disease risk. Next, we evaluate whether HeLM can incorporate more complex data modalities such as a spirogram, a time-series curve which measures the amount of air an individual can breath in and out of their lungs. Spirometry is commonly used to assess pulmonary function and the presence of respiratory diseases. Thus, we focus on the task of quantifying asthma risk.

使用HeLM处理表格数据的结果明确显示,这是量化疾病风险的一个有前景的方向。接下来,我们评估HeLM是否能整合更复杂的数据模态,例如肺活量图(spirogram)——这种时间序列曲线可测量个体肺部吸入和呼出的空气量。肺活量测定法常用于评估肺功能和诊断呼吸系统疾病,因此我们重点研究量化哮喘风险的任务。

To this end, we trained a HeLM model with three modalities as input: tabular data (14 “expanded set” input features) as well as spirometry, along with tabular data serialized to text. We do not include a textual description of the spirogram since it’s unclear how to summarize this data in text. We encode spirometry data into the token embedding space via a one-dimensional variant of the ResNet18 architecture [14,15], followed by an MLP (two hidden layers of size 1024 and 4096, respectively). We use a pre-trained model from [8] for the ResNet18 part of the encoder and only update the weights of the MLP. The ResNet18 model was trained to predict asthma and COPD. We take as input the 128-dimensional embedding, corresponding to the penultimate layer. Similar to Section 3.2, we use an MLP to encode tabular data into the token embedding space.

为此,我们训练了一个HeLM模型,其输入包含三种模态:表格数据(14个"扩展集"输入特征)、肺功能检测数据,以及序列化为文本的表格数据。由于不清楚如何用文本概括肺功能曲线数据,因此未包含其文字描述。我们通过一维变体ResNet18架构[14,15]和MLP(两个隐藏层大小分别为1024和4096)将肺功能数据编码至token嵌入空间。编码器的ResNet18部分采用文献[8]的预训练模型,仅更新MLP权重。该ResNet18模型原用于预测哮喘和慢性阻塞性肺疾病(COPD),我们取其倒数第二层的128维嵌入作为输入。与第3.2节类似,我们使用MLP将表格数据编码至token嵌入空间。

The model is instructed to predict asthma using supervised labels (“yes” or “no”) as targets in training. We trained on $n=16,724$ samples from the UK Biobank, which were obtained by sub sampling the dataset to achieve a one-to-one case-control class distribution to ensure a balanced representation of categories. As a direct baseline, we also trained a model that linearly combined the 128-dimensional embedding from the ResNet18 model and tabular data, which we term ResNet18 1D (tabular data $^+$ spirogram).

该模型被指示使用监督标签("是"或"否")作为训练目标来预测哮喘。我们在来自英国生物银行的16,724个样本上进行了训练,这些样本通过对数据集进行子采样获得,以实现1:1的病例-对照类别分布,确保类别平衡。作为直接基线,我们还训练了一个线性结合ResNet18模型的128维嵌入和表格数据的模型,我们称之为ResNet18 1D(表格数据$^+$呼吸图)。

In order to assess whether the multimodal HeLM model is leveraging the additional spirogram modality for asthma prediction, we trained several models on tabular data only, without the spirogram. Following Section 3.2 we trained a logistic regression model, XGBoost, an LLM with soft-prompt tuning and HeLM using tabular data only.

为了评估多模态HeLM模型是否利用了额外的呼吸图(spirogram)模态进行哮喘预测,我们在仅使用表格数据(不包含呼吸图)的情况下训练了多个模型。按照3.2节的方法,我们训练了逻辑回归模型、XGBoost、采用软提示调优的大语言模型以及仅使用表格数据的HeLM模型。


Fig. 2: Inclusion of additional input modalities improves HeLM asthma detection. (a) ROC and (b) Precision-recall (PR) for asthma phenotype prediction using tabular data only versus tabular data and spirometry. We compare HeLM, ResNet18 1D, logistic regression, XGBoost, and a soft-prompt tuned LLM trained on either a single modality (tabular) or two modalities (tabular and spirometry).

图 2: 引入额外输入模态可提升HeLM哮喘检测性能。(a)仅使用表格数据与表格数据+肺活量测定法的哮喘表型预测ROC曲线 (b)精确率-召回率(PR)曲线。我们对比了HeLM、ResNet18 1D、逻辑回归、XGBoost以及基于单模态(表格)或双模态(表格+肺活量测定)训练的软提示调优大语言模型。

Figure 2 shows the model performance on a held-out set of individuals with valid spirograms ( $n=2,289$ ). We observe that the HeLM model trained on two non-text modalities (tabular data and spirogram) is successfully leveraging the additional spirogram modality to boost the performance on asthma prediction. For example, the AUROC (AUPRC) increases from $0.49\pm0.02$ $0.14\pm0.01\$ ) for the LLM trained on tabular data only via soft-prompt tuning to $0.75\pm0.01$ $(0.38\pm0.02)\$ for HeLM trained on tabular data and a spirogram. The comparison with soft-prompt tuned LLM on tabular data is particularly important since it is the only other method that can also generate natural language text and thus provide recommendations, answer questions, and summarize given an individual’s health context. We also observe that HeLM trained on tabular and spirogram data performs on par with the linear combination of the ResNet18 1D model on spirogram and tabular data.

图 2 展示了模型在保留的有效肺活量测定个体集 ( $n=2,289$ ) 上的性能表现。我们观察到,基于两种非文本模态 (表格数据和肺活量图) 训练的 HeLM 模型成功利用额外的肺活量图模态提升了哮喘预测性能。例如,AUROC (AUPRC) 从仅通过软提示调优在表格数据上训练的大语言模型的 $0.49\pm0.02$ $(0.14\pm0.01)$ 提升至基于表格数据和肺活量图训练的 HeLM 的 $0.75\pm0.01$ $(0.38\pm0.02)$ 。与表格数据上软提示调优的大语言模型对比尤为关键,因为这是唯一另一种也能生成自然语言文本的方法,从而能根据个体健康背景提供建议、回答问题和生成摘要。我们还观察到,基于表格和肺活量图数据训练的 HeLM 性能与 ResNet18 1D 模型在肺活量图和表格数据上的线性组合相当。

3.4 Using HeLM for out-of-distribution traits

3.4 使用 HeLM 进行分布外特征分析

Deep learning methods have shown significant performance degradation when applied to out-of-distribution (OOD) data [26]. We investigated HeLM OOD performance by applying it to a set of traits not in the training set. Considering cataract, gastroesophageal reflux disease (GERD), hay fever eczema (atopic eczema), osteoarthritis, and pneumonia, we compared HeLM with trait-specific logistic regression models (Table 3). We observed that HeLM (not trained on the trait) performs on par with logistic regression (trained on the trait) in four out of five of the tested traits. Taken at face-value this implies that HeLM is leveraging prior information about the impact of risk factors on disease learned from related traits in the training data. However, to truly understand how the model arrives at the OOD scores, will take additional research.

深度学习模型在应用于分布外 (OOD) 数据时表现出显著的性能下降 [26]。我们通过将HeLM应用于训练集未包含的特征集来评估其OOD性能。针对白内障、胃食管反流病 (GERD)、花粉症湿疹 (特应性湿疹)、骨关节炎和肺炎这五种特征,我们将HeLM与特定特征的逻辑回归模型进行了对比 (表 3)。实验表明,在五分之四的测试特征中,未针对该特征训练的HeLM模型表现与针对该特征训练的逻辑回归模型相当。表面上看,这意味着HeLM利用了从训练数据相关特征中学到的风险因素对疾病影响的先验信息。但要真正理解模型如何得出OOD评分,还需要进一步研究。

3.5 Natural language generation

3.5 自然语言生成

Finally, we assess whether an LLM can incorporate multimodal data to tailor conversational tasks. Approximating a two-stage approach in which a model is trained to quantify risk and then trained to use these risk estimates in convers at ional tasks, we first use HeLM to compute asthma risk from a spirogram embedding and encode the predicted risk as text into the PaLM-2 model [13] for exercise recommendations: “I have a p% chance of having asthma. What are some exercises you would suggest, given this information? Please tailor your recommendations to my health status.” Over 110 examples that span the asthma risk spectrum we observe differential recommendations based on predicted asthma risk (Figure S3).

最后,我们评估大语言模型是否能整合多模态数据来定制对话任务。我们采用近似两阶段方法:先训练模型量化风险,再训练其将这些风险评估用于对话任务。首先使用HeLM从肺活量图嵌入中计算哮喘风险,并将预测风险编码为文本输入PaLM-2模型[13]以获取运动建议:"我有p%的概率患有哮喘。根据这一信息,您会建议哪些运动?请根据我的健康状况调整建议。"在涵盖整个哮喘风险范围的110多个案例中,我们观察到基于预测哮喘风险的差异化建议(图S3)。

This approach is inefficient as it requires transforming risk predictions back to the textual domain. Ideally, the LLM should condition recommendations directly upon the embedded multimodal data. To this end, we qualitatively explore the natural language generation capability of a HeLM model trained for the asthma task: “Let’s think step by step. Given the following spirogram:⟨spirogram⟩, do they have asthma? Based on that what is the most recommended exercise?”. A sample answer for an individual with asthma is “low intensity aerobic exercise. The answer: yes.” and for an individual without asthma “Swimming. The answer: no.”.

这种方法效率低下,因为它需要将风险预测转换回文本领域。理想情况下,大语言模型应直接基于嵌入的多模态数据生成建议。为此,我们定性探索了针对哮喘任务训练的HeLM模型的自然语言生成能力:"让我们逐步思考。给定以下肺活量图:⟨spirogram⟩,他们是否患有哮喘?基于此,最推荐的运动是什么?"。针对哮喘患者的示例回答是"低强度有氧运动。答案:是",而对非哮喘患者的回答则是"游泳。答案:否"。

Table 3: Comparison of AUC and AUPRC between zero-shot, fewshot LLM-based class if i ers, HeLM and a logistic regression model on the validation set in OOD setting. While logistic regression models were trained on the data for the trait, HeLM was not trained for the trait. The binary phenotypes are predicted from the 14 “expanded set” input features from Table 1. The features are both encoded as text as in Table 1 and also included as a secondary quantitative data modality. “prev” denotes prevalence.

PhenotypeModelAUC AUPRC
CataractZero-shot 0.58 (0.55-0.61) (prev=13.51%) Few-shot 0.63 (0.60-0.65) HeLM LogReg0.17 (0.15-0.20) 0.18 (0.16-0.20) 0.71 (0.69-0.73) 0.24 (0.21-0.27) 0.72 (0.69-0.74) 0.27 (0.23-0.31)
GERDZero-shot 0.58 (0.55-0.60) (prev=17.33%) Few-shot 0.57 (0.54-0.60) HeLM LogReg0.22 (0.19-0.25) 0.21 (0.18-0.23) 0.61 (0.59-0.64) 0.25 (0.22-0.28) 0.61 (0.59-0.64) 0.26 (0.23-0.29)
Hay Fever Eczema (prev=23.84%)HeLMZero-shot 0.48 (0.46-0.51) Few-shot 0.46 (0.44-0.49) LogReg0.23 (0.21-0.24) 0.23 (0.21-0.25) 0.47 (0.44-0.49) 0.22 (0.20-0.24) 0.57 (0.55-0.60) 0.28 (0.26-0.31)
Osteoarthritis Zero-shot 0.60 (0.58-0.62) (prev=28.13%) Few-shot 0.63 (0.61-0.65) HeLM LogReg0.36 (0.33-0.39) 0.38 (0.35-0.40) 0.65 (0.63-0.67) 0.41 (0.38-0.44) 0.67 (0.65-0.69) 0.44 (0.40-0.47)
Pneumonia (prev=6.50%)Zero-shot 0.56 (0.52-0.60) Few-shot 0.59 (0.55-0.63) HeLM LogReg0.08 (0.06-0.10) 0.08 (0.07-0.10) 0.68 (0.65-0.72) 0.16 (0.12-0.21) 0.67 (0.63-0.71) 0.16 (0.12-0.20)

表 3: OOD设置下验证集上零样本、少样本基于大语言模型的分类器、HeLM与逻辑回归模型在AUC和AUPRC上的对比。逻辑回归模型针对性状数据进行训练,而HeLM未针对该性状训练。二元表型预测基于表1中的14项"扩展集"输入特征,这些特征既按表1编码为文本,也作为第二定量数据模态纳入。"prev"表示患病率。

表型 模型 AUC AUPRC
白内障 零样本 0.58 (0.55-0.61) (prev=13.51%) 少样本 0.63 (0.60-0.65) HeLM LogReg 0.17 (0.15-0.20) 0.18 (0.16-0.20) 0.71 (0.69-0.73) 0.24 (0.21-0.27) 0.72 (0.69-0.74) 0.27 (0.23-0.31)
胃食管反流病 零样本 0.58 (0.55-0.60) (prev=17.33%) 少样本 0.57 (0.54-0.60) HeLM LogReg 0.22 (0.19-0.25) 0.21 (0.18-0.23) 0.61 (0.59-0.64) 0.25 (0.22-0.28) 0.61 (0.59-0.64) 0.26 (0.23-0.29)
花粉症/湿疹 零样本 0.48 (0.46-0.51) 少样本 0.46 (0.44-0.49) LogReg (prev=23.84%)HeLM 0.23 (0.21-0.24) 0.23 (0.21-0.25) 0.47 (0.44-0.49) 0.22 (0.20-0.24) 0.57 (0.55-0.60) 0.28 (0.26-0.31)
骨关节炎 零样本 0.60 (0.58-0.62) (prev=28.13%) 少样本 0.63 (0.61-0.65) HeLM LogReg 0.36 (0.33-0.39) 0.38 (0.35-0.40) 0.65 (0.63-0.67) 0.41 (0.38-0.44) 0.67 (0.65-0.69) 0.44 (0.40-0.47)
肺炎 零样本 0.56 (0.52-0.60) 少样本 0.59 (0.55-0.63) HeLM LogReg (prev=6.50%) 0.08 (0.06-0.10) 0.08 (0.07-0.10) 0.68 (0.65-0.72) 0.16 (0.12-0.21) 0.67 (0.63-0.71) 0.16 (0.12-0.20)

We observe that while the model gives sample exercise recommendations that are reasonable (e.g., a low intensity exercise for someone with asthma), it also learned to include yes/no in its answer. This is likely because it has associated our particular question/answer format with the presence of multimodal tokens. The risk prediction results demonstrated that the model can learn to operate on spirograms in the token space, so we expect that including more diverse input and output pairs and generative tasks may improve conversational ability and plan to explore this in future work.

我们观察到,虽然模型给出的示例运动建议是合理的(例如为哮喘患者推荐低强度运动),但它也学会了在回答中包含"是/否"选项。这可能是因为模型将我们特定的问答格式与多模态token (multimodal tokens) 的存在关联了起来。风险预测结果表明,该模型能够学会在token空间中对呼吸图 (spirograms) 进行操作,因此我们预计通过纳入更多样化的输入输出对和生成式任务,可能会提升对话能力,并计划在后续工作中对此进行探索。

4 Discussion

4 讨论

Grounding LLMs in individual-specific information is required to create personalized experiences across a large set of applications. Effective application in health presents unique challenges owing to the high-dimensional and multimodal nature of relevant input data that do not clearly map into text. In this paper, we defined a framework (HeLM) for mapping non-text data modalities into token embedding space and providing this information as context for a foundation LLM to perform disease risk prediction. Using data from the UK Biobank, we showed that HeLM is competitive with classic ML approaches and that—in some cases—the LLM outperforms these methods. The results highlight the promise of enabling LLMs to quantify underlying health risks by leveraging complex multimodal health data.

要让大语言模型(LLM)在众多应用中实现个性化体验,必须将其与个体特异性信息相结合。由于相关输入数据具有高维度和多模态特性,且无法直接映射为文本,因此在健康领域的有效应用面临独特挑战。本文提出HeLM框架,用于将非文本数据模态映射到token嵌入空间,并将这些信息作为基础大语言模型的上下文来执行疾病风险预测。通过使用英国生物银行(UK Biobank)的数据,我们证明HeLM与传统机器学习方法相比具有竞争力,在某些情况下大语言模型的表现甚至优于这些方法。这些结果凸显了通过利用复杂多模态健康数据使大语言模型能够量化潜在健康风险的前景。

While these results demonstrate the effectiveness of a multimodal approach, there are many extensions to explore in future work. Though we have focused solely on scalar lab values and high-dimensional lung function data, biobanks contain individual-level data spanning a wide array of clinical modalities. An immediate next step is exploring the effectiveness of simultaneously embedding additional imaging data, such as fundus images or cardiac MRI, in a single input prompt to better understand how jointly modeling such inputs impacts predictive power. Additionally, this experimental setting motivates the study of how an LLM handles missing data and whether the model could “impute” predictive signal from various modality combinations. Furthermore, we have only explored whether an LLM can learn to understand non-text data, but have not assessed whether a similar approach can generate non-text-based outputs.

虽然这些结果证明了多模态方法的有效性,但未来工作中仍有许多扩展方向值得探索。尽管我们目前仅聚焦于标量实验室指标和高维肺功能数据,但生物样本库涵盖的个体层面数据跨越了广泛的临床模态。下一步可直接探索在单一输入提示中同时嵌入眼底图像或心脏MRI等额外影像数据的有效性,以更好地理解联合建模此类输入如何影响预测能力。此外,该实验场景促使我们研究大语言模型如何处理缺失数据,以及模型是否能够从不同模态组合中"填补"预测信号。值得注意的是,我们仅验证了大语言模型能否学会理解非文本数据,尚未评估类似方法是否可生成基于非文本的输出。

A major question that arises from this work is whether enabling LLMs to use multimodal data for health risk prediction can improve their ability to provide relevant personalized suggestions. To this end, we experimented with using HeLM to offer health-related recommendations that are conditioned on an under standing of risk. However, we found that conversational ability degraded after model tuning. This is consistent with previous observations [35], though there is some evidence that larger models are more robust to degradation [11].

这项工作的一个主要问题是,能否通过让大语言模型(LLM)利用多模态数据进行健康风险预测,从而提高其提供相关个性化建议的能力。为此,我们尝试使用HeLM来提供基于风险理解的健康相关建议。然而,我们发现模型调优后对话能力有所下降。这与之前的观察结果一致[35],尽管有证据表明更大的模型对性能下降更具鲁棒性[11]。

In addition, it will be important for conversational agents to explain why they placed individuals into high or low risk categories, and to quantify their level of uncertainty. Alignment issues are also keenly important, as some healthrelated topics are sensitive and complex to explain, lack expert consensus for best practices, or are sensitive to the culture and preferences of the user.

此外,对话式智能体需要解释为何将个体划分为高风险或低风险类别,并量化其不确定性水平。对齐问题也至关重要,因为某些健康相关话题敏感且解释复杂,缺乏最佳实践的专家共识,或对用户的文化和偏好敏感。

Finally, LLM-based models that have been trained on available health-related data to quantify disease may show differential performance across demographic groups due to computational, systemic, and human biases [34]. Evaluation and mitigation of this is key to avoid perpetuating and increasing existing health disparities. These are complex issues that we have not attempted to address in this work, but will be extremely important to address before deploying multimodal LLMs for health.

最后,基于大语言模型 (LLM) 的模型在可用健康数据上训练以量化疾病时,可能因计算偏差、系统性偏差和人为偏差 [34] 而在不同人口群体中表现出差异性能。评估和缓解这一问题对于避免延续和加剧现有健康差异至关重要。这些复杂问题我们未在本文中探讨,但在部署多模态大语言模型用于健康领域前,解决它们将极为重要。

Acknowledgements The authors would like to thank Katrin Tomanek for providing software, inspiration, and know-how that influenced the direction of this work. We also thank Ted Yun for helpful discussions and feedback.

致谢
作者感谢 Katrin Tomanek 提供的软件、灵感以及影响本工作方向的专业知识。同时感谢 Ted Yun 的有益讨论和反馈。

References

参考文献

A Data Availability

数据可用性

Phenotypes and genotypes are available for approved projects through the UK Biobank study (https://www.ukbiobank.ac.uk). This research has been conducted under Application Number 65275.

表型和基因型数据可通过英国生物银行研究(https://www.ukbiobank.ac.uk)向获批项目提供。本研究使用申请编号65275开展。

B Definitions of UK Biobank phenotypes

B 英国生物样本库表型定义

We restricted analyses to European ancestry individuals within the UK Biobank to reduce phenotypic heterogeneity and limit the impact of population structure. To define European ancestry, we first filtered to individuals with self-reported “British” ancestry according to UKB field 21000. We then computed the medoid of the British ancestry set in the 15-dimensional genetic principal component (PC) space and calculated the distance of each individual in the UK Biobank to this medoid. Finally, we constructed the “European” set by selecting all individuals with a British-medoid distance of less than 40. This cutoff is based on the 99th percentile of distances of individuals who self-identified as British or Irish.

我们将分析限制在英国生物银行(UK Biobank)的欧洲血统个体中,以减少表型异质性并降低群体结构的影响。为界定欧洲血统,我们首先根据UKB字段21000筛选自我报告为"英国"血统的个体。随后在15维遗传主成分(PC)空间中计算英国血统集合的中位数点,并计算生物银行中每个个体与该中位数点的距离。最终,我们通过选择所有英国中位数点距离小于40的个体构建"欧洲"集合。该阈值基于自我认同为英国或爱尔兰血统个体距离值的第99百分位数。

Table S1: The set of UK Biobank phenotypes and their Data-Field IDs. International Classification of Disease version 9 (ICD-9) hospital inpatient (HESIN) codes are taken from UKB field 41271 while International Classification of Disease version 10 (ICD-10) general practitioner note codes are taken from UKB field 42040. We define self-reported statuses according to the target disease coding using UKB field 20002. If multiple label sources are provided for a given phenotype (e.g., hypertension lists ICD-9, ICD-10, and self-report codes), we perform a binary OR across label sources to determine case-control status.

表 S1: 英国生物样本库表型集合及其数据字段ID。国际疾病分类第9版(ICD-9)医院住院(HESIN)代码取自UKB字段41271,国际疾病分类第10版(ICD-10)全科医生记录代码取自UKB字段42040。我们根据目标疾病编码使用UKB字段20002定义自我报告状态。若某表型提供多标签来源(如高血压列出ICD-9、ICD-10和自我报告代码),则通过标签源间的二进制OR运算确定病例对照状态。

PhenotypeUKB Data-Fields
Age21003
BMI21001
Chronotype1180 ("morning people” responded with “definitely a” or "more of a" morning person)
Daytime napping 1190 (cases responded with “sometimes" or “usually") Diastolic blood 4079
pressure
HDL cholesterol 30760
Insomnia1200 (cases responded with “sometimes" or “usually")
LDL cholesterol 30780 Sex
Sleep duration31
Smoking statusfield 1160
Snoring20160
Total cholesterol 306901210 (cases responded with “yes")
Triglycerides30870
All cause mortal- 40000
ity
AsthmaICD-9 code 493, ICD-10 codes J45 and J46, or self-report code 1111 ICD-10 codes H25 and H26 and fields 4700 and 131164-131167
Cataract DiabetesICD-9 code 250, ICD-10 codes E10-E14, self-report codes 1220, 1222,
1223, and fields 2443, 6153, and 6177
Gastro- oesophagealICD-10 codes K20 and K21 and field 131584
reflux (GERD)
eczemaHay fever and ICD-10 codes L20-L30 and field 3761
Hyperte nsionICD-9 codes 401 and 405, ICD-10 codes I10 and I15, or self-report code 1065
Major depression ICD-10 codes F32 and F33 and field 20126
MigraineICD-9 code 346, ICD-10 code G43, and self-report code 1265 Myocardial in- ICD-9 codes 410, 412, 4109, and 4129, ICD-10 codes I21, 1252, and
farctionZ034, self-report code 1075, and felds 6150, 131298, and 131299
OsteoarthritisICD-10 codes M15-M19 and felds 131868, 131869, 131876, 131877,
and 131870-131873
PneumoniaICD-9 codes 480-484 and 486 and ICD-10 codes J12-J18
StrokeICD-9 field 434.91, ICD-10 fields I63 and I64, and fields 6150, 131368,
and 131369
表型 UKB数据字段
年龄 21003
BMI 21001
作息类型 1180 ("晨型人"回答为"绝对是"或"倾向于"晨型人)
白天小睡 1190 (病例回答为"有时"或"通常") 舒张压 4079
血压
HDL胆固醇 30760
失眠 1200 (病例回答为"有时"或"通常")
LDL胆固醇 30780 性别
睡眠时长 31
吸烟状态 字段1160
打鼾 20160
总胆固醇 30690 1210 (病例回答为"是")
甘油三酯 30870
全因死亡率 40000
哮喘 ICD-9编码493, ICD-10编码J45和J46, 或自报编码1111
白内障 ICD-10编码H25和H26及字段4700和131164-131167
糖尿病 ICD-9编码250, ICD-10编码E10-E14, 自报编码1220,1222,1223及字段2443,6153,6177
胃食管反流(GERD) ICD-10编码K20和K21及字段131584
湿疹 花粉症和ICD-10编码L20-L30及字段3761
高血压 ICD-9编码401和405, ICD-10编码I10和I15, 或自报编码1065
重度抑郁 ICD-10编码F32和F33及字段20126
偏头痛 ICD-9编码346, ICD-10编码G43, 及自报编码1265
心肌梗死 ICD-9编码410,412,4109和4129, ICD-10编码I21,1252和Z034, 自报编码1075及字段6150,131298,131299
骨关节炎 ICD-10编码M15-M19及字段131868,131869,131876,131877,131870-131873
肺炎 ICD-9编码480-484和486及ICD-10编码J12-J18
中风 ICD-9字段434.91, ICD-10字段I63和I64, 及字段6150,131368,131369

C LLM-based vs. logistic regression and XGBoost feature importances

基于大语言模型 (LLM) 与逻辑回归和XGBoost的特征重要性对比


Fig. S1: Comparison of feature weights between different scoring methods across four phenotypes. For each disease risk score (i.e., Zero-shot, Fewshot, Soft-prompt and logistic regression), we fit a linear regression to predict the score given the original input features (i.e., age, sex, and BMI). By comparing the regression coefficients we can get a sense of how much weight is given to each input feature in the scoring method.

图 S1: 四种表型下不同评分方法的特征权重对比。针对每种疾病风险评分方法(即零样本 (Zero-shot)、少样本 (Few-shot)、软提示 (Soft-prompt) 和逻辑回归 (logistic regression)),我们通过线性回归模型基于原始输入特征(即年龄、性别和BMI)预测得分。通过比较回归系数,可以了解每种评分方法对各输入特征的权重分配情况。


Fig. S2: Comparison of feature weights between logistic regression, XGBoost and HeLM.

图 S2: 逻辑回归 (logistic regression) 、XGBoost 与 HeLM 的特征权重对比。

D Single task vs. mixture of tasks comparison

D 单任务与多任务混合对比

Table S2: Comparison of validation AUC and AUPRC across all cause mortality, hypertension, and myocardial infarction between a HeLM model trained on a mixture of all seven diseases and a HeLM model trained on only the target disease. Models were trained on the extended feature set comprising 14 clinical and wellness features. The mean AUC/AUPRC and $95%$ confidence intervals were calculated across 1,000 boots trapping iterations. Bold cells denote the best models for a given phenotype and input feature set, where statistical significance is determined via paired boots trapping.

表 S2: 在全部死因、高血压和心肌梗死三个预测任务中,对比使用七种疾病混合数据训练的 HeLM 模型与仅使用目标疾病数据训练的 HeLM 模型的验证 AUC 和 AUPRC 结果。所有模型均基于包含 14 项临床与健康指标的扩展特征集进行训练。AUC/AUPRC 平均值及 95% 置信区间通过 1,000 次自助采样计算得出。加粗单元格表示在特定表型和输入特征集下性能最优的模型,其统计学显著性通过配对自助采样检验确定。

PhenotypeModelAUPRC
All CauseHeLM (Mixture) 0.71(0.68-0.75) 0.18 (0.14-0.23)
MortalityHeLM (Single Task)0.71 (0.67-0.74) 0.17 (0.13-0.22)
Hypertension HeLM(Mixture) HeLM Single Task)0.79 (0.77-0.81) 0.69 9(0.66-0.72) 0.79 (0.78-0.81) 0.69 (0.66-0.73)
MyocardialHeLM (Mixture)0.73 (0.69-0.77) 0.15 (0.12-0.19)
InfarctionHeLM (SingleTask) 0.72 (0.68-0.76) 0.16 (0.13-0.21)
表型 模型 AUPRC
全因死亡 HeLM (混合) 0.71 (0.68-0.75) 0.18 (0.14-0.23)
死亡率 HeLM (单任务) 0.71 (0.67-0.74) 0.17 (0.13-0.22)
高血压 HeLM (混合) 0.79 (0.77-0.81) 0.69 (0.66-0.72)
HeLM (单任务) 0.79 (0.78-0.81) 0.69 (0.66-0.73)
心肌梗死 HeLM (混合) 0.73 (0.69-0.77) 0.15 (0.12-0.19)
HeLM (单任务) 0.72 (0.68-0.76) 0.16 (0.13-0.21)

E Differential LLM recommendations based on predicted asthma risk

E 基于预测哮喘风险的差异化大语言模型 (LLM) 推荐


Fig. S3: Exercise recommendation frequency as a function of predicted asthma risk.

图 S3: 运动推荐频率与预测哮喘风险的关系。


F Spearman’s rank correlation between different methods Fig. S4: Spearman’s rank correlation between different methods’ scores across (a) all cause mortality, (b) diabetes, (c) hypertension, and (d) major depression.

图 S4: 不同方法间的Spearman等级相关性 (a) 全因死亡率, (b) 糖尿病, (c) 高血压, (d) 重度抑郁症。


Fig. S5: Spearman’s rank correlation between different methods’ scores across (a) migraine, (b) myocardial infarction, and (c) stroke.

图 S5: 不同方法评分在 (a) 偏头痛、(b) 心肌梗死和 (c) 中风中的斯皮尔曼等级相关性。

阅读全文(20积分)