Publicly Shareable Clinical Large Language Model Built on Synthetic Clinical Notes

基于合成临床笔记的可公开共享临床大语言模型

Abstract

摘要

The development of large language models tailored for handling patients’ clinical notes is often hindered by the limited accessibility and usability of these notes due to strict privacy regulations. To address these challenges, we first create synthetic large-scale clinical notes using publicly available case reports extracted from biomedical literature. We then use these synthetic notes to train our specialized clinical large language model, Asclepius. While Asclepius is trained on synthetic data, we assess its potential performance in real-world applications by evaluating it using real clinical notes. We benchmark Asclepius against several other large language models, including GPT-3.5-turbo and other open-source al- ternatives. To further validate our approach using synthetic notes, we also compare Asclepius with its variants trained on real clinical notes. Our findings convincingly demonstrate that synthetic clinical notes can serve as viable substitutes for real ones when constructing high-performing clinical language models. This conclusion is supported by detailed evaluations conducted by both GPT-4 and medical professionals. All resources—including weights, codes, and data—used in the development of Asclepius will be made publicly accessible for future research1.

针对患者临床记录定制的大语言模型开发常因严格的隐私法规导致数据可获取性和可用性受限。为解决这些问题，我们首先利用从生物医学文献中提取的公开病例报告生成合成大规模临床记录，随后用这些合成数据训练专精临床领域的大语言模型Asclepius。尽管Asclepius基于合成数据训练，我们通过真实临床记录评估验证了其在现实场景中的潜在性能。实验中将Asclepius与GPT-3.5-turbo等主流大语言模型及开源替代方案进行对比，并进一步通过真实临床数据训练的模型变体验证合成数据的有效性。研究结果表明：在构建高性能临床语言模型时，合成临床记录可有效替代真实数据。该结论获得GPT-4和医疗专业人员的双重评估支持。Asclepius开发过程中使用的全部资源（包括模型权重、代码和数据）将公开以供后续研究[1]。

1 Introduction

1 引言

Clinical notes serve as an extensive repository of information specific to individual patients. Applying Natural Language Processing (NLP) techniques to these notes can significantly enhance the decision-making processes of medical professionals (Demner-Fushman et al., 2009; Lederman et al., 2022; Wu et al., 2022). Recent advances in large language models (LLMs) such as OpenAI’s GPT series (Brown et al., 2020; Ouyang et al., 2022; OpenAI, 2023) have shown promising results in analyzing these clinical notes (Agrawal et al., 2022; Hu et al., 2023; Liu et al., 2023b; Tang et al., 2023). However, when health organizations try to utilize these API-based external LLMs, they encounter two major challenges.

临床笔记是记录患者个体信息的庞大资料库。运用自然语言处理 (NLP) 技术分析这些笔记，能显著提升医疗工作者的决策效率 (Demner-Fushman et al., 2009; Lederman et al., 2022; Wu et al., 2022)。随着OpenAI的GPT系列 (Brown et al., 2020; Ouyang et al., 2022; OpenAI, 2023) 等大语言模型 (LLM) 的突破性进展，临床笔记分析领域已取得显著成果 (Agrawal et al., 2022; Hu et al., 2023; Liu et al., 2023b; Tang et al., 2023)。但医疗机构在尝试调用这些基于API的外部大语言模型时，面临着两大核心挑战。

Figure 1: The large clinical language model, Asclepius, trained solely on synthetic clinical notes, can effectively handle various clinical NLP tasks on real notes in a zero-shot setting.

图 1: 仅通过合成临床笔记训练的大型临床语言模型Asclepius，能够在零样本设置下有效处理真实临床笔记中的各类临床NLP任务。

The first challenge is privacy and security. Hospitals must transmit sensitive patient information beyond their internal systems when using these API-based external LLMs. This could potentially infringe on privacy regulations. Even when the external model adheres to regulations such as Health Insurance Portability and Accountability Act (HIPAA), hospitals should still undertake careful measures such as de-identifying clinical notes and setting up secure transmission protocols to avoid privacy breaches. This complicates the usage of external models. The second challenge relates to the autonomy that a health organization would need to exercise over its LLMs. Given each organization’s unique environment and characteristics, they may prefer a model specifically tailored to their needs. In light of these challenges, there is an increasing demand for a clinical LLM that can operate securely in an offline environment while still offering the effectiveness of powerful online LLMs such as GPT series.

首要挑战是隐私与安全问题。医院使用这些基于API的外部大语言模型时，必须将敏感患者信息传输至内部系统之外，这可能违反隐私法规。即使外部模型符合《健康保险可携性和责任法案》(HIPAA)等法规，医院仍需采取临床记录去标识化、建立安全传输协议等谨慎措施以防隐私泄露，这使外部模型的使用复杂化。第二个挑战涉及医疗机构对大语言模型自主控制的需求。鉴于每个机构的独特环境和特性，它们可能更倾向使用为其需求专门定制的模型。基于这些挑战，市场越来越需要一种能在离线环境中安全运行，同时保持GPT系列等强大在线大语言模型效能的临床大语言模型。

To develop a specialized LLM capable of handling clinical notes, a specific training dataset is required. This dataset would consist of instructionanswer pairs drawn from real clinical notes. Creating such a dataset, however, introduces its own set of challenges. The first is the daunting task of acquiring clinical notes, which is almost impossible for external developers and even challenging for internal developers associated with a health organization due to privacy regulations. Secondly, even when clinical notes are procured, creating a clinical instruction set necessitates either direct annotation from medical professionals or leveraging external models that have a strong understanding of clinical practices, such as the GPT series (Liévin et al., 2022; Nori et al., 2023; Dash et al., 2023; Javaid et al., 2023). The former approach is laborious and costly, making it impractical for large-scale use, while the latter approach presents the previously mentioned challenges related to privacy and security that are inherent to API-based models.

要开发一个能够处理临床记录的专业化大语言模型(LLM)，需要特定的训练数据集。这类数据集应由真实临床记录中的指令-答案对组成。然而，创建这样的数据集本身存在诸多挑战：首先是获取临床记录的艰巨性——由于隐私法规限制，外部开发者几乎不可能获取，即便是医疗机构内部的开发者也面临困难；其次，即使获得临床记录，创建临床指令集仍需医疗专业人员直接标注，或依赖具有临床实践理解能力的外部模型(如GPT系列) [20][21][22][23]。前者费时费力且成本高昂，难以大规模应用；后者则存在前文所述的、基于API模型固有的隐私与安全隐患。

To address these multifaceted challenges in the clinical settings, we introduce Asclepius, a clinical LLM constructed based on a comprehensive collection of synthetic clinical notes and corresponding instruction-answer pairs. These synthetic notes are generated from PMC-Patients (Zhao et al., 2023), containing anonymized case reports extracted from PubMed Central, a publicly available biomedical literature archive. Usage of synthetic notes, unlike real ones, not only enables us to leverage advanced online LLMs to produce comprehensive and highquality clinical instruction datasets, but also allow for the sharing of these resources and the models trained on them as open-source. Throughout the entire process of generating these data, we utilized GPT-3.5-turbo, and medical professionals were involved in prompt tuning to ensure the output’s clinical accuracy and relevancy. As a culmination of these efforts, we developed Asclepius-7B and Asclepius-13B, our advanced clinical LLMs capable of handling diverse clinical NLP tasks (see Figure 1).

为应对临床环境中的这些多层面挑战，我们推出了Asclepius——一个基于综合合成临床记录及对应指令-答案对构建的临床大语言模型。这些合成记录生成自PMC-Patients (Zhao et al., 2023)，包含从公开生物医学文献库PubMed Central提取的匿名病例报告。与真实记录不同，使用合成记录不仅能让我们利用先进的在线大语言模型生成全面高质量的临床指令数据集，还能将这些资源及基于其训练的模型开源共享。在整个数据生成过程中，我们使用了GPT-3.5-turbo，并邀请医学专家参与提示词调优以确保输出的临床准确性和相关性。最终我们开发了Asclepius-7B和Asclepius-13B这两个能处理多样化临床NLP任务的先进临床大语言模型（见图1）。

We evaluated Asclepius using a rigorous framework that aligns with its intended real-world applications, utilizing real clinical notes as our primary evaluation dataset. For this evaluation, we gathered clinical notes from a diverse set of sources including MIMIC-III (Johnson et al., 2016), MIMIC- IV (Johnson et al., 2023), i2b2 (Uzuner et al., 2007), CASI (Moon et al., 2014), and MTSamples2, thereby ensuring a broad coverage of notes from various institutions. The first goal of our evaluation involved a comparison between our model and GPT-3.5-turbo. This comparison allowed us to assess Asclepius’s capability to perform at par with API-based LLM across different clinical NLP tasks. Additionally, we compared Asclepius against a diverse array of open-source LLMs, including both general domain and clinical-biomedical domain models. This comparison aimed to validate our model’s performance against other locally available LLMs. We were particularly interested in Asclepius-R, a variant trained with real clinical notes. Comparing Asclepius with Asclepius-R enabled us to assess the relative performance of models trained with synthetic notes against those trained with real ones. If a significant performance gap was found, it could possibly challenge our approach’s validity. Hence, this comparison emphasizes the effectiveness of our method in training a clinical LLM using synthetic notes. In the overall evaluation process, we utilized GPT-4 as an evaluator, which is known to have advanced medical knowledge, to assess the models’ performance. Furthermore, for the crucial comparison between Asclepius and Asclepius-R, four clinicians were involved in the evaluation to substantiate our claim.

我们采用与实际应用场景相匹配的严谨框架评估Asclepius，以真实临床记录作为主要评估数据集。本次评估汇集了来自多源机构的临床记录，包括MIMIC-III (Johnson et al., 2016)、MIMIC-IV (Johnson et al., 2023)、i2b2 (Uzuner et al., 2007)、CASI (Moon et al., 2014)和MTSamples2，确保覆盖广泛的医疗机构记录。评估首要目标是将我们的模型与GPT-3.5-turbo进行对比，以此衡量Asclepius在不同临床NLP任务中与基于API的大语言模型相当的能力。此外，我们还对比了包括通用领域和临床生物医学领域模型在内的多种开源大语言模型，以验证Asclepius相对于其他本地可用大语言模型的性能。我们特别关注使用真实临床记录训练的变体Asclepius-R，通过对比揭示合成数据训练模型与真实数据训练模型的相对性能差异。若发现显著性能差距，可能对我们方法的有效性提出挑战，因此该对比着重验证了使用合成记录训练临床大语言模型的有效性。整个评估过程中，我们采用具备先进医学知识的GPT-4作为评估器来判定模型性能。针对Asclepius与Asclepius-R的关键对比，另有四位临床医生参与评估以强化结论可信度。

Our key contributions can be summarized as follows:

我们的主要贡献可总结如下:

Figure 2: The first column is a part of the real discharge summary from MIMIC-III (Johnson et al., 2016). Second is a case report from PMC-Patients (Zhao et al., 2023), and the third is the synthetic discharge summary created from this case report. Initially, the case report did not resemble the real clinical note in terms of format, but after the transformation, it more closely resembles the real clinical note. At the last column, there is an instruction and answer pair generated from synthetic clinical note. GPT-3.5-turbo was used in all generation processes.

图 2: 第一列是来自 MIMIC-III (Johnson et al., 2016) 的真实出院小结片段。第二列是来自 PMC-Patients (Zhao et al., 2023) 的病例报告，第三列是根据该病例报告生成的合成出院小结。最初，病例报告在格式上与真实临床记录并不相似，但经过转换后更接近真实临床记录。最后一列展示了从合成临床记录生成的指令-答案对。所有生成过程均使用 GPT-3.5-turbo 完成。

By leveraging our methodology, any entity – from healthcare organizations to individual researchers – can develop an LLM capable of under standing and interacting with clinical notes. This breakthrough will serve as a crucial steppingstone to accelerate the research and development of healthcare AI, which has been previously deterred by stringent (yet essential) privacy regulations.

通过运用我们的方法，任何实体——从医疗机构到独立研究者——都能开发出具备理解临床记录并与之交互能力的大语言模型 (LLM)。这一突破将成为加速医疗AI研发的关键基石，该领域此前一直受到严格（却必要）的隐私法规制约。

2 Data Generation

2 数据生成

In Section 2.1, we discuss the differentiation between clinical notes and case reports and detail how to convert case reports into synthetic clinical notes. Section 2.2 delves into extracting specific instruction-answer pairs from these notes for training the clinical LLM. Figure 2 illustrates this process with an accompanying example. It is important to note our method solely uses public data, allowing unrestricted use of LLM (GPT-3.5-turbo). All prompts utilized are listed in Appendix A.

在2.1节中，我们讨论了临床笔记与病例报告的区别，并详细说明了如何将病例报告转化为合成临床笔记。2.2节深入探讨了从这些笔记中提取特定指令-答案对以训练临床大语言模型的方法。图2通过一个示例展示了这一流程。需要注意的是，我们的方法仅使用公开数据，因此可以无限制地使用大语言模型(GPT-3.5-turbo)。所有使用的提示词均列在附录A中。

In this research, we specifically focus on the discharge summary, a specific type of clinical note that is extensively used in a variety of clinical tasks. Henceforth in this paper, the term clinical note will specifically refer to the discharge summary.

在本研究中，我们特别关注出院小结这一特定类型的临床记录，它被广泛应用于各种临床任务中。因此在本文中，临床记录这一术语将特指出院小结。

2.1 Synthetic Clinical Notes

2.1 合成临床记录

Clinical notes are comprehensive records created by healthcare providers to document the care administered to a patient during their stay in a medical facility. These notes contain sensitive personal health information of patients, and as such, their access and usage are strictly regulated. Although public datasets like MIMIC-III (Johnson et al., 2016) and MIMIC-IV (Johnson et al., 2023) exist, access is only limited to credential ed individuals, such as those who have completed CITI training. This limitation also applies to any products derived from these datasets, such as synthetic data or generative models trained using MIMIC, making it challenging to share them publicly. On the other hand, case reports are detailed reports on individual patients prepared for academic or educational purposes. They are fully anonymized and publicly available through medical journals. The contents of a case report mirror that of clinical notes, encompassing admission details, laboratory test results, official diagnoses, and treatment plans. Given these similarities, we hypothesized that creating a clinical large language model using case reports would yield a model with performance comparable to one built using authentic clinical notes. Additionally, this approach would make the model widely accessible without any restrictions, as it would be based on publicly available, anonymized data.

临床记录是医护人员在医疗机构内为患者提供诊疗服务时创建的全面档案。这些文件包含患者的敏感个人健康信息，其访问和使用受到严格监管。尽管存在MIMIC-III (Johnson et al., 2016) 和 MIMIC-IV (Johnson et al., 2023) 等公共数据集，但仅限完成CITI培训的认证人员使用。该限制同样适用于基于这些数据集衍生的任何产物，例如合成数据或使用MIMIC训练的生成模型，导致难以公开共享。另一方面，病例报告是为学术或教育目的编制的个体患者详细报告，经过完全匿名化处理并通过医学期刊公开发表。病例报告内容与临床记录高度一致，包含入院详情、实验室检测结果、正式诊断和治疗方案。鉴于这些相似性，我们假设使用病例报告构建临床大语言模型，其性能可媲美基于真实临床记录训练的模型。此外，这种方法能使模型基于公开可得的匿名化数据，实现无限制的广泛访问。

However, using case reports to directly create a large clinical language model as a substitute for real clinical notes presents a problem due to the differences in terms of the characteristics. Firstly, case reports are written with the intention of being published in academia, thus, they use wellorganized and standardized language, whereas clinical notes often contain frequent abbreviations, non-standard terminology, and occasional grammatical errors (Lehman et al., 2023). Second, case reports are presented in a continuous narrative form, written in plain text paragraphs. Clinical notes, in contrast, are designed for quick referencing by healthcare professionals. These notes are typically semi-structured through headers such as ’History’, ’Physical Examination’, ’Assessment’, and ’Plan’. To bridge this gap, we used GPT-3.5 to transform case reports into synthetic clinical notes, giving an instruction to mimic the traits found in real clinical notes. Another consideration during this process was the hallucination risk of GPT-3.5 (Ji et al., 2023). Even if the synthetic clinical note closely resembles a real clinical note, any hallucination leading to clinical inconsistency would undermine its validity as a clinical note. Therefore, we explicitly specified in the prompt that clinical entities should not be generated in the synthetic clinical notes if they were not mentioned in the case report. During the prompt tuning process, clinicians participated and reviewed 50 random samples for each prompt to ensure that the outputs resembled real clinical notes and did not contain inaccuracies or inconsistencies with the original case report.

然而，直接利用病例报告创建大型临床语言模型以替代真实临床记录存在特征差异问题。首先，病例报告以学术发表为目的，采用结构清晰、标准化的语言，而临床记录常包含高频缩写、非标准术语及偶发语法错误 (Lehman et al., 2023) 。其次，病例报告以连续叙述的纯文本段落呈现，而临床记录为便于医护人员快速查阅，通常采用"病史"、"体格检查"、"评估"和"计划"等标题进行半结构化处理。为弥合这一差距，我们使用GPT-3.5将病例报告转化为合成临床记录，并通过指令要求其模拟真实临床记录特征。在此过程中还需考虑GPT-3.5的幻觉风险 (Ji et al., 2023) ——即使合成记录与真实记录高度相似，任何导致临床矛盾的幻觉都会削弱其作为临床记录的有效性。因此我们在提示中明确规定：若病例报告未提及的临床实体，不得出现在合成记录中。提示调优过程中，临床医生参与并对每个提示的50个随机样本进行审核，确保输出结果既贴近真实临床记录，又与原始病例报告不存在差异或矛盾。

Consequently, we have obtained $158\mathrm{k}$ highquality synthetic clinical notes using case reports from PMC-patients dataset (Zhao et al., 2023). An example of a case report and its converted synthetic clinical note can be found in the Appendix B. We used perplexity as a measurement to quantitatively evaluate the similarity of these synthetic clinical notes to real ones. For this comparison, we further finetuned a pre-trained language model, LLaMA (Touvron et al., 2023), on a corpus of $57\mathrm{k\Omega}$ real discharge summaries from the MIMIC-III database (Johnson et al., 2016). Then, we measured the perplexity of 200 discharge summaries from three different actual hospital datasets: MIMIC-III (unseen during training), MIMIC-IV (Johnson et al., 2023), and i2b2 (Uzuner et al., 2007). The MIMIC-III and MIMIC-IV datasets originate from Beth Israel Deaconess Medical Center, whereas i2b2 comes from a different institution, Partners Healthcare. We also calculated the perplexity of the 200 case reports from PMC-Patients using the same model. Finally, we evaluated the perplexity of the synthetic notes transformed from the specific 200 case reports that we had previously measured for perplexity. Our findings, summarized below, show that the perplexity of real hospital data ranges from 2.186 (in-domain data from MIMIC-III) to 5.178 (data from another hospital, i2b2). The PMC-Patients’ case reports initially had a perplexity of 71.719, but upon transformation into synthetic notes, while preserving the same contents, it dropped to 4.816, thus falling within the range observed for real hospital data. These results suggest that our synthetic notes likely exhibit a high degree of validity, comparable to real hospital data.

因此，我们利用PMC-Patients数据集（Zhao等人，2023）中的病例报告生成了158k份高质量合成临床记录。病例报告及其转换后的合成临床记录示例见附录B。我们采用困惑度作为量化指标，评估这些合成临床记录与真实记录的相似性。为此，我们基于MIMIC-III数据库（Johnson等人，2016）的57k份真实出院摘要，对预训练语言模型LLaMA（Touvron等人，2023）进行了微调。随后，我们测量了来自三个不同真实医院数据集的200份出院摘要困惑度：MIMIC-III（训练时未见过）、MIMIC-IV（Johnson等人，2023）和i2b2（Uzuner等人，2007）。其中MIMIC-III和MIMIC-IV数据源自Beth Israel Deaconess医疗中心，而i2b2来自另一机构Partners Healthcare。我们还使用同一模型计算了PMC-Patients中200份病例报告的困惑度。最后，我们评估了由先前测量过困惑度的特定200份病例报告转换而来的合成记录困惑度。研究结果表明：真实医院数据的困惑度介于2.186（MIMIC-III域内数据）至5.178（其他医院i2b2数据）之间；PMC-Patients病例报告初始困惑度为71.719，但在内容保持不变的情况下转换为合成记录后降至4.816，进入真实医院数据范围。这表明我们的合成记录可能具有与真实医院数据相当的高效度。

MIMIC-III	MIMIC-IV	i2b2	PMC-Patients	Synthetic
2.186	2.809	5.178	71.719	4.816

MIMIC-III	MIMIC-IV	i2b2	PMC-Patients	Synthetic
2.186	2.809	5.178	71.719	4.816

2.2 Clinical Instruction Generation

2.2 临床指导生成

To develop a clinical large language model capable of performing various clinical NLP tasks, a specific training dataset, in the form of instruction-answer pairs, is necessary. Considering that our model is targeted towards healthcare professionals, we aimed to incorporate their diverse needs into the instruction sets. We initiated the process by defining clinical NLP tasks, based on a comprehensive survey by Wu et al. (2022), which analyzed widely used clinical NLP tasks. This task list was further refined through consultations with professionals, leading to eight specific task types: Named Entity Recognition, Relation Extraction, Temporal Information Extraction, Co reference Resolution, Question Answering, Abbreviation Expansion, Summa- rization, and Paraphrasing. We created instructionanswer pairs for these eight clinical NLP tasks using GPT-3.5-turbo, based on synthetic clinical notes. The method for creating these pairs is as follows.

为开发能够执行多种临床自然语言处理(NLP)任务的大语言模型，需要构建特定形式的指令-答案对训练数据集。考虑到该模型主要面向医疗专业人员，我们致力于将他们的多样化需求融入指令集设计。基于Wu等人(2022)对临床NLP任务的系统性调研，我们首先确定了基础任务框架，随后通过专业咨询进一步细化为八类具体任务：命名实体识别、关系抽取、时间信息提取、共指消解、问答系统、缩写扩展、文本摘要以及复述改写。依托GPT-3.5-turbo模型，我们基于合成临床病历为这八项任务生成了指令-答案对，具体构建方法如下。

The generated instructions were fed back into the model along with the notes, prompting the model to generate the corresponding answers. While many studies attempt to generate instructions and answers simultaneously for efficiency (Wang et al., 2023; Taori et al., 2023), our empirical findings indicated that a sequential generation results in more detailed instructions and answers.
生成的指令与笔记一同反馈给模型，提示模型生成相应答案。尽管许多研究为提高效率尝试同时生成指令和答案 (Wang et al., 2023; Taori et al., 2023)，但我们的实证研究表明，顺序生成能产生更详细的指令和答案。

Employing this approach, we were able to generate high-quality clinical instruction-answer pairs for each synthetic note, culminating in a total of 158,114 pairs. Similar to the synthetic notes generation process, physicians were directly involved in the prompt tuning process, thereby ensuring the quality of the instruction-answer pairs. Examples of the generated instructions can be found in Appendix C.

采用这种方法，我们为每条合成病历生成了高质量的临床问答对，最终共计158,114对。与合成病历生成过程类似，医生直接参与了提示词调优流程，从而确保了问答对的质量。生成的指令示例见附录C。

3 Clinical Large Language Model

3 临床大语言模型

3.1 Training

3.1 训练

Recent research (Taori et al., 2023; Chiang et al., 2023; Geng et al., 2023; Han et al., 2023; Yunxiang et al., 2023; Toma et al., 2023) has demonstrated the effectiveness of fine-tuning with instruction datasets on foundation language models, such as LLaMA (Touvron et al., 2023). Inspired by these findings, we designed a language model specifically for clinical notes, using LLaMA as the base and incorporating instructions from synthetic clinical notes. Distinct from other studies, we added an additional step to our process to address a persistent challenge: language models, trained on general domain texts, often struggle to accurately capture the peculiarities found in clinical texts (Laparra et al., 2020). Previous research has attempted to solve this problem by pre-training base models on clinical notes (Alsentzer et al., 2019; Lewis et al., 2020). Adopting this approach, we applied domain adaptation to LLaMA by pre-training it on synthetic clinical notes before fine-tuning it with clinical instructions. Detailed information about the pre-training and instruction fine-tuning processes can be found in Appendix E. As a result, we developed two models, Asclepius-7B and Asclepius-13B. To our knowledge, these are the first publicly accessible clinical LLMs capable of managing multiple tasks without necessitating taskspecific fine-tuning.

近期研究 (Taori et al., 2023; Chiang et al., 2023; Geng et al., 2023; Han et al., 2023; Yunxiang et al., 2023; Toma et al., 2023) 表明，在基础语言模型 (如LLaMA (Touvron et al., 2023)) 上使用指令数据集进行微调具有显著效果。受此启发，我们以LLaMA为基础模型，结合合成临床笔记指令，专门设计了面向临床笔记的语言模型。与其他研究不同，我们额外增加了一个处理步骤以应对长期存在的挑战：基于通用领域文本训练的语言模型往往难以准确捕捉临床文本的特殊性 (Laparra et al., 2020)。先前研究尝试通过对临床笔记进行基础模型预训练来解决该问题 (Alsentzer et al., 2019; Lewis et al., 2020)。采用这一思路，我们通过在合成临床笔记上预训练LLaMA实现领域自适应，随后再进行临床指令微调。预训练和指令微调流程的详细信息见附录E。最终我们开发出Asclepius-7B和Asclepius-13B两个模型。据我们所知，这是首个无需任务特定微调即可处理多任务的公开临床大语言模型。

3.2 Evaluation

3.2 评估

In our study, we utilized the capabilities of GPT-4 to assess the performance of our models. GPT-4 has been applied in numerous research as a means to evaluate the results of Natural Language Generation (NLG) models (Liu et al., 2023a; Chiang et al., 2023). According to these studies, evaluations derived from GPT-4 – using indicators such as helpfulness and fluency – closely align with human judgment.

在我们的研究中，我们利用GPT-4的能力来评估模型性能。GPT-4已被广泛应用于多项研究，作为评估自然语言生成(NLG)模型结果的手段(Liu et al., 2023a; Chiang et al., 2023)。根据这些研究，基于GPT-4的评估(使用帮助性和流畅性等指标)与人类判断高度一致。

However, in the context of the clinical domain, the consequences of mistakes are dire, and there is less tolerance for inaccuracies than in the general domain. Consequently, we tailored our evaluation criteria to prioritize accuracy, relevancy, and completeness, as any misinformation or omission could potentially result in adverse patient outcomes. We designed evaluation prompts for GPT-4 to address these specific clinical concerns. Our cliniciancertified four-point scale for scoring is:

然而，在临床领域中，错误的后果极为严重，对不准确性的容忍度远低于通用领域。因此，我们调整了评估标准，优先考虑准确性、相关性和完整性，因为任何错误信息或遗漏都可能导致患者不良结局。我们为GPT-4设计了针对这些临床特定问题的评估提示。临床医生认证的四点评分标准为：

The full prompt can be found in Appendix A.4.

完整提示词见附录 A.4。

4 Comparative Analysis

4 对比分析

Despite the various benefits of a clinical LLM trained on synthetic notes, a model’s ultimate value lies in its performance on real clinical notes. Accordingly, our evaluation framework employs real discharge summaries as an evaluation dataset, establishing a more authentic and applicable testing ground for our model, Asclepius. For this evaluation, we gathered clinical notes from a diverse set of sources including MIMIC-III (Johnson et al., 2016), MIMIC-IV (Johnson et al., 2023), i2b2 (Uzuner et al., 2007), MTSamples, and CASI (Moon et al., 2014), thereby ensuring a broad coverage of notes from various institutions.

尽管基于合成病历训练的临床大语言模型(LLM)具有诸多优势，但模型的最终价值仍取决于其在真实临床病历上的表现。为此，我们的评估框架采用真实出院小结作为评估数据集，为我们的模型Asclepius建立了更真实、更适用的测试环境。本次评估汇集了来自多源的真实临床病历，包括MIMIC-III (Johnson等人, 2016)、MIMIC-IV (Johnson等人, 2023)、i2b2 (Uzuner等人, 2007)、MTSamples以及CASI (Moon等人, 2014)，从而确保覆盖不同医疗机构的多样化病历样本。

Figure 3: The evaluation score from GPT-4 across diverse tasks and models. These tasks include: (A) MIMIC-III and MIMIC-IV (B) i2b2 and MTSamples (C) CASI (D) DiSCQ. The percentages listed beneath the GPT-4 scores represent the ratio of each model’s score compared to the highest score achieved within that same model size category. The error bars represent a $95%$ confidence interval.

图 3: GPT-4在不同任务和模型上的评估得分。这些任务包括：(A) MIMIC-III和MIMIC-IV (B) i2b2和MTSamples (C) CASI (D) DiSCQ。GPT-4得分下方列出的百分比表示每个模型的得分与同规模类别中最高得分的比率。误差线表示95%置信区间。

We conduct a comparative study to analyze the performance of Asclepius against several others using GPT-4 evaluation specified in Section 3.2. Our initial point of comparison is GPT-3.5-turbo, wherein we aim to ascertain whether Asclepius can match the performance and versatility of APIbased LLM in various clinical NLP tasks. We also include an evaluation of other open-source instruction fine-tuned LLMs that are trained on general domain data, such as Alpaca (Taori et al., 2023) and Vicuna (Chiang et al., 2023), and those tailored for the clinical-biomedical domain, such as MedAlpaca (Han et al., 2023), ChatDoctor (Yunxiang et al., 2023), and Clinical-Camel (Toma et al., 2023). Including these models aims to compare Asclepius’s performance on clinical NLP tasks with other locally available models, thereby validating our methodology in developing our model.

我们开展了一项比较研究，通过第3.2节所述的GPT-4评估方法来分析Asclepius与其他模型的性能表现。我们首先以GPT-3.5-turbo作为基准，旨在验证Asclepius能否在各类临床自然语言处理任务中达到基于API的大语言模型的性能和通用性。同时，我们还评估了其他经过指令微调的开源大语言模型，包括基于通用领域数据训练的Alpaca (Taori等人, 2023)和Vicuna (Chiang等人, 2023)，以及针对临床生物医学领域优化的MedAlpaca (Han等人, 2023)、ChatDoctor (Yunxiang等人, 2023)和Clinical-Camel (Toma等人, 2023)。引入这些模型旨在将Asclepius在临床自然语言处理任务中的表现与其他本地可用模型进行对比，从而验证我们开发该模型的方法论。

Lastly, we include Asclepius-R, a variant of our model, trained using $57\mathrm{k\Omega}$ real clinical notes from the MIMIC-III dataset (Johnson et al., $2016)^{3}$ . Asclepius-R, having been both pre-trained and finetuned on these real notes, is directly compared to Asclepius, our model trained on $158\mathrm{k}$ synthetic notes. This comparison allows us to explore the performance of models developed with synthetic data in relation to those trained with real data. It is important to clarify that our objective is not to claim that synthetic notes can completely replace real ones, but rather to show that a model trained on synthetic notes can be a viable alternative to one trained on real data. To optimize their performances, we leveraged the maximum amount of data available for each model. The performances of Asclepius and Asclepius-R, when trained on datasets of the same size, are detailed in the ablation study in Appendix F. By including AsclepiusR in our analysis, we can thoroughly assess the potential of our method in training large language models using synthetic clinical notes instead of real ones.

最后，我们加入了 Asclepius-R，这是我们模型的一个变体，使用来自 MIMIC-III 数据集 (Johnson et al., 2016) 的 57kΩ 真实临床笔记进行训练。Asclepius-R 在这些真实笔记上进行了预训练和微调，直接与我们的模型 Asclepius (使用 158k 合成笔记训练) 进行对比。这一比较让我们能够探索基于合成数据开发的模型与基于真实数据训练的模型之间的性能差异。需要明确的是，我们的目标并非宣称合成笔记可以完全替代真实笔记，而是表明基于合成笔记训练的模型可以成为基于真实数据训练模型的可行替代方案。为了优化性能，我们为每个模型充分利用了可用的最大数据量。Asclepius 和 Asclepius-R 在相同规模数据集上的性能细节详见附录 F 的消融实验。通过将 Asclepius-R 纳入分析，我们能够全面评估该方法在使用合成临床笔记 (而非真实笔记) 训练大语言模型方面的潜力。

4.1 Preliminary Evaluation

4.1 初步评估

We conducted a comparative analysis of our model, Asclepius, with Asclepius-R, GPT-3.5- turbo, and other open-source instruction-tuned large language models. The initial performance assessment involved MIMIC-III (unseen during training Asclepius-R) and MIMIC-IV discharge summaries, which are in-domain data for training Asclepius-R which comes from the same health institution. We then extracted instruction data from these summaries to compile a test set, following the methodology outlined in Section 2.2. As shown in Figure 3-(A), Asclepius, trained on synthetic notes, demonstrated performance closely aligned with that of Asclepius-R, which was trained on in-domain data. Moreover, when we conducted the same evaluation on i2b2 notes and MTSamples (Figure 3-(B)), which originated from different institutions and were of types not used in the training of Asclepius-R, the performance gap between Asclepius and Asclepius-R narrowed for both 7B and 13B models further.

我们对模型Asclepius与Asclepius-R、GPT-3.5-turbo及其他开源指令调优大语言模型进行了对比分析。初始性能评估使用了MIMIC-III（Asclepius-R训练期间未见数据）和MIMIC-IV出院摘要，这些是Asclepius-R训练的域内数据（源自同一医疗机构）。随后按照第2.2节所述方法，从这些摘要中提取指令数据编制测试集。如图3-(A)所示，基于合成病历训练的Asclepius表现与基于域内数据训练的Asclepius-R高度接近。此外，当我们在i2b2病历和MTSamples（源自不同机构且类型未包含于Asclepius-R训练集）上进行相同评估时（图3-(B)），7B和13B模型的Asclepius与Asclepius-R性能差距进一步缩小。

Another key observation is that Asclepius outperforms all open-source LLMs and even exhibits performance comparable to GPT-3.5-turbo. How- ever, it is important to consider that the test set, created from the aforementioned discharge summaries, followed the same process used for the training sets of Asclepius and Asclepius-R. This could potentially bias the comparison in their favor. To ensure a fairer comparison, we broadened our evaluation to directly employ prompts that were used in Agrawal et al. (2022), which addresses Co reference Resolution and Abbreviation Expansion tasks on the CASI dataset (Moon et al., 2014). As illustrated in Figure 3-(C), even for previously unseen types of prompts, the Asclepius model 1) outperformed all other open-source LLMs and 2) displayed performance closely aligned with GPT3.5 for the 13B model. This pattern is consistent across all individual benchmarks, detailed in Appendix D.

另一个关键观察是，Asclepius的表现优于所有开源大语言模型，甚至展现出与GPT-3.5-turbo相当的性能。然而，需要注意的是，测试集源自上述出院摘要，其构建流程与Asclepius和Asclepius-R的训练集相同，这可能使对比结果偏向它们。为确保更公平的比较，我们扩展了评估范围，直接采用Agrawal等人(2022)研究中用于CASI数据集(Moon等人，2014)的共指消解和缩写扩展任务的提示词。如图3-(C)所示，即使面对此前未见的提示类型，Asclepius模型仍：1) 超越所有其他开源大语言模型；2) 13B版本的表现与GPT-3.5高度接近。该模式在所有单项基准测试中均保持一致，详见附录D。

4.2 Practical Evaluation

4.2 实践评估

Designed for use by professionals in actual healthcare settings, it is crucial to test clinical LLM’s effectiveness on actual queries posed by healthcare professionals. As such, we utilized the DiSCQ dataset (Lehman et al., 2022) - a set of clinicianposed questions derived from MIMIC-III discharge summaries - for practical evaluation. However, since the authors of the DiSCQ dataset allowed clinicians to annotate questions freely while reading the discharge summaries, without providing specific guidance, it is often impossible to find answers to the questions within the corresponding discharge summaries. This presents a significant challenge when evaluating a model’s performance in answering these questions. To mitigate this issue, we first used GPT-4 to filter the dataset, tasking it with identifying any evidence within the discharge summary that could potentially answer a given question. We then randomly selected 100 questions from this filtered dataset for our evaluation. Refer to Appendix G for more detail.

为在实际医疗环境中供专业人员使用，测试临床大语言模型对医疗工作者实际提问的有效性至关重要。为此，我们采用DiSCQ数据集（Lehman等人，2022）——一组源自MIMIC-III出院摘要的临床医生提问——进行实际评估。然而，由于DiSCQ数据集作者允许临床医生在阅读出院摘要时自由标注问题而未提供具体指导，这些问题往往无法在对应出院摘要中找到答案。这对评估模型回答这些问题的性能构成了重大挑战。为缓解该问题，我们首先使用GPT-4过滤数据集，要求其识别出院摘要中可能回答给定问题的任何证据。随后从过滤后的数据集中随机选取100个问题用于评估。详见附录G。

The results depicted in Figure 3-(D) confirm that the performance on questions annotated by real clinicians shows the same pattern as before. Our model, Asclepius, demonstrated significant superiority over other baseline models. In the case of the 13B model, its performance was on par with GPT3.5-turbo, despite being ten times smaller in model size. Moreover, when compared with Asclepius-R, the performance remains comparable. Based on these findings, it may be suggested that building a clinical LLM from real patient notes - which poses a privacy risk - might not be necessary. It is possible that a model with similar performance could be achieved using synthetic notes.

图3-(D)所示结果证实，真实临床医生标注的问题表现与此前模式一致。我们的模型Asclepius展现出显著优于其他基线模型的性能。13B参数版本的表现与GPT3.5-turbo相当，尽管模型规模小了十倍。此外，与Asclepius-R相比仍保持相近性能。这些发现表明，基于真实患者病历(存在隐私风险)构建临床大语言模型可能并非必要，使用合成病历或可实现同等性能。

4.3 Professional Evaluation

4.3 专业评估

Despite GPT-4’s advanced medical knowledge, boasting an accuracy level of $86%$ on the United

尽管GPT-4拥有先进的医学知识，在美国医学执照考试(USMLE)中准确率高达86%，但

Figure 4: Professional and GPT-4 evaluation of Asclepius-13B and Asclepius-R-13B responses to 100 DiSCQ questions, featuring inter-professional Krippendorff’s alpha $(\alpha)$ agreement and GPT-4 to professional average alignment via Pearson, Kendall-Tau, and Spearman coefficients $(\sigma,\tau,\rho)$ . The error bars represent a $95%$ confidence interval.

图 4: 专业医师与GPT-4对Asclepius-13B和Asclepius-R-13B在100道DiSCQ问题回答的评估结果，展示跨专业Krippendorff's alpha $(\alpha)$ 一致性以及GPT-4与专业医师通过Pearson、Kendall-Tau和Spearman系数 $(\sigma,\tau,\rho)$ 的平均对齐度。误差条表示95%置信区间。

States Medical Licensing Examination (Nori et al., 2023), it is uncertain whether the conclusions drawn by GPT-4 match with those of actual healthcare professionals. Considering that professionals are the most likely users of our model, it’s necessary to validate our previous conclusion that a model trained on synthetic notes performs comparably to one trained on real notes, involving these professionals.

美国医师执照考试 (Nori et al., 2023) 的研究表明，目前尚不确定GPT-4得出的结论是否与真实医疗专业人员一致。考虑到专业人士最可能成为我们模型的主要使用者，有必要邀请这些专家参与验证我们之前的结论：基于合成病历训练的模型与基于真实病历训练的模型表现相当。

To address this, we solicited evaluations from healthcare professionals for Asclepius-13B and Asclepius-R-13B, on DiSCQ dataset. Concurrently, we measured the alignment of professionals’ evaluations to that of GPT-4, thus bolstering the validity of our previous evaluations. The evaluation was carried out by a team of four clinicians. We asked them to rate the quality of responses generated by the two models (with criteria in 3.2) to the same 100 questions from the DiSCQ dataset that were used in Section 4.2. We ensured that each question was evaluated by at least two experts, allowing us to also assess inter-rater agreement among them. The overall process and its result are visualized in Figure 4 and the user interface employed for this process can be seen in Appendix H.

为此，我们邀请医疗专业人员对Asclepius-13B和Asclepius-R-13B在DiSCQ数据集上进行评估。同时，我们测量了专业人员评估与GPT-4评估的一致性，从而增强了先前评估的有效性。评估由四位临床医生组成的团队进行。我们要求他们根据3.2节的标准对两个模型针对DiSCQ数据集中相同的100个问题（与4.2节所用问题一致）生成的回答质量进行评分。确保每个问题至少由两位专家评估，从而也能评估评分者间的一致性。整个过程及其结果如图4所示，该过程使用的用户界面见附录H。

Our statistical analysis revealed a Kri pp end orff’s alpha $(\alpha)$ of 0.53. As a measure of agreement among evaluators, this value signifies a moderate level of inter-annotator agreement (Landis and Koch, 1977), offering preliminary assurance of the credibility of our evaluations. The clinicians assigned average scores of 3.03 and 3.15 to Asclepius-13B and Asclepius-R-13B, respectively. We conducted a paired sample t-test on the evaluations from the clinicians, comparing Asclepius and Asclepius-R. The result did not reject the null hypothesis stating that the performance of the two models is equivalent (p-value $=0.18$ ). Additionally, when a statistical test of the same kind was applied to the scores provided by GPT-4, the null hypothesis could not be rejected in this case either (p-value $=0.40^{\cdot}$ . While the interpretation of these results is limited by the sample size of 100, it nonetheless offers a preliminary conclusion that the performance of the two models is approximately similar. When comparing the alignment of GPT-4 and the professionals’ evaluations, we found a moderate level of Pearson $\acute{\sigma}=0.41$ ), Kendall-Tau $\acute{\tau}=0.36,$ ), and Spearman $\mathrm{\Delta}_ {\mathrm{\rho}}^{\mathrm{\prime}}=0.39\mathrm{\Delta}$ correlation coefficients (Landis and Koch, 1977), lending further validity to our previous experiments that were solely evaluated by GPT-4.

我们的统计分析显示Krippendorff's alpha $(\alpha)$ 为0.53。作为评估者间一致性的衡量指标，该值表明标注者间具有中等程度的一致性 (Landis and Koch, 1977)，初步确保了评估结果的可信度。临床医生给Asclepius-13B和Asclepius-R-13B的平均评分分别为3.03和3.15。我们对临床医生的评估结果进行了配对样本t检验，比较Asclepius和Asclepius-R的表现。结果未能拒绝两个模型性能等效的原假设 (p值 $=0.18$ )。此外，当对GPT-4提供的评分进行同类统计检验时，同样无法拒绝原假设 (p值 $=0.40^{\cdot}$ )。虽然这些结果的解释受限于100的样本量，但仍可初步得出两个模型性能大致相近的结论。在比较GPT-4与专业医生评估的一致性时，我们发现Pearson $\acute{\sigma}=0.41$ )、Kendall-Tau $acute{\tau}=0.36$ )和Spearman $\mathrm{\Delta}_ {\mathrm{\rho}}^{\mathrm{\prime}}=0.39\mathrm{\Delta}$ )相关系数均处于中等水平 (Landis and Koch, 1977)，这进一步验证了我们之前仅由GPT-4进行评估的实验有效性。

5 相关工作

5.1 Synthetic Clinical Notes

5.1 合成临床笔记

Efforts to create synthetic clinical notes include Melamud and Shivade (2019) using LSTM (Hochreiter and Schmid huber, 1997) on MIMICIII discharge summaries, Ive et al. (2020) employing transformer architecture (Vaswani et al., 2017) with MHR (Perera et al., 2016) and MIMIC-III databases, Li et al. (2021) using text generation models like GPT2 (Radford et al., 2019) on the i2b2 2010 (Uzuner et al., 2011) and $\mathrm{n}2\mathrm{c}22018$ datasets (Henry et al., 2020) for data augmentation, and Zhou et al. (2022)’s BERT-based method (Devlin et al., 2019) for de-identifying MIMICIII clinical records. All of these synthetic notes were derived from real hospital data, which implies certain constraints on their usage. Distinctively, our approach harnesses publicly accessible case reports for generating synthetic notes. Thus, our synthetic data does not possess the limitations seen in the aforementioned works. Consequently, models trained on our data are free from such constraints, making them shareable with the public.

生成合成临床记录的相关工作包括：Melamud和Shivade (2019) 在MIMICIII出院摘要上使用LSTM (Hochreiter和Schmidhuber, 1997) ，Ive等人 (2020) 结合MHR (Perera等人, 2016) 和MIMIC-III数据库采用Transformer架构 (Vaswani等人, 2017) ，Li等人 (2021) 在i2b2 2010 (Uzuner等人, 2011) 和$\mathrm{n}2\mathrm{c}22018$数据集 (Henry等人, 2020) 上使用GPT2 (Radford等人, 2019) 等文本生成模型进行数据增强，以及Zhou等人 (2022) 基于BERT (Devlin等人, 2019) 的MIMICIII临床记录去标识化方法。这些合成记录均源自真实医院数据，因此存在使用限制。与之不同，我们的方法利用公开病例报告生成合成记录，因此不存在上述研究的局限性。基于我们数据训练的模型不受此类限制约束，可公开共享。

5.2 Language Models for Clinical NLP tasks

5.2 面向临床自然语言处理任务的语言模型

There have been several clinical language models developed, each designed to address a specific clinical NLP tasks. Notable examples of such models include Clinic alBERT (Alsentzer et al., 2019), Clinical-Longformer (Li et al., 2023), Gatortron (Yang et al., 2022) which based on transformer encoder structure, and Clinical-T5 (Lehman et al., 2023) which is based on transformer encoderdecoder structure. All these models are pre-trained using clinical notes and then fine-tuned for each specific task. While this approach has shown to be effective, the limited size of these models restricts their ability to perform multiple tasks simultaneously. This limitation reduces their practicality in real-world scenarios, as it is more convenient to address various tasks with a single model rather than managing multiple models specialized for each task. Asclepius is the first clinical large language model that is capable of handling multiple clinical NLP tasks.

已开发出多种临床语言模型，每种都针对特定临床自然语言处理(NLP)任务而设计。其中值得注意的模型包括基于Transformer编码器结构的ClinicalBERT (Alsentzer等人, 2019)、Clinical-Longformer (Li等人, 2023)、Gatortron (Yang等人, 2022)，以及基于Transformer编码器-解码器结构的Clinical-T5 (Lehman等人, 2023)。这些模型均使用临床记录进行预训练，然后针对每项具体任务进行微调。虽然这种方法已被证明有效，但这些模型的有限规模限制了它们同时执行多项任务的能力。这一局限降低了它们在现实场景中的实用性，因为使用单一模型处理各种任务比管理多个专用模型更为便捷。Asclepius是首个能够处理多项临床NLP任务的临床大语言模型。

6 Conclusion

6 结论

In this paper, we present Asclepius, trained on $158\mathrm{k}$ high-quality synthetic clinical notes and instruction sets. Evaluations across diverse benchmark datasets against other LLMs demonstrate that Asclepius performs on par with GPT-3.5-turbo while locally executable in hospital settings. Most importantly, it exhibits no significant disparity with models trained on actual clinical notes, thereby val- idating the use of synthetic notes for training clinical large language models. The evaluations were not solely reliant on GPT-4 but also involved appraisals by four clinicians, reinforcing the validity of our conclusions. For future research, all synthetic data, model weights, and code used in these experiments are publicly available. This opens the door for not only healthcare institutions but also businesses and researchers to develop clinical large language models. We believe this has the potential to significantly advance healthcare AI, especially in areas previously held back by privacy concerns.

本文介绍了Asclepius，该模型基于15.8万条高质量合成临床记录和指令集训练而成。通过在多组基准数据集上与其他大语言模型的对比评估，Asclepius表现与GPT-3.5-turbo相当，同时可在医院环境中本地运行。最关键的是，其表现与基于真实临床记录训练的模型无显著差异，从而验证了使用合成记录训练临床大语言模型的可行性。评估不仅依赖GPT-4，还包含四位临床医生的专业评审，强化了结论的可信度。为促进后续研究，本实验使用的全部合成数据、模型权重及代码均已公开。这不仅为医疗机构，也为企业和研究者开发临床大语言模型打开了大门。我们相信这将显著推动医疗AI发展，特别是在那些受隐私问题制约的领域。

effectively function with a wider variety of note types. Secondly, our model is currently only capable of handling one-turn instruction following tasks. This may constrain its use in more dynamic and interactive healthcare settings where conversations between the model and healthcare professionals are required for a comprehensive understanding of the patient’s clinical notes. We plan to extend this model in future studies to allow interactive dialogues, thereby increasing its utility and applicability in real-world clinical scenarios. Third, we initially used GPT for data generation, but its terms of use prohibit using its output to train models for business competition. However, with the recent advances in open-source LLMs, this issue could be addressed by replacing GPT’s role with one of them. Lastly, but most importantly, we did not extensively investigate the model’s hallucination capacity, which may affect its reliability and accuracy when implemented in practice. Our model can generate hallucinated responses, which may cause critical issues in practical applications (see Appendix I). It is important to note that the current model is intended for research purposes and should not yet be used in actual clinical practice. Further research is required to rigorously test and enhance the model’s performance and ensure its safe and effective use in clinical settings.

规则：

有效处理更多类型的临床记录。其次，当前模型仅能执行单轮指令跟随任务，这可能限制其在需要模型与医疗专业人员对话以全面理解患者病历的动态交互式医疗场景中的应用。我们计划在后续研究中扩展该模型以实现交互式对话，从而提升其在真实临床环境中的实用性和适用性。第三，虽然我们最初使用GPT生成数据，但其使用条款禁止将其输出用于训练商业竞争模型。不过随着开源大语言模型的发展，可通过替换GPT角色来解决此问题。最后且最重要的是，我们未深入探究模型的幻觉生成能力，这可能会影响实际应用时的可靠性和准确性（该模型可能产生幻觉响应，详见附录I）。需特别强调，当前模型仅供研究使用，尚未达到实际临床应用标准，仍需进一步研究来严格测试和改进模型性能，确保其在临床环境中的安全有效应用。

Acknowledgements

致谢

This work was supported by Institute of Information & Communications Technology Planning & Evaluation (IITP) grant (No.RS-2019-II190075), National Research Foundation of Korea (NRF) grant (NRF-2020H1D3A2A03100945, RS-2023- 00262527), funded by the Korea government (MSIT), and NAVER Digital Bio Innovation Research Fund, funded by NAVER Corporation (Grant No.3720230020).

本研究由韩国信息通信技术规划与评估研究所(IITP)资助项目(No.RS-2019-II190075)、韩国国家研究基金会(NRF)资助项目(NRF-2020H1D3A2A03100945, RS-2023-00262527)(韩国科技信息通信部(MSIT)资助)以及NAVER数字生物创新研究基金(NAVER Corporation资助项目, Grant No.3720230020)共同支持。

Limitations

局限性

This study has several limitations that should be acknowledged. Firstly, our model was primarily designed and tested only on discharge summaries, which may limit its application and general iz ability to other types of clinical notes, such as progress notes, nursing notes, or radiology notes. Future research should aim to develop a model that can

本研究存在若干需指出的局限性。首先，我们的模型主要针对出院小结进行设计和测试，这可能限制其在其他类型临床记录（如病程记录、护理记录或影像报告）中的应用和泛化能力。未来研究应致力于开发能够...

References

参考文献

Monica Agrawal, Stefan Hegselmann, Hunter Lang, Yoon Kim, and David Sontag. 2022. Large language models are few-shot clinical information extractors. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 1998–2022.

Monica Agrawal、Stefan Hegselmann、Hunter Lang、Yoon Kim和David Sontag。2022。大语言模型是少样本临床信息抽取器。载于《2022年自然语言处理实证方法会议论文集》，第1998–2022页。

Emily Alsentzer, John R Murphy, Willie Boag, WeiHung Weng, Di Jin, Tristan Naumann, WA Redmond, and Matthew BA McDermott. 2019. Publicly avail- able clinical bert embeddings. NAACL HLT 2019, page 72.

Emily Alsentzer、John R Murphy、Willie Boag、WeiHung Weng、Di Jin、Tristan Naumann、WA Redmond 和 Matthew BA McDermott。2019. 公开可用的临床 BERT 嵌入。NAACL HLT 2019，第72页。

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neel a kant an, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell等. 2020. 大语言模型是少样本学习者. 神经信息处理系统进展, 33:1877–1901.

Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. 2016. Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174.

Tianqi Chen、Bing Xu、Chiyuan Zhang 和 Carlos Guestrin。2016. 以亚线性内存成本训练深度网络。arXiv预印本 arXiv:1604.06174。

Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. 2023. Vicuna: An opensource chatbot impressing gpt-4 with $90%* $ chatgpt quality.

Wei-Lin Chiang、Zhuohan Li、Zi Lin、Ying Sheng、Zhanghao Wu、Hao Zhang、Lianmin Zheng、Siyuan Zhuang、Yonghao Zhuang、Joseph E. Gonzalez、Ion Stoica 和 Eric P. Xing。2023。Vicuna：一款以90% ChatGPT质量惊艳GPT-4的开源聊天机器人。

Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. 2022. Flash Attention: Fast and memory-efficient exact attention with IO-awareness. In Advances in Neural Information Processing Systems.

Tri Dao、Daniel Y. Fu、Stefano Ermon、Atri Rudra 和 Christopher Ré。2022。Flash Attention: 具有IO感知能力的快速且内存高效精确注意力机制。载于《神经信息处理系统进展》。

Debadutta Dash, Rahul Thapa, Juan M Banda, Akshay Swami nathan, Morgan Cheatham, Mehr Kashyap, Nikesh Kotecha, Jonathan H Chen, Saurabh Gombar, Lance Downing, et al. 2023. Evaluation of gpt-3.5 and gpt-4 for supporting real-world information needs in healthcare delivery. arXiv preprint arXiv:2304.13714.

Debadutta Dash、Rahul Thapa、Juan M Banda、Akshay Swami Nathan、Morgan Cheatham、Mehr Kashyap、Nikesh Kotecha、Jonathan H Chen、Saurabh Gombar、Lance Downing 等. 2023. 评估 GPT-3.5 和 GPT-4 在支持医疗保健服务中真实信息需求的能力. arXiv预印本 arXiv:2304.13714.

Dina Demner-Fushman, Wendy W Chapman, and Clement J McDonald. 2009. What can natural language processing do for clinical decision support? Journal of biomedical informatics, 42(5):760–772.

Dina Demner-Fushman、Wendy W Chapman和Clement J McDonald。2009。自然语言处理能为临床决策支持做什么？《生物医学信息学杂志》，42(5):760–772。

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.

Jacob Devlin、Ming-Wei Chang、Kenton Lee 和 Kristina Toutanova。2019. BERT: 用于语言理解的深度双向Transformer预训练。载于《2019年北美计算语言学协会会议论文集：人类语言技术》(第1卷：长篇与短篇论文)，第4171-4186页，明尼苏达州明尼阿波利斯市。计算语言学协会。

Xinyang Geng, Arnav Gudibande, Hao Liu, Eric Wallace, Pieter Abbeel, Sergey Levine, and Dawn Song. 2023. Koala: A dialogue model for academic research. Blog post, April, 1.

Xinyang Geng、Arnav Gudibande、Hao Liu、Eric Wallace、Pieter Abbeel、Sergey Levine 和 Dawn Song。2023。Koala：面向学术研究的对话模型。博客文章，4月1日。

Tianyu Han, Lisa C Adams, Jens-Michalis Papaioannou, Paul Grundmann, Tom Oberhauser, Alexander Löser, Daniel Truhn, and Keno K Bressem. 2023. Medalpaca–an open-source collection of medical conversational ai models and training data. arXiv preprint arXiv:2304.08247.

Tianyu Han、Lisa C Adams、Jens-Michalis Papaioannou、Paul Grundmann、Tom Oberhauser、Alexander Löser、Daniel Truhn 和 Keno K Bressem。2023。MedAlpaca——一个开源的医疗对话AI模型及训练数据集合。arXiv预印本 arXiv:2304.08247。

Sam Henry, Kevin Buchan, Michele Filannino, Amber Stubbs, and Ozlem Uzuner. 2020. 2018 n2c2 shared task on adverse drug events and medication extraction in electronic health records. Journal of the American Medical Informatics Association, 27(1):3– 12.

Sam Henry、Kevin Buchan、Michele Filannino、Amber Stubbs 和 Ozlem Uzuner。2020。2018年n2c2电子健康记录中药物不良事件与用药信息抽取共享任务。Journal of the American Medical Informatics Association，27(1):3-12。

Sepp Hochreiter and Jürgen Schmid huber. 1997. Long short-term memory. Neural computation, 9(8):1735– 1780.

Sepp Hochreiter 和 Jürgen Schmidhuber. 1997. 长短期记忆网络 (Long Short-Term Memory). Neural computation, 9(8):1735–1780.

Yan Hu, Iqra Ameer, Xu Zuo, Xueqing Peng, Yujia Zhou, Zehan Li, Yiming Li, Jianfu Li, Xiaoqian Jiang, and Hua Xu. 2023. Zero-shot clinical entity recognition using chatgpt. arXiv preprint arXiv:2303.16416.

Yan Hu、Iqra Ameer、Xu Zuo、Xueqing Peng、Yujia Zhou、Zehan Li、Yiming Li、Jianfu Li、Xiaoqian Jiang 和 Hua Xu。2023。使用 ChatGPT 实现零样本临床实体识别。arXiv 预印本 arXiv:2303.16416。

Julia Ive, Natalia Viani, Joyce Kam, Lucia Yin, Somain Verma, Stephen Puntis, Rudolf N Cardinal, Angus Roberts, Robert Stewart, and Sumithra Velupillai. 2020. Generation and evaluation of artificial mental health records for natural language processing. NPJ digital medicine, 3(1):69.

Julia Ive, Natalia Viani, Joyce Kam, Lucia Yin, Somain Verma, Stephen Puntis, Rudolf N Cardinal, Angus Roberts, Robert Stewart, and Sumithra Velupillai. 2020. 用于自然语言处理的人工心理健康记录生成与评估. NPJ digital medicine, 3(1):69.

Mohd Javaid, Abid Haleem, and Ravi Pratap Singh. 2023. Chatgpt for healthcare services: An emerging stage for an innovative perspective. Bench Council Transactions on Benchmarks, Standards and Evaluations, 3(1):100105.

Mohd Javaid、Abid Haleem 和 Ravi Pratap Singh。2023。ChatGPT 在医疗服务中的应用：创新视角的新兴阶段。《Bench Council Transactions on Benchmarks, Standards and Evaluations》，3(1):100105。

Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. 2023. Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1–38.

Ziwei Ji、Nayeon Lee、Rita Frieske、Tiezheng Yu、Dan Su、Yan Xu、Etsuko Ishii、Ye Jin Bang、Andrea Madotto 和 Pascale Fung。2023。自然语言生成中的幻觉现象综述。ACM Computing Surveys，55(12):1–38。

Alistair Johnson, Tom Pollard, Steven Horng, Leo Anthony Celi, and Roger Mark. 2023. Mimic-iv-note: De identified free-text clinical notes.

Alistair Johnson、Tom Pollard、Steven Horng、Leo Anthony Celi 和 Roger Mark。2023. MIMIC-IV-NOTE: 去标识化自由文本临床笔记。

Alistair EW Johnson, Tom J Pollard, Lu Shen, Li-wei H Lehman, Mengling Feng, Mohammad Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger G Mark. 2016. Mimic-iii, a freely accessible critical care database. Scientific data, 3(1)

[论文翻译]基于合成临床笔记的可公开共享临床大语言模型

原文地址：https://arxiv.org/pdf/2309.00237

Publicly Shareable Clinical Large Language Model Built on Synthetic Clinical Notes

Abstract

1 Introduction

1 引言

2 Data Generation

2 数据生成

2.1 Synthetic Clinical Notes

2.1 合成临床记录

2.2 Clinical Instruction Generation

2.2 临床指导生成

3 Clinical Large Language Model

3 临床大语言模型

3.1 Training

3.1 训练

3.2 Evaluation

4 Comparative Analysis

4 对比分析

4.1 Preliminary Evaluation

4.1 初步评估

4.2 Practical Evaluation

4.3 Professional Evaluation

5 相关工作

5.1 Synthetic Clinical Notes

5.1 合成临床笔记

5.2 Language Models for Clinical NLP tasks

5.2 面向临床自然语言处理任务的语言模型

6 Conclusion

6 结论

Acknowledgements

Limitations

References

参考文献

[论文翻译]基于合成临床笔记的可公开共享临床大语言模型

原文地址：https://arxiv.org/pdf/2309.00237

Publicly Shareable Clinical Large Language Model Built on Synthetic Clinical Notes

Abstract

1 Introduction

1 引言

2 Data Generation

2 数据生成

2.1 Synthetic Clinical Notes

2.1 合成临床记录

2.2 Clinical Instruction Generation

2.2 临床指导生成

3 Clinical Large Language Model

3 临床大语言模型

3.1 Training

3.1 训练

3.2 Evaluation

4 Comparative Analysis

4 对比分析

4.1 Preliminary Evaluation

4.1 初步评估

4.2 Practical Evaluation

4.3 Professional Evaluation

5 Related Work

5 相关工作

5.1 Synthetic Clinical Notes

5.1 合成临床笔记

5.2 Language Models for Clinical NLP tasks

5.2 面向临床自然语言处理任务的语言模型

6 Conclusion

6 结论

Acknowledgements

Limitations

References

参考文献