[论文翻译]MEDALPACA - 开源医疗对话AI模型及训练数据集合


原文地址:https://arxiv.org/pdf/2304.08247


MEDALPACA - AN OPEN-SOURCE COLLECTION OF MEDICAL CONVERSATIONAL AI MODELS AND TRAINING DATA

MEDALPACA - 开源医疗对话AI模型及训练数据集合

A PREPRINT

预印本

Tianyu $\mathrm{Han^{1,+}}$ , Lisa C. Adams2,+, Jens-Michalis Papa ioan nou 4, Paul Grundmann4, Tom Ober hauser 4, Alexei Figueroa4, Alexander Löser4, Daniel Truhn1,+, and Keno K. Bressem2,5,+

Tianyu $\mathrm{Han^{1,+}}$、Lisa C. Adams2,+、Jens-Michalis Papaioannou4、Paul Grundmann4、Tom Oberhauser4、Alexei Figueroa4、Alexander Löser4、Daniel Truhn1,+和Keno K. Bressem2,5,+

1 Department of Radiology, University Hospital Aachen, Aachen, Germany Email: {tianyu.han, dtruhn}@ukaachen.de

1 德国亚琛大学医院放射科 邮箱: {tianyu.han, dtruhn}@ukaachen.de

2 Department of Diagnostic and Interventional Radiology, Technical University of Munich, School of Medicine and Health, Klinikum rechts der Isar, TUM University Hospital, Munich, Germany Email: lisa.adams@tum.de

2 德国慕尼黑工业大学医学与健康学院附属伊萨尔右岸医院介入放射诊断科
电子邮箱:lisa.adams@tum.de

4Berliner Hochschule für Technik (BHT), Berlin, Germany Email: {michalis.papa ioan nou, pgrundmann, tom.oberhauser, alexei.figueroa, aloeser}@bht-berlin.de

4柏林工程应用技术大学 (BHT), 德国柏林
邮箱: {michalis.papaioannou, pgrundmann, tom.oberhauser, alexei.figueroa, aloeser}@bht-berlin.de

5 Department of Cardiovascular Radiology and Nuclear Medicine, Technical University of Munich, School of Medicine and Health, German Heart Center, TUM University Hospital, Munich, Germany Email: keno.bressem@tum.de

5 德国慕尼黑工业大学医学与健康学院心血管放射学与核医学系,德国心脏中心,慕尼黑工业大学附属医院
电子邮箱:keno.bressem@tum.de

+Contributed equally

+同等贡献

March 20, 2025

2025年3月20日

ABSTRACT

摘要

As large language models (LLMs) like OpenAI’s GPT series continue to make strides, we witness the emergence of artificial intelligence applications in an ever-expanding range of fields. In medicine, these LLMs hold considerable promise for improving medical workflows, diagnostics, patient care, and education. Yet, there is an urgent need for open-source models that can be deployed on-premises to safeguard patient privacy. In our work, we present an innovative dataset consisting of over 160,000 entries, specifically crafted to fine-tune LLMs for effective medical applications. We investigate the impact of fine-tuning these datasets on publicly accessible pre-trained LLMs, and subsequently, we juxtapose the performance of pre-trained-only models against the fine-tuned models concerning the examinations that future medical doctors must pass to achieve certification.

随着 OpenAI 的 GPT 系列等大语言模型 (LLM) 不断发展,我们见证了人工智能应用在日益广泛的领域中出现。在医学领域,这些大语言模型对改善医疗工作流程、诊断、患者护理和教育具有巨大潜力。然而,当前亟需可本地部署的开源模型来保护患者隐私。我们的工作提出了一个包含超过 16 万条目的创新数据集,专门用于微调大语言模型以实现有效的医疗应用。我们研究了这些数据集对公开可用的预训练大语言模型进行微调的影响,随后将纯预训练模型与微调模型在未来医生认证考试中的表现进行了对比分析。

Keywords Natural Language Processing $\cdot$ Artificial Intelligence $\cdot$ Medicine

关键词 自然语言处理 (Natural Language Processing) $\cdot$ 人工智能 (Artificial Intelligence) $\cdot$ 医学

1 Introduction

1 引言

The advent of large language models (LLMs), trained using reinforcement learning through human feedback (RLHF) and exemplified by OpenAI’s GPT series, has profoundly influenced the fields of natural language processing (NLP) and artificial intelligence (AI) research [1]. Their remarkable capacity to produce coherent, con textually apt, and intricate responses has increased their value across diverse domains. Notably, the medical field is poised to reap substantial benefits from the implementation of these models.

大语言模型 (LLM) 的出现深刻影响了自然语言处理 (NLP) 和人工智能 (AI) 研究领域 [1]。这类模型通过人类反馈强化学习 (RLHF) 训练而成,以OpenAI的GPT系列为代表,其生成连贯、语境恰当且复杂响应的卓越能力,使其在多个领域价值凸显。值得注意的是,医疗领域有望从这些模型的实施中获得巨大收益。

A salient benefit of these LLMs lies in their ability to perform tasks following instructions in natural language, thereby eliminating the necessity for users to have programming proficiency. This feature empowers medical professionals to seamlessly engage with and steer the models through diverse medical workflows.

这些大语言模型的一个显著优势在于,它们能够按照自然语言指令执行任务,从而消除了用户需要具备编程能力的要求。这一特性使医疗专业人员能够无缝参与并通过多样化的医疗工作流程来引导模型。

Potential applications include aiding medical professionals in note-taking, composing discharge letters, retrieving information from extensive documents, summarizing content, and converting free-form texts into structured formats [2, 3]. Provided the model has been trained on a sufficient number of medical documents, it may possess the medical knowledge necessary to assist in consultations by supplying accurate information derived from its base texts [4]. Furthermore, the training of medical students can also benefit from these models, wherein they assume the role of a study partner, capable of quizzing students or elucidating complex subjects, provided the model demonstrates sufficient coherence and accuracy. However, the most adept LLM models are currently not openly accessible, being available exclusively through APIs that necessitate data transmission to the parent company for processing.

潜在应用包括协助医疗专业人员记录笔记、撰写出院信函、从大量文档中检索信息、总结内容,以及将自由格式文本转换为结构化格式 [2, 3]。如果模型已在足够数量的医疗文档上进行训练,它可能具备必要的医学知识,通过从其基础文本中提取准确信息来协助诊疗咨询 [4]。此外,医学生的培训也能从这些模型中受益,即模型可充当学习伙伴的角色,只要其表现出足够的连贯性和准确性,就能对学生进行测验或解释复杂主题。然而,目前最先进的大语言模型并未公开开放,仅能通过需将数据传输至母公司进行处理的API接口使用。

Considering the sensitive nature of medical data and the imperative for robust privacy safeguards, non-transparent models with unclear data management practices are ill-suited for medical applications. To tackle this challenge and avert unauthorized data transfers, it is essential to employ open-source models that enable on-site implementation, thus mitigating privacy concerns.

考虑到医疗数据的敏感性以及对强有力隐私保护措施的迫切需求,那些数据管理实践不明确的不透明模型并不适合医疗应用。为解决这一挑战并防止未经授权的数据传输,必须采用支持现场实施的开源模型,从而缓解隐私问题。

Addressing this demand, we present a compilation of language models specifically fine-tuned for biomedical tasks. Utilizing a blend of new and established open-source biomedical datasets, we adapt them into an instruction-following format. This structure facilitates supervised fine-tuning as the initial phase, as detailed in [1].

为满足这一需求,我们整理了一套专为生物医学任务微调的语言模型。通过结合使用新旧开源生物医学数据集,我们将其调整为遵循指令的格式。如[1]所述,该结构以监督式微调作为初始阶段。

To assess the effectiveness of these models, we evaluate their performance on the United States Medical Licensing Examination (USMLE), a standardized assessment undertaken by medical students in the United States as part of their qualification process to become physicians. This evaluation offers valuable insights into the models’ competencies and prospective applications within the medical domain.

为评估这些模型的有效性,我们以美国医师执照考试(USMLE)作为基准进行测试。该标准化考试是美国医学生取得执业资格的关键环节,其评估结果能有效反映模型在医疗领域的核心能力与应用潜力。

We make all models and datasets publicly available, anticipating that they will confer significant advantages to both medical and AI researchers as well as practitioners in their respective fields.

我们公开所有模型和数据集,预计它们将为医学和AI研究人员以及各领域从业者带来显著优势。

2 Materials and Methods

2 材料与方法

2.1 Datasets

2.1 数据集

In this section, we present Medical Meadow a collection of medical tasks that we have compiled for fine-tuning and evaluating the performance of large language models in the context of medicine. Medical Meadow consists of two main categories, a collection of established medical NLP tasks reformatted in instruction tuning formats as well as a crawl of various internet resources. Each dataset focuses on different aspects of medical knowledge and practice, providing a comprehensive training and evaluation framework. See Table 1 for a detailed overview of the datasets.

在本节中,我们介绍 Medical Meadow,这是一个为微调和评估大语言模型在医学领域性能而整理的医疗任务集合。Medical Meadow 包含两大类:一组以指令调优格式重构的经典医疗 NLP 任务,以及从各类互联网资源爬取的内容。每个数据集聚焦于医学知识和实践的不同方面,提供了全面的训练与评估框架。具体数据集概览请参见表 1。

2.1.1 Dataset 1: Flash Cards Used by Medical Students

2.1.1 数据集1:医学生使用的抽认卡

Medicine as a whole encompasses a wide range of subjects that medical students and graduates must master in order to practice effectively. This includes a profound understanding of basic medical sciences, clinical knowledge, and clinical skills. The Anki Medical Curriculum flashcards are created and updated by medical students and cover the entirety of the medical school curriculum, addressing subjects such as anatomy, physiology, pathology, pharmacology, and more. These flashcards frequently feature succinct summaries and mnemonics to aid in the learning and retention of important medical concepts. In our investigation, we leveraged flashcards as a source to create question-answer pairs for training purposes. Upon excluding cards containing images, we harnessed OpenAI’s GPT-3.5-Turbo to restructure the cards into coherent, con textually pertinent question-answer pairs. Generally, the questions and answers are concise and targeted, as the flashcards offer limited space for incorporating extensive information. See Table 3 for representative Q/A pairs.

医学作为一个整体涵盖了医学生和毕业生必须掌握的广泛学科,以便有效行医。这包括对基础医学、临床知识和临床技能的深刻理解。Anki医学课程闪卡由医学生创建并更新,覆盖了医学院全部课程内容,涉及解剖学、生理学、病理学、药理学等学科。这些闪卡常采用简洁的总结和助记法来帮助学习和记忆重要医学概念。在我们的研究中,我们利用闪卡作为来源创建用于训练的问答对。在排除包含图像的卡片后,我们使用OpenAI的GPT-3.5-Turbo将卡片重组为连贯且上下文相关的问答对。通常,问题和答案都简洁且有针对性,因为闪卡提供的空间有限,无法包含大量信息。代表性问答对参见表3。

2.1.2 Dataset 2: Stack exchange Medical Sciences

2.1.2 数据集 2: Stack Exchange医学科学版块

The stack exchange dataset consists of 52,475 question-answer pairs obtained from five Stack Exchange forums related to biomedical sciences and related fields:

Stack Exchange 数据集包含从五个与生物医学及相关领域相关的 Stack Exchange 论坛获取的 52,475 个问答对:

  1. Academia: This forum offers insights into research methodologies, scientific publication processes, and career paths within the scientific community. While not directly affiliated with medicine, considering the volume of medical research, it is likely that medical professionals will also consult models pertaining to this subject matter.
  2. 学术界:该论坛提供科研方法、科学出版流程及科学界职业发展路径的见解。虽不直接隶属医学领域,但鉴于医学研究体量庞大,医疗从业者也很可能咨询与此主题相关的模型。

Table 1: Summary of medical datasets created for this work. For information regarding other, already published data, please refer to the respective original publication.

表 1: 本研究创建的医学数据集汇总。有关已发布数据的更多信息,请参阅相应原始文献。

DatasetSourceDescriptionn
Finetuning
Medical Flash CardsAnkiFlashcardsRephrased Q&A pairs derived from the front and back sides of medicalflashcards33,955
StackEx- changeAcademiaQ&A pairs generated from questions and their top-rated answers39,633
Biology7,482
Fitness3,026
Health1,428
WikidocBioinformatics906
Living TextbookQ&A pairs generated from paragraphs, where questions were for- mulated from rephrased paragraph titles, and answers were ex-67,704
Patient Informa- tiontracted from paragraph text Q&A pairs generated from paragraph headings and associated textcontent5,942
Evaluation
USMLEStep 1Multiple choice questions from the USMLE self-assessment with119
Step 2image-based questions excluded120
Step 3135
数据集 来源 描述 数量
* * 微调数据集* *
医学闪卡 Anki闪卡 从医学闪卡正反面内容重构的问答对 33,955
StackExchange 学术 根据问题及其高赞回答生成的问答对 39,633
生物学 7,482
健身 3,026
健康 1,428
Wikidoc 生物信息学 906
活体教材 通过段落标题重构生成问题,并从段落文本提取答案形成的问答对 67,704
患者信息 根据段落标题及相关文本内容生成的问答对 5,942
* * 评估数据集* *
USMLE 第一阶段 美国医师执照考试自测题库中的选择题(排除图像题) 119
第二阶段 120
第三阶段 135
  1. Bioinformatics: As an interdisciplinary field combining biology, computer science, and data analysis, the Bioinformatics forum offers valuable information on the techniques and tools used for analyzing complex biological data, which is increasingly important in modern medical research.
  2. 生物信息学:作为结合生物学、计算机科学和数据分析的交叉学科领域,生物信息学论坛提供了有关分析复杂生物数据的技术和工具的宝贵信息,这些信息在现代医学研究中日益重要。

To maintain a high level of answer quality, we collected data exclusively from responses that received a minimum of five up-votes within the forum discussions and paired them with their corresponding questions. See Table 4 for representative Q/A pairs.

为保持回答质量的高水准,我们仅收集论坛讨论中获得至少五个赞的回复数据,并将其与对应问题配对。代表性问答对请参见表 4。

2.1.3 Dataset 3: Wikidoc

2.1.3 数据集3:Wikidoc

We incorporated medical question-answer pairs extracted from WikiDoc, a collaborative platform for medical professionals to share and contribute up-to-date medical knowledge. The platform has two main sub-sites, the "Living Textbook" and "Patient Information". The "Living Textbook" contains chapters for various medical specialties, which we crawled. We then used GTP-3.5-Turbo to rephrase the paragraph heading to a question and used the paragraph as answers. Patient Information is structured differently, in that each section subheading is already a question, making rephrasing obsolete. See Table 5 for representative Q/A pairs.

我们整合了从WikiDoc平台提取的医学问答对。该平台是供医学专业人士共享和贡献最新医学知识的协作平台,包含两大子站点:"Living Textbook"和"Patient Information"。其中"Living Textbook"收录了各医学专科的章节内容,我们通过爬取这些章节,使用GPT-3.5-Turbo将段落标题改写为问题形式,并将段落内容作为答案。而"Patient Information"的结构有所不同,其每个小节标题本身已是问题形式,无需进行改写。代表性问答对参见表5。

2.1.4 Dataset 4: medical NLP Benchmarks

2.1.4 数据集4:医疗NLP基准

We additionally use data from open NLP datasets and benchmarks, including:

我们还使用了来自开放NLP数据集和基准测试的数据,包括:

  1. The COVID-19 Open Research Dataset Challenge (CORD-19), consisting of more than one million scholarly articles [5] 2. Benchmark data from Measuring Massive Multitask Language Understanding [6, 7]
  2. COVID-19开放研究数据集挑战赛 (CORD-19) ,包含超过一百万篇学术文献 [5]
  3. 来自《测量海量多任务语言理解》的基准数据 [6, 7]

2.2 Model Training

2.2 模型训练

Our models are built upon the LLaMA (Large Language Model Meta AI) foundation models. LLaMA represents a cutting-edge large language model released by Meta, demonstrating their commitment to open science. It is available in various sizes, including 7 billion, 13 billion, 33 billion, and 65 billion parameters. In this study, we fine-tuned the 7 and 13 billion parameter LLaMA variants, adhering to the approach delineated by Taori et al [11].

我们的模型基于LLaMA (Large Language Model Meta AI) 基础模型构建。LLaMA是Meta发布的前沿大语言模型,体现了其对开放科学的承诺。该模型提供多种参数量版本,包括70亿、130亿、330亿和650亿参数。本研究遵循Taori等人[11]的方法,对70亿和130亿参数的LLaMA变体进行了微调。

We trained each model for five epochs, employing a learning rate of $2e^{-5}$ for the 7b model and $1e^{-5}$ for the 13b model, using a cosine learning rate scheduler. Gradient accumulation facilitated training with an effective batch size of 256. Given that this training impacts all model parameters, the hardware requirements are substantial. Consequently, we explored alternative training procedures.

我们对每个模型进行了五个周期的训练,7b模型采用$2e^{-5}$的学习率,13b模型采用$1e^{-5}$的学习率,并使用余弦学习率调度器。通过梯度累积实现了256的有效批次大小训练。由于该训练会影响所有模型参数,硬件需求较高。因此,我们探索了替代训练方案。

First, we implemented Low-Rank Adaptation (LoRA) for weight updates to adapt the pre-trained language models to our specific tasks. LoRA is a method that involves freezing the pre-trained model weights and incorporating trainable rank decomposition matrices into each layer of the Transformer architecture [12]. This approach substantially diminishes the number of trainable parameters and GPU memory requirements for downstream tasks, making it more efficient compared to full fine-tuning and significantly reducing training time.

首先,我们实现了低秩适应 (LoRA) 来进行权重更新,以使预训练语言模型适应我们的特定任务。LoRA 是一种冻结预训练模型权重并在 Transformer 架构的每一层中加入可训练秩分解矩阵的方法 [12]。与全参数微调相比,这种方法大幅减少了可训练参数数量和下游任务对 GPU 内存的需求,从而提高了效率并显著缩短了训练时间。

To further decrease memory and compute demands, we employed 8-bit matrix multiplication for the feed-forward and attention projection layers, along with an 8-bit optimizer. When combined with LoRA, this strategy further reduces the memory needed for training [13] [14]. All models trained with LoRA underwent three epochs of training at a learning rate of 2e-5.

为了进一步降低内存和计算需求,我们在前馈网络和注意力投影层采用了8位矩阵乘法,并配合8位优化器。结合LoRA技术后,该策略能进一步减少训练所需内存 [13] [14]。所有使用LoRA训练的模型均以2e-5的学习率进行了三轮训练。

2.3 Evaluation Procedure

2.3 评估流程

To evaluate the performance of the fine-tuned language models, we devised an assessment methodology centered on their zero-shot performance across the United States Medical Licensing Examination (USMLE) Step 1, Step 2, and Step 3 self-assessment datasets. We excluded all questions containing images, as our primary interest lies in the models’ language capabilities, and they lack visual abilities. We instructed the models to present answers in the format "Option: Answer" (e.g., "A: Penicillin"). If a model’s output did not adhere to this format, they were prompted up to five times until the response was generated in the desired format. If the model failed to provide the response in the desired format, the last response was retained.

为评估微调后语言模型的性能,我们设计了一种以美国医师执照考试(USMLE)第一阶段、第二阶段和第三阶段自测数据集上的零样本表现为核心的评估方法。我们排除了所有包含图像的问题,因为研究重点在于模型的语言能力,且它们不具备视觉能力。我们要求模型以"选项: 答案"的格式输出结果(例如"A: 青霉素")。若模型输出不符合该格式,系统会最多提示五次直至生成符合要求的响应。若模型最终仍无法按指定格式响应,则保留最后一次生成的答案。

Interestingly, most of the fine-tuned models typically produced answers in the correct format after the first prompt, while only the base LLaMA models required multiple prompts. We conducted separate evaluations for each model, measuring their accuracy on the USMLE Step 1, Step 2, and Step 3 datasets individually. This approach allowed us to gain a comprehensive understanding of the models’ performance across the various stages of the medical licensing examination.

有趣的是,大多数经过微调的模型通常能在首次提示后生成格式正确的答案,而基础LLaMA模型则需要多次提示。我们对每个模型分别进行了评估,单独测量它们在美国医师执照考试(USMLE)第一步、第二步和第三步数据集上的准确率。这种方法使我们能全面了解这些模型在医学执照考试不同阶段的表现。

3 Results

3 结果

Our findings on the USMLE test set are displayed in Table 2. Fine-tuned LLMs consistently surpassed the performance of their pre-trained-only counterparts. It is worth noting that while LoRa and 8-bit fine-tuning expedited the training process, employing these methods resulted in reduced accuracy.

我们在USMLE测试集上的发现如表2所示。经过微调的大语言模型持续超越仅预训练模型的性能。值得注意的是,虽然LoRa和8位量化微调加速了训练过程,但采用这些方法会导致准确率下降。

4 Discussion and conclusion

4 讨论与结论

In this study, we introduced a novel, high-quality collection of medical text data specifically designed for training instruction-following, medical large language models (LLMs). This dataset serves as a comprehensive resource for enhancing LLM performance in the medical domain, laying the groundwork for potential integration into medical education and practice.

在本研究中,我们引入了一个新颖、高质量的医疗文本数据集,专为训练遵循指令的医疗大语言模型(LLM)而设计。该数据集作为提升大语言模型在医疗领域表现的综合性资源,为未来融入医学教育和临床实践奠定了基础。

Table 2: Zero shot performance on the USMLE self assessment

ModelStep1Step2Step3
LLaMA 7b[15]0.1980.2020.203
Alpaca 7b naive [11]0.2750.2660.293
Alpaca 7bLoRA0.2200.1380.252
MedAlpaca 7b0.2970.3120.398
MedAlpaca 7b LoRA0.2310.2020.179
MedAlpaca7bLoRA8bit0.2310.2410.211
ChatDoctor(7b) [10] LLaMA 13b [15]0.1870.1850.148
0.2220.2480.276
Alpaca13bnaive0.3190.3120.301
MedAlpaca 13b0.4730.4770.602
MedAlpaca13bLoRA MedAlpaca 13b LoRA 8bit0.2500.2550.255

表 2: USMLE自评估零样本性能

模型 Step1 Step2 Step3
LLaMA 7b[15] 0.198 0.202 0.203
Alpaca 7b naive [11] 0.275 0.266 0.293
Alpaca 7bLoRA 0.220 0.138 0.252
MedAlpaca 7b 0.297 0.312 0.398
MedAlpaca 7b LoRA 0.231 0.202 0.179
MedAlpaca7bLoRA8bit 0.231 0.241 0.211
ChatDoctor(7b) [10] LLaMA 13b [15] 0.187 0.185 0.148
0.222 0.248 0.276
Alpaca13bnaive 0.319 0.312 0.301
MedAlpaca 13b 0.473 0.477 0.602
MedAlpaca13bLoRA MedAlpaca 13b LoRA 8bit 0.250 0.255 0.255

Using our medical text data, we fine-tuned several open-source LLM variants, adopting parameter-efficient tuning methodologies to address limited computing resources [16]. This approach is vital, as full fine-tuning of language model parameters is often unfeasible for most academic institutions. Our study demonstrates the viability of parameterefficient fine-tuning.

利用我们的医学文本数据,我们对多个开源大语言模型变体进行了微调,采用参数高效调优方法以应对有限的计算资源[16]。这一方法至关重要,因为对语言模型参数进行全面微调对大多数学术机构而言往往不可行。我们的研究证明了参数高效微调的可行性。

We evaluated LLM performance using the United States Medical Licensing Examination (USMLE) for Steps 1, 2, and 3, which assess medical knowledge at various complexity levels. As expected, performance improved with larger pre-trained models. Applying approximation techniques, such as 8-bit precision and LoRa, during fine-tuning yielded less optimal results. However, due to considerable computational costs, we did not conduct extensive hyper parameter optimization and fine-tuning; thus, it may be possible to achieve performance comparable to vanilla-trained models through more thorough hyper parameter optimization, which we leave for future research.

我们使用美国医师执照考试(USMLE)的步骤1、2和3来评估大语言模型(LLM)性能,这些考试评估不同复杂程度的医学知识。正如预期的那样,预训练模型越大,性能越好。在微调过程中应用8位精度和LoRa等近似技术时,得到的结果不太理想。然而,由于计算成本较高,我们没有进行大量的超参数优化和微调;因此,通过更彻底的超参数优化,有可能达到与普通训练模型相当的性能,我们将此留待未来研究。

The availability of additional medical datasets will likely enhance the applicability and performance of these models, creating various potential applications such as extracting structured medical information from unstructured text, supporting medical students’ education through question-answering interactions to reinforce their knowledge and clarify lecture uncertainties, or assisting patients in understanding their health and improving communication between doctors and patients who often find medical language challenging.

额外医疗数据集的可用性可能会提升这些模型的适用性和性能,从而创造多种潜在应用场景,例如从非结构化文本中提取结构化医疗信息、通过问答互动支持医学生教育以巩固其知识并澄清讲座中的疑问,或帮助患者理解自身健康状况并改善医患沟通(患者通常认为医学术语难以理解)。

Nevertheless, implementing LLMs for these application scenarios presents challenges and concerns. Ensuring data privacy and compliance with ethical standards is critical when handling sensitive patient data; these concerns can be addressed by deploying models locally within secure hospital networks. Moreover, models must be thoroughly evaluated and safeguarded for potential biases and inaccuracies to prevent unintended consequences in medical decision-making.

然而,在这些应用场景中部署大语言模型(LLM)仍面临挑战与隐忧。处理敏感患者数据时,确保数据隐私和符合伦理标准至关重要——可通过在医院安全网络内部署本地模型来解决这些问题。此外,必须对模型进行彻底评估并防范潜在偏见与错误,以避免对医疗决策造成意外影响。

A significant limitation is LLMs’ tendency to confab u late or generate text that appears plausible but is factually incorrect [17]. This issue is especially concerning in the medical domain, where disseminating incorrect information can have serious implications for patient care and safety. Guaranteeing the accuracy and reliability of generated information is therefore essential, necessitating rigorous evaluation and continuous monitoring to mitigate confab ul ation risks and the potential harm it may cause in medical settings.

一个显著的限制是大语言模型 (LLM) 容易产生虚构内容,即生成看似合理但事实错误的信息 [17]。这一问题在医疗领域尤为令人担忧,因为传播错误信息可能对患者护理和安全造成严重影响。因此,确保生成信息的准确性和可靠性至关重要,需要通过严格评估和持续监控来降低虚构风险及其在医疗环境中可能造成的潜在危害。

In conclusion, our work substantially contributes to the field of LLMs in medicine by providing a novel, high-quality medical dataset for research and application purposes. Further, we successfully fine-tuned and evaluated various LLMs, demonstrating that their medical domain performance increases with pre-trained model size and high-quality data availability. This progress paves the way for further exploration and development of LLMs in medicine, with potential implications for medical education, patient care, and healthcare communication.

总之,我们的工作通过提供一个新颖、高质量的医学数据集供研究和应用,对大语言模型在医学领域的发展做出了重要贡献。此外,我们成功微调并评估了多种大语言模型,证明其医学领域性能会随着预训练模型规模扩大和高质量数据可用性而提升。这一进展为医学领域大语言模型的进一步探索和开发铺平了道路,可能对医学教育、患者护理和医疗沟通产生深远影响。

5 Acknowledgements

5 致谢

The authors acknowledge the Scientific Computing of the IT Division at the Charité - Universit ts me diz in Berlin for providing computational resources that have contributed to the research results reported in this paper. URL: https://www.charite.de/en/research/research support services/research infrastructure/science_ it/#c30646

作者感谢柏林Charité大学医学院IT部门的科学计算团队为本研究提供的计算资源支持。URL: https://www.charite.de/en/research/research_ support_ services/research_ infrastructure/science_ it/#c30646

References

参考文献

6 Appendix

6 附录

Table 3: Representative question from the medical flashcards dataset.

Question|Answer
What is the relationship between very low Mg2+ levels, PTH lev-low Ca2+ levels. els, and Ca2+ levels?1 Very low Mg2+ levels correspond to low PTH levels which in turn results in
What are the cautionary consid- erations andcontraindicationsfor the use of β-blockers?3-blockers must be used cautiously in decompensated heart failure and are contraindicated in cardiogenic shock.
What is the recommended next step in management for a stable child who is suspected of having sign" on ultrasound?The recommended next step in management for a stable child with suspected intussusception and a "target sign" on ultrasound is to undergo an ultrasound- guided air contrast enema. This procedure involves using a small amount of air and contrast material to inflate the colon and push the telescoped intestine back into place. It is a non-invasive and effective method for both diagnosing and treating intussusception in children. In some cases, the procedure may need to be repeated or followed by surgery if the intussusception cannot be reduced. It
Whatconditionshouldbesus- pected in a young infant who is healthyandactiveandhasblood- tinged stools? What is the cause of this condition and how is it managed?The presence of blood-tinged stools in a young infant who is healthy and ac- tive should raise suspicion of protein-induced proctocolitis. This condition is caused by an intolerance to milk or soy protein, which can cause inflammation in the lining of the rectum. The management of protein-induced proctocoli- tis involves continuing breastfeeding and advising the mother to avoid dairy and soy products in her diet. In most cases, this will lead to resolution of the symptoms within a few days to a few weeks. If the symptoms persist, further ing.
Which muscle abducts the arm from 0 to 15° and what is the name of the nerve that innervates it?The muscle that abducts the arm from 0 to 15° is the supraspinatus, which is innervated by the suprascapular nerve.
of community-acquired pneumo- nia?Streptococcus pneumoniae is the most common cause of community-acquired acquired outside of a hospital or healthcare setting. It can be caused by a vari- ety of bacteria, viruses, and other microorganisms, but Streptococcus pneumo- niae is the most frequent cause. Other common causes of community-acquired pneumonia include Haemophilus influenzae, Mycoplasma pneumoniae, and Legionella pneumophila, among others. The specific cause of community- acquired pneumonia can vary depending on a person's age, underlying health conditions, and other factors.
Whatisariskfactorfor pyelonephritis related to the urinary system?One risk factor for pyelonephritis is vesicoureteral refux.
Which type of fibrillation is a fa- tal arrhythmia that requires im- mediate CPR and defibrillation to preventdeath?Ventricular fibrillation is the type of fibrillation that is a fatal arrhythmia with- out immediate CPR and defibrillation.

表 3: 医学抽认卡数据集中的代表性题目。

问题 答案
极低 Mg2+ 水平、低 PTH 水平和低 Ca2+ 水平之间的关系是什么? 极低 Mg2+ 水平对应低 PTH 水平,进而导致低 Ca2+ 水平。
使用 β-受体阻滞剂时有哪些注意事项和禁忌症? β-受体阻滞剂在失代偿性心力衰竭中需谨慎使用,心源性休克患者禁用。
对于超声检查发现"靶征"的稳定患儿,推荐下一步处理措施是什么? 对疑似肠套叠且超声显示"靶征"的稳定患儿,推荐进行超声引导下空气灌肠复位。该操作通过注入少量空气和造影剂使结肠膨胀,将套叠肠管推回原位,是一种无创且有效的诊断治疗方法。若复位失败,可能需重复操作或手术干预。
健康活跃的婴儿出现血便应考虑什么疾病?其病因和处理方法是什么? 健康活跃婴儿出现血便应怀疑蛋白诱导性直肠结肠炎。该病由牛奶或大豆蛋白不耐受引发,可导致直肠黏膜炎症。处理方法包括继续母乳喂养并建议母亲避免摄入乳制品和大豆蛋白,多数症状会在数日至数周内缓解。若症状持续需进一步检查。
负责手臂0-15°外展的肌肉及其支配神经是什么? 冈上肌负责手臂0-15°外展,由肩胛上神经支配。
社区获得性肺炎最常见的病原体是什么? 肺炎链球菌是社区获得性肺炎最常见病原体。其他常见病原体包括流感嗜血杆菌、肺炎支原体和嗜肺军团菌等,具体病原体因患者年龄、基础健康状况等因素而异。
与泌尿系统相关的肾盂肾炎危险因素有哪些? 膀胱输尿管反流是肾盂肾炎的危险因素之一。
哪种颤动属于致命性心律失常,需立即进行心肺复苏和除颤? 室颤是需立即进行心肺复苏和除颤的致命性心律失常。

Table 4: Representative question from the Stack Exchange dataset.

Question|Answer
Given the thalidomide birth de- fect disaster of the 1960s, why was it later approved by the U.S FDAfortreatmentofleprosy (1996) and of multiple myeloma (2006)?Presumably,the terato- genic issues and problems with permanent nerve damage didn't go away, as the liver can con- vert the"good"enantiomer to the "bad."Thalidomide affects even non-pregnant women, so why ap- prove a drug with known severe complications?Because leprosy and multiple myeloma are conditions for which other treat- ment options are limited. All treatments have side effects, some more debil- itating than others. The trick with pharmacology is to balance the benefit of treatment with the hazards of the side effects. Take cancer chemotherapy. The drugs we use for cancer chemotherapy are horrible poisons. We literally use derivatives of chemical weapons to treat cancer. Why do we deliberately poi- son cancer patients with what amounts to a weapon of mass destruction? Be- cause if we don't, the cancer will kill them. In pharmacology there's a concept of a therapeutic window. That is, there's a particular amount of compound that will cause death or other severe disability. There's another (hopefully lower) level of the compound that will effectively treat the disease. In treating, we hope to hit that middle ground: enough to treat the disease, but not enough to kill the patient. Different drugs have different therapeutic window. The win- dow for over-the-counter drugs like asprin is rather large: the amount needed for severe side effects is many times higher than what people take to treat headaches. That's why we can allow untrained people to self-dose. (Though look at acetaminophen/paracetamol as an over-the-counter example where the window is not as large as it probably should be.) [...]
Whatisthewaytheydetermine someone died before declaring themdead?Therearestories I've heard of over theyears,peo- ple who's heart stopped, were brain dead, fell under ice rivers for half an hour and miraculously survived. So there are many in- stances of unexplained recoveries. Howdothemedicalteamsordoc- tors determine then that this per- son won't all of sudden get a heart beat back?say, compared to true deaths), but it occasionally happens that someone - even in a hospital - is thought to be dead when they are not actually dead. In one of your stories, the girl was presumed to be brain dead, not dead. So strike that one. Hypothermia is a beast all unto itself, and declaring a cold person dead is a bit tricky.* Determining death is not simple. The International Guidelines for the Determination of Death - Phase I (May 30-31, 2012) Montreal Forum Report is 46 pages long and it still doesn't have a definitive conclusion. For the most part (and to simplify a bit), death is determined to have occurred when someone is exceedingly unlikely (determined from experience of millions of deaths) to regain function of their heart. It can be from a very wide variety of causes, but basically it follows cardiac arrest or respiratory arrest leading to cardiac arrest. The procedure is to observe the patient carefully. In hospital, that usually includes electronic monitors of one sort or another. Out of hospital it's by observation. When there is no evidence of cardiac electrical activity capable of generating a pulse, the patient has not been breathing for some time, oxygenation of blood has fallen to beyond critical levels, and there is no neurological activity, they are pronounced dead.[...]

表 4: Stack Exchange 数据集中的代表性问答

问题 回答
考虑到 20 世纪 60 年代的沙利度胺 (thalidomide) 致畸灾难,为何美国 FDA 后来仍批准其用于治疗麻风病 (1996) 和多发性骨髓瘤 (2006)?理论上致畸问题和永久性神经损伤风险并未消除,因为肝脏会将"良性"对映体转化为"恶性"。沙利度胺甚至会影响非孕妇,为何还要批准这种已知具有严重并发症的药物? 因为麻风病和多发性骨髓瘤的治疗选择非常有限。所有治疗都存在副作用,有些比其他更严重。药理学的关键在于平衡治疗效果与副作用风险。以化疗为例:我们使用的化疗药物本质上是剧毒物质,实际应用的是化学武器衍生物。为何要用大规模杀伤性武器级别的毒素治疗患者?因为不治疗癌症就会致命。药理学中有个"治疗窗 (therapeutic window)"概念:既存在导致死亡或严重残疾的化合物剂量,也存在(期望更低的)有效治疗剂量。理想情况是找到中间值:足以治病但不足以致命。不同药物的治疗窗各异,例如阿司匹林等非处方药的治疗窗很大——引发严重副作用的剂量远高于治疗头痛的剂量(但对乙酰氨基酚/扑热息痛的治疗窗就比理想值窄得多)。[...]
医生如何确定患者真正死亡而非假死状态?多年来我听过许多案例:心脏停跳者、脑死亡者、冰河溺水半小时后奇迹生还者。既然存在这么多无法解释的复苏案例,医疗团队如何判定患者不会突然恢复心跳? (假死案例极为罕见)但医院偶尔也会误判死亡。你提到的案例中女孩是被判定为脑死亡而非临床死亡。低温症的情况更为特殊,判断体温过低者是否死亡尤为困难* 。死亡判定非常复杂,《国际死亡判定指南-第一阶段》(2012年蒙特利尔论坛报告)长达46页仍无定论。简而言之,当患者极不可能(基于数百万死亡案例经验)恢复心脏功能时即判定死亡。可能由多种原因导致,但基本遵循心脏骤停或呼吸停止引发心脏骤停的路径。判定程序包括密切观察:院内通常使用电子监护设备,院外则依靠观察。当确认无有效心电活动、长时间无自主呼吸、血氧低于临界值且无神经活动时,即可宣布死亡。[...]

It seems there are a ton of easy ab exercises, but very few for the lower back to keep the muscles balanced as your abs get stronger. What are simple home exercises for the lower back?

似乎有很多简单的腹肌训练动作,但针对下背部的练习却很少,难以在腹肌变强时保持肌肉平衡。有哪些适合在家进行的简单下背部锻炼?

Try rolling over onto your stomach and doing a "superman". You can hold tension in the position, or do reps similar to doing "reverse situps". ref site Also, you might think about doing slow squats or wall sits where you focus on contracting both your abs and lower back. Tense up so as to make the line between anus and belly-button as short as possible during this exercise - maybe you can replace your ab work at the same time!

试着翻身趴下做一个"超人式"。你可以保持这个姿势的张力,或者像做"反向仰卧起坐"那样进行多次重复。参考网站 此外,你可以考虑做慢速深蹲或靠墙静蹲,同时专注于收缩腹部和下背部肌肉。锻炼时要用力绷紧,尽量缩短肛门到肚脐的距离——说不定能顺便替代你的腹部训练!

The accepted range for the waveGood question. If you look at the spectral energy distribution in the accepted lengths of light that the human answer here, we see that photons with wavelengths less than $300 \mathrm{nm}$ are abeye can detect is roughly between sorbed by species such as ozone. Much beyond 750 infrared radiation is largely $400\mathrm{nm}$ and $700\mathrm{nm}$ . Is it a coabsorbed by species such as water and carbon dioxide. Therefore the vast maincidence that these wavelengths jority of solar photons reaching the surface have wavelengths that lie between are identical to those in the Phothese two extremes. Therefore, I would suggest that surface organisms will to synthetically Active Radiation have adapted to use these wavelengths of light whether it be used in photore(PAR) range (the wavelength of ceptors or in photosynthesis since these are the wavelengths available; i.e., light used for normal photosynorganisms have adapted to use these wavelengths of light, rather than these thesis)? Alternatively is there wavelengths being special per se (although in the specific case of photo syn thesomething special about photons sis there is a photon energy sweet spot). For example this study suggests that with those energy levels that is some fungi might actually be able to utilize ionizing radiation in metabolism. leading to s tabi lising selection in This suggests that hypothetical organisms on a world bathed in ionizing radiamultiple species as diverse as hution may evolve meghanisms to utilize this energy. mans and plants?

好问题。如果我们观察人类可感知的光谱能量分布范围,会发现波长小于 $300 \mathrm{nm}$ 的光子会被臭氧等物质吸收,而远超过 750 nm 的红外辐射则主要被水和二氧化碳等物质吸收。因此到达地表的太阳光子绝大多数波长都介于这两个极端值之间。人眼可检测的光波长范围大致在 $400\mathrm{nm}$ 至 $700\mathrm{nm}$ 之间——这恰好与光合有效辐射 (PAR) 范围(植物用于正常光合作用的光波长)完全一致,是巧合吗?或者说,这些特定能量水平的光子是否存在某种特殊性?例如这项研究表明,某些真菌可能真的能在新陈代谢中利用电离辐射。这表明,假想中沐浴在电离辐射下的星球生物可能会进化出利用这种能量的机制。是否正是这些特定波长光子的特殊性,导致了从人类到植物等众多物种的稳定选择?

Table 5: Representative question from the Wikidoc Living Textbook and Patient Information.
QuestionAnswer In 2006, the lifetime prevalence of Anaphylaxis was estimated to be 0.05% to
What is the explanation for the epidemiology and demographics of anaphylaxis?2% in the United States.However it is believed that this is an underestimation because the disease is underdiagnosed. The mortality rate of Anaphylaxis is approximately 186 to 225 deaths/year. Patients of all age groups may develop anaphylaxis, however, children and adolescents account for the majority of cases. There is no racial predilection to anaphylaxis. Anaphylaxis affects men and women equally with more women being admitted to the hospital for the disease. There is no regional predilection to anaphylaxis. An estimated 1.24% to 16.8% of the United States population is considered at risk for developing anaphylaxis if they are exposed to one or more allergens. Anaphylaxis results tion is cardiovascular collapse
What do coronary angiographic projections refer to?For the beginner angiographer the anatomic landmarks formed by the spine. view from which the image is obtained. In the LAO view the catheter and spine are seen on the right side of the image, while in the RAO they are found on the right. PA imaging places these landmarks in the center of the image. Cranial angulation can usually be distinguished from caudal angulation by the presence of the diaphragm. For cranial imaging, the patient should be asked to inspire to remove the diaphragmatic shadow from the image.
WhatarethesymptomsofHy- pothermia?As people develop hypothermia, their abilities to think and move are often lost slowly. In fact, they may even be unaware that they need emergency treatment. Someone with hypothermia also is likely to have frostbite. The symptoms include: Drowsiness Weakness and loss of coordination Pale and cold skin Confusion Uncontrollable shivering (although at extremely low body temper- arrest, shock, and coma can set in without prompt treatment. Hypothermia can be fatal.
Whois at riskfor hereditarypan- creatitis?Studies demonstrate that cationic trypsinogen gene mutations are associated trypsinogen "R122H", "N291". Further more, hereditary pancreatitis has also been linked to an increased lifetime risk of pancreatic cancer.
Whoisathighestriskfor Glomerular disease ?The following may increase your risk of this condition: Blood or lymphatic system disorders Exposure to hydrocarbon solvents History of cancer Infec- tions such as strep infections, viruses, heart infections, or abscesses Many conditions cause or increase the risk for glomerulonephritis, including: Amy- loidosis Anti-glomerular basement membrane antibody disease Blood vessel diseases such as vasculitis or polyarteritis Focal segmental glomerulosclero- sis Goodpasture syndrome Heavy use of pain relievers, especially NSAIDs Henoch-Schonlein purpura IgA nephropathy Lupus nephritis Membranopro-
What to expect if I have In- tracranial aneurysms (Out- look/Prognosis)?liferative GN The outcome varies. Patients who are in deep comas after an aneurysm rupture generally do not do as well as those with less severe symptoms. Ruptured cerebral aneurysms are often deadly. About 25% of people die within 1 day, and another 25% die within about 3 months. Of those who survive, about 25% will have some sort of permanent disability.

表 5: Wikidoc活体教科书与患者信息中的代表性问答。

问题 答案
如何解释过敏反应的流行病学和人口统计学特征? 2006年,美国过敏反应的终生患病率估计为0.05%至2%。但由于漏诊情况普遍,实际数字可能更高。过敏反应的年死亡率约为186至225例。各年龄段人群均可发病,但儿童和青少年占多数。该病无种族倾向,男女发病率相当,但女性住院率更高。无地域性差异。约1.24%至16.8%的美国人群在接触过敏原时存在发病风险。过敏反应最严重的后果是心血管性虚脱。
冠状动脉造影投影指的是什么? 对初学者而言,脊柱形成的解剖标志是判断影像来源视角的关键。在左前斜位(LAO)视图中,导管和脊柱位于影像右侧;右前斜位(RAO)则相反。后前位(PA)成像时这些标志位于影像中心。头侧成角可通过膈肌影与尾侧成角区分,进行头侧成像时应嘱患者吸气以消除膈肌阴影。
低温症有哪些症状? 患者随着体温降低会逐渐丧失思维和行动能力,甚至意识不到需要急救,常伴随冻伤。症状包括:嗜睡、虚弱及协调性丧失、皮肤苍白冰冷、意识模糊、不可控颤抖(极低体温时可能停止)。如不及时治疗会导致心跳骤停、休克和昏迷,可能致命。
哪些人群易患遗传性胰腺炎? 研究表明阳离子胰蛋白酶原基因突变(如"R122H"、"N291")与该病相关。此外,遗传性胰腺炎患者终生罹患胰腺癌的风险显著增加。
哪些人群是肾小球疾病的高危群体? 危险因素包括:血液/淋巴系统疾病、碳氢溶剂接触史、癌症病史、链球菌感染等感染性疾病。多种病症可引发或加重肾小球肾炎,如:淀粉样变性、抗GBM抗体病、血管炎等血管病变、局灶节段性肾小球硬化、古德帕斯彻综合征、过量使用止痛药(尤其是NSAIDs)、过敏性紫癜、IgA肾病、狼疮性肾炎、膜性增生性GN等。
颅内动脉瘤患者的预后如何? 预后差异较大。动脉瘤破裂后深度昏迷患者恢复情况通常不如症状较轻者。破裂性脑动脉瘤致死率高:约25%患者在24小时内死亡,另有25%在3个月内死亡。幸存者中约25%会遗留永久性残疾。
阅读全文(20积分)