MEDALPACA - AN OPEN-SOURCE COLLECTION OF MEDICAL CONVERSATIONAL AI MODELS AND TRAINING DATA

MEDALPACA - 开源医疗对话AI模型及训练数据集合

A PREPRINT

预印本

Tianyu $\mathrm{Han^{1,+}}$ , Lisa C. Adams2,+, Jens-Michalis Papa ioan nou 4, Paul Grundmann4, Tom Ober hauser 4, Alexei Figueroa4, Alexander Löser4, Daniel Truhn1,+, and Keno K. Bressem2,5,+

Tianyu $\mathrm{Han^{1,+}}$、Lisa C. Adams2,+、Jens-Michalis Papaioannou4、Paul Grundmann4、Tom Oberhauser4、Alexei Figueroa4、Alexander Löser4、Daniel Truhn1,+和Keno K. Bressem2,5,+

1 Department of Radiology, University Hospital Aachen, Aachen, Germany Email: {tianyu.han, dtruhn}@ukaachen.de

1 德国亚琛大学医院放射科邮箱: {tianyu.han, dtruhn}@ukaachen.de

2 Department of Diagnostic and Interventional Radiology, Technical University of Munich, School of Medicine and Health, Klinikum rechts der Isar, TUM University Hospital, Munich, Germany Email: lisa.adams@tum.de

2 德国慕尼黑工业大学医学与健康学院附属伊萨尔右岸医院介入放射诊断科
电子邮箱：lisa.adams@tum.de

4Berliner Hochschule für Technik (BHT), Berlin, Germany Email: {michalis.papa ioan nou, pgrundmann, tom.oberhauser, alexei.figueroa, aloeser}@bht-berlin.de

4柏林工程应用技术大学 (BHT), 德国柏林
邮箱: {michalis.papaioannou, pgrundmann, tom.oberhauser, alexei.figueroa, aloeser}@bht-berlin.de

5 Department of Cardiovascular Radiology and Nuclear Medicine, Technical University of Munich, School of Medicine and Health, German Heart Center, TUM University Hospital, Munich, Germany Email: keno.bressem@tum.de

5 德国慕尼黑工业大学医学与健康学院心血管放射学与核医学系，德国心脏中心，慕尼黑工业大学附属医院
电子邮箱：keno.bressem@tum.de

+Contributed equally

+同等贡献

March 20, 2025

2025年3月20日

ABSTRACT

摘要

As large language models (LLMs) like OpenAI’s GPT series continue to make strides, we witness the emergence of artificial intelligence applications in an ever-expanding range of fields. In medicine, these LLMs hold considerable promise for improving medical workflows, diagnostics, patient care, and education. Yet, there is an urgent need for open-source models that can be deployed on-premises to safeguard patient privacy. In our work, we present an innovative dataset consisting of over 160,000 entries, specifically crafted to fine-tune LLMs for effective medical applications. We investigate the impact of fine-tuning these datasets on publicly accessible pre-trained LLMs, and subsequently, we juxtapose the performance of pre-trained-only models against the fine-tuned models concerning the examinations that future medical doctors must pass to achieve certification.

随着 OpenAI 的 GPT 系列等大语言模型 (LLM) 不断发展，我们见证了人工智能应用在日益广泛的领域中出现。在医学领域，这些大语言模型对改善医疗工作流程、诊断、患者护理和教育具有巨大潜力。然而，当前亟需可本地部署的开源模型来保护患者隐私。我们的工作提出了一个包含超过 16 万条目的创新数据集，专门用于微调大语言模型以实现有效的医疗应用。我们研究了这些数据集对公开可用的预训练大语言模型进行微调的影响，随后将纯预训练模型与微调模型在未来医生认证考试中的表现进行了对比分析。

Keywords Natural Language Processing $\cdot$ Artificial Intelligence $\cdot$ Medicine

关键词自然语言处理 (Natural Language Processing) $\cdot$ 人工智能 (Artificial Intelligence) $\cdot$ 医学

1 Introduction

1 引言

The advent of large language models (LLMs), trained using reinforcement learning through human feedback (RLHF) and exemplified by OpenAI’s GPT series, has profoundly influenced the fields of natural language processing (NLP) and artificial intelligence (AI) research [1]. Their remarkable capacity to produce coherent, con textually apt, and intricate responses has increased their value across diverse domains. Notably, the medical field is poised to reap substantial benefits from the implementation of these models.

大语言模型 (LLM) 的出现深刻影响了自然语言处理 (NLP) 和人工智能 (AI) 研究领域 [1]。这类模型通过人类反馈强化学习 (RLHF) 训练而成，以OpenAI的GPT系列为代表，其生成连贯、语境恰当且复杂响应的卓越能力，使其在多个领域价值凸显。值得注意的是，医疗领域有望从这些模型的实施中获得巨大收益。

A salient benefit of these LLMs lies in their ability to perform tasks following instructions in natural language, thereby eliminating the necessity for users to have programming proficiency. This feature empowers medical professionals to seamlessly engage with and steer the models through diverse medical workflows.

这些大语言模型的一个显著优势在于，它们能够按照自然语言指令执行任务，从而消除了用户需要具备编程能力的要求。这一特性使医疗专业人员能够无缝参与并通过多样化的医疗工作流程来引导模型。

Potential applications include aiding medical professionals in note-taking, composing discharge letters, retrieving information from extensive documents, summarizing content, and converting free-form texts into structured formats [2, 3]. Provided the model has been trained on a sufficient number of medical documents, it may possess the medical knowledge necessary to assist in consultations by supplying accurate information derived from its base texts [4]. Furthermore, the training of medical students can also benefit from these models, wherein they assume the role of a study partner, capable of quizzing students or elucidating complex subjects, provided the model demonstrates sufficient coherence and accuracy. However, the most adept LLM models are currently not openly accessible, being available exclusively through APIs that necessitate data transmission to the parent company for processing.

潜在应用包括协助医疗专业人员记录笔记、撰写出院信函、从大量文档中检索信息、总结内容，以及将自由格式文本转换为结构化格式 [2, 3]。如果模型已在足够数量的医疗文档上进行训练，它可能具备必要的医学知识，通过从其基础文本中提取准确信息来协助诊疗咨询 [4]。此外，医学生的培训也能从这些模型中受益，即模型可充当学习伙伴的角色，只要其表现出足够的连贯性和准确性，就能对学生进行测验或解释复杂主题。然而，目前最先进的大语言模型并未公开开放，仅能通过需将数据传输至母公司进行处理的API接口使用。

Considering the sensitive nature of medical data and the imperative for robust privacy safeguards, non-transparent models with unclear data management practices are ill-suited for medical applications. To tackle this challenge and avert unauthorized data transfers, it is essential to employ open-source models that enable on-site implementation, thus mitigating privacy concerns.

考虑到医疗数据的敏感性以及对强有力隐私保护措施的迫切需求，那些数据管理实践不明确的不透明模型并不适合医疗应用。为解决这一挑战并防止未经授权的数据传输，必须采用支持现场实施的开源模型，从而缓解隐私问题。

Addressing this demand, we present a compilation of language models specifically fine-tuned for biomedical tasks. Utilizing a blend of new and established open-source biomedical datasets, we adapt them into an instruction-following format. This structure facilitates supervised fine-tuning as the initial phase, as detailed in [1].

为满足这一需求，我们整理了一套专为生物医学任务微调的语言模型。通过结合使用新旧开源生物医学数据集，我们将其调整为遵循指令的格式。如[1]所述，该结构以监督式微调作为初始阶段。

To assess the effectiveness of these models, we evaluate their performance on the United States Medical Licensing Examination (USMLE), a standardized assessment undertaken by medical students in the United States as part of their qualification process to become physicians. This evaluation offers valuable insights into the models’ competencies and prospective applications within the medical domain.

为评估这些模型的有效性，我们以美国医师执照考试(USMLE)作为基准进行测试。该标准化考试是美国医学生取得执业资格的关键环节，其评估结果能有效反映模型在医疗领域的核心能力与应用潜力。

We make all models and datasets publicly available, anticipating that they will confer significant advantages to both medical and AI researchers as well as practitioners in their respective fields.

我们公开所有模型和数据集，预计它们将为医学和AI研究人员以及各领域从业者带来显著优势。

2 Materials and Methods

2 材料与方法

2.1 Datasets

2.1 数据集

In this section, we present Medical Meadow a collection of medical tasks that we have compiled for fine-tuning and evaluating the performance of large language models in the context of medicine. Medical Meadow consists of two main categories, a collection of established medical NLP tasks reformatted in instruction tuning formats as well as a crawl of various internet resources. Each dataset focuses on different aspects of medical knowledge and practice, providing a comprehensive training and evaluation framework. See Table 1 for a detailed overview of the datasets.

在本节中，我们介绍 Medical Meadow，这是一个为微调和评估大语言模型在医学领域性能而整理的医疗任务集合。Medical Meadow 包含两大类：一组以指令调优格式重构的经典医疗 NLP 任务，以及从各类互联网资源爬取的内容。每个数据集聚焦于医学知识和实践的不同方面，提供了全面的训练与评估框架。具体数据集概览请参见表 1。

2.1.1 Dataset 1: Flash Cards Used by Medical Students

2.1.1 数据集1：医学生使用的抽认卡

Medicine as a whole encompasses a wide range of subjects that medical students and graduates must master in order to practice effectively. This includes a profound understanding of basic medical sciences, clinical knowledge, and clinical skills. The Anki Medical Curriculum flashcards are created and updated by medical students and cover the entirety of the medical school curriculum, addressing subjects such as anatomy, physiology, pathology, pharmacology, and more. These flashcards frequently feature succinct summaries and mnemonics to aid in the learning and retention of important medical concepts. In our investigation, we leveraged flashcards as a source to create question-answer pairs for training purposes. Upon excluding cards containing images, we harnessed OpenAI’s GPT-3.5-Turbo to restructure the cards into coherent, con textually pertinent question-answer pairs. Generally, the questions and answers are concise and targeted, as the flashcards offer limited space for incorporating extensive information. See Table 3 for representative Q/A pairs.

医学作为一个整体涵盖了医学生和毕业生必须掌握的广泛学科，以便有效行医。这包括对基础医学、临床知识和临床技能的深刻理解。Anki医学课程闪卡由医学生创建并更新，覆盖了医学院全部课程内容，涉及解剖学、生理学、病理学、药理学等学科。这些闪卡常采用简洁的总结和助记法来帮助学习和记忆重要医学概念。在我们的研究中，我们利用闪卡作为来源创建用于训练的问答对。在排除包含图像的卡片后，我们使用OpenAI的GPT-3.5-Turbo将卡片重组为连贯且上下文相关的问答对。通常，问题和答案都简洁且有针对性，因为闪卡提供的空间有限，无法包含大量信息。代表性问答对参见表3。

2.1.2 Dataset 2: Stack exchange Medical Sciences

2.1.2 数据集 2: Stack Exchange医学科学版块

The stack exchange dataset consists of 52,475 question-answer pairs obtained from five Stack Exchange forums related to biomedical sciences and related fields:

Stack Exchange 数据集包含从五个与生物医学及相关领域相关的 Stack Exchange 论坛获取的 52,475 个问答对：

Academia: This forum offers insights into research methodologies, scientific publication processes, and career paths within the scientific community. While not directly affiliated with medicine, considering the volume of medical research, it is likely that medical professionals will also consult models pertaining to this subject matter.
学术界：该论坛提供科研方法、科学出版流程及科学界职业发展路径的见解。虽不直接隶属医学领域，但鉴于医学研究体量庞大，医疗从业者也很可能咨询与此主题相关的模型。

Table 1: Summary of medical datasets created for this work. For information regarding other, already published data, please refer to the respective original publication.

表 1: 本研究创建的医学数据集汇总。有关已发布数据的更多信息，请参阅相应原始文献。

Dataset	Source	Description	n
Finetuning
Medical Flash Cards	AnkiFlashcards	Rephrased Q&A pairs derived from the front and back sides of medicalflashcards	33,955
StackEx- change	Academia	Q&A pairs generated from questions and their top-rated answers	39,633
	Biology		7,482
	Fitness		3,026
	Health		1,428
Wikidoc	Bioinformatics		906
	Living Textbook	Q&A pairs generated from paragraphs, where questions were for- mulated from rephrased paragraph titles, and answers were ex-	67,704
	Patient Informa- tion	tracted from paragraph text Q&A pairs generated from paragraph headings and associated textcontent	5,942
Evaluation
USMLE	Step 1	Multiple choice questions from the USMLE self-assessment with	119
	Step 2	image-based questions excluded	120
	Step 3		135

数据集	来源	描述	数量
* * 微调数据集* *
医学闪卡	Anki闪卡	从医学闪卡正反面内容重构的问答对	33,955
StackExchange	学术	根据问题及其高赞回答生成的问答对	39,633
	生物学		7,482
	健身		3,026
	健康		1,428
Wikidoc	生物信息学		906
	活体教材	通过段落标题重构生成问题，并从段落文本提取答案形成的问答对	67,704
	患者信息	根据段落标题及相关文本内容生成的问答对	5,942
* * 评估数据集* *
USMLE	第一阶段	美国医师执照考试自测题库中的选择题（排除图像题）	119
	第二阶段		120
	第三阶段		135

Bioinformatics: As an interdisciplinary field combining biology, computer science, and data analysis, the Bioinformatics forum offers valuable information on the techniques and tools used for analyzing complex biological data, which is increasingly important in modern medical research.
生物信息学：作为结合生物学、计算机科学和数据分析的交叉学科领域，生物信息学论坛提供了有关分析复杂生物数据的技术和工具的宝贵信息，这些信息在现代医学研究中日益重要。

To maintain a high level of answer quality, we collected data exclusively from responses that received a minimum of five up-votes within the forum discussions and paired them with their corresponding questions. See Table 4 for representative Q/A pairs.

为保持回答质量的高水准，我们仅收集论坛讨论中获得至少五个赞的回复数据，并将其与对应问题配对。代表性问答对请参见表 4。

2.1.3 Dataset 3: Wikidoc

2.1.3 数据集3：Wikidoc

We incorporated medical question-answer pairs extracted from WikiDoc, a collaborative platform for medical professionals to share and contribute up-to-date medical knowledge. The platform has two main sub-sites, the "Living Textbook" and "Patient Information". The "Living Textbook" contains chapters for various medical specialties, which we crawled. We then used GTP-3.5-Turbo to rephrase the paragraph heading to a question and used the paragraph as answers. Patient Information is structured differently, in that each section subheading is already a question, making rephrasing obsolete. See Table 5 for representative Q/A pairs.

我们整合了从WikiDoc平台提取的医学问答对。该平台是供医学专业人士共享和贡献最新医学知识的协作平台，包含两大子站点："Living Textbook"和"Patient Information"。其中"Living Textbook"收录了各医学专科的章节内容，我们通过爬取这些章节，使用GPT-3.5-Turbo将段落标题改写为问题形式，并将段落内容作为答案。而"Patient Information"的结构有所不同，其每个小节标题本身已是问题形式，无需进行改写。代表性问答对参见表5。

2.1.4 Dataset 4: medical NLP Benchmarks

2.1.4 数据集4：医疗NLP基准

We additionally use data from open NLP datasets and benchmarks, including:

我们还使用了来自开放NLP数据集和基准测试的数据，包括：

The COVID-19 Open Research Dataset Challenge (CORD-19), consisting of more than one million scholarly articles [5] 2. Benchmark data from Measuring Massive Multitask Language Understanding [6, 7]
COVID-19开放研究数据集挑战赛 (CORD-19) ，包含超过一百万篇学术文献 [5]
来自《测量海量多任务语言理解》的基准数据 [6, 7]

2.2 Model Training

2.2 模型训练

Our models are built upon the LLaMA (Large Language Model Meta AI) foundation models. LLaMA represents a cutting-edge large language model released by Meta, demonstrating their commitment to open science. It is available in various sizes, including 7 billion, 13 billion, 33 billion, and 65 billion parameters. In this study, we fine-tuned the 7 and 13 billion parameter LLaMA variants, adhering to the approach delineated by Taori et al [11].

我们的模型基于LLaMA (Large Language Model Meta AI) 基础模型构建。LLaMA是Meta发布的前沿大语言模型，体现了其对开放科学的承诺。该模型提供多种参数量版本，包括70亿、130亿、330亿和650亿参数。本研究遵循Taori等人[11]的方法，对70亿和130亿参数的LLaMA变体进行了微调。

We trained each model for five epochs, employing a learning rate of $2e^{-5}$ for the 7b model and $1e^{-5}$ for the 13b model, using a cosine learning rate scheduler. Gradient accumulation facilitated training with an effective batch size of 256. Given that this training impacts all model parameters, the hardware requirements are substantial. Consequently, we explored alternative training procedures.

我们对每个模型进行了五个周期的训练，7b模型采用$2e^{-5}$的学习率，13b模型采用$1e^{-5}$的学习率，并使用余弦学习率调度器。通过梯度累积实现了256的有效批次大小训练。由于该训练会影响所有模型参数，硬件需求较高。因此，我们探索了替代训练方案。

First, we implemented Low-Rank Adaptation (LoRA) for weight updates to adapt the pre-trained language models to our specific tasks. LoRA is a method that involves freezing the pre-trained model weights and incorporating trainable rank decomposition matrices into each layer of the Transformer architecture [12]. This approach substantially diminishes the number of trainable parameters and GPU memory requirements for downstream tasks, making it more efficient compared to full fine-tuning and significantly reducing training time.

首先，我们实现了低秩适应 (LoRA) 来进行权重更新，以使预训练语言模型适应我们的特定任务。LoRA 是一种冻结预训练模型权重并在 Transformer 架构的每一层中加入可训练秩分解矩阵的方法 [12]。与全参数微调相比，这种方法大幅减少了可训练参数数量和下游任务对 GPU 内存的需求，从而提高了效率并显著缩短了训练时间。

To further decrease memory and compute demands, we employed 8-bit matrix multiplication for the feed-forward and attention projection layers, along with an 8-bit optimizer. When combined with LoRA, this strategy further reduces the memory needed for training [13] [14]. All models trained with LoRA underwent three epochs of training at a learning rate of 2e-5.

为了进一步降低内存和计算需求，我们在前馈网络和注意力投影层采用了8位矩阵乘法，并配合8位优化器。结合LoRA技术后，该策略能进一步减少训练所需内存 [13] [14]。所有使用LoRA训练的模型均以2e-5的学习率进行了三轮训练。

2.3 Evaluation Procedure

2.3 评估流程

To evaluate the performance of the fine-tuned language models, we devised an assessment methodology centered on their zero-shot performance across the United States Medical Licensing Examination (USMLE) Step 1, Step 2, and Step 3 self-assessment datasets. We excluded all questions containing images, as our primary interest lies in the models’ language capabilities, and they lack visual abilities. We instructed the models to present answers in the format "Option: Answer" (e.g., "A: Penicillin"). If a model’s output did not adhere to this format, they were prompted up to five times until the response was generated in the desired format. If the model failed to provide the response in the desired format, the last response was retained.

为评估微调后语言模型的性能，我们设计了一种以美国医师执照考试(USMLE)第一阶段、第二阶段和第三阶段自测数据集上的零样本表现为核心的评估方法。我们排除了所有包含图像的问题，因为研究重点在于模型的语言能力，且它们不具备视觉能力。我们要求模型以"选项: 答案"的格式输出结果(例如"A: 青霉素")。若模型输出不符合该格式，系统会最多提示五次直至生成符合要求的响应。若模型最终仍无法按指定格式响应，则保留最后一次生成的答案。

Interestingly, most of the fine-tuned models typically produced answers in the correct format after the first prompt, while only the base LLaMA models required multiple prompts. We conducted separate evaluations for each model, measuring their accuracy on the USMLE Step 1, Step 2, and Step 3 datasets individually. This approach allowed us to gain a comprehensive understanding of the models’ performance across the various stages of the medical licensing examination.

有趣的是，大多数经过微调的模型通常能在首次提示后生成格式正确的答案，而基础LLaMA模型则需要多次提示。我们对每个模型分别进行了评估，单独测量它们在美国医师执照考试(USMLE)第一步、第二步和第三步数据集上的准确率。这种方法使我们能全面了解这些模型在医学执照考试不同阶段的表现。

3 Results

3 结果

Our findings on the USMLE test set are displayed in Table 2. Fine-tuned LLMs consistently surpassed the performance of their pre-trained-only counterparts. It is worth noting that while LoRa and 8-bit fine-tuning expedited the training process, employing these methods resulted in reduced accuracy.

我们在USMLE测试集上的发现如表2所示。经过微调的大语言模型持续超越仅预训练模型的性能。值得注意的是，虽然LoRa和8位量化微调加速了训练过程，但采用这些方法会导致准确率下降。

4 Discussion and conclusion

4 讨论与结论

In this study, we introduced a novel, high-quality collection of medical text data specifically designed for training instruction-following, medical large language models (LLMs). This dataset serves as a comprehensive resource for enhancing LLM performance in the medical domain, laying the groundwork for potential integration into medical education and practice.

在本研究中，我们引入了一个新颖、高质量的医疗文本数据集，专为训练遵循指令的医疗大语言模型(LLM)而设计。该数据集作为提升大语言模型在医疗领域表现的综合性资源，为未来融入医学教育和临床实践奠定了基础。

Table 2: Zero shot performance on the USMLE self assessment

Model	Step1	Step2	Step3
LLaMA 7b[15]	0.198	0.202	0.203
Alpaca 7b naive [11]	0.275	0.266	0.293
Alpaca 7bLoRA	0.220	0.138	0.252
MedAlpaca 7b	0.297	0.312	0.398
MedAlpaca 7b LoRA	0.231	0.202	0.179
MedAlpaca7bLoRA8bit	0.231	0.241	0.211
ChatDoctor(7b) [10] LLaMA 13b [15]	0.187	0.185	0.148
	0.222	0.248	0.276
Alpaca13bnaive	0.319	0.312	0.301
MedAlpaca 13b	0.473	0.477	0.602
MedAlpaca13bLoRA MedAlpaca 13b LoRA 8bit	0.250	0.255	0.255

表 2: USMLE自评估零样本性能

模型	Step1	Step2	Step3
LLaMA 7b[15]	0.198	0.202	0.203
Alpaca 7b naive [11]	0.275	0.266	0.293
Alpaca 7bLoRA	0.220	0.138	0.252
MedAlpaca 7b	0.297	0.312	0.398
MedAlpaca 7b LoRA	0.231	0.202	0.179
MedAlpaca7bLoRA8bit	0.231	0.241	0.211
ChatDoctor(7b) [10] LLaMA 13b [15]	0.187	0.185	0.148
	0.222	0.248	0.276
Alpaca13bnaive	0.319	0.312	0.301
MedAlpaca 13b	0.473	0.477	0.602
MedAlpaca13bLoRA MedAlpaca 13b LoRA 8bit	0.250	0.255	0.255

Using our medical text data, we fine-tuned several open-source LLM variants, adopting parameter-efficient tuning methodologies to address limited computing resources [16]. This approach is vital, as full fine-tuning of language model parameters is often unfeasible for most academic institutions. Our study demonstrates the viability of parameterefficient fine-tuning.

利用我们的医学文本数据，我们对多个开源大语言模型变体进行了微调，采用参数高效调优方法以应对有限的计算资源[16]。这一方法至关重要，因为对语言模型参数进行全面微调对大多数学术机构而言往往不可行。我们的研究证明了参数高效微调的可行性。

We evaluated LLM performance using the United States Medical Licensing Examination (USMLE) for Steps 1, 2, and 3, which assess medical knowledge at various complexity levels. As expected, performance improved with larger pre-trained models. Applying approximation techniques, such as 8-bit precision and LoRa, during fine-tuning yielded less optimal results. However, due to considerable computational costs, we did not conduct extensive hyper parameter optimization and fine-tuning; thus, it may be possible to achieve performance comparable to vanilla-trained models through more thorough hyper parameter optimization, which we leave for future research.

我们使用美国医师执照考试(USMLE)的步骤1、2和3来评估大语言模型(LLM)性能，这些考试评估不同复杂程度的医学知识。正如预期的那样，预训练模型越大，性能越好。在微调过程中应用8位精度和LoRa等近似技术时，得到的结果不太理想。然而，由于计算成本较高，我们没有进行大量的超参数优化和微调；因此，通过更彻底的超参数优化，有可能达到与普通训练模型相当的性能，我们将此留待未来研究。

The availability of additional medical datasets will likely enhance the applicability and performance of these models, creating various potential applications such as extracting structured medical information from unstructured text, supporting medical students’ education through question-answering interactions to reinforce their knowledge and clarify lecture uncertainties, or assisting patients in understanding their health and improving communication between doctors and patients who often find medical language challenging.

额外医疗数据集的可用性可能会提升这些模型的适用性和性能，从而创造多种潜在应用场景，例如从非结构化文本中提取结构化医疗信息、通过问答互动支持医学生教育以巩固其知识并澄清讲座中的疑问，或帮助患者理解自身健康状况并改善医患沟通（患者通常认为医学术语难以理解）。

Nevertheless, implementing LLMs for these application scenarios presents challenges and concerns. Ensuring data privacy and compliance with ethical standards is critical when handling sensitive patient data; these concerns can be addressed by deploying models locally within secure hospital networks. Moreover, models must be thoroughly evaluated and safeguarded for potential biases and inaccuracies to prevent unintended consequences in medical decision-making.

然而，在这些应用场景中部署大语言模型(LLM)仍面临挑战与隐忧。处理敏感患者数据时，确保数据隐私和符合伦理标准至关重要——可通过在医院安全网络内部署本地模型来解决这些问题。此外，必须对模型进行彻底评估并防范潜在偏见与错误，以避免对医疗决策造成意外影响。

A significant limitation is LLMs’ tendency to confab u late or generate text that appears plausible but is factually incorrect [17]. This issue is especially concerning in the medical domain, where disseminating incorrect information can have serious implications for patient care and safety. Guaranteeing the accuracy and reliability of generated information is therefore essential, necessitating rigorous evaluation and continuous monitoring to mitigate confab ul ation risks and the potential harm it may cause in medical settings.

一个显著的限制是大语言模型 (LLM) 容易产生虚构内容，即生成看似合理但事实错误的信息 [17]。这一问题在医疗领域尤为令人担忧，因为传播错误信息可能对患者护理和安全造成严重影响。因此，确保生成信息的准确性和可靠性至关重要，需要通过严格评估和持续监控来降低虚构风险及其在医疗环境中可能造成的潜在危害。

In conclusion, our work substantially contributes to the field of LLMs in medicine by providing a novel, high-quality medical dataset for research and application purposes. Further, we successfully fine-tuned and evaluated various LLMs, demonstrating that their medical domain performance increases with pre-trained model size and high-quality data availability. This progress paves the way for further exploration and development of LLMs in medicine, with potential implications for medical education, patient care, and healthcare communication.

总之，我们的工作通过提供一个新颖、高质量的医学数据集供研究和应用，对大语言模型在医学领域的发展做出了重要贡献。此外，我们成功微调并评估了多种大语言模型，证明其医学领域性能会随着预训练模型规模扩大和高质量数据可用性而提升。这一进展为医学领域大语言模型的进一步探索和开发铺平了道路，可能对医学教育、患者护理和医疗沟通产生深远影响。

5 Acknowledgements

5 致谢

The authors acknowledge the Scientific Computing of the IT Division at the Charité - Univ

[论文翻译]MEDALPACA - 开源医疗对话AI模型及训练数据集合

原文地址：https://arxiv.org/pdf/2304.08247