Clinical Camel: An Open Expert-Level Medical Language Model with Dialogue-Based Knowledge Encoding
Clinical Camel: 具备对话式知识编码的专家级开源医学大语言模型
Augustin Toma1,2,∗ Patrick R. Lawler3,4,5 Jimmy $\mathbf{B}\mathbf{a}^{1,6}$ Rahul G. Krishnan1,6,7 Barry B Rubin3 Bo Wang1,3,6,7,8,∗,† 1Vector Institute for Artificial Intelligence, Toronto, Canada 2 Department of Medical Biophysics, University of Toronto, Toronto, Canada 3Peter Munk Cardiac Centre, University Health Network, Toronto, Canada 4McGill University, Montreal, Canada 5Division of Cardiology, University of Toronto, Toronto, Canada 6 Department of Computer Science, University of Toronto, Toronto, Canada 7 Department of Laboratory Medicine and Path o biology, University of Toronto, Toronto, Canada 8AI Hub, University Health Network, Toronto, Canada augustin.toma@mail.utoronto.ca bowang@vector institute.ai
Augustin Toma1,2,∗ Patrick R. Lawler3,4,5 Jimmy $\mathbf{B}\mathbf{a}^{1,6}$ Rahul G. Krishnan1,6,7 Barry B Rubin3 Bo Wang1,3,6,7,8,∗,†
1加拿大多伦多Vector人工智能研究所
2加拿大多伦多大学生物医学物理系
3加拿大多伦多大学健康网络Peter Munk心脏中心
4加拿大麦吉尔大学
5加拿大多伦多大学心脏病学系
6加拿大多伦多大学计算机科学系
7加拿大多伦多大学检验医学与病理生物学系
8加拿大多伦多大学健康网络AI中心
augustin.toma@mail.utoronto.ca
bowang@vectorinstitute.ai
Abstract
摘要
We present Clinical Camel, an open large language model (LLM) explicitly tailored for clinical research. Fine-tuned from LLaMA-2 using QLoRA, Clinical Camel achieves state-of-the-art performance across medical benchmarks among openly available medical LLMs. Leveraging efficient single-GPU training, Clinical Camel surpasses GPT-3.5 in five-shot evaluations on all assessed benchmarks, including $64.3%$ on the USMLE Sample Exam (compared to $58.5%$ for GPT-3.5), $77.9%$ on PubMedQA (compared to $60.2%$ ), $60.7%$ on MedQA (compared to $53.6%$ ),and $54.2%$ on MedMCQA (compared to $51.0%$ ). In addition to these benchmarks, Clinical Camel demonstrates its broader capabilities, such as synthesizing plausible clinical notes. This work introduces dialogue-based knowledge encoding, a novel method to synthesize conversational data from dense medical texts. While benchmark results are encouraging, extensive and rigorous human evaluation across diverse clinical scenarios is imperative to ascertain safety before implementation. By openly sharing Clinical Camel, we hope to foster transparent and collaborative research, working towards the safe integration of LLMs within the healthcare domain. Significant challenges concerning reliability, bias, and the potential for outdated knowledge persist. Nonetheless, the transparency provided by an open approach reinforces the scientific rigor essential for future clinical applications.
我们推出Clinical Camel,这是一款专为临床研究设计的开源大语言模型(LLM)。该模型基于LLaMA-2通过QLoRA进行微调,在所有公开可用的医学大语言模型中实现了最先进的医疗基准测试性能。凭借高效的单GPU训练,Clinical Camel在所有评估基准的五样本测试中均超越GPT-3.5,包括USMLE样本考试64.3%(GPT-3.5为58.5%)、PubMedQA 77.9%(60.2%)、MedQA 60.7%(53.6%)以及MedMCQA 54.2%(51.0%)。除基准测试外,该模型还展现出生成可信临床记录等更广泛能力。本研究引入了基于对话的知识编码技术,这是一种从密集医学文本合成对话数据的新方法。虽然基准测试结果令人鼓舞,但在实际应用前仍需通过多样化临床场景的严格人工评估来确保安全性。通过开源分享Clinical Camel,我们希望促进透明协作的研究,推动大语言模型在医疗领域的安全整合。关于可靠性、偏见和知识过时等重大挑战仍然存在,但开源方法提供的透明度强化了未来临床应用所需的科学严谨性。
1 Introduction
1 引言
Large language models (LLMs), such as GPT-4, have demonstrated remarkable capabilities in various applications. However, their deployment in healthcare settings raises concerns due to their proprietary nature, particularly regarding privacy, stability, and transparency. Although open medical LLMs exist, they fall short in performance compared to proprietary alternatives and offer limited context lengths, restricting their use cases.
大语言模型 (LLM),如 GPT-4,已在多种应用中展现出卓越能力。然而,由于其专有属性,在医疗场景部署时引发了隐私、稳定性和透明度方面的担忧。尽管存在开源医疗大语言模型,但其性能仍逊于专有方案,且上下文长度有限,制约了应用场景。
The performance gap between proprietary and open models is concerning in healthcare, as the latter allows for more rigorous evaluation and validation. Ensuring the safe integration of LLMs into clinical care requires thorough validation, which is not feasible with the current landscape of proprietary models. Moreover, challenges arise when sending healthcare data to private companies, highlighting the value of institutions being able to serve their own models for reliable and safe access.
专有模型与开源模型在医疗领域的性能差距令人担忧,因为后者支持更严格的评估与验证。要将大语言模型(LLM)安全整合到临床护理中,必须进行彻底验证,而这在当前专有模型的生态下难以实现。此外,将医疗数据发送给私营企业会带来诸多挑战,这凸显了医疗机构能够部署自有模型以实现可靠安全访问的重要性。
We introduce Clinical Camel, an openly available and high-performing medical LLM fine-tuned from LLaMA-2[Touvron et al., 2023] to address these issues. Clinical Camel is trained via QLoRA[Dettmers et al., 2023] on a single commercial GPU, enabling it to surpass GPT-3.5 in performance on standardized medical benchmarks: biomedical subsections of the MMLU, MedMCQA, MedQA, PubMedQA, and the USMLE sample exam. We introduce a novel method called Dialogue-Based Knowledge Encoding (DBKE) to develop our training corpus, which converts dense clinical review articles into synthetic conversations.
我们推出Clinical Camel,这是一个公开可用的高性能医疗大语言模型(LLM),基于LLaMA-2[Touvron et al., 2023]微调以解决上述问题。该模型通过QLoRA[Dettmers et al., 2023]在单个商用GPU上完成训练,使其在标准化医疗基准测试(包括MMLU生物医学子项、MedMCQA、MedQA、PubMedQA以及USMLE模拟考试)中超越GPT-3.5。我们提出了一种名为对话式知识编码(DBKE)的新方法用于构建训练语料库,该方法将密集的临床综述文章转化为合成对话。
Our research demonstrates the feasibility of efficiently fine-tuning domain-specific LLMs without the need for massive datasets or computing power. Clinical Camel is an example of open medical LLMs that compare favorably with proprietary counterparts. Nonetheless, evaluating LLMs in healthcare remains challenging, and performance on automated benchmarks does not equate to clinical utility or safety.
我们的研究表明,无需海量数据集或算力即可高效微调领域专用大语言模型具有可行性。Clinical Camel作为开源医疗大语言模型的代表,其表现可媲美专有模型。然而在医疗领域评估大语言模型仍存在挑战,自动化基准测试的性能表现并不等同于临床效用或安全性。
By making Clinical Camel openly available for research, we aim to promote further investigation into the safe integration of LLMs into clinical care and contribute to the advancements of machine learning applications in health.
通过公开提供Clinical Camel用于研究,我们旨在促进进一步探索如何安全地将大语言模型 (LLM) 整合到临床护理中,并为机器学习在健康领域的应用进展做出贡献。
1.1 The Application of Large Language Models in Healthcare
1.1 大语言模型 (Large Language Model) 在医疗健康领域的应用
LLMs have a broad scope of potential medical applications due to their ability to process unstructured clinical text; these range from automated clinical note creation and patient record sum mari z ation to more advanced tasks like clinical decision support, medical triaging, patient counseling, and medical education. These applications could improve healthcare delivery and access for providers and patients if proven effective.
大语言模型(LLM)因其处理非结构化临床文本的能力,在医疗领域具有广泛的应用潜力。这些应用涵盖从自动化临床记录创建和患者病历摘要,到更高级的任务如临床决策支持、医疗分诊、患者咨询和医学教育。若被证明有效,这些应用有望改善医疗服务提供者和患者的医疗保健服务与可及性。
Proprietary models like OpenAI’s GPT-3.5 and GPT-4 demonstrate strong performance on medical benchmarks without domain-specific fine-tuningNori et al. [2023]. GPT-4’s capabilities have prompted efforts to integrate it into clinical care, but sending healthcare data to private servers creates access equity issues globally. Critically, rigorously studying proprietary models is challenging. For example, OpenAI updates models on a three-month basis, complicating deployment in patient care where even small prompt changes can drastically alter outputs.
OpenAI的GPT-3.5和GPT-4等专有模型在未经领域特定微调的情况下,在医学基准测试中展现出强劲性能 [Nori et al., 2023]。GPT-4的能力促使人们尝试将其整合到临床护理中,但将医疗数据发送至私有服务器会引发全球范围内的访问公平问题。关键在于,严格研究专有模型具有挑战性。例如,OpenAI每三个月更新一次模型,这使患者护理中的部署变得复杂——即使提示词(prompt)的微小变化也可能彻底改变输出结果。
Google’s Med-PaLM 2 surpassed GPT-4 when tested with an ensemble refinement strategy (an inference-heavy prompting strategy requiring 44 generations), demonstrating superior performance on MedQA, PubMedQA, MMLU-Professional Medicine, and MMLU-College Medicine benchmarks Singh al et al. [2023]. Human evaluations also showed physicians and laypeople preferred Med-PaLM 2 answers over physician-generated responses- although the human evaluation group was modestly sized with 15 physicians and six laypersons. The Med-PaLM 2 work is commendable for going beyond automated benchmarks; however, Med-PaLM-2 remains unavailable publicly, preventing external validation of these results.
谷歌的Med-PaLM 2在使用集成优化策略(一种需要44次生成的高计算量提示策略)测试时超越了GPT-4,在MedQA、PubMedQA、MMLU-Professional Medicine和MMLU-College Medicine基准测试中展现了更优性能[20]。人工评估也显示医师和非专业人士更倾向选择Med-PaLM 2的答案而非医师生成的回答——尽管评估组规模较小,仅包含15名医师和6名非专业人士。Med-PaLM 2研究值得称赞之处在于超越了自动化基准测试;然而该模型仍未公开,导致这些结果无法获得外部验证。
The inability to rigorously study proprietary models due to the lack of public information, access, and privacy constraints motivates the development of open alternatives. High-performing publicly available models will enhance access and enable the rigorous evaluation needed for the safe clinical integration of LLMs.
由于缺乏公开信息、访问权限和隐私限制,无法对专有模型进行严格研究,这促使了开放替代方案的开发。高性能的公开可用模型将提高访问性,并为大语言模型(LLM)安全融入临床实践所需的严格评估创造条件。
2 Open Medical Language Models: Pushing for Transparency and Better Public Health Outcomes
2 开放医疗语言模型:推动透明度与改善公共卫生成果
Several open medical language models have been released, including MedAlpaca[Han et al., 2023] and ChatDoctor[Li et al., 2023]. Limited benchmark evaluations for these models have been made, and no rigorous comparisons have been made to proprietary models such as GPT-3.5/4.
已发布多个开源医疗语言模型,包括MedAlpaca [Han et al., 2023] 和 ChatDoctor [Li et al., 2023]。目前对这些模型的基准评估较为有限,且未与GPT-3.5/4等专有模型进行严格对比。
ChatDoctor was fine-tuned on online physician-patient dialogues and compared favorably to GPT3.5 on BERTScore metrics - which were calculated by comparing the BERTScore of Chat Doctors responses on a dataset comprising of patient questions and answers; however, no other benchmarks were evaluated. Its short context length of 512 tokens restricts utility beyond question-answering.
ChatDoctor基于在线医患对话进行了微调,在BERTScore指标上表现优于GPT3.5——该指标通过对比ChatDoctor在患者问答数据集上的响应BERTScore计算得出,但未评估其他基准。其512 token的短上下文长度限制了除问答外的其他应用场景。
MedAlpaca reported high performance on the USMLE self-assessment test. However, it also has a trained context length of 512 tokens. A parameter-efficient variant was trained alongside a fully fine-tuned version; however, it significantly under performed. No other benchmark results were reported.
MedAlpaca在美国医师执照考试(USMLE)自测中报告了优异表现,但其训练上下文长度仅为512个token。该研究同时训练了参数高效变体和全参数微调版本,但前者性能显著落后。未报告其他基准测试结果。
In conclusion, while existing open models show promise, benchmark evaluations have been limited and lack comparisons to proprietary models. Their short contexts likely restrict utility as well. In contrast, Clinical Camel has an expanded 4096 token context length and can perform tasks beyond question answering. Consequently, Clinical Camel represents a substantial advancement for deploying large language models in healthcare.
总之,虽然现有开源模型展现出潜力,但基准评估仍存在局限且缺乏与专有模型的对比。其较短的上下文长度也可能限制实用性。相比之下,Clinical Camel拥有扩展至4096 token的上下文长度,并能执行问答之外的多种任务。因此,Clinical Camel标志着大语言模型在医疗健康领域部署的重大进展。
3 Methodology
3 方法
3.1 Dialogue-Based Knowledge Encoding
3.1 基于对话的知识编码
Our work introduces Dialogue-Based Knowledge Encoding (DBKE), a method designed to transform input text into a multi-turn dialogue. The methodology we have developed acts as a form of domain adaptation that we hypothesize strengthens the recall capabilities of the downstream conversational models. DBKE allows us to convert dense medical literature into dialogues and instill soft alignment.
我们的工作提出了基于对话的知识编码 (Dialogue-Based Knowledge Encoding, DBKE) 方法,旨在将输入文本转化为多轮对话。该方法作为一种领域自适应形式,我们假设其能增强下游对话模型的召回能力。DBKE使我们能够将密集的医学文献转化为对话,并实现软对齐。
The DBKE process consists of dialogue creation and student model training. The process is initiated with a dense knowledge text input, paired with an input prompt containing alignment constraints and instructions for generating a dialogue. A teacher model, denoted by $M_ {T}$ , generates a dialogue based on the provided context while following the constraints stated in the prompt. The generated dialogue is then used as a transformed training text for fine-tuning a student model, denoted by $M_ {S}$ .
DBKE流程包含对话创建和学生模型训练两个阶段。该流程以密集知识文本作为输入,同时配合包含对齐约束和对话生成指令的输入提示(prompt)启动。由教师模型 $M_ {T}$ 根据给定上下文生成符合提示约束的对话内容,随后将生成的对话作为转换后的训练文本,用于微调学生模型 $M_ {S}$。
We illustrate the steps of the DBKE methodology in Algorithm 1:
我们在算法1中展示了DBKE方法的步骤:
Algorithm 1 Dialogue-Based Knowledge Encoding (DBKE)
算法 1: 基于对话的知识编码 (DBKE)
1: procedure DBKE(T, P, Mr, Ms) > T is input text, P is prompt (containing alignment rules), Mristeachermodel,Msisstudentmodel |
2: for each target text t; in T do |
3: D← Generate a dialogue from t;using MT and P Teacher model generates multi-turndialogue |
4: end for |
5: Fine-tune Ms on D, masking user's inputs during training |
6: returnMs Returnthefine-tunedstudentmodel |
7:endprocedure |
| 1: 过程 DBKE(T, P, Mr, Ms) > T 是输入文本, P 是提示 (包含对齐规则), Mr 是教师模型, Ms 是学生模型 |
| 2: 对于每个目标文本 t; 在 T 中执行 |
| 3: D← 使用 MT 和 P 从 t; 生成对话 教师模型生成多轮对话 |
| 4: 结束循环 |
| 5: 在 D 上微调 Ms, 训练时掩码用户输入 |
| 6: 返回 Ms 返回微调后的学生模型 |
| 7: 结束过程 |
The DBKE method combines knowledge encoding with soft behavioral alignment. Although not strictly enforced, the alignment constraints embedded in the input prompt guide the generated output of medical models. For example – these constraints could instruct the model to gather more information before suggesting diagnoses. The alignment objectives of these models can be modified to cater to the requirements of specific domains. See B for an example.
DBKE方法将知识编码与软行为对齐相结合。虽然并非严格强制执行,但嵌入在输入提示中的对齐约束会引导医疗模型的生成输出。例如,这些约束可以指示模型在提出诊断建议前收集更多信息。这些模型的对齐目标可进行调整,以满足特定领域的需求。示例见附录B。

Figure 1: Schematic representation of the Dialogue-Based Knowledge Encoding (DBKE) methodology. The process starts with a knowledge-dense input text $T$ and a prompt $P$ containing alignment constraints. The teacher model $M_ {T}$ then generates a multi-turn dialogue $D$ , which is used to fine-tune the student model $M_ {S}$ . The result is a fine-tuned student model capable of improved conversational performance.
图 1: 基于对话的知识编码 (DBKE) 方法示意图。该流程从知识密集的输入文本 $T$ 和包含对齐约束的提示 $P$ 开始。教师模型 $M_ {T}$ 随后生成多轮对话 $D$ ,用于微调学生模型 $M_ {S}$ 。最终获得一个经过微调的学生模型,能够提升对话表现。
3.2 Dataset
3.2 数据集
We use data from the ShareGPT project[noa, 2023], data from the MedQA training set[Jin et al., 2020], and clinical review articles that are available in the public domain from PubMed published before 2021 to minimize the over-representation of COVID-19 related content. The clinical review articles are transformed through the DBKE process to produce synthetic dialogues. The dataset is truncated to 4096 tokens, and non-English text is filtered out.
我们使用了来自ShareGPT项目[noa, 2023]的数据、MedQA训练集[Jin et al., 2020]的数据,以及2021年前PubMed公开发表的临床综述文章,以尽量减少COVID-19相关内容的过度呈现。这些临床综述文章通过DBKE流程转换为合成对话。数据集被截断至4096个token,并过滤掉了非英语文本。
Table 1: Summary of datasets in Clinical Camel
Namae | Description | Preprocessing |
ShareGPT | Multi-stepconversation | Removed non-English text, seg- mented conversations (4096 tokens), filtereddegenerateconversations |
ClinicalArticles | s 20,000 pre-2021 open-access articles | Transformed into 100,000 dialogues (5 utterance exchanges avg.) |
MedQA | 4000randomly selected (10,178 multiple-choice questions pool) | Transformed into dialogue by re- trievingrelevantsource articles and prompting GPT-4 to produce detailed justificationfor the correct answer fromretrievedarticles |
表 1: Clinical Camel 数据集概览
名称 | 描述 | 预处理 |
---|---|---|
ShareGPT | 多轮对话 | 移除非英文文本,按 4096 Token 分段对话,过滤低质对话 |
ClinicalArticles | 约 20,000 篇 2021 年前开放获取文章 | 转化为 100,000 组对话(平均每组含 5 轮话语交互) |
MedQA | 从 10,178 道多选题库中随机选取的 4000 题 | 通过检索相关源文章并提示 GPT-4 根据检索内容生成正确答案的详细解释,转化为对话形式 |
Table 1 provides an overview of the datasets used in developing the Clinical Camel model, including their description and preprocessing steps. The ShareGPT data includes general multi-step conversations and comprises 70,000 conversations before preprocessing. Clinical articles include 20,000 open-access clinical articles from various sources published before 2021 that were transformed into 100,000 multi-step dialogues. The MedQA training set has 10,178 multiple-choice questions with non-descriptive answers. We processed a subset of 4000 into dialogues using retrieval augmented generation to identify relevant source texts and provide the correct answer to guide the model’s response. The model is encouraged to explain why a particular option is correct and why other options are wrong.
表 1: 概述了用于开发 Clinical Camel 模型的数据集,包括其描述和预处理步骤。ShareGPT 数据包含通用的多轮对话,预处理前包含 70,000 组对话。临床文献包含 2021 年前发布的 20,000 篇开放获取临床文章,这些文章被转化为 100,000 组多轮对话。MedQA 训练集包含 10,178 道非描述性答案的多选题。我们通过检索增强生成技术将其中 4000 道题转化为对话形式,识别相关源文本并提供正确答案以指导模型响应。模型被鼓励解释为何特定选项正确,以及其他选项错误的原因。
3.3 Clinical Camel
3.3 Clinical Camel
The LLaMA-2 models serve as the foundation for developing Clinical Camel. We trained 13B and 70B parameter variants using the same dataset. The training utilized QLoRA with masking of human input. This approach enabled training Clinical Camel on a single H100 GPU. Training was conducted for one epoch with the parameters specified in Table 2
LLaMA-2模型是开发Clinical Camel的基础。我们使用相同数据集训练了130亿和700亿参数的变体。训练采用QLoRA技术并掩码人类输入数据,这种方法使得单块H100 GPU就能完成Clinical Camel的训练。训练历时一个epoch,具体参数见表2:
Table 2: Training Parameters
Parameter | 13B Model | 70BModel |
Sequence Length | 4096 | 4096 |
Lora_ r | 64 | 64 |
Lora_alpha | 16 | 16 |
Lora_ dropout | 0.00 | 0.00 |
Lora_ target_ modules | Alllinearlayers | Alllinearlayers |
GradientAccumulationSteps | 16 | 32 |
Mini-batch Size | 1 | 1 |
Number ofEpochs | 1 | 1 |
Optimizer | paged_adamw_32bit | paged_adamw_32bit |
Learning Rate Scheduler | Cosine | Cosine |
Learning Rate | 0.0002 | 0.0001 |
表 2: 训练参数
参数 | 13B 模型 | 70B 模型 |
---|---|---|
序列长度 | 4096 | 4096 |
Lora_ r | 64 | 64 |
Lora_alpha | 16 | 16 |
Lora_ dropout | 0.00 | 0.00 |
Lora_ target_ modules | Alllinearlayers | Alllinearlayers |
梯度累积步数 | 16 | 32 |
小批量大小 | 1 | 1 |
训练轮数 | 1 | 1 |
优化器 | paged_adamw_32bit | paged_adamw_32bit |
学习率调度器 | Cosine | Cosine |
学习率 | 0.0002 | 0.0001 |
4 Evaluation
4 评估
We evaluated Clinical Camel’s performance on standard medical benchmarks in zero- and five-shot settings. Table 3 presents the zero-shot results compared to GPT-3.5 and GPT-4. Table 4 shows the five-shot results alongside GPT-3.5, GPT-4, and Med-PaLM 2.
我们在零样本和五样本设置下评估了Clinical Camel在标准医学基准上的表现。表3展示了与GPT-3.5和GPT-4对比的零样本结果。表4显示了与GPT-3.5、GPT-4以及Med-PaLM 2对比的五样本结果。
The GPT and Med-PaLM-2 scores were sourced from studies by Microsoft[Nori et al., 2023] and Google[Singhal et al., 2023]. Clinical Camel scores were computed using EleutherAI’s evaluation framework[Gao et al., 2021], which compares response likelihoods, we report the accuracy scores for all benchmarks.
GPT和Med-PaLM-2的分数分别来自Microsoft [Nori et al., 2023] 和 Google [Singhal et al., 2023] 的研究。Clinical Camel的分数是使用EleutherAI的评估框架 [Gao et al., 2021] 计算的,该框架比较了回答的可能性,我们报告了所有基准测试的准确率分数。
In five-shot testing, our model outperforms GPT-3.5 across all metrics. However, it currently falls short of GPT-4 and Med-PaLM 2, except surpassing GPT-4 on PubMedQA.
在少样本测试中,我们的模型在所有指标上均优于GPT-3.5,但目前仍落后于GPT-4和Med-PaLM 2 (仅在PubMedQA上超越GPT-4)。
Table 3: Performance of Clinical Camel-13B (C13), Clinical Camel-70B (C70), GPT3.5, and GPT4 on various medical datasets in a zero-shot setting.
Dataset | C13 (0-shot) | C70 (0-shot) | GPT3.5(0-shot) | GPT4 (0-shot) |
MMLUAnatomy | 50.4 | 62.2 | 56.3 | 80.0 |
MMLU Clinical Knowledge | 54.0 | 69.8 | 69.8 | 86.0 |
MMLU CollegeBiology | 54.9 | 79.2 | 72.2 | 95.1 |
MMLU CollegeMedicine | 48.0 | 67.0 | 61.3 | 76.9 |
MMLUMedicalGenetics | 59.0 | 69.0 | 70.0 | 91.0 |
MMLUProfessionalMedicine | 51.8 | 71.3 | 70.2 | 93.0 |
MedMCQA | 39.1 | 47.0 | 50.1 | 69.5 |
MedQA (USMLE) | 34.4 | 53.4 | 50.8 | 78.9 |
PubMedQA | 72.9 | 74.3 | 71.6 | 75.2 |
USMLESampleExam | 26.9 | 54.3 | 49.2 | 83.2 |
表 3: Clinical Camel-13B (C13)、Clinical Camel-70B (C70)、GPT3.5 和 GPT4 在零样本设置下各医学数据集的性能表现。
数据集 | C13 (零样本) | C70 (零样本) | GPT3.5 (零样本) | GPT4 (零样本) |
---|---|---|---|---|
MMLUAnatomy | 50.4 | 62.2 | 56.3 | 80.0 |
MMLU Clinical Knowledge | 54.0 | 69.8 | 69.8 | 86.0 |
MMLU CollegeBiology | 54.9 | 79.2 | 72.2 | 95.1 |
MMLU CollegeMedicine | 48.0 | 67.0 | 61.3 | 76.9 |
MMLUMedicalGenetics | 59.0 | 69.0 | 70.0 | 91.0 |
MMLUProfessionalMedicine | 51.8 | 71.3 | 70.2 | 93.0 |
MedMCQA | 39.1 | 47.0 | 50.1 | 69.5 |
MedQA (USMLE) | 34.4 | 53.4 | 50.8 | 78.9 |
PubMedQA | 72.9 | 74.3 | 71.6 | 75.2 |
USMLESampleExam | 26.9 | 54.3 | 49.2 | 83.2 |
Table 4: Performance of Clinical Camel-13B (C13), Clinical Camel-70B (C70), GPT3.5, GPT4, and Med-PaLM 2 on various medical datasets in a five-shot setting.
Dataset | C13 (5-shot) | C70 (5-shot) | GPT3.5 (5-shot) | GPT4 (5-shot) | Med-PaLM 2 (5-shot) |
MMLU Anatomy | 48.2 | 65.2 | 60.7 | 80.0 | 77.8 |
MMLU Clinical | 60.4 | 72.8 | 68.7 | 86.4 | 88.3 |
Knowledge MMLU College | 59.0 | 81.2 | 72.9 | 93.8 | 94.4 |
Biology MMLU College | 52.6 | 68.2 | 63.6 | 76.3 | 80.9 |
Medicine MMLU Medical | 59.0 | 69.0 | 68.0 | 92.0 | 90.0 |
Genetics MMLU Professional | 53.3 | 75.0 | 69.8 | 93.8 | 95.2 |
Medicine | |||||
MedMCQA | 44.8 | 54.2 | 51.0 | 72.4 | 71.3 |
MedQA (USMLE) | 45.2 | 60.7 | 53.6 | 81.4 | 79.7 |
PubMedQA | 74.8 | 77.9 | 60.2 | 74.4 | 79.2 |
USMLE Sample Exam | 39.5 | 64.3 | 58.5 | 86.6 |
表 4: Clinical Camel-13B (C13)、Clinical Camel-70B (C70)、GPT3.5、GPT4 和 Med-PaLM 2 在少样本 (five-shot) 设置下各类医学数据集的性能表现
数据集 | C13 (5-shot) | C70 (5-shot) | GPT3.5 (5-shot) | GPT4 (5-shot) | Med-PaLM 2 (5-shot) |
---|---|---|---|---|---|
MMLU Anatomy | 48.2 | 65.2 | 60.7 | 80.0 | 77.8 |
MMLU Clinical | 60.4 | 72.8 | 68.7 | 86.4 | 88.3 |
Knowledge MMLU College | 59.0 | 81.2 | 72.9 | 93.8 | 94.4 |
Biology MMLU College | 52.6 | 68.2 | 63.6 | 76.3 | 80.9 |
Medicine MMLU Medical | 59.0 | 69.0 | 68.0 | 92.0 | 90.0 |
Genetics MMLU Professional | 53.3 | 75.0 | 69.8 | 93.8 | 95.2 |
Medicine | |||||
MedMCQA | 44.8 | 54.2 | 51.0 | 72.4 | 71.3 |
MedQA (USMLE) | 45.2 | 60.7 | 53.6 | 81.4 | 79.7 |
PubMedQA | 74.8 | 77.9 | 60.2 | 74.4 | 79.2 |
USMLE Sample Exam | 39.5 | 64.3 | 58.5 | 86.6 |
5 Capabilities, challenges, and future directions of the Clinical Camel
5 Clinical Camel 的能力、挑战与未来方向
In addition to strong performance on medical question-answering benchmarks, Clinical Camel shows promise for other healthcare applications like automated clinical note generation. As demonstrated in Figure 2, the model can effectively synthesize plausible clinical notes from long patient-physician conversations(see Appendix A) while adhering to alignment objectives. This ability to handle extended contexts is a crucial capability arising from Clinical Camel’s 4096 token limit.
除了在医疗问答基准测试中表现优异外,Clinical Camel在其他医疗应用(如自动化临床记录生成)中也展现出潜力。如图2所示,该模型能有效从冗长的医患对话中合成合理的临床记录(参见附录A),同时遵循对齐目标。这种处理长上下文的能力源于Clinical Camel的4096 token限制。
However, several challenges remain in applying Clinical Camel more broadly in healthcare settings. A primary concern is the potential for generating misleading or inappropriate content [Ji et al., 2023]. Evaluating model outputs and developing techniques to improve reliability and alignment will be critical for future research directions.
然而,在医疗保健领域更广泛地应用Clinical Camel仍存在若干挑战。首要问题是可能生成误导性或不恰当内容 [Ji et al., 2023]。评估模型输出并开发提升可靠性和对齐性的技术,将成为未来研究的关键方向。
Another challenge stems from updating medical LLMs as knowledge evolves continually. Retraining models on new data requires significant computational resources. Alternative approaches like memory editing [Meng et al., 2022] and retrieval-augmented generation [Shuster et al., 2021] may enable more efficient knowledge updating and will be essential to explore.
另一个挑战源于医学大语言模型 (LLM) 需要随着知识的持续更新而更新。在新数据上重新训练模型需要大量计算资源。像记忆编辑 [Meng et al., 2022] 和检索增强生成 [Shuster et al., 2021] 这样的替代方法可能实现更高效的知识更新,这些方法都值得深入探索。
Additionally, Clinical Camel is not multi-modal, which is a significant limitation in healthcare. Extending the model to multi-modal inputs could improve its utility for diagnostic and other visual tasks.
此外,Clinical Camel 不具备多模态能力,这在医疗领域是一个重大局限。将该模型扩展到多模态输入可以提升其在诊断和其他视觉任务中的实用性。
We also note that we have yet to systematically evaluate the effectiveness of DBKE compared to other methods of processing training data. Therefore we cannot make definitive statements about the effectiveness of DBKE.
我们也注意到,目前尚未系统性地评估DBKE与其他训练数据处理方法的有效性对比。因此无法对DBKE的效果做出明确结论。
In summary, while Clinical Camel demonstrates promising capabilities on medical benchmarks, further research is needed to improve reliability, update knowledge, and incorporate multi-modal data. As an open model, Clinical Camel will facilitate this continued study into safely and effectively applying LLMs in healthcare.
总之,虽然 Clinical Camel 在医疗基准测试中展现出良好潜力,但仍需进一步研究以提高可靠性、更新知识库并整合多模态数据。作为开源模型,Clinical Camel 将持续推动大语言模型在医疗领域安全有效应用的研究进程。

Figure 2: Clinical note generated by Clinical Camel from the dialogue in Appendix A
图 2: Clinical Camel根据附录A中的对话生成的临床记录
5.1 Bridging the Divide
5.1 弥合鸿沟
Recent advances in parameter-efficient training methods, along with the release of models like MetaAI’s LLaMA, have led to rapid improvements in open language models; as a result, Clinical Camel outp