Clinical Camel: An Open Expert-Level Medical Language Model with Dialogue-Based Knowledge Encoding
Clinical Camel: 具备对话式知识编码的专家级开源医学大语言模型
Augustin Toma1,2,∗ Patrick R. Lawler3,4,5 Jimmy $\mathbf{B}\mathbf{a}^{1,6}$ Rahul G. Krishnan1,6,7 Barry B Rubin3 Bo Wang1,3,6,7,8,∗,† 1Vector Institute for Artificial Intelligence, Toronto, Canada 2 Department of Medical Biophysics, University of Toronto, Toronto, Canada 3Peter Munk Cardiac Centre, University Health Network, Toronto, Canada 4McGill University, Montreal, Canada 5Division of Cardiology, University of Toronto, Toronto, Canada 6 Department of Computer Science, University of Toronto, Toronto, Canada 7 Department of Laboratory Medicine and Path o biology, University of Toronto, Toronto, Canada 8AI Hub, University Health Network, Toronto, Canada augustin.toma@mail.utoronto.ca bowang@vector institute.ai
Augustin Toma1,2,∗ Patrick R. Lawler3,4,5 Jimmy $\mathbf{B}\mathbf{a}^{1,6}$ Rahul G. Krishnan1,6,7 Barry B Rubin3 Bo Wang1,3,6,7,8,∗,†
1加拿大多伦多Vector人工智能研究所
2加拿大多伦多大学生物医学物理系
3加拿大多伦多大学健康网络Peter Munk心脏中心
4加拿大麦吉尔大学
5加拿大多伦多大学心脏病学系
6加拿大多伦多大学计算机科学系
7加拿大多伦多大学检验医学与病理生物学系
8加拿大多伦多大学健康网络AI中心
augustin.toma@mail.utoronto.ca
bowang@vectorinstitute.ai
Abstract
摘要
We present Clinical Camel, an open large language model (LLM) explicitly tailored for clinical research. Fine-tuned from LLaMA-2 using QLoRA, Clinical Camel achieves state-of-the-art performance across medical benchmarks among openly available medical LLMs. Leveraging efficient single-GPU training, Clinical Camel surpasses GPT-3.5 in five-shot evaluations on all assessed benchmarks, including $64.3%$ on the USMLE Sample Exam (compared to $58.5%$ for GPT-3.5), $77.9%$ on PubMedQA (compared to $60.2%$ ), $60.7%$ on MedQA (compared to $53.6%$ ),and $54.2%$ on MedMCQA (compared to $51.0%$ ). In addition to these benchmarks, Clinical Camel demonstrates its broader capabilities, such as synthesizing plausible clinical notes. This work introduces dialogue-based knowledge encoding, a novel method to synthesize conversational data from dense medical texts. While benchmark results are encouraging, extensive and rigorous human evaluation across diverse clinical scenarios is imperative to ascertain safety before implementation. By openly sharing Clinical Camel, we hope to foster transparent and collaborative research, working towards the safe integration of LLMs within the healthcare domain. Significant challenges concerning reliability, bias, and the potential for outdated knowledge persist. Nonetheless, the transparency provided by an open approach reinforces the scientific rigor essential for future clinical applications.
我们推出Clinical Camel,这是一款专为临床研究设计的开源大语言模型(LLM)。该模型基于LLaMA-2通过QLoRA进行微调,在所有公开可用的医学大语言模型中实现了最先进的医疗基准测试性能。凭借高效的单GPU训练,Clinical Camel在所有评估基准的五样本测试中均超越GPT-3.5,包括USMLE样本考试64.3%(GPT-3.5为58.5%)、PubMedQA 77.9%(60.2%)、MedQA 60.7%(53.6%)以及MedMCQA 54.2%(51.0%)。除基准测试外,该模型还展现出生成可信临床记录等更广泛能力。本研究引入了基于对话的知识编码技术,这是一种从密集医学文本合成对话数据的新方法。虽然基准测试结果令人鼓舞,但在实际应用前仍需通过多样化临床场景的严格人工评估来确保安全性。通过开源分享Clinical Camel,我们希望促进透明协作的研究,推动大语言模型在医疗领域的安全整合。关于可靠性、偏见和知识过时等重大挑战仍然存在,但开源方法提供的透明度强化了未来临床应用所需的科学严谨性。
1 Introduction
1 引言
Large language models (LLMs), such as GPT-4, have demonstrated remarkable capabilities in various applications. However, their deployment in healthcare settings raises concerns due to their proprietary nature, particularly regarding privacy, stability, and transparency. Although open medical LLMs exist, they fall short in performance compared to proprietary alternatives and offer limited context lengths, restricting their use cases.
大语言模型 (LLM),如 GPT-4,已在多种应用中展现出卓越能力。然而,由于其专有属性,在医疗场景部署时引发了隐私、稳定性和透明度方面的担忧。尽管存在开源医疗大语言模型,但其性能仍逊于专有方案,且上下文长度有限,制约了应用场景。
The performance gap between proprietary and open models is concerning in healthcare, as the latter allows for more rigorous evaluation and validation. Ensuring the safe integration of LLMs into clinical care requires thorough validation, which is not feasible with the current landscape of proprietary models. Moreover, challenges arise when sending healthcare data to private companies, highlighting the value of institutions being able to serve their own models for reliable and safe access.
专有模型与开源模型在医疗领域的性能差距令人担忧,因为后者支持更严格的评估与验证。要将大语言模型(LLM)安全整合到临床护理中,必须进行彻底验证,而这在当前专有模型的生态下难以实现。此外,将医疗数据发送给私营企业会带来诸多挑战,这凸显了医疗机构能够部署自有模型以实现可靠安全访问的重要性。
We introduce Clinical Camel, an openly available and high-performing medical LLM fine-tuned from LLaMA-2[Touvron et al., 2023] to address these issues. Clinical Camel is trained via QLoRA[Dettmers et al., 2023] on a single commercial GPU, enabling it to surpass GPT-3.5 in performance on standardized medical benchmarks: biomedical subsections of the MMLU, MedMCQA, MedQA, PubMedQA, and the USMLE sample exam. We introduce a novel method called Dialogue-Based Knowledge Encoding (DBKE) to develop our training corpus, which converts dense clinical review articles into synthetic conversations.
我们推出Clinical Camel,这是一个公开可用的高性能医疗大语言模型(LLM),基于LLaMA-2[Touvron et al., 2023]微调以解决上述问题。该模型通过QLoRA[Dettmers et al., 2023]在单个商用GPU上完成训练,使其在标准化医疗基准测试(包括MMLU生物医学子项、MedMCQA、MedQA、PubMedQA以及USMLE模拟考试)中超越GPT-3.5。我们提出了一种名为对话式知识编码(DBKE)的新方法用于构建训练语料库,该方法将密集的临床综述文章转化为合成对话。
Our research demonstrates the feasibility of efficiently fine-tuning domain-specific LLMs without the need for massive datasets or computing power. Clinical Camel is an example of open medical LLMs that compare favorably with proprietary counterparts. Nonetheless, evaluating LLMs in healthcare remains challenging, and performance on automated benchmarks does not equate to clinical utility or safety.
我们的研究表明,无需海量数据集或算力即可高效微调领域专用大语言模型具有可行性。Clinical Camel作为开源医疗大语言模型的代表,其表现可媲美专有模型。然而在医疗领域评估大语言模型仍存在挑战,自动化基准测试的性能表现并不等同于临床效用或安全性。
By making Clinical Camel openly available for research, we aim to promote further investigation into the safe integration of LLMs into clinical care and contribute to the advancements of machine learning applications in health.
通过公开提供Clinical Camel用于研究,我们旨在促进进一步探索如何安全地将大语言模型 (LLM) 整合到临床护理中,并为机器学习在健康领域的应用进展做出贡献。
1.1 The Application of Large Language Models in Healthcare
1.1 大语言模型 (Large Language Model) 在医疗健康领域的应用
LLMs have a broad scope of potential medical applications due to their ability to process unstructured clinical text; these range from automated clinical note creation and patient record sum mari z ation to more advanced tasks like clinical decision support, medical triaging, patient counseling, and medical education. These applications could improve healthcare delivery and access for providers and patients if proven effective.
大语言模型(LLM)因其处理非结构化临床文本的能力,在医疗领域具有广泛的应用潜力。这些应用涵盖从自动化临床记录创建和患者病历摘要,到更高级的任务如临床决策支持、医疗分诊、患者咨询和医学教育。若被证明有效,这些应用有望改善医疗服务提供者和患者的医疗保健服务与可及性。
Proprietary models like OpenAI’s GPT-3.5 and GPT-4 demonstrate strong performance on medical benchmarks without domain-specific fine-tuningNori et al. [2023]. GPT-4’s capabilities have prompted efforts to integrate it into clinical care, but sending healthcare data to private servers creates access equity issues globally. Critically, rigorously studying proprietary models is challenging. For example, OpenAI updates models on a three-month basis, complicating deployment in patient care where even small prompt changes can drastically alter outputs.
OpenAI的GPT-3.5和GPT-4等专有模型在未经领域特定微调的情况下,在医学基准测试中展现出强劲性能 [Nori et al., 2023]。GPT-4的能力促使人们尝试将其整合到临床护理中,但将医疗数据发送至私有服务器会引发全球范围内的访问公平问题。关键在于,严格研究专有模型具有挑战性。例如,OpenAI每三个月更新一次模型,这使患者护理中的部署变得复杂——即使提示词(prompt)的微小变化也可能彻底改变输出结果。
Google’s Med-PaLM 2 surpassed GPT-4 when tested with an ensemble refinement strategy (an inference-heavy prompting strategy requiring 44 generations), demonstrating superior performance on MedQA, PubMedQA, MMLU-Professional Medicine, and MMLU-College Medicine benchmarks Singh al et al. [2023]. Human evaluations also showed physicians and laypeople preferred Med-PaLM 2 answers over physician-generated responses- although the human evaluation group was modestly sized with 15 physicians and six laypersons. The Med-PaLM 2 work is commendable for going beyond automated benchmarks; however, Med-PaLM-2 remains unavailable publicly, preventing external validation of these results.
谷歌的Med-PaLM 2在使用集成优化策略(一种需要44次生成的高计算量提示策略)测试时超越了GPT-4,在MedQA、PubMedQA、MMLU-Professional Medicine和MMLU-College Medicine基准测试中展现了更优性能[20]。人工评估也显示医师和非专业人士更倾向选择Med-PaLM 2的答案而非医师生成的回答——尽管评估组规模较小,仅包含15名医师和6名非专业人士。Med-PaLM 2研究值得称赞之处在于超越了自动化基准测试;然而该模型仍未公开,导致这些结果无法获得外部验证。
The inability to rigorously study proprietary models due to the lack of public information, access, and privacy constraints motivates the development of open alternatives. High-performing publicly available models will enhance access and enable the rigorous evaluation needed for the safe clinical integration of LLMs.
由于缺乏公开信息、访问权限和隐私限制,无法对专有模型进行严格研究,这促使了开放替代方案的开发。高性能的公开可用模型将提高访问性,并为大语言模型(LLM)安全融入临床实践所需的严格评估创造条件。
2 Open Medical Language Models: Pushing for Transparency and Better Public Health Outcomes
2 开放医疗语言模型:推动透明度与改善公共卫生成果
Several open medical language models have been released, including MedAlpaca[Han et al., 2023] and ChatDoctor[Li et al., 2023]. Limited benchmark evaluations for these models have been made, and no rigorous comparisons have been made to proprietary models such as GPT-3.5/4.
已发布多个开源医疗语言模型,包括MedAlpaca [Han et al., 2023] 和 ChatDoctor [Li et al., 2023]。目前对这些模型的基准评估较为有限,且未与GPT-3.5/4等专有模型进行严格对比。
ChatDoctor was fine-tuned on online physician-patient dialogues and compared favorably to GPT3.5 on BERTScore metrics - which were calculated by comparing the BERTScore of Chat Doctors responses on a dataset comprising of patient questions and answers; however, no other benchmarks were evaluated. Its short context length of 512 tokens restricts utility beyond question-answering.
ChatDoctor基于在线医患对话进行了微调,在BERTScore指标上表现优于GPT3.5——该指标通过对比ChatDoctor在患者问答数据集上的响应BERTScore计算得出,但未评估其他基准。其512 token的短上下文长度限制了除问答外的其他应用场景。
MedAlpaca reported high performance on the USMLE self-assessment test. However, it also has a trained context length of 512 tokens. A parameter-efficient variant was trained alongside a fully fine-tuned version; however, it significantly under performed. No other benchmark results were reported.
MedAlpaca在美国医师执照考试(USMLE)自测中报告了优异表现,但其训练上下文长度仅为512个token。该研究同时训练了参数高效变体和全参数微调版本,但前者性能显著落后。未报告其他基准测试结果。
In conclusion, while existing open models show promise, benchmark evaluations have been limited and lack comparisons to proprietary models. Their short contexts likely restrict utility as well. In contrast, Clinical Camel has an expanded 4096 token context length and can perform tasks beyond question answering. Consequently, Clinical Camel represents a substantial advancement for deploying large language models in healthcare.
总之,虽然现有开源模型展现出潜力,但基准评估仍存在局限且缺乏与专有模型的对比。其较短的上下文长度也可能限制实用性。相比之下,Clinical Camel拥有扩展至4096 token的上下文长度,并能执行问答之外的多种任务。因此,Clinical Camel标志着大语言模型在医疗健康领域部署的重大进展。
3 Methodology
3 方法
3.1 Dialogue-Based Knowledge Encoding
3.1 基于对话的知识编码
Our work introduces Dialogue-Based Knowledge Encoding (DBKE), a method designed to transform input text into a multi-turn dialogue. The methodology we have developed acts as a form of domain adaptation that we hypothesize strengthens the recall capabilities of the downstream conversational models. DBKE allows us to convert dense medical literature into dialogues and instill soft alignment.
我们的工作提出了基于对话的知识编码 (Dialogue-Based Knowledge Encoding, DBKE) 方法,旨在将输入文本转化为多轮对话。该方法作为一种领域自适应形式,我们假设其能增强下游对话模型的召回能力。DBKE使我们能够将密集的医学文献转化为对话,并实现软对齐。
The DBKE process consists of dialogue creation and student model training. The process is initiated with a dense knowledge text input, paired with an input prompt containing alignment constraints and instructions for generating a dialogue. A teacher model, denoted by $M_ {T}$ , generates a dialogue based on the provided context while following the constraints stated in the prompt. The generated dialogue is then used as a transformed training text for fine-tuning a student model, denoted by $M_ {S}$ .
DBKE流程包含对话创建和学生模型训练两个阶段。该流程以密集知识文本作为输入,同时配合包含对齐约束和对话生成指令的输入提示(prompt)启动。由教师模型 $M_ {T}$ 根据给定上下文生成符合提示约束的对话内容,随后将生成的对话作为转换后的训练文本,用于微调学生模型 $M_ {S}$。
We illustrate the steps of the DBKE methodology in Algorithm 1:
我们在算法1中展示了DBKE方法的步骤:
Algorithm 1 Dialogue-Based Knowledge Encoding (DBKE)
算法 1: 基于对话的知识编码 (DBKE)
| 1: procedure DBKE(T, P, Mr, Ms) > T is input text, P is prompt (containing alignment rules), Mristeachermodel,Msisstudentmodel |
| 2: for each target text t; in T do |
| 3: D← Generate a dialogue from t;using MT and P Teacher model generates multi-turndialogue |
| 4: end for |
| 5: Fine-tune Ms on D, masking user's inputs during training |
| 6: returnMs Returnthefine-tunedstudentmodel |
| 7:endprocedure |
| 1: 过程 DBKE(T, P, Mr, Ms) > T 是输入文本, P 是提示 (包含对齐规则), Mr 是教师模型, Ms 是学生模型 |
| 2: 对于每个目标文本 t; 在 T 中执行 |
| 3: D← 使用 MT 和 P 从 t; 生成对话 教师模型生成多轮对话 |
| 4: 结束循环 |
| 5: 在 D 上微调 Ms, 训练时掩码用户输入 |
| 6: 返回 Ms 返回微调后的学生模型 |
| 7: 结束过程 |
The DBKE method combines knowledge encoding with soft behavioral alignment. Although not strictly enforced, the alignment constraints embedded in the input prompt guide the generated output of medical models. For example – these constraints could instruct the model to gather more information before suggesting diagnoses. The alignment objectives of these models can be modified to cater to the requirements of specific domains. See B for an example.
DBKE方法将知识编码与软行为对齐相结合。虽然并非严格强制执行,但嵌入在输入提示中的对齐约束会引导医疗模型的生成输出。例如,这些约束可以指示模型在提出诊断建议前收集更多信息。这些模型的对齐目标可进行调整,以满足特定领域的需求。示例见附录B。

Figure 1: Schematic representation of the Dialogue-Based Knowledge Encoding (DBKE) methodology. The process starts with a knowledge-dense input text $T$ and a prompt $P$ containing alignment constraints. The teacher model $M_ {T}$ then generates a multi-turn dialogue $D$ , which is used to fine-tune the student model $M_ {S}$ . The result is a fine-tuned student model capable of improved conversational performance.
图 1: 基于对话的知识编码 (DBKE) 方法示意图。该流程从知识密集的输入文本 $T$ 和包含对齐约束的提示 $P$ 开始。教师模型 $M_ {T}$ 随后生成多轮对话 $D$ ,用于微调学生模型 $M_ {S}$ 。最终获得一个经过微调的学生模型,能够提升对话表现。
3.2 Dataset
3.2 数据集
We use data from the ShareGPT project[noa, 2023], data from the MedQA training set[Jin et al., 2020], and clinical review articles that are available in the public domain from PubMed published before 2021 to minimize the over-representation of COVID-19 related content. The clinical review articles are transformed through the DBKE process to produce synthetic dialogues. The dataset is truncated to 4096 tokens, and non-English text is filtered out.
我们使用了来自ShareGPT项目[noa, 2023]的数据、MedQA训练集[Jin et al., 2020]的数据,以及2021年前PubMed公开发表的临床综述文章,以尽量减少COVID-19相关内容的过度呈现。这些临床综述文章通过DBKE流程转换为合成对话。数据集被截断至4096个token,并过滤掉了非英语文本。
Table 1: Summary of datasets in Clinical Camel
| Namae | Description | Preprocessing |
| ShareGPT | Multi-stepconversation | Removed non-English text, seg- mented conversations (4096 tokens), filtereddegenerateconversations |
| ClinicalArticles | s 20,000 pre-2021 open-access articles | Transformed into 100,000 dialogues (5 utterance exchanges avg.) |
| MedQA | 4000randomly selected (10,178 multiple-choice questions pool) | Transformed into dialogue by re- trievingrelevantsource articles and prompting GPT-4 to produce detailed justificationfor the correct answer fromretrievedarticles |
表 1: Clinical Camel 数据集概览
| 名称 | 描述 | 预处理 |
|---|---|---|
| ShareGPT | 多轮对话 | 移除非英文文本,按 4096 Token 分段对话,过滤低质对话 |
| ClinicalArticles | 约 20,000 篇 2021 年前开放获取文章 | 转化为 100,000 组对话(平均每组含 5 轮话语交互) |
| MedQA | 从 10,178 道多选题库中随机选取的 4000 题 | 通过检索相关源文章并提示 GPT-4 根据检索内容生成正确答案的详细解释,转化为对话形式 |
Table 1 provides an overview of the datasets used in developing the Clinical Camel model, including their description and preprocessing steps. The ShareGPT data includes general multi-step conversations and comprises 70,000 conversations before preprocessing. Clinical articles include 20,000 open-access clinical articles from various sources published before 2021 that were transformed into 100,000 multi-step dialogues. The MedQA training set has 10,178 multiple-choice questions with non-descriptive answers. We processed a subset of 4000 into dialogues using retrieval augmented generation to identify relevant source texts and provide the correct answer to guide the model’s response. The model is encouraged to explain why a particular option is correct and why other options are wrong.
表 1: 概述了用于开发 Clinical Camel 模型的数据集,包括其描述和预处理步骤。ShareGPT 数据包含通用的多轮对话,预处理前包含 70,000 组对话。临床文献包含 2021 年前发布的 20,000 篇开放获取临床文章,这些文章被转化为 100,000 组多轮对话。MedQA 训练集包含 10,178 道非描述性答案的多选题。我们通过检索增强生成技术将其中 4000 道题转化为对话形式,识别相关源文本并提供正确答案以指导模型响应。模型被鼓励解释为何特定选项正确,以及其他选项错误的原因。
3.3 Clinical Camel
3.3 Clinical Camel
The LLaMA-2 models serve as the foundation for developing Clinical Camel. We trained 13B and 70B parameter variants using the same dataset. The training utilized QLoRA with masking of human input. This approach enabled training Clinical Camel on a single H100 GPU. Training was conducted for one epoch with the parameters specified in Table 2
LLaMA-2模型是开发Clinical Camel的基础。我们使用相同数据集训练了130亿和700亿参数的变体。训练采用QLoRA技术并掩码人类输入数据,这种方法使得单块H100 GPU就能完成Clinical Camel的训练。训练历时一个epoch,具体参数见表2:
Table 2: Training Parameters
| Parameter | 13B Model | 70BModel |
| Sequence Length | 4096 | 4096 |
| Lora_ r | 64 | 64 |
| Lora_alpha | 16 | 16 |
| Lora_ dropout | 0.00 | 0.00 |
| Lora_ target_ modules | Alllinearlayers | Alllinearlayers |
| GradientAccumulationSteps | 16 | 32 |
| Mini-batch Size | 1 | 1 |
| Number ofEpochs | 1 | 1 |
| Optimizer | paged_adamw_32bit | paged_adamw_32bit |
| Learning Rate Scheduler | Cosine | Cosine |
| Learning Rate | 0.0002 | 0.0001 |
表 2: 训练参数
| 参数 | 13B 模型 | 70B 模型 |
|---|---|---|
| 序列长度 | 4096 | 4096 |
| Lora_ r | 64 | 64 |
| Lora_alpha | 16 | 16 |
| Lora_ dropout | 0.00 | 0.00 |
| Lora_ target_ modules | Alllinearlayers | Alllinearlayers |
| 梯度累积步数 | 16 | 32 |
| 小批量大小 | 1 | 1 |
| 训练轮数 | 1 | 1 |
| 优化器 | paged_adamw_32bit | paged_adamw_32bit |
| 学习率调度器 | Cosine | Cosine |
| 学习率 | 0.0002 | 0.0001 |
4 Evaluation
4 评估
We evaluated Clinical Camel’s performance on standard medical benchmarks in zero- and five-shot settings. Table 3 presents the zero-shot results compared to GPT-3.5 and GPT-4. Table 4 shows the five-shot results alongside GPT-3.5, GPT-4, and Med-PaLM 2.
我们在零样本和五样本设置下评估了Clinical Camel在标准医学基准上的表现。表3展示了与GPT-3.5和GPT-4对比的零样本结果。表4显示了与GPT-3.5、GPT-4以及Med-PaLM 2对比的五样本结果。
The GPT and Med-PaLM-2 scores were sourced from studies by Microsoft[Nori et al., 2023] and Google[Singhal et al., 2023]. Clinical Camel scores were computed using EleutherAI’s evaluation framework[Gao et al., 2021], which compares response likelihoods, we report the accuracy scores for all benchmarks.
GPT和Med-PaLM-2的分数分别来自Microsoft [Nori et al., 2023] 和 Google [Singhal et al., 2023] 的研究。Clinical Camel的分数是使用EleutherAI的评估框架 [Gao et al., 2021] 计算的,该框架比较了回答的可能性,我们报告了所有基准测试的准确率分数。
In five-shot testing, our model outperforms GPT-3.5 across all metrics. However, it currently falls short of GPT-4 and Med-PaLM 2, except surpassing GPT-4 on PubMedQA.
在少样本测试中,我们的模型在所有指标上均优于GPT-3.5,但目前仍落后于GPT-4和Med-PaLM 2 (仅在PubMedQA上超越GPT-4)。
Table 3: Performance of Clinical Camel-13B (C13), Clinical Camel-70B (C70), GPT3.5, and GPT4 on various medical datasets in a zero-shot setting.
| Dataset | C13 (0-shot) | C70 (0-shot) | GPT3.5(0-shot) | GPT4 (0-shot) |
| MMLUAnatomy | 50.4 | 62.2 | 56.3 | 80.0 |
| MMLU Clinical Knowledge | 54.0 | 69.8 | 69.8 | 86.0 |
| MMLU CollegeBiology | 54.9 | 79.2 | 72.2 | 95.1 |
| MMLU CollegeMedicine | 48.0 | 67.0 | 61.3 | 76.9 |
| MMLUMedicalGenetics | 59.0 | 69.0 | 70.0 | 91.0 |
| MMLUProfessionalMedicine | 51.8 | 71.3 | 70.2 | 93.0 |
| MedMCQA | 39.1 | 47.0 | 50.1 | 69.5 |
| MedQA (USMLE) | 34.4 | 53.4 | 50.8 | 78.9 |
| PubMedQA | 72.9 | 74.3 | 71.6 | 75.2 |
| USMLESampleExam | 26.9 | 54.3 | 49.2 | 83.2 |
表 3: Clinical Camel-13B (C13)、Clinical Camel-70B (C70)、GPT3.5 和 GPT4 在零样本设置下各医学数据集的性能表现。
| 数据集 | C13 (零样本) | C70 (零样本) | GPT3.5 (零样本) | GPT4 (零样本) |
|---|---|---|---|---|
| MMLUAnatomy | 50.4 | 62.2 | 56.3 | 80.0 |
| MMLU Clinical Knowledge | 54.0 | 69.8 | 69.8 | 86.0 |
| MMLU CollegeBiology | 54.9 | 79.2 | 72.2 | 95.1 |
| MMLU CollegeMedicine | 48.0 | 67.0 | 61.3 | 76.9 |
| MMLUMedicalGenetics | 59.0 | 69.0 | 70.0 | 91.0 |
| MMLUProfessionalMedicine | 51.8 | 71.3 | 70.2 | 93.0 |
| MedMCQA | 39.1 | 47.0 | 50.1 | 69.5 |
| MedQA (USMLE) | 34.4 | 53.4 | 50.8 | 78.9 |
| PubMedQA | 72.9 | 74.3 | 71.6 | 75.2 |
| USMLESampleExam | 26.9 | 54.3 | 49.2 | 83.2 |
Table 4: Performance of Clinical Camel-13B (C13), Clinical Camel-70B (C70), GPT3.5, GPT4, and Med-PaLM 2 on various medical datasets in a five-shot setting.
| Dataset | C13 (5-shot) | C70 (5-shot) | GPT3.5 (5-shot) | GPT4 (5-shot) | Med-PaLM 2 (5-shot) |
| MMLU Anatomy | 48.2 | 65.2 | 60.7 | 80.0 | 77.8 |
| MMLU Clinical | 60.4 | 72.8 | 68.7 | 86.4 | 88.3 |
| Knowledge MMLU College | 59.0 | 81.2 | 72.9 | 93.8 | 94.4 |
| Biology MMLU College | 52.6 | 68.2 | 63.6 | 76.3 | 80.9 |
| Medicine MMLU Medical | 59.0 | 69.0 | 68.0 | 92.0 | 90.0 |
| Genetics MMLU Professional | 53.3 | 75.0 | 69.8 | 93.8 | 95.2 |
| Medicine | |||||
| MedMCQA | 44.8 | 54.2 | 51.0 | 72.4 | 71.3 |
| MedQA (USMLE) | 45.2 | 60.7 | 53.6 | 81.4 | 79.7 |
| PubMedQA | 74.8 | 77.9 | 60.2 | 74.4 | 79.2 |
| USMLE Sample Exam | 39.5 | 64.3 | 58.5 | 86.6 |
表 4: Clinical Camel-13B (C13)、Clinical Camel-70B (C70)、GPT3.5、GPT4 和 Med-PaLM 2 在少样本 (five-shot) 设置下各类医学数据集的性能表现
| 数据集 | C13 (5-shot) | C70 (5-shot) | GPT3.5 (5-shot) | GPT4 (5-shot) | Med-PaLM 2 (5-shot) |
|---|---|---|---|---|---|
| MMLU Anatomy | 48.2 | 65.2 | 60.7 | 80.0 | 77.8 |
| MMLU Clinical | 60.4 | 72.8 | 68.7 | 86.4 | 88.3 |
| Knowledge MMLU College | 59.0 | 81.2 | 72.9 | 93.8 | 94.4 |
| Biology MMLU College | 52.6 | 68.2 | 63.6 | 76.3 | 80.9 |
| Medicine MMLU Medical | 59.0 | 69.0 | 68.0 | 92.0 | 90.0 |
| Genetics MMLU Professional | 53.3 | 75.0 | 69.8 | 93.8 | 95.2 |
| Medicine | |||||
| MedMCQA | 44.8 | 54.2 | 51.0 | 72.4 | 71.3 |
| MedQA (USMLE) | 45.2 | 60.7 | 53.6 | 81.4 | 79.7 |
| PubMedQA | 74.8 | 77.9 | 60.2 | 74.4 | 79.2 |
| USMLE Sample Exam | 39.5 | 64.3 | 58.5 | 86.6 |
5 Capabilities, challenges, and future directions of the Clinical Camel
5 Clinical Camel 的能力、挑战与未来方向
In addition to strong performance on medical question-answering benchmarks, Clinical Camel shows promise for other healthcare applications like automated clinical note generation. As demonstrated in Figure 2, the model can effectively synthesize plausible clinical notes from long patient-physician conversations(see Appendix A) while adhering to alignment objectives. This ability to handle extended contexts is a crucial capability arising from Clinical Camel’s 4096 token limit.
除了在医疗问答基准测试中表现优异外,Clinical Camel在其他医疗应用(如自动化临床记录生成)中也展现出潜力。如图2所示,该模型能有效从冗长的医患对话中合成合理的临床记录(参见附录A),同时遵循对齐目标。这种处理长上下文的能力源于Clinical Camel的4096 token限制。
However, several challenges remain in applying Clinical Camel more broadly in healthcare settings. A primary concern is the potential for generating misleading or inappropriate content [Ji et al., 2023]. Evaluating model outputs and developing techniques to improve reliability and alignment will be critical for future research directions.
然而,在医疗保健领域更广泛地应用Clinical Camel仍存在若干挑战。首要问题是可能生成误导性或不恰当内容 [Ji et al., 2023]。评估模型输出并开发提升可靠性和对齐性的技术,将成为未来研究的关键方向。
Another challenge stems from updating medical LLMs as knowledge evolves continually. Retraining models on new data requires significant computational resources. Alternative approaches like memory editing [Meng et al., 2022] and retrieval-augmented generation [Shuster et al., 2021] may enable more efficient knowledge updating and will be essential to explore.
另一个挑战源于医学大语言模型 (LLM) 需要随着知识的持续更新而更新。在新数据上重新训练模型需要大量计算资源。像记忆编辑 [Meng et al., 2022] 和检索增强生成 [Shuster et al., 2021] 这样的替代方法可能实现更高效的知识更新,这些方法都值得深入探索。
Additionally, Clinical Camel is not multi-modal, which is a significant limitation in healthcare. Extending the model to multi-modal inputs could improve its utility for diagnostic and other visual tasks.
此外,Clinical Camel 不具备多模态能力,这在医疗领域是一个重大局限。将该模型扩展到多模态输入可以提升其在诊断和其他视觉任务中的实用性。
We also note that we have yet to systematically evaluate the effectiveness of DBKE compared to other methods of processing training data. Therefore we cannot make definitive statements about the effectiveness of DBKE.
我们也注意到,目前尚未系统性地评估DBKE与其他训练数据处理方法的有效性对比。因此无法对DBKE的效果做出明确结论。
In summary, while Clinical Camel demonstrates promising capabilities on medical benchmarks, further research is needed to improve reliability, update knowledge, and incorporate multi-modal data. As an open model, Clinical Camel will facilitate this continued study into safely and effectively applying LLMs in healthcare.
总之,虽然 Clinical Camel 在医疗基准测试中展现出良好潜力,但仍需进一步研究以提高可靠性、更新知识库并整合多模态数据。作为开源模型,Clinical Camel 将持续推动大语言模型在医疗领域安全有效应用的研究进程。

Figure 2: Clinical note generated by Clinical Camel from the dialogue in Appendix A
图 2: Clinical Camel根据附录A中的对话生成的临床记录
5.1 Bridging the Divide
5.1 弥合鸿沟
Recent advances in parameter-efficient training methods, along with the release of models like MetaAI’s LLaMA, have led to rapid improvements in open language models; as a result, Clinical Camel outperforms GPT-3.5 on medical benchmarks, despite being trained on a single commercial GPU. However, a significant gap remains compared to top-performing models such as GPT-4 and Med-PaLM-2.
参数高效训练方法的最新进展,以及MetaAI的LLaMA等模型的发布,推动了开源语言模型的快速进步。因此,Clinical Camel在医学基准测试中表现优于GPT-3.5,尽管其仅在一块商用GPU上完成训练。但与GPT-4和Med-PaLM-2等顶尖模型相比仍存在显著差距。
Open initiatives have the potential to continue closing this gap through data rather than computing. In many countries, public health institutions control massive datasets that could help train open medical models. Collaborations between public and private entities can enable responsible access to these records, creating fine-tuned models based on anonymized electronic health record data - this stands in contrast to the fragmented efforts undertaken by competing private companies.
开放计划有望通过数据而非算力持续缩小这一差距。多国公共卫生机构掌握着可助力开源医疗模型训练的海量数据集。公私实体合作能够实现对这些记录的合规访问,基于匿名化电子健康档案数据开发微调模型——这与相互竞争的私营企业各自为政的分散做法形成鲜明对比。
The strategic harnessing of public healthcare data resources could help democratize model development for equitable public benefit. With patient consent and privacy techniques, health records could be used to co-develop open models designed for patients first.
战略性地利用公共医疗数据资源有助于推动模型开发的民主化,实现公平的公共效益。在获得患者同意并采用隐私保护技术的前提下,健康档案可用于共同开发以患者需求为先的开放模型。
Additionally, open development enables transparency and collaboration fundamental for scientific study. Openness facilitates engaging diverse experts and patients to provide critical input.
此外,开放式开发能够实现科学研究所必需的透明度和协作。开放有助于吸引多元化的专家和患者提供关键意见。
In summary, while open model development efforts may lack the computing scale of private corporations, they could leverage extensive public data. Responsible data initiatives could help democratize development toward open models finely tuned for serving all patients.
总之,虽然开源模型开发工作可能缺乏私营企业的计算规模,但它们可以利用大量的公共数据。负责任的数据计划有助于推动开发民主化,从而打造出能为所有患者提供精准服务的开源模型。
6 Ethical Considerations
6 伦理考量
Deploying LLMs like Clinical Camel raises many ethical concerns [Harrer, 2023]; paramount is patient safety, as these models can generate misleading or incorrect information, potentially causing inappropriate diagnoses or treatments. Thorough evaluation and real-world testing are essential to ensure safe deployment.
部署诸如Clinical Camel这样的大语言模型会引发诸多伦理问题 [Harrer, 2023] ,其中患者安全最为关键,因为这些模型可能生成误导性或错误信息,进而导致不当诊断或治疗。必须通过全面评估和真实场景测试来确保安全部署。
Bias in model outputs, fueled by skewed training data, may lead to unfair outcomes for diverse populations. Pro actively assessing and mitigating dataset and output biases is crucial. Any ethical efforts require clear accountability, regular accuracy checks, and comprehensive monitoring and reporting. Deploying healthcare LLMs demands rigorous ethical precautions. Foremost is ensuring patient safety, as inaccurate model outputs risk inappropriate diagnoses or treatments. Extensive evaluation across diverse clinical contexts is essential pre-deployment and ongoing real-world monitoring post-deployment to enable early error detection and prevent patient harm.
模型输出中的偏见源于训练数据的偏差,可能导致对不同人群的不公平结果。主动评估并减轻数据集和输出偏差至关重要。任何伦理举措都需要明确的问责机制、定期准确性检查以及全面的监测与报告机制。
部署医疗领域大语言模型(LLM)需采取严格的伦理预防措施。首要任务是确保患者安全,因为不准确的模型输出可能导致不当诊断或治疗。在部署前必须进行跨多样化临床场景的全面评估,并在部署后持续进行真实世界监测,以实现早期错误识别并避免患者受到伤害。
Imbalanced training data may fuel model biases yielding unfair outcomes for underrepresented groups. Proactive bias detection and mitigation in datasets and outputs are imperative, alongside mandated ongoing accountability through accuracy benchmarking and progress reporting.
训练数据不平衡可能加剧模型偏见,导致对少数群体的不公平结果。必须主动检测和缓解数据集及输出中的偏见,同时通过准确性基准测试和进展报告进行持续的问责。
Furthermore, upholding safety and equity requires close collaboration with patients, clinicians, ethicists, and experts from marginalized communities throughout development, centering patient voices.
此外,要在整个开发过程中确保安全性和公平性,需要与患者、临床医生、伦理学家以及边缘化社区专家密切合作,并以患者意见为核心。
Clinical Camel is not ready for actual clinical application. By openly releasing the model, we aim to promote the rigorous study needed to integrate similar LLMs safely. Much work remains to evaluate and improve performance across diverse populations and prevent potential harm before clinical use. Transparent development and evaluation of open models like Clinical Camel are essential to realizing benefits while acting in a principled manner.
临床骆驼 (Clinical Camel) 尚未达到实际临床应用标准。我们通过开放模型发布,旨在推动安全整合类似大语言模型所需的严谨研究。在投入临床使用前,仍需开展大量工作来评估和改进模型在不同人群中的表现,并预防潜在危害。对临床骆驼等开源模型进行透明化开发和评估,对于以负责任的方式实现技术红利至关重要。
7 Conclusion
7 结论
Clinical Camel demonstrates competitive performance to proprietary LLMs via efficient training, achieving state-of-the-art results among open medical models and surpassing GPT-3.5 on QA benchmarks. However, benchmark metrics alone insufficiently evidence real-world efficacy and safety. Extensive human assessment across diverse clinical contexts is essential pre-deployment, and ongoing monitoring post-deployment, to enable early error detection. Sustained accountability around updating, transparency, and integrating patient perspectives is vital to uphold ethics as applications progress toward practice. By openly releasing Clinical Camel, we aim to promote collaboration on rigorously evaluating LLMs pre-clinically to harness their possibilities for patients safely. However, significant work remains to prevent potential harm before clinical integration. Open development and assessment of models like Clinical Camel is essential to realizing benefits while upholding scientific ethics.
临床Camel通过高效训练展现出与专有大语言模型相竞争的性能,在开放医疗模型中取得最先进成果,并在问答基准测试中超越GPT-3.5。然而,仅凭基准指标不足以证明实际应用中的有效性和安全性。在部署前需要进行跨多样临床场景的广泛人工评估,并在部署后持续监测,以实现早期错误检测。随着应用向实践推进,围绕更新、透明度和整合患者视角的持续问责对维护伦理至关重要。通过公开发布临床Camel,我们旨在促进临床前严格评估大语言模型的合作,以安全地为患者发掘其潜力。然而,在临床整合前仍需大量工作来预防潜在危害。开放开发与评估临床Camel等模型,对于在坚守科学伦理的同时实现效益至关重要。
8 Model Access
8 模型访问
The model can be found online: Hugging Face: https://hugging face.co/wanglab
模型可在网上获取:Hugging Face: https://hugging face.co/wanglab
Disclaimer: Please note that users must agree to not use the model for actual patient care, the model is released for research purposes.
免责声明:请注意,用户必须同意不将该模型用于实际患者护理,该模型仅限研究用途发布。
Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Rei zen stein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramania n, Xiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom. Llama 2: Open foundation and fine-tuned chat models, 2023.
Kloumann、Artem Korenev、Punit Singh Koura、Marie-Anne Lachaux、Thibaut Lavril、Jenya Lee、Diana Liskovich、Yinghai Lu、Yuning Mao、Xavier Martinet、Todor Mihaylov、Pushkar Mishra、Igor Molybog、Yixin Nie、Andrew Poulton、Jeremy Reizenstein、Rashi Rungta、Kalyan Saladi、Alan Schelten、Ruan Silva、Eric Michael Smith、Ranjan Subramanian、Xiaoqing Ellen Tan、Binh Tang、Ross Taylor、Adina Williams、Jian Xiang Kuan、Puxin Xu、Zheng Yan、Iliyan Zarov、Yuchen Zhang、Angela Fan、Melanie Kambadur、Sharan Narang、Aurelien Rodriguez、Robert Stojnic、Sergey Edunov 和 Thomas Scialom。《Llama 2:开放基础与微调聊天模型》,2023年。
A Appendix A: Dialogue
附录A: 对话
DOCTOR: How can I help you?
医生:有什么可以帮您的?
PATIENT: Hi, Dr. You know, it’s been 20 years that I’ve been having this problem. Sorry. But it’s really been, it’s just been affecting my life. I’m having problems with my toes, I can’t feel them. And now I’m having difficulty to walk.
患者:你好,医生。你知道,这个问题已经困扰我20年了。抱歉。但它真的严重影响了我的生活。我的脚趾出了问题,感觉不到它们。现在走路也很困难。
DOCTOR: How far can you walk?
医生:你能走多远?
PATIENT: Half a block.
患者:半个街区。
DOCTOR: What happens then?
医生:然后呢?
PATIENT: I start to get some pain in the left leg. I have to stop. And then if I push myself and I have to go because I have somewhere to get, then I can get pain in both legs. Where in your left leg does it start?
患者:我开始感到左腿有些疼痛,不得不停下来。如果强撑着继续走,因为赶时间的话,双腿都会开始疼。
DOCTOR: Kind of starts in the ankle and the calf and kind of migrates upwards. If you’re sitting around not doing anything, do you ever get pain?
医生:疼痛是从脚踝和小腿开始,然后逐渐向上蔓延。如果你坐着不动,会不会感到疼痛?
PATIENT: No, I get numbness.
患者:不,我有麻木感。
DOCTOR: What is the numbness?
医生:什么是麻木感?
PATIENT: It’s in the feet. I can’t feel my toes most of the time.
患者:问题出在脚上。我大部分时间都感觉不到自己的脚趾。
DOCTOR: Does it go up past your ankle, your calf, your knee?
医生:疼痛会向上蔓延到脚踝、小腿或膝盖吗?
PATIENT: I don’t know actually, I’ve really just noticed it in my toes.
患者:其实我也不清楚,我只是注意到脚趾有这个问题。
DOCTOR: Ever wake you up at night?
医生:晚上会把你吵醒吗?
PATIENT: Yes.
患者:是的。
DOCTOR: What wakes you up at night?
医生:是什么让你夜不能寐?
PATIENT: This weird sensation of numbness and tingling that can sometimes be painful.
患者: 这种奇怪的麻木和刺痛感,有时会伴有疼痛。
DOCTOR: Is it both feet or just one?
医生:是双脚还是单脚?
PATIENT: Starts off in one, but sometimes it’s mostly the left, but sometimes it can be both.
患者:一开始是一侧,但通常是左侧,有时也会两侧都疼。
DOCTOR: Do you have diabetes?
医生:你有糖尿病吗?
PATIENT: Well, kind of. I’ve been told I’ve been borderline diabetic for 20 years.
患者:嗯,算是吧。医生说我处于糖尿病临界状态已经20年了。
DOCTOR: Do you take medication for diabetes?
医生:你在服用糖尿病药物吗?
PATIENT: I’m supposed to.
患者:我应该这样做。
DOCTOR: Do you smoke?
医生:你抽烟吗?
PATIENT: I do.
患者:是的。
DOCTOR: How much do you smoke?
医生:你抽多少烟?
PATIENT: About a pack or two. High blood pressure?
患者:大约一包或两包。有高血压吗?
DOCTOR: High cholesterol?
医生:高胆固醇?
PATIENT: I don’t know, I haven’t really seen my family doctor in about five years. I went to the walk-in because of the feet and that’s how I ended up here.
患者:我不清楚,我已经有五年没怎么见过家庭医生了。这次是因为脚的问题去了免预约诊所,然后就转诊到这里了。
DOCTOR: Allergies?
医生:过敏史?
PATIENT: Shellfish.
患者:贝类。
DOCTOR: Do you have any brothers or sisters?
医生:你有兄弟姐妹吗?
PATIENT: I do.
患者:我同意。
DOCTOR: Any of your brothers or sisters or parents have heart problems?
医生:你的兄弟姐妹或父母有心脏问题吗?
PATIENT: They’ve all died in their sleep. They’ve all died in their sleep.
患者:他们都在睡梦中死去。他们都在睡梦中死去。
DOCTOR: Do you know, was there ever a post-mortem exam to understand what happened to them?
医生:你知道是否进行过尸检以了解他们的死因?
PATIENT: No.
患者:否。
DOCTOR: Yeah, that’s very sad. I’m sorry to hear that.
医生:是的,这确实令人难过。听到这个消息我很难过。
PATIENT: Thank you.
患者:谢谢。
DOCTOR: Okay. And do you ever get any chest pain?
医生: 好的。你是否有过胸痛的情况?
PATIENT: I get this weird heartburn sensation.
患者:我有这种奇怪的烧心感。
DOCTOR: Tell me about that.
医生:跟我说说这个情况。
PATIENT: So, if I go for a walk and I have the burning in my feet or the pain and the burning, sometimes feel burning in the stomach. That goes away when I have to rest.
患者:那么,如果我散步时出现脚部灼烧感或疼痛灼热,有时还会感到胃部灼烧。这些症状在休息后就会消失。
DOCTOR: Do you get sweaty when that happens?
医生:这种情况发生时你会出汗吗?
PATIENT: Maybe. Not consistently, but yeah.
患者:也许吧。不是一直这样,但确实有。
DOCTOR: Do you get any pain in one arm or another?
医生:你有一只手臂或另一只手臂感到疼痛吗?
PATIENT: No.
患者:否。
DOCTOR: Does the feeling you have in your stomach go up into your neck or into your head?
医生:你胃部的这种不适感会向上蔓延到颈部或头部吗?
PATIENT: No, it’s kind of stuck there. It’s sort of this burning sensation.
患者:不,感觉像是卡在那里了。有点像这种灼烧感。
DOCTOR: You’re short of breath?
医生:你呼吸急促吗?
PATIENT: All the time.
患者:一直如此。
DOCTOR: When you walk, what’s more likely to stop you from walking? The pain in your left leg or the shortness of breath.
医生:您走路时,是左腿疼痛还是呼吸急促更容易让您停下?
PATIENT: The pain. The pain comes first. I don’t notice really the breathing. It’s more the pain and then, because I’m sitting quietly, then I notice that I have some heartburn.
患者:首先是疼痛。我其实没太注意到呼吸问题,主要是疼痛。然后因为安静坐着,才注意到有些胃灼热。
DOCTOR: Okay. And have you ever had an episode where you suddenly lost vision in one eye or the other, like a curtain came over your eye?
医生:好的。那么你是否曾有过突然单眼失明的经历,就像有帘子遮住眼睛那样?
PATIENT: No.
患者:否。
DOCTOR: Do you ever have any difficulty speaking?
医生:你说话有困难吗?
PATIENT: No.
患者:否
DOCTOR: Any problems moving one arm or one leg?
医生:活动单侧手臂或腿部有困难吗?
PATIENT: No.
患者:否。
DOCTOR: Any numbness other than the numbness of your feet?
医生:除了脚部麻木之外还有其他部位麻木吗?
PATIENT: I have some numbness in my fingers.
患者:我的手指有些麻木。
DOCTOR: Okay. So we did an ultrasound of your legs and we can see that there’s quite a significant narrowing in the main artery in your left leg. Why? So it’s maybe because of previous smoking. It may be because of borderline diabetes. It may be that this runs in your family, but it doesn’t really matter the why. It’s there and we need to do more tests to understand how to treat this because with what you’re describing, you just have enough blood flow to keep your leg alive and if we don’t improve that, you could end up losing a leg.
医生:好的。我们做了腿部超声检查,发现你左腿的主动脉有相当严重的狭窄。原因呢?可能是你以前吸烟,也可能是临界糖尿病,或者有家族遗传因素,但具体原因并不重要。问题确实存在,我们需要做更多检查来确定治疗方案——根据你描述的症状,腿部供血仅能勉强维持生存,如果不改善这种情况,最终可能导致截肢。
PATIENT: So you’re telling me I’m going to lose my leg?
患者:所以你是说我要失去这条腿了?
DOCTOR: I’m telling you that we have to do some tests so that we can see exactly what’s going on and then see if there’s a way to improve the circulation in your leg so you don’t end up losing a leg. I’m not sure what’s going to happen yet.
医生:我必须告诉你,我们需要做一些检查来准确了解情况,然后看看是否有办法改善你腿部的血液循环,避免最终失去这条腿。目前我还不能确定结果会怎样。
PATIENT: So is what’s happening in my leg also happening in my chest?
患者: 那我腿上的情况是不是也发生在胸部了?
DOCTOR: So it could be and we’re going to also investigate that. So I’m going to get a CAT scan of your arteries in your legs and that’s going to tell me where the blockages or narrowings are.
医生:有可能是这样,我们也会对此进行调查。我会给你做一个腿部动脉的CAT扫描,这样就能知道哪里有阻塞或狭窄。
PATIENT: And what if I don’t want anything done?
患者:如果我不想采取任何治疗呢?
DOCTOR: So that’s fine. It’s always the patient’s choice about what to do. The way that this works is I give you options and then you tell me what you want to do and as long as I’m satisfied that you really understand what I’ve told you, then it’s completely your choice. It would be helpful if you come in if there’s any family members for our next meeting. We can discuss this with other people. And you don’t have to decide right this moment, but we do have to decide fairly soon because this can progress. So I’m going to get a CAT scan of your leg arteries. I’m going to get an ultrasound of your heart and a stress test of your heart. Because you have problems walking, I’m going to get a specific type of stress test called a dopamine echo cardiogram that you don’t have to do any walking. We’ll just be able to put this all together and we’ll see, do you have any narrowings in your heart arteries? Do you have narrowings in your leg arteries? And then I’ll make a recommendation about what to do about this.
医生:这样很好。治疗方案始终由患者自主选择。流程是这样的:我会提供选项,由你告知想采取的措施。只要我确认你完全理解了我的说明,决定权就完全在你手上。下次就诊时若有家属陪同会很有帮助,我们可以一起讨论。你不必现在立刻决定,但确实需要尽快抉择,因为病情可能会发展。接下来我要为你安排腿部动脉的CT扫描、心脏超声检查和心脏负荷试验。考虑到你行走困难,我会采用一种特殊的负荷试验——多巴胺超声心动图,无需你行走。等所有检查结果汇总后,我们就能判断:你的心脏动脉是否存在狭窄?腿部动脉是否有狭窄?届时我会给出相应的治疗建议。
PATIENT: Okay, that sounds reasonable.
患者:好的,这听起来很合理。
DOCTOR: Do you have any questions for me?
医生:你有什么问题要问我吗?
PATIENT: Do you think that this could be why my siblings all died?
患者:你觉得这会不会就是我兄弟姐妹都去世的原因?
DOCTOR: Yes. It could be that they had narrowed heart arteries. But there’s lots of other reasons, so it would really be speculation. What’s important is to figure out what’s going on with you.
医生:是的。可能是他们的心脏动脉变窄了。但还有很多其他原因,所以现在只是猜测。重要的是弄清楚你的情况。
B Appendix B: Example of DBKE
B 附录 B: DBKE 示例
Prompt
提示
Create a realistic chat dialogue between a patient and a medical chat bot using the passage provided below 1. Bot em pathetically communicates medical information in a simple manner. 2. Bot admits limitations if unsure about information. 3. Patient inquiries cover diverse topics (test results, medications, physical findings, symptoms) related to the passage. 4. Bot asks follow-up questions for better understanding. 5. Focus is on guiding the patient towards understanding their diagnosis. 6. Bot explains its reasoning upon request. 7. Patient provides lab values, imaging descriptions, or ECG findings explicitly. 8. Bot inquires about patient’s medical history, medications, symptoms, lab results, and imaging or ECG findings using non-expert language. 9. Bot explains imaging or ECG features suggestive of a diagnosis without claiming to view images. 10. Bot encourages the patient to consult a healthcare provider for further evaluation, not booking appointments or ordering tests directly.
根据提供的段落内容,创建一段患者与医疗聊天机器人之间的真实对话:
- 机器人以简单易懂的方式传达医疗信息。
- 若不确定信息准确性,机器人会主动承认局限性。
- 患者咨询涵盖与段落相关的多样化主题(检查结果、用药情况、体征表现、症状描述)。
- 机器人通过追问来加深理解。
- 核心目标是引导患者理解自身诊断情况。
- 应要求时,机器人会解释其推理过程。
- 患者需明确提供实验室数值、影像描述或心电图(ECG)结果。
- 机器人使用非专业术语询问患者病史、用药情况、症状表现、实验室结果及影像/心电图检查结果。
- 机器人会解释提示特定诊断的影像/心电图特征,但申明无法实际查看图像。
- 机器人应建议患者咨询医疗专业人员进一步评估,而非直接预约或开具检查。
| Input'Text There is significant overlapbetween the lifestyle andactivityrecommendationsforpre- ventingmyocardialinfarction and thoseused for secondary prevention after an initial my- ocardial infarction... For primary prevention, lifestyle changes such as engaging in regular physical activity, maintaining a healthy weight drinkingalcoholwithinrecommendedlimits, and quitting smoking can reduce the risk of car- diovascular disease... Public health measures such as reducing unhealthy diets through food labeling and marketingrequirements,andstim- ulating physical activity can also help reduce the risk of myocardial infarction at a popula- tionlevel..Medications such as statins,which lower blood cholesterol, and aspirin have been studiedextensivelyfortheirefficacyinprevent- ing myocardial infarction... | OutputDialogue Patient: Hi, I just read an article about prevent- ing myocardial infarction... Bot: Of course, I'll do my best tohelp...Patient:How effective is theinfluenzavaccineinprotectingagainst my- ocardial infarction? Bot: Studies have shown thattheinfluenzavaccinecanreducetherisk of myocardial infarction by 15-45%... Patient: What lifestyle changes can I make to reduce my risk of myocardial infarction? Bot: Main- taining a healthy weight, engaging in regular physical activity, quitting smoking, and drink- ing alcohol within recommended limits are all lifestyle changes... Patient: What kind of diet is recommended?Bot:A healthy diet should in- clude wholegrain starch,fruits,vegetables,fish unsalted nuts, seeds, and legumes... |
| 输入文本 | 输出对话 |
|---|---|
| 预防心肌梗死的生活方式与活动建议,与初次心肌梗死后二级预防的措施存在显著重叠...对于一级预防,改变生活方式如定期锻炼、保持健康体重、饮酒不超过推荐限量以及戒烟,均可降低心血管疾病风险...通过食品标签和营销要求减少不健康饮食、促进体育锻炼等公共卫生措施,也有助于在人群层面降低心肌梗死风险...他汀类药物(可降低血胆固醇)和阿司匹林等药物在预防心肌梗死方面的疗效已被广泛研究... | 患者:你好,我刚读了篇关于预防心肌梗死的文章... |
| 机器人:当然,我会尽力协助... | |
| 患者:流感疫苗对预防心肌梗死的效果如何? | |
| 机器人:研究表明流感疫苗可使心肌梗死风险降低15-45%... | |
| 患者:我能通过哪些生活方式改变来降低心肌梗死风险? | |
| 机器人:保持健康体重、定期锻炼、戒烟、饮酒不超过推荐限量都是有效的生活方式调整... | |
| 患者:推荐采用什么饮食方式? | |
| 机器人:健康饮食应包含全谷物淀粉、水果、蔬菜、鱼类、无盐坚果、种子及豆类... |
