[论文翻译]ChatDoctor:基于大语言模型Meta-AI (LLaMA)和医学领域知识微调的医疗对话模型


原文地址:https://pmc.ncbi.nlm.nih.gov/articles/PMC10364849/pdf/cureus-0015-00000040895.pdf


ChatDoctor: A Medical Chat Model Fine-Tuned on a Large Language Model Meta-AI (LLaMA) Using Medical Domain Knowledge

ChatDoctor:基于大语言模型Meta-AI (LLaMA)和医学领域知识微调的医疗对话模型

$\circledcirc$ Copyright 2023

© 版权 2023

Li et al. This is an open access article distributed under the terms of the Creative Commons Attribution License CC-BY 4.0., which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Li等人。本文是一篇开放获取文章,遵循知识共享署名许可协议CC-BY 4.0条款发布,允许在任何媒介上不受限制地使用、分发和复制,前提是注明原作者和来源。

Yunxiang Li 1 , Zihan Li 2 , Kai Zhang 3 , Ruilong Dan 4 , Steve Jiang 1 , You Zhang 1

李云翔 1, 李梓涵 2, 张凯 3, 丹瑞龙 4, Steve Jiang 1, 张游 1

  1. Department of Radiation Oncology, University of Texas Southwestern Medical Center, Dallas, USA 2. Department of Computer Science, University of Illinois at Urbana-Champaign, Illinois, USA 3. Department of Computer Science and Engineering, The Ohio State University, Columbus, USA 4. College of Computer Science and Technology, Hangzhou Dianzi University, Hangzhou, CHN
  2. 德克萨斯大学西南医学中心放射肿瘤科,达拉斯,美国
  3. 伊利诺伊大学厄巴纳-香槟分校计算机科学系,伊利诺伊州,美国
  4. 俄亥俄州立大学计算机科学与工程系,哥伦布市,美国
  5. 杭州电子科技大学计算机科学与技术学院,杭州,中国

Corresponding author: You Zhang, you.zhang@ut southwestern.edu

通讯作者:You Zhang, you.zhang@ut southwestern.edu

Abstract

摘要

Objective

目标

The primary aim of this research was to address the limitations observed in the medical knowledge of prevalent large language models (LLMs) such as ChatGPT, by creating a specialized language model with enhanced accuracy in medical advice.

本研究的主要目标是针对当前主流大语言模型(如ChatGPT)在医学知识方面的局限性,通过构建一个专业语言模型来提升医疗建议的准确性。

Methods

方法

We achieved this by adapting and refining the large language model meta-AI (LLaMA) using a large dataset of 100,000 patient-doctor dialogues sourced from a widely used online medical consultation platform. These conversations were cleaned and anonymized to respect privacy concerns. In addition to the model refinement, we incorporated a self-directed information retrieval mechanism, allowing the model to access and utilize real-time information from online sources like Wikipedia and data from curated offline medical databases.

我们通过使用从广泛使用的在线医疗咨询平台获取的10万条医患对话大型数据集,对大语言模型meta-AI (LLaMA)进行适配和优化,实现了这一目标。这些对话经过清洗和匿名化处理以保护隐私。除了模型优化外,我们还加入了自主信息检索机制,使模型能够访问并利用来自维基百科等在线资源的实时信息,以及精选离线医疗数据库中的数据。

Results

结果

The fine-tuning of the model with real-world patient-doctor interactions significantly improved the model's ability to understand patient needs and provide informed advice. By equipping the model with self-directed information retrieval from reliable online and offline sources, we observed substantial improvements in the accuracy of its responses.

通过真实医患互动数据对模型进行微调,显著提升了模型理解患者需求并提供专业建议的能力。通过赋予模型从可靠线上/线下资源自主检索信息的功能,我们观察到其回答准确率得到实质性提升。

Conclusion

结论

Our proposed ChatDoctor, represents a significant advancement in medical LLMs, demonstrating a significant improvement in understanding patient inquiries and providing accurate advice. Given the high stakes and low error tolerance in the medical field, such enhancements in providing accurate and reliable information are not only beneficial but essential.

我们提出的ChatDoctor代表了医疗大语言模型(Large Language Model)领域的重大进步,显著提升了对患者问询的理解能力和提供准确建议的水平。鉴于医疗领域的高风险性和低容错率,这种在提供准确可靠信息方面的改进不仅有益,更是至关重要。

Categories: Family/General Practice, Medical Physics, Integrative/Complementary Medicine Keywords: ai chatbot, large language model, llama, chat gpt, gpt

分类:家庭/全科医学、医学物理学、整合/补充医学
关键词:AI聊天机器人 (AI chatbot)、大语言模型 (Large Language Model)、Llama、Chat GPT、GPT

Introduction

引言

The development of instruction-following large language models (LLMs), such as ChatGPT [1], has gained significant attention due to their remarkable success in instruction understanding and human-like response generation. These auto-regressive LLMs [2] are pre-trained on web-scale natural language by predicting the next token and then fine-tuned to follow large-scale human instructions. These models show robust performance on a wide range of natural language processing (NLP) tasks and can generalize to unseen tasks, demonstrating their potential as unified solutions to various problems in natural language understanding, text generation, and conversational artificial intelligence. However, the exploration of such general-domain LLMs in the medical domain remains relatively scarce [3], despite their great potential in revolutionizing medical communication and decision-making [4]. In general, these common-domain models were not trained to capture the medical-domain knowledge specifically or in detail, resulting in models that often provide incorrect medical responses.

指令跟随大语言模型(LLM)的发展(如ChatGPT[1])因其在指令理解和类人响应生成方面的显著成功而备受关注。这些自回归大语言模型[2]通过预测下一个token在网页级自然语言数据上进行预训练,随后通过大规模人类指令进行微调。这些模型在广泛自然语言处理(NLP)任务中展现出强大性能,并能泛化至未见任务,展示了其作为自然语言理解、文本生成和会话人工智能领域各类问题的统一解决方案的潜力。然而,尽管这类通用领域大语言模型在革新医疗沟通与决策方面具有巨大潜力[4],其在医疗领域的探索仍相对匮乏[3]。通常,这些通用领域模型并未经过专门或详细的医学领域知识训练,导致模型经常提供错误的医疗响应。

By fine-tuning large linguistic dialogue models on data from real-world patient-physician conversations, these models’ ability in understanding patients’ inquiries and needs can be significantly improved. In addition, to further enhance the models’ credibility, a knowledge brain based on online sources such as Wikipedia or offline sources like medical-domain databases can be incorporated into the models to retrieve real-time information to facilitate answering medical questions. The enhanced reliability of such answers is

通过在真实医患对话数据上微调大语言模型,可以显著提升模型理解患者询问和需求的能力。此外,为提高模型可信度,可为其集成基于在线资源(如Wikipedia)或离线医疗领域数据库的知识中枢,实时检索信息以辅助医学问答。这种增强机制能有效提升答案的可靠性

How to cite this article

如何引用本文

Li Y, Li Z, Zhang K, et al. (June 24, 2023) ChatDoctor: A Medical Chat Model Fine-Tuned on a Large Language Model Meta-AI (LLaMA) Using Medical Domain Knowledge. Cureus 15(6): e40895. DOI 10.7759/cureus.40895

Li Y, Li Z, Zhang K 等 (2023年6月24日) ChatDoctor: 基于大语言模型(LLaMA)和医学领域知识微调的医疗对话模型。Cureus 15(6): e40895. DOI 10.7759/cureus.40895

vital for the medical field, as a wrong answer can be detrimental to patients’ treatments and well-being. In this study, we investigated the use of these two strategies: model fine-tuning and knowledge brain in still ation, to enhance the capability of LLMs to serve as medical chatbots. Since the prevalent ChatGPT model is not open source, we used Meta’s public large language model meta-AI (LLaMA) model as the platform for development and evaluation. In detail, we first trained a generic conversation model based on LLaMA, using 52K instruction-following data from Stanford University’s Alpaca project [5]. We then finetuned the conversation model on our collected dataset of 100K patient-physician conversations from an online medical consultation website (www.healthcare magic.com). Through extensive experiments, we found that the fine-tuned model by patient-physician dialogues outperforms ChatGPT in terms of precision, recall, and the F1 score [6]. In addition, the autonomous ChatDoctor model, which is able to retrieve the latest online/offline information, can also answer medical questions about relatively new diseases that are not included in the patient-physician training dialogues, for instance, the Monkeypox (Mpox) disease [7,8].

对医疗领域至关重要,因为错误答案可能损害患者的治疗和健康。本研究探索了两种策略——模型微调与知识库植入,用于增强大语言模型作为医疗聊天机器人的能力。由于当前流行的ChatGPT模型未开源,我们采用Meta公开的大语言模型meta-AI(LLaMA)作为开发和评估平台。具体而言,我们首先基于斯坦福大学Alpaca项目的52K指令跟随数据[5],在LLaMA上训练通用对话模型;随后通过在线医疗咨询网站(www.healthcaremagic.com)收集的10万条医患对话数据集进行微调。大量实验表明,经医患对话微调的模型在精确率、召回率和F1值[6]上均优于ChatGPT。此外,具备在线/离线信息检索能力的自主ChatDoctor模型,还能回答训练数据未涵盖的新发疾病(如猴痘(Mpox))相关问题[7,8]。

In summary, the ChatDoctor model has the following three main contributions:

ChatDoctor模型的主要贡献可归纳为以下三点:

  1. We established a methodology for fine-tuning LLMs for application in the medical field.
  2. 我们建立了一套针对医疗领域应用的大语言模型 (LLM) 微调方法。
  3. We compiled and publicly shared a comprehensive dataset of 100,000 patient-doctor interactions to serve as a training resource for refining the LLM. This dataset includes a wealth of terms, knowledge, and expertise essential for training LLMs in the medical domain. Additionally, we curated and openly shared another dataset consisting of 10,000 patient-doctor conversations from a separate source (www.icliniq.com) to serve as a testing resource for the model. To support and stimulate future advancements in the development of dialogue models in healthcare, we provide public access to all relevant resources such as source codes, datasets, and model weights. These can be found at github.com/Kent0n-Li/ChatDoctor.
  4. 我们整理并公开共享了一个包含10万条医患互动的综合数据集,作为优化大语言模型的训练资源。该数据集涵盖了训练医疗领域大语言模型所需的大量术语、知识和专业知识。此外,我们还从独立数据源(www.icliniq.com)筛选并公开了另一组1万条医患对话数据集,作为模型的测试资源。为支持和促进未来医疗对话模型的开发进展,我们公开了所有相关资源,包括源代码、数据集和模型权重。这些资源可通过github.com/Kent0n-Li/ChatDoctor获取。
  5. We proposed an autonomous ChatDoctor model that can retrieve online and offline medical domain knowledge to answer medical questions on up-to-date medical terms and diseases, which can potentially reduce the errors and hallucinations of LLMs [9-11].
  6. 我们提出了一种自主的ChatDoctor模型,能够检索线上和线下医疗领域知识,以回答关于最新医学术语和疾病的医疗问题,这有望减少大语言模型 [9-11] 的错误和幻觉现象。

This article was previously posted to the arXiv preprint server on March 24, 2023

本文曾于2023年3月24日发布在arXiv预印本服务器上

Materials And Methods

材料与方法

Collection and preparation of patient-physician conversation dataset

医患对话数据集的收集与准备

The initial step in refining our model involved curating a dataset comprising patient-physician interactions. Often, patients describe their symptoms in casual and somewhat superficial language. If we attempted to generate these dialogues synthetically, similar to Alpaca [5], it could lead to over-specific descriptions with limited diversity and relevance to the real world. Hence, we chose to gather authentic patient-doctor conversations, collecting around 100k such interactions from the online medical consultation website, HealthCare Magic. The data were filtered both manually and automatically. Specifically, we automatically filtered out conversations that were too short, most of which did not answer anything of practical significance. And we manually filtered the content of the responses that had errors. To maintain privacy, we erased any information identifying the doctor or the patient and employed Language Tool to rectify any grammatical errors. This dataset was labeled HealthCare Magic 100 k, illustrated in Figure 1. We also sourced roughly 10k additional conversations from another independent online medical consultation site, iCliniq, to test our model's performance. The iCliniq dataset was chosen randomly in a stratified manner to guarantee representation across various medical specialties. It was also made certain that the selected data contained no identifiable patient information, in strict compliance with privacy and ethical standards.

优化模型的第一步是整理包含医患互动的数据集。患者通常会用随意且相对表面的语言描述症状。如果我们尝试像Alpaca [5]那样人工合成这些对话,可能导致描述过于具体,缺乏多样性且与现实世界关联有限。因此,我们选择收集真实的医患对话,从在线医疗咨询网站HealthCare Magic获取了约10万条此类互动数据。数据经过人工和自动双重筛选:自动过滤过短的对话(这类内容大多缺乏实际意义),人工剔除存在错误的回复内容。为保护隐私,我们删除了所有可识别医生或患者身份的信息,并使用Language Tool修正语法错误。该数据集被标记为HealthCare Magic 100k,如图1所示。我们还从另一独立在线医疗咨询平台iCliniq获取了约1万条额外对话用于测试模型性能。iCliniq数据集采用分层随机抽样,确保覆盖不同医学专科领域,并严格遵循隐私与伦理标准,确保所选数据不含可识别患者信息。


FIGURE 1: A summary of the process involved in gathering the patientphysician conversation dataset and the steps involved in training the ChatDoctor model.

图 1: 患者-医生对话数据集收集流程及ChatDoctor模型训练步骤概览

Creation of external knowledge database

创建外部知识库

LLMs typically predict the next word in a sequence, leading to potential inaccuracies or erroneous responses to questions (hallucinations) [12]. In addition, the model's output can be unpredictable to some extent, which is unacceptable in the medical field. However, the accuracy of these models could be significantly improved if they could generate or assess responses based on a reliable knowledge database, depicted in Figure 2. Consequently, we curated a database (sample shown in Figure 3) encompassing diseases, their symptoms, relevant medical tests/treatment procedures, and potential medications. This database serves as an external and offline knowledge brain for ChatDoctor. Continually updatable without requiring model retraining, this database can be tailored to specific diseases or medical specialties. We utilized MedlinePlus to construct this disease database, but other reliable sources can also be used. Additionally, online information sources like Wikipedia can supplement the knowledge base of our autonomous model. It is worth noting that Wikipedia may not be a fully reliable database, but our framework can be easily extended to more reliable online databases such as reputable academic journals.

大语言模型通常预测序列中的下一个词,这可能导致对问题的回答存在潜在不准确或错误(幻觉)[12]。此外,模型输出在一定程度上具有不可预测性,这在医疗领域是不可接受的。然而,如果这些模型能够基于可靠的知识数据库生成或评估回答(如图 2 所示),其准确性将显著提升。为此,我们构建了一个涵盖疾病、症状、相关医学检查/治疗流程及潜在药物的数据库(示例见图 3),作为 ChatDoctor 的外部离线知识大脑。该数据库无需重新训练模型即可持续更新,并能针对特定疾病或医学专科定制。我们使用 MedlinePlus 构建该疾病数据库,但其他可靠来源同样适用。此外,维基百科等在线信息源可补充我们自主模型的知识库。值得注意的是,维基百科并非完全可靠的数据库,但我们的框架可轻松扩展至更可靠的在线数据库(如权威学术期刊)。


Q: What is the treatment for Mpox? FIGURE 2: Overview of the autonomous ChatDoctor model based on information retrieval from an external knowledge brain.

图 2: 基于外部知识大脑信息检索的自主ChatDoctor模型概览

Disease Database

疾病数据库

Disease: Appendicitis

疾病:阑尾炎

Symptoms:Pain in the abdomen, often on the right side. It is usually sudden and gets worse over time. Other symptoms may include: Swelling in the abdomen, Loss of appetite, Nausea and vomiting, Constipation or diarrhea, Inability to pass gas, Low fever

症状:腹部疼痛,常见于右侧。通常突然发作并随时间加重。其他症状可能包括:腹部肿胀、食欲不振、恶心呕吐、便秘或腹泻、无法排气、低烧

Further test: Abdominal and pelvic CT (Computed Tomography), Abdominal ultrasound,Blood test to check for signs of infection,Urine test to rule out a urinary tract infection

进一步检查:腹部和盆腔CT (Computed Tomography)、腹部超声、血液检查以确认感染迹象、尿液检查以排除尿路感染

Treatment: Appendectomy, cefotetan (Cefotan), cefotaxime (Claforan), piper a cill in and tazobactam (Zosyn), ampicillin and sulbactam (Unasyn), cef tria x one (Rocephin), cefepime (Maxipime), gentamicin (Garamycin), meropenem (Merrem), ertapenem(Invanz),metro nida z ole(Flagyl),c linda my c in(Cleocin),lev of lox ac in (Levaquin).In the case of a ruptured appendix,doctors will prescribe an intravenous (IV) antibiotic to treat abdominal infection.

治疗:阑尾切除术,头孢替坦(Cefotan)、头孢噻肟(Claforan)、哌拉西林他唑巴坦(Zosyn)、氨苄西林舒巴坦(Unasyn)、头孢曲松(Rocephin)、头孢吡肟(Maxipime)、庆大霉素(Garamycin)、美罗培南(Merrem)、厄他培南(Invanz)、甲硝唑(Flagyl)、克林霉素(Cleocin)、左氧氟沙星(Levaquin)。若阑尾穿孔,医生会开具静脉注射抗生素以治疗腹腔感染。

Disease: Allergic rhinitis

疾病:过敏性鼻炎

Symptoms:Symptoms thatoccur shortly after you come into contact with the substance you are allergic to may include: Itchy nose, mouth, eyes, throat, skin, or any area, Problems with smell, Runny nose, Sneezing, Watery eyes. Symptoms that may develop later include: Stuffy nose (nasal congestion), Coughing, Clogged ears and decreased sense of smell, Sore throat,Dark circles under the eyes,Puffiness under the eyes, Fatigue and irritability, Headache.

症状:
接触过敏原后短期内可能出现的症状包括:鼻、口腔、眼睛、喉咙、皮肤或任何部位发痒,嗅觉异常,流鼻涕,打喷嚏,流泪。
后续可能发展的症状包括:鼻塞,咳嗽,耳闷和嗅觉减退,喉咙痛,黑眼圈,眼睑浮肿,疲劳易怒,头痛。

Further test: Allergy testing, Complete blood count (CBC) testin

进一步检查:过敏测试、全血细胞计数 (CBC) 检测

Treatment: Antihistamines, Antihistamine nasal sprays, Corticosteroids, Decongestants

治疗:抗组胺药、抗组胺鼻喷剂、皮质类固醇、减充血剂

Disease: Malignant otitis externa

疾病:恶性外耳道炎

Symptoms: Ongoing drainage from the ear that is yellow or green and smells bad. Ear pain deep inside the ear. Pain may get worse when you move your head. Hearing loss, Itching of the ear or ear canal, Fever, Trouble swallowing, Weakness in the muscles of the face.

症状:耳朵持续流出黄色或绿色且有异味的分泌物。耳部深处疼痛,头部活动时疼痛可能加剧。听力下降、耳部或耳道瘙痒、发热、吞咽困难、面部肌肉无力。

Further test:Look into the ear for signs of an outer ear infection.The head around andbehind the ear maybe tender to touch.A nervous system (neurological)exam may show that the cranial nerves are affected.If there is any drainage, the provider may send a sample of it to the lab.The lab will culture the sample to try to find thecauseof the infection.To look for signs of a bone infection next to the ear canal, the following tests maybe done: CT scan of the head,MRI scan of the head, Radio nuclide scan.

进一步检查:检查外耳是否有感染迹象。耳周及耳后头部区域可能有触痛感。神经系统检查可能显示颅神经受累。若有分泌物排出,医生可能取样送检实验室。实验室将进行培养以确定感染病原体。为排查耳道邻近骨组织感染,可能需进行以下检查:头部CT扫描、头部MRI扫描、放射性核素扫描。

Treatment:The goal of treatment is to cure the infection.Treatment often lasts for several months,because it is difficult to treat the bacteria andreach an infection in bone tissue. You will need to take antibiotic medicines for a long period of time. The medicines may be given through a vein (intravenously), or by mouth.Antibiotics should be continued until scans orother testsshow the inflammation has gone down. Dead or infected tissue may need to be removed from the ear canal. In some cases surgery may be needed to remove dead or damaged tissue in the skull.

治疗:
治疗目标是根除感染。由于细菌难以杀灭且骨组织感染难以触及,疗程通常持续数月。患者需长期服用抗生素,给药方式包括静脉注射或口服。抗生素应持续使用至影像学或其他检查显示炎症消退。耳道内的坏死或感染组织可能需要手术清除。部分病例需通过开颅手术切除颅骨内的坏死或受损组织。

FIGURE 3: Some samples in our offline disease database consist of symptoms, clinical test/treatment approaches, and medication suggestions.

图 3: 我们的离线疾病数据库中的部分样本包含症状、临床检测/治疗方案和用药建议。

Development of autonomous ChatDoctor with knowledge brain

开发具备知识大脑的自主ChatDoctor

Armed with the external knowledge brain, i.e., Wikipedia or our custom disease database, ChatDoctor can more accurately answer patient inquiries by retrieving reliable information. Upon establishing the external knowledge brain, we devised a mechanism to enable ChatDoctor to autonomously retrieve necessary information to answer queries. This was accomplished by constructing appropriate prompts to input into the ChatDoctor model. Specifically, we designed keyword mining prompts (Figure $^{4)}$ as the initial step for ChatDoctor to extract key terms from patient queries for relevant knowledge search. Based on these keywords, top-ranked information was retrieved from the knowledge brain using a term-matching retrieval system [13]. Given the LLM's word limit (token size), we divided the texts to be read into equal sections and ranked each section by the number of keyword hits. The ChatDoctor model then reads the first N sections (five used in our study) sequentially, selecting and summarizing pertinent information via prompts (Figure 5). Ultimately, the model processes and compiles all the knowledge entries to generate a final response (Figure 6). This information retrieval approach ensures patients receive precise, well-informed responses backed by credible sources and can serve as a verification method for responses generated by ChatDoctor

借助外部知识库(即维基百科或我们自定义的疾病数据库),ChatDoctor能够通过检索可靠信息更准确地回答患者咨询。在建立外部知识库后,我们设计了一套机制使ChatDoctor能自主检索必要信息来回答问题。这是通过构建适当的提示词输入ChatDoctor模型实现的。具体而言,我们首先设计了关键词挖掘提示(图4),让ChatDoctor从患者查询中提取关键术语以进行相关知识搜索。基于这些关键词,系统使用术语匹配检索方法[13]从知识库中获取排名靠前的信息。考虑到大语言模型的字数限制(token大小),我们将待阅读文本分割为等长段落,并按关键词命中次数对每段进行排序。ChatDoctor模型随后依次读取前N段(本研究采用五段),通过提示词筛选并总结相关信息(图5)。最终,模型处理并整合所有知识条目生成最终回复(图6)。这种信息检索方法确保患者获得基于可信来源的精准答复,同时可作为验证ChatDoctor生成回复的依据。

from prior knowledge.

基于先验知识。

Prompt for extracting keywords

提取关键词的提示词

FIGURE 4: Autonomously extract keywords for information retrieval.

图 4: 自主提取关键词用于信息检索。

Promptfor autonomous knowledge retrieval

自主知识检索提示

FIGURE 5: Autonomous information retrieval from the disease database through the prompt.

图 5: 通过提示词从疾病数据库中自主检索信息

Prompt to make a final answer

生成最终答案的提示词

FIGURE 6: Instruct the ChatDoctor to read the retrieved domain knowledge and provide a reliable answer.

图 6: 指导 ChatDoctor 阅读检索到的领域知识并提供可靠答案。

Model training

模型训练

We developed the ChatDoctor model using Meta’s publicly accessible LLaMA-7B model [14], which uses Transformers with the structure of the decoder only. Despite its relatively modest 7 billion parameters, the LLaMA model exhibits comparable performance to the much larger GPT-3 model (with 175 billion parameters) across several NLP benchmarks. This performance enhancement was achieved by diversifying the training data rather than increasing network parameters. Specifically, LLaMA was trained on 1.0 trillion tokens from publicly accessible data sources like Common Crawl and arXiv documents. We used conversations from HealthCare Magic-100k to fine-tune the LLaMA model [15] in line with Stanford Alpaca [5] training methodology. The model was first fine-tuned with Alpaca’s data to acquire basic conversation skills, followed by further refinement on HealthCare Magic-100k using $6^{\ast}$ A100 GPUs for three hours. The training process followed these hyper parameters: total batch size of 192, learning rate of $2\times10^{-5}$ , 3 epochs, maximum sequence length of 512 tokens, and a warmup ratio of 0.03, with no weight decay.

我们基于Meta公司公开的LLaMA-7B模型[14]开发了ChatDoctor模型,该模型仅采用解码器结构的Transformer架构。尽管LLaMA模型仅有70亿参数,但在多项自然语言处理基准测试中表现出与参数量达1750亿的GPT-3模型相当的性能。这种性能提升是通过扩展训练数据多样性而非增加网络参数实现的——LLaMA使用来自Common Crawl、arXiv文档等公开数据源的1万亿token进行训练。我们采用斯坦福Alpaca[5]的训练方法,使用HealthCare Magic-100k的对话数据对LLaMA进行微调[15]。模型首先通过Alpaca数据获得基础对话能力,随后使用6* A100 GPU耗时3小时在HealthCare Magic-100k上进一步优化。训练超参数设置为:总批次大小192、学习率2×10^-5、3个训练周期、最大序列长度512 token、预热比例0.03,且未采用权重衰减。

Results

结果

To evaluate the proficiency of the autonomous ChatDoctor model, we tested it using a variety of contemporary medical queries. One of these included a question related to “Monkeypox” (abbreviated as Mpox), as illustrated in Figure 7. Monkeypox was recently designated by the World Health Organization (WHO) on November 28, 2022, making it a relatively novel term. While ChatGPT was incapable of providing a satisfactory response, ChatDoctor, due to its autonomous knowledge retrieval feature, was able to extract pertinent information about Monkeypox from Wikipedia and deliver a precise answer. Similarly, for more

为了评估自主 ChatDoctor 模型的熟练程度,我们使用各种当代医学问题对其进行了测试。其中一个问题涉及"猴痘" (Mpox),如图 7 所示。猴痘是世界卫生组织 (WHO) 于 2022 年 11 月 28 日新命名的术语。虽然 ChatGPT 无法提供令人满意的回答,但 ChatDoctor 凭借其自主知识检索功能,能够从维基百科提取有关猴痘的相关信息并给出准确答案。同样地,对于更...

general medical inquiries such as “Otitis,” as shown in Figure 8, ChatDoctor was able to provide a reliable response following the retrieval of relevant knowledge. In another instance, a question about “Daybue,” a drug that received FDA approval in March 2023, was accurately addressed by our model after it autonomously retrieved relevant information, demonstrating an advantage over ChatGPT, as shown in Figure 9.

如图 8 所示,对于"中耳炎"等一般医疗咨询,ChatDoctor 在检索相关知识后能够提供可靠回答。另一个例子中,针对 2023 年 3 月获 FDA 批准的药物"Daybue"的提问,我们的模型通过自主检索相关信息后给出了准确解答,如图 9 所示,这展现了相较于 ChatGPT 的优势。

Example of Mpox

Mpox示例

Q: How to test for Mpox?

Q: 如何检测猴痘 (Mpox)?

FIGURE 7: Comparison between the ChatGPT and the autonomous ChatDoctor for relatively new medical diseases/terms. The ChatGPT cannot recognize the word Mpox (aka, Monkeypox), while our ChatDoctor can provide the precise answer for the relevant medical tests of Mpox, with the help of the external knowledge brain.

图 7: ChatGPT与自主ChatDoctor在较新医学疾病/术语上的对比。ChatGPT无法识别Mpox(即猴痘)一词,而我们的ChatDoctor借助外部知识库能准确回答关于Mpox相关医学检测的问题。

Exampleof Otitis

中耳炎示例

Q:How to treat Otitis?

Q:如何治疗中耳炎?

ChatGPT: The treatment of otitis depends on the underlying cause and severity of the condition. Treatment may involve antibiotics, pain relief, ear drops, warm compresses,rest,and prevention measures such as avoiding foreign objects in the ear, keeping ears clean and dry, and protecting ears from loud noises and sudden changes in air pressure. It is important to see a doctor for proper diagnosis and treatment.

ChatGPT:中耳炎的治疗取决于病因和病情严重程度。治疗可能包括抗生素、止痛药、滴耳液、热敷、休息等,以及预防措施如避免耳道异物、保持耳部清洁干燥、防护耳朵免受噪音和气压骤变影响。务必就医以获得准确诊断和规范治疗。

ChatDoctor: Treatment depends on the severity of the symptoms. If there is only mild discomfort, then pain relievers such as ibuprofen or acetaminophen can help. Antibiotics may be prescribed if the infection has spread beyond the outer ear. For more serious forms of Otitis,surgery maybe needed toremove infected areas of the ear.

ChatDoctor: 治疗方案取决于症状的严重程度。若仅有轻微不适,可使用布洛芬或对乙酰氨基酚等止痛药。若感染扩散至外耳以外,则可能需要开具抗生素。对于更严重的中耳炎 (Otitis) 病例,可能需通过手术切除耳部感染区域。

FIGURE 8: Comparison between the ChatGPT and the autonomous ChatDoctor. The ChatGPT provided a more general answer about otitis, while the ChatDoctor provided a more specialized response about the treatments of otitis, with the help of the external knowledge brain.

图 8: ChatGPT与自主ChatDoctor的对比。ChatGPT提供了关于中耳炎(otitis)的更通用回答,而ChatDoctor借助外部知识库针对中耳炎治疗方案给出了更专业的响应。

Cureus

Cureus

Example of Daybue

Daybue示例

FIGURE 9: Comparison between the ChatGPT and the autonomous ChatDoctor. The ChatGPT is unfamiliar with the “Daybue” medication which received approval from the Food and Drug Administration (FDA) in early 2023. The ChatDoctor accurately pointed out the purpose of Daybue (tro fine tide), with the help of the external knowledge brain.

图 9: ChatGPT与自主ChatDoctor的对比。ChatGPT对2023年初获得美国食品药品监督管理局(FDA)批准的"Daybue"药物不熟悉。ChatDoctor在外置知识库的帮助下准确指出了Daybue(曲芬肽)的用途。

For a quantitative evaluation of ChatDoctor's performance, we utilized questions from the independently sourced iCliniq database as inputs, with the corresponding responses from actual human physicians serving as the benchmark or “ground truth.” We compared these with responses generated by both ChatDoctor and ChatGPT. In this evaluation, we employed BERTScore [6] to compute Precision, Recall, and F1 scores for both ChatDoctor and ChatGPT. BERTScore leverages pre-trained BERT to match words in the candidate and reference sentences via cosine similarity, and BERTScore was chosen for its ability to evaluate the semantic similarity between our model's responses and the reference sentences, which we believe is of utmost importance in the medical context. This method of evaluation closely aligns with human judgment at both sentence and system levels. In all three metrics, a higher value denotes a better match. As seen in the results illustrated in Table 1, the fine-tuned ChatDoctor model outperforms ChatGPT across all three metrics, with specific dialogue examples detailed in Figures 10-13.

为定量评估ChatDoctor的表现,我们采用独立来源的iCliniq数据库问题作为输入,并以真实医生的对应回答作为基准或"标准答案"。我们将这些回答与ChatDoctor和ChatGPT生成的回答进行对比。在此评估中,我们使用BERTScore [6]计算ChatDoctor和ChatGPT的精确率、召回率和F1分数。BERTScore利用预训练的BERT模型通过余弦相似度匹配候选句子和参考句子中的词汇,选择该方法是因为它能评估模型回答与参考句子之间的语义相似度——我们认为这在医疗场景中至关重要。该评估方法在句子级和系统级都与人类判断高度一致。三项指标中,数值越高表示匹配度越好。如表1所示结果,经过微调的ChatDoctor模型在所有指标上均优于ChatGPT,具体对话示例详见图10-13。

TABLE 1: Quantitative comparison with BERTScore between ChatDoctor and ChatGPT. The pvalues in the table are derived from our paired t-test.

ChatGPTChatDoctorP-value
Precision0.837±0.01880.8444±0.01856.66 × 10-195
Recall0.8445±0.01640.8451±0.01574.71 × 10-4
F1Score0.8406±0.01430.8446±0.01382.14 × 10-111

表 1: ChatDoctor与ChatGPT基于BERTScore的量化对比。表中p值来自配对t检验。

ChatGPT ChatDoctor P-value
精确率 0.837±0.0188 0.8444±0.0185 6.66 × 10-195
召回率 0.8445±0.0164 0.8451±0.0157 4.71 × 10-4
F1分数 0.8406±0.0143 0.8446±0.0138 2.14 × 10-111

Example 1: ChatGPT vs ChatDoctor.

示例1: ChatGPT与ChatDoctor对比

Q:Hi doctor,I am getting one-sided headache. I do not get them often.They are behind my right eye andtemple area.Now this time Iam also fighting with heavy cold and cough. It came on late morning today and it hurts more when I cough or shake my head or bend over.Prior to this,I had the same headache about four months ago.At thattime it came on mid afternoon and lasted till next morning.Prior to that,Ihadthe same headache exactly on the right side and lasted for a day.Iam on CCB and Proton ix.Iam scared of brain tumor.Pleasehelp

问:医生您好,我最近出现单侧头痛。头痛并不频繁,位置在右眼和太阳穴后方。这次发作还伴有重感冒和咳嗽症状,今晨晚些时候开始疼痛,咳嗽、摇头或弯腰时痛感加剧。四个月前曾出现过相同头痛,当时是下午三点左右发作并持续到次日清晨。更早之前也有过完全同侧的头痛,持续了一整天。目前我正在服用CCB(钙通道阻滞剂)和质子泵抑制剂。很担心是脑肿瘤,请您帮助诊断。

iCliniq: Hi. Based on what you have told, this is more likely to be due to sinusitis. Sinuses are small air filled cavities in your skull bone. They can get infected and when they are infected,you will get headaches behind eyes.Such headaches will be more common during cold and cough.Brain tumor headache will be on both sides, more in forehead and you will also have vomiting in that case.Based on what you havetold,this is more likely due to simple sinus infection instead of brain tumor. CT brain and CT paranasal sinuses.

iCliniq: 您好。根据您的描述,这种情况更可能是由鼻窦炎引起的。鼻窦是颅骨内充满空气的小腔室,当它们发生感染时,会引起眼部后方的头痛。这类头痛在感冒咳嗽期间更为常见。脑瘤引起的头痛通常位于双侧前额区域,并伴随呕吐症状。根据您提供的信息,这更可能是单纯的鼻窦感染而非脑瘤。建议进行脑部CT和鼻窦CT检查。

ChatGPT: Hello! I'm sorry to hear you're experiencing these headaches. Based on the location of the pain, it sounds like it may be a tension headache or a migraine. These types of headaches can sometimes be triggered by colds or other illnesses However, it's always best to get checked out by your doctor to rule out any other underlying conditions. In terms of a brain tumor, that is a very rare cause of headaches Most people who experience headaches do not have a brain tumor. That being said, if you have any concerning symptoms such as vision changes, weakness, or numbness, it is important to seek medical attention right away.In the meantime,you can try taking over-the-counter pain relievers such as ibuprofen or acetaminophen to help with the headache. It's also important to get plenty of rest and stay hydrated. If the headaches persist or become more frequent, please make an appointment with your doctor.Take care!

ChatGPT:你好!听到你正遭受头痛困扰我很抱歉。根据疼痛部位判断,这可能是紧张性头痛或偏头痛。这类头痛有时会由感冒或其他疾病引发。不过最好还是让医生检查一下,以排除其他潜在病因。

关于脑肿瘤的可能性,这是非常罕见的头痛诱因,绝大多数头痛患者并不存在脑肿瘤。但如果你出现视力变化、肢体无力或麻木等警示症状,请务必立即就医。

目前你可以尝试服用布洛芬或对乙酰氨基酚等非处方止痛药缓解症状,同时保持充足休息和水分摄入。若头痛持续或发作频率增加,请及时预约医生就诊。保重!

ChatDoctor: Hi, Thank you for posting your query. First, I would like to reassure you that there is no need to worry about the brain tumor. Also,we should keep migraine apart as well. The most likely cause of your headaches could be sinusitis This canbe confirmed bydoing anX-rayof theface.Treatment would consist of using decongestant tablets/ nasal sprays. Best wishes, Chat Doctor.

ChatDoctor:您好,感谢您的咨询。首先请放心,您无需担忧脑肿瘤问题,同时也可以排除偏头痛的可能性。您头痛最可能的原因是鼻窦炎,通过拍摄面部X光片即可确诊。治疗方案包括使用减充血药片/鼻腔喷雾剂。祝好,Chat Doctor。

FIGURE 10: Example 1: a patient suffering from a unilateral headache expressed concerns about a potential association with a brain tumor. Our ChatDoctor accurately proposed sinusitis as a possible cause for the headache, mirroring the diagnosis provided by the physician from iCliniq. On the other hand, ChatGPT failed to deliver a congruent interpretation regarding the root cause of the one-sided headache.

图 10: 示例1:一名单侧头痛患者对可能与脑肿瘤相关的担忧。我们的ChatDoctor准确提出了鼻窦炎作为头痛的可能原因,与iCliniq医生提供的诊断一致。另一方面,ChatGPT未能对单侧头痛的根本原因给出相符的解释。

Example 2: ChatGPT vs ChatDoctor.

示例2: ChatGPT与ChatDoctor对比

Q:Hi doctor, I have one white lump around the tonsil area in my throat. It has been there a good few months now and the area is swollen. Is this cancer? iCliniq: Hi.Cancer cannot be diagnosed by inspection only. To exclude tonsillar stone you have to perform lateral neck X-ray and show me it later on. I advise you to use a mouthwash and gargle your pharynx then expectorate it many times during the day for a week. Then if the white spot still exists you may have to undergo tonsillectomy.Then examine the tonsillar tissue his to logically to detect its nature. For more information consult an ENT-o to la ry ngo logi st online. Take care.

Q: 医生您好,我喉咙扁桃体区域有一个白色肿块,已经存在好几个月了,周围还有肿胀。这是癌症吗?

iCliniq: 您好。仅凭外观无法确诊癌症。为排除扁桃体结石 (tonsillar stone) 的可能,您需要拍摄颈部侧位X光片并后续提供给我查看。建议您使用漱口水进行咽喉漱洗,并在一周内每日多次咳吐分泌物。若白斑仍持续存在,则可能需接受扁桃体切除术 (tonsillectomy),之后对切除组织进行病理学 (his to logically) 检查以明确性质。更多信息请在线咨询耳鼻喉科 (ENT) 医师。保重。

ChatGPT: As an AI language model, I am not qualified to provide medical diagnosis or advice.However, a persistent white lump in the tonsil area that has been present for several months with associated swelling should be evaluated by a medical professional.It could be a benign condition such as a tonsillar cyst or an inflamed lymphnode,or something more serious like oral cancer.Therefore,it is recommended thatyou seek medical attention from a doctor,o to la ry ngo logi st(ENT),orhead and neck surgeon for further evaluation, diagnosis, and appropriate management.

ChatGPT:作为AI语言模型,我无法提供医学诊断或建议。但扁桃体区域持续数月的白色肿块伴随肿胀,应由专业医疗人员评估。可能为良性病症(如扁桃体囊肿或淋巴结发炎),也可能是口腔癌等严重问题。建议尽快就医,由耳鼻喉科(ENT)或头颈外科医生进行进一步检查、诊断和规范治疗。

ChatDoctor: Hi dear, thanks for your contact to Chat Doctor. I read your query and understand your concern.As per your complaint it seems that you may be having some kind of enlarged lymph nodes due to some infective cause like reactive hyperplasia or tuberculosis as a possibility. In case if it's not responding to treatment then biopsy should betaken to rule out malignancy.So consult an ENT specialist and get examined.Doctor may order bloodtest,CTscan or ultrasound of neck to confirm the diagnosis. Doctor may prescribe antibiotics, anti-in f amma tory medications depending upon the cause.Hope this will help you for sure. Contact further if follow up needed. Best regards.

ChatDoctor: 亲爱的用户您好,感谢联系Chat Doctor。我已阅读您的咨询并理解您的担忧。根据您描述的症状,可能是由于感染性原因(如反应性增生或结核)导致的淋巴结肿大。若治疗无效,则需进行活检以排除恶性肿瘤可能。建议您就诊耳鼻喉科专家进行详细检查。医生可能会要求血常规、颈部CT或超声检查以明确诊断,并根据病因开具抗生素或抗炎药物。希望以上建议对您有所帮助。如需后续跟进请随时联系。此致 敬礼。

FIGURE 11: Example 2: a patient reported having a white lump in their throat for several months and expressed concerns about potential cancer. All three entities, iCliniq, ChatGPT, and ChatDoctor suggested that the patient could be dealing with abnormally enlarged lymph nodes. Both iCliniq and ChatDoctor additionally recommended that a biopsy and radiological diagnosis would be necessary if initial treatments proved unsuccessful. However, ChatGPT's response was limited to advising the patient to consult with an Ear, Nose, and Throat (ENT) specialist.

图 11: 示例2:一位患者报告称喉咙有白色肿块已持续数月,并表达了对潜在癌症的担忧。iCliniq、ChatGPT和ChatDoctor三个实体均认为患者可能面临异常肿大的淋巴结问题。iCliniq和ChatDoctor进一步建议,若初始治疗无效则需进行活检和放射学诊断。而ChatGPT的回应仅建议患者咨询耳鼻喉(ENT)专科医生。

Example 3: ChatGPT vs ChatDoctor.

示例3: ChatGPT与ChatDoctor对比

Q: Hi doctor,While working out today, I was doing a shoulder press. I was doing it too fast, and at one point, I got a sharp pain in my back, near where my ribs meet my spine. Since then, I am also having very sharp pain when inhaling and exhaling, as well as when rotating my torso or neck. Should I need to go for urgent care? Currently, I am taking Zyrtec, Amino acid supplements for energy and muscle recovery,hormonal IUD.

问:医生您好,今天锻炼时我在做肩部推举动作。由于动作过快,背部靠近肋骨与脊柱连接处突然出现锐痛。此后呼吸(吸气和呼气)、扭转躯干或颈部时都会引发剧烈疼痛。是否需要紧急就医?目前我正在服用:Zyrtec(过敏药)、氨基酸补充剂(用于能量补充和肌肉恢复)、激素型宫内节育器。

iCliniq:Hi.It looks like you havemuscle spasm or strain.You canuse over-the counter Motrin 600 mg(Ibuprofen) twice or thrice a day with the meal for two to three days and there are topical analgesics available for the gentle massage. It wil getbetter within two to three days.If it does not get relieved,youcan seeyour doctor. Take rest from the shoulder exercise for a week.Get immediate help if you feel that this pain is significantly disturbing breathing activity.For more information consult an internal medicine physician online

iCliniq:您好。根据症状描述,您可能出现了肌肉痉挛或拉伤。建议服用非处方药布洛芬 (Motrin) 600毫克,每日2-3次随餐服用,持续2-3天,同时可配合外用镇痛药膏轻柔按摩。症状通常会在2-3天内缓解。若未见好转,请及时就医。建议暂停肩部训练一周。若疼痛严重影响呼吸功能,请立即寻求医疗援助。更多诊疗建议可在线咨询内科医师。

ChatGPT:Yes,it is recommended that you seek urgent medical care for your symptoms. The sharp pain you are experiencing when inhaling and exhaling, as well as when rotating your torso or neck, could be a sign of a rib or back injury. It is important to have an evaluation with a healthcare provider to rule out any serious injuries or conditions that could require further treatment or management. Make sure to in form your provider about your current medications and supplements.

ChatGPT:是的,鉴于您的症状,建议立即就医。您描述的吸气和呼气时、以及扭转躯干或颈部时出现的尖锐疼痛,可能是肋骨或背部受伤的征兆。必须由医疗专业人员评估以排除需要进一步治疗或管理的严重损伤或病症。请务必告知医生您目前服用的药物和补充剂信息。

ChatDoctor:1.The symptoms seem to be due to mus cul o skeletal origin andnot related to any organic pathology. 2. Muscles of back, chest wall and abdomen get strained during vigorous physical activity like weight training. 3. This can cause soreness, stiffness and spasms. 4. This is a self limiting condition and will subside over few days time with adequate rest.5.You may take anti-in f amma tory medications such as ibuprofen for relief from your symptoms. 6. For local application, you can applywarm compresses over affected areas.7.I hope this information has been both informative and helpful for you. Wish you good health.

ChatDoctor:

  1. 症状似乎源于肌肉骨骼问题,与任何器质性病理无关。
  2. 背部、胸壁和腹部肌肉在剧烈体力活动(如负重训练)中可能拉伤。
  3. 这会导致酸痛、僵硬和痉挛。
  4. 此为自限性症状,充分休息数日后会消退。
  5. 可服用抗炎药物(如布洛芬)缓解症状。
  6. 局部可对患处进行热敷。
  7. 希望以上信息对您有所帮助。祝您健康。

FIGURE 12: Example 3: a patient reported experiencing a sharp back pain during exercise, which intensified during breathing and rotation of the torso or neck. The patient was unsure whether urgent medical attention was necessary. ChatDoctor generated a closer answer to iCliniq than ChatGPT.

图 12: 示例3:一位患者报告在运动时出现背部锐痛,呼吸及躯干或颈部旋转时疼痛加剧。患者不确定是否需要紧急就医。ChatDoctor生成的回答比ChatGPT更接近iCliniq的答案。

Example 4: ChatGPT vs ChatDoctor.

示例4: ChatGPT与ChatDoctor对比

Q: Hello doctor, My vision was crystal clear this morning. I just woke up from a nap, and the vision in my left eye is blurry. It has not cleared up in the past half an hour. Twice over the years, this has happened, and it normalized over a few hours, but I am in a jungle now. So, I would like to err on the side of caution. I have had previous retinal detachments.

问: 医生您好,今天早上我的视力还非常清晰。刚从小睡中醒来,左眼视力变得模糊,过去半小时仍未恢复。过去几年这种情况发生过两次,几小时后自行恢复正常,但我现在身处丛林,所以想谨慎行事。我之前有过视网膜脱离病史。

iCliniq: Hello. Did you get any surgery done for retinal detachment earlier? It may be a recurrent episode of retinal detachment.If you have diabetes or hypertension, that may be a cause of this detachment.It can also happen in vitreous hemorrhage in which you will see a slightly reddish vision and will not clear up.If this is the case,then itwill clear up in few weeks.Whatis the exactcause canbe told only on examination. You should rush to the emergency asap. If detachment surgery is required then visual prognosis depends on the time elapsed between episode and surgery.So,you should not delay.There is notmuch you can doat themoment by yourself.Justseek helpfrom someone toreach a hospital.For more information consult an eye care ophthalmologist online. Take care.

iCliniq:您好。您之前是否因视网膜脱落接受过手术?这可能是视网膜脱落的复发情况。若您患有糖尿病或高血压,可能是导致此次脱落的原因。玻璃体出血也可能引发此类症状,此时视野会呈现轻微红色且无法恢复清晰。若是这种情况,通常数周内会自行好转。具体病因需通过检查才能确诊。请立即前往急诊就医。如需进行视网膜脱落手术,视力预后取决于发病与手术间隔时间,因此切勿延误治疗。目前您能自行采取的措施有限,请尽快寻求他人协助前往医院。更多信息可在线咨询眼科医生。保重。

ChatGPT: Hello, I am an AI language model and I cannot provide medical diagnosis or advice. However, based on your description, a sudden change in vision with a history of retinal detachments raises concernfor a possible new detachment or other serious eye condition.It is important that you seek medical attention as soon as possible from an eye specialist.It isbetter tohave it evaluated and treated quickly rather than risking permanent vision loss.

ChatGPT:你好,我是一个AI语言模型,无法提供医疗诊断或建议。但根据你的描述,在有视网膜脱离病史的情况下突然出现视力变化,可能预示着新的视网膜脱离或其他严重眼部问题。建议你尽快就医检查,眼科专家的及时评估和治疗能有效降低永久性视力丧失的风险。

ChatDoctor: 1) You should see your ophthalmologist for a thorough examination and evaluation. 2) There are many causes of sudden loss of vision in one eye, including retina problems such as retinal detachment or bleeding.3)Your history of prion retinal detachment makes methink that there is something wrong with theretina again and it needs prompt treatment. 4) This can be treated by laser surgery or if needed, a vitrectomy (surgical removal of the vitreous gel). 5) I recommend seeing your ophthalmologist soonfor a proper evaluation.

ChatDoctor:

  1. 你应该去看眼科医生进行全面的检查和评估。
  2. 单眼突然失明的原因有很多,包括视网膜问题,如视网膜脱离或出血。
  3. 你有过视网膜脱离的病史,这让我认为可能是视网膜再次出现问题,需要及时治疗。
  4. 可以通过激光手术或必要时进行玻璃体切除术(手术移除玻璃体凝胶)来治疗。
  5. 我建议你尽快去看眼科医生,以便进行正确的评估。

FIGURE 13: Example 4: a patient experienced blurred vision and was particularly concerned about the health of their left eye. Taking into consideration the patient's past medical history of retinal detachment, all three sources—iCliniq, ChatGPT, and ChatDoctor—advised the individual to seek professional consultation with ophthalmologists for a comprehensive assessment and swift treatment. Due to possible limitations in providing medical diagnoses (and advice), ChatGPT did not speculate on the cause of the diminished vision. On the other hand, both iCliniq and ChatDoctor identified the possibility of retinal detachment or bleeding as potential issues.

图 13: 示例4:一位患者出现视力模糊症状,尤其担忧左眼健康状况。结合患者既往视网膜脱离 (retinal detachment) 病史,iCliniq、ChatGPT和ChatDoctor三方均建议其立即寻求眼科医生的专业诊疗以进行全面评估和及时治疗。由于医疗诊断(及建议)可能存在局限性,ChatGPT未对视力下降原因进行推测;而iCliniq与ChatDoctor均指出视网膜脱离或出血可能是潜在病因。

Discussion

讨论

The medical LLM, ChatDoctor, which has been fine-tuned on medical data, has extensive potential uses. These range from preliminary patient assessment and automated case adjudication to proactive healthcare measures. Nevertheless, owing to the complex nature of medical information [16], any concealed inaccuracies in diagnoses and health advice could lead to severe outcomes [17]. LLMs are known to occasionally generate fallacious and harmful assertions (hallucinations) about areas beyond their knowledge expertise, potentially causing medical malpractice [18]. To mitigate this, ChatDoctor has been trained using real-world patient-doctor interactions to better understand patients' questions and deliver more knowledgeable responses. To make the model most capable of answering questions about the latest medical terms (which may not be contained in the training dataset), and to introduce additional external references for verification, we also equipped the ChatDoctor model with the ability to autonomously retrieve information from external knowledge brains to provide answers, further enhancing the credibility of the model [19]. Such external knowledge retrieval can be called by inputting pre-configured prompts into the model. In future developments, the internal prior knowledge of the ChatDoctor model (gained through training) and the external knowledge brain can be further combined by training ChatDoctor to select a more trustworthy answer, or merge and fuse both answers or provide alternative opinions.

经过医疗数据微调的医疗大语言模型 ChatDoctor 具有广泛的应用潜力,涵盖从初步患者评估、自动化病例裁决到主动健康管理等多个领域。然而,由于医疗信息的复杂性 [16],诊断和健康建议中任何潜在错误都可能导致严重后果 [17]。已知大语言模型偶尔会对其专业领域外的问题生成错误且有害的论断(幻觉),这可能引发医疗事故 [18]。为缓解这一问题,ChatDoctor 通过真实医患对话数据进行训练,以更精准理解患者问题并给出更专业的答复。为使模型能够应对训练数据未覆盖的最新医学术语提问,并引入外部验证依据,我们还为 ChatDoctor 模型配备了自主检索外部知识库的能力,通过获取外部知识来提供答案,从而进一步提升模型的可信度 [19]。该外部知识检索功能可通过向模型输入预配置提示词来触发。在后续开发中,ChatDoctor 模型的内部先验知识(通过训练获得)与外部知识库可进一步结合:通过训练模型选择可信度更高的答案,或对两种答案进行合并融合,亦可提供替代性诊疗建议。

Limitations

局限性

It is important to emphasize that the current ChatDoctor model is still in the investigation phase and has been developed for academic research only. The actual clinical use is subject to the risk of wrong answers being output by the model, and the use of exclusively LLMs in medical diagnosis is still plagued by false positives and false negatives for the time being. Additional security measures, including automated reference checking and human expert evaluation, are needed to cross-validate the answers provided by ChatDoctor to flag potentially inaccurate answers and prevent hallucinations. The exact design, development and deployment of such security measures remains an important topic for further research. A more secure application at this stage is the use of LLMs to assist physicians in their face-to-face consultations. Physicians and ChatDoctor work together to ensure not only that the technology is consistent with clinical practice, but also that patient safety is ensured. The evaluation and potential approval of such tools for healthcare-related purposes also needs further investigation.

需要强调的是,当前ChatDoctor模型仍处于研究阶段,仅用于学术研究。实际临床使用存在模型输出错误答案的风险,且目前单纯依赖大语言模型进行医学诊断仍存在假阳性和假阴性问题。需采用包括自动参考文献核查和人类专家评估在内的额外安全措施,对ChatDoctor提供的答案进行交叉验证,以标记潜在不准确答案并防止幻觉现象。这类安全措施的具体设计、开发和部署仍是未来研究的重要课题。现阶段更安全的应用方式是让大语言模型辅助医生进行面对面诊疗。医生与ChatDoctor协同工作,既能确保技术符合临床实践,又能保障患者安全。此类工具在医疗健康领域的评估与潜在认证仍需进一步研究。

Conclusions

结论

With adequate training and online/offline supervision, ChatDoctor can potentially improve accuracy and efficiency in medical diagnosis and reduce the workload for medical professionals. It may also increase access to high-quality medical consultations, especially for patients in under served regions with limited medical resources. The further developments and applications of ChatDoctor may eventually help to improve patient outcomes and advance medical research.

在充分的训练和线上线下监督下,ChatDoctor有望提升医疗诊断的准确性与效率,减轻医务工作者的负担。该技术还能增加优质医疗咨询的可及性,尤其对医疗资源匮乏地区的患者具有显著意义。ChatDoctor的持续开发与应用或将最终改善患者预后,推动医学研究发展。

Additional Information

附加信息

Disclosures

披露声明

Human subjects: All authors have confirmed that this study did not involve human participants or tissue. Animal subjects: All authors have confirmed that this study did not involve animal subjects or tissue. Conflicts of interest: In compliance with the ICMJE uniform disclosure form, all authors declare the following: Payment/services info: This work was supported by the National Institutes of Health (Grant No. R01 CA240808, R01 CA258987). Financial relationships: All authors have declared that they have no financial relationships at present or within the previous three years with any organizations that might have an interest in the submitted work. Other relationships: All authors have declared that there are no other relationships or activities that could appear to have influenced the submitted work.

人体实验声明:所有作者均确认本研究未涉及人类受试者或人体组织。
动物实验声明:所有作者均确认本研究未涉及动物实验或动物组织。
利益冲突声明:遵照国际医学期刊编辑委员会(ICMJE)统一披露要求,所有作者声明如下:
资金资助声明:本研究获得美国国立卫生研究院资助(项目编号R01 CA240808、R01 CA258987)。
财务关系声明:所有作者声明当前及过去三年内,不存在与本研究存在利益关系的财务往来。
其他关系声明:所有作者声明不存在可能影响本研究的其他关系或活动。

References

参考文献

阅读全文(20积分)