[论文翻译]ChatDoctor:基于大语言模型Meta-AI (LLaMA)和医学领域知识微调的医疗对话模型


原文地址:https://pmc.ncbi.nlm.nih.gov/articles/PMC10364849/pdf/cureus-0015-00000040895.pdf


ChatDoctor: A Medical Chat Model Fine-Tuned on a Large Language Model Meta-AI (LLaMA) Using Medical Domain Knowledge

ChatDoctor:基于大语言模型Meta-AI (LLaMA)和医学领域知识微调的医疗对话模型

$\circledcirc$ Copyright 2023

© 版权 2023

Li et al. This is an open access article distributed under the terms of the Creative Commons Attribution License CC-BY 4.0., which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Li等人。本文是一篇开放获取文章,遵循知识共享署名许可协议CC-BY 4.0条款发布,允许在任何媒介上不受限制地使用、分发和复制,前提是注明原作者和来源。

Yunxiang Li 1 , Zihan Li 2 , Kai Zhang 3 , Ruilong Dan 4 , Steve Jiang 1 , You Zhang 1

李云翔 1, 李梓涵 2, 张凯 3, 丹瑞龙 4, Steve Jiang 1, 张游 1

  1. Department of Radiation Oncology, University of Texas Southwestern Medical Center, Dallas, USA 2. Department of Computer Science, University of Illinois at Urbana-Champaign, Illinois, USA 3. Department of Computer Science and Engineering, The Ohio State University, Columbus, USA 4. College of Computer Science and Technology, Hangzhou Dianzi University, Hangzhou, CHN
  2. 德克萨斯大学西南医学中心放射肿瘤科,达拉斯,美国
  3. 伊利诺伊大学厄巴纳-香槟分校计算机科学系,伊利诺伊州,美国
  4. 俄亥俄州立大学计算机科学与工程系,哥伦布市,美国
  5. 杭州电子科技大学计算机科学与技术学院,杭州,中国

Corresponding author: You Zhang, you.zhang@ut southwestern.edu

通讯作者:You Zhang, you.zhang@ut southwestern.edu

Abstract

摘要

Objective

目标

The primary aim of this research was to address the limitations observed in the medical knowledge of prevalent large language models (LLMs) such as ChatGPT, by creating a specialized language model with enhanced accuracy in medical advice.

本研究的主要目标是针对当前主流大语言模型(如ChatGPT)在医学知识方面的局限性,通过构建一个专业语言模型来提升医疗建议的准确性。

Methods

方法

We achieved this by adapting and refining the large language model meta-AI (LLaMA) using a large dataset of 100,000 patient-doctor dialogues sourced from a widely used online medical consultation platform. These conversations were cleaned and anonymized to respect privacy concerns. In addition to the model refinement, we incorporated a self-directed information retrieval mechanism, allowing the model to access and utilize real-time information from online sources like Wikipedia and data from curated offline medical databases.

我们通过使用从广泛使用的在线医疗咨询平台获取的10万条医患对话大型数据集,对大语言模型meta-AI (LLaMA)进行适配和优化,实现了这一目标。这些对话经过清洗和匿名化处理以保护隐私。除了模型优化外,我们还加入了自主信息检索机制,使模型能够访问并利用来自维基百科等在线资源的实时信息,以及精选离线医疗数据库中的数据。

Results

结果

The fine-tuning of the model with real-world patient-doctor interactions significantly improved the model's ability to understand patient needs and provide informed advice. By equipping the model with self-directed information retrieval from reliable online and offline sources, we observed substantial improvements in the accuracy of its responses.

通过真实医患互动数据对模型进行微调,显著提升了模型理解患者需求并提供专业建议的能力。通过赋予模型从可靠线上/线下资源自主检索信息的功能,我们观察到其回答准确率得到实质性提升。

Conclusion

结论

Our proposed ChatDoctor, represents a significant advancement in medical LLMs, demonstrating a significant improvement in understanding patient inquiries and providing accurate advice. Given the high stakes and low error tolerance in the medical field, such enhancements in providing accurate and reliable information are not only beneficial but essential.

我们提出的ChatDoctor代表了医疗大语言模型(Large Language Model)领域的重大进步,显著提升了对患者问询的理解能力和提供准确建议的水平。鉴于医疗领域的高风险性和低容错率,这种在提供准确可靠信息方面的改进不仅有益,更是至关重要。

Categories: Family/General Practice, Medical Physics, Integrative/Complementary Medicine Keywords: ai chatbot, large language model, llama, chat gpt, gpt

分类:家庭/全科医学、医学物理学、整合/补充医学
关键词:AI聊天机器人 (AI chatbot)、大语言模型 (Large Language Model)、Llama、Chat GPT、GPT

Introduction

引言

The development of instruction-following large language models (LLMs), such as ChatGPT [1], has gained significant attention due to their remarkable success in instruction understanding and human-like response generation. These auto-regressive LLMs [2] are pre-trained on web-scale natural language by predicting the next token and then fine-tuned to follow large-scale human instructions. These models show robust performance on a wide range of natural language processing (NLP) tasks and can generalize to unseen tasks, demonstrating their potential as unified solutions to various problems in natural language understanding, text generation, and conversational artificial intelligence. However, the exploration of such general-domain LLMs in the medical domain remains relatively scarce [3], despite their great potential in revolutionizing medical communication and decision-making [4]. In general, these common-domain models were not trained to capture the medical-domain knowledge specifically or in detail, resulting in models that often provide incorrect medical responses.

指令跟随大语言模型(LLM)的发展(如ChatGPT[1])因其在指令理解和类人响应生成方面的显著成功而备受关注。这些自回归大语言模型[2]通过预测下一个token在网页级自然语言数据上进行预训练,随后通过大规模人类指令进行微调。这些模型在广泛自然语言处理(NLP)任务中展现出强大性能,并能泛化至未见任务,展示了其作为自然语言理解、文本生成和会话人工智能领域各类问题的统一解决方案的潜力。然而,尽管这类通用领域大语言模型在革新医疗沟通与决策方面具有巨大潜力[4],其在医疗领域的探索仍相对匮乏[3]。通常,这些通用领域模型并未经过专门或详细的医学领域知识训练,导致模型经常提供错误的医疗响应。

By fine-tuning large linguistic dialogue models on data from real-world patient-physician conversations, these models’ ability in understanding patients’ inquiries and needs can be significantly improved. In addition, to further enhance the models’ credibility, a knowledge brain based on online sources such as Wikipedia or offline sources like medical-domain databases can be incorporated into the models to retrieve real-time information to facilitate answering medical questions. The enhanced reliability of such answers is

通过在真实医患对话数据上微调大语言模型,可以显著提升模型理解患者询问和需求的能力。此外,为提高模型可信度,可为其集成基于在线资源(如Wikipedia)或离线医疗领域数据库的知识中枢,实时检索信息以辅助医学问答。这种增强机制能有效提升答案的可靠性

How to cite this article

如何引用本文

Li Y, Li Z, Zhang K, et al. (June 24, 2023) ChatDoctor: A Medical Chat Model Fine-Tuned on a Large Language Model Meta-AI (LLaMA) Using Medical Domain Knowledge. Cureus 15(6): e40895. DOI 10.7759/cureus.40895

Li Y, Li Z, Zhang K 等 (2023年6月24日) ChatDoctor: 基于大语言模型(LLaMA)和医学领域知识微调的医疗对话模型。Cureus 15(6): e40895. DOI 10.7759/cureus.40895

vital for the medical field, as a wrong answer can be detrimental to patients’ treatments and well-being. In this study, we investigated the use of these two strategies: model fine-tuning and knowledge brain in still ation, to enhance the capability of LLMs to serve as medical chatbots. Since the prevalent ChatGPT model is not open source, we used Meta’s public large language model meta-AI (LLaMA) model as the platform for development and evaluation. In detail, we first trained a generic conversation model based on LLaMA, using 52K instruction-following data from Stanford University’s Alpaca project [5]. We then finetuned the conversation model on our collected dataset of 100K patient-physician conversations from an online medical consultation website (www.healthcare magic.com). Through extensive experiments, we found that the fine-tuned model by patient-physician dialogues outperforms ChatGPT in terms of precision, recall, and the F1 score [6]. In addition, the autonomous ChatDoctor model, which is able to retrieve the latest online/offline information, can also answer medical questions about relatively new diseases that are not included in the patient-physician training dialogues, for instance, the Monkeypox (Mpox) disease [7,8].

对医疗领域至关重要,因为错误答案可能损害患者的治疗和健康。本研究探索了两种策略——模型微调与知识库植入,用于增强大语言模型作为医疗聊天机器人的能力。由于当前流行的ChatGPT模型未开源,我们采用Meta公开的大语言模型meta-AI(LLaMA)作为开发和评估平台。具体而言,我们首先基于斯坦福大学Alpaca项目的52K指令跟随数据[5],在LLaMA上训练通用对话模型;随后通过在线医疗咨询网站(www.healthcaremagic.com)收集的10万条医患对话数据集进行微调。大量实验表明,经医患对话微调的模型在精确率、召回率和F1值[6]上均优于ChatGPT。此外,具备在线/离线信息检索能力的自主ChatDoctor模型,还能回答训练数据未涵盖的新发疾病(如猴痘(Mpox))相关问题[7,8]。

In summary, the ChatDoctor model has the following three main contributions:

ChatDoctor模型的主要贡献可归纳为以下三点:

  1. We established a methodology for fine-tuning LLMs for application in the medical field.
  2. 我们建立了一套针对医疗领域应用的大语言模型 (LLM) 微调方法。
  3. We compiled and publicly shared a comprehensive dataset of 100,000 patient-doctor interactions to serve as a training resource for refining the LLM. This dataset includes a wealth of terms, knowledge, and expertise essential for training LLMs in the medical domain. Additionally, we curated and openly shared another dataset consisting of 10,000 patient-doctor conversations from a separate source (www.icliniq.com) to serve as a testing resource for the model. To support and stimulate future advancements in the development of dialogue models in healthcare, we provide public access to all relevant resources such as source codes, datasets, and model weights. These can be found at github.com/Kent0n-Li/ChatDoctor.
  4. 我们整理并公开共享了一个包含10万条医患互动的综合数据集,作为优化大语言模型的训练资源。该数据集涵盖了训练医疗领域大语言模型所需的大量术语、知识和专业知识。此外,我们还从独立数据源(www.icliniq.com)筛选并公开了另一组1万条医患对话数据集,作为模型的测试资源。为支持和促进未来医疗对话模型的开发进展,我们公开了所有相关资源,包括源代码、数据集和模型权重。这些资源可通过github.com/Kent0n-Li/ChatDoctor获取。
  5. We proposed an autonomous ChatDoctor model that can retrieve online and offline medical domain knowledge to answer medical questions on up-to-date medical terms and diseases, which can potentially reduce the errors and hallucinations of LLMs [9-11].
  6. 我们提出了一种自主的ChatDoctor模型,能够检索线上和线下医疗领域知识,以回答关于最新医学术语和疾病的医疗问题,这有望减少大语言模型 [9-11] 的错误和幻觉现象。

This article was previously posted to the arXiv preprint server on March 24, 2023

本文曾于2023年3月24日发布在arXiv预印本服务器上

Materials And Methods

材料与方法

Collection and preparation of patient-physician conversation dataset

医患对话数据集的收集与准备

The initial step in refining our model involved curating a dataset comprising patient-physician interactions. Often, patients describe their symptoms in casual and somewhat superficial language. If we attempted to generate these dialogues synthetically, similar to Alpaca [5], it could lead to over-specific descriptions with limited diversity and relevance to the real world. Hence, we chose to gather authentic patient-doctor conversations, collecting around 100k such interactions from the online medical consultation website, HealthCare Magic. The data were filtered both manually and automatically. Specifically, we automatically filtered out conversations that were too short, most of which did not answer anything of practical significance. And we manually filtered the content of the responses that had errors. To maintain privacy, we erased any information identifying the doctor or the patient and employed Language Tool to rectify any grammatical errors. This dataset was labeled HealthCare Magic 100 k, illustrated in Figure 1. We also sourced roughly 10k additional conversations from another independent online medical consultation site, iCliniq, to test our model's performance. The iCliniq dataset was chosen randomly in a stratified manner to guarantee representation across various medical specialties. It was also made certain that the selected data contained no identifiable patient information, in strict compliance with privacy and ethical standards.

优化模型的第一步是整理包含医患互动的数据集。患者通常会用随意且相对表面的语言描述症状。如果我们尝试像Alpaca [5]那样人工合成这些对话,可能导致描述过于具体,缺乏多样性且与现实世界关联有限。因此,我们选择收集真实的医患对话,从在线医疗咨询网站HealthCare Magic获取了约10万条此类互动数据。数据经过人工和自动双重筛选:自动过滤过短的对话(这类内容大多缺乏实际意义),人工剔除存在错误的回复内容。为保护隐私,我们删除了所有可识别医生或患者身份的信息,并使用Language Tool修正语法错误。该数据集被标记为HealthCare Magic 100k,如图1所示。我们还从另一独立在线医疗咨询平台iCliniq获取了约1万条额外对话用于测试模型性能。iCliniq数据集采用分层随机抽样,确保覆盖不同医学专科领域,并严格遵循隐私与伦理标准,确保所选数据不含可识别患者信息。


FIGURE 1: A summary of the process involved in gathering the patientphysician conversation dataset and the steps involved in training the ChatDoctor model.

图 1: 患者-医生对话数据集收集流程及ChatDoctor模型训练步骤概览

Creation of external knowledge database

创建外部知识库

LLMs typically predict the next word in a sequence, leading to potential inaccuracies or erroneous responses to questions (hallucinations) [12]. In addition, the model's output can be unpredictable to some extent, which is unacceptable in the medical field. However, the accuracy of these models could be significantly improved if they could generate or assess responses based on a reliable knowledge database, depicted in Figure 2. Consequently, we curated a database (sample shown in Figure 3) encompassing diseases, their symptoms, relevant medical tests/treatment procedures, and potential medications. This database serves as an external and offline knowledge brain for ChatDoctor. Continually updatable without requiring model retraining, this database can be tailored to specific diseases or medical specialties. We utilized MedlinePlus to construct this disease database, but other reliable sources can also be used. Additionally, online information sources like Wikipedia can supplement the knowledge base of our autonomous model. It is worth noting that Wikipedia may not be a fully reliable database, but our framework can be easily extended to more reliable online databases such as reputable academic journals.

大语言模型通常预测序列中的下一个词,这可能导致对问题的回答存在潜在不准确或错误(幻觉)[12]。此外,模型输出在一定程度上具有不可预测性,这在医疗领域是不可接受的。然而,如果这些模型能够基于可靠的知识数据库生成或评估回答(如图 2 所示),其准确性将显著提升。为此,我们构建了一个涵盖疾病、症状、相关医学检查/治疗流程及潜在药物的数据库(示例见图 3),作为 ChatDoctor 的外部离线知识大脑。该数据库无需重新训练模型即可持续更新,并能针对特定疾病或医学专科定制。我们使用 MedlinePlus 构建该疾病数据库,但其他可靠来源同样适用。此外,维基百科等在线信息源可补充我们自主模型的知识库。值得注意的是,维基百科并非完全可靠的数据库,但我们的框架可轻松扩展至更可靠的在线数据库(如权威学术期刊)。


Q: What is the treatment for Mpox? FIGURE 2: Overview of the autonomous ChatDoctor model based on information retrieval from an external knowledge brain.

图 2: 基于外部知识大脑信息检索的自主ChatDoctor模型概览

Disease Database

疾病数据库

Disease: Appendicitis

疾病:阑尾炎

Symptoms:Pain in the abdomen, often on the right side. It is usually sudden and gets worse over time. Other symptoms may include: Swelling in the abdomen, Loss of appetite, Nausea and vomiting, Constipation or diarrhea, Inability to pass gas, Low fever

症状:腹部疼痛,常见于右侧。通常突然发作并随时间加重。其他症状可能包括:腹部肿胀、食欲不振、恶心呕吐、便秘或腹泻、无法排气、低烧

Further test: Abdominal and pelvic CT (Computed Tomography), Abdominal ultrasound,Blood test to check for signs of infection,Urine test to rule out a urinary tract infection

进一步检查:腹部和盆腔CT (Computed Tomography)、腹部超声、血液检查以确认感染迹象、尿液检查以排除尿路感染

Treatment: Appendectomy, cefotetan (Cefotan), cefotaxime (Claforan), piper a cill in and tazobactam (Zosyn), ampicillin and sulbactam (Unasyn), cef tria x one (Rocephin), cefepime (Maxipime), gentamicin (Garamycin), meropenem (Merrem), ertapenem(Invanz),metro nida z ole(Flagyl),c linda my c in(Cleocin),lev of lox ac in (Levaquin).In the case of a ruptured appendix,doctors will prescribe an intravenous (IV) antibiotic to treat abdominal infection.

治疗:阑尾切除术,头孢替坦(Cefotan)、头孢噻肟(Claforan)、哌拉西林他唑巴坦(Zosyn)、氨苄西林舒巴坦(Unasyn)、头孢曲松(Rocephin)、头孢吡肟(Maxipime)、庆大霉素(Garamycin)、美罗培南(Merrem)、厄他培南(Invanz)、甲硝唑(Flagyl)、克林霉素(Cleocin)、左氧氟沙星(Levaquin)。若阑尾穿孔,医生会开具静脉注射抗生素以治疗腹腔感染。

Disease: Allergic rhinitis

疾病:过敏性鼻炎

Symptoms:Symptoms thatoccur shortly after you come into contact with the substance you are allergic to may include: Itchy nose, mouth, eyes, throat, skin, or any area, Problems with smell, Runny nose, Sneezing, Watery eyes. Symptoms that may develop later include: Stuffy nose (nasal congestion), Coughing, Clogged ears and decreased sense of smell, Sore throat,Dark circles under the eyes,Puffiness under the eyes, Fatigue and irritability, Headache.

症状:
接触过敏原后短期内可能出现的症状包括:鼻、口腔、眼睛、喉咙、皮肤或任何部位发痒,嗅觉异常,流鼻涕,打喷嚏,流泪。
后续可能发展的症状包括:鼻塞,咳嗽,耳闷和嗅觉减退,喉咙痛,黑眼圈,眼睑浮肿,疲劳易怒,头痛。

Further test: Allergy testing, Complete blood count (CBC) testin

进一步检查:过敏测试、全血细胞计数 (CBC) 检测

Treatment: Antihistamines, Antihistamine nasal sprays, Corticosteroids, Decongestants

治疗:抗组胺药、抗组胺鼻喷剂、皮质类固醇、减充血剂

Disease: Malignant otitis externa

疾病:恶性外耳道炎

Symptoms: Ongoing drainage from the ear that is yellow or green and smells bad. Ear pain deep inside the ear. Pain may get worse when you move your head. Hearing loss, Itching of the ear or ear canal, Fever, Trouble swallowing, Weakness in the muscles of the face.

症状:耳朵持续流出黄色或绿色且有异味的分泌物。耳部深处疼痛,头部活动时疼痛可能加剧。听力下降、耳部或耳道瘙痒、发热、吞咽困难、面部肌肉无力。

Further test:Look into the ear for signs of an outer ear infection.The head around andbehind the ear maybe tender to touch.A nervous system (neurological)exam may show that the cranial nerves are affected.If there is any drainage, the provider may send a sample of it to the lab.The lab will culture the sample to try to find thecauseof the infection.To look for signs of a bone infection next to the ear canal, the following tests maybe done: CT scan of the head,MRI scan of the head, Radio nuclide scan.

进一步检查:检查外耳是否有感染迹象。耳周及耳后头部区域可能有触痛感。神经系统检查可能显示颅神经受累。若有分泌物排出,医生可能取样送检实验室。实验室将进行培养以确定感染病原体。为排查耳道邻近骨组织感染,可能需进行以下检查:头部CT扫描、头部MRI扫描、放射性核素扫描。

Treatment:The goal of treatment is to cure the infection.Treatment often lasts for several months,because it is difficult to treat the bacteria andreach an infection in bone tissue. You will need to take antibiotic medicines for a long period of time. The medicines may be given through a vein (intravenously), or by mouth.Antibiotics should be continued until scans orother testsshow the inflammation has gone down. Dead or infected tissue may need to be removed from the ear canal. In some cases surgery may be needed to remove dead or damaged tissue in the skull.

治疗:
治疗目标是根除感染。由于细菌难以杀灭且骨组织感染难以触及,疗程通常持续数月。患者需长期服用抗生素,给药方式包括静脉注射或口服。抗生素应持续使用至影像学或其他检查显示炎症消退。耳道内的坏死或感染组织可能需要手术清除。部分病例需通过开颅手术切除颅骨内的坏死或受损组织。

FIGURE 3: Some samples in our offline disease database consist of symptoms, clinical test/treatment approaches, and medication suggestions.

图 3: 我们的离线疾病数据库中的部分样本包含症状、临床检测/治疗方案和用药建议。

Development of autonomous ChatDoctor with knowledge brain

开发具备知识大脑的自主ChatDoctor

Armed with the external knowledge brain, i.e., Wikipedia or our custom disease database, ChatDoctor can more accurately answer patient inquiries by retrieving reliable information. Upon establishing the external knowledge brain, we devised a mechanism to enable ChatDoctor to autonomously retrieve necessary information to answer queries. This was accomplished by constructing appropriate prompts to input into the ChatDoctor model. Specifically, we designed keyword mining prompts (Figure $^{4)}$ as the initial step for ChatDoctor to extract key terms from patient queries for relevant knowledge search. Based on these keywords, top-ranked information was retrieved from the knowledge brain using a term-matching retrieval system [13]. Given the LLM's word limit (token size), we divided the texts to be read into equal sections and ranked each section by the number of keyword hits. The ChatDoctor model then reads the first N sections (five used in our study) sequentially, selecting and summarizing pertinent information via prompts (Figure 5). Ultimately, the model processes and compiles all the knowledge entries to generate a final response (Figure 6). This information retrieval approach ensures patients receive precise, well-informed responses backed by credible sources and can serve as a verification method for responses generated by ChatDoctor

借助外部知识库(即维基百科或我们自定义的疾病数据库),ChatDoctor能够通过检索可靠信息更准确地回答患者咨询。在建立外部知识库后,我们设计了一套机制使ChatDoctor能自主检索必要信息来回答问题。这是通过构建适当的提示词输入ChatDoctor模型实现的。具体而言,我们首先设计了关键词挖掘提示(图4),让ChatDoctor从患者查询中提取关键术语以进行相关知识搜索。基于这些关键词,系统使用术语匹配检索方法[13]从知识库中获取排名靠前的信息。考虑到大语言模型的字数限制(token大小),我们将待阅读文本分割为等长段落,并按关键词命中次数对每段进行排序。ChatDoctor模型随后依次读取前N段(本研究采用五段),通过提示词筛选并总结相关信息(图5)。最终,模型处理并整合所有知识条目生成最终回复(图6)。这种信息检索方法确保患者获得基于可信来源的精准答复,同时可作为验证ChatDoctor生成回复的依据。

from prior knowledge.

基于先验知识。

Prompt for extracting keywords

提取关键词的提示词

FIGURE 4: Autonomously extract keywords for information retrieval.

图 4: 自主提取关键词用于信息检索。

Promptfor autonomous knowledge retrieval

自主知识检索提示

FIGURE 5: Autonomous information retrieval from the disease database through the prompt.

图 5: 通过提示词从疾病数据库中自主检索信息

Prompt to make a final answer

生成最终答案的提示词

FIGURE 6: Instruct the ChatDoctor to read the retrieved domain knowledge and provide a reliable answer.

图 6: 指导 ChatDoctor 阅读检索到的领域知识并提供可靠答案。

Model training

模型训练

We developed the ChatDoctor model using Meta’s publicly accessible LLaMA-7B model [14], which uses Transformers with the structure of the decoder only. Despite its relatively modest 7 billion parameters, the LLaMA model exhibits comparable performance to the much larger GPT-3 model (with 175 billion parameters) across several NLP benchmarks. This performance enhancement was achieved by diversifying the training data rather than increasing network parameters. Specifically, LLaMA was trained on 1.0 trillion tokens from publicly accessible data sources like Common Crawl and arXiv documents. We used conversations from HealthCare Magic-100k to fine-tune the LLaMA model [15] in line with Stanford Alpaca [5] training methodology. The model was first fine-tuned with Alpaca’s data to acquire basic conversation skills, followed by further refinement on HealthCare Magic-100k using $6^{\ast}$ A100 GPUs for three hours. The training process followed these hyper parameters: total batch size of 192, learning rate of $2\times10^{-5}$ , 3 epochs, maximum sequence length of 512 tokens, and a warmup ratio of 0.03, with no weight decay.

我们基于Meta公司公开的LLaMA-7B模型[14]开发了ChatDoctor模型,该模型仅采用解码器结构的Transformer架构。尽管LLaMA模型仅有70亿参数,但在多项自然语言处理基准测试中表现出与参数量达1750亿的GPT-3模型相当的性能。这种性能提升是通过扩展训练数据多样性而非增加网络参数实现的——LLaMA使用来自Common Crawl、arXiv文档等公开数据源的1万亿token进行训练。我们采用斯坦福Alpaca[5]的训练方法,使用HealthCare Magic-100k的对话数据对LLaMA进行微调[15]。模型首先通过Alpaca数据获得基础对话能力,随后使用6* A100 GPU耗时3小时在HealthCare Magic-100k上进一步优化。训练超参数设置为:总批次大小192、学习率2×10^-5、3个训练周期、最大序列长度512 token、预热比例0.03,且未采用权重衰减。

Results

结果

To evaluate the proficiency of the autonomous ChatDoctor model, we tested it using a variety of contemporary medical queries. One of these included a question related to “Monkeypox” (abbreviated as Mpox), as illustrated in Figure 7. Monkeypox was recently designated by the World Health Organization (WHO) on November 28, 2022, making it a relatively novel term. While ChatGPT was incapable of providing a satisfactory response, ChatDoctor, due to its autonomous knowledge retrieval feature, was able to extract pertinent information about Monkeypox from Wikipedia and deliver a precise answer. Similarly, for more

为了评估自主 ChatDoctor 模型的熟练程度,我们使用各种当代医学问题对其进行了测试。其中一个问题涉及"猴痘" (Mpox),如图 7 所示。猴痘是世界卫生组织 (WHO) 于 2022 年 11 月 28 日新命名的术语。虽然 ChatGPT 无法提供令人满意的回答,但 ChatDoctor 凭借其自主知识检索功能,能够从维基百科提取有关猴痘的相关信息并给出准确答案。同样地,对于更...

general medical inquiries such as “Otitis,” as shown in Figure 8, ChatDoctor was able to provide a reliable response following the retrieval of relevant knowledge. In another instance, a question about “Daybue,” a drug that received FDA approval in March 2023, was accurately addressed by our model after it autonomously retrieved relevant information, demonstrating an advantage over ChatGPT, as shown in Figure 9.

如图 8 所示,对于"中耳炎"等一般医疗咨询,ChatDoctor 在检索相关知识后能够提供可靠回答。另一个例子中,针对 2023 年 3 月获 FDA 批准的药物"Daybue"的提问,我们的模型通过自主检索相关信息后给出了准确解答,如图 9 所示,这展现了相较于 ChatGPT 的优势。

Example of Mpox

Mpox示例

Q: How to test for Mpox?

Q: 如何检测猴痘 (Mpox)?

FIGURE 7: Comparison between the ChatGPT and the autonomous ChatDoctor for relatively new medical diseases/terms. The ChatGPT cannot recognize the word Mpox (aka, Monkeypox), while our ChatDoctor can provide the precise answer for the relevant medical tests of Mpox, with the help of the external knowledge brain.

图 7: ChatGPT与自主ChatDoctor在较新医学疾病/术语上的对比。ChatGPT无法识别Mpox(即猴痘)一词,而我们的ChatDoctor借助外部知识库能准确回答关于Mpox相关医学检测的问题。

Exampleof Otitis

中耳炎示例

Q:How to treat Otitis?

Q:如何治疗中耳炎?

ChatGPT: The treatment of otitis depends on the underlying cause and severity of the condition. Treatment may involve antibiotics, pain relief, ear drops, warm compresses,rest,and prevention measures such as avoiding foreign objects in the ear, keeping ears clean and dry, and protecting ears from loud noises and sudden changes in air pressure. It is important to see a doctor for proper diagnosis and treatment.

ChatGPT:中耳炎的治疗取决于病因和病情严重程度。治疗可能包括抗生素、止痛药、滴耳液、热敷、休息等,以及预防措施如避免耳道异物、保持耳部清洁干燥、防护耳朵免受噪音和气压骤变影响。务必就医以获得准确诊断和规范治疗。

ChatDoctor: Treatment depends on the severity of the symptoms. If there is only mild discomfort, then pain relievers such as ibuprofen or acetaminophen can help. Antibiotics may be prescribed if the infection has spread beyond the outer ear. For more serious forms of Otitis,surgery maybe needed toremove infected areas of the ear.

ChatDoctor: 治疗方案取决于症状的严重程度。若仅有轻微不适,可使用布洛芬或对乙酰氨基酚等止痛药。若感染扩散至外耳以外,则可能需要开具抗生素。对于更严重的中耳炎 (Otitis) 病例,可能需通过手术切除耳部感染区域。

FIGURE 8: Comparison between the ChatGPT and the autonomous ChatDoctor. The ChatGPT provided a more general answer about otitis, while the ChatDoctor provided a more specialized response about the treatments of otitis, with the help of the external knowledge brain.

图 8: ChatGPT与自主ChatDoctor的对比。ChatGPT提供了关于中耳炎(otitis)的更通用回答,而ChatDoctor借助外部知识库针对中耳炎治疗方案给出了更专业的响应。

Cureus

Cureus

Example of Daybue

Daybue示例

FIGURE 9: Comparison between the ChatGPT and the autonomous ChatDoctor. The ChatGPT is unfamiliar with the “Daybue” medication which received approval from the Food and Drug Administration (FDA) in early 2023. The ChatDoctor accurately pointed out the purpose of Daybue (tro fine tide), with the help of the external knowledge brain.

图 9: ChatGPT与自主ChatDoctor的对比。ChatGPT对2023年初获得美国食品药品监督管理局(FDA)批准的"Daybue"药物不熟悉。ChatDoctor在外置知识库的帮助下准确指出了Daybue(曲芬肽)的用途。

For a quantitative evaluation of ChatDoctor's performance, we utilized questions from the independently sourced iCliniq database as inputs, with the corresponding responses from actual human physicians serving as the benchmark or “ground truth.” We compared these with responses generated by both ChatDoctor and ChatGPT. In this evaluation, we employed BERTScore [6] to compute Precision, Recall, and F1 scores for both ChatDoctor and ChatGPT. BERTScore leverages pre-trained BERT to match words in the candidate and reference sentences via cosine similarity, and BERTScore was chosen for its ability to evaluate the semantic similarity between our model's responses and the reference sentences, which we believe is of utmost importance in the medical context. This method of evaluation closely aligns with human judgment at both sentence and system levels. In all three metrics, a higher value denotes a better match. As seen in the results illustrated in Table 1, the fine-tuned ChatDoctor model outperforms ChatGPT across all three metrics, with specific dialogue examples detailed in Figures 10-13.

为定量评估ChatDoctor的表现,我们采用独立来源的iCliniq数据库问题作为输入,并以真实医生的对应回答作为基准或"标准答案"。我们将这些回答与ChatDoctor和ChatGPT生成的回答进行对比。在此评估中,我们使用BERTScore [6]计算ChatDoctor和ChatGPT的精确率、召回率和F1分数。BERTScore利用预训练的BERT模型通过余弦相似度匹配候选句子和参考句子中的词汇,选择该方法是因为它能评估模型回答与参考句子之间的语义相似度——我们认为这在医疗场景中至关重要。该评估方法在句子级和系统级都与人类判断高度一致。三项指标中,数值越高表示匹配度越好。如表1所示结果,经过微调的ChatDoctor模型在所有指标上均优于ChatGPT,具体对话示例详见图10-13。

TABLE 1: Quantitative comparison with BERTScore between ChatDoctor and ChatGPT. The pvalues in the table are derived from our paired t-test.

ChatGPTChatDoctorP-value
Precision0.837±0.01880.8444±0.01856.66 × 10-195
Recall0.8445±0.01640.8451±0.01574.71 × 10-4
F1Score0.8406±0.01430.8446±0.01382.14 × 10-111

表 1: ChatDoctor与ChatGPT基于BERTScore的量化对比。表中p值来自配对t检验。

ChatGPT ChatDoctor P-value
精确率 0.837±0.0188 0.8444±0.0185 6.66 × 10-195
召回率 0.8445±0.0164 0.8451±0.0157 4.71 × 10-4
F1分数 0.8406±0.0143 0.8446±0.0138 2.14 × 10-111

Example 1: ChatGPT vs ChatDoctor.

示例1: ChatGPT与ChatDoctor对比

Q:Hi doctor,I am getting one-sided headache. I do not get them often.They are behind my r