HuaTuo (华驼): Tuning LLaMA Model with Chinese Medical Knowledge
HuaTuo (华驼): 基于中文医学知识微调的LLaMA模型
Haochun Wang∗, Chi Liu∗, Nuwa Xi, Zewen Qiang, Sendong Zhao, Bing Qin and Ting Liu Research Center for Social Computing and Information Retrieval, Harbin Institute of Technology, China {hcwang,cliu,nwxi,zwqiang,sdzhao,bqin,tliu}@ir.hit.edu.cn
王浩春*,刘驰*,席女娲,强泽文,赵森栋,秦兵,刘挺
社会计算与信息检索研究中心,哈尔滨工业大学,中国
{hcwang,cliu,nwxi,zwqiang,sdzhao,bqin,tliu}@ir.hit.edu.cn
Abstract
摘要
Large Language Models (LLMs), such as the LLaMA model, have demonstrated their effec ti ve ness in various general-domain natural language processing (NLP) tasks. Nevertheless, LLMs have not yet performed optimally in biomedical domain tasks due to the need for medical expertise in the responses. In response to this challenge, we propose HuaTuo (华 驼), a LLaMA-based model that has been supervised-fine-tuned with generated QA (Question-Answer) instances. The experimental results demonstrate that HuaTuo generates responses that possess more reliable medical knowledge. Our proposed HuaTuo model is accessible at https://github.com/SCIR-HI/ Huatuo-Llama-Med-Chinese.
大语言模型 (LLMs),如 LLaMA 模型,已在多种通用领域的自然语言处理 (NLP) 任务中展现出其有效性。然而,由于回答需要医学专业知识,大语言模型在生物医学领域的任务中尚未发挥最佳性能。针对这一挑战,我们提出了华驼 (HuaTuo),这是一个基于 LLaMA 的模型,通过生成的问答 (QA) 实例进行了监督微调。实验结果表明,华驼生成的回答具有更可靠的医学知识。我们提出的华驼模型可在 https://github.com/SCIR-HI/Huatuo-Llama-Med-Chinese 获取。
1 Introduction
1 引言
The advent of instruction-following large language models (LLMs), representative by ChatGPT(OpenAI, 2022), has generated significant interest due to their exceptional performance in under standing instructions and generating human-like responses. Compared to smaller models, LLMs exhibit strong generalization across various natural language processing (NLP) tasks and unique emergent ability to solving unseen or complicated tasks. Despite ChatGPT’s non-open source status, open-source communities have provided several alternatives, such as LLaMa(Touvron et al., 2023), with relatively affordable training costs. This positions LLMs as potential solutions for real-world scenarios requiring communication and reasoning.
指令跟随型大语言模型 (LLM) 的出现,以 ChatGPT(OpenAI, 2022) 为代表,因其在理解指令和生成类人响应方面的卓越性能而引发了广泛关注。相比小型模型,大语言模型展现出跨多种自然语言处理 (NLP) 任务的强大泛化能力,以及解决未知或复杂任务的独特涌现能力。尽管 ChatGPT 未开源,开源社区已提供了多个替代方案 (如 LLaMa(Touvron et al., 2023)),其训练成本相对可控。这使得大语言模型成为需要沟通与推理能力的现实场景的潜在解决方案。
However, despite their numerous merits, LLMs are not designed to cater specifically to the medical domain. Their general domain knowledge often falls short when addressing such specialized fields, where accurate and domain-specific expert knowledge is critical. This can lead to sub-optimal diagnostic precision, drug recommendations, and medical advice, potentially endangering patients. Few efforts have been made to address this problem, with existing approaches primarily focusing on supplying LLMs with medical information retrieved from conversations, where human errors may occur more frequently. Additionally, LLMs are typically trained in English, constraining their comprehension and response capabilities in languages that differ significantly from English, such as Chinese, rendering their direct application in Chinese contexts less than ideal.
然而,尽管大语言模型(LLM)具有诸多优点,但它们并非专为医疗领域设计。在处理这类高度专业化的领域时,其通用领域知识往往难以满足需求——在这些场景中,准确且领域相关的专家知识至关重要。这可能导致诊断精度、药物推荐和医疗建议的效果欠佳,甚至危及患者安全。目前针对该问题的解决方案较少,现有方法主要集中于通过对话检索医疗信息来补充大语言模型,但这种方式可能更频繁地引入人为错误。此外,大语言模型通常基于英语训练,这限制了其对汉语等与英语差异较大语言的理解和响应能力,导致其在中文本地化应用中表现欠佳。
In this paper, we present the HuaTuo (华驼) model, an LLM tailored for the biomedical domain, focusing on the Chinese language. By generating diverse instruction data based on medical knowledge from CMeKG, we emphasize ensuring the correctness of facts in the model’s responses, which is vital in the biomedical domain. Through this process, we collect over 8,000 instruction data for supervised fine-tuning. Our model builds upon the open-source LLaMa-7B base model, integrates structured and unstructured medical knowledge from the Chinese medical knowledge graph (CMeKG), and employs knowledge-based instruction data for fine-tuning.
本文介绍了专为生物医学领域设计、面向中文的大语言模型 HuaTuo (华驼)。基于CMeKG医学知识库生成多样化指令数据时,我们特别注重确保模型响应中事实的准确性——这在生物医学领域至关重要。通过该流程,我们收集了8,000余条监督微调指令数据。该模型基于开源的LLaMa-7B基础模型,整合了中文医学知识图谱(CMeKG)中的结构化与非结构化医学知识,并采用基于知识的指令数据进行微调。
In summary, our contributions can be summarized as follows:
总之,我们的贡献可总结如下:
• We introduce the HuaTuo model, the first open-source Chinese biomedical LLM tuned with knowledge-based instruction data; • We integrate structured and unstructured medical knowledge from CMeKG, ensuring our model has accurate and domain-specific knowledge; • We proposed SUS, a novel metric for evaluating LMs in the biomedical domain consider- ing safety, usability and smoothness.
• 我们推出华佗(HuaTuo)模型,这是首个基于知识指令数据调优的开源中文生物医学大语言模型;
• 我们整合了来自CMeKG的结构化和非结构化医学知识,确保模型具备精准的领域专业知识;
• 我们提出SUS评估指标,这是一种考量生物医学领域语言模型安全性、实用性和流畅性的新型度量标准。
2 Related Works
2 相关工作
2.1 Large Language Models
2.1 大语言模型 (Large Language Models)
Recent advancements in large language models (LLMs) have demonstrated their superiority over previous-generation paradigms, such as pretraining and fine-tuning. The significant increase in model scale has led to qualitative changes in LLMs, commonly referred to as emergent abilities. These include in-context learning for zero-shot tasks and chains of thought that enhance the model’s performance on complex tasks.
大语言模型(LLM)的最新进展已展现出超越预训练(pre-training)和微调(fine-tuning)等传统范式的优势。模型规模的显著增长引发了大语言模型的质变,这种能力通常被称为涌现能力(emergent abilities),包括适用于零样本任务的上下文学习能力,以及提升模型处理复杂任务性能的思维链能力。
OpenAI’s development of ChatGPT and GPT-4 has revolutionized the perception of LLMs. Although these models exhibit remarkable performance, OpenAI has not disclosed details regarding their training strategies or weight parameters. LLaMa serves as an open-source alternative for GPT, with sizes ranging from 7 billion to 65 billion parameters. Taori et al. trained Alpaca based on LLaMa with instruction tuning.
OpenAI开发的ChatGPT和GPT-4彻底改变了大语言模型(LLM)的认知。尽管这些模型展现出卓越性能,OpenAI并未公开其训练策略或权重参数的细节。LLaMa作为GPT的开源替代方案,参数量级从70亿到650亿不等。Taori等人基于LLaMa通过指令微调训练出Alpaca。
While comparable in performance to GPT-3.5, LLaMa’s performance on Chinese tasks is subpar due to its training data is primarily limited to English corpus. To address Chinese-specific applications, Du et al.; Zeng et al. introduced GLM, a 130 billion-parameter auto-regressive pre-trained model with multiple training objectives. ChatGLM further incorporates code training and aligns with human intentions through supervised fine-tuning, offering a tailored solution for Chinese contexts.
虽然性能与GPT-3.5相当,但由于训练数据主要局限于英文语料库,LLaMa在中文任务上的表现欠佳。针对中文场景应用,Du等人[20]和Zeng等人[21]提出了GLM——一个具有多重训练目标的1300亿参数自回归预训练模型。ChatGLM进一步融合了代码训练,并通过监督微调与人类意图对齐,为中文环境提供了定制化解决方案。
2.2 Pre-trained Models in Biomedical Domain
2.2 生物医学领域的预训练模型
Although large language models (LLMs) exhibit remarkable performance in general domains, their lack of domain-specific knowledge results in suboptimal performance in fields that require specialized expertise, such as bio medicine. The biomedical field’s inherent nature necessitates models to possess comprehensive knowledge bases for relevant queries, particularly when applied to real-world situations where patients seek health and medical advice. Several efforts have been made to adapt LLMs to the biomedical domain.
尽管大语言模型 (LLM) 在通用领域表现出卓越性能,但其缺乏领域特定知识导致在需要专业知识的领域(如生物医学)表现欠佳。生物医学领域的固有特性要求模型具备针对相关查询的全面知识库,尤其是当应用于患者寻求健康和医疗建议的现实场景时。目前已有多个研究致力于使大语言模型适应生物医学领域。
Existing approaches primarily employ ChatGPT for data assistance and train smaller models using its distilled or translated knowledge. Chatdoctor(Li et al., 2023) represents the first attempt to adapt LLMs to the biomedical field by fine-tuning LLaMa using conversation demonstrations synthesized via ChatGPT. DoctorGLM(Xiong et al.)
现有方法主要利用ChatGPT进行数据辅助,并通过其提炼或翻译的知识训练较小模型。Chatdoctor(Li et al., 2023)首次尝试通过使用ChatGPT合成的对话演示对LLaMa进行微调,使大语言模型适应生物医学领域。DoctorGLM(Xiong et al.)
leverages ChatGLM-6B as the base model and finetunes it with the Chinese translation of ChatDoctor dataset, obtained through ChatGPT. Additionally, Chen et al. develops a Chinese and Medically Enhanced Language model within their collection of LLMs. Collectively, these works illustrate the potential for LLMs to be successfully applied within the biomedical domain.
以ChatGLM-6B为基础模型,通过ChatGPT获得的中文版ChatDoctor数据集进行微调。此外,Chen等人还在他们的大语言模型集合中开发了一个中文医学增强语言模型。总的来说,这些工作展示了大语言模型在生物医学领域成功应用的潜力。
3 HuaTuo Model
3 华佗模型
In this section, we will introduce the training process of our HuaTuo (华驼) model.
在本节中,我们将介绍华驼(HuaTuo)模型的训练过程。
3.1 Base Model
3.1 基础模型
LLaMA (Touvron et al., 2023) is a collection of multi-lingual base models with parameters ranging from 7 billion to 65 billion, which are open-source to the research community. Here, we adopted the LLaMA-7B model for more accessible training.
LLaMA (Touvron et al., 2023) 是一系列参数量从70亿到650亿的多语言基础模型,这些模型已向研究社区开源。本文采用LLaMA-7B模型以便进行更便捷的训练。
3.2 Medical Knowledge
3.2 医学知识
There are various kinds of medical knowledge, generally including (1) structured medical knowledge like medical knowledge graphs and (2) unstructured medical knowledge like medical guidelines. We utilized a Chinese medical knowledge graph, CMeKG (Odmaa et al., 2019), which also provides retrieved medical knowledge about diseases, drugs, symptoms, etc. Table 1 shows several knowledge cases in the CMeKG knowledge base.
医学知识种类繁多,主要包括:(1) 结构化医学知识,如医学知识图谱;(2) 非结构化医学知识,如医学指南。我们使用了中文医学知识图谱CMeKG (Odmaa et al., 2019),该图谱还提供关于疾病、药物、症状等的检索医学知识。表1展示了CMeKG知识库中的几个知识案例。
3.3 Knowledge-based Instruction Data
3.3 基于知识的指令数据
Instruct-tuning has proven to be effective to tune large language models (Wei et al., 2022; Ouyang et al., 2022), which helps the models perform satis factor i ly under zero-shot scenarios with the cost of sufficient annotated instructions. Inspired by the automatic construction of the instruction along with the instances (inputs and outputs) (Wang et al., 2022; Taori et al., 2023), we generate our instruction data based on the above medical knowledge.
指令调优已被证明能有效调整大语言模型 (Wei et al., 2022; Ouyang et al., 2022),该技术通过消耗大量标注指令使模型在零样本场景下表现良好。受自动构建指令及实例 (输入输出对) 方法的启发 (Wang et al., 2022; Taori et al., 2023),我们基于上述医学知识生成了指令数据。
As demonstrated in Table 2, instruct-tuning involves supervised fine-tuning on the training instances and an instruction that describes the tasks in natural language. However, as for a large language model for medical dialogue, inputs are mostly stated as questions and instructions are all like “Answer the following question”. Therefore, we dispose of the instructions and only preserve the inputs for our HuaTuo.
如表 2 所示,指令微调 (instruct-tuning) 涉及对训练实例的自然语言任务描述指令进行监督微调。但对于医疗对话大语言模型,输入大多以问题形式呈现,且指令均为"回答以下问题"这类表述。因此我们移除了指令部分,仅保留输入内容用于华佗 (HuaTuo) 模型的训练。
While the generated instructions are required to be diverse enough for unseen tasks (Wang et al.,
生成的指令需要足够多样化以应对未见任务 (Wang et al.,
Table 1: Knowledge cases in the CMeKG.
Type Disease | Knowledgein Chinese | Knowledge translated to English |
{"class":"百种常见病","中心词":"肝 癌","药物治疗":『"瑞格非尼","对乙型 或丙型肝炎有效的抗病毒药物","索拉 非尼"],"多发地区":『"撒哈拉以南的非 洲"],"高危因素":["肥胖","HBVDNA过 高","慢性酗酒","男性","慢性乙型肝 炎感染","肝癌家族史","慢性丙型肝 炎肝硬化","核心启动子突变","肝硬 化","HCV重叠感染","老年性心瓣膜病", "乙型肝炎e抗原","糖尿病","发病部位": ["肝脏"],"辅助检查":["肝功能检查"], "病史":["长期慢性乙肝病史"]} | {"class": "Common Diseases", "Key Word": "Liver Cancer", "Drug Treatment": ["Rego- rafenib", "Antiviral drugs effective against hepatitis B or C", "Sorafenib"], "High Preva- lence Regions": ["Sub-Saharan Africa"], "High Risk Factors": ["Obesity", "High HBV DNA levels", "Chronic alcoholism", "Male gender", "Chronic hepatitis B in- fection", "Family history of liver cancer", "Cirrhosis due to chronic hepatitis C" "Core promoter mutation", "Liver cirrho- sis", "HCV co-infection", "Senile valvular heart disease", "Hepatitis B e antigen", "Di- iliary Examination": ["Liver function test"], | |
Drug | {"class":"西药","中心词":"二甲双胍", "性状":『"糖衣或薄膜衣片,除去包衣 后显白色"],"英文名称":『"异福片","格 华止"],"分类":『"双肌类","抗结核病 药"],"规格":["0.25g"],"OTC类型":["乙 类OTC","甲类OTC"],"适应证":["糖尿 病","肥胖"],"通用名":『"异福片"],"成 份":『"利福平及异烟肼","异烟肼","异 烟肼0.1克","异烟肼150毫克","本品为 复方制剂","利福平","利福平300毫克", "利福平0.15克","盐酸二甲双胍","盐 酸"] | chronic hepatitis B"]}. "Class": "Western Medicine", "Key Word": "Metformin", "Appearance": ["Sugar- coated or film-coated tablets, white after removal of coating"], "English Names": ["Yifupian", "Gehuazhi"], "Classification": ["Biguanide class", "Anti-tuberculosis drug"], "Specifications": ["0.25g"], "OTC Types": ["OTC Class B", "OTC Class A"], "Indications": ["Diabetes", "Obesity"], "Generic Name": ["Yifu- pian"], "Ingredients": ["Isoniazid and pyrazinamide", "Pyrazinamide", "0.1g "This product is a compound preparation", "Isoniazid", "300mg isoniazid", "0.15g isoniazid", "Metformin hydrochloride", |
Symptom | {"中心词":"毛发脱落","检查":『"毛发 矿物质检查"],"相关疾病":『"斑秃","慢 性疲劳综合症"],"相关症状":["毛发色 淡而呈棕色","毛发干燥易断","皮肤变 硬"],"所属科室":『"内科","皮肤性病", "放疗、化疗科"],"发病部位":["头部"] | 'Hydrochloride"] "Key Word": "Hair Loss", "Examinations": ["Hair mineral analysis"], "Related Dis- eases": ["Alopecia areata", "Chronic Fa- tigue Syndrome"], "Related Symptoms": ["Hair color is light and brown", "Hair is dry and brittle", "Skin becomes hard- ened"], "Related Departments": ["Internal Medicine", "Dermatology and Venereol- ogy", "Radiation and Chemotherapy"J, "Af- fected Area": ["Head"] |
表 1: CMeKG中的知识案例
类型疾病 | 中文知识 | 英文翻译知识 |
---|---|---|
{"class": "百种常见病", "中心词": "肝癌", "药物治疗": ["瑞格非尼", "对乙型或丙型肝炎有效的抗病毒药物", "索拉非尼"], "多发地区": ["撒哈拉以南的非洲"], "高危因素": ["肥胖", "HBVDNA过高", "慢性酗酒", "男性", "慢性乙型肝炎感染", "肝癌家族史", "慢性丙型肝炎肝硬化", "核心启动子突变", "肝硬化", "HCV重叠感染", "老年性心瓣膜病", "乙型肝炎e抗原", "糖尿病"], "发病部位": ["肝脏"], "辅助检查": ["肝功能检查"], "病史": ["长期慢性乙肝病史"]} | {"class": "Common Diseases", "Key Word": "Liver Cancer", "Drug Treatment": ["Regorafenib", "Antiviral drugs effective against hepatitis B or C", "Sorafenib"], "High Prevalence Regions": ["Sub-Saharan Africa"], "High Risk Factors": ["Obesity", "High HBV DNA levels", "Chronic alcoholism", "Male gender", "Chronic hepatitis B infection", "Family history of liver cancer", "Cirrhosis due to chronic hepatitis C", "Core promoter mutation", "Liver cirrhosis", "HCV co-infection", "Senile valvular heart disease", "Hepatitis B e antigen", "Diabetes"], "Auxiliary Examination": ["Liver function test"], "Medical History": ["Long-term chronic hepatitis B history"]} | |
药物 | {"class": "西药", "中心词": "二甲双胍", "性状": ["糖衣或薄膜衣片,除去包衣后显白色"], "英文名称": ["异福片", "格华止"], "分类": ["双胍类", "抗结核病药"], "规格": ["0.25g"], "OTC类型": ["乙类OTC", "甲类OTC"], "适应证": ["糖尿病", "肥胖"], "通用名": ["异福片"], "成份": ["利福平及异烟肼", "异烟肼", "异烟肼0.1克", "异烟肼150毫克", "本品为复方制剂", "利福平", "利福平300毫克", "利福平0.15克", "盐酸二甲双胍", "盐酸"]} | {"Class": "Western Medicine", "Key Word": "Metformin", "Appearance": ["Sugar-coated or film-coated tablets, white after removal of coating"], "English Names": ["Yifupian", "Gehuazhi"], "Classification": ["Biguanide class", "Anti-tuberculosis drug"], "Specifications": ["0.25g"], "OTC Types": ["OTC Class B", "OTC Class A"], "Indications": ["Diabetes", "Obesity"], "Generic Name": ["Yifupian"], "Ingredients": ["Isoniazid and pyrazinamide", "Pyrazinamide", "0.1g isoniazid", "150mg isoniazid", "This product is a compound preparation", "Isoniazid", "300mg isoniazid", "0.15g isoniazid", "Metformin hydrochloride", "Hydrochloride"]} |
症状 | {"中心词": "毛发脱落", "检查": ["毛发矿物质检查"], "相关疾病": ["斑秃", "慢性疲劳综合症"], "相关症状": ["毛发色淡而呈棕色", "毛发干燥易断", "皮肤变硬"], "所属科室": ["内科", "皮肤性病", "放疗、化疗科"], "发病部位": ["头部"]} | {"Key Word": "Hair Loss", "Examinations": ["Hair mineral analysis"], "Related Diseases": ["Alopecia areata", "Chronic Fatigue Syndrome"], "Related Symptoms": ["Hair color is light and brown", "Hair is dry and brittle", "Skin becomes hardened"], "Related Departments": ["Internal Medicine", "Dermatology and Venereology", "Radiation and Chemotherapy"], "Affected Area": ["Head"]} |
Table 2: Instance with an instruction.
Instruction: Translate the following sentence into Chinese. |
Input: |
What are the possible reasons for liver cancer? |
Output: 肝癌可能的原因有什么? |
表 2: 带指令的实例。
| Instruction: 将以下句子翻译成中文。 |
| Input: |
| What are the possible reasons for liver cancer? |
| Output: 肝癌可能的原因有什么? |
-
in the general domain, the correctness of the fact in the responses from the large language model is of more concern in the biomedical domain (Gilson et al., 2023). Thus, we first sample knowledge instances from the knowledge graph and then generate the instances based on the specific knowledge with the OpenAI API (OpenAI, 2022). Finally, we collect over 8,000 instruction data, like examples in Table 3 as training instances for supervised fine-tuning.
-
在通用领域中,大语言模型响应中事实的正确性在生物医学领域更受关注 (Gilson et al., 2023)。因此,我们首先从知识图谱中采样知识实例,然后通过 OpenAI API (OpenAI, 2022) 基于特定知识生成实例。最终,我们收集了超过 8,000 条指令数据,如表 3 中的示例所示,作为监督微调的训练实例。
4 Experiment
4 实验
4.1 Baselines
4.1 基线方法
In order to demonstrate the superior performance of HuaTuo, we conducted a comparative analysis with four baseline models.
为了展示华佗模型的卓越性能,我们与四个基线模型进行了对比分析。
4.2 Metrics
4.2 指标
For the generation tasks in the general domain, evaluation metrics, such as Bleu and Rouge are utilized to determine whether a generative model can produce responses similar to the ground truth. However, as for the medical QA tasks, namely (1) safety, (2) usability, and (3) smoothness. Safety determines whether the response includes anything that can mislead the user into danger, such as wrong medicine recommendations. Usability reflects the medical expertise of a specific response. And, the Smoothness represents the ability as a language model.
对于通