[论文翻译]HuaTuo (华驼): 基于中文医学知识微调的LLaMA模型


原文地址:https://arxiv.org/pdf/2304.06975


HuaTuo (华驼): Tuning LLaMA Model with Chinese Medical Knowledge

HuaTuo (华驼): 基于中文医学知识微调的LLaMA模型

Haochun Wang∗, Chi Liu∗, Nuwa Xi, Zewen Qiang, Sendong Zhao, Bing Qin and Ting Liu Research Center for Social Computing and Information Retrieval, Harbin Institute of Technology, China {hcwang,cliu,nwxi,zwqiang,sdzhao,bqin,tliu}@ir.hit.edu.cn

王浩春*,刘驰*,席女娲,强泽文,赵森栋,秦兵,刘挺
社会计算与信息检索研究中心,哈尔滨工业大学,中国
{hcwang,cliu,nwxi,zwqiang,sdzhao,bqin,tliu}@ir.hit.edu.cn

Abstract

摘要

Large Language Models (LLMs), such as the LLaMA model, have demonstrated their effec ti ve ness in various general-domain natural language processing (NLP) tasks. Nevertheless, LLMs have not yet performed optimally in biomedical domain tasks due to the need for medical expertise in the responses. In response to this challenge, we propose HuaTuo (华 驼), a LLaMA-based model that has been supervised-fine-tuned with generated QA (Question-Answer) instances. The experimental results demonstrate that HuaTuo generates responses that possess more reliable medical knowledge. Our proposed HuaTuo model is accessible at https://github.com/SCIR-HI/ Huatuo-Llama-Med-Chinese.

大语言模型 (LLMs),如 LLaMA 模型,已在多种通用领域的自然语言处理 (NLP) 任务中展现出其有效性。然而,由于回答需要医学专业知识,大语言模型在生物医学领域的任务中尚未发挥最佳性能。针对这一挑战,我们提出了华驼 (HuaTuo),这是一个基于 LLaMA 的模型,通过生成的问答 (QA) 实例进行了监督微调。实验结果表明,华驼生成的回答具有更可靠的医学知识。我们提出的华驼模型可在 https://github.com/SCIR-HI/Huatuo-Llama-Med-Chinese 获取。

1 Introduction

1 引言

The advent of instruction-following large language models (LLMs), representative by ChatGPT(OpenAI, 2022), has generated significant interest due to their exceptional performance in under standing instructions and generating human-like responses. Compared to smaller models, LLMs exhibit strong generalization across various natural language processing (NLP) tasks and unique emergent ability to solving unseen or complicated tasks. Despite ChatGPT’s non-open source status, open-source communities have provided several alternatives, such as LLaMa(Touvron et al., 2023), with relatively affordable training costs. This positions LLMs as potential solutions for real-world scenarios requiring communication and reasoning.

指令跟随型大语言模型 (LLM) 的出现,以 ChatGPT(OpenAI, 2022) 为代表,因其在理解指令和生成类人响应方面的卓越性能而引发了广泛关注。相比小型模型,大语言模型展现出跨多种自然语言处理 (NLP) 任务的强大泛化能力,以及解决未知或复杂任务的独特涌现能力。尽管 ChatGPT 未开源,开源社区已提供了多个替代方案 (如 LLaMa(Touvron et al., 2023)),其训练成本相对可控。这使得大语言模型成为需要沟通与推理能力的现实场景的潜在解决方案。

However, despite their numerous merits, LLMs are not designed to cater specifically to the medical domain. Their general domain knowledge often falls short when addressing such specialized fields, where accurate and domain-specific expert knowledge is critical. This can lead to sub-optimal diagnostic precision, drug recommendations, and medical advice, potentially endangering patients. Few efforts have been made to address this problem, with existing approaches primarily focusing on supplying LLMs with medical information retrieved from conversations, where human errors may occur more frequently. Additionally, LLMs are typically trained in English, constraining their comprehension and response capabilities in languages that differ significantly from English, such as Chinese, rendering their direct application in Chinese contexts less than ideal.

然而,尽管大语言模型(LLM)具有诸多优点,但它们并非专为医疗领域设计。在处理这类高度专业化的领域时,其通用领域知识往往难以满足需求——在这些场景中,准确且领域相关的专家知识至关重要。这可能导致诊断精度、药物推荐和医疗建议的效果欠佳,甚至危及患者安全。目前针对该问题的解决方案较少,现有方法主要集中于通过对话检索医疗信息来补充大语言模型,但这种方式可能更频繁地引入人为错误。此外,大语言模型通常基于英语训练,这限制了其对汉语等与英语差异较大语言的理解和响应能力,导致其在中文本地化应用中表现欠佳。

In this paper, we present the HuaTuo (华驼) model, an LLM tailored for the biomedical domain, focusing on the Chinese language. By generating diverse instruction data based on medical knowledge from CMeKG, we emphasize ensuring the correctness of facts in the model’s responses, which is vital in the biomedical domain. Through this process, we collect over 8,000 instruction data for supervised fine-tuning. Our model builds upon the open-source LLaMa-7B base model, integrates structured and unstructured medical knowledge from the Chinese medical knowledge graph (CMeKG), and employs knowledge-based instruction data for fine-tuning.

本文介绍了专为生物医学领域设计、面向中文的大语言模型 HuaTuo (华驼)。基于CMeKG医学知识库生成多样化指令数据时,我们特别注重确保模型响应中事实的准确性——这在生物医学领域至关重要。通过该流程,我们收集了8,000余条监督微调指令数据。该模型基于开源的LLaMa-7B基础模型,整合了中文医学知识图谱(CMeKG)中的结构化与非结构化医学知识,并采用基于知识的指令数据进行微调。

In summary, our contributions can be summarized as follows:

总之,我们的贡献可总结如下:

• We introduce the HuaTuo model, the first open-source Chinese biomedical LLM tuned with knowledge-based instruction data; • We integrate structured and unstructured medical knowledge from CMeKG, ensuring our model has accurate and domain-specific knowledge; • We proposed SUS, a novel metric for evaluating LMs in the biomedical domain consider- ing safety, usability and smoothness.

• 我们推出华佗(HuaTuo)模型,这是首个基于知识指令数据调优的开源中文生物医学大语言模型;
• 我们整合了来自CMeKG的结构化和非结构化医学知识,确保模型具备精准的领域专业知识;
• 我们提出SUS评估指标,这是一种考量生物医学领域语言模型安全性、实用性和流畅性的新型度量标准。

2 Related Works

2 相关工作

2.1 Large Language Models

2.1 大语言模型 (Large Language Models)

Recent advancements in large language models (LLMs) have demonstrated their superiority over previous-generation paradigms, such as pretraining and fine-tuning. The significant increase in model scale has led to qualitative changes in LLMs, commonly referred to as emergent abilities. These include in-context learning for zero-shot tasks and chains of thought that enhance the model’s performance on complex tasks.

大语言模型(LLM)的最新进展已展现出超越预训练(pre-training)和微调(fine-tuning)等传统范式的优势。模型规模的显著增长引发了大语言模型的质变,这种能力通常被称为涌现能力(emergent abilities),包括适用于零样本任务的上下文学习能力,以及提升模型处理复杂任务性能的思维链能力。

OpenAI’s development of ChatGPT and GPT-4 has revolutionized the perception of LLMs. Although these models exhibit remarkable performance, OpenAI has not disclosed details regarding their training strategies or weight parameters. LLaMa serves as an open-source alternative for GPT, with sizes ranging from 7 billion to 65 billion parameters. Taori et al. trained Alpaca based on LLaMa with instruction tuning.

OpenAI开发的ChatGPT和GPT-4彻底改变了大语言模型(LLM)的认知。尽管这些模型展现出卓越性能,OpenAI并未公开其训练策略或权重参数的细节。LLaMa作为GPT的开源替代方案,参数量级从70亿到650亿不等。Taori等人基于LLaMa通过指令微调训练出Alpaca。

While comparable in performance to GPT-3.5, LLaMa’s performance on Chinese tasks is subpar due to its training data is primarily limited to English corpus. To address Chinese-specific applications, Du et al.; Zeng et al. introduced GLM, a 130 billion-parameter auto-regressive pre-trained model with multiple training objectives. ChatGLM further incorporates code training and aligns with human intentions through supervised fine-tuning, offering a tailored solution for Chinese contexts.

虽然性能与GPT-3.5相当,但由于训练数据主要局限于英文语料库,LLaMa在中文任务上的表现欠佳。针对中文场景应用,Du等人[20]和Zeng等人[21]提出了GLM——一个具有多重训练目标的1300亿参数自回归预训练模型。ChatGLM进一步融合了代码训练,并通过监督微调与人类意图对齐,为中文环境提供了定制化解决方案。

2.2 Pre-trained Models in Biomedical Domain

2.2 生物医学领域的预训练模型

Although large language models (LLMs) exhibit remarkable performance in general domains, their lack of domain-specific knowledge results in suboptimal performance in fields that require specialized expertise, such as bio medicine. The biomedical field’s inherent nature necessitates models to possess comprehensive knowledge bases for relevant queries, particularly when applied to real-world situations where patients seek health and medical advice. Several efforts have been made to adapt LLMs to the biomedical domain.

尽管大语言模型 (LLM) 在通用领域表现出卓越性能,但其缺乏领域特定知识导致在需要专业知识的领域(如生物医学)表现欠佳。生物医学领域的固有特性要求模型具备针对相关查询的全面知识库,尤其是当应用于患者寻求健康和医疗建议的现实场景时。目前已有多个研究致力于使大语言模型适应生物医学领域。

Existing approaches primarily employ ChatGPT for data assistance and train smaller models using its distilled or translated knowledge. Chatdoctor(Li et al., 2023) represents the first attempt to adapt LLMs to the biomedical field by fine-tuning LLaMa using conversation demonstrations synthesized via ChatGPT. DoctorGLM(Xiong et al.)

现有方法主要利用ChatGPT进行数据辅助,并通过其提炼或翻译的知识训练较小模型。Chatdoctor(Li et al., 2023)首次尝试通过使用ChatGPT合成的对话演示对LLaMa进行微调,使大语言模型适应生物医学领域。DoctorGLM(Xiong et al.)

leverages ChatGLM-6B as the base model and finetunes it with the Chinese translation of ChatDoctor dataset, obtained through ChatGPT. Additionally, Chen et al. develops a Chinese and Medically Enhanced Language model within their collection of LLMs. Collectively, these works illustrate the potential for LLMs to be successfully applied within the biomedical domain.

以ChatGLM-6B为基础模型,通过ChatGPT获得的中文版ChatDoctor数据集进行微调。此外,Chen等人还在他们的大语言模型集合中开发了一个中文医学增强语言模型。总的来说,这些工作展示了大语言模型在生物医学领域成功应用的潜力。

3 HuaTuo Model

3 华佗模型

In this section, we will introduce the training process of our HuaTuo (华驼) model.

在本节中,我们将介绍华驼(HuaTuo)模型的训练过程。

3.1 Base Model

3.1 基础模型

LLaMA (Touvron et al., 2023) is a collection of multi-lingual base models with parameters ranging from 7 billion to 65 billion, which are open-source to the research community. Here, we adopted the LLaMA-7B model for more accessible training.

LLaMA (Touvron et al., 2023) 是一系列参数量从70亿到650亿的多语言基础模型,这些模型已向研究社区开源。本文采用LLaMA-7B模型以便进行更便捷的训练。

3.2 Medical Knowledge

3.2 医学知识

There are various kinds of medical knowledge, generally including (1) structured medical knowledge like medical knowledge graphs and (2) unstructured medical knowledge like medical guidelines. We utilized a Chinese medical knowledge graph, CMeKG (Odmaa et al., 2019), which also provides retrieved medical knowledge about diseases, drugs, symptoms, etc. Table 1 shows several knowledge cases in the CMeKG knowledge base.

医学知识种类繁多,主要包括:(1) 结构化医学知识,如医学知识图谱;(2) 非结构化医学知识,如医学指南。我们使用了中文医学知识图谱CMeKG (Odmaa et al., 2019),该图谱还提供关于疾病、药物、症状等的检索医学知识。表1展示了CMeKG知识库中的几个知识案例。

3.3 Knowledge-based Instruction Data

3.3 基于知识的指令数据

Instruct-tuning has proven to be effective to tune large language models (Wei et al., 2022; Ouyang et al., 2022), which helps the models perform satis factor i ly under zero-shot scenarios with the cost of sufficient annotated instructions. Inspired by the automatic construction of the instruction along with the instances (inputs and outputs) (Wang et al., 2022; Taori et al., 2023), we generate our instruction data based on the above medical knowledge.

指令调优已被证明能有效调整大语言模型 (Wei et al., 2022; Ouyang et al., 2022),该技术通过消耗大量标注指令使模型在零样本场景下表现良好。受自动构建指令及实例 (输入输出对) 方法的启发 (Wang et al., 2022; Taori et al., 2023),我们基于上述医学知识生成了指令数据。

As demonstrated in Table 2, instruct-tuning involves supervised fine-tuning on the training instances and an instruction that describes the tasks in natural language. However, as for a large language model for medical dialogue, inputs are mostly stated as questions and instructions are all like “Answer the following question”. Therefore, we dispose of the instructions and only preserve the inputs for our HuaTuo.

如表 2 所示,指令微调 (instruct-tuning) 涉及对训练实例的自然语言任务描述指令进行监督微调。但对于医疗对话大语言模型,输入大多以问题形式呈现,且指令均为"回答以下问题"这类表述。因此我们移除了指令部分,仅保留输入内容用于华佗 (HuaTuo) 模型的训练。

While the generated instructions are required to be diverse enough for unseen tasks (Wang et al.,

生成的指令需要足够多样化以应对未见任务 (Wang et al.,

Table 1: Knowledge cases in the CMeKG.

Type DiseaseKnowledgein ChineseKnowledge translated to English
{"class":"百种常见病","中心词":"肝 癌","药物治疗":『"瑞格非尼","对乙型 或丙型肝炎有效的抗病毒药物","索拉 非尼"],"多发地区":『"撒哈拉以南的非 洲"],"高危因素":["肥胖","HBVDNA过 高","慢性酗酒","男性","慢性乙型肝 炎感染","肝癌家族史","慢性丙型肝 炎肝硬化","核心启动子突变","肝硬 化","HCV重叠感染","老年性心瓣膜病", "乙型肝炎e抗原","糖尿病","发病部位": ["肝脏"],"辅助检查":["肝功能检查"], "病史":["长期慢性乙肝病史"]}{"class": "Common Diseases", "Key Word": "Liver Cancer", "Drug Treatment": ["Rego- rafenib", "Antiviral drugs effective against hepatitis B or C", "Sorafenib"], "High Preva- lence Regions": ["Sub-Saharan Africa"], "High Risk Factors": ["Obesity", "High HBV DNA levels", "Chronic alcoholism", "Male gender", "Chronic hepatitis B in- fection", "Family history of liver cancer", "Cirrhosis due to chronic hepatitis C" "Core promoter mutation", "Liver cirrho- sis", "HCV co-infection", "Senile valvular heart disease", "Hepatitis B e antigen", "Di- iliary Examination": ["Liver function test"],
Drug{"class":"西药","中心词":"二甲双胍", "性状":『"糖衣或薄膜衣片,除去包衣 后显白色"],"英文名称":『"异福片","格 华止"],"分类":『"双肌类","抗结核病 药"],"规格":["0.25g"],"OTC类型":["乙 类OTC","甲类OTC"],"适应证":["糖尿 病","肥胖"],"通用名":『"异福片"],"成 份":『"利福平及异烟肼","异烟肼","异 烟肼0.1克","异烟肼150毫克","本品为 复方制剂","利福平","利福平300毫克", "利福平0.15克","盐酸二甲双胍","盐 酸"]chronic hepatitis B"]}. "Class": "Western Medicine", "Key Word": "Metformin", "Appearance": ["Sugar- coated or film-coated tablets, white after removal of coating"], "English Names": ["Yifupian", "Gehuazhi"], "Classification": ["Biguanide class", "Anti-tuberculosis drug"], "Specifications": ["0.25g"], "OTC Types": ["OTC Class B", "OTC Class A"], "Indications": ["Diabetes", "Obesity"], "Generic Name": ["Yifu- pian"], "Ingredients": ["Isoniazid and pyrazinamide", "Pyrazinamide", "0.1g "This product is a compound preparation", "Isoniazid", "300mg isoniazid", "0.15g isoniazid", "Metformin hydrochloride",
Symptom{"中心词":"毛发脱落","检查":『"毛发 矿物质检查"],"相关疾病":『"斑秃","慢 性疲劳综合症"],"相关症状":["毛发色 淡而呈棕色","毛发干燥易断","皮肤变 硬"],"所属科室":『"内科","皮肤性病", "放疗、化疗科"],"发病部位":["头部"]'Hydrochloride"] "Key Word": "Hair Loss", "Examinations": ["Hair mineral analysis"], "Related Dis- eases": ["Alopecia areata", "Chronic Fa- tigue Syndrome"], "Related Symptoms": ["Hair color is light and brown", "Hair is dry and brittle", "Skin becomes hard- ened"], "Related Departments": ["Internal Medicine", "Dermatology and Venereol- ogy", "Radiation and Chemotherapy"J, "Af- fected Area": ["Head"]

表 1: CMeKG中的知识案例

类型疾病 中文知识 英文翻译知识
{"class": "百种常见病", "中心词": "肝癌", "药物治疗": ["瑞格非尼", "对乙型或丙型肝炎有效的抗病毒药物", "索拉非尼"], "多发地区": ["撒哈拉以南的非洲"], "高危因素": ["肥胖", "HBVDNA过高", "慢性酗酒", "男性", "慢性乙型肝炎感染", "肝癌家族史", "慢性丙型肝炎肝硬化", "核心启动子突变", "肝硬化", "HCV重叠感染", "老年性心瓣膜病", "乙型肝炎e抗原", "糖尿病"], "发病部位": ["肝脏"], "辅助检查": ["肝功能检查"], "病史": ["长期慢性乙肝病史"]} {"class": "Common Diseases", "Key Word": "Liver Cancer", "Drug Treatment": ["Regorafenib", "Antiviral drugs effective against hepatitis B or C", "Sorafenib"], "High Prevalence Regions": ["Sub-Saharan Africa"], "High Risk Factors": ["Obesity", "High HBV DNA levels", "Chronic alcoholism", "Male gender", "Chronic hepatitis B infection", "Family history of liver cancer", "Cirrhosis due to chronic hepatitis C", "Core promoter mutation", "Liver cirrhosis", "HCV co-infection", "Senile valvular heart disease", "Hepatitis B e antigen", "Diabetes"], "Auxiliary Examination": ["Liver function test"], "Medical History": ["Long-term chronic hepatitis B history"]}
药物 {"class": "西药", "中心词": "二甲双胍", "性状": ["糖衣或薄膜衣片,除去包衣后显白色"], "英文名称": ["异福片", "格华止"], "分类": ["双胍类", "抗结核病药"], "规格": ["0.25g"], "OTC类型": ["乙类OTC", "甲类OTC"], "适应证": ["糖尿病", "肥胖"], "通用名": ["异福片"], "成份": ["利福平及异烟肼", "异烟肼", "异烟肼0.1克", "异烟肼150毫克", "本品为复方制剂", "利福平", "利福平300毫克", "利福平0.15克", "盐酸二甲双胍", "盐酸"]} {"Class": "Western Medicine", "Key Word": "Metformin", "Appearance": ["Sugar-coated or film-coated tablets, white after removal of coating"], "English Names": ["Yifupian", "Gehuazhi"], "Classification": ["Biguanide class", "Anti-tuberculosis drug"], "Specifications": ["0.25g"], "OTC Types": ["OTC Class B", "OTC Class A"], "Indications": ["Diabetes", "Obesity"], "Generic Name": ["Yifupian"], "Ingredients": ["Isoniazid and pyrazinamide", "Pyrazinamide", "0.1g isoniazid", "150mg isoniazid", "This product is a compound preparation", "Isoniazid", "300mg isoniazid", "0.15g isoniazid", "Metformin hydrochloride", "Hydrochloride"]}
症状 {"中心词": "毛发脱落", "检查": ["毛发矿物质检查"], "相关疾病": ["斑秃", "慢性疲劳综合症"], "相关症状": ["毛发色淡而呈棕色", "毛发干燥易断", "皮肤变硬"], "所属科室": ["内科", "皮肤性病", "放疗、化疗科"], "发病部位": ["头部"]} {"Key Word": "Hair Loss", "Examinations": ["Hair mineral analysis"], "Related Diseases": ["Alopecia areata", "Chronic Fatigue Syndrome"], "Related Symptoms": ["Hair color is light and brown", "Hair is dry and brittle", "Skin becomes hardened"], "Related Departments": ["Internal Medicine", "Dermatology and Venereology", "Radiation and Chemotherapy"], "Affected Area": ["Head"]}

Table 2: Instance with an instruction.

Instruction: Translate the following sentence into Chinese.
Input:
What are the possible reasons for liver cancer?
Output: 肝癌可能的原因有什么?

表 2: 带指令的实例。

| Instruction: 将以下句子翻译成中文。 |
| Input: |
| What are the possible reasons for liver cancer? |
| Output: 肝癌可能的原因有什么? |

  1. in the general domain, the correctness of the fact in the responses from the large language model is of more concern in the biomedical domain (Gilson et al., 2023). Thus, we first sample knowledge instances from the knowledge graph and then generate the instances based on the specific knowledge with the OpenAI API (OpenAI, 2022). Finally, we collect over 8,000 instruction data, like examples in Table 3 as training instances for supervised fine-tuning.

  2. 在通用领域中,大语言模型响应中事实的正确性在生物医学领域更受关注 (Gilson et al., 2023)。因此,我们首先从知识图谱中采样知识实例,然后通过 OpenAI API (OpenAI, 2022) 基于特定知识生成实例。最终,我们收集了超过 8,000 条指令数据,如表 3 中的示例所示,作为监督微调的训练实例。

4 Experiment

4 实验

4.1 Baselines

4.1 基线方法

In order to demonstrate the superior performance of HuaTuo, we conducted a comparative analysis with four baseline models.

为了展示华佗模型的卓越性能,我们与四个基线模型进行了对比分析。

4.2 Metrics

4.2 指标

For the generation tasks in the general domain, evaluation metrics, such as Bleu and Rouge are utilized to determine whether a generative model can produce responses similar to the ground truth. However, as for the medical QA tasks, namely (1) safety, (2) usability, and (3) smoothness. Safety determines whether the response includes anything that can mislead the user into danger, such as wrong medicine recommendations. Usability reflects the medical expertise of a specific response. And, the Smoothness represents the ability as a language model.

对于通用领域的生成任务,通常采用Bleu和Rouge等评估指标来判断生成模型能否产生与标准答案相似的响应。但在医疗问答任务中,这些指标难以衡量三个关键维度:(1) 安全性、(2) 可用性和 (3) 流畅性。安全性用于判断响应是否包含可能误导用户陷入危险的内容,例如错误的用药建议。可用性反映特定响应所体现的医学专业性。而流畅性则代表模型作为语言模型的基本能力。

In the domain of natural language generation, various evaluation metrics are utilized to assess the efficacy of generative models. The widely used metrics in the general domain include Bleu and Rouge, which compare generated responses with the ground truth. Additionally, for medical question-answering tasks, we introduce an evaluation metric, SUS. The SUS metric consists of three dimensions: Safety, Usability, and Smoothness. The “Safety” dimension assesses whether the generated response has the potential to mislead the user and pose a danger to their health, for example, through incorrect medication recommendations. The “Usability” dimension evaluates the extent to which the generated response reflects medical expertise, while the “Smoothness” dimension measures the proficiency of the generative model as a language model.

在自然语言生成领域,多种评估指标被用于衡量生成模型的效果。通用领域广泛使用的指标包括Bleu和Rouge,它们通过对比生成回答与标准答案进行评测。针对医疗问答任务,我们引入了一项评估指标SUS。该指标包含三个维度:安全性(Safety)、可用性(Usability)和流畅性(Smoothness)。"安全性"维度评估生成回答是否存在误导用户、危害健康的风险,例如错误的用药建议;"可用性"维度衡量生成回答体现医学专业知识的程度;"流畅性"维度则评测生成模型作为语言模型的熟练度。

4.3 Results

4.3 结果

In this study, we constructed a test set of potential questions in Chinese dialogue scenarios and compared the generated responses of our HuaTuo model with three other baseline models. To evaluate the model performance, we recruited five annotators with medical backgrounds who assessed the randomly mixed responses of the models using a three-point scale for each dimension of Safety, Usability, and Smoothness (SUS). The SUS scale ranges from 1 (not acceptable) to 3 (good), with 2 indicating an acceptable response. The average SUS scores are presented in Table 4. Although LLaMA achieves the highest safety score, its responses are often uninformative and rephrased the questions, resulting in a low usability score. On the other hand, our HuaTuo model significantly improves knowledge usability without much compromising safety.

在本研究中,我们构建了一个中文对话场景下的潜在问题测试集,并将华佗模型生成的回答与其他三个基线模型进行了对比。为评估模型表现,我们招募了五位具有医学背景的标注人员,采用三点量表对模型随机混合的回答进行安全性(Safety)、可用性(Usability)和流畅性(Smoothness)(SUS)三个维度的评估。SUS量表范围从1(不可接受)到3(优秀),2分表示回答可接受。平均SUS分数如表4所示。虽然LLaMA获得了最高的安全分,但其回答往往缺乏信息量且仅复述问题,导致可用性得分较低。相比之下,我们的华佗模型在不大幅降低安全性的前提下显著提升了知识可用性。

Ethics Statement

伦理声明

The HuaTuo initiative is primarily dedicated to research and is not intended to offer medical advice. The medical information utilized in this study is sourced from an open-access medical knowledge graph. It is important to note that the accuracy of responses generated by large language models cannot be guaranteed, and the medical knowledge utilized therein should not be construed as a substitute for professional medical advice. If one experiences any discomfort or distress, it is strongly advised to seek the guidance of a qualified medical professional.

华佗计划主要致力于研究,并非旨在提供医疗建议。本研究所使用的医学信息来源于一个开放获取的医学知识图谱。需要注意的是,大语言模型生成的回答准确性无法保证,其中使用的医学知识不应被解读为专业医疗建议的替代品。若出现任何不适或症状,强烈建议寻求合格医疗专业人员的指导。

Table 3: Example for the supervised-fine-tuning training instances.

Instance (in Chinese)Instances (translated to English)
Question:患者XXX因头部及眼后部疼痛并能听 到连续不断的隆隆声来到医院,医生对 其进行了检查后诊断为偏头痛。请问, 该病有什么治疗方法?Patient XXX came to the hospital due to pain in the head and behind the eyes, ac- companied by continuous rumbling sounds. After an examination, the doctor diagnosed the patient with a migraine. What are the treatment options for this condition?
Answer治疗偏头痛的方法包括药物治疗和非 药物治疗。药物治疗主要包括三大类: 急性发作期治疗、预防性治疗和混合治 疗。非药物治疗包括改变生活方式,如 减少熬夜、避免过度劳累、避免饮酒、 戒烟等;此外还包括针灸、按摩、脉 冲放电治疗等。在治疗中应该综合评估 患者的病情、临床表现、个人情况等因 素,制定个性化的治疗方案。The methods for treating migraines include medication and non-pharmacological ther- apies. Medication therapy primarily con- sists of three categories: acute attack treat- ment, preventative treatment, and mixed treatment. Non-pharmacological therapies include lifestyle changes such as reducing staying up late, avoiding overwork, avoid- ing alcohol consumption, and quitting smok- ing; in addition, they also include treat- ments such as acupuncture, massage, and pulse discharge therapy. During treatment, a comprehensive assessment of the patient's condition, clinical manifestations, personal circumstances, and other factors should be taken into account to develop a personalized treatment plan.

表 3: 监督微调训练实例示例。

实例(中文) 实例(英文翻译)
问题: 患者XXX因头部及眼后部疼痛并能听到连续不断的隆隆声来到医院,医生对其进行了检查后诊断为偏头痛。请问,该病有什么治疗方法? Patient XXX came to the hospital due to pain in the head and behind the eyes, accompanied by continuous rumbling sounds. After an examination, the doctor diagnosed the patient with a migraine. What are the treatment options for this condition?
回答 治疗偏头痛的方法包括药物治疗和非药物治疗。药物治疗主要包括三大类:急性发作期治疗、预防性治疗和混合治疗。非药物治疗包括改变生活方式,如减少熬夜、避免过度劳累、避免饮酒、戒烟等;此外还包括针灸、按摩、脉冲放电治疗等。在治疗中应该综合评估患者的病情、临床表现、个人情况等因素,制定个性化的治疗方案。 The methods for treating migraines include medication and non-pharmacological therapies. Medication therapy primarily consists of three categories: acute attack treatment, preventative treatment, and mixed treatment. Non-pharmacological therapies include lifestyle changes such as reducing staying up late, avoiding overwork, avoiding alcohol consumption, and quitting smoking; in addition, they also include treatments such as acupuncture, massage, and pulse discharge therapy. During treatment, a comprehensive assessment of the patient's condition, clinical manifestations, personal circumstances, and other factors should be taken into account to develop a personalized treatment plan.

Table 4: Experimental results of SUS score for the models.

Safety UsabilitySmoothness
LLaMA2.93 1.211.58
Alpaca2.64 2.052.30
ChatGLM2.59 1.932.41
HuaTuo (华驼)2.88 2.122.47

表 4: 各模型的SUS评分实验结果。

安全可用性 流畅度
LLaMA 2.93 1.21 1.58
Alpaca 2.64 2.05 2.30
ChatGLM 2.59 1.93 2.41
HuaTuo (华驼) 2.88 2.12 2.47

References

参考文献

Zhihong Chen, Junying Chen, Hongbo Zhang, Feng Jiang, Guiming Chen, Fei Yu, Tiannan Wang, Juhao Liang, Chen Zhang, Zhiyi Zhang, Jianquan Li, Xiang Wan, Haizhou Li, and Benyou Wang. 2023. Llm zoo: democratizing chatgpt. https://github. com/Freedom Intelligence/LLMZoo.

Zhihong Chen、Junying Chen、Hongbo Zhang、Feng Jiang、Guiming Chen、Fei Yu、Tiannan Wang、Juhao Liang、Chen Zhang、Zhiyi Zhang、Jianquan Li、Xiang Wan、Haizhou Li 和 Benyou Wang。2023。LLM Zoo:民主化 ChatGPT。https://github.com/FreedomIntelligence/LLMZoo

Zhengxiao Du, Yujie Qian, Xiao Liu, Ming Ding, Jiezhong Qiu, Zhilin Yang, and Jie Tang. 2022. Glm:

郑晓杜、余杰钱、肖刘、明丁、杰忠邱、志林杨和杰唐。2022。GLM:

General language model pre training with autoregressive blank infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 320– 335.

采用自回归空白填充的通用语言模型预训练。见《第60届计算语言学协会年会论文集(第一卷:长论文)》,第320-335页。

Aidan Gilson, Conrad W Safranek, Thomas Huang, Vimig Socrates, Ling Chi, Richard Andrew Taylor, David Chartash, et al. 2023. How does chatgpt perform on the united states medical licensing examination? the implications of large language models for medical education and knowledge assessment. JMIR Medical Education, 9(1):e45312.

Aidan Gilson、Conrad W Safranek、Thomas Huang、Vimig Socrates、Ling Chi、Richard Andrew Taylor、David Chartash等。2023。ChatGPT在美国医师执照考试中的表现如何?大语言模型对医学教育和知识评估的影响。JMIR医学教育,9(1):e45312。

Yunxiang Li, Zihan Li, Kai Zhang, Ruilong Dan, and You Zhang. 2023. Chatdoctor: A medical chat model fine-tuned on llama model using medical domain knowledge.

Yunxiang Li, Zihan Li, Kai Zhang, Ruilong Dan, and You Zhang. 2023. ChatDoctor: 基于Llama模型和医学领域知识微调的医疗对话模型

BY AMBA SURE N Odmaa, YANG Yunfei, SUI Zhi- fang, DAI Damai, CHANG Baobao, LI Sujian, and ZAN Hongying. 2019. Preliminary study on the construction of chinese medical knowledge graph. Journal of Chinese Information Processing, 33(10):1–7.

BY AMBA SURE N Odmaa, YANG Yunfei, SUI Zhi-fang, DAI Damai, CHANG Baobao, LI Sujian, and ZAN Hongying. 2019. 中文医学知识图谱构建的初步研究. Journal of Chinese Information Processing, 33(10):1–7.

OpenAI. 2022. Chatgpt. https://chat.openai. com.

OpenAI. 2022. Chatgpt. https://chat.openai.com.

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.

Long Ouyang、Jeffrey Wu、Xu Jiang、Diogo Almeida、Carroll Wainwright、Pamela Mishkin、Chong Zhang、Sandhini Agarwal、Katarina Slama、Alex Ray 等。2022。通过人类反馈训练语言模型遵循指令。《神经信息处理系统进展》,35:27730–27744。

Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. 2023. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/ stanford alpaca.

Rohan Taori、Ishaan Gulrajani、Tianyi Zhang、Yann Dubois、Xuechen Li、Carlos Guestrin、Percy Liang 和 Tatsunori B. Hashimoto。2023。Stanford Alpaca: 一个遵循指令的 LLaMA 模型。https://github.com/tatsu-lab/stanford_alpaca

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. 2023. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.

Hugo Touvron、Thibaut Lavril、Gautier Izacard、Xavier Martinet、Marie-Anne Lachaux、Timothée Lacroix、Baptiste Rozière、Naman Goyal、Eric Hambro、Faisal Azhar 等。2023。Llama:开放高效的基础语言模型。arXiv预印本 arXiv:2302.13971。

Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A Smith, Daniel Khashabi, and Hannaneh Hajishirzi. 2022. Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560.

Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A Smith, Daniel Khashabi, Hannaneh Hajishirzi. 2022. Self-Instruct: 通过自生成指令对齐语言模型. arXiv预印本 arXiv:2212.10560.

Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. 2022. Finetuned language models are zero-shot learners. In International Conference on Learning Representations.

Jason Wei、Maarten Bosma、Vincent Zhao、Kelvin Guu、Adams Wei Yu、Brian Lester、Nan Du、Andrew M Dai 和 Quoc V Le。2022. 微调语言模型是零样本学习器。收录于国际学习表征会议。

Honglin Xiong, Sheng Wang, Yitao Zhu, Zihao Zhao, Yuxiao Liu, Qian Wang, and Dinggang Shen. Doctorglm: Fine-tuning your chinese doctor is not a herculean task.

Honglin Xiong, Sheng Wang, Yitao Zhu, Zihao Zhao, Yuxiao Liu, Qian Wang, and Dinggang Shen. DoctorGLM: 微调你的中文医生并非艰巨任务

Aohan Zeng, Xiao Liu, Zhengxiao Du, Zihan Wang, Hanyu Lai, Ming Ding, Zhuoyi Yang, Yifan Xu, Wendi Zheng, Xiao Xia, Weng Lam Tam, Zixuan Ma, Yufei Xue, Jidong Zhai, Wenguang Chen, Zhiyuan Liu, Peng Zhang, Yuxiao Dong, and Jie Tang. 2023. GLM-130b: An open bilingual pretrained model. In The Eleventh International Conference on Learning Representations (ICLR).

Aohan Zeng、Xiao Liu、Zhengxiao Du、Zihan Wang、Hanyu Lai、Ming Ding、Zhuoyi Yang、Yifan Xu、Wendi Zheng、Xiao Xia、Weng Lam Tam、Zixuan Ma、Yufei Xue、Jidong Zhai、Wenguang Chen、Zhiyuan Liu、Peng Zhang、Yuxiao Dong 和 Jie Tang。2023。GLM-130B: 一个开放的双语预训练模型。载于第十一届国际学习表征会议 (ICLR)。

阅读全文(20积分)