MedChatZH: a Better Medical Adviser Learns from Better Instructions
MedChatZH: 更优秀的医疗顾问源于更优质的指令
Yang Tan, Mingchen Li, Zijie Huang, Huiqun Yu and Guisheng Fan
杨坦、李明臣、黄子杰、余慧群、范贵生
Department of Computer Science and Technology, East China University of Science and Technology, China {tyang,lmc,hzj}@mail.ecust.edu.cn {yhq,gsfan} $@$ ecust.edu.cn
华东理工大学计算机科学与技术系,中国 {tyang,lmc,hzj}@mail.ecust.edu.cn {yhq,gsfan} $@$ ecust.edu.cn
Abstract
摘要
Generative large language models (LLMs) have shown great success in various applications, including question-answering (QA) and dialogue systems. However, in specialized domains like traditional Chinese medical QA, these models may perform un satisfactorily without fine- tuning on domain-specific datasets. To address this, we introduce MedChatZH, a dialogue model designed specifically for traditional Chinese medical QA. Our model is pre-trained on Chinese traditional medical books and finetuned with a carefully curated medical instruction dataset. It outperforms several solid baselines on a real-world medical dialogue dataset. We release our model, code, and dataset on https://github.com/tyang816/MedChatZH to facilitate further research in the domain of traditional Chinese medicine and LLMs.
生成式大语言模型(LLM)在问答(QA)和对话系统等多种应用中取得了巨大成功。然而在中医问答等专业领域,若未经领域特定数据集的微调,这些模型可能表现欠佳。为此,我们推出了专为中医问答设计的对话模型MedChatZH。该模型基于中医典籍进行预训练,并通过精心构建的医疗指令数据集进行微调。在实际医疗对话数据集上的测试表明,其性能优于多个强基线模型。我们已将模型、代码和数据集发布于https://github.com/tyang816/MedChatZH,以促进中医药与大语言模型领域的进一步研究。
1 Introduction
1 引言
The ChatGPT series has achieved remarkable success in both academic and industrial circles, serving as a catalyst for numerous subsequent studies. Through a combination of instruction tuning and human feedback, these models have consistently demonstrated state-of-the-art performance across a wide range of Natural Language Processing (NLP) tasks. However, it is worth noting that these models are not openly available and do not divulge many specifics about their training process.
ChatGPT系列在学术界和工业界均取得了显著成功,成为众多后续研究的催化剂。通过指令微调 (instruction tuning) 与人类反馈的结合,这些模型在各类自然语言处理 (NLP) 任务中持续展现出最先进的性能。但值得注意的是,这些模型并未开源,也未透露其训练过程的具体细节。
In recent years, several alternative foundational models have emerged in response to this limitation. For instance, LLaMa (Touvron et al., 2023), BLOOM (Scao et al., 2022), and GLM (Du et al., 2021) are notable examples. These models have been trained on extensive collections of general raw texts derived from real-world sources, thereby introducing a new paradigm for comprehending fundamental knowledge within human society. By leveraging such diverse and expansive training data, these models offer unique insights and capabilities in understanding and processing natural language.
近年来,为应对这一局限性,出现了几种替代性的基础模型。例如,LLaMa (Touvron et al., 2023)、BLOOM (Scao et al., 2022) 和 GLM (Du et al., 2021) 都是值得关注的代表。这些模型基于从现实世界获取的大规模通用原始文本进行训练,从而为理解人类社会的基础知识提供了新范式。通过利用这种多样化且海量的训练数据,这些模型在理解和处理自然语言方面展现出独特的洞察力与能力。
Given the constraints imposed by the limited availability of high-quality corpora, most Large Language Models (LLMs) are primarily tailored to cater to English-speaking users. Unfortunately, their performance significantly deteriorates when deployed in scenarios involving other languages. Furthermore, the performance of generalpurpose large language models cannot be universally remarkable across various specialized domains (Zhang et al., 2023). An illustrative example of this phenomenon lies in the commercialization of ChatGPT, which imposes certain restrictions on the provision of answers within the medical field. Consequently, a considerable disparity arises, wherein medical resources are scarce despite the limited scope of their application. This disconnect presents a challenge in terms of harnessing the full potential of these resources in the medical domain.
鉴于高质量语料库的有限可用性所带来的限制,大多数大语言模型(LLM)主要针对英语用户定制。遗憾的是,当这些模型应用于其他语言场景时,其性能会显著下降。此外,通用大语言模型在不同专业领域的表现也无法做到全面优异 (Zhang et al., 2023)。这种现象的一个典型例证是ChatGPT的商业化应用,该模型对医学领域的问题回答设置了特定限制。由此产生了一个显著矛盾:尽管应用范围有限,医疗资源却呈现严重匮乏状态。这种脱节现象对充分发挥这些资源在医疗领域的潜力构成了挑战。
Our main contributions can be summarized as follows:
我们的主要贡献可概括如下:
• We enhanced the Chinese-specific language model by training it on an extensive collection of traditional Chinese medicine (TCM) books. As a result, the model is capable of providing answers that combine knowledge from both traditional Chinese and Western medicine.
• 我们通过大量中医典籍的训练增强了中文专用语言模型,使其能够提供融合中西医知识的答案。
• We curated a new dataset of medical dialogue instructions through a sophisticated pipeline that meticulously removed any irrelevant or sensitive data, such as private information and colloquial responses.
• 我们通过一套精细的流程整理出一套新的医疗对话指令数据集,该流程严格剔除了所有不相关或敏感数据,例如私人信息和口语化回复。
• We demonstrated state-of-the-art performance on a real-world medical QA benchmark, outperforming other baseline models across several evaluation metrics. Furthermore, we have made our dataset and model open-source for the benefit of the research community.
• 我们在真实世界的医疗问答基准测试中展示了最先进的性能,在多项评估指标上超越其他基线模型。此外,我们已将数据集和模型开源,以造福研究社区。
2 Related Work
2 相关工作
2.1 Training General Language Models
2.1 通用语言模型训练
Training General language models consume trillion tokens and costly computation resources to learn the structure, syntax, and semantics of the human language through unsupervised methods. This stage allows the model to learn general language patterns and representations.
训练通用大语言模型需要消耗数万亿token和昂贵的计算资源,通过无监督方法学习人类语言的结构、语法和语义。这一阶段使模型能够掌握通用语言模式和表征。
The Transformer (Vaswani et al., 2017) revolutionized natural language processing with its introduction of attention mechanisms, inspiring subsequent encoder-only architectures like BERT (Devlin et al., 2018) that leverage masked language modeling, as well as causal models such as the GPT (Radford et al., 2018, 2019; Brown et al., 2020) se- ries that utilize next token prediction strategy. However, since OpenAI releases ChatGPT and GPT-4, the casual language models have shown more potential power in modeling the real world, but their models’ weights and training details are not open to the public.
Transformer (Vaswani等人,2017) 通过引入注意力机制彻底改变了自然语言处理领域,启发了后续仅编码器架构如 BERT (Devlin等人,2018) 采用掩码语言建模,以及因果模型如 GPT 系列 (Radford等人,2018,2019;Brown等人,2020) 使用下一token预测策略。然而自OpenAI发布ChatGPT和GPT-4以来,因果语言模型在现实世界建模中展现出更大潜力,但其模型权重和训练细节未向公众开放。
As alternatives, both LLaMa (Touvron et al., 2023) and BLOOM (Scao et al., 2022) have released models’ weights with more than 10 billion parameters for research purposes, but they focus on English applications and trained on massive English corpus. As alternatives, both LLaMa (Touvron et al., 2023) and BLOOM (Scao et al., 2022) have made the weights of their models, each containing over 10 billion parameters, accessible for research purposes. However, their focus has primarily been on English applications, with training conducted on extensive English corpora. Recognizing the need to bridge the language gap in Chinese applications, ChatGLM (Du et al., 2021; Zeng et al., 2022) employs an auto-regressive GLM with multiple training objectives and a bilingual corpus, achieving superior performance in Chinesespecific tasks. To address Chinese language requirements, TigerBot 1 and BaiCahuan 2 have been developed based on the BLOOM and LLaMa archi tec ture s, respectively. These models are commercially available and cater to Chinese language processing needs.
作为替代方案,LLaMa (Touvron et al., 2023) 和 BLOOM (Scao et al., 2022) 都发布了参数超过100亿的模型权重供研究使用,但它们主要面向英语应用,并在海量英语语料库上进行了训练。为填补中文应用的语言空白,ChatGLM (Du et al., 2021; Zeng et al., 2022) 采用具有多重训练目标和双语语料库的自回归GLM,在中文特定任务中表现出色。针对中文需求,基于BLOOM和LLaMa架构分别开发了TigerBot 1和BaiCahuan 2,这些模型为商业产品,专注于中文语言处理。
2.2 Medical Language Models
2.2 医疗语言模型
While general-purpose Language Models (LMs) have demonstrated remarkable capabilities in various scenarios, it is often necessary to fine-tune them on specific, smaller datasets that are tailored to the target task or domain. This fine-tuning process helps the models to better understand and adapt to the specific requirements of downstream tasks.
虽然通用语言模型 (Language Model) 在各种场景中展现出了卓越的能力,但通常仍需针对目标任务或领域特定的较小数据集进行微调。这种微调过程有助于模型更好地理解并适应下游任务的具体需求。
In comparison to general-purpose models, specialized models for specific verticals are relatively scarce. For instance, BenTso (Wang et al., 2023) constructed a Chinese medical instruction dataset by leveraging the Medical Knowledge Graph and GPT3.5 API. Building upon this dataset, we performed fine-tuning on the instructions of LLaMA to enhance its query and answer effectiveness specifically in the medical field. The resulting model, HuatuoGPT (Zhang et al., 2023), is a large language model trained on an extensive Chinese medical corpus, with the goal of constructing a more proficient ’ChatGPT’ for medical consultation scenarios.
与通用模型相比,针对特定垂直领域的专用模型相对稀缺。例如,BenTso (Wang et al., 2023) 通过结合医学知识图谱和GPT3.5 API构建了中文医疗指令数据集。基于该数据集,我们对LLaMA的指令进行了微调,以提升其在医疗领域的查询和回答效果。最终得到的模型华佗GPT (Zhang et al., 2023) 是一个基于海量中文医疗语料训练的大语言模型,旨在为医疗咨询场景构建更专业的"ChatGPT"。
Additionally, Google’s Med-PaLM (Singhal et al., 2022) harnesses the power of Google’s large language models. These models have been aligned with the medical domain and evaluated using medical exams, medical research, and consumer queries in the English language. This alignment and evaluation process ensures that the model is well-suited for handling medical-related tasks and inquiries.
此外,Google的Med-PaLM (Singhal et al., 2022) 利用了Google大语言模型的能力。这些模型已与医学领域对齐,并通过英语医学考试、医学研究和消费者查询进行评估。这种对齐和评估过程确保了该模型非常适合处理与医学相关的任务和查询。
By developing and fine-tuning these specialized models, we aim to provide more accurate and reliable language processing solutions in domains such as healthcare and medicine. These models bridge the gap between general-purpose LMs and specific vertical applications, enabling more effective and targeted language understanding and generation in specialized fields.
通过开发和微调这些专用模型,我们旨在为医疗健康等领域提供更精准可靠的语言处理解决方案。这些模型弥合了通用大语言模型与垂直领域应用之间的鸿沟,使专业领域的语言理解与生成更具针对性和实效性。
3 MedChatZH
3 MedChatZH
In this section, we will introduce the data process pipeline and training details of MedChatZH.
在本节中,我们将介绍 MedChatZH 的数据处理流程和训练细节。
3.1 Data Collection
3.1 数据收集
Our training dataset consists of two main components: TMC books and raw instructions.
我们的训练数据集包含两个主要组成部分:TMC书籍和原始指令。
For the medical books, we have gathered a comprehensive collection of over 1,000 books, including renowned works such as the Yellow Emperor’s Canon of Internal Medicine and Treatise on Febrile Diseases, as well as valuable folk doctor notes. While we have primarily focused on extracting relevant texts from these books, minimal cleaning has been performed on this dataset.
在医学书籍方面,我们已收集了1000多本涵盖全面的著作,包括《黄帝内经》《伤寒论》等经典医籍,以及珍贵的民间医家手札。目前主要从这些书籍中提取相关文本,该数据集仅进行了基础清洗。
In contrast, for the instructions component, we have created a mixture of general and medical Chinese data known as med-mix-2M. This dataset combines both general and medical Chinese instructions, providing a diverse range of language patterns and medical contexts. The med-mix-2M dataset serves as a valuable resource for training models with a broad understanding of both general language usage and medical terminology.
相比之下,在指令组件方面,我们创建了一个名为med-mix-2M的通用中文与医学中文混合数据集。该数据集融合了通用中文指令和医学中文指令,提供了多样化的语言模式和医学语境。med-mix-2M数据集是训练同时理解通用语言用法和医学术语模型的宝贵资源。
Table 1: Resulsts on webMedQA benchmark.
Model | Parameter | BLEU-1 | BLEU-2 | BLEU-3 | BLEU-4 | GLEU | ROUGE-1 | ROUGE-2 | ROUGE-L |
GPT-3.5-turbo* | 18.06 | 6.74 | 2.73 | 1.09 | 4.71 | 20.01 | 2.81 | 12.58 | |
HuatuoGPT* | 13B | 24.61 | 12.84 | 7.23 | 4.19 | 7.73 | 27.38 | 7.09 | 17.66 |
ChatGLM-Med | 6B | 32.18 | 18.37 | 8.87 | 3.79 | 6.09 | 26.14 | 8.08 | 18.87 |
BenTsao | 7B | 32.02 | 17.41 | 8.36 | 3.92 | 6.12 | 17.72 | 3.21 | 14.15 |
MedChatZH | 7B | 56.31 | 32.14 | 17.58 | 9.17 | 10.32 | 35.99 | 10.31 | 21.77 |
$\dagger$ The models highlighted by * means copied scores from HuatuoGPT.
表 1: webMedQA基准测试结果。
Model | Parameter | BLEU-1 | BLEU-2 | BLEU-3 | BLEU-4 | GLEU | ROUGE-1 | ROUGE-2 | ROUGE-L |
---|---|---|---|---|---|---|---|---|---|
GPT-3.5-turbo* | 18.06 | 6.74 | 2.73 | 1.09 | 4.71 | 20.01 | 2.81 | 12.58 | |
HuatuoGPT* | 13B | 24.61 | 12.84 | 7.23 | 4.19 | 7.73 | 27.38 | 7.09 | 17.66 |
ChatGLM-Med | 6B | 32.18 | 18.37 | 8.87 | 3.79 | 6.09 | 26.14 | 8.08 | 18.87 |
BenTsao | 7B | 32.02 | 17.41 | 8.36 | 3.92 | 6.12 | 17.72 | 3.21 | 14.15 |
MedChatZH | 7B | 56.31 | 32.14 | 17.58 | 9.17 | 10.32 | 35.99 | 10.31 | 21.77 |
$\dagger$ 标有* 的模型表示分数来自HuatuoGPT。
3.2 Data Process Pipeline
3.2 数据处理流程
The BELLE-3.5M instruction dataset (Yunjie Ji et al., 2023) is derived from ChatGPT, employing AI-style instructions known for their high quality. To ensure the dataset’s reliability and coherence, we employ heuristic methods during the curation process. Specifically, we discard short answers that consist of fewer than 200 tokens and lack logical consistency. This approach helps to enhance the quality of the question-answer pairs in the dataset, resulting in more accurate and meaningful QA interactions.
BELLE-3.5M指令数据集(Yunjie Ji等, 2023)源自ChatGPT,采用以高质量著称的AI风格指令。为确保数据集的可靠性与连贯性,我们在整理过程中采用了启发式方法。具体而言,我们会剔除少于200个token且缺乏逻辑一致性的简短回答。这种方法有助于提升数据集中问答对的质量,从而产生更准确、更有意义的问答交互。
To ensure domain-specific knowledge, we have amassed over 7,000,000 medical instructions from the Internet and various Chinese hospitals. These instructions exhibit variations in expression, quality, length, and style. In order to curate a highquality dataset, we apply the following filtering steps:
为确保领域专业知识,我们从互联网和中国多家医院收集了超过7,000,000条医疗指令。这些指令在表达方式、质量、长度和风格上存在差异。为构建高质量数据集,我们采取了以下过滤步骤:
• Filtering Personal Data: We utilize heuristics, such as regular matching, to identify and remove responses containing personal information like email addresses or phone numbers. This step ensures the protection of individuals’ privacy. • Self-labeling and Training: We perform selflabeling on a subset of 3,000 preference ranking data in the medical domain. This subset is then used to train a model called ZiyaLLaMA-7B-Reward 3. Data with scores lower than 0.5 are discarded, ensuring the selection of high-quality training examples.
• 过滤个人数据:我们采用启发式方法(如正则匹配)来识别并删除包含电子邮件地址或电话号码等个人信息的回答。这一步骤确保了对个人隐私的保护。
• 自标注与训练:我们对医疗领域的3000条偏好排序数据子集进行自标注,随后用该子集训练名为ZiyaLLaMA-7B-Reward的模型。得分低于0.5的数据将被丢弃,从而确保筛选出高质量的训练样本。
• Numerical Symbol Harmonization: We harmonize various numerical symbols, such as $\ '_ {1,};\mathrel{\mathop:}(1)'$ , etc., into a standardized format represented by a number followed by a dot, e.g., ’1.’ This standardization ensures consistency and ease of processing for numerical information.
• 数值符号统一化:我们将各种数值符号(如 $\ '_ {1,};\mathrel{\mathop:}(1)'$ 等)统一为标准格式,即数字后跟一个点(例如“1.”)。这种标准化确保了数值信息的一致性和处理便捷性。
As a result of these steps, we obtain a curated dataset comprising 763,629 medical instructions and 1,305,194 general instructions. This dataset serves as the foundation for fine-tuning our model, enabling it to acquire the necessary dialogue capabilities specific to the medical domain.
经过上述步骤,我们最终获得了一个包含763,629条医疗指令和1,305,194条通用指令的精选数据集。该数据集作为微调模型的基础,使其能够习得医疗领域所需的特定对话能力。
3.3 Base Model
3.3 基础模型
Our base model is Baichuan-7B, which is based on the Transformer and its architecture is the same as the LLaMa. This 7 billion parameter model is trained on about 1.2 trillion tokens supports Chinese and English bilinguals, and the context window length is 4096. The best results of the same size have been achieved on the standard Chinese and English benchmarks (C-Eval/MMLU).
我们的基础模型是Baichuan-7B,它基于Transformer架构,与LLaMa相同。这个70亿参数的模型在约1.2万亿token的中英双语数据上训练,上下文窗口长度为4096。在标准中英文基准测试(C-Eval/MMLU)上取得了同规模模型的最佳成绩。
3.4 Training Details
3.4 训练细节
Our model is developed using PyTorch 2.0.1, with Baichuan-7B serving as the foundational architecture. During the further pre-training stage, we employ specific settings to optimize the model’s performance. The learning rate is set to 2e-5, the batch size per device is 4, and the maximum context length is restricted to 2048 tokens.In the subsequent instruction fine-tuning stage, we deviate from the LoRA (Hu et al., 2021) strategy and instead opt for a full parameter fine-tuning approach. Here, the learning rate is adjusted to 2e-4, the batch size per device is increased to 8, and the maximum context length is limited to 1024 tokens. For optimization, we employ the AdamW optimizer (Loshchilov and Hutter, 2017), and weight decay is set to 1e-5 to mitigate over fitting. To execute our experiments, we utilize 8 NVIDIA A800 GPUs and leverage the ZeRO-2 (Raj bh and ari et al., 2020) stage, which optimizes memory consumption and accelerates training.
我们的模型基于PyTorch 2.0.1开发,采用Baichuan-7B作为基础架构。在增量预训练阶段,我们采用特定参数优化模型性能:学习率设为2e-5,单设备批处理大小为4,最大上下文长度限制为2048个token。在后续指令微调阶段,我们未采用LoRA (Hu et al., 2021)策略,而是选择全参数微调方案,此时学习率调整为2e-4,单设备批处理量提升至8,最大上下文长度限制为1024个token。优化器选用AdamW (Loshchilov and Hutter, 2017),权重衰减系数设为1e-5以防止过拟合。实验使用8块NVIDIA A800 GPU,通过ZeRO-2 (Rajbhandari et al., 2020)技术阶段实现内存优化与训练加速。
Figure 1: Chinese reward model scores on different categories in Medical QA.
图 1: 医疗问答领域中文奖励模型在不同类别上的得分。
4 Experiment
4 实验
4.1 Baselines
4.1 基线方法
In our evaluation, we compare the performance of our model with that of the state-of-the-art zero-shot model, OpenAI’s ChatGPT (GPT-3.5-turbo), as well as several Chinese-specific Language Models (LLMs) that have been fine-tuned specifically on medical domain knowledge.
在我们的评估中,我们将模型性能与当前最先进的零样本模型 OpenAI 的 ChatGPT (GPT-3.5-turbo) ,以及多个针对医疗领域知识专门微调的中文大语言模型进行了对比。
• BenTsao 4 (Wang et al., 2023) is a fine-tuned Chinese Language Model (LLM) developed by SCIR-HI, leveraging the LoRA strategy and Chinese medical knowledge. It consists of two series, LLaMA-7B and Chinese-LLaMAAlpaca (Cui et al., 2023). Our comparison focuses on LLaMA-7B, which is fine-tuned exclusively on the medical knowledge database, excluding medical literature.
• BenTsao 4 (Wang et al., 2023) 是由SCIR-HI基于LoRA策略和中文医学知识微调开发的中文大语言模型。该模型包含LLaMA-7B和Chinese-LLaMAAlpaca (Cui et al., 2023) 两个系列。我们的对比研究聚焦于仅在医学知识库(不含医学文献)上微调的LLaMA-7B版本。
• ChatGLM-Med 5 is another model based on the same dataset as BenTsao, but utilizing the more Chinese-friendly ChatGLM-6B (Du et al., 2021) as its foundational model. It represents an enhanced version of ChatGLM, specifically designed for improved questionanswering effectiveness in the medical field.
• ChatGLM-Med 5 是基于与BenTsao相同数据集的另一模型,但采用了更适配中文场景的ChatGLM-6B (Du et al., 2021) 作为基础模型。它是ChatGLM的增强版本,专门针对医疗领域问答效果进行了优化。
• ChatGPT 6 is a sibling model to Instruct GP T (Ouyang et al., 2022), which is trained to follow instructions in a prompt and provide a detailed response. It is considered one of the leading dialogue models, and we compare our model against the GPT-3.5-turbo.
• ChatGPT 6 是 Instruct GPT (Ouyang et al., 2022) 的姊妹模型,经过训练能够遵循提示指令并提供详细响应。它被视为领先的对话模型之一,我们将自己的模型与 GPT-3.5-turbo 进行了对比。
• HuatuoGPT 7 (Zhang et al., 2023) releases model weights of HuatuoGPT-13B, which is trained on Ziya-LLaMA-13B-Pretrain-v1 8. It combines distilled data from ChatGPT and real-world data from doctors to enhance its medical dialogue capabilities.
• HuatuoGPT 7 (Zhang et al., 2023) 发布了基于Ziya-LLaMA-13B-Pretrain-v1训练的HuatuoGPT-13B模型权重,通过融合ChatGPT蒸馏数据和医生真实诊疗数据来强化医疗对话能力。
Table 2: The distribution of the webMedQA dataset is highly skewed, with the largest category being ’internal medicine,’ comprising over 17,000 data points. The category with the least representation is ’other,’ containing only 30 questions and answers.
DatasetSize | Count | Category |
>10000 | 2 | Internal Medicine; Surgery |
5000-10000 | 2 | Pediatrics;Gynaecology and Obstetrics |
1000-5000 | 7 | Pentaphthaliaceae; Oncology; Dermatovenereology;Infectious Diseases; |
Mental Health;Plastic Surgery; TMC Health Care; Aesthetic Medicine; Auxiliary Examination;Rehabilitation Medicine; | ||
<1000 | 12 | Nutrition and Health; Home Environment; Exercise and Fitness; Physical Examination; ChildcareKnowledge;Drug;Heredity;Other |
表 2: webMedQA数据集分布极不均衡,最大类别"内科"包含超过17,000条数据,最小类别"其他"仅含30组问答。
数据集规模 | 数量 | 类别 |
---|---|---|
>10000 | 2 | 内科; 外科 |
5000-10000 | 2 | 儿科; 妇产科 |
1000-5000 | 7 | 五官科; 肿瘤科; 皮肤性病科; 传染病科; 心理健康; 整形外科; 中医保健 |
<1000 | 12 | 营养健康; 家庭环境; 运动健身; 体检; 育儿知识; 药品; 遗传学; 其他 |