[论文翻译]MedChatZH: 更优秀的医疗顾问源于更优质的指令


原文地址:https://arxiv.org/pdf/2309.01114


MedChatZH: a Better Medical Adviser Learns from Better Instructions

MedChatZH: 更优秀的医疗顾问源于更优质的指令

Yang Tan, Mingchen Li, Zijie Huang, Huiqun Yu and Guisheng Fan

杨坦、李明臣、黄子杰、余慧群、范贵生

Department of Computer Science and Technology, East China University of Science and Technology, China {tyang,lmc,hzj}@mail.ecust.edu.cn {yhq,gsfan} $@$ ecust.edu.cn

华东理工大学计算机科学与技术系,中国 {tyang,lmc,hzj}@mail.ecust.edu.cn {yhq,gsfan} $@$ ecust.edu.cn

Abstract

摘要

Generative large language models (LLMs) have shown great success in various applications, including question-answering (QA) and dialogue systems. However, in specialized domains like traditional Chinese medical QA, these models may perform un satisfactorily without fine- tuning on domain-specific datasets. To address this, we introduce MedChatZH, a dialogue model designed specifically for traditional Chinese medical QA. Our model is pre-trained on Chinese traditional medical books and finetuned with a carefully curated medical instruction dataset. It outperforms several solid baselines on a real-world medical dialogue dataset. We release our model, code, and dataset on https://github.com/tyang816/MedChatZH to facilitate further research in the domain of traditional Chinese medicine and LLMs.

生成式大语言模型(LLM)在问答(QA)和对话系统等多种应用中取得了巨大成功。然而在中医问答等专业领域,若未经领域特定数据集的微调,这些模型可能表现欠佳。为此,我们推出了专为中医问答设计的对话模型MedChatZH。该模型基于中医典籍进行预训练,并通过精心构建的医疗指令数据集进行微调。在实际医疗对话数据集上的测试表明,其性能优于多个强基线模型。我们已将模型、代码和数据集发布于https://github.com/tyang816/MedChatZH,以促进中医药与大语言模型领域的进一步研究。

1 Introduction

1 引言

The ChatGPT series has achieved remarkable success in both academic and industrial circles, serving as a catalyst for numerous subsequent studies. Through a combination of instruction tuning and human feedback, these models have consistently demonstrated state-of-the-art performance across a wide range of Natural Language Processing (NLP) tasks. However, it is worth noting that these models are not openly available and do not divulge many specifics about their training process.

ChatGPT系列在学术界和工业界均取得了显著成功,成为众多后续研究的催化剂。通过指令微调 (instruction tuning) 与人类反馈的结合,这些模型在各类自然语言处理 (NLP) 任务中持续展现出最先进的性能。但值得注意的是,这些模型并未开源,也未透露其训练过程的具体细节。

In recent years, several alternative foundational models have emerged in response to this limitation. For instance, LLaMa (Touvron et al., 2023), BLOOM (Scao et al., 2022), and GLM (Du et al., 2021) are notable examples. These models have been trained on extensive collections of general raw texts derived from real-world sources, thereby introducing a new paradigm for comprehending fundamental knowledge within human society. By leveraging such diverse and expansive training data, these models offer unique insights and capabilities in understanding and processing natural language.

近年来,为应对这一局限性,出现了几种替代性的基础模型。例如,LLaMa (Touvron et al., 2023)、BLOOM (Scao et al., 2022) 和 GLM (Du et al., 2021) 都是值得关注的代表。这些模型基于从现实世界获取的大规模通用原始文本进行训练,从而为理解人类社会的基础知识提供了新范式。通过利用这种多样化且海量的训练数据,这些模型在理解和处理自然语言方面展现出独特的洞察力与能力。

Given the constraints imposed by the limited availability of high-quality corpora, most Large Language Models (LLMs) are primarily tailored to cater to English-speaking users. Unfortunately, their performance significantly deteriorates when deployed in scenarios involving other languages. Furthermore, the performance of generalpurpose large language models cannot be universally remarkable across various specialized domains (Zhang et al., 2023). An illustrative example of this phenomenon lies in the commercialization of ChatGPT, which imposes certain restrictions on the provision of answers within the medical field. Consequently, a considerable disparity arises, wherein medical resources are scarce despite the limited scope of their application. This disconnect presents a challenge in terms of harnessing the full potential of these resources in the medical domain.

鉴于高质量语料库的有限可用性所带来的限制,大多数大语言模型(LLM)主要针对英语用户定制。遗憾的是,当这些模型应用于其他语言场景时,其性能会显著下降。此外,通用大语言模型在不同专业领域的表现也无法做到全面优异 (Zhang et al., 2023)。这种现象的一个典型例证是ChatGPT的商业化应用,该模型对医学领域的问题回答设置了特定限制。由此产生了一个显著矛盾:尽管应用范围有限,医疗资源却呈现严重匮乏状态。这种脱节现象对充分发挥这些资源在医疗领域的潜力构成了挑战。

Our main contributions can be summarized as follows:

我们的主要贡献可概括如下:

• We enhanced the Chinese-specific language model by training it on an extensive collection of traditional Chinese medicine (TCM) books. As a result, the model is capable of providing answers that combine knowledge from both traditional Chinese and Western medicine.

• 我们通过大量中医典籍的训练增强了中文专用语言模型,使其能够提供融合中西医知识的答案。

• We curated a new dataset of medical dialogue instructions through a sophisticated pipeline that meticulously removed any irrelevant or sensitive data, such as private information and colloquial responses.

• 我们通过一套精细的流程整理出一套新的医疗对话指令数据集,该流程严格剔除了所有不相关或敏感数据,例如私人信息和口语化回复。

• We demonstrated state-of-the-art performance on a real-world medical QA benchmark, outperforming other baseline models across several evaluation metrics. Furthermore, we have made our dataset and model open-source for the benefit of the research community.

• 我们在真实世界的医疗问答基准测试中展示了最先进的性能,在多项评估指标上超越其他基线模型。此外,我们已将数据集和模型开源,以造福研究社区。

2 Related Work

2 相关工作

2.1 Training General Language Models

2.1 通用语言模型训练

Training General language models consume trillion tokens and costly computation resources to learn the structure, syntax, and semantics of the human language through unsupervised methods. This stage allows the model to learn general language patterns and representations.

训练通用大语言模型需要消耗数万亿token和昂贵的计算资源,通过无监督方法学习人类语言的结构、语法和语义。这一阶段使模型能够掌握通用语言模式和表征。

The Transformer (Vaswani et al., 2017) revolutionized natural language processing with its introduction of attention mechanisms, inspiring subsequent encoder-only architectures like BERT (Devlin et al., 2018) that leverage masked language modeling, as well as causal models such as the GPT (Radford et al., 2018, 2019; Brown et al., 2020) se- ries that utilize next token prediction strategy. However, since OpenAI releases ChatGPT and GPT-4, the casual language models have shown more potential power in modeling the real world, but their models’ weights and training details are not open to the public.

Transformer (Vaswani等人,2017) 通过引入注意力机制彻底改变了自然语言处理领域,启发了后续仅编码器架构如 BERT (Devlin等人,2018) 采用掩码语言建模,以及因果模型如 GPT 系列 (Radford等人,2018,2019;Brown等人,2020) 使用下一token预测策略。然而自OpenAI发布ChatGPT和GPT-4以来,因果语言模型在现实世界建模中展现出更大潜力,但其模型权重和训练细节未向公众开放。

As alternatives, both LLaMa (Touvron et al., 2023) and BLOOM (Scao et al., 2022) have released models’ weights with more than 10 billion parameters for research purposes, but they focus on English applications and trained on massive English corpus. As alternatives, both LLaMa (Touvron et al., 2023) and BLOOM (Scao et al., 2022) have made the weights of their models, each containing over 10 billion parameters, accessible for research purposes. However, their focus has primarily been on English applications, with training conducted on extensive English corpora. Recognizing the need to bridge the language gap in Chinese applications, ChatGLM (Du et al., 2021; Zeng et al., 2022) employs an auto-regressive GLM with multiple training objectives and a bilingual corpus, achieving superior performance in Chinesespecific tasks. To address Chinese language requirements, TigerBot 1 and BaiCahuan 2 have been developed based on the BLOOM and LLaMa archi tec ture s, respectively. These models are commercially available and cater to Chinese language processing needs.

作为替代方案,LLaMa (Touvron et al., 2023) 和 BLOOM (Scao et al., 2022) 都发布了参数超过100亿的模型权重供研究使用,但它们主要面向英语应用,并在海量英语语料库上进行了训练。为填补中文应用的语言空白,ChatGLM (Du et al., 2021; Zeng et al., 2022) 采用具有多重训练目标和双语语料库的自回归GLM,在中文特定任务中表现出色。针对中文需求,基于BLOOM和LLaMa架构分别开发了TigerBot 1和BaiCahuan 2,这些模型为商业产品,专注于中文语言处理。

2.2 Medical Language Models

2.2 医疗语言模型

While general-purpose Language Models (LMs) have demonstrated remarkable capabilities in various scenarios, it is often necessary to fine-tune them on specific, smaller datasets that are tailored to the target task or domain. This fine-tuning process helps the models to better understand and adapt to the specific requirements of downstream tasks.

虽然通用语言模型 (Language Model) 在各种场景中展现出了卓越的能力,但通常仍需针对目标任务或领域特定的较小数据集进行微调。这种微调过程有助于模型更好地理解并适应下游任务的具体需求。

In comparison to general-purpose models, specialized models for specific verticals are relatively scarce. For instance, BenTso (Wang et al., 2023) constructed a Chinese medical instruction dataset by leveraging the Medical Knowledge Graph and GPT3.5 API. Building upon this dataset, we performed fine-tuning on the instructions of LLaMA to enhance its query and answer effectiveness specifically in the medical field. The resulting model, HuatuoGPT (Zhang et al., 2023), is a large language model trained on an extensive Chinese medical corpus, with the goal of constructing a more proficient ’ChatGPT’ for medical consultation scenarios.

与通用模型相比,针对特定垂直领域的专用模型相对稀缺。例如,BenTso (Wang et al., 2023) 通过结合医学知识图谱和GPT3.5 API构建了中文医疗指令数据集。基于该数据集,我们对LLaMA的指令进行了微调,以提升其在医疗领域的查询和回答效果。最终得到的模型华佗GPT (Zhang et al., 2023) 是一个基于海量中文医疗语料训练的大语言模型,旨在为医疗咨询场景构建更专业的"ChatGPT"。

Additionally, Google’s Med-PaLM (Singhal et al., 2022) harnesses the power of Google’s large language models. These models have been aligned with the medical domain and evaluated using medical exams, medical research, and consumer queries in the English language. This alignment and evaluation process ensures that the model is well-suited for handling medical-related tasks and inquiries.

此外,Google的Med-PaLM (Singhal et al., 2022) 利用了Google大语言模型的能力。这些模型已与医学领域对齐,并通过英语医学考试、医学研究和消费者查询进行评估。这种对齐和评估过程确保了该模型非常适合处理与医学相关的任务和查询。

By developing and fine-tuning these specialized models, we aim to provide more accurate and reliable language processing solutions in domains such as healthcare and medicine. These models bridge the gap between general-purpose LMs and specific vertical applications, enabling more effective and targeted language understanding and generation in specialized fields.

通过开发和微调这些专用模型,我们旨在为医疗健康等领域提供更精准可靠的语言处理解决方案。这些模型弥合了通用大语言模型与垂直领域应用之间的鸿沟,使专业领域的语言理解与生成更具针对性和实效性。

3 MedChatZH

3 MedChatZH

In this section, we will introduce the data process pipeline and training details of MedChatZH.

在本节中,我们将介绍 MedChatZH 的数据处理流程和训练细节。

3.1 Data Collection

3.1 数据收集

Our training dataset consists of two main components: TMC books and raw instructions.

我们的训练数据集包含两个主要组成部分:TMC书籍和原始指令。

For the medical books, we have gathered a comprehensive collection of over 1,000 books, including renowned works such as the Yellow Emperor’s Canon of Internal Medicine and Treatise on Febrile Diseases, as well as valuable folk doctor notes. While we have primarily focused on extracting relevant texts from these books, minimal cleaning has been performed on this dataset.

在医学书籍方面,我们已收集了1000多本涵盖全面的著作,包括《黄帝内经》《伤寒论》等经典医籍,以及珍贵的民间医家手札。目前主要从这些书籍中提取相关文本,该数据集仅进行了基础清洗。

In contrast, for the instructions component, we have created a mixture of general and medical Chinese data known as med-mix-2M. This dataset combines both general and medical Chinese instructions, providing a diverse range of language patterns and medical contexts. The med-mix-2M dataset serves as a valuable resource for training models with a broad understanding of both general language usage and medical terminology.

相比之下,在指令组件方面,我们创建了一个名为med-mix-2M的通用中文与医学中文混合数据集。该数据集融合了通用中文指令和医学中文指令,提供了多样化的语言模式和医学语境。med-mix-2M数据集是训练同时理解通用语言用法和医学术语模型的宝贵资源。

Table 1: Resulsts on webMedQA benchmark.

ModelParameterBLEU-1BLEU-2BLEU-3BLEU-4GLEUROUGE-1ROUGE-2ROUGE-L
GPT-3.5-turbo* 18.066.742.731.094.7120.012.8112.58
HuatuoGPT* 13B24.6112.847.234.197.7327.387.0917.66
ChatGLM-Med6B32.1818.378.873.796.0926.148.0818.87
BenTsao7B32.0217.418.363.926.1217.723.2114.15
MedChatZH7B56.3132.1417.589.1710.3235.9910.3121.77

$\dagger$ The models highlighted by * means copied scores from HuatuoGPT.

表 1: webMedQA基准测试结果。

Model Parameter BLEU-1 BLEU-2 BLEU-3 BLEU-4 GLEU ROUGE-1 ROUGE-2 ROUGE-L
GPT-3.5-turbo* 18.06 6.74 2.73 1.09 4.71 20.01 2.81 12.58
HuatuoGPT* 13B 24.61 12.84 7.23 4.19 7.73 27.38 7.09 17.66
ChatGLM-Med 6B 32.18 18.37 8.87 3.79 6.09 26.14 8.08 18.87
BenTsao 7B 32.02 17.41 8.36 3.92 6.12 17.72 3.21 14.15
MedChatZH 7B 56.31 32.14 17.58 9.17 10.32 35.99 10.31 21.77

$\dagger$ 标有* 的模型表示分数来自HuatuoGPT。

3.2 Data Process Pipeline

3.2 数据处理流程

The BELLE-3.5M instruction dataset (Yunjie Ji et al., 2023) is derived from ChatGPT, employing AI-style instructions known for their high quality. To ensure the dataset’s reliability and coherence, we employ heuristic methods during the curation process. Specifically, we discard short answers that consist of fewer than 200 tokens and lack logical consistency. This approach helps to enhance the quality of the question-answer pairs in the dataset, resulting in more accurate and meaningful QA interactions.

BELLE-3.5M指令数据集(Yunjie Ji等, 2023)源自ChatGPT,采用以高质量著称的AI风格指令。为确保数据集的可靠性与连贯性,我们在整理过程中采用了启发式方法。具体而言,我们会剔除少于200个token且缺乏逻辑一致性的简短回答。这种方法有助于提升数据集中问答对的质量,从而产生更准确、更有意义的问答交互。

To ensure domain-specific knowledge, we have amassed over 7,000,000 medical instructions from the Internet and various Chinese hospitals. These instructions exhibit variations in expression, quality, length, and style. In order to curate a highquality dataset, we apply the following filtering steps:

为确保领域专业知识,我们从互联网和中国多家医院收集了超过7,000,000条医疗指令。这些指令在表达方式、质量、长度和风格上存在差异。为构建高质量数据集,我们采取了以下过滤步骤:

• Filtering Personal Data: We utilize heuristics, such as regular matching, to identify and remove responses containing personal information like email addresses or phone numbers. This step ensures the protection of individuals’ privacy. • Self-labeling and Training: We perform selflabeling on a subset of 3,000 preference ranking data in the medical domain. This subset is then used to train a model called ZiyaLLaMA-7B-Reward 3. Data with scores lower than 0.5 are discarded, ensuring the selection of high-quality training examples.

• 过滤个人数据:我们采用启发式方法(如正则匹配)来识别并删除包含电子邮件地址或电话号码等个人信息的回答。这一步骤确保了对个人隐私的保护。
• 自标注与训练:我们对医疗领域的3000条偏好排序数据子集进行自标注,随后用该子集训练名为ZiyaLLaMA-7B-Reward的模型。得分低于0.5的数据将被丢弃,从而确保筛选出高质量的训练样本。

• Numerical Symbol Harmonization: We harmonize various numerical symbols, such as $\ '_ {1,};\mathrel{\mathop:}(1)'$ , etc., into a standardized format represented by a number followed by a dot, e.g., ’1.’ This standardization ensures consistency and ease of processing for numerical information.

• 数值符号统一化:我们将各种数值符号(如 $\ '_ {1,};\mathrel{\mathop:}(1)'$ 等)统一为标准格式,即数字后跟一个点(例如“1.”)。这种标准化确保了数值信息的一致性和处理便捷性。

As a result of these steps, we obtain a curated dataset comprising 763,629 medical instructions and 1,305,194 general instructions. This dataset serves as the foundation for fine-tuning our model, enabling it to acquire the necessary dialogue capabilities specific to the medical domain.

经过上述步骤,我们最终获得了一个包含763,629条医疗指令和1,305,194条通用指令的精选数据集。该数据集作为微调模型的基础,使其能够习得医疗领域所需的特定对话能力。

3.3 Base Model

3.3 基础模型

Our base model is Baichuan-7B, which is based on the Transformer and its architecture is the same as the LLaMa. This 7 billion parameter model is trained on about 1.2 trillion tokens supports Chinese and English bilinguals, and the context window length is 4096. The best results of the same size have been achieved on the standard Chinese and English benchmarks (C-Eval/MMLU).

我们的基础模型是Baichuan-7B,它基于Transformer架构,与LLaMa相同。这个70亿参数的模型在约1.2万亿token的中英双语数据上训练,上下文窗口长度为4096。在标准中英文基准测试(C-Eval/MMLU)上取得了同规模模型的最佳成绩。

3.4 Training Details

3.4 训练细节

Our model is developed using PyTorch 2.0.1, with Baichuan-7B serving as the foundational architecture. During the further pre-training stage, we employ specific settings to optimize the model’s performance. The learning rate is set to 2e-5, the batch size per device is 4, and the maximum context length is restricted to 2048 tokens.In the subsequent instruction fine-tuning stage, we deviate from the LoRA (Hu et al., 2021) strategy and instead opt for a full parameter fine-tuning approach. Here, the learning rate is adjusted to 2e-4, the batch size per device is increased to 8, and the maximum context length is limited to 1024 tokens. For optimization, we employ the AdamW optimizer (Loshchilov and Hutter, 2017), and weight decay is set to 1e-5 to mitigate over fitting. To execute our experiments, we utilize 8 NVIDIA A800 GPUs and leverage the ZeRO-2 (Raj bh and ari et al., 2020) stage, which optimizes memory consumption and accelerates training.

我们的模型基于PyTorch 2.0.1开发,采用Baichuan-7B作为基础架构。在增量预训练阶段,我们采用特定参数优化模型性能:学习率设为2e-5,单设备批处理大小为4,最大上下文长度限制为2048个token。在后续指令微调阶段,我们未采用LoRA (Hu et al., 2021)策略,而是选择全参数微调方案,此时学习率调整为2e-4,单设备批处理量提升至8,最大上下文长度限制为1024个token。优化器选用AdamW (Loshchilov and Hutter, 2017),权重衰减系数设为1e-5以防止过拟合。实验使用8块NVIDIA A800 GPU,通过ZeRO-2 (Rajbhandari et al., 2020)技术阶段实现内存优化与训练加速。


Figure 1: Chinese reward model scores on different categories in Medical QA.

图 1: 医疗问答领域中文奖励模型在不同类别上的得分。

4 Experiment

4 实验

4.1 Baselines

4.1 基线方法

In our evaluation, we compare the performance of our model with that of the state-of-the-art zero-shot model, OpenAI’s ChatGPT (GPT-3.5-turbo), as well as several Chinese-specific Language Models (LLMs) that have been fine-tuned specifically on medical domain knowledge.

在我们的评估中,我们将模型性能与当前最先进的零样本模型 OpenAI 的 ChatGPT (GPT-3.5-turbo) ,以及多个针对医疗领域知识专门微调的中文大语言模型进行了对比。

• BenTsao 4 (Wang et al., 2023) is a fine-tuned Chinese Language Model (LLM) developed by SCIR-HI, leveraging the LoRA strategy and Chinese medical knowledge. It consists of two series, LLaMA-7B and Chinese-LLaMAAlpaca (Cui et al., 2023). Our comparison focuses on LLaMA-7B, which is fine-tuned exclusively on the medical knowledge database, excluding medical literature.

• BenTsao 4 (Wang et al., 2023) 是由SCIR-HI基于LoRA策略和中文医学知识微调开发的中文大语言模型。该模型包含LLaMA-7B和Chinese-LLaMAAlpaca (Cui et al., 2023) 两个系列。我们的对比研究聚焦于仅在医学知识库(不含医学文献)上微调的LLaMA-7B版本。

• ChatGLM-Med 5 is another model based on the same dataset as BenTsao, but utilizing the more Chinese-friendly ChatGLM-6B (Du et al., 2021) as its foundational model. It represents an enhanced version of ChatGLM, specifically designed for improved questionanswering effectiveness in the medical field.

• ChatGLM-Med 5 是基于与BenTsao相同数据集的另一模型,但采用了更适配中文场景的ChatGLM-6B (Du et al., 2021) 作为基础模型。它是ChatGLM的增强版本,专门针对医疗领域问答效果进行了优化。

• ChatGPT 6 is a sibling model to Instruct GP T (Ouyang et al., 2022), which is trained to follow instructions in a prompt and provide a detailed response. It is considered one of the leading dialogue models, and we compare our model against the GPT-3.5-turbo.

• ChatGPT 6 是 Instruct GPT (Ouyang et al., 2022) 的姊妹模型,经过训练能够遵循提示指令并提供详细响应。它被视为领先的对话模型之一,我们将自己的模型与 GPT-3.5-turbo 进行了对比。

• HuatuoGPT 7 (Zhang et al., 2023) releases model weights of HuatuoGPT-13B, which is trained on Ziya-LLaMA-13B-Pretrain-v1 8. It combines distilled data from ChatGPT and real-world data from doctors to enhance its medical dialogue capabilities.

• HuatuoGPT 7 (Zhang et al., 2023) 发布了基于Ziya-LLaMA-13B-Pretrain-v1训练的HuatuoGPT-13B模型权重,通过融合ChatGPT蒸馏数据和医生真实诊疗数据来强化医疗对话能力。

Table 2: The distribution of the webMedQA dataset is highly skewed, with the largest category being ’internal medicine,’ comprising over 17,000 data points. The category with the least representation is ’other,’ containing only 30 questions and answers.

DatasetSizeCountCategory
>100002Internal Medicine; Surgery
5000-100002Pediatrics;Gynaecology and Obstetrics
1000-50007Pentaphthaliaceae; Oncology; Dermatovenereology;Infectious Diseases;
Mental Health;Plastic Surgery; TMC Health Care; Aesthetic Medicine; Auxiliary Examination;Rehabilitation Medicine;
<100012Nutrition and Health; Home Environment; Exercise and Fitness; Physical Examination; ChildcareKnowledge;Drug;Heredity;Other

表 2: webMedQA数据集分布极不均衡,最大类别"内科"包含超过17,000条数据,最小类别"其他"仅含30组问答。

数据集规模 数量 类别
>10000 2 内科; 外科
5000-10000 2 儿科; 妇产科
1000-5000 7 五官科; 肿瘤科; 皮肤性病科; 传染病科; 心理健康; 整形外科; 中医保健
<1000 12 营养健康; 家庭环境; 运动健身; 体检; 育儿知识; 药品; 遗传学; 其他


Figure 2: Ablation study on webMedQA, evaluated by traditional NLP metrics.

图 2: 基于传统 NLP 指标的 webMedQA 消融实验结果。

4.2 Benchmark

4.2 基准测试

The webMedQA dataset (He et al., 2019) is a real-world collection of Chinese medical questionanswering (QA) data sourced from online health consultancy websites. It comprises 63,255 questions9. This dataset offers the advantage of multiple candidate answers corresponding to each question, allowing for the evaluation of answer accuracy using multiple references. It further categorizes the dataset into 23 different domains, including Health Care, Internal Medicine, and other departments, enabling more targeted analysis and exploration. All basic information can be found in Table 2.

webMedQA数据集 (He et al., 2019) 是一个真实世界的中文医疗问答数据集合,来源于在线健康咨询网站。该数据集包含63,255个问题,其优势在于每个问题对应多个候选答案,可通过多参考评估答案准确性。数据集进一步细分为23个不同领域,包括保健科、内科等科室,便于进行更有针对性的分析和探索。所有基础信息见表2。

4.3 Evaluation Metrics

4.3 评估指标

Our evaluation methodology comprises two primary components: traditional Natural Language Processing (NLP) metrics and reward model scores.

我们的评估方法包含两个主要部分:传统自然语言处理 (NLP) 指标和奖励模型分数。

To quantify the similarity between generated and reference sentences, we employ the BLEU metric (Papineni et al., 2002). It calculates the $\mathbf{k}$ -gram overlap, enabling us to assess the similarity of ngrams in the generated output and the reference sentences.

为了量化生成句子与参考句子之间的相似度,我们采用BLEU指标 (Papineni et al., 2002)。它计算$\mathbf{k}$元组重叠度,使我们能够评估生成输出与参考句子中ngrams的相似性。

For evaluating sentence-level fluency, we utilize the GLEU metric (Mutton et al., 2007). This metric automatically evaluates the fluency of generated responses, taking into account both adequacy and fluency aspects.

为评估句子层面的流畅性,我们采用GLEU指标 (Mutton et al., 2007)。该指标能自动衡量生成回复的流畅性,同时兼顾充分性和流畅性两个维度。

To gauge the overlap of n-grams between the generated output and the reference summaries, we employ the ROUGE metric (Lin, 2004). Specifically, we employ ROUGE-L, which measures the longest common sub sequence of word matches.

为了衡量生成输出与参考摘要之间的n-gram重叠度,我们采用ROUGE指标 (Lin, 2004)。具体而言,我们使用ROUGE-L来测量最长公共子序列的单词匹配情况。

Additionally, we incorporate a Reward Model Score as a more flexible and nuanced evaluation metric. In this study, we utilize the Ziya-LLaMA7B-Reward model. This reward model is specifically designed to accurately assess the quality of model-generated output, including factors such as text repetition, abnormal interruptions, and adherence to instruction requirements. It assigns a lower reward value to outputs that exhibit low-quality generation characteristics.

此外,我们引入奖励模型评分 (Reward Model Score) 作为更灵活细致的评估指标。本研究采用Ziya-LLaMA7B-Reward模型,该奖励模型专为精准评估模型生成内容质量而设计,能检测文本重复、异常中断及指令遵循度等要素,对存在低质量生成特征的输出会赋予较低奖励值。

By combining these traditional NLP metrics and reward-based evaluation, our evaluation framework provides a comprehensive and rigorous assessment of the model’s performance. These metrics enable us to evaluate similarity, fluency, adherence to instructions, and overall quality of the generated responses in a systematic and objective manner.

通过结合这些传统 NLP (Natural Language Processing) 指标和基于奖励的评估方法,我们的评估框架为模型性能提供了全面而严谨的评判。这些指标使我们能够以系统化、客观的方式评估生成响应的相似性、流畅性、指令遵循度以及整体质量。

4.4 Results

4.4 结果

In this research study, our primary focus is on evaluating single-turn questions. The results of all the models are presented in Tab 1. It’s important to note that the score results for GPT-3.5-turbo and HuatuoGPT are directly taken from the original paper of HuatuoGPT, and we have not re-run the experimental validation for these models. However, for the remaining models, we have used official checkpoints and conducted inferences on the dataset to ensure that all results are reproducible. Our model demonstrates a significant performance improvement over other baseline models in Singleturn Chinese medical dialogue situations.

在本研究中,我们主要关注单轮问答任务的评估。所有模型的实验结果如 表 1 所示。需特别说明的是,GPT-3.5-turbo和HuatuoGPT的评分结果直接引自HuatuoGPT原始论文[20],我们未对这些模型重新进行实验验证。对于其余模型,我们使用官方检查点并在数据集上进行了推理测试,确保所有结果具备可复现性。我们的模型在中文医疗单轮对话场景中展现出显著优于其他基线模型的性能提升。

Due to the limitations of traditional metrics commonly used in machine translation scenarios, which may not be entirely suitable for evaluating dialogue quality, we have also employed a fine-tuned reward model to score answers. For this purpose, we utilized a medical-specific language model in the Chinese domain to compare the performance of our model against other baselines, as shown in Fig 1.

由于传统机器翻译场景中常用的评估指标可能不完全适用于对话质量评估,我们还采用了一个微调后的奖励模型对回答进行评分。为此,我们使用了中文领域的医疗专用大语言模型,将本模型与其他基线模型的表现进行对比,如图1所示。

To ensure accurate evaluation and avoid unnecessary confusion, it is essential to consider that different versions of the evaluation kit can yield different results (Shi et al., 2022). Therefore, we have used the latest version of NLTK-3.8.1 for our evaluation.

为确保评估准确并避免不必要的混淆,必须考虑不同版本的评估工具包可能产生不同结果 (Shi et al., 2022)。因此,我们使用了最新版NLTK-3.8.1进行评估。

5 Discussion

5 讨论

5.1 Ablation Study

5.1 消融研究

Given the constraints imposed by limited computational resources, we have conducted an ablation study focusing solely on whether to use distilled medical instructions. The results, as depicted in Fig 2, clearly demonstrate that after fine-tuning the model using high-quality medical instruction data, the medical question-answering (QA) ability has shown a substantial improvement. This outcome highlights the crucial role played by fine-tuning with relevant medical instruction data in enhancing the performance of our model in the medical QA domain.

鉴于有限计算资源的限制,我们仅针对是否使用蒸馏医学指令进行了消融研究。如图2所示,结果表明:在使用高质量医学指令数据对模型进行微调后,其医疗问答(QA)能力展现出显著提升。这一结果凸显了相关医学指令数据微调对增强模型在医疗QA领域性能的关键作用。

5.2 Limitation

5.2 局限性

Our model is trained for Chinese speakers in the non-commercial medical domain, so it’s not suitable for other languages or domains. Medical advice is sensitive and critical, and if the model provides unreasonable advice, it could lead to bad negative effects. We cannot guarantee the authenticity of our model’s output, and it may suffer from hallucination phenomena common in language models. Caution, human verification, and transparent communication are essential when using the model.

我们的模型专为非商业医疗领域的中文使用者训练,因此不适用于其他语言或领域。医疗建议敏感且关键,若模型提供不合理建议,可能引发严重负面影响。我们无法保证模型输出的真实性,且可能出现大语言模型常见的幻觉现象。使用该模型时,必须谨慎对待、人工核验并保持透明沟通。

6 Conclusion

6 结论

This paper compiles and organizes a significant amount of traditional Chinese medicine texts to further train Chinese large models. This process enhances the models’ localization and adaptability to specific language environments. Additionally, the data quality is improved through a rigorous Data cleansing process that involves heuristic methods and reward models.

本文汇编整理了大量中医药典籍,用于进一步训练中文大语言模型。这一过程增强了模型在特定语言环境下的本地化能力和适应性。此外,通过结合启发式方法和奖励模型的严格数据清洗 (Data cleansing) 流程,数据质量得到了显著提升。

To evaluate the effectiveness of the approach, comprehensive tests were performed using real medical consultation data. These tests compared MedChatZH with multiple powerful baselines, including traditional NLP indicators and AI model scoring. The results demonstrate the robustness of MedChatZH in the medical domain, validating its performance and efficacy.

为评估该方法的有效性,我们使用真实医疗咨询数据进行了全面测试。这些测试将MedChatZH与多个强大基线进行了对比,包括传统NLP指标和AI模型评分。结果表明MedChatZH在医疗领域具有稳健性,验证了其性能与功效。

7 Acknowledgements

7 致谢

Supported by Research Programme of National Engineering Laboratory for Big Data Distribution and Exchange Technologies, Shanghai Municipal Special Fund for Promoting High Quality Development (No. 2021-GYHLW-01007)

国家大数据流通与交易技术国家工程实验室研究计划资助
上海市促进高质量发展专项资金资助 (No. 2021-GYHLW-01007)

References

参考文献

plying deep matching networks to chinese medical question answering: a study and a dataset. BMC medical informatics and decision making, 19(2):91–100.

将深度匹配网络应用于中文医疗问答:一项研究及数据集。BMC医学信息学与决策制定,19(2):91-100。

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2021. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685.

Edward J Hu、Yelong Shen、Phillip Wallis、Zeyuan Allen-Zhu、Yuanzhi Li、Shean Wang、Lu Wang 和 Weizhu Chen。2021. Lora: 大语言模型的低秩自适应。arXiv预印本 arXiv:2106.09685。

Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text sum mari z ation branches out, pages 74–81.

Chin-Yew Lin. 2004. ROUGE: 自动摘要评估工具包。In Text summarization branches out, pages 74-81.

Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regular iz ation. arXiv preprint arXiv:1711.05101.

Ilya Loshchilov and Frank Hutter. 2017. 解耦权重衰减正则化. arXiv preprint arXiv:1711.05101.

Andrew Mutton, Mark Dras, Stephen Wan, and Robert Dale. 2007. Gleu: Automatic evaluation of sentencelevel fluency. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pages 344–351.

Andrew Mutton、Mark Dras、Stephen Wan 和 Robert Dale。2007. GLEU:句子流畅度的自动评估。载于《第45届计算语言学协会年会论文集》,第344-351页。

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.

Long Ouyang、Jeffrey Wu、Xu Jiang、Diogo Almeida、Carroll Wainwright、Pamela Mishkin、Chong Zhang、Sandhini Agarwal、Katarina Slama、Alex Ray 等. 2022. 通过人类反馈训练语言模型遵循指令. Advances in Neural Information Processing Systems, 35:27730–27744.

Kishore Papineni, Salim Roukos, Todd Ward, and WeiJing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318.

Kishore Papineni、Salim Roukos、Todd Ward 和 WeiJing Zhu。2002。BLEU:一种机器翻译自动评估方法。载于《第40届计算语言学协会年会论文集》,第311-318页。

Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. 2018. Improving language understanding by generative pre-training.

Alec Radford、Karthik Narasimhan、Tim Salimans、Ilya Sutskever 等. 2018. 通过生成式预训练提升语言理解能力.

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. 2019. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.

Alec Radford、Jeffrey Wu、Rewon Child、David Luan、Dario Amodei、Ilya Sutskever 等. 2019. 语言模型是无监督多任务学习者. OpenAI 博客, 1(8):9.

Samyam Raj bh and ari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. 2020. Zero: Memory optimization s toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1– 16. IEEE.

Samyam Rajbh和ari、Jeff Rasley、Olatunji Ruwase以及Yuxiong He。2020年。Zero:面向万亿参数模型训练的内存优化技术。收录于SC20:国际高性能计算、网络、存储与分析会议,第1-16页。IEEE。

Teven Le Scao, Angela Fan, Christopher Akiki, El- lie Pavlick, Suzana Ilic, Daniel Hesslow, Roman Castagné, Alexandra Sasha Luccioni, François Yvon, Matthias Gallé, et al. 2022. Bloom: A 176bparameter open-access multilingual language model. arXiv preprint arXiv:2211.05100.

Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ilic, Daniel Hesslow, Roman Castagné, Alexandra Sasha Luccioni, François Yvon, Matthias Gallé 等. 2022. Bloom: 一个1760亿参数的开源多语言大语言模型. arXiv预印本 arXiv:2211.05100.

Ensheng Shi, Yanlin Wang, Lun Du, Junjie Chen, Shi Han, Hongyu Zhang, Dongmei Zhang, and Hongbin Sun. 2022. On the evaluation of neural code sum mari z ation. In Proceedings of the 44th International Conference on Software Engineering, pages 1597–1608.

Ensheng Shi、Yanlin Wang、Lun Du、Junjie Chen、Shi Han、Hongyu Zhang、Dongmei Zhang 和 Hongbin Sun。2022. 神经代码摘要评估研究。见《第44届国际软件工程会议论文集》,第1597–1608页。

Karan Singhal, Shekoofeh Azizi, Tao Tu, S Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, et al. 2022. Large language models encode clinical knowledge. arXiv preprint arXiv:2212.13138.

Karan Singhal、Shekoofeh Azizi、Tao Tu、S Sara Mahdavi、Jason Wei、Hyung Won Chung、Nathan Scales、Ajay Tanwani、Heather Cole-Lewis、Stephen Pfohl等。2022。大语言模型编码临床知识。arXiv预印本arXiv:2212.13138。

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. 2023. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.

Hugo Touvron、Thibaut Lavril、Gautier Izacard、Xavier Martinet、Marie-Anne Lachaux、Timothée Lacroix、Baptiste Rozière、Naman Goyal、Eric Hambro、Faisal Azhar 等。2023。Llama: 开放高效的基础语言模型。arXiv预印本 arXiv:2302.13971。

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems, 30.

Ashish Vaswani、Noam Shazeer、Niki Parmar、Jakob Uszkoreit、Llion Jones、Aidan N Gomez、Łukasz Kaiser 和 Illia Polosukhin。2017. Attention is all you need。神经信息处理系统进展,30。

Haochun Wang, Chi Liu, Nuwa Xi, Zewen Qiang, Sendong Zhao, Bing Qin, and Ting Liu. 2023. Huatuo: Tuning llama model with chinese medical knowledge. arXiv preprint arXiv:2304.06975.

王浩春、刘驰、席女娲、强泽文、赵森东、秦兵和刘挺。2023。华佗:基于中文医学知识调优的Llama模型。arXiv预印本arXiv:2304.06975。

Yong Deng Yunjie Ji, Yiping Peng Yan Gong, Lei Zhang Qiang Niu, and Xiangang Li Baochang Ma. 2023. Exploring the impact of instruction data scaling on large language models: An empirical study on realworld use cases. arXiv preprint arXiv:2303.14742.

邓勇 纪云杰, 彭一平 龚燕, 张磊 牛强, 李贤刚 马宝昌. 2023. 探究指令数据规模对大语言模型的影响: 基于真实应用场景的实证研究. arXiv预印本 arXiv:2303.14742.

Aohan Zeng, Xiao Liu, Zhengxiao Du, Zihan Wang, Hanyu Lai, Ming Ding, Zhuoyi Yang, Yifan Xu, Wendi Zheng, Xiao Xia, et al. 2022. Glm-130b: An open bilingual pre-trained model. arXiv preprint arXiv:2210.02414.

Aohan Zeng, Xiao Liu, Zhengxiao Du, Zihan Wang, Hanyu Lai, Ming Ding, Zhuoyi Yang, Yifan Xu, Wendi Zheng, Xiao Xia, et al. 2022. GLM-130B: 一个开源的跨语言预训练模型. arXiv preprint arXiv:2210.02414.

Hongbo Zhang, Junying Chen, Feng Jiang, Fei Yu, Zhihong Chen, Jianquan Li, Guiming Chen, Xiangbo Wu, Zhiyi Zhang, Qingying Xiao, et al. 2023. Huatuogpt, towards taming language model to be a doctor. arXiv preprint arXiv:2305.15075.

Hongbo Zhang、Junying Chen、Feng Jiang、Fei Yu、Zhihong Chen、Jianquan Li、Guiming Chen、Xiangbo Wu、Zhiyi Zhang、Qingying Xiao等. 2023. 华佗GPT:驯化大语言模型成为医生. arXiv预印本 arXiv:2305.15075.

阅读全文(20积分)