Guangyu Wang∗, Guoxing Yang, Zongxin Du, Longjun Fan, Xiaohu Li

State Key Laboratory of Networking and Switching Technology Beijing University of Posts and Telecommunications Beijing, China ∗guangyu.wang24@gmail.com

网络与交换技术国家重点实验室
北京邮电大学
中国，北京
∗guangyu.wang24@gmail.com

ABSTRACT

摘要

Large language models have exhibited exceptional performance on various Natural Language Processing (NLP) tasks, leveraging techniques such as the pre-training, and instruction fine-tuning. Despite these advances, their effectiveness in medical applications is limited, due to challenges such as factual inaccuracies, reasoning abilities, and lack grounding in real-world experience. In this study, we present Clinical GP T, a language model explicitly designed and optimized for clinical scenarios. By incorporating extensive and diverse real-world data, such as medical records, domainspecific knowledge, and multi-round dialogue consultations in the training process, Clinical GP T is better prepared to handle multiple clinical task. Furthermore, we introduce a comprehensive evaluation framework that includes medical knowledge question-answering, medical exams, patient consultations, and diagnostic analysis of medical records. Our results demonstrate that Clinical GP T significantly outperforms other models in these tasks, highlighting the effectiveness of our approach in adapting large language models to the critical domain of healthcare.

大语言模型 (Large Language Model) 在各类自然语言处理 (NLP) 任务中展现出卓越性能，这得益于预训练和指令微调等技术。然而，由于事实准确性不足、推理能力有限以及缺乏现实经验基础等挑战，其在医疗领域的应用效果仍受制约。本研究推出专为临床场景设计与优化的语言模型 Clinical GPT。通过整合电子病历、领域专业知识及多轮诊疗对话等多样化真实世界数据，该模型在训练过程中获得了更强的临床任务处理能力。此外，我们建立了包含医学知识问答、执业考试、患者咨询和病历诊断分析在内的综合评估体系。实验结果表明，Clinical GPT 在这些任务中显著优于其他模型，证明了大语言模型在医疗关键领域适配方法的有效性。

Keywords deep learning $\cdot$ large language model $\cdot$ medical knowledge $\cdot$ electronic medical record $\cdot$ text generation

关键词深度学习 $\cdot$ 大语言模型 $\cdot$ 医学知识 $\cdot$ 电子病历 $\cdot$ 文本生成

1 Introduction

1 引言

In recent years, the paradigm of pre-training and fine-tuning large language models has brought about significant advancements in Natural Language Processing (NLP) domain. The earliest approaches like BERT[1], utilized optimized objectives like Masked Language Model (MLM) to pre-train on large text corpora such as BookCorpus[2], in an unsupervised manner to learn good representations. These representations can be fine-tuned and adapted to one or more specific downstream tasks to improve their performance. Further research aims to develop competent generalists, i.e. generalized systems that can perform multiple NLP tasks without the need for a manually labeled training dataset for each task. For instance, T5[3] treats multiple NLP tasks as text-to-text transformation tasks and leverages an encoderdecoder architecture, achieving promising results such as text classification, question answering, and sum mari z ation, though with a larger number of parameters. In contrast, GPT-3[4] uses large auto-regressive model for few-shot predictions, improving performance without parameter fine-tuning by incorporating few-shot demonstrations through text interaction with the model. PALM[5] is Transformers-based and Pathways-enabled large-scale language model. Compared to other models, PALM is more resource-efficient in terms of computation and achieves state-of-the-art few-shot results across hundreds of natural language, code, and mathematical reasoning tasks.

近年来，预训练和微调大语言模型的范式为自然语言处理 (NLP) 领域带来了重大进展。最早的方法如 BERT[1]，利用掩码语言模型 (MLM) 等优化目标，在 BookCorpus[2] 等大型文本语料库上以无监督方式进行预训练，学习良好的表示。这些表示可以通过微调适应一个或多个特定下游任务，以提高其性能。进一步的研究旨在开发出能力全面的通用系统，即无需为每个任务手动标注训练数据集就能执行多种 NLP 任务的通用系统。例如，T5[3] 将多种 NLP 任务视为文本到文本的转换任务，并利用编码器-解码器架构，在文本分类、问答和摘要等任务上取得了不错的结果，尽管参数量更大。相比之下，GPT-3[4] 使用大型自回归模型进行少样本预测，通过与模型的文本交互融入少量示例，无需参数微调即可提升性能。PALM[5] 是基于 Transformer 和 Pathways 的大规模语言模型。与其他模型相比，PALM 在计算资源上更为高效，并在数百项自然语言、代码和数学推理任务中实现了最先进的少样本性能。

With their substantial generalization capabilities in NLP tasks, large pre-trained models are increasingly utilized for various tasks and facilitating human interaction through dialogue models. LaMDA [6], a transformer-based model designed for dialogues, leverages annotated data and external knowledge to augment its helpfulness and role consistency. Instruct GP T [7] aligns with user intent across various tasks through fine-tuning and reinforcement learning with human feedback, resulting in improved truthfulness and reduced toxicity in output generation. ChatGPT can simulate human interaction, write abstracts or create movie scripts in response to prompts, driving the AI revolution. Large language models are also effective for writing assistance and generating efficient code for programmers.

凭借在自然语言处理(NLP)任务中的强大泛化能力，大规模预训练模型正被广泛应用于各类任务，并通过对话模型促进人机交互。基于Transformer架构的对话模型LaMDA [6]利用标注数据和外部知识来提升其实用性和角色一致性。InstructGPT [7]通过微调和基于人类反馈的强化学习，使模型输出更符合用户意图，在提高真实性的同时降低了有害内容生成。ChatGPT能够模拟人类对话，根据提示撰写摘要或创作电影剧本，推动着AI革命浪潮。大语言模型还能有效辅助写作，并为程序员生成高效代码。

As we know, medicine and health care still face many challenges, including aging population, lack of equitable access, rising costs, doctor and nurse burnout, and global pandemics. Information technology has the potential to transform modern medicine by offering new tools and insights for healthcare, with ChatGPT and GPT-4 promising to revolutionize clinical decision support, clinical trial recruitment, clinical data management, research support, patient education [8, 9]. Google researchers developed FlanPaLM, an instruction-tuned variant of PaLM, showing improved task performance via natural language instructions. Using a combination of prompting strategies, Flan-PaLM achieves state-of-the-art accuracy in MultiMedQA multiple-choice datasets, but remains outperformed by clinicians. Recent prospective suggests generalist medical AI (GMAI) using foundation models may disrupt task-specific paradigms, enabling versatile applications like interactive note-taking, bedside decision support, and patient chatbots [10].However, there are considerable challenges to overcome in applying generative language models to the medical field. The output of generative language models may have factual errors, logic inconsistencies, and problems with coherence, such as citing article references that do not exist [11]. The models have limited reasoning abilities and lack grounding in real-world experience, leading to general and vague responses. ChatGPT has been found lacking in depth and insight [4], likely due to its alignment model used for reward-based training, which produces overly generalized answers that lack medical expertise. This evidence implies that employing these technologies in the medical field brings unique hurdles, such as the necessity for high accuracy, interpret ability, and secure handling of sensitive health data.

众所周知，医疗健康领域仍面临诸多挑战，包括人口老龄化、医疗资源分配不均、成本攀升、医护职业倦怠以及全球流行病等问题。信息技术有望通过为医疗保健提供新工具与洞察力来变革现代医学，其中ChatGPT和GPT-4在临床决策支持、临床试验招募、临床数据管理、科研辅助及患者教育等领域展现出革新潜力[8,9]。Google研究人员开发的指令调优模型Flan-PaLM通过自然语言指令显著提升了任务性能，采用组合式提示策略后在MultiMedQA多选题数据集达到最先进准确率，但仍逊色于临床医生。最新前瞻研究指出，基于基础模型的通用医疗AI(GMAI)可能颠覆任务专用范式，实现交互式病历记录、床旁决策支持和患者聊天机器人等多样化应用[10]。

然而，生成式语言模型在医疗领域的应用存在显著挑战。其输出可能存在事实错误、逻辑矛盾及连贯性问题，例如引用不存在的文献[11]。这些模型推理能力有限且缺乏现实经验基础，导致回答笼统模糊。研究发现ChatGPT存在深度不足、洞察力欠缺等问题[4]，这很可能源于其采用基于奖励训练的对齐模型，产生了缺乏医学专业性的过度泛化答案。这些证据表明，在医疗领域应用此类技术需克服独特障碍，包括对高精度、可解释性及敏感健康数据安全处理的严格要求。

In this study, we present Clinical GP T, a large language model that is specifically designed for tasks across medical applications. To train the model, we leverage extensive and diverse datasets consisting of real-world medical records, allowing us to transform domain-specific knowledge to the model. In addition, we establish a comprehensive evaluation framework that includes medical knowledge question-answering, medical examinations, patient consultations, and medical record analysis. By utilizing parameter-efficient fine-tuning methods, we were able to further improve the performance of Clinical GP T. The results demonstrate that Clinical GP T outperform existing models in term of performance, thus confirming the effectiveness of our approach.

本研究介绍了Clinical GP T，这是一款专为医疗应用任务设计的大语言模型。我们利用由真实世界医疗记录组成的广泛多样数据集进行模型训练，从而将领域专业知识转化为模型能力。此外，我们建立了包含医学知识问答、医疗考试、患者咨询和病历分析的综合评估框架。通过采用参数高效微调方法，我们进一步提升了Clinical GP T的性能。结果表明，Clinical GP T在性能指标上优于现有模型，从而验证了我们方法的有效性。

Figure 1: The overview of Clinical GP T.

图 1: Clinical GP T 概览

2 Methods

2 方法

2.1 Dataset

2.1 数据集

In this study, we incorporated a large and diverse medical datasets including cMedQA2, cMedQA-KG, MD-EHR, MEDQA-MCMLE, and MedDialog, for the training and evaluation of our model.

在本研究中，我们整合了包括cMedQA2、cMedQA-KG、MD-EHR、MEDQA-MCMLE和MedDialog在内的大量多样化医疗数据集，用于模型的训练和评估。

The cMedQA2 dataset [12] is a Chinese medical question-and-answer dataset that consists of $120\mathrm{k\Omega}$ questions and 226k answers. The data is aggregated from a Chinese medical question-and-answer online forum1. For training purposes, we followed the original dataset partition as proposed by the author, and then we randomly selected one answer per question. We annotated 10k questions from the training set for training reward models and used $4\mathrm{k\Omega}$ questions from the validation set for reinforcement learning. We sampled questions from the testing set for evaluation.

cMedQA2数据集[12]是一个中文医疗问答数据集，包含12万Ω个问题和22.6万条答案。数据来源于一个中文医疗问答论坛1。为训练需要，我们遵循原作者提出的数据集划分方式，随后为每个问题随机选取一个答案。我们从训练集中标注了1万Ω个问题用于训练奖励模型，并使用验证集中的4kΩ个问题进行强化学习。评估阶段则从测试集中抽样选取问题。

The cMedQA-KG is a medical question-answer dataset which are curated based on knowledge graphs. It is established on three knowledge graphs: $\mathrm{cMeKG}^{2}$ , xywy $\mathbf{\nabla}\cdot\mathbf{K}\mathbf{G}^{3}$ , and 39Health-KG 4.These knowledge graphs cover comprehensive medical entities such as disease, medication, and symptom, and their relationships. Detailed descriptions of the knowledge graphs can be found in Appendix A. We have designed templates (see Appendix B) to transform each knowledge triplet into fine-tuning instruction data, i.e text-to-text pair for text generation, yielding 100k question-answer pairs. cMedQA-KG is used exclusively for training purposes.

cMedQA-KG是一个基于知识图谱构建的医疗问答数据集。它建立在三个知识图谱之上：$\mathrm{cMeKG}^{2}$、xywy $\mathbf{\nabla}\cdot\mathbf{K}\mathbf{G}^{3}$和39Health-KG 4。这些知识图谱涵盖了疾病、药物、症状等全面的医疗实体及其关系。知识图谱的详细描述见附录A。我们设计了模板（见附录B）将每个知识三元组转化为微调指令数据（即用于文本生成的文本对），最终生成10万条问答对。cMedQA-KG仅用于训练目的。

The MEDQA-MCMLE dataset is a subset of the original MEDQA dataset [13], consisting of Chinese medical examination questions in a multiple-choice format. It includes $34\mathrm{k\Omega}$ questions, each offering multiple choices, typically 4 or 5. We have followed the original author’s division of the dataset into training, validation, and testing sets. As this dataset is derived from professional medical board examinations, it effectively evaluates applied knowledge, clinical reasoning, and patient-centric skills.

MEDQA-MCMLE数据集是原始MEDQA数据集[13]的一个子集，包含中国医学考试中的多项选择题形式题目。该数据集共有$34\mathrm{k\Omega}$道题目，每道题提供多个选项，通常为4或5个。我们遵循原作者对数据集的划分方式，将其分为训练集、验证集和测试集。由于该数据集源自专业医学委员会考试，因此能有效评估应用知识、临床推理和以患者为中心的技能。

The MedDialog dataset [14] is a data collection of multi-turn medical conversations obtained from an online platform5. MedDialog comprises 1.1 million dialogues and 4 million utterances. Due to the large volume of data, we have randomly sampled $100\mathrm{k}$ , 1k, and 1k dialogues for the training, validation, and testing sets, respectively. These multi-turn dialogues closely resemble real interactions between doctors and patients, aiding the model in understanding the process of clinical inquiry and decision-making.

MedDialog数据集[14]是从在线平台5获取的多轮医疗对话数据集合。该数据集包含110万组对话和400万条话语。由于数据量庞大，我们随机抽取了$100\mathrm{k}$、1k和1k组对话分别作为训练集、验证集和测试集。这些多轮对话高度模拟真实医患互动场景，有助于模型理解临床问诊与决策流程。

The MD-EHR dataset is comprised of electronic health records from multi center, large-scale hospitals in China. This dataset contains $100\mathrm{k\Omega}$ records covering a range of disease groups, including Respiratory, Digestive, Urinary, Psychiatry, Neurology, Gynecology, and Hematology.

MD-EHR数据集由中国多中心大型医院的电子健康记录组成。该数据集包含100kΩ条记录，涵盖呼吸系统、消化系统、泌尿系统、精神科、神经科、妇科和血液科等多个疾病组。

Each record within the MD-EHR dataset provides a comprehensive overview of the patient’s complaints, medical history, findings from physical examinations, ancillary test results, and the final diagnosis. We have divided the dataset into three sets: 2,000 records for the validation set, 2,000 records for the testing set, and the remaining entries for the training set. Following T5[3], we transformed the medical records into a text generation task by concatenating the notes from the records as input and using the diagnosis as the output.

MD-EHR数据集中的每条记录都全面概述了患者的主诉、病史、体格检查结果、辅助检查结果及最终诊断。我们将数据集划分为三部分：验证集2000条记录、测试集2000条记录，其余条目作为训练集。参照T5[3]的做法，我们将病历转化为文本生成任务——将记录中的笔记内容拼接作为输入，并以诊断结论作为输出。

2.2 Finetuning

2.2 微调

We adopt the T5 model’s [3] strategy of utilizing text generation grounded in language models to complete all tasks in our study. Language models, pre-trained on extensive corpora, have demonstrated a remarkable ability to understand and generate human-like text [4]. These models calculate the probability of a sequence of words in a text, $T=(w_ {1},w_ {2},...,w_ {L})$ . Specifically, the casual language model calculates the probability of the text $T$ that can be formulated as $p(T)=p(w_ {1})p(w_ {2}|w_ {1})...p(w_ {L}|w_ {1},w_ {2},...,w_ {L-1})$ , where $L$ represents the length of the text. Several large language models, such as BLOOM, GLM, and others, are available for public use.

我们采用T5模型[3]的策略，利用基于语言模型的文本生成来完成研究中的所有任务。在大规模语料库上预训练的语言模型已展现出卓越的理解和生成类人文本的能力[4]。这些模型计算文本中词序列的概率，$T=(w_ {1},w_ {2},...,w_ {L})$。具体而言，因果语言模型计算文本$T$的概率可表示为$p(T)=p(w_ {1})p(w_ {2}|w_ {1})...p(w_ {L}|w_ {1},w_ {2},...,w_ {L-1})$，其中$L$代表文本长度。目前有多个可公开使用的大语言模型，例如BLOOM、GLM等。

To enhance the utility of large models for downstream tasks, we apply an instruction-tuning approach with supervised fine tuning (SFT). The language model $p_ {\theta}$ is trained to generate a response $R=v_ {1:n}$ for a given input prompt $I=w_ {1:m}$ , optimizing the likelihood $p_ {\theta}(R|I)=p_ {\theta}(v_ {1:n}|w_ {1:m})$ , where $n$ and $m$ represent the lengths of the response and input prompt, respectively. Thus, the loss function is 1n Pim=+mn+1 − $\begin{array}{r}{\frac{1}{n}\sum_ {i=m+1}^{m+n}-\log p_ {\theta}(w_ {i}|w_ {1},...,w_ {i-1}).}\end{array}$

为提升大模型在下游任务中的实用性，我们采用监督微调 (SFT) 的指令调优方法。语言模型 $p_ {\theta}$ 被训练为给定输入提示 $I=w_ {1:m}$ 时生成响应 $R=v_ {1:n}$ ，通过优化似然函数 $p_ {\theta}(R|I)=p_ {\theta}(v_ {1:n}|w_ {1:m})$ ，其中 $n$ 和 $m$ 分别代表响应和输入提示的长度。因此，损失函数为 $\begin{array}{r}{\frac{1}{n}\sum_ {i=m+1}^{m+n}-\log p_ {\theta}(w_ {i}|w_ {1},...,w_ {i-1}).}\end{array}$

To incorporate domain-specific knowledge into LLMs, we turn to knowledge graphs (KGs) specific to the domain for constructing prompt-response pairs. KGs capture knowledge in the form of structured triples $(s,r,o)$ , where $s$ denotes the subject, $r$ the relationship, and $o$ the object. An example of such a triple could be (Cough, SymptomOf, Pneumonia). We leverage a set of manually designed templates to transform these triples into question-answer pairs, rendering them suitable for instruction tuning. The manually designed templates can be found in Appendix B.

为了将领域特定知识融入大语言模型(LLM)，我们转向特定领域的知识图谱(KG)来构建提示-响应对。知识图谱以结构化三元组$(s,r,o)$的形式捕获知识，其中$s$表示主体，$r$表示关系，$o$表示客体。例如(咳嗽，症状属于，肺炎)就是这样一个三元组。我们利用一组人工设计的模板将这些三元组转化为问答对，使其适用于指令微调。人工设计的模板详见附录B。

2.3 Reward model

2.3 奖励模型

Existing works have demonstrated that reinforcement learning can incorporate human feedback to enhance large language models. For instance, WebGPT [15] is a browser-assisted question-answering system that utilizes human feedback for performance improvement. Instruct GP T also [7] to align with human feedback via reinforcement learning for helpful and safe response generation.

现有研究表明，强化学习能够整合人类反馈以优化大语言模型。例如，WebGPT [15] 作为浏览器辅助问答系统，通过人类反馈提升性能。InstructGPT [7] 同样采用强化学习对齐人类反馈，从而生成有益且安全的响应。

We follow the work of [7], constructing a reward model (RM) $r_ {\mu}$ to furnish the reward signal crucial for the reinforcement learning process. We employ rank-based training for the RM. Human labelers rank responses for a given input prompt

我们遵循[7]的研究工作，构建了一个奖励模型 (reward model, RM) $r_ {\mu}$ 来为强化学习过程提供关键的奖励信号。我们采用基于排序的训练方法来训练RM。人工标注员会对给定输入提示的响应进行排序

$I$ , generating a comparison pair for each prompt. For a comparison pair with a human-preferred response $R_ {w}$ and a less preferred response $R_ {l}$ , the loss is given by $-\log(\sigma(r_ {\mu}(I,\bar{R}_ {w})-\bar{r_ {\mu}}(I,R_ {l})))$ .

$I$，为每个提示生成一个对比对。对于包含人类偏好响应$R_ {w}$和次优响应$R_ {l}$的对比对，损失函数由$-\log(\sigma(r_ {\mu}(I,\bar{R}_ {w})-\bar{r_ {\mu}}(I,R_ {l})))$给出。

2.4 Reinforcement learning

2.4 强化学习

We adopt the method proposed by Stiennon et al. [16], leveraging reinforcement learning to enhance the fine-tuned models with the objective of generating high-quality and helpful outputs, as well as improving the generation of medical texts, thereby aiding in the accurate description and treatment of patient conditions.

我们采用 Stiennon 等人 [16] 提出的方法，利用强化学习来增强微调模型，目标是生成高质量且有用的输出，并改进医学文本的生成，从而帮助准确描述和治疗患者病情。

We utilize the trained reward model as the reward function. In order to prevent the model from deviating too far from its initial state, we employ Proximal Policy Optimization (PPO) as our optimization strategy. Specifically, we incorporate a penalty term in the reward function that penalizes the KL divergence between the learned reinforcement learning policy, denoted as $\pi_ {\phi}^{R L}$ , and the original supervised model, $\pi^{S F\bar{T}}$ . This is to ensure that the final model does not deviate excessively from the original supervised model. The complete reward function is defined as follows: $R(x,y)=r_ {\mu}(x,y)-\beta\log\bar{\wp}(\pi_ {\phi}^{R L}(y|x)/\bar{\pi}^{S F T}(\dot{y|x}))$ , where $r_ {\mu}(x,y)$ represents the output of the reward model and $\beta$ is the coefficient for KL divergence in the reward function. The loss function used in PPO optimization is given by: ${\cal L}=r_ {\mu}\hat{A}_ {t}-\beta K{\cal L}[\pi_ {\phi_ {o l d}},\pi_ {\phi}]$ , where $r_ {\mu}$ is the reward function, $\hat{A}_ {t}$ is an estimator of the advantage function, $\phi_ {o l d}$ represents the parameters of the policy at the previous step, and $\pi_ {\phi}$ is the current policy.

我们利用训练好的奖励模型作为奖励函数。为防止模型偏离初始状态过远，我们采用近端策略优化 (Proximal Policy Optimization, PPO) 作为优化策略。具体而言，我们在奖励函数中加入惩罚项，用于惩罚学习到的强化学习策略 $\pi_ {\phi}^{R L}$ 与原始监督模型 $\pi^{S F\bar{T}}$ 之间的KL散度，从而确保最终模型不会过度偏离原始监督模型。完整奖励函数定义为：$R(x,y)=r_ {\mu}(x,y)-\beta\log\bar{\wp}(\pi_ {\phi}^{R L}(y|x)/\bar{\pi}^{S F T}(\dot{y|x}))$，其中 $r_ {\mu}(x,y)$ 表示奖励模型的输出，$\beta$ 为奖励函数中KL散度的系数。PPO优化使用的损失函数为：${\cal L}=r_ {\mu}\hat{A}_ {t}-\beta K{\cal L}[\pi_ {\phi_ {o l d}},\pi_ {\phi}]$，其中 $r_ {\mu}$ 为奖励函数，$\hat{A}_ {t}$ 是优势函数的估计值，$\phi_ {o l d}$ 表示上一步策略的参数，$\pi_ {\phi}$ 为当前策略。

3 Experiments and results

3 实验与结果

3.1 Implemented details

3.1 实现细节

We chose BLOOM-7B[17] as our base large language model, due to its open-source nature and multilingual support. For the supervised fine-tuning process, we set the learning rate to 5e-5, with a batch size of 128 and a maximum length of 1,024, training across 3 epochs. During the training of the reward model, we utilized the last feature vector of the final output sequence features as the text representation. Based on the fine-tuned model, we added a binary classification head to output the reward. We set the learning rate to 2e-5, with a batch size of 128, a maximum length of 1,024, and training over 3 epochs. For the reinforcement learning process, we applied a learning rate of 1e-5 and a maximum length of 1,024, training for 4000 steps. To efficiently train the large language model, we adopted LoRA (Low-Rank Approximated adapter)[18], a parameter efficient fine tuning method, with r of 8, alpha of 32, and dropout of 0.1. To decrease memory usage and improve training speed, we employed the ZeRO-2 [19], and made use of both TF32 (Tensor Float-32) and BF16 (Bfloat16). We selected several instruction fine-tuned models for comparison, including ChatGLM-6B [20], LLAMA-7B[21] (fine-tuned on English and Chinese data), and BLOOM-7B [22] (fined-tuned on cross lingual tasks).

我们选择BLOOM-7B[17]作为基础大语言模型，因其开源特性和多语言支持能力。在有监督微调阶段，设置学习率为5e-5，批处理大小为128，最大序列长度1024，训练3个周期。奖励模型训练时，采用最终输出序列特征的末位特征向量作为文本表征，基于微调后的模型增加二分类头输出奖励值，学习率设为2e-5，批处理大小128，最大长度1024，训练3个周期。强化学习过程采用1e-5学习率，最大长度1024，训练4000步。为高效训练大语言模型，采用参数高效微调方法LoRA (Low-Rank Approximated adapter)[18]，参数r=8，alpha=32，dropout=0.1。通过ZeRO-2[19]技术降低内存占用并提升训练速度，同时使用TF32 (Tensor Float-32)和BF16 (Bfloat16)浮点格式。对比实验选取了多个指令微调模型，包括ChatGLM-6B[20]、LLAMA-7B[21]（中英文数据微调）和BLOOM-7B[22]（跨语言任务微调）。

3.2 Medical conversation

3.2 医疗对话

We conducted performance evaluation of the medical conversation on the test set of MedDialog. To address the challenge of multiple rounds of conversation within each medical dialogue, we randomly truncated the dialogue at a certain round, discarding the subsequent dialogue, and using the historical dialogue prior to this round as input. The sample response is shown in Table 1. We used three evaluation metrics: BLEU[23], ROUGE[24], and GLEU, to assess the quality of the conversations. BLEU is a commonly used metric that compares a candidate translation with one or more reference translations based on n-gram precision. GLEU calculates the average score of different n-grams, providing a more comprehensive evaluation of the generated text. ROUGE, on the other hand, is a particularly useful metric for evaluating automatic sum mari z ation and machine translation, as it focuses on the recall aspect of generated summaries by comparing them with references.

我们在MedDialog测试集上对医疗对话进行了性能评估。为解决每个医疗对话包含多轮交谈的挑战，我们在某一轮随机截断对话，舍弃后续内容，并将该轮之前的历史对话作为输入。示例响应如表1所示。我们采用BLEU[23]、ROUGE[24]和GLEU三个评估指标来衡量对话质量：BLEU通过n-gram精度比较候选文本与参考文本，是常用机器翻译指标；GLEU通过计算不同n-gram的平均得分，能更全面评估生成文本质量；而ROUGE作为自动摘要和机器翻译的重要指标，通过对比生成文本与参考文本的召回率进行评估。

The experimental results are presented in Table 2. It demonstrates that Clinical GP T achieves outstanding performance on BLEU-1 and all ROUGE scores. Clinical GP T comes second only to BLOOM-7B in terms of BLEU-2, BLEU-3, and BLEU-4. The superior ROUGE scores achieved by Clinical GP T indicate that the responses generated by the model cover the information provided by the reference text more effectively.

实验结果如表2所示。数据显示Clinical GPT在BLEU-1和所有ROUGE分数上表现优异。在BLEU-2、BLEU-3和BLEU-4指标上，Clinical GPT仅次于BLOOM-7B。Clinical GPT取得的卓越ROUGE分数表明，该模型生成的响应能更有效地覆盖参考文本提供的信息。

3.3 Medical examination

3.3 医学检查

In this study, the medical examination assessment using the MEDQA-MCMLE dataset was evaluated with the categories which are the highest frequencies in the dataset. The selected categories included Medical ethics, Respiratory system, Digestive system, Urinary system, He mato logic diseases, Rheumatic immune Diseases, Pediatric diseases,

在本研究中，使用MEDQA-MCMLE数据集进行医学考试评估时，选取了数据集中出现频率最高的类别。所选类别包括医学伦理、呼吸系统、消化系统、泌尿系统、血液系统疾病、风湿免疫疾病、儿科疾病。

Description of medical conditions and history

医疗状况及病史描述

Table 1: A sample response in medical conversation.

疾病：疑似皮肤paget病

(Disease:SuspectedPaget's disease of the skin)

患病时长：大于半年

(Duration ofillness:more thansixmonths.)

病情描述：一直按湿疹来医已经好几年了 (Disease description:Has been treated as eczema for severalyears.)

希望获得的帮助：请医生给我一些治疗上的建议，目前病情是否需要手术？是否需要就诊？就诊前做哪些准备？请问三院有收过这种病的吗？怎么医治？

(Help needed: Can the doctor provide me with some treatment suggestions?Does the current condition require surgery? Do I need to seek medical attention?What preparations should be made before seeking medical treatment?May I ask if the

ThirdHospitalhaseverreceived thisdisease?How totreat it?)

Dialogue Assistant:根据您的病情，确实需要手术切除病灶，不过具体手术情况可能需要现场判断，建议您前来就诊后由

专业医生为您制定治疗方案。 (Assistant:Based on your condition,it is indeed necessary toundergo surgery to remove thelesion,but the specific surgical

situation may require on-site judgment.It is recommended that you come for treatment and have a professional doctor develop atreatment planforyou.)

User:那就诊前需要准备些什么呢？

(User:What doyou need to preparebefore seeing a doctor?)

Assistant:前往就诊前，您需要做好皮肤检查，包括观察皮肤颜色、质地、皮疹大小等。此外，您也需要带上病

历资料、医生开的处方等，以便医生更好地为您诊断和治疗。

(Assistant: Before going to the clinic,you need to undergo a skin examination, including observing skin color, texture,

rash size,etc. In addition, you also need to bring medical records, doctor's prescriptions,etc., so that the doctor can better

表 1: 医疗对话示例

| 疾病：疑似皮肤paget病 |
| (Disease: Suspected Paget's disease of the skin) |
| 患病时长：大于半年 |
| (Duration of illness: more than six months.) |
| 病情描述：一直按湿疹来医已经好几年了 |
| (Disease description: Has been treated as eczema for several years.) |
| 希望获得的帮助：请医生给我一些治疗上的建议，目前病情是否需要手术？是否需要就诊？就诊前做哪些准备？请问三院有收过这种病的吗？怎么医治？ |
| (Help needed: Can the doctor provide me with some treatment suggestions? Does the current condition require surgery? Do I need to seek medical attention? What preparations should be made before seeking medical treatment? May I ask if the Third Hospital has ever received this disease? How to treat it?) |
| Dialogue Assistant: 根据您的病情，确实需要手术切除病灶，不过具体手术情况可能需要现场判断，建议您前来就诊后由专业医生为您制定治疗方案。 |
| (Assistant: Based on your condition, it is indeed necessary to undergo surgery to remove the lesion, but the specific surgical situation may require on-site judgment. It is recommended that you come for treatment and have a professional doctor develop a treatment plan for you.) |
| User: 那就诊前需要准备些什么呢？ |
| (User: What do you need to prepare before seeing a doctor?) |
| Assistant: 前往就诊前，您需要做好皮肤检查，包括观察皮肤颜色、质地、皮疹大小等。此外，您也需要带上病历资料、医生开的处方等，以便医生更好地为您诊断和治疗。 |
| (Assistant: Before going to the clinic, you need to undergo a skin examination, including observing skin color, texture, rash size, etc. In addition, you also need to bring medical records, doctor's prescriptions, etc., so that the doctor can better diagnose and treat you.) |

	BLEU-1	BLEU-2	BLEU-3	BLEU-4	GLEU	ROUGE-1	ROUGE-2	ROUGE-L
LLAMA-7B	10.8	2.9	1.5	0.9	0.6	22.4	5.1	17.3
ChatGLM-6B	6.6	1.6	0.9	0.5	0.3	23.6	5.0	16.2
BLOOM-7B	12.2	4.4	2.9	2.2	2.4	11.0	1.6	8.6
Ours	13.9	3.7	2.0	1.2	0.9	27.9	6.5	21.3

Table 2: Comparisons on medical conversation.

	BLEU-1	BLEU-2	BLEU-3	BLEU-4	GLEU	ROUGE-1	ROUGE-2	ROUGE-L
LLAMA-7B	10.8	2.9	1.5	0.9	0.6	22.4	5.1	17.3
ChatGLM-6B	6.6	1.6	0.9	0.5	0.3	23.6	5.0	16.2
BLOOM-7B	12.2	4.4	2.9	2.2	2.4	11.0	1.6	8.6
Ours	13.9	3.7	2.0	1.2	0.9	27.9	6.5	21.3

表 2: 医疗对话对比测试

and Pharmacology. The models were fed with the form of questions and options as input, and the generated text wa subsequently used to extract answers to compute accuracy. The sample response is shown in Table 3.

与药理学。模型以问题和选项的形式作为输入，生成的文本随后用于提取答案以计算准确率。示例响应如表 3 所示。

Prompt:

下述哪一项不符合SLE的血液系统改变？

(Whichof thefollowingblood system changesdoesnotcomplywithSLE?)

回答选项：

(Answeroptions:)

选项A血小板减少

(OptionAThrombocytopenia)

选项B自细胞减少

(OptionB:Selfcellreduction)

选项C自身免疫性溶血性贫血

(OptionCAutoimmunehemolyticanemia) 选项D正色素细胞性贫血

(OptionDPositiveChromatophoreanemia)

选项E类白血病样改变

(OptionE Leukemialikechanges)

Response: SLE是一种自身免疫性疾病，其血液系统改变包括血小板减少、自身免疫性溶血性贫血、正色素细胞性贫血等。而类白血病样改变是指骨髓中出现大量幼稚细胞，与SLE无关，因此选项E不符合SLE的血液系统改变。

(SLEis an autoimmunedisease,whosebloodsystemchangesincludethrombocytopenia,autoimmunehemolyticanemia positive Chromatophore anemia,etc.Leukemia like changes refer to the presence of a large number of immature cellsin the bone marrow,which is not related toSLE.Therefore,optionE doesnotmeet thehematological changesof SLE.)

Table 3: A sample response in medical examination.

表 3: 医学检查中的示例回答。

The experimental results, as shown in Table 4, reveal that Clinical GP T outperformed other LLMs such as LLAMA7B, ChatGLM-6B, and BLOOM-7B in all evaluated categories, boasting an average accuracy of 38.4. Specifically, Clinical GP T achieved strong performance, exceeding the average scores of ChatGLM-6B, BLOOM-7B, and LLAMA7B with 19.9, 25.7, and 27.2 respectively. Among all categories, Clinical GP T achieved the best score in Rheumatic immune with an accuracy of 47.4. Conversely, it under performed in Respiratory and Digestive diseases, with accuracies of 26.1 and 36.9, respectively. These findings suggest that while Clinical GT P excels in understanding and generating

实验结果如表4所示，Clinical GPT在所有评估类别中均优于LLAMA7B、ChatGLM-6B和BLOOM-7B等其他大语言模型，平均准确率达到38.4。具体而言，Clinical GPT表现强劲，分别以19.9、25.7和27.2的分数超过ChatGLM-6B、BLOOM-7B和LLAMA7B的平均得分。在所有类别中，Clinical GPT在风湿免疫类疾病中表现最佳，准确率达47.4。相反，其在呼吸系统和消化系统疾病中表现欠佳，准确率分别为26.1和36.9。这些发现表明，尽管Clinical GPT在理解和生成...

responses related to rheumatic immune system, further refinement is required to improve its performance in Respiratory and Digestive diseases.

	Respiratory	Urinary	Digestive	Rheumatic immune	Average
ChatGLM-6B	24.6	24.4	20.0	10.5	19.9
LLAMA-7B	20.3	35.6	21.2	31.6	27.2
BLOOM-7B	15.9	31.1	29.4	26.3	25.7
ClinicalGPT	26.1	40.0	36.9	47.4	37.6

Table 4: Comparisons on medical examination.

与风湿免疫系统相关的响应方面，仍需进一步优化以提升其在呼吸系统和消化系统疾病中的表现。

| | 呼吸系统 | 泌尿系统 | 消化系统 | 风湿免疫 | 平均 |
|------|----------|----------|--------

[论文翻译]Guangyu Wang∗, Guoxing Yang, Zongxin Du, Longjun Fan, Xiaohu Li

原文地址：https://arxiv.org/pdf/2306.09968

Guangyu Wang∗, Guoxing Yang, Zongxin Du, Longjun Fan, Xiaohu Li

Guangyu Wang∗, Guoxing Yang, Zongxin Du, Longjun Fan, Xiaohu Li

ABSTRACT

摘要

1 Introduction

1 引言

2 Methods

2 方法

2.1 Dataset

2.1 数据集

2.2 Finetuning

2.3 Reward model

2.3 奖励模型

2.4 Reinforcement learning

2.4 强化学习

3 Experiments and results

3 实验与结果

3.1 Implemented details

3.2 Medical conversation

3.3 Medical examination

3.3 医学检查

Description of medical conditions and history

医疗状况及病史描述