Guangyu Wang∗, Guoxing Yang, Zongxin Du, Longjun Fan, Xiaohu Li
Guangyu Wang∗, Guoxing Yang, Zongxin Du, Longjun Fan, Xiaohu Li
State Key Laboratory of Networking and Switching Technology Beijing University of Posts and Telecommunications Beijing, China ∗guangyu.wang24@gmail.com
网络与交换技术国家重点实验室
北京邮电大学
中国,北京
∗guangyu.wang24@gmail.com
ABSTRACT
摘要
Large language models have exhibited exceptional performance on various Natural Language Processing (NLP) tasks, leveraging techniques such as the pre-training, and instruction fine-tuning. Despite these advances, their effectiveness in medical applications is limited, due to challenges such as factual inaccuracies, reasoning abilities, and lack grounding in real-world experience. In this study, we present Clinical GP T, a language model explicitly designed and optimized for clinical scenarios. By incorporating extensive and diverse real-world data, such as medical records, domainspecific knowledge, and multi-round dialogue consultations in the training process, Clinical GP T is better prepared to handle multiple clinical task. Furthermore, we introduce a comprehensive evaluation framework that includes medical knowledge question-answering, medical exams, patient consultations, and diagnostic analysis of medical records. Our results demonstrate that Clinical GP T significantly outperforms other models in these tasks, highlighting the effectiveness of our approach in adapting large language models to the critical domain of healthcare.
大语言模型 (Large Language Model) 在各类自然语言处理 (NLP) 任务中展现出卓越性能,这得益于预训练和指令微调等技术。然而,由于事实准确性不足、推理能力有限以及缺乏现实经验基础等挑战,其在医疗领域的应用效果仍受制约。本研究推出专为临床场景设计与优化的语言模型 Clinical GPT。通过整合电子病历、领域专业知识及多轮诊疗对话等多样化真实世界数据,该模型在训练过程中获得了更强的临床任务处理能力。此外,我们建立了包含医学知识问答、执业考试、患者咨询和病历诊断分析在内的综合评估体系。实验结果表明,Clinical GPT 在这些任务中显著优于其他模型,证明了大语言模型在医疗关键领域适配方法的有效性。
Keywords deep learning $\cdot$ large language model $\cdot$ medical knowledge $\cdot$ electronic medical record $\cdot$ text generation
关键词 深度学习 $\cdot$ 大语言模型 $\cdot$ 医学知识 $\cdot$ 电子病历 $\cdot$ 文本生成
1 Introduction
1 引言
In recent years, the paradigm of pre-training and fine-tuning large language models has brought about significant advancements in Natural Language Processing (NLP) domain. The earliest approaches like BERT[1], utilized optimized objectives like Masked Language Model (MLM) to pre-train on large text corpora such as BookCorpus[2], in an unsupervised manner to learn good representations. These representations can be fine-tuned and adapted to one or more specific downstream tasks to improve their performance. Further research aims to develop competent generalists, i.e. generalized systems that can perform multiple NLP tasks without the need for a manually labeled training dataset for each task. For instance, T5[3] treats multiple NLP tasks as text-to-text transformation tasks and leverages an encoderdecoder architecture, achieving promising results such as text classification, question answering, and sum mari z ation, though with a larger number of parameters. In contrast, GPT-3[4] uses large auto-regressive model for few-shot predictions, improving performance without parameter fine-tuning by incorporating few-shot demonstrations through text interaction with the model. PALM[5] is Transformers-based and Pathways-enabled large-scale language model. Compared to other models, PALM is more resource-efficient in terms of computation and achieves state-of-the-art few-shot results across hundreds of natural language, code, and mathematical reasoning tasks.
近年来,预训练和微调大语言模型的范式为自然语言处理 (NLP) 领域带来了重大进展。最早的方法如 BERT[1],利用掩码语言模型 (MLM) 等优化目标,在 BookCorpus[2] 等大型文本语料库上以无监督方式进行预训练,学习良好的表示。这些表示可以通过微调适应一个或多个特定下游任务,以提高其性能。进一步的研究旨在开发出能力全面的通用系统,即无需为每个任务手动标注训练数据集就能执行多种 NLP 任务的通用系统。例如,T5[3] 将多种 NLP 任务视为文本到文本的转换任务,并利用编码器-解码器架构,在文本分类、问答和摘要等任务上取得了不错的结果,尽管参数量更大。相比之下,GPT-3[4] 使用大型自回归模型进行少样本预测,通过与模型的文本交互融入少量示例,无需参数微调即可提升性能。PALM[5] 是基于 Transformer 和 Pathways 的大规模语言模型。与其他模型相比,PALM 在计算资源上更为高效,并在数百项自然语言、代码和数学推理任务中实现了最先进的少样本性能。
With their substantial generalization capabilities in NLP tasks, large pre-trained models are increasingly utilized for various tasks and facilitating human interaction through dialogue models. LaMDA [6], a transformer-based model designed for dialogues, leverages annotated data and external knowledge to augment its helpfulness and role consistency. Instruct GP T [7] aligns with user intent across various tasks through fine-tuning and reinforcement learning with human feedback, resulting in improved truthfulness and reduced toxicity in output generation. ChatGPT can simulate human interaction, write abstracts or create movie scripts in response to prompts, driving the AI revolution. Large language models are also effective for writing assistance and generating efficient code for programmers.
凭借在自然语言处理(NLP)任务中的强大泛化能力,大规模预训练模型正被广泛应用于各类任务,并通过对话模型促进人机交互。基于Transformer架构的对话模型LaMDA [6]利用标注数据和外部知识来提升其实用性和角色一致性。InstructGPT [7]通过微调和基于人类反馈的强化学习,使模型输出更符合用户意图,在提高真实性的同时降低了有害内容生成。ChatGPT能够模拟人类对话,根据提示撰写摘要或创作电影剧本,推动着AI革命浪潮。大语言模型还能有效辅助写作,并为程序员生成高效代码。
As we know, medicine and health care still face many challenges, including aging population, lack of equitable access, rising costs, doctor and nurse burnout, and global pandemics. Information technology has the potential to transform modern medicine by offering new tools and insights for healthcare, with ChatGPT and GPT-4 promising to revolutionize clinical decision support, clinical trial recruitment, clinical data management, research support, patient education [8, 9]. Google researchers developed FlanPaLM, an instruction-tuned variant of PaLM, showing improved task performance via natural language instructions. Using a combination of prompting strategies, Flan-PaLM achieves state-of-the-art accuracy in MultiMedQA multiple-choice datasets, but remains outperformed by clinicians. Recent prospective suggests generalist medical AI (GMAI) using foundation models may disrupt task-specific paradigms, enabling versatile applications like interactive note-taking, bedside decision support, and patient chatbots [10].However, there are considerable challenges to overcome in applying generative language models to the medical field. The output of generative language models may have factual errors, logic inconsistencies, and problems with coherence, such as citing article references that do not exist [11]. The models have limited reasoning abilities and lack grounding in real-world experience, leading to general and vague responses. ChatGPT has been found lacking in depth and insight [4], likely due to its alignment model used for reward-based training, which produces overly generalized answers that lack medical expertise. This evidence implies that employing these technologies in the medical field brings unique hurdles, such as the necessity for high accuracy, interpret ability, and secure handling of sensitive health data.
众所周知,医疗健康领域仍面临诸多挑战,包括人口老龄化、医疗资源分配不均、成本攀升、医护职业倦怠以及全球流行病等问题。信息技术有望通过为医疗保健提供新工具与洞察力来变革现代医学,其中ChatGPT和GPT-4在临床决策支持、临床试验招募、临床数据管理、科研辅助及患者教育等领域展现出革新潜力[8,9]。Google研究人员开发的指令调优模型Flan-PaLM通过自然语言指令显著提升了任务性能,采用组合式提示策略后在MultiMedQA多选题数据集达到最先进准确率,但仍逊色于临床医生。最新前瞻研究指出,基于基础模型的通用医疗AI(GMAI)可能颠覆任务专用范式,实现交互式病历记录、床旁决策支持和患者聊天机器人等多样化应用[10]。
然而,生成式语言模型在医疗领域的应用存在显著挑战。其输出可能存在事实错误、逻辑矛盾及连贯性问题,例如引用不存在的文献[11]。这些模型推理能力有限且缺乏现实经验基础,导致回答笼统模糊。研究发现ChatGPT存在深度不足、洞察力欠缺等问题[4],这很可能源于其采用基于奖励训练的对齐模型,产生了缺乏医学专业性的过度泛化答案。这些证据表明,在医疗领域应用此类技术需克服独特障碍,包括对高精度、可解释性及敏感健康数据安全处理的严格要求。
In this study, we present Clinical GP T, a large language model that is specifically designed for tasks across medical applications. To train the model, we leverage extensive and diverse datasets consisting of real-world medical records, allowing us to transform domain-specific knowledge to the model. In addition, we establish a comprehensive evaluation framework that includes medical knowledge question-answering, medical examinations, patient consultations, and medical record analysis. By utilizing parameter-efficient fine-tuning methods, we were able to further improve the performance of Clinical GP T. The results demonstrate that Clinical GP T outperform existing models in term of performance, thus confirming the effectiveness of our approach.
本研究介绍了Clinical GP T,这是一款专为医疗应用任务设计的大语言模型。我们利用由真实世界医疗记录组成的广泛多样数据集进行模型训练,从而将领域专业知识转化为模型能力。此外,我们建立了包含医学知识问答、医疗考试、患者咨询和病历分析的综合评估框架。通过采用参数高效微调方法,我们进一步提升了Clinical GP T的性能。结果表明,Clinical GP T在性能指标上优于现有模型,从而验证了我们方法的有效性。

Figure 1: The overview of Clinical GP T.
图 1: Clinical GP T 概览
2 Methods
2 方法
2.1 Dataset
2.1 数据集
In this study, we incorporated a large and diverse medical datasets including cMedQA2, cMedQA-KG, MD-EHR, MEDQA-MCMLE, and MedDialog, for the training and evaluation of our model.
在本研究中,我们整合了包括cMedQA2、cMedQA-KG、MD-EHR、MEDQA-MCMLE和MedDialog在内的大量多样化医疗数据集,用于模型的训练和评估。
The cMedQA2 dataset [12] is a Chinese medical question-and-answer dataset that consists of $120\mathrm{k\Omega}$ questions and 226k answers. The data is aggregated from a Chinese medical question-and-answer online forum1. For training purposes, we followed the original dataset partition as proposed by the author, and then we randomly selected one answer per question. We annotated 10k questions from the training set for training reward models and used $4\mathrm{k\Omega}$ questions from the validation set for reinforcement learning. We sampled questions from the testing set for evaluation.
cMedQA2数据集[12]是一个中文医疗问答数据集,包含12万Ω个问题和22.6万条答案。数据来源于一个中文医疗问答论坛1。为训练需要,我们遵循原作者提出的数据集划分方式,随后为每个问题随机选取一个答案。我们从训练集中标注了1万Ω个问题用于训练奖励模型,并使用验证集中的4kΩ个问题进行强化学习。评估阶段则从测试集中抽样选取问题。
The cMedQA-KG is a medical question-answer dataset which are curated based on knowledge graphs. It is established on three knowledge graphs: $\mathrm{cMeKG}^{2}$ , xywy $\mathbf{\nabla}\cdot\mathbf{K}\mathbf{G}^{3}$ , and 39Health-KG 4.These knowledge graphs cover comprehensive medical entities such as disease, medication, and symptom, and their relationships. Detailed descriptions of the knowledge graphs can be found in Appendix A. We have designed templates (see Appendix B) to transform each knowledge triplet into fine-tuning instruction data, i.e text-to-text pair for text generation, yielding 100k question-answer pairs. cMedQA-KG is used exclusively for training purposes.
cMedQA-KG是一个基于知识图谱构建的医疗问答数据集。它建立在三个知识图谱之上:$\mathrm{cMeKG}^{2}$、xywy $\mathbf{\nabla}\cdot\mathbf{K}\mathbf{G}^{3}$和39Health-KG 4。这些知识图谱涵盖了疾病、药物、症状等全面的医疗实体及其关系。知识图谱的详细描述见附录A。我们设计了模板(见附录B)将每个知识三元组转化为微调指令数据(即用于文本生成的文本对),最终生成10万条问答对。cMedQA-KG仅用于训练目的。
The MEDQA-MCMLE dataset is a subset of the original MEDQA dataset [13], consisting of Chinese medical examination questions in a multiple-choice format. It includes $34\mathrm{k\Omega}$ questions, each offering multiple choices, typically 4 or 5. We have followed the original author’s division of the dataset into training, validation, and testing sets. As this dataset is derived from professional medical board examinations, it effectively evaluates applied knowledge, clinical reasoning, and patient-centric skills.
MEDQA-MCMLE数据集是原始MEDQA数据集[13]的一个子集,包含中国医学考试中的多项选择题形式题目。该数据集共有$34\mathrm{k\Omega}$道题目,每道题提供多个选项,通常为4或5个。我们遵循原作者对数据集的划分方式,将其分为训练集、验证集和测试集。由于该数据集源自专业医学委员会考试,因此能有效评估应用知识、临床推理和以患者为中心的技能。
The MedDialog dataset [14] is a data collection of multi-turn medical conversations obtained from an online platform5. MedDialog comprises 1.1 million dialogues and 4 million utterances. Due to the large volume of data, we have randomly sampled $100\mathrm{k}$ , 1k, and 1k dialogues for the training, validation, and testing sets, respectively. These multi-turn dialogues closely resemble real interactions between doctors and patients, aiding the model in understanding the process of clinical inquiry and decision-making.
MedDialog数据集[14]是从在线平台5获取的多轮医疗对话数据集合。该数据集包含110万组对话和400万条话语。由于数据量庞大,我们随机抽取了$100\mathrm{k}$、1k和1k组对话分别作为训练集、验证集和测试集。这些多轮对话高度模拟真实医患互动场景,有助于模型理解临床问诊与决策流程。
The MD-EHR dataset is comprised of electronic health records from multi center, large-scale hospitals in China. This dataset contains $100\mathrm{k\Omega}$ records covering a range of disease groups, including Respiratory, Digestive, Urinary, Psychiatry, Neurology, Gynecology, and Hematology.
MD-EHR数据集由中国多中心大型医院的电子健康记录组成。该数据集包含100kΩ条记录,涵盖呼吸系统、消化系统、泌尿系统、精神科、神经科、妇科和血液科等多个疾病组。
Each record within the MD-EHR dataset provides a comprehensive overview of the patient’s complaints, medical history, findings from physical examinations, ancillary test results, and the final diagnosis. We have divided the dataset into three sets: 2,000 records for the validation set, 2,000 records for the testing set, and the remaining entries for the training set. Following T5[3], we transformed the medical records into a text generation task by concatenating the notes from the records as input and using the diagnosis as the output.
MD-EHR数据集中的每条记录都全面概述了患者的主诉、病史、体格检查结果、辅助检查结果及最终诊断。我们将数据集划分为三部分:验证集2000条记录、测试集2000条记录,其余条目作为训练集。参照T5[3]的做法,我们将病历转化为文本生成任务——将记录中的笔记内容拼接作为输入,并以诊断结论作为输出。
2.2 Finetuning
2.2 微调
We adopt the T5 model’s [3] strategy of utilizing text generation grounded in language models to complete all tasks in our study. Language models, pre-trained on extensive corpora, have demonstrated a remarkable ability to understand and generate human-like text [4]. These models calculate the probability of a sequence of words in a text, $T=(w_ {1},w_ {2},...,w_ {L})$ . Specifically, the casual language model calculates the probability of the text $T$ that can be formulated as $p(T)=p(w_ {1})p(w_ {2}|w_ {1})...p(w_ {L}|w_ {1},w_ {2},...,w_ {L-1})$ , where $L$ represents the length of the text. Several large language models, such as BLOOM, GLM, and others, are available for public use.
我们采用T5模型[3]的策略,利用基于语言模型的文本生成来完成研究中的所有任务。在大规模语料库上预训练的语言模型已展现出卓越的理解和生成类人文本的能力[4]。这些模型计算文本中词序列的概率,$T=(w_ {1},w_ {2},...,w_ {L})$。具体而言,因果语言模型计算文本$T$的概率可表示为$p(T)=p(w_ {1})p(w_ {2}|w_ {1})...p(w_ {L}|w_ {1},w_ {2},...,w_ {L-1})$,其中$L$代表文本长度。目前有多个可公开使用的大语言模型,例如BLOOM、GLM等。
To enhance the utility of large models for downstream tasks, we apply an instruction-tuning approach with supervised fine tuning (SFT). The language model $p_ {\theta}$ is trained to generate a response $R=v_ {1:n}$ for a given input prompt $I=w_ {1:m}$ , optimizing the likelihood $p_ {\theta}(R|I)=p_ {\theta}(v_ {1:n}|w_ {1:m})$ , where $n$ and $m$ represent the lengths of the response and input prompt, respectively. Thus, the loss function is 1n Pim=+mn+1 − $\begin{array}{r}{\frac{1}{n}\sum_ {i=m+1}^{m+n}-\log p_ {\theta}(w_ {i}|w_ {1},...,w_ {i-1}).}\end{array}$
为提升大模型在下游任务中的实用性,我们采用监督微调 (SFT) 的指令调优方法。语言模型 $p_ {\theta}$ 被训练为给定输入提示 $I=w_ {1:m}$ 时生成响应 $R=v_ {1:n}$ ,通过优化似然函数 $p_ {\theta}(R|I)=p_ {\theta}(v_ {1:n}|w_ {1:m})$ ,其中 $n$ 和 $m$ 分别代表响应和输入提示的长度。因此,损失函数为 $\begin{array}{r}{\frac{1}{n}\sum_ {i=m+1}^{m+n}-\log p_ {\theta}(w_ {i}|w_ {1},...,w_ {i-1}).}\end{array}$
To incorporate domain-specific knowledge into LLMs, we turn to knowledge graphs (KGs) specific to the domain for constructing prompt-response pairs. KGs capture knowledge in the form of structured triples $(s,r,o)$ , where $s$ denotes the subject, $r$ the relationship, and $o$ the object. An example of such a triple could be (Cough, SymptomOf, Pneumonia). We leverage a set of manually designed templates to transform these triples into question-answer pairs, rendering them suitable for instruction tuning. The manually designed templates can be found in Appendix B.
为了将领域特定知识融入大语言模型(LLM),我们转向特定领域的知识图谱(KG)来构建提示-响应对。知识图谱以结构化三元组$(s,r,o)$的形式捕获知识,其中$s$表示主体,$r$表示关系,$o$表示客体。例如(咳嗽,症状属于,肺炎)就是这样一个三元组。我们利用一组人工设计的模板将这些三元组转化为问答对,使其适用于指令微调。人工设计的模板详见附录B。
2.3 Reward model
2.3 奖励模型
Existing works have demonstrated that reinforcement learning can incorporate human feedback to enhance large language models. For instance, WebGPT [15] is a browser-assisted question-answering system that utilizes human feedback for performance improvement. Instruct GP T also [7] to align with human feedback via reinforcement learning for helpful and safe response generation.
现有研究表明,强化学习能够整合人类反馈以优化大语言模型。例如,WebGPT [15] 作为浏览器辅助问答系统,通过人类反馈提升性能。InstructGPT [7] 同样采用强化学习对齐人类反馈,从而生成有益且安全的响应。
We follow the work of [7], constructing a reward model (RM) $r_ {\mu}$ to furnish the reward signal crucial for the reinforcement learning process. We employ rank-based training for the RM. Human labelers rank responses for a given input prompt
我们遵循[7]的研究工作,构建了一个奖励模型 (reward model, RM) $r_ {\mu}$ 来为强化学习过程提供关键的奖励信号。我们采用基于排序的训练方法来训练RM。人工标注员会对给定输入提示的响应进行排序
$I$ , generating a comparison pair for each prompt. For a comparison pair with a human-preferred response $R_ {w}$ and a less preferred response $R_ {l}$ , the loss is given by $-\log(\sigma(r_ {\mu}(I,\bar{R}_ {w})-\bar{r_ {\mu}}(I,R_ {l})))$ .
$I$,为每个提示生成一个对比对。对于包含人类偏好响应$R_ {w}$和次优响应$R_ {l}$的对比对,损失函数由$-\log(\sigma(r_ {\mu}(I,\bar{R}_ {w})-\bar{r_ {\mu}}(I,R_ {l})))$给出。
2.4 Reinforcement learning
2.4 强化学习
We adopt the method proposed by Stiennon et al. [16], leveraging reinforcement learning to enhance the fine-tuned models with the objective of generating high-quality and helpful outputs, as well as improving the generation of medical texts, thereby aiding in the accurate description and treatment of patient conditions.
我们采用 Stiennon 等人 [16] 提出的方法,利用强化学习来增强微调模型,目标是生成高质量且有用的输出,并改进医学文本的生成,从而帮助准确描述和治疗患者病情。
We utilize the trained reward model as the reward function. In order to prevent the model from deviating too far from its initial state, we employ Proximal Policy Optimization (PPO) as our optimization strategy. Specifically, we incorporate a penalty term in the reward function that penalizes the KL divergence between the learned reinforcement learning policy, denoted as $\pi_ {\phi}^{R L}$ , and the original supervised model, $\pi^{S F\bar{T}}$ . This is to ensure that the final model does not deviate excessively from the original supervised model. The complete reward function is defined as follows: $R(x,y)=r_ {\mu}(x,y)-\beta\log\bar{\wp}(\pi_ {\phi}^{R L}(y|x)/\bar{\pi}^{S F T}(\dot{y|x}))$ , where $r_ {\mu}(x,y)$ represents the output of the reward model and $\beta$ is the coefficient for KL divergence in the reward function. The loss function used in PPO optimization is given by: ${\cal L}=r_ {\mu}\hat{A}_ {t}-\beta K{\cal L}[\pi_ {\phi_ {o l d}},\pi_ {\phi}]$ , where $r_ {\mu}$ is the reward function, $\hat{A}_ {t}$ is an estimator of the advantage function, $\phi_ {o l d}$ represents the parameters of the policy at the previous step, and $\pi_ {\phi}$ is the current policy.
我们利用训练好的奖励模型作为奖励函数。为防止模型偏离初始状态过远,我们采用近端策略优化 (Proximal Policy Optimization, PPO) 作为优化策略。具体而言,我们在奖励函数中加入惩罚项,用于惩罚学习到的强化学习策略 $\pi_ {\phi}^{R L}$ 与原始监督模型 $\pi^{S F\bar{T}}$ 之间的KL散度,从而确保最终模型不会过度偏离原始监督模型。完整奖励函数定义为:$R(x,y)=r_ {\mu}(x,y)-\beta\log\bar{\wp}(\pi_ {\phi}^{R L}(y|x)/\bar{\pi}^{S F T}(\dot{y|x}))$,其中 $r_ {\mu}(x,y)$ 表示奖励模型的输出,$\beta$ 为奖励函数中KL散度的系数。PPO优化使用的损失函数为:${\cal L}=r_ {\mu}\hat{A}_ {t}-\beta K{\cal L}[\pi_ {\phi_ {o l d}},\pi_ {\phi}]$,其中 $r_ {\mu}$ 为奖励函数,$\hat{A}_ {t}$ 是优势函数的估计值,$\phi_ {o l d}$ 表示上一步策略的参数,$\pi_ {\phi}$ 为当前策略。
3 Experiments and results
3 实验与结果
3.1 Implemented details
3.1 实现细节
We chose BLOOM-7B[17] as our base large language model, due to its open-source nature and multilingual support. For the supervised fine-tuning process, we set the learning rate to 5e-5, with a batch size of 128 and a maximum length of 1,024, training across 3 epochs. During the training of the reward model, we utilized the last feature vector of the final output sequence features as the text representation. Based on the fine-tuned model, we added a binary classification head to output the reward. We set the learning rate to 2e-5, with a batch size of 128, a maximum length of 1,024, and training over 3 epochs. For the reinforcement learning process, we applied a learning rate of 1e-5 and a maximum length of 1,024, training for 4000 steps. To efficiently train the large language model, we adopted LoRA (Low-Rank Approximated adapter)[18], a parameter efficient fine tuning method, with r of 8, alpha of 32, and dropout of 0.1. To decrease memory usage and improve training speed, we employed the ZeRO-2 [19], and made use of both TF32 (Tensor Float-32) and BF16 (Bfloat16). We selected several instruction fine-tuned models for comparison, including ChatGLM-6B [20], LLAMA-7B[21] (fine-tuned on English and Chinese data), and BLOOM-7B [22] (fined-tuned on cross lingual tasks).
我们选择BLOOM-7B[17]作为基础大语言模型,因其开源特性和多语言支持能力。在有监督微调阶段,设置学习率为5e-5,批处理大小为128,最大序列长度1024,训练3个周期。奖励模型训练时,采用最终输出序列特征的末位特征向量作为文本表征,基于微调后的模型增加二分类头输出奖励值,学习率设为2e-5,批处理大小128,最大长度1024,训练3个周期。强化学习过程采用1e-5学习率,最大长度1024,训练4000步。为高效训练大语言模型,采用参数高效微调方法LoRA (Low-Rank Approximated adapter)[18],参数r=8,alpha=32,dropout=0.1。通过ZeRO-2[19]技术降低内存占用并提升训练速度,同时使用TF32 (Tensor Float-32)和BF16 (Bfloat16)浮点格式。对比实验选取了多个指令微调模型,包括ChatGLM-6B[20]、LLAMA-7B[21](中英文数据微调)和BLOOM-7B[22](跨语言任务微调)。
3.2 Medical conversation
3.2 医疗对话
We conducted performance evaluation of the medical conversation on the test set of MedDialog. To address the challenge of multiple rounds of conversation within each medical dialogue, we randomly truncated the dialogue at a certain round, discarding the subsequent dialogue, and using the historical dialogue prior to this round as input. The sample response is shown in Table 1. We used three evaluation metrics: BLEU[23], ROUGE[24], and GLEU, to assess the quality of the conversations. BLEU is a commonly used metric that compares a candidate translation with one or more reference translations based on n-gram precision. GLEU calculates the average score of different n-grams, providing a more comprehensive evaluation of the generated text. ROUGE, on the other hand, is a particularly useful metric for evaluating automatic sum mari z ation and machine translation, as it focuses on the recall aspect of generated summaries by comparing them with references.
我们在MedDialog测试集上对医疗对话进行了性能评估。为解决每个医疗对话包含多轮交谈的挑战,我们在某一轮随机截断对话,舍弃后续内容,并将该轮之前的历史对话作为输入。示例响应如表1所示。我们采用BLEU[23]、ROUGE[24]和GLEU三个评估指标来衡量对话质量:BLEU通过n-gram精度比较候选文本与参考文本,是常用机器翻译指标;GLEU通过计算不同n-gram的平均得分,能更全面评估生成文本质量;而ROUGE作为自动摘要和机器翻译的重要指标,通过对比生成文本与参考文本的召回率进行评估。
The experimental results are presented in Table 2. It demonstrates that Clinical GP T achieves outstanding performance on BLEU-1 and all ROUGE scores. Clinical GP T comes second only to BLOOM-7B in terms of BLEU-2, BLEU-3, and BLEU-4. The superior ROUGE scores achieved by Clinical GP T indicate that the responses generated by the model cover the information provided by the reference text more effectively.
实验结果如表2所示。数据显示Clinical GPT在BLEU-1和所有ROUGE分数上表现优异。在BLEU-2、BLEU-3和BLEU-4指标上,Clinical GPT仅次于BLOOM-7B。Clinical GPT取得的卓越ROUGE分数表明,该模型生成的响应能更有效地覆盖参考文本提供的信息。
3.3 Medical examination
3.3 医学检查
In this study, the medical examination assessment using the MEDQA-MCMLE dataset was evaluated with the categories which are the highest frequencies in the dataset. The selected categories included Medical ethics, Respiratory system, Digestive system, Urinary system, He mato logic diseases, Rheumatic immune Diseases, Pediatric diseases,
在本研究中,使用MEDQA-MCMLE数据集进行医学考试评估时,选取了数据集中出现频率最高的类别。所选类别包括医学伦理、呼吸系统、消化系统、泌尿系统、血液系统疾病、风湿免疫疾病、儿科疾病。
Description of medical conditions and history
医疗状况及病史描述
Table 1: A sample response in medical conversation.
| 疾病:疑似皮肤paget病 |
| (Disease:SuspectedPaget's disease of the skin) |
| 患病时长:大于半年 |
| (Duration ofillness:more thansixmonths.) |
| 病情描述:一直按湿疹来医已经好几年了 (Disease description:Has been treated as eczema for severalyears.) |
| 希望获得的帮助:请医生给我一些治疗上的建议,目前病情是否需要手术?是否需要就诊?就诊前做哪些准 备?请问三院有收过这种病的吗?怎么医治? |
| (Help needed: Can the doctor provide me with some treatment suggestions?Does the current condition require surgery? Do I need to seek medical attention?What preparations should be made before seeking medical treatment?May I ask if the |
| ThirdHospitalhaseverreceived thisdisease?How totreat it?) |
| Dialogue Assistant:根据您的病情,确实需要手术切除病灶,不过具体手术情况可能需要现场判断,建议您前来就诊后由 |
| 专业医生为您制定治疗方案。 (Assistant:Based on your condition,it is indeed necessary toundergo surgery to remove thelesion,but the specific surgical |
| situation may require on-site judgment.It is recommended that you come for treatment and have a professional doctor develop atreatment planforyou.) |
| User:那就诊前需要准备些什么呢? |
| (User:What doyou need to preparebefore seeing a doctor?) |
| Assistant:前往就诊前,您需要做好皮肤检查,包括观察皮肤颜色、质地、皮疹大小等。此外,您也需要带上病 |
| 历资料、医生开的处方等,以便医生更好地为您诊断和治疗。 |
| (Assistant: Before going to the clinic,you need to undergo a skin examination, including observing skin color, texture, |
| rash size,etc. In addition, you also need to bring medical records, doctor's prescriptions,etc., so that the doctor can better |
表 1: 医疗对话示例
| 疾病:疑似皮肤paget病 |
| (Disease: Suspected Paget's disease of the skin) |
| 患病时长:大于半年 |
| (Duration of illness: more than six months.) |
| 病情描述:一直按湿疹来医已经好几年了 |
| (Disease description: Has been treated as eczema for several years.) |
| 希望获得的帮助:请医生给我一些治疗上的建议,目前病情是否需要手术?是否需要就诊?就诊前做哪些准备?请问三院有收过这种病的吗?怎么医治? |
| (Help needed: Can the doctor provide me with some treatment suggestions? Does the current condition require surgery? Do I need to seek medical attention? What preparations should be made before seeking medical treatment? May I ask if the Third Hospital has ever received this disease? How to treat it?) |
| Dialogue Assistant: 根据您的病情,确实需要手术切除病灶,不过具体手术情况可能需要现场判断,建议您前来就诊后由专业医生为您制定治疗方案。 |
| (Assistant: Based on your condition, it is indeed necessary to undergo surgery to remove the lesion, but the specific surgical situation may require on-site judgment. It is recommended that you come for treatment and have a professional doctor develop a treatment plan for you.) |
| User: 那就诊前需要准备些什么呢? |
| (User: What do you need to prepare before seeing a doctor?) |
| Assistant: 前往就诊前,您需要做好皮肤检查,包括观察皮肤颜色、质地、皮疹大小等。此外,您也需要带上病历资料、医生开的处方等,以便医生更好地为您诊断和治疗。 |
| (Assistant: Before going to the clinic, you need to undergo a skin examination, including observing skin color, texture, rash size, etc. In addition, you also need to bring medical records, doctor's prescriptions, etc., so that the doctor can better diagnose and treat you.) |
| BLEU-1 | BLEU-2 | BLEU-3 | BLEU-4 | GLEU | ROUGE-1 | ROUGE-2 | ROUGE-L | |
| LLAMA-7B | 10.8 | 2.9 | 1.5 | 0.9 | 0.6 | 22.4 | 5.1 | 17.3 |
| ChatGLM-6B | 6.6 | 1.6 | 0.9 | 0.5 | 0.3 | 23.6 | 5.0 | 16.2 |
| BLOOM-7B | 12.2 | 4.4 | 2.9 | 2.2 | 2.4 | 11.0 | 1.6 | 8.6 |
| Ours | 13.9 | 3.7 | 2.0 | 1.2 | 0.9 | 27.9 | 6.5 | 21.3 |
Table 2: Comparisons on medical conversation.
| BLEU-1 | BLEU-2 | BLEU-3 | BLEU-4 | GLEU | ROUGE-1 | ROUGE-2 | ROUGE-L | |
|---|---|---|---|---|---|---|---|---|
| LLAMA-7B | 10.8 | 2.9 | 1.5 | 0.9 | 0.6 | 22.4 | 5.1 | 17.3 |
| ChatGLM-6B | 6.6 | 1.6 | 0.9 | 0.5 | 0.3 | 23.6 | 5.0 | 16.2 |
| BLOOM-7B | 12.2 | 4.4 | 2.9 | 2.2 | 2.4 | 11.0 | 1.6 | 8.6 |
| Ours | 13.9 | 3.7 | 2.0 | 1.2 | 0.9 | 27.9 | 6.5 | 21.3 |
表 2: 医疗对话对比测试
and Pharmacology. The models were fed with the form of questions and options as input, and the generated text wa subsequently used to extract answers to compute accuracy. The sample response is shown in Table 3.
与药理学。模型以问题和选项的形式作为输入,生成的文本随后用于提取答案以计算准确率。示例响应如表 3 所示。
| Prompt: |
| 下述哪一项不符合SLE的血液系统改变? |
| (Whichof thefollowingblood system changesdoesnotcomplywithSLE?) |
| 回答选项: |
| (Answeroptions:) |
| 选项A血小板减少 |
| (OptionAThrombocytopenia) |
| 选项B自细胞减少 |
| (OptionB:Selfcellreduction) |
| 选项C自身免疫性溶血性贫血 |
| (OptionCAutoimmunehemolyticanemia) 选项D正色素细胞性贫血 |
| (OptionDPositiveChromatophoreanemia) |
| 选项E类白血病样改变 |
| (OptionE Leukemialikechanges) |
| Response: SLE是一种自身免疫性疾病,其血液系统改变包括血小板减少、自身免疫性溶血性贫血、正色素细胞性贫血 等。而类白血病样改变是指骨髓中出现大量幼稚细胞,与SLE无关,因此选项E不符合SLE的血液系统改变。 |
| (SLEis an autoimmunedisease,whosebloodsystemchangesincludethrombocytopenia,autoimmunehemolyticanemia positive Chromatophore anemia,etc.Leukemia like changes refer to the presence of a large number of immature cellsin the bone marrow,which is not related toSLE.Therefore,optionE doesnotmeet thehematological changesof SLE.) |
Table 3: A sample response in medical examination.
表 3: 医学检查中的示例回答。
The experimental results, as shown in Table 4, reveal that Clinical GP T outperformed other LLMs such as LLAMA7B, ChatGLM-6B, and BLOOM-7B in all evaluated categories, boasting an average accuracy of 38.4. Specifically, Clinical GP T achieved strong performance, exceeding the average scores of ChatGLM-6B, BLOOM-7B, and LLAMA7B with 19.9, 25.7, and 27.2 respectively. Among all categories, Clinical GP T achieved the best score in Rheumatic immune with an accuracy of 47.4. Conversely, it under performed in Respiratory and Digestive diseases, with accuracies of 26.1 and 36.9, respectively. These findings suggest that while Clinical GT P excels in understanding and generating
实验结果如表4所示,Clinical GPT在所有评估类别中均优于LLAMA7B、ChatGLM-6B和BLOOM-7B等其他大语言模型,平均准确率达到38.4。具体而言,Clinical GPT表现强劲,分别以19.9、25.7和27.2的分数超过ChatGLM-6B、BLOOM-7B和LLAMA7B的平均得分。在所有类别中,Clinical GPT在风湿免疫类疾病中表现最佳,准确率达47.4。相反,其在呼吸系统和消化系统疾病中表现欠佳,准确率分别为26.1和36.9。这些发现表明,尽管Clinical GPT在理解和生成...
responses related to rheumatic immune system, further refinement is required to improve its performance in Respiratory and Digestive diseases.
| Respiratory | Urinary | Digestive | Rheumatic immune | Average | |
| ChatGLM-6B | 24.6 | 24.4 | 20.0 | 10.5 | 19.9 |
| LLAMA-7B | 20.3 | 35.6 | 21.2 | 31.6 | 27.2 |
| BLOOM-7B | 15.9 | 31.1 | 29.4 | 26.3 | 25.7 |
| ClinicalGPT | 26.1 | 40.0 | 36.9 | 47.4 | 37.6 |
Table 4: Comparisons on medical examination.
与风湿免疫系统相关的响应方面,仍需进一步优化以提升其在呼吸系统和消化系统疾病中的表现。
| 呼吸系统 | 泌尿系统 | 消化系统 | 风湿免疫 | 平均 | |
|---|---|---|---|---|---|
| ChatGLM-6B | 24.6 | 24.4 | 20.0 | 10.5 | 19.9 |
| LLAMA-7B | 20.3 | 35.6 | 21.2 | 31.6 | 27.2 |
| BLOOM-7B | 15.9 | 31.1 | 29.4 | 26.3 | 25.7 |
| ClinicalGPT | 26.1 | 40.0 | 36.9 | 47.4 | 37.6 |
表 4: 医学检查对比。
3.4 Diagnosis
3.4 诊断
The diagnostic capabilities of LLMs (large language models) were evaluated on the testing set of MD-EHR. Disease groups were selected for evaluation, including Respiratory, Digestive, Urinary, Psychiatry, Neurology, Gynecology, and Hematology. The models were provided with concatenated notes from each medical record as input and generated text as output. The accuracy of the models was calculated by comparing the generated text with the diagnosis labels in the medical records. The sample response is shown in Table 5.
在大语言模型(LLM)的诊断能力评估中,我们使用了MD-EHR测试集。选取了呼吸系统、消化系统、泌尿系统、精神科、神经科、妇科和血液科等疾病组别进行评估。模型输入为每份病历的串联文本笔记,输出为生成的文本。通过对比生成文本与病历中的诊断标签来计算模型准确率。示例响应如表5所示。
Table 5: A sample response in diagnosis for an electronic medical record.
| Prompt: |
| 主诉:腹痛1天。 (Complaint:Abdominal painfor 1 day.) |
| 现病史和体格检查:腹痛,无明显规律,有呕吐,曾有发热、现无发热,无腹胀及异常哭闹等不适。尿量可, 精神尚可,食纳一般。R20次/分,P110次/分,神志清,精神反应好,呼吸平稳,口唇红润,咽无充血,双肺呼 |
| 吸音粗,未闻及罗音,心音有力,律齐,各瓣膜区未闻及杂音,腹部软,全腹压痛,肝脾未及,神经系统查体 未见异常。 |
| (Current medical history and physical examination: Abdominal pain, no obvious pattern, vomiting, previous or current fever, no discomfort such as abdominal distension or abnormal crying. Urinary volume is acceptable, mental state is |
| acceptable, and appetite is average. R 20 times/minute, P 110 times/minute, clear mind, good mental response, stable |
| breathing, red lips, no congestion in the throat, thick respiratory sounds in both lungs, no rale heard, strong heart sounds, regular rhythm, no murmur heard in each valve area, soft abdomen, full abdominal tenderness, liver and spleen not reached, |
| andnoabnormalitieswerefoundinthenervoussystemexamination.) 检验检查:外院查尿淀粉酶高,流感+、腹平片:肠梗阻。影像显示肝脏切面形态大小正常,包膜光整,实质回 |
| (Examination: high urine Amylase in the external hospital, influenza+, abdominal plain film: Bowel obstruction. The image shows that the shape and size of the liver section are normal, the capsule is smooth, and the parenchymal echo is uniform. |
| No obvious abnormal echo is detected.) |
| Response: |
| 您好,根据您的症状和检查结果,您被诊断为阑尾炎。阑尾炎是一种常见的急性腹痛疾病,通常需要进行手术 治疗。建议您尽快进行手术治疗,以避免病情恶化。 |
| soon aspossible to avoidworseningthecondition.) 在手术前,我们会对您进行一些必要的准备工作,包括禁食和饮水,以及进行血液检查等。手术后,我们会对 您进行密切观察和护理,以确保您的恢复情况良好。 |
表 5: 电子病历诊断示例回复。
| 提示: |
|---|
| 主诉:腹痛1天。 (Complaint:Abdominal painfor 1 day.) |
| 现病史和体格检查:腹痛,无明显规律,有呕吐,曾有发热、现无发热,无腹胀及异常哭闹等不适。尿量可,精神尚可,食纳一般。R20次/分,P110次/分,神志清,精神反应好,呼吸平稳,口唇红润,咽无充血,双肺呼吸音粗,未闻及罗音,心音有力,律齐,各瓣膜区未闻及杂音,腹部软,全腹压痛,肝脾未及,神经系统查体未见异常。 |
| (Current medical history and physical examination: Abdominal pain, no obvious pattern, vomiting, previous or current fever, no discomfort such as abdominal distension or abnormal crying. Urinary volume is acceptable, mental state is acceptable, and appetite is average. R 20 times/minute, P 110 times/minute, clear mind, good mental response, stable breathing, red lips, no congestion in the throat, thick respiratory sounds in both lungs, no rale heard, strong heart sounds, regular rhythm, no murmur heard in each valve area, soft abdomen, full abdominal tenderness, liver and spleen not reached, and no abnormalities were found in the nervous system examination.) |
| 检验检查:外院查尿淀粉酶高,流感+、腹平片:肠梗阻。影像显示肝脏切面形态大小正常,包膜光整,实质回声均匀。 |
| (Examination: high urine Amylase in the external hospital, influenza+, abdominal plain film: Bowel obstruction. The image shows that the shape and size of the liver section are normal, the capsule is smooth, and the parenchymal echo is uniform. No obvious abnormal echo is detected.) |
| 回复: |
|---|
| 您好,根据您的症状和检查结果,您被诊断为阑尾炎。阑尾炎是一种常见的急性腹痛疾病,通常需要进行手术治疗。建议您尽快进行手术治疗,以避免病情恶化。 |
| (Hello, according to your symptoms and examination results, you are diagnosed with appendicitis. Appendicitis is a common acute abdominal pain disease that usually requires surgical treatment. It is recommended that you undergo surgical treatment as soon as possible to avoid worsening the condition.) |
| 在手术前,我们会对您进行一些必要的准备工作,包括禁食和饮水,以及进行血液检查等。手术后,我们会您进行密切观察和护理,以确保您的恢复情况良好。 |
The experimental results are demonstrated in Table 6 for each disease group. Clinical GP T outperformed other language models, such as ChatGLM-6B, LLAMA-7B, and BLOOM-7B, across all disease groups. The average accuracy of Clinical GP T across all disease groups was $80.9%$ , which is obviously higher than the $40.9%$ of ChatGLM-6B, $36.6%$ of LLAMA-7B, and $60.3%$ of BLOOM-7B. Clinical GP T demonstrated particularly strong performance in the Digestive and Urinary departments, achieving accuracies of $90.1%$ and $89.9%$ , respectively. This indicates a robust capability for understanding and interpreting medical records across different disease groups. However, Clinical GP T exhibited slightly lower, yet still impressive, performance in the Gynecology and Hematology departments, with accuracies of $78.6%$ and $80.7%$ respectively. This suggests that there may be room for improvement, specifically in the fields of Gynecology and Hematology, although Clinical GP T still performed well overall across a range of medical specialties.
表6展示了各疾病组的实验结果。Clinical GP T在所有疾病组中均优于ChatGLM-6B、LLAMA-7B和BLOOM-7B等其他语言模型。其整体平均准确率达$80.9%$,显著高于ChatGLM-6B的$40.9%$、LLAMA-7B的$36.6%$和BLOOM-7B的$60.3%$。该模型在消化科($90.1%$)和泌尿科($89.9%$)表现尤为突出,展现出强大的跨病种病历理解能力。尽管在妇科($78.6%$)和血液科($80.7%$)稍逊,但仍保持较高水平,表明这两个专科领域尚有优化空间。总体而言,Clinical GP T在多个医学专科均展现出卓越性能。
| Respiratory | Digestive | Urinary | Psychiatry | Neurology | Gynecology | Hematology | Average | |
| ChatGLM-6B | 22.3 | 49.7 | 55.0 | 38.7 | 39.3 | 39.8 | 41.6 | 40.9 |
| LLAMA-7B | 24.2 | 43.7 | 40.9 | 34.9 | 32.8 | 40.8 | 39.2 | 36.6 |
| BLOOM-7B | 36.9 | 73.9 | 71.7 | 59.1 | 57.7 | 56.8 | 65.7 | 60.3 |
| Ours | 64.3 | 90.1 | 89.9 | 79.2 | 83.6 | 78.6 | 80.7 | 80.9 |
Table 6: Comparisons on diagnosis.
表 6: 诊断结果对比
| 呼吸系统 | 消化系统 | 泌尿系统 | 精神科 | 神经科 | 妇科 | 血液科 | 平均分 | |
|---|---|---|---|---|---|---|---|---|
| ChatGLM-6B | 22.3 | 49.7 | 55.0 | 38.7 | 39.3 | 39.8 | 41.6 | 40.9 |
| LLAMA-7B | 24.2 | 43.7 | 40.9 | 34.9 | 32.8 | 40.8 | 39.2 | 36.6 |
| BLOOM-7B | 36.9 | 73.9 | 71.7 | 59.1 | 57.7 | 56.8 | 65.7 | 60.3 |
| Ours | 64.3 | 90.1 | 89.9 | 79.2 | 83.6 | 78.6 | 80.7 | 80.9 |
3.5 Medical question answering
3.5 医疗问答
For medical question-answering (QA) assessment, our model was benchmarked against several other models using a dataset of 388 questions sampled from cMedQA2. Automated evaluation metrics were used, with GPT-4 serving as the refrence model. Given the question, each model generated an answer independently. Then GPT-4 was used to assess these responses based on their accuracy, helpfulness, and safety. The GPT-4 assigned a judgment of Win, Tie, or Lose for each comparison. A "Win" indicates Clinical GP T provided a superior response, a "Lose" indicates the competing model offered a better response, and a "Tie" means that no obvious difference between the responses was observed.
在医疗问答(QA)评估中,我们使用从cMedQA2采样的388个问题数据集,将模型与其他多个模型进行了基准测试。采用自动化评估指标,并以GPT-4作为参考模型。给定问题后,各模型独立生成答案,随后GPT-4根据准确性、实用性和安全性对这些回答进行评估。GPT-4为每次比较判定"胜出"、"平局"或"落败":其中"胜出"表示Clinical GPT提供了更优回答,"落败"表示竞争模型表现更好,"平局"则表明未观察到明显差异。
| Win | Tie | Lose | |
| Oursv.s.BLOOM-7B | 89.7% | 1.8% | 8.5% |
| Oursv.s.LLAMA-7B | 85.0% | 2.3% | 12.7% |
| Oursv.s.ChatGLM-6B | 67.2% | 10.9% | 22.0% |
Table 7: Medical question-answering on automatic evaluation.
| 胜率 | 平局 | 负率 | |
|---|---|---|---|
| 我们的模型 vs BLOOM-7B | 89.7% | 1.8% | 8.5% |
| 我们的模型 vs LLAMA-7B | 85.0% | 2.3% | 12.7% |
| 我们的模型 vs ChatGLM-6B | 67.2% | 10.9% | 22.0% |
表 7: 自动评估中的医疗问答表现。
The results of the medical question-answering evaluation are presented in Table 7. According to the results, Clinical GP T outperformed all of BLOOM-7B, LLAMA-7B, and ChatGLM-6B. In comparisons against BLOOM-7B and LLAMA7B, our model won in $89.7%$ and $85.0%$ of the cases respectively. The percentage of tie cases were relatively small, at $1.8%$ against BLOOM-7B and $2.3%$ against LLAMA-7B. Meanwhile, Clinical GP T wins against ChatGLM-6B at $67.2%$ . The tie rate increased to $10.9%$ and the loss rate to $22.0%$ . This performance suggests that while ChatGLM-6B has a commendable repository of medical knowledge and displays fluent textual expression, training with Clinical GP T is beneficial for augmenting the capabilities in medical question answering, despite the extensive knowledge reserves of larger models.
医学问答评估结果如表 7 所示。根据结果,Clinical GPT 在性能上超越了 BLOOM-7B、LLAMA-7B 和 ChatGLM-6B。与 BLOOM-7B 和 LLAMA7B 相比,我们的模型分别以 $89.7%$ 和 $85.0%$ 的胜率领先。平局案例占比较小,对抗 BLOOM-7B 时为 $1.8%$,对抗 LLAMA-7B 时为 $2.3%$。同时,Clinical GPT 对抗 ChatGLM-6B 的胜率为 $67.2%$,平局率升至 $10.9%$,败率则为 $22.0%$。这一表现表明,尽管 ChatGLM-6B 拥有可观的医学知识库且文本表达流畅,但通过 Clinical GPT 训练仍能有效增强医学问答能力,即使面对更大规模模型的知识储备优势。
4 Conclusion
4 结论
In this study, we introduced Clinical GP T, a large language model tailored for medical and clinical applications. Recognizing the limitations that generic large language models present in these specialized fields, we took steps to refine the model, assembling comprehensive datasets for its fine-tuning. These datasets incorporate real medical records, patient consultations, diverse medical knowledge, and exam data, all aimed at shaping the model’s knowledge base and responsiveness. Our extensive experiments cover a range of critical tasks in the medical field, such as medical conversation, medical examination, diagnosis, and medical question answering. The empirical results highlight the superior capabilities of Clinical GP T in understanding and generating medical and clinical-related responses.
本研究介绍了Clinical GP T,一款专为医疗和临床应用定制的大语言模型。针对通用大语言模型在这些专业领域的局限性,我们通过整合真实病历、患者咨询记录、多样化医学知识及考试数据等综合数据集进行微调,以优化模型的知识库与响应能力。实验覆盖医疗对话、医学检查、诊断及问答等关键任务,实证结果表明Clinical GP T在理解和生成医疗临床相关响应方面具有卓越性能。
Acknowledgments
致谢
Parts of the experiments are conducted in the In for Super Bahn Testbed. The authors appreciate Nanjing Institute of In for Super Bahn for providing the test and evaluation platform.
部分实验在In for Super Bahn测试平台进行。作者感谢南京In for Super Bahn研究所提供的测试与评估平台。
References
参考文献
[24] Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In Text sum mari z ation branches out, pages 74–81, 2004.
[24] Chin-Yew Lin. ROUGE: 自动摘要评估工具包. 见《文本摘要分支》, 第74-81页, 2004.
Appendix
附录
A Medical knowledge graphs
医学知识图谱
The CMeKG (Chinese Medical Knowledge Graph) is a Chinese medical knowledge graph created by human-AI collaboration, using natural language processing and text mining techniques. It’s built upon international standards such as ICD, ATC, SNOMED, and MeSH, and integrates clinical guidelines, industry standards, and medical wiki websites as diverse sources. The CMeKG contains $62\mathrm{k}$ entities and $374\mathrm{k\Omega}$ relationship triplets, representing nine types of medical entities and their 23 different relationships. Entities include diseases (15,962), manifestations (12,271), body parts (17,706), equipment (900), procedures (6,418), microorganisms (1,934), medical departments (356), tests (2,605), and medications (3,935). Relationships cover diverse medical aspects, with the most prominent being common symptoms (94,657) and side effects (62,339).
CMeKG (Chinese Medical Knowledge Graph) 是一个通过人机协作构建的中文医学知识图谱,运用自然语言处理与文本挖掘技术。该图谱基于ICD、ATC、SNOMED、MeSH等国际标准,整合临床指南、行业标准及医学百科网站等多源数据。CMeKG包含$62\mathrm{k}$个实体和$374\mathrm{k\Omega}$组关系三元组,涵盖9类医学实体及其23种关联关系。实体类型包括:疾病(15,962)、临床表现(12,271)、解剖部位(17,706)、医疗设备(900)、诊疗操作(6,418)、微生物(1,934)、科室(356)、检查项目(2,605)和药物(3,935)。关系类型覆盖多维医学关联,其中最常见症状(94,657)和不良反应(62,339)占比最高。
The xywy-KG is a medical knowledge graph generated using data sourced from a Chinese online medical consultation website6. These entities are categorized into seven groups: diseases (11,013), manifestations (5,998), procedures (554), departments (54), examination items (3,353), medications (22,359), and foods (4,993). The relationships are sorted into nine types, most notably examinations (39,531) and recommended medications (59,467), totally comprising 44k entities and 294k relationships.
xywy-KG是一个利用中国在线医疗咨询网站数据生成的医学知识图谱[20]。该图谱包含7类实体:疾病(11,013种)、症状表现(5,998种)、治疗操作(554种)、科室(54类)、检查项目(3,353项)、药品(22,359种)和食物(4,993种);关系分为9种类型,其中最主要的是检查项目关联(39,531条)和推荐用药(59,467条),共计包含4.4万个实体和29.4万条关系。
The 39Health-KG is a medical knowledge graph built from data collected from 39-health, a website dedicated to health consultation and registration 7. This graph integrates seven types of medical entities and eight types of relationships among them. It comprises 37k entities and 210k entity relationships. The entity types are diseases (14,337), body parts (82), departments (83), examination items (3,074), clinical manifestations (5,927), treatment methods (1,493), and medications (4,966). The relationships majorly revolve around related symptoms (48,757) and examination items (31,577).
39Health-KG是一个基于39健康网(一个专注于健康咨询和挂号的网站)收集数据构建的医疗知识图谱[7]。该图谱整合了七类医疗实体和八种实体间关系,包含3.7万个实体和21万条实体关系。实体类型包括疾病(14,337种)、身体部位(82个)、科室(83个)、检查项目(3,074项)、临床表现(5,927种)、治疗方法(1,493种)和药物(4,966种)。其中关系主要集中在相关症状(48,757条)和检查项目(31,577条)方面。
B Prompt templates
B 提示模板
| Prompt (Chinese) | Response (Chinese) |
| {疾病}和哪些疾病有关联? | {疾病}与{疾病}可能有关联。 |
| {疾病}可能与哪些其他疾病有关? | {疾病}可能与{疾病}有关联。 |
| {疾病}有哪些常见症状? | {疾病}的常见症状包括{临床表现}。 |
| 患有{疾病}的患者可能出现哪些症状? | {疾病}患者可能出现如{临床表现}等症状。 |
| {疾病}的典型{临床表现}是什么? | {疾病}的典型临床表现包括{临床表现}。 |
| 患有{疾病}的患者在临床上通常表现为哪些症状? | 患有{疾病}的患者在临床上通常表现为{临床表现}。 |
| 诊断{疾病}需要进行哪些检查? | 诊断{疾病)需要进行如{医学检验项目}等检查。 |
| 如何检查以确定患有{疾病}? | 确定患有{疾病}需要进行{医学检验项目}等检查。 |
| 【药物}主要用于治疗哪些疾病? | {药物}主要用于治疗{疾病}等疾病。 |
| {药物}的适应症是什么? | {药物}的适应症包括{疾病}。 |
| 如何治疗{疾病}? | 治疗{疾病}的方法包括{医疗程序}。 |
| {疾病}的常见治疗方法有哪些? {疾病}会引起哪些并发症? | {疾病}的常见治疗方法包括{医疗程序}。 {疾病}会引起{疾病]等并发症。 |
| 患有{疾病}的患者可能出现哪些并发症? | 患有{疾病}的患者可能出现{疾病】等并发症。 |
| 【药物}与哪些药物存在相互作用? | {药物}与{药物}存在相互作用。 |
| 使用{药物}时需要注意哪些药物相互作用? | 使用{药物}时需注意与{药物}的相互作用。 |
| 【药物}主要用于治疗哪些症状? | {药物}主要用于治疗{临床表现}等症状。 |
| {药物}的主要治疗作用是什么? | {药物}的主要治疗作用为治疗{临床表现}。 |
| Prompt (English) | Response (English) |
| What diseases are related to{disease}? | {Disease}mayberelated to{disease}. |
| What other diseases may be associated with {disease}? | {Disease}may be associated with{disease}. |
| What are the common symptoms of {disease}? | The common symptoms of{disease}include {clinical manifestations}. |
| What symptoms might a patient with{disease}exhibit? | Patients with{disease}may exhibit symptoms such as{clinical manifestations}. |
| What arethetypical{clinicalmanifestations}of{disease}? | Thetypical clinical manifestations of{disease}include{clinical manifestations}. Patients with{disease} typically present with {clinical manifestations} in a clinical |
| setting. | |
| What tests are needed to diagnose {disease}? | Tests such as{medical examination items} are required to diagnose {disease}. To confirm if one has {disease},tests such as{medical examination items} are |
| How canone check to confirm if theyhave{disease}? | required. |
| What diseases can {drug} primarily treat? | |
| What aretheindications of{drug}? | The indications of{drug}include{disease}. |
| How can{disease}be treated? | {Disease}canbe treatedwith methods such as{medicalprocedures}. |
| What are the common treatment methods for {disease}? | The common treatment methods for {disease} include {medical procedures}. |
| What complications can{disease}cause? | {Disease}can cause complications such as{disease}. |
| What complicationsmight apatientwith{disease}develop? | A patient with{disease} might develop complications such as {disease}. |
| What drugs interact with{drug}? | {Drug} interacts with {drug}. |
| What drug interactions should be considered when using{drug}? | When using {drug},interactions with {drug} should be considered. |
| Whatsymptomscan{drug}primarilytreat? | |
| What is themain therapeutic action of {drug}? | {Drug} is primarily used to treat symptoms such as {clinical manifestations}. The main therapeutic action of {drug}is to treat{clinical manifestations}. |
Table 8: Prompt templates.
| 提示语(中文) | 响应(中文) |
|---|---|
| {疾病}和哪些疾病有关联? | {疾病}与{疾病}可能有关联。 |
| {疾病}可能与哪些其他疾病有关? | {疾病}可能与{疾病}有关联。 |
| {疾病}有哪些常见症状? | {疾病}的常见症状包括{临床表现}。 |
| 患有{疾病}的患者可能出现哪些症状? | {疾病}患者可能出现如{临床表现}等症状。 |
| {疾病}的典型{临床表现}是什么? | {疾病}的典型临床表现包括{临床表现}。 |
| 患有{疾病}的患者在临床上通常表现为哪些症状? | 患有{疾病}的患者在临床上通常表现为{临床表现}。 |
| 诊断{疾病}需要进行哪些检查? | 诊断{疾病}需要进行如{医学检验项目}等检查。 |
| 如何检查以确定患有{疾病}? | 确定患有{疾病}需要进行{医学检验项目}等检查。 |
| {药物}主要用于治疗哪些疾病? | {药物}主要用于治疗{疾病}等疾病。 |
| {药物}的适应症是什么? | {药物}的适应症包括{疾病}。 |
| 如何治疗{疾病}? | 治疗{疾病}的方法包括{医疗程序}。 |
| {疾病}的常见治疗方法有哪些? {疾病}会引起哪些并发症? | {疾病}的常见治疗方法包括{医疗程序}。 {疾病}会引起{疾病}等并发症。 |
| 患有{疾病}的患者可能出现哪些并发症? | 患有{疾病}的患者可能出现{疾病}等并发症。 |
| {药物}与哪些药物存在相互作用? | {药物}与{药物}存在相互作用。 |
| 使用{药物}时需要注意哪些药物相互作用? | 使用{药物}时需注意与{药物}的相互作用。 |
| {药物}主要用于治疗哪些症状? | {药物}主要用于治疗{临床表现}等症状。 |
| {药物}的主要治疗作用是什么? | {药物}的主要治疗作用为治疗{临床表现}。 |
| 提示语(英文) | 响应(英文) |
| ------------- | ------------ |
| What diseases are related to{disease}? | {Disease} may be related to{disease}. |
| What other diseases may be associated with {disease}? | {Disease} may be associated with{disease}. |
| What are the common symptoms of {disease}? | The common symptoms of{disease} include {clinical manifestations}. |
| What symptoms might a patient with{disease} exhibit? | Patients with{disease} may exhibit symptoms such as{clinical manifestations}. |
| What are the typical{clinical manifestations} of{disease}? | The typical clinical manifestations of{disease} include{clinical manifestations}. Patients with{disease} typically present with {clinical manifestations} in a clinical setting. |
| What tests are needed to diagnose {disease}? | Tests such as{medical examination items} are required to diagnose {disease}. To confirm if one has {disease}, tests such as{medical examination items} are required. |
| What diseases can {drug} primarily treat? | |
| What are the indications of{drug}? | The indications of{drug} include{disease}. |
| How can{disease} be treated? | {Disease} can be treated with methods such as{medical procedures}. |
| What are the common treatment methods for {disease}? | The common treatment methods for {disease} include {medical procedures}. |
| What complications can{disease} cause? | {Disease} can cause complications such as{disease}. |
| What complications might a patient with{disease} develop? | A patient with{disease} might develop complications such as {disease}. |
| What drugs interact with{drug}? | {Drug} interacts with {drug}. |
| What drug interactions should be considered when using{drug}? | When using {drug}, interactions with {drug} should be considered. |
| What symptoms can{drug} primarily treat? | |
| What is the main therapeutic action of {drug}? | {Drug} is primarily used to treat symptoms such as {clinical manifestations}. The main therapeutic action of {drug} is to treat{clinical manifestations}. |
表 8: 提示模板。
