[论文翻译]LLM-Mini-CEX: 大语言模型在诊断对话中的自动评估


原文地址:https://arxiv.org/pdf/2308.07635


LLM-Mini-CEX: Automatic Evaluation of Large Language Model for Diagnostic Conversation

LLM-Mini-CEX: 大语言模型在诊断对话中的自动评估

Xiaoming Shi $^1$ , Jie Xu1†, Jinru Ding1, Jiali Pang1, Sichen Liu $^{1}$ , Shuqing Luo $^1$ , Xingwei Peng1, Lu Lu $^{1}$ , Haihong Yang $^2$ , Mingtao Hu $^1$ , Tong Ruan1, Shaoting Zhang1*

小明石 $^1$,徐杰1†,丁金茹1,庞佳丽1,刘思辰 $^{1}$,罗书青 $^1$,彭兴伟1,卢璐 $^{1}$,杨海红 $^2$,胡明涛 $^1$,阮彤1,张少亭1*

Abstract

摘要

Purpose: There is an increasing interest in developing large language models (LLMs) for medical diagnosis to improve diagnosis efficiency. Despite their alluring technological potential, there is no unified and comprehensive evaluation criterion, leading to the inability to evaluate the quality and potential risks of medical LLMs, further hindering the application of LLMs in medical treatment scenarios. Besides, current evaluations heavily rely on labor-intensive interactions with LLMs to obtain diagnostic dialogues and human evaluation on the quality of diagnosis dialogue.

目的:开发用于医疗诊断的大语言模型(LLM)以提高诊断效率的兴趣日益增长。尽管其技术潜力诱人,但目前缺乏统一全面的评估标准,导致无法评估医疗大语言模型的质量和潜在风险,进一步阻碍了大语言模型在医疗场景中的应用。此外,当前评估严重依赖与大语言模型的劳动密集型交互来获取诊断对话,并需人工评估诊断对话的质量。

Methods: To tackle the lack of unified and comprehensive evaluation criterion, we first initially establish an evaluation criterion, termed LLM-specific Mini-CEX to assess the diagnostic capabilities of LLMs effectively, based on original MiniCEX. To address the labor-intensive interaction problem, we develop a patient simulator to engage in automatic conversations with LLMs, and utilize ChatGPT for evaluating diagnosis dialogues automatically.

方法:为解决缺乏统一全面评估标准的问题,我们首先基于原始MiniCEX建立了名为LLM专用Mini-CEX的评估标准,以有效评估大语言模型的诊断能力。针对人工交互耗时的问题,我们开发了患者模拟器与大语言模型进行自动对话,并利用ChatGPT实现诊断对话的自动化评估。

Results: Experimental results show that the LLM-specific Mini-CEX is adequate and necessary to evaluate medical diagnosis dialogue. Besides, ChatGPT can replace manual evaluation on the metrics of humanistic qualities and provides reproducible and automated comparisons between different LLMs.

结果:实验结果表明,针对大语言模型(LLM)的Mini-CEX量表能有效评估医疗诊断对话。此外,ChatGPT可替代人工评估人文关怀指标,并为不同大语言模型提供可复现的自动化对比结果。

Keywords: Large language model, Medical diagnostic conversation, Automatic Evaluation, Mini-CEX

关键词:大语言模型,医学诊断对话,自动评估,Mini-CEX

1 Introduction

1 引言

The large language models (LLMs) [1–4], owing to their abilities to generate humanlike responses by learning from tremendous online resources, have shown great potential in various medical applications, such as medical education [5], medical QA [6, 7], and medical diagnosis [8]. Among these applications, using LLMs for medical diagnosis to enhance the decision-making process treatment efficiency [8] has gained attention as LLMs could answer some medical queries as well as clinicians [6, 9]. However, LLMs may produce generations misaligned with clinical and societal value [10]. To facilitate the application of LLMs for medical diagnosis in real clinical scenarios, it is crucial to evaluate LLMs’ capability and mitigate potential risks, due to the safety-critical nature of the medical domain.

大语言模型(LLM) [1-4]通过学习海量在线资源生成类人响应的能力,在医疗教育[5]、医学问答[6,7]、医疗诊断[8]等医学应用领域展现出巨大潜力。其中,利用大语言模型辅助医疗诊断以提升决策过程和治疗效率[8]备受关注,因其回答某些医学问题的能力已接近临床医生水平[6,9]。但需注意的是,大语言模型可能产生不符合临床及社会价值观的输出[10]。鉴于医疗领域的安全关键属性,评估大语言模型的能力并降低潜在风险,对实现其在真实临床场景中的医疗诊断应用至关重要。

Towards the evaluation of LLMs’ capabilities in the medical domain, current evaluation methods can be divided into three kinds, which are medical information extraction (IE), medical question-and-answer (QA), and diagnosis dialogue capability evaluation. Specifically, the medical IE aims to extract pre-specified information from medical textual sources and the medical QA focuses on the single round medical knowledge QA. However, the evaluations on medical IE and QA is insufficient for the evaluation of LLMs’ diagnostic capabilities in real clinical scenarios as they neglect either multi-turn diagnostic interviewing or rigorous diagnostic results, as shown in Table 1. Compared with medical IE and QA, the evaluation of medical diagnosis dialogue can better reflect the diagnostic capability. However, current metrics for evaluation of medical diagnosis dialogue are all word-based methods [11–14], measuring the word co-occurrence between predicted results and ground-truth references, thus failing to judge the medical diagnostic quality due to the complexity of natural language. Therefore, there is a great demand for designing a unified and comprehensive evaluation criterion for evaluating LLMs’ diagnostic capability in real clinical applications.

当前针对大语言模型(LLM)在医疗领域能力的评估方法可分为三类:医疗信息抽取(IE)、医疗问答(QA)和诊断对话能力评估。具体而言,医疗信息抽取旨在从医疗文本中提取预设信息,医疗问答则关注单轮次医学知识问答。然而如表1所示,这两种评估方式均无法充分反映大语言模型在真实临床场景中的诊断能力——医疗信息抽取缺乏多轮诊断问诊环节,医疗问答则忽略严谨的诊断结果产出。相较而言,医疗诊断对话评估能更全面体现模型的诊断能力。但现有诊断对话评估指标均为基于词汇匹配的方法[11-14],通过计算预测结果与标准答案的词汇共现率进行评估,由于自然语言的复杂性,这类方法难以准确评判医学诊断质量。因此,亟需设计一套统一且全面的评估标准,用于衡量大语言模型在真实临床应用中的诊断能力。

Inspired by the Mini-clinical evaluation exercise (Mini-CEX) [15, 16], a formative assessment tool designed to evaluate the ability of doctors to provide clinical care in actual medical consultations [17], we propose to construct LLM-specific Mini-CEX for evaluating LLMs’ diagnostic capabilities. Specifically, in the original Mini-CEX, physicians are evaluated on six aspects: medical interviewing skills, physical examination skills, professionalism, counselling skills, clinical judgement, and organisation/efficiency [15, 17]. However, the original Mini-CEX can not be directly applied to LLMs, due to a lack of physical observations, palpations, and access to laboratory testing in text-based LLM dialogues. Therefore, modifications are needed to Mini-CEX in order to evaluate LLMs’ diagnostic skills, including the removal of items related to physical examinations and laboratory tests as well as the simplification of complicated, repetitive, and ambiguous items. Specifically, an expert committee was established to draft the LLMs-specific Mini-CEX and the Delphi method was used for revision. Then, the reliability and validity analysis [18] are utilized to analyze the necessity and sufficiency of each item in the scale 1. Finally, the LLM-specific Mini-CEX is obtained, containing 4 primary items including medical interviewing skills, humanistic care, comprehensive diagnostic and treatment abilities, and overall clinical competence, with a total number of 26 secondary items.

受迷你临床演练评估 (Mini-CEX) [15, 16] 的启发(该形成性评估工具旨在评估医生在实际诊疗中提供临床护理的能力 [17]),我们提出构建针对大语言模型的Mini-CEX以评估其诊断能力。具体而言,原版Mini-CEX从六个维度评估医师:病史采集能力、体格检查能力、职业素养、沟通技巧、临床判断力及组织效率 [15, 17]。但由于大语言模型基于文本的对话缺乏实体观察、触诊和实验室检测功能,原版Mini-CEX无法直接适用。因此需对Mini-CEX进行调整以评估大语言模型的诊断技能,包括删除与体格检查和实验室检测相关的条目,并简化复杂、重复和模糊的条目。具体实施中,专家委员会起草了针对大语言模型的Mini-CEX初稿,采用德尔菲法进行修订,随后通过信效度分析 [18] 验证量表中各条目的必要性与充分性,最终形成包含4个一级条目(病史采集能力、人文关怀、综合诊疗能力、整体临床胜任力)和26个二级条目的大语言模型专用Mini-CEX评估体系。

To apply the Mini-CEX evaluation on LLMs’ diagnostic capability, human participation is needed [9]. First, annotators need act as patients to converse with LLMs to obtain medical diagnostic dialogue based on pre-defined patient portraits. Then, 10 experts are needed to comprehensively evaluate each dialogue based on the LLMspecific Mini-CEX. Despite of the effectiveness, these human participation in the evaluation is time-consuming and labor-intensive. To achieve fast and comprehensive evaluations, we propose to utilize a patient simulator to converse with LLMs and utilize the ChatGPT, replacing experts, to automatically evaluate LLMs. Specifically, the patient simulator plays the role of “patient” to automatically converse with LLMs based on given patient self-reports, in this manner collecting diagnostic dialogue automatically. For the patient simulator training, the BLOOM [19] with 7 billion parameters is selected as the LLM foundation model, and is fine-tuned on 120,000 medical dialogue data to generate patient responses based on given patient self-reports and dialogue histories.

为评估大语言模型(LLM)的诊断能力,需引入人工参与的Mini-CEX评估[9]。首先,标注人员需模拟患者身份与大语言模型对话,基于预设患者画像获取医疗诊断对话记录。随后需10位专家根据定制版Mini-CEX对每段对话进行全面评估。该方法虽有效,但人工评估过程耗时费力。为实现快速全面的评估,我们提出采用患者模拟器与大语言模型对话,并利用ChatGPT替代专家进行自动化评估。具体而言,患者模拟器基于给定患者自述扮演"患者"角色,自动与大语言模型生成诊断对话。在模拟器训练环节,我们选择70亿参数的BLOOM[19]作为基础大语言模型,通过对12万条医疗对话数据进行微调,使其能根据患者自述和对话历史生成患者应答。

Besides, to meet the need for a robust and scalable automated method to evaluate medical LLM alignment with human preferences, ChatGPT is utilized as the judge in this work. To facilitate further research on the automatic evaluation, we first construct a dataset, termed medical diagnostic evaluation (MedEval) dataset. To this end, an expert committee is established to annotate the dataset, and finally 2,680 medical consultation dialogues are collected and annotated based on the LLM-specific MiniCEX.

此外,为满足对稳健且可扩展的自动化方法的需求,以评估医疗大语言模型与人类偏好的一致性,本研究采用ChatGPT作为评判工具。为促进自动评估的进一步研究,我们首先构建了一个名为医疗诊断评估(MedEval)的数据集。为此,成立了一个专家委员会对数据集进行标注,最终基于大语言模型专用MiniCEX框架收集并标注了2,680段医疗咨询对话。

We conduct an empirical study of automatic al evaluation on the MedEval dataset. Besides, we conduct human evaluation and automatic evaluation on different LLMs, including ChatGLM [20], ERNIE Bot [3], ChatGPT [1], and GPT-4 [2]. Experimental results show that the automatic evaluation can replace experts in evaluating humanistic care and provide reproducible and automated comparisons between different LLMs, indicating that it is potential to evaluate medical dialogues automatically.

我们在MedEval数据集上对自动评估进行了实证研究。此外,我们对不同的大语言模型进行了人工评估和自动评估,包括ChatGLM [20]、ERNIE Bot [3]、ChatGPT [1]和GPT-4 [2]。实验结果表明,自动评估可以替代专家评估人文关怀,并提供不同大语言模型之间可复现的自动化比较,这表明自动评估医疗对话具有潜力。

2 Results

2 结果

This section contains two parts, including scale construction results, and LLMs evaluation results.

本节包含两部分内容:规模构建结果和大语言模型评估结果。

Table 1 Comparison of the evaluation of LLMs’ medical IE, medical QA and diagnosis dialogue capabilities. The diagnostic dialogue generation task in Prompt C BLUE is designed to compare generated responses with ground-truth responses in word-level, which is not sufficient to evaluate qualities of medical interviewing, humanistic care, and diagnosis skills.

表 1: 大语言模型在医疗信息抽取(IE)、医疗问答(QA)和诊断对话能力的评估对比。Prompt C BLUE中的诊断对话生成任务旨在通过词级匹配比较生成响应与真实响应,但不足以评估医疗问诊、人文关怀和诊断技能的质量。


Fig. 1 The process of the scale construction. Original Chinese Mini-CEX is revised through literature review, focus group research, three rounds inquiry, and reliability/validity analysis.

图 1: 量表构建流程。通过文献综述、焦点小组研究、三轮专家咨询及信效度分析,对中文迷你临床演练评估量表 (Mini-CEX) 进行修订。

2.1 Scale Construction Results

2.1 量表构建结果

Results of the First Round Inquiry. After the focus group discussion in the first round of inquiry, 4 primary items were established, including medical interviewing skills, humanistic care, comprehensive diagnosis and treatment abilities, and overall clinical competence. These 4 primary items are accepted by the experts. Besides, 30 secondary items are collected (see Appendix 1, Figure 1). No secondary items have been added or deleted.

第一轮调查结果。经过首轮焦点小组讨论,确立了4个一级指标,包括问诊技能、人文关怀、综合诊疗能力和整体临床能力。这4个一级指标均获得专家认可。此外,共收集到30个二级指标(见附录1,图1)。未新增或删除任何二级指标。

The focus group experts gave positive feedback on improving the LLM-based MiniCEX items and put forward a total of 12 valuable comments. The modification in this round focused on the obvious semantic incompleteness. For example, “Conduct disease-related inquiries accurately based on the patient’s chief complaint, including past histories, medication history, com or bidi ties, etc.” is revised to “Conduct relevant inquiries accurately around the patient’s chief complaint, including the cause of the disease, the nature and degree of symptoms, occurrence time and regularity, aggravating or mitigating factors, accompanying symptoms, diagnosis and treatment, and other medical history such as past history, personal history, allergy history, etc.” because of incomplete description. More details are listed in Appendix 1, Figure 1.

焦点小组专家对改进基于大语言模型的MiniCEX项目给予了积极反馈,并提出了共计12条宝贵意见。本轮修改主要针对明显的语义不完整问题。例如,"根据患者主诉准确进行疾病相关问诊,包括既往史、用药史、合并症等"被修订为"围绕患者主诉准确进行相关问诊,包括病因、症状性质和程度、发生时间及规律、加重或缓解因素、伴随症状、诊疗经过,以及既往史、个人史、过敏史等其他病史",原因在于原描述不完整。更多细节详见附录1中的图1。

Results of the Second Round Inquiry. The second round expert letter inquiry is discussed in the focus group based on the results of the first round. The scale contains 4 primary items and 30 secondary items, with the primary items unchanged (see Appendix 1, Figure 2).

第二轮征询结果。第二轮专家函询在首轮结果基础上经焦点小组讨论确定,量表包含4个一级条目和30个二级条目,一级条目未作调整(见附录1 图2)。

Table 2 The illustration of the LLM-specific Mini-CEX. The LLM-specific Mini-CEX contains 26 secondary items, covering 4 primary items.

Primary ItemSecondary Item
Medical Interviewing Skills1.1 Enquire about current medical history around patients’ self reports; 1.2 Enquire patients about their past medical history; 1.3 Enquire patients with open questions, and encourage patients to make statements; 1.4 Use words that patients easily understand and avoid medical terminology; 1.5 Explain to patients the basis or purpose of treatment and conclusion; 1.6 Avoid providing medically harmful information; 1.7 Make effective judgments and appropriate responses to medical emergencies; 1.8 Focus on and enquire relevant information that is helpful for assessing the condition;
Humanistic Care2.1 Show respect, sensitivity and empathy during consultation and communication 2.2 Avoid making ineffective extensions after obtaining sufficient information; 2.3 Provide reasonable guidance when patients exhibit negative emotions; 2.4 Respect individual wishes of patients; 2.5 Avoid expressing any form of bias; 2.6 Be patient when explaining problems to patients; 2.7 Use polite words appropriately; 2.8 Avoid asking privacy questions when non disease needs arise;
Comprehensive Diagnostic and Treatment Abilities3.1 Determine the accuracy of information provided by patients; 3.2 Provide disease diagnosis accurately; 3.3 Provide disease diagnosis and corresponding explanations accurately; 3.4 Provide diagnostic plan accurately; 3.5 Provide corresponding interpretation of the diagnostic plan accurately; 3.6 Provide treatment plan accurately; 3.7 Provide explanation of treatment plan accurately;
Overall Clinical Competence4.1 Unsatisfactory 4.2 Satisfactory 4.3 Excellent

表 2 大语言模型专用迷你临床演练评估表(LLM-specific Mini-CEX)说明。该评估表包含26个二级条目,涵盖4个一级条目。

一级条目 二级条目
医学问诊技能 1.1 围绕患者主诉询问现病史;1.2 询问患者既往病史;1.3 使用开放式问题询问患者,并鼓励患者陈述;1.4 使用患者易懂的词语,避免医学术语;1.5 向患者解释治疗依据或目的及结论;1.6 避免提供有医学危害的信息;1.7 对医疗紧急情况做出有效判断和恰当应对;1.8 关注并询问有助于评估病情的相关信息;
人文关怀 2.1 在问诊和沟通中展现尊重、敏感性和同理心;2.2 获取足够信息后避免无效延伸;2.3 当患者出现负面情绪时提供合理引导;2.4 尊重患者个人意愿;2.5 避免表达任何形式的偏见;2.6 向患者解释问题时保持耐心;2.7 恰当使用礼貌用语;2.8 非疾病需要时避免询问隐私问题;
综合诊疗能力 3.1 判断患者提供信息的准确性;3.2 准确提供疾病诊断;3.3 准确提供疾病诊断及相应解释;3.4 准确提供诊断方案;3.5 准确提供诊断方案的相应解读;3.6 准确提供治疗方案;3.7 准确提供治疗方案的说明;
整体临床能力 4.1 不合格;4.2 合格;4.3 优秀

In this step, modifications are all conducted on the secondary items, including the sentence adjustment, the semantic conciseness, and the identification of semantic repetition or contradictory content. For example, the order of the secondary items 1.2 and 1.3 is adjusted, and the content of the secondary item 1.1-1.3 is also adjusted to avoid repetition and ambiguity. More details are listed in Appendix 1, Figure 2.

在这一步骤中,所有修改都在二级条目上进行,包括句子调整、语义精简以及识别语义重复或矛盾内容。例如,调整了二级条目1.2和1.3的顺序,并对1.1-1.3的内容进行了修改以避免重复和歧义。更多细节见附录1中的图2。

Results of the Third Round Inquiry. The third round of expert letter inquiry is discussed in the focus group based on the results of the second round. The scale is shown in Appendix 1, Figure 3. In this step, modifications are all conducted on the secondary items, including the supplement of the entry example content, the entry involving scene screening, the adjustment of the word order in the entry content, and merge similar secondary items. For example, “Other medical histories such as past history, personal history, allergy history, etc.” is revised to “Other medical history, such as past history (previous surgery), personal history, family history, allergy history, medication history, ongoing treatment, etc.”. More details are listed in Appendix 1, Figure 3.

第三轮专家函询结果。第三轮专家函询在第二轮结果基础上进行焦点小组讨论,具体量表见附录1图3。本轮修改均针对二级条目展开,包括补充条目示例内容、筛选条目涉及场景、调整条目内容语序以及合并相似二级条目。例如将"其他病史如既往史、个人史、过敏史等"修订为"其他病史,如既往史(既往手术史)、个人史、家族史、过敏史、用药史、正在接受的治疗等"。更多修改细节详见附录1图3。

Table 3 Results of the Cronbach’s Alpha coefficient before and after item deletions. Item deletion is conducted as randomly deleting items, which is designed to check the necessity of items.

Item#of ItemBeforedeletionAfterdeletion
MedicalInterviewing80.8750.812
Humanistic Care80.8650.821
Comprehensive Diagnosis and Treatment70.8690.807
Overall Clinical Competence30.8840.814
Total260.815

表 3: 项目删除前后Cronbach's Alpha系数结果。项目删除采用随机删除方式,旨在检验项目的必要性。

项目 项目数量 删除前 删除后
MedicalInterviewing 8 0.875 0.812
Humanistic Care 8 0.865 0.821
Comprehensive Diagnosis and Treatment 7 0.869 0.807
Overall Clinical Competence 3 0.884 0.814
Total 26 0.815

Table 4 Results of KMO and Bartlett’s test for sphericity.

KMO0.812
Bartlett's test for sphericityApproximate chi-square2343.197
Degrees of freedom325
Significance0.000

表 4: KMO和Bartlett球形检验结果

KMO 0.812
Bartlett球形检验 近似卡方 (Approximate chi-square) 2343.197
自由度 (Degrees of freedom) 325
显著性 (Significance) 0.000

Statistical Analysis of the Scale. We conducted a reliability analysis and a validity analysis to evaluate the necessity and sufficiency of each item in the LLM-specific Mini-CEX. In the reliability analysis, the Cronbach’s alpha coefficient of the whole scale and each primary item was calculated. As shown in Table 3, the Cronbach’s alpha coefficient of the whole scale is 0.802. Besides, the Cronbach’s alpha coefficient of each dimension in the scale is greater than 0.8. After deleting the entry of a certain dimension, the Cronbach’s alpha coefficient did not increase. In the validity analysis, we used the KMO test and Bartlett’s test for sphericity. The KMO value was 0.835, and the Bartlett’s test of sphericity was 2278.5 ( $P$ =0.000) (Table 4).

量表统计分析。我们对大语言模型专用迷你临床演练评估(LLM-specific Mini-CEX)各条目进行了信度分析和效度分析。信度分析中计算了总量表及各主条目的Cronbach's α系数。如表3所示,总量表的Cronbach's α系数为0.802,且量表各维度的Cronbach's α系数均大于0.8。删除某维度条目后,Cronbach's α系数未出现升高。效度分析采用KMO检验和Bartlett球形检验,KMO值为0.835,Bartlett球形检验值为2278.5 ($P$=0.000)(表4)。

2.2 Patient Simulator and MedEval

2.2 患者模拟器与医疗评估系统 (MedEval)

An example of interaction between the patient simulator and the LLM doctor PLUSE $_ 2$ is shown in Appendix Figure 2. In the sample, the self report is pre-given, then the doctor LLM converses with the patient simulator to provide diagnosis and treatment to the patient. As shown in the sample, the patient simulator performs well to response to the LLM doctor’s questions based on the given self report. Statistically, the average number of tokens in each patient simulator’s response is 18.95, respectively. The results show that the patient simulator provides detailed responses and tends to interact with the user for multiple turns.

患者模拟器与大语言模型医生PLUSE $_ 2$ 的交互示例如附录图2所示。在该示例中,预先给定自述症状后,医生大语言模型通过与患者模拟器对话为患者提供诊断和治疗。如示例所示,患者模拟器能根据给定的自述症状对大语言模型医生的问题作出良好回应。据统计,患者模拟器每次回应的平均token数为18.95个。结果表明,患者模拟器能提供详细回应,并倾向于与用户进行多轮交互。

An example of MedEval is shown in Figure 2. Scores on each secondary items are annotated by experts. Statistically, there are totally 1890, 300, 500 samples in training set, validation set, and test set, respectively. Besides, the average dialogue turn, the average number of tokens in each utterance are 15.70, 19.21, respectively. The statistical data shows that dialogue and utterances are long.

图 2 展示了 MedEval 的一个示例。每个二级项目的得分均由专家标注。据统计,训练集、验证集和测试集分别包含 1890、300 和 500 个样本。此外,平均对话轮次和每句话的平均 token 数分别为 15.70 和 19.21。统计数据显示对话和语句较长。

Table 5 Results of automatic evaluation on secondary items. Results on accuracy, precision, recall, F1 score are reported. The results are expressed as percentages $(%)$

Primary ItemSecondary ItemAccuracyPrecisionRecallF1
Medical Interviewing Skills1.183.4090.3188.7289.51
1.263.8047.6471.6057.21
1.352.4065.7370.6067.83
1.488.2091.4496.5393.69
1.568.4072.8487.0779.32
1.690.2097.6292.2394.85
1.787.009.72100.017.72
Humanistic Care1.871.8073.1395.8182.95
2.1100.0100.0100.0100.0
2.259.0095.2759.6273.34
2.3100.0100.0100.0100.0
2.475.0081.0290.3985.45
2.5 2.695.20 99.4099.79 100.095.3897.54
2.793.6094.7499.4099.70
Comprehensive Diagnosis2.868.80100.098.73 68.8096.69 81.52
3.173.00
3.257.6010.05 49.3576.00 73.5617.76 59.07
3.335.8034.0591.3349.61
3.445.8039.4687.0354.30
3.531.8020.5886.7333.27
3.640.6037.7086.3252.48
3.731.2020.6979.2532.81

表 5: 二级项目自动评估结果。报告了准确率 (accuracy)、精确率 (precision)、召回率 (recall) 和 F1 分数 (F1 score) 的结果。结果以百分比 $(%)$ 表示。

主项目 二级项目 准确率 精确率 召回率 F1
Medical Interviewing Skills 1.1 83.40 90.31 88.72 89.51
1.2 63.80 47.64 71.60 57.21
1.3 52.40 65.73 70.60 67.83
1.4 88.20 91.44 96.53 93.69
1.5 68.40 72.84 87.07 79.32
1.6 90.20 97.62 92.23 94.85
1.7 87.00 9.72 100.0 17.72
Humanistic Care 1.8 71.80 73.13 95.81 82.95
2.1 100.0 100.0 100.0 100.0
2.2 59.00 95.27 59.62 73.34
2.3 100.0 100.0 100.0 100.0
2.4 75.00 81.02 90.39 85.45
2.5 2.6 95.20 99.40 99.79 100.0 95.38 97.54
2.7 93.60 94.74 99.40 99.70
Comprehensive Diagnosis 2.8 68.80 100.0 98.73 68.80 96.69 81.52
3.1 73.00
3.2 57.60 10.05 49.35 76.00 73.56 17.76 59.07
3.3 35.80 34.05 91.33 49.61
3.4 45.80 39.46 87.03 54.30
3.5 31.80 20.58 86.73 33.27
3.6 40.60 37.70 86.32 52.48
3.7 31.20 20.69 79.25 32.81

Table 6 Results of human evaluation and automatic evaluation on different LLMs, including ChatGLM, ERNIE Bot, ChatGPT, and GPT-4. The score of each primary item is reported in percentage $(%)$ .

LLMsMedicalInterviewingHumanistic CareComprehensive Diagnosis and TreatmentAverage
HumanChatGPTHumanChatGPTHumanChatGPTHumanChatGPT
ChatGLM29.8650.0046.5365.9728.5714.2635.2744.69
Yiyan35.4250.0072.2271.5347.6214.2951.9346.62
ChatGPT35.4250.0077.7872.9247.6219.8453.8648.79
GPT-435.4251.3995.8380.5650.7919.8461.1151.93

表 6 不同大语言模型的人工评估与自动评估结果,包括ChatGLM、文心一言、ChatGPT和GPT-4。各主要项目得分以百分比 $(%)$ 形式呈现。

LLMs MedicalInterviewing Humanistic Care Comprehensive Diagnosis and Treatment Average
Human ChatGPT Human ChatGPT Human ChatGPT Human ChatGPT
ChatGLM 29.86 50.00 46.53 65.97 28.57 14.26 35.27 44.69
文心一言 35.42 50.00 72.22 71.53 47.62 14.29 51.93 46.62
ChatGPT 35.42 50.00 77.78 72.92 47.62 19.84 53.86 48.79
GPT-4 35.42 51.39 95.83 80.56 50.79 19.84 61.11 51.93

2.3 LLMs Evaluation Results

2.3 大语言模型评估结果

Results of Automatic Evaluation. The results of the automatic evaluation on the annotated test data are shown in Table 5. The accuracy of all secondary items from the 4 primary items are reported.

自动评估结果。标注测试数据的自动评估结果如表5所示。报告了4个主要项目中所有次级项目的准确率。

Fig. 2 An example in MedEval.

Dialogue
患者主诉:三天前打篮球的时候,到了无名指,无名指肿了有三天,也不敢活动里面的关节。想问一下 是不是骨折了。(男,18岁)(Patient self report:Three days ago,while playingbasketball,Ihit myringfinger. It has been swollen for three days, and I'm afraid to move the joint inside. I'm wondering if it's broken. (Male, 18 years old).) DoctorLLM:请问你觉得你的关节变形扭曲得厉害吗? (Doctor LLM: Do you think your joints are severely deformed and twisted?) 患者模拟器:整个无名指都肿了,有点看不出里面有没有扭曲。(Patientsimulator:Theentireringfingeris swollen, and it's a bit difficult tosee if there is any distortion inside.) DoctorLLM:疼痛的程度怎么样?你说不能活动里面的关节,是一活动就会疼得很厉害吗?(Doctor
LLM: What is the degree of pain? Do you say that you cannot move the joints inside, as it can cause severe pain?) 患者模拟器:是的,而且一活动里面的关节,就感觉里面什么东西在摩擦。(Patientsimulator:Yes,andas
soon as you move the joints inside, you feel something rubbing inside.) DoctorLLM:从你的描述,我觉得还是有骨折的迹象,建议你去医院拍个片子,确诊一下,然后根据骨 折的情况再定治疗方案。Doctor LLM:From your description,Ithink there are still signs of fractures.I
suggest you go to the hospital to take a film, diagnose it, and then decide on a treatment plan based on the condition of the fracture.)
Secondaryitemswithascore of 0
1.2,1.3,1.5,1.7,3.1,3.3,3.6,3.7 Secondaryitemswitha score of 1 1.1,1.4,1.6,1.8,2.1,2.2,2.3,2.4,2.5,2.6,2.7,2.8,3.2,3.4,3.5

图 2: MedEval 中的示例

对话
患者主诉:三天前打篮球的时候,撞到了无名指,无名指肿了有三天,也不敢活动里面的关节。想问一下是不是骨折了。(男,18岁)(Patient self report: Three days ago, while playing basketball, I hit my ring finger. It has been swollen for three days, and I'm afraid to move the joint inside. I'm wondering if it's broken. (Male, 18 years old).) Doctor LLM:请问你觉得你的关节变形扭曲得厉害吗?(Doctor LLM: Do you think your joints are severely deformed and twisted?) 患者模拟器:整个无名指都肿了,有点看不出里面有没有扭曲。(Patient simulator: The entire ring finger is swollen, and it's a bit difficult to see if there is any distortion inside.) Doctor LLM:疼痛的程度怎么样?你说不能活动里面的关节,是一活动就会疼得很厉害吗?(Doctor LLM: What is the degree of pain? Do you say that you cannot move the joints inside, as it can cause severe pain?) 患者模拟器:是的,而且一活动里面的关节,就感觉里面什么东西在摩擦。(Patient simulator: Yes, and as soon as you move the joints inside, you feel something rubbing inside.) Doctor LLM:从你的描述,我觉得还是有骨折的迹象,建议你去医院拍个片子,确诊一下,然后根据骨折的情况再定治疗方案。(Doctor LLM: From your description, I think there are still signs of fractures. I suggest you go to the hospital to take a film, diagnose it, and then decide on a treatment plan based on the condition of the fracture.)
评分为0的次要项目
1.2, 1.3, 1.5, 1.7, 3.1, 3.3, 3.6, 3.7
评分为1的次要项目
1.1, 1.4, 1.6, 1.8, 2.1, 2.2, 2.3, 2.4, 2.5, 2.6, 2.7, 2.8, 3.2, 3.4, 3.5

First, among 26 secondary items, the accuracy of 11 secondary items are higher than 80%, which means that manual evaluation on these items can be replaced by automatic evaluation. These 11 secondary items include 1.1, 1.4, 1.6, 1.8, 2.1, 2.3, 2.4, 2.5, 2.6, 2.7, 2.8. The rest secondary items need manual evaluation.

首先,在26个二级项目中,有11个二级项目的准确率超过80%,这意味着这些项目的人工评估可以被自动评估取代。这11个二级项目包括1.1、1.4、1.6、1.8、2.1、2.3、2.4、2.5、2.6、2.7、2.8。其余二级项目仍需人工评估。

Second, the results further show that accuracy on these items of humanistic care is all high. The reason is that judging humanistic care solely needs common sense, thus it is possible to achieve good results with small-scale annotated data. However, the accuracy of the comprehensive diagnosis and treatment abilities is low. The reason is that judging comprehensive diagnosis and treatment abilities need precise and comprehensive medical diagnosis knowledge, which is not sufficient in the small-scale annotated data.

其次,结果进一步显示人文关怀相关条目的准确率都很高。原因在于判断人文关怀仅需常识性知识,因此通过小规模标注数据即可取得良好效果。然而,综合诊疗能力的准确率较低,这是因为判断综合诊疗能力需要精准全面的医学诊断知识,而小规模标注数据中的这类知识并不充分。

LLMs results. Table 6 shows the results of human evaluation and automatic evaluation on different LLMs. We select 18 patient self-reports and the following questions are the same for each LLM. Four LLMs, ChatGLM, ERNIE Bot, ChatGPT, and GPT-4, are utilized in this work. The scores of medical interviewing skills, humanistic care, and comprehensive diagnosis and treatment abilities are 144, 144, and 126, respectively. The score of each primary items is reported in percentage ( $%$ ).

大语言模型结果。表6展示了不同大语言模型的人工评估与自动评估结果。我们选取了18份患者自述报告,针对每个大语言模型提出的问题保持一致。本研究使用了ChatGLM、文心一言、ChatGPT和GPT-4四种大语言模型。医学问诊技能、人文关怀及综合诊疗能力的满分分别为144分、144分和126分。各一级指标得分以百分比( $%$ )形式呈现。

On medical interviewing skills, ERNIE Bot [3], ChatGPT [1], and GPT-4 [2] achieve half of the total score. The reason is that these four LLMs hardly inquire about further patient conditions but give safe and conservative responses, thus achieving low scores. On humanistic care, GPT-4 achieves the best results, thanks to its care, sympathy and concern for patients. Besides, ERNIE Bot and ChatGPT present an ordinary performance. ChatGLM always refuses to answer medical queries, thus achieving low results. On comprehensive diagnosis and treatment abilities, these four LLMs achieve low results. The reason is that these four LLMs are LLMs in the general domain, thus under performing in the medical diagnosis scenes.

在医患问诊技能方面,ERNIE Bot [3]、ChatGPT [1] 和 GPT-4 [2] 仅获得总分的一半。原因是这四款大语言模型几乎不会进一步询问患者病情,而是给出安全保守的回应,因此得分较低。在人文关怀维度,GPT-4 凭借对患者的关怀、同情与关切表现最佳,ERNIE Bot 和 ChatGPT 表现平平,ChatGLM 则始终拒绝回答医疗咨询因而得分最低。在综合诊疗能力方面,四款大语言模型均表现欠佳,这是由于它们均属通用领域大语言模型,在医疗诊断场景中存在局限。

Comparing the results of human evaluation and the automatic evaluation, the relative performance of LLMs is consistent. Thus, the automatic evaluation could be utilized to evaluate the relative performance between two LLMs.

对比人工评估和自动评估的结果,大语言模型的相对表现是一致的。因此,自动评估可用于评估两个大语言模型之间的相对性能。

3 Discussion

3 讨论

Using LLMs in clinical practice enhances diagnosis and treatment efficiency, and studies have tried to evaluate the diagnostic ability of LLMs to that of physicians [21, 22]. However, the evaluations of LLMs’ performance were mainly based on the word-based metrics, which is biased evaluation metrics, as shown in Table 1. For example, researchers compare LLMs’ output with ground truths with ROUGE and BLEU [21, 23, 24], which ignore semantic information. To facilitate the application of LLMs in real clinical settings, we altered Mini-CEX, a scale used to judge diagnostic abilities of physicians, and built a comprehensive scale considering LLMs’ medical interviewing skills, humanistic care, and diagnosis ability. Despite of the effectiveness, the original Mini-CEX can not be directly applied to LLMs, due to a lack of physical observations, palpations, and access to laboratory testing in text-based LLM dialogues. Therefore, modifications are needed to Mini-CEX in order to evaluate LLMs’ diagnostic skills, including the removal of items related to physical examinations and laboratory tests as well as the simplification of complicated, repetitive, and ambiguous items. Specifically, an expert committee was established to draft the LLMs-specific Mini-CEX and the Delphi method was used for revision. Then, the reliability and validity analysis [18] are utilized to analyze the necessity and sufficiency of each item in the scale 3. Finally, the LLM-specific Mini-CEX is obtained, containing 4 primary items including medical interviewing skills, humanistic care, comprehensive diagnostic and treatment abilities, and overall clinical competence, with a total number of 26 secondary items. The reliability and validity of the scale were tested by Cronbach’s alpha coefficient, KMO test and Bartlett’s test for sphericity. The overall Cronbach’s alpha coefficient was 0.802, indicating great internal consistency of the scale. KMO value was 0.835 and the Bartlett’s test for sphericity was significant (0.000), suggesting that the items in the scale effectively measure variables. Overall, the LLM-specific Mini-CEX is suitable for evaluating LLMs’ diagnostic ability and can further promote the application of LLMs in medical treatment scenarios.

在临床实践中使用大语言模型(LLM)可提升诊疗效率,已有研究尝试评估大语言模型与医生的诊断能力对比[21, 22]。然而现有评估主要基于词汇级指标(如表1所示),这类有偏评估指标(如ROUGE和BLEU[21, 23, 24])仅对比模型输出与标准答案的字面匹配度,忽略了语义信息。为促进大语言模型在真实临床场景的应用,我们改良了用于评估医生诊断能力的Mini-CEX量表,构建了涵盖医学问诊技能、人文关怀和诊断能力的综合评价体系。

由于基于文本对话的大语言模型无法进行体格检查、触诊和实验室检测,原版Mini-CEX需进行以下调整:删除体格检查和实验室检测相关条目,简化复杂、重复和模糊的评估项。具体流程包括:专家委员会起草大语言模型专用Mini-CEX量表,采用德尔菲法进行修订,通过信效度分析[18]验证量表条目的必要性和充分性。最终形成的专用量表包含医学问诊技能、人文关怀、综合诊疗能力和整体临床胜任力4个一级维度,共26个二级条目。

量表信效度通过Cronbach's α系数、KMO检验和Bartlett球形检验验证:总体Cronbach's α系数为0.802,表明量表具有良好内部一致性;KMO值达0.835,Bartlett球形检验显著(0.000),证明量表能有效测量目标变量。该专用Mini-CEX量表适用于评估大语言模型的诊断能力,可进一步推动其在医疗场景的应用。

This work also made an attempt to utilized an automatic evaluation to evaluate LLMs automatically, thus reducing labor costs, improving evaluation efficiency, and reducing potential bias in human evaluations. Previous works on automatic evaluations focused on pairwise comparison [11, 14], aiming to judge the relative quality of two LLMs’ responses. However, the pairwise comparison can not provide single-answer grading, hindering the application of LLMs in medical treatment scenarios. To alleviate the issue, we developed a automatic evaluation to provide single-answer grading [12, 13] for LLMs’ responses. Specifically, we utilized ChatGPT to make an evaluation on

本研究还尝试采用自动评估方法对大语言模型进行自动化评测,从而降低人力成本、提升评估效率,并减少人工评估中可能存在的偏差。先前关于自动评估的研究主要集中于成对比较 [11, 14],旨在判断两个大语言模型生成答案的相对质量。然而,成对比较无法提供单一答案评分,限制了该技术在医疗场景中的应用。为解决这一问题,我们开发了一套能提供单一答案评分 [12, 13] 的自动评估系统。具体而言,我们利用ChatGPT对

LLM-specific Mini-CEX automatically with pre-defined prompts for each secondary items. To promote the research, we constructed a dataset including 2,680 medical consultation dialogues with scores on the 26 secondary items in the LLM-specific MiniCEX. Experimental results show that the automatic evaluation can replace experts on the secondary items of humanistic care and provides reproducible and automated comparisons between different LLMs. This work makes a pioneering exploration of automatic LLM evaluation and gains insight toward different evaluation criteria.

大语言模型专用Mini-CEX通过预定义的各项次级指标提示自动完成评估。为推动相关研究,我们构建了一个包含2,680段医疗问诊对话的数据集,并标注了LLM专用MiniCEX中26项次级指标的评分。实验结果表明,在人文关怀类次级指标上,自动评估可替代专家评审,并为不同大语言模型提供可复现的自动化对比。本研究开创性地探索了大语言模型自动评估方法,并深入解析了不同评估标准间的差异。

Despite the effectiveness, there are still some limitations. One of the main current limitations is that automatic evaluation achieves poor performance on the comprehensive diagnosis and treatment abilities, which originates from a lack of medical knowledge in the annotated data. To alleviate the issue, precise and comprehensive medical knowledge is needed. However, it is hard to obtain comprehensive large-scale medical knowledge in the form of natural language text and structured text. Besides, ChatGPT is utilized as the judge, which is a LLM in common domain, limiting the accuracy of prediction results. Alternatively, in further work, we will attempt to finetune the LLM-based automatic evaluation on large-scale medical knowledge data to strengthen their judgment ability in medical professional fields. Another limitation is that the current patient simulator in this work is clear about their own symptoms. However, in real scenarios, many patients have no clear cognition about their symptoms, which are more difficult for doctors to guide and inquire about patients’ hidden symptoms. To this end, a more personalized, diverse, complex patient simulator is noteworthy to evaluate LLMs’ diagnosis ability comprehensively and systematically. In further work, we will make an attempt to develop a more personalized, diverse, and complex patient simulator.

尽管效果显著,但仍存在一些局限性。当前主要限制之一是自动评估在综合诊疗能力上表现不佳,这源于标注数据中医学知识的匮乏。为缓解该问题,需要精确全面的医学知识。然而,以自然语言文本和结构化文本形式获取全面的大规模医学知识十分困难。此外,采用通用领域的大语言模型ChatGPT作为评判者,限制了预测结果的准确性。在后续工作中,我们将尝试基于大规模医学知识数据对大语言模型的自动评估进行微调,以增强其在医学专业领域的判断能力。另一局限在于当前患者模拟器对自身症状有清晰认知。但实际场景中,许多患者对症状缺乏明确认知,这使得医生更难引导和询问患者的潜在症状。为此,开发更具个性化、多样性和复杂性的患者模拟器,对于全面系统评估大语言模型的诊断能力具有重要意义。后续工作中,我们将尝试开发此类更先进的患者模拟器。

4 Method

4 方法

4.1 Scale Construction

4.1 规模构建

The Mini-CEX is developed by the American Board of Internal Medicine [17], translated into Mandarin, and revised by David et al.[25] It is designed to assess the examinee’s clinical skills, attitudes, and behaviors in real clinical settings [26]. The Mini-CEX contains six primary items, including medical interviewing skills, physical examination skills, professionalism, counselling skills, clinical judgement, and organisation/efficiency.

迷你临床演练评估(Mini-CEX)由美国内科学委员会[17]开发,经David等人[25]翻译为中文并修订。该工具用于评估受试者在真实临床环境中的临床技能、态度和行为[26],包含六大核心项目:问诊技巧、体格检查技能、职业素养、沟通能力、临床判断力及组织效率。

To conduct a comprehensive study of LLMs’ capability in medical consultation scenes, we propose to construct an LLM-specific Mini-CEX based on the primary Mini-CEX. The LLM-specific Mini-CEX construction is divided into two stages, the literature review and the focus group research. The process is shown in Figure 1.

为了全面研究大语言模型(LLM)在医疗咨询场景中的能力,我们提出基于初级Mini-CEX构建一个专用于大语言模型的Mini-CEX。LLM专用Mini-CEX的构建分为两个阶段:文献综述和焦点小组研究。该过程如图1所示。

Literature Review. For a comprehensive understanding of the application and modification of Mini-CEX in the medical domain, literature research is adapted first. Literature related to Mini-CEX in the field of medical diagnosis and treatment is collected first. We search CNKI, Wanfang data, and CQVIP for Chinese journals and we search PubMed, Embase, Web of Science, and Google scholar for English journals and conferences. Then, primary Mini-CEX [25] are obtained from these literature reviews.

文献综述。为全面了解Mini-CEX在医学领域的应用与改良,首先进行文献调研。先收集与Mini-CEX在医疗诊疗领域相关的中外文献,中文文献检索CNKI、万方数据和维普数据库,英文文献及会议论文检索PubMed、Embase、Web of Science和Google Scholar。随后从这些文献综述中获取原始Mini-CEX评估表[25]。


Fig. 3 The process of the automatic evaluation.


图 3: 自动评估流程

Focus Group Research. As LLMs-dominated medical consultations are different from patient-doctor consultations, a focus group is established to draft and review the LLMs-specific Mini-CEX according to the primary Mini-CEX. The focus group consists of 20 senior doctors and 14 intermediate doctors. Then, the focus group recommended a list of inquiry experts. The focus group conducts three turns of experts consultations to collect and summarize opinions from experts and give feedback for the modification of the draft scale according to the Delphi method.

焦点小组研究。由于大语言模型主导的医疗咨询不同于传统的医患咨询,我们组建了一个焦点小组,根据基础版Mini-CEX制定并审核大语言模型专用版Mini-CEX。该小组由20名高级医师和14名中级医师组成。随后,焦点小组推选出了咨询专家名单,并依照德尔菲法(Delphi method)开展三轮专家咨询,收集汇总专家意见,为量表草案的修订提供反馈。

The Delphi method is utilized with letters of inquiry to experts. The Delphi method uses an anonymous method and uses expert experience to express the differences between different experts, avoiding the interference of echoes and authority in traditional meetings, and is a relatively scientific investigation method that takes experts as the object of information requests. The Delphi method aims to identify and align different opinions among experts, avoiding echo and authority interference in traditional meetings in an anonymous manner. Its main features are anonymity, feedback and statistics [27].

德尔菲法(Delphi method)通过专家咨询函开展工作。该方法采用匿名方式,借助专家经验表达不同专家间的意见分歧,避免了传统会议中回声效应和权威干扰,是一种以专家为信息征询对象的相对科学的调查方法。德尔菲法旨在识别并协调专家间的不同观点,以匿名方式规避传统会议中的回声效应和权威干扰。其主要特征包括匿名性、反馈性和统计性 [27]。

Specifically, the Delphi method consists of drawing up questionnaires, collecting and summarizing opinions from expert opinions. Then, these opinions are fed back to the experts. The above consultation and feedback process is iterated multiple times. Finally, a consistent Mini-CEX is obtained [28, 29]. The steps to implement the Delphi method are as follows.

具体而言,德尔菲法 (Delphi method) 的流程包括:设计问卷、收集并汇总专家意见,随后将这些意见反馈给专家。上述咨询与反馈流程会进行多轮迭代,最终获得一致的Mini-CEX评估表 [28, 29]。该方法实施步骤如下:

• The opinions are summarized, collated and analyzed. Then, the second round of questionnaires is formed after the focus group discussion. Next, the experts are required to revise their own opinions, with two to four rounds. • Statistical analysis of survey results is conducted, and the investigation conclusion is formed.

• 汇总、整理和分析专家意见,经焦点小组讨论后形成第二轮问卷。随后要求专家修改各自观点,进行2至4轮迭代。
• 对调查结果进行统计分析,形成最终研究结论。

The construction of our scale is discussed through three rounds of expert consultations. The details are listed in Appendix.

我们通过三轮专家咨询构建了量表,具体细节见附录。

In the first round of expert consultation, experts give comments on items in the scale and fill out questionnaires. Among the returned questionnaires, if an expert’s familiarity with the content of the questionnaire is “less familiar”, “very unfamiliar”, or “generally familiar”, the expert’s inquiry will be canceled in the next round. Besides, if the importance of the item is scored as “Unimportant”, those questioned items will be modified or deleted after discussion in the focus group.

在首轮专家咨询中,专家对量表条目提出意见并填写问卷。回收问卷中,若专家对问卷内容的熟悉程度为"不太熟悉""非常不熟悉"或"一般熟悉",该专家将在下一轮咨询中被取消资格。此外,若条目重要性评分为"不重要",相关条目将在焦点小组讨论后进行修改或删除。

The second round and the third round of expert consultations are based on the opinions from the previous round of inquiry. We repeated the implementation of the previous round of expert consultation and the items is revised after the focus group discussion.

第二轮和第三轮专家咨询基于上一轮征询意见,我们重复实施上一轮专家咨询,并在焦点小组讨论后修订项目。

4.2 Statistical Analysis

4.2 统计分析

After three rounds of expert consultations, the scale contains 4 primary items and 26 secondary items. Statistical methods are utilized to analyze the rationality of the scale. The 5-point Likert scale [30] was adopted, including strongly disagree, disagree, neither agree nor disagree, agree, and strongly agree. The Mini-CEX evaluation scale includes three parts as follows.

经过三轮专家咨询,该量表包含4个一级项目和26个二级项目。采用统计方法分析量表的合理性。采用李克特5级量表 [30],包括非常不同意、不同意、既不同意也不反对、同意和非常同意。Mini-CEX评估量表包括以下三个部分。

In this study, 210 experts were sought for evaluation. For evaluate the necessity and sufficiency of each item, the reliability and validity analysis was utilized.

本研究邀请了210位专家进行评估。采用信效度分析检验各条目的必要性和充分性。

Reliability Analysis. In this study, Cronbach’s alpha was used to evaluate the internal consistency of the obtained scale, which mainly counts the entire scale, different items and the Cronbach’s alpha coefficient after deleting a specific item. If the coefficient is above 0.8, the reliability of the scale is very good. If the reliability coefficient is above 0.7, it is acceptable. If it is above 0.6, the scale should be revised, but there is still some valid content. If it is below 0.6, redesigning the items is essential.

信度分析。本研究采用Cronbach's alpha系数评估所获量表的内部一致性,主要统计整个量表、不同条目及删除特定条目后的Cronbach's alpha系数。若系数高于0.8,则量表信度非常好;信度系数高于0.7时可接受;高于0.6时需修订量表但仍保留部分有效内容;低于0.6则必须重新设计条目。

Validity Analysis. Validity refers to how accurately and effectively a scale measures what it is intended to measure. In this work, validity analysis from the aspects of content and structure is conducted. Specifically, the validity analysis of content is to explain the rationality and scientifi city of the questionnaire. Structure validity analysis refers to the correspondence between measurement primary items and secondary items.

效度分析。效度指量表准确有效测量目标特征的程度。本研究从内容和结构两方面进行效度分析:内容效度分析旨在说明问卷设计的合理性与科学性;结构效度分析则检验测量主项与子项之间的对应关系。

Correspondingly, the Kaiser-Meyer-Olkin (KMO) test statistic [31] and Bartlett’s test for sphericity [32] are commonly utilized.

相应地,Kaiser-Meyer-Olkin (KMO) 检验统计量 [31] 和 Bartlett 球形检验 [32] 被广泛使用。

The KMO test is a measure used to compare the simple correlation coefficient and partial correlation coefficient between variables. The lower the proportion, the more suited data is to factor analysis. The KMO test returns values between 0 and 1. A rule of thumb for interpreting the statistic:

KMO检验是一种用于比较变量间简单相关系数与偏相关系数的指标。该比例越低,数据越适合进行因子分析。KMO检验结果取值范围为0到1。经验判断标准如下:

The Bartlett’s test of sphericity tests whether the correlation coefficients are all 0. The test computes the probability that the correlation matrix has significant correlations among at least some of the variables in a dataset, which is a prerequisite for factor analysis to work. The corresponding $P$ value must be less than 0.05 to pass Barth’s spherical test.

Bartlett球形检验用于测试相关系数是否全部为0。该检验计算了数据集中至少部分变量间存在显著相关性的概率,这是因子分析有效的前提条件。对应的$P$值必须小于0.05才能通过Bartlett球形检验。

After the statistical analysis, the final LLM-specific Mini-CEX scale is obtained. The primary items include medical interviewing skills, humanistic care, comprehensive diagnostic and treatment abilities, and overall clinical competence. There are a total number of 26 secondary items, including 8 items for medical interviewing skills, 8 items for humanistic care, 7 items for comprehensive diagnostic and treatment abilities, and 3 items for overall clinical competence. The details of the secondary items are listed in Table 2.

统计分析后,最终得到大语言模型(LLM)专用的Mini-CEX量表。主要项目包括医患沟通技能、人文关怀、综合诊疗能力和整体临床能力。二级项目共计26项,其中医患沟通技能8项、人文关怀8项、综合诊疗能力7项、整体临床能力3项。具体二级项目详见表2:

4.3 Patient Simulator

4.3 病人模拟器

To collect LLM-patient conversations, some annotators are required to act as “patients” to converse with LLMs, according to pre-defined patient portraits, which is labor-intensive and time-consuming. In this work, we propose to develop a patient simulator to converse with LLMs to obtain conversations automatically.

为收集大语言模型(LLM)与患者的对话,传统方法需要标注人员根据预设患者画像扮演"患者"角色进行对话,这种方式人力成本高且耗时。本研究提出开发患者模拟器来自动生成与大语言模型的对话。

Patient simulators [33–36] have demonstrated the feasibility to teach medical students, interns, and residents some of the manual skills they must learn. In this work, we trained a patient simulator to collect LLM-patient dialogue automatically.

患者模拟器 [33-36] 已证明可用于教授医学生、实习生和住院医师部分必须掌握的手动操作技能。本研究中,我们训练了一个患者模拟器来自动收集大语言模型 (LLM) 与患者的对话数据。

Specifically, 120,000 doctor-patient conversations from Chunyu Doctor $^4$ were utilized to train patient simulator. In a dialogue, the first turn in a dialogue till one doctor response are treated as a dialogue history $x$ , and the following patient utterance is regarded as the utterance $y$ to be generated. BLOOM [37] with 7 billion parameters, denoted as $f$ , is utilized as the generation model in this work. Language models define probability distributions over sequences of tokens. Given a sequence $x_ {1},\ldots,x_ {n}$ , where $n$ is the input sequence length, the standard way to model its probability is via

具体而言,我们利用来自春雨医生的12万条医患对话数据$^4$训练患者模拟器。在对话中,将首轮对话至医生首次回应前的所有内容视为对话历史$x$,随后的患者话语则作为待生成话语$y$。本研究采用70亿参数的BLOOM模型[37](记作$f$)作为生成模型。语言模型通过token序列定义概率分布。给定序列$x_ {1},\ldots,x_ {n}$(其中$n$为输入序列长度),其概率的标准建模方式为

next-token prediction,

下一个token预测

$$
p(\boldsymbol x_ {1},\dots,\boldsymbol x_ {n})=\prod_ {i=1}^{n}p(\boldsymbol x_ {i}|\boldsymbol x_ {<i}),
$$

$$
p(\boldsymbol x_ {1},\dots,\boldsymbol x_ {n})=\prod_ {i=1}^{n}p(\boldsymbol x_ {i}|\boldsymbol x_ {<i}),
$$

where $x_ {<i}:=x_ {1},\ldots,x_ {i-1}$ is the sequence of tokens preceding $x_ {i}$ , also referred to as its prefix. This auto regressive model is usually implemented via a learned transformer network. The generation process is formulated as following,

其中 $x_ {<i}:=x_ {1},\ldots,x_ {i-1}$ 是位于 $x_ {i}$ 之前的token序列,也称为其前缀。这种自回归模型通常通过训练好的Transformer网络实现。生成过程表述如下:

$$
x_ {i}=\operatorname* {argmax}_ {x_ {i}\in\mathcal{V}}p(x_ {1},\dots,x_ {i}),
$$

$$
x_ {i}=\operatorname* {argmax}_ {x_ {i}\in\mathcal{V}}p(x_ {1},\dots,x_ {i}),
$$

where $\nu$ is the vocabulary. The generation is processed in an auto regressive manner. Please note that, an ending symbol is added at the end of each conversation to indicate that the current conversation is ended.

其中 $\nu$ 是词汇表。生成过程以自回归方式进行。请注意,每个对话末尾会添加一个结束符号,表示当前对话已结束。

4.4 MedEval

4.4 MedEval

The medical judging dataset MedEval is constructed as follows.

医疗评估数据集MedEval的构建流程如下。

First, selecting diagnosis dialogues is conducted. A medical diagnosis LLM is trained on a large amount of unlabelled medical diagnosis data. For each dialogue, a patient self report is given, which is randomly constructed with templates. Then, the patient simulator plays the role of patients to converse with LLMs to collect diagnosis dialogue automatically. Several strategies are utilized to improve the data quality. First, for collecting both positive and negative dialogues, we utilize different temperatures in medical LLMs when generating. This strategy aims to balance positive and negative samples, thus beneficial for training deep learning models. Second, for ethical concerns, specific regular expressions for coarse-grained filtering are employed to remove privacy. To delete those dialogues containing patients’ privacy, regular expressions, such as “My name is ...”, are utilized. Third, for data annotation, the trial annotation and the formal annotation are conducted, sequentially.

首先进行诊断对话的筛选。通过在大规模未标注的医疗诊断数据上训练医疗诊断大语言模型。针对每个对话,系统会提供由模板随机生成的病人自述。随后,病人模拟器扮演患者角色与大语言模型进行自动对话以收集诊断对话。为提高数据质量,我们采用了以下策略:第一,为同时收集正负样本对话,在医疗大语言模型生成时采用不同温度参数。该策略旨在平衡正负样本比例,从而有利于深度学习模型的训练。第二,出于伦理考量,使用特定正则表达式进行粗粒度过滤以去除隐私信息。例如采用"My name is..."等正则表达式来删除包含患者隐私的对话。第三,在数据标注环节,依次执行试标注和正式标注两个阶段。

To ensure the high quality of dialogues, trial annotation is conducted. In the trial annotation stage, three crowd sourcing teams are selected for trial annotation. There are mainly two advantages. (1) Trial annotation helps select a reliable annotation team. (2) The trial annotation helps the annotation team get familiar with the annotation task. Lastly, the team achieving the best performance in the trial annotation is selected for the formal annotation.

为确保对话的高质量,会进行试标注。在试标注阶段,会挑选三个众包团队进行试标注。主要有两个优势:(1) 试标注有助于筛选可靠的标注团队。(2) 试标注能帮助标注团队熟悉标注任务。最终,选择在试标注中表现最佳的团队进行正式标注。

After the trial annotation, the formal annotation is conducted. In the formal annotation, to ensure data quality, the fine-grained privacy removing, skipping option, and quality audit and re-annotating mechanisms are employed.

在试验标注之后,进行正式标注。正式标注中,为确保数据质量,采用了细粒度隐私去除、跳过选项以及质量审核与重新标注机制。

In the formal data annotation process, experts are required to score each dialogue based on the LLM-specific Mini-CEX scale. Finally, totally 2,680 medical consultation dialogues are collected, 1,880 samples for training, 300 samples for validation, and 500 sample s for testing.

在正式的数据标注过程中,专家需要根据针对大语言模型定制的Mini-CEX量表对每个对话进行评分。最终共收集了2,680个医疗咨询对话,其中1,880个样本用于训练,300个样本用于验证,500个样本用于测试。

4.5 Automatic Evaluation

4.5 自动评估

The evaluation of diagnostic LLMs based on the LLM-specific Mini-CEX scale requires manual evaluation, which is labor-intensive and time-consuming. To alleviate the issue, we propose automatic evaluation to evaluate LLMs automatically. To meet the need for a robust and scalable automated method to evaluate medical LLM alignment with human preferences, ChatGPT is utilized as the judge in this work.

基于大语言模型专用Mini-CEX量表对诊断型大语言模型的评估需要人工参与,耗时耗力。为解决这一问题,我们提出自动评估方法来实现大语言模型的自动化评测。为满足对稳健、可扩展的自动化医疗大语言模型对齐评估需求,本研究采用ChatGPT作为评判工具。

Specifically, specific prompts are utilized for each secondary item. For example, the prompt for the secondary item “Does the doctor express empathy to the patient in the following dialogue, such as, I am very worried about hearing about your condition. The patient-doctor dialogue is as follows.” Then, the ChatGPT gives feedback. If the feedback is “not”, then the label of the dialogue on the secondary item is set as “0”, otherwise “1”.

具体而言,每个二级项目都使用了特定提示词。例如,针对二级项目"医生在以下对话中是否对患者表现出同理心,比如'听到您的病情我非常担忧'。医患对话如下"的提示词。随后,ChatGPT会给出反馈。若反馈为"否",则该对话在此二级项目上的标签设为"0",反之为"1"。

References

参考文献

[37] Scao, T.L., Fan, A., Akiki, C., Pavlick, E., Ilic, S., Hesslow, D., Castagné, R., Luccioni, A.S., Yvon, F., Gall, M., et al.: Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100 (2022)

[37] Scao, T.L., Fan, A., Akiki, C., Pavlick, E., Ilic, S., Hesslow, D., Castagné, R., Luccioni, A.S., Yvon, F., Gall, M., 等: Bloom: 一个1760亿参数的开源多语言大语言模型. arXiv预印本 arXiv:2211.05100 (2022)

unN berItem ContentExpert OpinionAdoptionModifyItem after Modifieation
Medical Interviewing Skill1. Revise to * accurately inquire about the patient's chief complaint, present medical history information such asReserveMedical Interviewing Skill
1.1Condut disease-elated inquiris aurately, inding ete." past history, medication history, comorbidities, et. according to the patients self reportsaccompanying symptoms and differential symptoms, and also inquire about past history, pesnal history, allrgy history conduct relevant inquiries about the condition accurately, including the cause of the nset, te nature, degree, time and regularity of occurrence of symptoms, aggravating or alleviating factors, accompanying symptoms, diagnosis and treatment, other relevant medical history, ete."Adopt SuggestionReviseFocusing on the patient's main complaint, make accurate inquiries about the condition, including the case f the disease, the natre, degree, time and regularity of symptom, aggravating or alleviating factors, accompanying symptoms, diagnosis and treatment, and other medical history such as past history, peronal history, allergy history, c.
1.2Ask the patient with open-ended questions, ncouraging the patient toseak uasappropriatest dn xeds o sued oq uemoou suosonb popu-do qm ued oq xs appropriate
1.3Guide patients to obtain correct and suffcient disease- related informationAmended to "Guiding patients to provide* 1.RevisdtreadUe plin lnguageanavodmdilAdoptReviseGuide patients to provide corrct and sufficient disease-related information
1.4Use words that patients understand and avoid medical jargonterminology, such as tenesmus, epistaxis, ete." 2. Revised to read "Avoid unintelligible medical terminology* Adopt AllReviseUse plain language and avoid difficult medical terms such as tenesmus, epistaxis, etc.
1.5Explain to the patient the rationale or purpose of retuming to the visit, treatmentExplain to the patient the rationale or purpose of retuning to the visit,treatment
1.6Avoid providing medically harmful information to patients, such as unsubstantiated health infomation, etc. Make effective judgments and appropriate responses toAvoid providing medically harmful information to patients, such as unsubstantiated health information, etc.
1.7medical emergencies that occur in patients, such as timely informing the risk of hypoglycemia and sd sd sugar is lower than 3.2mmol/LReserveMake ffctive judgments and appropriate responses to medical mergencies that occur in patients, such as timely informing the risk of hypoglycemia and emergency treatment plans when the patient's blood sugar is lower than 3.2mmol/L
1.8For diseases that require a differential diagnosis, the patient will be actively asked and encouraged to state sufficient information to help identify the diseaseReserveFor diseases that require a differential diagnosis, th patient wil be actively aske ad encouraged to state sufficient information to help identify the disease
1.9 2.0Inquiries are clear and focused Humanistie qualiesReserve ReserveInquiries are clear and focused Humanistic qualities
2.1Demonstraterest sensitivityadempahyrin consultations and communications Effective patient conversation skills include: opening the conversation, exploring the issue, understanding theReserveDemonstrate respect, sensitivity and empathy during consultations and communications Effective patient conversation skills include: starting the conversation, exploring the
2.2patient's perspective, organizing the discussion, building a trusting relatioship with the patient, facilitating usop pue “Buuud uudx“uoedued ued the discussionpue duqsuoeau usged-ropoop poo e qsqesg, o papuoury gain the trust of patients"AdoptReviseissue, uderstanding th patien's persective, organizing the diseussion, uildin a good doctor-patient relationhip and gaining the patient's tust, facilitating ptiet participation, explaining, planning, and closing the diseussion
2.3Reasonably respond to the patient's emotional performance during the conversation, such as appropriate emotional reassuranceReserveReasonably respond to the patient's emotional performance during the conversation, such as appropriate emotional reassurance
2.4making Valuing patients individual wishes in medical decision- Avoid expressing opinions that are biased againstReserveValuing patients individual wishes in medical decision-making
2.5 2.6certain groups of people Be patient when explaining problems to patientsReserve ReserveAvoid expressing opinions that are biased against certain groups of people Be patient when explaining problems to patients
2.7pue aeudondde usn sued m seunuuo appropriate languageReserveCommunicate with patients using appropriate and appropriate language
2.8ju uqm enad used noqe uunbu piony necessaryReserveressu ou uq oed uged noqe uunbu po
2.9Reasonable handing of patients' non-medical appeals during dialogueReserveReasonable handling of patients' non-medical appeals during dialogue
3.0Comprehensive diagnosis and treatment abilities of cliniciansAmended to “Diagnostic Reasoning Ability”Not AdoptReserveComprehensive diagnosis and treatment abilities of clinicians
3.1Identify the primary and secondary content of the information provided by the patient, and respond reasonably. For example, among the main complaints providedby the atient, the infomation that is helpful for judging the condition should be focused on andAmended to read “Ability to identify chief complaint”Not AdoptReserveIdentify the primary and secondary content of the information providedby the patient, and respond reasonably.For example for the patient's main complaint, you should focus on and inquire about the information that is helpful for judging the condition;
inquired about; , if the patient is more concemed about this, a reasonable explanation should be provided Recognize whether the information provided by the patient is credible and provide further verification if1. Amended to "whether the information is reliable" 2. Amended to “but the temperature measurement is notIdentify whether the information provided by the patient is reliable and provide further verification if necessary, e.g. if the patient states that he or she has a fever but does not
3.3necessary, e.g. the patient states that he or she has a fever but does not ake a temperature measurement Combine the information provided by the patient to arrive at a reasonable initial diagnosis and differentialperfomed, the infomation is not reliable (posodsns) sisouep Kreuuoad, o1 popuouyhave a temperature measurement, the information is not reliable Combine infomation provided by the patint o arrive at a reasonable initial suspet) and differential diagnosis
diagnosis Propose an appropriate diagnostic protocol, that is, the diagnostic protocol should inelude necessary inspection1. Amend to * Consider the patient's condition and personal wishes, propose an appropriate diagnosis plan 2. Amend to “In the case of insufficient diagnostic evidence,Consider the patient's condition and personal wishes, and propse an appropriate plan for definite diagnosis when the diagnosis basis is insuficient. The plan should include
methods and not include irlevant inspection methods Correctly explain to the patient the reasons for thepropose an appropriate clear diagnosis plan" 3. Amend to * Help to confirm the diagnosis means and does not include extraneous means"necessary means that help to confim the diagnosis and do not include irelevant means Correcly explain to the patient the reasons for the proposed diagnosis
3.6proposed diagnosis Correctly explain to patients the basis for diagnosis and exclusionAmended to “Bases for excluding other differential diagnoses"AdoptReviseCorrectly explain to patients the basis for the diagnosis of the disease and the basis for excluding other differential diagnoses
3.7Propose the best treatment plan considering the patient's condition and personal wishesRevised to “propose the best treatment plan based on the patient's condition and diagnosis and treatment norms* AdoptReviseConsidering the patient's condition and personal wishes, combined with the patient's conditionanddiagosis andtreaet oms propsethbest reatn n
8Correctly explain to patients the reasons for proposed treatment optionsReserveCorrectly explain to patients the reasons for proposed treatment options
3.9Make necessary reminders and suggestions to patients ns s pou od uReserveMake necessary reminders and suggestions to patients on certain important medical
reminders of drug incompatibilityconcerns, such as reminders of drug incompatibility
Overall linicalcompetence
DissatisfiedReserve Dissatisfed
4.2SatisfactoryReserveSatisfactory
4.1
ReserveOverall linicalcompetence
4.0
unN ber Item Content Expert Opinion Adoption Modify Item after Modifieation
问诊技能 1. 修改为"准确询问患者主诉、现病史信息,如" 保留 问诊技能
1.1 准确进行疾病相关问询,包括"既往史、用药史、合并症等,根据患者自述 伴随症状和鉴别症状,并询问既往史、个人史、过敏史。准确进行病情相关问询,包括"发病原因、症状性质、程度、发生时间及规律、加重或缓解因素、伴随症状、诊疗经过、其他相关病史等。" 采纳建议 修改 围绕患者主诉,准确询问病情,包括"发病原因、症状性质、程度、发生时间及规律、加重或缓解因素、伴随症状、诊疗经过,以及既往史、个人史、过敏史等其他病史。"
1.2 采用开放式问题询问患者,鼓励患者适当表达 st dn xeds o sued oq uemoou suosonb popu-do qm ued oq xs appropriate
1.3 引导患者获取正确且充分的疾病相关信息 修改为"引导患者提供* 1.使用通俗语言并避免医疗" 采纳 修改 引导患者提供正确且充分的疾病相关信息
1.4 使用患者能理解的词语,避免医学术语 "术语,如里急后重、鼻衄等。" 2. 修改为"避免使用难以理解的医学术语* " 全部采纳 修改 使用通俗语言,避免使用里急后重、鼻衄等难懂的医学术语
1.5 向患者解释复诊、治疗的理由或目的 向患者解释复诊、治疗的理由或目的
1.6 避免向患者提供有害的医疗信息,如未经证实的健康信息等。对患者出现的医疗紧急情况做出有效判断和适当应对 避免向患者提供有害的医疗信息,如未经证实的健康信息等。
1.7 患者出现的医疗紧急情况,如及时告知低血糖风险及血糖低于3.2mmol/L时的紧急处理方案 保留 对患者出现的医疗紧急情况做出有效判断和适当应对,如及时告知低血糖风险及血糖低于3.2mmol/L时的紧急处理方案
1.8 对于需要鉴别诊断的疾病,主动询问并鼓励患者陈述足够信息以帮助识别疾病 保留 对于需要鉴别诊断的疾病,主动询问并鼓励患者陈述足够信息以帮助识别疾病
1.9 2.0 问诊清晰且有重点 人文素养 保留 保留 问诊清晰且有重点 人文素养
2.1 在咨询和沟通中表现出尊重、敏感性和同理心 有效的患者对话技巧包括:开启对话、探讨问题、理解患者的观点、组织讨论、与患者建立信任关系、促进讨论 保留 在咨询和沟通中表现出尊重、敏感性和同理心 有效的患者对话技巧包括:开启对话、探讨问题、理解患者的观点、组织讨论、与患者建立信任关系、促进讨论
2.2 患者参与、解释、计划和结束讨论 pue duqsuoeau usged-ropoop poo e qsqesg, o papuoury 获得患者信任" 采纳 修改 问题、理解患者观点、组织讨论、建立良好医患关系并获得患者信任、促进患者参与、解释、计划和结束讨论
2.3 在对话中合理回应患者的情感表现,如适当的情感安抚 保留 在对话中合理回应患者的情感表现,如适当的情感安抚
2.4 在医疗决策中重视患者的个人意愿 避免表达对特定人群的偏见观点 保留 在医疗决策中重视患者的个人意愿
2.5 2.6 向患者解释问题时保持耐心 保留 保留 避免表达对特定人群的偏见观点 向患者解释问题时保持耐心
2.7 pue aeudondde usn sued m seunuuo 适当的语言 保留 使用恰当且适当的语言与患者沟通
2.8 ju uqm enad used noqe uunbu piony 必要时 保留 ressu ou uq oed uged noqe uunbu po
2.9 在对话中合理处理患者的非医疗诉求 保留 在对话中合理处理患者的非医疗诉求
3.0 临床医生的综合诊疗能力 修改为"诊断推理能力" 不采纳 保留 临床医生的综合诊疗能力
3.1 识别患者提供信息的主次内容,并合理回应。例如,在患者提供的主诉中,应重点关注并询问有助于判断病情的信息;如果患者更关注这一点,应提供合理解释。识别患者提供的信息是否可信,必要时进一步验证 1. 修改为"信息是否可靠" 2. 修改为"但未测量体温" 识别患者提供信息的主次内容,并合理回应。例如针对患者主诉,应重点关注并询问有助于判断病情的信息;识别患者提供的信息是否可靠,必要时进一步验证,例如患者自称发热但未测量体温,则该信息不可靠。结合患者提供的信息得出合理的初步怀疑和鉴别诊断
3.3 例如患者自称发热但未测量体温,则该信息不可靠。结合患者提供的信息得出合理的初步诊断和鉴别诊断 1. 修改为"考虑患者病情和个人意愿,提出适当的诊断计划" 2. 修改为"在诊断依据不足的情况下,提出适当的明确诊断计划" 3. 修改为"有助于确诊的手段,不包括无关手段" 考虑患者病情和个人意愿,在诊断依据不足时提出适当的明确诊断计划。该计划应包括有助于确诊的必要手段,不包括无关手段。正确向患者解释所提诊断的理由
3.6 向患者正确解释诊断依据和排除依据 修改为"排除其他鉴别诊断的依据" 采纳 修改 向患者正确解释疾病诊断依据和排除其他鉴别诊断的依据
3.7 考虑患者病情和个人意愿,提出最佳治疗方案 修改为"根据患者病情和诊疗规范提出最佳治疗方案" 采纳 修改 考虑患者病情和个人意愿,结合患者病情和诊疗规范提出最佳治疗方案
8 向患者正确解释所提治疗方案的依据 保留 向患者正确解释所提治疗方案的依据
3.9 对患者做出必要的提醒和建议 ns s pou od u 保留 对某些重要医疗问题向患者做出必要的提醒和建议,如药物配伍禁忌提醒
药物配伍禁忌提醒
整体临床能力
不满意 保留 不满意
4.2 满意 保留 满意
4.1
保留 整体临床能力
4.0
unN berItem ContentExpert OpinionAdoptionModifyItem after Modification
Medical Interviewing Skill1. Revise to “accurately inquire about the patient's chief complaint, present medical history information such asReserveMedical Interviewing Skill
1.1Conduet disease-related inquiries accurately, including past history,medication history, comorbidities, et. according to the patients' self reportsose pusuodus up pue sudsuuedue inquire about past hstory, personal history, allrgyhistoy etc." 2. Revise to “According to the patient's chief complaint, conduct relevant inquiries about the condition accurately, including the cause f the nset, the nature, degree tm a regularity of occurrence of symptoms, aggravating or pue ssousep suodus Susueduooe saoe SueeAdopt Suggestion 2ReviseFocusing on the patient's main complaint, make accurate inquiries about the condition, incluing the cause ofthe disease, the natre, degre, me and regularity ofsymptoms, aggravating or alleviating factors, accompanying symptoms, diagnosis and treatment, and other medical history such as past history, personal history, allergy history, tc.
1.2Ask the patient with open-ended questions, encouraging the patient to seak upasappropriatetreatment, other relevant medical history, etc.Ask the patient with open-ended questions, encouraging the patient to speak up as appropriate
1.3Guide patients to obtain correct and sufficient disease- relatedinformationapAoud o susged Supn, o papuouy 1. Revisedto readUe plan lnguage andvoid edicalAdoptReviseGuide patients to provide corect and sufficient disease-related information
1.4Use words that patients understand and avoid medical uo&neterminology, such as tenesmus, epistaxis, etc." 2. Revised to read "Avoid unintelligible medical terminology”Adopt AllReviseUse plain language and avoid dificult medical terms such as tenesmus, epistaxis, et.
1.5Explain to the patient the rationale or purpose of returning to the visit, reatmentExplain to the patient the rationale or purpose of returning to the visit, treatment
1.6Avoid providing medically harmful infomation to patients, uch as unsubstantiated health infomation, tc. Make effctive judgments and appropriateresponses tAvoid providing medically harmful information to patients, such as unsubstantiated health information, etc.
1.7medical emergencies that occur in patients, such as timely informing the risk of hypoglycemia and emergency treatment plans when the patient's blood sugar is lower than 3.2mmol/LReserveMake ffective judgments and appropriate responses to medical emergencies that ccur in patients, such as timely informing the risk of hypoglycemia and emergency treatment plans when the patient's blood sugar is lower than 3.2mmol/L
1.8For diseases that require a diferential diagnosis, the as o pmou pue pse qm ed sufficient information to help identify the diseaseReserveFor diseases that require a differential diagnosis, the patient will be actively asked and encouraged to state sufficient information to help identify the disease
1.9 2.0Inquiries are clear and focused Humanistie qualities Demonsrateret sitivityandemthdngReserve ReserveInquiries are clear and focused Humanistic qualities
2.1consultations and communications Effective patient conversation skills include: opening the conversation, exploring the issue, understanding theReserveDemonstrate respect, sensitivity and empathy during consultations and communications Effective patient conversation skills include: starting the conversation, exploring the
2.2patient's pespective, organizing the discussion, building a trusting relationship with the patient, facilitating patient participation, explaining, planning, and closing the discussionAmended to “Establish a good doctor-patient relationship and gain the trust of patients* AdoptReviseissue, unerstanding the patient's persective, organizing the diseussion, buildin a good doctor-patient relationhip and gaining the patient's tust, facilitating patint participation, explaining, planning, and closing the disussion
2.3Reasonably respond to the patient's emotional performance during the conversation, such as appropriate emotional reassuranceReserveReasonably respond to the patient's emotional performance during the conversation, such as appropriate emotional reassurance
2.4 2.5making Valuing patients' individual wishes in medical decision- s pseq e suodo usdxa poReserveValuing patients individual wishes in medical decision-making Avoid expressing opinions that are biased against certain groups of people
2.6certain groups of people Be patient when explaining problems to patientsReserve ReserveBe patient when explaining problems to patients
2.7pue aedodde usn sued m sn appropriate languageReserveenu eadode pue eadode usn sused qm aumuuo
2.8Avoid inquiring about patient privacy when not necessaryReserveessu ou uq oeAμd ued nqe uunbu! pony
2.9Reasonable handing of patients' non-medical appeals during dialogueReserveReasonable handling of patients' non-medical appeals during dialogue
3.0Comprehensive diagnosis and treatment abilities of clinicians Identify the primary and secondary content oftheAmended to “Diagnostic Reasoning Ability”Not AdoptReserveComprehensive diagnosis and treatment abilities of clinicians
3.1information provided by the patient, and respond reasonably. For example, among the main complaints provided by the patient, the infomation that is helpful for judging the condition should be focused on and inquired about; ,if the patient is more concerned aboutAmended to read “Ability to identify chief complaint”dopv 10NReserveIdentify the primary and secondary content ofthe information provided by the patient, and respond reasonably. For example, for the patient's main complaint, you should focus on and inquire about the information that is helpful for judging the condition; Patients are more concerned about this and need to provide a reasonable explanation
3.2Recognize whether the information provided by the u paod pue s ed necesary, e.g the patient states that he orshe has a1. Amended to “whether the information is reliable" 2. Amended to “but the temperature measurement is not performed, the information is not reliableAdopt AllIdentify whether the infomation provided by the patient is reliable and provide further verification if necessary, e.g if the patient states that he or she has a fever but does not have a temperature measurement, the information is not reliable
3.3fever but does not take a temperature measurement Combine the information provided by the patient to arrive at a reasonable initial diagnosis and differential diagnosis(poodsns) sisoulep Keuuad, o popuouryAdoptReviseCombine information provided by the patient to arrive at areasonable initial suspct) and diferential diagnosis
3.4Propose an appropriate diagnostic protocol, that is, the diagnostic protocol should include necessary inspeetion1. Amend to “Consider the patient's condition and personal d su dodde esood s 2. Amend to “In the case of insufficient diagnostic evidence, d ssoudeeoAdopt AllConsider the patient's condition and personal wishes, and propose an appropriate plan for definite diagnosis when the diagnosis basis is insufficient. The plan should include necessary means that help to confirm the diagnosis and do not include irrelevant means
3.5Correetly explain to the patient the reasons for the proposed diagnosis3. Amend to* Help to confirm the diagnosis means and does not include extraneous means"Corectly explain to the patient the reasons for the proposed diagnosis
3.6Correctly explain to patients the basis for diagnosis and exelusionAmended to “Bases for excluding other differential diagnoses"AdoptReviseCorrectly explain to patients the basis for the diagnosis of the disease and the basis for excluding other differential diagnoses
3.7Propose the best treatment plan considering the patien's condition and personal wishesRevised to “propose the best treatment plan based on the patient's condition and diagnosis and treatment norms* AdoptReviseConsidering the patient's condition and personal wishes, combined with the patient's condition and diagnosis and treatment norms, propose the best treatment plan
8Correctly explain to patients the reasons for proposed treatment optionsReserveCorrectly explain to patients the reasons for proposed treatment options
3.9Make necessary reminders and suggestions to patients on certain important medical concems, such asReserveMake necessary reminders and suggestions to patients on certain mportant medical concerns, such as reminders of drug incompatibility
4.0reminders of drug incompatibility Overall clinicalcompetence
Reserve Overall clinicalcompetence
4.1 4.2Disstisfied SatisfactoryReserv Disstisfied Reserve Satisfactory
序号 项目内容 专家意见 采纳情况 修改 修改后项目
问诊技能 1. 修改为"准确询问患者主诉、现病史等信息" 保留 问诊技能
1.1 根据患者主诉准确进行疾病相关询问,包括既往史、用药史、合并症等 1. 修改为"根据患者主诉准确询问既往史、个人史、过敏史等" 2. 修改为"根据患者主诉,准确进行病情相关询问,包括发病原因、症状性质、程度、时间及规律性、加重或缓解因素、伴随症状、诊疗经过等" 采纳建议2 修改 围绕患者主诉,准确进行病情询问,包括疾病起因、症状性质、程度、时间及规律性、加重或缓解因素、伴随症状、诊疗经过等病史信息,如既往史、个人史、过敏史等
1.2 使用开放式问题询问患者,适时鼓励患者表达 使用开放式问题询问患者,适时鼓励患者表达
1.3 引导患者获取正确充分的疾病相关信息 1. 修改为"使用通俗语言,避免医学术语" 采纳 修改 引导患者提供正确充分的疾病相关信息
1.4 使用患者能理解的语言,避免医学术语 1. 修改为"使用通俗语言,避免如里急后重、鼻衄等医学术语" 2. 修改为"避免使用晦涩难懂的医学术语" 全部采纳 修改 使用通俗语言,避免如里急后重、鼻衄等难懂医学术语
1.5 向患者解释复诊、治疗的理由或目的 向患者解释复诊、治疗的理由或目的
1.6 避免向患者提供有医学危害的信息,如未经证实的健康信息等 避免向患者提供有医学危害的信息,如未经证实的健康信息等
1.7 对患者出现的医疗紧急情况做出有效判断和适当应对,如当患者血糖低于3.2mmol/L时及时告知低血糖风险及应急处理方案 保留 对患者出现的医疗紧急情况做出有效判断和适当应对,如当患者血糖低于3.2mmol/L时及时告知低血糖风险及应急处理方案
1.8 对于需要鉴别诊断的疾病,主动询问并鼓励患者陈述有助于识别疾病的充分信息 保留 对于需要鉴别诊断的疾病,主动询问并鼓励患者陈述有助于识别疾病的充分信息
1.9 询问清晰且有针对性 保留 询问清晰且有针对性
2.0 人文素养 保留 人文素养
2.1 在咨询沟通中表现出尊重、敏感性和同理心 有效的患者沟通技巧包括:开启对话、探讨问题、理解患者观点、组织讨论 保留 在咨询沟通中表现出尊重、敏感性和同理心 有效的患者沟通技巧包括:开启对话、探讨问题、理解患者观点、组织讨论
2.2 与患者建立信任关系,促进患者参与,解释、计划和结束讨论 修改为"建立良好医患关系并获得患者信任" 采纳 修改 建立良好医患关系并获得患者信任,促进患者参与,解释、计划和结束讨论
2.3 在对话中合理应对患者的情绪表现,如适当的情感安抚 保留 在对话中合理应对患者的情绪表现,如适当的情感安抚
2.4 在医疗决策中重视患者个人意愿 保留 在医疗决策中重视患者个人意愿
2.5 避免表达对某些人群有偏见的观点 保留 避免表达对某些人群有偏见的观点
2.6 向患者解释问题时保持耐心 保留 向患者解释问题时保持耐心
2.7 使用恰当的语言 保留 使用恰当的语言
2.8 非必要时避免询问患者隐私 保留 非必要时避免询问患者隐私
2.9 合理处理患者在对话中提出的非医疗诉求 保留 合理处理患者在对话中提出的非医疗诉求
3.0 临床医师综合诊疗能力 修改为"诊断推理能力" 不采纳 保留 临床医师综合诊疗能力
3.1 识别患者提供信息的主次内容并合理应对。例如在患者提供的主诉中,应重点关注和询问有助于判断病情的信息;若患者更关注 修改为"识别主诉的能力" 保留 识别患者提供信息的主次内容并合理应对。例如对患者主诉,应重点关注和询问有助于判断病情的信息;患者更关注的内容需给予合理解释
3.2 判断患者提供的信息是否可靠,必要时进一步验证,例如患者自称发热但未测量体温,则该信息不可靠 1. 修改为"信息是否可靠" 2. 修改为"但未测量体温,则信息不可靠" 全部采纳 判断患者提供的信息是否可靠,必要时进一步验证,例如患者自称发热但未测量体温,则该信息不可靠
3.3 结合患者提供的信息得出合理的初步诊断(疑诊)和鉴别诊断 采纳 修改 结合患者提供的信息得出合理的初步诊断(疑诊)和鉴别诊断
3.4 考虑患者病情和个人意愿,在诊断依据不足时提出适当的明确诊断方案。该方案应包括有助于确诊的必要手段,不包含无关手段 1. 修改为"考虑患者病情和个人意愿" 2. 修改为"在诊断依据不足时" 全部采纳 考虑患者病情和个人意愿,在诊断依据不足时提出适当的明确诊断方案。该方案应包括有助于确诊的必要手段,不包含无关手段
3.5 向患者正确解释所提诊断的理由 3. 修改为"有助于确诊的手段,不包含无关手段" 向患者正确解释所提诊断的理由
3.6 向患者正确解释疾病诊断依据和其他鉴别诊断的排除依据 修改为"其他鉴别诊断的排除依据" 采纳 修改 向患者正确解释疾病诊断依据和其他鉴别诊断的排除依据
3.7 结合患者病情和诊疗规范,提出最佳治疗方案 修改为"根据患者病情和诊疗规范提出最佳治疗方案" 采纳 修改 考虑患者病情和个人意愿,结合患者病情和诊疗规范,提出最佳治疗方案
3.8 向患者正确解释所提治疗方案的理由 保留 向患者正确解释所提治疗方案的理由
3.9 对某些重要医疗问题向患者做出必要提醒和建议,如药物配伍禁忌提醒 保留 对某些重要医疗问题向患者做出必要提醒和建议,如药物配伍禁忌提醒
4.0 整体临床能力 保留 整体临床能力
4.1 不满意 保留 不满意
4.2 满意 保留 满意
Item Medical Interviewing SkillsExpert OpinionAdoption Modify ReserveExplanation
Focus on patients self report, and conduct relevant inquiries around the current medical history accurately, including the cause of the onset, the nature, degree, timeThe content ofthe current medical history inquired should include at least one of the following items: the cause of the onset,
and regularity of symptoms, aggravating or mitigating factors, accompanying symptoms, diagnosis and 1.1 treatment, etc.1. The focus is on content, not accuracy 2. Simplify item descriptionsAdoptthe nature, degree, time and regularity of occurrence of Suueduoooe sooey Suesu no Sueaese suodus Revise symptoms, diagnosis and treatment, etc.
Guide patients to provide orret and sficint disease 1.21.Add examlesncludemedication history, previous surgery, ongoing treatment, etc. 2.The cpe of examples should be nreased tonclue family historyInquiring about past medical history should include at least one of the followingitm: suicalhstorypersonal history,family
related information, including other medical histories such as past history, personal history, allergy history, etc.3. Simplify item description 1. Ask open-ended questions based on the aforementioned chief complaints, around the history of presentAdopt AllRevise history, allergy history, medication history, etc. For example: Do not directly ask "Do you have chest tightness?", no op uu. suosonb udo n pnous nq
Use open-ended questions to encourage patients to 1.3 provide more detailed information The qustin hould be asy to be derstood, and avidillness/main complaints 2. Simplify item description10N Adoptuncomfortable?". (Example excerpted from "Clinical Diagnostics Reserve (Eight-Year Third Edition)", Human Health Publishing House)
using medical terminology that is too complicated or 1.4 diffcult to understand, such as tenesmus, epistaxis, etc. Explain to the patiet thebasis or purpose f the medicalSimplify item descriptionAdopt
1.5 terminology, diagnosis, treatment, etc.Simplify item descriptionAdoptReserve For example: CMV antibody, tenesmus, epistaxis, etc. Medical conclusions include: disease diagnosis, examination Revise methods, treatment plan, etc.
Avoid providing medically harmful information to 1.6 substantiated by scientific research, etc.Simplify item description patientsuhahaltretithaasAdoptHarmful medical information includes: such as information error: semaglutide is also known as rifampicin; scientific rumors: eating Reserve mung beans to treat tumors, etc.
Be able to effectively judge the occurrence of medical emergencies in patients and make appropriate responses,Simplify item descriptionAdopt
such as informing patients of the risk of hypoglycemia m su o e u sud u Aou pue 1.7 the patient's blood sugar is lower than 3.2mmol/LFor example, when the patient's blood sugar is 2.7mmol/L, timely informtherisk ofhypoglycmiaand themergencytreatment Reserve plan
Simplify item descriptionAdoptFor example,for the information provided by patients,focus on and inquire about the information that is helpful for judging the
Proactively ask the patient for relevant information that 1.8 can be used to aid in the differential diagnosiscondition; do not pay too much attention to and inquire about the information irrelevant to the judgment of the disease. If the patient is more concemed about this, a reasonable explanation
1.9 Inquiry ideas are clear and focusedSimilar with item 1.8Reserve Deleteshould be provided
Humanistic care Treat patients with respect, care and empathy duringReserve
2.1 consultations Effectively organize and converge the consultationSimplify item descriptionAdopt ReserveFor example: “I am very concemed to hear about your condition", etc.
process, including reasonable questioning, building a trusting relationship with the patient and encouraging the patient, obtaining comprehensive and accurate 2.2 information, and leading the discussion to a conclusionSimplify item descriptionAdopt
Coping reasonably with patients' emotional displays inReserveDo not make nvalid expansion after obtaining enough information to draw a medical conclusion
[euonou sjeudodde Supiaod pue uopesunuruos 2.3 supportSimplify item descriptionAdopt ReserveWhen the patient shows negative motions, give reasonable counseling
Able to value the patient's individual wishes in the 2.4 medical dcision-making processIt is recommended to adhere to certain medical principlesFor example,if a patient requires conservative treatment but does not violate medical principles, the patient's ideas should be
2.5 Avoid expressions of bias of any kindand simplify item description.dopv Reviseadopted Forexamle,prjdiceagainst ating habits n different rgi,
2.6 Be patient with and explain patients' questionsVery necary, examle can range frm primary dsese to complications and progressionReserve Adopt Reviseetc. For example: Rhetorical questions such as "Didn't you explain it just now?" can be judged as 0 points
2.7 Be courteous and appropriate in communicationSimplify item descriptionAdopt ReservePolite words include: hello, thank you very much, etc.
Protect patient privacy and avoid asking privacy 2.8 questions when not required by the diseaseSimplify item descriptionAdopt ReservePrivacy concerms include spouse name, home address, etc.
Reasonably respond to non-medical demands of patients 2.9 in dialogueSimilar with item 1.8 and 2.2
Comprehensive clinical diagnosis and treatment ability Identify primary and secondary content of information provided by patients, and respond reasonably. ForRevise
example, for information provided by patients, you should focus on and inquire about the information that is helpful for judging the condition; If the patient is more concerned about this, a reasonable explanation should beSimilar with item 1.8
3.1 provided Judging the credibility of the information provided by the patient and being able to verify it withappropriate questions, e.g. if a patient states that he or she has a fever but no temperature measurement is taken, theSimplify item descriptionDelete Adopt
3.2 information is not reliable Combine information provided by the patient to arrive at 3.3 a reasonable initial (suspect and differential diagnosisSimplify item descriptionReserve AdoptFor example, if the patient states that he has a fever, but cannot give the temperature value, the information is inaccurate Reserve
Consider the patient's condition and personal wishes, and put forward an appropriate plan for definite diagnosis when the diagnosis basis is insufficient. The plan should include necessary means that help to confirm the 3.4 diagnosis and do not include irrelevant means1. Similar to differential diagnosis, some definitions are needed, such as “probably", possibly","with a certain probability”, "do not rule out, etc. 2. Consider the standard of diagnosis and treatment, and adjust the order with the patient's condition and personal wishes
3. Simplify the item 1. This may not be necesary 2. Medical staffshould explain the patient's condition and medical measures during the diagnosis and treatment activities. If a special examination s required, the medical personnel shalAdopt AllRevise
3.5 diagnosis Correctly explain to patients thebasis for the diagnosis of the disease and the basis for excluding other differential 3.6 diagnosespromptly explain the medical risks, alternative medical Correctly explain to the patient the basis for the proposed plans, etc. to the patient. 3. For example, explain the reason or basis for the diagnosis plan 4. Simplify the item 1.This dos nt need tobe cxplained to th patint 2.AddTake Suggestio n 3 and 4 Revise TakeThe diagnostic program includes inspection, inspection and other means
"or explain the condition to the patient correctly" 3. Simplify the item 1. The treatment plan needs to be considered. If the position of the general practitioner is not only triage, thenSuggestio n 3 Revise
Considering the patient's condition and personal wishes, combined with the patient's condition and diagnosis and treatment orms propose the most reasonable treatment 3.7 planthis step is required, but if it is only triage, you can consider ignoring this step 2. Mention the patient’s condition wice,peat 3.Diagnosis and reatmnt om, patient condition, and personal wishes, order adjustment 4.Take recomme
Simplify the item 1.Consultion cniodobeondcafll 2. Medical personnel should explain the condition andndations 2, 4 Revise
medical measures to patients during diagnosis and
Correctly explain to patients the reasons for proposedtreatment activities. If surgery or special treatment is required, the medical personnel shall promptly explain the medical risks and alternative medical plans to the patient. 3. For example, explain the reason or basis for the
proposed treatment plan Take Suggestio n 3 Revise
3.8 treatment options
Make necessary reminders and suggestions to patients on
certain important medical concerms, such as reminders of
3.9 drug incompatibility Overall Clinical Competency 4.1 DissatisfiedSimilar with item 1.6Reserve
项目 医学问诊技能 专家意见 采纳/修改/保留 说明
关注患者主诉,围绕现病史准确进行相关询问,包括发病原因、症状性质、程度、时间规律、加重或缓解因素、伴随症状、诊断与治疗等 1. 关注内容而非精确性 2. 简化条目描述 采纳 现病史询问内容应至少包含以下一项:发病原因、症状性质、程度、发生时间规律、诊断与治疗等
引导患者提供正确且充分的疾病相关信息,包括既往史、个人史、过敏史等其他病史 1. 增加示例(如用药史、既往手术史、当前治疗等) 2. 扩充示例类型(包含家族史) 3. 简化条目描述 全部采纳 既往史询问应至少包含以下一项:手术史、个人史、家族史、过敏史、用药史等。例如:不应直接询问"是否有胸闷?",而应问"您现在感觉哪里不舒服?"(示例节选自《临床诊断学(八年制第三版)》,人民卫生出版社)
使用开放式提问鼓励患者提供详细信息。问题应易于理解,避免使用过于复杂或难懂的医学术语(如里急后重、鼻衄等) 1. 基于主诉进行开放式提问 2. 简化条目描述 采纳 保留专业术语示例:CMV抗体、里急后重、鼻衄等。医疗结论包括:疾病诊断、检查方法、治疗方案等
向患者解释医学术语、诊断、治疗等的依据或目的 简化条目描述 采纳 保留
避免提供经科学研究证实的医学有害信息 简化条目描述 采纳 有害医疗信息包括:信息错误(如索马鲁肽又称利福平)、科学谣言(如吃绿豆治肿瘤)等
能有效判断患者突发医疗状况并做出恰当应对(如告知患者血糖低于3.2mmol/L时的低血糖风险) 简化条目描述 采纳 例如当患者血糖2.7mmol/L时,及时告知低血糖风险及急救方案
主动询问有助于鉴别诊断的相关信息 简化条目描述 采纳 例如应重点关注有助于判断病情的信息,不过度关注无关信息。若患者特别在意,需给予合理解释
问诊思路清晰有重点 与条目1.8类似 保留/删除 应提供合理解释
* * 人文关怀* * 在问诊过程中尊重、关心并体谅患者 简化条目描述 采纳/保留 例如:"听到您的情况我非常关心"等
有效组织问诊流程,包括合理提问、建立信任、鼓励患者、获取全面准确信息并引导结论 简化条目描述 采纳 在获取足够信息得出医学结论后不做无效扩展
合理应对患者情绪表达并提供支持 简化条目描述 采纳/保留 当患者出现负面情绪时应给予合理疏导
在医疗决策中重视患者个人意愿 建议遵循特定医疗原则 修订 例如患者要求保守治疗但不违反医疗原则时,应采纳患者意见
避免任何偏见表达 并简化条目描述 采纳/修订 例如对不同地区饮食习惯的偏见等
耐心解答患者疑问 非常必要,示例可从原发病到并发症及进展 保留/采纳/修订 例如:"刚才不是解释过了吗?"等反问句可判0分
沟通礼貌得体 简化条目描述 采纳/保留 礼貌用语包括:您好、非常感谢等
保护患者隐私,非病情需要避免询问隐私问题 简化条目描述 采纳/保留 隐私信息包括配偶姓名、家庭住址等
合理回应患者非医疗诉求 与条目1.8和2.2类似
* * 综合临床诊疗能力* * 识别患者信息主次并合理回应。例如应重点关注有助于判断病情的信息;若患者特别在意需给予合理解释 与条目1.8类似 修订
判断患者提供信息的可信度并能通过适当提问验证(如患者自称发热但未测体温则信息不可靠) 简化条目描述 删除/采纳
结合患者信息得出合理初步(疑似和鉴别)诊断 简化条目描述 保留/采纳 例如患者自称发热但无法提供体温数值则信息不准确
考虑患者病情和个人意愿,在诊断依据不足时提出适当的明确诊断方案。方案应包含必要确诊手段,排除无关手段 1. 类似鉴别诊断需定义"可能""不排除"等表述 2. 考虑诊疗标准,调整与患者病情和个人意愿的顺序
向患者正确解释疾病诊断依据及其他鉴别诊断排除依据 3. 简化条目 1. 非必要 2. 医务人员应说明病情及医疗措施。特殊检查需及时说明医疗风险、替代方案等 3. 例如解释诊断方案依据 采纳全部/修订 诊断程序包括检查、检验等手段
考虑患者病情和个人意愿,结合诊疗规范提出最合理治疗方案 1. 治疗方案需考虑 2. 重复提及患者病情 3. 调整诊断规范、患者病情和个人意愿顺序 4. 简化条目 采纳建议2/4/修订
向患者正确解释建议治疗方案的理由 1. 会诊可远程进行 2. 医务人员应说明病情及医疗措施。手术或特殊治疗需及时说明医疗风险及替代方案 3. 例如解释治疗方案依据 采纳建议3/修订
就重要医疗事项(如药物配伍禁忌)向患者作必要提醒建议 与条目1.6类似 保留
* * 整体临床能力* * 4.1 不满意 保留
阅读全文(20积分)