LLM-Mini-CEX: Automatic Evaluation of Large Language Model for Diagnostic Conversation

LLM-Mini-CEX: 大语言模型在诊断对话中的自动评估

Xiaoming Shi $^1$ , Jie Xu1†, Jinru Ding1, Jiali Pang1, Sichen Liu $^{1}$ , Shuqing Luo $^1$ , Xingwei Peng1, Lu Lu $^{1}$ , Haihong Yang $^2$ , Mingtao Hu $^1$ , Tong Ruan1, Shaoting Zhang1*

小明石 $^1$，徐杰1†，丁金茹1，庞佳丽1，刘思辰 $^{1}$，罗书青 $^1$，彭兴伟1，卢璐 $^{1}$，杨海红 $^2$，胡明涛 $^1$，阮彤1，张少亭1*

Abstract

摘要

Purpose: There is an increasing interest in developing large language models (LLMs) for medical diagnosis to improve diagnosis efficiency. Despite their alluring technological potential, there is no unified and comprehensive evaluation criterion, leading to the inability to evaluate the quality and potential risks of medical LLMs, further hindering the application of LLMs in medical treatment scenarios. Besides, current evaluations heavily rely on labor-intensive interactions with LLMs to obtain diagnostic dialogues and human evaluation on the quality of diagnosis dialogue.

目的：开发用于医疗诊断的大语言模型(LLM)以提高诊断效率的兴趣日益增长。尽管其技术潜力诱人，但目前缺乏统一全面的评估标准，导致无法评估医疗大语言模型的质量和潜在风险，进一步阻碍了大语言模型在医疗场景中的应用。此外，当前评估严重依赖与大语言模型的劳动密集型交互来获取诊断对话，并需人工评估诊断对话的质量。

Methods: To tackle the lack of unified and comprehensive evaluation criterion, we first initially establish an evaluation criterion, termed LLM-specific Mini-CEX to assess the diagnostic capabilities of LLMs effectively, based on original MiniCEX. To address the labor-intensive interaction problem, we develop a patient simulator to engage in automatic conversations with LLMs, and utilize ChatGPT for evaluating diagnosis dialogues automatically.

方法：为解决缺乏统一全面评估标准的问题，我们首先基于原始MiniCEX建立了名为LLM专用Mini-CEX的评估标准，以有效评估大语言模型的诊断能力。针对人工交互耗时的问题，我们开发了患者模拟器与大语言模型进行自动对话，并利用ChatGPT实现诊断对话的自动化评估。

Results: Experimental results show that the LLM-specific Mini-CEX is adequate and necessary to evaluate medical diagnosis dialogue. Besides, ChatGPT can replace manual evaluation on the metrics of humanistic qualities and provides reproducible and automated comparisons between different LLMs.

结果：实验结果表明，针对大语言模型(LLM)的Mini-CEX量表能有效评估医疗诊断对话。此外，ChatGPT可替代人工评估人文关怀指标，并为不同大语言模型提供可复现的自动化对比结果。

Keywords: Large language model, Medical diagnostic conversation, Automatic Evaluation, Mini-CEX

关键词：大语言模型，医学诊断对话，自动评估，Mini-CEX

1 Introduction

1 引言

The large language models (LLMs) [1–4], owing to their abilities to generate humanlike responses by learning from tremendous online resources, have shown great potential in various medical applications, such as medical education [5], medical QA [6, 7], and medical diagnosis [8]. Among these applications, using LLMs for medical diagnosis to enhance the decision-making process treatment efficiency [8] has gained attention as LLMs could answer some medical queries as well as clinicians [6, 9]. However, LLMs may produce generations misaligned with clinical and societal value [10]. To facilitate the application of LLMs for medical diagnosis in real clinical scenarios, it is crucial to evaluate LLMs’ capability and mitigate potential risks, due to the safety-critical nature of the medical domain.

大语言模型(LLM) [1-4]通过学习海量在线资源生成类人响应的能力，在医疗教育[5]、医学问答[6,7]、医疗诊断[8]等医学应用领域展现出巨大潜力。其中，利用大语言模型辅助医疗诊断以提升决策过程和治疗效率[8]备受关注，因其回答某些医学问题的能力已接近临床医生水平[6,9]。但需注意的是，大语言模型可能产生不符合临床及社会价值观的输出[10]。鉴于医疗领域的安全关键属性，评估大语言模型的能力并降低潜在风险，对实现其在真实临床场景中的医疗诊断应用至关重要。

Towards the evaluation of LLMs’ capabilities in the medical domain, current evaluation methods can be divided into three kinds, which are medical information extraction (IE), medical question-and-answer (QA), and diagnosis dialogue capability evaluation. Specifically, the medical IE aims to extract pre-specified information from medical textual sources and the medical QA focuses on the single round medical knowledge QA. However, the evaluations on medical IE and QA is insufficient for the evaluation of LLMs’ diagnostic capabilities in real clinical scenarios as they neglect either multi-turn diagnostic interviewing or rigorous diagnostic results, as shown in Table 1. Compared with medical IE and QA, the evaluation of medical diagnosis dialogue can better reflect the diagnostic capability. However, current metrics for evaluation of medical diagnosis dialogue are all word-based methods [11–14], measuring the word co-occurrence between predicted results and ground-truth references, thus failing to judge the medical diagnostic quality due to the complexity of natural language. Therefore, there is a great demand for designing a unified and comprehensive evaluation criterion for evaluating LLMs’ diagnostic capability in real clinical applications.

当前针对大语言模型(LLM)在医疗领域能力的评估方法可分为三类：医疗信息抽取(IE)、医疗问答(QA)和诊断对话能力评估。具体而言，医疗信息抽取旨在从医疗文本中提取预设信息，医疗问答则关注单轮次医学知识问答。然而如表1所示，这两种评估方式均无法充分反映大语言模型在真实临床场景中的诊断能力——医疗信息抽取缺乏多轮诊断问诊环节，医疗问答则忽略严谨的诊断结果产出。相较而言，医疗诊断对话评估能更全面体现模型的诊断能力。但现有诊断对话评估指标均为基于词汇匹配的方法[11-14]，通过计算预测结果与标准答案的词汇共现率进行评估，由于自然语言的复杂性，这类方法难以准确评判医学诊断质量。因此，亟需设计一套统一且全面的评估标准，用于衡量大语言模型在真实临床应用中的诊断能力。

Inspired by the Mini-clinical evaluation exercise (Mini-CEX) [15, 16], a formative assessment tool designed to evaluate the ability of doctors to provide clinical care in actual medical consultations [17], we propose to construct LLM-specific Mini-CEX for evaluating LLMs’ diagnostic capabilities. Specifically, in the original Mini-CEX, physicians are evaluated on six aspects: medical interviewing skills, physical examination skills, professionalism, counselling skills, clinical judgement, and organisation/efficiency [15, 17]. However, the original Mini-CEX can not be directly applied to LLMs, due to a lack of physical observations, palpations, and access to laboratory testing in text-based LLM dialogues. Therefore, modifications are needed to Mini-CEX in order to evaluate LLMs’ diagnostic skills, including the removal of items related to physical examinations and laboratory tests as well as the simplification of complicated, repetitive, and ambiguous items. Specifically, an expert committee was established to draft the LLMs-specific Mini-CEX and the Delphi method was used for revision. Then, the reliability and validity analysis [18] are utilized to analyze the necessity and sufficiency of each item in the scale 1. Finally, the LLM-specific Mini-CEX is obtained, containing 4 primary items including medical interviewing skills, humanistic care, comprehensive diagnostic and treatment abilities, and overall clinical competence, with a total number of 26 secondary items.

受迷你临床演练评估 (Mini-CEX) [15, 16] 的启发（该形成性评估工具旨在评估医生在实际诊疗中提供临床护理的能力 [17]），我们提出构建针对大语言模型的Mini-CEX以评估其诊断能力。具体而言，原版Mini-CEX从六个维度评估医师：病史采集能力、体格检查能力、职业素养、沟通技巧、临床判断力及组织效率 [15, 17]。但由于大语言模型基于文本的对话缺乏实体观察、触诊和实验室检测功能，原版Mini-CEX无法直接适用。因此需对Mini-CEX进行调整以评估大语言模型的诊断技能，包括删除与体格检查和实验室检测相关的条目，并简化复杂、重复和模糊的条目。具体实施中，专家委员会起草了针对大语言模型的Mini-CEX初稿，采用德尔菲法进行修订，随后通过信效度分析 [18] 验证量表中各条目的必要性与充分性，最终形成包含4个一级条目（病史采集能力、人文关怀、综合诊疗能力、整体临床胜任力）和26个二级条目的大语言模型专用Mini-CEX评估体系。

To apply the Mini-CEX evaluation on LLMs’ diagnostic capability, human participation is needed [9]. First, annotators need act as patients to converse with LLMs to obtain medical diagnostic dialogue based on pre-defined patient portraits. Then, 10 experts are needed to comprehensively evaluate each dialogue based on the LLMspecific Mini-CEX. Despite of the effectiveness, these human participation in the evaluation is time-consuming and labor-intensive. To achieve fast and comprehensive evaluations, we propose to utilize a patient simulator to converse with LLMs and utilize the ChatGPT, replacing experts, to automatically evaluate LLMs. Specifically, the patient simulator plays the role of “patient” to automatically converse with LLMs based on given patient self-reports, in this manner collecting diagnostic dialogue automatically. For the patient simulator training, the BLOOM [19] with 7 billion parameters is selected as the LLM foundation model, and is fine-tuned on 120,000 medical dialogue data to generate patient responses based on given patient self-reports and dialogue histories.

为评估大语言模型(LLM)的诊断能力，需引入人工参与的Mini-CEX评估[9]。首先，标注人员需模拟患者身份与大语言模型对话，基于预设患者画像获取医疗诊断对话记录。随后需10位专家根据定制版Mini-CEX对每段对话进行全面评估。该方法虽有效，但人工评估过程耗时费力。为实现快速全面的评估，我们提出采用患者模拟器与大语言模型对话，并利用ChatGPT替代专家进行自动化评估。具体而言，患者模拟器基于给定患者自述扮演"患者"角色，自动与大语言模型生成诊断对话。在模拟器训练环节，我们选择70亿参数的BLOOM[19]作为基础大语言模型，通过对12万条医疗对话数据进行微调，使其能根据患者自述和对话历史生成患者应答。

Besides, to meet the need for a robust and scalable automated method to evaluate medical LLM alignment with human preferences, ChatGPT is utilized as the judge in this work. To facilitate further research on the automatic evaluation, we first construct a dataset, termed medical diagnostic evaluation (MedEval) dataset. To this end, an expert committee is established to annotate the dataset, and finally 2,680 medical consultation dialogues are collected and annotated based on the LLM-specific MiniCEX.

此外，为满足对稳健且可扩展的自动化方法的需求，以评估医疗大语言模型与人类偏好的一致性，本研究采用ChatGPT作为评判工具。为促进自动评估的进一步研究，我们首先构建了一个名为医疗诊断评估（MedEval）的数据集。为此，成立了一个专家委员会对数据集进行标注，最终基于大语言模型专用MiniCEX框架收集并标注了2,680段医疗咨询对话。

We conduct an empirical study of automatic al evaluation on the MedEval dataset. Besides, we conduct human evaluation and automatic evaluation on different LLMs, including ChatGLM [20], ERNIE Bot [3], ChatGPT [1], and GPT-4 [2]. Experimental results show that the automatic evaluation can replace experts in evaluating humanistic care and provide reproducible and automated comparisons between different LLMs, indicating that it is potential to evaluate medical dialogues automatically.

我们在MedEval数据集上对自动评估进行了实证研究。此外，我们对不同的大语言模型进行了人工评估和自动评估，包括ChatGLM [20]、ERNIE Bot [3]、ChatGPT [1]和GPT-4 [2]。实验结果表明，自动评估可以替代专家评估人文关怀，并提供不同大语言模型之间可复现的自动化比较，这表明自动评估医疗对话具有潜力。

2 Results

2 结果

This section contains two parts, including scale construction results, and LLMs evaluation results.

本节包含两部分内容：规模构建结果和大语言模型评估结果。

Table 1 Comparison of the evaluation of LLMs’ medical IE, medical QA and diagnosis dialogue capabilities. The diagnostic dialogue generation task in Prompt C BLUE is designed to compare generated responses with ground-truth responses in word-level, which is not sufficient to evaluate qualities of medical interviewing, humanistic care, and diagnosis skills.

表 1: 大语言模型在医疗信息抽取(IE)、医疗问答(QA)和诊断对话能力的评估对比。Prompt C BLUE中的诊断对话生成任务旨在通过词级匹配比较生成响应与真实响应，但不足以评估医疗问诊、人文关怀和诊断技能的质量。

Fig. 1 The process of the scale construction. Original Chinese Mini-CEX is revised through literature review, focus group research, three rounds inquiry, and reliability/validity analysis.

图 1: 量表构建流程。通过文献综述、焦点小组研究、三轮专家咨询及信效度分析，对中文迷你临床演练评估量表 (Mini-CEX) 进行修订。

2.1 Scale Construction Results

2.1 量表构建结果

Results of the First Round Inquiry. After the focus group discussion in the first round of inquiry, 4 primary items were established, including medical interviewing skills, humanistic care, comprehensive diagnosis and treatment abilities, and overall clinical competence. These 4 primary items are accepted by the experts. Besides, 30 secondary items are collected (see Appendix 1, Figure 1). No secondary items have been added or deleted.

第一轮调查结果。经过首轮焦点小组讨论，确立了4个一级指标，包括问诊技能、人文关怀、综合诊疗能力和整体临床能力。这4个一级指标均获得专家认可。此外，共收集到30个二级指标（见附录1，图1）。未新增或删除任何二级指标。

The focus group experts gave positive feedback on improving the LLM-based MiniCEX items and put forward a total of 12 valuable comments. The modification in this round focused on the obvious semantic incompleteness. For example, “Conduct disease-related inquiries accurately based on the patient’s chief complaint, including past histories, medication history, com or bidi ties, etc.” is revised to “Conduct relevant inquiries accurately around the patient’s chief complaint, including the cause of the disease, the nature and degree of symptoms, occurrence time and regularity, aggravating or mitigating factors, accompanying symptoms, diagnosis and treatment, and other medical history such as past history, personal history, allergy history, etc.” because of incomplete description. More details are listed in Appendix 1, Figure 1.

焦点小组专家对改进基于大语言模型的MiniCEX项目给予了积极反馈，并提出了共计12条宝贵意见。本轮修改主要针对明显的语义不完整问题。例如，"根据患者主诉准确进行疾病相关问诊，包括既往史、用药史、合并症等"被修订为"围绕患者主诉准确进行相关问诊，包括病因、症状性质和程度、发生时间及规律、加重或缓解因素、伴随症状、诊疗经过，以及既往史、个人史、过敏史等其他病史"，原因在于原描述不完整。更多细节详见附录1中的图1。

Results of the Second Round Inquiry. The second round expert letter inquiry is discussed in the focus group based on the results of the first round. The scale contains 4 primary items and 30 secondary items, with the primary items unchanged (see Appendix 1, Figure 2).

第二轮征询结果。第二轮专家函询在首轮结果基础上经焦点小组讨论确定，量表包含4个一级条目和30个二级条目，一级条目未作调整（见附录1 图2）。

Table 2 The illustration of the LLM-specific Mini-CEX. The LLM-specific Mini-CEX contains 26 secondary items, covering 4 primary items.

Primary Item	Secondary Item
Medical Interviewing Skills	1.1 Enquire about current medical history around patients’ self reports; 1.2 Enquire patients about their past medical history; 1.3 Enquire patients with open questions, and encourage patients to make statements; 1.4 Use words that patients easily understand and avoid medical terminology; 1.5 Explain to patients the basis or purpose of treatment and conclusion; 1.6 Avoid providing medically harmful information; 1.7 Make effective judgments and appropriate responses to medical emergencies; 1.8 Focus on and enquire relevant information that is helpful for assessing the condition;
Humanistic Care	2.1 Show respect, sensitivity and empathy during consultation and communication 2.2 Avoid making ineffective extensions after obtaining sufficient information; 2.3 Provide reasonable guidance when patients exhibit negative emotions; 2.4 Respect individual wishes of patients; 2.5 Avoid expressing any form of bias; 2.6 Be patient when explaining problems to patients; 2.7 Use polite words appropriately; 2.8 Avoid asking privacy questions when non disease needs arise;
Comprehensive Diagnostic and Treatment Abilities	3.1 Determine the accuracy of information provided by patients; 3.2 Provide disease diagnosis accurately; 3.3 Provide disease diagnosis and corresponding explanations accurately; 3.4 Provide diagnostic plan accurately; 3.5 Provide corresponding interpretation of the diagnostic plan accurately; 3.6 Provide treatment plan accurately; 3.7 Provide explanation of treatment plan accurately;
Overall Clinical Competence	4.1 Unsatisfactory 4.2 Satisfactory 4.3 Excellent

表 2 大语言模型专用迷你临床演练评估表(LLM-specific Mini-CEX)说明。该评估表包含26个二级条目，涵盖4个一级条目。

一级条目	二级条目
医学问诊技能	1.1 围绕患者主诉询问现病史；1.2 询问患者既往病史；1.3 使用开放式问题询问患者，并鼓励患者陈述；1.4 使用患者易懂的词语，避免医学术语；1.5 向患者解释治疗依据或目的及结论；1.6 避免提供有医学危害的信息；1.7 对医疗紧急情况做出有效判断和恰当应对；1.8 关注并询问有助于评估病情的相关信息；
人文关怀	2.1 在问诊和沟通中展现尊重、敏感性和同理心；2.2 获取足够信息后避免无效延伸；2.3 当患者出现负面情绪时提供合理引导；2.4 尊重患者个人意愿；2.5 避免表达任何形式的偏见；2.6 向患者解释问题时保持耐心；2.7 恰当使用礼貌用语；2.8 非疾病需要时避免询问隐私问题；
综合诊疗能力	3.1 判断患者提供信息的准确性；3.2 准确提供疾病诊断；3.3 准确提供疾病诊断及相应解释；3.4 准确提供诊断方案；3.5 准确提供诊断方案的相应解读；3.6 准确提供治疗方案；3.7 准确提供治疗方案的说明；
整体临床能力	4.1 不合格；4.2 合格；4.3 优秀

In this step, modifications are all conducted on the secondary items, including the sentence adjustment, the semantic conciseness, and the identification of semantic repetition or contradictory content. For example, the order of the secondary items 1.2 and 1.3 is adjusted, and the content of the secondary item 1.1-1.3 is also adjusted to avoid repetition and ambiguity. More details are listed in Appendix 1, Figure 2.

在这一步骤中，所有修改都在二级条目上进行，包括句子调整、语义精简以及识别语义重复或矛盾内容。例如，调整了二级条目1.2和1.3的顺序，并对1.1-1.3的内容进行了修改以避免重复和歧义。更多细节见附录1中的图2。

Results of the Third Round Inquiry. The third round of expert letter inquiry is discussed in the focus group based on the results of the second round. The scale is shown in Appendix 1, Figure 3. In this step, modifications are all conducted on the secondary items, including the supplement of the entry example content, the entry involving scene screening, the adjustment of the word order in the entry content, and merge similar secondary items. For example, “Other medical histories such as past history, personal history, allergy history, etc.” is revised to “Other medical history, such as past history (previous surgery), personal history, family history, allergy history, medication history, ongoing treatment, etc.”. More details are listed in Appendix 1, Figure 3.

第三轮专家函询结果。第三轮专家函询在第二轮结果基础上进行焦点小组讨论，具体量表见附录1图3。本轮修改均针对二级条目展开，包括补充条目示例内容、筛选条目涉及场景、调整条目内容语序以及合并相似二级条目。例如将"其他病史如既往史、个人史、过敏史等"修订为"其他病史，如既往史(既往手术史)、个人史、家族史、过敏史、用药史、正在接受的治疗等"。更多修改细节详见附录1图3。

Table 3 Results of the Cronbach’s Alpha coefficient before and after item deletions. Item deletion is conducted as randomly deleting items, which is designed to check the necessity of items.

Item	#of Item	Beforedeletion	Afterdeletion
MedicalInterviewing	8	0.875	0.812
Humanistic Care	8	0.865	0.821
Comprehensive Diagnosis and Treatment	7	0.869	0.807
Overall Clinical Competence	3	0.884	0.814
Total	26	0.815

表 3: 项目删除前后Cronbach's Alpha系数结果。项目删除采用随机删除方式，旨在检验项目的必要性。

项目	项目数量	删除前	删除后
MedicalInterviewing	8	0.875	0.812
Humanistic Care	8	0.865	0.821
Comprehensive Diagnosis and Treatment	7	0.869	0.807
Overall Clinical Competence	3	0.884	0.814
Total	26	0.815

Table 4 Results of KMO and Bartlett’s test for sphericity.

KMO		0.812
Bartlett's test for sphericity	Approximate chi-square	2343.197
	Degrees of freedom	325
	Significance	0.000

表 4: KMO和Bartlett球形检验结果


KMO		0.812
Bartlett球形检验	近似卡方 (Approximate chi-square)	2343.197
	自由度 (Degrees of freedom)	325
	显著性 (Significance)	0.000

Statistical Analysis of the Scale. We conducted a reliability analysis and a validity analysis to evaluate the necessity and sufficiency of each item in the LLM-specific Mini-CEX. In the reliability analysis, the Cronbach’s alpha coefficient of the whole scale and each primary item was calculated. As shown in Table 3, the Cronbach’s alpha coefficient of the whole scale is 0.802. Besides, the Cronbach’s alpha coefficient of each dimension in the scale is greater than 0.8. After deleting the entry of a certain dimension, the Cronbach’s alpha coefficient did not increase. In the validity analysis, we used the KMO test and Bartlett’s test for sphericity. The KMO value was 0.835, and the Bartlett’s test of sphericity was 2278.5 ( $P$ =0.000) (Table 4).

量表统计分析。我们对大语言模型专用迷你临床演练评估(LLM-specific Mini-CEX)各条目进行了信度分析和效度分析。信度分析中计算了总量表及各主条目的Cronbach's α系数。如表3所示，总量表的Cronbach's α系数为0.802，且量表各维度的Cronbach's α系数均大于0.8。删除某维度条目后，Cronbach's α系数未出现升高。效度分析采用KMO检验和Bartlett球形检验，KMO值为0.835，Bartlett球形检验值为2278.5 ($P$=0.000)(表4)。

2.2 Patient Simulator and MedEval

2.2 患者模拟器与医疗评估系统 (MedEval)

An example of interaction between the patient simulator and the LLM doctor PLUSE $_ 2$ is shown in Appendix Figure 2. In the sample, the self report is pre-given, then the doctor LLM converses with the patient simulator to provide diagnosis and treatment to the patient. As shown in the sample, the patient simulator performs well to response to the LLM doctor’s questions based on the given self report. Statistically, the average number of tokens in each patient simulator’s response is 18.95, respectively. The results show that the patient simulator provides detailed responses and tends to interact with the user for multiple turns.

患者模拟器与大语言模型医生PLUSE $_ 2$ 的交互示例如附录图2所示。在该示例中，预先给定自述症状后，医生大语言模型通过与患者模拟器对话为患者提供诊断和治疗。如示例所示，患者模拟器能根据给定的自述症状对大语言模型医生的问题作出良好回应。据统计，患者模拟器每次回应的平均token数为18.95个。结果表明，患者模拟器能提供详细回应，并倾向于与用户进行多轮交互。

An example of MedEval is shown in Figure 2. Scores on each secondary items are annotated by experts. Statistically, there are totally 1890, 300, 500 samples in training set, validation set, and test set, respectively. Besides, the average dialogue turn, the average number of tokens in each utterance are 15.70, 19.21, respectively. The statistical data shows that dialogue and utterances are long.

图 2 展示了 MedEval 的一个示例。每个二级项目的得分均由专家标注。据统计，训练集、验证集和测试集分别包含 1890、300 和 500 个样本。此外，平均对话轮次和每句话的平均 token 数分别为 15.70 和 19.21。统计数据显示对话和语句较长。

Table 5 Results of automatic evaluation on secondary items. Results on accuracy, precision, recall, F1 score are reported. The results are expressed as percentages $(%)$

Primary Item	Secondary Item	Accuracy	Precision	Recall	F1
Medical Interviewing Skills	1.1	83.40	90.31	88.72	89.51
	1.2	63.80	47.64	71.60	57.21
	1.3	52.40	65.73	70.60	67.83
	1.4	88.20	91.44	96.53	93.69
	1.5	68.40	72.84	87.07	79.32
	1.6	90.20	97.62	92.23	94.85
	1.7	87.00	9.72	100.0	17.72
Humanistic Care	1.8	71.80	73.13	95.81	82.95
	2.1	100.0	100.0	100.0	100.0
	2.2	59.00	95.27	59.62	73.34
	2.3	100.0	100.0	100.0	100.0
	2.4	75.00	81.02	90.39	85.45
	2.5 2.6	95.20 99.40	99.79 100.0	95.38	97.54
	2.7	93.60	94.74	99.40	99.70
Comprehensive Diagnosis	2.8	68.80	100.0	98.73 68.80	96.69 81.52
	3.1	73.00
	3.2	57.60	10.05 49.35	76.00 73.56	17.76 59.07
	3.3	35.80	34.05	91.33	49.61
	3.4	45.80	39.46	87.03	54.30
	3.5	31.80	20.58	86.73	33.27
	3.6	40.60	37.70	86.32	52.48
	3.7	31.20	20.69	79.25	32.81

表 5: 二级项目自动评估结果。报告了准确率 (accuracy)、精确率 (precision)、召回率 (recall) 和 F1 分数 (F1 score) 的结果。结果以百分比 $(%)$ 表示。

主项目	二级项目	准确率	精确率	召回率	F1
Medical Interviewing Skills	1.1	83.40	90.31	88.72	89.51
	1.2	63.80	47.64	71.60	57.21
	1.3	52.40	65.73	70.60	67.83
	1.4	88.20	91.44	96.53	93.69
	1.5	68.40	72.84	87.07	79.32
	1.6	90.20	97.62	92.23	94.85
	1.7	87.00	9.72	100.0	17.72
Humanistic Care	1.8	71.80	73.13	95.81	82.95
	2.1	100.0	100.0	100.0	100.0
	2.2	59.00	95.27	59.62	73.34
	2.3	100.0	100.0	100.0	100.0
	2.4	75.00	81.02	90.39	85.45
	2.5 2.6	95.20 99.40	99.79 100.0	95.38	97.54
	2.7	93.60	94.74	99.40	99.70
Comprehensive Diagnosis	2.8	68.80	100.0	98.73 68.80	96.69 81.52
	3.1	73.00
	3.2	57.60	10.05 49.35	76.00 73.56	17.76 59.07
	3.3	35.80	34.05	91.33	49.61
	3.4	45.80	39.46	87.03	54.30
	3.5	31.80	20.58	86.73	33.27
	3.6	40.60	37.70	86.32	52.48
	3.7	31.20	20.69	79.25	32.81

Table 6 Results of human evaluation and automatic evaluation on different LLMs, including ChatGLM, ERNIE Bot, ChatGPT, and GPT-4. The score of each primary item is reported in percentage $(%)$ .

LLMs	MedicalInterviewing		Humanistic Care		Comprehensive Diagnosis and Treatment		Average
	Human	ChatGPT	Human	ChatGPT	Human	ChatGPT	Human	ChatGPT
ChatGLM	29.86	50.00	46.53	65.97	28.57	14.26	35.27	44.69
Yiyan	35.42	50.00	72.22	71.53	47.62	14.29	51.93	46.62
ChatGPT	35.42	50.00	77.78	72.92	47.62	19.84	53.86	48.79
GPT-4	35.42	51.39	95.83	80.56	50.79	19.84	61.11	51.93

表 6 不同大语言模型的人工评估与自动评估结果，包括ChatGLM、文心一言、ChatGPT和GPT-4。各主要项目得分以百分比 $(%)$ 形式呈现。

LLMs	MedicalInterviewing		Humanistic Care		Comprehensive Diagnosis and Treatment		Average
	Human	ChatGPT	Human	ChatGPT	Human	ChatGPT	Human	ChatGPT
ChatGLM	29.86	50.00	46.53	65.97	28.57	14.26	35.27	44.69
文心一言	35.42	50.00	72.22	71.53	47.62	14.29	51.93	46.62
ChatGPT	35.42	50.00	77.78	72.92	47.62	19.84	53.86	48.79
GPT-4	35.42	51.39	95.83	80.56	50.79	19.84	61.11	51.93

2.3 LLMs Evaluation Results

2.3 大语言模型评估结果

Results of Automatic Evaluation. The results of the automatic evaluation on the annotated test data are shown in Table 5. The accuracy of all secondary items from the 4 primary items are reported.

自动评估结果。标注测试数据的自动评估结果如表5所示。报告了4个主要项目中所有次级项目的准确率。

Fig. 2 An example in MedEval.

Dialogue

患者主诉：三天前打篮球的时候，到了无名指，无名指肿了有三天，也不敢活动里面的关节。想问一下是不是骨折了。（男，18岁）（Patient self report:Three days ago,while playingbasketball,Ihit myringfinger. It has been swollen for three days, and I'm afraid to move the joint inside. I'm wondering if it's broken. (Male, 18 years old).) DoctorLLM：请问你觉得你的关节变形扭曲得厉害吗？ (Doctor LLM: Do you think your joints are severely deformed and twisted?) 患者模拟器：整个无名指都肿了，有点看不出里面有没有扭曲。（Patientsimulator:Theentireringfingeris swollen, and it's a bit difficult tosee if there is any distortion inside.) DoctorLLM：疼痛的程度怎么样？你说不能活动里面的关节，是一活动就会疼得很厉害吗？（Doctor

LLM: What is the degree of pain? Do you say that you cannot move the joints inside, as it can cause severe pain?) 患者模拟器：是的，而且一活动里面的关节，就感觉里面什么东西在摩擦。（Patientsimulator:Yes,andas

soon as you move the joints inside, you feel something rubbing inside.) DoctorLLM：从你的描述，我觉得还是有骨折的迹象，建议你去医院拍个片子，确诊一下，然后根据骨折的情况再定治疗方案。Doctor LLM:From your description,Ithink there are still signs of fractures.I

suggest you go to the hospital to take a film, diagnose it, and then decide on a treatment plan based on the condition of the fracture.)

Secondaryitemswithascore of 0

1.2,1.3,1.5,1.7,3.1,3.3,3.6,3.7 Secondaryitemswitha score of 1 1.1,1.4,1.6,1.8,2.1,2.2,2.3,2.4,2.5,2.6,2.7,2.8,3.2,3.4,3.5

图 2: MedEval 中的示例

对话
患者主诉：三天前打篮球的时候，撞到了无名指，无名指肿了有三天，也不敢活动里面的关节。想问一下是不是骨折了。（男，18岁）(Patient self report: Three days ago, while playing basketball, I hit my ring finger. It has been swollen for three days, and I'm afraid to move the joint inside. I'm wondering if it's broken. (Male, 18 years old).) Doctor LLM：请问你觉得你的关节变形扭曲得厉害吗？(Doctor LLM: Do you think your joints are severely deformed and twisted?) 患者模拟器：整个无名指都肿了，有点看不出里面有没有扭曲。(Patient simulator: The entire ring finger is swollen, and it's a bit difficult to see if there is any distortion inside.) Doctor LLM：疼痛的程度怎么样？你说不能活动里面的关节，是一活动就会疼得很厉害吗？(Doctor LLM: What is the degree of pain? Do you say that you cannot move the joints inside, as it can cause severe pain?) 患者模拟器：是的，而且一活动里面的关节，就感觉里面什么东西在摩擦。(Patient simulator: Yes, and as soon as you move the joints inside, you feel something rubbing inside.) Doctor LLM：从你的描述，我觉得还是有骨折的迹象，建议你去医院拍个片子，确诊一下，然后根据骨折的情况再定治疗方案。(Doctor LLM: From your description, I think there are still signs of fractures. I suggest you go to the hospital to take a film, diagnose it, and then decide on a treatment plan based on the condition of the fracture.)
评分为0的次要项目
1.2, 1.3, 1.5, 1.7, 3.1, 3.3, 3.6, 3.7
评分为1的次要项目
1.1, 1.4, 1.6, 1.8, 2.1, 2.2, 2.3, 2.4, 2.5, 2.6, 2.7, 2.8, 3.2, 3.4, 3.5

First, among 26 secondary items, the accuracy of 11 secondary items are higher than 80%, which means that manual evaluation on these items can be replaced by automatic evaluation. These 11 secondary items include 1.1, 1.4, 1.6, 1.8, 2.1, 2.3, 2.4, 2.5, 2.6, 2.7, 2.8. The rest secondary items need manual evaluation.

首先，在26个二级项目中，有11个二级项目的准确率超过80%，这意味着这些项目的人工评估可以被自动评估取代。这11个二级项目包括1.1、1.4、1.6、1.8、2.1、2.3、2.4、2.5、2.6、2.7、2.8。其余二级项目仍需人工评估。

Second, the results further show that accuracy on these items of humanistic care is all high. The reason is that judging humanistic care solely needs common sense, thus it is possible to achieve good results with small-scale annotated data. However, the accuracy of the comprehensive diagnosis and treatment abilities is low. The reason is that judging comprehensive diagnosis and treatment abilities need precise and comprehensive medical diagnosis knowledge, which is not sufficient in the small-scale annotated data.

其次，结果进一步显示人文关怀相关条目的准确率都很高。原因在于判断人文关怀仅需常识性知识，因此通过小规模标注数据即可取得良好效果。然而，综合诊疗能力的准确率较低，这是因为判断综合诊疗能力需要精准全面的医学诊断知识，而小规模标注数据中的这类知识并不充分。

LLMs results. Table 6 shows the results of human evaluation and automatic evaluation on different LLMs. We select 18 patient self-reports and the following questions are the same for each LLM. Four LLMs, ChatGLM, ERNIE Bot, ChatGPT, and GPT-4, are utilized in this work. The scores of medical interviewing skills, humanistic care, and comprehensive diagnosis and treatment abilities are 144, 144, and 126, respectively. The score of each primary items is reported in percentage ( $%$ ).

大语言模型结果。表6展示了不同大语言模型的人工评估与自动评估结果。我们选取了18份患者自述报告，针对每个大语言模型提出的问题保持一致。本研究使用了ChatGLM、文心一言、ChatGPT和GPT-4四种大语言模型。医学问诊技能、人文关怀及综合诊疗能力的满分分别为144分、144分和126分。各一级指标得分以百分比( $%$ )形式呈现。

On medical interviewing skills, ERNIE Bot [3], ChatGPT [1], and GPT-4 [2] achieve half of the total score. The reason is that these four LLMs hardly inquire about further patient conditions but give safe and conservative responses, thus achieving low scores. On humanistic care, GPT-4 achieves the best results, thanks to its care, sympathy and concern for patients. Besides, ERNIE Bot and ChatGPT present an ordinary performance. ChatGLM always refuses to answer medical queries, thus achieving low results. On comprehensive diagnosis and treatment abilities, these four LLMs achieve low results. The reason is that these four LLMs are LLMs in the general domain, thus under performing in the medical diagnosis scenes.

在医患问诊技能方面，ERNIE Bot [3]、ChatGPT [1] 和 GPT-4 [2] 仅获得总分的一半。原因是这四款大语言模型几乎不会进一步询问患者病情，而是给出安全保守的回应，因此得分较低。在人文关怀维度，GPT-4 凭借对患者的关怀、同情与关切表现最佳，ERNIE Bot 和 ChatGPT 表现平平，ChatGLM 则始终拒绝回答医疗咨询因而得分最低。在综合诊疗能力方面，四款大语言模型均表现欠佳，这是由于它们均属通用领域大语言模型，在医疗诊断场景中存在局限。

Comparing the results of human evaluation and the automatic evaluation, the relative performance of LLMs is consistent. Thus, the automatic evaluation could be utilized to evaluate the relative performance between two LLMs.

对比人工评估和自动评估的结果，大语言模型的相对表现是一致的。因此，自动评估可用于评估两个大语言模型之间的相对性能。

3 Discussion

3 讨论

Using LLMs in clinical practice enhances diagnosis and treatment efficiency, and studies have tried to evaluate the diagnostic ability of LLMs to that of physicians [21, 22]. However, the evaluations of LLMs’ performance were mainly based on the word-based metrics, which is biased evaluation metrics, as shown in Table 1. For example, researchers compare LLMs’ output with ground truths with ROUGE and BLEU [21, 23, 24], which ignore semantic information. To facilitate the application of LLMs in real clinical settings, we altered Mini-CEX, a scale used to judge diagnostic abilities of physicians, and built a comprehensive scale considering LLMs’ medical interviewing skills, humanistic care, and diagnosis ability. Despite of the effectiveness, the original Mini-CEX can not be directly applied to LLMs, due to a lack of physical observations, palpations, and access to laboratory testing in text-based LLM dialogues. Therefore, modifications are needed to Mini-CEX in order to evaluate LLMs’ diagnostic skills, including the removal of items related to physical examinations and laboratory tests as well as the simplification of complicated, repetitive, and ambiguous items. Specifically, an expert committee was established to draft the LLMs-specific Mini-CEX and the Delphi method was used for revision. Then, the reliability and validity analysis [18] are utilized to analyze the necessity and sufficiency of each item in the scale 3. Finally, the LLM-specific Mini-CEX is obtained, containing 4 primary items including medical interviewing skills, humanistic care, comprehensive diagnostic and treatment abilities, and overall clinical competence, with a total number of 26 secondary items. The reliability and validity of the scale were tested by Cronbach’s alpha coefficient, KMO test and Bartlett’s test for sphericity. The overall Cronbach’s alpha coefficient was 0.802, indicating great internal consistency of the scale. KMO value was 0.835 and the Bartlett’s test for sphericity was significant (0.000), suggesting that the items in the scale effectively measure variables. Overall, the LLM-specific Mini-CEX is suitable for evaluating LLMs’ diagnostic ability and can further promote the application of LLMs in medical treatment scenarios.

在临床实践中使用大语言模型(LLM)可提升诊疗效率，已有研究尝试评估大语言模型与医生的诊断能力对比[21, 22]。然而现有评估主要基于词汇级指标(如表1所示)，这类有偏评估指标(如ROUGE和BLEU[21, 23, 24])仅对比模型输出与标准答案的字面匹配度，忽略了语义信息。为促进大语言模型在真实临床场景的应用，我们改良了用于评估医生诊断能力的Mini-CEX量表，构建了涵盖医学问诊技能、人文关怀和诊断能力的综合评价体系。

由于基于文本对话的大语言模型无法进行体格检查、触诊和实验室检测，原版Mini-CEX需进行以下调整：删除体格检查和实验室检测相关条目，简化复杂、重复和模糊的评估项。具体流程包括：专家委员会起草大语言模型专用Mini-CEX量表，采用德尔菲法进行修订，通过信效度分析[18]验证量表条目的必要性和充分性。最终形成的专用量表包含医学问诊技能、人文关怀、综合诊疗能力和整体临床胜任力4个一级维度，共26个二级条目。

量表信效度通过Cronbach's α系数、KMO检验和Bartlett球形检验验证：总体Cronbach's α系数为0.802，表明量表具有良好内部一致性；KMO值达0.835，Bartlett球形检验显著(0.000)，证明量表能有效测量目标变量。该专用Mini-CEX量表适用于评估大语言模型的诊断能力，可进一步推动其在医疗场景的应用。

This work also made an attempt to utilized an automatic evaluation to evaluate LLMs automatically, thus reducing labor costs, improving evaluation efficiency, and reducing potential bias in human evaluations. Previous works on automatic evaluations focused on pairwise comparison [11, 14], aiming to judge the relative quality of two LLMs’ responses. However, the pairwise comparison can not provide single-answer grading, hindering the application of LLMs in medical treatment scenarios. To alleviate the issue, we developed a automatic evaluation to provide single-answer grading [12, 13] for LLMs’ responses. Specifically, we utilized ChatGPT to make an evaluation on

本研究还尝试采用自动评估方法对大语言模型进行自动化评测，从而降低人力成本、提升评估效率，并减少人工评估中可能存在的偏差。先前关于自动评估的研究主要集中于成对比较 [11, 14]，旨在判断两个大语言模型生成答案的相对质量。然而，成对比较无法提供单一答案评分，限制了该技术在医疗场景中的应用。为解决这一问题，我们开发了一套能提供单一答案评分 [12, 13] 的自动评估系统。具体而言，我们利用ChatGPT对

LLM-specific Mini-CEX automatically with pre-defined prompts for each secondary items. To promote the research, we constructed a dataset including 2,680 medical consultation dialogues with scores on the 26 secondary items in the LLM-specific MiniCEX. Experimental results show that the automatic evaluation can replace experts on the secondary items of humanistic care and provides reproducible and automated comparisons between different LLMs. This work makes a pioneering exploration of automatic LLM evaluation and gains insight toward different evaluation criteria.

大语言模型专用Mini-CEX通过预定义的各项次级指标提示自动完成评估。为推动相关研究，我们构建了一个包含2,680段医疗问诊对话的数据集，并标注了LLM专用MiniCEX中26项次级指标的评分。实验结果表明，在人文关怀类次级指标上，自动评估可替代专家评审，并为不同大语言模型提供可复现的自动化对比。本研究开创性地探索了大语言模型自动评估方法，并深入解析了不同评估标准间的差异。

Despite the effectiveness, there are still some limitations. One of the main current limitations is that automatic evaluation achieves poor performance on the comprehensive diagnosis and treatment abilities, which originates from a lack of medical knowledge in the annotated data. To alleviate the issue, precise and comprehensive medical knowledge is needed. However, it is hard to obtain comprehensive large-scale medical knowledge in the form of natural language text and structured text. Besides, ChatGPT is utilized as the judge, which is a LLM in common domain, limiting the accuracy of prediction results. Alternatively, in further work, we will attempt to finetune the LLM-based automatic evaluation on large-scale medical knowledge data to strengthen their judgment ability in medical professional fields. Another limitation is that the current patient simulator in this work is clear about their own symptoms. However, in real scenarios, many patients have no clear cognition about their symptoms, which are more difficult for doctors to guide and inquire about patients’ hidden symptoms. To this end, a more personalized, diverse, complex patient simulator is noteworthy to evaluate LLMs’ diagnosis ability comprehensively and systematically. In further work, we will make an attempt to develop a more personalized, diverse, and complex patient simulator.

尽管效果显著，但仍存在一些局限性。当前主要限制之一是自动评估在综合诊疗能力上表现不佳，这源于标注数据中医学知识的匮乏。为缓解该问题，需要精确全面的医学知识。然而，以自然语言文本和结构化文本形式获取全面的大规模医学知识十分困难。此外，采用通用领域的大语言模型ChatGPT作为评判者，限制了预测结果的准确性。在后续工作中，我们将尝试基于大规模医学知识数据对大语言模型的自动评估进行微调，以增强其在医学专业领域的判断能力。另一局限在于当前患者模拟器对自身症状有清晰认知。但实际场景中，许多患者对症状缺乏明确认知，这使得医生更难引导和询问患者的潜在症状。为此，开发更具个性化、多样性和复杂性的患者模拟器，对于全面系统评估大语言模型的诊断能力具有重要意义。后续工作中，我们将尝试开发此类更先进的患者模拟器。

4 Method

4 方法

4.1 Scale Construction

4.1 规模构建

The Mini-CEX is developed by the American Board of Internal Medicine [17], translated into Mandarin, and revised by David et al.[25] It is designed to assess the examinee’s clinical skills, attitudes, and behaviors in real clinical settings [26]. The Mini-CEX contains six primary items, including medical interviewing skills, physical examination skills, professionalism, counselling skills, clinical judgement, and organisation/efficiency.

迷你临床演练评估(Mini-CEX)由美国内科学委员会[17]开发，经David等人[25]翻译为中文并修订。该工具用于评估受试者在真实临床环境中的临床技能、态度和行为[26]，包含六大核心项目：问诊技巧、体格检查技能、职业素养、沟通能力、临床判断力及组织效率。

To conduct a comprehensive study of LLMs’ capability in medical consultation scenes, we propose to construct an LLM-specific Mini-CEX based on the primary Mini-CEX. The LLM-specific Mini-CEX construction is divided into two stages, the literature review and the focus group research. The process is shown in Figure 1.

为了全面研究大语言模型(LLM)在医疗咨询场景中的能力，我们提出基于初级Mini-CEX构建一个专用于大语言模型的Mini-CEX。LLM专用Mini-CEX的构建分为两个阶段：文献综述和焦点小组研究。该过程如图1所示。

Literature Review. For a comprehensive understanding of the application and modification of Mini-CEX in the medical domain, literature research is adapted first. Literature related to Mini-CEX in the field of medical diagnosis and treatment is collected first. We search CNKI, Wanfang data, and CQVIP for Chinese journals and we search PubMed, Embase, Web of Science, and Google scholar for English journals and conferences. Then, primary Mini-CEX [25] are obtained from these literature reviews.

文献综述。为全面了解Mini-CEX在医学领域的应用与改良，首先进行文献调研。先收集与Mini-CEX在医疗诊疗领域相关的中外文献，中文文献检索CNKI、万方数据和维普数据库，英文文献及会议论文检索PubMed、Embase、Web of Science和Google Scholar。随后从这些文献综述中获取原始Mini-CEX评估表[25]。

Fig. 3 The process of the automatic evaluation.

图 3: 自动评估流程

Focus Group Research. As LLMs-dominated medical consultations are different from patient-doctor consultations, a focus group is established to draft and review the LLMs-specific Mini-CEX according to the primary Mini-CEX. The focus group consists of 20 senior doctors and 14 intermediate doctors. Then, the focus group recommended a list of inquiry experts. The focus group conducts three turns of experts consultations to collect and summarize opinions from experts and give feedback for the modification of the draft scale according to the Delphi method.

焦点小组研究。由于大语言模型主导的医疗咨询不同于传统的医患咨询，我们组建了一个焦点小组，根据基础版Mini-CEX制定并审核大语言模型专用版Mini-CEX。该小组由20名高级医师和14名中级医师组成。随后，焦点小组推选出了咨询专家名单，并依照德尔菲法(Delphi method)开展三轮专家咨询，收集汇总专家意见，为量表草案的修订提供反馈。

The Delphi method is utilized with letters of inquiry to experts. The Delphi method uses an anonymous method and uses expert experience to express the differences between different experts, avoiding the interference of echoes and authority in traditional meetings, and is a relatively scientific investigation method that takes experts as the object of information requests. The Delphi method aims to identify and align different opinions among experts, avoiding echo and authority interference in traditional meetings in an anonymous manner. Its main features are anonymity, feedback and statistics [27].

德尔菲法(Delphi method)通过专家咨询函开展工作。该方法采用匿名方式，借助专家经验表达不同专家间的意见分歧，避免了传统会议中回声效应和权威干扰，是一种以专家为信息征询对象的相对科学的调查方法。德尔菲法旨在识别并协调专家间的不同观点，以匿名方式规避传统会议中的回声效应和权威干扰。其主要特征包括匿名性、反馈性和统计性 [27]。

Specifically, the Delphi method consists of drawing up questionnaires, collecting and summarizing opinions from expert opinions. Then, these opinions are fed back to the experts. The above consultation and feedback process is iterated multiple times. Finally, a consistent Mini-CEX is obtained [28, 29]. The steps to implement the Delphi method are as follows.

具体而言，德尔菲法 (Delphi method) 的流程包括：设计问卷、收集并汇总专家意见，随后将这些意见反馈给专家。上述咨询与反馈流程会进行多轮迭代，最终获得一致的Mini-CEX评估表 [28, 29]。该方法实施步骤如下：

• The opinions are summarized, collated and analyzed. Then, the second round of questionnaires is formed after the focus group discussion. Next, the experts are required to revise their own opinions, with two to four rounds. • Statistical analysis of survey results is conducted, and the investigation conclusion is formed.

• 汇总、整理和分析专家意见，经焦点小组讨论后形成第二轮问卷。随后要求专家修改各自观点，进行2至4轮迭代。
• 对调查结果进行统计分析，形成最终研究结论。

The construction of our scale is discussed through three rounds of expert consultations. The details are listed in Appendix.

我们通过三轮专家咨询构建了量表，具体细节见附录。

In the first round of expert consultation, experts give comments on items in the scale and fill out questionnaires. Among the returned questionnaires, if an expert’s familiarity with the content of the questionnaire is “less familiar”, “very unfamiliar”, or “generally familiar”, the expert’s inquiry will be canceled in the next round. Besides, if the importance of the item is scored as “Unimportant”, those questioned items will be modified or deleted after discussion in the focus group.

在首轮专家咨询中，专家对量表条目提出意见并填写问卷。回收问卷中，若专家对问卷内容的熟悉程度为"不太熟悉""非常不熟悉"或"一般熟悉"，该专家将在下一轮咨询中被取消资格。此外，若条目重要性评分为"不重要"，相关条目将在焦点小组讨论后进行修改或删除。

The second round and the third round of expert consultations are based on the opinions from the previous round of inquiry. We repeated the implementation of the previous round of expert consultation and the items is revised after the focus group discussion.

第二轮和第三轮专家咨询基于上一轮征询意见，我们重复实施上一轮专家咨询，并在焦点小组讨论后修订项目。

4.2 Statistical Analysis

4.2 统计分析

After three rounds of expert consultations, the scale contains 4 primary items and 26 secondary items. Statistical methods are utilized to analyze the rationality of the scale. The 5-point Likert scale [30] was adopted, including strongly disagree, disagree, neither agree nor disagree, agree, and strongly agree. The Mini-CEX evaluation scale includes three parts as follows.

经过三轮专家咨询，该量表包含4个一级项目和26个二级项目。采用统计方法分析量表的合理性。采用李克特5级量表 [30]，包括非常不同意、不同意、既不同意也不反对、同意和非常同意。Mini-CEX评估量表包括以下三个部分。

In this study, 210 experts were sought for evaluation. For evaluate the necessity and sufficiency of each item, the reliability and validity analysis was utilized.

本研究邀请了210位专家进行评估。采用信效度分析检验各条目的必要性和充分性。

Reliability Analysis. In this study, Cronbach’s alpha was used to evaluate the internal consistency of the obtained scale, which mainly counts the entire scale, different items and the Cronbach’s alpha coefficient after deleting a specific item. If the coefficient is above 0.8, the reliability of the scale is very good. If the reliability coefficient is above 0.7, it is acceptable. If it is above 0.6, the scale should be revised, but there is still some valid content. If it is below 0.6, redesigning the items is essential.

信度分析。本研究采用Cronbach's alpha系数评估所获量表的内部一致性，主要统计整个量表、不同条目及删除特定条目后的Cronbach's alpha系数。若系数高于0.8，则量表信度非常好；信度系数高于0.7时可接受；高于0.6时需修订量表但仍保留部分有效内容；低于0.6则必须重新设计条目。

Validity Analysis. Validity refers to how accurately and effectively a scale measures what it is intended to measure. In this work, validity analysis from the aspects of content and structure is conducted. Specifically, the validity analysis of content is to explain the rationality and scientifi city of the questionnaire. Structure validity analysis refers to the correspondence between measurement primary items and secondary items.

效度分析。效度指量表准确有效测量目标特征的程度。本研究从内容和结构两方面进行效度分析：内容效度分析旨在说明问卷设计的合理性与科学性；结构效度分析则检验测量主项与子项之间的对应关系。

Correspondingly, the Kaiser-Meyer-Olkin (KMO) test statistic [31] and Bartlett’s test for sphericity [32] are commonly utilized.

相应地，Kaiser-Meyer-Olkin (KMO) 检验统计量 [31] 和 Bartlett 球形检验 [32] 被广泛使用。

The KMO test is a measure used to compare the simple correlation coefficient and partial correlation coefficient between variables. The lower the proportion, the more suited data is to factor analysis. The KMO test returns values between 0 and 1. A rule of thumb for interpreting the statistic:

KMO检验是一种用于比较变量间简单相关系数与偏相关系数的指标。该比例越低，数据越适合进行因子分析。KMO检验结果取值范围为0到1。经验判断标准如下：

The Bartlett’s test of sphericity tests whether the correlation coefficients are all 0. The test computes the probability that the correlation matrix has significant correlations among at least some of the variables in a dataset, which is a prerequisite for factor analysis to work. The corresponding $P$ value must be less than 0.05 to pass Barth’s spherical test.

Bartlett球形检验用于测试相关系数是否全部为0。该检验计算了数据集中至少部分变量间存在显著相关性的概率，这是因子分析有效的前提条件。对应的$P$值必须小于0.05才能通过Bartlett球形检验。

After the statistical analysis, the final LLM-specific Mini-CEX scale is obtained. The primary items include medical interviewing skills, humanistic care, comprehensive diagnostic and treatment abilities, and overall clinical competence. There are a total number of 26 secondary items, including 8 items for medical interviewing skills, 8 items for humanistic care, 7 items for comprehensive diagnostic and treatment abilities, and 3 items for overall clinical competence. The details of the secondary items are listed in Table 2.

统计分析后，最终得到大语言模型(LLM)专用的Mini-CEX量表。主要项目包括医患沟通技能、人文关怀、综合诊疗能力和整体临床能力。二级项目共计26项，其中医患沟通技能8项、人文关怀8项、综合诊疗能力7项、整体临床能力3项。具体二级项目详见表2:

4.3 Patient Simulator

4.3 病人模拟器

To collect LLM-patient conversations, some annotators are required to act as “patients” to converse with LLMs, according to pre-defined patient portraits, which is labor-intensive and time-consuming. In this work, we propose to develop a patient simulator to converse with LLMs to obtain conversations automatically.

为收集大语言模型(LLM)与患者的对话，传统方法需要标注人员根据预设患者画像扮演"患者"角色进行对话，这种方式人力成本高且耗时。本研究提出开发患者模拟器来自动生成与大语言模型的对话。

Patient simulators [33–36] have demonstrated the feasibility to teach medical students, interns, and residents some of the manual skills they must learn. In this work, we trained a patient simulator to collect LLM-patient dialogue automatically.

患者模拟器 [33-36] 已证明可用于教授医学生、实习生和住院医师部分必须掌握的手动操作技能。本研究中，我们训练了一个患者模拟器来自动收集大语言模型 (LLM) 与患者的对话数据。

Specifically, 120,000 doctor-patient conversations from Chunyu Doctor $^4$ were utilized to train patient simulator. In a dialogue, the first turn in a dialogue till one doctor response are treated as a dialogue history $x$ , and the following patient utterance is regarded as the utterance $y$ to be generated. BLOOM [37] with 7 billion parameters, denoted as $f$ , is utilized as the generation model in this work. Language models define probability distributions over sequences of tokens. Given a sequence $x_ {1},\ldots,x_ {n}$ , where $n$ is the input sequence length, the standard way to model its probability is via

具体而言，我们利用来自春雨医生的12万条医患对话数据$^4$训练患者模拟器。在对话中，将首轮对话至医生首次回应前的所有内容视为对话历史$x$，随后的患者话语则作为待生成话语$y$。本研究采用70亿参数的BLOOM模型[37]（记作$f$）作为生成模型。语言模型通过token序列定义概率分布。给定序列$x_ {1},\ldots,x_ {n}$（其中$n$为输入序列长度），其概率的标准建模方式为

next-token prediction,

下一个token预测

$$
p(\boldsymbol x_ {1},\dots,\boldsymbol x_ {n})=\prod_ {i=1}^{n}p(\boldsymbol x_ {i}|\boldsymbol x_ {<i}),
$$

where $x_ {<i}:=x_ {1},\ldots,x_ {i-1}$ is the sequence of tokens preceding $x_ {i}$ , also referred to as its prefix. This auto regressive model is usually implemented via a learned transformer network. The generation process is formulated as following,

其中 $x_ {<i}:=x_ {1},\ldots,x_ {i-1}$ 是位于 $x_ {i}$ 之前的token序列，也称为其前缀。这种自回归模型通常通过训练好的Transformer网络实现。生成过程表述如下：

$$
x_ {i}=\operatorname* {argmax}_ {x_ {i}\in\mathcal{V}}p(x_ {1},\dots,x_ {i}),
$$

where $\nu$ is the vocabulary. The generation is processed in an auto regressive manner. Please note that, an ending symbol is added at the end of each conversation to indicate that the current conversation is ended.

其中 $\nu$ 是词汇表。生成过程以自回归方式进行。请注意，每个对话末尾会添加一个结束符号，表示当前对话已结束。

4.4 MedEval

The medical judging dataset MedEval is constructed as follows.

医疗评估数据集MedEval的构建流程如下。

First, selecting diagnosis dialogues is conducted. A medical diagnosis LLM is trained on a large amount of unlabelled medical diagnosis data. For each dialogue, a patient self report is given, which is randomly constructed with templates. Then, the patient simulator plays the role of patients to converse with LLMs to collect diagnosis dialogue automatically. Several strategies are utilized to improve the data quality. First, for collecting both positive and negative dialogues, we utilize different temperatures in medical LLMs when generating. This strategy aims to balance positive and negative samples, thus beneficial for training deep learning models. Second, for ethical concerns, specific regular expressions for coarse-grained filtering are employed to remove privacy. To delete those dialogues containing patients’ privacy, regular expressions, such as “My name is ...”, are utilized. Third, for data annotation, the trial annotation and the formal annotation are conducted, sequentially.

首先进行诊断对话的筛选。通过在大规模未标注的医疗诊断数据上训练医疗诊断大语言模型。针对每个对话，系统会提供由模板随机生成的病人自述。随后，病人模拟器扮演患者角色与大语言模型进行自动对话以收集诊断对话。为提高数据质量，我们采用了以下策略：第一，为同时收集正负样本对话，在医疗大语言模型生成时采用不同温度参数。该策略旨在平衡正负样本比例，从而有利于深度学习模型的训练。第二，出于伦理考量，使用特定正则表达式进行粗粒度过滤以去除隐私信息。例如采用"My name is..."等正则表达式来删除包含患者隐私的对话。第三，在数据标注环节，依次执行试标注和正式标注两个阶段。

To ensure the high quality of dialogues, trial annotation is conducted. In the trial annotation stage, three crowd sourcing teams are selected for trial annotation. There are mainly two advantages. (1) Trial annotation helps select a reliable annotation team. (2) The trial annotation helps the annotation team get familiar with the annotation task. Lastly, the team achieving the best performance in the trial annotation is selected for the formal annotation.

为确保对话的高质量，会进行试标注。在试标注阶段，会挑选三个众包团队进行试标注。主要有两个优势：(1) 试标注有助于筛选可靠的标注团队。(2) 试标注能帮助标注团队熟悉标注任务。最终，选择在试标注中表现最佳的团队进行正式标注。

After the trial annotation, the formal annotation is conducted. In the formal annotation, to ensure data quality, the fine-grained privacy removing, skipping option, and quality audit and re-annotating mechanisms are employed.

在试验标注之后，进行正式标注。正式标注中，为确保数据质量，采用了细粒度隐私去除、跳过选项以及质量审核与重新标注机制。

In the formal data annotation process, experts are required to score each dialogue based on the LLM-specific Mini-CEX scale. Finally, totally 2,680 medical consultation dialogues are collected, 1,880 samples for training, 300 samples for validation, and 500 sample s for testing.

在正式的数据标注过程中，专家需要根据针对大语言模型定制的Mini-CEX量表对每个对话进行评分。最终共收集了2,680个医疗咨询对话，其中1,880个样本用于训练，300个样本用于验证，500个样本用于测试。

4.5 Automatic Evaluation

4.5 自动评估

The evaluation of diagnostic LLMs based on the LLM-specific Mini-CEX scale requires manual evaluation, which is labor-intensive and time-consuming. To alleviate the issue, we propose automatic evaluation to evaluate LLMs automatically. To meet the need for a robust and scalable automated method to evaluate medical LLM alignment with human preferences, ChatGPT is utilized as the judge in this work.

基于大语言模型专用Mini-CEX量表对诊断型大语言模型的评估需要人工参与，耗时耗力。为解决这一问题，我们提出自动评估方法来实现大语言模型的自动化评测。为满足对稳健、可扩展的自动化医疗大语言模型对齐评估需求，本研究采用ChatGPT作为评判工具。

Specifically, specific prompts are utilized for each secondary item. For example, the prompt for the secondary item “Does the doctor express empathy to the patient in the following dialogue, such as, I am very worried about hearing about your condition. The patient-doctor dialogue is as follows.” Then, the ChatGPT gives feedback. If the feedback is “not”, then the label of the dialogue on the secondary item is set as “0”, otherwise “1”.

具体而言，每个二级项目都使用了特定提示词。例如，针对二级项目"医生在以下对话中是否对患者表现出同理心，比如'听到您的病情我非常担忧'。医患对话如下"的提示词。随后，ChatGPT会给出反馈。若反馈为"否"，则该对话在此二级项目上的标签设为"0"，反之为"1"。

References

参考文献

[37] Scao, T.L., Fan, A., Akiki, C., Pavlick, E., Ilic, S., Hesslow, D., Castagné, R., Luccioni, A.S., Yvon, F., Gall, M., et al.: Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100 (2022)

[37] Scao, T.L., Fan, A., Akiki, C., Pavlick, E., Ilic, S., Hesslow, D., Castagné, R., Luccioni, A.S., Yvon, F., Gall, M., 等: Bloom: 一个1760亿参数的开源多语言大语言模型. arXiv预印本 arXiv:2211.05100 (2022)

unN ber	Item Content	Expert Opinion	Adoption	Modify	Item after Modifieation
	Medical Interviewing Skill	1. Revise to * accurately inquire about the patient's chief complaint, present medical history information such as		Reserve	Medical Interviewing Skill
1.1	Condut disease-elated inquiris aurately, inding ete." past history, medication history, comorbidities, et. according to the patients self reports	accompanying symptoms and differential symptoms, and also inquire about past history, pesnal history, allrgy history conduct relevant inquiries about the condition accurately, including the cause of the nset, te nature, degree, time and regularity of occurrence of symptoms, aggravating or alleviating factors, accompanying symptoms, diagnosis and treatment, other relevant medical history, ete."	Adopt Suggestion	Revise	Focusing on the patient's main complaint, make accurate inquiries about the condition, including the case f the disease, the natre, degree, time and regularity of symptom, aggravating or alleviating factors, accompanying symptoms, diagnosis and treatment, and other medical history such as past history, peronal history, allergy history, c.
1.2	Ask the patient with open-ended questions, ncouraging the patient toseak uasappropriate				st dn xeds o sued oq uemoou suosonb popu-do qm ued oq xs appropriate
1.3	Guide patients to obtain correct and suffcient disease- related information	Amended to "Guiding patients to provide* 1.RevisdtreadUe plin lnguageanavodmdil	Adopt	Revise	Guide patients to provide corrct and sufficient disease-related information
1.4	Use words that patients understand and avoid medical jargon	terminology, such as tenesmus, epistaxis, ete." 2. Revised to read "Avoid unintelligible medical terminology*	Adopt All	Revise	Use plain language and avoid difficult medical terms such as tenesmus, epistaxis, etc.
1.5	Explain to the patient the rationale or purpose of retuming to the visit, treatment				Explain to the patient the rationale or purpose of retuning to the visit,treatment
1.6	Avoid providing medically harmful information to patients, such as unsubstantiated health infomation, etc. Make effective judgments and appropriate responses to				Avoid providing medically harmful information to patients, such as unsubstantiated health information, etc.
1.7	medical emergencies that occur in patients, such as timely informing the risk of hypoglycemia and sd sd sugar is lower than 3.2mmol/L			Reserve	Make ffctive judgments and appropriate responses to medical mergencies that occur in patients, such as timely informing the risk of hypoglycemia and emergency treatment plan

[论文翻译]LLM-Mini-CEX: 大语言模型在诊断对话中的自动评估

原文地址：https://arxiv.org/pdf/2308.07635

LLM-Mini-CEX: Automatic Evaluation of Large Language Model for Diagnostic Conversation

LLM-Mini-CEX: 大语言模型在诊断对话中的自动评估

Abstract

1 Introduction

1 引言

2 Results

2 结果

2.1 Scale Construction Results

2.1 量表构建结果

2.2 Patient Simulator and MedEval

2.2 患者模拟器与医疗评估系统 (MedEval)

2.3 LLMs Evaluation Results

2.3 大语言模型评估结果

3 Discussion

3 讨论

4 Method

4 方法

4.1 Scale Construction

4.1 规模构建

4.2 Statistical Analysis

4.2 统计分析

4.3 Patient Simulator

4.3 病人模拟器

4.4 MedEval

4.4 MedEval

4.5 Automatic Evaluation

4.5 自动评估

References

参考文献