MedGPTEval: A Dataset and Benchmark to Evaluate the Responses of Large Language Models in Medicine

MedGPTEval: 用于评估大语言模型在医学领域响应能力的数据集与基准

Jie $\mathbf{X}\mathbf{u}^{1}$ , $\mathbf{L}\mathbf{u}\mathbf{L}\mathbf{u}^{1}$ , Sen $\mathbf{Yang^{1}}$ , Bilin Liang1, Xinwei $\mathbf{Peng^{1}}$ , Jiali Pang1, Jinru $\mathbf{Ding^{1}}$ , Xiaoming $\mathbf{Shi^{1}}$ , Lingrui Yang2, Huan Song3, 4, Kang Li3, 4, Xin $\mathbf{Sun}^{2}$ , Shaoting Zhang1∗ 1Shanghai Articial Intelligence Laboratory, Shanghai, China 2Clinical Research and Innovation Unit (CRIU), Xinhua Hospital Affiliated To Shanghai Jiaotong University School Of Medicine, Shanghai, China 3West China Biomedical Big Data Center, West China Hospital, Sichuan University, Chengdu, Chin 4Med-X Center for Informatics, Sichuan University, Chengdu, China

杰 $\mathbf{X}\mathbf{u}^{1}$、$\mathbf{L}\mathbf{u}\mathbf{L}\mathbf{u}^{1}$、森 $\mathbf{Yang^{1}}$、李彬亮1、彭新伟 $\mathbf{Peng^{1}}$、庞佳丽1、丁金茹 $\mathbf{Ding^{1}}$、石晓明 $\mathbf{Shi^{1}}$、杨凌瑞2、宋欢3,4、李康3,4、孙欣 $\mathbf{Sun}^{2}$、张少霆1∗
1上海人工智能实验室，中国上海
2上海交通大学医学院附属新华医院临床研究与创新中心(CRIU)，中国上海
3四川大学华西医院华西生物医学大数据中心，中国成都
4四川大学Med-X信息学中心，中国成都

Abstract

摘要

Background: Large language models (LLMs) have achieved great progress in natural language processing tasks and demonstrated the potential for use in clinical applications. Despite their capabilities, LLMs in the medical domain are prone to generating hallucinations (not fully reliable responses). Hallucinations in LLMs’ responses create significant safety risks, potentially threatening patients’ physical safety. Thus, to perceive and prevent this safety risk, it is essential to evaluate LLMs in the medical domain and build a systematic evaluation.

背景：大语言模型（LLM）在自然语言处理任务中取得了巨大进展，并展现出临床应用的潜力。尽管能力出众，医疗领域的大语言模型仍易产生幻觉（不完全可靠的响应）。这些幻觉会带来重大安全风险，可能危及患者生命安全。因此，为识别和防范此类风险，必须对医疗领域的大语言模型进行系统性评估。

Objective: We developed a comprehensive evaluation system, MedGPTEval, composed of criteria, medical datasets in Chinese, and publicly available benchmarks.

目标：我们开发了一套全面的评估系统MedGPTEval，包含评估标准、中文医疗数据集和公开基准。

Methods: First, a set of evaluation criteria was designed based on a comprehensive literature review. Second, existing candidate criteria were optimized for using a Delphi method by 5 experts in medicine and engineering. Third, 3 clinical experts designed a set of medical datasets to interact with LLMs. Finally, benchmarking experiments were conducted on the datasets. The responses generated by chatbots based on LLMs were recorded for blind evaluations by 5 licensed medical experts. The obtained evaluation criteria cover medical professional capabilities, social comprehensive capabilities, contextual capabilities, and computational robustness, with 16 detailed indicators. The medical datasets include 27 medical dialogues and 7 case reports in Chinese. Three chatbots were evaluated: ChatGPT, by OpenAI; ERNIE Bot, by Baidu, Inc.; and Doctor PuJiang (Dr. PJ), by Shanghai Artificial Intelligence Laboratory.

方法：
首先，基于全面的文献综述设计了一套评估标准。其次，由5位医学和工程领域的专家采用德尔菲法对现有候选标准进行优化。接着，3位临床专家设计了一组用于与大语言模型交互的医学数据集。最后，在这些数据集上进行了基准测试实验。基于大语言模型的聊天机器人生成的响应被记录下来，由5位持证医疗专家进行盲评。所获得的评估标准涵盖医学专业能力、社会综合能力、情境能力和计算鲁棒性，共包含16项详细指标。医学数据集包括27个中文医疗对话和7份病例报告。评估的三种聊天机器人分别为：OpenAI的ChatGPT、百度公司的文心一言(ERNIE Bot)以及上海人工智能实验室的浦医(Dr. PJ)。

Results: Dr. PJ outperformed ChatGPT and ERNIE Bot in the multiple-turn medical dialogues and case report scenarios. Dr. PJ also outperformed ChatGPT in the semantic consistency rate and complete error rate category, indicating better robustness. However, Dr. PJ had slightly lower scores in medical professional capabilities compared with ChatGPT in the multiple-turn dialogue scenario.

结果：Dr. PJ在多轮医学对话和病例报告场景中表现优于ChatGPT和ERNIE Bot。在语义一致率和完全错误率类别中，Dr. PJ也优于ChatGPT，表明其具有更好的鲁棒性。然而，在多轮对话场景中，Dr. PJ的医学专业能力得分略低于ChatGPT。

Conclusions: MedGPTEval provides comprehensive criteria to evaluate chatbots by LLMs in the medical domain, open-source datasets, and benchmarks assessing 3 LLMs. Experimental results demonstrate that Dr. PJ outperforms ChatGPT and ERNIE Bot in social and professional contexts. Therefore, such an assessment system can be easily adopted by researchers in this community to augment an open-source dataset.

结论：MedGPTEval通过开源数据集和评估3个大语言模型的基准测试，为医疗领域聊天机器人提供了全面的评估标准。实验结果表明，Dr. PJ在社交和专业场景中的表现优于ChatGPT和ERNIE Bot。因此，该评估系统可被该领域研究者便捷采用，以扩充开源数据集。

1 Introduction

1 引言

The development of large language models (LLMs) has revolutionized natural language processing, raising significant interest in LLMs as a solution for complex tasks such as instruction execution and elaborate question answering in various domains (Sezgin et al., 2022). Among these domains, the medical field has received significant attention because of its actual demands. Recently, progress has been achieved in medical education (Anders, 2023), patient care management (Schulman et al., 2022), medical exams (Levine et al., 2023), and other medical applications.

大语言模型 (LLM) 的发展为自然语言处理带来了革命性变革，引发了人们将其作为跨领域复杂任务解决方案的浓厚兴趣 (Sezgin et al., 2022) ，例如指令执行和精细化问答。其中，医疗领域因其实际需求获得了显著关注。近期在医学教育 (Anders, 2023) 、患者护理管理 (Schulman et al., 2022) 、医学考试 (Levine et al., 2023) 等医疗应用场景均取得了进展。

Despite their capabilities, LLMs are prone to generating hallucinations (not fully reliable responses) (Lee et al., 2023; Hagendorff et al., 2022). Hallucinations in LLMs’ responses create significant safety risks, potentially threatening patient’s physical safety and leading to serious medical malpractice. Thus, to perceive and prevent this safety risk, we must conduct an exhaustive evaluation of LLMs in the medical domain and build a systematic evaluation.

尽管大语言模型(LLM)能力强大，但它们容易产生幻觉(并非完全可靠的响应) (Lee et al., 2023; Hagendorff et al., 2022)。大语言模型响应中的幻觉会带来重大安全风险，可能危及患者生命安全并导致严重医疗事故。因此，为了识别和防范这种安全风险，我们必须对医疗领域的大语言模型进行详尽评估，并建立系统化的评估体系。

However, conducting an exhaustive evaluation for LLMs is nontrivial. First, LLMs lack robustness; that is, their performance is highly sensitive to prompts. White et al. (2023) showed that a meticulously crafted and thoroughly tested prompt could greatly improve performance and produce superior results. Thus, the robustness of LLMs must be evaluated through in-depth research. Second, the evaluation criteria of LLMs are critical. Recent evaluations have been mainly based on automatic metrics (Kung et al., 2023; Balas and Ing, 2023; Johnson et al., 2023) (e.g., BLEU, ROUGE, accuracy) in popular applications such as machine translation and text sum mari z ation. Despite their high efficiency, these automatic metrics are insufficient for employing LLMs in real-world medical scenarios. Other factors, such as the logical coherence of responses, social characteristics like tone, and the ability to understand contextual information, are essential influential factors (Hagendorff et al., 2022; Hu et al., 2018; Liang and Li, 2021; Holmes et al., 2023; Chaves and Gerosa, 2021; Yang et al., 2023; Tan et al., 2023; West, 2023).

然而，对大语言模型进行全面评估并非易事。首先，大语言模型缺乏鲁棒性，其表现对提示词高度敏感。White等人 (2023) 研究表明，经过精心设计和充分测试的提示词能显著提升性能并产生更优结果。因此，必须通过深入研究来评估大语言模型的鲁棒性。其次，大语言模型的评估标准至关重要。近期评估主要基于机器翻译和文本摘要等热门应用中的自动指标 (Kung等人, 2023; Balas和Ing, 2023; Johnson等人, 2023) (如BLEU、ROUGE、准确率)。尽管这些自动指标效率很高，但将其应用于现实医疗场景仍显不足。其他因素如回答的逻辑连贯性、语气等社会特征，以及理解上下文信息的能力，都是关键影响因素 (Hagendorff等人, 2022; Hu等人, 2018; Liang和Li, 2021; Holmes等人, 2023; Chaves和Gerosa, 2021; Yang等人, 2023; Tan等人, 2023; West, 2023)。

To conduct an exhaustive study, we developed a comprehensive assessment system, MedGPTEval, composed of criteria, medical datasets in Chinese, and publicly available benchmarks. First, 5 interdisciplinary experts in medicine and engineering summarized existing criteria, based on a comprehensive literature review on the assessment of medical applications. Second, these candidate criteria were optimized using a Delphi method. The obtained evaluation criteria cover medical professional capabilities, social comprehensive capabilities, contextual capabilities, and computational robustness, with 16 detailed indicators. Third, 3 clinical experts designed a set of medical datasets to interact with LLMs, including 27 medical dialogues and 7 case reports in Chinese. Finally, benchmarking experiments were conducted on the datasets. The responses generated by LLMs were recorded for blind evaluations by 5 licensed medical experts practicing medicine.

为进行全面研究，我们开发了一个由评估标准、中文医疗数据集和公开基准组成的综合评估系统MedGPTEval。首先，5位医学与工程跨学科专家在全面回顾医疗应用评估文献的基础上，总结了现有标准。其次，采用德尔菲法对这些候选标准进行优化，最终获得的评估标准涵盖医疗专业能力、社会综合能力、情境能力和计算鲁棒性四大维度，包含16项具体指标。第三，3位临床专家设计了一套用于与大语言模型交互的中文医疗数据集，包含27组医疗对话和7份病例报告。最后，在数据集上开展基准测试，记录大语言模型生成的响应，由5位执业医师执照的医疗专家进行盲评。

In the benchmarking experiments, 3 chatbots by LLMs were selected for evaluation. 1) ChatGPT, an LLM created by OpenAI, has gained global popularity owing to its exceptional language capabilities (Anders, 2023). However, ChatGPT has not been specifically trained for the medical domain (King, 2023). 2) ERNIE Bot is an LLM developed by Baidu, Inc., a Chinese computer technology company. It has been primarily trained on Chinese text and predominantly supports the Chinese language for general purposes. 3) Doctor PuJiang (Dr. PJ) is an LLM created by the medical research group of the Shanghai Artificial Intelligence Laboratory. Dr. PJ has been trained based on massive Chinese medical corpora and supports various application scenarios, such as diagnosis, triage, and medical question answering. Note that ChatGPT and ERNIE Bot are general-purpose conversational artificial intelligence (AI) systems, while Dr. PJ is an LLM fine-tuned specifically for medical use. To promote research on medical LLMs evaluation, we conducted benchmarking experiments on the proposed medical datasets in Chinese. Experimental results show that Dr. PJ outperformed ChatGPT and ERNIE Bot in both the multi-turn medical dialogues (13.95 vs 13.41 vs 12.56) and the case report scenarios (10.14 vs. 8.71 vs 8.0).

在基准测试实验中，我们选取了3款基于大语言模型(LLM)的聊天机器人进行评估：

ChatGPT是由OpenAI开发的大语言模型，凭借卓越的语言能力在全球广受欢迎 (Anders, 2023) 。但该模型未针对医疗领域进行专门训练 (King, 2023) 。
文心一言(ERNIE Bot)是中国百度公司开发的大语言模型，主要基于中文文本训练，主要支持通用中文场景。
浦医医生(Dr. PJ)由上海人工智能实验室医疗研究团队开发，基于海量中文医疗语料训练，支持诊断、分诊、医疗问答等多种应用场景。

需要说明的是，ChatGPT和文心一言属于通用对话式人工智能(AI)系统，而Dr. PJ是专门针对医疗场景微调的大语言模型。为推进医疗大语言模型评估研究，我们在提出的中文医疗数据集上进行了基准测试。实验结果表明，在多轮医疗对话(13.95 vs 13.41 vs 12.56)和病例报告场景(10.14 vs 8.71 vs 8.0)中，Dr. PJ的表现均优于ChatGPT和文心一言。

The scale of the dataset remains limited. We urge researchers in this community to join this open project via email 1. MedGPTEval is open to researchers, that is, people affiliated with a research organization (in academia or industry), as well as to people whose technical and professional expertise is relevant to the social aspects of the project.

数据集规模仍然有限。我们呼吁该领域的研究人员通过电子邮件加入这一开放项目。MedGPTEval向研究人员开放，即隶属于研究机构(学术界或工业界)的人员，以及其技术和专业知识与项目社会层面相关的人员。

The contribution of this work is twofold:

本工作的贡献分为两个方面:

• By conducting a thorough study of LLMs employed in the medical context and collaborating with domain experts, we established comprehensive evaluation criteria to assess the medical responses of LLMs; • Based on the criteria, we released a set of open-source datasets for the evaluation of medical responses in Chinese and conducted benchmark experiments on 3 chatbots, including ChatGPT.

• 通过对医疗领域应用的大语言模型 (LLM) 进行深入研究并与领域专家合作，我们建立了全面的评估标准来衡量大语言模型的医疗响应能力；
• 基于该标准，我们发布了一套用于评估中文医疗响应的开源数据集，并在包括 ChatGPT 在内的 3 个聊天机器人上进行了基准实验。

2 Methods

2 方法

2.1 Evaluation Criteria

2.1 评估标准

The evaluation criteria for assessing the LLMs were summarized by a thorough literature review. Then, the evaluation criteria were optimized using the Delphi method (Côrte-Real et al., 2019). The general process involved sending the criteria to designated experts in the field as well as obtaining their opinions on linguistic embellishment, ambiguity, and readability. After making generalizations and corrections, we provided anonymous feedback to each expert. This cycle of seeking opinions, refining focus, and giving feedback was repeated until a unanimous consensus was reached. A team of 5 interdisciplinary experts in medicine and engineering collaborated to determine the final evaluation aspects, specific details, and scoring standards. All members of the team held doctoral degrees in their specialties, with titles of associate professor or above, including 2 clinical medicine specialists, 2 computer specialists, and 1 medical management specialist.

通过全面的文献综述总结出评估大语言模型(LLM)的标准，随后采用德尔菲法(Côrte-Real et al., 2019)对评估标准进行优化。该流程主要包括：将标准发送给指定领域专家，获取他们关于语言修饰、歧义性和可读性的意见；经过归纳修正后，向每位专家提供匿名反馈。这种"征求意见-聚焦改进-反馈循环"的过程反复进行，直至达成全体一致意见。最终由5名医学与工程跨学科专家组成的团队确定了评估维度、具体细则及评分标准。团队成员均持有专业博士学位且具有副教授以上职称，包括2名临床医学专家、2名计算机专家和1名医疗管理专家。

2.2 Medical Datasets in Chinese

2.2 中文医疗数据集

To apply the evaluation criteria, 3 licensed medical experts with over 10 years of extensive clinical experience worked together to create a set of medical datasets in Chinese, including the multiple-turn dialogue dataset and the case report dataset. The case report dataset necessitated a singular round of questioning and encompasses an elaborate medical record of the patient, encompassing age, gender, medical history (personal and familial), symptoms, medication history, and other relevant information. In addition, the medical problem consulted had to be clearly described. In contrast, the dataset with multiple-turn was derived through an iterative process comprising 4 rounds. The initial round initiated with the patient’s symptoms, followed by supplementary descriptions of medication, examination, or other symptom-related queries. The dataset with multiple-turn required careful consideration to assess contextual relevance.

为应用评估标准，3名拥有10年以上丰富临床经验的持证医学专家共同创建了一套中文医学数据集，包括多轮对话数据集和病例报告数据集。病例报告数据集需进行单轮提问，包含患者详尽的病历信息，涵盖年龄、性别、个人及家族病史、症状、用药史等相关资料，并要求清晰描述咨询的医学问题。相比之下，多轮对话数据集通过包含4轮对话的迭代过程生成：首轮以患者症状为起点，后续轮次补充用药、检查或其他症状相关询问。该数据集需特别关注上下文相关性评估。

2.3 Benchmark

2.3 基准测试

The generations of LLMs’ responses were recorded by an impartial programmer to ensure an unbiased evaluation. During the evaluation process, the LLMs’ responses were concealed from a different group of 5 clinical medical experts who were licensed practitioners. Then, the performances of 3 LLMs (ChatGPT, ERNIE Bot, and Dr. PJ) were compared based on the assessment criteria outlined above and on the proposed medical datasets. The datasets were used to evaluate the medical and social capabilities of the LLMs, while the multipleturn dialogue dataset was used to additionally assess their contextual abilities. The maximum scores available for LLMs in the multiple-turn dialogue dataset and the case report dataset were 16 and 13, respectively, where a higher score indicated superior performance. Furthermore, the computational robustness of the LLMs was assessed using extended datasets derived from the multiple-turn dialogue dataset. Lastly, a subset of the case reports was randomly selected and comprehensively reviewed by 5 medical experts. The benchmark assessment methods are summarized in Table 1.

一位公正的程序员记录了大语言模型(LLM)的生成结果以确保评估无偏。在评估过程中，这些回答对另一组5位持有执照的临床医学专家保持隐匿。随后，根据前述评估标准及所提出的医学数据集，比较了ChatGPT、文心一言(ERNIE Bot)和Dr. PJ这3个大语言模型的表现。数据集用于评估大语言模型的医学和社会能力，而多轮对话数据集额外测试了它们的上下文理解能力。在多轮对话数据集和病例报告数据集中，大语言模型可获得最高分分别为16分和13分，分数越高表现越优。此外，基于多轮对话数据集衍生的扩展数据集被用来评估大语言模型的计算鲁棒性。最后，随机选取部分病例报告由5位医学专家进行全面复核。基准评估方法汇总见表1:

3 Results

3 结果

3.1 Comprehensive Assessment Criteria

3.1 综合评估标准

The draft evaluation criteria for assessing the LLMs were summarized by a thorough literature review (Hagendorff et al., 2022; Wei et al., 2023; White et al., 2023; Hu et al., 2018; Liang and Li, 2021; Holmes et al., 2023; Tan et al., 2023; West, 2023; Chaves and Gerosa, 2021) from 4 aspects: medical professional capabilities, social comprehensive capabilities, contextual capabilities, and computational robustness. All 5 interdisciplinary experts made suggestions for fine-tuning the assessment method, and they reached a consensus using the Delphi method to make it more scientifically rigorous and easier to read (Côrte-Real et al., 2019).

通过全面文献综述 [Hagendorff et al., 2022; Wei et al., 2023; White et al., 2023; Hu et al., 2018; Liang and Li, 2021; Holmes et al., 2023; Tan et al., 2023; West, 2023; Chaves and Gerosa, 2021] ，我们从医学专业能力、社会综合能力、上下文理解能力和计算鲁棒性四个维度总结了大语言模型 (LLM) 的评估标准草案。5位跨学科专家对评估方法提出细化建议，并采用德尔菲法 (Delphi method) [Côrte-Real et al., 2019] 达成共识，使方案更具科学严谨性与可读性。

3.1.1 Medical Professional Capabilities

3.1.1 医疗专业能力

The professional comprehensive capabilities of LLMs’ answers were evaluated using 7 indicators (White et al., 2023; Liang and Li, 2021; West, 2023): 1) Accuracy, requiring that there are no medical errors in the answers and that the answers do not provide any harmful information to patients. Accuracy can also include the evaluation of safety; 2) Informative ness, where a 3-point Likert scale was used to evaluate the informative ness of the answers (0 – incomplete, 1 – adequate, 2 – comprehensive); 3) Expansiveness, meaning that the answers contain useful information besides the medical knowledge included in the question; 4) Logic, with a 3-point Likert scale (0 – the answer is irrelevant to the topic, 1 – off-topic, the answer does not directly address the topic but is still relevant, 2 – on-topic, the answer addresses the topic directly and positively); 5) Prohibitive ness, where the LLMs correctly identify medical vocabulary or prohibited vocabulary; 6) Sensitivity, ensuring that LLMs’ answers do not contain any politically sensitive expressions. Note that if the score for either knowledge accuracy or logical correlation is 0, the score for the overall professional comprehensive capabilities is set to 0.

大语言模型回答的专业综合能力通过7项指标进行评估 (White et al., 2023; Liang and Li, 2021; West, 2023):

准确性，要求回答中不存在医疗错误且不会向患者提供任何有害信息。准确性也可包含对安全性的评估；
信息量，采用3级李克特量表评估回答的信息完整度 (0-不完整，1-适中，2-全面)；
扩展性，指回答包含问题所涉医学知识之外的有效信息；
逻辑性，采用3级李克特量表 (0-回答与主题无关，1-偏题，回答未直接切题但仍具相关性，2-切题，回答直接且积极地回应主题)；
禁忌识别，要求大语言模型正确识别医学术语或禁忌词汇；
敏感性，确保大语言模型的回答不包含任何政治敏感表述。
注：若知识准确性或逻辑相关性任一得分为0，则专业综合能力总评分直接记为0。

3.1.2 社会综合能力

We conducted an overall evaluation of the social comprehensive performances using 4 indicators (Hagendorff et al., 2022; Hu et al., 2018; Liang and

我们采用4项指标对综合社会表现进行了整体评估 (Hagendorff et al., 2022; Hu et al., 2018; Liang and

Table 1: Summary of benchmark assessment.

Datasets	Assessmentaspects	Assessmentapproaches
MedicalDialogue	medicalprofessionalcapabilities,socialcomprehensivecapabilities,contextualcapabilities computationalrobustness	maximum scoreof16 percentage
CaseReport	medicalprofessionalcapabilities,socialcomprehensivecapabilities computationalrobustness comprehensivereview	maximumscore of13 percentage comments

表 1: 基准评估总结

数据集	评估维度	评估方法
MedicalDialogue	医疗专业能力、社会综合能力、上下文理解能力、计算鲁棒性	满分16分百分比
CaseReport	医疗专业能力、社会综合能力、计算鲁棒性、综合评审	满分13分百分比评语

Li, 2021; Chaves and Gerosa, 2021). 1) Compre- hension, where a binary scale is used to evaluate the readability of the answers (0 – awkward-sounding: all answers are professional and not explanatory, 1 – understandable: intuitive and easy to understand); 2) Tone, which pertains to the appropriate use of mood/tone in the generated responses by the LLMs, including the use of mood particles, symbols, emotional rhythm, and emotional intensity; 3) Empathy, where the accuracy of the scenario analysis is considered, including emotional understanding and reasoning; 4) Social decorum, using a 3-point Likert scale to evaluate the social decorum (0 – rude: not matching any friendly social keywords or displaying malicious language attacks, 1 – general: matching 1 to 2 keywords, 2 – graceful: matching 3 or more keywords).

Li, 2021; Chaves and Gerosa, 2021)。1) 可理解性：采用二元量表评估答案的易读性 (0 - 生硬：所有回答都专业且无解释性，1 - 易懂：直观且易于理解)；2) 语气：评估大语言模型生成回复中情绪/语气的恰当运用，包括语气助词、符号、情感节奏和情感强度；3) 共情：考量场景分析的准确性，包括情感理解与推理；4) 社交礼仪：采用3级李克特量表评估社交礼仪 (0 - 粗鲁：不符合任何友好社交关键词或出现恶意语言攻击，1 - 一般：匹配1-2个关键词，2 - 得体：匹配3个及以上关键词)。

3.1.3 Contextual Capabilities

3.1.3 上下文能力

Three indicators were used to access the contextual capabilities (Wei et al., 2023; Holmes et al., 2023) only in the multiple-turn dialogue dataset, as follows: 1) Repeated answer, which means that no duplicate answers should appear in the responses generated by LLMs; 2) Anaphora matching, which involves correctly identifying and matching the abbr evi at ions or aliases of medical professional terms used in the dialogue; 3) Key information, where LLMs can recognize and include all relevant information from the question in its response, particularly those that have been repeated 2 or more times in the questions. The content performance criteria used for scoring are outlined in Table 2.

在多轮对话数据集中，仅使用以下三个指标评估上下文能力 (Wei et al., 2023; Holmes et al., 2023)：

重复答案，即大语言模型生成的响应中不应出现重复回答；
指代匹配，要求正确识别并匹配对话中使用的医学术语缩写或别名；
关键信息，大语言模型需在响应中识别并包含问题中的所有相关信息，尤其是问题中重复出现2次及以上的内容。评分采用的内容表现标准如表 2 所示。

3.1.4 Computational Robustness

3.1.4 计算鲁棒性

To evaluate the robustness of the LLMs, 5 extended datasets were created based on first-round questions in the multiple-turn dialogue dataset described above. Specifically, the following strategies were employed to rephrase each original question and create 10 rephrasing questions: 1) Rephrasing the question sentence but maintaining the semantics (Dataset-A); 2) Rephrasing the question sentence and changing the semantics (Dataset-B); 3) Rephrasing the question sentence by introducing punctuation errors (Dataset-C); 4) Rephrasing the question sentence by introducing grammatical errors (Dataset-D); 5) Rephrasing the question sentence by introducing spelling errors (Dataset-E). The Dataset A-E was used to evaluate the robustness of the LLMs from different common scenarios, which could be classified into 3 anomaly categories. Specifically, Dataset-A was used for the adversarial success rate (ASR); Dataset-B, for the noise success rate (NSR); and Dataset C-E, for the input error success rate (IESR).

为评估大语言模型(LLM)的鲁棒性，基于上述多轮对话数据集的首轮问题创建了5个扩展数据集。具体采用以下策略对每个原始问题进行改写并生成10个改写问题：1) 保持语义改写问句(Dataset-A)；2) 改变语义改写问句(Dataset-B)；3) 引入标点错误改写问句(Dataset-C)；4) 引入语法错误改写问句(Dataset-D)；5) 引入拼写错误改写问句(Dataset-E)。Dataset A-E用于从不同常见场景评估大语言模型的鲁棒性，这些场景可分为3类异常：Dataset-A用于对抗成功率(ASR)；Dataset-B用于噪声成功率(NSR)；Dataset C-E用于输入错误成功率(IESR)。

For each dataset, the original and rephrasing questions were inputted into the LLMs, and 3 metrics were calculated according to LLMs’ answers as follows (Tan et al., 2023; West, 2023): 1) The semantic consistency rate $(R_ {1})$ represents the proportion of the answer able to maintain the same semantics when inputting a rephrasing question; 2) The semantically inconsistent but medically sound rate $(R_ {2})$ means that the semantics of the answer has changed but is medically sound when inputting rephrasing question; 3) The complete error rate $(R_ {3})$ means that the semantics of the answer have changed and that there is a medical error when inputting a rephrasing question.

针对每个数据集，将原始问题和改写问题输入大语言模型，并根据模型回答计算以下3个指标 (Tan et al., 2023; West, 2023): 1) 语义一致率 $(R_ {1})$ 表示输入改写问题时答案能保持相同语义的比例; 2) 语义不一致但医学合理率 $(R_ {2})$ 表示输入改写问题时答案语义发生变化但医学上合理的情况; 3) 完全错误率 $(R_ {3})$ 表示输入改写问题时答案语义发生变化且存在医学错误的情况。

3.2 Medical Datasets in Chinese

3.2 中文医疗数据集

Two medical datasets in Chinese were created: medical multiple-turn dialogues and case reports. The datasets 1 include a total of 34 cases, with 27 cases for multiple-turn dialogue and 7 case reports. Datasets include medical scenarios, questions, suspected diagnoses given by LLMs, disease types, and classification of medical questions. The medical questions were sorted into 6 categories: clinical manifestations, treatment, ancillary tests, lifestyle habits, etiology, and prognosis. Most questions focused on patients’ self-reported symptoms and their respective treatments. The datasets contain 14 types of diseases: systemic diseases, digestive system diseases, brain diseases, heart diseases, bone diseases, chest diseases, vascular diseases, eye diseases, uterine diseases, urinary system diseases, na so pharyngeal diseases, oral diseases, skin diseases, and accidental injuries. Some specific common diseases featured in the datasets are metabolic diseases like diabetes mellitus, gastrointestinal diseases such as gastritis and hyper acidity, and critical diseases like Parkinson’s disease and heart failure.

创建了两个中文医疗数据集：医疗多轮对话和病例报告。数据集1共包含34个案例，其中27个为多轮对话，7个为病例报告。数据集涵盖医疗场景、问题、大语言模型给出的疑似诊断、疾病类型及医疗问题分类。医疗问题被归类为6种：临床表现、治疗、辅助检查、生活习惯、病因和预后。大多数问题聚焦于患者自述症状及相应治疗方案。数据集包含14类疾病：全身性疾病、消化系统疾病、脑部疾病、心脏疾病、骨骼疾病、胸部疾病、血管疾病、眼部疾病、子宫疾病、泌尿系统疾病、鼻咽疾病、口腔疾病、皮肤疾病和意外伤害。数据集中涉及的具体常见疾病包括糖尿病等代谢性疾病、胃炎和胃酸过多等胃肠疾病，以及帕金森病和心力衰竭等危重疾病。

3.3 Benchmarks Based on ChatGPT, ERNIE Bot, and Dr. PJ

3.3 基于ChatGPT、文心一言和Dr. PJ的基准测试

3.3.1 Analysis of Results in 2 Medical Scenarios

3.3.1 2种医疗场景下的结果分析

As shown in Table 3, 3 assessment aspects were covered in the multiple-turn dialogue evaluation: medical professional capabilities, social comprehensive capabilities, and contextual capabilities. Table 3 shows the total scores of each assessment and the scores of specific indicators. Dr. PJ outperformed ChatGPT and ERNIE Bot, with total scores of 13.95, 13.41, and 12.56, respectively. ChatGPT achieved a slightly higher score of 6.30 in medical professional capabilities, compared to 6.25 for Dr. PJ and 5.63 for ERNIE Bot. Although ChatGPT performed better in the assessment of medical professional capabilities, Dr. PJ had a higher score for accuracy, meaning that the answers were harmless and that Dr. PJ performed better in the evaluation of safety. As for social comprehensive capabilities, ChatGPT, ERNIE and Dr. PJ achieved scores of 4.26, 4.33, and 4.70, respectively. Dr. PJ achieved a score of 3.00 for context relevance, while ChatGPT and ERNIE Bot achieved scores of 2.85 and 2.59, respectively.

如表 3 所示，多轮对话评估涵盖 3 个评估维度：医疗专业能力、社会综合能力和上下文能力。表 3 展示了各项评估总分及具体指标得分。Dr. PJ 表现优于 ChatGPT 和 ERNIE Bot，总分分别为 13.95、13.41 和 12.56。ChatGPT 在医疗专业能力维度以 6.30 分略高于 Dr. PJ 的 6.25 分和 ERNIE Bot 的 5.63 分。虽然 ChatGPT 在医疗专业能力评估中表现更优，但 Dr. PJ 在准确性指标上得分更高，这意味着其回答无害且在安全性评估中表现更佳。在社会综合能力方面，ChatGPT、ERNIE 和 Dr. PJ 分别获得 4.26、4.33 和 4.70 分。Dr. PJ 在上下文相关性指标上获得 3.00 分，而 ChatGPT 和 ERNIE Bot 分别获得 2.85 分和 2.59 分。

As shown in Table 4, 2 assessment aspects were covered in the case report evaluation: medical professional capabilities and social comprehensive capabilities. Dr. PJ outperformed ChatGPT and ERNIE Bot, with total scores of 10.14, 8.71, and 8.00, respectively. As for medical professional capabilities, Dr. PJ achieved 6.86, higher than that of ChatGPT (6.43) and ERNIE Bot (5.71). Similarly, Dr. PJ had the highest score (1.00) for accuracy in the evaluation of medical professional capabilities. In addition, Dr. PJ had the same scores as ChatGPT regarding informative ness and expansiveness. As for social comprehensive capabilities, the scores for Dr. PJ, ChatGPT and ERNIE Bot were 3.29, 2.29, and 2.29 respectively. Specific scores for each indicator can be found in Table 4.

如表 4 所示，病例报告评估涵盖 2 个方面：医疗专业能力和社会综合能力。PJ 医生表现优于 ChatGPT 和 ERNIE Bot，总分分别为 10.14、8.71 和 8.00。在医疗专业能力方面，PJ 医生得分为 6.86，高于 ChatGPT (6.43) 和 ERNIE Bot (5.71)。同样，PJ 医生在医疗专业能力评估的准确性方面得分最高 (1.00)。此外，PJ 医生在信息量和扩展性方面与 ChatGPT 得分相同。在社会综合能力方面，PJ 医生、ChatGPT 和 ERNIE Bot 的得分分别为 3.29、2.29 和 2.29。各指标具体得分见表 4。

3.3.2 Comprehensive Review of Detailed Case Reports

3.3.2 详细案例报告的综合评述

The comments of 2 case reports by 5 medical experts are shown in Figure 1. Overall, all 3 LLMs performed well in correctly understanding patients’ questions. They could comprehend the questions asked by patients and respond with logical answers. However, Dr. PJ outperformed the others in terms of sociality. Additionally, Dr. PJ answered the questions in an orderly manner, with clear and intuitive serial numbers listed.

5位医学专家对2份病例报告的评论如图1所示。总体而言，所有3个大语言模型都能正确理解患者的问题。它们可以理解患者提出的问题，并给出逻辑清晰的回答。然而，Dr. PJ在社会性方面表现优于其他模型。此外，Dr. PJ的回答条理分明，列出了清晰直观的序号。

3.3.3 Computational Robustness Performance

3.3.3 计算鲁棒性性能

The results in Table 5 show that Dr. PJ outperformed ChatGPT and ERNIE Bot in the semantic consistency rate, with higher ASR, NSR, and IESR. This indicates that Dr. PJ was the best at maintaining the same semantics of the model answers when questions were paraphrased. Furthermore, in the complete error rate category, both Dr. PJ and ERNIE Bot had lower error rates than ChatGPT, suggesting that the semantics of the answer changed when the question was altered. Dr. PJ also had a low probability of medical errors.

表 5 中的结果显示，Dr. PJ 在语义一致性率上优于 ChatGPT 和 ERNIE Bot，具有更高的 ASR、NSR 和 IESR。这表明当问题被转述时，Dr. PJ 最擅长保持模型答案的相同语义。此外，在完全错误率类别中，Dr. PJ 和 ERNIE Bot 的错误率均低于 ChatGPT，这表明当问题被修改时，答案的语义发生了变化。Dr. PJ 的医疗错误概率也很低。

4 Discussion

4 讨论

In this study, we introduced a set of comprehensive evaluation criteria for assessing LLMs’ performances in medical contexts, considering aspects such as medical professional capabilities, social comprehensive capabilities, contextual capabilities, and computational robustness. We compared ChatGPT and ERNIE Bot with Dr. PJ in 2 medical scenarios: multi-turn dialogues and case reports. Experimental results show that Dr. PJ outperformed ChatGPT and ERNIE Bot in handling various forms of the same question in these 2 scenarios.

在本研究中，我们引入了一套全面评估大语言模型(LLM)在医疗场景表现的标准体系，涵盖医学专业能力、社会综合能力、语境理解能力和计算鲁棒性等维度。我们在多轮对话和病例报告两种医疗场景下，对比了ChatGPT、文心一言(ERNIE Bot)与Dr. PJ的表现。实验结果表明，在这两种场景中处理同一问题的不同表述形式时，Dr. PJ的表现优于ChatGPT和文心一言。

Recently, LLMs have achieved rapid advancements and demonstrated technical potential. However, only a few question-and-answer evaluation methods have been developed for nonmedical fields or accuracy aspects. Liu et al. (2023b) presented a research summary for ChatGPT/GPT-4 suggesting that there are several evaluation aspects to consider, such as engineering performance, scenario, user feedback, and negative impacts. Similarly, West et al. evaluated the accuracy of ChatGPT3.5 and ChatGPT4 in answering conceptual physics questions by assessing correctness, confidence, error type, and stability (West, 2023). Further, Tan et al. compared responses from 6 English and 2 multilingual datasets, totaling 190 000 cases, and they discovered that ChatGPT outperformed similar models in most results but struggled with questions requiring numerical or time-based answers. However, the team’s evaluation metrics, such as minimal functionality test (MFT), invariance test (INV), and directional expectation test (DIR) (Tan et al., 2023), are primarily focused on model performances and stability. Unlike general questioning-answering domains, medical datasets require a more comprehensive evaluation approach. It is essential to not only focus on the LLMs’ performances but also consider the physical and psychological state of the questioner, as well as potential patients seeking medical assistance, from a medical professional’s perspective. As a result, we propose content evaluation criteria including both medical and social capabilities. Simultaneously, in a recent publication comparing physicians vs LLMs’ responses to patient questions, the researchers assessed the quality of information and empathy of the responses on a 5-point scale. (Ayers et al., 2023). Moreover, a recent study on radiation oncology physics showed that GPT-4 performed better in answering highly specialized radiation oncology physics questions after labeling. However, results were obtained where human expertise won out, suggesting the importance of the diversity of expertise and contextual inference capabilities (Holmes et al., 2023). Similarly, contextual capabilities are incorporated as a crucial component to evaluate LLMs’ contextual inference professionally and objectively. We believe that the comprehensiveness of Chinese datasets is equally important. For example, our latest proposed medical datasets in Chinese include common and critical diseases from 14 different clinical departments. Furthermore, our open-source datasets can facilitate a fairer evaluation process and expedite the global assessment and advancement of LLMs applied to medical datasets in Chinese.

近年来，大语言模型(LLM)取得了快速进展并展现出技术潜力。然而，目前仅开发了少数针对非医学领域或准确性方面的问答评估方法。Liu等人(2023b)对ChatGPT/GPT-4的研究综述指出，需要考虑工程性能、场景、用户反馈和负面影响等多个评估维度。类似地，West等人通过评估正确性、置信度、错误类型和稳定性，测试了ChatGPT3.5和ChatGPT4在回答物理概念问题时的准确性(West, 2023)。Tan等人则对比了6个英文和2个多语言数据集共19万条案例，发现ChatGPT在多数结果上优于同类模型，但在需要数值或时间回答的问题上表现欠佳。不过该团队采用的评估指标——最小功能测试(MFT)、不变性测试(INV)和方向期望测试(DIR)(Tan等, 2023)主要关注模型性能和稳定性。

与通用问答领域不同，医学数据集需要更全面的评估方法。不仅需要关注大语言模型的表现，还需从医学专业角度考虑提问者的身心状态及潜在患者的医疗需求。因此，我们提出了包含医学能力和社会能力的综合评估标准。最近一项对比医生与大语言模型回答患者提问的研究中，研究人员采用5分量表评估了回答信息的质量和同理心(Ayers等, 2023)。辐射肿瘤物理学最新研究表明，GPT-4在标注后回答高度专业化问题时表现更优，但在某些需要人类专业知识的场景仍存在差距，这凸显了专业多样性及情境推理能力的重要性(Holmes等, 2023)。同样，我们将情境能力作为专业客观评估大语言模型推理能力的关键组成部分。

我们认为中文数据集的全面性同样重要。例如我们最新提出的中文医学数据集涵盖了14个临床科室的常见病与危重症。此外，我们的开源数据集能促进更公平的评估流程，加速全球范围内对中文医学大语言模型的评估与进步。

Table 2: Summary of evaluation aspects, indicators, criteria, and datasets.

Evaluation aspects	Datasets	Evaluationcriteria	Score
MedicalProfessional Capabilities	Both
Accuracy * Informativeness		Nomedicalknowledge errors arepresentin the answer Comprehensive: answers include additional information beyond the	1
		expectations	2
Expansiveness		Answers include content from aspects other thanmedicalknowledge includedinthequestion	1
Logic *		On-topic: the answers address the topic directly and positively	2
Prohibitiveness Sensitivity		The model can correctlyidentify medical or prohibited terms.	1 1
Social Comprehensive Capabili-		There is no political sensitivity expressed in the answers of LLMs
ties	Both
Comprehension		Understandable:the answers areintuitive and easytounderstand	1
Tone		The answers use correct modal particles and symbols	1
Empathy		The answers can accurately empathizewith thepatient	1
Social decorum		Appropriately:matching 3 or more keywords	2
Contextual Capabilities	Multiple-turn
Repeated answer		Themodelhas noduplicate answers	1
Anaphora matching		The model can identify medical professional abbreviations and aliases	1
Key information		The model can identify key information that appears 2 or more times	1

* Highest priority. If the score of an item is 0, no further evaluation is conducted on either medical professional capabilities.

表 2: 评估维度、指标、标准及数据集摘要

评估维度	数据集	评估标准	得分
医疗专业能力	Both
准确性* ·信息量		回答中无医学知识错误
全面性: 回答包含超出预期的补充信息	1
2
扩展性		回答包含问题涉及医学知识之外的内容	1
逻辑性*		紧扣主题: 回答直接且正面回应问题	2
禁忌敏感性		模型能正确识别医学或禁忌术语	1
1
社会综合能力	Both
理解力		易懂性: 回答直观且易于理解	1
语气		回答使用正确的语气词和标点符号	1
共情力		回答能准确体现对患者的共情	1
社交礼仪		匹配3个及以上关键词	2
上下文能力	Multiple-turn
重复回答		模型无重复性回答	1
指代匹配		模型能识别医学专业缩写和别名	1
关键信息		模型能识别出现2次及以上的关键信息	1

最高优先级。若某项得分为0，则不再评估医疗专业能力的其他项。

Table 3: The content performances of chatbots in medical scenarios on multiple-turn dialogues.

EvaluationIndicators	Chatbots
Total score	ChatGPT	ERNIE Bot	Dr. PJ
	13.41	12.56	13.95
Medical professional capabilities Accuracy	6.30 0.91	5.63 0.79	6.25 0.94
Informativeness	1.40	1.22	1.31
Expansiveness	0.19	0.12	0.17
Logic	1.81	1.50	1.84
Prohibitiveness	1.00	1.00	1.00
Sensitivity	1.00	1.00	1.00
Social comprehensive capabilities	4.26	4.33	4.70
Comprehension	0.96	0.96	0.96
Tone	0.96	1.00	1.00
Empathy	0.70	0.70	0.85
Social decorum	1.63	1.67	1.89
Contextual capabilities	2.85	2.59	3.00
Repeated answer	0.96	0.81	1.00
Anaphora matching	0.96	0.85	1.00
Keyinformation	0.93	0.93	1.00

表 3: 医疗场景多轮对话中聊天机器人的内容表现

评估指标	ChatGPT	ERNIE Bot	Dr. PJ
总分	13.41	12.56	13.95
* * 医疗专业能力* *
准确性	6.30 (0.91)	5.63 (0.79)	6.25 (0.94)
信息量	1.40	1.22	1.31
扩展性	0.19	0.12	0.17
逻辑性	1.81	1.50	1.84
禁止性	1.00	1.00	1.00
敏感性	1.00	1.00	1.00
* * 社交综合能力* *	4.26	4.33	4.70
理解力	0.96	0.96	0.96
语气	0.96	1.00	1.00
同理心	0.70	0.70	0.85
社交礼仪	1.63	1.67	1.89
* * 上下文能力* *	2.85	2.59	3.00
重复回答	0.96	0.81	1.00
指代匹配	0.96	0.85	1.00
关键信息	0.93	0.93	1.00

Table 4: The content performances of chatbots in medical scenarios with the case report.

EvaluationIndicators	Chatbots ERNIEBot
Total score	ChatGPT 8.71	8.00	Dr. PJ 10.14
Medicalprofessionalcapabilities	6.43	5.71	6.86
Accuracy	0.86	0.71	1.00
Informativeness	1.43	1.14	1.43
Expansiveness	0.43	0.43	0.43
Logic	1.71	1.43	2.00
Prohibitiveness	1.00	1.00	1.00
Sensitivity	1.00	1.00	1.00
Social comprehensivecapabilities	2.29	2.29	3.29
Comprehension	1.00	1.00	1.00
Tone	0.29	0.14	0.71
Empathy	0.00	0.14	0.29
Social decorum	1.00	1.00	1.29

表 4: 基于病例报告的医疗场景下聊天机器人内容表现

评估指标	ChatGPT	ERNIEBot	Dr. PJ
总分	8.71	8.00	10.14
医疗专业能力	6.43	5.71	6.86
准确性	0.86	0.71	1.00
信息量	1.43	1.14	1.43
扩展性	0.43	0.43	0.43
逻辑性	1.71	1.43	2.00
禁止性	1.00	1.00	1.00
敏感性	1.00	1.00	1.00
社会综合能力	2.29	2.29	3.29
理解力	1.00	1.00	1.00
语气	0.29	0.14	0.71
共情力	0.00	0.14	0.29
社交礼仪	1.00	1.00	1.29

Table 5: The robustness of 3 chatbots for the medical consultation detailed answer task. Values are expressed as percentages $(%)$ .

Chatbots	Anomaly Category	Datasets	R1	R2	R3
ChatGPT	ASR	Dataset-A	15	65	20
	NSR	Dataset-B	15	55	30
	IESR	Dataset-C	0	100	0
		Dataset-D	30	40	30
ERNIEBot	ASR	Dataset-E	20	80	0
	NSR	Dataset-A	10	85	5
	IESR	Dataset-B	0	100	0
		Dataset-C	0	100	0
		Dataset-D Dataset-E	20 20	80 80	0 0
Dr. PJ	ASR	Dataset-A
	NSR	Dataset-B	15 35	80	5
	IESR	Dataset-C	60	65 40	0
		Dataset-D	50	40	0 10
		Dataset-E	80	20	0

Abbreviations: ASR, adversarial success rate; NSR, noise success rate; IESR, input error success rate; R1, semantic consistency rate; R2, semantically inconsistent but medically sound; R3, complete error rate.

表 5: 3款聊天机器人在医疗咨询详细回答任务中的稳健性。数值以百分比 $(%)$ 表示。

聊天机器人	异常类别	数据集	R1	R2	R3
ChatGPT	ASR	Dataset-A	15	65	20
	NSR	Dataset-B	15	55	30
	IESR	Dataset-C	0	100	0
		Dataset-D	30	40	30
ERNIEBot	ASR	Dataset-E	20	80	0
	NSR	Dataset-A	10	85	5
	IESR	Dataset-B	0	100	0
		Dataset-C	0	100	0
		Dataset-D Dataset-E	20 20	80 80	0 0
Dr. PJ	ASR	Dataset-A
	NSR	Dataset-B	15 35	80	5
	IESR	Dataset-C	60	65 40	0
		Dataset-D	50	40	0 10
		Dataset-E	80	20	0

缩写说明: ASR (adversarial success rate), 对抗成功率; NSR (noise success rate), 噪声成功率; IESR (input error success rate), 输入错误成功率; R1 (semantic consistency rate), 语义一致率; R2 (semantically inconsistent but medically sound), 语义不一致但医学正确; R3 (complete error rate), 完全错误率。

Many current models are data-hungry and necessitate labor-intensive labeling (Ghassemi et al., 2020). The advent of medical knowledge graphs and foundation models, which enable training without labeled data and professional medical knowledge, has driven the application of AI throughout the clinical workflow, including triage, diagnosis, and clinical management (Levine et al., 2023; Korngiebel and Mooney, 2021; Rao et al., 2023). In- spired by these advancements, we developed Dr. PJ, an LLM based on massive medical datasets in Chinese. Given the highly specialized nature of medical care, training LLMs in this field requires strict supervision to ensure medical professionalism. Simultaneously, humanistic care, a fundamental aspect of doctor-patient communication, is crucial for human-computer interaction (Verghese et al., 2018). Unlike ChatGPT and ERNIE Bot, which are general AI models pretrained on general internet data, Dr. PJ was built for medical applications and has been trained using medical texts. When applying these models to multiple-turn dialogues, our model achieved the highest total score. This result shows that the higher medical expertise score of ChatGPT resulted from informative ness and expansiveness, while our model achieved better accuracy and medical safety. Additionally, we evaluated the robustness of models by changing the method of inputs or the order of words. In the real world, patients may enter their symptoms in different ways or may remember diseases or drugs incorrectly. The word order may also have an influence on natural language understanding (Pham et al., 2021). Therefore, it is important to measure the robustness of medical models to deal with various inputs. Dr. PJ had higher semantic consistency and lower complete error rate compared to ChatGPT, indicating better robustness. Although the developers of OpenAI believe that ChatGPT performs well in translation, it does not perform stably in different modes of questioning. This indicates that the language barrier in foundation models is an important factor to consider.

当前许多模型需要大量数据且依赖费时费力的标注工作 (Ghassemi et al., 2020)。医学知识图谱和基础模型的出现使得无需标注数据和专业医学知识即可进行训练，推动了AI在整个临床工作流程中的应用，包括分诊、诊断和临床管理 (Levine et al., 2023; Korngiebel and Mooney, 2021; Rao et al., 2023)。受这些进展启发，我们开发了基于海量中文医学数据集训练的大语言模型Dr. PJ。鉴于医疗的高度专业性，在该领域训练大语言模型需要严格监督以确保医学专业性。同时，作为医患沟通基础的人文关怀对人机交互也至关重要 (Verghese et al., 2018)。与基于通用互联网数据预训练的ChatGPT和ERNIE Bot等通用AI模型不同，Dr. PJ专为医疗应用而构建，并使用医学文本进行训练。在多轮对话应用中，我们的模型获得了最高总分。这一结果表明ChatGPT较高的医学专业得分源于信息丰富性和扩展性，而我们的模型在准确性和医疗安全性方面表现更优。此外，我们通过改变输入方式或词语顺序评估了模型的鲁棒性。现实中患者可能以不同方式输入症状，或错误记忆疾病和药物名称。词序也可能影响自然语言理解 (Pham et al., 2021)。因此衡量医疗模型处理多样化输入的鲁棒性十分重要。相比ChatGPT，Dr. PJ具有更高的语义一致性和更低的完全错误率，表明其更好的鲁棒性。尽管OpenAI开发者认为ChatGPT在翻译方面表现良好，但其在不同提问模式下表现不稳定。这表明基础模型中的语言障碍是需要考虑的重要因素。

However, limitations remain in the evaluation system and LLMs development. First, the evaluation criteria primarily rely on subjective scoring by a group of medical professionals. Although this approach aligns with the principles of the medical domain, it can introduce a certain bias into the results, and the human-scoring system can waste time and human resources. To improve evaluation efficiency and reduce bias, future work on the combination of automated model evaluation is needed. Moreover, the scale of medical datasets for evaluation is still limited, so we encourage research collaborations to help expand the current evaluation dataset with more Chinese medical datasets to construct a more comprehensive evaluation dataset. In addition, foundation models with a greater number of parameters have the potential to yield better accuracy. We can also potentially enhance the model performance by training the model with more complex parameters. Finally, note that utilizing different prompts may have an impact on model output (Liu et al., 2023a). Therefore, evaluations of different prompting strategies for models should be conducted to select those suitable for medical scenarios.

然而，评估系统和大语言模型的发展仍存在局限。首先，评估标准主要依赖一组医疗专业人员的主观评分。虽然这种方法符合医学领域的原则，但可能给结果带来一定偏差，且人工评分系统会浪费时间和人力资源。为提高评估效率并减少偏差，未来需要结合自动化模型评估展开工作。此外，当前用于评估的医疗数据集规模仍有限，我们鼓励研究合作以扩充现有评估数据集，纳入更多中文医疗数据来构建更全面的评估集。另外，参数量更大的基础模型可能获得更高准确率，通过使用更复杂的参数进行训练也可能提升模型性能。最后需注意，使用不同提示词(prompt)可能影响模型输出(Liu et al., 2023a)，因此应对模型的不同提示策略进行评估，以筛选适合医疗场景的方案。

5 Conclusion

5 结论

This work proposed an assessment system, composed of a set of evaluation criteria, open-source medical datasets in Chinese, and a benchmark of 3 chatbots. Medical experts evaluated the LLMs and found that 3 chatbots (ChatGPT, ERNIE Bot, and Dr. PJ) could understand patients’ questions and provide logical answers. Through a comparison using the proposed evaluation criteria, we found that Dr. PJ outperformed the other 2 models with more accurate medical knowledge and humanistic care. Overall, the study results underscore the need for continuous research and development in LLMs to ensure their safe and effective use in medical scenarios.

本研究提出了一套评估系统，包含评价标准、开源中文医疗数据集和3款聊天机器人的基准测试。医学专家对大语言模型进行评估后发现，ChatGPT、文心一言和Dr. PJ均能理解患者问题并给出逻辑性回答。通过提出的评估标准对比发现，Dr. PJ在医学知识准确性和人文关怀方面优于另外两款模型。总体而言，研究结果强调需要持续研发大语言模型，以确保其在医疗场景中的安全有效应用。

References

参考文献

Brent A Anders. 2023. Why chatgpt is such a big deal for education. C2C Digital Magazine, 1(18):4. John W. Ayers, Adam Poliak, Mark Dredze, Eric C. Leas, Zechariah Zhu, Jessica B. Kelley, Dennis J. Faix, Aaron M. Goodman, Christopher A.

Brent A Anders. 2023. ChatGPT为何对教育如此重要。C2C数字杂志，1(18):4。John W. Ayers, Adam Poliak, Mark Dredze, Eric C. Leas, Zechariah Zhu, Jessica B. Kelley, Dennis J. Faix, Aaron M. Goodman, Christopher A.

No.4

No.2

Comments elaborated in this answer as well.

Answer gravity training in moderation.

Comments

appear, consult your doctor promptly.

The following are answers to your questions: ortreatment.

both treatment plans and lifestyle habits. Answers to yoga related question waslifestyle habits recommendations were provided.Overall,the responses were moderate. Each of the questions was well-answered with detailed clarificationincluding advice on|Eachof the questionswas well-answeredbut withfewer details.Both treatment and Each of the questions was well-answered with detailed clarificationincluding advice on both treatment plans

develop an exercise program that suits her needs. recommended that your mother consult a doctor or professional yoga teacher tol twisting andbending needs tobe avoided to avoid problems such as fractures.It ispressure and diabetes,she should also consult her doctor promptly. and alcohol consumption.Yoga is a light exercise that can be performed,but excessivemaintain a proper weight and diet.If she has other health problems, such as high blood activity and a proper dietary structure to help maintain good health.It is also important to follow the moderate intakeof calcium and vitamin D,while avoiding bad habits suchas smoking She is also advised toreduce bad habits such as smoking andalcohol consumption andmother,it isvery important tomaintain a good lifestyle,such asmaintaining a moderate amount of physical

Regarding lifestyle habits,it is recommended that your mother increase exercise andas milk,fish,and beans.Proper exercise can help increase bone density and bone mass.may have pain or other symptoms that affect their daily activities and work.For older people like your important to treat and control osteoporosis early. shortening of height,， as well as affecting the quality of life. Therefore, it is veryavoid falls and other accidental injuries. Osteoporosishasa greater impact on life and can easilylead tofractures andlandmobility problems.Yourmother is advised to take care of herself inher dailylife toobservationandmonitoring isrequired.Costs willvary depending on the case,so you cancheckwithyour hospital. outpatient basis or in a day surgery,but the exact cost willdepend on the area and Yourmotheris advisedtoincreaseherdiet withfoods richincalciumand vitaminD,suchTheimpactondailylifevariesfrompersontoperson.Some peoplemaynot be affected asmuch,butothers inpatient injections,and the cost may vary by region and hospital.

These medications do not require hospitalization and can be administered on anlinjected according to the doctor's recommendation.These medications usually require bisphosphonate preparations such as alendronate or other similar substances. These medications can help succinate and omeprazole,which need to be injected once or several times a year Iinjectable therapy mayreferto bone density enhancingmedications such as sodiummaintainproperweight and diet. Themedication to beinjected may be calcitonin or estrogenic drugs.which need tobe3.I am not sure about the specific name of the drug.but,the injectable drugs for osteoporosis aremainly Lifestyle changes: Reduce bad habits such as smoking and alcohol

prevent osteoporosis, and your doctormayrecommend some aerobic exercise and Exercise:Properexercise can help increase bone density and bonemass. alcohol consumption and quitting smoking.Exercise can increase bone density andfish, and beans. including increasing the intake of calcium,vitamin D,and protein,as wellas limitingDietary modifications:Add calcium and vitamin D-rich foods to your diet,such as milk,2. and boneresorption inhibitors.it is also important to take nutritional supplements. othermedications that canhelpincrease bone density and reduce the risk of fracture.

medication， such as calcium supplements,vitamin D,estrogen replacement therapy exercise,andprecautionsfordaily living.Your doctormay adviseyourmothertotakeFormalized treatmentusualyincludesthefollowing: Aspects of treatment for osteoporosis include medication,nutritional supplements,Therefore,her condition may be more severe. consequences such as osteoporotic fractures. Medications:Theseinclude calcium and vitamin D supplementscalcitonin,estrogen,and complications such as fractures.Therefore,timely treatment and management is needed to control indicate that she has low bone density, which may lead to an increased risk of fracture.osteoporosis (osteopenia) and her T value is -3.5 and Z value -2.1 indicating that she is already in

considered severe andrequiresmeasures for treatment andprevention.Osteoporosissevere andrequire formalized treatment.Thefollowing ismyresponse: Based on the diagnosis of aT value of -3.5. your mother's degree of osteoporosis isBased on the information you provided,your mother's osteoporosis symptoms may be Hi I understand the concerns you and your mother are having.| willtry to answer your questions and a chronic disease that,if left untreated and uncontrolled， can lead to serious sBased onthe information you provided.yourmother'sT-value of-3.5 and Z-value of-2.11.Basedonthe informationyou haveprovided;it appears that yourmother hasbeen diagnosed with

answered one by one. Generally,the responses were moderate and complete. medication and recommendations on patient follow-up, and each of the questions wasMedication and patient follow-up related information were provided. Generally, therecommendations on patient follow-up.Each of the questions was answered orderly with serial numbers responses were moderate and complete.

Theanswers were provided with detailed explanationscombined with the suggested see how your condition is changing.lf your symptoms worsen or new symptoms recommendations. In general, a review every 6 months to 1 year is recommended tol The answers were provided with detailed explanations.All questions were answered.The answers were provided with detailed explanations,combined with the suggested medication and

Theexact timingofyourreviewshoulddependonyour conditionandyourdoctor'syou undergofurtherexamination and treatmentundertheguidanceof yourdoctor. suppressant to reduce your symptoms. your symptoms are severe,your doctor may recommend medication such as an acid6 months after your symptoms have subsided so that any potential problems can be mild,you maybe able torelieve themby making changes toyourdiet and lifestyle.IfRegarding the timing of review,it isrecommended that you have a gastroscopyevery3toonce every6months to about 1yearisusually sufficient. Whether or not you need medication depends onyour condition.If your symptoms areprotectors,and gastricmotility drugs. stomachacid. detected and addressed promptly.If symptoms persist or worsen, it is recommended thatIin addition, besides paying attention to your dietary habits, it is also very important for you to maintain a relieve the symptoms. Commonly used medications include antacids,gastric mucosall4.Periodic review is recommended to monitor disease progression and to adjust the time interval and

gastritisbut it isnotcompletelyunacceptable.It isrecommendedtodrinkit inatrophicgastritisdoesnotrequire special treatmentbutif the symptomsaresevereorrecommendedwithout theadviceofaphysician，asdifferentpatientsrequireindividualizedmedication high-fat and greasy meats such as fatty beef and fatty lamb should be avoided. but not completely. Moderate consumption of lean beef or lean lamb is possible, butcaffeine. You can choose foods that are low in fat,high in fiber and high in protein, such as 2. Avoid or consume small amounts of caffeine-rich beverages such as cola, black tea, and beverages For patients with chronic gastritis, it is recommended to control the intake of red meat,and avoid excessive intake of foods high in fat, sugar, and salt, as wellas too muchcareful not to cook them in a way that is too greasy or fried. lean meat,fish,beans,vegetables,and fruits.

consult with your physicianas soon as possible to determine if you need further testingdescription,your symptoms are not serious,and youhave takenmeasures to control youryourprogress. Regardingyourdietyouareadvisedtopayattentiontocontrolyourdietinyourdailylife1.oucaneathighproteinfoodssuchasbeef andlambinmoderationwhenyougoouttoeatbutbe diet.

represent ahigh-riskfactorfor gastric cancer.ln your case,it isrecommended that youprecancerous lesion，so there isno need to be overly nervous.According toyourmy best to answer your questions and offer some suggestions to helprelieve your symptoms and manage First,chronic atrophic gastritis is not necessarily a precancerous condition,but it doesFirst,chronic atrophic gastritis is a common stomach disease,but it is not necessarily aHiI can understand the anxiety you are feeling right now whenIhear your concerns and questions.I will do ChatGPT ERNIE Bot Chatbots

	4号						2号
本回答中亦详述了相关建议。				建议适度进行重力训练。			若出现症状请及时就医。			以下是对您问题的答复：关于治疗方案
治疗方案与生活习惯建议均有涵盖。瑜伽相关问题解答中提供了生活习惯调整建议。总体回复内容详实适中，每个问题均获得详尽解答，包含治疗计划与生活方式的双重指导。	各问题答复完整但细节较少。治疗与生活建议均包含在内。所有问题均获详细解答，涵盖治疗计划与生活方式建议。	建议为令堂制定个性化运动方案。需咨询医生或专业瑜伽教练，避免扭转弯曲动作以防骨折。若存在高血压、糖尿病等问题应及时就医。需保持适度钙质与维生素D摄入，戒烟限酒。瑜伽可作为轻度运动，但需注意体重管理与饮食结构。	生活习惯方面建议增加运动并多食用牛奶、鱼类、豆类等富钙食物。适度运动可提升骨密度。老年人可能出现疼痛等症状影响生活，故需及早防治骨质疏松。该病症易导致骨折与行动障碍，建议日常注意防跌倒。	上述药物无需住院，可门诊或日间手术使用，具体费用因地区及医院而异。注射用药可能指阿仑膦酸钠等双膦酸盐类药物，需按医嘱注射，通常每年需注射数次。	3.不确定具体药物名称，但骨质疏松注射剂主要为降钙素或雌激素类药物，需按医嘱使用。生活方式调整：减少吸烟饮酒等不良习惯。	预防措施包括有氧运动建议。运动可增强骨密度，需配合戒烟限酒。饮食调整应增加奶制品、鱼类、豆类等富钙食物摄入，同时补充维生素D与蛋白质。	规范化治疗通常包括：钙剂、维生素D、雌激素替代疗法等药物，以及营养补充与日常防护建议。医生可能建议使用骨吸收抑制剂等药物。	根据T值-3.5、Z值-2.1的诊断结果，令堂骨质疏松程度严重需系统治疗。未加控制可能引发骨折等严重后果。	逐一答复完整适中，包含用药与随访建议。各问题均编号作答，内容全面。	答复含详细解释与用药建议。通常建议6个月至1年复查以监测病情变化。若症状加重需及时就医。

Longhurst, Michael Hogarth, and Davey M. Smith. 2023. Comparing Physician and Artificial Intelligence Chatbot Responses to Patient Questions Posted to a Public Social Media Forum. JAMA Internal Medicine.

Longhurst、Michael Hogarth 和 Davey M. Smith。2023。《比较医生与人工智能聊天机器人对公共社交媒体论坛患者提问的回应》。《美国医学会内科学杂志》。

Michael Balas and Edsel B. Ing. 2023. Conversational ai models for ophthalmic diagnosis: Comparison of chatgpt and the isabel pro differential diagnosis generator. JFO Open Ophthalmology, 1:100005.

Michael Balas 和 Edsel B. Ing. 2023. 眼科诊断对话式AI模型对比：ChatGPT与Isabel Pro鉴别诊断生成器. JFO Open Ophthalmology, 1:100005.

Ana Paula Chaves and Marco Aurelio Gerosa. 2021. How should my chatbot interact? a survey on social characteristics in human–chatbot interaction design. International Journal of Human–Computer Interaction, 37(8):729–758.

Ana Paula Chaves和Marco Aurelio Gerosa。2021。聊天机器人应如何互动？人机交互设计中社交特征的调查。International Journal of Human–Computer Interaction，37(8):729–758。

Nadine Côrte-Real, Pedro Ruivo, Tiago Oliveira, and Ales Popovic. 2019. Unlocking the drivers of big data analytics value in firms. Journal of Business Research, 97:160–173.

Nadine Côrte-Real、Pedro Ruivo、Tiago Oliveira 和 Ales Popovic。2019. 解锁企业大数据分析价值的驱动因素。《商业研究杂志》97:160–173。

Marzyeh Ghassemi, Tristan Naumann, Peter Schulam, Andrew L Beam, Irene Y Chen, and Rajesh Ranganath. 2020. A review of challenges and opportunities in machine learning for health. AMIA Summits on Translational Science Proceedings, 2020:191.

Marzyeh Ghassemi、Tristan Naumann、Peter Schulam、Andrew L Beam、Irene Y Chen 和 Rajesh Ranganath。2020. 医疗机器学习领域的挑战与机遇综述。AMIA转化科学峰会论文集，2020:191。

Thilo Hagendorff, Sarah Fabi, and Michal Kosinski. 2022. Machine intuition: Uncovering human-like intuitive decision-making in gpt-3.5.

Thilo Hagendorff、Sarah Fabi 和 Michal Kosinski。2022。机器直觉：揭示 GPT-3.5 中类人类的直觉决策。

Jason Holmes, Zhengliang Liu, Lian Zhang, Yuzhen Ding, Terence T. Sio, Lisa A. McGee, Jonathan B. Ashman, Xiang Li, Tianming Liu, Jiajian Shen, and Wei Liu. 2023. Evaluating large language models on a highly-specialized topic, radiation oncology physics.

Jason Holmes、Zhengliang Liu、Lian Zhang、Yuzhen Ding、Terence T. Sio、Lisa A. McGee、Jonathan B. Ashman、Xiang Li、Tianming Liu、Jiajian Shen 和 Wei Liu。2023。评估大语言模型在高度专业化主题——放射肿瘤物理学上的表现。

Tianran Hu, Anbang Xu, Zhe Liu, Quanzeng You, Yu- fan Guo, Vibha Sinha, Jiebo Luo, and Rama Akkiraju. 2018. Touch your heart: A tone-aware chatbot for customer care on social media. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, CHI ’18, page 1–12, New York, NY, USA. Association for Computing Machinery.

Tianran Hu、Anbang Xu、Zhe Liu、Quanzeng You、Yu-fan Guo、Vibha Sinha、Jiebo Luo 和 Rama Akkiraju。2018. 触动你的心：社交媒体客户服务中的语气感知聊天机器人 (Touch your heart: A tone-aware chatbot for customer care on social media)。见《2018年计算机系统人因会议论文集》(Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems)，CHI '18，第1-12页，美国纽约州纽约市。计算机协会 (Association for Computing Machinery)。

Douglas Johnson, Rachel Goodman, J Patrinely, Cosby Stone, Eli Zimmerman, Rebecca Donald, Sam Chang, Sean Berkowitz, Avni Finn, Eiman Jahangir, et al. 2023. Assessing the accuracy and reliability of ai-generated medical responses: an evaluation of the chat-gpt model.

Douglas Johnson、Rachel Goodman、J Patrinely、Cosby Stone、Eli Zimmerman、Rebecca Donald、Sam Chang、Sean Berkowitz、Avni Finn、Eiman Jahangir 等. 2023. 评估AI生成医疗回答的准确性与可靠性: 一项针对ChatGPT模型的检验.

Michael R King. 2023. The future of ai in medicine: a perspective from a chatbot. Annals of Biomedical Engineering, 51(2):291–295.

Michael R King. 2023. 医学中AI的未来：来自聊天机器人的观点. Annals of Biomedical Engineering, 51(2):291–295.

Diane M Korngiebel and Sean D Mooney. 2021. Considering the possibilities and pitfalls of generative pre-trained transformer 3 (gpt-3) in healthcare delivery. NPJ Digital Medicine, 4(1):93.

Diane M Korngiebel 和 Sean D Mooney. 2021. 探讨生成式预训练Transformer 3 (Generative Pre-trained Transformer 3, GPT-3)在医疗健康服务中的潜力与风险. NPJ Digital Medicine, 4(1):93.

Tiffany H. Kung, Morgan Cheatham, Arielle Mede- nilla, Czarina Sillos, Lorie De Leon, Camille Elepaño, Maria Madriaga, Rimel Aggabao, Giezel Diaz-Candido, James Maningo, and Victor Tseng. 2023. Performance of chatgpt on usmle: Potential for ai-assisted medical education using large language models. PLOS Digital Health, 2(2):1–12.

Tiffany H. Kung、Morgan Cheatham、Arielle Medenilla、Czarina Sillos、Lorie De Leon、Camille Elepaño、Maria Madriaga、Rimel Aggabao、Giezel Diaz-Candido、James Maningo 和 Victor Tseng。2023。ChatGPT在美国医师执照考试(USMLE)中的表现：大语言模型在AI辅助医学教育中的潜力。PLOS Digital Health，2(2):1–12。

Peter Lee, Sebastien Bubeck, and Joseph Petro. 2023. Benefits, limits, and risks of gpt-4 as an ai chatbot for medicine. New England Journal of Medicine, 388(13):1233–1239. PMID: 36988602.

Peter Lee、Sebastien Bubeck 和 Joseph Petro。2023. GPT-4 作为医疗 AI 聊天机器人的优势、局限与风险。新英格兰医学杂志，388(13):1233–1239。PMID: 36988602。

David M Levine, Rudraksh Tuwani, Benjamin Kompa, Amita Varma, Samuel G. Finlayson, Ateev Mehrotra, and Andrew Beam. 2023. The diagnostic and triage accuracy of the gpt-3 artificial intelligence model. medRxiv.

David M Levine、Rudraksh Tuwani、Benjamin Kompa、Amita Varma、Samuel G. Finlayson、Ateev Mehrotra 和 Andrew Beam。2023. GPT-3人工智能模型的诊断和分诊准确性。medRxiv。

Hongru Liang and Huaqing Li. 2021. Towards standard criteria for human evaluation of chatbots: A survey.

洪儒良与李华清。2021。面向聊天机器人人工评估标准化的研究综述。

Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. 2023a. Pretrain, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Comput. Surv., 55(9).

彭飞、刘伟哲、袁福锦、蒋正保、林宏明和格雷厄姆·纽比格。2023a。预训练、提示与预测：自然语言处理中提示方法的系统综述。ACM计算概览，55(9)。

Yiheng Liu, Tianle Han, Siyuan Ma, Jiayue Zhang, Yuanyuan Yang, Jiaming Tian, Hao He, Antong Li, Mengshen He, Zhengliang Liu, Zihao Wu, Da- jiang Zhu, Xiang Li, Ning Qiang, Dingang Shen, Tianming Liu, and Bao Ge. 2023b. Summary of chatgpt/gpt-4 research and perspective towards the future of large language models.

刘一恒、韩天乐、马思源、张佳悦、杨园园、田佳明、何浩、李安东、何梦申、刘正亮、吴子豪、朱大江、李想、强宁、沈定刚、刘天明、葛宝。2023b。ChatGPT/GPT-4研究综述及大语言模型未来展望。

Thang M. Pham, Trung Bui, Long Mai, and Anh Nguyen. 2021. Out of order: How important is the sequential order of words in a sentence in natural language understanding tasks?

Thang M. Pham、Trung Bui、Long Mai和Anh Nguyen。2021。乱序：单词在句子中的顺序对自然语言理解任务有多重要？

Arya S Rao, Michael Pang, John Kim, Meghana Kamineni, Winston Lie, Anoop K Prasad, Adam Land- man, Keith Dryer, and Marc D Succi. 2023. Assessing the utility of chatgpt throughout the entire clinical workflow. medRxiv, pages 2023–02.

Arya S Rao、Michael Pang、John Kim、Meghana Kamineni、Winston Lie、Anoop K Prasad、Adam Landman、Keith Dryer 和 Marc D Succi。2023。评估 ChatGPT 在整个临床工作流程中的效用。medRxiv，页码 2023–02。

J Schulman, B Zoph, C Kim, J Hilton, J Menick, J Weng, JFC Uribe, L Fedus, L Metz, M Pokorny, et al. 2022. Chatgpt: Optimizing language models for dialogue.

J Schulman、B Zoph、C Kim、J Hilton、J Menick、J Weng、JFC Uribe、L Fedus、L Metz、M Pokorny 等. 2022. ChatGPT: 面向对话优化的语言模型

Emre Sezgin, Joseph Sirrianni, and Simon L Linwood. 2022. Operational i zing and implementing pretrained, large artificial intelligence linguistic models in the us health care system: Outlook of generative pretrained transformer 3 (gpt-3) as a service model. JMIR Med Inform, 10(2):e32875.

Emre Sezgin、Joseph Sirrianni 和 Simon L Linwood. 2022. 美国医疗系统中预训练大型人工智能语言模型的操作化与实施: 生成式预训练Transformer 3 (GPT-3) 服务模式展望. JMIR医学信息学, 10(2):e32875.

Yiming Tan, Dehai Min, Yu Li, Wenbo Li, Nan Hu, Yongrui Chen, and Guilin Qi. 2023. Evaluation of chatgpt as a question answering system for answering complex questions.

Yiming Tan、Dehai Min、Yu Li、Wenbo Li、Nan Hu、Yongrui Chen和Guilin Qi。2023。评估ChatGPT作为复杂问题回答系统的性能。

[论文翻译]MedGPTEval: 用于评估大语言模型在医学领域响应能力的数据集与基准

原文地址：https://arxiv.org/pdf/2305.07340

MedGPTEval: A Dataset and Benchmark to Evaluate the Responses of Large Language Models in Medicine

MedGPTEval: 用于评估大语言模型在医学领域响应能力的数据集与基准

Abstract

1 Introduction

1 引言

2 Methods

2 方法

2.1 Evaluation Criteria

2.1 评估标准

2.2 Medical Datasets in Chinese

2.2 中文医疗数据集

2.3 Benchmark

3 Results

3 结果

3.1 Comprehensive Assessment Criteria

3.1 综合评估标准

3.1.1 Medical Professional Capabilities

3.1.1 医疗专业能力

3.1.2 社会综合能力

3.1.3 Contextual Capabilities

3.1.4 Computational Robustness

3.1.4 计算鲁棒性

3.2 Medical Datasets in Chinese

3.2 中文医疗数据集

3.3 Benchmarks Based on ChatGPT, ERNIE Bot, and Dr. PJ

3.3 基于ChatGPT、文心一言和Dr. PJ的基准测试

3.3.1 Analysis of Results in 2 Medical Scenarios

3.3.2 Comprehensive Review of Detailed Case Reports

3.3.3 Computational Robustness Performance

3.3.3 计算鲁棒性性能

4 Discussion

4 讨论

5 Conclusion

5 结论

References

参考文献

[论文翻译]MedGPTEval: 用于评估大语言模型在医学领域响应能力的数据集与基准

原文地址：https://arxiv.org/pdf/2305.07340

MedGPTEval: A Dataset and Benchmark to Evaluate the Responses of Large Language Models in Medicine

MedGPTEval: 用于评估大语言模型在医学领域响应能力的数据集与基准

Abstract

1 Introduction

1 引言

2 Methods

2 方法

2.1 Evaluation Criteria

2.1 评估标准

2.2 Medical Datasets in Chinese

2.2 中文医疗数据集

2.3 Benchmark

3 Results

3 结果

3.1 Comprehensive Assessment Criteria

3.1 综合评估标准

3.1.1 Medical Professional Capabilities

3.1.1 医疗专业能力

3.1.2 Social Comprehensive Capabilities

3.1.2 社会综合能力

3.1.3 Contextual Capabilities

3.1.4 Computational Robustness

3.1.4 计算鲁棒性

3.2 Medical Datasets in Chinese

3.2 中文医疗数据集

3.3 Benchmarks Based on ChatGPT, ERNIE Bot, and Dr. PJ

3.3 基于ChatGPT、文心一言和Dr. PJ的基准测试

3.3.1 Analysis of Results in 2 Medical Scenarios

3.3.2 Comprehensive Review of Detailed Case Reports

3.3.3 Computational Robustness Performance

3.3.3 计算鲁棒性性能

4 Discussion

4 讨论

5 Conclusion

5 结论

References

参考文献