Performance of Multimodal GPT-4V on USMLE with Image: Potential for Imaging Diagnostic Support with Explanations
多模态GPT-4V在美国医师执照考试(USMLE)含图像题目的表现:基于解释的影像诊断支持潜力
Zhichao Yang, MSc ; Zonghai Yao, MSc ; Mahbuba Tasmin, BSc ; Parth Vashisht, $\mathtt{B S c}^{1}$ ; Won Seok Jang, RN, MSc ; Feiyun Ouyang, $\mathsf{P h D}^{2}$ ; Beining Wang, BSc ; Dan Berlowitz, MD, MPH4,5; Hong Yu, PhD1,2,5,6
杨志超,理学硕士;姚宗海,理学硕士;Mahbuba Tasmin,理学学士;Parth Vashisht,$\mathtt{B S c}^{1}$;张元硕,注册护士,理学硕士;欧阳飞云,$\mathsf{P h D}^{2}$;王贝宁,理学学士;Dan Berlowitz,医学博士,公共卫生硕士4,5;余洪,博士1,2,5,6
Author Affiliations:
作者所属机构:
These authors contributed equally * : Zhichao Yang, Zonghai Yao
这些作者贡献相同* :Zhichao Yang, Zonghai Yao
Corresponding Author Information:
通讯作者信息:
Main Figures: 2; Tables: 3
主要图表:图 2;表 3
Keywords: Artificial Intelligence, Large Language Model, ChatGPT, Multi modality, GPT-4V, USMLE, Medical License Exam, Clinical Decision Support
关键词:人工智能 (Artificial Intelligence)、大语言模型 (Large Language Model)、ChatGPT、多模态 (Multi modality)、GPT-4V、USMLE、医师执照考试 (Medical License Exam)、临床决策支持 (Clinical Decision Support)
1-2 sentence description:
1-2句话描述:
In this study the authors show that GPT-4V, a large multimodal chatbot, achieved accuracy on medical licensing exams with images equivalent to the 70th - 80th percentile with AMBOSS medical students. The authors also show issues with GPT-4V, including uneven performance in different clinical subdomains and explanation quality, which may hamper its clinical use.
本研究作者发现,多模态大模型GPT-4V在包含图像的医学执照考试中达到了相当于AMBOSS医学生第70至80百分位的准确率。作者同时指出GPT-4V存在临床子领域表现不均、解释质量参差等问题,可能影响其临床应用。
Abstract
摘要
Background: Using artificial intelligence (AI) to help clinical diagnoses has been an active research topic for more than six decades. Past research, however, has not had the scale and accuracy for use in clinical decision making. The power of AI in large language model (LLM)- related technologies may be changing this. In this study, we evaluated the performance and interpret ability of Generative Pre-trained Transformer 4 Vision (GPT-4V), a multimodal LLM, on medical licensing examination questions with images.
背景:利用人工智能(AI)辅助临床诊断已是一个活跃了六十多年的研究课题。然而,过往研究因规模和准确性不足而无法应用于临床决策。大语言模型(LLM)相关技术中AI的能力可能正在改变这一现状。本研究评估了多模态大语言模型GPT-4V(Generative Pre-trained Transformer 4 Vision)在含图像的医学执照考试题目上的表现与可解释性。
Methods: We used three sets of multiple-choice questions with images from the United States Medical Licensing Examination (USMLE), the USMLE question bank for medical students with different difficulty level (AMBOSS), and the Diagnostic Radiology Qualifying Core Exam (DRQCE) to test GPT-4V’s accuracy and explanation quality. We compared GPT-4V with two state-of-the-art LLMs, GPT-4 and ChatGPT. We also assessed the preference and feedback of healthcare professionals on GPT-4V’s explanations. We presented a case scenario on how GPT-4V can be used for clinical decision support.
方法:我们采用三组含图像的单选题来测试GPT-4V的准确性和解释质量,题目来源包括美国医师执照考试(USMLE)、面向医学生的多难度题库(AMBOSS)以及放射诊断资格核心考试(DRQCE)。我们将GPT-4V与两种先进的大语言模型(GPT-4和ChatGPT)进行对比,并评估了医疗专业人员对GPT-4V解释内容的偏好与反馈。最后通过临床决策支持案例展示了GPT-4V的应用场景。
Results: GPT-4V outperformed ChatGPT $(58.4%)$ and GPT4 $(83.6%)$ to pass the full USMLE exam with an overall accuracy of $90.7%$ . In comparison, the passing threshold was $60%$ for medical students. For questions with images, GPT-4V achieved a performance that was equivalent to the 70th - 80th percentile with AMBOSS medical students, with accuracies of $86.2%$ , $73.1%$ , and $62.0%$ on USMLE, DRQCE, and AMBOSS, respectively. While the accuracies decreased quickly among medical students when the difficulties of questions increased, the performance of GPT-4V remained relatively stable. On the other hand, GPT4V’s performance varied across different medical subdomains, with the highest accuracy in immunology $(100%)$ and o to la ry ng ology $(100%)$ and the lowest accuracy in anatomy $(25%)$ and emergency medicine $(25%)$ . When GPT-4V answered correctly, its explanations were almost as good as those made by domain experts. However, when GPT-4V answered incorrectly, the quality of generated explanation was poor: $18.2%$ wrong answers had made-up text; $45.5%$ had in ferenc ing errors; and $76.3%$ had image misunderstandings. Our results show that after experts gave GPT-4V a short hint about the image, it reduced $40.5%$ errors on average, and more difficult test questions had higher performance gains. Therefore, a hypothetical clinical decision support system as shown in our case scenario is a human-AI-in-the-loop system where a clinician can interact with GPT-4V with hints to maximize its clinical use.
结果:GPT-4V以90.7%的总准确率通过了完整的USMLE考试,表现优于ChatGPT (58.4%) 和GPT4 (83.6%)。相比之下,医学生的通过门槛为60%。对于含图像的试题,GPT-4V表现相当于AMBOSS医学生群体的70-80百分位,在USMLE、DRQCE和AMBOSS上的准确率分别为86.2%、73.1%和62.0%。当试题难度增加时,医学生准确率快速下降,而GPT-4V表现保持相对稳定。另一方面,GPT-4V在不同医学子领域的表现存在差异:免疫学 (100%) 和耳鼻喉科 (100%) 准确率最高,解剖学 (25%) 和急诊医学 (25%) 最低。当GPT-4V回答正确时,其解释质量接近领域专家水平;但回答错误时,生成解释的质量较差:18.2%的错误答案存在虚构文本,45.5%存在推理错误,76.3%存在图像误解。实验表明,专家提供简短图像提示后,GPT-4V平均减少40.5%的错误,且难题的性能提升更显著。因此,如案例场景所示,理想的临床决策支持系统应是 clinician-in-the-loop 的人机协同系统,通过提示交互最大化GPT-4V的临床效用。
Conclusion: GPT-4V outperformed other LLMs and typical medical student performance on results for medical licensing examination questions with images. However, uneven subdomain performance and inconsistent explanation quality may restrict its practical application in clinical settings. The observation that physicians’ hints significantly improved GPT-4V's performance suggests that future research could focus on developing more effective human-AI collaborative systems. Such systems could potentially overcome current limitations and make GPT-4V more suitable for clinical use.
结论: GPT-4V 在包含图像的医学执照考试题目结果上表现优于其他大语言模型和典型医学生水平。然而,其子领域表现不均和解释质量不一致可能限制其在临床环境中的实际应用。医生提示显著提升 GPT-4V 表现的观察表明,未来研究可聚焦于开发更有效的人机协作系统。此类系统有望克服当前局限,使 GPT-4V 更适用于临床场景。
Introduction
引言
Using computers to help make clinical diagnoses and guide treatments has been a goal of artificial intelligence (AI) since its inception. The adoption of electronic health record (EHR) systems by hospitals in the US has resulted in an unprecedented amount of digital data associated with patient encounters. Computer-assisted clinical diagnostic support system (CDSS) endeavors to enhance clinicians' decisions with patient information and clinical knowledge.2 There is burgeoning interest in CDSS for enhanced imaging3, often termed radiomics, in various disciplines such as breast cancer detection , Covid detection , diagnosing congenital cataracts , and hidden fracture location . For a decision to be trustworthy for clinicians, CDSS should not only make the prediction but also provide accurate explanations.8–10 However, most previous imaging CDSS offers only highlight areas deemed significant by AI,11–14 providing limited insight into the explanation of the diagnosis.
利用计算机辅助临床诊断和治疗指导一直是人工智能(AI)自诞生以来的目标。美国医院采用电子健康记录(EHR)系统后,产生了与患者诊疗相关的空前规模的数字化数据。计算机辅助临床诊断支持系统(CDSS)致力于通过患者信息和临床知识来增强临床医生的决策能力。目前人们对增强影像学分析的CDSS(常被称为放射组学)兴趣日益浓厚,这些系统已应用于乳腺癌检测、新冠检测、先天性白内障诊断以及隐匿性骨折定位等多个领域。要使临床医生信任CDSS的决策,系统不仅需要做出预测,还应提供准确的解释依据。然而,现有的大多数影像CDSS仅能标出AI认为重要的区域,对诊断解释的深入洞察十分有限。
Recent advances in large language models (LLMs) have set much discussion in healthcare. State-of-the-art LLMs include Chat Generative Pre-trained Transformer (ChatGPT), a chatbot released by OpenAI in October 2022, and its successor Generative Pre-trained Transformer 4 (GPT-4) in March 2023. The success of ChatGPT and GPT4 is attributed to their conversational prowess and their performance, which have approached or matched human-level competence in cognitive tasks, spanning various domains including medicine. Both ChatGPT and GPT4 have achieved commendable results in the United States Medical Licensing Examinations, leading to discussions about the readiness of LLM applications for integration into clinical17–19 and educational 20–22 environments.
大语言模型 (LLM) 的最新进展引发了医疗健康领域的广泛讨论。目前最先进的大语言模型包括 OpenAI 于 2022 年 10 月发布的聊天机器人 Chat Generative Pre-trained Transformer (ChatGPT) 及其继任者 Generative Pre-trained Transformer 4 (GPT-4) (2023 年 3 月发布)。ChatGPT 和 GPT4 的成功归功于它们的对话能力及其在认知任务中接近或达到人类水平的表现,这些任务涵盖包括医学在内的多个领域。ChatGPT 和 GPT4 在美国医师执照考试中均取得了令人瞩目的成绩,从而引发关于大语言模型应用是否已准备好融入临床 [17–19] 和教育 [20–22] 环境的讨论。
One limitation of ChatGPT and GPT4 is that they can only read and generate text but are unable to process other data modalities, such as images. This limitation, known as the "singlemodality," is a common issue among many LLMs.23,24 Advancements in multimodal LLM promise enhanced capabilities and integration with diverse data sources.25–27 OpenAI's GPT-4V is a state-of-the-art multimodal LLM equipped with visual processing/understanding ability. By incorporating GPT-4V into current imaging CDSS, physicians can ask open-ended questions pertaining to a patient’s medical evaluation - taking into account all available information including images, symptoms, and lab results, allowing for an interactive experience where AI suggests both decision and explanation to support physicians.
ChatGPT和GPT4的一个局限是它们只能读取和生成文本,无法处理其他数据模态(如图像)。这种被称为"单模态"的局限是许多大语言模型的共性问题[23,24]。多模态大语言模型的进步有望增强能力并与多样化数据源集成[25-27]。OpenAI的GPT-4V是具备视觉处理/理解能力的最先进多模态大语言模型。通过将GPT-4V集成到当前影像临床决策支持系统中,医生可以提出与患者医学评估相关的开放式问题——综合考虑包括图像、症状和实验室结果在内的所有可用信息,实现AI既提供决策建议又给出解释说明的交互体验,从而支持医生工作。
However, the ability of GPT-4V to analyze medical images remains unknown. For GPT-4V to be useful to medical professionals, it should not only provide correct responses but also reasons for the responses. In this work, we assess GPT-4V performance on medical licensing examination questions with images. We also analyze the explanation of its answers to the examination questions.
然而,GPT-4V分析医学影像的能力仍属未知。要使GPT-4V对医疗专业人员有用,它不仅需要提供正确答案,还需给出判断依据。本研究评估了GPT-4V在带图像的医学执照考试题目中的表现,并对其答案解释进行了分析。
Method
方法
This cross-sectional study compared the performance between GPT-4V, GPT-4, and ChatGPT on medical licensing examination questions answering. This study also investigates the quality of GPT-4V explanation in answering these questions. The study protocol was deemed exempt by Institutional Review Board at the VA Bedford Healthcare System and informed consent was waived due to minimal risk to patients. This study was conducted in October 2023.
这项横断面研究比较了GPT-4V、GPT-4和ChatGPT在回答医师资格考试题目上的表现,并评估了GPT-4V在解答过程中的解释质量。研究方案经VA Bedford医疗系统机构审查委员会认定豁免审查,由于患者风险极低,免除了知情同意流程。研究实施时间为2023年10月。
Medical Exams and a Patient Case Report Collection
医学检查与患者病例报告集
We obtained study questions from three sources. The United States Medical Licensing Examination (USMLE) consists of three steps required to obtain a medical license in the United States. The USMLE assesses a physician's ability to apply knowledge, concepts, and principles, which is critical to both health and disease management and is the foundation for safe, efficient patient care. The Step1, Step2 clinical knowledge(CK), Step3 of USMLE sample exam released from the National Board of Medical Examiners (NBME) consist of 119, 120, and 137 questions respectively. Each question contained multiple options to choose from. We then selected all questions with images, resulting in 19, 13, and 18 questions from Step1, Step2 CK, and Step3. Medical subdomains include but are not limited to radiology, dermatology, orthopedics, ophthalmology, cardiology, and general surgery.
我们从三个来源获取了研究问题。美国医师执照考试(USMLE)是在美国获得行医执照必须通过的三个步骤。USMLE评估医生应用知识、概念和原则的能力,这对健康和疾病管理至关重要,也是安全高效患者护理的基础。美国国家医学考试委员会(NBME)发布的USMLE Step1、Step2临床知识(CK)和Step3样题分别包含119、120和137道题目。每道题都包含多个选项。我们随后筛选出所有含图像的题目,最终从Step1、Step2 CK和Step3中分别得到19、13和18道题。涉及的医学子领域包括但不限于放射学、皮肤病学、骨科学、眼科学、心脏病学和普通外科。
The sample exam only included limited questions with images. Thus, we further collected similar questions from AMBOSS, a widely used question bank for medical students, which provides exam performance data given students’ performance. The performance of past AMBOSS students enabled us to assess the comparative effectiveness of the model. For each question, AMBOSS associated an expert-written hint to tip the student to answer the question and a difficulty level that ranges from 1-5. Levels 1, 2, 3, 4, and 5 represent the easiest $20%$ , $20{-}50%$ , $50%-80%$ , $80%-95%$ , and $95%-100%$ of questions respectively. Since AMBOSS is proprietary, we randomly selected and manually downloaded 10 questions from each of the 5 difficulty levels. And we repeated this process for Step1, Step2 CK, and Step3. This resulted in a total number of 150 questions.
样卷仅包含少量带图像的题目。为此,我们从医学生广泛使用的题库AMBOSS中进一步收集了类似题目,该平台能根据学生答题表现提供考试数据。通过AMBOSS历史考生数据,我们可以评估模型的相对表现。每个题目都附有专家撰写的答题提示和1-5级难度标识:1级(最简单的前20%)、2级(20-50%)、3级(50-80%)、4级(80-95%)和5级(95-100%)。由于AMBOSS是专有题库,我们从每个难度级别随机手动下载了10道题,并在Step1、Step2 CK和Step3考试中重复此流程,最终获得150道题目。
In addition, we collected questions from the Diagnostic Radiology Qualifying Core Exam (DRQCE), which is an image-rich exam to evaluate a candidate’s core fund of knowledge and clinical judgment across practice domains of diagnostic radiology, being offered after 36 months of residency training. Since DRQCE is proprietary, we randomly selected and manually downloaded 26 questions with images from the preparation exam offered by the American Board of Radiology (ABR). In total, we had 226 questions with images from the three aforementioned sources.
此外,我们从诊断放射学资格核心考试(DRQCE)中收集了问题,这是一项富含图像的考试,用于评估候选人在诊断放射学实践领域中的核心知识储备和临床判断能力,该考试在住院医师培训36个月后提供。由于DRQCE是专有考试,我们随机选择并手动下载了美国放射学委员会(ABR)提供的预备考试中的26道带图像题目。最终,我们从上述三个来源共获得了226道带图像的问题。
To illustrate GPT-4V’s potential as an imaging diagnostic support tool, we modified a patient case report30 to resemble a typical “curbside consult” question between medical professionals.31
为了展示 GPT-4V 作为影像诊断支持工具的潜力,我们修改了一份患者病例报告 [30],使其模拟医疗专业人员之间典型的 "路边会诊" 问题 [31]。
How to Answer Image Questions using GPT-4V Prompt
如何使用GPT-4V提示回答图像问题
GPT-4V took image and text data as inputs to generate textual outputs. Given that input format (prompt) played a key role in optimizing model performance, we followed the standard prompting guidelines of the visual question-answering task. Specifically, we prompted GPT-4V by first adding the image, then appending context (i.e., patient information) and questions, and finally providing multiple-choice options, each separated by a new line. An example user prompt and GPT-4V response are shown in Figure 1. When multiple sub-images existed in the image, we uploaded multiple sub-images to GPT-4V. When a hint is provided, we append it to the end of the question. The response consists of the selected option as an answer, supported by a textual explanation to substantiate the selected decision. When using ChatGPT and GPT-4 models that cannot handle image data, images were omitted from the prompt. Responses were collected from the September 25, 2023 version of models. Each question was manually entered into the ChatGPT website independently (new chat window).
GPT-4V 接收图像和文本数据作为输入以生成文本输出。鉴于输入格式(提示词)对优化模型性能起关键作用,我们遵循了视觉问答任务的标准提示准则。具体而言,我们首先添加图像,然后附加上下文(即患者信息)和问题,最后提供多选选项(每个选项用换行符分隔)来构建 GPT-4V 的提示词。图 1 展示了示例用户提示词和 GPT-4V 的响应。当图像中包含多个子图时,我们会将多个子图上传至 GPT-4V。若提供提示信息,则将其追加到问题末尾。响应内容包括作为答案的选定选项,以及支持该决策的文本解释。当使用无法处理图像数据的 ChatGPT 和 GPT-4 模型时,提示词中会省略图像部分。所有响应均采集自 2023 年 9 月 25 日的模型版本。每个问题均通过 ChatGPT 网站手动独立输入(新建聊天窗口)。
Evaluation Metrics
评估指标
For answer accuracy, we evaluated the model’s performance by comparing the model’s choice with the correct choice provided by the exam board or question bank website. We defined accuracy as the ratio of the number of correct choices to the total number of questions.
为了评估答案的准确性,我们通过比较模型的选择与考试委员会或题库网站提供的正确答案来衡量模型的表现。我们将准确率定义为正确选择的数量与问题总数的比值。
We also evaluated the quality of the explanation by preference from 3 healthcare professionals (one medical doctor, one registered nurse, and one medical student). For each question from AMBOSS dataset $(\mathsf{n}{=}150)$ , we asked the healthcare professionals to choose their preference between an explanation by GPT-4V, an explanation by an expert, or a tie.
我们还通过3位医疗专业人员(一名医生、一名注册护士和一名医学生)的偏好评估了解释质量。针对AMBOSS数据集中的每个问题 $(\mathsf{n}{=}150)$ ,我们让医疗专业人员在GPT-4V的解释、专家解释或两者持平之间选择他们的偏好。
Additionally, we also asked healthcare professionals to evaluate GPT-4V explanation from a sufficient and comprehensive perspective.32,33 They determined if the following information exists in the explanation:
此外,我们还邀请医疗专业人员从充分性和全面性角度评估GPT-4V的解释[32][33],要求判断解释中是否包含以下信息:
- Image interpretation: GPT-4V tried to interpret the image in the explanation, and such interpretation is sufficient to support its choice.
- 图像解读:GPT-4V尝试在解释中解读图像,这种解读足以支持其选择。
- Question information: Explanations contained information related to the textual context (i.e., patient information) of the question, and such information was essential for GPT4V’s choice.
- 问题信息:解释部分包含与问题文本背景(即患者信息)相关的信息,这类信息对GPT4V的选择至关重要。
- Comprehensive explanation: The explanation included comprehensive reasoning for all possible evidence (e.g., symptoms, lab results) that leads to the final answer.
- 全面解释:解释包含对所有可能导致最终答案的证据(如症状、实验室结果)进行全面推理。
Finally, for each question answered incorrectly, we asked healthcare professionals to check if the explanation contained any of the following errors:
最后,对于每个回答错误的问题,我们请医疗专业人员检查解释中是否包含以下任一错误:
- Image misunderstanding: if the sentence in the explanation showed an incorrect interpretation of the image. Example: GPT-4V said that a bone in the image was for the hand, but it was in fact the foot.
- 图像误解:解释中的句子对图像做出了错误解读。例如:GPT-4V称图中的骨头属于手部,但实际上属于足部。
- Text hallucination: if the sentence in the explanation contained made-up information. Example: Claiming Saxenda was insulin.
- 文本幻觉 (Text hallucination):若解释句包含虚构信息。例如:声称 Saxenda 是胰岛素。
- Reasoning error: if the sentence did not properly infer knowledge in either image or text to an answer. Example: GPT-4V reasoned that a patient took a trip within the last 3 months and therefore diagnosed the patient as having chagas disease, despite the clinical knowledge that chagas disease usually develops $10{\sim}20$ years after infection.
- 推理错误:若句子未能正确从图像或文本中推断出答案。示例:GPT-4V推断患者在过去3个月内有过旅行,因此诊断其患有恰加斯病 (chagas disease) ,但临床医学知识表明该病通常在感染后 $10{\sim}20$ 年才会发病。
- Non-medical error: GPT is known to struggle with tasks requiring precise spatial localization, such as identifying chess positions on the board.28
- 非医学错误:已知GPT在需要精确定位的任务上表现不佳,例如识别棋盘上的国际象棋位置。[28]
Statistical Analysis
统计分析
GPT-4V’s accuracies on the AMBOSS dataset were compared between different difficulties using unpaired chi-square tests with a significance level of 0.05. All analysis was conducted in Python software (version 3.10.11).
使用显著性水平为0.05的非配对卡方检验比较了GPT-4V在AMBOSS数据集上不同难度级别的准确率。所有分析均在Python语言 (版本3.10.11) 软件中完成。
Results
结果
Overall Answer Accuracy
总体回答准确率
For all questions in the USMLE sample exam (including ones without image), GPT-4V achieved an accuracy of 88.2%, 90.8%, 92.7% among Step1, Step2CK, and Step3 of USMLE questions respectively, outperforming ChatGPT and GPT-4 by $33.1%$ and $6.7%$ in Step1, $31.7%$ and $10.0%$ in Step2CK, $31.8%$ and $4.4%$ in Step3 (Table 1). The score of GPT-4V passes the standard for the USMLE (about $60%$ ). Performance of GPT-4V across different subdomains is shown in Supplementary Table 1.
在美国医师执照考试(USMLE)样本测试的所有题目中(包括不含图像的题目),GPT-4V在Step1、Step2CK和Step3阶段的准确率分别达到88.2%、90.8%和92.7%,在三个阶段的成绩分别比ChatGPT高出33.1%、31.7%、31.8%,比GPT-4高出6.7%、10.0%、4.4% (表1)。GPT-4V的得分已超过USMLE的及格标准(约60%)。GPT-4V在不同子领域的表现见补充表1。
For questions with image, GPT-4V achieved an accuracy of $84.2%$ , $85.7%$ , $88.9%$ in Step1, Step2CK, and Step3 of USMLE questions accordingly, outperforming ChatGPT and GPT-4 by $42.1%$ and $21.1%$ in Step1, $35.7%$ and $21.4%$ in Step2CK, $38.9%$ and $22.2%$ in Step3 (Table 1). Similarly, GPT-4V achieved an accuracy of $73.1%$ , outperforming ChatGPT $(19.2%)$ and GPT-4 $(26.9%)$ in DRQCE.
对于包含图像的问题,GPT-4V在美国医师执照考试(USMLE) Step1、Step2CK和Step3中的准确率分别达到$84.2%$、$85.7%$和$88.9%$,在Step1中比ChatGPT和GPT-4高出$42.1%$和$21.1%$,在Step2CK中高出$35.7%$和$21.4%$,在Step3中高出$38.9%$和$22.2%$ (表1)。同样地,GPT-4V在DRQCE中的准确率达到$73.1%$,优于ChatGPT $(19.2%)$和GPT-4 $(26.9%)$。
Impact of Difficulty Level and Use of Hints
难度级别与提示使用的影响
When asking GPT-4V questions without the hint, it achieved an accuracy of $60%$ , $64%$ , and $66%$ for AMBOSS Step1, Step2CK, and Step3 (Table 2). GPT-4V was in the 72nd, 76th, and 80th percentile with AMBOSS users who were preparing for Step1, Step2CK, and Step3 respectively. When asking GPT-4V questions with the hint, it achieved accuracy of $84%$ , $86%$ , and $88%$ for
在不提供提示的情况下询问GPT-4V时,其在AMBOSS Step1、Step2CK和Step3的准确率分别为$60%$、$64%$和$66%$ (表2)。GPT-4V的成绩分别超过了72%、76%和80%正在备考Step1、Step2CK和Step3的AMBOSS用户。当提供提示询问GPT-4V时,其在三个考试中的准确率分别达到$84%$、$86%$和$88%$。
AMBOSS Step1, Step2CK, and Step3. Supplementary Figure 1 is an example where GPT-4V switched the answer from incorrect to correct when hint was provided.
AMBOSS Step1、Step2CK 和 Step3。补充图 1: 是一个示例,展示了当提供提示时 GPT-4V 将答案从错误改为正确的情况。
Figure 2 shows a decreasing trend in GPT-4V’s performance in the AMBOSS dataset when the difficulty of questions increased $(P<0.05)$ without hint. However, with the hint, the performance of GPT-4V plateaued across five difficulty levels. Importantly, the accuracies of both GPT-4V, with or without hint, in general outperformed the accuracies of medical students and the gap between the performance of GPT-4V and medical students increased when the difficulty increased.
图 2 显示在没有提示的情况下,随着问题难度增加 $(P<0.05)$ ,GPT-4V 在 AMBOSS 数据集上的性能呈下降趋势。然而,在有提示的情况下,GPT-4V 在五个难度级别上的性能趋于稳定。值得注意的是,无论是否有提示,GPT-4V 的准确率总体上均优于医学生,且随着难度增加,GPT-4V 与医学生之间的性能差距进一步扩大。
As shown in Figure 2, for easy questions (difficulty level ${}=1$ ), the medical students performed between $75%$ to $99%$ accuracies. GPT-4V with and without hint performed at $90%$ and $77%$ , respectively. When the difficulty level increased to 2, the performance of medical students decreased to between $56%$ to $68%$ . In contrast, GPT-4V with and without hint were more stable, performed at $87%$ and $77%$ respectively. The performance of medical students continued to decrease lineally to $39%$ and $55%$ when the difficulty level was 3. When difficulty levels were 4 and 5, the performance of medical students was very poor, ranging from $27%$ to $37%$ and from $14%$ to $24%$ , respectively. In contrast, the performance of GPT-4V with hint remained stable and stayed at $83%$ and $80%$ , respectively. The performance of GPT-4V without hint decreased when the difficulty level increased from 2 to 3, but then remained stable at $57%$ and $53%$ for difficulty levels 4 and 5, respectively.
如图 2 所示,对于简单问题 (难度等级 ${}=1$),医学生的准确率在 $75%$ 至 $99%$ 之间。带提示和不带提示的 GPT-4V 分别达到 $90%$ 和 $77%$。当难度等级升至 2 时,医学生表现降至 $56%$ 到 $68%$,而带提示和不带提示的 GPT-4V 表现更稳定,分别为 $87%$ 和 $77%$。难度等级为 3 时,医学生表现线性下滑至 $39%$ 和 $55%$。在难度等级 4 和 5 时,医学生表现极差,准确率分别为 $27%$ 至 $37%$ 以及 $14%$ 至 $24%$。相比之下,带提示的 GPT-4V 保持稳定,分别达到 $83%$ 和 $80%$。不带提示的 GPT-4V 在难度等级从 2 升至 3 时表现下降,但在难度等级 4 和 5 时稳定维持在 $57%$ 和 $53%$。
Quality of Explanation
解释质量
We evaluated the user’s preference among GPT-4V generated explanations and expert generated explanations. When GPT-4V answered incorrectly, our results show that healthcare professionals overwhelmingly preferred expert explanations as shown in Table 3. When GPT4V answered correctly, the quality of GPT-4V generated explanations was close to expert generated explanations: out of 95 votes, 19 preferred experts,15 preferred GPT-4V, and 61 preferred either.
我们评估了用户对GPT-4V生成解释和专家生成解释的偏好。当GPT-4V回答错误时,如表3所示,医疗保健专业人士绝大多数更倾向于专家解释。当GPT-4V回答正确时,其生成解释的质量接近专家水平:在95票中,19票偏好专家解释,15票偏好GPT-4V解释,61票认为两者均可。
We further evaluated the quality of the GPT-4V generated explanation by verifying if explanation includes image and question text interpretation in Supplementary Table 2. When examining the 95 correct answers, $84.2%$ $\scriptstyle\left(\mathtt{n}\right)=80$ ) of the responses contained an interpretation of the image, while $96.8%$ $(\mathsf{n}{=}92)$ ) aptly captured the information presented in the question. On the other hand, for the 55 incorrect answers, $92.8%$ $(\mathsf{n}{=}51$ ) interpreted the image, and $89.1%$ $(\mathsf{n}{=}49\$ ) depicted the question's details. In terms of comprehensiveness, GPT-4V offered a comprehensive explanation in $79.0%$ $(\mathsf{n}{=}75^{\circ})$ ) of correct responses. In contrast, only $7.2%$ $(\nsimeq4)$ of the wrong responses had a comprehensive explanation that led to the GPT-4V’s choice.
我们进一步通过验证补充表2中解释是否包含图像和问题文本来评估GPT-4V生成解释的质量。在检查95个正确答案时,84.2% (n=80) 的回复包含对图像的解读,而96.8% (n=92) 恰当地捕捉了问题中的信息。另一方面,在55个错误答案中,92.8% (n=51) 解读了图像,89.1% (n=49) 描述了问题的细节。就全面性而言,GPT-4V在79.0% (n=75) 的正确回答中提供了全面解释。相比之下,只有7.2% (n=4) 的错误回答包含导致GPT-4V选择的全面解释。
We also evaluated the explanations of incorrect responses by GPT-4V image and grouped them into the following categories: image misunderstanding, text hallucination, reasoning error, and non-medical error. Among GPT-4V responses with wrong answers $\mathtt{\left(n=55\right)}$ ), we found that $76.3%$ $(\mathsf{n}{=}42)$ of responses included misunderstanding of the image, $45.5%$ ( $\scriptstyle\mathrm{n=}25$ ) of responses included logic error, $18.2%$ $(\mathsf{n}{=}10)$ ) of responses included text hallucination, and no responses included non-medical errors.
我们还评估了GPT-4V图像对错误回答的解释,并将其分为以下几类:图像误解、文本幻觉、推理错误和非医学错误。在GPT-4V的错误回答中(n=55),我们发现76.3%(n=42)的回答包含对图像的误解,45.5%(n=25)的回答包含逻辑错误,18.2%(n=10)的回答包含文本幻觉,没有回答包含非医学错误。
A Case Study of Consult Conversation
咨询对话案例研究
We present a clinical case study regarding a 45-Year-Old woman with hypertension and altered mental status, where GPT-4V can be used as a clinical decision support system. As shown in Supplementary Figure 2, an interactive design of GPT-4V allows communications between
我们介绍一例45岁女性高血压伴精神状态改变的临床案例研究,其中GPT-4V可作为临床决策支持系统使用。如补充图2所示,GPT-4V的交互式设计支持...
GPT-4V and physicians. In this hypothetical scenario, GPT-4V initially provided an irrelevant response when asked to interpret the CT scan. However, it was able to adjust its response and accurately identify the potential medical condition depicted in the image after receiving a physician’s visual hint - an arrow pointed to a part of the CT scan where physicians desired GPT-4V to analyze.
GPT-4V与医生。在这个假设场景中,当被要求解读CT扫描时,GPT-4V最初给出了无关的回应。但在接收到医生的视觉提示(一个指向CT扫描特定部位的箭头,该部位是医生希望GPT-4V分析的区域)后,它能够调整回答并准确识别图像中潜在的医疗状况。
Through comparing GPT-4V response with the case report, we also found that GPT-4V generally offered responses that were clear and coherent through interaction with experts. When asked about differential diagnosis, GPT-4V listed 3 diseases (Primary Aldo ster on is m, Hypertension, and Cushing's Syndrome) along with its explanations that were deemed relevant by a medical doctor. Following a query about the subsequent steps to ascertain the origin of the anomaly, GPT-4V recommended a PET-CT scan. Utilizing the patient's PET-CT scan, it was able to locate a tumor in the media st in um, lending credence to the suspicion of Cushing's Syndrome. Finally, GPT-4V asked for further tests, such as a biopsy of the mass, to confirm the diagnosis.
通过对比GPT-4V的回复与病例报告,我们发现GPT-4V在与专家互动时通常能提供清晰连贯的回应。在被问及鉴别诊断时,GPT-4V列出了3种疾病(原发性醛固酮增多症、高血压和库欣综合征)及其解释,这些解释得到了医生的认可。在询问确定异常来源的后续步骤时,GPT-4V建议进行PET-CT扫描。利用患者的PET-CT扫描结果,GPT-4V成功定位了纵隔内的肿瘤,进一步支持了库欣综合征的怀疑。最后,GPT-4V建议进行更多检查(如肿块活检)以确认诊断。
Discussion
讨论
We found that GPT-4V outperformed ChatGPT and GPT-4 (Table 1). When evaluating all questions in the USMLE sample exam, GPT-4V achieved an accuracy of $90.7%$ outperforming ChatGPT $(58.5%)$ and GPT-4 $(83.8%)$ . In comparison, medical students can pass the USMLE exam with $260%$ accuracy, indicating that the GPT-4V performed at a level similar to or above a medical graduate in the final year of study. The accuracy of GPT-4V highlights its grasp over biomedical and clinical sciences, essential for medical practice, but also showcases its ability in patient management and problem-solving skills, both of which indicate the potential for clinical routines, such as summarizing radiology reports35 and differential diagnosis 36,37.
我们发现GPT-4V的表现优于ChatGPT和GPT-4(表1)。在评估USMLE样卷所有题目时,GPT-4V取得了90.7%的准确率,显著高于ChatGPT(58.5%)和GPT-4(83.8%)。相比之下,医学生需达到260%准确率才能通过USMLE考试,这表明GPT-4V的表现已达到或超过毕业年级医学生水平。GPT-4V的高准确率不仅体现了其对生物医学和临床学科(医疗实践的核心领域)的掌握程度,更展示了其在患者管理和问题解决方面的能力,这些特质使其具备支持临床常规工作的潜力,例如总结放射学报告[35]和进行鉴别诊断[36,37]。
For medical exam questions with images, we found that GPT-4V achieved an accuracy of $62%$ , which was equivalent to the 70th - 80th percentile with AMBOSS medical students. This finding indicates that GPT-4V has the capabilities to integrate information from both text and images to answer questions, making it a promising tool for answering clinical questions based on images. This is the first study that evaluates GPT-4V performance on questions with images. Previous evaluations exclude questions with images as the single-modality limitation of ChatGPT and GPT-4. 20,38–40
对于包含图像的医学考试题目,我们发现GPT-4V的准确率达到$62%$,相当于AMBOSS医学生成绩的70-80百分位。这一结果表明,GPT-4V具备整合文本与图像信息来回答问题的能力,使其成为基于图像解答临床问题的潜力工具。这是首个评估GPT-4V在图像类题目表现的研究。先前评估因ChatGPT和GPT-4的单模态限制而排除了图像类题目[20,38–40]。
Our findings revealed that while medical students’ performance lineally decreased when the difficulty of questions increased, GPT-4V’s performance stayed relatively stable. When hints were provided, GPT-4V’s performance stayed almost the same among questions in all difficult levels, as shown in Figure 2. Therefore, compared with medical students, GPT-4V was effective in answering more difficult questions. There may be multiple factors that contribute to this result. Instrument methods (e.g., item response theory $\left(\left|\mathsf{R}\mathsf{T}\right\rangle^{41}\right.$ ) have been typically used for the construction and evaluation of measurement scales and tests. For example, IRT employs a statistical model that links an individual person's responses to individual test items (questions on a test) to the person’s ability to correctly respond to the items and the items’ features. Therefore, medical examination test sets have been specifically selected and tailored to medical students’ performance with the intended distribution where the performance decreases when the difficulty level increases. Although more evaluation is needed to draw the conclusion that GPT-4V substantially outperformed medical students in difficult questions, our results at least show that GPT-4V performed differently. This may help GPT-4V as a useful clinical decision support system as it may be complementary to physicians’ knowledge and thinking.
我们的研究发现,虽然医学生的表现随着题目难度增加呈线性下降趋势,但GPT-4V的表现保持相对稳定。如图2所示,当提供提示时,GPT-4V在所有难度级别问题中的表现几乎保持一致。因此,与医学生相比,GPT-4V在回答更高难度问题时更具优势。这一结果可能由多重因素导致:测量工具方法(如项目反应理论(IRT) [41])通常用于量表和测试的构建与评估。例如,IRT采用统计模型将个体对测试题目(测验中的问题)的反应与个人正确作答能力及题目特征相关联。因此,医学考试题库会针对医学生表现进行专门筛选和调整,使其形成预设的难度-表现分布曲线。虽然需要更多评估才能断言GPT-4V在高难度题目上显著优于医学生,但我们的结果至少表明两者表现模式存在差异。这一特性可能使GPT-4V成为有价值的临床决策支持系统,从而对医师的知识思维形成互补。
On the other hand, we found that GPT-4V ‘s performance was inconsistent among different medical subdomains. As shown in Supplementary Table 1, GPT-4V achieved high accuracy on subdomains such as Immunology $(100%)$ , O to la ry ng ology $(100%)$ , and Pulmonology $(75%)$ , and low accuracy on others such as anatomy $(25%)$ , emergency medicine $(25%)$ , and pathology $(50%)$ . This suggests that while CDSS shows potential in some specialties or subdomains, they may require further development to be reliable across the board. The uneven performance highlights the need for tailored approaches in enhancing the model's capabilities where it falls short.
另一方面,我们发现 GPT-4V 在不同医学子领域的表现存在差异。如补充材料表 1 所示,该模型在免疫学 $(100%)$ 、耳鼻喉科 $(100%)$ 和肺病学 $(75%)$ 等子领域准确率较高,而在解剖学 $(25%)$ 、急诊医学 $(25%)$ 和病理学 $(50%)$ 等领域表现欠佳。这表明虽然临床决策支持系统 (CDSS) 在某些专科领域展现出潜力,但要实现全面可靠仍需进一步开发。这种不均衡的表现凸显了需要针对模型薄弱环节采取定制化改进方案。
In terms of explanation quality, we found that the quality of its generated explanations was close to ones created by domain experts when GPT-4V answered correctly. We also found that more than $80%$ of responses from GPT-4V provided an interpretation of the image and question of its answer selection, regardless of correctness. This suggests that GPT-4V consistently takes into account both the image and question elements while generating responses. Figure 1 illustrates an example of high-quality explanation that utilizes both text and image in answering a hard question. In this example, more than $70%$ of students answered incorrectly on the first try, because both bacterial pneumonia and pulmonary embolism may involve symptoms such as cough. To differentiate them, GPT-4V correctly interpreted the X-ray with a radiologic sign of Hampton hump, which further increased the suspicion of pulmonary infa