[论文翻译]USMLE试题中GPT多模态分析(含文本与视觉内容)的诊断准确性


原文地址:https://www.medrxiv.org/content/10.1101/2023.10.29.23297733v1.full.pdf


Diagnostic Accuracy of GPT Multimodal Analysis on USMLE Questions Including Text and Visuals

USMLE试题中GPT多模态分析(含文本与视觉内容)的诊断准确性

Vera Sorin, $\mathbf{MD}^{1,2}$ ; Benjamin S. Glicksberg, $\mathrm{PhD}^{3}$ ; Yiftach Barash, $\mathbf{MD}^{1,2}$ ; Eli Konen, $\mathrm{MD^{1}}$ ; Girish Nadkarni, $\mathrm{MD}^{3-5}$ ; Eyal Klang, MD1-5

Vera Sorin, $\mathbf{MD}^{1,2}$; Benjamin S. Glicksberg, $\mathrm{PhD}^{3}$; Yiftach Barash, $\mathbf{MD}^{1,2}$; Eli Konen, $\mathrm{MD^{1}}$; Girish Nadkarni, $\mathrm{MD}^{3-5}$; Eyal Klang, MD1-5

at Mount Sinai, New York, New York, USA.

美国纽约州纽约市西奈山医院。

medRxiv preprint doi: https://doi.org/10.1101/2023.10.29.23297733; this version posted October 31, 2023. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. All rights reserved. No reuse allowed without permission.

medRxiv预印本 doi: https://doi.org/10.1101/2023.10.29.23297733; 此版本发布于2023年10月31日。该预印本的版权持有者(未经同行评审认证)为作者/资助者,其已授予medRxiv永久展示该预印本的许可。保留所有权利。未经许可不得重复使用。

Abstract

摘要

Objective: Large Language Models (LLMs) have demonstrated proficiency in freetext analysis in healthcare. With recent advancements, GPT-4 now has the capability to analyze both text and accompanying images. The aim of this study was to evaluate the performance of the multimodal GPT-4 in analyzing medical images using USMLE questions that incorporate visuals.

目标:大语言模型(LLM)已证明在医疗领域的自由文本分析中表现优异。随着技术进展,GPT-4现具备同时分析文本和图像的能力。本研究旨在通过包含视觉元素的USMLE试题,评估多模态GPT-4在医学影像分析中的表现。

Methods: We analyzed GPT-4's performance on 55 USMLE sample questions across the three steps. In separate chat instances we provided the model with each question both with and without the images. We calculated accuracy with and without the images provided.

方法:我们分析了GPT-4在美国医师执照考试(USMLE)三个阶段的55道样题上的表现。在独立的对话实例中,我们为模型提供了每道题目,包括有图像和无图像的版本。我们分别计算了提供图像和不提供图像情况下的准确率。

Results: GPT-4 achieved an accuracy of $80.0%$ with images and $65.0%$ without. No cases existed where the model answered correctly without images and incorrectly with them. Performance varied across USMLE steps and was significantly better for questions with figures compared to graphs.

结果:GPT-4在带图像的情况下准确率达到$80.0%$,无图像时为$65.0%$。未出现模型在无图像时答对而带图像时答错的情况。在不同USMLE步骤中表现存在差异,且带图表问题的表现显著优于纯图形问题。

Conclusion: GPT-4 demonstrated an ability to analyze medical images from USMLE questions, including graphs and figures. A multimodal LLM in healthcare could potentially accelerate both patient care and research, by integrating visual data and text in analysis processes.

结论: GPT-4展现了分析美国医师执照考试(USMLE)题目中医学影像(包括图表和图形)的能力。通过将视觉数据和文本整合到分析流程中,医疗领域的多模态大语言模型有望加速患者护理和研究进程。

medRxiv preprint doi: https://doi.org/10.1101/2023.10.29.23297733; this version posted October 31, 2023. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. All rights reserved. No reuse allowed without permission.

medRxiv预印本 doi: https://doi.org/10.1101/2023.10.29.23297733; 此版本发布于2023年10月31日。该预印本的版权持有者(未经同行评审认证)为作者/资助方,其授予medRxiv永久展示该预印本的许可。保留所有权利。未经许可不得重复使用。

Introduction

引言

Large Language Models (LLMs) such as ChatGPT by OpenAI have shown impressive skills in free-text analysis and generation in various tasks related to healthcare 1-6. Numerous publications detail LLMs’ performance on the United States Medical Licensing Examination (USMLE), with GPT-4 achieving $80{-}90%$ success rate7. Performance however was evaluated on textual data only, either excluding questions with visuals, or including them without the associated images.

OpenAI的ChatGPT等大语言模型(LLM)在医疗健康相关任务中展现出卓越的自由文本分析与生成能力[1-6]。大量文献详细记录了大语言模型在美国医师执照考试(USMLE)中的表现,其中GPT-4取得了80%-90%的通过率[7]。但现有评估仅基于文本数据,要么排除含视觉资料的问题,要么保留问题却未使用相关图像。

While these abilities of LLMs are remarkable, language analysis is only one aspect of patient care. Diagnosis and management rely not only on anamnesis and clinical history, but include physical examination, and supplemental tests such laboratory, ECG, imaging and pathology images8,9. These types of data can include images, audio, and video.

虽然大语言模型的这些能力令人瞩目,但语言分析仅是患者护理的一个方面。诊断和治疗不仅依赖于病史和临床记录,还包括体格检查以及实验室检查、心电图(ECG)、影像学和病理图像等辅助检测[8,9]。这类数据可能包含图像、音频和视频。

GPT-4 by Open-AI, can now analyze images alongside textual data10. This advancement brings the algorithm one step closer to artificial general intelligence (AGI). In healthcare, it may enable more comprehensive and practical medical care, with many additional potential uses. The aim of our study was to evaluate GPT-4 performance in analyzing medical images, specifically using USMLE questions that incorporate visuals.

OpenAI 的 GPT-4 现已能同时分析图像和文本数据 [10]。这一进展使算法向通用人工智能 (AGI) 更近一步。在医疗领域,它可能实现更全面、更实用的医疗服务,并具有许多额外潜在用途。本研究旨在评估 GPT-4 分析医学图像的性能,特别采用包含视觉元素的 USMLE 试题作为测试素材。

Methods

方法

Experimental Design

实验设计

We evaluated OpenAI's GPT-4 multimodal model on a set of 55 USMLE sample questions. These questions were sourced from sample tests for steps 1, 2CK, and 3, available at the USMLE official website. The same sample was used in a publication by Nori et al. (Microsoft and OpenAI) evaluating GPT-4 performance on the

我们在一组55道美国医师执照考试(USMLE)样题上评估了OpenAI的GPT-4多模态模型。这些题目选自USMLE官网提供的步骤1、2CK和3的样题测试,与Nori等人(Microsoft和OpenAI)在评估GPT-4表现时使用的样本相同。

USMLE11. As they have been extensively studied before, we did not analyze any questions that were text-only. Instead, we used all questions that included images and conducted two sets of experiments: one where the model was provided with both text and accompanying images, and another including only the text without images. Questions were further divided into two groups based on the type of accompanying images - those with graphs versus those with figures. All interactions with the model were carried out using OpenAI's web interface.

USMLE11. 由于此前已有大量研究,我们未分析纯文本问题,而是选用所有包含图像的问题开展两组实验:一组为模型同时提供文本和配图,另一组仅提供文本不含图像。问题根据配图类型进一步分为两类——图表类与插图类。所有模型交互均通过OpenAI网页界面完成。

Prompts

提示词 (Prompts)

Two distinct prompt types were used to elicit responses from the model:

采用了两种不同的提示类型来激发模型的响应:

“Please answer the following USMLE question based on the attached image".

请根据所附图片回答以下USMLE问题。

"The following USMLE question is based on an image that is not provided. Please answer the question to the best of your ability using the information available."

以下USMLE题目基于未提供的图像。请根据现有信息尽力回答问题。

While both prompts incorporated the text from the questions, only the first prompt included the images alongside the text. Each prompt type was assessed in a separate chat instance for every question.

虽然两个提示词都包含了问题文本,但只有第一个提示词同时包含文本和图像。每个提示类型都在单独的聊天实例中对每个问题进行了评估。

Statistical Analysis

统计分析

Accuracy was the primary metric for evaluating the model's performance, calculated as the percentage of correctly answered questions over the total number of questions in each category, and separately for the same set of questions answered with images and without the images provided. A Fisher's exact test was employed to assess the statistical significance of the difference in performance between questions answered with and without images, and for questions with figures versus questions with graphs.

准确率是评估模型性能的主要指标,按每个类别中正确回答问题占总问题数的百分比计算,并分别统计提供图像和不提供图像时同一组问题的回答情况。采用Fisher精确检验来评估带图像与不带图像回答问题之间性能差异的统计显著性,以及带图表问题与带图形问题之间的差异。

All data preprocessing and statistical analyses were conducted using Python version 3.10, with Pandas for data manipulation and SciPy for statistical testing.

所有数据预处理和统计分析均使用Python语言 3.10版本完成,其中数据操作采用Pandas库,统计检验使用SciPy库。

Results:

结果:

Dataset Description

数据集描述

The dataset comprised 55 USMLE sample questions distributed over three examination steps: 29 from Step 1, 13 from Step 2CK, and 13 from Step 3 (Figure 1). Questions accompanied by graphs were fewer: 11/29 in Step 1, 1/13 in Step 2CK, and 3/13 in Step 3.

该数据集包含55道美国医师执照考试(USMLE)样题,分布在三个考试阶段:29道来自Step 1,13道来自Step 2CK,以及13道来自Step 3 (图1)。附带图表的问题较少:Step 1中11/29,Step 2CK中1/13,Step 3中3/13。

Overall Performance

整体性能

GPT-4 achieved an overall accuracy of $80.0%$ (44/55 correct answers) when images were provided and $65.0%$ (36/55 correct answers) without them $\mathbf{\bar{p}}{=}0{.}089.$ ). There was no scenario when the model answered a question correctly when an image was not provided and then incorrectly with the image. When looking at the correctly answered questions when images were not provided, it is interesting to understand the “reasoning” of the model, where in some cases the model “chose” the correct answer based on overall probability, or where there were enough clinical details/image description that enabled the correct diagnosis. Example cases are provided in the Supplement.

GPT-4在提供图像时的总体准确率为$80.0%$(55题中答对44题),未提供图像时为$65.0%$(55题中答对36题)$\mathbf{\bar{p}}{=}0{.}089.$)。模型从未出现未提供图像时答对、提供图像后反而答错的情况。分析未提供图像时的正确回答案例时,模型的"推理"过程值得关注:某些情况下模型基于总体概率"选择"了正确答案,或是凭借充足的临床细节/图像描述得出了正确诊断。补充材料中提供了具体案例。

Performance by USMLE Step

美国医师执照考试 (USMLE) 各阶段表现

GPT-4's accuracy differed across the three steps, both with and without the accompanying images (Table 1). The model achieved $100%$ accuracy (13/13 correct answers) on Step 3 questions when images were provided. The model's performance

GPT-4的准确率在三个步骤中存在差异,无论是否伴随图像(表 1)。当提供图像时,该模型在步骤3问题上达到了100%准确率(13/13正确答案)。模型表现

dipped to $55.2%$ (16/29 correct answers) for Step 1 questions without the images provided.

降至 $55.2%$ (16/29 个正确答案) 针对未提供图像的步骤 1 问题。

Performance by Image Type

按图像类型划分的性能

When comparing performance in questions including graphs versus questions with figures, overall performance was better in questions with figures (with or without the images). When evaluating questions with graphs, accuracy was $20%$ (can be considered random choices) when images were not provided but improved to $40%$ when images were included (Table 2).

在比较包含图表的问题与仅含图形的问题时,总体表现在含图形的问题中更优(无论是否提供图像)。评估含图表的问题时,未提供图像的准确率为$20%$(可视为随机选择),而提供图像后提升至$40%$(表2)。

Discussion

讨论

In this study we evaluated the performance of multimodal GPT-4 using USMLE questions that incorporate visuals. There are several important findings: (1) Overall accuracy on USMLE questions reached $80%$ when images were provided. This is close to the previously reported accuracy of $80{-}90%$ based on text analysis alone7. (2) The model’s accuracy improved with the inclusion of images, though the difference was not statistically significant. (3) GPT-4 performed b