[论文翻译]ChatGPT在医疗健康领域的应用:分类与系统综述


原文地址:https://www.medrxiv.org/content/10.1101/2023.03.30.23287899v1.full.pdf


ChatGPT in Healthcare: A Taxonomy and Systematic Review

ChatGPT在医疗健康领域的应用:分类与系统综述

Jianning Li, Amin Dada, Jens Kleesiek, Jan Egger∗

Jianning Li, Amin Dada, Jens Kleesiek, Jan Egger∗

Institute of Artificial Intelligence in Medicine, University Hospital Essen (AoR), Girard et stra Be, 45131 Essen, Germany.

德国埃森大学医院人工智能医学研究所 (AoR), Girard et stra Be, 45131 埃森

$^* $ Corresponding author: jan.egger (at) uk-essen.de (J.E.)

$^* $ 通讯作者: jan.egger (at) uk-essen.de (J.E.)

March 2023

2023年3月

Abstract

摘要

The recent release of ChatGPT, a chat bot research project/product of natural language processing (NLP) by OpenAI, stirs up a sensation among both the general public and medical professionals, amassing a phenomenally large user base in a short time. This is a typical example of the ‘product iz ation’ of cutting-edge technologies, which allows the general public without a technical background to gain firsthand experience in artificial intelligence (AI), similar to the AI hype created by AlphaGo (DeepMind Technologies, UK) and self-driving cars (Google, Tesla, etc.). However, it is crucial, especially for healthcare researchers, to remain prudent amidst the hype. This work provides a systematic review of existing publications on the use of ChatGPT in healthcare, elucidating the ‘status quo’ of ChatGPT in medical applications, for general readers, healthcare professionals as well as NLP scientists. The large biomedical literature database PubMed is used to retrieve published works on this topic using the keyword ‘ChatGPT’. An inclusion criterion and a taxonomy are further proposed to filter the search results and categorize the selected publications, respectively. It is found through the review that the current release of ChatGPT has achieved only moderate or ‘passing’ performance in a variety of tests, and is unreliable for actual clinical deployment, since it is not intended for clinical applications by design. We conclude that specialized NLP models trained on (bio)medical datasets still represent the right direction to pursue for critical clinical applications.

ChatGPT是OpenAI近期发布的自然语言处理(NLP)聊天机器人研究项目/产品,在公众和医疗专业人士中引发轰动,短时间内积累了庞大的用户基础。这是尖端技术"产品化"的典型案例,让非技术背景的普通大众能亲身体验人工智能(AI),类似于AlphaGo(英国DeepMind Technologies)和自动驾驶汽车(谷歌、特斯拉等)引发的AI热潮。然而,医疗研究人员尤其需要在这种热潮中保持谨慎。本文系统综述了ChatGPT在医疗领域应用的现有文献,为普通读者、医疗从业者和NLP科学家阐明ChatGPT在医学应用中的"现状"。研究使用生物医学文献数据库PubMed,以"ChatGPT"为关键词检索相关文献,进一步提出纳入标准和分类法来筛选结果并归类文献。综述发现,当前版本的ChatGPT在各种测试中仅达到中等或"及格"水平,由于设计初衷并非临床用途,其实际临床部署并不可靠。我们得出结论:针对关键临床应用,基于(生物)医学数据集训练的专业NLP模型仍是正确发展方向。

Keywords: ChatGPT; Healthcare; NLP; Transformer; LLM; OpenAI; Taxonomy; Bard; BERT; LLaMA.

关键词: ChatGPT; 医疗保健; NLP; Transformer; 大语言模型(LLM); OpenAI; 分类法; Bard; BERT; LLaMA.

1 Introduction

1 引言

In November 2022 a chat bot called ChatGPT was released. According to itself it is ‘a conversational AI language model developed by OpenAI. It uses deep learning techniques to generate human-like responses to natural language inputs. The model has been trained on a large dataset of text and has the ability to understand and generate text for a wide range of topics. ChatGPT can be used for various applications such as customer service, content creation, and language translation’. Since its release, ChatGPT has taken humans by storm and its user base is growing even faster than the current record holder TikTok, reaching 100 million users in just two months after its launch. ChatGPT is already used to generate textual context, presentations and even source code for all kinds of topics. But what does that mean specifically for the healthcare sector? What if the general public or medical professionals turn to ChatGPT for treatment decisions? To answer these questions, we will look at published works that already reported the usage of ChatGPT in the medical field. In doing so, we will explore and discuss ethical concerns when using ChatGPT, specifically within the healthcare sector (e.g., in clinical routines). We also identify specific action items that we believe have to be undertaken by creators and providers of chat bots to avoid catastrophic consequences that go far beyond letting a chat bot do someone’s homework. This review makes William B. Schwartz description from 1970 about conversational agents that will serve as consultants by enhancing the intellectual functions of physicians through interactions [94] as up-to-date as ever.

2022年11月,一款名为ChatGPT的聊天机器人发布。根据其自我描述,它是"由OpenAI开发的对话式AI语言模型,采用深度学习技术生成类人化的自然语言响应。该模型基于海量文本数据集训练,能够理解并生成涵盖广泛主题的文本,可应用于客户服务、内容创作和语言翻译等多种场景"。自发布以来,ChatGPT迅速引发全球热潮,其用户增长速度甚至超越了当前纪录保持者TikTok,上线仅两个月就突破1亿用户。该工具已被用于生成各类主题的文本内容、演示文稿乃至源代码。但这对医疗健康领域究竟意味着什么?当公众或医疗专业人员转向ChatGPT寻求诊疗决策时会发生什么?为解答这些问题,我们将审视已发表的关于ChatGPT在医学领域应用的研究报告。通过这一过程,我们将探讨并讨论使用ChatGPT(特别是在医疗健康领域,如临床常规工作中)涉及的伦理问题。同时,我们提出了一系列具体行动建议,认为聊天机器人的开发者和提供商必须采取这些措施,以避免产生远超"让聊天机器人代写作业"的灾难性后果。本综述使得William B. Schwartz在1970年关于"对话代理将通过增强医生智力功能来担任顾问角色"的论述[94] 显得前所未有的前瞻。

Even though the application of natural language processing (NLP) in healthcare is not new [34, 101, 111, 77], the recent release of ChatGPT, a direct product of NLP, still generated a hype in artificial intelligence (AI) and sparked a heated discussion about ChatGPT’s potential capability and pitfalls in healthcare, and attracted the attention of researchers from different medical specialities. The sensation could largely be attributed to ChatGPT’s barrier-free (browser-based) and user-friendly interface, allowing medical professionals and the general public without a technical background to easily communicate with the Transformerand reinforcement learning-based language model. Currently, the interface is designed for question answering (QA), i.e., ChatGPT responds in texts to the questions/prompts from users. All established or potential applications of ChatGPT in different medical specialities and/or clinical scenarios hinge on the QA feature, distinguished only by how the prompts are formulated (Format-wise: open-ended, multiple choice, etc. Content-wise: radiology, paras it ology, toxicology, diagnosis, medical education and consultation, etc.). Numerous publications featuring these applications have also been generated and indexed in PubMed since the release. This systematic review dives into these publications, aiming to elucidate the current state of employment, as well as the limitations and pitfalls of ChatGPT in healthcare, amidst the ChatGPT AI hype.

尽管自然语言处理(NLP)在医疗健康领域的应用并非新鲜事[34, 101, 111, 77],但作为NLP直接产物的ChatGPT近期发布,仍在人工智能(AI)领域引发热潮,并激起关于ChatGPT在医疗健康领域潜在能力与缺陷的热烈讨论,吸引了来自不同医学专业研究人员的关注。这一现象很大程度上归功于ChatGPT的无障碍(基于浏览器)和用户友好界面,使得没有技术背景的医疗从业者和普通大众也能轻松与基于Transformer和强化学习的语言模型交流。目前该界面专为问答(QA)设计,即ChatGPT以文本形式回应用户的问题/提示。ChatGPT在不同医学专业和/或临床场景中所有已确立或潜在的应用都依赖于这一QA特性,区别仅在于提示的构建方式(形式上:开放式、多选题等;内容上:放射学、寄生虫学、毒理学、诊断、医学教育与咨询等)。自发布以来,PubMed已收录大量展示这些应用的出版物。本系统综述深入分析这些文献,旨在阐明ChatGPT在医疗健康领域的应用现状、局限性与缺陷,以回应当前的ChatGPT AI热潮。

Table 1: Summary of Level 1 and Level $\mathcal{L}$ papers.

Ref.ScenarioCategoryMain ContentTag
[93]clinical workfloweditorialdiscussion of the potential use, limitations and risks of ChatGPT in nursing practiceLevel 1
[92]medical researchperspectivecomments about ChatGPT in scientific writing; Use Chat- GPT to summarize and compare across papersLevel 1
[81]medical researcheditorialgeneric comments on using ChatGPT in orthopaedic re- searchLevel 1
[79]medical researchletter to the editorcomments on using ChatGPT in scientific publications and generating research ideasLevel 1
[59]miscellaneousletter to the editorcomments on the potential use and pitfalls of ChatGPT in healthcareLevel 1
[106]miscellaneouseditorialdiscuss with ChatGPT about synthetic biology (e.g., appli- cations, ethical regulations, history, research trends, etc.)Level 1
[25]medical researcheditorialcomments on the pros and cons of using ChatGPT in med- ical researchLevel 1
[65]miscellaneousoriginal articlecomments on the potential usage of ChatGPT in radiol- ogy (generate radiological reports, education, diagnostic decision-making, communicate with patients, compose ra- diological research article)Level 1
[8]medical education & researchletter to the editorcomments on the pros and cons of ChatGPT in medical education and researchLevel 1
35]miscellaneousprimershort comment on ChatGPT for UrologistsLevel 1
49consultationcorrespondenceChatGPT for antimicrobial consultationLevel 1
[48]medical researcharticle (preprint)comments on ChatGPT in peer-reviewLevel 1
[72]miscellaneouseditorialcomments on ChatGPT in translational medicineLevel 1

表 1: Level 1 和 Level $\mathcal{L}$ 论文总结。

文献 场景 类别 主要内容 标签
[93] 临床工作流 社论 讨论 ChatGPT 在护理实践中的潜在用途、局限性和风险 Level 1
[92] 医学研究 观点 关于 ChatGPT 在科学写作中的评论;使用 ChatGPT 总结和比较论文 Level 1
[81] 医学研究 社论 对在骨科研究中使用 ChatGPT 的通用评论 Level 1
[79] 医学研究 致编辑的信 关于在科学出版物中使用 ChatGPT 及生成研究想法的评论 Level 1
[59] 其他 致编辑的信 关于 ChatGPT 在医疗保健中的潜在用途和陷阱的评论 Level 1
[106] 其他 社论 与 ChatGPT 讨论合成生物学(如应用、伦理规范、历史、研究趋势等) Level 1
[25] 医学研究 社论 关于在医学研究中使用 ChatGPT 的利弊评论 Level 1
[65] 其他 原创文章 关于 ChatGPT 在放射学中的潜在用途的评论(生成放射报告、教育、诊断决策、与患者沟通、撰写放射学研究文章) Level 1
[8] 医学教育与研究 致编辑的信 关于 ChatGPT 在医学教育和研究中的利弊评论 Level 1
35] 其他 入门 对泌尿科医师使用 ChatGPT 的简短评论 Level 1
49 咨询 通信 使用 ChatGPT 进行抗菌咨询 Level 1
[48] 医学研究 文章(预印本) 关于 ChatGPT 在同行评审中的评论 Level 1
[72] 其他 社论 关于 ChatGPT 在转化医学中的评论 Level 1
[14]consultationletter to the editorcomments on the pros and cons of ChatGPT in public/community health (e.g.,a answer generic publicLevel 1
[73]miscellaneousarticlehealth questions) comments on the ethics of using ChatGPT in Health Pro- fessions EducationLevel 1
[60]medical researchletter to the editorbrief comments on using ChatGPT in medical writingLevel 1
[80]medical educationeditorialcomment on ChatGPT in nursing educationLevel 1
[11]miscellaneouscommentarycomment on ChatGPT in translational medicineLevel 1
[113]miscellaneouseditorialcomment on ChatGPT in healthcareLevel 1
[58]medical researcheditorialcomment on ChatGPT in medical writingLevel 1
[7]medical researcheditorialcomment on using ChatGPT for scientific writing in sports & exercise medicineLevel 1
[12]medical researchperspectivecomment on medical writingLevel 1
[91]miscellaneousreviewsystematic review on ChatGPT in healthcareLevel 1
[5]medical researcheditorialcomment on the hallucination issue of ChatGPT in medical writingLevel 1
[71]medical researcheditorialChatGPT draft an article on vaccine effectivenessLevel 2
[108]medical researchreviewreview on ChatGPT in medical research, including use ex- amplesLevel 2
[6]medical researchoriginal articleuse ChatGPT to compile a review article on Digital Twin in healthcareLevel 2
[82]clinical workflowcommentuse ChatGPT to generate a discharge summary for a pa- tient who had hip replacement surgeries including follow-up care suggestions)Level 2
[89]clinical workflowletter to the editorChatGPT gives diagnosis, prognosis and explanation for a clinical toxicology case of acute organophosphate poisoningLevel 2

| [14] | 咨询 | 读者来信 | 评论ChatGPT在公共/社区健康中的利弊(例如回答一般公众健康问题) | 1级 |
| [73] | 其他 | 文章 | 评论在健康职业教育中使用ChatGPT的伦理问题 | 1级 |
| [60] | 医学研究 | 读者来信 | 简要评论在医学写作中使用ChatGPT | 1级 |
| [80] | 医学教育 | 社论 | 评论ChatGPT在护理教育中的应用 | 1级 |
| [11] | 其他 | 评论 | 评论ChatGPT在转化医学中的作用 | 1级 |
| [113] | 其他 | 社论 | 评论ChatGPT在医疗保健中的应用 | 1级 |
| [58] | 医学研究 | 社论 | 评论ChatGPT在医学写作中的应用 | 1级 |
| [7] | 医学研究 | 社论 | 评论在运动与运动医学科学写作中使用ChatGPT | 1级 |
| [12] | 医学研究 | 观点 | 评论医学写作 | 1级 |
| [91] | 其他 | 综述 | 关于ChatGPT在医疗保健中的系统综述 | 1级 |
| [5] | 医学研究 | 社论 | 评论ChatGPT在医学写作中的幻觉问题 | 1级 |
| [71] | 医学研究 | 社论 | ChatGPT起草一篇关于疫苗有效性的文章 | 2级 |
| [108] | 医学研究 | 综述 | 关于ChatGPT在医学研究中的应用综述(含使用示例) | 2级 |
| [6] | 医学研究 | 原创文章 | 使用ChatGPT编写关于医疗保健中数字孪生的综述文章 | 2级 |
| [82] | 临床工作流程 | 评论 | 使用ChatGPT为接受髋关节置换手术的患者生成出院摘要(含随访护理建议) | 2级 |
| [89] | 临床工作流程 | 读者来信 | ChatGPT为急性有机磷中毒临床毒理学病例提供诊断、预后和解释 | 2级 |

[19]medical researcheditorialChatGPT answers questions about computational systems biology in stem cell research but its answers lack depthLevel 2
[40]medical researchletter to the editoruse ChatGPT to search literature of a given topic,but ma- jority of returned publications are fabricatedLevel 2
[78]medical (anatomy) educationletter to the editorChatGPT answersanatomy-related 1 questions; Result shows ChatGPT is currently incapable of giving accurateLevel 2
[1]consultationletter to the editoranatomy information ChatGPT answers questions on cardiopulmonary resusci- tationLevel 2
[75]miscellaneousDiscussions with Leaders (Invitationcomment and use examples of ChatGPT in nuclear medicineLevel 2
[3]medical educationOnly) editorialChatGPT answers multiple-choice questions on nuclear medicine; Results suggest ChatGPT does not possesses theLevel 2
[20]medical researchbrief reportknowledge of a nuclear medicine physician comments on using ChatGPT in healthcare (e.g., compose medical notes) and medical research (e.g., generate ab-Level 2
[47]consultationcommentarystracts, research topics) ChatGPT answers cancer-related questions informationLevel 2
[15]consultationcommentaryChatGPT answersepilepsy-related questionsLevel 2
[100]consultationarticlecomments on ChatGPT in diabetes self management and education (DSME)Level 2
[31]medical researcheditorialChatGPT generates a curriculum about AI for medical stu-Level 2
dents and a list of recommended readings)

| [19] | 医学研究 | 社论 | ChatGPT回答干细胞研究中关于计算系统生物学的问题,但其答案缺乏深度 | 2级 |
| [40] | 医学研究 | 致编辑的信 | 使用ChatGPT搜索特定主题的文献,但大多数返回的出版物是捏造的 | 2级 |
| [78] | 医学(解剖学)教育 | 致编辑的信 | ChatGPT回答解剖学相关问题;结果显示ChatGPT目前无法提供准确信息 | 2级 |
| [1] | 咨询 | 致编辑的信 | ChatGPT回答关于心肺复苏的问题 | 2级 |
| [75] | 其他 | 领导者对话(特邀) | 评论并举例说明ChatGPT在核医学中的应用 | 2级 |
| [3] | 医学教育 | 仅限社论 | ChatGPT回答核医学选择题;结果表明ChatGPT不具备核医学医师的知识 | 2级 |
| [20] | 医学研究 | 简报 | 评论在医疗保健(如撰写病历)和医学研究(如生成摘要、研究主题)中使用ChatGPT | 2级 |
| [47] | 咨询 | 评论 | ChatGPT回答癌症相关问题 | 2级 |
| [15] | 咨询 | 评论 | ChatGPT回答癫痫相关问题 | 2级 |
| [100] | 咨询 | 文章 | 评论ChatGPT在糖尿病自我管理与教育(DSME)中的应用 | 2级 |
| [31] | 医学研究 | 社论 | ChatGPT为医学生生成关于AI的课程大纲和推荐阅读清单 | 2级 |

Based on the findings derived from existing publications on ChatGPT in healthcare, this systematic review addresses the following research questions:

基于现有关于ChatGPT在医疗健康领域的研究成果,本系统性综述旨在解答以下研究问题:

The rest of the manuscript is organized as follows: Section 2 briefly introduces NLP, transformers and large language Models (LLMs), on which ChatGPT is built. Section 3 introduces the inclusion criteria and taxonomy used in the systematic review, and discusses in detail the selected publications. Section 4 presents the answers to the above research questions (RQ1 - RQ4), and Section 5 summarizes and concludes the review.

本文其余部分的结构安排如下:第2节简要介绍自然语言处理(NLP)、Transformer架构以及ChatGPT所基于的大语言模型(LLM)。第3节阐述系统综述的纳入标准与分类体系,并详细讨论入选文献。第4节针对前述研究问题(RQ1-RQ4)给出解答,第5节对综述进行总结与结论。

2 Background

2 背景

2.1 Natural Language Processing (NLP)

2.1 自然语言处理 (NLP)

Natural Language Processing (NLP) [22] is an interdisciplinary research field that aims to develop algorithms for the computational understanding of written and spoken languages. Some of the most prominent applications include text classification, question answering, speech recognition, language translation, chat bots, and the generation or sum mari z ation of texts. Over the past decade, the progress of NLP has been accelerated by deep learning techniques, in conjunction with increasing hardware capabilities and the availability of massive text corpora. Given the fast growth of digital data and the growing need for automated language processing, NLP has become an indispensable technology in various industries, such as healthcare, finance, education, and marketing.

自然语言处理(NLP) [22] 是一个跨学科研究领域,旨在开发能够计算理解书面和口头语言的算法。其最突出的应用包括文本分类、问答系统、语音识别、语言翻译、聊天机器人以及文本生成或摘要。过去十年间,随着硬件性能提升和海量文本语料库的可用性,深度学习技术加速了NLP的发展。鉴于数字数据的快速增长以及对自动化语言处理日益增长的需求,NLP已成为医疗、金融、教育和营销等行业不可或缺的技术。

2.2 Transformer

2.2 Transformer

In 2017, Vaswani et al. [109] introduced the Transformer model architecture, replacing previously widespread recurrent neural networks (RNN) [76], Long short-term memory networks (LSTM) [45] and Word2Vec [23]. Transformers are feed forward networks combined with specialized attention blocks that enable the model to attend to distinct segments of its input selectively. Attention blocks overcome two important limitations of RNNs. First, they enable Transformers to process input in parallel, whereas in RNNs each computation step depends on the previous one. Second, they allow Transformers to learn long-term dependencies. Since their introduction, Transformers consecutively achieved state-of-the-art results on various NLP benchmarks. Further developments include novel training tasks [24, 54, 114], adaptions of the network architecture [42, 64], and reduction of computational complexity [57, 64, 41]. However, the limited training data and the model complexities remained one of the primary factors of model performance. Transformers have also been used for tasks beyond NLP, such as image and video processing [95], and they are an active area of research in the deep learning community.

2017年,Vaswani等人[109]提出了Transformer模型架构,取代了之前广泛使用的循环神经网络(RNN)[76]、长短期记忆网络(LSTM)[45]和Word2Vec[23]。Transformer是由前馈网络与特殊注意力模块组成的架构,使模型能够选择性地关注输入的不同片段。注意力模块克服了RNN的两个重要局限:首先,它使Transformer能够并行处理输入,而RNN的每个计算步骤都依赖于前一步骤;其次,它让Transformer能够学习长期依赖关系。自问世以来,Transformer在各种自然语言处理基准测试中连续取得最先进成果。后续发展包括新型训练任务[24,54,114]、网络架构调整[42,64]以及计算复杂度降低[57,64,41]。然而有限的训练数据和模型复杂度仍是影响性能的主要因素之一。Transformer也被应用于自然语言处理之外的任务,如图像和视频处理[95],目前仍是深度学习领域的研究热点。

2.3 Large Language Models (LLMs)

2.3 大语言模型 (LLMs)

Large language models (LLMs) [17] refer to massive Transformer models trained on extensive datasets. Substantial research has been conducted on scaling the size of Transformer models. The popular BERT model [26], which in 2019 achieved record-breaking performance on seven tasks in the Glue Benchmark [110], possesses 110 million parameters. On the other hand, GPT-3 [18] had already reached 175 billion parameters by 2021. At the same time, the size of the training datasets has continued to grow. BERT, for example, was trained on a dataset comprising of 3.3 billion words, while the recently published LLaMA [107] was trained on 1.4 trillion tokens. Despite the success of the LLMs, LLMs face several challenges, including the need for massive computational resources and the potential of adopting bias and misinformation from training data. Additionally, over confidence when expressing wrong statements and a general lack of uncertainty remains to be a significant concern in NLP applications. As LLMs continue to improve and become more widespread, addressing these challenges and ensuring they are used ethically and responsibly is essential. ChatGPT is another representative LLM released by OpenAI and other tech giants have released their LLMs, such as the previously mentioned LLaMA from Meta, as a response. Figure 1 illustrates the evolution of LLMs.

大语言模型 (LLMs) [17] 指的是基于海量数据训练的巨型Transformer模型。关于扩展Transformer模型规模的研究已取得显著进展:2019年在Glue Benchmark [110] 七项任务中创下纪录的BERT模型 [26] 具有1.1亿参数,而GPT-3 [18] 到2021年已达到1750亿参数。与此同时,训练数据集规模持续扩大——BERT的训练数据包含33亿单词,近期发布的LLaMA [107] 则使用了1.4万亿token进行训练。尽管大语言模型取得成功,仍面临多重挑战:需要消耗巨大计算资源、可能继承训练数据中的偏见与错误信息。此外,在自然语言处理应用中,模型对错误陈述的过度自信及普遍缺乏不确定性仍是重要隐患。随着大语言模型持续改进和普及,解决这些挑战并确保其符合伦理规范至关重要。ChatGPT是OpenAI发布的代表性大语言模型,其他科技巨头也相继推出相关产品作为回应,例如Meta此前发布的LLaMA。图1展示了大语言模型的演进历程。


Figure 1: Evolution of large language models (LLMs) (adapted from [96])

图 1: 大语言模型 (LLM) 的演进 (改编自 [96])

3 Methodology

3 方法论

The search strategy used in this systematic review is illustrated in Figure 2, following the PRISMA guidelines. We use $P u b M e d$ as the only source to search candidate publications. Since the majority of the papers are very short (without abstracts), eligibility is determined at first screening based on the inclusion criteria below.

本系统综述采用的检索策略如图 2 所示,遵循 PRISMA 指南。我们仅使用 $P u b M e d$ 作为候选文献的检索来源。由于大多数论文篇幅较短(无摘要),初步筛选时根据以下纳入标准判定 eligibility。

3.1 Inclusion Criteria

3.1 纳入标准

The review is expressly dedicated to the ChatGPT released in November 2022 by OpenAI, excluding its predecessors (GPT-3.5, CPT-4 ), other large language models (LLMs) such as Instruct GP T and general NLP medical applications [69]. By March 20, 2023, a total of 140 publications are retrieved in PubMed (https://pubmed.ncbi.nlm.nih.gov/) using the keyword ChatGPT. Among them, article written in languages other than English (e.g., French [84]), without full text access (e.g., [62]), or whose main content has little to do with (or is not specific to) either ChatGPT (e.g., [46, 104, 33, 37]) or healthcare (e.g., [97, 103, 27, 6, 39, 13, 88, 21, 66, 115, 102, 43]) are excluded. Other representative exclusions include [44, 55], which deal with CPT-3, and [56, 30, 90, 2], where the authors claimed that ChatGPT assisted with the writing of the papers or case reports, but did not provide any discussion of the appropriateness of the generated texts and how the texts were incorporated into the main content. Generic comments that are not specific to healthcare, such as [105, 115, 16, 50], where the authors comment on the authorship of ChatGPT and using ChatGPT in scientific writing, are also excluded. Several repetitive articles were found from the PubMed search results. Table 1 and Table 2 show the full list of selected publications based on the inclusion (exclusion) criteria.

本综述专门针对OpenAI于2022年11月发布的ChatGPT,不包括其前代版本(GPT-3.5、CPT-4)、其他大语言模型(如InstructGPT)以及通用NLP医学应用[69]。截至2023年3月20日,在PubMed (https://pubmed.ncbi.nlm.nih.gov/) 使用关键词ChatGPT共检索到140篇文献。其中,非英语撰写的文章(如法语[84])、无法获取全文的文献(如[62])、主要内容与ChatGPT(如[46,104,33,37])或医疗健康领域(如[97,103,27,6,39,13,88,21,66,115,102,43])关联性较低的文献均被排除。其他典型排除案例包括涉及CPT-3的[44,55],以及作者声明使用ChatGPT辅助论文或病例报告撰写但未讨论生成文本的适用性及文本整合方式的[56,30,90,2]。与医疗健康领域无关的通用评论(如[105,115,16,50]中关于ChatGPT作者身份及科研写作应用的讨论)同样被排除。PubMed检索结果中发现若干重复文献。表1和表2列示了基于纳入(排除)标准的最终文献清单。

3.2 Taxonomy

3.2 分类体系

We propose a taxonomy, as shown in Figure 3, to categorize the selected publications included in the review. The taxonomy is based on applications, including ‘triage’, ‘translation’, ‘medical research’, ‘clinical workflow’, ‘medical education’, ‘consultation’, ‘multimodal’, each targeting one or multiple enduser groups, such as patients, healthcare professionals, researchers, medical students and teachers, etc. An application-based taxonomy allows more compact and inclusive grouping of papers, compared to categorizing papers by specific medical specialities. For example, scientific progress and findings generated through clinical practices are documented in the form of publications and/or reports, and literature reviews and novel ideas are usually required for medical researchers of all disciplines to publish their works. Thus, papers on ‘scientific writing’, ‘literature reviews’, ‘research ideas generation’, etc., can be grouped into the ‘medical research’ category. Similarly, the ‘consultation’ category comprises papers where ChatGPT is used in medical consulting settings for both corporations (e.g., insurance companies, medical consulting agencies, etc.) and individuals (e.g., patients) seeking medical information and advice. The ‘clinical workflow’ category includes ChatGPT’s applications in a variety of clinical scenarios, such as diagnostic decision-making, treatment and imaging procedure recommendation, and writing of discharge summary, patient letter and medical note. Furthermore, clinical departments, regardless of medical specialities, may benefit from a translation system for patients/visitors who are non-native language speakers (‘translation’). A triage system [10] guiding patients to the right departments would reduce the burden of clinical facilities and centers in general. Note that different categories are not necessarily completely independent, since all applications are reliant upon the QA-based interface of ChatGPT. By formulating the same questions differently according to different scenarios, ChatGPT’s role can change. For instance, reformulating multiple choice questions about a medical speciality in medical exams to open-ended questions, ChatGPT’s role changes from a medical student (‘medical education’) to a medical consultant (‘consultation’) or a clinician providing diagnosis or giving prescriptions (‘clinical workflow’). To avoid such ambiguity, categorization of a paper is solely based on the scenario explicitly reported in the paper. The connections between the applications and end-users in Figure 3 are also not unique. In this review, only the most obvious connections are established, such as ‘medical education’ - ‘students/teachers/exam agencies’, ‘medical research’ - ‘researchers’. The following of the review will show that existing publications on ChatGPT in healthcare can all find a proper categorization based on the proposed taxonomy.

我们提出如图3所示的分类法,对综述涵盖的文献进行归类。该分类体系基于应用场景划分,包括"分诊"、"翻译"、"医学研究"、"临床工作流"、"医学教育"、"咨询"、"多模态"七大类,每类面向患者、医护人员、研究者、医学生与教师等一个或多个终端用户群体。相比按专科分类,基于应用的分类法能实现更紧凑且包容的论文归类。例如:通过临床实践产生的科学进展通常以论文/报告形式记载,而文献综述与创新思路是所有学科医学研究者发表成果的普遍需求,因此"科学写作"、"文献综述"、"研究创意生成"等主题可归入"医学研究"类。同理,"咨询"类涵盖ChatGPT应用于企业(如保险公司、医疗咨询机构)和个人(如患者)医疗咨询场景的论文。"临床工作流"类则包含ChatGPT在诊断决策、治疗方案与影像检查推荐、出院小结/患者信函/病历书写等多样化临床场景的应用。此外,各临床科室都可能需要为非母语患者/访客提供翻译服务("翻译"),而分诊系统[10]能通过引导患者正确就诊减轻医疗机构整体负担。需注意不同类别并非完全独立,因为所有应用都依赖ChatGPT基于问答的交互界面——通过根据不同场景调整问题表述,ChatGPT的角色可能转变。例如将医学考试中某专科选择题改写为开放式问题时,其角色就从医学生("医学教育")转变为医疗顾问("咨询")或提供诊断处方的临床医生("临床工作流")。为避免歧义,本文献分类仅依据论文明确记载的应用场景。图3中应用与终端用户的关联也非唯一,本综述仅建立最显著的关联(如"医学教育"-"学生/教师/考试机构"、"医学研究"-"研究者")。后续内容将表明,现有医疗领域ChatGPT研究都能在该分类体系中找到准确定位。


Figure 2: Search strategy used in this systematic review.

图 2: 本系统综述采用的检索策略。

medRxiv preprint doi: https://doi.org/10.1101/2023.03.30.23287899; this version posted March 30, 2023. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC 4.0 International license .

medRxiv预印本 doi: https://doi.org/10.1101/2023.03.30.23287899; 本版本发布于2023年3月30日。该预印本的版权持有者 (未经同行评审认证) 是作者/资助者,其已授予medRxiv永久展示该预印本的许可。本作品采用CC-BY-NC 4.0国际许可协议提供。


Figure 3: Application- and user-oriented Taxonomy used in the ChatGPT review. The references shown in the application boxes are the Level 3 publications.

图 3: ChatGPT综述中采用的面向应用和用户的分类体系。应用框内标注的参考文献为第三层级文献。

Besides the taxonomy, we further assign a tag (Level 1 - Level 3 ) to the selected papers to indicate the depth and particularity of the papers on the ‘ChatGPT in Healthcare’ topic:

除了分类法之外,我们进一步为选定的论文分配标签(Level 1 - Level 3),以表明这些论文在"医疗健康领域的ChatGPT"主题上的深度和特殊性:

Shortly prior to our review, a systematic review of ChatGPT in healthcare was published by Sallam, M. [91]. An inclusive taxonomy and a proper different i ation among the selected publications (tag: Level 1, Level 2, Level $\mathcal{S}$ ) is, however, lacking. We believe that the tag helps readers quickly filter and locate papers of interest. This review put more emphasis on Level 3 papers, since they provide a clearer picture of the real capability of ChatGPT in different healthcare applications.

在我们开展本次综述前不久,Sallam, M. [91] 发表了一篇关于 ChatGPT 在医疗领域应用的系统性综述。然而,该综述缺乏一个包容性分类体系,且未对入选文献进行明确分级 (标签: Level 1, Level 2, Level $\mathcal{S}$)。我们认为分级标签能帮助读者快速筛选和定位目标论文。本次综述更侧重 Level 3 级别论文,因为它们能更清晰地展现 ChatGPT 在不同医疗应用场景中的实际能力。

3.3 General Profile of Level 1 and Level 2 Papers

3.3 一级和二级论文的总体概况

A list of Level 1 and Level $\mathcal{L}$ papers are summarized in Table 1. It is not unexpected that the majority of shortlisted papers fall into the Level 1 and Level 2 category. As seen from Table 1, most of Level 1 and Level 2 papers are short editorial comments or letters to the editor from multidisciplinary journals like Nature (https://www.nature.com/) and Science (https://www.science.org/), or speciality journals like nuclear medicine [3, 59], plastic surgery [79, 38], synthetic biology [106] and orthopaedic [81]. These publications usually deliver high-level comments about the potential impact and pitfalls of ChatGPT in healthcare [113], with a focus on medical publishing. Scientific journals are among the immediate stakeholders of the publishing industry on which ChatGPT will exert a significant impact. Thus, publishers introduce new regulations regarding the use of ChatGPT in scientific publications, in particular whether ChatGPT is eligible as an author and ChatGPT-generated texts are allowed. Answers from leading publishers like Science are in the negative [105, 16]. Nature also bans ChatGPT authorship but takes a slightly more tolerant stance regarding ChatGPT-generated content, subject to a clear statement of whether, how and to what extent ChatGPT contributed to the submitted manuscript [103, 27]. Main argument for the decision is that ChatGPT cannot properly source literature where its answers are derived from, causing unintentional plagiarism, nor can it take accountability as human authors do [105, 27]. The decision is echoed by the academic community [58, 97, 115, 66], agreeing that

表1总结了1级和$\mathcal{L}$级论文的清单。不出所料,大多数入围论文属于1级和2级类别。从表1可见,多数1级和2级论文是来自《自然》(https://www.nature.com/)、《科学》(https://www.science.org/)等多学科期刊的简短社论评论或致编辑信,或是核医学[3,59]、整形外科[79,38]、合成生物学[106]和骨科[81]等专业期刊的投稿。这些出版物通常就ChatGPT在医疗健康领域的潜在影响和陷阱发表高层级评论[113],重点关注医学出版领域。科学期刊是出版业中受ChatGPT显著影响的直接利益相关方。因此,出版商针对ChatGPT在科学出版物中的使用出台了新规,特别是关于ChatGPT是否具备作者资格及其生成文本是否被允许的问题。《科学》等主流出版商的答案是否定的[105,16]。《自然》也禁止将ChatGPT列为作者,但对ChatGPT生成内容持相对宽容态度,前提是需明确声明ChatGPT对投稿文稿的贡献程度、方式及范围[103,27]。该决定的主要依据是:ChatGPT无法妥善标注其答案的文献来源,可能导致无意识抄袭,也无法像人类作者那样承担责任[105,27]。这一立场获得了学术界[58,97,115,66]的普遍认同,认为...

ChatGPT-generated content must be scrutinized by human experts before being used [58], as the generated content, such as references [105, 12, 40, 31] could be fabricated. Lee, J.Y. et al. [66] reiterated from a legal (e.g., copy-right law) perspective the inappropriateness of listing ChatGPT as an author, emphasizing that a non-human cannot take legal responsibilities and consequences. However, banning ChatGPT from scientific writing is not easily enforceable, since ChatGPT is trained to produce human-like texts that even scientists and specifically-trained AI detector sometimes fail to detect [29, 7]. In short, even though the prospect is promising [92, 25, 43, 102], new regulations and substantial improvements are needed before ChatGPT can be safely and widely used for scientific writing, publishing, or medical research in general [105]. The scenario column in Table 1 corresponds to the taxonomy categorization. If the article concerns healthcare or a medical speciality in general, it is categorized as ‘miscellaneous’. The category column indicates the type of the publications.

ChatGPT生成的内容必须经过人类专家审查后才能使用[58],因为生成的内容(如参考文献[105, 12, 40, 31])可能存在捏造。Lee, J.Y.等人[66]从法律(如著作权法)角度重申将ChatGPT列为作者的不当性,强调非人类实体无法承担法律责任与后果。然而,完全禁止ChatGPT参与科学写作难以执行,因为ChatGPT生成的类人文本甚至可能骗过科学家和专用AI检测器[29, 7]。简言之,尽管前景广阔[92, 25, 43, 102],在ChatGPT能安全广泛应用于科研写作、出版或医学研究前[105],仍需新规制定和实质性改进。表1中的场景列对应分类体系,若论文涉及医疗健康或广义医学专业领域,则归类为"其他"。类别列表示出版物类型。

3.4 Reviews of Level 3 Papers

3.4 三级论文综述

Level 3 papers feature extensive experiments conducted to assess the suitability of ChatGPT for a medical speciality or clinical scenario. For open-ended (OE) questions, human experts are usually involved to assess the appropriateness of the answers. To quantify the subjective assessments, a scoring criteria and scheme (e.g., 5-point, 6-point or 10-point Likert scale) is usually required. For multiple choice questions, it is desirable to not only quantify the accuracies but to evaluate whether the ‘justification’ given by ChatGPT and the choice are in congruence. When it comes to comparisons (with humans or other language models), statistical analysis is usually performed. As shown in Table 2, many of Level 3 papers are still pre-prints (under review) at the time of writing this review. Most of current ChatGPT evaluations are on ‘medical education’ (medical exams in particular), which requires no ethical approval to conduct. Representative works include [36, 61], where the authors test ChatGPT in the US Medical Licensing Examination (USMLE). Even though the evaluations were carried out independently ([36] and [61] were published almost at the same time), similar results were reported, i.e., ChatGPT achieved only moderate passing performance. [36] further showed that ChatGPT outperformed two other language models, Instruct GP T and GPT-3, in the exam. In both studies, ChatGPT was asked to give not only the answers but also the justifications, which were taken into consideration during evaluation (by physicians). [36] further found that ChatGPT performed better on fact-check questions than on complex ‘knowhow’ type questions. It is worthy of noting that the exam contains questions from different medical specialities. However, Mbakwe, A.B. et al. [74] raised concerns that ChatGPT, a language model, passing the exam indicates the flawness of the exam system 1. Besides USMLE, ChatGPT was also tested on the Chinese National Medical Licensing Examination [112] and the AHA BLS / CLS Exams 2016 [32], on both of which ChatGPT failed to achieve passing scores.

三级论文通过大量实验评估ChatGPT在医学专科或临床场景中的适用性。对于开放式问题(OE),通常需要人类专家参与评估回答的恰当性。为量化主观评价,需制定评分标准与方案(如5分制、6分制或10分李克特量表)。针对选择题,不仅需量化准确率,还应评估ChatGPT提供的"论证"是否与选项逻辑一致。进行比较研究(与人类或其他语言模型对比)时,通常需进行统计分析。如表2所示,截至本文撰写时,多数三级论文仍处于预印本(评审中)状态。当前ChatGPT评估主要集中于"医学教育"(特别是医学考试)领域,这类研究无需伦理审批。代表性研究包括[36,61],作者在美国医师执照考试(USMLE)中测试ChatGPT。尽管两项独立研究([36]与[61]几乎同期发表)得出了相似结论:ChatGPT仅达到中等通过水平。[36]进一步显示ChatGPT在该考试中表现优于另两个语言模型Instruct GPT和GPT-3。两项研究均要求ChatGPT同时提供答案与论证,并由医师在评估时综合考量。[36]还发现ChatGPT在事实核查类问题的表现优于复杂"技术诀窍"类问题。值得注意的是,该考试涵盖多个医学专科的试题。然而Mbakwe, A.B.等[74]提出质疑:一个语言模型能通过考试,可能暴露考试体系存在缺陷1。除USMLE外,ChatGPT还在中国国家医学资格考试[112]和2016年美国心脏协会基础/高级生命支持考试[32]中接受测试,但均未达到及格线。

ChatGPT achieved similar performance to students examinees on a Doctor of Veterinary Medicine (DVM) exam containing 288 paras it ology exam questions. One major limitation of using ChatGPT in medical exams is that, current release of ChatGPT can only process text inputs, whereas some questions are diagram-/figure-based $^2$ . Such questions are either excluded or translated into text descriptions.

ChatGPT在包含288道寄生虫学考试题的兽医学博士(DVM)考试中表现出与学生考生相当的水平。当前版本ChatGPT在医学考试应用中的一个主要局限是只能处理文本输入,而部分试题基于图表/图示。这类题目要么被排除,要么被转化为文字描述。

Besides the standard medical exams, ChatGPT achieved promising results on cancer-related questions [47, 53]. In [53], ChatGPT’s answers to common cancer myths and misconceptions were evaluated by expert reviewers and compared with the standard answers from the National Cancer Institute (NCI). Results showed that ChatGPT is able to achieve very high accuracies, showing that current ChatGPT is already a reliable source of cancer-related information for cancer patients [47]. Furthermore, [83] tested ChatGPT with 100 questions related to retina disease. The answers were evaluated based on a 5-point Likert scale by domain experts. It is found that ChatGPT answers with high accuracy on general questions, while the answers are less satisfactory, sometimes harmful, when it comes to treatment/prescription recommendations. On 85 multiple-choice questions concerning genetics/genomics, ChatGPT achieved similar performance to human respondents [28]. Interestingly, based on the test results, [28] also reached the conclusion that ChatGPT fares better on ‘memorization (fact-lookup)’ type questions than on those requiring critical thinking, similar to [83]. The performance of ChatGPT on these question-answering scenarios $^3$ shows its potential for medical consultation and education.

除了标准医学考试外,ChatGPT在癌症相关问题中也取得了令人瞩目的成果[47,53]。在[53]中,专家评审评估了ChatGPT对常见癌症迷思与误解的回答,并将其与美国国家癌症研究所(NCI)的标准答案进行对比。结果显示ChatGPT能达到极高准确率,表明当前版本已能作为癌症患者可靠的癌症信息源[47]。此外,[83]用100道视网膜疾病相关问题测试ChatGPT,由领域专家基于5级李克特量表评估答案。研究发现ChatGPT对一般性问题回答准确率较高,但在治疗/处方建议方面表现欠佳,有时甚至存在误导性。在涉及遗传学/基因组学的85道选择题上,ChatGPT达到了与人类受访者相当的水平[28]。值得注意的是,[28]根据测试结果得出结论:与[83]类似,ChatGPT更擅长"记忆性(事实查询)"类问题,而非需要批判性思维的问题。ChatGPT在这些问答场景中的表现$^3$展现了其在医疗咨询和教育领域的潜力。

A few studies evaluate the use of ChatGPT in medical research, particularly in scientific writing [67] and generating research questions [63] and systematic review topics [38]. In [67], the authors use ChatGPT to generate full abstracts, providing only the title and result sections of the abstracts from 50 real scientific publications. Even though previous studies [29] have shown that scientists can not tell apart abstracts generated by ChatGPT from those written by humans, [67] found that the two groups can simply be differentiated based on Grammarly scores. Disc rim i native features of ChatGPT-generated texts include mixed use of English dialects and language perfect ness e.g., very few typos, more unique words, proper prepositions usage and no misuse of conjunction and comma. These characteristics can be captured by Grammarly scores. The finding indicates that Grammarly could potentially be adopted by scientific journals to enforce the ’no-AI-generated-texts’ policy. In [63], the authors use ChatGPT to identify research questions in g astro enter ology. The answers generated by ChatGPT proves to be highly relevant but lack depth and novelty. In [38], ChatGPT is used to generate systematic review topics in plastic surgery. Similar to [63], ChatGPT-generated research topics are generally not novel. The version column in Table 2 shows the version of ChatGPT used for evaluation. [63] found that newer versions of ChatGPT tend to have better performance on the same questions. In contrast to using ChatGPT directly for writing, which is expressly banned by many scientific journals, exploring new research ideas/topics with the assistance of ChatGPT faces less ethical issues. However, [63, 38] demonstrated that the current version of ChatGPT is not sufficiently qualified for such tasks. Humans still play dominating roles in ingenious and innovative research.

少数研究评估了ChatGPT在医学研究中的应用,特别是在科学写作[67]、生成研究问题[63]和系统综述选题[38]方面。在[67]中,作者使用ChatGPT生成完整摘要,仅提供50篇真实科学出版物摘要的标题和结果部分。尽管先前研究[29]表明科学家无法区分ChatGPT生成的摘要与人工撰写的摘要,但[67]发现这两组可以通过Grammarly评分简单区分。ChatGPT生成文本的鉴别特征包括英语方言混用和语言完美性(如极少拼写错误、更多独特词汇、正确介词使用以及无连词和逗号误用),这些特征可通过Grammarly评分捕捉。该发现表明科学期刊可能采用Grammarly来执行"禁止AI生成文本"政策。在[63]中,作者使用ChatGPT识别胃肠病学领域的研究问题,其生成的答案虽高度相关但缺乏深度和新颖性。在[38]中,ChatGPT被用于生成整形外科的系统综述主题,与[63]类似,其生成的研究主题普遍缺乏创新性。表2中的版本列显示了用于评估的ChatGPT版本。[63]发现新版ChatGPT在相同问题上表现更优。与直接使用ChatGPT写作(被多数科学期刊明令禁止)相比,借助ChatGPT探索新研究想法/主题面临的伦理问题较少。然而[63, 38]证明当前版本的ChatGPT尚不足以胜任此类任务,人类仍在创新性研究中占据主导地位。

[87, 86, 4, 68] evaluate the application of ChatGPT in clinical workflow. In [87], ChatGPT is used to decide the appropriate imaging procedure (e.g., Mammography, MRI, US, etc.) for breast cancer screening and breast pain, given a description of the patients’ conditions. ChatGPT’s responses were evaluated against the corresponding American College of Radiology (ACR) appropriateness criteria. Results showed that ChatGPT achieved moderate overall results, and its performance is noticeably better for breast cancer screening than breast pain. The finding is in accordance with previous discussions that ChatGPT is already highly accurate on cancer-related information [47, 53]. The authors concluded that, even though ChatGPT showed impressive performance on the task, specialized AI tools are desired to support the clinical decision-making process more reliably. In a follow-up study [86], the authors tested ChatGPT with 36 clinical vignettes from the Merck Sharpe & Dohme (MSD), covering the entire clinical workflow (differential diagnosis, final diagnosis and subsequent clinical management of the patients). Overall, ChatGPT obtained a $71.8%$ accuracy in the test, and its performance on differential diagnosis is significantly lower than on final diagnosis. ChatGPT achieved the highest accuracy on a cancer vignette. The patients and their conditions in these vignettes are only hypothetical, which removes the ethical barrier to conduct the evaluation. In [4], ChatGPT is used to write patient clinic letters in 38 hypothetical clinical scenarios (e.g., basal cell carcinoma, malignant melanoma, etc.), where ChatGPT communicates the diagnosis results and treatment advice to the patients in a friendly and easily-understandable manner. The letters are evaluated from the perspective of factual correctness and humanness by clinicians, and ChatGPT achieved high scores on both criteria. In [68], ChatGPT is supplied with seven types of clinical decision support (CDS) alerts (e.g., pediatrics bronchi ol it is, immunization, postoperative anesthesia nausea and vomiting, etc.) and asked to give suggestions. However, ChatGPT’s answers, even though highly relevant to the alerts, were not adequately acceptable by the standard of CDS experts.

[87, 86, 4, 68] 评估了 ChatGPT 在临床工作流程中的应用。在 [87] 中,ChatGPT 被用于根据患者病情描述,决定乳腺癌筛查和乳房疼痛的适当影像学检查程序(如乳腺X线摄影、MRI、超声等)。ChatGPT 的回答根据美国放射学会 (ACR) 的适用性标准进行评估。结果显示,ChatGPT 总体表现中等,且在乳腺癌筛查任务中的表现明显优于乳房疼痛。这一发现与此前的讨论一致,即 ChatGPT 在癌症相关信息上已具有较高准确性 [47, 53]。作者总结称,尽管 ChatGPT 在该任务中表现出色,但仍需要专门的 AI 工具以更可靠地支持临床决策过程。在后续研究 [86] 中,作者使用默沙东 (MSD) 的 36 个临床案例对 ChatGPT 进行测试,涵盖完整临床工作流程(鉴别诊断、最终诊断及后续临床管理)。总体而言,ChatGPT 在测试中取得了 71.8% 的准确率,但其在鉴别诊断上的表现显著低于最终诊断。ChatGPT 在癌症案例中取得了最高准确率。这些案例中的患者及其病情均为假设性设定,从而规避了伦理评估障碍。在 [4] 中,ChatGPT 被用于在 38 个假设性临床场景(如基底细胞癌、恶性黑色素瘤等)中撰写患者门诊信件,以友好易懂的方式向患者传达诊断结果和治疗建议。临床医生从事实准确性和人性化角度评估这些信件,ChatGPT 在两项标准上均获得高分。在 [68] 中,研究者向 ChatGPT 提供七类临床决策支持 (CDS) 警报(如儿科支气管炎、免疫接种、术后麻醉恶心呕吐等)并要求其给出建议。然而,尽管 ChatGPT 的回答与警报高度相关,但按照 CDS 专家标准仍未能达到充分可接受水平。

Table 2: Summary of Level 3 papers.

Ref.ScenarioSummaryResults/ConclusionVersionJournal
[87]clinical workflowdecide an imaging procedure or evaluate whether a procedure is proper for breast cancer/pain pa-specialized ChatGPT is neededJan. 9, 2023preprint
[86]clinical workfowtients ChatGPT supports clinical decision-making, l by answering questions from Merck Sharpe & Dohme (MSD) clinical vignettesChatGPT achieves an overall accuracy of (71.7%) on 36 clinical vignettes covering theJan. 9,2023preprint
[51]medical educationcompare ChatGPT with medical students in (an internal) )para- sitology exam (79 questions)ChatGPT is not compa- rable to medical student (Acc. 89.6%) in parasitol- ogy questionsDec. 15, 2022JEEHP
[36]medical educationChatGPT takes US Medical Li- censing Examination (USMLE)ChatGPT achieved pass- ing scoreDec. 15, 2022JMIR
[4]clinical workflowChatGPT writes patient letters (e.g., communicates diagnostic results, gives treatment advice) for 38 clinical scenariosChatGPT achieved high scores on both the fac- tual correctness and hu- manness criterionLancet Digit Health
[66]medical educationcompare ChatGPT with medi- cal students in parasitology exam (288 questions) from the Doctor of Veterinary Medicine (DVM) examChatGPT and students achieve similar scoresCell

表 2: Level 3论文综述

文献 场景 摘要 结果/结论 版本 期刊
[87] 临床工作流 决定影像学检查程序或评估乳腺癌/疼痛患者是否适合某项检查 需要专用ChatGPT 2023年1月9日 预印本
[86] 临床工作流 ChatGPT通过回答默沙东(MSD)临床小案例问题支持临床决策 ChatGPT在36个涵盖多领域的临床小案例中总体准确率达71.7% 2023年1月9日 预印本
[51] 医学教育 比较ChatGPT与医学生在寄生虫学考试(79题)中的表现 ChatGPT在寄生虫学问题上(准确率89.6%)不及医学生 2022年12月15日 JEEHP
[36] 医学教育 ChatGPT参加美国医师执照考试(USMLE) ChatGPT达到及格分数 2022年12月15日 JMIR
[4] 临床工作流 ChatGPT为38个临床场景撰写患者信件(如传达诊断结果、提供治疗建议) ChatGPT在事实准确性和人性化标准上均获高分 《柳叶刀》数字健康
[66] 医学教育 比较ChatGPT与兽医学博士(DVM)考试中寄生虫学考题(288题)的答题表现 ChatGPT与学生成绩相当 《细胞》
[68]clinical workflowChatGPT answers clinical deci- sion support (CSD) alerts from Epic EHRChatGPT's answers sare biased and redundant, their acceptability in CDS is lowpreprint
[61]medical educationChatGPT takes USMLE (June 2022)ChatGPT achieved pass- ing score, and its expla- nations contain novel in- sightsPLOS Digi- tal Health
[32]medical educationChatGPT takes life-support ex- ams (AHA BLS / CLS Exams 2016)ChatGPT did not reach passing scoreJan. 9 and 30, 2023Resuscitation
[53]consultation medical researchChatGPT provides cancer- related information and feedback on cancer misconceptions compared 50 ChatGPT-ChatGPT provides highly accurate cancer informa- tion Grammarly can detectDec. 15, 2022JNCI Cancer Spectrum AJOG
[67] [83]consultationgenerated abstracts with real abstracts from scientific publica- tions evaluate( ChatGPT using 100ChatGPT-generatedab- stracts with high accuracy ChatGPT is highly accu-
[28]consultationquestions about retinal diseases compare ChatGPT with humansrate on general questions but less accurate for treat- ment options ChatGPT and humansActa Oph- thalmologica
consultationon 85 genetics/genomics ques- tionsperform similarlypreprint
[52]medical educationChatGPT answers 284 question from various medical specialitiesChatGPT achieved overall high accuraciespreprint

| [68] | 临床工作流 | ChatGPT回答来自Epic电子健康记录(EHR)的临床决策支持(CDS)警报 | ChatGPT的回答存在偏见和冗余,在CDS中的可接受性较低 | | 预印本 |
| [61] | 医学教育 | ChatGPT参加美国医师执照考试(USMLE)(2022年6月) | ChatGPT达到及格分数,其解释包含新颖见解 | | 《PLOS数字健康》 |
| [32] | 医学教育 | ChatGPT参加生命支持考试(AHA BLS/CLS 2016版) | ChatGPT未达到及格分数 | 2023年1月9日及30日 | 《复苏》 |
| [53] | 咨询/医学研究 | ChatGPT提供癌症相关信息并纠正癌症误解(与50位ChatGPT用户对比) | ChatGPT提供高度准确的癌症信息(Grammarly可检测) | 2022年12月15日 | 《JNCI癌症谱系》《美国妇产科杂志》 |
| [67][83] | 咨询 | 用ChatGPT生成摘要与真实科学出版物摘要进行对比评估(使用100篇) | ChatGPT生成摘要具有高准确性 | | |
| [28] | 咨询 | 关于视网膜疾病的提问(对比ChatGPT与人类表现) | 在常规问题上准确率高,但治疗方案准确性较低 | | 《眼科学报》 |
| | 咨询 | 回答85个遗传学/基因组学问题 | ChatGPT与人类表现相当 | | 预印本 |
| [52] | 医学教育 | ChatGPT回答来自多个医学专业的284个问题 | ChatGPT总体准确率较高 | | 预印本 |

[112]medical educationChatGPTtakes Chinese Na- tional Medical Licensing Exam- inationChatGPT's performance on the exam is well below passing levelpreprint
[63]medical researchChatGPT identifies research questions in gastroenterology (e.g., microbiome, endoscopy)ChatGPT generates highly relevant but non- novel research questionsDec. 15, 2022Scientific Re- ports
[38]medical researchChatGPT generates systematic review topics in plastic surgeriesChatGPT performs mod- erately in generating novel systematic review ideasAesthetic Surgery Journal
[98]consultation medical educationevaluate ChatGPT using100 OE questions about pathologyChatGPT scored around 80%Jan. 30, 2023Cureus

| [112] | 医学教育 | ChatGPT参加中国国家医学资格考试 | ChatGPT在考试中的表现远低于及格水平 | | 预印本 |
| [63] | 医学研究 | ChatGPT识别胃肠病学研究问题(如微生物组、内窥镜) | ChatGPT生成高度相关但非新颖的研究问题 | 2022年12月15日 | 《科学报告》 |
| [38] | 医学研究 | ChatGPT生成整形外科系统综述主题 | ChatGPT在生成新颖系统综述想法方面表现中等 | | 《美容外科杂志》 |
| [98] | 咨询医学教育 | 使用100个病理学OE问题评估ChatGPT | ChatGPT得分约80% | 2023年1月30日 | 《Cureus》 |

4 Results

4 结果

The following presents the answers to the four research questions (RQ1-RQ4) based on the discussion in Section 3.

根据第3节的讨论,以下是对四个研究问题(RQ1-RQ4)的解答。

4.1 Medical Applications of ChatGPT

4.1 ChatGPT的医疗应用

According to Table 1, Table 2 and the taxonomy (Figure 3), it is straightforward to see that ChatGPT is mostly evaluated in medical education, consultation and research, as well as in various scenarios in the clinical workflow, such as diagnosis, decision-making and clinical documentation (patient letter, medical note, discharge summary, etc.). However, it is important to note these ‘applications’ are carried out in a ‘laboratory environment’, by providing ChatGPT question samples from standard medical exams (question banks), CSD alerts from Epic EHR or clinical vignettes from Merck Sharpe & Dohme (MSD), through its QA interface. None of the reviewed publications have reported an actual deployment of ChatGPT in clinical settings. Furthermore, due to the current strict policies on AI-generated content imposed by publishers, the unsolved ethical issues as well as its incapability in generating novel research topics, using ChatGPT for medical research remains experimental as well. For medical consultation, the fact that ChatGPT is already capable of providing highly accurate cancerrelated information can not be generalized to all medical specialities, since reliable sources of cancer information, such as the National Cancer Institute (NCI), are publicly accessible and could have already been part of ChatGPT’s training set. Its qualification as a medical consultant remains to be further evaluated.

根据表1、表2和分类体系(图3)可以看出,ChatGPT主要在医学教育、咨询和研究以及临床工作流程中的各种场景(如诊断、决策和临床文档记录(患者信函、医疗记录、出院摘要等))中进行评估。但需注意的是,这些"应用"是在"实验室环境"中进行的,通过其问答界面向ChatGPT提供标准医学考试(题库)的问题样本、Epic电子健康档案(EHR)的CSD警报或默沙东(MSD)的临床案例。所有被评论文献均未报告ChatGPT在临床环境中的实际部署。此外,由于目前出版商对AI生成内容的严格政策、未解决的伦理问题以及其无法生成新研究课题,使用ChatGPT进行医学研究仍处于实验阶段。对于医疗咨询,ChatGPT已能提供高度准确的癌症相关信息这一事实并不能推广到所有医学专业领域,因为美国国家癌症研究所(NCI)等可靠的癌症信息来源是公开可获取的,可能已成为ChatGPT训练集的一部分。其作为医疗顾问的资格仍有待进一步评估。

4.2 Strengths and Limitations of ChatGPT in Healthcare

4.2 ChatGPT在医疗健康领域的优势与局限性

Strengths The QA design of ChatGPT’s interface makes it easy to be integrated into existing clinical workflow, providing feedback in real-time. ChatGPT can not only give answers to specific questions but provide ’justifications’ to its answers. Sometimes, ChatGPT’s ’justifications’ and answers to open-ended question contain novel insights and perspectives, which might inspire novel research ideas. ChatGPT also shows superior performance in healthcare compared to other general large language models, such as Instruct GP T, GPT-3.5.

优势
ChatGPT的问答式界面设计便于融入现有临床工作流程,提供实时反馈。它不仅能够回答具体问题,还能为答案提供"依据"。有时,ChatGPT对开放式问题的"依据"和回答会包含新颖见解,可能激发创新研究思路。与其他通用大语言模型(如InstructGPT、GPT-3.5)相比,ChatGPT在医疗领域展现出更优异的性能。

Limitations The current release of ChatGPT can only take input and give feedback in texts, so that ChatGPT cannot handle questions requiring the interpretation of images. ChatGPT is incapable of ’reasoning’ like an expert system, and the ’justifications’ provided by ChatGPT is merely a result of predicting the next words according to probability. It is possible that ChatGPT makes a correct choice, but gives completely nonsensical explanations. Accuracy of ChatGPT’s answers depends largely on the quality of its training data, and the information ChatGPT is trained on decides how ChatGPT would respond to a question. However, ChatGPT itself cannot distinguish between real and fake information fed into it, so that its answers could be highly misleading, biased and dangerous when it comes to healthcare. For example, one of the most concerning issues of current release of ChatGPT, as confirmed by the reviewed publications, is that it can ’fabricate’ information and convey it in a persuasive tone. Therefore, its answers should always be fact-checked by human experts before adoption. Furthermore, ChatGPT’s answers, even if can be highly relevant, stay most of the time superficial and lack depth and novelty. Most importantly, ChatGPT is not fine-tuned for healthcare by design, and should not be used as such without specialization. Last but not least, the use of ChatGPT is not without barriers. Reformulating the prompt to the same question might change ChatGPT’s answer as well. Proper formulation of prompts is another factor to obtaining desirable answers from ChatGPT. Last but not least, ChatGPT is a proprietary product, and therefore feeding sensitive patient information into its interface in order to obtain a feedback might violate privacy regulations.

局限性
当前版本的ChatGPT仅能接收文本输入并给出文本反馈,因此无法处理需要图像解读的问题。ChatGPT不具备类似专家系统的"推理"能力,其提供的"解释"仅是依据概率预测下一个词的结果。可能出现ChatGPT做出正确选择却给出完全荒谬解释的情况。

ChatGPT回答的准确性很大程度上取决于训练数据的质量,其训练信息决定了它如何回应问题。然而,ChatGPT自身无法区分输入信息的真伪,因此在医疗健康领域,其回答可能具有高度误导性、偏见性和危险性。例如,根据文献综述确认,当前版本ChatGPT最令人担忧的问题之一是其能以令人信服的语气"捏造"信息。因此,其回答在采纳前应始终由人类专家进行事实核查。

此外,ChatGPT的回答即使高度相关,多数时候仍停留在表面,缺乏深度和新颖性。最重要的是,ChatGPT在设计上并未针对医疗健康领域进行微调,未经专门优化不应在此领域使用。

最后,使用ChatGPT并非没有障碍。对同一问题重新组织提示词可能改变ChatGPT的答案。恰当构建提示词是从ChatGPT获得理想回答的另一关键因素。尤为重要的是,ChatGPT是专有产品,向其界面输入敏感患者信息以获取反馈可能违反隐私法规。

4.3 Research Gaps and Future Works

4.3 研究空白与未来工作

Prior to the deployment of any product in clinical settings, extensive evaluations of the product in a laboratory environment are required to identify the limitations and improve the product iterative ly. Since ChatGPT was released no more than half a year ago, it has only been tested in a limited number of scenarios (Table 2). ChatGPT clearly is still at an experimental stage, and clinical deployment faces substantial unsolved technical and regulatory challenges. The Level $\mathcal{S}$ publications provide a sound paradigm on how ChatGPT should continued to be evaluated in different specialities, for future works to follow. However, before further pursuing the direction, researchers should be aware that, even though these evaluations provide, at best, a general picture of ChatGPT’s capability in a medical speciality, little contribution to the improvement of the underlying language model is made. The limitations identified through these evaluations have also long been known in NLP research and are not specific to ChatGPT. Most importantly, whether or not ChatGPT has achieved good performance in an application scenario, it is unlikely that the ChatGPT with general knowledge will be clinically deployed in the future. Specialized AI models in healthcare, which the NLP community has long been working on, are more promising for practical and reliable clinical applications, compared to ChatGPT.

在将任何产品部署到临床环境之前,需要在实验室环境中对该产品进行广泛评估,以识别其局限性并迭代改进产品。由于ChatGPT发布至今不足半年,目前仅在小范围场景中进行了测试(表2)。显然,ChatGPT仍处于实验阶段,临床部署面临着大量尚未解决的技术和监管挑战。$\mathcal{S}$级(Level $\mathcal{S}$)文献为不同专科领域如何持续评估ChatGPT提供了完善范式,可供后续研究参考。但需注意的是,这些评估最多只能展现ChatGPT在某个医学专科领域的整体能力,对改进底层大语言模型贡献有限。通过评估发现的局限性在自然语言处理(NLP)研究中早已存在,并非ChatGPT特有。最关键的是,无论ChatGPT在应用场景中表现如何,具备通用知识的ChatGPT未来都不太可能直接用于临床。相比之下,NLP领域长期研发的医疗专用AI模型,比ChatGPT更有可能实现可靠的实际临床应用。

4.4 Categorization of Publications based on a Taxonomy

4.4 基于分类法的出版物分类

Finally, we have shown in our review that existing publications on ChatGPT in healthcare can be compactly grouped according to applications and target user groups. Thus, we come up with a application- and user-oriented taxonomy to categorize the selected publications, as discussed in Section 3.

最后,我们在综述中表明,现有的关于ChatGPT在医疗健康领域应用的文献可以简洁地按应用场景和目标用户群体进行归类。因此,我们提出了一种以应用和用户为导向的分类法来对选定的文献进行分类,如第3节所述。

5 Discussion and Conclusion

5 讨论与结论

In this systematic review, we review published works (from Nov. 2022 to Mar. 2023) that used ChatGPT within the healthcare sector. In doing so, we extract publications from PubMed using the keyword ‘ChatGPT’ and propose a two-sided taxonomy (application-oriented and user-oriented) to categorize these publications, which we see as a building block for new publications on ChatGPT in healthcare. Even though the current taxonomy is already quite inclusive, it can be easily extended to emerging new applications or user groups. This first taxonomy is not limited to ChatGPT, rather it can also be applied to other (existing or upcoming) NLP models, like Bard from Google. On the one hand, the taxonomy helps interested readers to identify relevant works. On the other hand, it also helps identify areas where ChatGPT has not yet been applied to. An automatic processing of multimodal input, like text and images, is an exciting development for future healthcare. In example, Contrastive Language-Image Pre-Training (CLIP) [85], a neural network trained on large-scale image-text pairs, possesses both vision and language capabilities, and is therefore a promising research direction towards AI-assisted multimodal healthcare. In general, a physician takes also several sources of information into account when making diagnosis and treatment decisions, such as the written reports and image acquisitions from a patient. ChatGPT-4, a enhanced version of ChatGPT released recently, is able to analyse and summarize images and texts, as seen from a live demo given by its developers.

在这篇系统性综述中,我们回顾了2022年11月至2023年3月期间在医疗健康领域应用ChatGPT的已发表文献。通过PubMed数据库以"ChatGPT"为关键词检索文献,我们提出了一个双向分类体系(应用导向型与用户导向型)对这些文献进行归类,该体系可作为医疗健康领域ChatGPT新研究的基础框架。虽然当前分类体系已具备较高包容性,但仍可轻松扩展至新兴应用场景或用户群体。这一首创的分类框架不仅限于ChatGPT,同样适用于其他(现有或即将出现的)自然语言处理模型,如谷歌的Bard。该分类体系一方面帮助目标读者定位相关研究,另一方面也揭示了ChatGPT尚未涉足的应用领域。未来医疗健康领域最令人期待的发展方向之一是多模态输入(如文本与图像)的自动处理。例如对比语言-图像预训练模型CLIP[85]——一个基于海量图文对训练的神经网络——同时具备视觉与语言处理能力,这为AI辅助的多模态医疗健康研究指明了前景。通常医生在制定诊疗决策时会综合考量多种信息源,包括患者书面报告和影像资料。最新发布的ChatGPT-4增强版已具备图文分析与摘要能力,其开发者在实时演示中展示了这一特性。

The barrier-free user interface, the ability to produce human-like texts and the breadth of its knowledge on a variety of topics are the key reasons why ChatGPT has amassed a phenomenally large user base shortly after its release. Besides the architectural design of the LLM, the immeasurable human efforts invested in training the LLM through reinforcement learning contribute greatly to its impressive performance in human-like conversations. Even though ChatGPT technically represents the product iz ation of a NLP model by OpenAI, rather than a fundamental technological advance or breakthrough, it is undeniable that ChatGPT is a living embodiment of state-of-the-art NLP techniques. The efforts devoted to making the product a reality still greatly push forward the field as a whole. Speaking from the perspective of a tech product, existing publications on ChatGPT’s healthcare applications boil down to ‘reviews and testing of a new NLP product in healthcare’. However, the product is not intended for medical applications by design, and it is therefore not unexpected that most ‘test reports’ evaluated ChatGPT as ‘unqualified’ or ‘of merely passing grade’ for healthcare. However, the reported limitations (see Section 4) of ChatGPT are not specific to the product, but are applicable to language models in general, as discussed in Section 2. These limitations can mostly be addressed by improving the underlying language model through NLP innovations. Nevertheless, the fact that ChatGPT is monetized $^4$ and therefore not (fully) open-sourced makes it difficult for the community to pinpoint the issues and come up with specific solutions for future improvement. In particular, the sources of datasets used for training the language model, which determine the type of questions and topics of the conversations ChatGPT can handle, remain unclear. As suggested by van Dis et al. [27], the community should invest in truly open LLMs that perform on par with proprietary NLP products like ChatGPT, in order to fully address these limitations. Currently, for healthcare applications, specialized AI models trained on biomedical datasets, such as BioGPT [70], are always more desirable than ChatGPT.

无障碍用户界面、生成类人文本的能力以及广泛的多领域知识储备,是ChatGPT在发布后短期内获得现象级用户增长的关键原因。除了大语言模型的架构设计外,通过强化学习投入的巨量人工训练也极大提升了其在拟人对话中的表现。尽管从技术层面看,ChatGPT本质上是OpenAI对NLP模型的产品化实践,而非基础性技术突破,但不可否认它代表了当前最先进的NLP技术集成。实现该产品的努力仍显著推动了整个领域发展。从科技产品视角而言,现有关于ChatGPT医疗应用的出版物可归结为"对新型NLP产品在医疗领域的测评"。但该产品设计初衷并非医疗用途,因此多数"测试报告"判定其医疗应用"不合格"或"仅达及格线"并不意外。然而如第4节所述,ChatGPT的局限性并非该产品特有,而是如第2节讨论的通用语言模型共性问题,这些问题大多能通过NLP技术创新改进底层语言模型来解决。但ChatGPT的商用性质$^4$导致其未(完全)开源,使得学界难以准确定位问题并提出具体改进方案。尤其关键的是,决定ChatGPT问答范围的语言模型训练数据来源仍不透明。正如van Dis等人[27]建议的,学界应投入开发性能媲美ChatGPT等专有NLP产品的真正开源大语言模型,才能彻底解决这些局限。当前在医疗应用场景中,基于生物医学数据集训练的专用AI模型(如BioGPT[70])始终比ChatGPT更具适用性。

As discussed in this review (Section 3), these evaluation studies on ChatGPT’s performance in healthcare provide a general picture of the capability of the current release of ChatGPT. By and large, the training set and the underlying language model decide the quality (accuracy, unbiased ness, humanness, etc.) of the responses of an AI chat bot to certain questions. Therefore, this review concludes that healthcare researchers in particular should retract from the AI hype generated by the product and focus their attention on NLP research in general and developing/evaluating specialized language models for healthcare applications.

正如本综述(第3节)所讨论的,这些关于ChatGPT在医疗领域表现的评价研究,为当前版本ChatGPT的能力提供了总体概览。总体而言,训练数据集和底层语言模型决定了AI聊天机器人对特定问题回答的质量(准确性、无偏见性、拟人性等)。因此,本综述认为医疗领域的研究者尤其应当远离该产品引发的AI炒作,将注意力集中在自然语言处理(NLP)研究本身,以及开发/评估适用于医疗应用的专用语言模型上。

Acknowledgments

致谢

This work was supported by the REACT-EU project KITE (Plattform fur KITranslation Essen, EFRE-0801977, https://kite.ikim.nrw/) and the Cancer Research Center Cologne Essen (CCCE).

本工作由REACT-EU项目KITE (埃森机器翻译平台, EFRE-0801977, https://kite.ikim.nrw/) 和科隆埃森癌症研究中心 (CCCE) 资助。

References

参考文献

阅读全文(20积分)