[论文翻译]ChatGPT在医疗健康领域的应用:分类与系统综述


原文地址:https://www.medrxiv.org/content/10.1101/2023.03.30.23287899v1.full.pdf


ChatGPT in Healthcare: A Taxonomy and Systematic Review

ChatGPT在医疗健康领域的应用:分类与系统综述

Jianning Li, Amin Dada, Jens Kleesiek, Jan Egger∗

Jianning Li, Amin Dada, Jens Kleesiek, Jan Egger∗

Institute of Artificial Intelligence in Medicine, University Hospital Essen (AoR), Girard et stra Be, 45131 Essen, Germany.

德国埃森大学医院人工智能医学研究所 (AoR), Girard et stra Be, 45131 埃森

$^* $ Corresponding author: jan.egger (at) uk-essen.de (J.E.)

$^* $ 通讯作者: jan.egger (at) uk-essen.de (J.E.)

March 2023

2023年3月

Abstract

摘要

The recent release of ChatGPT, a chat bot research project/product of natural language processing (NLP) by OpenAI, stirs up a sensation among both the general public and medical professionals, amassing a phenomenally large user base in a short time. This is a typical example of the ‘product iz ation’ of cutting-edge technologies, which allows the general public without a technical background to gain firsthand experience in artificial intelligence (AI), similar to the AI hype created by AlphaGo (DeepMind Technologies, UK) and self-driving cars (Google, Tesla, etc.). However, it is crucial, especially for healthcare researchers, to remain prudent amidst the hype. This work provides a systematic review of existing publications on the use of ChatGPT in healthcare, elucidating the ‘status quo’ of ChatGPT in medical applications, for general readers, healthcare professionals as well as NLP scientists. The large biomedical literature database PubMed is used to retrieve published works on this topic using the keyword ‘ChatGPT’. An inclusion criterion and a taxonomy are further proposed to filter the search results and categorize the selected publications, respectively. It is found through the review that the current release of ChatGPT has achieved only moderate or ‘passing’ performance in a variety of tests, and is unreliable for actual clinical deployment, since it is not intended for clinical applications by design. We conclude that specialized NLP models trained on (bio)medical datasets still represent the right direction to pursue for critical clinical applications.

ChatGPT是OpenAI近期发布的自然语言处理(NLP)聊天机器人研究项目/产品,在公众和医疗专业人士中引发轰动,短时间内积累了庞大的用户基础。这是尖端技术"产品化"的典型案例,让非技术背景的普通大众能亲身体验人工智能(AI),类似于AlphaGo(英国DeepMind Technologies)和自动驾驶汽车(谷歌、特斯拉等)引发的AI热潮。然而,医疗研究人员尤其需要在这种热潮中保持谨慎。本文系统综述了ChatGPT在医疗领域应用的现有文献,为普通读者、医疗从业者和NLP科学家阐明ChatGPT在医学应用中的"现状"。研究使用生物医学文献数据库PubMed,以"ChatGPT"为关键词检索相关文献,进一步提出纳入标准和分类法来筛选结果并归类文献。综述发现,当前版本的ChatGPT在各种测试中仅达到中等或"及格"水平,由于设计初衷并非临床用途,其实际临床部署并不可靠。我们得出结论:针对关键临床应用,基于(生物)医学数据集训练的专业NLP模型仍是正确发展方向。

Keywords: ChatGPT; Healthcare; NLP; Transformer; LLM; OpenAI; Taxonomy; Bard; BERT; LLaMA.

关键词: ChatGPT; 医疗保健; NLP; Transformer; 大语言模型(LLM); OpenAI; 分类法; Bard; BERT; LLaMA.

1 Introduction

1 引言

In November 2022 a chat bot called ChatGPT was released. According to itself it is ‘a conversational AI language model developed by OpenAI. It uses deep learning techniques to generate human-like responses to natural language inputs. The model has been trained on a large dataset of text and has the ability to understand and generate text for a wide range of topics. ChatGPT can be used for various applications such as customer service, content creation, and language translation’. Since its release, ChatGPT has taken humans by storm and its user base is growing even faster than the current record holder TikTok, reaching 100 million users in just two months after its launch. ChatGPT is already used to generate textual context, presentations and even source code for all kinds of topics. But what does that mean specifically for the healthcare sector? What if the general public or medical professionals turn to ChatGPT for treatment decisions? To answer these questions, we will look at published works that already reported the usage of ChatGPT in the medical field. In doing so, we will explore and discuss ethical concerns when using ChatGPT, specifically within the healthcare sector (e.g., in clinical routines). We also identify specific action items that we believe have to be undertaken by creators and providers of chat bots to avoid catastrophic consequences that go far beyond letting a chat bot do someone’s homework. This review makes William B. Schwartz description from 1970 about conversational agents that will serve as consultants by enhancing the intellectual functions of physicians through interactions [94] as up-to-date as ever.

2022年11月,一款名为ChatGPT的聊天机器人发布。根据其自我描述,它是"由OpenAI开发的对话式AI语言模型,采用深度学习技术生成类人化的自然语言响应。该模型基于海量文本数据集训练,能够理解并生成涵盖广泛主题的文本,可应用于客户服务、内容创作和语言翻译等多种场景"。自发布以来,ChatGPT迅速引发全球热潮,其用户增长速度甚至超越了当前纪录保持者TikTok,上线仅两个月就突破1亿用户。该工具已被用于生成各类主题的文本内容、演示文稿乃至源代码。但这对医疗健康领域究竟意味着什么?当公众或医疗专业人员转向ChatGPT寻求诊疗决策时会发生什么?为解答这些问题,我们将审视已发表的关于ChatGPT在医学领域应用的研究报告。通过这一过程,我们将探讨并讨论使用ChatGPT(特别是在医疗健康领域,如临床常规工作中)涉及的伦理问题。同时,我们提出了一系列具体行动建议,认为聊天机器人的开发者和提供商必须采取这些措施,以避免产生远超"让聊天机器人代写作业"的灾难性后果。本综述使得William B. Schwartz在1970年关于"对话代理将通过增强医生智力功能来担任顾问角色"的论述[94] 显得前所未有的前瞻。

Even though the application of natural language processing (NLP) in healthcare is not new [34, 101, 111, 77], the recent release of ChatGPT, a direct product of NLP, still generated a hype in artificial intelligence (AI) and sparked a heated discussion about ChatGPT’s potential capability and pitfalls in healthcare, and attracted the attention of researchers from different medical specialities. The sensation could largely be attributed to ChatGPT’s barrier-free (browser-based) and user-friendly interface, allowing medical professionals and the general public without a technical background to easily communicate with the Transformerand reinforcement learning-based language model. Currently, the interface is designed for question answering (QA), i.e., ChatGPT responds in texts to the questions/prompts from users. All established or potential applications of ChatGPT in different medical specialities and/or clinical scenarios hinge on the QA feature, distinguished only by how the prompts are formulated (Format-wise: open-ended, multiple choice, etc. Content-wise: radiology, paras it ology, toxicology, diagnosis, medical education and consultation, etc.). Numerous publications featuring these applications have also been generated and indexed in PubMed since the release. This systematic review dives into these publications, aiming to elucidate the current state of employment, as well as the limitations and pitfalls of ChatGPT in healthcare, amidst the ChatGPT AI hype.

尽管自然语言处理(NLP)在医疗健康领域的应用并非新鲜事[34, 101, 111, 77],但作为NLP直接产物的ChatGPT近期发布,仍在人工智能(AI)领域引发热潮,并激起关于ChatGPT在医疗健康领域潜在能力与缺陷的热烈讨论,吸引了来自不同医学专业研究人员的关注。这一现象很大程度上归功于ChatGPT的无障碍(基于浏览器)和用户友好界面,使得没有技术背景的医疗从业者和普通大众也能轻松与基于Transformer和强化学习的语言模型交流。目前该界面专为问答(QA)设计,即ChatGPT以文本形式回应用户的问题/提示。ChatGPT在不同医学专业和/或临床场景中所有已确立或潜在的应用都依赖于这一QA特性,区别仅在于提示的构建方式(形式上:开放式、多选题等;内容上:放射学、寄生虫学、毒理学、诊断、医学教育与咨询等)。自发布以来,PubMed已收录大量展示这些应用的出版物。本系统综述深入分析这些文献,旨在阐明ChatGPT在医疗健康领域的应用现状、局限性与缺陷,以回应当前的ChatGPT AI热潮。

Table 1: Summary of Level 1 and Level $\mathcal{L}$ papers.

Ref.ScenarioCategoryMain ContentTag
[93]clinical workfloweditorialdiscussion of the potential use, limitations and risks of ChatGPT in nursing practiceLevel 1
[92]medical researchperspectivecomments about ChatGPT in scientific writing; Use Chat- GPT to summarize and compare across papersLevel 1
[81]medical researcheditorialgeneric comments on using ChatGPT in orthopaedic re- searchLevel 1
[79]medical researchletter to the editorcomments on using ChatGPT in scientific publications and generating research ideasLevel 1
[59]miscellaneousletter to the editorcomments on the potential use and pitfalls of ChatGPT in healthcareLevel 1
[106]miscellaneouseditorialdiscuss with ChatGPT about synthetic biology (e.g., appli- cations, ethical regulations, history, research trends, etc.)Level 1
[25]medical researcheditorialcomments on the pros and cons of using ChatGPT in med- ical researchLevel 1
[65]miscellaneousoriginal articlecomments on the potential usage of ChatGPT in radiol- ogy (generate radiological reports, education, diagnostic decision-making, communicate with patients, compose ra- diological research article)Level 1
[8]medical education & researchletter to the editorcomments on the pros and cons of ChatGPT in medical education and researchLevel 1
35]miscellaneousprimershort comment on ChatGPT for UrologistsLevel 1
49consultationcorrespondenceChatGPT for antimicrobial consultationLevel 1
[48]medical researcharticle (preprint)comments on ChatGPT in peer-reviewLevel 1
[72]miscellaneouseditorialcomments on ChatGPT in translational medicineLevel 1

表 1: Level 1 和 Level $\mathcal{L}$ 论文总结。

文献 场景 类别 主要内容 标签
[93] 临床工作流 社论 讨论 ChatGPT 在护理实践中的潜在用途、局限性和风险 Level 1
[92] 医学研究 观点 关于 ChatGPT 在科学写作中的评论;使用 ChatGPT 总结和比较论文 Level 1
[81] 医学研究 社论 对在骨科研究中使用 ChatGPT 的通用评论 Level 1
[79] 医学研究 致编辑的信 关于在科学出版物中使用 ChatGPT 及生成研究想法的评论 Level 1
[59] 其他 致编辑的信 关于 ChatGPT 在医疗保健中的潜在用途和陷阱的评论 Level 1
[106] 其他 社论 与 ChatGPT 讨论合成生物学(如应用、伦理规范、历史、研究趋势等) Level 1
[25] 医学研究 社论 关于在医学研究中使用 ChatGPT 的利弊评论 Level 1
[65] 其他 原创文章 关于 ChatGPT 在放射学中的潜在用途的评论(生成放射报告、教育、诊断决策、与患者沟通、撰写放射学研究文章) Level 1
[8] 医学教育与研究 致编辑的信 关于 ChatGPT 在医学教育和研究中的利弊评论 Level 1
35] 其他 入门 对泌尿科医师使用 ChatGPT 的简短评论 Level 1
49 咨询 通信 使用 ChatGPT 进行抗菌咨询 Level 1
[48] 医学研究 文章(预印本) 关于 ChatGPT 在同行评审中的评论 Level 1
[72] 其他 社论 关于 ChatGPT 在转化医学中的评论 Level 1
[14]consultationletter to the editorcomments on the pros and cons of ChatGPT in public/community health (e.g.,a answer generic publicLevel 1
[73]miscellaneousarticlehealth questions) comments on the ethics of using ChatGPT in Health Pro- fessions EducationLevel 1
[60]medical researchletter to the editorbrief comments on using ChatGPT in medical writingLevel 1
[80]medical educationeditorialcomment on ChatGPT in nursing educationLevel 1
[11]miscellaneouscommentarycomment on ChatGPT in translational medicineLevel 1
[113]miscellaneouseditorialcomment on ChatGPT in healthcareLevel 1
[58]medical researcheditorialcomment on ChatGPT in medical writingLevel 1
[7]medical researcheditorialcomment on using ChatGPT for scientific writing in sports & exercise medicineLevel 1
[12]medical researchperspectivecomment on medical writingLevel 1
[91]miscellaneousreviewsystematic review on ChatGPT in healthcareLevel 1
[5]medical researcheditorialcomment on the hallucination issue of ChatGPT in medical writingLevel 1
[71]medical researcheditorialChatGPT draft an article on vaccine effectivenessLevel 2
[108]medical researchreviewreview on ChatGPT in medical research, including use ex- amplesLevel 2
[6]medical researchoriginal articleuse ChatGPT to compile a review article on Digital Twin in healthcareLevel 2
[82]clinical workflowcommentuse ChatGPT to generate a discharge summary for a pa- tient who had hip replacement surgeries including follow-up care suggestions)Level 2
[89]clinical workflowletter to the editorChatGPT gives diagnosis, prognosis and explanation for a clinical toxicology case of acute organophosphate poisoningLevel 2

| [14] | 咨询 | 读者来信 | 评论ChatGPT在公共/社区健康中的利弊(例如回答一般公众健康问题) | 1级 |
| [73] | 其他 | 文章 | 评论在健康职业教育中使用ChatGPT的伦理问题 | 1级 |
| [60] | 医学研究 | 读者来信 | 简要评论在医学写作中使用ChatGPT | 1级 |
| [80] | 医学教育 | 社论 | 评论ChatGPT在护理教育中的应用 | 1级 |
| [11] | 其他 | 评论 | 评论ChatGPT在转化医学中的作用 | 1级 |
| [113] | 其他 | 社论 | 评论ChatGPT在医疗保健中的应用 | 1级 |
| [58] | 医学研究 | 社论 | 评论ChatGPT在医学写作中的应用 | 1级 |
| [7] | 医学研究 | 社论 | 评论在运动与运动医学科学写作中使用ChatGPT | 1级 |
| [12] | 医学研究 | 观点 | 评论医学写作 | 1级 |
| [91] | 其他 | 综述 | 关于ChatGPT在医疗保健中的系统综述 | 1级 |
| [5] | 医学研究 | 社论 | 评论ChatGPT在医学写作中的幻觉问题 | 1级 |
| [71] | 医学研究 | 社论 | ChatGPT起草一篇关于疫苗有效性的文章 | 2级 |
| [108] | 医学研究 | 综述 | 关于ChatGPT在医学研究中的应用综述(含使用示例) | 2级 |
| [6] | 医学研究 | 原创文章 | 使用ChatGPT编写关于医疗保健中数字孪生的综述文章 | 2级 |
| [82] | 临床工作流程 | 评论 | 使用ChatGPT为接受髋关节置换手术的患者生成出院摘要(含随访护理建议) | 2级 |
| [89] | 临床工作流程 | 读者来信 | ChatGPT为急性有机磷中毒临床毒理学病例提供诊断、预后和解释 | 2级 |

[19]medical researcheditorialChatGPT answers questions about computational systems biology in stem cell research but its answers lack depthLevel 2
[40]medical researchletter to the editoruse ChatGPT to search literature of a given topic,but ma- jority of returned publications are fabricatedLevel 2
[78]medical (anatomy) educationletter to the editorChatGPT answersanatomy-related 1 questions; Result shows ChatGPT is currently incapable of giving accurateLevel 2
[1]consultationletter to the editoranatomy information ChatGPT answers questions on cardiopulmonary resusci- tationLevel 2
[75]miscellaneousDiscussions with Leaders (Invitationcomment and use examples of ChatGPT in nuclear medicineLevel 2
[3]medical educationOnly) editorialChatGPT answers multiple-choice questions on nuclear medicine; Results suggest ChatGPT does not possesses theLevel 2
[20]medical researchbrief reportknowledge of a nuclear medicine physician comments on using ChatGPT in healthcare (e.g., compose medical notes) and medical research (e.g., generate ab-Level 2
[47]consultationcommentarystracts, research topics) ChatGPT answers cancer-related questions informationLevel 2
[15]consultationcommentaryChatGPT answersepilepsy-related questionsLevel 2
[100]consultationarticlecomments on ChatGPT in diabetes self management and education (DSME)Level 2
[31]medical researcheditorialChatGPT generates a curriculum about AI for medical stu-Level 2
dents and a list of recommended readings)

| [19] | 医学研究 | 社论 | ChatGPT回答干细胞研究中关于计算系统生物学的问题,但其答案缺乏深度 | 2级 |
| [40] | 医学研究 | 致编辑的信 | 使用ChatGPT搜索特定主题的文献,但大多数返回的出版物是捏造的 | 2级 |
| [78] | 医学(解剖学)教育 | 致编辑的信 | ChatGPT回答解剖学相关问题;结果显示ChatGPT目前无法提供准确信息 | 2级 |
| [1] | 咨询 | 致编辑的信 | ChatGPT回答关于心肺复苏的问题 | 2级 |
| [75] | 其他 | 领导者对话(特邀) | 评论并举例说明ChatGPT在核医学中的应用 | 2级 |
| [3] | 医学教育 | 仅限社论 | ChatGPT回答核医学选择题;结果表明ChatGPT不具备核医学医师的知识 | 2级 |
| [20] | 医学研究 | 简报 | 评论在医疗保健(如撰写病历)和医学研究(如生成摘要、研究主题)中使用ChatGPT | 2级 |
| [47] | 咨询 | 评论 | ChatGPT回答癌症相关问题 | 2级 |
| [15] | 咨询 | 评论 | ChatGPT回答癫痫相关问题 | 2级 |
| [100] | 咨询 | 文章 | 评论ChatGPT在糖尿病自我管理与教育(DSME)中的应用 | 2级 |
| [31] | 医学研究 | 社论 | ChatGPT为医学生生成关于AI的课程大纲和推荐阅读清单 | 2级 |

Based on the findings derived from existing publications on ChatGPT in healthcare, this systematic review addresses the following research questions:

基于现有关于ChatGPT在医疗健康领域的研究成果,本系统性综述旨在解答以下研究问题:

The rest of the manuscript is organized as follows: Section 2 briefly introduces NLP, transformers and large language Models (LLMs), on which ChatGPT is built. Section 3 introduces the inclusion criteria and taxonomy used in the systematic review, and discusses in detail the selected publications. Section 4 presents the answers to the above research questions (RQ1 - RQ4), and Section 5 summarizes and concludes the review.

本文其余部分的结构安排如下:第2节简要介绍自然语言处理(NLP)、Transformer架构以及ChatGPT所基于的大语言模型(LLM)。第3节阐述系统综述的纳入标准与分类体系,并详细讨论入选文献。第4节针对前述研究问题(RQ1-RQ4)给出解答,第5节对综述进行总结与结论。

2 Background

2 背景

2.1 Natural Language Processing (NLP)

2.1 自然语言处理 (NLP)

Natural Language Processing (NLP) [22] is an interdisciplinary research field that aims to develop algorithms for the computational understanding of written and spoken languages. Some of the most prominent applications include text classification, question answering, speech recognition, language translation, chat bots, and the generation or sum mari z ation of texts. Over the past decade, the progress of NLP has been accelerated by deep learning techniques, in conjunction with increasing hardware capabilities and the availability of massive text corpora. Given the fast growth of digital data and the growing need for automated language processing, NLP has become an indispensable technology in various industries, such as healthcare, finance, education, and marketing.

自然语言处理(NLP) [22] 是一个跨学科研究领域,旨在开发能够计算理解书面和口头语言的算法。其最突出的应用包括文本分类、问答系统、语音识别、语言翻译、聊天机器人以及文本生成或摘要。过去十年间,随着硬件性能提升和海量文本语料库的可用性,深度学习技术加速了NLP的发展。鉴于数字数据的快速增长以及对自动化语言处理日益增长的需求,NLP已成为医疗、金融、教育和营销等行业不可或缺的技术。

2.2 Transformer

2.2 Transformer

In 2017, Vaswani et al. [109] introduced the Transformer model architecture, replacing previously widespread recurrent neural networks (RNN) [76], Long short-term memory networks (LSTM) [45] and Word2Vec [23]. Transformers are feed forward networks combined with specialized attention blocks that enable the model to attend to distinct segments of its input selectively. Attention blocks overcome two important limitations of RNNs. First, they enable Transformers to process input in parallel, whereas in RNNs each computation step depends on the previous one. Second, they allow Transformers to learn long-term dependencies. Since their introduction, Transformers consecutively achieved state-of-the-art results on various NLP benchmarks. Further developments include novel training tasks [24, 54, 114], adaptions of the network architecture [42, 64], and reduction of computational complexity [57, 64, 41]. However, the limited training data and the model complexities remained one of the primary factors of model performance. Transformers have also been used for tasks beyond NLP, such as image and video processing [95], and they are an active area of research in the deep learning community.

2017年,Vaswani等人[109]提出了Transformer模型架构,取代了之前广泛使用的循环神经网络(RNN)[76]、长短期记忆网络(LSTM)[45]和Word2Vec[23]。Transformer是由前馈网络与特殊注意力模块组成的架构,使模型能够选择性地关注输入的不同片段。注意力模块克服了RNN的两个重要局限:首先,它使Transformer能够并行处理输入,而RNN的每个计算步骤都依赖于前一步骤;其次,它让Transformer能够学习长期依赖关系。自问世以来,Transformer在各种自然语言处理基准测试中连续取得最先进成果。后续发展包括新型训练任务[24,54,114]、网络架构调整[42,64]以及计算复杂度降低[57,64,41]。然而有限的训练数据和模型复杂度仍是影响性能的主要因素之一。Transformer也被应用于自然语言处理之外的任务,如图像和视频处理[95],目前仍是深度学习领域的研究热点。

2.3 Large Language Models (LLMs)

2.3 大语言模型 (LLMs)

Large language models (LLMs) [17] refer to massive Transformer models trained on extensive datasets. Substantial research has been conducted on scaling the size of Transformer models. The popular BERT model [26], which in 2019 achieved record-breaking performance on seven tasks in the Glue Benchmark [110], possesses 110 million parameters. On the other hand, GPT-3 [18] had already reached 175 billion parameters by 2021. At the same time, the size of the training datasets has continued to grow. BERT, for example, was trained on a dataset comprising of 3.3 billion words, while the recently published LLaMA [107] was trained on 1.4 trillion tokens. Despite the success of the LLMs, LLMs face several challenges, including the need for massive computational resources and the potential of adopting bias and misinformation from training data. Additionally, over confidence when expressing wrong statements and a general lack of uncertainty remains to be a significant concern in NLP applications. As LLMs continue to improve and become more widespread, addressing these challenges and ensuring they are used ethically and responsibly is essential. ChatGPT is another representative LLM released by OpenAI and other tech giants have released their LLMs, such as the previously mentioned LLaMA from Meta, as a response. Figure 1 illustrates the evolution of LLMs.

大语言模型 (LLMs) [17] 指的是基于海量数据训练的巨型Transformer模型。关于扩展Transformer模型规模的研究已取得显著进展:2019年在Glue Benchmark [110] 七项任务中创下纪录的BERT模型 [26] 具有1.1亿参数,而GPT-3 [18] 到2021年已达到1750亿参数。与此同时,训练数据集规模持续扩大——BERT的训练数据包含33亿单词,近期发布的LLaMA [107] 则使用了1.4万亿token进行训练。尽管大语言模型取得成功,仍面临多重挑战:需要消耗巨大计算资源、可能继承训练数据中的偏见与错误信息。此外,在自然语言处理应用中,模型对错误陈述的过度自信及普遍缺乏不确定性仍是重要隐患。随着大语言模型持续改进和普及,解决这些挑战并确保其符合伦理规范至关重要。ChatGPT是OpenAI发布的代表性大语言模型,其他科技巨头也相继推出相关产品作为回应,例如Meta此前发布的LLaMA。图1展示了大语言模型的演进历程。


Figure 1: Evolution of large language models (LLMs) (adapted from [96])

图 1: 大语言模型 (LLM) 的演进 (改编自 [96])

3 Methodology

3 方法论

The search strategy used in this systematic review is illustrated in Figure 2, following the PRISMA guidelines. We use $P u b M e d$ as the only source to search candidate publications. Since the majority of the papers are very short (without abstracts), eligibility is determined at first screening based on the inclusion criteria below.

本系统综述采用的检索策略如图 2 所示,遵循 PRISMA 指南。我们仅使用 $P u b M e d$ 作为候选文献的检索来源。由于大多数论文篇幅较短(无摘要),初步筛选时根据以下纳入标准判定 eligibility。

3.1 Inclusion Criteria

3.1 纳入标准

The review is expressly dedicated to the ChatGPT released in November 2022 by OpenAI, excluding its predecessors (GPT-3.5, CPT-4 ), other large language models (LLMs) such as Instruct GP T and general NLP medical applications [69]. By March 20, 2023, a total of 140 publications are retrieved in PubMed (https://pubmed.ncbi.nlm.nih.gov/) using the keyword ChatGPT. Among them, article written in languages other than English (e.g., French [84]), without full text access (e.g., [62]), or whose main content has little to do with (or is not specific to) either ChatGPT (e.g., [46, 104, 33, 37]) or healthcare (e.g., [97, 103, 27, 6, 39, 13, 88, 21, 66, 115, 102, 43]) are excluded. Other representative exclusions include [44, 55], which deal with CPT-3, and [56, 30, 90, 2], where the authors claimed that ChatGPT assisted with the writing of the papers or case reports, but did not provide any discussion of the appropriateness of the generated texts and how the texts were incorporated into the main content. Generic comments that are not specific to healthcare, such as [105, 115, 16, 50], where the authors comment on the authorship of ChatGPT and using ChatGPT in scientific writing, are also excluded. Several repetitive articles were found from the PubMed search results. Table 1 and Table 2 show the full list of selected publications based on the inclusion (exclusion) criteria.

本综述专门针对OpenAI于2022年11月发布的ChatGPT,不包括其前代版本(GPT-3.5、CPT-4)、其他大语言模型(如InstructGPT)以及通用NLP医学应用[69]。截至2023年3月20日,在PubMed (https://pubmed.ncbi.nlm.nih.gov/) 使用关键词ChatGPT共检索到140篇文献。其中,非英语撰写的文章(如法语[84])、无法获取全文的文献(如[62])、主要内容与ChatGPT(如[46,104,33,37])或医疗健康领域(如[97,103,27,6,39,13,88,21,66,115,102,43])关联性较低的文献均被排除。其他典型排除案例包括涉及CPT-3的[44,55],以及作者声明使用ChatGPT辅助论文或病例报告撰写但未讨论生成文本的适用性及文本整合方式的[56,30,90,2]。与医疗健康领域无关的通用评论(如[105,115,16,50]中关于ChatGPT作者身份及科研写作应用的讨论)同样被排除。PubMed检索结果中发现若干重复文献。表1和表2列示了基于纳入(排除)标准的最终文献清单。

3.2 Taxonomy

3.2 分类体系

We propose a taxonomy, as shown in Figure 3, to categorize the selected publications included in the review. The taxonomy is based on applications, including ‘triage’, ‘translation’, ‘medical research’, ‘clinical workflow’, ‘medical education’, ‘consultation’, ‘multimodal’, each targeting one or multiple enduser groups, such as patients, healthcare professionals, researchers, medical students and teachers, etc. An application-based taxonomy allows more compact and inclusive grouping of papers, compared to categorizing papers by specific medical specialities. For example, scientific progress and findings generated through clinical practices are documented in the form of publications and/or reports, and literature reviews and novel ideas are usually required for medical researchers of all disciplines to publish their works. Thus, papers on ‘scientific writing’, ‘literature reviews’, ‘research ideas generation’, etc., can be grouped into the ‘medical research’ category. Similarly, the ‘consultation’ category comprises papers where ChatGPT is used in medical consulting settings for both corporations (e.g., insurance companies, medical consulting agencies, etc.) and individuals (e.g., patients) seeking medical information and advice. The ‘clinical workflow’ category includes ChatGPT’s applications in a variety of clinical scenarios, such as diagnostic decision-making, treatment and imaging procedure recommendation, and writing of discharge summary, patient letter and medical note. Furthermore, clinical departments, regardless of medical specialities, may benefit from a translation system for patients/visitors who are non-native language speakers (‘translation’). A triage system [10] guiding patients to the right departments would reduce the burden of clinical facilities and centers in general. Note that different categories are not necessarily completely independent, since all applications are reliant upon the QA-based interface of ChatGPT. By formulating the same questions differently according to different scenarios, ChatGPT’s role can change. For instance, reformulating multiple choice questions about a medical speciality in medical exams to open-ended questions, ChatGPT’s role changes from a medical student (‘medical education’) to a medical consultant (‘consultation’) or a clinician providing diagnosis or giving prescriptions (‘clinical workflow’). To avoid such ambiguity, categorization of a paper is solely based on the scenario explicitly reported in the paper. The connections between the applications and end-users in Figure 3 are also not unique. In this review, only the most obvious connections are established, such as ‘medical education’ - ‘students/teachers/exam agencies’, ‘medical research’ - ‘researchers’. The following of the review will show that existing publications on ChatGPT in healthcare can all find a proper categorization based on the proposed taxonomy.

我们提出如图3所示的分类法,对综述涵盖的文献进行归类。该分类体系基于应用场景划分,包括"分诊"、"翻译"、"医学研究"、"临床工作流"、"医学教育"、"咨询"、"多模态"七大类,每类面向患者、医护人员、研究者、医学生与教师等一个或多个终端用户群体。相比按专科分类,基于应用的分类法能实现更紧凑且包容的论文归类。例如:通过临床实践产生的科学进展通常以论文/报告形式记载,而文献综述与创新思路是所有学科医学研究者发表成果的普遍需求,因此"科学写作"、"文献综述"、"研究创意生成"等主题可归入"医学研究"类。同理,"咨询"类涵盖ChatGPT应用于企业(如保险公司、医疗咨询机构)和个人(如患者)医疗咨询场景的论文。"临床工作流"类则包含ChatGPT在诊断决策、治疗方案与影像检查推荐、出院小结/患者信函/病历书写等多样化临床场景的应用。此外,各临床科室都可能需要为非母语患者/访客提供翻译服务("翻译"),而分诊系统[10]能通过引导患者正确就诊减轻医疗机构整体负担。需注意不同类别并非完全独立,因为所有应用都依赖ChatGPT基于问答的交互界面——通过根据不同场景调整问题表述,ChatGPT的角色可能转变。例如将医学考试中某专科选择题改写为开放式问题时,其角色就从医学生("医学教育")转变为医疗顾问("咨询")或提供诊断处方的临床医生("临床工作流")。为避免歧义,本文献分类仅依据论文明确记载的应用场景。图3中应用与终端用户的关联也非唯一,本综述仅建立最显著的关联(如"医学教育"-"学生/教师/考试机构"、"医学研究"-"研究者")。后续内容将表明,现有医疗领域ChatGPT研究都能在该分类体系中找到准确定位。


Figure 2: Search strategy used in this systematic review.

图 2: 本系统综述采用的检索策略。

medRxiv preprint doi: https://doi.org/10.1101/2023.03.30.23287899; this version posted March 30, 2023. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC 4.0 International license .

medRxiv预印本 doi: https://doi.org/10.1101/2023.03.30.23287899; 本版本发布于2023年3月30日。该预印本的版权持有者 (未经同行评审认证) 是作者/资助者,其已授予medRxiv永久展示该预印本的许可。本作品采用CC-BY-NC 4.0国际许可协议提供。


Figure 3: Application- and user-oriented Taxonomy used in the ChatGPT review. The references shown in the application boxes are the Level 3 publications.

图 3: ChatGPT综述中采用的面向应用和用户的分类体系。应用框内标注的参考文献为第三层级文献。

Besides the taxonomy, we further assign a tag (Level 1 - Level 3 ) to the selected papers to indicate the depth and particularity of the papers on the ‘ChatGPT in Healthcare’ topic:

除了分类法之外,我们进一步为选定的论文分配标签(Level 1 - Level 3),以表明这些论文在"医疗健康领域的ChatGPT"主题上的深度和特殊性:

Shortly prior to our review, a systematic review of ChatGPT in healthcare was published by Sallam, M. [91]. An inclusive taxonomy and a proper different i ation among the selected publications (tag: Level 1, Level 2, Level $\mathcal{S}$ ) is, however, lacking. We believe that the tag helps readers quickly filter and locate papers of interest. This review put more emphasis on Level 3 papers, since they provide a clearer picture of the real capability of ChatGPT in different healthcare applications.

在我们开展本次综述前不久,Sallam, M. [91] 发表了一篇关于 ChatGPT 在医疗领域应用的系统性综述。然而,该综述缺乏一个包容性分类体系,且未对入选文献进行明确分级 (标签: Level 1, Level 2, Level $\mathcal{S}$)。我们认为分级标签能帮助读者快速筛选和定位目标论文。本次综述更侧重 Level 3 级别论文,因为它们能更清晰地展现 ChatGPT 在不同医疗应用场景中的实际能力。

3.3 General Profile of Level 1 and Level 2 Papers

3.3 一级和二级论文的总体概况

A list of Level 1 and Level $\mathcal{L}$ papers are summarized in Table 1. It is not unexpected that the majority of shortlisted papers fall into the Level 1 and Level 2 category. As seen from Table 1, most of Level 1 and Level 2 papers are short editorial comments or letters to the editor from multidisciplinary journals like Nature (https://www.nature.com/) and Science (https://www.science.org/), or speciality journals like nuclear medicine [3, 59], plastic surgery [79, 38], synthetic biology [106] and orthopaedic [81]. These publications usually deliver high-level comments about the potential impact and pitfalls of ChatGPT in healthcare [113], with a focus on medical publishing. Scientific journals are among the immediate stakeholders of the publishing industry on which ChatGPT will exert a significant impact. Thus, publishers introduce new regulations regarding the use of ChatGPT in scientific publications, in particular whether ChatGPT is eligible as an author and ChatGPT-generated texts are allowed. Answers from leading publishers like Science are in the negative [105, 16]. Nature also bans ChatGPT authorship but takes a slightly more tolerant stance regarding ChatGPT-generated content, subject to a clear statement of whether, how and to what extent ChatGPT contributed to the submitted manuscript [103, 27]. Main argument for the decision is that ChatGPT cannot properly source literature where its answers are derived from, causing unintentional plagiarism, nor can it take accountability as human authors do [105, 27]. The decision is echoed by the academic community [58, 97, 115, 66], agreeing that

表1总结了1级和$\mathcal{L}$级论文的清单。不出所料,大多数入围论文属于1级和2级类别。从表1可见,多数1级和2级论文是来自《自然》(https://www.nature.com/)、《科学》(https://www.science.org/)等多学科期刊的简短社论评论或致编辑信,或是核医学[3,59]、整形外科[79,38]、合成生物学[106]和骨科[81]等专业期刊的投稿。这些出版物通常就ChatGPT在医疗健康领域的潜在影响和陷阱发表高层级评论[113],重点关注医学出版领域。科学期刊是出版业中受ChatGPT显著影响的直接利益相关方。因此,出版商针对ChatGPT在科学出版物中的使用出台了新规,特别是关于ChatGPT是否具备作者资格及其生成文本是否被允许的问题。《科学》等主流出版商的答案是否定的[105,16]。《自然》也禁止将ChatGPT列为作者,但对ChatGPT生成内容持相对宽容态度,前提是需明确声明ChatGPT对投稿文稿的贡献程度、方式及范围[103,27]。该决定的主要依据是:ChatGPT无法妥善标注其答案的文献来源,可能导致无意识抄袭,也无法像人类作者那样承担责任[105,27]。这一立场获得了学术界[58,97,115,66]的普遍认同,认为...

ChatGPT-generated content must be scrutinized by human experts before being used [58], as the generated content, such as references [105, 12, 40, 31] could be fabricated. Lee, J.Y. et al. [66] reiterated from a legal (e.g., copy-right law) perspective the inappropriateness of listing ChatGPT as an author, emphasizing that a non-human cannot take legal responsibilities and consequences. However, banning ChatGPT from scientific writing is not easily enforceable, since ChatGPT is trained to produce human-like texts that even scientists and specifically-trained AI detector sometimes fail to detect [29, 7]. In short, even though the prospect is promising [92, 25, 43, 102], new regulations and substantial improvements are needed before ChatGPT can be safely and widely used for scientific writing, publishing, or medical research in general [105]. The scenario column in Table 1 corresponds to the taxonomy categorization. If the article concerns healthcare or a medical speciality in general, it is categorized as ‘miscellaneous’. The category column indicates the type of the publications.

ChatGPT生成的内容必须经过人类专家审查后才能使用[58],因为生成的内容(如参考文献[105, 12, 40, 31])可能存在捏造。Lee, J.Y.等人[66]从法律(如著作权法)角度重申将ChatGPT列为作者的不当性,强调非人类实体无法承担法律责任与后果。然而,完全禁止ChatGPT参与科学写作难以执行,因为ChatGPT生成的类人文本甚至可能骗过科学家和专用AI检测器[29, 7]。简言之,尽管前景广阔[92, 25, 43, 102],在ChatGPT能安全广泛应用于科研写作、出版或医学研究前[105],仍需新规制定和实质性改进。表1中的场景列对应分类体系,若论文涉及医疗健康或广义医学专业领域,则归类为"其他"。类别列表示出版物类型。

3.4 Reviews of Level 3 Papers

3.4 三级论文综述

Level 3 papers feature extensive experiments conducted to assess the suitability of ChatGPT for a medical speciality or clinical scenario. For open-ended (OE) questions, human experts are usually involved to assess the appropriateness of the answers. To quantify the subjective assessments, a scoring criteria and scheme (e.g., 5-point, 6-point or 10-point Likert scale) is usually required. For multiple choice questions, it is desirable to not only quantify the accuracies but to evaluate whether the ‘justification’ given by ChatGPT and the choice are in congruence. When it comes to comparisons (with humans or other language models), statistical analysis is usually performed. As shown in Table 2, many of Level 3 papers are still pre-prints (under review) at the time of writing this review. Most of current ChatGPT evaluations are on ‘medical education’ (medical exams in particular), which requires no ethical approval to conduct. Representative works include [36, 61], where the authors test ChatGPT in the US Medical Licensing Examination (USMLE). Even though the evaluations were carried out independently ([36] and [61] were published almost at the same time), similar results were reported, i.e., ChatGPT achieved only moderate passing performance. [36] further showed that ChatGPT outperformed two other language models, Instruct GP T and GPT-3, in the exam. In both studies, ChatGPT was asked to give not only the answers but also the justifications, which were taken into consideration during evaluation (by physicians). [36] further found that ChatGPT performed better on fact-check questions than on complex ‘knowhow’ type questions. It is worthy of noting that the exam contains questions from different medical specialities. However, Mbakwe, A.B. et al. [74] raised concerns that ChatGPT, a language model, passing the exam indicates the flawness of the exam system 1. Besides USMLE, ChatGPT was also tested on the Chinese National Medical Licensing Examination [112] and the AHA BLS / CLS Exams 2016 [32], on both of which ChatGPT failed to achieve passing scores.

三级论文通过大量实验评估ChatGPT在医学专科或临床场景中的适用性。对于开放式问题(OE),通常需要人类专家参与评估回答的恰当性。为量化主观评价,需制定评分标准与方案(如5分制、6分制或10分李克特量表)。针对选择题,不仅需量化准确率,还应评估ChatGPT提供的"论证"是否与选项逻辑一致。进行比较研究(与人类或其他语言模型对比)时,通常需进行统计分析。如表2所示,截至本文撰写时,多数三级论文仍处于预印本(评审中)状态。当前ChatGPT评估主要集中于"医学教育"(特别是医学考试)领域,这类研究无需伦理审批。代表性研究包括[36,61],作者在美国医师执照考试(USMLE)中测试ChatGPT。尽管两项独立研究([36]与[61]几乎同期发表)得出了相似结论:ChatGPT仅达到中等通过水平。[36]进一步显示ChatGPT在该考试中表现优于另两个语言模型Instruct GPT和GPT-3。两项研究均要求ChatGPT同时提供答案与论证,并由医师在评估时综合考量。[36]还发现ChatGPT在事实核查类问题的表现优于复杂"技术诀窍"类问题。值得注意的是,该考试涵盖多个医学专科的试题。然而Mbakwe, A.B.等[74]提出质疑:一个语言模型能通过考试,可能暴露考试体系存在缺陷1。除USMLE外,ChatGPT还在中国国家医学资格考试[112]和2016年美国心脏协会基础/高级生命支持考试[32]中接受测试,但均未达到及格线。

ChatGPT achieved similar performance to students examinees on a Doctor of Veterinary Medicine (DVM) exam containing 288 paras it ology exam questions. One major limitation of using ChatGPT in medical exams is that, current release of ChatGPT can only process text inputs, whereas some questions are diagram-/figure-based $^2$ . Such questions are either excluded or translated into text descriptions.

ChatGPT在包含288道寄生虫学考试题的兽医学博士(DVM)考试中表现出与学生考生相当的水平。当前版本ChatGPT在医学考试应用中的一个主要局限是只能处理文本输入,而部分试题基于图表/图示。这类题目要么被排除,要么被转化为文字描述。

Besides the standard medical exams, ChatGPT achieved promising results on cancer-related questions [47, 53]. In [53], ChatGPT’s answers to common cancer myths and misconceptions were evaluated by expert reviewers and compared with the standard answers from the National Cancer Institute (NCI). Results showed that ChatGPT is able to achieve very high accuracies, showing that current ChatGPT is already a reliable source of cancer-related information for cancer patients [47]. Furthermore, [83] tested ChatGPT with 100 questions related to retina disease. The answers were evaluated based on a 5-point Likert scale by domain experts. It is found that ChatGPT answers with high accuracy on general questions, while the answers are less satisfactory, sometimes harmful, when it comes to treatment/prescription recommendations. On 85 multiple-choice questions concerning genetics/genomics, ChatGPT achieved similar performance to human respondents [28]. Interestingly, based on the test results, [28] also reached the conclusion that ChatGPT fares better on ‘memoriz