ChatCAD+: Towards a Universal and Reliable Interactive CAD using LLMs

ChatCAD+: 基于大语言模型的通用可靠交互式CAD系统

Zihao Zhao, Sheng Wang, Jinchen Gu, Yitao Zhu, Lanzhuju Mei, Zixu Zhuang, Zhiming Cui, Qian Wang, Dinggang Shen, Fellow, IEEE

Abstract— The integration of Computer-Aided Diagnosis (CAD) with Large Language Models (LLMs) presents a promising frontier in clinical applications, notably in automating diagnostic processes akin to those performed by radiologists and providing consultations similar to a virtual family doctor. Despite the promising potential of this integration, current works face at least two limitations: (1) From the perspective of a radiologist, existing studies typically have a restricted scope of applicable imaging domains, failing to meet the diagnostic needs of different patients. Also, the insufficient diagnostic capability of LLMs further undermine the quality and reliability of the generated medical reports. (2) Current LLMs lack the requisite depth in medical expertise, rendering them less effective as virtual family doctors due to the potential unreliability of the advice provided during patient consultations. To address these limitations, we introduce ChatCAD+, to be universal and reliable. Specifically, it is featured by two main modules: (1) Reliable Report Generation and (2) Reliable Interaction. The Reliable Report Generation module is capable of interpreting medical images from diverse domains and generate high-quality medical reports via our proposed hierarchical in-context learning. Concurrently, the interaction module leverages up-to-date information from reputable medical websites to provide reliable medical advice. Together, these designed modules synergize to closely align with the expertise of human medical professionals, offering enhanced consistency and reliability for interpretation and advice. The source code is available at GitHub.

摘要—计算机辅助诊断(CAD)与大语言模型的结合为临床应用开辟了前景广阔的领域，尤其在实现类似放射科医师的自动化诊断流程和提供虚拟家庭医生般的咨询服务方面。尽管这种结合潜力巨大，现有研究仍面临至少两个局限：(1) 从放射科医师视角看，现有研究通常局限于特定影像领域，难以满足不同患者的诊断需求。同时，大语言模型诊断能力的不足进一步影响了生成医疗报告的质量与可靠性。(2) 当前大语言模型缺乏足够的医学专业深度，导致其在作为虚拟家庭医生时可能提供不可靠的诊疗建议。为解决这些局限，我们推出具备普适性与可靠性的ChatCAD+系统，其核心包含两大模块：(1) 可靠报告生成模块通过提出的分层上下文学习技术，能解读多领域医学影像并生成高质量报告；(2) 可靠交互模块利用权威医疗网站的最新信息提供可信诊疗建议。这些模块协同工作，使系统表现更接近人类医疗专家的专业水平，显著提升诊断解读与建议的一致性和可靠性。源代码已发布于GitHub平台。

Index Terms— Large Language Models, Multi-modality System, Medical Dialogue, Computer-Assisted Diagnosis

索引术语 - 大语言模型 (Large Language Models)、多模态系统 (Multi-modality System)、医疗对话、计算机辅助诊断

Zihao Zhao, Sheng Wang, Jinchen Gu and Yitao Zhu contributed equally and are listed as first authors.

赵子豪、王升、顾金辰和朱一涛贡献均等，并列为第一作者。

Zihao Zhao, Jinchen Gu, Yitao Zhu, Lanzhuju Mei, and Zhiming Cui are with the School of Biomedical Engineering, Shanghai Tech University, Shanghai 201210, China (email: zi hao zhao 10@gmail.com, gujch12022,zhuyt, meilzhj, cuizm @shanghai tech.edu.cn)

赵子豪、顾金辰、朱逸韬、梅兰竹菊和崔志明就职于上海科技大学生物医学工程学院，上海 201210，中国 (email: zihao.zhao10@gmail.com, gujch12022, zhuyt, meilzhj, cuizm@shanghaitech.edu.cn)

Fig. 1. Overview of our proposed ChatCAD $^+$ system. (a) For patients seeking a diagnosis, ChatCAD $^+$ generates reliable medical reports based on the input medical image(s) by referring to local report database. (b) Additionally, for any inquiry from patients, ChatCAD $^+$ retrieves related knowledge from online database and lets large language model generate reliable response.

图 1: 我们提出的 ChatCAD$^+$系统概述。(a) 对于寻求诊断的患者，ChatCAD$^+$通过参考本地报告数据库，基于输入的医学图像生成可靠的医疗报告。(b) 此外，针对患者的任何查询，ChatCAD$^+$会从在线数据库中检索相关知识，并让大语言模型生成可靠回答。

Corresponding authors: Qian Wang and Dinggang Shen.

通讯作者：Qian Wang 和 Dinggang Shen。

I. INTRODUCTION

I. 引言

Sheng Wang and Zixu Zhuang are with the School of Biomedical Engineering, Shanghai Jiao Tong University, Shanghai 200030, China, Shanghai United Imaging Intelligence Co., Ltd., Shanghai 200230, China, and also Shanghai Tech University, Shanghai 201210, China. (email: {wsheng, zixuzhuang}@sjtu.edu.cn)

王盛和庄子旭任职于上海交通大学生物医学工程学院（上海 200030）、上海联影智能医疗科技有限公司（上海 200230），同时隶属于上海科技大学（上海 201210）。(email: {wsheng, zixuzhuang}@sjtu.edu.cn)

Qian Wang and Dinggang Shen are with the School of Biomedical Engineering & State Key Laboratory of Advanced Medical Materials and Devices, Shanghai Tech University, Shanghai 201210, China, and Shanghai Clinical Research and Trial Center, Shanghai 201210, China. Dinggang Shen is also with Shanghai United Imaging Intelligence Co. Ltd., Shanghai 200230, China. (e-mail: {qianwang, dgshen}@shanghai tech.edu.cn)

王倩和沈定刚就职于上海科技大学生物医学工程学院及国家先进医疗材料与器械重点实验室（上海 201210），同时隶属于上海临床研究中心（上海 201210）。沈定刚还任职于上海联影智能医疗科技有限公司（上海 200230）(邮箱: {qianwang, dgshen}@shanghaitech.edu.cn)

ARGE Language Models (LLMs) have emerged as artificial intelligence that has been extensively trained on text data. Drawing on the combined power of sophisticated deep learning techniques, large-scale datasets, and increased model sizes, LLMs demonstrate extraordinary capabilities in understanding and generating human-like text. This is substantiated by significant projects like ChatGPT [1] and LLaMA [2]. These techniques are highly suitable for a wide range of scenarios, such as customer service, marketing, education, and healthcare consultation. Notably, ChatGPT has passed part of the US medical licensing exams, highlighting the potential of LLMs in the medical domain [3], [4].

大语言模型 (LLM) 已成为一种经过海量文本数据训练的人工智能技术。凭借复杂的深度学习技术、大规模数据集和不断增长的模型规模，大语言模型在理解和生成类人文本方面展现出非凡能力。ChatGPT [1] 和LLaMA [2] 等重大项目证实了这一点。这些技术非常适合客服、营销、教育和医疗咨询等广泛场景。值得注意的是，ChatGPT 已通过美国医师执照考试部分科目 [3] [4]，彰显了大语言模型在医疗领域的潜力。

So far, several studies have been conducted to integrate LLMs into Computer-Assisted Diagnosis (CAD) systems of medical images. Conventional CAD typically operates following pure computer vision [5], [6] or vision-language paradigms [7], [8]. LLMs have shown ability to effectively interpret findings from medical images, thus mitigating limitations of only visual interpretation. For example, in our pilot study, we leverage the intermediate results obtained from image CAD networks and then utilize LLMs to generate final diagnostic reports [9].

目前已有若干研究致力于将大语言模型(LLM)整合到医学影像的计算机辅助诊断(CAD)系统中。传统CAD系统通常遵循纯计算机视觉[5][6]或视觉-语言范式[7][8]运行。研究表明，大语言模型能有效解析医学影像的检查结果，从而弥补纯视觉解读的局限性。例如在我们的试点研究中，我们利用图像CAD网络生成的中间结果，通过大语言模型最终生成诊断报告[9]。

Although some efforts have been made to combine CAD networks and LLMs [9]–[11], it should be noted that these studies are limited in their scopes, which often focus on specific image domains. That is, such a system may only support a single image modality, organ, or application (such as chest X-ray), which greatly limits general iz ability in the real clinical workflow. The primary reason for this limitation comes from notable topologic and semantic variations observed among medical image data, which present distinct challenges when attempting to encode various images with a unified model. Furthermore, although LLMs have demonstrated the capability to articulate medical concepts and clinical interpretations [12], [13], there still exists a significant gap between LLMs and radiologists. This gap undoubtedly undermines patient confidence in a LLM-generated report. Consequently, these issues hinder practical utilization of LLMs in automated generation of medical reports. In addition, it is also art i cul able that the integration of LLMs and CADs may lead to medical dialogue systems [14]–[16]. In this way, patients will be able to interact through LLMs and acquire more medical advice and explanation, while this functionality is often missing in conventional CAD systems. However, existing studies show that the general LLMs typically produce medical advice based solely on their encoded knowledge, without considering specific knowledge in the medical domain [15]. As a result, patients may receive unreliable responses, dwarfing trust in CADs and thereby hindering the use of such systems in real medical scenarios.

尽管已有研究尝试将CAD网络与大语言模型结合[9]–[11]，但需注意这些研究往往局限于特定影像领域。此类系统通常仅支持单一影像模态、器官或应用场景（如胸部X光），极大限制了实际临床工作流程中的泛化能力。这种局限性的主要根源在于医学影像数据存在显著的拓扑结构和语义差异，使得统一模型编码多样化图像面临独特挑战。此外，尽管大语言模型已展现出阐述医学概念和临床解读的能力[12][13]，但其与放射科医师之间仍存在显著差距，这种差距无疑会削弱患者对大语言模型生成报告的信任度，从而阻碍其在医学报告自动化生成中的实际应用。值得注意的是，大语言模型与CAD系统的结合可能催生医疗对话系统[14]–[16]，使患者能通过大语言模型交互获取更多医疗建议与解释——这一功能在传统CAD系统中往往缺失。然而现有研究表明，通用大语言模型通常仅基于编码知识生成医疗建议，未充分考虑医学领域的专业知识[15]，可能导致患者获得不可靠的反馈，进而削弱对CAD系统的信任，阻碍此类系统在真实医疗场景中的应用。

In order to tackle the challenges mentioned above, we propose ChatCAD $^+$ in this paper. And our contributions are made in the following aspects. (1) Universal image interpretation. Due to the difficulty in obtaining a unified CAD network tackling various images currently, ChatCAD $^+$ incorporates a domain identification module to work with a variety of CAD models (c.f. Fig. 2(a)). ChatCAD $^+$ can adaptively select a corresponding model given the input medical image. The tentative output of the CAD network is converted into text description to reflect image features, making it applicable for diagnostic reporting subsequently. (2) Hierarchical incontext learning for enhanced report generation. Top $k$ reports that are semantically similar to the LLM-generated report are retrieved from a clinic database via the proposed retrieval module (c.f. Fig. 3. The retrieved $k$ reports then serve as in-context examples to refine the LLM-generated report. (3) Knowledge-based reliable interaction. As illustrated in Fig. 1(b), ChatCAD $^+$ does not directly provide medical advice. Instead, it first seeks help via our proposed knowledge retrieval module for obtaining relevant knowledge from professional sources, e.g. Merck Manuals, Mayo Clinic, and Cleveland Clinic. Then, the LLM considers the retrieved knowledge as a reference to provide reliable medical advice.

为应对上述挑战，本文提出ChatCAD$^+$，主要贡献体现在以下方面：(1) 通用影像解读。针对当前难以构建统一处理各类影像的CAD网络，ChatCAD$^+$引入领域识别模块协同多CAD模型工作（见图2(a)）。系统能根据输入医学影像自适应选择对应模型，并将CAD网络的初步输出转化为文本描述以反映影像特征，为后续诊断报告生成提供支持。(2) 分层上下文学习增强报告生成。通过提出的检索模块从临床数据库中获取与大语言模型生成报告语义相似的前$k$篇报告（见图3），这些报告作为上下文示例用于优化生成结果。(3) 基于知识的可靠交互。如图1(b)所示，ChatCAD$^+$不直接提供医疗建议，而是先通过知识检索模块从专业资源（如默克手册、梅奥诊所、克利夫兰诊所）获取相关知识，再由大语言模型参考检索结果提供可靠医疗建议。

In summary, our ChatCAD $^+$ for the first time builds a universal and reliable medical dialogue system. The improved quality of answers and diagnostic reports of the chatbot reveals potential of LLMs in interactive medical consultation.

总之，我们的ChatCAD$^+$首次构建了一个通用且可靠的医疗对话系统。该聊天机器人回答质量和诊断报告的提升，揭示了大语言模型在交互式医疗咨询中的潜力。

II. 相关工作

A. Large Language Models in Healthcare

A. 医疗领域的大语言模型

Following the impressive potential demonstrated by ChatGPT with its 100B-scale parameters, researchers have further expanded the application of language models in healthcare unprecedentedly, yielding highly promising results. MedPaLM [17] was developed using meticulously curated biomedical corpora and human feedback, showcasing its potential with notable achievements, including a remarkable $67.6%$ accuracy on the MedQA exam. Notably, ChatGPT, which did not receive specific medical training, successfully passed all three parts of the USMLE and achieved an overall accuracy of over $50%$ across all exams, with several exams surpassing $60%$ accuracy [18]. ChatDoctor [14], fine-tuned on the LLaMA model using synthetic clinical QA data generated by ChatGPT, also incorporated medical domain knowledge to provide reliable responses. However, our ChatCAD+ distinguishes its reliable interaction from ChatDoctor in several aspects. First, ChatDoctor relies on Wikipedia, which is editable for all users and may compromise the rigor and reliability required in the medical domain. Second, ChatDoctor simply divides the knowledge of each medical topic into equal sections and relies on rule-based ranking to select relevant information. In contrast, our reliable interaction leverages the Merck Manual [19], a comprehensive and trusted medical knowledge resource. It can organize medical knowledge into a hierarchical structure, thus enabling us to fully exploit the reasoning capabilities of the LLM in recursively retrieving the most pertinent and valuable knowledge.

继ChatGPT凭借其千亿级参数展现出惊人潜力后，研究人员进一步前所未有地扩展了语言模型在医疗领域的应用，并取得了极具前景的成果。MedPaLM[17]通过精心整理的生物医学语料库和人类反馈开发而成，在MedQA考试中取得67.6%的惊人准确率等显著成就，展现了其潜力。值得注意的是，未接受专门医学训练的ChatGPT成功通过美国医师执照考试(USMLE)全部三个部分，所有考试总体准确率超过50%，其中多项考试突破60%[18]。基于LLaMA模型微调的ChatDoctor[14]利用ChatGPT生成的合成临床问答数据，同时结合医学领域知识来提供可靠回答。但我们的ChatCAD+在可靠交互方面与ChatDoctor存在多项差异：首先，ChatDoctor依赖可公开编辑的维基百科，可能影响医学领域所需的严谨性与可靠性；其次，ChatDoctor仅将每个医学主题知识简单均分，并依赖基于规则的排序来选取相关信息。相比之下，我们的可靠交互机制依托权威医学知识库《默克诊疗手册》[19]，能够将医学知识组织为层级结构，从而充分发挥大语言模型在递归检索最相关、最有价值知识时的推理能力。

B. Multi-modality Large Models

B. 多模态大模型

In present times, multi-modality large models are typically constructed based on pre-trained models rather than training them from scratch. Frozen [20] performed fine-tuning on an image encoder, utilizing its outputs as soft prompts for the language model. Flamingo [21] introduced cross-attention layers into the LLM to incorporate visual features, exclusively pre-training these new layers. BLIP-2 [22] leveraged both frozen image encoders and frozen LLMs, connecting them through their proposed Q-Former. Following the architecture of BLIP-2, XrayGPT [23] was specifically developed on chest $\boldsymbol{\mathrm X}$ -ray. While the training cost for XrayGPT is demanding, our proposed ChatCAD $^+$ can seamlessly integrate existing pretrained CAD models without requiring further training. This highlights the inherent advantage of ChatCAD $^+$ in continual learning scenarios, where additional CAD models can be effortlessly incorporated.

当前，多模态大模型通常基于预训练模型构建，而非从头训练。Frozen [20] 对图像编码器进行微调，将其输出作为语言模型的软提示。Flamingo [21] 在大语言模型中引入交叉注意力层以融合视觉特征，仅对这些新层进行预训练。BLIP-2 [22] 同时采用冻结图像编码器和冻结大语言模型，通过其提出的 Q-Former 实现两者连接。基于 BLIP-2 架构，XrayGPT [23] 专门针对胸部 $\boldsymbol{\mathrm X}$ 光影像开发。虽然 XrayGPT 的训练成本较高，但我们提出的 ChatCAD $^+$ 能无缝集成现有预训练 CAD 模型而无需额外训练。这凸显了 ChatCAD $^+$ 在持续学习场景中的固有优势——可轻松整合新增的 CAD 模型。

In more recent studies, researchers have taken a different approach, i.e., designing prompts to allow LLMs to utilize vision models for avoiding training altogether. For instance, VisualChatGPT [24] established a connection between ChatGPT and a series of visual foundation models, enabling exchange of images during conversations. ChatCAD [9] linked existing CAD models with LLMs to enhance diagnostic accuracy and improve patient care. Following a similar paradigm as ChatCAD [9], Visual Med-Alpaca [25] incorporated multiple vision models to generate text descriptions for medical images.

在最近的研究中，研究者们采取了不同的方法，即设计提示词让大语言模型(LLM)直接调用视觉模型，从而完全避免训练过程。例如，VisualChatGPT [24] 在ChatGPT与一系列视觉基础模型之间建立了连接，实现了对话过程中的图像交互。ChatCAD [9] 将现有CAD模型与大语言模型结合，以提高诊断精度并改善患者护理。Visual Med-Alpaca [25] 沿用了与ChatCAD [9] 相似的范式，通过整合多个视觉模型来生成医学图像的文本描述。

Fig. 2. Overview of the reliable report generation. (a) Universal interpretation: To enhance the vision capability of LLMs, multiple CAD models are incorporated. The numerical results obtained from these models are transformed into visual descriptions following the rule of prob2text. (b) Hierarchical in-context learning: The LLM initially generates a preliminary report based on the visual description, which is then enhanced through in-context learning using retrieved semantically similar reports. LLM and CLIP models are kept frozen, while CAD models are denoted as trainable as they can be continually trained and updated without interfering with other components.

图 2: 可靠报告生成流程概览。(a) 通用解释: 为增强大语言模型的视觉能力,系统整合了多个CAD模型。这些模型生成的数值结果通过prob2text规则转化为视觉描述。(b) 分层上下文学习: 大语言模型首先生成基于视觉描述的初步报告,随后通过检索语义相似报告进行上下文学习来优化结果。大语言模型和CLIP模型保持冻结状态,CAD模型标注为可训练状态,因其可持续训练更新而不影响其他组件。

However, when compared to our study, it is limited to a single medical imaging domain and also lacks reliability in both report generation and interactive question answering.

然而，与我们的研究相比，它仅限于单一医学影像领域，且在报告生成和交互式问答方面都缺乏可靠性。

III. METHOD

III. 方法

The proposed ${\mathrm{ChatCAD+}}$ is a multi-modality system capable of processing both image and text inputs (as illustrated in Fig. 1) through reliable report generation and reliable interaction, respectively. It is important to note that, in this paper, the term “modality” refers to language and image, which differs from the commonly used concept of medical image modality. The reliable report generation treats medical image as input, identifies its domain, and provides enhanced medical reports by referencing a local report database. Simultaneously, for the user’s text query, the reliable interaction will retrieve clinically-sound knowledge from a professional knowledge database, thereby serving as a reliable reference for the LLM (e.g. ChatGPT) to enhance the reliability its response.

提出的 ${\mathrm{ChatCAD+}}$ 是一个多模态系统，能够分别通过可靠报告生成和可靠交互处理图像与文本输入（如图 1 所示）。需要注意的是，本文中的"模态"指语言和图像，这与常用的医学影像模态概念不同。可靠报告生成以医学影像为输入，识别其领域，并通过参考本地报告数据库生成增强版医学报告。同时，针对用户的文本查询，可靠交互会从专业知识库中检索临床可靠知识，从而为大语言模型（如 ChatGPT）提供可靠参考，以提升其回答的可靠性。

A. Reliable Report Generation

A. 可靠的报告生成

This section outlines our proposed reliable report generation, as depicted in Fig. 2, which encompasses two primary stages: universal interpretation and hierarchical in-context generation. Section III-A.1 illustrates the paradigm to interpret medical images from diverse domains into descriptions that can be understood by LLMs. The resulting visual description then serves as the input for subsequent hierarchical in-context learning. Following this, Section III-A.2 elucidates hierarchical in-context learning, within which the retrieval module (in Fig. 3) acts as key component.

本节概述了我们提出的可靠报告生成方法，如图 2 所示，该方法包含两个主要阶段：通用解释和分层上下文生成。第 III-A.1 节阐述了将来自不同领域的医学图像解释为大语言模型可理解的描述范式。生成的视觉描述随后作为后续分层上下文学习的输入。接着，第 III-A.2 节阐明了分层上下文学习，其中检索模块 (如图 3 所示) 作为关键组件。

TABLE I THE ILLUSTRATION OF PROB2TEXT. P3 IS THE DEFAULT SETTING. CAD Model $\rightarrow$ (disease: prob)

P1 (direct)	"{disease} score: {prob}"
P2 (simplistic)	probE[0, 0.5): “No Finding' prob∈[0.5,1]:“Theprediction is{disease}”
P3( (illustrative)	probE[0, 0.2): “No sign of {disease}” probE[0.2, 0.5): “Small possibility of {disease}” prob∈[0.5,0.9):“Patient is likely tohave {disease} prob∈[0.9,1]:“Definitelyhave {disease}

表 1: PROB2TEXT 的说明。P3 为默认设置。CAD 模型 $\rightarrow$ (疾病: 概率)

| P1 (直接) | "{疾病} 分数: {概率}" |
| P2 (简化) | 概率∈[0, 0.5): "无发现" 概率∈[0.5,1]: "预测为{疾病}" |
| P3 (示例) | 概率∈[0, 0.2): "无{疾病}迹象" 概率∈[0.2, 0.5): "{疾病}可能性较小" 概率∈[0.5,0.9): "患者可能有{疾病}" 概率∈[0.9,1]: "肯定患有{疾病}" |

Universal Interpretation: The process of universal interpretation can be divided into three steps. Initially, the proposed domain identification module is employed to determine the specific domain of the medical image. Subsequently, the corresponding domain-specific model is activated to interpret the image. Finally, the interpretation result is converted into a text prompt for describing visual information via the rulebased prob2text module for further processing.
通用解读：通用解读过程可分为三个步骤。首先，采用提出的领域识别模块确定医学图像的具体领域。随后，激活对应的领域专用模型对图像进行解读。最后，通过基于规则的prob2text模块将解读结果转换为描述视觉信息的文本提示以供后续处理。

Domain identification chooses CAD models to interpret the input medical image, which is fundamental for subsequent operations of ChatCAD+. To achieve this, we employ a method of visual-language contrast that computes the cosine similarity between the input image and textual representations of various potential domains. This approach takes advantage of language’s ability in densely describing the features of a particular type of image, as well as its ease of extensibility. In particular, we employ the pre-trained BiomedCLIP model [26] to encode both the medical image and the text associated with domains of interest. In this study, we demonstrate upon three domains: chest X-ray, dental X-ray, and knee MRI. The workflow is depicted in the upper right portion of Fig. 2. Assuming there are three domains $\mathrm{D}_ {1},\mathrm{D}_ {2},\mathrm{D}_ {3}$ , along with their textual representations $\mathbf{M}_ {1},\mathbf{M}_ {2},\mathbf{M}_ {3}$ , and also a visual representation denoted as I for the input image, we define

领域识别通过选择CAD模型来解读输入的医学图像，这是ChatCAD+后续操作的基础。为此，我们采用了一种视觉-语言对比方法，计算输入图像与各潜在领域文本表征之间的余弦相似度。该方法利用了语言在密集描述特定类型图像特征方面的优势，以及其易于扩展的特性。具体而言，我们使用预训练的BiomedCLIP模型[26]对医学图像和关注领域的相关文本进行编码。本研究展示了在胸部X光、牙科X光和膝关节MRI三个领域的应用效果，工作流程如图2右上部分所示。假设存在三个领域$\mathrm{D}_ {1},\mathrm{D}_ {2},\mathrm{D}_ {3}$及其文本表征$\mathbf{M}_ {1},\mathbf{M}_ {2},\mathbf{M}_ {3}$，以及表示输入图像的视觉表征I，我们定义

Fig. 3. The illustration of the retrieval module within reliable report generation. It adopts the TF-IDF algorithm to preserve the semantics of each report and converts it into a latent embedding during offline modeling and online inference. To facilitate highly efficient retrieval, we perform spherical projection on all TF-IDF embeddings, whether during building or querying. In this manner, we can utilize the KD-Tree structure to store these data and implement retrieval with a low time complexity.

图 3: 可靠报告生成中的检索模块示意图。该模块采用TF-IDF算法保留每份报告的语义，并在离线建模和在线推理阶段将其转换为潜在嵌入向量。为实现高效检索，无论构建还是查询阶段，我们都对所有TF-IDF嵌入向量进行球面投影。通过这种方式，可利用KD-Tree结构存储数据，并以较低时间复杂度实现检索。

$$
\mathbf{D}_ {\mathrm{pred}}=\operatorname* {argmax}_ {i\in{1,2,3}}\frac{{I M}_ {i}}{\Vert\mathbf{I}\Vert:\left\Vert\mathbf{M}_ {i}\right\Vert},
$$

where $\mathrm{D_ {pred}}$ denotes the prediction of the medical image domain. The module thereby can call the domain-specific CAD model to analyze visual information given $\mathrm{D_ {pred}}$ .

其中 $\mathrm{D_ {pred}}$ 表示医学图像领域的预测。该模块因此可以调用特定领域的 CAD (Computer-Aided Design) 模型来分析给定 $\mathrm{D_ {pred}}$ 的视觉信息。

Since most CAD models generate outputs that can hardly be understood by language models, further processing is needed to bridge this gap. For example, an image diagnosis model typically outputs tensors representing the likelihood of certain clinical findings. To establish a link between image and text, these tensors are transformed into textual descriptions according to diagnostic-related rules, which is denoted as prob2text in Fig. 2(a). The prob2text module has been designed to present clinically relevant information in a manner that is more easily interpret able by LLMs. The details of the prompt design are illustrated in Table I. Using chest X-ray as an example, we follow the three types (P1-P3) of prompt designs in [9] and adopt P3 (illustrative) as the recommended setting in this study. Concretely, it employs a grading system that maps the numerical scores into clinically illustrative description of disease severity. The scores are divided into four levels based on their magnitude: “No sign”, “Small possibility”, “Likely”, and “Definitely”. The corresponding texts are then used to describe the likelihood of different observations in chest X-ray, providing a concise and informative summary of the patient’s condition. The prompt design for dental X-ray and knee MRI are in similar ways. And other prompt designs, such as P1 and P2 [9], will be discussed in experiments below.

由于大多数CAD模型的输出结果难以被语言模型理解，因此需要进一步处理以弥合这一差距。例如，影像诊断模型通常输出表示特定临床发现可能性的张量。为了建立影像与文本之间的联系，这些张量会根据诊断相关规则转换为文本描述（如图2(a)中的prob2text模块所示）。prob2text模块的设计旨在以更易于大语言模型理解的方式呈现临床相关信息。提示设计的详细信息如表1所示。以胸部X光为例，我们遵循文献[9]中提出的三种提示设计类型（P1-P3），并采用P3（说明性）作为本研究的推荐设置。具体而言，该系统采用分级机制将数值评分映射为疾病严重程度的临床描述性说明。根据数值大小将评分划分为四个等级："无迹象"、"可能性小"、"很可能"和"确定"。生成的对应文本用于描述胸部X光中不同观察结果的可能性，从而为患者状况提供简明且信息丰富的总结。牙科X光和膝关节MRI的提示设计采用类似方法。其他提示设计（如文献[9]中的P1和P2）将在后续实验部分讨论。

Hierarchical In-context Learning: The proposed strategy for hierarchical in-context learning, as shown in Fig. 2(b), consists of (1) preliminary report generation and (2) retrievalbased in-context enhancement.
分层上下文学习：如图 2(b) 所示，提出的分层上下文学习策略包括 (1) 初步报告生成和 (2) 基于检索的上下文增强。

In the first step, the LLM utilizes the visual description to generate a preliminary diagnostic report. This process is facilitated through prompts such as “Write a report based on results from Network(s).” It is important to note that the generation of this preliminary report relies solely on the capability of the LLM and is still far from matching capability of human specialists. As a result, even powerful LLMs like ChatGPT still struggle to achieve satisfactory results.

第一步，大语言模型(LLM)利用视觉描述生成初步诊断报告。该过程通过"根据Network(s)结果撰写报告"等提示词实现。需要注意的是，此初步报告的生成仅依赖大语言模型的能力，仍远未达到人类专家的水平。因此，即便是ChatGPT等强大模型，其输出结果仍难以令人满意。

To address this concern, the next step involves retrieving the top $\cdot k$ reports that share similar semantics with the preliminary report as in-context examples. The retrieval module is designed to preserve semantic information and ensure efficient implementation. Specifically, we employ TF-IDF [27] to model the semantics of each report, and the KD-Tree [28] data structure is utilized to facilitate high-speed retrieval. Fig. 3 illustrates the detailed pipeline of the proposed retrieval module. For simplification, we use chest $\mathrm{X}$ -ray in MIMIC-CXR as an example in this section. To better preserve disease-related information in the report, we focus on 17 medical terms related to thoracic diseases during the modeling of TF-IDF embeddings, which form a term set $\tau$ . Assuming the training set of MIMIC-CXR as $\mathcal{D}$ , the TF-IDF score for each term $t\in\mathcal T$ in each document $d\in\mathcal{D}$ is computed as:

为解决这一问题，下一步将检索与初步报告语义相似的前$\cdot k$份报告作为上下文示例。该检索模块旨在保留语义信息并确保高效实现。具体而言，我们采用TF-IDF [27]对每份报告的语义进行建模，并利用KD-Tree [28]数据结构实现高速检索。图3展示了该检索模块的详细流程。为简化说明，本节以MIMIC-CXR中的胸部$\mathrm{X}$光为例。为了更好地保留报告中与疾病相关的信息，我们在构建TF-IDF嵌入时聚焦于17个与胸部疾病相关的医学术语，这些术语构成集合$\tau$。假设MIMIC-CXR的训练集为$\mathcal{D}$，则每个文档$d\in\mathcal{D}$中每个术语$t\in\mathcal T$的TF-IDF得分计算如下：

$$
\begin{array}{r l}&{\mathrm{TF-IDF}(t,d)=\mathrm{TF}(t,d)\cdot\mathrm{IDF}(t)}\ &{=\frac{|{w\in d:w=t}|}{|d|}\cdot\log\frac{|\mathcal{D}|}{|{d\in\mathcal{D}:t\in d}|},}\end{array}
$$

where $w$ indicates a word within the document $d$ . In brief, $\mathrm{TF}(t,d)$ represents the frequency of term $t$ occurring in $d$ , and $\mathrm{IDF}(t)$ denotes the inverse of the frequency that document $d$ occurs in the training set $\mathcal{D}$ . Following this, we define the TF-IDF embedding (TIE) of a document $d$ as:

其中 $w$ 表示文档 $d$ 中的一个词。简而言之，$\mathrm{TF}(t,d)$ 表示术语 $t$ 在 $d$ 中出现的频率，$\mathrm{IDF}(t)$ 表示文档 $d$ 在训练集 $\mathcal{D}$ 中出现频率的倒数。接着，我们将文档 $d$ 的 TF-IDF 嵌入 (TIE) 定义为：

$$
\mathrm{TIE}(d)={\mathrm{TF}\mathrm{-}\mathrm{IDF}(t,d)|t\in\mathcal{T}}.
$$

After calculating the TIEs for all $\textit{d}\in\textit{D}$ , we organize them using the KD-Tree [28] data structure, which enables fast online querying. The KD-Tree is a space-partitioning data structure that can implement $k$ -nearest-neighbor querying in $O(\log(n))$ time on average, making it suitable for information searching. However, since the implementation of KD-Tree relies on $L_ {2}$ distance instead of cosine similarity, it cannot be directly applied to our TIEs. To address this issue, we project all TIEs onto the surface of a unit hyper sphere, as shown in the right panel of Fig. 3. After spherical projection, we have:

在计算完所有 $\textit{d}\in\textit{D}$ 的 TIEs 后，我们使用 KD-Tree [28] 数据结构对其进行组织，以实现快速在线查询。KD-Tree 是一种空间划分数据结构，平均能在 $O(\log(n))$ 时间内实现 $k$ 近邻查询，适合信息检索。但由于 KD-Tree 的实现依赖于 $L_ {2}$ 距离而非余弦相似度，无法直接应用于我们的 TIEs。为解决该问题，如图 3 右面板所示，我们将所有 TIEs 投影至单位超球面。经球面投影后可得：

$$
L_ {2}(\frac{\vec{q}}{|\vec{q}|},\frac{\vec{v}}{|\vec{v}|})=2r\cdot\mathrm{sin}(\frac{\theta}{2}),
$$

where $\vec{q}$ and $\vec{v}$ represent the TIE of the query report and a selected report, respectively. $r$ indicates the radius of the hyper sphere (radius $_ {\mathrm{{}}=1}$ in this study), and $\theta\in[0,\pi]$ stands for the angle between $\frac{\vec{q}}{\left|\vec{q}\right|}$ and 前· . The $L_ {2}$ distance between two TIEs will monotonically increase as a function of the angle between them, which favors the application of the KD-Tree data structure for efficient retrieval.

其中 $\vec{q}$ 和 $\vec{v}$ 分别表示查询报告和选定报告的 TIE (Tissue Identity Embedding)。$r$ 表示超球面的半径 (本研究中 radius$_ {\mathrm{{}}=1}$)，$\theta\in[0,\pi]$ 表示 $\frac{\vec{q}}{\left|\vec{q}\right|}$ 与前· 之间的夹角。两个 TIE 之间的 $L_ {2}$ 距离会随它们夹角单调递增，这一特性有利于应用 KD-Tree 数据结构实现高效检索。

During online querying, the KD-Tree engine retrieves the top $\cdot k$ (i.e., $k{=}3$ ) reports that share the most similar semantics with the query report (i.e., preliminary report illustrated in Fig. 2). It then asks the LLM to refine its preliminary report with the retrieved $k$ reports as references, and generate the enhanced report as the final output.

在线查询时，KD-Tree引擎会检索与查询报告(即图2所示的初步报告)语义最相似的前$\cdot k$ (即$k{=}3$)份报告。随后，它会让大语言模型以检索到的$k$份报告为参考来完善其初步报告，并生成增强版报告作为最终输出。

Fig. 4. Overview of the reliable interaction. (a) The illustration of structured medical knowledge database, organized as a tree-like dictionary, where each medical topic has multiple sections while sections can be further divided into subsections. (b) A LLM-based knowledge retrieval method is proposed to search relevant knowledge in a backtrack manner. (c) The LLM is prompted to answer the question based on the retrieved knowledge.

图 4: 可靠交互概述。(a) 结构化医学知识库示意图，以树状字典形式组织，每个医学主题包含多个章节，章节可进一步细分为子章节。(b) 提出基于大语言模型的回溯式知识检索方法。(c) 引导大语言模型根据检索到的知识回答问题。

B. Reliable Interaction

B. 可靠交互

This section outlines the reliable interaction, which is responsible for answering the user’s text query. It serves a dual role within our proposed ChatCAD+. First, reliable interaction can function as a sequential module of reliable report generation to enhance interpret ability and understand ability of medical reports for patients. In this capacity, it takes the generated report from previous module as the input, and provides patients with explanations of involved clinical terms1. Second, it can also operate independently to answer clinicalrelated questions and provide reliable medical advice.

本节概述了可靠交互功能，该功能负责响应用户的文本查询。在我们提出的ChatCAD+框架中，可靠交互具有双重作用：首先，它可作为可靠报告生成流程的后续模块，通过解释临床术语1来增强患者对医疗报告的可理解性；其次，该功能也可独立运行，用于解答临床相关问题并提供可靠的医疗建议。

The proposed ChatCAD $^+$ offers reliable interaction via the construction of a professional knowledge database and LLM-based knowledge retrieval. In this study, we demonstrate using the Merck Manuals. The Merck Manuals are a series of healthcare reference books that provide evidence-based information on the diagnosis and treatment of diseases and medical conditions.

提出的ChatCAD$^+$通过构建专业知识数据库和基于大语言模型的知识检索，提供可靠的交互。在本研究中，我们以《默克诊疗手册》为例进行演示。《默克诊疗手册》是一系列医疗参考书籍，提供基于证据的疾病诊断和治疗信息。

The implementation of LLM-based knowledge retrieval leverages the Chain-of-Thought (CoT) technique, which is widely known for its ability to enhance the performance of LLM in problem-solving. CoT breaks down a problem into a series of sub-problems and then synthesizes the answers to these sub-problems progressively for solving the original problem. Humans typically do not read the entire knowledge base but rather skim through topic titles to find what they are looking for. Inspired by this, we have designed a prompt system that automatically guides the LLM to execute such searches. Correspondingly, we have structured the Merck Manuals database as a hierarchical dictionary, with topic titles of different levels serving as keys, as shown in Fig 4(a). Fig. 4(b) demonstrates the proposed knowledge retrieval methods. Initially, we provide the LLM with only titles of five related medical topics in the database, and then ask the LLM to select the most relevant topic to begin the retrieval. Once the LLM has made its choice, we provide it with the content of “abstract” section and present names of all other sections subject to this topic. Given that medical topics often exhibit hierarchical organization across three or four tiers, we repeat this process iterative ly, empowering the LLM to efficiently navigate through all tiers of a given medical topic. Upon identifying relevant medical knowledge, the LLM is prompted to return the retrieved knowledge, terminating the process. Otherwise, in the case that the LLM finds none of the provided information as relevant, our algorithm would backtrack to the parent tier, which makes it a kind of depth-first search [30]. Finally, the LLM is prompted to provide reliable response based on the retrieved knowledge (in Fig. 4(c)).

基于大语言模型(LLM)的知识检索实现采用了思维链(Chain-of-Thought, CoT)技术，该技术以提升大语言模型解决问题能力而广为人知。CoT将问题分解为一系列子问题，然后逐步综合这些子问题的答案来解决原始问题。人类通常不会阅读整个知识库，而是通过浏览主题标题来寻找所需内容。受此启发，我们设计了一个提示系统，自动引导大语言模型执行此类搜索。相应地，我们将默克手册数据库构建为分层字典结构，以不同层级的主题标题作为键值，如图4(a)所示。图4(b)展示了提出的知识检索方法。首先，我们仅向大语言模型提供数据库中五个相关医学主题的标题，然后要求其选择最相关的主题开始检索。当大语言模型做出选择后，我们会提供"摘要"部分的内容，并展示该主题下所有其他章节的名称。鉴于医学主题通常具有三到四层的层级结构，我们迭代重复这一过程，使大语言模型能够有效遍历给定医学主题的所有层级。当识别到相关医学知识时，系统会提示大语言模型返回检索到的知识并终止流程。若大语言模型认为提供的所有信息都不相关，我们的算法会回溯到父级层级，这使其成为一种深度优先搜索[30]。最终，系统会提示大语言模型基于检索到的知识提供可靠响应(如图4(c)所示)。

IV. EXPERIMENTAL RESULTS

IV. 实验结果

To comprehensively assess our proposed ${\mathrm{ChatCAD+}}$ , we conducted evaluations across three crucial aspects that influence its effectiveness in real-world scenarios: (1) the capability to handle medical images from different domains, (2) report generation quality, and (3) the efficacy in clinical question answering. This section commences with an introduction to the datasets used and implementation details of ${\mathrm{ChatCAD+}}$ . Then, we delve into the evaluation of these aspects in turn.

为了全面评估我们提出的 ${\mathrm{ChatCAD+}}$，我们从三个关键维度进行了评测，这些维度会影响其在实际医疗场景中的有效性：(1) 处理多领域医学影像的能力，(2) 报告生成质量，(3) 临床问答效能。本节首先介绍所用数据集及 ${\mathrm{ChatCAD+}}$ 的实现细节，随后依次展开这三个维度的评估分析。

A. Dataset and implementations

A. 数据集与实现

Domain identification: To investigate the robustness of domain identification, which is the fundamental component of our proposed universal CAD system, we crafted a dataset covering nine medical imaging domains by randomly selecting samples from different datasets. The statistics of this dataset are presented in Table II. During the evaluation, we progressively increase the number of domains following the left-toright order in Table II and report the accuracy (ACC) in each step. Since the pre-trained CLIP model is utilized to perform domain identification (as shown in Fig. 2), we also investigated the impact of different CLIPs in Sec. IV-B.

领域识别：为了探究领域识别的鲁棒性（这是我们提出的通用CAD系统的核心组件），我们通过从不同数据集中随机选取样本，构建了一个涵盖九个医学影像领域的数据集。该数据集的统计信息如表II所示。评估过程中，我们按照表II从左到右的顺序逐步增加领域数量，并记录每一步的准确率(ACC)。由于采用了预训练的CLIP模型进行领域识别（如图2所示），我们在第IV-B节还研究了不同CLIP模型的影响。

TABLE II STATISTICS OF THE COMPOSED DATASET FOR THE EVALUATION OF DOMAIN IDENTIFICATION.

	chestX-ray	dentalX-ray	knee X-ray	mammography	chest CT	fundus	endoscopy	dermoscopy	blood cell
number	400	360	800	360	320	120	200	360	200
source	MedFMC[31]	Tufts Dental [32]	OAI [33]	INbreast [34]	COVID-CT[35]	DIARETDB1[36]	MedFMC[31]	ISIC2016[37]	PBC [38]

表 II: 用于领域识别评估的合成数据集统计

	chestX-ray	dentalX-ray	knee X-ray	mammography	chest CT	fundus	endoscopy	dermoscopy	blood cell
number	400	360	800	360	320	120	200	360	200
source	MedFMC[31]	Tufts Dental[32]	OAI[33]	INbreast[34]	COVID-CT[35]	DIARETDB1[36]	MedFMC[31]	ISIC2016[37]	PBC[38]

Fig. 5. Evaluation of domain identification using different CLIPs.

图 5: 使用不同CLIP模型进行领域识别的评估。

Report generation: For a fair comparison, the report generation performance of ${\mathrm{ChatCAD+}}$ is assessed on the public MIMIC-CXR dataset [39]. The MIMIC-CXR is a large public dataset of chest X-ray images, associated radiology reports, and other clinical data. The dataset consists of 377,110 deidentified chest X-ray images and associated 227,827 radiology reports. The MIMIC-CXR dataset allows us to measure the quality of diagnostic accuracy and report generation, as most report generation methods are designed specifically for chest X-ray. The quality of report generation was tested on the official test set of the MIMIC-CXR dataset, focusing on five findings (card iomega ly, edema, consolidation, at elect as is, and pleural effusion). Since ChatGPT was adopted as the default LLM in report generation experiments, we randomly selected 200 samples for each finding of interest due to limitations on the per-hour accesses of OpenAI API-Key, resulting in a total of 1,000 test samples. To evaluate the performance of report generation, we used several widelyused Natural Language Generation (NLG) metrics, including BLEU [40], METEOR [41], and ROUGE-L [42]. Specifically, BLEU measures the similarity between generated and groundtruth reports by counting word overlap and we denote Corpus BLEU as C-BLEU by default. METEOR considers synonym substitution and evaluates performance on both sentence level and report level. Meanwhile, ROUGE-L evaluates the length of the longest common sub sequence. We also measured clinical efficacy (CE) of reports using the open-source CheXbert library [43]. CheXbert takes text reports as input and generates multi-label classification labels for each report, with each label corresponding to one of the pre-defined thoracic diseases.

报告生成：为确保公平比较，${\mathrm{ChatCAD+}}$的报告生成性能在公开的MIMIC-CXR数据集[39]上进行评估。MIMIC-CXR是一个包含胸部X光图像、相关放射学报告和其他临床数据的大型公开数据集，包含377,110张去标识化胸部X光图像和227,827份相关放射学报告。由于大多数报告生成方法专为胸部X光设计，该数据集可用于评估诊断准确性和报告生成质量。报告生成质量测试在MIMIC-CXR官方测试集上进行，重点关注五种症状（心脏增大、肺水肿、肺实变、肺不张和胸腔积液）。由于采用ChatGPT作为报告生成实验的默认大语言模型，受限于OpenAI API-Key的每小时调用次数，我们为每种目标症状随机选取200个样本，共计1,000个测试样本。评估报告生成性能时，我们采用多种广泛使用的自然语言生成(NLG)指标，包括BLEU[40]、METEOR[41]和ROUGE-L[42]。具体而言，BLEU通过统计词汇重叠度衡量生成报告与标准报告的相似性（默认使用语料库BLEU即C-BLEU），METEOR考虑同义词替换并在句子级和报告级评估性能，ROUGE-L则评估最长公共子序列长度。我们还使用开源CheXbert库[43]测量报告的临床效能(CE)，该库可将文本报告转化为多标签分类结果，每个标签对应预定义的胸部疾病类型。

Based on these extracted multi-hot labels, we compute precision (PR), recall (RC), and F1-score (F1) on 1,000 test samples. These metrics provide additional insight into the performance of report generation. If not specified, we choose P3 as a default prompt design in universal interpretation and $k{=}3$ for hierarchical in-context learning.

基于这些提取的多热标签，我们在1000个测试样本上计算了精确率(PR)、召回率(RC)和F1分数(F1)。这些指标为报告生成性能提供了更多洞察。如无特别说明，在通用解释中我们默认选择P3作为提示设计，在分层上下文学习中使用$k{=}3$。

Clinical question answering: The evaluation of reliable interaction was performed using a subset of the CMExam dataset [44]. CMExam is a collection of multiple-choice questions with official explanations, sourced from the Chinese National Medical Licensing Examination. Since CMExam covers a diversity of areas, including many clinical-unrelated questions, we focused on a subset of the CMExam test set that specifically addressed “Disease Diagnosis” and “Disease Treatment,” while excluding questions belonging to the department of “Traditional Chinese Medicine”. This resulted in a subset of 2190 question-answer pairs. Since the CMExam dataset is completely in Chinese, we selected ChatGLM2 [45] and ChatGLM3 [45], two 6B scale models, as the LLMs for validation. These models have demonstrated superior Chinese language capabilities compared to other open-source LLMs. We employed the official prompt design of CMExam, which required the LLM to make a choice and provide the accompanying explanation simultaneously. We acknowledge that our proposed LLM-based knowledge retrieval involves recursive requests to the LLM and will consume a large amount of tokens, making experiments using ChatGPT and GPT-4 un affordable for us. In experimental results, we report ACC and F1 for multi-choice selection, and NLG metrics for reason explanation.

临床问答：使用CMExam数据集[44]的子集进行了可靠交互的评估。CMExam是中国国家医学资格考试中收集的带有官方解释的多选题合集。由于CMExam涵盖多个领域，包含许多与临床无关的问题，我们专注于CMExam测试集中专门涉及"疾病诊断"和"疾病治疗"的子集，同时排除了属于"中医"科室的问题。最终得到2190个问答对的子集。由于CMExam数据集完全为中文，我们选择了ChatGLM2[45]和ChatGLM3[45]这两个60亿参数规模的大语言模型作为验证对象。与其他开源大语言模型相比，这些模型已展现出卓越的中文能力。我们采用CMExam官方设计的提示模板，要求大语言模型同时做出选择并提供相应解释。我们注意到，基于大语言模型的知识检索涉及递归请求，将消耗大量token，因此无法承担使用ChatGPT和GPT-4进行实验的成本。在实验结果中，我们报告多选题选择的准确率(ACC)和F1值，以及原因解释的自然语言生成(NLG)指标。

Implementation of domain-specific CADs: In Section IIIA.1, we proposed to incorporate domain-specific CAD models to interpret images from different medical imaging domain. In this study, we adopted Chest CADs, Tooth CADs, and Knee CADs. Chest CADs is composed of a thoracic disease classifier [46] and a chest X -ray report generation network [7]. The former [46] was trained on the official training split of the CheXpert [47] dataset, using a learning rate $(l r)$ of $1e^{-4}$ for 10 epochs. The report generation network [7] was trained on the training split of MIMIC-CXR [39] using a single NVIDIA A100 GPU. We used the Adam optimizer with a learning rate of $5e^{-5}$ . For Tooth CAD, we utilized the periodontal diagnosis model proposed in [48], which was trained on 300 panoramic dental X-ray images collected from real-world clinics with a size of $2903\mathrm{x}1536$ . The training was conducted on a NVIDIA A100 GPU with 80GB memory for a total of 200 epochs. We set the learning rate as $2e^{-5}$ using the Adam optimizer. For Knee CAD, we adopted the model proposed in [49], which was trained on 964 knee MRIs captured using a Philips Achieva 3.0T TX MRI scanner from Shanghai Sixth People’s Hospital. The Adam optimizer was used with a weight decay of $1e^{-4}$ .

特定领域CAD系统的实现：在第三节A.1部分，我们提出整合特定领域CAD模型来解析不同医学影像领域的图像。本研究采用了胸部CAD、牙齿CAD和膝关节CAD系统。胸部CAD系统由胸科疾病分类器[46]和胸部X光报告生成网络[7]构成。前者[46]基于CheXpert数据集[47]的官方训练集进行训练，学习率(lr)设为1e−4，训练10个周期；报告生成网络[7]使用单块NVIDIA A100 GPU在MIMIC-CXR[39]训练集上完成训练，采用学习率为5e−5的Adam优化器。牙齿CAD系统采用文献[48]提出的牙周诊断模型，该模型基于临床收集的300张尺寸为2903×1536的全景牙科X光片，在80GB显存的NVIDIA A100 GPU上训练200个周期，设置2e−5学习率并使用Adam优化器。膝关节CAD系统采用文献[49]提出的模型，其训练数据来自上海市第六人民医院Philips Achieva 3.0T TX磁共振扫描仪采集的964例膝关节MRI图像，训练时使用Adam优化器并设置1e−4的权重衰减系数。

Fig. 6. Examples of universal and interactive medical consultation using ${\mathsf{C h a t C A D+}}$ , with ChatGPT as the default LLM. The underlined text signifies information obtained from reputable medical websites, which is not comprehensively incorporated in ChatGPT.

图 6: 使用 ${\mathsf{C h a t C A D+}}$ (默认大语言模型为ChatGPT)进行通用交互式医疗咨询的案例。下划线文本表示来自权威医疗网站的信息，这些信息未全面整合到ChatGPT中。

The learning rate was set as $3e^{-5}$ for Transformer and graph attention network components, and $3e^{-4}$ for others.

Transformer 和 graph attention network 组件的学习率设置为 $3e^{-5}$，其他部分设置为 $3e^{-4}$。

B. Evaluation of a Universal System

B. 通用系统评估

To investigate the effectiveness of ${\mathrm{ChatCAD+}}$ as a universal CAD system, we evaluated it by (1) quantitatively investigating its domain identification capability when encountering images from various domains, and (2) qualitatively investigating its ability to provide reliable response in different scenarios.

为了验证 ${\mathrm{ChatCAD+}}$ 作为通用CAD系统的有效性，我们从两方面进行评估：(1) 定量分析其遇到不同领域图像时的领域识别能力，(2) 定性研究其在多种场景下提供可靠反馈的能力。

Robustness of domain identification: In Fig. 5, we visualize the domain identification performance of different CLIP models. Compared with the baseline CLIP model, BiomedCLIP [26] achieves an impressive performance, showing $100%$ accuracy on 2560 images from 7 different domains. It can even maintain an accuracy of $98.9%$ when all 9 domains are involved, while the performance of CLIP is degraded to $86.3%$ . These observations demonstrate that BiomedCLIP has an more in-depth understanding of professional medical knowledge compared with CLIP, and thus is suitable for domain identification in our proposed ChatCAD+.
领域识别的鲁棒性：在图5中，我们可视化了不同CLIP模型的领域识别性能。与基线CLIP模型相比，BiomedCLIP [26] 表现出色，在来自7个不同领域的2560张图像上实现了100%的准确率。即使涉及全部9个领域时，它仍能保持98.9%的准确率，而CLIP的性能则降至86.3%。这些结果表明，相比CLIP，BiomedCLIP对专业医学知识有更深入的理解，因此适合用于我们提出的ChatCAD+中的领域识别任务。

However, to our surprise, PMC-CLIP [50] performs rather poorly on this task. As shown in Fig. 5, it only achieves approximate $50%$ accuracy when three domains are involved. We hypothesize that this can be attributed to its lack of zeroshot capability. While PMC-CLIP has demonstrated promising quality fof fine-tuning on donwstream tasks, its lack of zeroshot inference capability may hinder its wide-speard application. In conclusion, BiomedCLIP can be a better choice than both CLIP and PMC-CLIP for domain identification.

然而出乎意料的是，PMC-CLIP [50] 在该任务中表现相当糟糕。如图 5 所示，当涉及三个领域时，其准确率仅达到约 $50%$ 。我们推测这可能归因于其缺乏零样本 (zero-shot) 能力。尽管 PMC-CLIP 在下游任务微调中展现出良好性能，但其零样本推理能力的缺失可能限制其广泛应用。综上所述，BiomedCLIP 在领域识别任务上是比 CLIP 和 PMC-CLIP 更优的选择。

Reliability of consultations: Fig. 6 demonstrates the universal application and reliability of our proposed method. The ChatCAD $^+$ can process different medical images and provide proper diagnostic reports. Users can interact with ChatCAD $^+$ and ask additional questions for further clarification. By leveraging external medical knowledge, the enhanced ${\mathrm{ChatCAD+}}$ is capable of delivering medical advice in a more professional manner. In Fig. 6, each reply from ChatCAD $^+$ includes a reference, and the pertinent knowledge obtained from the external database is underlined for emphasis. For instance, in the scenario where the question is raised about whether a chest X-ray is sufficient for diagnosing pleural effusion, ChatGPT would simply recommend a CT scan without providing a convincing explanation. In contrast, ChatCAD $^+$ is capable of informing the patient that CT scans have the capability to detect effusions as small as 10 milliliters, thereby providing a more detailed and informative response.
咨询可靠性：图6展示了我们提出方法的普适性和可靠性。ChatCAD$^+$能够处理不同的医学图像并提供准确的诊断报告。用户可与ChatCAD$^+$交互并提出额外问题以获得进一步说明。通过利用外部医学知识，增强版${\mathrm{ChatCAD+}}$能够以更专业的方式提供医疗建议。在图6中，ChatCAD$^+$的每个回复都包含参考文献，且从外部数据库获取的相关知识会以下划线强调。例如，当被问及胸部X光是否足以诊断胸腔积液时，ChatGPT仅会建议进行CT扫描而不提供令人信服的解释。相比之下，ChatCAD$^+$能够告知患者CT扫描可检测小至10毫升的积液，从而提供更详细且信息丰富的答复。

C. Evaluation of Report Generation

C. 报告生成评估

We begin by comparing our approach to state-of-the-art (SOTA) report generation methods, while also examining the impact of various prompt designs. Subsequently, we assess the general iz ability of our reliable report generation model by utilizing diverse LLMs. Following this, we explore the impact of hierarchical in-context learning by changing the number of retrieved reports. We then proceed to qualitatively showcase the advanced capabilities of ${\mathrm{ChatCAD+}}$ in report generation. Finally, we compare our specialist-based system against others’ generalist medical AI systems.

我们首先将我们的方法与最先进的 (SOTA) 报告生成方法进行比较，同时考察不同提示设计的影响。随后，我们通过使用多样化的大语言模型来评估可靠报告生成模型的泛化能力。接着，我们通过改变检索报告的数量来探索分层上下文学习的影响。然后，我们定性展示了 ${\mathrm{ChatCAD+}}$ 在报告生成中的先进能力。最后，我们将基于专科医生的系统与其他通用医疗 AI 系统进行对比。

Comparison with SOTA methods: The numerical results of different methods was compared and presented in Table III. The advantage of ${\mathrm{ChatCAD+}}$ lies in its superior performance across multiple evaluation metrics compared to both previous models and different prompt settings. Particularly, ${\mathrm{ChatCAD+}}$ (P3) stands out with the highest scores in terms of C-BLEU, METEOR, REC, and F1. Despite the inferiority in specific metrics, ChatCAD $^+$ (P3) still holds a significant advantage across other metrics, which validates its advanced overall performance. Note that ${\mathrm{ChatCAD+}}$ (P3) shows a significant advantage over ChatCAD (P3) on all NLG and CE metrics, which underscores the rationality of hierarchical in-context learning. These results suggest that ${\mathrm{ChatCAD+}}$ (P3) is capable of generating high-quality responses that are not only fluent and grammatical but also clinically accurate, making it a promising model for conversational tasks.
与SOTA方法的比较：不同方法的数值结果对比见表III。${\mathrm{ChatCAD+}}$的优势在于，与先前模型及不同提示设置相比，其在多项评估指标上均表现出更优性能。特别是${\mathrm{ChatCAD+}}$ (P3)在C-BLEU、METEOR、REC和F1指标上均取得最高分。尽管在某些特定指标上略逊，ChatCAD$^+$ (P3)在其他指标上仍保持显著优势，这验证了其整体性能的先进性。值得注意的是，${\mathrm{ChatCAD+}}$ (P3)在全部NLG和CE指标上均显著优于ChatCAD (P3)，这印证了分层上下文学习策略的合理性。结果表明，${\mathrm{ChatCAD+}}$ (P3)能够生成既流畅合规又符合临床准确性的高质量响应，使其成为对话任务中极具前景的模型。

TABLE III COMPARATIVE RESULTS WITH PREVIOUS STUDIES AND DIFFERENT PROMPT SETTINGS. P3 IS THE DEFAULT PROMPT SETTING IN THIS STUDY.

Model	Size	C-BLEU	ROUGE-L	METEOR	PRE	REC	F1
R2GenCMN [7]	244M	3.525	18.351	21.045	0.578	0.411	0.458
VLCI [8]	357M	3.834	16.217	20.806	0.617	0.299	0.377
ChatCAD (P3)	256M+175B	3.594	16.791	23.146	0.526	0.603	0.553
ChatCAD+ (P1)	256M+175B	4.073	16.820	22.700	0.548	0.484	0.502
ChatCAD+ (P2)	256M+175B	4.407	17.266	23.302	0.538	0.597	0.557
ChatCAD+ (P3)	256M+175B	4.409	17.344	24.337	0.531	0.615	0.564

表 III 与先前研究和不同提示设置的对比结果。P3 是本研究的默认提示设置。

模型	大小	C-BLEU	ROUGE-L	METEOR	PRE	REC	F1
R2GenCMN [7]	244M	3.525	18.351	21.045	0.578	0.411	0.458
VLCI [8]	357M	3.834	16.217	20.806	0.617	0.299	0.377
ChatCAD (P3)	256M+175B	3.594	16.791	23.146	0.526	0.603	0.553
ChatCAD+ (P1)	256M+175B	4.073	16.820	22.700	0.548	0.484	0.502
ChatCAD+ (P2)	256M+175B	4.407	17.266	23.302	0.538	0.597	0.557
ChatCAD+ (P3)	256M+175B	4.409	17.344	24.337	0.531	0.615	0.564

TABLE IV EVALUATION OF RELIABLE REPORT GENERATION USING DIFFERENT LLMS.

Model	Size	Reliable	C-BLEU	BLEU-1	BLEU-2	BLEU-3	BLEU-4	ROUGE-L	METEOR	PRE	REC	F1
ChatGLM2 [45]	6B	×	2.073	16.999	3.426	0.926	0.313	14.119	23.008	0.518	0.632	0.553
ChatGLM2 [45]	6B	←	3.236	25.172	5.268	1.461	0.566	15.559	23.397	0.501	0.639	0.555
PMC-LLaMA [51]	7B	×	1.709	15.003	2.432	0.748	0.313	13.983	20.590	0.492	0.702	0.532
PMC-LLaMA [51]	7B		1.059	6.5059	1.591	0.504	0.241	8.140	18.551	0.477	0.664	0.525
MedAlpaca [52]	7B	×	1.554	30.710	6.143	1.906	0.819	12.729	11.865	0.526	0.464	0.484
MedAlpaca [52]	7B		3.031	30.795	7.107	2.408	1.107	14.761	15.711	0.507	0.518	0.505
Mistral [53]	7B	×	3.261	27.806	6.068	1.463	0.458	16.132	21.999	0.558	0.444	0.485
Mistral [53]	7B	←	3.547	30.851	7.178	1.698	0.493	17.493	22.347	0.543	0.597	0.558
LLaMA [2]	7B		0.807	7.575	1.265	0.337	0.128	10.016	17.194	0.495	0.589	0.519
LLaMA [2]	7B		0.966	7.790	1.525	0.423	0.178	10.237	18.344	0.478	0.625	0.535
LLaMA2 [54]	7B	×	1.298	14.929	2.717	0.531	0.132	12.016	19.958	0.481	0.588	0.519
LLaMA2 [54]	7B	←	3.579	24.798	5.826	1.638	0.693	16.850	23.950	0.512	0.561	0.528
LLaMA [2]	13B	×	0.810	6.824	1.393	0.331	0.113	8.959	16.647	0.486	0.452	0.459
LLaMA [2]	13B		0.973	8.295	1.521	0.439	0.197	9.813	18.073	0.498	0.465	0.463
LLaMA2 [54]	13B	×	2.940	23.536	5.173	1.390	0.441	15.053	21.938	0.531	0.564	0.538
LLaMA2 [54]	13B		3.650	25.080	6.148	1.728	0.666	16.646	24.015	0.533	0.571	0.543
ChatGPT [1]	175B	×	3.594	28.453	6.371	1.656	0.556	16.791	23.146	0.526	0.603
ChatGPT [1]	175B		4.409	31.625	7.570	2.107	0.760	17.344	24.337	0.531	0.615	0.553 0.564

表 IV 使用不同大语言模型生成可靠报告的评估结果

模型	规模	可靠	C-BLEU	BLEU-1	BLEU-2	BLEU-3	BLEU-4	ROUGE-L	METEOR	PRE	REC	F1
ChatGLM2 [45]	6B	×	2.073	16.999	3.426	0.926	0.313	14.119	23.008	0.518	0.632	0.553
		←	3.236	25.172	5.268	1.461	0.566	15.559	23.397	0.501	0.639	0.555
PMC-LLaMA [51]	7B	×	1.709	15.003	2.432	0.748	0.313	13.983	20.590	0.492	0.702	0.532
			1.059	6.5059	1.591	0.504	0.241	8.140	18.551	0.477	0.664	0.525
MedAlpaca [52]	7B	×	1.554	30.710	6.143	1.906	0.819	12.729	11.865	0.526	0.464	0.484
			3.031	30.795	7.107	2.408	1.107	14.761	15.711	0.507	0.518	0.505
Mistral [53]	7B	×	3.261	27.806	6.068	1.463	0.458	16.132	21.999	0.558	0.444	0.485
		←	3.547	30.851	7.178	1.698	0.493	17.493	22.347	0.543	0.597	0.558
LLaMA [2]	7B		0.807	7.575	1.265	0.337	0.128	10.016	17.194	0.495	0.589	0.519
			0.966	7.790	1.525	0.423	0.178	10.237	18.344	0.478	0.625	0.535
LLaMA2 [54]	7B	×	1.298	14.929	2.717	0.531	0.132	12.016	19.958	0.481	0.588	0.519
		←	3.579	24.798	5.826	1.638	0.693	16.850	23.950	0.512	0.561	0.528
LLaMA [2]	13B	×	0.810	6.824	1.393	0.331	0.113	8.959	16.647	0.486	0.452	0.459
			0.973	8.295	1.521	0.439	0.197	9.813	18.073	0.498	0.465	0.463
LLaMA2 [54]	13B	×	2.940	23.536	5.173	1.390	0.441	15.053	21.938	0.531	0.564	0.538
			3.650	25.080	6.148	1.728	0.666	16.646	24.015	0.533	0.571	0.543
ChatGPT [1]	175B	×	3.594	28.453	6.371	1.656	0.556	16.791	23.146	0.526	0.603
			4.409	31.625	7.570	2.107	0.760	17.344	24.337	0.531	0.615	0.553 0.564

Additionally, the impact of prompt designs can also be observed. Generally speaking, P3 achieves the best performance while P1 leads to unsatisfying results. This phenomenon may be attributed to the similarity of P3 to human language style, as it reflects the severity of the disease using rhetoric, whereas P1 directly displays the probability without post-process.

此外，也能观察到提示设计的影响。总体而言，P3表现最佳，而P1的结果不尽如人意。这一现象可能归因于P3更接近人类语言风格，它通过修辞手法反映疾病严重程度，而P1未经后处理直接显示概率值。

Investigation of general iz ability: In Table IV, we explored the general iz ability of reliable report generation by adopting different LLMs. From the results, several noteworthy observations can be drawn:
泛化能力研究: 在表 IV 中, 我们通过采用不同的大语言模型探索了可靠报告生成的泛化能力。从结果中可以得出几个值得注意的发现:

• Our proposed method consistently improves the quality of generated radiology reports when combined with different LLMs, except for PMC-LLaMA. This finding demonstrates the effectiveness of our approach.

• 我们提出的方法在与不同大语言模型结合时，持续提升了生成放射学报告的质量（PMC-LLaMA除外）。这一发现验证了该方法的有效性。

• Domain-specific LLMs, such as PMC-LLaMA and MedAlpaca, do not yield competitive enough results on reliable report generation. Despite being specifically trained on medical corpus, their performance falls short. Even with the enhancement provided by reliable report generation, MedAlpaca only achieves an F1 score of 0.505. Moreover, PMC-LLaMA does not perform well with reliable report generation, leading to significant performance degradation. We hypothesize that their inferior instruction following capability contributes to these results. Previous studies have emphasized the importance of properly controlling domain-specific fine-tuning to ensure conversational capability in LLMs [55]. Unlike MedAlpaca, PMC-LLaMA fully fine-tuned its model weights during training, which could result in worse performance degradation compared to parameter-efficiently tuned MedAlpaca. We believe this is the reason why reliable report generation shows inferior performance when combined with PMC-LLaMA.

• 特定领域的大语言模型，如 PMC-LLaMA 和 MedAlpaca，在可靠报告生成任务中未能取得足够竞争力。尽管它们专门针对医学语料进行了训练，但其表现仍不尽如人意。即使通过可靠报告生成技术增强，MedAlpaca 的 F1 分数也仅为 0.505。此外，PMC-LLaMA 在可靠报告生成任务中表现不佳，导致性能显著下降。我们推测其较弱的指令跟随能力是主要原因。先前研究强调过，需要合理控制领域微调以保持大语言模型的对话能力 [55]。与 MedAlpaca 不同，PMC-LLaMA 在训练过程中对模型权重进行了完整微调，这可能导致其性能下降比采用参数高效调优的 MedAlpaca 更为严重。我们认为这是 PMC-LLaMA 结合可靠报告生成时表现欠佳的原因。

• A larger model size does not always guarantee better performance. For example, LLaMA (7B) and LLaMA (13B) exhibit inferior performance compared to Chat

• 更大的模型规模并不总能保证更好的性能。例如，LLaMA (7B) 和 LLaMA (13B) 的表现就逊色于 Chat

Fig. 7. Ablation study of hierarchical in-context learning w.r.t. the number of retrieved reports. A range of values , varying from 0 to 5, was chosen for $k,$ and its influence is evaluated on several NLG metrics.

图 7: 分层上下文学习关于检索报告数量的消融研究。为 $k$ 选择了从0到5不等的数值范围，并在多个自然语言生成(NLG)指标上评估其影响。

Fig. 8. Distributions of report length with respect to the changing of $k.$

图 8: 报告长度随 $k$ 值变化的分布情况

GLM2 (6B) and other 7B-scale LLMs. Thus, blindly pursuing model size may not be the optimal solution.

GLM2 (6B) 与其他7B规模的大语言模型。因此，盲目追求模型规模可能并非最优解。

• While ChatGPT un surprisingly achieves the best overall performance, other LLMs like LLaMA2 (13B) and Mistral (7B) still demonstrate comparable performance. In particular, Mistral (7B) achieves the second-best F1 score among all the LLMs evaluated while maintaining competent NLG metrics. This highlights the potential application of consumer-level LLMs in clinical workflows.

• 尽管ChatGPT毫无意外地取得了最佳整体性能，但其他大语言模型如LLaMA2 (13B)和Mistral (7B)仍展现出可比的表现。值得注意的是，Mistral (7B)在所有评估的大语言模型中获得了第二高的F1分数，同时保持了出色的自然语言生成(NLG)指标。这凸显了消费级大语言模型在临床工作流程中的潜在应用价值。

TABLE V PERFORMANCE COMPARISON BETWEEN CHATCAD $^+$ AND GENERALIST MODELS IN THE MEDICAL IMAGING DOMAIN. EXCESSIVELY LOW RESULTS ARE REPRESENTED BY SLASH.

Method	C-BLEU	ROUGE-L	METEOR	PRE	REC	F1
PMC-VQA [56]	/	5.978	2.685	0.221	0.109	0.115
RadFM 1 [57]	/	7.366	4.756	0.487	0.095	0.156
ChatCAD+ (ours)	4.409	17.344	24.337	0.531	0.615	0.564

表 5: 医疗影像领域中 ChatCAD$^+$ 与通用模型的性能对比。过低的结果用斜杠表示。

方法	C-BLEU	ROUGE-L	METEOR	PRE	REC	F1
PMC-VQA [56]	/	5.978	2.685	0.221	0.109	0.115
RadFM 1 [57]	/	7.366	4.756	0.487	0.095	0.156
ChatCAD+ (ours)	4.409	17.344	24.337	0.531	0.615	0.564

Ablation study of hierarchical in-context learning: We investigate the impact of the retrieved top $k$ reports on reliable report generation in Fig. 7. As $k$ increases, the overall performance shows an upward trend. Moreover, in most cases, the largest improvement in performance occurs when $k$ changes from 0 to 1. This indicates that regardless of the quantity, as long as templates are provided as references, the quality of generated reports can be significantly enhanced. Meanwhile, the performance improvement tends to saturate around $k{=}3$ , and further increasing $k$ does not result in more significant improvement, which validates the rationality of the adopted
分层上下文学习的消融研究：我们在图7中研究了检索到的top $k$ 报告对可靠报告生成的影响。随着 $k$ 的增加，整体性能呈现上升趋势。此外，在大多数情况下，当 $k$ 从0变为1时，性能提升最为显著。这表明无论数量多少，只要提供模板作为参考，生成报告的质量都能得到显著提升。同时，性能提升在 $k{=}3$ 附近趋于饱和，进一步增加 $k$ 不会带来更显著的改进，这验证了所采用方法的合理性。

default setting.

默认设置。

The distribution of report length is also demonstrated in Fig. 8 by varying the value of $k$ . Report length serves as an important criterion [7] to measure the similarity to ground truth radiology reports. A straightforward observation is that the distribution of ChatCAD exhibits a significant shift from that of the radiologist. In contrast, ${\mathrm{ChatCAD+}}$ shows a more fitted curve irrespective of the value of $k$ .

图 8 还通过改变 $k$ 值展示了报告长度的分布情况。报告长度是衡量与真实放射学报告相似度的重要标准 [7]。一个直观的观察结果是，ChatCAD 的分布与放射科医生的分布存在显著差异。相比之下，无论 $k$ 值如何变化，${\mathrm{ChatCAD+}}$ 都显示出更吻合的曲线。

Qualitative analysis: In Fig. 9, each row showcases a chest X-ray along with its corresponding ground-truth annotation. To enhance clarity, we have color-coded the presence of certain findings in the generated reports. This visual representation allows for a clear illustration of the differences among the methods. For example, in the first chest X-ray image, it demonstrates the presence of four abnormalities, including card iomega ly. However, both R2GenCMN and VLCI methods fail to acknowledge the existence of card iomega ly in their generated reports. In the second row, the other two methods struggle to detect any abnormalities from the chest X-ray, while our method successfully identifies the possibility of card iomega ly. The same trend can be observed in the third row. By including this qualitative analysis, we aim to highlight the advantages of our ChatCAD $^+$ method and provide a comprehensive evaluation of the template retrieval system (hierarchical in-context learning). This comparison demonstrates the improved report quality achieved by our proposed method.
定性分析：图9中每行展示了一张胸部X光片及其对应的真实标注。为提高清晰度，我们在生成报告中用颜色编码标注了特定发现。这种可视化呈现能清晰展示各方法间的差异。例如第一张胸片图像显示存在四种异常（包括心脏增大），但R2GenCMN和VLCI方法在生成报告中均未提及心脏增大。第二行中，其他两种方法未能从胸片检测到任何异常，而我们的方法成功识别出心脏增大的可能性。第三行也呈现相同趋势。通过引入定性分析，我们旨在突显ChatCAD$^+$方法的优势，并对模板检索系统（分层上下文学习）进行全面评估。该对比证明我们提出的方法实现了报告质量的提升。
Comparison with generalists: While ChatCAD $^{\ast}$ aims to integrate multiple specialist AI models increment ally, some recent studies [51], [57] have investigated generalist AI models for medical imaging. Hence, we quantitatively compare ${\mathrm{ChatCAD+}}$ to generalists to investigate its superiority.
与通用模型的对比：虽然 ChatCAD$^{\ast}$ 旨在逐步整合多个专用AI模型，但近期一些研究 [51]、[57] 探索了医学影像领域的通用AI模型。为此，我们定量比较了 ${\mathrm{ChatCAD+}}$ 与通用模型的性能优势。

Existing generalist models typically follow the architecture of DeepMind’s Flamingo [21] and Salesforce’s BLIP-2 [22]. For instance, RadFM [57] and PMC-VQA [56] are trained based on Flamingo and BLIP-2, respectively. We compared ChatCAD $^{\ast}$ with PMC-VQA and Flamingo using the same test set, and conducted zero-shot inference, as shown in Table V. Overall, our method demonstrates significant advantages over the generalist models. It is worth noting that these generalist models exhibit extremely low values on C-BLEU, indicating that generalist models can only generate very short responses and lack the ability to provide comprehensive diagnostic information like ours. Generally speaking, a generalist model implies that a single image encoder needs to interpret semantic information for all possible input images. This is challenging to achieve in the medical imaging domain due to distinct morphological and clinical semantic differences among different medical imaging modalities, which require a deep understanding of professional medical knowledge. Based on these reasons, we believe that it is still not the ChatGPT time for medical imaging analysis, and specialist models will long been a preferable solution. Such preference of specialist model is also mentioned in a related survey [58].

现有通用模型通常遵循DeepMind的Flamingo [21]和Salesforce的BLIP-2 [22]的架构。例如，RadFM [57]和PMC-VQA [56]分别基于Flamingo和BLIP-2进行训练。我们在相同测试集上将ChatCAD$^{\ast}$与PMC-VQA和Flamingo进行比较，并进行了零样本推理，如表V所示。总体而言，我们的方法相较于通用模型展现出显著优势。值得注意的是，这些通用模型在C-BLEU指标上表现出极低值，表明通用模型只能生成非常简短的回应，缺乏像我们这样提供全面诊断信息的能力。一般来说，通用模型意味着单一图像编码器需要为所有可能的输入图像解析语义信息。由于不同医学影像模态之间存在显著的形态学和临床语义差异，这需要深厚的专业医学知识理解，因此在医学影像领域实现这一目标具有挑战性。基于这些原因，我们认为医学影像分析尚未迎来ChatGPT时代，专业模型仍将是长期更优的解决方案。相关综述[58]也提到了这种对专业模型的偏好。

It is also important to note that a major advantage of ChatCAD $^{\ast}$ lies in their capability for continual learning. By incorporating the LLM, we can combine decisions from multiple CAD models, enabling fine-tuning of each CAD model individually and also their incremental ensembling. This flexibility allows us to adapt and integrate new models rapidly in response to emerging situations without the need of huge computation resources.

值得注意的是，ChatCAD$^{\ast}$的一个主要优势在于其持续学习能力。通过整合大语言模型(LLM)，我们可以融合多个CAD模型的决策，既能单独微调每个CAD模型，又能实现它们的增量集成。这种灵活性使我们能够快速适应并整合新模型以应对突发情况，而无需消耗大量计算资源。

Fig. 9. Qualitative analysis on different report generation methods. chest $\mathsf{X}.$ -rays and corresponding ground-truth annotations are provided. The presence of a certain finding in a report is marked in a certain color for a clear illustration.

图 9: 不同报告生成方法的定性分析。提供了胸部X光片及对应的真实标注。报告中某项发现的出现用特定颜色标记以便清晰展示。

TABLE VI VALIDATION OF RELIABLE INTERACTION FROM PERSPECTIVES OF BOTH MULTI-CHOICE SELECTION AND REASON EXPLANATION.

Method	Multi-choice		Reason Explanation
Method	ACC	F1	C-BLEU	ROUGE-L
ChatGLM2[45]	0.435	0.427	2.430	11.396
ChatGLM2+Ours	0.465	0.461	7.930	18.957
ChatGLM3[45]	0.458	0.455	3.270	14.915
ChatGLM3+Ours	0.472	0.505	7.974	19.302

表 VI: 从多选和原因解释双视角验证可靠交互

方法	多选		原因解释
	ACC	F1	C-BLEU	ROUGE-L
ChatGLM2[45]	0.435	0.427	2.430	11.396
ChatGLM2+Ours	0.465	0.461	7.930	18.957
ChatGLM3[45]	0.458	0.455	3.270	14.915
ChatGLM3+Ours	0.472	0.505	7.974	19.302

D. Evaluation of clinical question answering

D. 临床问答评估

To evaluate the efficacy of clinical question answering, we adopt a comprehensive approach that combines quantitative analysis (through testing on the CMExam dataset) and qualitative analysis (by posing potential questions in real-world scenarios). This enables a thorough examination and provides insight into how the system functions in practical context.

为评估临床问答系统的效能，我们采用定量分析（通过在CMExam数据集上测试）与定性分析（通过真实场景中的潜在问题提问）相结合的综合方法。这种方法能进行全面检验，并揭示系统在实际应用中的表现情况。

In Table VI, we choose ChatGLM2 and ChatGLM3 as the LLM and present the performance of multi-choice selection and reason explanation with and without our proposed knowledge retrieval method on a subset of CMExam dataset (referring to Sec. IV-A). It is important to note that the results of multi-choice selection are directly generated by the LLM via prompt design, without intervention of any text encoders. We observe that our proposed method can consistently improve the performance of question answering for both ChatGLM2 and ChatGLM3. Interestingly, we find that the generation of reason explanations is significantly enhanced compared to multi-choice selection. For ChatGLM2, the ROUGE-L score is improved from 11.396 to 18.957, and for ChatGLM3, it is improved froVmLC I14.925 to 19.302. However, the improvement in multi-choice selection is not pronounced. We hypothesize that this discrepancy arises because the LLM may occasionally select correct answer based on unrigorous or even incorrect reasoning. In such cases, the LLM’s reliability is compromised. However, with our proposed method, the reasoning behind the correct answer becomes more convincing, enabling reliable interaction for patients in need.

在表 VI 中，我们选择 ChatGLM2 和 ChatGLM3 作为大语言模型，展示了在 CMExam 数据集子集 (参见第 IV-A 节) 上使用和不使用我们提出的知识检索方法时，多项选择与原因解释的性能表现。需要注意的是，多项选择的结果是通过提示设计直接由大语言模型生成，未经过任何文本编码器的干预。我们观察到，所提出的方法能持续提升 ChatGLM2 和 ChatGLM3 的问答性能。有趣的是，与多项选择相比，原因解释的生成得到了显著增强。对于 ChatGLM2，ROUGE-L 分数从 11.396 提升至 18.957；对于 ChatGLM3，则从 14.925 提升至 19.302。然而，多项选择的改进并不明显。我们推测这种差异源于大语言模型有时可能基于不严谨甚至错误的推理选择正确答案。这种情况下，大语言模型的可靠性会受到影响。而通过我们的方法，正确答案背后的推理变得更加可信，从而为有需求的患者提供可靠的交互。

Considering qualitative analysis, the effectiveness of our proposed LLM-based knowledge retrieval method is compared with a baseline method that does not rely on a LLM. Such methods tend to follow the paradigm of LangChain [59], which involves dividing the text into paragraphs and then utilizing sentence transformers to compare the similarity between the user query and all paragraphs. This kind of knowledge retrieval method can hardly handle the professional medical knowledge database. We select the implementation in [59] as the baseline and compare it with our proposed method in Table VII and Table VIII, where “K” and “A” indicate “retrieved knowledge” and “answer from the LLM”, respectively. Note that ChatGPT is the default LLM in this qualitative analysis.

在定性分析方面，我们将提出的基于大语言模型(LLM)的知识检索方法与不依赖大语言模型的基线方法进行了效果对比。这类基线方法通常遵循LangChain [59]的范式：先将文本分割为段落，再通过句子Transformer计算用户查询与所有段落的相似度。此类知识检索方法难以处理专业医学知识库。我们选取文献[59]的实现作为基线，与本文方法在表7和表8中进行对比（其中"K"表示"检索到的知识"，"A"表示"大语言模型生成的答案"）。需要说明的是，本定性分析默认采用ChatGPT作为大语言模型。

In accordance with the structure of the Merck Manual, each medical topic is divided into five main sections: “abstract”, “symptoms and signs”, “diagnosis”, “treatment” and “prognosis” as shown in Fig. 4. Therefore, to assess the effectiveness of knowledge retrieval, all the questions presented in Table VII and Table VIII are implicitly associated with certain sections. If the retrieved knowledge comes from the corresponding section, we label it as “related”. If the knowledge is retrieved from a different section but still belongs to the same medical topic, we label it as “partially related”. Otherwise, if the knowledge comes from a different medical topic altogether, we consider it “unrelated”. For example, in the case of Q1 (“Is pleural effusion serious?”), the target section is “symptoms and signs”. Unfortunately, the baseline method retrieves information from the “abstract” section, which is only “partially related” and cannot directly provide a reliable answer. In the case of Q3, we ask for the treatment of periodontitis. Our proposed method successfully retrieves the “treatment” section of periodontitis, while the baseline method mistakenly retrieves information about palm o plant ar ker at oder mas. As a result, we labeled it as “unrelated”. We hypothesize the reason behind this mistake is that palm o plant ar ker at oder mas can lead to secondary infections of periodontitis. These results clearly illustrate the advancement of our method. On the contrary, [59] struggles to identify relevant medical topics due to limited vocabulary size and understanding. This issue can be worse if the question of the user does not explicitly point to any medical topics or involve multiple medical entities.

根据《默克手册》的结构，每个医学主题分为五个主要部分："摘要"、"症状和体征"、"诊断"、"治疗"和"预后"，如图4所示。因此，为评估知识检索效果，表VII和表VIII中所有问题都隐式关联到特定部分。若检索知识来自对应部分，则标记为"相关"；若来自同一医学主题的不同部分，则标记为"部分相关"；若来自完全不同的医学主题，则标记为"无关"。例如Q1("胸腔积液是否严重？")的目标部分是"症状和体征"，但基线方法从"摘要"部分检索信息，这仅属于"部分相关"且无法直接提供可靠答案。对于Q3(询问牙周炎治疗方法)，我们提出的方法成功检索到牙周炎的"治疗"部分，而基线方法错误检索了掌跖角化病信息，因此标记为"无关"。我们推测该错误原因是掌跖角化病可能引发牙周炎继发感染。这些结果清晰展示了本方法的优势。相反，[59]由于词汇量和理解能力有限，难以识别相关医学主题。当用户问题未明确指向特定医学主题或涉及多个医学实体时，该问题会更加严重。

TABLE VII COMPARISON BETWEEN KNOWLEDGE RETRIEVAL METHODS. CHATGPT IS SELECTED AS THE DEFAULT LLM. COMPLETELY UNRELATED KNOWLEDGE IS MARKED IN RED, PARTIALLY RELATED KNOWLEDGE IS MARKED IN PURPLE. RELATED AND IMPORTANT KNOWLEDGE IS IN GREEN.

Q1 Is pleural effusion serious? K1 (Baseline) Under normal circumstances, a thin layer of pleural fluid (10-20mL) covers the surfaces of the visceral and parietal pleura,with a composition

based on laboratory characteristics of the fluid (see criteria for diagnosing exudative pleural effusions in the table).Regardless of unilateral or bilateral involvement, leaky effusions usually require little investigation and can be treated, while exudative effusions require identification of the underlying cause. Various factors can cause pleural effusions (see the table for causes of pleural effusions). Leaky pleural effusions are caused by an increase in systemic circulatory hydrostatic pressure and a decrease in plasma colloid osmotic pressure. K1 (Ours) Some pleural effusions are asymptomatic and are discovered incidentally during physical examination or on chest x-ray. Many cause dyspnea, pleuritic chest pain, or both. Pleuritic chest pain, a vague discomfort or sharp pain that worsens during inspiration, indicates inflammation of the parietal pleura. Pain is usually felt over the inflamed site, but referred pain is possible. The posterior and peripheral portions of the diaphragmatic pleura are supplied

reveals absent tactile fremitus, dullness to percussion, and decreased breath sounds on the side of the effusion. These findings can also be caused by pleural thickening. Pericardial rub is best heard over the left border of the sternum in the 3rd and 4th intercostal spaces, is characteristically a to-and-fro sound synchronous with the heartbeat, and is not influenced significantly by respiration. Sensitivity and specificity of the physical examination for detecting effusion are probably

low. A1 (ChatGPT Only) Pleural effusion can be serious depending on the underlying cause and the amount of fluid accumulated in the pleural space. It can be a symptom of various medical conditions such as congestive heart failure,pneumonia, cancer, or kidney disease.If left untreated,pleural effusion can lead to breathing difficulties and complications. It's important to consult with a healthcare provider for proper diagnosis and treatment.

A1 (Ours) The severity of pleural effusion needs to be determined based on the patient's specific situation. Generally, a large amount of effusion can cause symptoms such as dyspnea and chest pain, but some patients may not have obvious symptoms. Physical examination can reveal reduced breath sounds on the effusion side, dullness to percussion, and possibly pleural friction rub. However, the sensitivity and specificity of physical examination for pleural effusion are both low, so clinical symptoms and imaging examinations need to be combined to determine the severity of the effusion and treatment plan.Common

causes of pleural effusion include heart failure,pneumonia,and tuberculosis Q2 What should I do after getting lung atelectasis? K2 (Baseline) do not recommend oral orintravenous corticosteroids because theyhave not shown efficacy in metal fume fever.Patients with pre-existing lung disease(such as asthma or chronic obstructive pulmonary disease)rarely require treatment and hospitalization due to worsening of pre-existing conditions. If severe cases of either syndrome cause respiratory distress, treatment methods used for ARDS (such as mechanical ventilation and positive end-expiratory

pressure [PEEP]) may also be used. " , "Atelectasis": "Abs": "Atelectasis is collapse of lung tissue with loss of volume. Patients may have dyspnea or respiratory failure if atelectasis is extensive. They may also develop pneumonia. Atelectasis is usually asymptomatic, but hypoxemia and pleuritic chest pain may be present in certain cases.Diagnosis is by chest x-ray. Treatment includes maintaining coughing and deep breathing and treating the cause. K2 (Ours) Maximizing cough and deep breathing If obstruction by tumor or foreign body is suspected, bronchoscopy Evidence for the efficacy of most treatments for atelectasis is weak or absent. Nonetheless, commonly recommended measures include chest physiotherapy to help maintain ventilation and clearance of secretions, and encouragement of lung expansion techniques such as directed cough, deep breathing exercises, and use of an incentive spirometer. In ambulatory patients, exercise (eg, walking) is a desirable way to promote deep breathing. For patients who are not intubated and do not have excessive secretions, continuous positive airway pressure may help. For patients who are intubated and mechanically ventilated, positive end-expiratory pressure and/or higher tidal volume ventilation mayhelp.

plugging, nebulized dornase alfa and sometimes bronchodilators are tried. N-Acetylcysteine is usually avoided because it can cause bronchoconstriction. If A2 (ChatGPT Only)I'm not a doctor, but I can provide you with general information on lung atelectasis.It's important tofollowyour healthcare provider's advice for your specific situation.In general, treatment for lung atelectasis may involve addressing the underlying cause,such as removing any blockages in tissue, such as deep breathing exercises, chest physiotherapy, or using a device called a spirometer. Make sure to follow up with your healthcare provider for

proper diagnosis and treatment. positive airway pressure, etc. However, the evidence for the effectiveness of these treatments is not sufficient, so the treatment plan needs to be developed according to the specific situation of the patient. If there is suspicion of tumor or foreign body obstruction, bronchoscopy examination is needed. Opioid drugs can be used to relieve severe pleural pain, but cough suppressant drugs should be avoided. For the causes of pulmonary atelectasis such as mucus plugs, foreign bodies,tumors,masses,pleural effusions, etc., targeted treatments should be used.

表 7: 知识检索方法对比。默认选用ChatGPT作为大语言模型。完全不相关的知识标红，部分相关的标紫，相关且重要的知识标绿。

| Q1 胸腔积液严重吗？ K1 (基线方法) 正常情况下，脏层和壁层胸膜表面覆盖着一层薄薄的胸膜液(10-20mL)，其成分基于液体的实验室特征(参见表中渗出性胸腔积液的诊断标准)。无论是单侧还是双侧受累，漏出性积液通常不需要太多检查即可治疗，而渗出性积液则需要查明潜在病因。多种因素可导致胸腔积液(参见胸腔积液病因表)。漏出性胸腔积液是由全身循环静水压升高和血浆胶体渗透压降低引起的。 K1 (我们的方法) 部分胸腔积液无症状，是在体检或胸透时偶然发现的。许多会导致呼吸困难、胸膜炎性胸痛或两者兼有。胸膜炎性胸痛是一种模糊的不适感或锐痛，吸气时加重，表明壁层胸膜发炎。疼痛通常出现在发炎部位，但也可能出现牵涉痛。膈胸膜的后部和外周部分由... |
| 体检可发现积液侧触觉震颤消失、叩诊浊音、呼吸音减弱。这些表现也可能由胸膜增厚引起。心包摩擦音在胸骨左缘第3、4肋间听诊最清晰，特征是与心跳同步的来回音，且不受呼吸显著影响。体格检查对胸腔积液的检测敏感性和特异性可能... |
| 较低。 A1 (仅ChatGPT) 胸腔积液是否严重取决于潜在病因和胸膜腔内积聚的液体量。它可能是充血性心力衰竭、肺炎、癌症或肾脏疾病等多种疾病的症状。如果不及时治疗，胸腔积液可能导致呼吸困难和并发症。重要的是咨询医疗保健提供者以获得正确诊断和治疗。 |
| A1 (我们的方法) 胸腔积液的严重程度需要根据患者具体情况判断。一般来说，大量积液可引起呼吸困难、胸痛等症状，但部分患者可能无明显症状。体检可发现积液侧呼吸音减弱、叩诊浊音，可能出现胸膜摩擦音。但体格检查对胸腔积液的敏感性和特异性均较低，因此需要结合临床症状和影像学检查来判断积液的严重程度和治疗方案。常见... |
| 胸腔积液的病因包括心力衰竭、肺炎和结核病 Q2 发生肺不张后该怎么办？ K2 (基线方法) 不建议口服或静脉注射皮质类固醇，因为它们对金属烟热无效。既往有肺部疾病(如哮喘或慢性阻塞性肺病)的患者很少因原有疾病恶化而需要治疗和住院。如果任一综合征的严重病例导致呼吸窘迫，也可采用ARDS的治疗方法(如机械通气和呼气末正压[PEEP])。", "肺不张": "Abs": "肺不张是肺组织塌陷伴容积减少。如果肺不张范围大，患者可能出现呼吸困难或呼吸衰竭。他们还可能发展为肺炎。肺不张通常无症状，但在某些情况下可能出现低氧血症和胸膜炎性胸痛。诊断依靠胸透。治疗包括保持咳嗽和深呼吸以及治疗病因。 K2 (我们的方法) 最大化咳嗽和深呼吸如果怀疑肿瘤或异物阻塞，应进行支气管镜检查大多数肺不张治疗方法的有效性证据薄弱或缺失。尽管如此，通常推荐的措施包括胸部物理治疗以帮助维持通气和分泌物清除，以及鼓励肺扩张技术，如定向咳嗽、深呼吸练习和使用诱发性肺量计。对于能走动的患者，运动(如步行)是促进深呼吸的理想方式。对于未插管且无过多分泌物的患者，持续气道正压可能有帮助。对于插管和机械通气的患者，呼气末正压和/或更高潮气量通气可能有帮助。 |
| 对于黏液栓，可尝试雾化dornase alfa，有时也可用支气管扩张剂。通常避免使用N-乙酰半胱氨酸，因为它可能引起支气管收缩。如果... A2 (仅ChatGPT) 我不是医生，但我可以为您提供关于肺不张的一般信息。重要的是遵循您的医疗保健提供者针对您具体情况的建议。一般来说，肺不张的治疗可能涉及解决潜在病因，如清除组织中的任何阻塞，如深呼吸练习、胸部物理治疗或使用称为肺量计的装置。确保与您的医疗保健提供者跟进以获得... |
| 正确诊断和治疗。气道正压等。然而，这些治疗方法的有效性证据不足，因此需要根据患者具体情况制定治疗方案。如果怀疑肿瘤或异物阻塞，需要进行支气管镜检查。阿片类药物可用于缓解严重的胸膜疼痛，但应避免使用镇咳药。对于黏液栓、异物、肿瘤、肿块、胸腔积液等肺不张病因，应采取针对性治疗。 |

We also demonstrate the answer of ChatGPT without additional knowledge. Results reveal that ChatGPT’s responses are less useful and accurate. Specifically, for the first question, ChatGPT fails to elaborate on the fact that the severity of pleural effusion depends on the specific circumstances, instead merely stating that untreated pleural effusion can lead to complications. This omission prevents a thorough understanding of the condition. Furthermore, in Q4, ChatGPT provides an extensive list of possible tests for the patient, which can potentially cause unnecessary anxiety and expenses. In contrast, our reliable interaction highlights that an MRI is not imperative for initial assessment and should only be considered if knee pain does not improve after several weeks of conservative treatment. This clarification offers valuable guidance and prevents unwarranted medical interventions.

我们还展示了ChatGPT在没有额外知识情况下的回答。结果显示，ChatGPT的回复实用性和准确性较低。具体而言，对于第一个问题，ChatGPT未能详细说明胸腔积液严重程度取决于具体情况这一事实，仅指出未经治疗的胸腔积液可能导致并发症。这种疏漏阻碍了对病情的全面理解。此外，在Q4中，ChatGPT为患者列出了一长串可能需要的检查项目，这可能导致不必要的焦虑和费用。相比之下，我们可靠的交互明确指出：核磁共振(MRI)并非初始评估的必要项目，只有当保守治疗数周后膝关节疼痛未见改善时才需考虑。这一说明提供了有价值的指导，避免了不必要的医疗干预。

TABLE VIII COMPARISON BETWEEN KNOWLEDGE RETRIEVAL METHODS. CHATGPT IS SELECTED AS THE DEFAULT LLM. COMPLETELY UNRELATED KNOWLEDGE IS MARKED IN RED, PARTIALLY RELATED KNOWLEDGE IS MARKED IN PURPLE. RELATED AND IMPORTANT KNOWLEDGE IS IN GREEN.

Q3How to treatperiodontitis?Is takingmedication enough? syndrome: autosomal recessive inheritance, onset within 6 months after birth. Severe periodontitis can lead to tooth loss. Vohwinkel syndrome: autosomal

clean.It may include chlorhexidine swabs or rinses.A therapist should help teach the patient how to do these procedures. The patient is re-evaluated after 3 access for scaling and planing of deeper parts of the root. If deeper pockets persist, systemic antibiotics can be used. A common regimen is amoxicillin 500 mg orally 3 times a day for 10 days. In addition, a gel containing doxycycline or microspheres of minocycline can be placed into isolated recalcitrant pockets. These drugs are resorbed in 2 weeks. Another approach is to surgically eliminate the pocket and recontour the bone (pocket reduction/elimination surgery) so that the patient can clean the depth of the normal crevice (sulcus) between the tooth and gingiva.In certain patients,regenerative surgery and bone grafting are done to encourage alveolar bone growth.Splinting of loose teeth and selective reshaping of tooth surfaces to eliminate traumatic occlusion may be necessary. patients with necrotizing ulcerative periodontitis due to HIV (HIV-associated periodontitis)respond to combined treatment with scaling and planing,irrigation of the sulcus with povidone-iodine (which the dentist applies with a syringe), regular use of chlorhexidine mouth rinses, and systemic antibiotics, usually metronidazole 250 mg orally 3 times a day for 14 days. Localized aggressive periodontitis requires periodontal surgery plus oral antibiotics (eg, amoxicillin 500 mg 4 times a day or metronidazole 250 mg 3 times a day for 14 days). A3 (ChatGPT Only) Treating periodontitis usually involves a combination of professional dental treatments. Medication alone may not be enough to fully

treat periodontitis.Professional dental treatments may include scaling and root planing to remove plaque and tartar from the teeth and roots, as well as possibly surgical procedures to correct any damage to the gums and bone. It's important to follow your dentist's recommendations closely and attend regular dental A3 (Ours) The treatment of periodontitis requires a comprehensive approach that considers risk factors, scaling and root planing, oral antibiotics, and local hygiene. If deep periodontal pockets are present, systemic antibiotics may be used, with amoxicillin 500mg orally, three times a day for 10 days being a commonly used method.The second stage of treatment involves surgical removal of periodontal pockets and bone reshaping to allow the patient to self-clean the base of the gingival sulcus.Localized aggressive periodontitis requires periodontal surgery in combination with oral antibiotics.At the same time, systemic

factors such as poor oral hygiene, diabetes, and smoking must be controlled to improve the prognosis. Therefore, drug therapy alone is not sufficient and comprehensivetreatmentisnecessary. Q4What testsshould be doneforkneepain? K4 (Baseline) Patellar tendon tear: the patella is palpable and displaced upwards (high-riding patella). Transverse patellar fracture: a gap between the two pieces of the patella may be palpable. However, due to significant local swelling, these symptoms may be mistaken for knee ligament injury with joint effusion. If a patient presents with knee swelling and pain after injury, the clinician will ask the patient to sit down and attempt to extend the injured leg to test knee

extension actively or lie flat and elevate the injured leg. After knee swelling following trauma, remember to test the knee's ability for independent extension. Routine knee joint X-ray examination should be performed.Knee X-rays can show a high or low position of the patella.X-rays can reveal dislocation or K4 (Ours) Not every patient requires x-rays.However, anteroposterior, lateral, and oblique x-rays are often taken to exclude fractures. The Ottawa knee rules are used to limit x-rays to patients more likely to have a fracture that requires specific treatments.X-rays should be taken only if one of the following is present: Age > 55, Isolated tenderness of the patella (with no other bone tenderness of the knee), Tenderness of fibular head, Inability to flex the knee 90°, Inability tobear weightbothimmediately and in the emergencydepartment for 4steps (with or without limping). MRI is usually not needed at the initial evaluation. A reasonable approach is to do MRI if symptoms do not resolve after a few weeks of conservative

management. However, MRI is often done when severe injury or significant intra-articular injury is suspected or cannot be otherwise excluded. Other tests may be done to check for related injuries: Arteriography or CT angiography to check for suspected arterial injuries Electromyography and/or nerve conduction studies, which are rarely done immediately but more typically done when nerve symptoms persist weeks to months after the injury. A4 (ChatGPT Only) When diagnosing knee pain, there are several tests that healthcare professionals may recommend to determine the underlying cause. These tests may include:(1) Physical examination: A healthcare provider will assess your knee by examining its range of motion, stability,and tenderness. (2) X-rays:X-rays can help identify issues such as fractures, arthritis, or bone abnormalities in the knee. (3) MRI (Magnetic Resonance Imaging): MRI scans provide detailed images of soft tissues like ligaments,tendons,and cartilage,which can help diagnose conditions such as ligament tears or cartilage damage. (4) CT scan (Computed Tomography): CT scans can provide detailed images of the bones and joints in the knee, helping to diagnose fractures or bone abnormalities. (5) Ultrasound:Ultrasound imaging can help evaluate soft tissue structures like tendons and ligaments for injuries or inflammation. (6) Arthroscopy: In some cases, a minimally invasive procedure called arthroscopy may be recommended to directly visualize and diagnose issues within the knee

表 VIII: 知识检索方法对比。默认大语言模型为ChatGPT。完全不相关知识标记为红色，部分相关知识标记为紫色，相关且重要知识标记为绿色。

Q3如何治疗牙周炎？服药是否足够？症状：常染色体隐性遗传，出生后6个月内发病。严重牙周炎可导致牙齿脱落。Vohwinkel综合征：常染色体
清洁。可能包括氯己定棉签或漱口液。治疗师应指导患者掌握这些操作。3个月后复诊评估根面更深部位的刮治和根面平整效果。若持续存在深牙周袋，可使用全身抗生素。常用方案为阿莫西林500mg口服，每日3次，连用10天。此外，可在顽固孤立牙周袋内放置含多西环素的凝胶或米诺环素微球，这些药物在2周内被吸收。另一种方法是手术消除牙周袋并重塑骨轮廓（牙周袋缩减/消除术），使患者能清洁牙齿与牙龈间正常龈沟（龈隙）的深度。特定患者需进行再生手术和骨移植以促进牙槽骨生长。松牙固定术及选择性调磨牙面以消除创伤性咬合可能是必要的。HIV相关坏死溃疡性牙周炎患者（HIV相关性牙周炎）需联合治疗：刮治和根面平整、聚维酮碘龈沟冲洗（由牙医用注射器操作）、定期使用氯己定漱口液及全身抗生素（通常为甲硝唑250mg口服，每日3次，连用14天）。局限型侵袭性牙周炎需牙周手术联合口服抗生素（如阿莫西林500mg每日4次或甲硝唑250mg每日3次，连用14天）。
A3 (仅ChatGPT) 牙周炎治疗通常需要专业牙科治疗组合。仅靠药物可能无法完全
治愈牙周炎。专业牙科治疗包括刮治和根面平整以清除牙齿和根面的菌斑牙石，以及可能的手术修复牙龈和骨组织损伤。严格遵循牙医建议并定期复诊至关重要。
A3 (我们的方案) 牙周炎治疗需综合考虑风险因素、刮治根面平整、口服抗生素和局部卫生。若存在深牙周袋，可使用全身抗生素，常用方案为阿莫西林500mg口服，每日3次，连用10天。第二阶段治疗涉及手术切除牙周袋并重塑骨轮廓，使患者能自行清洁龈沟底部。局限型侵袭性牙周炎需牙周手术联合口服抗生素。同时必须控制口腔卫生不良、糖尿病和吸烟等全身因素以改善预后。因此，单纯药物治疗不足，需综合治疗。

Q4膝关节疼痛应做哪些检查？
K4 (基线) 髌腱撕裂：可触及髌骨上移（高位髌骨）。横行髌骨骨折：可触及髌骨两块碎片间的间隙。但由于局部严重肿胀，这些症状可能被误认为膝关节韧带损伤伴关节积液。若患者伤后出现膝关节肿痛，临床医生会让患者坐下尝试主动伸直伤腿测试膝关节
伸展能力，或平卧抬高伤腿。创伤后膝关节肿胀时，切记测试膝关节自主伸展能力。应进行常规膝关节X线检查。膝关节X线可显示髌骨位置偏高或偏低。X线可发现脱位或
K4 (我们的方案) 非所有患者都需要X线检查。但常拍摄正位、侧位和斜位X线片以排除骨折。渥太华膝关节规则用于限定需针对性治疗的骨折高风险患者的X线检查。仅当满足以下任一条件时拍摄X线：年龄>55岁、孤立性髌骨压痛（无其他膝关节骨压痛）、腓骨头压痛、膝关节屈曲不能达90°、受伤当时和急诊室无法负重行走4步（无论是否跛行）。初始评估通常不需要MRI。合理方案是保守治疗数周后症状未缓解时进行MRI。但若怀疑严重损伤或显著关节内损伤且无法排除时，常进行MRI。其他相关损伤检查包括：动脉造影或CT血管造影检查可疑动脉损伤；肌电图和/或神经传导研究（通常不立即进行，更多在伤后数周至数月神经症状持续时开展）。
A4 (仅ChatGPT) 诊断膝关节疼痛时，医疗专业人员可能推荐以下检查：(1) 体格检查：评估膝关节活动度、稳定性和压痛。(2) X线：识别骨折、关节炎或骨异常。(3) MRI：提供韧带、肌腱和软骨等软组织的详细图像，诊断韧带撕裂或软骨损伤。(4) CT扫描：提供膝关节骨骼和关节的详细图像，诊断骨折或骨异常。(5) 超声：评估肌腱和韧带等软组织的损伤或炎症。(6) 关节镜：某些情况下推荐用这种微创手术直接观察诊断膝关节内部问题。

We believe that this comparison could further highlight the effectiveness of our proposed reliable interaction.

我们相信，这一比较能进一步凸显我们所提出的可靠交互的有效性。

V. CONCLUSION

V. 结论

In this work, we have developed ChatCAD+, a universal and reliable interactive CAD system that can accept images of different domains as valid input. We expect that the proposed paradigm of universal CAD networks can motivate the advancement of cross-domain model literature in medical image domain. Furthermore, the hierarchical in-context learning enhances the report generation ability of ${\mathrm{ChatCAD+}}$ in a training-free manner and let it act more like a radiologist. The LLM-based knowledge retrieval method enables the LLM to search by itself, and utilize the external medical knowledge database to provide convincing medical treatments. In this way, our ChatCAD+ has utilized the Merck Manuals database as its primary external knowledge database, which may be considered limited in its scope. However, we are confident in the versatility of our proposed LLM-based knowledge retrieval method, as it can be readily adapted to incorporate other databases. By doing so, we anticipate that the reliability of ChatCAD+ will be greatly enhanced. There are several limitations that must be acknowledged. First, while we have succeeded in enhancing the report generation capacity of our model to closely resemble that of a radiologist, it relies on additional report database. Second, the knowledge retrieval method we employed is relatively slow, as its efficacy is closely related to the response speed of the OpenAI API.

在本工作中，我们开发了ChatCAD+，这是一个通用且可靠的交互式计算机辅助诊断(CAD)系统，能够接受不同领域的图像作为有效输入。我们期望所提出的通用CAD网络范式能够推动医学影像领域跨域模型研究的进展。此外，分层上下文学习以无需训练的方式增强了${\mathrm{ChatCAD+}}$的报告生成能力，使其表现更接近放射科医师。基于大语言模型的知识检索方法使大语言模型能够自主搜索，并利用外部医学知识库提供可信的诊疗建议。目前ChatCAD+采用默克手册数据库作为主要外部知识库，其覆盖范围可能存在局限。但我们相信所提出的基于大语言模型的知识检索方法具有高度通用性，可轻松适配其他数据库。通过这种方式，我们预期ChatCAD+的可靠性将得到显著提升。需要指出几点局限性：首先，虽然我们成功提升了模型的报告生成能力使其接近放射科医师水平，但这依赖于额外的报告数据库；其次，采用的知识检索方法速度较慢，其效率与OpenAI API的响应速度密切相关。

These limitations provide an opportunity for future research to address them and improve upon the current findings.

这些局限性为未来的研究提供了改进现有发现的机会。

[论文翻译]ChatCAD+: 基于大语言模型的通用可靠交互式CAD系统

原文地址：https://arxiv.org/pdf/2305.15964

ChatCAD+: Towards a Universal and Reliable Interactive CAD using LLMs

ChatCAD+: 基于大语言模型的通用可靠交互式CAD系统

I. INTRODUCTION

I. 引言

II. 相关工作

A. Large Language Models in Healthcare

A. 医疗领域的大语言模型

B. Multi-modality Large Models

III. METHOD

III. 方法

A. Reliable Report Generation

A. 可靠的报告生成

B. Reliable Interaction

IV. EXPERIMENTAL RESULTS

IV. 实验结果

A. Dataset and implementations

A. 数据集与实现

B. Evaluation of a Universal System

B. 通用系统评估

C. Evaluation of Report Generation

C. 报告生成评估

D. Evaluation of clinical question answering

D. 临床问答评估

V. CONCLUSION

V. 结论

[论文翻译]ChatCAD+: 基于大语言模型的通用可靠交互式CAD系统

原文地址：https://arxiv.org/pdf/2305.15964

ChatCAD+: Towards a Universal and Reliable Interactive CAD using LLMs

ChatCAD+: 基于大语言模型的通用可靠交互式CAD系统

I. INTRODUCTION

I. 引言

II. RELATED WORKS

II. 相关工作

A. Large Language Models in Healthcare

A. 医疗领域的大语言模型

B. Multi-modality Large Models

III. METHOD

III. 方法

A. Reliable Report Generation

A. 可靠的报告生成

B. Reliable Interaction

IV. EXPERIMENTAL RESULTS

IV. 实验结果

A. Dataset and implementations

A. 数据集与实现

B. Evaluation of a Universal System

B. 通用系统评估

C. Evaluation of Report Generation

C. 报告生成评估

D. Evaluation of clinical question answering

D. 临床问答评估

V. CONCLUSION

V. 结论