Advanced prompting as a catalyst: Empowering large language models in the management of gastrointestinal cancers

高级提示作为催化剂：赋能大语言模型管理胃肠癌

Jiajia Yuan,1,8 Peng Bao,2,8 Zifan Chen,2,8 Mingze Yuan,2,8 Jie Zhao,3,7 Jiahua Pan,7 Yi Xie,1 Yanshuo Cao,1 Yakun Wang,1 Zhenghang Wang,1 Zhihao Lu, Xiaotian Zhang,1 Jian Li,1 Lei Ma,6 Yang Chen,1,* Li Zhang,2,6,* Lin Shen,1,* and Bin Dong4,5,6,7,* * Correspondence: yang_ chen@bjcancer.org (Y.C.); zhang li pku@pku.edu.cn (L.Z.); shenlin@bjmu.edu.cn (L.S.); dongbin@math.pku.edu.cn (B.D.) Received: July 24, 2023; Accepted: August 8, 2023; Published Online: August 14, 2023; https://doi.org/10.59717/j.xinn-med.2023.100019 $\circledcirc$ 2023 The Author(s). This is an open access article under the CC BY-NC-ND license (http://creative commons.org/licenses/by-nc-nd/4.0/).

Jiajia Yuan,1,8 Peng Bao,2,8 Zifan Chen,2,8 Mingze Yuan,2,8 Jie Zhao,3,7 Jiahua Pan,7 Yi Xie,1 Yanshuo Cao,1 Yakun Wang,1 Zhenghang Wang,1 Zhihao Lu, Xiaotian Zhang,1 Jian Li,1 Lei Ma,6 Yang Chen,1,* Li Zhang,2,6,* Lin Shen,1,* and Bin Dong4,5,6,7,* * 通讯作者: yang_ chen@bjcancer.org (Y.C.); zhang li pku@pku.edu.cn (L.Z.); shenlin@bjmu.edu.cn (L.S.); dongbin@math.pku.edu.cn (B.D.) 收稿日期: 2023年7月24日; 接受日期: 2023年8月8日; 在线发表日期: 2023年8月14日; https://doi.org/10.59717/j.xinn-med.2023.100019 $\circledcirc$ 2023 作者。本文是一篇基于CC BY-NC-ND许可协议 (http://creativecommons.org/licenses/by-nc-nd/4.0/) 的开放获取文章。

GRAPHICAL ABSTRACT

图文摘要

PUBLIC SUMMARY

公开摘要

Advanced prompting as a catalyst: Empowering large language models in the management of gastrointestinal cancers

高级提示作为催化剂：赋能大语言模型在胃肠癌管理中的应用

Jiajia Yuan,1,8 Peng Bao,2,8 Zifan Chen,2,8 Mingze Yuan,2,8 Jie Zhao,3,7 Jiahua Pan,7 Yi Xie,1 Yanshuo Cao,1 Yakun Wang,1 Zhenghang Wang,1 Zhihao Lu Xiaotian Zhang,1 Jian Li,1 Lei Ma,6 Yang Chen,1,* Li Zhang,2,6,* Lin Shen,1,* and Bin Dong4,5,6,7,*

Large Language Models' (LLMs) performance in healthcare can be significantly impacted by prompt engineering. However, the area of study remains relatively uncharted in gastrointestinal oncology until now. Our research delves into this unexplored territory, investigating the efficacy of varied prompting strategies, including simple prompts, templated prompts, incontext learning (ICL), and multi-round iterative questioning, for optimizing the performance of LLMs within a medical setting. We develop a comprehensive evaluation system to assess the performance of LLMs across multiple dimensions. This robust evaluation system ensures a thorough assessment of the LLMs' capabilities in the field of medicine. Our findings suggest a positive relationship between the comprehensiveness of the prompts and the LLMs' performance. Notably, the multi-round strategy, which is characterized by iterative question-and-answer rounds, consistently yields the best results. ICL, a strategy that capitalizes on interrelated contextual learning, also displays significant promise, surpassing the outcomes achieved with simpler prompts. The research underscores the potential of advanced prompt engineering and iterative learning approaches for boosting the applicability of LLMs in healthcare. We recommend that additional research be conducted to refine these strategies and investigate their potential integration, to truly harness the full potential of LLMs in medical applications.

大语言模型(LLM)在医疗领域的表现受提示工程(prompt engineering)影响显著。然而在胃肠肿瘤学领域，相关研究至今仍处于探索阶段。本研究深入这一空白领域，系统评估了简单提示、模板化提示、上下文学习(ICL)以及多轮迭代提问等不同提示策略对优化大语言模型医疗场景表现的效果。我们开发了一套多维度的综合评估体系，确保全面衡量大语言模型的医学能力。研究发现提示的完整性与模型表现呈正相关，其中以问答循环为特征的多轮策略始终表现最佳。利用关联上下文学习的ICL策略也展现出显著优势，其效果远超简单提示。本研究证实了高级提示工程与迭代学习策略对提升大语言模型医疗应用价值的潜力，建议通过进一步研究优化这些策略并探索其协同效应，以充分释放大语言模型在医学应用中的潜能。

INTRODUCTION

引言

Large Language Models (LLMs), exemplified by cutting-edge architectures like GPT-4,1 have demonstrated considerable potential in transforming healthcare delivery2-4 and competency in medical examinations.5 This influence is manifested across various healthcare sectors, including online patient interaction,6 preventive oncology,7-9 neuro psychiatry,10 dermatology,11 and aesthetic surgery consultation,12,13 underscoring their remarkable versatility. However, the application of LLMs such as GPT-4 in digestive system cancer treatment remains an under explored area. The complexities inherent to this field, from patient consultation, and diagnosis to treatment planning and follow-up care, pose formidable challenges for LLMs. Additionally, the existing body of research2,6,7 primarily evaluates LLMs' responses to common medical inquiries via rudimentary prompting, which may not fully leverage their potential in medical settings. This highlights the need for a more comprehensive assessment of GPT-4's capability to provide personalized cancer treatment recommendations via sophisticated prompts.

大语言模型 (LLMs) ，以 GPT-4 等前沿架构为例，已展现出变革医疗服务的巨大潜力[2-4] ，并在医学考试中表现出竞争力[5] 。其影响力覆盖在线患者交互[6] 、肿瘤预防[7-9] 、神经精神医学[10] 、皮肤病学[11] 以及整形外科咨询[12,13] 等多个医疗领域，凸显了其卓越的多功能性。然而，GPT-4 等大语言模型在消化系统癌症治疗中的应用仍属待开发领域。该领域固有的复杂性——从患者问诊、诊断到治疗方案制定与随访护理——对大语言模型构成了严峻挑战。此外，现有研究[2,6,7] 主要通过基础提示词评估大语言模型对常规医学咨询的应答，这可能未充分释放其在医疗场景中的潜力。这凸显了需要通过更复杂的提示词设计，对 GPT-4 提供个性化癌症治疗建议的能力进行更全面评估的必要性。

To harness the full potential of LLMs, it is crucial to employ effective prompt engineering.14-19 Prompt engineering, a process of creating, testing, and optimizing input prompts, serves as a crucial tool in controlling and enhancing interactions with LLMs. Various techniques such as in-context learning,15 retrieval-augmented generation,16 chain-of-thought,17 and least-tomost prompting 18 have been shown to significantly improve the performance of LLMs in tasks demanding logical thinking and reasoning. In-context learning offers models a few demonstrations before attempting a task, while retrieval-augmented generation enhances this process by retrieving relevant examples from a given database. Chain-of-thought prompting improves LLMs' reasoning ability by directing them to generate a series of intermediate steps toward a solution, and least-to-most prompting dissects complex problems into simpler sub-problems to be solved sequentially. Intuitively, these techniques could effectively boost LLMs' performance in complex medical tasks, including cancer treatment recommendations.

要充分发挥大语言模型 (LLM) 的潜力，关键在于运用有效的提示工程 [14-19]。提示工程作为控制和优化与大语言模型交互的关键工具，其过程包括创建、测试和优化输入提示。诸如上下文学习 [15]、检索增强生成 [16]、思维链 [17] 和最少到最多提示 [18] 等技术已被证明能显著提升大语言模型在需要逻辑思维和推理任务中的表现。上下文学习会在模型执行任务前提供少量示例，而检索增强生成则通过从给定数据库中检索相关案例来增强这一过程。思维链提示通过引导模型生成一系列解决问题的中间步骤来提升其推理能力，最少到最多提示则将复杂问题拆解为可依次解决的简单子问题。直观而言，这些技术能有效提升大语言模型在复杂医疗任务（包括癌症治疗建议）中的表现。

In this study, we aimed to unleash GPT-4’s potential to provide personalized digestive system cancer treatment plans through prompt engineering. Inspired by the thinking, reasoning, and action processes of digestive oncologists, we initially conceived the iterative procedure of prompt engineering as a method of amassing information regarding gastrointestinal tumors within a distinct storage of knowledge and in turn, educating the LLM. However, these knowledge repositories, when embedded in rudimentary prompts, are often devoid of substantial content, thus limiting their potential to effectively guide LLMs. Consequently, we established an empirically effective multi-step prompt template consisting of: (i) declaring the role, this process involves assigning a particular role to GPT-4 that emulates a real-world professional or function; (ii) stating the main task, this step essentially provides GPT-4 with a clear directive of what it is required to accomplish; (iii) declaring the workflow, which we view as a generalized chain of thought that allows GPT-4 to approach problem-solving or deliver answers in an organized, step-bystep manner; (iv) specifying constraints, it involves defining the boundaries within which GPT-4 should operate. Then, we iterative ly refined this template to align GPT-4’s responses with physicians’ requirements and added elements to generate comforting responses for patients. An experienced oncology specialist subsequently interacted with GPT-4 over multiple rounds to further guide and optimize the recommended treatment plans. Furthermore, motivated by the exemplar-based teaching approach in medicine, we also assessed the impact of in-context learning by providing GPT-4 with examples of ideal treatment suggestions through document retrieval. We evaluated the performance of diverse prompt engineering strategies on 43 case reports, encompassing a wide range of digestive system cancer types, utilizing a clinically standardized evaluation metric.

本研究旨在通过提示工程释放GPT-4为消化系统癌症提供个性化治疗方案的潜力。基于消化肿瘤科医师的思维、推理和行动流程，我们最初将提示工程的迭代过程构想为：在独立知识库中积累胃肠道肿瘤信息并以此训练大语言模型。但发现当这些知识库被嵌入基础提示时，往往缺乏实质性内容，难以有效引导大语言模型。因此，我们建立了一个经验证有效的多步骤提示模板，包含：(i) 角色声明——为GPT-4分配模拟现实专业人士的特定角色；(ii) 任务声明——明确告知GPT-4需要完成的目标；(iii) 流程声明——构建通用思维链，使GPT-4能分步骤解决问题；(iv) 约束声明——界定GPT-4的操作边界。通过迭代优化该模板，我们使GPT-4的输出更符合医师需求，并添加了生成患者安抚话术的模块。随后由资深肿瘤专家与GPT-4进行多轮交互，进一步优化治疗方案推荐。受医学案例教学法启发，我们还通过文档检索为GPT-4提供理想治疗建议范例，评估了上下文学习的效果。最终采用临床标准化评估指标，在43例涵盖多种消化系统癌症的病例报告上测试了不同提示工程策略的效能。

In summary, we are the first to conduct a comprehensive assessment of prompt engineering on GPT-4’s ability to provide personalized digestive system cancer treatment recommendations, as per our comprehensive search in the existing literature. We developed a sophisticated prompt template to generate personalized cancer treatment plans that emphasize patient comfort, which significantly outperforms rudimentary prompts and offers valuable insights for prompt design in the medical domain. We evaluated various prompt engineering strategies, including rudimentary prompts, templated prompts, in-context learning, and multi-round interaction, using a clinically standardized metric. Our results highlight the promise of prompt engineering for medical applications of LLMs.

总结来说，据我们现有文献的全面检索，我们是首个对GPT-4通过提示工程(prompt engineering)提供个性化消化系统癌症治疗建议的能力进行全面评估的研究。我们开发了一套复杂的提示模板，用于生成强调患者舒适度的个性化癌症治疗方案，其表现显著优于基础提示，并为医疗领域的提示设计提供了宝贵见解。我们采用临床标准化指标评估了多种提示工程策略，包括基础提示、模板化提示、上下文学习以及多轮交互。研究结果凸显了提示工程在大语言模型医疗应用中的潜力。

Figure 1. An illustration showcasing the effects of various prompting strategies on Language Learning Models' (LLMs') performance, mediated by a 'storage of knowledge' Simple prompts leave this storage empty, offering no enhancement for GI tumor decision-making. Conversely, templated prompts and ICL populate the storage with role assumptions and case examples, respectively, helping to standardize LLMs' output, thus improving performance. The multi-round interaction strategy fills the storage with the complete physician-LLM dialogue, potentially allowing more accurate comprehension and utilization of decision-assisting information.

图 1: 展示不同提示策略通过"知识存储"中介作用对大语言模型(LLMs)性能的影响示意图。简单提示使该存储保持空白，无法提升胃肠道肿瘤决策能力；模板化提示和上下文学习(ICL)分别通过角色预设和案例填充存储，帮助标准化大语言模型输出；多轮交互策略则通过完整的医生-大语言模型对话填满存储，可能实现更精准的决策辅助信息理解与运用。

MATERIALS AND METHODS

材料与方法

Materials

材料

In this study, we propose an innovative methodology to augment the learning capability of LLMs by incorporating multifaceted prompt design and dynamic training approaches. As shown in Figure 1, diverse prompt designs can be perceived as varying modifications to the storage of knowledge, encompassing manual alterations meticulously orchestrated based on GI tumor expertise, automatic modifications that explore the hospital's preexisting data for analogous cases as pedagogical instances for the LLM, and dynamic modifications consistently interrogated and addressed during the deployment of the consultation process. Consequently, the design of the prompts was executed as follows: Initially, the models are subjected to a more sophisticated introduction prompt, intricately crafted with complex semantic and structural nuances, thereby priming the LLMs to comprehend and respond to intricate queries. Furthermore, an advanced method of incontext learning is introduced, encouraging the models to extract knowledge and patterns from various contexts rather than individual sentences, fostering a more comprehensive understanding of the text. To accommodate evolving data patterns, we also incorporate online learning techniques, enabling the LLMs to continually learn and adapt from real-time, dynamic data. Lastly, we implement an iterative feedback loop through multi-round question-and-answer sessions, reinforcing the model’s ability to comprehend, retain, and apply information over successive interactions. This combination of sophisticated prompt architecture, in-context learning, online learning, and iterative interactions aims to substantially enhance the LLM’s predictive and interpretative capabilities, pushing the frontiers of AI language understanding. We used publicly available medical licensing examination cases, oncology residency and attending physician exam cases as text source.

在本研究中，我们提出了一种创新方法，通过整合多层面提示设计和动态训练策略来增强大语言模型 (LLM) 的学习能力。如图 1 所示，多样化的提示设计可视为对知识存储的不同修改方式，包括：基于胃肠道肿瘤专业知识精心设计的手动修改、通过挖掘医院既有数据寻找类似病例作为教学范例的自动修改，以及在咨询流程部署过程中持续交互解决的动态修改。具体提示设计实施如下：首先采用经过复杂语义和结构设计的精妙引导提示，使大语言模型具备处理复杂查询的能力；其次引入进阶的上下文学习方法，促使模型从多语境而非单句中提取知识模式；为适应动态数据变化，我们整合在线学习技术使模型能持续从实时数据中学习；最后通过多轮问答建立迭代反馈机制，强化模型在连续交互中的信息理解、保持与应用能力。这种融合精妙提示架构、上下文学习、在线学习和迭代交互的方案，旨在显著提升大语言模型的预测与解释能力，拓展人工智能语言理解的前沿边界。本研究采用公开的医师资格考试题库、肿瘤专科住院医师及主治医师考核病例作为文本数据源。

Templated prompts

模板化提示

Past studies have shown that a good use of different prompt engineering,17,20,21 as well as properly designed prompt templates 22 can significantly improve the problem-solving ability of large language models, and this phenomenon was similarly observed in our study. As shown in Figure S1, we developed our prompt template by adopting a four-pronged approach as follows:

过去的研究表明，合理运用不同的提示工程 (prompt engineering) [17,20,21] 以及精心设计的提示模板 [22] 能显著提升大语言模型的问题解决能力，这一现象在我们的研究中也得到了印证。如图 S1 所示，我们通过以下四步法构建了提示模板：

Declaring the role. Assigning a 'role' or 'identity' to large language models is one of the commonly used techniques for interacting with these models. Previous research22 supports that this method can effectively guide what type of output the models generate and what details they prioritize. In our study, we assigned the role of a digestive oncology specialist to GPT-4, emphasizing its range of skills that included clinical diagnosis, treatment, and communication techniques. We found this strategy successfully influenced GPT-4's behaviors, responses, and interaction styles to align with the expectations of the role.

声明角色。为大语言模型分配"角色"或"身份"是与这些模型交互的常用技术之一。先前研究[22]表明，该方法能有效引导模型生成何种输出及关注哪些细节。在我们的研究中，我们为GPT-4赋予了消化肿瘤学专家的角色，强调其临床诊断、治疗和沟通技巧等技能范围。我们发现该策略成功使GPT-4的行为、响应和交互风格符合该角色的预期。

Stating the main task. This approach essentially provides GPT-4 with a clear directive on what it is expected to accomplish. In our study, the primary task of our model is to deliver detailed and accurate advice to patients with digestive system cancers. This involves defining the central task that GPT-4 needs to perform. Given the context of our research, our model, acting as a digestive oncology specialist, is tasked with generating personalized treatment plans for digestive system cancer patients. By articulating the main task, we direct GPT-4's focus, streamline its reasoning process, and enhance its ability to produce task-specific, relevant, and actionable outputs. In addition, to enable GPT-4 to produce complex and con textually accurate responses, we've included a wide range of scenarios and contexts, from simple situations to the complexity of academic discourse in hospitals. We also encourage GPT-4 to link different pieces of information together. This approach aids GPT-4 in moving beyond simple pattern recognition, facilitat ing a deeper understanding when executing tasks.

阐明主要任务。该方法本质上为GPT-4提供了明确的执行指令。在我们的研究中，模型的核心任务是为消化系统癌症患者提供详尽准确的诊疗建议。这需要明确定义GPT-4需完成的核心工作。基于研究背景，我们的模型作为消化肿瘤专家，需为消化系统癌症患者生成个性化治疗方案。通过阐明主要任务，我们引导GPT-4聚焦目标、优化推理流程，并提升其生成任务相关、可执行输出的能力。此外，为使GPT-4能生成复杂且符合语境的响应，我们设置了从简单场景到医院学术讨论等不同复杂度的情境。同时鼓励GPT-4建立信息间的关联，这种方法能帮助模型超越简单模式识别，在执行任务时实现更深层次的理解。

Declaring the workflow. We have defined a comprehensive workflow in the prompt templates, which includes case analysis, clinical examination, scheduling examination, diagnosis and treatment, execution and adjustment of treatment, and follow-ups. This is also the general workflow of a professional digestive oncology specialist. We believe this represents a generalized chain of thought and many studies17,20,21,23 have already demonstrated that this approach can stimulate LLM's reasoning ability. We find this strategy ensures that GPT-4's output is more consistent and logical, using a planned, step-by-step approach to accomplish tasks, which is very similar to the process a human expert uses to solve problems. By structuring GPT-4's thinking in this way, we can effectively manage its output, improve overall consistency, and reduce the likelihood of generating irrelevant or erroneous information.

声明工作流程。我们已在提示模板中定义了一套完整的工作流程，包括病例分析、临床检查、安排检查、诊断与治疗、执行并调整治疗方案以及随访。这也是专业消化肿瘤科医师的通用工作流程。我们认为这代表了一种通用思维链，多项研究[17,20,21,23]已证明该方法能激发大语言模型的推理能力。该策略能确保GPT-4的输出更具一致性和逻辑性，通过规划好的分步方法完成任务，这与人类专家解决问题的过程高度相似。通过这样结构化GPT-4的思维，我们可以有效管控其输出，提升整体一致性，并降低生成无关或错误信息的可能性。

Specifying constraints. In this process, we've incorporated certain constraints into the prompt templates. We require GPT-4 not to make responses when uncertain or additional information is needed, but rather, it must first gather sufficient information. In addition, we require GPT-4 to provide detailed and correct guidance for a specific case, as GPT-4 tends to give general and non-specific answers that may not be wrong but lack specificity. This approach ensures that GPT-4 avoids generating responses that are undesirable or beyond its scope, thereby enhancing its effectiveness and minimizing potential deviations. We also advised GPT-4 to build a trusting doctor-patient relationship in a warm, humorous manner rather than in a cold and impersonal way when answering.

指定约束。在此过程中，我们已将特定约束融入提示模板。要求GPT-4在不确定或需要补充信息时不得直接回应，而必须先收集充分信息。此外，我们要求GPT-4针对具体案例提供详尽准确的指导，因其常给出笼统而非针对性的回答——这些回答虽非错误但缺乏针对性。该方法确保GPT-4避免生成超出能力范围或不理想的回应，从而提升有效性并减少潜在偏差。我们还建议GPT-4在应答时以温暖幽默的方式建立医患信任关系，而非采用冷漠疏离的态度。

In-context learning

上下文学习

In this study, we introduce an automated in-context learning (ICL) approach to refine GPT-4's capabilities, focusing on the integration of doctors' habits and cognition. This method assimilates insights drawn from analogous past cases and is comprised of three main components: firstly, transposing past patient conditions into a designated embedding space; secondly, gauging the similarity between the current condition and these archived cases to identify its $\mathsf{k}$ -nearest counterparts; and finally, building incontext learning prompts based on these identified cases. We provide a detailed exposition of these three components in the following:

在本研究中，我们引入了一种自动化上下文学习 (in-context learning, ICL) 方法，旨在提升 GPT-4 的能力，重点关注医生习惯与认知的融合。该方法整合了从类似历史病例中提取的洞见，并包含三个主要组成部分：首先，将既往患者病情映射到特定嵌入空间；其次，衡量当前病情与存档病例的相似度以确定其 $\mathsf{k}$ 近邻；最后，基于这些筛选病例构建上下文学习提示。以下我们将详细阐述这三个组成部分：

Encoding patient conditions using pre-trained chinese BERT model. A pre-trained Chinese BERT model in Hugging Face (https://hugging face. co/hfl/chinese-bert-wwm-ext), specifically the “hfl/chinese-bert-wwm-ext”, is utilized to translate patient conditions into a high-dimensional embedding space (768 dimensions in this study), capturing the context of the condition effectively. The BERT tokenizer is used to convert condition text into input vectors, which are then fed into the BERT model. Operating in a no-gradient update setting, the “poole r output” from the model serves as the sentence embedding for each patient condition.

使用预训练的中文BERT模型编码患者病情。采用Hugging Face平台(https://huggingface.co/hfl/chinese-bert-wwm-ext)上的预训练中文BERT模型"hfl/chinese-bert-wwm-ext"，将患者病情有效映射到高维嵌入空间(本研究采用768维)。通过BERT分词器将病情文本转换为输入向量后输入模型，在无梯度更新设置下，模型的"pooler output"作为每个患者病情的句子嵌入表示。

Calculation of cosine similarity and identification of k-nearest neighbors. Once the embeddings for all patient conditions have been computed, we calculate the cosine similarity between them to derive a similarity score. This metric provides a measure of the contextual similarity between different patient conditions. Based on these similarity scores, we identify up to k-nearest neighbors for each patient condition (with k being up to four depending on the token limitation of GPT-4).

计算余弦相似度并识别k近邻。在计算完所有患者病况的嵌入向量后，我们计算它们之间的余弦相似度以得出相似度分数。该指标用于衡量不同患者病况之间的上下文相似性。根据这些相似度分数，我们为每个患者病况识别最多k个最近邻（k最多为4，具体取决于GPT-4的token限制）。

Generation of in-context learning prompts. For each patient condition, we generate an enriched prompt that includes the top-k similar past cases and the corresponding doctor's suggestions. To ensure consistency and readability in these prompts, a pre-defined template is used: "As an experienced clinician, your responsibilities include understanding and analyzing patient information and chief complaints, […]. Now, let’s look at these examples: [...]. After analyzing these examples, here is a new patient: [...]. Please give specific treatment plan suggestions based on the above examples and relevant literature. (see Figures 4 & S6 for details).

生成上下文学习提示。针对每位患者的病情，我们会生成一个包含前k个相似历史病例及对应医生建议的增强提示。为确保这些提示的一致性和可读性，采用预定义模板："作为经验丰富的临床医生，您的职责包括理解和分析患者信息及主诉[...]。现在请看以下示例：[...]。分析这些案例后，这里有一位新患者：[...]。请根据上述案例及相关文献给出具体治疗方案建议（详见图4 和图S6）。

Metrics

指标

We have developed a unique set of metrics, drawing from those typically used for evaluating clinicians' examinations, to quantitatively assess the results generated by various methods. These metrics encompass six key aspects:

我们开发了一套独特的评估指标，借鉴了临床医生检查常用的标准，用于定量评估不同方法产生的结果。这些指标涵盖六个关键方面：

Understanding medical history (0-20): This metric assesses how accurately and comprehensively an LLM captures and interprets a patient's medical history. This includes consideration of the patient's previous diagnoses, surgeries, hospitalizations, allergies, family history, lifestyle, and other relevant information.
理解病史 (0-20): 该指标评估大语言模型 (LLM) 对患者病史的捕捉和解释是否准确全面。包括患者既往诊断、手术史、住院史、过敏史、家族史、生活方式及其他相关信息。
Diagnosis and differential diagnosis (0-20): This metric assesses the ability of the LLM to accurately diagnose the patient's condition based on the medical history. It includes both the primary diagnosis and any differential diagnoses.
诊断与鉴别诊断 (0-20): 该指标评估大语言模型根据病史准确诊断患者病情的能力，包括初步诊断及任何鉴别诊断。
Further examination and reason (0-10): This metric evaluates the appropriateness of any additional examinations suggested by the LLM. It measures not only whether the recommended examinations are suitable, but also if they are justified based on the patient's condition and symptoms. The LLM should also provide a clear rationale for why these additional examinations are needed.
进一步检查和原因 (0-10): 该指标评估大语言模型建议的任何额外检查是否恰当。它不仅衡量推荐的检查是否合适，还评估这些检查是否基于患者的病情和症状具有合理性。大语言模型还应明确说明为何需要进行这些额外检查。
Principles and plans of treatment (0-20): This metric evaluates the LLM's ability to propose a suitable treatment plan. The plan should be personalized for the patient, taking into account factors like age, overall health, potential side effects, and patient preferences.
治疗原则与方案 (0-20): 该指标评估大语言模型提出合适治疗方案的能力。方案应针对患者个性化定制，考虑年龄、整体健康状况、潜在副作用及患者偏好等因素。
Breadth and depth of results (0-20): This metric measures how comprehensively the LLM covers the scope of medical knowledge in its results (breadth), as well as how much detail it provides (depth). Breadth refers to the range of different topics or areas covered in the results, while depth refers to the level of detail or complexity within those topics.
结果的广度和深度 (0-20): 该指标衡量大语言模型 (LLM) 在其结果中覆盖医学知识范围的全面性 (广度) 以及提供的细节量 (深度)。广度指结果中涵盖的不同主题或领域的范围，深度指这些主题内部的细节或复杂程度。
Thinking and expressing ability (0-10): This is a measure of how effectively the LLM reasons and communicates its findings. Thinking refers to the LLM's ability to logically process and interpret data, make connections, draw conclusions, and anticipate potential outcomes. The expressing ability should not only be clear and accurate but also demonstrate empathy in line with a real clinician's interaction. This includes sensitivity to the patient's emotional state, using comforting and supportive language, and showing understanding and respect for the patient's experiences and concerns. By effectively incorporating empathy, the LLM can build trust, encourage open communication, and provide emotional support in addition to addressing physical health issues.
思维与表达能力 (0-10): 该指标评估大语言模型(LLM)推理和传达结论的有效性。思维指LLM逻辑处理与解读数据、建立关联、得出结论及预判潜在结果的能力。表达能力不仅需清晰准确，还应展现与真实临床医生互动相符的同理心，包括对患者情绪状态的敏感性、使用安抚性支持性语言，并体现对患者经历与诉求的理解与尊重。通过有效融入同理心，LLM能在解决身体健康问题之外建立信任、促进开放沟通并提供情感支持。

To gain a clearer understanding of performance based on the total scores, we have defined the following expertise levels:

为了更清晰地理解基于总分的表现，我们定义了以下专业水平等级：

Level A (90-100 points): Top-level expertise, capable of independently managing complex and rare cases, demonstrating exceptional skills and professional knowledge. 2. Level B (80-89 points): Experienced level, capable of handling most cases, but requires guidance for complex or rare cases. 3. Level C (70-79 points): Mid-level competence, capable of independently addressing common cases, requires guidance for complex ones. 4. Level D (60-69 points): Junior level, capable of handling some common cases, but requires close guidance for complex cases. 5. Level E (below 60 points): Initial training level, needs guidance from experienced clinicians in all aspects.
A级 (90-100分): 顶级专业水平，能够独立处理复杂罕见病例，展现出卓越技能和专业知识。
B级 (80-89分): 资深水平，能处理多数病例，但复杂或罕见病例需指导。
C级 (70-79分): 中级水平，可独立处理常见病例，复杂病例需指导。
D级 (60-69分): 初级水平，能处理部分常见病例，复杂病例需密切指导。
E级 (60分以下): 培训初期水平，所有方面均需资深临床医师指导。

RESULTS

结果

Templated evaluation

模板化评估

Figures 2 & S2 provide a comparison between our designed prompting template (Figure 2B) and the standard, direct prompting (Figure 2A) utilized by GPT-4. The findings underscore that the designed template for role assumption (Figure S1) can improve GPT-4 to make more complex decisions based on the patient's individual circumstances. In the provided example, our designed prompting can prioritize the control of disease progression, symptom relief, enhancement of life quality, and survival extension, instead of merely pursuing a cure unconditionally. Moreover, the template manifests an exceptional ability to interweave quality-of-life considerations within the treatment strategies and provides comprehensive guidance (Figure S3). It also underscores the significance of continuous patient assessment and the pursuit of innovative, custom treatments (Figure S4). As opposed to direct prompting, our designed prompting template possesses the ability to mimic the intricate treatment ideation process, enhancing GPT-4’s efficacy as a therapeutic advisory tool when acting as a senior oncologist.

图 2 和图 S2 展示了我们设计的提示模板 (图 2B) 与 GPT-4 使用的标准直接提示 (图 2A) 之间的对比。研究结果表明，角色假设设计模板 (图 S1) 能帮助 GPT-4 根据患者个体情况做出更复杂的决策。在给定示例中，我们设计的提示能优先考虑控制疾病进展、缓解症状、提升生活质量及延长生存期，而非无条件追求治愈。此外，该模板展现出将生活质量考量融入治疗策略的卓越能力，并提供全面指导 (图 S3)。同时强调了持续患者评估和探索创新定制治疗方案的重要性 (图 S4)。与直接提示相比，我们设计的提示模板能够模拟复杂的治疗构思过程，从而提升 GPT-4 作为资深肿瘤学家角色时的治疗建议工具效能。

Multi-round evaluation

多轮评估

Figures 3 & S5 illustrate an interaction with GPT-4 for cancer treatment advice. Initially, GPT-4 prematurely diagnosed the patient with late-stage cancer and proposed a treatment plan. However, this was inappropriate, given the necessity for a more accurate staging diagnosis for this patient. As highlighted in Figure 3, the clinician directed GPT-4 to offer a detailed staging diagnosis, subsequently pointing out its error. Following multiple questionand-answer interactions with the clinician, GPT-4 acknowledged its mistake and adjusted its response. It began by determining the cancer's stage, before suggesting a specific treatment plan. This revised response is not only more suitable for the patient but also provides her with hope. This multi-round interaction demonstrates the learning capability of large language models like GPT-4, highlighting their ability to quickly integrate human logical reasoning within the context of intricate medical scenarios.

图 3 和图 S5 展示了与 GPT-4 就癌症治疗建议进行的交互过程。最初，GPT-4 过早地将患者诊断为晚期癌症并提出了治疗方案。然而，考虑到该患者需要更精确的分期诊断，这一建议并不恰当。如图 3 所示，临床医生引导 GPT-4 提供详细的分期诊断，随后指出了其错误。经过与临床医生的多轮问答交互后，GPT-4 承认了错误并调整了回答。它首先确定了癌症的分期，然后才提出具体的治疗方案。这一修订后的回应不仅更符合患者情况，还为她带来了希望。这种多轮交互展示了大语言模型（如 GPT-4）的学习能力，突显了它们在复杂医疗场景中快速整合人类逻辑推理的能力。

A Direct prompting

直接提示法

Doctor

医生

Male, 68 years old. The patient was found to have a gastric mass in an external hospital. G astros copy showed that at 41-47cm from the incisors, there is an ulcerative tumor at the e soph ago gastric junction. Pathology: E soph ago gastric junction. Biopsy: moderately differentiated a de no carcinoma. Lauren classification: intestinal type. Immuno his to chemical results: HER2(0) [HER2 positive control $(3+)$ , HER2 negative control (0)]. CT: Thickening of the gastric wall, considered as Borrmann type 3 gastric cancer(cT4aN3a) ; Multiple filling defects in the splenic vein and portal vein, suggesting tumor thrombus. The tumor markers are normal. Considering the patient's multiple portal vein tumor thrombi, it is not suitable for clinical trial enrollment.

男性，68岁。外院发现胃部占位。胃镜检查示距门齿41-47cm处贲门胃底见溃疡型肿物。病理：贲门胃底。活检：中分化腺癌。Lauren分型：肠型。免疫组化结果：HER2(0) [HER2阳性对照$(3+)$，HER2阴性对照(0)]。CT：胃壁增厚，考虑Borrmann 3型胃癌(cT4aN3a)；脾静脉及门静脉多发充盈缺损，考虑癌栓。肿瘤标志物正常。考虑患者门静脉多发癌栓，不适合入组临床试验。

This is the information of a patient. Please search for relevant literature and provide specific treatment plan recommendations, and explain the reasons.

这是一位患者的资料。请检索相关文献并提供具体的治疗方案建议，同时说明理由。

This prompt is summarized into “{Sex}, {Age}, {Disease description}, {Question}.” and will be referred to in other content.

该提示总结为“{性别}, {年龄}, {疾病描述}, {问题}。”并将在其他内容中引用。

Based on the information provided, immuno his to chemistry shows HER2 positive. Due to the presence of multiple tumor thrombi in the portal vein, the patient is not suitable for clinical trial enrollment.

根据所提供的信息，免疫组化显示HER2阳性。由于门静脉存在多发癌栓，该患者不适合参加临床试验。

B Make role assumptions before prompting

B 在提示前预设角色假设

Doctor

医生

{Role declaration} {Definition of a Gastrointestinal Oncologist} {Skills of a Gastrointestinal Oncologist} {Work Process of a Gastrointestinal Oncologist} {Specific Requirements and Rules for the Response} {Sex}, {Age}, {Disease description}.

{角色声明} {胃肠肿瘤学家定义} {胃肠肿瘤学家的技能} {胃肠肿瘤学家的工作流程} {回应的具体要求与规则} {性别}, {年龄}, {疾病描述}。

Based on the information you provided, the patient has been diagnosed with moderately differentiated a de no carcinoma of the e soph ago gastric junction, with a pathological type of intestinal, HER2-negative. CT scans show the existence of tumor thrombus, which may suggest the disease is in the late stage. However, please note that a more detailed clinical data and expert evaluation are needed for a definite diagnosis. For patients with advanced gastric cancer of this type, systemic chemotherapy is often the first choice, such as the FOLFOX regimen (fluorouracil $^+$ cisplatin $^+$ ox a lip latin) or the SOX regimen (S-1 $^+$ ox a lip latin). Moreover, considering the HER2-negative pathological result, targeted drugs like tra stu zuma b may not be applicable here. Additionally, if the disease is well-controlled, palliative surgery can be considered to alleviate symptoms.

根据您提供的信息，患者被诊断为食管胃结合部中分化腺癌（adenocarcinoma），病理类型为肠型，HER2阴性。CT扫描显示存在瘤栓，可能提示疾病已进入晚期。但请注意，确诊仍需更详细的临床数据和专家评估。对于此类晚期胃癌患者，系统化疗通常是首选方案，例如FOLFOX方案（氟尿嘧啶+顺铂+奥沙利铂）或SOX方案（S-1+奥沙利铂）。此外，考虑到HER2阴性病理结果，曲妥珠单抗等靶向药物可能不适用。若病情控制良好，可考虑姑息性手术以缓解症状。

Regarding the tumor thrombus, we need to evaluate in detail whether there is a possibility for surgery or interventional treatment, or if anticoagulant therapy is needed. If the tumor thrombus causes severe clinical symptoms, surgery or interventional treatment may be considered, but this requires a comprehensive assessment of the patient's overall condition and other complications.

关于癌栓，我们需要详细评估是否存在手术或介入治疗的可能性，或者是否需要抗凝治疗。如果癌栓引起严重的临床症状，可以考虑手术或介入治疗，但这需要对患者的整体状况及其他并发症进行全面评估。

At the same time, it is also recommended to conduct relevant genetic testing, such as MLH1, MSH2, MSH6, PMS2, etc., because a family history of cancer may be related to hereditary cancer syndromes.

同时，也建议进行相关基因检测，例如 MLH1、MSH2、MSH6、PMS2 等，因为癌症家族史可能与遗传性癌症综合征相关。

This is a possible diagnosis and treatment plan based on the information I currently have. The actual treatment plan needs to be made based on the patient's specific situation and a comprehensive assessment by the medical team. Please discuss these suggestions with your medical team to develop the most suitable treatment plan for the patient.

这是基于我目前掌握的信息提出的可能诊疗方案。实际治疗方案需根据患者具体情况及医疗团队全面评估后制定。请与您的医疗团队讨论这些建议，为患者制定最合适的治疗方案。

Figure 2. Evaluation of prompting template design (A) Direct prompting. GPT-4 is directly provided with a patient's condition and asked to suggest a treatment plan. (B) The proposed templated prompting (make role assumptions before prompting). GPT-4 is initially primed with a specifically designed role-embedded prompt, ensuring the AI understands the task requirements and responsibilities. Subsequently, the patient's condition is provided, and GPT-4 is asked to suggest a treatment plan. Note that this figure is a reduced version of Figure S2.

图 2: 提示模板设计评估 (A) 直接提示。直接向GPT-4提供患者病情并要求其提出治疗方案。(B) 提出的模板化提示(在提示前进行角色假设)。GPT-4首先通过专门设计的角色嵌入提示进行初始化，确保AI理解任务要求和职责。随后提供患者病情，并要求GPT-4提出治疗方案。请注意，该图是图S2的简化版本。

ICL evaluation

ICL评估

As demonstrated in Tables 1-4, the performance of In-Context Learning (ICL) exceeded that of rudimentary prompting by a substantial margin across

如表 1-4 所示，上下文学习 (In-Context Learning, ICL) 的性能显著超越了基础提示方法。

various types of digestive system cancer treatments, with a notable difference of 13.4 points in overall performance. Figures 4 & S6 provide an illustrative comparison between in-context learning and rudimentary prompting.

多种消化系统癌症治疗方式，整体性能差异显著达13.4分。图4和图S6展示了上下文学习与基础提示法的对比示例。

Doctor

医生

Female, 76 years old. Poor appetite. G astros copy revealed a raised lesion with a concave surface at the cardia, extending to the gastric fundus (Siewert II type, a de no carcinoma), and coarse gastric mucosa consistent with Borrmann IV type gastric cancer presentation (adeno carcinoma). Pathology report: a small amount of poorly differentiated cancer (gastric body), poorly differentiated a de no carcinoma; Cmet $(1+)$ , EGFR(0), HER2(0), pMMR, pan-TRK(-), PD-L1 (CPS<1), EBER(-); (Cardia) moderately differentiated a de no carcinoma, intestinal type; Cmet $(2+)$ , EGFR( $^{1+)}$ , HER2(0), pMMR, pan-TRK(-), PD-L1 ( $.C P S=70$ ), EBER(-). Normal lab results, chest-abdominal-pelvic enhanced CT showed diffuse thickening of the entire stomach, lymph node metastasis between the liver and stomach. Currently, the patient's food intake has decreased, bowel movements are not abnormal, and she has lost about $5\mathsf{k g}$ in the past 2 months.

女性，76岁。食欲不振。胃镜检查显示贲门处有一隆起性病变，表面凹陷，延伸至胃底(Siewert II型，原发癌)，胃黏膜粗糙符合Borrmann IV型胃癌表现(腺癌)。病理报告：(胃体)少量低分化癌，低分化原发癌；Cmet $(1+)$，EGFR(0)，HER2(0)，pMMR，pan-TRK(-)，PD-L1(CPS<1)，EBER(-)；(贲门)中分化原发癌，肠型；Cmet $(2+)$，EGFR$(^{1+})$，HER2(0)，pMMR，pan-TRK(-)，PD-L1$(CPS=70)$，EBER(-)。实验室检查结果正常，胸腹盆增强CT显示全胃弥漫性增厚，肝胃间淋巴结转移。目前患者进食量减少，排便无异常，近2个月体重下降约$5\mathsf{kg}$。

Doctor

医生

How do you analyze and determine the staging diagnosis and future prognosis for this patient?

如何分析和确定该患者的分期诊断及未来预后？

The staging of gastric cancer primarily refers to the American Joint Committee on Cancer (AJCC) TNM staging system, which assesses the depth of tumor invasion (T), the status of lymph node metastasis (N), and the condition of distant metastasis (M). According to the information you provided, the patient's gastric cancer has developed lymph node metastasis (LNM), indicating that the N part of the TNM stage is at least N1 (1-2 regional lymph nodes metastasis) or higher. Simultaneously, the gastric cancer is Borrmann IV type, suggesting that the tumor has extensively invaded the stomach wall, and in the TNM stage, the grading of the T part is usually T3 (tumor invades the serosa) or T4 (tumor penetrates the serosa or directly invades adjacent structures). Based on this information, it can be preliminarily judged that the patient is in at least stage III of gastric cancer (including T3N1M0, T4aN1M0, T4bN1M0, etc.) or a higher stage.

胃癌的分期主要依据美国癌症联合委员会 (AJCC) 的TNM分期系统，评估肿瘤浸润深度 (T)、淋巴结转移状态 (N) 和远处转移情况 (M)。根据您提供的信息，患者胃癌已出现淋巴结转移 (LNM)，表明TNM分期中的N部分至少为N1 (1-2枚区域淋巴结转移) 或更高。同时，胃癌属于Borrmann IV型，提示肿瘤已广泛侵犯胃壁，在TNM分期中T部分通常分级为T3 (肿瘤侵及浆膜下层) 或T4 (肿瘤穿透浆膜或直接侵犯邻近结构)。基于这些信息，可初步判断患者处于至少III期胃癌 (包括T3N1M0、T4aN1M0、T4bN1M0等) 或更高分期。

Doctor

医生

The patient's lymph node metastasis (LNM) between the liver and stomach could be classified as either N1 or N3, and a deta

[论文翻译]高级提示作为催化剂：赋能大语言模型管理胃肠癌

原文地址：https://the-innovation.org/data/article/medicine/preview/pdf/XINNMEDICINE-2023-0065.pdf

Advanced prompting as a catalyst: Empowering large language models in the management of gastrointestinal cancers

高级提示作为催化剂：赋能大语言模型管理胃肠癌

GRAPHICAL ABSTRACT

图文摘要

PUBLIC SUMMARY

Advanced prompting as a catalyst: Empowering large language models in the management of gastrointestinal cancers

高级提示作为催化剂：赋能大语言模型在胃肠癌管理中的应用

INTRODUCTION

引言

MATERIALS AND METHODS

材料与方法

Materials

材料

Templated prompts

In-context learning

Metrics

RESULTS

结果

Templated evaluation

模板化评估

Multi-round evaluation

A Direct prompting

Doctor

B Make role assumptions before prompting

B 在提示前预设角色假设

Doctor

ICL evaluation

Doctor

Doctor

Doctor